By Alicia Pastrana García and José Carlos Cerdán Medina, National Library of Spain
In the last 20 years most web archives have been building their websites collections. They will be very valuable as the years go by, as much of this information will no longer exist on the Internet. However, do we have to wait that long to see our collections be useful?
The huge amount of information the National Library of Spain (BNE) has built since 2009 has emerged as one of the largest linguistic corpus of current language. For this reason, BNE has collaborated with the Barcelona Supercomputing Center (BSC) to create the first massive AI model of the Spanish language. This collaboration is in the framework of the Language Technologies Plan of the State Secretariat for Digitization and Artificial Intelligence of the Ministry of Economic Affairs and Digital Agenda of Spain.
The National Library of Spain has been harvesting information from the web for more than 10 years. The Spanish Web Archive is still young but it already contains more than a Petabyte of information.
On the other hand, the Barcelona Supercomputing Center (BSC) is the leading supercomputing center in Spain. They offer infrastructures and supercomputing services to Spanish and European researchers, in order to generate knowledge and technology for the society.
The Spanish Web Archive, as most of the national libraries web archives, is based on a mixed model. It combines broad and selective crawls. The broad crawls harvest as many Spanish domains as possible without going very deep in the navigation levels. The scope is the .es domain. Selective crawls complement the broad crawls and harvest a smaller sample of websites but in greater depth and frequency. The sites are selected for their relevance to history, society and culture. Selective crawls include other king of domains (.org, .com, etc.)
Web Curators, from the BNE and the regional libraries, select the seeds that will be part of these collections. They assess the relevance of the websites from the heritage point of view and the importance for research and knowledge in the future.
For this project we chose the content harvested on selective crawls, a collection of around 40,000 websites.
How to prepare WARC files
The result of the collections is stored in WARC files (Web ARChive file format). The BSC just needed the text extracted from the WARC files to train the language models, so they removed everything else, using a specific script. It uses a parser to keep exclusively the HTML text tags (paragraphs, headlines, keywords, etc.) and discard everything that was not useful for the purpose of the project (e.g. images, audios, videos).
This parser was an open source Python module called Selectolax. It is seven times faster than others and it is easily customizable. Selectolax can be configured to take labels that contained text and to discard those that are not useful for the project. At the end of the process, the script generated JSON files organized according to the selected HTML tags and it structured the information in paragraphs, headlines, keywords, etc. These files are not only useful for the project, but will also be able to help us improve the Spanish Web Archive full text search.
All this work was done in the Library itself, in order to obtain files that were more manageable. It must be taken into account that the huge volume of information was a challenge. It was not easy to transfer the files to the BSC, where the supercomputer was. Hence the importance of starting with this cleaning process in the Library.
Once at the BSC, a second cleaning process was run. The BSC project team removed everything that it is not well-formed text (unfinished or duplicated sentences, erroneous encodings, other languages, etc.). The result was only well-formed texts in Spanish, as it is actually used.
BSC used the supercomputer MareNostrum, the most powerful computer in Spain and the only one capable of processing such a volume of information in a short time frame.
The language model
Once the files were prepared, the BSC used a neural network technology based on Transformer, already proven with English. It was trained to learn to use the language. The result is an AI model that is able to understand the Spanish language, its vocabulary, and its mechanisms for expressing meaning and writing at an expert level. This model is also able to understand abstract concepts and it deduces the meaning of words according to the context in which they are used.
This model is larger and better than the other models of the Spanish language available today. It is called MarIA and is open access. This project represents a milestone both in the application of artificial intelligence to Spanish language, and in collaboration between national libraries and research centers. It is a good example of the value of collaboration between different institutions with common objectives. The uses of MarIA can be multiple: correctors or predictors of language, auto summarization apps, chatbots, smart searches, translation engines, auto captioning, etc. They are all broad fields that promote the use of Spanish for technological applications, helping to increase its presence in the world. This way, the BNE fulfils part of its mission, promoting the scientific research and the dissemination of knowledge, helping to transform information into accessible technology for all.
By Emmanuelle Bermès, Deputy Director for Services and Networks at BnF and Membership Engagement Portfolio Lead
During the refresh of our consortium agreement in 2016, three portfolios were created to lead on a strategy in three areas: membership engagement, tools development, and partnerships and outreach. The first major project for the Membership Portfolio was to survey our members and find common grounds for potential collaborations. While this remains our goal, in the new Strategic Action Plan, we would also like to focus on regular conversations with our members, in the spirit of the updates we have at our General Assembly (GA), and on supporting regional initiatives.
Following up on this, on Monday 13 September, the Membership Engagement Portfolio hosted two calls open to all members, scheduled for two slots to accommodate different time zones.
The past months have made it so much more complicated to share, engage and work together. Our yearly event, the GA/WAC, fully held online by the National Library of Luxembourg, was very successful and we all enjoyed this opportunity to feel the strength and vitality of our community. So, we thought that opening more options for informal communication online could also be a way to keep the momentum.
We also wanted this Members Call to be a moment to share your updates on what’s happening in your organization. With short presentations from the Library of Congress, the National Archives UK, the ResPaDon project at the Bibliothèque nationale de France, the National Széchényi Library in Hungary, the National and University Library in Zagreb, the National Library of New Zealand, the National Library of Australia, the agenda was very rich. One of the major outcomes of these two calls may be a future workshop on capturing social media, as this was a topic of great interest in both meetings. The workshop will be organised by the IIPC and the National and University Library in Zagreb. If you’re interested in contributing to this specific topic, please let us know at events[at]netpreserve.org.
We hope this shared moment can become a regular meeting point between members, where you can share your institution’s hot topics and let us know your expectations about IIPC activities. So, we hope you will join us for this conversation during our next call, expected to take place on Wednesday, 15th of December (UTC). Abbie Grotke and Karolina Holub, Membership Engagement Portfolio co-leads, Olga Holownia and myself, are looking forward to meeting you there.
The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:
An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.
If you have any questions, please contact the IIPC Senior Program Officer.
Nomination statements in alphabetical order:
Bibliothèque nationale de France / National Library of France
The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.
The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.
The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.
The British Library
The British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.
Deutsche Nationalbibliothek / German National Library
The German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.
Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB. At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.
Det Kongelige Bibliotek / Royal Library of Denmark
Royal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.
Koninklijke Bibliotheek / National Library of the Netherlands
As the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.
We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.
We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.
The National Archives (UK)
The National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.
Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.
With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.
One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.
What is OutbackCDX?
OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.
A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.
OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.
Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.
So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.
Of course, there were a few bumps on that road, but more on that later.
Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.
Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.
Building the Index
Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.
In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.
Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.
Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.
OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:
And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.
This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.
The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.
I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.
Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.
OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.
Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.
The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.
Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is.
In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.
Live demo of SolrWayback
You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!
Back in 2018…
The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.
WARC-Indexer. Where the magic happens!
WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.
WARC-Indexer. Paying the price up front…
Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.
Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.
Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.
HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.
Solr. Reaping the benefits from the WARC-indexer
The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).
Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.
The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.
CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.
WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.
Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.
SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.
SolrWayback playback engine
SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.
The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.
The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.
SolrWayback and Scalability
For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.
One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.
For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.
Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.
You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.
Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.
Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.
SolrWayback – framework
SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.
SolrWayback software bundle
Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.
So, it’s finally here! SolrWayback 4.0 was released December 20th, after an intense development period. In this blog post, we’ll give you a nice little overview of the changes we made, some of the improvements and some of the added functionality that we’re very proud of having released. So let’s dig in!
A small intro – What is SolrWayback really?
As the name implies, SolrWayback is a fusion of discovery (Solr) and playback (Wayback) functionality. Besides full-text search, Solr provides multiple ways of aggregating data, moving common net archive statistics tasks from slow batch processing to interactive requests. Based on input from researchers the feature set is continuously expanding with aggregation, visualization and extraction of data.
SolrWayback relies on real time access to WARC files and a Solr index populated by the UKWA webarchive-discovery tool. The basic workflow is:
Amass a collection of WARCs (using Heritrix, wget, ArchiveIT…) and put them on live storage
Analyze and process the WARCs using webarchive-discovery. Depending on the amount of WARCS, this can be a fairly heavy job: Processing ½ petabyte of WARCs at the Royal Danish Library took 40+ CPU-years
Index the result from webarchive-discovery into Solr. For non-small collections, this means SolrCloud and Solid State Drives. A rule of thumb is that the index takes up about 5-10% of the size of the compressed WARCs
Connect SolrWayback to the WARC storage and the Solr index
We decided to give the SolrWayback a complete makeover, making the interface more coherent, the design more stylish, and the information architecture better structured. At first glance, not much has changed apart from an update on the color scheme, but looking closely, we’ve added some new functionality, and grouped some of the existing features in a new, and much improved, way.
The search page is still the same, and after searching, you’ll still see all the results lined in a nice single column. We’ve added some more functionality up front, giving you the opportunity to see the WARC header for a single post, as well as selecting an alternative playback engine for the post. Some of the more noticeable reworks and optimizations are highlighted in the section below.
We’ve done some work under the hood too, to make the application run faster. A lot of our call to the backend has been reworked to be individual calls, only being requested at need. This means, that facet calls are now made as a separate call to the backend instead of being being called with a query. So when you’re paging results, we only request the results – giving us a faster response, since the facets stay the same. The same principle has been applied to loading images and individual post data.
As mentioned, we’ve done some cleanup in the interface, making it easier to navigate. The search field has been reworked, to service the many needs. It will expand if the query is line separated (do so by SHIFT+Enter), making large and complex queries much easier to manage. We’ve even added context sensitive help, so if you’re making queries with boolean operators or similar, SolrWayback tell you if their syntax is correct.
We’ve kept the most used features upfront, with image and URL search readily available from the get go. The same goes for the option to group the search results to avoid URL duplicates.
Below the line are some of of the other features not directly linked to the query field, but nice to have upfront. Searching with an uploaded file, searching by GPS and the toolbox containing a lot of the different tools that can help gain insight into the archive, by generating Wordclouds or link graphs, searching through the Ngram interface and much more.
Image searching by location rethought
We’re reworked the way to search and look through the results when searching by GPS coordinates. We’ve made it easy to search for a specific location, and we’ve grouped the results so that they are easier to interpret.
Zooming into the map will expand the places where images are clustered. Furthermore, we realize that sometimes the need is to look through all the images regardless of their exact position, so we’ve made a split screen that can expand either way, depending on your needs. It’s still possible to do do a new search based on any of the found images in the list.
Elaborated export options
We’ve added more functionality to the export options. It’s possible to export both fields from the full search result and the raw WARC records for the search result, if enabled in the settings. You can even decide the format of your export and we’ve added an option to select exactly which fields in the search result you want exported – so if you want to leave out some stuff, that is now possible!
Quickly move through your archive
The standard search UI is pretty much as you are accustomed to but we made an effort to keep things simple and clean as well as facilitating in depth research and tracking of subject interests. In the search results you get a basic outline of metadata on each post. You can narrow your search with the provided facet filters. When expanding a post you get access to all metadata and every field has a link if you which to explore a particular angle related to your post. So you can quickly navigate the archive by starting wide, filtering and afterwards do a specific drill down and find related material.
Visualization of search result by domain
We’ve also made it very easy to quickly get a overview of the results. When clicking the icon in the results headline, you get a complete overview of the different domains in the results, and how big of a portion of the search result they amount for to each year. This is a very neat way to get a overview of the results, and the relative distribution by year.
With quick access from right under the search box we have gathered Toolbox with utilities for further data exploration. In the following we will give you a quick tour of the updates and new functionality in this section.
Linkgraph, domain stats and wordcloud
We reworked the Linkgraph, the Wordcloud and the Domain stats components a little, adding some more interaction to the graph and domain stats, and polished the interface for all of them a little. For the Linkgraph, it is now possible to highlight certain sections within the graph, making it much easier to navigate the sometimes rather large cluster, and looking at connections you find relevant. These tools now provide a easy and quick way to gain a deeper insight in specific domains and what content they hold.
We are so pleased to finally be able to supply a classical Ngram search tool complete with graphs and all. In this version you are able to search through the entire HTML content of your archive and see how the results are distributed over time (harvest time). You can even do comparisons by providing several queries sequentially and see how they compare. On every graph the datapoint at each year is clickable and will trigger a search for the underlying results which is a very handy feature for checking the context and further exploring underlying data. Oh and before we forget – if things get a little crowded in the graph area you can always click on the nicely colored labels at the top of the chart and deselect/select each query.
If the HTML content isn’t really your thing but your passion lays within the HTML tags themselves we got you covered. Just flip the radio button under the search box over to HTML-tags in HTML-pages and you will have all same features listed above but now the underlying data will be the HTML tags themselves. As easy as that you will finally be able to get answers to important questions like ‘when did we actually start to frown upon the blink tag?’
The possibilty to export a query, in a format that can be used in Gephi, is still present in the new version of SolrWayback. This will allow you to create some very nice visual graphs that can help you explore how exactly a collection of results are tied together. If you’re interested in this, feel free to visit the labs website about gephi graphs, where we’ve showcasted some of the possiblities of using Gephi.
Tools for the playback
SolrWayback comes with a build in playback engine, but can be configured to use another playback engine such as PyWb. The SolrWayback playback viewer shows a small toolbar overlay on the page that can be opened or hidden. When the toolbar is hidden the page is display without any frame/top-toolbar etc. to show the page exactly as it was harvested.
When you have clicked a specific result, you’re taking to the harvested resource. If it is a website, you will be shown a menu to the right, giving you some more ways to analyse the resource. This menu is hidden in the left upper corner when you enter, but can be expanded by clicking on it.
The harvest calendar will give you a very smooth overview of the harvest times of the resource, so you can easily see when, and how often, the resource has been harvest in the current index. This gives you an excellent opportunity to look at your index over time, and see how a website evolved.
The PWID option lets you export the harvest resource metadata, so you can share what’s in that particular resource in a nice and clean way. the PWID standard is an excellent way to keep track of, and share ressources between researchers, so a list of the exact dataset is preserved – along with all the resources to go with it
View page resources gives you a clean overview of the contents of the harvested page, along with all the resources. We’ve even added a way to quickly see the difference between the first and the last harvested resource on the page, giving you a quick hint of the contents and if they are all from the same period. You can even see a preview of the page here and download the individual resources from the page, if you wish.
Customization of your local SolrWayback instance
We’ve made it possible to customize your installation, to fit your needs. The logo can be changed, the about text can be changed, and you can even customize your search guidelines, if you need to. This makes sure, that you have a chance to make instance your own in some way – making sure that people can recognize when they are using your instance of SolarWayback, and it can now reflect your organisation and the people who is contributing to it.
The future of the SolrWayback
This is just the beginning for SolrWayback. Further down the road, we hope to add even more functionality that can help you dig deeper into the archives. One of our main goals is to provide you with the tools necessary to understand and analyse the vast amounts of data, that lies in most of the archives that SolrWayback is designed for. We already have a few ideas as to what could be useful, but if you have any suggestions for tools that might be helpful, feel free to reach out to us.
By Ben Els, Digital Curator at the National Library of Luxembourg & the Chair of the Organising Committee for the 2021 IIPC General Assembly and the Web Archiving Conference
Our previous blog post from the Luxembourg Web Archive focused on the typical steps that many web archiving initiatives take at the start of their program: to gain first experience with event-based crawls. Elections, natural disasters, and events of national importance are typical examples of event collections. These temporary projects have occupied our crawler for the past 3 years (and continue to do so for the Covid-19 collection), but we also feel that it’s about time for a change of scenery on our seed lists.
Aside from following the news on elections and Covid-19, we also operate 2 domain crawls a year, where basically all websites from the “.lu” top level domain are captured. We use the research from the event collections to expand the seed list for domain crawls and, therefore, also add another layer of coverage to those events. However, the captures of websites from the event collections remain very selective and are usually not revisited, once discussions around the event are over. This is why we plan to focus our efforts in the near future on building thematic collections. As a comparison:
Multifaceted coverage of one topic or event
Focus on one subject area
The idea is that event collections serve as a base to extract the subject areas for thematic collections. In turn, the thematic collections will serve as a base to start event collections, and save time on research. In time, event collections will help with a more intense coverage for the subjects of thematic collections and the latter will capture information before and after the topic of an event collection. For example, the seed list from an election crawl can serve as a basis for the thematic collection “Politics & Society”. The continued coverage and expansion from this collection will serve as an improved basis for a seed list, once the next election campaign comes around. Moreover, both types of collections will help in broadening the scope of domain crawls and achieve better coverage of the Luxembourg web.
Collaboration with subject experts
During election crawls, it has always been important for us to invite the input from different stakeholders, to make sure that the seed list covers all important areas surrounding the topic. The same principle has to be applied to the thematic collections. No curator can become an expert in every field and our web archiving team will never be able to research and find all relevant websites in all domains and all languages from all corners of the Luxembourg web. Therefore, the curator’s job has to be focused on finding the right people, who know the web around their subject, experts in their field and representatives of their communities, who can help to build and expand seed lists over time. This means relying on internal and external subject experts, who are familiar with the principles of web archiving and incentivised to offer their help in contributing to the Luxembourg web archive.
While, technically, we haven’t tested the idea of this collaborative Lego-tower in reality, here are some of the challenges we would like to tackle this year:
The workflows and platform used to collect the experts’ contributions need to be as easy to use as possible. Our contributors should not have require hours of training and tutorials to get started and it should be intuitive enough to pick up working on a seed list, after not having looked at it for several months.
Subject experts should be able to contribute in the way that best fits their work rhythm: a quick and easy option to add single seeds spontaneously when coming across an interesting website, as well as a way to dive in deeper into research and add several seeds at a time.
We are going to ask for help, which means additional work for contributors inside and outside the library. This means that we need to keep the motivate the subject experts and convince them that a working and growing web archive represents a benefit for everybody and that their input is indispensable.
As a first step, we would like to set up thematic collections with BnL subject experts, to see what the collaborative platform should look like and what kind of work input can be expected from contributors in terms of initial training and regular participation. The second stage will be to involve contributors from other heritage institutions who already provided lists to our domain crawls in the past. After that, we count on involving representatives of professional associations, communities or other organisations interested in seeing their line of business represented in the web archive.
On an even larger scale, the Luxembourg Web Archive will be open to contributions from students and researchers, website owners, web content creators and archive users in general, which is already possible through the “Suggest a website” form on webarchive.lu. While we haven’t received as many submissions as we would like, there have been very valuable contributions, of websites that we would perhaps never have found otherwise. We also noticed that it helps to raise awareness through calls ofor participation in the media. For instance, we received very positive feedback for our Covid-19 collection. If we are able to create interest on a larger scale, we can get much more people involved and improve the services provided by the Luxembourg Web Archive.
Save the date!
While we work on putting the pieces of this puzzle together, we are also moving closer and closer to the 2021 General Assembly and Web Archiving Conference. It’s been two years since the IIPC community was able to meet for a conference, and surely you are all as eager as we are, to catch up, to learn and to exchange ideas about problems and projects. So, if you haven’t done so already, please save the date for a virtual trip to Luxembourg from 14th -16th June.
By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory and Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre at National and University Library Zagreb
We are excited to share the news of a newly IIPC-funded collaborative project between the Los Alamos National Laboratory (LANL) and the National and University Library Zagreb (NSK). In this one-year project we will develop a software framework for web archives to create Bloom filters of their archival holdings. A Bloom filter, in this context, consists of hash values of archived URIs and can therefore be thought of as an encrypted index of an archive’s holdings. Its encrypted nature allows web archives to share information about their holdings in a passive manner, meaning only hashed URI values are communicated, rather than plain text URIs. Sharing Bloom filters with interested parties can enable a variety of down-stream applications such as search, synchronized crawling, and cataloging of archived resources.
Bloom filters and Memento TimeTravel
As many readers of this blog will know, the Prototyping Team at LANL has developed and maintained the Memento TimeTravel service, implemented as a federated search across more than two dozen memento-compliant web archives. This service therefore allows a user (or a machine via the underlying APIs) to search for archived web resources (mementos) across many web archives at the same time. We have tested, evaluated, and implemented various optimizations for the search system to improve speed and avoid unnecessary network requests against participating web archives but we can always do better. As part of this project, we aim at piloting a TimeTravel service based on Bloom filters, that, if successful, should provide close to an ideal false positive rate, meaning almost no unnecessary network requests to web archives that do not have a memento of the requested URI.
While Bloom filters are widely used to support membership queries (e.g., is element A part of set B?), it has, to the best of our knowledge, not been applied to query web archive holdings. We are aware of opportunities to improve the filters and, as additional components of this project, will investigate their scalability (in relation to CDX index size, for example) as well as the potential for incremental updates to the filters. Insights into the former will inform the applicability for different size archives and individual collections and the latter will guide a best practice process of filter creation.
The development and testing of Bloom filters will be performed by using data from the Croatian Web Archive’s collections. NSK develops the Croatian Web Archive (HAW) in collaboration with the University Computing Centre University of Zagreb (Srce), which is responsible for technical development and will work closely with LANL and NSK on this project.
LANL and NSK are excited about this project and new collaboration. We are thankful to the IIPC and their support and are looking forward to regularly sharing project updates with the web archiving community. If you would like to collaborate on any aspects of this project, please do not hesitate and get in touch.
By Claire Newing, The National Archives (UK) and Phil Clegg, MirrorWeb.
At The National Archives of the UK we have been archiving the online presence of UK central government since 2003. We originally worked with suppliers to capture traditional websites which we made available for all to browse and search through the UK Government Web Archive: http://www.nationalarchives.gov.uk/webarchive/.
In the early 2010s, we recognised that the government was increasingly using social media platforms to communicate with the public. At that time Twitter, YouTube and Flickr were the most widely used platforms. We experimented with trying to capture channels using our standard Heritrix based process but with very limited success, so we embarked on a project with Internet Memory Foundation (IMF), our then web archiving service provider, to develop a custom solution.
The project was partially successful. We developed a method of capturing Tweets and YouTube videos and the associated metadata directly from APIs and providing access to them through a custom interface. We also developed a method of crawling all shortlinks in Tweets so the links resolved correctly in the archive.
Unfortunately, we were unable to find a way of capturing Flickr content. The UK Government Social Media Archive was launched in 2014. We continued to capture a small number of channels regularly until mid-July 2017 when we started working with a new supplier, MirrorWeb.
Captures in the cloud
MirrorWeb social media capture is undertaken using serverless functions running in the cloud. These functions authenticate with the social API and request the metadata content for all new posts created on the social channel since the last capture point. Each post is then stored in a database for later replay. Further serverless functions are triggered when a post object is written to the database to check if the post contains media content like images or videos. If media content is found these are added to a queue which in turn triggers another serverless function to download the media for the post. For replay the stored json objects are read back from the database and presented to the user with media objects in a similar layout to the original platform.
MirrorWeb chose to archive most social accounts daily to ensure that all new content is captured. Twitter, for example, limits the number of requests to their API in a 15-minute window and restrict the number of historic posts that can be collected so some tuning is undertaken to ensure rate limits are not exceeded and that all posts are captured.
We also increased the number of channels we were capturing and took the opportunity to redesign our custom access pages incorporating feedback from our users. Excitingly, access to images and video content embedded in Tweets was made accessible for the first time. The archive was formally re-launched in August 2018, but we always knew it could be even better!
Flickr, access, and full text search
Towards the end of 2018 we launched an improvement project with MirrorWeb. It had three key aims:
(1) To develop a method of capturing Flickr images. This became urgent as in November 2018 Flickr announced that as of early 2019 free account holders would be limited to 1000 images per account and any additional images above that number would be deleted. A survey showed that several UK government accounts, particularly older accounts which were no longer being updated, were at risk of losing some content.
(2) To further improve our custom access pages.
(3) To provide full text search across the social media collection.
The project aims were fulfilled, and the new functionality went live late in 2019.
By far the most exciting new development was the implementation of full text search. We undertook some user research earlier in 2018 which revealed that users considered the archive to be interesting but didn’t think it was very useful without search. This emphasized to us how important it was to provide such a service.
The search service was built by MirrorWeb using Elasticsearch, the same technology we use for the full text search facility on the UK Government Web Archive, our collection of archived websites. MirrorWeb once again make use of serverless functions to provide full text search of the social accounts. When social post metadata is written to the database, a serverless function is triggered to extract the relevant metadata and this is then added to Elasticsearch.
Each search queries the full text of Tweets and the descriptions and titles of YouTube and Flickr content. Users can initially search for keywords or a phrase and are then given the opportunity to filter the results by platform, channel and year of post. They can also choose to only display results which include or exclude specific words.
Additionally, we added a search box to the top of each of our custom access pages to enable users to search all data captured from a specific channel. For example, on this page a user can search the titles and descriptions of all the videos we’ve captured from Prime Minister’s Office YouTube channel.
When we started to investigate capturing social media content over a decade ago there was a feeling in some quarters that content posted on social media was ephemeral and posts were being used to point to content on traditional sites. Events of recent years demonstrated that social media is of increasing importance. In some cases, government departments and ministers announce important information on social media some time before they update a traditional website.
We have achieved a lot already, but we know there is lots more to do. In future, we aspire to add a unified search across our website and social media archives. We are aware that the API capture method does not work for all platforms, so we are actively working to find other methods of capture, particularly for Instagram and Github. We hope to find a way of displaying metadata we capture which is not currently surfaced on the access pages – for example changes to the channel thumbnail image over time.
We are also aware that are many gaps in our web archive where we were unable to capture embedded YouTube videos. We hope to develop a method of linking between those gaps and the equivalent videos held in the YouTube archive. Finally, we plan to do some user research to guide future developments. We are very proud of the UK Government Social Media Archive and we want to make sure it is used to its full potential.
By Shawn M. Jones, Ph.D. student and Graduate Research Assistant at Los Alamos National Laboratory (LANL), Martin Klein, Scientist in the Research Library at LANL, Michele C. Weigle, Professor in the Computer Science Department at Old Dominion University (ODU), and Michael L. Nelson, Professor in the Computer Science Department at ODU.
Individual web archive collections can contain thousands of documents. Seeds inform capture, but the documents in these collections are archived web pages (mementos) created from those seeds. The sheer size of these collections makes them challenging to understand and compare. Consider Archive-It as an example platform. Archive-It has many collections on the same topic. As of this writing, a search for the query “COVID” returns 215 collections. If a researcher wants to use one of these collections, which one best meets their information need? How does the researcher differentiate them? Archive-It allows its collection owners to apply metadata, but our 2019 study found that as a collection’s number of seeds rises, the amount of metadata per seed falls. This relationship is likely due to the increased effort required to maintain the metadata for a growing number of seeds. It is paradoxical for those viewing the collection because the more seeds exist, the more metadata they need to understand the collection. Additionally, organizations add more collections each year, resulting in more than 15,000 Archive-It collections as of the end of 2020. Too many collections, too many documents, and not enough metadata make human review of these collections a costly proposition.
Ideally, a user would be able to glance at a visualization and gain understanding of the collection, but existing visualizations require a lot of cognitive load and training even to convey one aspect of a collection. Social media storytelling provides us with an approach. We see social cards all of the time on social media. Each card summarizes a single web resource. If we group those cards together, we summarize a topic. Thus social media storytelling produces a summary of summaries. Tools like Storify and Wakelet already apply this technique for live web resources. We want to use this proven technique because readers already understand how to view these visualizations. The Dark and Stormy Archives (DSA) Project explores how to summarize web archive collections through these visualizations. We make our DSA Toolkit freely available to others so they can explore web archive collections through storytelling.
The Dark and Stormy Archives Toolkit
Telling a story with web archives consists of three steps. First, we select the mementos for our story. Next, we gather the information to summarize each memento. Finally, we summarize all mementos together and publish the story. We evaluated more than 60 platforms and determined that no platform could reliably tell stories with mementos. Many could not even create cards for mementos, and some mixed information from the archive with details from the underlying document, creating confusing visualizations.
Hypercane selects the mementos for a story. It is a rich solution that gives the storyteller many customization options. With Hypercane, we submit a collection of thousands of documents, and Hypercane reduces them to a manageable number. Hypercane provides commands that allow the archivist to cluster, filter, score, and order mementos automatically. The output from some Hypercane commands can be fed into others so that archivists can create recipes with the intelligent selection steps that work best for them. For those looking for an existing selection algorithm, we provide random selection, filtered random selection, and AlNoamany’s Algorithm as prebuilt intelligent sampling techniques. We are experimenting with new recipes. Hypercane also produces reports, helping us include named entities, gather collection metadata, and select an overall striking image for our story.
To gather the information needed to summarize individual mementos, we required an archive-aware card service; thus, we created MementoEmbed. MementoEmbed can create summaries of individual mementos in the form of cards, browser screenshots, word clouds, and animated GIFs. If a web page author needs to summarize a single memento, we provide a graphical user interface that returns the proper HTML for them to embed in their page. MementoEmbed also provides an extensive API on top of which developers can build clients.
Raintale is one such client. Raintale summarizes all mementos together and publishes a story. An archivist can supply Raintale with a list of mementos. For more complex stories, including overall striking images and metadata, archivists can also provide output from Hypercane’s reports. Because we needed flexibility for our research, we incorporated templates into Raintale. These templates allow us to publish stories to Twitter, HTML, and other file formats and services. With these temples, an archivist can not only choose what elements to include in their cards; they can also brand the output for their institution.
The DSA Toolkit at work
Through these tools, we have produced a variety of stories from web archives. As shown above, we debuted with a story summarizing IIPC’s COVID-19 Archive-It collection, summarizing a collection of 23,376 mementos as an intelligent sample of 36. Instead of seed URLs and metadata, our visualization displays people in masks, places that the virus has affected, text drawn from the underlying mementos, correct source attribution, and, of course, links back to the Archive-It collection so that people can explore the collection further. We recently generated stories that would allow readers to view the differences between Archive-It collections about the mass shootings in Norway, El Paso, and Virginia Tech. Instead of facets and seed metadata, our stories show victims, places, survivors, and other information drawn from the sampled mementos. The reader can also follow the links back to the full collection page and get even more information using the tools provided by the archivists at Archive-It.
But our stories are not just limited to Archive-It. We designed the tools to work with any Memento-compliant web archive. In collaboration with Storygraph, we produce daily news stories built with mementos stored at Archive.Today and the Internet Archive. We are also experimenting with summarizing a scholar’s grey literature as stored in the web archive maintained by the Scholarly Orphans project.
Our Thanks To The IIPC For Funding The DSA Toolkit
We are excited to say that, starting in 2021, as part of a recent IIPC grant, we will be working with the National Library of Australia to pilot the DSA Toolkit with their collections. In addition to solving potential integration problems with their archive, we look forward to improving the DSA Toolkit based on feedback and ideas from the archivists themselves. We will incorporate the lessons learned back into the DSA Toolkit so that all web archives may benefit, which is what the IIPC is all about.