By SHIMURA Tsutomu, National Diet Library (NDL), Japan
2022 marked the 20th anniversary of the start of the Web Archiving Project (WARP) at the National Diet Library, Japan. The following article introduces the progress made over the 20 years since the project was first launched on an experimental basis in 2002.
The History of the Web Archiving Project
The National Diet Library’s Web Archiving Project, or WARP, was launched in 2002 as an experimental project to collect, preserve, and provide access to a small number of Japanese websites from both the public and the private sectors with the permission of the webmasters. In 2006, the project was expanded to include all government-related organizations.
In 2009, the National Diet Library Law was amended to enable us to comprehensively collect any website published by public institutions, including all national and municipal government agencies, without permission from the publisher. And when this amendment came into force the following year, 2010, we started to archive at regular intervals, such as monthly or quarterly, depending on the type of institution. It was at this time that the basic framework of the current WARP was solidified.
In 2013, we updated the system and began providing access to curated content, such as Monthly Special feature. In 2018, we developed an English-language user interface in the hope of further expanding our audience. In 2021, we improved the display of search results, and we greatly improved the mechanism for moving links within archived contents in 2022.
Changes in the WARP website
The layout of the WARP website has been changed three times so far: at the start of the experimental project in 2002, at the start of comprehensive archiving in 2010, and at the time of the 2013 update.
Number of targets
During the period from FY2002, when we began as an experimental project, until FY2009, we only archived websites when we had obtained permission from the webmasters, irrespective of whether the website was published by a public or a private institution.
With the start of comprehensive archiving in FY2010, we were able to archive the websites of all public institutions, which greatly increased the number of targets. The graph shows that between FY2009 and FY2011, the number of targets increased by more than 2,000.
In addition, we continued to request permission to collect private websites on a daily basis, and the number of targets has increased each year. In 2015, the number of targets increased significantly due to intensive requests made to public interest foundations. Generally, we focus on requesting permission from specific types of institutions for a certain period of time. This has resulted in the number of private-sector targets increasing to about 8,000, which is currently even more than the number of public-sector targets.
In order to provide access via the Internet to archived websites, we need permission of the owner of public transmission rights, which is something granted under the Copyright Law of Japan. Therefore, we request permission from each webmaster to provide access via the Internet to archived websites prior to providing such access. As of FY2021, we were able to provide access via the Internet to 12,435, or about 90%, of the targets we have archived. This is one of the most internet-accessible archives among the world’s web archives.
The size of archived data has increased rapidly since the start of comprehensive archiving of the websites of public agencies in FY2010, and nearly reached 2,400 TB in FY2021. This is due to the increase in the number of targets collected as well as the size of data published by each institution.
System configuration transition
Over the past 20 years, the system configuration and various technologies implemented in WARP have changed significantly. The three most important technologies for collecting, preserving, and providing access to web archives are harvest software, storage format, and replay software. In addition, we provide a full-text search function to make it easier for users to find content of interest from the vast amount of archives. Here is a brief summary of the transition of each system configuration.
At first, we used the open-source software Wget to store the harvested websites in units of files. In 2010, we implemented Heritrix, a standard harvesting software used by web archiving organizations around the world and specialized for web archiving and have been using it ever since. In 2013, a duplication reduction function was added as a means of reducing the volume of data to be stored. The duplication reduction function saves only the files that have been updated, thus reducing the total volume of data saved and saving storage space.
The data was saved in units of files when using Wget, but since 2010, with the implementation of Heritrix, the storage format was changed to the WARC format. The files that comprise each website as well as metadata about the various files are stored together in a WARC format file. The WARC format allows for the archiving of information that could not be included when saving data in units of files. For example, information that has no content, such as redirection to a new URL when the URL of a website changes, can now be saved.
In order to view a website saved in WARC format files, files comprising the website must be extracted from WARC format files and saved to a general web server, or a dedicated software to replay WARC format files is needed. Initially, we adopted the former method. This meant that, in addition to the original WARC format files, storage capacity for the extracted data was required. We currently use OpenWayback, which allows users to directly browse WARC format files, eliminating the need for storage space for data that has been extracted into units of files.
Full-text search software
A full-text search software was introduced during the experimental project period, but at that time it was created exclusively for WARP. In 2010, we adopted Solr, an open source full-text search software widely used around the world, to improve the search speed for large-scale data.
What prompted us to launch Monthly Special?
At the time of the interface renewal in 2013, WARP had been steadily archiving websites, but web archiving itself was not yet well known in Japan. We wanted to attract the interest of more people, so we started publishing introductory articles on web archiving, such as Monthly Special, Mechanism of Web Archiving, and Web Archives of the World.
In particular, the Monthly Special features archived websites related to a topic of interest chosen by our staff or includes an article explaining WARP.
Currently this content is available only in Japanese.
Looking back over the 20 years since the start of the project, you can see that the number of targets and the size of data archived have steadily increased.
We believe that the role of web archives will continue to grow in importance. We are committed to collecting and preserving websites on a regular basis as well as making them available to as many people as possible.
By André Mourão, Senior Software Engineer, Arquivo.pt and Daniel Gomes, Head of Arquivo.pt
Arquivo.pt launched a service that enables search over 1.8 billion images archived from the web since the 1990s. Users can submit text queries and immediately receive a list of historical web-archived images through a web user interface or an API.
The goal was to develop a service that addressed the challenges raised by the inherent temporal properties of web-archived data, but at the same time provided a familiar look-and-feel to users of platforms such as Google Images.
Supporting image search using web archives raised new challenges: little research was published on the subject and the volume of data to be processed was big and heterogeneous, summing over 530 TB of historical web data published since the early days of the Web.
The Arquivo.pt Image Search service has been running officially since March 2021 and it is based on Apache Solr. All the developed software is available as open-source to be freely reused and improved.
Search images from the Past Web
The simplest way to access the search service is using the web interface. Users can, for example, search for GIF images published during the early days of the Web related to Christmas by defining the time span of the search.
Users can select a given result and consult metadata about the image (e.g. title, ALT text, original URL, resolution or media type) or about the web page that contained it (e.g. page title, original URL or crawl date). Quickly identifying the page that embedded the image enables the interpretation of its original context.
Automatic identification of Not Suitable For Work images
Arquivo.pt automatically performs broad crawls of web pages hosted under the .PT domain. Thus, some of the images archived may contain pornographic content that users do not want to be immediately displayed by default, for instance while using Arquivo.pt in a classroom.
The Image Search service retrieves images based on the filename, alternative text and the surrounding text of an image contained on a web page. Images returned to answer a search query may include offensive content even for inoffensive queries due to the prevalence of web spam.
The detection of NSFW (not suitable for work) content on the archived Web pages from the Internet is challenging due to the scale (billions of images) and the diversity (small to very large images, graphic, colour images, among others) of image content.
Currently, Arquivo.pt applies an NSFW image classifier trained with over 60 GB of images scrapped from the web. Instead of identifying images as safe or not safe, this classifier returns the probability of an image belonging to one of five categories: drawing (SFW drawings), neutral (SFW photographic images), hentai (including explicit drawings), porn (explicit photographic images), sexy (potentially explicit images that are not pornographic, e.g. woman in bikini). nsfw (sum of hentai and porn) scores.
By default, Arquivo.pt hides pornographic images from the search results if their NSFW classification rate was higher than 0.5. This filter can be disabled by the user through the Advanced Image Search interface.
Image Search API
Arquivo.pt developed a free and open Image Search API, so that third-party software developers can integrate the Arquivo.pt image search results in their applications and for instance apply for the annual Arquivo.pt Awards.
The ImageSearch API allows keyword to image search and access to preserved web content and related metadata. The API returns a JSON object containing the metadata elements also available through the “Details” button.
Scientific and technical contributions
There are several services that enable image search over web collections (e.g. Google Images). However, the literature published about them is very limited and even less research has been published about how to search images in web archives.
Moreover, supporting image search over the historical web-data preserved by web archives raises new challenges that live-web search engines do not need to address such as having to deal with multiple versions of images and pages referenced by the same URLs, handling duplication of web-archived images over time or ranking search results considering the temporal features of historical web data published over decades.
Developing and maintaining an Image Search engine over the Arquivo.pt web archive originated scientific and technical contributions by addressing the following research questions:
How to extract relevant textual content in web pages that best describes images?
How to de-duplicate billions of archived images collected from the web over decades?
How to index and rank search results over web-archived images?
The main contributions of our work are:
A toolkit of algorithms that extract textual metadata to describe web-archived images
A system architecture and workflow to index large amounts of web-archived images considering their specific temporal features
By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, and Georg Perreiter, Software Developer at the National and University Library of Iceland.
Here at the National and University Library of Iceland (NULI) we have over the last couple of years eagerly awaited each new deliverable of the IIPC funded pywb project, developed by Webrecorder’s Ilya Kreymer. Last year Kristinn wrote a blog post about our adoption of OutbackCDX based on the recommendation from the OpenWayback to pywb transition guide that was a part of the first deliverable. In that post he noted that we’d gotten pywb to run against the index but there were still many issues that were expected to be addressed as the pywb project continued. Now that the project has been completed, we’d like to use this opportunity to share our experience of this transition.
As Kristinn is a member of the IIPC’s Tools Development Portfolio (TDP) – which oversees the project – this was partly an effort on our behalf to help the TDP evaluate the project deliverables. Primarily, however, this was motivated by the need to be able to replace our aging OpenWayback installation.
It is worth noting that prior to this project, we had no experience with using Python based software beyond some personal hobby projects. We were (and are) primarily a “Java shop.” We note this as the same is likely true of many organizations considering this switch. As we’ll describe below, this proved quite manageable despite our limited familiarity with Python.
Get pywb Running
The first obstacle we encountered was related to the required Python version. pywb requires version 3.8 but our production environment, running Red Hat Enterprise Linux (RHEL) 7, defaulted to Python 3.6. So we had to additionally install Python 3.8. We also had to learn how to use a Python virtual environment so we could run pywb in isolation. Then we needed to learn how to resolve site-package conflicts using Python’s package manager (pip3) due to differences between Ubuntu and RHEL.
Of course, all of that could be avoided if you deploy pywb on a machine with a compatible version of Python or use pywb’s Docker image. Indeed, when we first set up a test instance on a “throwaway” virtual machine, we were able to get pywb up and running against our OutbackCDX in a matter of minutes.
Our web archive is open to the world. However, we do need to limit access to a small number of resources. With OpenWayback this has been handled using a plain text exclusion file. We were able to use pywb’s wb-manager command line tool to migrate this file to the JSON based file format that pywb uses. The only issue we ran into was that we needed to strip out empty lines and comments (i.e. lines starting with #) before passing it to this utility.
Making pywb Also Speak Icelandic
We want our web archive user interface to be available in both Icelandic and English. When adopting OpenWayback, we ran into issues with such internationalization (i18n) support and ultimately just translated it into Icelandic and abandoned the i18n effort. pywb already supported i18n and further support and documentation of this was one of the elements of the IIPC pywb project. So we very much wanted to take advantage of this and fully support both languages in our pywb installation.
We found the documentation describing this process to be very robust and easy to follow. Following it, we installed pywb’s i18n tool, added an “is” locale and edited the provided CSV file to include Icelandic translations.
Along the way we had a few minor issues with textual elements that were hard coded and translations could not be provided for. This was notably more common in new features being added, as one might expect. We were, in a sense, acting as beta testers of the software, picking up each new update as it came, so this isn’t all that surprising. We reported these omissions as we discovered them and they were quickly addressed.
We were able to work around this issue with Chrome by using a German locale as none of the date formatting patterns relied on outputting the names of days or months.
Making pywb Fit In
Here at NULI we have a lot of websites. To help us maintain a “brand” identity, we – to the extent possible – like them to have a consistent look and feel. So, in addition to making pywb speak Icelandic, we wanted it to fit in.
Much like i18n, UI customizations were identified as being important to many IIPC members and additional support for and documentation of that was included in the IIPC pywb project. Following the documentation, we found the customization work to be very straightforward.
You can easily add your own templates and static files or copy and modify the existing ones. As you can always remove your added files, there is no chance of messing anything up.
As you can see on our website, we were able to bring our standard theme to pywb.
Additionally, we added 20 lines of code to frontendapp.py to allow serving of additional, localized, static content fed by an additional template (incl. header and footer) that loads static html files as content. This allowed us to add a few extra web pages to serve our FAQ and some other static content. This was our only “hack” and is, of course, only needed if you want to add static content that is served directly from pywb (as opposed to linking to another web host).
New Calendar & Sparkline and Performance
The final deliverable of the IIPC funded pywb project included the introduction of a new calendar-based search result page and a “sparkline” navigation element into the UI header. These were both features found in OpenWayback and, in our view, the last “missing” functionality in pywb. We were very happy to see these features in pywb but also discovered a performance problem.
Our web archive is by no means the largest one in the world. It is, however, somewhat unique in that it contains some pages with over one hundred thousand copies (yes 100.000+ copies). These mostly come from our RSS-based crawls that capture news sites’ front pages every time there is a new item in the RSS feed. The largest is likely the front page of our state broadcaster (RÚV) with 159.043 captures available as we write this (with probably another thousand or so waiting to be indexed).
The initial version of the calendar and sparkline struggled with these URLs. After we reported the issue, some improvements were made involving additional caching and “loading” animations so users would know the system was busy instead of just seeing a blank screen. This improved matters notably, but pywb’s performance under these circumstances could stand further improvement.
We recognize that our web archive is somewhat unusual in having material like this. However, as time goes on, archives with a high number of captures of the same pages will only increase, so this is worth considering in the future.
We’ve been very pleased with this migration process. In particular we’d like to commend the Webrecorder team for the excellent documentation now available for pywb. We’d also like to acknowledge all testing and vetting of the IIPC pywb project deliverables that Lauren Ko (UNT and member of the IIPC TDP) did alongside – and often ahead of – us.
We can also reaffirm Kristinn’s recommendation from last year to use OutbackCDX as a backend for pywb (and OpenWayback). Having a single OutbackCDX instance powering both our OpenWayback and pywb installations notably simplified the setup of pywb and ensured we only had one index to update.
We still have pywb in a public “beta” – in part due to the performance issues discussed above – while we continue to test it. But we expect it will replace OpenWayback as our main replay interface at some point this year.
Today is a doubly special day for the UK Government Web Archive (UKGWA): as well as celebrating World Digital Preservation Day, we mark our 25th Anniversary.
For this occasion, blog posts and a series of social media posts with facts and statistics are being released in addition to a time lapse video of the GOV.UK website that shows the evolution of the design of the site and how the Government has communicated with the public. Check: @UKNatArchives and https://www.facebook.com/TheNationalArchives/ for more.
Maintaining and preserving the government web estate and its interlinking network of resources was the driver for the original Web Continuity Initiative. The TNA Web Archiving team began by capturing 50 websites in 2003, but the collection dates from 1996 when we received copies of UK government websites from the Internet Archive. Since then, we have been expanding our collection and increasing our capacity to handle the growing volume and complexity of websites over time.
The UK Government’s use of the web is extensive: websites present information, act as document stores, and provide dynamic transactional services. Often, information published on the web is the only place where it is available. There is a tension between providing up-to-date information and ensuring that published information remains available in its original context for future reference, this is where the UKGWA steps in.
The UKGWA is a comprehensive, cloud-based, and freely accessible archive for everyone, including students, historians, researchers, government employees, business, and journalists – having become a reference in web archiving.
The numbers impress: there are around 6,400 websites in our collection, and it has over 644 social media accounts across YouTube, Twitter, and Flickr, with over 1.9 million post archived. Around 63% of pages we direct to on our A-Z list are no longer available on the live internet. If the UKGWA was not here, much of the government’s online content since 1996 would probably have been lost and would be unavailable to the public.
As important as it is to preserve our collection, it is vital to provide a great experience for our users. We are currently investing in more user-friendly guidance for websites owners. The new documentation aims to provide a straightforward understanding of the archiving process and the requirements to successfully capture and published websites.
We are continually upgrading the technologies used to capture and replay websites to ensure the highest fidelity possible – a website in the archive should look and function, as far as possible, as the original site. We are investing in auto-QA technologies to ensure our collection is of the highest possible quality. And we are focusing on opening the data in our collection for researchers. We have been working on the development of tools and methods for this purpose which you will hear more about in 2022!
By Andrew Jackson, Web Archiving Technical Lead, UK Web Archive, British Library
It’s World Digital Preservation Day 2021 #WDPD2021 so this is a good opportunity to give an update on what is going on today at the DPC Award Winning UK Web Archive.
Domain and frequent crawls
The 2021 domain crawl is still running. There’s been a few ups and downs, and it’s not going as fast as we’d ideally like, but it’s chugging away at 200-250 URLs a second, on track to get to around two billion URLs by the end of the year.
We run two main crawl streams, ‘frequent’ and ‘domain’ crawls. The ‘frequent’ one gets fresh content from thousands of sites everyday. It also takes screenshots while it goes, so in the future we will know what the site was supposed to look like!
We’ve been systematically taking thousands of screenshots every day since about 2018.
e.g. here’s how Twitter’s UI changed since January 2020 #PureDigiPres (who remembers Moments?)
The frequent crawls also refresh site maps every day, to make sure we know when content is added or updated. This is how websites make sure search engine indexes are up-to-date, and we can take advantage of that!
The ‘domain’ crawl is different – it runs once a year and attempts to crawl every known UK website once. It crawls around two billion URLs from millions of websites, so it’s a bit of a challenge to run. The frequent crawls also refresh site maps every day, to make sure we know when content is added or updated. This is how websites make sure search engine indexes are up-to-date, and we can take advantage of that!
The last two domain crawls have been run in the cloud, which has brought many new challenges, but also makes it much easier to experiment with different virtual hardware configurations.
The 2021 domain crawl is still running. There’s been a few ups and downs, and it’s not going as fast as we’d ideally like, but it’s chugging away at 200-250 URLs a second, on track to get to around two billion URLs by the end of the year.
All in all, we gather around 150TB of web content each year, and we’re over a petabyte in total now. This graph shows how the crawls have grown over the years (although it doesn’t include the 2020/2021 domain crawls as they are still on the cloud).
The legal framework we operate under means we can’t make all of this available openly on our website, but the curatorial team work on getting agreements in place to allow this where we can, as well as telling the system what should be crawled.
Open source tools
All that hard work means many millions of web pages are openly accessible via www.webarchive.org.uk/ – you can look up URLs and also do full-text faceted search. Not all web archives offer full-text search, and it’s been a challenge for us. Our websites search indexes are not as up-to-date or complete as we’d like, but we’re working on it. At least we can be glad that we’re not to be working on it alone. Our search indexing tools are open source and are now in use at a number of web archives across the world. This makes me very happy, as I belive our under-funded sector can only sustain the custom tools we need if we find ways to share the load.
This is why almost the entire UK Web Archive software stack is open source, and all our custom components can be found at https://github.com/ukwa/. Where appropriate, we also support or fund developments in open source tools beyond our own, e.g. the Webrecorder tools – we want everyone to be able to make their own web archives!
We also collaborate through the IIPC and with their support we help run Online Hours to support open source users and collaborators. These are regular online videocalls to help us keep in touch with colleagues across the world: see here for more details – Join us!
We’re very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others. Among these collaborators, I’d like to highlight the work of the Danish Net Archive. They understand Apache Solr better than us, and are working on an excellent search front-end called SolrWayback.
As well as full-text search, we also use this index to support activities specific to digital preservation by performing format analysis and metadata extraction during the text extraction process, and indexing that metadata as additional facets. We’ve not had time to make all this information available to the public, but some of it is accessible.
Working with data and researchers
Our Shine search prototype covers older web archive material held by the Internet Archive and if you know the Secret Squirrel Syntax, you can poke around in there and look for format information e.g. by MIME type, by file extension or by first-four-bytes. We also generate datasets of information extracted from our collections, including format statistics, and make this available as open data via the new Shared Research Repository. But again, we don’t always have time to work on this, so keeping those up to date is a big challenge.
One way to alleviate this is to partner with researchers in projects that can fund the resources and bring in the people to do the data extraction and analysis, while we facilitate access to the data and work with our British Library Open Research colleagues to give the results a long-term home. This is what happened with the recent work on how words have changed meaning over time, a research project led by Barbara McGillvray, and we’d like to support these kinds of projects in the future too.
Another tactic is to open up as much metadata as we can, and provide that via APIs that others can then build on.This was the notion behind our recent IIPC-fundedcollaboration with Tim Sherratt to add a web archives section to the GLAM Workbench. A collection of tools and examples to help you work with data from galleries, libraries, archives, and museums. We hope the workbench will be a great starting point for anyone wanting to do research with and about web archives.
If you’d like to know more about any of this, feel free to get in touch with me or with the UK Web Archive.
By Alicia Pastrana García and José Carlos Cerdán Medina, National Library of Spain
In the last 20 years most web archives have been building their websites collections. They will be very valuable as the years go by, as much of this information will no longer exist on the Internet. However, do we have to wait that long to see our collections be useful?
The huge amount of information the National Library of Spain (BNE) has built since 2009 has emerged as one of the largest linguistic corpus of current language. For this reason, BNE has collaborated with the Barcelona Supercomputing Center (BSC) to create the first massive AI model of the Spanish language. This collaboration is in the framework of the Language Technologies Plan of the State Secretariat for Digitization and Artificial Intelligence of the Ministry of Economic Affairs and Digital Agenda of Spain.
The National Library of Spain has been harvesting information from the web for more than 10 years. The Spanish Web Archive is still young but it already contains more than a Petabyte of information.
On the other hand, the Barcelona Supercomputing Center (BSC) is the leading supercomputing center in Spain. They offer infrastructures and supercomputing services to Spanish and European researchers, in order to generate knowledge and technology for the society.
The Spanish Web Archive, as most of the national libraries web archives, is based on a mixed model. It combines broad and selective crawls. The broad crawls harvest as many Spanish domains as possible without going very deep in the navigation levels. The scope is the .es domain. Selective crawls complement the broad crawls and harvest a smaller sample of websites but in greater depth and frequency. The sites are selected for their relevance to history, society and culture. Selective crawls include other king of domains (.org, .com, etc.)
Web Curators, from the BNE and the regional libraries, select the seeds that will be part of these collections. They assess the relevance of the websites from the heritage point of view and the importance for research and knowledge in the future.
For this project we chose the content harvested on selective crawls, a collection of around 40,000 websites.
How to prepare WARC files
The result of the collections is stored in WARC files (Web ARChive file format). The BSC just needed the text extracted from the WARC files to train the language models, so they removed everything else, using a specific script. It uses a parser to keep exclusively the HTML text tags (paragraphs, headlines, keywords, etc.) and discard everything that was not useful for the purpose of the project (e.g. images, audios, videos).
This parser was an open source Python module called Selectolax. It is seven times faster than others and it is easily customizable. Selectolax can be configured to take labels that contained text and to discard those that are not useful for the project. At the end of the process, the script generated JSON files organized according to the selected HTML tags and it structured the information in paragraphs, headlines, keywords, etc. These files are not only useful for the project, but will also be able to help us improve the Spanish Web Archive full text search.
All this work was done in the Library itself, in order to obtain files that were more manageable. It must be taken into account that the huge volume of information was a challenge. It was not easy to transfer the files to the BSC, where the supercomputer was. Hence the importance of starting with this cleaning process in the Library.
Once at the BSC, a second cleaning process was run. The BSC project team removed everything that it is not well-formed text (unfinished or duplicated sentences, erroneous encodings, other languages, etc.). The result was only well-formed texts in Spanish, as it is actually used.
BSC used the supercomputer MareNostrum, the most powerful computer in Spain and the only one capable of processing such a volume of information in a short time frame.
The language model
Once the files were prepared, the BSC used a neural network technology based on Transformer, already proven with English. It was trained to learn to use the language. The result is an AI model that is able to understand the Spanish language, its vocabulary, and its mechanisms for expressing meaning and writing at an expert level. This model is also able to understand abstract concepts and it deduces the meaning of words according to the context in which they are used.
This model is larger and better than the other models of the Spanish language available today. It is called MarIA and is open access. This project represents a milestone both in the application of artificial intelligence to Spanish language, and in collaboration between national libraries and research centers. It is a good example of the value of collaboration between different institutions with common objectives. The uses of MarIA can be multiple: correctors or predictors of language, auto summarization apps, chatbots, smart searches, translation engines, auto captioning, etc. They are all broad fields that promote the use of Spanish for technological applications, helping to increase its presence in the world. This way, the BNE fulfils part of its mission, promoting the scientific research and the dissemination of knowledge, helping to transform information into accessible technology for all.
By Emmanuelle Bermès, Deputy Director for Services and Networks at BnF and Membership Engagement Portfolio Lead
During the refresh of our consortium agreement in 2016, three portfolios were created to lead on a strategy in three areas: membership engagement, tools development, and partnerships and outreach. The first major project for the Membership Portfolio was to survey our members and find common grounds for potential collaborations. While this remains our goal, in the new Strategic Action Plan, we would also like to focus on regular conversations with our members, in the spirit of the updates we have at our General Assembly (GA), and on supporting regional initiatives.
Following up on this, on Monday 13 September, the Membership Engagement Portfolio hosted two calls open to all members, scheduled for two slots to accommodate different time zones.
The past months have made it so much more complicated to share, engage and work together. Our yearly event, the GA/WAC, fully held online by the National Library of Luxembourg, was very successful and we all enjoyed this opportunity to feel the strength and vitality of our community. So, we thought that opening more options for informal communication online could also be a way to keep the momentum.
We also wanted this Members Call to be a moment to share your updates on what’s happening in your organization. With short presentations from the Library of Congress, the National Archives UK, the ResPaDon project at the Bibliothèque nationale de France, the National Széchényi Library in Hungary, the National and University Library in Zagreb, the National Library of New Zealand, the National Library of Australia, the agenda was very rich. One of the major outcomes of these two calls may be a future workshop on capturing social media, as this was a topic of great interest in both meetings. The workshop will be organised by the IIPC and the National and University Library in Zagreb. If you’re interested in contributing to this specific topic, please let us know at events[at]netpreserve.org.
We hope this shared moment can become a regular meeting point between members, where you can share your institution’s hot topics and let us know your expectations about IIPC activities. So, we hope you will join us for this conversation during our next call, expected to take place on Wednesday, 15th of December (UTC). Abbie Grotke and Karolina Holub, Membership Engagement Portfolio co-leads, Olga Holownia and myself, are looking forward to meeting you there.
The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:
An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.
If you have any questions, please contact the IIPC Senior Program Officer.
Nomination statements in alphabetical order:
Bibliothèque nationale de France / National Library of France
The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.
The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.
The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.
The British Library
The British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.
Deutsche Nationalbibliothek / German National Library
The German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.
Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB. At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.
Det Kongelige Bibliotek / Royal Library of Denmark
Royal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.
Koninklijke Bibliotheek / National Library of the Netherlands
As the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.
We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.
We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.
The National Archives (UK)
The National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.
Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.
With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.
One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.
What is OutbackCDX?
OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.
A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.
OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.
Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.
So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.
Of course, there were a few bumps on that road, but more on that later.
Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.
Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.
Building the Index
Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.
In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.
Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.
Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.
OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:
And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.
This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.
The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.
I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.
Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.
OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.
Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.
The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.
Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is.
In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.
Live demo of SolrWayback
You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!
Back in 2018…
The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.
WARC-Indexer. Where the magic happens!
WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.
WARC-Indexer. Paying the price up front…
Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.
Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.
Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.
HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.
Solr. Reaping the benefits from the WARC-indexer
The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).
Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.
The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.
CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.
WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.
Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.
SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.
SolrWayback playback engine
SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.
The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.
The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.
SolrWayback and Scalability
For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.
One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.
For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.
Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.
You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.
Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.
Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.
SolrWayback – framework
SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.
SolrWayback software bundle
Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.