Migrating to pywb at the National Library of Luxembourg

Mon, 22 January 2024Mon, 22 January 2024 IIPC StaffLeave a comment

The National Library of Luxembourg (BnL) has been harvesting the Luxembourg web under the digital legal deposit since 2016. In addition to broad crawls of the entire .lu domain, the Luxembourg Web Archive conducts targeted crawls focusing on specific topics or events. Due to legal restrictions, the web archives of the BnL can only be viewed on the library premises in Luxembourg. BnL joined the IIPC in 2017 and co-organised the 2021 online General Assembly and Web Archiving Conference.

By László Tóth, Web Archiving Software Developer at the National Library of Luxembourg

During the course of 2023, the Bibliothèque Nationale du Luxembourg | National Library of Luxembourg (BnL) undertook the task of migrating its web archive into a new infrastructure. This migration affected all aspects of the archives:

Hardware: the BnL invested in 4 new high-end servers for hosting the applications related to indexing and playback
Software: the outdated OpenWayback application was upgraded to a modern pywb/OutbackCDX duo
Web archive storage: the 339 TB of WARC files were migrated from NetApp NFS to high-performance IBM S3 object storage

In theory, such a migration is not a very complicated task; however, in our case, several additional factors had to be considered:

pywb has no module for reading WARC data from an IBM-based S3 bucket or communicating with a custom S3 endpoint
pywb does not know which bucket each resource is stored in
Our 339 TB of data had to be indexed in a “reasonable” amount of time

In this blog post, we will discuss each of the points mentioned above and provide details on how we overcame these difficulties.

The “before”

Up until December 2023, the BnL offered OpenWayback as a playback engine for users wishing to access the Luxembourg Web Archive. Simple but slow (and somewhat cumbersome to use), OpenWayback lacks a number of features required for an ergonomic user experience and efficient browsing of the archive.

The WARC files were stored on locally mounted NFS shares, and the server handling the archive and serving OpenWayback to clients was a virtual machine with 8 cores and 24 GB of RAM.

One of the big drawbacks of this setup was the way indexes were handled. These were stored in CDX files at the rate of one file per collection, resulting in a loaded OpenWayback configuration file:

pywbluxembourg_fig1 — Figure 1: There has to be a better way of doing this…

In order for OpenWayback to access a given resource, it first had to find it; thus, the main source of loading times came from the size of the CDX files (3.1 TB in total) and the slowness of the NFS drive they were stored on. Furthermore, every time a new collection was added or a new WARC file indexed into the archive, the Tomcat web server had to be restarted, resulting in a few minutes of downtime.

The slow loading of the pages meant that users were not encouraged to visit our web archive.

Planning the update

In order to improve the BnL’s offerings to users, we decided to address these problems by switching to pywb, the popular playback engine developed by Webrecorder, and OutbackCDX, a high-performance CDX index server developed by the National Library of Australia.

For hosting these applications, and with a future SolrWayback setup in mind, the BnL purchased 4 high-end servers, boasting 96 CPUs and 768 GB RAM each.

pywbluxembourg_fig2 — Figure 2: Hardware comparison – before and after

The migration to S3 storage

Although not necessary for installing those new applications, we decided to migrate to S3 before installing pywb and Solrwayback. This is because our state IT service provider had successfully implemented a storage system based on S3 and had a positive experience with it performance-wise compared to using NFS. Since the web archive consumes a sizable chunk of their storage infrastructure, we made a joint strategic decision to move to S3. Migrating the storage layer at the point while we were migrating the access systems made sense, so this was done first.

This entailed several additional tasks:

Setting up a database to store the S3 location of each WARC file together with various metadata, such as integrity hashes and harvest details for each file
Physically copying 339 TB WARC files to S3 buckets
Developing a pywb module to read WARC data directly from IBM S3 buckets and another module to get the S3 bucket characteristics from the database

We began by setting up a MariaDB database with a few tables for storing file and collection metadata. Here is an example entry in the table “file”:

pywbluxembourg_fig3 — Figure 3: Some WARC file metadata in our DB

We then copied the files to S3 storage using a simple multithreaded Python script that used the ibm_boto3¹ module to upload files to S3 buckets and compute various pieces of metadata. We divided our web archive into separate collections, each stored in a single bucket and corresponding to a specific harvest made within a specific time period by a specific organization. Our naming scheme also includes internal or external identifiers if there are any. For instance, files that were harvested by the Internet Archive during the 2023 autumn broad crawl, having the ID #22, are stored in a bucket named “bnl-broad-022-2023-autumn”, while those collected during the second behind-the-paywall campaign of 2023 are stored in “bnl-internal-paywall-2023-2”. In total, we have 32 such buckets.

Finally, we developed pywb modules to ensure that every time the application requests WARC data, it first queries the aforementioned database to find out which bucket the resource is in, and then loads the data from the requested offset up to the requested length. Of course, we didn’t allow direct communication between pywb and our database, so we developed a small Java application with a REST API whose sole purpose is to facilitate and moderate such a communication.

At this point, we’d like to note that pywb already has an S3Loader class; however, this is based on Amazon S3 technology and does not allow defining a custom endpoint for communicating with the S3 service itself. In order to adapt this to our needs, we modified pywb by implementing a BnlLoader class that extends BaseLoader, which does all the above and uses ibm_boto3 to get the WARC data. We then mapped it to a custom loader prefix in the loaders.py file:

pywbluxembourg_fig4 — Figure 4: Custom BnL loader reference (loaders.py)

Notice our special class on the last line in the figure above. Now in order for pywb to use this class, it has to be referenced in the config file. Here is what the relevant part of our config file looks like:

pywbluxembourg_fig5 — Figure 5: Our configuration file (config.yaml)

Note the “bnl://” prefix in “archive_paths”: this tells pywb to load the BnlLoader class as the WARC data handler for the corresponding collection. The value of this key is simply the URL linking to our database API server that we mentioned above, which allows controlled communication with the underlying MariaDB database.

So, in summary:

pywb needs to load a resource from a WARC file
The BnlLoader class’ overridden load() method is called
In this method, pywb makes an API call to our REST API service in order to get the S3 bucket where the WARC is stored via the database
The S3 path that is returned is then used together with the requested offset and length (provided by OutbackCDX) to make a call to the IBM S3 cluster using ibm_boto3
pywb now has all the data it needs to display the page

The result can be summarized in the following workflow diagram:

pywbluxembourg_fig6 — Figure 6: The BnL’s pywb + OutbackCDX workflow

Note that OutbackCDX and pywb are set up on the same physical server, having 96 CPUs and 768 GM RAM; however, on the diagram above, we have shown them separately for the purpose of clarity.

Our new access portal

Our access portal to the web archives was also completely redesigned using pywb’s templating engine, which allows us to fully customize the appearance of almost all elements in the interface. We decided to provide our users with two main sub-portals: one for accessing websites harvested behind paywalls, and another one for everything else. We had two main reasons for this:

Several of our collections contained harvests of paywalled versions of websites. We did not want to mix these together with the un-paywalled versions,
We wanted to offer our users a clear distinction between open (“freely accessible”) content and un-paywalled content.

The screenshot below shows our main access portal:

pywbluxembourg_fig7 — Figure 7: The BnL’s web archive portal

Next steps

2024 will see the installation of Netarkivet’s SolrWayback at the BnL, providing advanced search features such as full-text search, faceted search, file type search, and many more. Our hybrid solution will use pywb as the playback engine while SolrWayback will be responsible for the search aspects of our archive.

The BnL will also provide an additional NSFW-filter and virus scan in SolrWayback. During indexing, the content inside the WARC files will be scanned using artificial intelligence techniques in order to label each one as “safe” or “not safe for work” material. This way, we will be able to restrict certain content, such as pornography, in our reading rooms and block other harmful elements, such as viruses.

Finally, the BnL will develop an automated QA workflow, aiming at detecting and patching missing elements after harvests. Since this work is currently being done manually by a student helper, this workflow will likely greatly increase the quality and efficiency of our in-house harvests.

Resources

From the IIPC blog
- BnL
- OpenWayback
- OutbackCDX
- pywb
- SolrWayback
- Additional tools
From GitHub

Footnotes

Link to Python module and documentation ↩︎

Web Archiving the War in Ukraine

Wed, 20 July 2022Wed, 20 July 2022 IIPC Staff4 Comments

By Olga Holownia, Senior Program Officer, IIPC & Kelsey Socha, Administrative Officer, IIPC with contributions to the Collaborative Collection section by Nicola Bingham, Lead Curator, Web Archives, British Library; CDG co-chair

This month, the IIPC Content Development Working Group (CDG) launched a new collaborative collection to archive web content related to the war in Ukraine, aiming to map the impact of this conflict on digital history and culture. In this blog, we describe what is involved in creating a transnational collection and we also give an overview of web archiving efforts that started earlier this year: both collections by IIPC members and collaborative volunteer initiatives.

Collaborative Collection 2022

In line with the broader content development policy, CDG collections focus on topics that are transnational in scope and are considered of high interest to IIPC members. Each collection represents more perspectives than similar collections by a single member archive may include. Nominations are submitted by IIPC members, who have been archiving the conflict as early as January 2022 (see below) as well as the general public.

How do members contribute?

Topics for special collections are proposed by IIPC members who submit their ideas to the IIPC CDG mailing list, or contact the IIPC co-chairs directly at any time. Providing that the topic fits with the CDG collecting scope, there is enough data budget to cover the collection, and a lead curator and volunteers to perform the archiving work are in place, the collection can go ahead. IIPC members are then canvassed widely to submit web content on a shared google spreadsheet together with associated metadata such as title, language and description. The URLs are taken from the spreadsheet and crawled in Archive-It by the project team, formed of volunteers from IIPC members for each collection. Many IIPC members add a selection of seeds from their institutions’ own collections which helps to make CDG collections very diverse in terms of coverage and language.

There will be overlap between the seeds that members submit to CDG collections and their own institutions’ collections, however there are differences, including that selections for IIPC collections can be more geographically wide ranging than those included in their own collections when, for example they must adhere to regional scope, such as in the case of a national library. Selection decisions that are appropriate for members’ own collections may not be appropriate for CDG collections. For example, members may want to curate individual articles from an online newspaper by crawling each one separately whereas, given the larger scope of CDG collections it would be more appropriate to create the target at the level of the sub-section of the online newspaper. Public access to collections provided by Archive-It is a positive factor for those institutions that, for various reasons, can’t provide access to their collections. You can learn more about the War in Ukraine 2022 collection’s scope and parameters here.

Public nominations

We encourage everyone to nominate relevant web content as defined by the collection’s lead curators: Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, National Library of France and Kees Teszelszky of KB, National Library of the Netherlands. The first crawl is scheduled to take place on 27 July and it will be followed by two additional crawls in September and October. We will be publishing updates on the collection at #Ukraine 2022 Collection. We are also planning to make this collection available to researchers.

Member collections

In Spring 2022, we compiled a survey of the work done by IIPC members. We asked about the collection start date, scope, frequency, type of collected websites, way of collecting (e.g. locally and/or via Archive-It), social media platforms and access.

IIPC members have collected content related to the war, ranging from news portals, to governmental websites, to embassies, charities, and cultural heritage sites. They have also selectively collected content from Ukrainian and Russian websites and social media, including Facebook, Reddit, Instagram, and, most prominently, Twitter. The CDG collection offers another chance for members without special collections to contribute seeds from their own country domains.

Many of our members are national libraries and archives, and legal deposit informs what these institutions are able to collect and how they provide access. In most cases, that would mean crawling country-level domains, offering a localized perspective on the war. Access varies from completely open (e.g. the Internet Archive, National Library of Australia and the Croatian Web Archive), to onsite-only with published and browsable metadata such as collected URLs (e.g. the Hungarian Web Archive) to reading-room only (e.g. Netarkivet at the Royal Danish Library or the “Archives de l’internet” at the national library of France). The UK Web Archive collection has a mixed model of access, where the full list of metadata and collected URLs are available, but access to individual websites depends on whether the website owner has granted permission for off-site open access”. Some institutions, such as Library of Congress, may have time-based embargoes in place for collection access.

Some of our members have also begun work preparing datasets and visualisations for researchers. The Internet Archive has been supporting multiple collections and volunteer projects and our members have provided valuable advice on capturing content that is difficult to archive (e.g. Telegram messages).

A map of IIPC members currently collecting content related to the war in Ukraine can be seen below. It includes Stanford University, which has been supporting SUCHO (Saving Ukrainian Cultural Heritage Online).

Survey results

Access

While many members have been collecting content related to the war, only a small number of collections are currently publicly available online. Some members provide access to browsable metadata or a list of ULRs. The National Library of Australia has been collecting publicly available Australian websites related to the conflict,as is the case for the National Library of the Czech Republic. A special event collection of 162 crowd-sourced URLs is now accessible at the Croatian Web Archive. The UK Web Archive’s special collection of nearly 300 websites is fully available on-site, however information about the collected resources, which currently include websites of Russian Oligarchs in the UK, Commentators, Charities, Think Tanks and the UK Embassies of Ukraine and the surrounding nations, is publicly available online. Some websites from the UK Web Archive’s collection are also fully available off-site, where website owners have granted permission. The National Library of Scotland has set up a special collection, ‘Scottish Communities and the Ukraine’ which contains nearly 100 websites and focuses on the local response to the Ukraine War. This collection will be viewable in the near future pending QA checks. Most of the University Library of Bratislava’s collection is only available on-site, but information about sites collected is browsable on their web portal with links to current versions of the archived pages.

The web archiving team at the National Széchényi Library in Hungary, which has been capturing content from 75 news portals, has created a SolrWayback-based public search interface which provides access to metadata and full-text search, though full pages cannot be viewed due to copyright. The web archiving team has also been collaborating with the library’s Digital Humanities Center to create datasets and visualisations related to captured content.

Hungarian-Web-Archive-word_cloud — **Márton Nemeth** of National Széchényi Library and **Gyula Kalcsó** of Digital Humanities Center, National Széchényi Library presented on this collection at the 2022 Web Archiving Conference.

Multiple institutions plan to make their content available online at a later date, after collecting has finished or after a specified period of time has passed. The Library of Congress has been capturing content in a number of collections within the scope of their collecting policies, including the ongoing East European Government Ministries Web Archive.

Frequency of Collection

Most institutions have been collecting with a variety of frequencies. Institutions rarely answered with just one of the frequency options, opting instead to pick multiple options or “Other.” Of answers in the “Other” category, some were doing one-time collection, while others were collecting yearly, six-monthly, and quarterly.

How the content is collected

Most IIPC members crawl the content locally, while a few have also been using Archive-It. SUCHO has mostly relied on browser-based crawler Browsertrix, which was developed by Ilya Kreymer of Webrecorder and is in part funded by the IIPC, and on the Internet Archive’s Wayback Machine.

Type of collected websites (your domain)

When asked about types of websites being collected within local domains, most institutions have been focusing on governmental and news-related sites, followed by embassies and official sites related to Ukraine and Russia as well as cultural heritage sites. Other websites included a variety of crisis relief organisations, non-profits, blogs, think tanks, charities, and research organisations.

Types of websites/social media collected

When asked more broadly, most members have been focusing on local websites from their home countries. Outside local websites, some institutions were collecting Ukrainian websites and social media, while a smaller number were collecting Russian websites and social media.

Specific social media platforms collected

The survey also asked specifically about social media platforms our members were pulling from: Reddit, Instagram, TikTok, Tumblr, and Youtube. While many institutions were not collecting social media, Twitter was otherwise the most commonly collected social media platform.

Internet Archive

The Internet Archive (IA) has been instrumental in providing support for multiple initiatives related to the war in Ukraine. IA’s initiatives have included:

giving free Archive-It accounts, as well as general data storage, to a number of different community archiving efforts
uploading files to SUCHO collection at archive.org
supporting the extensive use of Save Page Now (especially via the Google Sheets interface) with the help of numerous SUCHO volunteers (many 10s of TB have been archived this way)
supporting the uploading of WACZ files to the Wayback Machine. This work has just started but a significant number of files are expected to be archived and, similar to other collections featured in the new “Collection Search” service, a full-text index will be available
crawling the entire country code top level domain of the Ukrainian web (the crawl was launched in April and is still running)
archiving Russian Independent Media (TV, TV Rain), Radio (Echo of Moscow) and web-based resources (see “Russian Independent Media” option in the “Collection Search” service at the bottom of the Wayback Machine).

IA’s Television News Archive, the GDELT Project, and the Media-Data Research Consortium have all collaborated to create the Television News Visual Explorer, which allows for greater research access of the Television News Archive, including channels from across Russia, Belarus, and Ukraine. This blog post by GDELT’s Dr. Kavel H. Leetaru explains more of the significance of this collaboration, and the importance of this new research collection of Belarusian, Russian and Ukrainian television news coverage.

Volunteer initiatives

SUCHO

One of the largest volunteer initiatives focusing on preserving Ukrainian web content has been SUCHO. Involving over 1300 librarians, archivists, researchers and programmers, SUCHO is led by Stanford University’s Quinn Dombrowski, Anna E. Kijas of Tufts University, and Sebastian Majstorovic of the Austrian Centre for Digital Humanities and Cultural Heritage. In its first phase, the project’s primary goal was to archive at-risk sites, digital content, and data in Ukrainian cultural heritage institutions. So far over 30TB of content and 3,500+ websites of Ukrainian museums, libraries and archives have been preserved and a subset of this collection is available at https://www.sucho.org/archives. The project is beginning its second phase, focusing on coordinating aid shipments of digitization hardware, exhibiting Ukrainian culture online and organizing training for Ukrainian cultural workers in digitization methods.

The Telegram Archive of the War

Telegram has been the most widely used application in Ukraine since the onset of the war but this messaging app is notoriously difficult to archive. A team of five archivists at the Center for Urban History in Lviv led by Taras Nazaruk, has been archiving almost 1000 Telegram channels since late February to create the Telegram Archive of the War. Each team member has been assigned to monitor and archive a topic or a region in Ukraine. They focus on capturing official announcements from different military administrative districts, ministries, local and regional news, volunteer groups helping with evacuation, searches for missing people, local channels for different towns, databases, cyberattacks, Russian propaganda, fake news as well as personal diaries, artistic reflections, humour and memes. Russian government propaganda and pro-Russian channels and chats are also archived. The multi-media content is currently grouped into over 20 thematic collections. The project coordinators have also been working with universities interested in supporting this archive and are planning to set up a working group to provide guidance for the future access to this invaluable archive.

Ukraine collections on Archive-It

New content has been gradually made available within the Ukraine collections on Archive-It that provided free or heavily cost-shared accounts to its partners earlier this year. These collections also include websites documenting the Ukraine Crisis 2014-2015 curated by University of California Berkeley (UC Berkeley) and by Internet Archive Global Events. Four new collections have been created since February 2022 with over 2.5TB of content. The largest one about the 2022 conflict (around 200 URLs) that is publicly available is curated by Ukrainian Research Institute at Harvard University. Other collections that focus on Ukrainian content are curated by Center for Urban History of East Central Europe, UC Berkeley and SUCHO. To learn more about the “War in Ukraine: 2022” collection, read this blog post by Liladhar R. Pendse, Librarian for East European, Central European, Central Asian and Armenian Studies Collections, UC Berkeley. University of Oxford, New College has been archiving at-risk Russian cultural heritage on the web as well as Russian opposition efforts to the war on Ukraine.

HURI-at-Archive-It — Ukrainian Research Institute at Harvard University’s collection at Archive-It.

Organisations interested in collecting web content related to the war in Ukraine, can contact Mirage Berry, Business Development Manager at the Internet Archive.

How to get involved

Nominate web content for the CDG collection
Use the Internet Archive’s “Save Page Now”
Check updates on the SUCHO Page for information on how you can contribute to the new phase of the project. SUCHO is currently accepting donations to pay for server costs and funding digitization equipment to send to Ukraine. Those interested in volunteering with SUCHO can sign up for the standby volunteer list here
Help the Center for Urban History in Lviv by nominating Ukrainian Telegram channels that you think are worth archiving and participate in their events
Submit information about your project: we are working to maintain a comprehensive and up-to-date list of web archiving efforts related to the war in Ukraine. If you are involved in a collection or a project and would like to see it included here, please use this form to contact us: https://bit.ly/archiving-the-war-in-Ukraine.

Many thanks to all of the institutions and projects featured on this list! We appreciate the time our members spent filling out our survey, and answering questions. Special thanks to Nicola Bingham of the British Library, Mark Graham and Mirage Berry of the Internet Archive, and Taras Nazaruk of the Center for Urban History in Lviv for providing supplementary information on their institutions’ collecting efforts.

Resources

SolrWayback 4.0 release! What’s it all about? Part 2

Thu, 04 March 2021Mon, 03 April 2023 IIPC Senior Program OfficerLeave a comment

By Thomas Egense, Programmer at the Royal Danish Library and the Lead Developer on SolrWayback.

This blog post is republished from Software Development at Royal Danish Library.

In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.

Live demo of SolrWayback

You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!

Back in 2018…

The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.

Another interesting frontend was the Shine frontend. It was custom tailored for the Solr index created with warc-indexer and had features such as Trend analysis (n-gram) visualization of search results over time. The showstopper was that Shine was using an older version the Play-framework and the latest version of the Play-framework was not backwards compatible to the maintained branch of the Play-framework. Upgrading was far from trivial and would require a major rewrite of the application. Adding to that, the frontend developers had years of experience with the larger more widely used pure javascript-frameworks. The weapon of choice by the frontenders for SolrWayback was the VUE JS framework. Both SolrWayback 3.0 and the new rewritten SolrWayback 4.0 had the frontend developed in VUE JS. If you have skills in VUE JS and interest in SolrWayback, your collaboration will be appriciate.

WARC-Indexer. Where the magic happens!

WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.

WARC-Indexer. Paying the price up front…

Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.

Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.

Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.

HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.

Solr. Reaping the benefits from the WARC-indexer

The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.

See the frontend blog post for more feature examples.

Wordcloud
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.

Interactive linkgraph
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).

Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.

Link graph example from the Danish NetArchive.

The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.

Image search
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.

CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.

WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.

Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.

SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.

SolrWayback playback engine

SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.

Playback quality

The playback quality of SolrWayback is an improvement over OpenWayback for the Danish Netarchive, but not as good as PyWb. The technique used is url-rewrite just as PyWb does, and replaces urls according to the HTML specification for html-pages and CSS files. However, SolrWayback does not replace links generated from javascript yet, but this is most likely to be improved in a next major release. It has not been a priority since the content for the Danish Netarchive is harvested with Heritrix and the dynamic javascript resources are not harvested by Heritrix.

This is only a problem for absolute links, ie. starting with http://domain/… since all relative URL paths will be resolved automatically due to the URL playback API. Relative links that refer to the root of the playback-server will also be resolved by the SolrWaybackRootProxy application which has this sole purpose. It calculates the correct URL from the http-referer tags and redirect back into SolrWayback. The absolute URL from javascript (or dynamic javascript) can result in live leaks. This can be avoided by a HTTP proxy or just adding a white list of URLs to the browser. In the Danish Citrix production environment, live leaks are blocked by sandboxing the enviroment. Improving playback is in the pipeline.

The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.

The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.

SolrWayback and Scalability

For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.

One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.

For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.

Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.

You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.

Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.

Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.

SolrWayback – framework

SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.

SolrWayback software bundle

Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.

Try SolrWayback Software bundle!

SolrWayback 4.0 release! What’s it all about?

Thu, 25 February 2021Thu, 25 February 2021 IIPC Senior Program Officer2 Comments

By Jesper Lauridsen, frontend developer at the Royal Danish Library.

This blog post is republished from Software Development at Royal Danish Library.

So, it’s finally here! SolrWayback 4.0 was released December 20th, after an intense development period. In this blog post, we’ll give you a nice little overview of the changes we made, some of the improvements and some of the added functionality that we’re very proud of having released. So let’s dig in!

A small intro – What is SolrWayback really?

As the name implies, SolrWayback is a fusion of discovery (Solr) and playback (Wayback) functionality. Besides full-text search, Solr provides multiple ways of aggregating data, moving common net archive statistics tasks from slow batch processing to interactive requests. Based on input from researchers the feature set is continuously expanding with aggregation, visualization and extraction of data.

SolrWayback relies on real time access to WARC files and a Solr index populated by the UKWA webarchive-discovery tool. The basic workflow is:

Amass a collection of WARCs (using Heritrix, wget, ArchiveIT…) and put them on live storage
Analyze and process the WARCs using webarchive-discovery. Depending on the amount of WARCS, this can be a fairly heavy job: Processing ½ petabyte of WARCs at the Royal Danish Library took 40+ CPU-years
Index the result from webarchive-discovery into Solr. For non-small collections, this means SolrCloud and Solid State Drives. A rule of thumb is that the index takes up about 5-10% of the size of the compressed WARCs
Connect SolrWayback to the WARC storage and the Solr index

A small visual illustration of the components used for SolrWayback.

Live demo

Try Live demo provided by National Széchényi Library, Hungary. (thanks!)

Helicopter view: What happend to SolrWayback

We decided to give the SolrWayback a complete makeover, making the interface more coherent, the design more stylish, and the information architecture better structured. At first glance, not much has changed apart from an update on the color scheme, but looking closely, we’ve added some new functionality, and grouped some of the existing features in a new, and much improved, way.

The search page is still the same, and after searching, you’ll still see all the results lined in a nice single column. We’ve added some more functionality up front, giving you the opportunity to see the WARC header for a single post, as well as selecting an alternative playback engine for the post. Some of the more noticeable reworks and optimizations are highlighted in the section below.

Faster loadtimes

We’ve done some work under the hood too, to make the application run faster. A lot of our call to the backend has been reworked to be individual calls, only being requested at need. This means, that facet calls are now made as a separate call to the backend instead of being being called with a query. So when you’re paging results, we only request the results – giving us a faster response, since the facets stay the same. The same principle has been applied to loading images and individual post data.

GUI polished

As mentioned, we’ve done some cleanup in the interface, making it easier to navigate. The search field has been reworked, to service the many needs. It will expand if the query is line separated (do so by SHIFT+Enter), making large and complex queries much easier to manage. We’ve even added context sensitive help, so if you’re making queries with boolean operators or similar, SolrWayback tell you if their syntax is correct.

We’ve kept the most used features upfront, with image and URL search readily available from the get go. The same goes for the option to group the search results to avoid URL duplicates.

Below the line are some of of the other features not directly linked to the query field, but nice to have upfront. Searching with an uploaded file, searching by GPS and the toolbox containing a lot of the different tools that can help gain insight into the archive, by generating Wordclouds or link graphs, searching through the Ngram interface and much more.

The nifty helper when making complex queries for SolrWayback.

Image searching by location rethought

We’re reworked the way to search and look through the results when searching by GPS coordinates. We’ve made it easy to search for a specific location, and we’ve grouped the results so that they are easier to interpret.

The new and improved location search interface. Images intentionally blurred.

Zooming into the map will expand the places where images are clustered. Furthermore, we realize that sometimes the need is to look through all the images regardless of their exact position, so we’ve made a split screen that can expand either way, depending on your needs. It’s still possible to do do a new search based on any of the found images in the list.

Elaborated export options

We’ve added more functionality to the export options. It’s possible to export both fields from the full search result and the raw WARC records for the search result, if enabled in the settings. You can even decide the format of your export and we’ve added an option to select exactly which fields in the search result you want exported – so if you want to leave out some stuff, that is now possible!

Quickly move through your archive

The standard search UI is pretty much as you are accustomed to but we made an effort to keep things simple and clean as well as facilitating in depth research and tracking of subject interests. In the search results you get a basic outline of metadata on each post. You can narrow your search with the provided facet filters. When expanding a post you get access to all metadata and every field has a link if you which to explore a particular angle related to your post. So you can quickly navigate the archive by starting wide, filtering and afterwards do a specific drill down and find related material.

Visualization of search result by domain

We’ve also made it very easy to quickly get a overview of the results. When clicking the icon in the results headline, you get a complete overview of the different domains in the results, and how big of a portion of the search result they amount for to each year. This is a very neat way to get a overview of the results, and the relative distribution by year.

The toolbox

With quick access from right under the search box we have gathered Toolbox with utilities for further data exploration. In the following we will give you a quick tour of the updates and new functionality in this section.

Linkgraph, domain stats and wordcloud

We reworked the Linkgraph, the Wordcloud and the Domain stats components a little, adding some more interaction to the graph and domain stats, and polished the interface for all of them a little. For the Linkgraph, it is now possible to highlight certain sections within the graph, making it much easier to navigate the sometimes rather large cluster, and looking at connections you find relevant. These tools now provide a easy and quick way to gain a deeper insight in specific domains and what content they hold.

Ngram

We are so pleased to finally be able to supply a classical Ngram search tool complete with graphs and all. In this version you are able to search through the entire HTML content of your archive and see how the results are distributed over time (harvest time). You can even do comparisons by providing several queries sequentially and see how they compare. On every graph the datapoint at each year is clickable and will trigger a search for the underlying results which is a very handy feature for checking the context and further exploring underlying data. Oh and before we forget – if things get a little crowded in the graph area you can always click on the nicely colored labels at the top of the chart and deselect/select each query.

If the HTML content isn’t really your thing but your passion lays within the HTML tags themselves we got you covered. Just flip the radio button under the search box over to HTML-tags in HTML-pages and you will have all same features listed above but now the underlying data will be the HTML tags themselves. As easy as that you will finally be able to get answers to important questions like ‘when did we actually start to frown upon the blink tag?’

Gephi Export

The possibilty to export a query, in a format that can be used in Gephi, is still present in the new version of SolrWayback. This will allow you to create some very nice visual graphs that can help you explore how exactly a collection of results are tied together. If you’re interested in this, feel free to visit the labs website about gephi graphs, where we’ve showcasted some of the possiblities of using Gephi.

Tools for the playback

SolrWayback comes with a build in playback engine, but can be configured to use another playback engine such as PyWb. The SolrWayback playback viewer shows a small toolbar overlay on the page that can be opened or hidden. When the toolbar is hidden the page is display without any frame/top-toolbar etc. to show the page exactly as it was harvested.

The menu when you access the individual search results.

When you have clicked a specific result, you’re taking to the harvested resource. If it is a website, you will be shown a menu to the right, giving you some more ways to analyse the resource. This menu is hidden in the left upper corner when you enter, but can be expanded by clicking on it.

The harvest calendar will give you a very smooth overview of the harvest times of the resource, so you can easily see when, and how often, the resource has been harvest in the current index. This gives you an excellent opportunity to look at your index over time, and see how a website evolved.

The PWID option lets you export the harvest resource metadata, so you can share what’s in that particular resource in a nice and clean way. the PWID standard is an excellent way to keep track of, and share ressources between researchers, so a list of the exact dataset is preserved – along with all the resources to go with it

View page resources gives you a clean overview of the contents of the harvested page, along with all the resources. We’ve even added a way to quickly see the difference between the first and the last harvested resource on the page, giving you a quick hint of the contents and if they are all from the same period. You can even see a preview of the page here and download the individual resources from the page, if you wish.

Customization of your local SolrWayback instance

We’ve made it possible to customize your installation, to fit your needs. The logo can be changed, the about text can be changed, and you can even customize your search guidelines, if you need to. This makes sure, that you have a chance to make instance your own in some way – making sure that people can recognize when they are using your instance of SolarWayback, and it can now reflect your organisation and the people who is contributing to it.

The future of the SolrWayback

This is just the beginning for SolrWayback. Further down the road, we hope to add even more functionality that can help you dig deeper into the archives. One of our main goals is to provide you with the tools necessary to understand and analyse the vast amounts of data, that lies in most of the archives that SolrWayback is designed for. We already have a few ideas as to what could be useful, but if you have any suggestions for tools that might be helpful, feel free to reach out to us.

Online Hours: Supporting Open Source

Fri, 12 October 2018Thu, 28 May 2020 IIPC Senior Program Officer3 Comments

By Andrew Jackson, Web Archiving Technical Lead at the British Library

At the UK Web Archive, we believe in working in the open, and that organisations like ours can achieve more by working together and pooling our knowledge through shared practices and open source tools. However, we’ve come to realise that simply working in the open is not enough – it’s relatively easy to share the technical details, but less clear how to build real collaborations (particularly when not everyone is able to release their work as open source).

To help us work together (and maintain some momentum in the long gaps between conferences or workshops), we were keen to try something new, and hit upon the idea of Online Hours. It’s simply a regular web conference slot (organised and hosted by the IIPC, but open to all) which can act as a forum for anyone interested in collaborating on open source tools for web archiving. We’ve been running for a while now, and have settled on a rough agenda:

Full-text indexing:
– Mostly focussing on our Web Archive Discovery toolkit so far.

Heritrix3:
– including Heritrix3 release management, and the migration of Heritrix3 documentation to the GitHub wiki.

Playback:
– covering e.g. SolrWayback as well as OpenWayback and pywb.

AOB/SOS:
– for Any Other Business, and for anyone to ask for help if they need it.

This gives the meetings some structure, but is really just a starting point. If you look at the notes from the meetings, you’ll see we’ve talked about a wide range of technical topics, e.g.

OutbackCDX features and documentation, including its API;
web archive analysis, e.g. via the Archives Unleashed Toolkit;
summary of technologies so we can compare how we do things in our organisations, to find out which tools and approaches are shared and so might benefit from more collaboration;
coming up with ideas for possible new tools that meet a shared need in a modular, reusable way and identify potential collaborative projects.

The meeting is weekly, but we’ve attempted to make the meetings inclusive by alternating the specific time between 10am and 4pm (GMT). This doesn’t catch everyone who might like to attend, but at the moment I’m personally not able to run the call at a time that might tempt those of you on Pacific Standard Time. Of course, I’m more than happy to pass the baton if anyone else wants to run one or more calls at a more suitable time.

If you can’t make the calls, please consider:

reviewing the notes from the calls and adding any questions or comments;
joining the #oh-sos channel on the IIPC Slack to join the discussion around the meetings;
reviewing our Trello board, and commenting on/voting for items of interest;
contributing to the Awesome Web Archiving List.

My thanks go to everyone who as come along to the calls so far, and to IIPC for supporting us while still keeping it open to non-members.

Maybe see you online?