A Retrospective with the Archives Unleashed Project

At the 2016 IIPC Web Archiving Conference in Reykjavík, Ian Milligan and Matthew Weber talked about the importance of building communities around analysing web archives and bringing together interdisciplinary researchers which is what Archives Unleashed 1.0, the first Web Archive Hackathon hosted by University of Toronto Libraries, attempted to do. At the same conference, Nick Ruest and Ian gave a workshop on the earliest version of the Archives Unleashed Toolkit (“Hands on with Warcbase“). Five years and 7 datathons later, the Archives Unleashed Project has seen major technological developments (including the Cloud version of the Toolkit integrated with Archive-It collections), a growing community of researchers, an expanded team, new partnerships, and two major grants. At the project’s core, there still is a desire to engage the community, and the most recent initiative, which builds on the datathons, includes the Cohort Program which facilitates research engagement with web archives through a year-long collaboration while receiving mentorship and support from the Archives Unleashed Team. 

In her blog post, Samantha Fritz, the Project Manager at the Archives Unleashed Project, reflects on the strategy and key milestones achieved between 2017 and 2020, as well as the new partnership with Archive-It and the plans for the next 3 years.


By Samantha Fritz, Project Manager, Archives Unleashed Project

AUTLogo-512x512The web archiving world blends the work and contributions of many institutions, groups, projects, and individuals. The field is witnessing work and progress in many areas, from policies, to professional development and learning resources, to the development of tools that address replay, acquisition, and analysis.

For over two decades memory institutions and organizations around the world have engaged in web archiving to ensure the preservation of born-digital content that is vital to our understanding of post-1990s research topics. Increasingly web archiving programs are adopted as part of institutional activities, because in general there is a recognition from librarians, archivists, scholars, and others that web archives are critical resources and are vulnerable to stewarding our cultural heritage.

The National Digital Stewardship Alliance has conducted surveys to “understand the landscape of web archiving activities in the United States.” Reflecting on the most recent 2017 survey results, respondents indicated they perceived the least progress in the past two years fell in the category of access, use, and reuse. The 2017 report indicates that this could suggest “a lack of clarity about how Web archives are to be used post-capture” (Farrell et. al. 2017 Report, p.13). This finding makes complete sense given that focus has largely revolved around selection, appraisal, scoping and capture. 

Ultimately, the active use of web archives by researchers, and by extension the development of tools to explore web archives has lagged. As such we see institutions and service providers like librarians and archivists are tasked with figuring out how to “use” web archives.

We have petabytes of data, but we also have barriers

The amount of data captured is well into the petabyte range – and we can look at larger organizations like the Internet Archive, the British Library, the Bibliothèque Nationale de France, Denmark’s Netarchive, the National Library of Australia’s Trove platform, and Portugal’s Arquivo.pt, who have curated extensive web archive collections, but we still don’t see a mainstream or heavy use of web archives as primary sources in research. This is in part due to access and usability barriers. Essentially, the technical experience needed to work with web archives, especially at scale, is beyond the reach of most scholars.

It is this barrier that offers an opportunity for discussion and work in and beyond the web archiving community. As such, we turn to a reflection of contributions from the Archives Unleashed Project for lowering barriers to web archives.

About the Archives Unleashed Project

Archives Unleashed was established in 2017 with support from The Andrew W. Mellon Foundation. The project grew out of an earlier series of events which identified a collective need among researchers, scholars, librarians and archivists for analytics tools, community infrastructure, and accessible web archival interfaces.

In recognizing the vital role web archives play in studying topics from the 1990s forward, the team has focused on developing open-source tools to lower the barrier for working with and analyze web archives at scale.

From 2017-2020 Archives Unleashed has a three-pronged strategy for tackling the computational woes of working with large data, and more specifically W/ARCs:

  1. Development of the Archives Unleashed Toolkit: to apply modern big data analytics infrastructure to scholarly analysis of web archives
  2. Deployment of the Archives Unleashed Cloud: provide a one-stop, web-based portal for scholars to ingest their Archive-It collections and execute a number of analyses with the click of a mouse.
  3. Organization of Archives Unleashed Datathons: build a sustainable user community around our open-source software. 

Milestones + Achievements

If we look at how Archives Unleashed tools have developed, we have to reach back to 2013 when Warcbase was developed. It was the forerunner to the Toolkit and was built on Hadoop and HBase as an open-source platform to support temporal browsing and large-scale analytics of web archives (Ruest et al., 2020, p. 157).

The Toolkit moves beyond the foundations of Warcbase. Our first major transition was to replace Apache HBase with Apache Spark to modernize analytical functions. In developing the Toolkit, the team was able to leverage the needs of users to inform two significant development choices. First, by creating a Python interface that has functional parity with the Scala interface. Python is widely accepted, and more commonly known, among scholars in the digital humanities who engage in computational work. From a sustainability perspective, Python is a stable, open-source, and ranked as one of the most popular programming languages.

Second, the Toolkit shifted from Spark’s resilient distributed datasets (RDDs), part of the Warcbase legacy, to support DataFrames. While this was part of the initial Toolkit roadmap, the team engaged with users to discuss the impact of alternative options to RDD. Essentially, DataFrames offers the ability within Apache Spark to produce a tabular based output. From the community, this approach was unanimously accepted in large part because of the familiarity with pandas, and DataFrames made it easier to visually read the data outputs (Fritz, et. al, 2018, Medium Post).

Comparison between RDD and DataFrame outputs
Comparison between RDD and DataFrame outputs
 

The Toolkit is currently at a 0.90.0 release, and while the Toolkit offers powerful analytical functionality, it is still geared towards an advanced user. Recognizing that scholars often didn’t know where to start with analyzing W/ARC files, and the intimidating nature of the command line, we took a cookbook approach in developing our Toolkit user documentation. With it, researchers can modify dozens of example scripts for extracting and exploring information. Our team focused on designing documentation that presented possibilities and options, while at the same time guided and supported user learning.

 
Sparkshell for using the Archives Unleashed Toolkit
Sparkshell for using the Archives Unleashed Toolkit

The work to develop the Toolkit, provided the foundations for other platforms and experimental methods of working with web archives. The second large milestone reached by the project was the launch of the Archives Unleashed Cloud.

The Archives Unleashed Cloud, largely developed by project co-investigator Nick Ruest, is an open-source platform that was developed to provide a web-based front end for users to access the most recent version of the Archives Unleashed Toolkit. A core feature of the Cloud, is that it uses the Archive-It WASAPI, which means that users are directly connected to their Archive-It collections and can proceed to analyze web archives without having to spend time delving into the technical world. 

 

 

Archives Unleashed Cloud Interface for Analysis
Archives Unleashed Cloud Interface for Analysis

Recognizing that the Toolkit, while flexible and powerful, may still be a little too advanced for some scholars, the Cloud offers a more user-friendly and familiar user interface for interacting with data. Users are presented with simple dashboards which provide insights into WARC collections, provide downloadable derivative files and offer simple in-browser visualizations.

In June of 2020, marking the end of our grant, the Cloud had analyzed just under a petabyte of data, and has been used by individuals from 59 unique institutions across 10 countries. Cloud remains an open-source project, with code available through a GitHub repository. The canonical instance will be deprecated as of June 30 2021 and be migrated into Archive-It, but more on that project in a bit.

Datathons + Community Engagement

Datathons provided an opportunity to build a sustainable community around Archives Unleashed tools, scholarly discussion, and training for scholars with limited technical expertise to explore archived web content.

Adapting the hackathon model, these events saw participants from over fifty institutions from seven countries engage in a hands-on learning environment – working directly with web archive data and new analytical tools to produce creative and ingenuitive projects that explore W/ARcs. In collaborating with host institutions, the datathons also highlight web archive collections from host institutions, increasing visibility and usability cases for their curated collections.

In a recently published article, “Fostering Community Engagement through Datathon Events: The Archives Unleashed Experience,” we reflected on the impact that our series of datathon events had on community engagement within the web archiving field, and on the professional practices of attendees. We conducted interviews with datathon participants to learn about their experiences and complemented this with an exploration of established models from the community engagement literature. Our article culminates in contextualizing a model for community building and engagement within the Archives Unleashed Project, with potential applications for the wider digital humanities field. 

Our team has also invested and participated in the wider web archival community through additional scholarly activities, such as institutional collaborations, conferences, and meetings. We recognize that these activities bring together many perspectives, and have been a great opportunity to listen to the needs of users and engage in conversations that impact adjacent disciplines and communities.

Archives Unleashed Datathon, Gelman Library, George Washington University
Archives Unleashed Datathon, Gelman Library, George Washington University

Lessons Learned

1. It takes a community

If there is one main take away we’ve learned as a team, and that all our activities point to, it’s that projects can’t live in silos! Be they digital humanities, digital libraries, or any other discipline, projects need communities to function, survive, and thrive. 

We’ve been fortunate and grateful to have been able to connect with various existing groups including being welcomed by the web archiving and digital humanities communities. Community development takes time and focused efforts, but it is certainly worthwhile! Ask yourself, if you don’t have a community, who are you building your tools, services, or platforms for? Who will engage with your work?

We have approached community building through a variety of avenues. First and foremost, we have developed relationships with people and organizations. This is clearly highlighted through our institutional collaborations in hosting datathon events, but we’ve also used platforms like Slack and Twitter to support discussion and connection opportunities among individuals. For instance, in creating both general and specific Slack channels, new users are able to connect with the project team and user community to share information and resources, ask for help, and engage in broader conversations on methods, tools, and data. 

Regardless of platform, successful community building relies on authentic interactions and an acknowledgment that each user brings unique perspectives and experiences to the group. In many cases we have connected with uses who are either new to the field or to analysis methods of web archives. As such, this perspective has helped to inform an empathetic approach to the way we create learning materials, deliver reports and presentations, and share resources. 

2. Interdisciplinary teams are important

So often we see projects and initiatives that highlight an interdisciplinary environment – and we’ve found it to be an important part of why our project has been successful. 

Each of our project investigators personifies a group of users that the Archives Unleashed Project aims to support, all of which converge around data, more specifically WARCs or web archive data. We have a historian who is broadly representative of digital humanists and researchers who analyze and explore web archives; a librarian who represents the curators and service providers of web archives; and a computer scientist who reflects tool builders.

A key strength of our team has been to look at the same problem from different perspectives, allowing each member to apply their unique skills and experiences in different ways. This has been especially valuable in developing underlying systems, processes and structures which now make up the Toolkit. For instance, triaging technical components offered a chance for team members to apply their unique skill sets, which often assisted in navigating issues and roadblocks.

We also recognized each sector has its own language and jargon that can be jarring to new users. In identifying the wide range of technical skills within our team, we leveraged (and valued) those “I have no idea what this means/ what this does.”  moments. If these types of statements were made by team members or close collaborators, chances are they would carry through to our user community. 

Ultimately, the interdisciplinary nature and the wide range of technical expertise found within our team, helped us to see and think like our users.

3. Sustainability planning is really hard

Sustainability has been part question, part riddle. This is the case for many digital humanities projects. These sustainability questions speak to the long term lifecycle of the project, and our primary goal has always been to ensure a project’s survival and continued efforts once the grant cycle has ended.

As such the Archives Unleashed team has developed tools and platforms with sustainability in mind, specifically by adopting widely adopted and stable programming languages and best practices. We’ve also been committed to ensuring all our platforms and tools have developed in the spirit of open-access, and are available in public GitHub repositories.

One overarching question remained as our project entered its final stages in the Spring of 2020: how will the Toolkit live on? Three years of development and use cases demonstrated not only the need and adoption of tools created under the Archives Unleashed Project, but also solidified the fact that without these tools, there aren’t currently any simplified processes to adequately replace it. 

Where we are headed (2020-2023)

Our team was awarded a second grant from The Andrew W. Mellon Foundation, which started in 2020 and will secure the future of Archives Unleashed. The goal of this second phase is the integration of the Cloud with Archive-it, so as a tool it can succeed in a sustainable and long-term environment. The collaboration between Archives Unleashed and Archive-It also aims to continue to widen and enhance the accessibility and usability of web archives.

Priorities of the Project

First, we will merge the Archives Unleashed analytical tools with the Internet Archive’s Archive-it service to provide an end-to-end process for collecting and studying web archives. This will be completed in three stages:

  1. Build. Our team will be setting up the physical infrastructure and computing environment needed to kick start the project. We will be purchasing dedicated infrastructure with the Internet Archive.
  2. Integrate. Here we will be migrating the back end of the Archives Unleashed Cloud to Archive-it and paying attention to how the Cloud can scale to work within its new infrastructure. This stage will also see the development of a new user interface that will provide a basic set of derivatives to users.
  3. Enhance. The team will incorporate consultation with users to develop an expanded and enhanced set of derivatives and implement new features.

Secondly, we will engage the community by facilitating opportunities to support web archives research and scholarly outputs. Building on our earlier successful datathons, we will be launching the Archives Unleashed Cohort program to engage with and support web archives research. The Cohorts will see research teams participate in year-long intensive collaborations and receive mentorship from Archives Unleashed with the intention of producing a full-length manuscript.

We’ve made tremendous progress, as the close of our first year is in sight. Our major milestone will be to complete the integration of the Archives Unleashed Cloud/Toolkit over to Archive-It. As such users will soon see a beta release of the new interface for conducting analysis with their web archive collections, specifically by downloading over a dozen derivatives for further analysis, and access to simple in-browser visualizations.

Our team looks forward to the road ahead, and would like to express our appreciation for the support and enthusiasm Archives Unleashed has received!

 

We would like to recognize our 2017-2020 work was primarily supported by the Andrew W. Mellon Foundation, with financial and in-kind support from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

 

References

Farrell, M., McCain, E., Praetzellis, M., Thomas, G., and Walker, P. 2018. Web Archiving in the United States: A 2017 Survey. National Digital Stewardship Alliance Report. DOI 10.17605/OSF.IO/3QH6N

Ruest, N., Lin, J., Milligan, I., and Fritz, S. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513

Fritz, S., Milligan, I., Ruest, N., and Lin, J. To DataFrame or Not, that is the Questions: A PySpark DataFrames Discussion. May 29, 2018. Medium. https://news.archivesunleashed.org/to-dataframe-or-not-that-is-the-questions-a-pyspark-dataframes-discussion-600f761674c4

 

Resources

We’ve provided some additional reading materials and resources that have been written by our team, and shared with the community over the course of our project work.

For a full list please visit our publications page: https://archivesunleashed.org/publications/.

Shorter blog posts can be found on our Medium site: https://news.archivesunleashed.org

Warcbase

Toolkit

Cloud

Datathons/Community

Using OutbackCDX with OpenWayback

By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, IIPC Vice-Chair and Co-Lead of the Tools Development Portfolio

Last year I wrote about The Future of Playback and the work that the IIPC was funding to facilitate the migration from OpenWayback to PyWb for our members. That work is being done by Ilya Kreymer and the first work package has now been delivered. An exhaustive transition guide detailing how OpenWayback configuration options can be translated into equivalent PyWb settings.

One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.

What is OutbackCDX?

OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.

A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.

OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.

The Mission

Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.

So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.

Of course, there were a few bumps on that road, but more on that later.

Installing OutbackCDX

Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.

Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.

Building the Index

Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.

Example:

curl -X POST –data-binary @index.cdx http://localhost:8901/myindex

In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.

Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.

Configuring OpenWayback

Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.

OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:

<property name=”resourceIndex”>
<bean class=”org.archive.wayback.resourceindex.RemoteResourceIndex”>
<property name=”searchUrlBase” value=”http://localhost:8080/myindex&#8221; />
</bean>
</property>

And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.

Those ‘bumps’

This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.

The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.

I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.

Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.

OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.

Summary

Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.

The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.

Our OpenWayback at https://vefsafn.is, is now fully powered by OutbackCDX.

Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is

Search results in Vefsafn.is (beta) that uses PyWb.

.

SolrWayback 4.0 release! What’s it all about? Part 2

By Thomas Egense, Programmer at the Royal Danish Library and the Lead Developer on SolrWayback.

This blog post is republished from Software Development at Royal Danish Library.

In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.

Live demo of SolrWayback

You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!

Back in 2018…

The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.

Another interesting frontend was the Shine frontend. It was custom tailored for the Solr index created with warc-indexer and had features such as Trend analysis (n-gram) visualization of search results over time. The showstopper was that Shine was using an older version the Play-framework and the latest version of the Play-framework was not backwards compatible to the maintained branch of the Play-framework. Upgrading was far from trivial and would require a major rewrite of the application. Adding to that, the frontend developers had years of experience with the larger more widely used pure javascript-frameworks. The weapon of choice by the frontenders for SolrWayback was the VUE JS framework. Both SolrWayback 3.0 and the new rewritten SolrWayback 4.0 had the frontend developed in VUE JS. If you have skills in VUE JS and interest in SolrWayback, your collaboration will be appriciate.

WARC-Indexer. Where the magic happens!

WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.

WARC-Indexer. Paying the price up front…

Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.

Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.

Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.

HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.

Solr. Reaping the benefits from the WARC-indexer

The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.

See the frontend blog post for more feature examples.

Wordcloud
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.

Interactive linkgraph
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).

Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.

Link graph example from the Danish NetArchive.

The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.

Image search
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.

CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.

WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.

Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.

SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.

SolrWayback playback engine

SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.

Playback quality

The playback quality of SolrWayback is an improvement over OpenWayback for the Danish Netarchive, but not as good as PyWb. The technique used is url-rewrite just as PyWb does, and replaces urls according to the HTML specification for html-pages and CSS files. However, SolrWayback does not replace links generated from javascript yet, but this is most likely to be improved in a next major release. It has not been a priority since the content for the Danish Netarchive is harvested with Heritrix and the dynamic javascript resources are not harvested by Heritrix.

This is only a problem for absolute links, ie. starting with http://domain/… since all relative URL paths will be resolved automatically due to the URL playback API. Relative links that refer to the root of the playback-server will also be resolved by the SolrWaybackRootProxy application which has this sole purpose. It calculates the correct URL from the http-referer tags and redirect back into SolrWayback. The absolute URL from javascript (or dynamic javascript) can result in live leaks. This can be avoided by a HTTP proxy or just adding a white list of URLs to the browser. In the Danish Citrix production environment, live leaks are blocked by sandboxing the enviroment. Improving playback is in the pipeline.

The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.

The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.

SolrWayback and Scalability

For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.

One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.

For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.

Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there  no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.

You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.

Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.

Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.

SolrWayback – framework

SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.

SolrWayback software bundle

Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.

Try SolrWayback Software bundle!

IIPC Chair Address

By Abbie Grotke, Assistant Head, Digital Content Management Section
(Web Archiving Program), Library of Congress
and the IIPC Chair 2021-2022


Hello IIPC community!

I am thrilled to be the Chair of the IIPC in 2021. I’ve been involved in this organization since the very early days, so much so that somewhere buried in my folders in my office (which I have not been in for almost a year), are meeting notes from the very first discussions that led to the IIPC being formed back in 2003. Involvement in IIPC has been incredibly rewarding personally, and for our institution and all of our team members who have had the chance to interact with the community through working groups, projects, events, and informal discussions.

This year brings changes, challenges and opportunities for our community. Particularly during a time when many of us are isolated and working from home, both documenting web content about the pandemic and living it at the same time, connections to my friends and colleagues around the world seem more important than ever.

Here are a few key things to highlight for the coming year:

A Big Year for Organisation, Governance, and Strategic Planning Change

As a result of the fine work of the Strategic Direction Group led by Hansueli Locher of Swiss National Library, the IIPC has a new Consortium Agreement for 2021-2025! This document is renewed every 4-5 years, and this time some key changes were made to strengthen our ability to manage the Consortium more efficiently and to reflect the organisational changes that have taken place since 2016. Feedback from IIPC members was used to create the new agreement, and you’ll notice a slight update of objectives, which now acknowledge the importance of collaborative collections and research. Many thanks to the Strategic Direction Group (Emmanuelle Bermès of the BnF, Steve Knight of the National Library of New Zealand, Hansueli Locher, Alex Thurman of the Columbia University Libraries, and IIPC Programme and Communications Officer) for their work on this and continued engagement.

Executive Board and the Steering Committee’s terms

The new agreement establishes a new Executive Board composed of the Chair, the Vice-Chair, the Treasurer and our new senior staff member, as well as additional members of the SC appointed as needed. While the Steering Committee is responsible for setting out the strategic direction for our organisation for the next 5 years, one of our key tasks for this year is to convert it into an Action Plan.

The new Consortium Agreement aligns the terms of the Steering Committee members and the Executive Board. What it means in practise is that the SC members’ 3-year term will start on January 1 and not June 1. We will open a call for nominations to serve on the SC during our next General Assembly but if you are interested in nominating your institution, you can contact the PCO.

For more information about the responsibilities of the new Executive Board please review section 2.5 of the new Consortium Agreement.

Administrative Host

Our ability to have and compensate our Administrative and Financial Host has been formalized in the new agreement. We are excited to collaborate more with  the Council on Library and Information Resources (CLIR) this year through this arrangement, particularly in setting up some new staffing arrangements for us. More on this will be announced in the coming months.

Strategic Plan

One of our big tasks in 2021 will be working on the Strategic Plan. This work is led by the Strategic Direction Group, with inputs from the Steering Committee, Working Groups, and Portfolio Leads. Since this work is one of our important activities for the year, Hansueli has will joined the Executive Board to ensure close collaboration and support for the initiative.

Missing Your IIPC Colleagues? Join our Virtual Events!

A blast from the past: the IIPC General Assembly at the Library of Congress, May 2012.
From the left: Kristinn Sigurðsson (IIPC Vice-Chair, National and University Library of Iceland), Gildas Illien (BnF), and Abbie.

As anyone who has attended an IIPC event in person knows, it is one of the best parts about being a member. In my case, interacting with colleagues from around the world who have similar challenges, experiences, and new and exciting insights has been great for my own professional growth, and has only helped the Library of Congress web archiving program be more successful. While it’s sad that we cannot travel and meet in person together right now, there are opportunities to continue to connect virtually and to engage others in our institutions who may not have been able to travel to the in-person meetings. We’re already working on developing a more robust calendar of events for members (and some that will be more widely open to non-members).

As you’re aware, our big event, the General Assembly (June 14) and the Web Archiving Conference (June 15-16)  have been moved to a virtual event as a part of Web Archiving Week (virtually from Luxembourg). Many thanks to the National Library of Luxembourg for offering to host the online event!

Beyond the GA and WAC, due to the success of the well-received and well-attended webinars and calls with members in 2020, we will continue to deliver those over the course of the year. We are also working on additional training events and continuing report-outs of technical projects and member updates. Stay tuned for more soon and check our events page for updates!

Working Groups and funded projects

The IIPC continues to work collaboratively together in 2021 on a number of initiatives through our Working Groups), including our transnational collections (the Covid-19 collection continues in 2021), training materials, and activities focusing on research use of the web archives. 2021 also brings exciting funded project news, thanks to the continuation of DFP, a funding programme launched in June 2019 and led by three former IIPC Chairs: Emmanuelle Bermès, Jefferson Bailey (Internet Archive), and Mark Phillips (University of North Texas Libraries). In 2020 the Jupyter Notebooks project led by Andy Jackson of the British Library and created by Tim Sherratt was successfully completely and won the British Library Labs award. This year, we are launching Developing Bloom Filters for Web Archives’ Holdings (a collaboration between Los Alamos National Laboratory (LANL) & National and University Library in Zagreb), Improving the Dark and Stormy Archives Framework by Summarizing the Collections of the National Library of Australia (a collaboration between Old Dominion University, National Library of Australia and LANL), and continuing LinkGate: Core Functionality and Future Use Cases (Bibliotheca Alexandrina & National Library of New Zealand) and hoping to be able to hold the Archives Unleashed datathon led by the BnF in partnership with KBR / Royal Library of Belgium and the National Library of Luxembourg later in 2021.

We are also working with Webrecorder on the pywb transition support for members. The migration guide, with inputs from the IIPC Members, is already available and the work continues on the next stages of the project. Look for more updates on these projects through our events and blog posts throughout the year. There will also be an opportunity in 2021 for more projects to be funded, so we encourage members to start thinking about other projects that could use support and that would benefit the community.

Lastly, I want to remind you to continue to follow our activities on the IIPC website and Twitter (do tweet on #WebArchiveWednesday!). To subscribe to our mailing list, send an email to communications@iipc.simplelists.com.

I look forward to working with you all more closely this year. Please feel free to reach out to me if you have any questions or concerns during my time as Chair.

Happy Web Archiving to you all!

Abbie Grotke

Assistant Head, Digital Content Management Section (Web Archiving Program), Library of Congress

IIPC Chair 2021-2022

SolrWayback 4.0 release! What’s it all about?

By Jesper Lauridsen, frontend developer at the Royal Danish Library.

This blog post is republished from Software Development at Royal Danish Library.


So, it’s finally here! SolrWayback 4.0 was released December 20th, after an intense development period. In this blog post, we’ll give you a nice little overview of the changes we made, some of the improvements and some of the added functionality that we’re very proud of having released. So let’s dig in!

A small intro – What is SolrWayback really?

As the name implies, SolrWayback is a fusion of discovery (Solr) and playback (Wayback) functionality. Besides full-text search, Solr provides multiple ways of aggregating data, moving common net archive statistics tasks from slow batch processing to interactive requests. Based on input from researchers the feature set is continuously expanding with aggregation, visualization and extraction of data.

SolrWayback relies on real time access to WARC files and a Solr index populated by the UKWA webarchive-discovery tool. The basic workflow is:

  • Amass a collection of WARCs (using Heritrix, wget, ArchiveIT…) and put them on live storage
  • Analyze and process the WARCs using webarchive-discovery. Depending on the amount of WARCS, this can be a fairly heavy job: Processing ½ petabyte of WARCs at the Royal Danish Library took 40+ CPU-years
  • Index the result from webarchive-discovery into Solr. For non-small collections, this means SolrCloud and Solid State Drives. A rule of thumb is that the index takes up about 5-10% of the size of the compressed WARCs
  • Connect SolrWayback to the WARC storage and the Solr index

A small visual illustration of the components used for SolrWayback.

Live demo

Try Live demo provided by National Széchényi Library, Hungary. (thanks!)

Helicopter view: What happend to SolrWayback

We decided to give the SolrWayback a complete makeover, making the interface more coherent, the design more stylish, and the information architecture better structured. At first glance, not much has changed apart from an update on the color scheme, but looking closely, we’ve added some new functionality, and grouped some of the existing features in a new, and much improved, way.

The new interface of SolrWayback.

The search page is still the same, and after searching, you’ll still see all the results lined in a nice single column. We’ve added some more functionality up front, giving you the opportunity to see the WARC header for a single post, as well as selecting an alternative playback engine for the post. Some of the more noticeable reworks and optimizations are highlighted in the section below.

Faster loadtimes

We’ve done some work under the hood too, to make the application run faster. A lot of our call to the backend has been reworked to be individual calls, only being requested at need. This means, that facet calls are now made as a separate call to the backend instead of being being called with a query. So when you’re paging results, we only request the results – giving us a faster response, since the facets stay the same. The same principle has been applied to loading images and individual post data.

GUI polished

As mentioned, we’ve done some cleanup in the interface, making it easier to navigate. The search field has been reworked, to service the many needs. It will expand if the query is line separated (do so by SHIFT+Enter), making large and complex queries much easier to manage. We’ve even added context sensitive help, so if you’re making queries with boolean operators or similar, SolrWayback tell you if their syntax is correct.

We’ve kept the most used features upfront, with image and URL search readily available from the get go. The same goes for the option to group the search results to avoid URL duplicates.

Below the line are some of of the other features not directly linked to the query field, but nice to have upfront. Searching with an uploaded file, searching by GPS and the toolbox containing a lot of the different tools that can help gain insight into the archive, by generating Wordclouds or link graphs, searching through the Ngram interface and much more.

The nifty helper when making complex queries for SolrWayback.

Image searching by location rethought

We’re reworked the way to search and look through the results when searching by GPS coordinates. We’ve made it easy to search for a specific location, and we’ve grouped the results so that they are easier to interpret.

The new and improved location search interface. Images intentionally blurred.

Zooming into the map will expand the places where images are clustered. Furthermore, we realize that sometimes the need is to look through all the images regardless of their exact position, so we’ve made a split screen that can expand either way, depending on your needs. It’s still possible to do do a new search based on any of the found images in the list.

Elaborated export options

We’ve added more functionality to the export options. It’s possible to export both fields from the full search result and the raw WARC records for the search result, if enabled in the settings. You can even decide the format of your export and we’ve added an option to select exactly which fields in the search result you want exported – so if you want to leave out some stuff, that is now possible!

Quickly move through your archive

The standard search UI is pretty much as you are accustomed to but we made an effort to keep things simple and clean as well as facilitating in depth research and tracking of subject interests. In the search results you get a basic outline of metadata on each post. You can narrow your search with the provided facet filters. When expanding a post you get access to all metadata and every field has a link if you which to explore a particular angle related to your post. So you can quickly navigate the archive by starting wide, filtering and afterwards do a specific drill down and find related material.

Visualization of search result by domain

We’ve also made it very easy to quickly get a overview of the results. When clicking the icon in the results headline, you get a complete overview of the different domains in the results, and how big of a portion of the search result they amount for to each year. This is a very neat way to get a overview of the results, and the relative distribution by year.

The toolbox

With quick access from right under the search box we have gathered Toolbox with utilities for further data exploration. In the following we will give you a quick tour of the updates and new functionality in this section.

Linkgraph, domain stats and wordcloud

Link graph.

Domain stats.

Wordcloud.

We reworked the Linkgraph, the Wordcloud and the Domain stats components a little, adding some more interaction to the graph and domain stats, and polished the interface for all of them a little. For the Linkgraph, it is now possible to highlight certain sections within the graph, making it much easier to navigate the sometimes rather large cluster, and looking at connections you find relevant. These tools now provide a easy and quick way to gain a deeper insight in specific domains and what content they hold.

Ngram

We are so pleased to finally be able to supply a classical Ngram search tool complete with graphs and all. In this version you are able to search through the entire HTML content of your archive and see how the results are distributed over time (harvest time). You can even do comparisons by providing several queries sequentially and see how they compare. On every graph the datapoint at each year is clickable and will trigger a search for the underlying results which is a very handy feature for checking the context and further exploring underlying data. Oh and before we forget – if things get a little crowded in the graph area you can always click on the nicely colored labels at the top of the chart and deselect/select each query.

The ngram interface.

The evolution of the blink tag.

If the HTML content isn’t really your thing but your passion lays within the HTML tags themselves we got you covered. Just flip the radio button under the search box over to HTML-tags in HTML-pages and you will have all same features listed above but now the underlying data will be the HTML tags themselves. As easy as that you will finally be able to get answers to important questions like ‘when did we actually start to frown upon the blink tag?’

The export functionality for Ngram.

Gephi Export

The possibilty to export a query, in a format that can be used in Gephi, is still present in the new version of SolrWayback.  This will allow you to create some very nice visual graphs that can help you explore how exactly a collection of results are tied together. If you’re interested in this, feel free to visit the labs website about gephi graphs, where we’ve showcasted some of the possiblities of using Gephi.

Tools for the playback

SolrWayback comes with a build in playback engine, but can be configured to use another playback engine such as PyWb. The SolrWayback playback viewer shows a small toolbar overlay on the page that can be opened or hidden. When the toolbar is hidden the page is display without any frame/top-toolbar etc. to show the page exactly as it was harvested.

The menu when you access the individual search results.

When you have clicked a specific result, you’re taking to the harvested resource. If it is a website, you will be shown a menu to the right, giving you some more ways to analyse the resource.  This menu is hidden in the left upper corner when you enter, but can be expanded by clicking on it.

The harvest calendar will give you a very smooth overview of the harvest times of the resource, so you can easily see when, and how often, the resource has been harvest in the current index. This gives you an excellent opportunity to look at your index over time, and see how a website evolved.

The date harvest calendar module.

The PWID option lets you export the harvest resource metadata, so you can share what’s in that particular resource in a nice and clean way. the PWID standard is an excellent way to keep track of, and share ressources between researchers, so a list of the exact dataset is preserved – along with all the resources to go with it

View page resources gives you a clean overview of the contents of the harvested page, along with all the resources. We’ve even added a way to quickly see the difference between the first and the last harvested resource on the page, giving you a quick hint of the contents and if they are all from the same period. You can even see a preview of the page here and download the individual resources from the page, if you wish.

Customization of your local SolrWayback instance

We’ve made it possible to customize your installation, to fit your needs. The logo can be changed, the about text can be changed, and you can even customize your search guidelines, if you need to. This makes sure, that you have a chance to make instance your own in some way – making sure that people can recognize when they are using your instance of SolarWayback, and it can now reflect your organisation and the people who is contributing to it.

The future of the SolrWayback

This is just the beginning for SolrWayback. Further down the road, we hope to add even more functionality that can help you dig deeper into the archives. One of our main goals is to provide you with the tools necessary to understand and analyse the vast amounts of data, that lies in most of the archives that SolrWayback is designed for. We already have a few ideas as to what could be useful, but if you have any suggestions for tools that might be helpful, feel free to reach out to us.

Collaborative collecting at webarchive.lu

By Ben Els, Digital Curator at the National Library of Luxembourg & the Chair of the Organising Committee for the 2021 IIPC General Assembly and the Web Archiving Conference

Our previous blog post from the Luxembourg Web Archive focused on the typical steps that many web archiving initiatives take at the start of their program: to gain first experience with event-based crawls. Elections, natural disasters, and events of national importance are typical examples of event collections. These temporary projects have occupied our crawler for the past 3 years (and continue to do so for the Covid-19 collection), but we also feel that it’s about time for a change of scenery on our seed lists.

How it works

Domain crawl

Aside from following the news on elections and Covid-19, we also operate 2 domain crawls a year, where basically all websites from the “.lu” top level domain are captured. We use the research from the event collections to expand the seed list for domain crawls and, therefore, also add another layer of coverage to those events. However, the captures of websites from the event collections remain very selective and are usually not revisited, once discussions around the event are over. This is why we plan to focus our efforts in the near future on building thematic collections. As a comparison:

Event collections Thematic collections

Temporary

Evolving
Multifaceted coverage of one topic or event Focus on one subject area

The idea is that event collections serve as a base to extract the subject areas for  thematic collections. In turn, the thematic collections will serve as a base to start event collections, and save time on research. In time, event collections will help with a more intense coverage for the subjects of thematic collections and the latter will capture information before and after the topic of an event collection. For example, the seed list from an election crawl can serve as a basis for the thematic collection “Politics & Society”. The continued coverage and expansion from this collection will serve as an improved basis for a seed list, once the next election campaign comes around. Moreover, both types of collections will help in broadening the scope of domain crawls and achieve better coverage of the Luxembourg web.

Collaboration with subject experts

Special Collections at webarchive.lu

During election crawls, it has always been important for us to invite the input from different stakeholders, to make sure that the seed list covers all important areas surrounding the topic. The same principle has to be applied to the thematic collections. No curator can become an expert in every field and our web archiving team will never be able to research and find all relevant websites in all domains and all languages from all corners of the Luxembourg web. Therefore, the curator’s job has to be focused on finding the right people, who know the web around their subject, experts in their field and representatives of their communities, who can help to build and expand seed lists over time. This means relying on internal and external subject experts, who are familiar with the principles of web archiving and incentivised to offer their help in contributing to the Luxembourg web archive.

While, technically, we haven’t tested the idea of this collaborative Lego-tower in reality, here are some of the challenges we would like to tackle this year:

  • The workflows and platform used to collect the experts’ contributions need to be as easy to use as possible. Our contributors should not have require hours of training and tutorials to get started and it should be intuitive enough to pick up working on a seed list, after not having looked at it for several months.

  • Subject experts should be able to contribute in the way that best fits their work rhythm: a quick and easy option to add single seeds spontaneously when coming across an interesting website, as well as a way to dive in deeper into research and add several seeds at a time.

  • We are going to ask for help, which means additional work for contributors inside and outside the library. This means that we need to keep the motivate the subject experts and convince them that a working and growing web archive represents a benefit for everybody and that their input is indispensable.

Selection criteria for special collections

Next steps

As a first step, we would like to set up thematic collections with BnL subject experts, to see what the collaborative platform should look like and what kind of work input can be expected from contributors in terms of initial training and regular participation. The second stage will be to involve contributors from other heritage institutions who already provided lists to our domain crawls in the past. After that, we count on involving representatives of professional associations, communities or other organisations interested in seeing their line of business represented in the web archive.

On an even larger scale, the Luxembourg Web Archive will be open to contributions from students and researchers, website owners, web content creators and archive users in general, which is already possible through the “Suggest a website” form on webarchive.lu. While we haven’t received as many submissions as we would like, there have been very valuable contributions, of websites that we would perhaps never have found otherwise. We also noticed that it helps to raise awareness through calls ofor participation in the media. For instance, we received very positive feedback for our Covid-19 collection. If we are able to create interest on a larger scale, we can get much more people involved and improve the services provided by the Luxembourg Web Archive.

Call for participation in the Covid-19 collection on RTL Radio

Save the date!

While we work on putting the pieces of this puzzle together, we are also moving closer and closer to the 2021 General Assembly and Web Archiving Conference. It’s been two years since the IIPC community was able to meet for a conference, and surely you are all as eager as we are, to catch up, to learn and to exchange ideas about problems and projects. So, if you haven’t done so already, please save the date for a virtual trip to Luxembourg from 14th -16th June.

IIPC-supported project “Developing Bloom Filters for Web Archives’ Holdings”

By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory and Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre at  National and University Library Zagreb

We are excited to share the news of a newly IIPC-funded collaborative project between the Los Alamos National Laboratory (LANL) and the National and University Library Zagreb (NSK). In this one-year project we will develop a software framework for web archives to create Bloom filters of their archival holdings. A Bloom filter, in this context, consists of hash values of archived URIs and can therefore be thought of as an encrypted index of an archive’s holdings. Its encrypted nature allows web archives to share information about their holdings in a passive manner, meaning only hashed URI values are communicated, rather than plain text URIs. Sharing Bloom filters with interested parties can enable a variety of down-stream applications such as search, synchronized crawling, and cataloging of archived resources.

Bloom filters and Memento TimeTravel

As many readers of this blog will know, the Prototyping Team at LANL has developed and maintained the Memento TimeTravel service, implemented as a federated search across more than two dozen memento-compliant web archives. This service therefore allows a user (or a machine via the underlying APIs) to search for archived web resources (mementos) across many web archives at the same time. We have tested, evaluated, and implemented various optimizations for the search system to improve speed and avoid unnecessary network requests against participating web archives but we can always do better. As part of this project, we aim at piloting a TimeTravel service based on Bloom filters, that, if successful, should provide close to an ideal false positive rate, meaning almost no unnecessary network requests to web archives that do not have a memento of the requested URI.

While Bloom filters are widely used to support membership queries (e.g., is element A part of set B?), it has, to the best of our knowledge, not been applied to query web archive holdings. We are aware of opportunities to improve the filters and, as additional components of this project, will investigate their scalability (in relation to CDX index size, for example) as well as the potential for incremental updates to the filters. Insights into the former will inform the applicability for different size archives and individual collections and the latter will guide a best practice process of filter creation.

The development and testing of Bloom filters will be performed by using data from the Croatian Web Archive’s collections. NSK develops the Croatian Web Archive (HAW) in collaboration with the University Computing Centre University of Zagreb (Srce), which is responsible for technical development and will work closely with LANL and NSK on this project.

LANL and NSK are excited about this project and new collaboration. We are thankful to the IIPC and their support and are looking forward to regularly sharing project updates with the web archiving community. If you would like to collaborate on any aspects of this project, please do not hesitate and get in touch.

Making the UK Government Social Media Archive even better

By Claire Newing, The National Archives (UK) and Phil Clegg, MirrorWeb.

At The National Archives of the UK we have been archiving the online presence of UK central government since 2003. We originally worked with suppliers to capture traditional websites which we made available for all to browse and search through the UK Government Web Archive: http://www.nationalarchives.gov.uk/webarchive/.

In the early 2010s, we recognised that the government was increasingly using social media platforms to communicate with the public. At that time Twitter, YouTube and Flickr were the most widely used platforms. We experimented with trying to capture channels using our standard Heritrix based process but with very limited success, so we embarked on a project with Internet Memory Foundation (IMF), our then web archiving service provider, to develop a custom solution.

The project was partially successful. We developed a method of capturing Tweets and YouTube videos and the associated metadata directly from APIs and providing access to them through a custom interface. We also developed a method of crawling all shortlinks in Tweets so the links resolved correctly in the archive.

The Original Twitter Archive access page.

 

Original Archived Twitter Feed access page

Unfortunately, we were unable to find a way of capturing Flickr content. The UK Government Social Media Archive was launched in 2014. We continued to capture a small number of channels regularly until mid-July 2017 when we started working with a new supplier, MirrorWeb.

Captures in the cloud

MirrorWeb social media capture is undertaken using serverless functions running in the cloud. These functions authenticate with the social API and request the metadata content for all new posts created on the social channel since the last capture point. Each post is then stored in a database for later replay. Further serverless functions are triggered when a post object is written to the database to check if the post contains media content like images or videos. If media content is found these are added to a queue which in turn triggers another serverless function to download the media for the post. For replay the stored json objects are read back from the database and presented to the user with media objects in a similar layout to the original platform.

MirrorWeb chose to archive most social accounts daily to ensure that all new content is captured. Twitter, for example, limits the number of requests to their API in a 15-minute window and restrict the number of historic posts that can be collected so some tuning is undertaken to ensure rate limits are not exceeded and that all posts are captured.

We also increased the number of channels we were capturing and took the opportunity to redesign our custom access pages incorporating feedback from our users. Excitingly, access to images and video content embedded in Tweets was made accessible for the first time. The archive was formally re-launched in August 2018, but we always knew it could be even better!

Flickr, access, and full text search

Towards the end of 2018 we launched an improvement project with MirrorWeb. It had three key aims:

(1) To develop a method of capturing Flickr images. This became urgent as in November 2018 Flickr announced that as of early 2019 free account holders would be limited to 1000 images per account and any additional images above that number would be deleted. A survey showed that several UK government accounts, particularly older accounts which were no longer being updated, were at risk of losing some content.

(2) To further improve our custom access pages.

(3) To provide full text search across the social media collection.

The project aims were fulfilled, and the new functionality went live late in 2019.

By far the most exciting new development was the implementation of full text search. We undertook some user research earlier in 2018 which revealed that users considered the archive to be interesting but didn’t think it was very useful without search. This emphasized to us how important it was to provide such a service.

The search service was built by MirrorWeb using Elasticsearch, the same technology we use for the full text search facility on the UK Government Web Archive, our collection of archived websites. MirrorWeb once again make use of serverless functions to provide full text search of the social accounts. When social post metadata is written to the database, a serverless function is triggered to extract the relevant metadata and this is then added to Elasticsearch.

Each search queries the full text of Tweets and the descriptions and titles of YouTube and Flickr content. Users can initially search for keywords or a phrase and are then given the opportunity to filter the results by platform, channel and year of post. They can also choose to only display results which include or exclude specific words.

Results of a search for the word ‘Christmas’. The filters can be seen on the left-hand side.

Additionally, we added a search box to the top of each of our custom access pages to enable users to search all data captured from a specific channel. For example, on this page a user can search the titles and descriptions of all the videos we’ve captured from Prime Minister’s Office YouTube channel.

Current access page for archive of the Prime Minister’s Office YouTube Channel.

What’s next?

When we started to investigate capturing social media content over a decade ago there was a feeling in some quarters that content posted on social media was ephemeral and posts were being used to point to content on traditional sites. Events of recent years demonstrated that social media is of increasing importance. In some cases, government departments and ministers announce important information on social media some time before they update a traditional website.

We have achieved a lot already, but we know there is lots more to do. In future, we aspire to add a unified search across our website and social media archives. We are aware that the API capture method does not work for all platforms, so we are actively working to find other methods of capture, particularly for Instagram and Github. We hope to find a way of displaying metadata we capture which is not currently surfaced on the access pages – for example changes to the channel thumbnail image over time.

We are also aware that are many gaps in our web archive where we were unable to capture embedded YouTube videos. We hope to develop a method of linking between those gaps and the equivalent videos held in the YouTube archive. Finally, we plan to do some user research to guide future developments. We are very proud of the UK Government Social Media Archive and we want to make sure it is used to its full potential.

Current access page for archived Tweets from the HM Treasury Twitter feed showing embedded media.

The Dark and Stormy Archives Project: Summarizing Web Archives Through Social Media Storytelling

By Shawn M. Jones, Ph.D. student and Graduate Research Assistant at Los Alamos National Laboratory (LANL), Martin Klein, Scientist in the Research Library at LANL, Michele C. Weigle, Professor in the Computer Science Department at Old Dominion University (ODU), and Michael L. Nelson, Professor in the Computer Science Department at ODU.

The Dark and Stormy Archives Project applies social media storytelling to automatically summarize web archive collections in a format that readers already understand.

Individual web archive collections can contain thousands of documents. Seeds inform capture, but the documents in these collections are archived web pages (mementos) created from those seeds. The sheer size of these collections makes them challenging to understand and compare. Consider Archive-It as an example platform. Archive-It has many collections on the same topic. As of this writing, a search for the query “COVID” returns 215 collections. If a researcher wants to use one of these collections, which one best meets their information need? How does the researcher differentiate them? Archive-It allows its collection owners to apply metadata, but our 2019 study found that as a collection’s number of seeds rises, the amount of metadata per seed falls. This relationship is likely due to the increased effort required to maintain the metadata for a growing number of seeds. It is paradoxical for those viewing the collection because the more seeds exist, the more metadata they need to understand the collection. Additionally, organizations add more collections each year, resulting in more than 15,000 Archive-It collections as of the end of 2020. Too many collections, too many documents, and not enough metadata make human review of these collections a costly proposition.

We use cards to summarize web documents all of the time. Here is the same document rendered as cards on different platforms.

An example of social media storytelling at Storify (now defunct) and Wakelet: cards created from individual pages, pictures, and short text describe a topic.

Ideally, a user would be able to glance at a visualization and gain understanding of the collection, but existing visualizations require a lot of cognitive load and training even to convey one aspect of a collection. Social media storytelling provides us with an approach. We see social cards all of the time on social media. Each card summarizes a single web resource. If we group those cards together, we summarize a topic. Thus social media storytelling produces a summary of summaries. Tools like Storify and Wakelet already apply this technique for live web resources. We want to use this proven technique because readers already understand how to view these visualizations. The Dark and Stormy Archives (DSA) Project explores how to summarize web archive collections through these visualizations. We make our DSA Toolkit freely available to others so they can explore web archive collections through storytelling.

The Dark and Stormy Archives Toolkit

The Dark and Stormy Archives (DSA) Toolkit provides a solution for each stage of the storytelling lifecycle.

Telling a story with web archives consists of three steps. First, we select the mementos for our story. Next, we gather the information to summarize each memento. Finally, we summarize all mementos together and publish the story. We evaluated more than 60 platforms and determined that no platform could reliably tell stories with mementos. Many could not even create cards for mementos, and some mixed information from the archive with details from the underlying document, creating confusing visualizations.

Hypercane selects the mementos for a story. It is a rich solution that gives the storyteller many customization options. With Hypercane, we submit a collection of thousands of documents, and Hypercane reduces them to a manageable number. Hypercane provides commands that allow the archivist to cluster, filter, score, and order mementos automatically. The output from some Hypercane commands can be fed into others so that archivists can create recipes with the intelligent selection steps that work best for them. For those looking for an existing selection algorithm, we provide random selection, filtered random selection, and AlNoamany’s Algorithm as prebuilt intelligent sampling techniques. We are experimenting with new recipes. Hypercane also produces reports, helping us include named entities, gather collection metadata, and select an overall striking image for our story.

To gather the information needed to summarize individual mementos, we required an archive-aware card service; thus, we created MementoEmbed. MementoEmbed can create summaries of individual mementos in the form of cards, browser screenshots, word clouds, and animated GIFs. If a web page author needs to summarize a single memento, we provide a graphical user interface that returns the proper HTML for them to embed in their page. MementoEmbed also provides an extensive API on top of which developers can build clients.

Raintale is one such client. Raintale summarizes all mementos together and publishes a story. An archivist can supply Raintale with a list of mementos. For more complex stories, including overall striking images and metadata, archivists can also provide output from Hypercane’s reports. Because we needed flexibility for our research, we incorporated templates into Raintale. These templates allow us to publish stories to Twitter, HTML, and other file formats and services. With these temples, an archivist can not only choose what elements to include in their cards; they can also brand the output for their institution.

Raintale uses templates to allow the storyteller to tell their story in different formats, with various options, including branding.

The DSA Toolkit at work

The DSA Toolkit produced stories from Archive-It collections about mass shootings (from left to right) at Virginia TechNorway, and El Paso.

 

Through these tools, we have produced a variety of stories from web archives. As shown above, we debuted with a story summarizing IIPC’s COVID-19 Archive-It collection, summarizing a collection of 23,376 mementos as an intelligent sample of 36. Instead of seed URLs and metadata, our visualization displays people in masks, places that the virus has affected, text drawn from the underlying mementos, correct source attribution, and, of course, links back to the Archive-It collection so that people can explore the collection further. We recently generated stories that would allow readers to view the differences between Archive-It collections about the mass shootings in Norway, El Paso, and Virginia Tech. Instead of facets and seed metadata, our stories show victims, places, survivors, and other information drawn from the sampled mementos. The reader can also follow the links back to the full collection page and get even more information using the tools provided by the archivists at Archive-It.

With help from StoryGraph, the DSA Toolkit produces daily news stories so that readers can compare the biggest story of the day across different years.

But our stories are not just limited to Archive-It. We designed the tools to work with any Memento-compliant web archive. In collaboration with Storygraph, we produce daily news stories built with mementos stored at Archive.Today and the Internet Archive. We are also experimenting with summarizing a scholar’s grey literature as stored in the web archive maintained by the Scholarly Orphans project.

We designed the DSA Toolkit to work with any Memento-compliant archive. Here we summarize Ian Milligan’s grey literature as captured by the web archive at the Scholarly Orphans Project.

Our Thanks To The IIPC For Funding The DSA Toolkit

We are excited to say that, starting in 2021, as part of a recent IIPC grant, we will be working with the National Library of Australia to pilot the DSA Toolkit with their collections. In addition to solving potential integration problems with their archive, we look forward to improving the DSA Toolkit based on feedback and ideas from the archivists themselves. We will incorporate the lessons learned back into the DSA Toolkit so that all web archives may benefit, which is what the IIPC is all about.

Relevant URLs

DSA web site: http://oduwsdl.github.io/dsa

DSA Toolkit: https://oduwsdl.github.io/dsa/software.html

Raintale web site: https://oduwsdl.github.io/raintale/

Hypercane web site: https://oduwsdl.github.io/hypercane/

WCT 3.0 Release

By Ben O’Brien, Web Archive Technical Lead, National Library of New Zealand

Let’s rewind 15 years, back to 2006. The Nintendo Wii is released, Google has just bought YouTube, Facebook switches to open registration, Italy has won the Fifa World Cup, and Borat is shocking cinema screens across the globe.

Java 6, Spring 1.2, Hibernate 3.1, Struts 1.2, Acegi-security are some of the technologies we’re using to deliver open source enterprise web applications. One application in particular, the Web Curator Tool (WCT) is starting its journey into the wide world of web archiving. WCT is an open source tool for managing the selective web harvesting process.

2018 Relaunch

Fast forward to 2018, and these technologies themselves belong inside an archive. Instead they were still being used by the WCT to collect content for web archives. Twelve years is a long time in the world of the Internet and IT, so needless to say a fair amount of technical debt had caught up with the WCT and its users.

The collaborative development of the WCT between the National Library of the Netherlands and the National Library of New Zealand was full steam ahead after the release of the long awaited Heritrix 3 integration in November 2018. With new features in mind, we knew we needed a modern, stable foundation within the WCT if we were to take it forward. Queue the Technical Uplift.

WCT 3.0

What followed was two years of development by teams in opposing time zones, battling resourcing, lockdowns and endless regression testing. Now at the beginning of 2021, we can at last announce the release of version 3.0 of the WCT.

While some of the names in the technology stack are the same (Java/Spring/Hibernate), the upgrade of these languages and frameworks represent a big milestone for the WCT. A launchpad to tackle the challenges of the next decade of web archiving!

For more information, see our recent blog post on webcuratortool.org. And check out a demo of v3.0 inside our virtual box image here.

WCT Team:

KB-NL

Jeffrey van der Hoeven
Sophie Ham
Trienka Rohrbach
Hanna Koppelaar

NLNZ

Ben O’Brien
Andrea Goethals
Steve Knight
Frank Lee
Charmaine Fajardo

Further reading on WCT:

WCT tutorial on IIPC
Documentation on WCT
WCT on GitHub
WCT on Slack
WCT on Twitter
Recent blogpost on WCT with links to old documentation