LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.
What does LinkGate do?
LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.
APIs and use cases
We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.
An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:
Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.
We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.
This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:
Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
Improved implementation of the software in general.
The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.
Lana Alsabbagh, NLNZ, Research Use Cases
Youssef Eldakar, BA, Project Coordination
Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
Mohamed Elsayed, BA, Link Indexer (link-indexer)
Andrea Goethals, NLNZ, Project Coordination
Amr Morad, BA, Link Service (link-serv)
Ben O’Brien, NLNZ, Research Use Cases
Amr Rizq, BA, Link Visualizer (link-viz)
Tasneem Allam, BA, link-viz development
Suzan Attia, BA, UI design
Dalia Elbadry, BA, UI design
Nada Eliba, BA, link-serv development
Mirona Gamil, BA, link-serv development
Olga Holownia, IIPC, project support
Andy Jackson, British Library, technical advice
Amged Magdey, BA, logo design
Liquaa Mahmoud, BA, logo design
Alex Osborne, National Library of Australia, technical advice
We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.
Based in the Washington, D, area, CLIR forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning. CLIR has a number of international affiliates, including IIIF and NDSA. These affiliations give organizations opportunities to engage meaningfully with new constituencies, and to work together toward integrating services, tools, platforms, research, and expertise across organizations in ways that will reduce costs, create greater efficiencies, and better serve our collective constituencies.
In 2017, IIPC became a CLIR Affiliate and CLIR agreed to serve as the organization’s fiscal agent, but IIPC staff were hosted by the British Library until Holownia’s move to CLIR. IIPC will remain independent, and its Steering Committee and Executive Board will continue to be responsible for setting the strategy, overseeing membership, tools development and outreach, as well as the Consortium’s key events.
“We warmly welcome Olga Holownia to the staff,” said CLIR president Charles Henry. “IIPC’s work is closely aligned with CLIR’s mission, and her presence will open new opportunities to enrich the work of both organizations.”
“We are thrilled that Olga has accepted the role of senior program officer with CLIR, after performing in a program officer role for many years through her position with the British Library,” said Abbie Grotke, IIPC Chair. “With CLIR now hosting this role in addition to other administrative host activities, the IIPC is well suited to serve its members and the broader web archiving community in the future.”
The IIPC community is encouraged to mark their calendars for CLIR’s Digital Library Federation (DLF) 2021 Forum, November 1-3. The annual Forum is a meeting place, marketplace, and congress for digital library practitioners, featuring panels, individual presentations, lightning talks, and birds of a feather sessions. The Forum program will be announced in late August, when registration opens. The 2021 Forum will be virtual and free of charge, as will its two affiliated events: Digital Preservation 2021, the annual conference of the National Digital Stewardship Alliance (NDSA) on November 4; and Learn@DLF, a workshop series, November 8-10.
At the 2016 IIPC Web Archiving Conference in Reykjavík, Ian Milligan and Matthew Weber talked about the importance of building communities around analysing web archives and bringing together interdisciplinary researchers which is what Archives Unleashed 1.0, the first Web Archive Hackathon hosted by University of Toronto Libraries, attempted to do. At the same conference, Nick Ruest and Ian gave a workshop on the earliest version of the Archives Unleashed Toolkit (“Hands on with Warcbase“). Five years and 7 datathons later, the Archives Unleashed Project has seen major technological developments (including the Cloud version of the Toolkit integrated with Archive-It collections), a growing community of researchers, an expanded team, new partnerships, and two major grants. At the project’s core, there still is a desire to engage the community, and the most recent initiative, which builds on the datathons, includes the Cohort Program which facilitates research engagement with web archives through a year-long collaboration while receiving mentorship and support from the Archives Unleashed Team.
In her blog post, Samantha Fritz, the Project Manager at the Archives Unleashed Project, reflects on the strategy and key milestones achieved between 2017 and 2020, as well as the new partnership with Archive-It and the plans for the next 3 years.
The web archiving world blends the work and contributions of many institutions, groups, projects, and individuals. The field is witnessing work and progress in many areas, from policies, to professional development and learning resources, to the development of tools that address replay, acquisition, and analysis.
For over two decades memory institutions and organizations around the world have engaged in web archiving to ensure the preservation of born-digital content that is vital to our understanding of post-1990s research topics. Increasingly web archiving programs are adopted as part of institutional activities, because in general there is a recognition from librarians, archivists, scholars, and others that web archives are critical resources and are vulnerable to stewarding our cultural heritage.
The National Digital Stewardship Alliance has conducted surveys to “understand the landscape of web archiving activities in the United States.” Reflecting on the most recent 2017 survey results, respondents indicated they perceived the least progress in the past two years fell in the category of access, use, and reuse. The 2017 report indicates that this could suggest “a lack of clarity about how Web archives are to be used post-capture” (Farrell et. al. 2017 Report, p.13). This finding makes complete sense given that focus has largely revolved around selection, appraisal, scoping and capture.
Ultimately, the active use of web archives by researchers, and by extension the development of tools to explore web archives has lagged. As such we see institutions and service providers like librarians and archivists are tasked with figuring out how to “use” web archives.
We have petabytes of data, but we also have barriers
The amount of data captured is well into the petabyte range – and we can look at larger organizations like the Internet Archive, the British Library, the Bibliothèque Nationale de France, Denmark’s Netarchive, the National Library of Australia’s Trove platform, and Portugal’s Arquivo.pt, who have curated extensive web archive collections, but we still don’t see a mainstream or heavy use of web archives as primary sources in research. This is in part due to access and usability barriers. Essentially, the technical experience needed to work with web archives, especially at scale, is beyond the reach of most scholars.
It is this barrier that offers an opportunity for discussion and work in and beyond the web archiving community. As such, we turn to a reflection of contributions from the Archives Unleashed Project for lowering barriers to web archives.
About the Archives Unleashed Project
Archives Unleashed was established in 2017 with support from The Andrew W. Mellon Foundation. The project grew out of an earlier series of events which identified a collective need among researchers, scholars, librarians and archivists for analytics tools, community infrastructure, and accessible web archival interfaces.
In recognizing the vital role web archives play in studying topics from the 1990s forward, the team has focused on developing open-source tools to lower the barrier for working with and analyze web archives at scale.
From 2017-2020 Archives Unleashed has a three-pronged strategy for tackling the computational woes of working with large data, and more specifically W/ARCs:
Development of the Archives Unleashed Toolkit: to apply modern big data analytics infrastructure to scholarly analysis of web archives
Deployment of the Archives Unleashed Cloud: provide a one-stop, web-based portal for scholars to ingest their Archive-It collections and execute a number of analyses with the click of a mouse.
Organization of Archives Unleashed Datathons: build a sustainable user community around our open-source software.
Milestones + Achievements
If we look at how Archives Unleashed tools have developed, we have to reach back to 2013 when Warcbase was developed. It was the forerunner to the Toolkit and was built on Hadoop and HBase as an open-source platform to support temporal browsing and large-scale analytics of web archives (Ruest et al., 2020, p. 157).
The Toolkit moves beyond the foundations of Warcbase. Our first major transition was to replace Apache HBase with Apache Spark to modernize analytical functions. In developing the Toolkit, the team was able to leverage the needs of users to inform two significant development choices. First, by creating a Python interface that has functional parity with the Scala interface. Python is widely accepted, and more commonly known, among scholars in the digital humanities who engage in computational work. From a sustainability perspective, Python is a stable, open-source, and ranked as one of the most popular programming languages.
Second, the Toolkit shifted from Spark’s resilient distributed datasets (RDDs), part of the Warcbase legacy, to support DataFrames. While this was part of the initial Toolkit roadmap, the team engaged with users to discuss the impact of alternative options to RDD. Essentially, DataFrames offers the ability within Apache Spark to produce a tabular based output. From the community, this approach was unanimously accepted in large part because of the familiarity with pandas, and DataFrames made it easier to visually read the data outputs (Fritz, et. al, 2018, Medium Post).
The Toolkit is currently at a 0.90.0 release, and while the Toolkit offers powerful analytical functionality, it is still geared towards an advanced user. Recognizing that scholars often didn’t know where to start with analyzing W/ARC files, and the intimidating nature of the command line, we took a cookbook approach in developing our Toolkit user documentation. With it, researchers can modify dozens of example scripts for extracting and exploring information. Our team focused on designing documentation that presented possibilities and options, while at the same time guided and supported user learning.
The work to develop the Toolkit, provided the foundations for other platforms and experimental methods of working with web archives. The second large milestone reached by the project was the launch of the Archives Unleashed Cloud.
The Archives Unleashed Cloud, largely developed by project co-investigator Nick Ruest, is an open-source platform that was developed to provide a web-based front end for users to access the most recent version of the Archives Unleashed Toolkit. A core feature of the Cloud, is that it uses the Archive-It WASAPI, which means that users are directly connected to their Archive-It collections and can proceed to analyze web archives without having to spend time delving into the technical world.
Recognizing that the Toolkit, while flexible and powerful, may still be a little too advanced for some scholars, the Cloud offers a more user-friendly and familiar user interface for interacting with data. Users are presented with simple dashboards which provide insights into WARC collections, provide downloadable derivative files and offer simple in-browser visualizations.
In June of 2020, marking the end of our grant, the Cloud had analyzed just under a petabyte of data, and has been used by individuals from 59 unique institutions across 10 countries. Cloud remains an open-source project, with code available through a GitHub repository. The canonical instance will be deprecated as of June 30 2021 and be migrated into Archive-It, but more on that project in a bit.
Datathons + Community Engagement
Datathons provided an opportunity to build a sustainable community around Archives Unleashed tools, scholarly discussion, and training for scholars with limited technical expertise to explore archived web content.
Adapting the hackathon model, these events saw participants from over fifty institutions from seven countries engage in a hands-on learning environment – working directly with web archive data and new analytical tools to produce creative and ingenuitive projects that explore W/ARcs. In collaborating with host institutions, the datathons also highlight web archive collections from host institutions, increasing visibility and usability cases for their curated collections.
In a recently published article, “Fostering Community Engagement through Datathon Events: The Archives Unleashed Experience,” we reflected on the impact that our series of datathon events had on community engagement within the web archiving field, and on the professional practices of attendees. We conducted interviews with datathon participants to learn about their experiences and complemented this with an exploration of established models from the community engagement literature. Our article culminates in contextualizing a model for community building and engagement within the Archives Unleashed Project, with potential applications for the wider digital humanities field.
Our team has also invested and participated in the wider web archival community through additional scholarly activities, such as institutional collaborations, conferences, and meetings. We recognize that these activities bring together many perspectives, and have been a great opportunity to listen to the needs of users and engage in conversations that impact adjacent disciplines and communities.
1. It takes a community
If there is one main take away we’ve learned as a team, and that all our activities point to, it’s that projects can’t live in silos! Be they digital humanities, digital libraries, or any other discipline, projects need communities to function, survive, and thrive.
We’ve been fortunate and grateful to have been able to connect with various existing groups including being welcomed by the web archiving and digital humanities communities. Community development takes time and focused efforts, but it is certainly worthwhile! Ask yourself, if you don’t have a community, who are you building your tools, services, or platforms for? Who will engage with your work?
We have approached community building through a variety of avenues. First and foremost, we have developed relationships with people and organizations. This is clearly highlighted through our institutional collaborations in hosting datathon events, but we’ve also used platforms like Slack and Twitter to support discussion and connection opportunities among individuals. For instance, in creating both general and specific Slack channels, new users are able to connect with the project team and user community to share information and resources, ask for help, and engage in broader conversations on methods, tools, and data.
Regardless of platform, successful community building relies on authentic interactions and an acknowledgment that each user brings unique perspectives and experiences to the group. In many cases we have connected with uses who are either new to the field or to analysis methods of web archives. As such, this perspective has helped to inform an empathetic approach to the way we create learning materials, deliver reports and presentations, and share resources.
2. Interdisciplinary teams are important
So often we see projects and initiatives that highlight an interdisciplinary environment – and we’ve found it to be an important part of why our project has been successful.
Each of our project investigators personifies a group of users that the Archives Unleashed Project aims to support, all of which converge around data, more specifically WARCs or web archive data. We have a historian who is broadly representative of digital humanists and researchers who analyze and explore web archives; a librarian who represents the curators and service providers of web archives; and a computer scientist who reflects tool builders.
A key strength of our team has been to look at the same problem from different perspectives, allowing each member to apply their unique skills and experiences in different ways. This has been especially valuable in developing underlying systems, processes and structures which now make up the Toolkit. For instance, triaging technical components offered a chance for team members to apply their unique skill sets, which often assisted in navigating issues and roadblocks.
We also recognized each sector has its own language and jargon that can be jarring to new users. In identifying the wide range of technical skills within our team, we leveraged (and valued) those “I have no idea what this means/ what this does.” moments. If these types of statements were made by team members or close collaborators, chances are they would carry through to our user community.
Ultimately, the interdisciplinary nature and the wide range of technical expertise found within our team, helped us to see and think like our users.
3. Sustainability planning is really hard
Sustainability has been part question, part riddle. This is the case for many digital humanities projects. These sustainability questions speak to the long term lifecycle of the project, and our primary goal has always been to ensure a project’s survival and continued efforts once the grant cycle has ended.
As such the Archives Unleashed team has developed tools and platforms with sustainability in mind, specifically by adopting widely adopted and stable programming languages and best practices. We’ve also been committed to ensuring all our platforms and tools have developed in the spirit of open-access, and are available in public GitHub repositories.
One overarching question remained as our project entered its final stages in the Spring of 2020: how will the Toolkit live on? Three years of development and use cases demonstrated not only the need and adoption of tools created under the Archives Unleashed Project, but also solidified the fact that without these tools, there aren’t currently any simplified processes to adequately replace it.
Where we are headed (2020-2023)
Our team was awarded a second grant from The Andrew W. Mellon Foundation, which started in 2020 and will secure the future of Archives Unleashed. The goal of this second phase is the integration of the Cloud with Archive-it, so as a tool it can succeed in a sustainable and long-term environment. The collaboration between Archives Unleashed and Archive-It also aims to continue to widen and enhance the accessibility and usability of web archives.
Priorities of the Project
First, we will merge the Archives Unleashed analytical tools with the Internet Archive’s Archive-it service to provide an end-to-end process for collecting and studying web archives. This will be completed in three stages:
Build. Our team will be setting up the physical infrastructure and computing environment needed to kick start the project. We will be purchasing dedicated infrastructure with the Internet Archive.
Integrate. Here we will be migrating the back end of the Archives Unleashed Cloud to Archive-it and paying attention to how the Cloud can scale to work within its new infrastructure. This stage will also see the development of a new user interface that will provide a basic set of derivatives to users.
Enhance. The team will incorporate consultation with users to develop an expanded and enhanced set of derivatives and implement new features.
Secondly, we will engage the community by facilitating opportunities to support web archives research and scholarly outputs. Building on our earlier successful datathons, we will be launching the Archives Unleashed Cohort program to engage with and support web archives research. The Cohorts will see research teams participate in year-long intensive collaborations and receive mentorship from Archives Unleashed with the intention of producing a full-length manuscript.
We’ve made tremendous progress, as the close of our first year is in sight. Our major milestone will be to complete the integration of the Archives Unleashed Cloud/Toolkit over to Archive-It. As such users will soon see a beta release of the new interface for conducting analysis with their web archive collections, specifically by downloading over a dozen derivatives for further analysis, and access to simple in-browser visualizations.
Our team looks forward to the road ahead, and would like to express our appreciation for the support and enthusiasm Archives Unleashed has received!
Farrell, M., McCain, E., Praetzellis, M., Thomas, G., and Walker, P. 2018. Web Archiving in the United States: A 2017 Survey. National Digital Stewardship Alliance Report. DOI 10.17605/OSF.IO/3QH6N
Ruest, N., Lin, J., Milligan, I., and Fritz, S. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513
Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. 2017. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. J. Comput. Cult. Herit. 10, 4, Article 22 (October 2017), 30 pages. DOI: https://doi.org/10.1145/3097570
One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.
What is OutbackCDX?
OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.
A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.
OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.
Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.
So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.
Of course, there were a few bumps on that road, but more on that later.
Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.
Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.
Building the Index
Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.
In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.
Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.
Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.
OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:
And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.
This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.
The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.
I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.
Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.
OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.
Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.
The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.
Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is.
In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.
Live demo of SolrWayback
You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!
Back in 2018…
The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.
WARC-Indexer. Where the magic happens!
WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.
WARC-Indexer. Paying the price up front…
Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.
Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.
Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.
HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.
Solr. Reaping the benefits from the WARC-indexer
The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).
Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.
The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.
CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.
WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.
Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.
SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.
SolrWayback playback engine
SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.
The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.
The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.
SolrWayback and Scalability
For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.
One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.
For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.
Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.
You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.
Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.
Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.
SolrWayback – framework
SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.
SolrWayback software bundle
Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.
By Abbie Grotke, Assistant Head, Digital Content Management Section
(Web Archiving Program), Library of Congress
and the IIPC Chair 2021-2022
Hello IIPC community!
I am thrilled to be the Chair of the IIPC in 2021. I’ve been involved in this organization since the very early days, so much so that somewhere buried in my folders in my office (which I have not been in for almost a year), are meeting notes from the very first discussions that led to the IIPC being formed back in 2003. Involvement in IIPC has been incredibly rewarding personally, and for our institution and all of our team members who have had the chance to interact with the community through working groups, projects, events, and informal discussions.
This year brings changes, challenges and opportunities for our community. Particularly during a time when many of us are isolated and working from home, both documenting web content about the pandemic and living it at the same time, connections to my friends and colleagues around the world seem more important than ever.
Here are a few key things to highlight for the coming year:
A Big Year for Organisation, Governance, and Strategic Planning Change
As a result of the fine work of the Strategic Direction Group led by Hansueli Locher of Swiss National Library, the IIPC has a new Consortium Agreement for 2021-2025! This document is renewed every 4-5 years, and this time some key changes were made to strengthen our ability to manage the Consortium more efficiently and to reflect the organisational changes that have taken place since 2016. Feedback from IIPC members was used to create the new agreement, and you’ll notice a slight update of objectives, which now acknowledge the importance of collaborative collections and research. Many thanks to the Strategic Direction Group (Emmanuelle Bermès of the BnF, Steve Knight of the National Library of New Zealand, Hansueli Locher, Alex Thurman of the Columbia University Libraries, and IIPC Programme and Communications Officer) for their work on this and continued engagement.
Executive Board and the Steering Committee’s terms
The new agreement establishes a new Executive Board composed of the Chair, the Vice-Chair, the Treasurer and our new senior staff member, as well as additional members of the SC appointed as needed. While the Steering Committee is responsible for setting out the strategic direction for our organisation for the next 5 years, one of our key tasks for this year is to convert it into an Action Plan.
The new Consortium Agreement aligns the terms of the Steering Committee members and the Executive Board. What it means in practise is that the SC members’ 3-year term will start on January 1 and not June 1. We will open a call for nominations to serve on the SC during our next General Assembly but if you are interested in nominating your institution, you can contact the PCO.
For more information about the responsibilities of the new Executive Board please review section 2.5 of the new Consortium Agreement.
Our ability to have and compensate our Administrative and Financial Host has been formalized in the new agreement. We are excited to collaborate more with the Council on Library and Information Resources (CLIR) this year through this arrangement, particularly in setting up some new staffing arrangements for us. More on this will be announced in the coming months.
One of our big tasks in 2021 will be working on the Strategic Plan. This work is led by the Strategic Direction Group, with inputs from the Steering Committee, Working Groups, and Portfolio Leads. Since this work is one of our important activities for the year, Hansueli has will joined the Executive Board to ensure close collaboration and support for the initiative.
Missing Your IIPC Colleagues? Join our Virtual Events!
As anyone who has attended an IIPC event in person knows, it is one of the best parts about being a member. In my case, interacting with colleagues from around the world who have similar challenges, experiences, and new and exciting insights has been great for my own professional growth, and has only helped the Library of Congress web archiving program be more successful. While it’s sad that we cannot travel and meet in person together right now, there are opportunities to continue to connect virtually and to engage others in our institutions who may not have been able to travel to the in-person meetings. We’re already working on developing a more robust calendar of events for members (and some that will be more widely open to non-members).
Beyond the GA and WAC, due to the success of the well-received and well-attended webinars and calls with members in 2020, we will continue to deliver those over the course of the year. We are also working on additional training events and continuing report-outs of technical projects and member updates. Stay tuned for more soon and check our events page for updates!
We are also working with Webrecorder on the pywb transition support for members. The migration guide, with inputs from the IIPC Members, is already available and the work continues on the next stages of the project. Look for more updates on these projects through our events and blog posts throughout the year. There will also be an opportunity in 2021 for more projects to be funded, so we encourage members to start thinking about other projects that could use support and that would benefit the community.
So, it’s finally here! SolrWayback 4.0 was released December 20th, after an intense development period. In this blog post, we’ll give you a nice little overview of the changes we made, some of the improvements and some of the added functionality that we’re very proud of having released. So let’s dig in!
A small intro – What is SolrWayback really?
As the name implies, SolrWayback is a fusion of discovery (Solr) and playback (Wayback) functionality. Besides full-text search, Solr provides multiple ways of aggregating data, moving common net archive statistics tasks from slow batch processing to interactive requests. Based on input from researchers the feature set is continuously expanding with aggregation, visualization and extraction of data.
SolrWayback relies on real time access to WARC files and a Solr index populated by the UKWA webarchive-discovery tool. The basic workflow is:
Amass a collection of WARCs (using Heritrix, wget, ArchiveIT…) and put them on live storage
Analyze and process the WARCs using webarchive-discovery. Depending on the amount of WARCS, this can be a fairly heavy job: Processing ½ petabyte of WARCs at the Royal Danish Library took 40+ CPU-years
Index the result from webarchive-discovery into Solr. For non-small collections, this means SolrCloud and Solid State Drives. A rule of thumb is that the index takes up about 5-10% of the size of the compressed WARCs
Connect SolrWayback to the WARC storage and the Solr index
We decided to give the SolrWayback a complete makeover, making the interface more coherent, the design more stylish, and the information architecture better structured. At first glance, not much has changed apart from an update on the color scheme, but looking closely, we’ve added some new functionality, and grouped some of the existing features in a new, and much improved, way.
The search page is still the same, and after searching, you’ll still see all the results lined in a nice single column. We’ve added some more functionality up front, giving you the opportunity to see the WARC header for a single post, as well as selecting an alternative playback engine for the post. Some of the more noticeable reworks and optimizations are highlighted in the section below.
We’ve done some work under the hood too, to make the application run faster. A lot of our call to the backend has been reworked to be individual calls, only being requested at need. This means, that facet calls are now made as a separate call to the backend instead of being being called with a query. So when you’re paging results, we only request the results – giving us a faster response, since the facets stay the same. The same principle has been applied to loading images and individual post data.
As mentioned, we’ve done some cleanup in the interface, making it easier to navigate. The search field has been reworked, to service the many needs. It will expand if the query is line separated (do so by SHIFT+Enter), making large and complex queries much easier to manage. We’ve even added context sensitive help, so if you’re making queries with boolean operators or similar, SolrWayback tell you if their syntax is correct.
We’ve kept the most used features upfront, with image and URL search readily available from the get go. The same goes for the option to group the search results to avoid URL duplicates.
Below the line are some of of the other features not directly linked to the query field, but nice to have upfront. Searching with an uploaded file, searching by GPS and the toolbox containing a lot of the different tools that can help gain insight into the archive, by generating Wordclouds or link graphs, searching through the Ngram interface and much more.
Image searching by location rethought
We’re reworked the way to search and look through the results when searching by GPS coordinates. We’ve made it easy to search for a specific location, and we’ve grouped the results so that they are easier to interpret.
Zooming into the map will expand the places where images are clustered. Furthermore, we realize that sometimes the need is to look through all the images regardless of their exact position, so we’ve made a split screen that can expand either way, depending on your needs. It’s still possible to do do a new search based on any of the found images in the list.
Elaborated export options
We’ve added more functionality to the export options. It’s possible to export both fields from the full search result and the raw WARC records for the search result, if enabled in the settings. You can even decide the format of your export and we’ve added an option to select exactly which fields in the search result you want exported – so if you want to leave out some stuff, that is now possible!
Quickly move through your archive
The standard search UI is pretty much as you are accustomed to but we made an effort to keep things simple and clean as well as facilitating in depth research and tracking of subject interests. In the search results you get a basic outline of metadata on each post. You can narrow your search with the provided facet filters. When expanding a post you get access to all metadata and every field has a link if you which to explore a particular angle related to your post. So you can quickly navigate the archive by starting wide, filtering and afterwards do a specific drill down and find related material.
Visualization of search result by domain
We’ve also made it very easy to quickly get a overview of the results. When clicking the icon in the results headline, you get a complete overview of the different domains in the results, and how big of a portion of the search result they amount for to each year. This is a very neat way to get a overview of the results, and the relative distribution by year.
With quick access from right under the search box we have gathered Toolbox with utilities for further data exploration. In the following we will give you a quick tour of the updates and new functionality in this section.
Linkgraph, domain stats and wordcloud
We reworked the Linkgraph, the Wordcloud and the Domain stats components a little, adding some more interaction to the graph and domain stats, and polished the interface for all of them a little. For the Linkgraph, it is now possible to highlight certain sections within the graph, making it much easier to navigate the sometimes rather large cluster, and looking at connections you find relevant. These tools now provide a easy and quick way to gain a deeper insight in specific domains and what content they hold.
We are so pleased to finally be able to supply a classical Ngram search tool complete with graphs and all. In this version you are able to search through the entire HTML content of your archive and see how the results are distributed over time (harvest time). You can even do comparisons by providing several queries sequentially and see how they compare. On every graph the datapoint at each year is clickable and will trigger a search for the underlying results which is a very handy feature for checking the context and further exploring underlying data. Oh and before we forget – if things get a little crowded in the graph area you can always click on the nicely colored labels at the top of the chart and deselect/select each query.
If the HTML content isn’t really your thing but your passion lays within the HTML tags themselves we got you covered. Just flip the radio button under the search box over to HTML-tags in HTML-pages and you will have all same features listed above but now the underlying data will be the HTML tags themselves. As easy as that you will finally be able to get answers to important questions like ‘when did we actually start to frown upon the blink tag?’
The possibilty to export a query, in a format that can be used in Gephi, is still present in the new version of SolrWayback. This will allow you to create some very nice visual graphs that can help you explore how exactly a collection of results are tied together. If you’re interested in this, feel free to visit the labs website about gephi graphs, where we’ve showcasted some of the possiblities of using Gephi.
Tools for the playback
SolrWayback comes with a build in playback engine, but can be configured to use another playback engine such as PyWb. The SolrWayback playback viewer shows a small toolbar overlay on the page that can be opened or hidden. When the toolbar is hidden the page is display without any frame/top-toolbar etc. to show the page exactly as it was harvested.
When you have clicked a specific result, you’re taking to the harvested resource. If it is a website, you will be shown a menu to the right, giving you some more ways to analyse the resource. This menu is hidden in the left upper corner when you enter, but can be expanded by clicking on it.
The harvest calendar will give you a very smooth overview of the harvest times of the resource, so you can easily see when, and how often, the resource has been harvest in the current index. This gives you an excellent opportunity to look at your index over time, and see how a website evolved.
The PWID option lets you export the harvest resource metadata, so you can share what’s in that particular resource in a nice and clean way. the PWID standard is an excellent way to keep track of, and share ressources between researchers, so a list of the exact dataset is preserved – along with all the resources to go with it
View page resources gives you a clean overview of the contents of the harvested page, along with all the resources. We’ve even added a way to quickly see the difference between the first and the last harvested resource on the page, giving you a quick hint of the contents and if they are all from the same period. You can even see a preview of the page here and download the individual resources from the page, if you wish.
Customization of your local SolrWayback instance
We’ve made it possible to customize your installation, to fit your needs. The logo can be changed, the about text can be changed, and you can even customize your search guidelines, if you need to. This makes sure, that you have a chance to make instance your own in some way – making sure that people can recognize when they are using your instance of SolarWayback, and it can now reflect your organisation and the people who is contributing to it.
The future of the SolrWayback
This is just the beginning for SolrWayback. Further down the road, we hope to add even more functionality that can help you dig deeper into the archives. One of our main goals is to provide you with the tools necessary to understand and analyse the vast amounts of data, that lies in most of the archives that SolrWayback is designed for. We already have a few ideas as to what could be useful, but if you have any suggestions for tools that might be helpful, feel free to reach out to us.
By Ben Els, Digital Curator at the National Library of Luxembourg & the Chair of the Organising Committee for the 2021 IIPC General Assembly and the Web Archiving Conference
Our previous blog post from the Luxembourg Web Archive focused on the typical steps that many web archiving initiatives take at the start of their program: to gain first experience with event-based crawls. Elections, natural disasters, and events of national importance are typical examples of event collections. These temporary projects have occupied our crawler for the past 3 years (and continue to do so for the Covid-19 collection), but we also feel that it’s about time for a change of scenery on our seed lists.
Aside from following the news on elections and Covid-19, we also operate 2 domain crawls a year, where basically all websites from the “.lu” top level domain are captured. We use the research from the event collections to expand the seed list for domain crawls and, therefore, also add another layer of coverage to those events. However, the captures of websites from the event collections remain very selective and are usually not revisited, once discussions around the event are over. This is why we plan to focus our efforts in the near future on building thematic collections. As a comparison:
Multifaceted coverage of one topic or event
Focus on one subject area
The idea is that event collections serve as a base to extract the subject areas for thematic collections. In turn, the thematic collections will serve as a base to start event collections, and save time on research. In time, event collections will help with a more intense coverage for the subjects of thematic collections and the latter will capture information before and after the topic of an event collection. For example, the seed list from an election crawl can serve as a basis for the thematic collection “Politics & Society”. The continued coverage and expansion from this collection will serve as an improved basis for a seed list, once the next election campaign comes around. Moreover, both types of collections will help in broadening the scope of domain crawls and achieve better coverage of the Luxembourg web.
Collaboration with subject experts
During election crawls, it has always been important for us to invite the input from different stakeholders, to make sure that the seed list covers all important areas surrounding the topic. The same principle has to be applied to the thematic collections. No curator can become an expert in every field and our web archiving team will never be able to research and find all relevant websites in all domains and all languages from all corners of the Luxembourg web. Therefore, the curator’s job has to be focused on finding the right people, who know the web around their subject, experts in their field and representatives of their communities, who can help to build and expand seed lists over time. This means relying on internal and external subject experts, who are familiar with the principles of web archiving and incentivised to offer their help in contributing to the Luxembourg web archive.
While, technically, we haven’t tested the idea of this collaborative Lego-tower in reality, here are some of the challenges we would like to tackle this year:
The workflows and platform used to collect the experts’ contributions need to be as easy to use as possible. Our contributors should not have require hours of training and tutorials to get started and it should be intuitive enough to pick up working on a seed list, after not having looked at it for several months.
Subject experts should be able to contribute in the way that best fits their work rhythm: a quick and easy option to add single seeds spontaneously when coming across an interesting website, as well as a way to dive in deeper into research and add several seeds at a time.
We are going to ask for help, which means additional work for contributors inside and outside the library. This means that we need to keep the motivate the subject experts and convince them that a working and growing web archive represents a benefit for everybody and that their input is indispensable.
As a first step, we would like to set up thematic collections with BnL subject experts, to see what the collaborative platform should look like and what kind of work input can be expected from contributors in terms of initial training and regular participation. The second stage will be to involve contributors from other heritage institutions who already provided lists to our domain crawls in the past. After that, we count on involving representatives of professional associations, communities or other organisations interested in seeing their line of business represented in the web archive.
On an even larger scale, the Luxembourg Web Archive will be open to contributions from students and researchers, website owners, web content creators and archive users in general, which is already possible through the “Suggest a website” form on webarchive.lu. While we haven’t received as many submissions as we would like, there have been very valuable contributions, of websites that we would perhaps never have found otherwise. We also noticed that it helps to raise awareness through calls ofor participation in the media. For instance, we received very positive feedback for our Covid-19 collection. If we are able to create interest on a larger scale, we can get much more people involved and improve the services provided by the Luxembourg Web Archive.
Save the date!
While we work on putting the pieces of this puzzle together, we are also moving closer and closer to the 2021 General Assembly and Web Archiving Conference. It’s been two years since the IIPC community was able to meet for a conference, and surely you are all as eager as we are, to catch up, to learn and to exchange ideas about problems and projects. So, if you haven’t done so already, please save the date for a virtual trip to Luxembourg from 14th -16th June.
By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory and Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre at National and University Library Zagreb
We are excited to share the news of a newly IIPC-funded collaborative project between the Los Alamos National Laboratory (LANL) and the National and University Library Zagreb (NSK). In this one-year project we will develop a software framework for web archives to create Bloom filters of their archival holdings. A Bloom filter, in this context, consists of hash values of archived URIs and can therefore be thought of as an encrypted index of an archive’s holdings. Its encrypted nature allows web archives to share information about their holdings in a passive manner, meaning only hashed URI values are communicated, rather than plain text URIs. Sharing Bloom filters with interested parties can enable a variety of down-stream applications such as search, synchronized crawling, and cataloging of archived resources.
Bloom filters and Memento TimeTravel
As many readers of this blog will know, the Prototyping Team at LANL has developed and maintained the Memento TimeTravel service, implemented as a federated search across more than two dozen memento-compliant web archives. This service therefore allows a user (or a machine via the underlying APIs) to search for archived web resources (mementos) across many web archives at the same time. We have tested, evaluated, and implemented various optimizations for the search system to improve speed and avoid unnecessary network requests against participating web archives but we can always do better. As part of this project, we aim at piloting a TimeTravel service based on Bloom filters, that, if successful, should provide close to an ideal false positive rate, meaning almost no unnecessary network requests to web archives that do not have a memento of the requested URI.
While Bloom filters are widely used to support membership queries (e.g., is element A part of set B?), it has, to the best of our knowledge, not been applied to query web archive holdings. We are aware of opportunities to improve the filters and, as additional components of this project, will investigate their scalability (in relation to CDX index size, for example) as well as the potential for incremental updates to the filters. Insights into the former will inform the applicability for different size archives and individual collections and the latter will guide a best practice process of filter creation.
The development and testing of Bloom filters will be performed by using data from the Croatian Web Archive’s collections. NSK develops the Croatian Web Archive (HAW) in collaboration with the University Computing Centre University of Zagreb (Srce), which is responsible for technical development and will work closely with LANL and NSK on this project.
LANL and NSK are excited about this project and new collaboration. We are thankful to the IIPC and their support and are looking forward to regularly sharing project updates with the web archiving community. If you would like to collaborate on any aspects of this project, please do not hesitate and get in touch.
By Claire Newing, The National Archives (UK) and Phil Clegg, MirrorWeb.
At The National Archives of the UK we have been archiving the online presence of UK central government since 2003. We originally worked with suppliers to capture traditional websites which we made available for all to browse and search through the UK Government Web Archive: http://www.nationalarchives.gov.uk/webarchive/.
In the early 2010s, we recognised that the government was increasingly using social media platforms to communicate with the public. At that time Twitter, YouTube and Flickr were the most widely used platforms. We experimented with trying to capture channels using our standard Heritrix based process but with very limited success, so we embarked on a project with Internet Memory Foundation (IMF), our then web archiving service provider, to develop a custom solution.
The project was partially successful. We developed a method of capturing Tweets and YouTube videos and the associated metadata directly from APIs and providing access to them through a custom interface. We also developed a method of crawling all shortlinks in Tweets so the links resolved correctly in the archive.
Unfortunately, we were unable to find a way of capturing Flickr content. The UK Government Social Media Archive was launched in 2014. We continued to capture a small number of channels regularly until mid-July 2017 when we started working with a new supplier, MirrorWeb.
Captures in the cloud
MirrorWeb social media capture is undertaken using serverless functions running in the cloud. These functions authenticate with the social API and request the metadata content for all new posts created on the social channel since the last capture point. Each post is then stored in a database for later replay. Further serverless functions are triggered when a post object is written to the database to check if the post contains media content like images or videos. If media content is found these are added to a queue which in turn triggers another serverless function to download the media for the post. For replay the stored json objects are read back from the database and presented to the user with media objects in a similar layout to the original platform.
MirrorWeb chose to archive most social accounts daily to ensure that all new content is captured. Twitter, for example, limits the number of requests to their API in a 15-minute window and restrict the number of historic posts that can be collected so some tuning is undertaken to ensure rate limits are not exceeded and that all posts are captured.
We also increased the number of channels we were capturing and took the opportunity to redesign our custom access pages incorporating feedback from our users. Excitingly, access to images and video content embedded in Tweets was made accessible for the first time. The archive was formally re-launched in August 2018, but we always knew it could be even better!
Flickr, access, and full text search
Towards the end of 2018 we launched an improvement project with MirrorWeb. It had three key aims:
(1) To develop a method of capturing Flickr images. This became urgent as in November 2018 Flickr announced that as of early 2019 free account holders would be limited to 1000 images per account and any additional images above that number would be deleted. A survey showed that several UK government accounts, particularly older accounts which were no longer being updated, were at risk of losing some content.
(2) To further improve our custom access pages.
(3) To provide full text search across the social media collection.
The project aims were fulfilled, and the new functionality went live late in 2019.
By far the most exciting new development was the implementation of full text search. We undertook some user research earlier in 2018 which revealed that users considered the archive to be interesting but didn’t think it was very useful without search. This emphasized to us how important it was to provide such a service.
The search service was built by MirrorWeb using Elasticsearch, the same technology we use for the full text search facility on the UK Government Web Archive, our collection of archived websites. MirrorWeb once again make use of serverless functions to provide full text search of the social accounts. When social post metadata is written to the database, a serverless function is triggered to extract the relevant metadata and this is then added to Elasticsearch.
Each search queries the full text of Tweets and the descriptions and titles of YouTube and Flickr content. Users can initially search for keywords or a phrase and are then given the opportunity to filter the results by platform, channel and year of post. They can also choose to only display results which include or exclude specific words.
Additionally, we added a search box to the top of each of our custom access pages to enable users to search all data captured from a specific channel. For example, on this page a user can search the titles and descriptions of all the videos we’ve captured from Prime Minister’s Office YouTube channel.
When we started to investigate capturing social media content over a decade ago there was a feeling in some quarters that content posted on social media was ephemeral and posts were being used to point to content on traditional sites. Events of recent years demonstrated that social media is of increasing importance. In some cases, government departments and ministers announce important information on social media some time before they update a traditional website.
We have achieved a lot already, but we know there is lots more to do. In future, we aspire to add a unified search across our website and social media archives. We are aware that the API capture method does not work for all platforms, so we are actively working to find other methods of capture, particularly for Instagram and Github. We hope to find a way of displaying metadata we capture which is not currently surfaced on the access pages – for example changes to the channel thumbnail image over time.
We are also aware that are many gaps in our web archive where we were unable to capture embedded YouTube videos. We hope to develop a method of linking between those gaps and the equivalent videos held in the YouTube archive. Finally, we plan to do some user research to guide future developments. We are very proud of the UK Government Social Media Archive and we want to make sure it is used to its full potential.