By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio
It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.
Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out new releases (thanks Lauren!).
Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.
As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.
Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.
Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.
The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.
Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.
However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.
To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.
The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).
The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.
The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.
There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.
By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench
We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.
Where do you start?
The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!
What can you do?
Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.
A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.
Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.
The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.
While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.
Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.
By Youssef Eldakar of Bibliotheca Alexandrina and Lana Alsabbagh of the National Library of New Zealand
Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ) are working together to bring to the web archiving community a tool for scalable web archive visualization: LinkGate. The project was awarded funding by the IIPC for the year 2020. This blog post gives a detailed overview of the work that has been done so far and outlines what lies ahead.
In all domains of science, visualization is essential for deriving meaning from data. In web archiving, data is linked data that may be visualized as graph with web resources as nodes and outlinks as edges.
This phase of the project aims to deliver the core functionality of a scalable web archive visualization environment consisting of a Link Service (link-serv), Link Indexer (link-indexer), and Link Visualizer (link-viz) components as well as to document potential research use cases within the domain of web archiving for future development.
The following illustrates data flow for LinkGate in the web archiving ecosystem, where a web crawler archives captured web resources into WARC/ARC files that are then checked into storage, metadata is extracted from WARC/ARC files into WAT files, link-indexer extracts outlink data from WAT files and inserts it into link-serv, which then serves graph data to link-viz for rendering as the user navigates the graph representation of the web archive:
In what follows, we look at development by Bibliotheca Alexandrina to get each of the project’s three main components, Link Service, Link Indexer and Link Visualizer, off the ground. We also discuss the outreach part of the project, coordinated by the National Library of New Zealand, which involves gathering researcher input and putting together an inventory of use cases.
Please watch the project’s code repositories on GitHub for commits following a code review later this month:
link-serv is the Link Service that provides an API for inserting web archive interlinking data into a data store and for retrieving back that data for rendering and navigation.
We worked on the following:
Data store scalability
API definition and and Gephi compatibility
Data store scalability
link-serv depends on an underlying graph database as repository for web resources as nodes and outlinks as relationships. Building upon BA’s previous experience with graph databases in the Encyclopedia of Life project, we worked on adapting the Neo4j graph database for versioned web archive data. Scalability being a key interest, we ran a benchmark of Neo4j on Intel Xeon E5-2630 v3 hardware using a generated test dataset and examined bottlenecks to tune performance. In the benchmark, over a series of progressions, a total of 15 billion nodes and 34 billion relationships were loaded into Neo4j, and matching and updating performance was tested. And while time to insert nodes into the database for the larger progressions was in hours or even days, match and update times in all progressions after a database index was added, remained in seconds, ranging from 0.01 to 25 seconds for nodes, with 85% of cases remaining below 7 seconds and 0.5 to 34 seconds for relationships, with 67% of cases remaining below 9 seconds. Considering the results promising, we hope that tuning work during the coming months will lead to more desirable performance. Further testing is underway using a second set of generated relationships to more realistically simulate web links.
We ruled out Virtuoso, 4store, and OrientDB as graph data store options for being less suitable for the purposes of this project. A more recent alternative, ArangoDB, is currently being looked into and is also showing promising initial results, and we are leaving open the possibility of additionally supporting it as an option for the graph data store in link-serv.
To represent web archive data in the graph data store, we designed a schema with the goals of supporting time-versioned interlinked web resources and being friendly to search using the Cypher Query Language. The schema defines Node and VersionNode as node types and HAS_VERSION and LINKED_TO as relationship types linking a Node to a descendant VersionNode and a VersionNode to a hyperlinked Node, respectively. A Node has the URI of the resource as attribute in Sort-friendly URI Reordering Transform (SURT), and a VersionNode has the ISO 8601 timestamp of the version as attribute. The following illustrates the schema:
API definition and Gephi compatibility
link-serv is to receive data extracted by link-indexer from a web archive and respond to queries by link-viz as the graph representation of web resources is navigated. At this point, 2 API operations were defined for this interfacing: updateGraph and getGraph. updateGraph is to be invoked by link-indexer and takes as input a JSON representation of outlinks to be loaded into the data store. getGraph, on the other hand, is to be invoked by link-viz and returns a JSON representation of possibly nested outlinks for rendering. Additional API operations may be defined in the future as development progresses.
One of the project’s premises is maintaining compatibility with the popular graph visualization tool, Gephi. This would enable users to render web archive data served by link-serv using Gephi as an alternative to the project’s frontend component, link-viz. To achieve this, the updateGraph and getGraph API operations were based on their counterparts in the Gephi graph streaming API with the following adaptations:
Redefining the workspace to refer to a timestamp and URL
Adding timestamp and url parameters to both updateGraph and getGraph
Adding depth parameter to getGraph
An instance of Gephi with the graph streaming plugin installed was used to examine API behavior. We also examined API behavior using the Neo4j APOC library, which provides a procedure for data export to Gephi.
Initial minimal API service for link-serv was implemented. The implementation is in Java and uses the Spring Boot framework and Neo4j bindings.
We have the following issues up next:
Continue to develop the service API implementation
Tune insertion and matching performance
Test integration with link-indexer and link-viz
link-indexer is the tool that runs on web archive storage where WARC/ARC files are kept and collects outlinks data to feed to link-serv to load into the graph data store. In a subsequent phase of the project, collected data may include details besides outlinks to enrich the visualization.
We worked on the following:
Invocation model and choice of programming tools
Web Archive Transformation (WAT) as input format
Invocation model and choice of programming tools
link-indexer collects data from the web archive’s underlying file storage, which means it will often be invoked on multiple nodes in a computer cluster. To handle future research use cases, the tool will also eventually need to do a fair amount of data processing, such as language detection, named entity recognition, or geolocation. For these reasons, we found Python a fitting choice for link-indexer. Additionally, several modules are readily available for Python that implement functionality related to web archiving, such as WARC file reading and writing and URI transformation.
In a distributed environment such as a computer cluster, invocation would be on ad-hoc basis using a tool such as Ansible, dsh, or pdsh (among many others) or configured using a configuration management tool (also such as Ansible) for periodic execution on each host in the distributed environment. Given this intended usage and magnitude of the input data, we identified the following requirements for the tool:
Flexible configuration using a configuration file as well as command-line options
Reduced system resource footprint and optimized performance
Web Archive Transformation (WAT) as input format
Building upon already existing tools, Web Archive Transformation (WAT) is used as input format rather than directly reading full WARC/ARC files. WAT files hold metadata extracted from the web archive. Using WAT as input reduces code complexity, promotes modularity, and makes it possible to run link-indexer on auxiliary storage having only WAT files, which are significantly smaller in size compared to their original WARC/ARC sources. warcio is used in the Python code to read WAT files, which conform in structure to the WARC format. We initially used archive-metadata-extractor to generate WAT files. However, testing our implementation with sample files showed the tool generates files that do not exactly conform to the WARC structure and cause warcio to fail on reading. The more recent webarchive-commons library was subsequently used instead to generate WAT files.
The current initial minimal implementation of link-indexer includes the following:
Basic command-line invocation with multiple input WAT files as arguments
Traversal of metadata records in WAT files using warcio
Collecting outlink data and converting relative links to absolute
Composing JSON graph data compatible with the Gephi streaming API
Grouping a defined count of records into batches to reduce hits on the API service
We plan to continue work on the following:
Rewriting links in Sort-friendly URI Transformation (SURT)
Integration with the link-serv API
link-viz is the project’s web-based frontend for accessing data provided by link-serv as a graph that can be navigated and explored.
We worked on the following:
Graph rendering toolkit
Web development framework and tools
UI design and artwork
Graph visualization libraries, as well as web application frameworks, were researched for the web-based link visualization frontend. Both D3.js and Vis.js emerged as the most suitable candidates for the visualization toolkit. Experimentally coding using both toolkits, we decided to go with Vis.js, which fits the needs of the application and is better documented.
We also took a fresh look at current web development frameworks and decided to house the Vis.js visualization logic within a Laravel framework application combining PHP and Vue.js for future expandability of the application’s features, e.g., user profile management, sharing of graphs, etc.
A virtual machine was allocated on BA’s server infrastructure to host link-viz for the project demo that we will be working on.
We built a barebone frontend consisting of the following:
Graph rendering page with the following UI elements:
URL, depth, and date selection inputs
Placeholders for add-ons
As we outlined in the project proposal, we plan to implement add-ons during a later phase of the project to extend functionality. Add-ons would come in 2 categories: vizors for modifying how the user sees the graph, e.g., GeoVizor for superimposing nodes on a map of the world, and finders to help the user explore the graph, e.g., PathFinder for finding all paths from one node to another.
Some work has already been done in UI design, color theming, and artwork, and we plan to continue work on the following:
Integration with the link-serv API
Continue work on UI design and artwork
Research use cases for web archive visualization
In terms of outreach, the National Library of New Zealand has been getting in touch with researchers from a wide array of backgrounds, ranging from data scientists to historians, to gather feedback on potential use cases and the types of features researchers would like to see in a web archive visualization tool. Several issues have been brought up, including frustrations with existing tools’ lack of scalability, being tied to a physical workstation, time wasted on preprocessing datasets, and inability to customize an existing tool to a researcher’s individual needs. Gathering first hand input from researchers has led to many interesting insights. The next steps are to document and publish these potential research use cases on the wiki to guide future developments in the project.
We would like to extend our thanks and appreciation for all the researchers who generously gave their time to provide us with feedback, including Dr. Ian Milligan, Dr. Niels Brügger, Emily Maemura, Ryan Deschamps, Erin Gallagher, and Edward Summers.
Meet the people involved in the project at Bibliotheca Alexandrina:
And at the National Library of New Zealand:
We would also like to thank Alex Osborne at the National Library of Australia and Andy Jackson at the British Library for their advice on technical issues.
If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.
By Jefferson Bailey, Internet Archive & IIPC Steering Committee
IIPC is excited to announce the launch of its Discretionary Funding Program (DFP) to support the collaborative activities of its members by providing funding to accelerate the preservation and accessibility of the web. Following the announcement to membership at the recent IIPC General Assembly in Zagreb, Croatia, the IIPC DFP aims to advance the development of tools, training, and practices that further the organization’s mission “to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.”
The inaugural DFP Call for Proposals will award funding according to an application process. Applications will be due on September 1, 2019 for one-year projects starting January 1, 2020 or July 1, 2020. The program will grant awards in three categories:
Seed Grants ($0 to $10,000) fund smaller, individual efforts, help smaller projects/events scale up, or support smaller-scope projects.
Development Grants ($10,000 to $25,000) fund efforts that require meaningful funding for event hosting, engineering, publications, project growth, etc.
Program Grants ($25,000 to $50,000) fund larger initiatives, either to launch new initiatives or to increase the impact and expansion of proven work or technologies.
The IIPC has earmarked a significant portion of its reserve funds and of income from member dues to support the joint work of its members through this program. Applications will be reviewed by a team of IIPC Steering Committee members as well as representatives from the broader IIPC membership. Our hope is that the IIPC DFP serves as a catalyst to promote grassroots, member-driven innovation and collaboration across the IIPC membership.
Please visit the IIPC DFP page (http://netpreserve.org/projects/funding/) for an overview of the application process, links to the application form and a FAQ page, and other details and contact information. We encourage all IIPC members to apply for DFP funding and to coordinate with their peer member on brainstorming programs to advance the field of web archiving. The DFP team intends to administer the program with the utmost equity and transparency and encourages any members with questions not answered by online resources to post them on the dedicated IIPC Slack channel (#projects at http://iipc.slack.com) or via email projects[at]iipc.simplelists.com.