LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.
What does LinkGate do?
LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.
APIs and use cases
We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.
An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:
Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.
We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.
This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:
Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
Improved implementation of the software in general.
The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.
Lana Alsabbagh, NLNZ, Research Use Cases
Youssef Eldakar, BA, Project Coordination
Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
Mohamed Elsayed, BA, Link Indexer (link-indexer)
Andrea Goethals, NLNZ, Project Coordination
Amr Morad, BA, Link Service (link-serv)
Ben O’Brien, NLNZ, Research Use Cases
Amr Rizq, BA, Link Visualizer (link-viz)
Tasneem Allam, BA, link-viz development
Suzan Attia, BA, UI design
Dalia Elbadry, BA, UI design
Nada Eliba, BA, link-serv development
Mirona Gamil, BA, link-serv development
Olga Holownia, IIPC, project support
Andy Jackson, British Library, technical advice
Amged Magdey, BA, logo design
Liquaa Mahmoud, BA, logo design
Alex Osborne, National Library of Australia, technical advice
We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.
By Mohammed Elfarargy and Youssef Eldakar of Bibliotheca Alexandrina
LinkGate is an IIPC-funded project to develop a scalable web archive graph visualization environment and collect research use cases, led by Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ). The project provides three modular components:
Link Service (link-serv) for the scalable temporal graph data service with an underlying graph data store and API
Link Indexer (link-indexer) for collecting inter-linking data from the web archive
Link Visualizer (link-viz) for the web-based frontend geared towards web archive graph data navigation and exploration
During a webinar held at the end of July as part of the IIPC Research Speaker Series (RSS), we presented a demo of the tools being developed and a summary of feedback gathered so far from the community towards a research use case inventory. In this blog post, we give an update on progress of the technical development, focusing on the initial UI of link-viz.
LinkGate’s frontend visualization component,link-viz, has developed on many fronts over the last four months. While the LinkServe component is compatible with the Gephi streaming API, Gephi remains a desktop-only general-purpose graph visualization tool. link-viz, on the other hand, is a web-based, scalable graph visualization tool that is made specifically to visualize web archive graph data. This makes it possible to produce more informative graphs for web archive users.
link-viz works in a similar manner to web-based map services like Google Maps. The user gets a graph based on the queried URL and the desired snapshot. Users can set the initial depth of the graph and then incrementally add more nodes as they explore deeper in the graph. This smart loading makes the exploration of such a dense graph run more smoothly.
The link-viz UI is designed to set the main focus on the graph. Users can click on any graph node to select it and perform actions using tools available in the UI. Graph nodes can be moved around and are, by default, distributed using a spring force model to help make a uniform distribution over 2D space. It’s possible to toggle this off to give users the option to organize nodes manually. Users can easily pan and zoom in/out the view using mouse controls or touch gestures. All other tools are located in four floating panels surrounding the main graph area:
The left-hand panel is used to search for a URL and to select the desired snapshot based on which the initial graph will be rendered. The snapshot selection widget is illustrated in Figure 1:
Figure 1: Snapshot selection widget
The bottom panel shows detailed information on the highlighted graph node. This includes a full URL and a listing of all the outlinks and inlinks. This can be seen in Figure 2:
The top panel contains a set of tools for graph navigation (zoom in/out and reset view), taking graph screenshots, setting graph depth, collapsing/expanding portions of the graph, and configuring the look of the graph (selection of color, size, and shape for both graph nodes and edges to represent different pieces of information). One nice feature of link-viz compared to standard graph visualization tools is the usage of website favicons for graph nodes instead of geometric shapes, which makes nodes instantly identifiable and results in a much more readable graph. Figures 3 and 4 show the top panel and favicon usage, respectively:
The right-hand panel contains two tabs reserved for two sets of tools, Vizors and Finders. Vizors are tools to display the same graph highlighting additional information. Two vizors are currently planned. The GeoVizor will put graph nodes on top of a world map to show the hosting physical location. The FileTypeVizor will display file-type icons as graph nodes, making it very easy to identify most common file types and their distribution over the web. Finders perform graph exploration functions, such as finding loops or paths between nodes.
Apart from Vizors and Finders, we are also working on other features, including smart graph loading and animated graph timeline. We are also going to improve UI styling.
link-indexer is now integrated with link-serv via the API. We have been testing the process of inserting data extracted with link-indexer into link-serv to identify data and scalability problems to work on. link-indexer now accepts command-line options for specifying the target link-serv instance and controlling the insertion batch size to manage how often the API is invoked. More command-line options are being added to control various aspects of the tool, as well as the ability to load options from a configuration file. We are also working to enhance tolerance to data issues, such as very long URLs, and network issues, such as short service outages. Figure 5 shows a sample output from a link-indexer run:
link-serv implements an API for link-indexer and link-viz to communicate with the graph data store. The API is compatible with the Gephi streaming API, giving users the option to connect to link-serv using the popular graph visualization tool, Gephi, as an alternative to the project’s frontend, link-viz. Figure 6 shows a Gephi client streaming graph data from a link-serv instance:
A data schema customized for temporal, versioned web archive data is used in the underlying Neo4j graph data store, and link-serv defines extra API operations not defined in the Gephi streaming API to support temporal navigation functionality in link-viz.
As more data is added to link-serv, the underlying graph data store has difficulty scaling up when reliant on a single instance. Our primary focus in link-serv at the moment, therefore, is to implement clustering. Work is in progress on a customized dispatcher service for the Neo4j graph data store as a substitute to clustering functionality in the commercially licensed Neo4j Enterprise Edition. As a side track, we are also looking into ArangoDB as possibly an alternative deployment option for link-serv’s graph data store.
By Youssef Eldakar of Bibliotheca Alexandrina and Lana Alsabbagh of the National Library of New Zealand
Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ) are working together to bring to the web archiving community a tool for scalable web archive visualization: LinkGate. The project was awarded funding by the IIPC for the year 2020. This blog post gives a detailed overview of the work that has been done so far and outlines what lies ahead.
In all domains of science, visualization is essential for deriving meaning from data. In web archiving, data is linked data that may be visualized as graph with web resources as nodes and outlinks as edges.
This phase of the project aims to deliver the core functionality of a scalable web archive visualization environment consisting of a Link Service (link-serv), Link Indexer (link-indexer), and Link Visualizer (link-viz) components as well as to document potential research use cases within the domain of web archiving for future development.
The following illustrates data flow for LinkGate in the web archiving ecosystem, where a web crawler archives captured web resources into WARC/ARC files that are then checked into storage, metadata is extracted from WARC/ARC files into WAT files, link-indexer extracts outlink data from WAT files and inserts it into link-serv, which then serves graph data to link-viz for rendering as the user navigates the graph representation of the web archive:
In what follows, we look at development by Bibliotheca Alexandrina to get each of the project’s three main components, Link Service, Link Indexer and Link Visualizer, off the ground. We also discuss the outreach part of the project, coordinated by the National Library of New Zealand, which involves gathering researcher input and putting together an inventory of use cases.
Please watch the project’s code repositories on GitHub for commits following a code review later this month:
link-serv is the Link Service that provides an API for inserting web archive interlinking data into a data store and for retrieving back that data for rendering and navigation.
We worked on the following:
Data store scalability
API definition and and Gephi compatibility
Data store scalability
link-serv depends on an underlying graph database as repository for web resources as nodes and outlinks as relationships. Building upon BA’s previous experience with graph databases in the Encyclopedia of Life project, we worked on adapting the Neo4j graph database for versioned web archive data. Scalability being a key interest, we ran a benchmark of Neo4j on Intel Xeon E5-2630 v3 hardware using a generated test dataset and examined bottlenecks to tune performance. In the benchmark, over a series of progressions, a total of 15 billion nodes and 34 billion relationships were loaded into Neo4j, and matching and updating performance was tested. And while time to insert nodes into the database for the larger progressions was in hours or even days, match and update times in all progressions after a database index was added, remained in seconds, ranging from 0.01 to 25 seconds for nodes, with 85% of cases remaining below 7 seconds and 0.5 to 34 seconds for relationships, with 67% of cases remaining below 9 seconds. Considering the results promising, we hope that tuning work during the coming months will lead to more desirable performance. Further testing is underway using a second set of generated relationships to more realistically simulate web links.
We ruled out Virtuoso, 4store, and OrientDB as graph data store options for being less suitable for the purposes of this project. A more recent alternative, ArangoDB, is currently being looked into and is also showing promising initial results, and we are leaving open the possibility of additionally supporting it as an option for the graph data store in link-serv.
To represent web archive data in the graph data store, we designed a schema with the goals of supporting time-versioned interlinked web resources and being friendly to search using the Cypher Query Language. The schema defines Node and VersionNode as node types and HAS_VERSION and LINKED_TO as relationship types linking a Node to a descendant VersionNode and a VersionNode to a hyperlinked Node, respectively. A Node has the URI of the resource as attribute in Sort-friendly URI Reordering Transform (SURT), and a VersionNode has the ISO 8601 timestamp of the version as attribute. The following illustrates the schema:
API definition and Gephi compatibility
link-serv is to receive data extracted by link-indexer from a web archive and respond to queries by link-viz as the graph representation of web resources is navigated. At this point, 2 API operations were defined for this interfacing: updateGraph and getGraph. updateGraph is to be invoked by link-indexer and takes as input a JSON representation of outlinks to be loaded into the data store. getGraph, on the other hand, is to be invoked by link-viz and returns a JSON representation of possibly nested outlinks for rendering. Additional API operations may be defined in the future as development progresses.
One of the project’s premises is maintaining compatibility with the popular graph visualization tool, Gephi. This would enable users to render web archive data served by link-serv using Gephi as an alternative to the project’s frontend component, link-viz. To achieve this, the updateGraph and getGraph API operations were based on their counterparts in the Gephi graph streaming API with the following adaptations:
Redefining the workspace to refer to a timestamp and URL
Adding timestamp and url parameters to both updateGraph and getGraph
Adding depth parameter to getGraph
An instance of Gephi with the graph streaming plugin installed was used to examine API behavior. We also examined API behavior using the Neo4j APOC library, which provides a procedure for data export to Gephi.
Initial minimal API service for link-serv was implemented. The implementation is in Java and uses the Spring Boot framework and Neo4j bindings.
We have the following issues up next:
Continue to develop the service API implementation
Tune insertion and matching performance
Test integration with link-indexer and link-viz
link-indexer is the tool that runs on web archive storage where WARC/ARC files are kept and collects outlinks data to feed to link-serv to load into the graph data store. In a subsequent phase of the project, collected data may include details besides outlinks to enrich the visualization.
We worked on the following:
Invocation model and choice of programming tools
Web Archive Transformation (WAT) as input format
Invocation model and choice of programming tools
link-indexer collects data from the web archive’s underlying file storage, which means it will often be invoked on multiple nodes in a computer cluster. To handle future research use cases, the tool will also eventually need to do a fair amount of data processing, such as language detection, named entity recognition, or geolocation. For these reasons, we found Python a fitting choice for link-indexer. Additionally, several modules are readily available for Python that implement functionality related to web archiving, such as WARC file reading and writing and URI transformation.
In a distributed environment such as a computer cluster, invocation would be on ad-hoc basis using a tool such as Ansible, dsh, or pdsh (among many others) or configured using a configuration management tool (also such as Ansible) for periodic execution on each host in the distributed environment. Given this intended usage and magnitude of the input data, we identified the following requirements for the tool:
Flexible configuration using a configuration file as well as command-line options
Reduced system resource footprint and optimized performance
Web Archive Transformation (WAT) as input format
Building upon already existing tools, Web Archive Transformation (WAT) is used as input format rather than directly reading full WARC/ARC files. WAT files hold metadata extracted from the web archive. Using WAT as input reduces code complexity, promotes modularity, and makes it possible to run link-indexer on auxiliary storage having only WAT files, which are significantly smaller in size compared to their original WARC/ARC sources. warcio is used in the Python code to read WAT files, which conform in structure to the WARC format. We initially used archive-metadata-extractor to generate WAT files. However, testing our implementation with sample files showed the tool generates files that do not exactly conform to the WARC structure and cause warcio to fail on reading. The more recent webarchive-commons library was subsequently used instead to generate WAT files.
The current initial minimal implementation of link-indexer includes the following:
Basic command-line invocation with multiple input WAT files as arguments
Traversal of metadata records in WAT files using warcio
Collecting outlink data and converting relative links to absolute
Composing JSON graph data compatible with the Gephi streaming API
Grouping a defined count of records into batches to reduce hits on the API service
We plan to continue work on the following:
Rewriting links in Sort-friendly URI Transformation (SURT)
Integration with the link-serv API
link-viz is the project’s web-based frontend for accessing data provided by link-serv as a graph that can be navigated and explored.
We worked on the following:
Graph rendering toolkit
Web development framework and tools
UI design and artwork
Graph visualization libraries, as well as web application frameworks, were researched for the web-based link visualization frontend. Both D3.js and Vis.js emerged as the most suitable candidates for the visualization toolkit. Experimentally coding using both toolkits, we decided to go with Vis.js, which fits the needs of the application and is better documented.
We also took a fresh look at current web development frameworks and decided to house the Vis.js visualization logic within a Laravel framework application combining PHP and Vue.js for future expandability of the application’s features, e.g., user profile management, sharing of graphs, etc.
A virtual machine was allocated on BA’s server infrastructure to host link-viz for the project demo that we will be working on.
We built a barebone frontend consisting of the following:
Graph rendering page with the following UI elements:
URL, depth, and date selection inputs
Placeholders for add-ons
As we outlined in the project proposal, we plan to implement add-ons during a later phase of the project to extend functionality. Add-ons would come in 2 categories: vizors for modifying how the user sees the graph, e.g., GeoVizor for superimposing nodes on a map of the world, and finders to help the user explore the graph, e.g., PathFinder for finding all paths from one node to another.
Some work has already been done in UI design, color theming, and artwork, and we plan to continue work on the following:
Integration with the link-serv API
Continue work on UI design and artwork
Research use cases for web archive visualization
In terms of outreach, the National Library of New Zealand has been getting in touch with researchers from a wide array of backgrounds, ranging from data scientists to historians, to gather feedback on potential use cases and the types of features researchers would like to see in a web archive visualization tool. Several issues have been brought up, including frustrations with existing tools’ lack of scalability, being tied to a physical workstation, time wasted on preprocessing datasets, and inability to customize an existing tool to a researcher’s individual needs. Gathering first hand input from researchers has led to many interesting insights. The next steps are to document and publish these potential research use cases on the wiki to guide future developments in the project.
We would like to extend our thanks and appreciation for all the researchers who generously gave their time to provide us with feedback, including Dr. Ian Milligan, Dr. Niels Brügger, Emily Maemura, Ryan Deschamps, Erin Gallagher, and Edward Summers.
Meet the people involved in the project at Bibliotheca Alexandrina:
And at the National Library of New Zealand:
We would also like to thank Alex Osborne at the National Library of Australia and Andy Jackson at the British Library for their advice on technical issues.
If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.