Asking questions with web archives – introductory notebooks for historians

“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).


By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.

Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that  document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

What can you do?

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.

A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.

Using screenshots to visualise change in a page over time.

Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.

 

Find when a piece of text appears in an archived web page.

The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.

Find and explore Powerpoint presentations from a specific domain.

While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

Visualising subdomains in the gov.au domain as captured by the Internet Archive.

Acknowledgements

Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.

Resources:

The French coronavirus (COVID-19) web archive collection: focus on collaborative networks

BnF’s Covid-19 web archive collection has drawn considerable media attention in France, including coverage in Le Monde, 20 minutes and TV Channel France 3. The following blog post was first published in Web Corpora, BnF’s blog dedicated to web archives.


By Alexandre Faye, Digital Collection Manager, Bibliothèque nationale de France (BnF)
English translation by Alexandre Faye and Karine Delvert

The current global coronavirus pandemic (Covid-19) poses an unprecedented challenge for the web archiving activities. The impact on society is such that the ongoing collection requires several levels of coordination and cooperation at a national and international level.

Since its spreading out of China and its later development in Europe, coronavirus outbreak has become a pervasive theme on the web. This sanitary crisis is being experienced in real time by populations simultaneously confined and largely connected, with a sense of emergency as well as underlying questioning. Archived websites, blogs, and social media should make up a coherent, significant and representative collection. They will be primary sourcesfor future research, and they are already the trace and memory of the event.

#jenesuispasunvirus

At the end of January 2020, while the Wuhan megapolis is quarantined, the first hashtags #JeNeSuisPasUnVirus and #CORONAVIRUSENFRANCE appear on Twitter. They denounce and show the stigma experienced by the Asian community in France. The Movement against racism and for friendship between peoples (Mouvement contre le racisme et pour l’amitié entre les peuples, MRAP) quickly published a page on its website entitled “a virus has no ethnic origin”. This is the first webpage related to coronavirus to have been selected, crawled and preserved under French legal deposit.

Group dynamics

The coronavirus collection is not conceived as a project, in the sense that it would be programmed, would have a precise calendar and would be limited to predetermined topics. It grows as a part of the both National and local news media and Ephemeral News Current Topics collections. The National and local news media collection brings together a hundred of national and local press websites, including the editorial content, such as headlines and related articles as well as Twitter accounts which are collected once a day. The News Current Topics collection, which requires both a technical and organizational approach, relies on the coordination of an internal network of digital curators from their relevant fields”. It facilitates dynamic and reactive identification of web content related to contemporary issues and important events. By documenting the evolution, spreading and overall impact of the pandemic in France, archiving policy embraces all facets of the public health crisis: medical, social, economic, political and more broadly scientific, cultural and moral aspects.

“A virus has no ethnic origin”. Movement Against Racism and for Friendship Between Peoples (MRAP) website. Archive of February 21, 2020.

70 selected seed URLs were crawled in January and February, while the spread of the virus out of China seemed to be limited and under control. Since March 17, date of the French lockdown, 500 to 600 seed URLs per week are selected and assigned to a crawl frequency: several times a day for social networks, daily for national and local press sites, weekly for news sections dedicated to the coronavirus, monthly for articles and dedicated websites which are created ex nihilo. Thus the section of the economic review L’Usine nouvelle is crawled weekly, because it organizes a stream of articles. Less dynamic, the recommendation pages of the National Research and Security Institute (INRES), is assigned monthly frequency.

By mid-April 2020, more than 2,000 selections and settings were created. This reactivity is all the more necessary due to the fact that certain web pages selected in the first phase have already disappeared from the live web.

The regional dimension

The geographical approach is also at the core of the archiving dynamics. The web does not entirely do away with territorial dimensions, as shown by the research works led on this topic. One may even think that they were reinforced as France is hit by the sanitary crisis, as the crisis coincides with the campaign for the municipal elections.

The curators of partner institutions all over the French territory have spontaneously enriched the selections on the coronavirus sanitary crisis. They contributed by including local and regional contents into account. This network is a key element to the national cooperation framework. Initiated in 2004 by the BnF, it relies on a network of 26 regional libraries and archives services, which share this mission of print and web legal deposit by participating in collaborative nominations. Its contribution proved to be significant since over 50% of the nominated websites selected until 15th April refer to local/regional content.

Simplified access to teleconsultation. ARS Guyana. Archived, April 5, 2020.

As a corollary, the crawl devoted to local elections has not been suspended after the 1st poll (which took place on March 15th), although the second poll (due to take place the following weekend) had been postponed and the whole electoral process suspended due to the crisis. In particular, the Twitter and Facebook accounts of the mayors elected in the 1st poll and those of the candidates who are still in contention for the 2nd poll have continued to be collected. These archives, as statements of mayors and candidates on the web during the weeks that had preceded and followed the 1st poll of local elections, already appear to be a major source for both electoral history and coronavirus pandemic in France.

Historic abstention rate in the local elections in the Oise “cluster”. francetvinfo.fr. Capture of March 16, 2020.

International cooperation

At the international level, the BnF and also in this way the other French participating libraries contribute to the archiving project “Novel Coronavirus (2019-nCoV) outbreak”. This initiative launched in February 2020 is supported by the IIPC Content Development Group (CDG) in association with the Internet Archive. It brings together about thirty libraries and institutions collaborating around the world on this web archive collection. At the end of May, more than 6,800 preserved websites representing 45 languages had been put online on Archive-it.org and indexed in full text.


The BnF has for many years been pursuing a policy of cooperation with the IIPC to promote preservation and use of web archives on an international scale. One of the research challenges is to facilitate comparisons of the different national webs, in particular for the global and transnational phenomena such as #MeToo and the current health crisis. A first contribution was sent at the end of February to the IIPC.  It consisted of an 80 seeds selection made during the first phase of the pandemic, just before Europe became the main active center in front of China. Some of these pages have already disappeared from the living web.

According to the IIPC’s new recommendations and considering the evolution of the pandemic in France, the next contribution to the IIPC should be a tight selection (almost 5% of the French collection) linked to high priority subtopics include: information about the spread of infection; regional or local containment efforts; medical and scientific aspects, social aspects; economic aspects; and political aspects. A third of those websites reports on medical domain. A second third provides information about French territories that are remote from Europe: French Guiana and West Indies, Reunion and Mayotte. The last part concerns citizen’s initiatives and debates during the lockdown.

For examples, the special INED’s website hosting gives information on local excess mortality, articles from Madinin’art, Montray Kreyol, Free Pawol were selected by a local curator and banlieues-sante.org is website of an NGO which acts against medical inequality and has created a YouTube channel explaining protection measures in 24 languages including sign language.

Dr François Ehlinger on EHPAD. Nicole Bertin’s Blog. Website capture from the Charente-Maritime region. Capture on April 3, 2020

What’s next?

Some of the websites nominated by the BnF and its partners tend to constitute a collective memory of the event. Until mid-April, the share of social networks represented 40% of the nominations, with a slight predominance of Twitter over Facebook. Although a large share is devoted to official accounts – namely, of institutions or associations (@AssembleeNat, @restosducoeur, @banlieuesante) or to accounts created ex nihilo (@CovidRennes, @CoronaVictimes, @InitiativeCovid), hashtags prevail in the set of selections.

The aim is to archive a representative part of individual and collective expressions by capturing tweets around the most significant hashtags: multiple variations of the terms “coronavirus” and “confinement” (#coronavacances, #ConfinementJour29), criticism of the way the crisis has been managed (#OuSontLesMasques, #OnOublieraPas), instruction dissemination and expressions of sympathy show a unique and characteristic mobilisation of citizens while following the pace of the news (#chloroquine, #Luxfer).

Daniel Bourrion, “The virus journals” on face-ecran.fr. Archived April 3, 2020.

Archives relating to the coronavirus, as they account for the outcomes of the sanitary crisis and of the lockdown in various domains, end up in overlapping the set of themes to which the BnF and its partners pay a particular attention or for which focused crawls have already been conducted or will be led. For instance, digital literature or confinement diaries, relationships between the body and public health policies, epidemiology and artificial intelligence, family life in confinement and feminism, can be mentioned.

“Next” isn’t just a matter of a unique form of promoting this special archive collection, which remains a work-in-progress. It is neither a delimited project nor an already closed. It is documentation for many kinds of research projects and also heritage for all of us.

Guide for confined parents. The French Secretariat for Equality (Le Secrétariat d’Etat chargé de l’égalité entre les femmes et les hommes et de la lutte contre les discriminations). Capture of April 10.

COVID-19: Collecting so that we don’t forget

by Martine Renaud, Librarian, Bibliothèque et Archives nationales du Québec [1]

The COVID-19 pandemic has dominated the news for months because of its sheer scale and its impact on our economy and social life as well as our health. How will it be remembered in a few years? The Spanish flu epidemic of 1918-1919 is sometimes described as the forgotten pandemic[2]. This time, how can we make sure nothing is forgotten? Preserving the memory of this turbulent and exceptional time is crucially important for tomorrow’s researchers.

Capturing the Web

The Web and social media are playing a key role in the pandemic. They enable the instant spread of information (as well as fake news), provide a space for exchange and communication in a context of social distancing. BAnQ has been collecting Québec websites on a selective basis since 2009. The result of this harvesting is largely available on the BAnQ portal. Sites for which BAnQ has not gotten permission are preserved, but not made available. They can be accessed for research purposes.

Collaborative Collection

In February 2020, the International Internet Preservation Consortium (IIPC) called on its members, including BAnQ, to create a collaborative collection of websites dealing with the emerging pandemic.

BAnQ’s contribution to this collection formed the basis of the Québec collection, which we decided to create once the scale of the crisis became apparent. BAnQ  had already created several collections on special events, for example the 375th anniversary of the city of Montreal, the collection on the pandemic is part of this corpus around exceptional events.

The Québec collection includes Québec government websites, and sections of websites, dealing with the pandemic. It also includes the websites of public health authorities (Directions de la santé publique), Québec’s National Public Health Institute (INSPQ), as well as the CISS and CIUSS (Integrated Health and Social Services Centres). Web pages about the pandemic from a number of cities and towns are included, as well as universities, CEGEPs (senior high schools), and school boards. Websites of companies that are particularly affected by the pandemic, such as financial institutions and supermarket chains, are also included.

Articles dealing with COVID-19 from Québec-wide and regional papers are collected, as well as parts of the websites of professional orders and associations. Of course, sites that have emerged or been in the news since mid-March, such as Jebenevole.ca, are also harvested. At the time of writing, over 15,000 URL addresses have been collected, and new ones are added every week.

Capturing social media

As for social media, BAnQ collects the Twitter feeds and Facebook pages of personalities and public bodies involved in front-line management of the crisis, such as Premier François Legault, Québec’s health ministry (Santé Québec), and the City of Montréal’s police department (Service de police de la Ville de Montréal). All over the world, memory institutions are working to preserve traces of the pandemic. Thanks to these efforts, it is our hope that nothing will be forgotten.

References:

[1] This article will appear in the June 2020 issue of À rayons ouverts – Chroniques de Bibliothèque et Archives nationales du Québec, No. 106 (Spring/Summer 2020), p. 26.

 [2] Alfred W. Crosby, America’s Forgotten Pandemic – The Influenza of 1918, 2e édition, Cambridge, Cambridge University Press, 2003, https://books.google.ca/books?id=KYtAkAIHw24C&redir_esc=y&hl=en (consulté le 4 mai 2020).

Let’s time travel with the IIPC!

IIPC has been organising its annual meetings for over 15 years. The first full Steering Committee meeting and the meetings of working groups were held in Canberra in 2004. The most recent General Assembly (GA) and Web Archiving Conference (WAC) were held Zagreb in June 2019. What started as a small get-together of web archiving enthusiasts from a dozen national libraries and the Internet Archive, has gradually become an important fixture in the web archiving calendar. We have been very fortunate that our members have volunteered to host the events in Singapore, The Hague, Washington D.C., Ljubljana, Stanford, Reykjavík, London, Wellington, Zagreb and Ottawa. The GA also returned to Canberra in 2008.

 

Due to Covid-19, this year we will not meet in person but we can time travel! While preparing for the next annual event hosted by the National Library of Luxembourg (15-18 June 2021), we will be trawling through the history of the GA and the WAC. We will be collecting, publishing and archiving memories from past events in a variety of formats, ranging from tweets, blog posts to a GA and WAC digital repository and bibliography. All new and older posts will be available in the “GAWAC” archive.

This slideshow requires JavaScript.

We are starting from 2019, which was the first GA for Friedel Geeraert of KBR, The Royal Library of Belgium. This was also the first GA for the British Library web archivists Helena Byrne and Carlos Rarugal, the organisers of a workshop called “Reflecting on how we train new starters in web archiving”.

Abstracts from the 2019 presentations and slides are available on the conference website. You can also watch the keynote speeches and panel discussions on our YouTube Channel and browse through the photos on the IIPC Flickr. The 2019 GA and WAC were hosted by the National and University Library in Zagreb. The Croatian Web Archive (HAW), which last year celebrated its 15th anniversary, has launched its new interface earlier this year. You can browse the archive and the thematic collections at https://haw.nsk.hr/en.

Photo: Tibor God.

Discovering the web archiving community at the IIPC events in Zagreb

By Friedel Geeraert, Scientific Assistant Web Archiving, KBR – Royal Library of Belgium

Last year I had the privilege of participating in the IIPC General Assembly and Web Archiving Conference in Zagreb for the first time as the representative of KBR (the Belgian Royal Library), who was at that time the youngest IIPC member. Last year KBR was involved in a research project called PROMISE that studied the question of web archiving at the federal level in Belgium.

The General Assembly provided good insight into the working of IIPC as an organisation. It was very interesting to participate in the reflection about the future form of IIPC during the General Assembly. According to member institutions the top three priorities for the coming years should be: 1) community-led tools, 2) providing platforms for sharing knowledge and 3) networking and support for innovation in research on the archived web. Furthermore, the reports of the Treasurer and Porgoramme and Communications Officer indicated the different possibilities of engaging with the organisation and other IIPC members: TSS (Technical Speaker Series) and RSS (Research Speaker Series) Webinars, Online Hours, the different working groups (Content Development, Training Working Group, Preservation, Research Working Group), the Discretionary Funding Programme. I took part in the workshops of the Preservation, Training and Research Working Groups which allowed me to discover different initiatives launched within web archiving institutions all over the world.

The Web Archiving Conference brought a plethora of developments within web archiving to light. A lot of focus was on outreach and on how to promote web archives (via library labs for example). Another theme was researcher interaction with web archives and opening up access to complementary files such as crawl and access logs, derivative files or documentation about curatorial decisions and Heritrix settings. The use of machine learning on archived web material was another recurring theme. From a curatorial perspective trending collection themes are minorities, emerging formats such as interactive fiction or retrospective web archiving. It was also stressed that divergent opinions should feature in a web archive in order to avoid curatorial bias. Furthermore, even though I don’t have a technical background, it was fascinating to discover new developments such as size reduction of indexes, Browsertrix or automated quality assurance.

On top of all that rich information, the networking possibilities were fantastic. Within the PROMISE project, we did an extensive literature review concerning web archiving initiatives in Europe and Canada. It was a wonderful opportunity to meet some of the web archivists and researchers I admire in person. It is safe to say that I came back inspired and with a head full of ideas for the Belgian web archive. I’m already looking forward to the next edition.

This slideshow requires JavaScript.

Reflecting on how we train new starters in web archiving

This blog post is a summary of a workshop that took place at the 2019 IIPC Web Archiving Conference in Zagreb, Croatia. The abstract and the final slides used during the workshop are available on the IIPC website.


By Helena Byrne, Web Curator and Carlos Rarugal, Assistant Web Archivist at the British Library

 

Most people when learning can relate to the Benjamin Franklin quote

tell me and I forget, teach me and I may remember, involve me and I learn.*

It can be very challenging to find the most effective way to involve a trainee in web archiving and transfer your specialist knowledge. Web archiving is a relatively new profession that is constantly changing and it is only in recent years that a body of work from practitioners and researchers has started to grow. In addition, each web archiving institution has its own collection policies and many use their own web archiving technology meaning that there is no one size fits all solution to providing training to people who work in this field.

However, before taking on new strategies it is important to understand our own beliefs on training and what actions we currently take when training new staff. Reflecting on these points can help us to become more aware of any biases we may have in terms of preferred training delivery style which could be contradictory to what the trainee really needs.
What we did

Before we started the workshop participants answered a series of questions about their own experience of training or receiving training on web archives via a Menti poll. We then reviewed the training practices of the curatorial web archive team at the British Library and in groups reviewed what methods participants felt worked well or not.

“Reflecting on how we train new starters in web archiving” at the Web Archiving Conference in Zagreb, 6 June 2019.
Photo: Tibor God.

Menti Poll Results

Menti Poll Results: Average Score for each question.

Overall, there were about 26 participants in the workshop who had varying degrees of experience training people on how to work with their web archive. As shown in Slide 3, only 31% of participants train people in web archiving on a regular basis while 50% of participants train people occasionally and the remaining 19% don’t train other people in web archiving. Some of the people in this final category work as solo web archivists and don’t have any resources for additional staff.

When asked if there was a structured training programme on web archiving at their organisation, 65% of participants responded “no” while only 35% of respondents had a programme in place. Not surprisingly, when asked ‘how were you trained in web archiving?’, hands-on training was the most popular method used to train participants at the workshop.

Results of this poll can be viewed here.

Training practices at the British Library

During this workshop we reviewed common training methods and reflected on the current practices of the curatorial team of the UK Web Archive based at the British Library as well as how we would like to change these practices in the future. (Slides 7-8)

Group Discussion

Participants in small groups discussed a series of questions about how they train people in their institutions:

Questions

1. Who do you train about web archiving?
2. How do you currently train them?
3. What web archiving training resources do you have available to your team?
4. What methods do you use for training? Computer based, documentation (handouts, user guides etc.), one to one learning, shadowing etc.

After discussing these questions participants then placed their current training methods onto a scale of what they felt works and doesn’t work.

Brainstorming

Overall there were 56 points filled in on the post-it notes by participants in 6 different groups. These can be loosely categorised into 10 categories:

Reading list, videos, hands on training, documentation, networking, case studies, examples/modelling, verbal training, forums and tutorials. A more detailed breakdown of these categories can be viewed here.

Most of the points noted (30/56) were in the ‘what works’ section, (10/56) were neutral while only (8/56) of the points were in the ‘what doesn’t work’ section. However, there was some overlap with the ‘what works’ and ‘what doesn’t work’ sections, with some methods like videos and reading lists appearing in both sections but in different groups.

Review

In the last workshop activity, participants voted, by using two coloured stickers, on what they considered most aspirational and most achievable training method.

As you can see from the votes below the most popular activity that could be achieved in the short term by the workshop participants was hands-on individual training with 9 votes. While there was a split between participants who felt that writing manuals was achievable with 7 votes and those they felt that this was aspirational with 6 votes.

How people voted

Conclusion

Overall participants were keen to see a training related event on the IIPC Web Archiving Conference programme. As the importance of web archiving grows, so too does the need for training in this field and it has become more evident that these responsibilities are falling on web archivists.

All the data collected during this workshop was shared with the IIPC Training Working Group and it is hoped that it will help inform the development of materials to support training within the field.

More information about the IIPC Training Working Group can be found here: http://netpreserve.org/about-us/working-groups/training-working-group/

References:

* Goodreads.com, ‘Benjamin Franklin > Quotes > Quotable Quote’, https://www.goodreads.com/quotes/21262-tell-me-and-i-forget-teach-me-and-i-may (accessed December 20, 2018).

LinkGate: Let’s build a scalable visualization tool for web archive research

By Youssef Eldakar of Bibliotheca Alexandrina and Lana Alsabbagh of the National Library of New Zealand

Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ) are working together to bring to the web archiving community a tool for scalable web archive visualization: LinkGate. The project was awarded funding by the IIPC for the year 2020. This blog post gives a detailed overview of the work that has been done so far and outlines what lies ahead.


In all domains of science, visualization is essential for deriving meaning from data. In web archiving, data is linked data that may be visualized as graph with web resources as nodes and outlinks as edges.

This phase of the project aims to deliver the core functionality of a scalable web archive visualization environment consisting of a Link Service (link-serv), Link Indexer (link-indexer), and Link Visualizer (link-viz) components as well as to document potential research use cases within the domain of web archiving for future development.

The following illustrates data flow for LinkGate in the web archiving ecosystem, where a web crawler archives captured web resources into WARC/ARC files that are then checked into storage, metadata is extracted from WARC/ARC files into WAT files, link-indexer extracts outlink data from WAT files and inserts it into link-serv, which then serves graph data to link-viz for rendering as the user navigates the graph representation of the web archive:

LinkGate: data flow

In what follows, we look at development by Bibliotheca Alexandrina to get each of the project’s three main components, Link Service, Link Indexer and Link Visualizer, off the ground. We also discuss the outreach part of the project, coordinated  by the National Library of New Zealand, which involves gathering researcher input and putting together an inventory of use cases.

Please watch the project’s code repositories on GitHub for commits following a code review later this month:

Please see also the Research Use Cases for Web Archive Visualization wiki.

Link Service

link-serv is the Link Service that provides an API for inserting web archive interlinking data into a data store and for retrieving back that data for rendering and navigation.
We worked on the following:

  • Data store scalability
  • Data schema
  • API definition and and Gephi compatibility
  • Initial implementation

Data store scalability

link-serv depends on an underlying graph database as repository for web resources as nodes and outlinks as relationships. Building upon BA’s previous experience with graph databases in the Encyclopedia of Life project, we worked on adapting the Neo4j graph database for versioned web archive data. Scalability being a key interest, we ran a benchmark of Neo4j on Intel Xeon E5-2630 v3 hardware using a generated test dataset and examined bottlenecks to tune performance. In the benchmark, over a series of progressions, a total of 15 billion nodes and 34 billion relationships were loaded into Neo4j, and matching and updating performance was tested. And while time to insert nodes into the database for the larger progressions was in hours or even days, match and update times in all progressions after a database index was added, remained in seconds, ranging from 0.01 to 25 seconds for nodes, with 85% of cases remaining below 7 seconds and 0.5 to 34 seconds for relationships, with 67% of cases remaining below 9 seconds. Considering the results promising, we hope that tuning work during the coming months will lead to more desirable performance. Further testing is underway using a second set of generated relationships to more realistically simulate web links.

We ruled out Virtuoso, 4store, and OrientDB as graph data store options for being less suitable for the purposes of this project. A more recent alternative, ArangoDB, is currently being looked into and is also showing promising initial results, and we are leaving open the possibility of additionally supporting it as an option for the graph data store in link-serv.

Data schema

To represent web archive data in the graph data store, we designed a schema with the goals of supporting time-versioned interlinked web resources and being friendly to search using the Cypher Query Language. The schema defines Node and VersionNode as node types and HAS_VERSION and LINKED_TO as relationship types linking a Node to a descendant VersionNode and a VersionNode to a hyperlinked Node, respectively. A Node has the URI of the resource as attribute in Sort-friendly URI Reordering Transform (SURT), and a VersionNode has the ISO 8601 timestamp of the version as attribute. The following illustrates the schema:

LinkGate: data schema

API definition and Gephi compatibility

link-serv is to receive data extracted by link-indexer from a web archive and respond to queries by link-viz as the graph representation of web resources is navigated. At this point, 2 API operations were defined for this interfacing: updateGraph and getGraph. updateGraph is to be invoked by link-indexer and takes as input a JSON representation of outlinks to be loaded into the data store. getGraph, on the other hand, is to be invoked by link-viz and returns a JSON representation of possibly nested outlinks for rendering. Additional API operations may be defined in the future as development progresses.

One of the project’s premises is maintaining compatibility with the popular graph visualization tool, Gephi. This would enable users to render web archive data served by link-serv using Gephi as an  alternative to the project’s frontend component, link-viz. To achieve this, the updateGraph and getGraph API operations were based on their counterparts in the Gephi graph streaming API with the following adaptations:

  • Redefining the workspace to refer to a timestamp and URL
  • Adding timestamp and url parameters to both updateGraph and getGraph
  • Adding depth parameter to getGraph

An instance of Gephi with the graph streaming plugin installed was used to examine API behavior. We also examined API behavior using the Neo4j APOC library, which provides a procedure for data export to Gephi.

Initial implementation

Initial minimal API service for link-serv was implemented. The implementation is in Java and uses the Spring Boot framework and Neo4j bindings.
We have the following issues up next:

  • Continue to develop the service API implementation
  • Tune insertion and matching performance
  • Test integration with link-indexer and link-viz
  • ArangoDB benchmark

Link Indexer

link-indexer is the tool that runs on web archive storage where WARC/ARC files are kept and collects outlinks data to feed to link-serv to load into the graph data store. In a subsequent phase of the project, collected data may include details besides outlinks to enrich the visualization.
We worked on the following:

  • Invocation model and choice of programming tools
  • Web Archive Transformation (WAT) as input format
  • Initial implementation

Invocation model and choice of programming tools

link-indexer collects data from the web archive’s underlying file storage, which means it will often be invoked on multiple nodes in a computer cluster. To handle future research use cases, the tool will also eventually need to do a fair amount of data processing, such as  language detection, named entity recognition, or geolocation. For these reasons, we found Python a fitting choice for link-indexer. Additionally, several modules are readily available for Python that implement functionality related to web archiving, such as WARC file reading and writing and URI transformation.
In a distributed environment such as a computer cluster, invocation would be on ad-hoc basis using a tool such as Ansible, dsh, or pdsh (among many others) or configured using a configuration management tool (also such as Ansible) for periodic execution on each host in the distributed environment. Given this intended usage and magnitude of the input data, we identified the following requirements for the tool:

  • Non-interactive (unattended) command-line execution
  • Flexible configuration using a configuration file as well as command-line options
  • Reduced system resource footprint and optimized performance

Web Archive Transformation (WAT) as input format

Building upon already existing tools, Web Archive Transformation (WAT) is used as input format rather than directly reading full WARC/ARC files. WAT files hold metadata extracted from the web archive. Using WAT as input reduces code complexity, promotes modularity, and makes it possible to run link-indexer on auxiliary storage having only WAT files, which are significantly smaller in size compared to their original WARC/ARC sources.
warcio is used in the Python code to read WAT files, which conform in structure to the WARC format. We initially used archive-metadata-extractor to generate WAT files. However, testing our implementation with sample files showed the tool generates files that do not exactly conform to the WARC structure and cause warcio to fail on reading. The more recent webarchive-commons library was subsequently used instead to generate WAT files.

Initial implementation

The current initial minimal implementation of link-indexer includes the following:

  • Basic command-line invocation with multiple input WAT files as arguments
  • Traversal of metadata records in WAT files using warcio
  • Collecting outlink data and converting relative links to absolute
  • Composing JSON graph data compatible with the Gephi streaming API
  • Grouping a defined count of records into batches to reduce hits on the API service

We plan to continue work on the following:

  • Rewriting links in Sort-friendly URI Transformation (SURT)
  • Integration with the link-serv API
  • Command-line options
  • Configuration file

Link Visualizer

link-viz is the project’s web-based frontend for accessing data provided by link-serv as a graph that can be navigated and explored.
We worked on the following:

  • Graph rendering toolkit
  • Web development framework and tools
  • UI design and artwork

Graph visualization libraries, as well as web application frameworks, were researched for the web-based link visualization frontend. Both D3.js and Vis.js emerged as the most suitable candidates for the visualization toolkit. Experimentally coding using both toolkits, we decided to go with Vis.js, which fits the needs of the application and is better documented.
We also took a fresh look at current web development frameworks and decided to house the Vis.js visualization logic within a Laravel framework application combining PHP and Vue.js for future expandability of the application’s features, e.g., user profile management, sharing of graphs, etc.
A virtual machine was allocated on BA’s server infrastructure to host link-viz for the project demo that we will be working on.
We built a barebone frontend consisting of the following:

  • Landing page
  • Graph rendering page with the following UI elements:
    • Graph area
    • URL, depth, and date selection inputs
    • Placeholders for add-ons

As we outlined in the project proposal, we plan to implement add-ons during a later phase of the project to extend functionality. Add-ons would come in 2 categories: vizors for modifying how the user sees the graph, e.g., GeoVizor for superimposing nodes on a map of the world, and finders to help the user explore the graph, e.g., PathFinder for finding all paths from one node to another.
Some work has already been done in UI design, color theming, and artwork, and we plan to continue work on the following:

  • Integration with the link-serv API
  • Continue work on UI design and artwork
  • UI actions
  • Performance considerations

Research use cases for web archive visualization

In terms of outreach, the National Library of New Zealand has been getting in touch with researchers from a wide array of backgrounds, ranging from data scientists to historians, to gather feedback on potential use cases and the types of features researchers would like to see in a web archive visualization tool. Several issues have been brought up, including frustrations with existing tools’ lack of scalability, being tied to a physical workstation, time wasted on preprocessing datasets, and inability to customize an existing tool to a researcher’s individual needs. Gathering first hand input from researchers has led to many interesting insights. The next steps are to document and publish these potential research use cases on the wiki to guide future developments in the project.

We would like to extend our thanks and appreciation for all the researchers who generously gave their time to provide us with feedback, including Dr. Ian Milligan, Dr. Niels Brügger, Emily Maemura, Ryan Deschamps, Erin Gallagher, and Edward Summers.

Acknowledgements

Meet the people involved in the project at Bibliotheca Alexandrina:

  • Amr Morad
  • Amr Rizq
  • Mohamed Elsayed
  • Mohammed Elfarargy
  • Youssef Eldakar

And at the National Library of New Zealand:

  • Andrea Goethals
  • Ben O’Brien
  • Lana Alsabbagh

We would also like to thank Alex Osborne at the National Library of Australia and Andy Jackson at the British Library for their advice on technical issues.

If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.