Launching LinkGate

By Youssef Eldakar of Bibliotheca Alexandrina

We are pleased to invite the web archiving community to visit LinkGate at linkgate.bibalex.org.

LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.

What does LinkGate do?

LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.

APIs and use cases

We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.

Using LinkGate

An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:

  • Graph with nodes colorized by domain
  • Nodes being zoomed in
  • Settings dialog for customizing graph
  • Showing properties for a selected node
  • PathFinder for finding routes between any two nodes

Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.

We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.

Next steps

This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:

  • Improved scalability
  • Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
  • Enriched metadata
  • Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
  • Improved implementation of the software in general.

BA intends to maintain and expand the deploymentat linkgate.bibalex.org on a long-term basis.

Acknowledgements

The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.

LinkGate Team

  • Lana Alsabbagh, NLNZ, Research Use Cases
  • Youssef Eldakar, BA, Project Coordination
  • Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
  • Mohamed Elsayed, BA, Link Indexer (link-indexer)
  • Andrea Goethals, NLNZ, Project Coordination
  • Amr Morad, BA, Link Service (link-serv)
  • Ben O’Brien, NLNZ, Research Use Cases
  • Amr Rizq, BA, Link Visualizer (link-viz)

Additional Thanks

  • Tasneem Allam, BA, link-viz development
  • Suzan Attia, BA, UI design
  • Dalia Elbadry, BA, UI design
  • Nada Eliba, BA, link-serv development
  • Mirona Gamil, BA, link-serv development
  • Olga Holownia, IIPC, project support
  • Andy Jackson, British Library, technical advice
  • Amged Magdey, BA, logo design
  • Liquaa Mahmoud, BA, logo design
  • Alex Osborne, National Library of Australia, technical advice

We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.


Resources

IIPC-supported project “Developing Bloom Filters for Web Archives’ Holdings”

By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory and Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre at  National and University Library Zagreb

We are excited to share the news of a newly IIPC-funded collaborative project between the Los Alamos National Laboratory (LANL) and the National and University Library Zagreb (NSK). In this one-year project we will develop a software framework for web archives to create Bloom filters of their archival holdings. A Bloom filter, in this context, consists of hash values of archived URIs and can therefore be thought of as an encrypted index of an archive’s holdings. Its encrypted nature allows web archives to share information about their holdings in a passive manner, meaning only hashed URI values are communicated, rather than plain text URIs. Sharing Bloom filters with interested parties can enable a variety of down-stream applications such as search, synchronized crawling, and cataloging of archived resources.

Bloom filters and Memento TimeTravel

As many readers of this blog will know, the Prototyping Team at LANL has developed and maintained the Memento TimeTravel service, implemented as a federated search across more than two dozen memento-compliant web archives. This service therefore allows a user (or a machine via the underlying APIs) to search for archived web resources (mementos) across many web archives at the same time. We have tested, evaluated, and implemented various optimizations for the search system to improve speed and avoid unnecessary network requests against participating web archives but we can always do better. As part of this project, we aim at piloting a TimeTravel service based on Bloom filters, that, if successful, should provide close to an ideal false positive rate, meaning almost no unnecessary network requests to web archives that do not have a memento of the requested URI.

While Bloom filters are widely used to support membership queries (e.g., is element A part of set B?), it has, to the best of our knowledge, not been applied to query web archive holdings. We are aware of opportunities to improve the filters and, as additional components of this project, will investigate their scalability (in relation to CDX index size, for example) as well as the potential for incremental updates to the filters. Insights into the former will inform the applicability for different size archives and individual collections and the latter will guide a best practice process of filter creation.

The development and testing of Bloom filters will be performed by using data from the Croatian Web Archive’s collections. NSK develops the Croatian Web Archive (HAW) in collaboration with the University Computing Centre University of Zagreb (Srce), which is responsible for technical development and will work closely with LANL and NSK on this project.

LANL and NSK are excited about this project and new collaboration. We are thankful to the IIPC and their support and are looking forward to regularly sharing project updates with the web archiving community. If you would like to collaborate on any aspects of this project, please do not hesitate and get in touch.

The Dark and Stormy Archives Project: Summarizing Web Archives Through Social Media Storytelling

By Shawn M. Jones, Ph.D. student and Graduate Research Assistant at Los Alamos National Laboratory (LANL), Martin Klein, Scientist in the Research Library at LANL, Michele C. Weigle, Professor in the Computer Science Department at Old Dominion University (ODU), and Michael L. Nelson, Professor in the Computer Science Department at ODU.

The Dark and Stormy Archives Project applies social media storytelling to automatically summarize web archive collections in a format that readers already understand.

Individual web archive collections can contain thousands of documents. Seeds inform capture, but the documents in these collections are archived web pages (mementos) created from those seeds. The sheer size of these collections makes them challenging to understand and compare. Consider Archive-It as an example platform. Archive-It has many collections on the same topic. As of this writing, a search for the query “COVID” returns 215 collections. If a researcher wants to use one of these collections, which one best meets their information need? How does the researcher differentiate them? Archive-It allows its collection owners to apply metadata, but our 2019 study found that as a collection’s number of seeds rises, the amount of metadata per seed falls. This relationship is likely due to the increased effort required to maintain the metadata for a growing number of seeds. It is paradoxical for those viewing the collection because the more seeds exist, the more metadata they need to understand the collection. Additionally, organizations add more collections each year, resulting in more than 15,000 Archive-It collections as of the end of 2020. Too many collections, too many documents, and not enough metadata make human review of these collections a costly proposition.

We use cards to summarize web documents all of the time. Here is the same document rendered as cards on different platforms.

An example of social media storytelling at Storify (now defunct) and Wakelet: cards created from individual pages, pictures, and short text describe a topic.

Ideally, a user would be able to glance at a visualization and gain understanding of the collection, but existing visualizations require a lot of cognitive load and training even to convey one aspect of a collection. Social media storytelling provides us with an approach. We see social cards all of the time on social media. Each card summarizes a single web resource. If we group those cards together, we summarize a topic. Thus social media storytelling produces a summary of summaries. Tools like Storify and Wakelet already apply this technique for live web resources. We want to use this proven technique because readers already understand how to view these visualizations. The Dark and Stormy Archives (DSA) Project explores how to summarize web archive collections through these visualizations. We make our DSA Toolkit freely available to others so they can explore web archive collections through storytelling.

The Dark and Stormy Archives Toolkit

The Dark and Stormy Archives (DSA) Toolkit provides a solution for each stage of the storytelling lifecycle.

Telling a story with web archives consists of three steps. First, we select the mementos for our story. Next, we gather the information to summarize each memento. Finally, we summarize all mementos together and publish the story. We evaluated more than 60 platforms and determined that no platform could reliably tell stories with mementos. Many could not even create cards for mementos, and some mixed information from the archive with details from the underlying document, creating confusing visualizations.

Hypercane selects the mementos for a story. It is a rich solution that gives the storyteller many customization options. With Hypercane, we submit a collection of thousands of documents, and Hypercane reduces them to a manageable number. Hypercane provides commands that allow the archivist to cluster, filter, score, and order mementos automatically. The output from some Hypercane commands can be fed into others so that archivists can create recipes with the intelligent selection steps that work best for them. For those looking for an existing selection algorithm, we provide random selection, filtered random selection, and AlNoamany’s Algorithm as prebuilt intelligent sampling techniques. We are experimenting with new recipes. Hypercane also produces reports, helping us include named entities, gather collection metadata, and select an overall striking image for our story.

To gather the information needed to summarize individual mementos, we required an archive-aware card service; thus, we created MementoEmbed. MementoEmbed can create summaries of individual mementos in the form of cards, browser screenshots, word clouds, and animated GIFs. If a web page author needs to summarize a single memento, we provide a graphical user interface that returns the proper HTML for them to embed in their page. MementoEmbed also provides an extensive API on top of which developers can build clients.

Raintale is one such client. Raintale summarizes all mementos together and publishes a story. An archivist can supply Raintale with a list of mementos. For more complex stories, including overall striking images and metadata, archivists can also provide output from Hypercane’s reports. Because we needed flexibility for our research, we incorporated templates into Raintale. These templates allow us to publish stories to Twitter, HTML, and other file formats and services. With these temples, an archivist can not only choose what elements to include in their cards; they can also brand the output for their institution.

Raintale uses templates to allow the storyteller to tell their story in different formats, with various options, including branding.

The DSA Toolkit at work

The DSA Toolkit produced stories from Archive-It collections about mass shootings (from left to right) at Virginia TechNorway, and El Paso.

 

Through these tools, we have produced a variety of stories from web archives. As shown above, we debuted with a story summarizing IIPC’s COVID-19 Archive-It collection, summarizing a collection of 23,376 mementos as an intelligent sample of 36. Instead of seed URLs and metadata, our visualization displays people in masks, places that the virus has affected, text drawn from the underlying mementos, correct source attribution, and, of course, links back to the Archive-It collection so that people can explore the collection further. We recently generated stories that would allow readers to view the differences between Archive-It collections about the mass shootings in Norway, El Paso, and Virginia Tech. Instead of facets and seed metadata, our stories show victims, places, survivors, and other information drawn from the sampled mementos. The reader can also follow the links back to the full collection page and get even more information using the tools provided by the archivists at Archive-It.

With help from StoryGraph, the DSA Toolkit produces daily news stories so that readers can compare the biggest story of the day across different years.

But our stories are not just limited to Archive-It. We designed the tools to work with any Memento-compliant web archive. In collaboration with Storygraph, we produce daily news stories built with mementos stored at Archive.Today and the Internet Archive. We are also experimenting with summarizing a scholar’s grey literature as stored in the web archive maintained by the Scholarly Orphans project.

We designed the DSA Toolkit to work with any Memento-compliant archive. Here we summarize Ian Milligan’s grey literature as captured by the web archive at the Scholarly Orphans Project.

Our Thanks To The IIPC For Funding The DSA Toolkit

We are excited to say that, starting in 2021, as part of a recent IIPC grant, we will be working with the National Library of Australia to pilot the DSA Toolkit with their collections. In addition to solving potential integration problems with their archive, we look forward to improving the DSA Toolkit based on feedback and ideas from the archivists themselves. We will incorporate the lessons learned back into the DSA Toolkit so that all web archives may benefit, which is what the IIPC is all about.

Relevant URLs

DSA web site: http://oduwsdl.github.io/dsa

DSA Toolkit: https://oduwsdl.github.io/dsa/software.html

Raintale web site: https://oduwsdl.github.io/raintale/

Hypercane web site: https://oduwsdl.github.io/hypercane/

WCT 3.0 Release

By Ben O’Brien, Web Archive Technical Lead, National Library of New Zealand

Let’s rewind 15 years, back to 2006. The Nintendo Wii is released, Google has just bought YouTube, Facebook switches to open registration, Italy has won the Fifa World Cup, and Borat is shocking cinema screens across the globe.

Java 6, Spring 1.2, Hibernate 3.1, Struts 1.2, Acegi-security are some of the technologies we’re using to deliver open source enterprise web applications. One application in particular, the Web Curator Tool (WCT) is starting its journey into the wide world of web archiving. WCT is an open source tool for managing the selective web harvesting process.

2018 Relaunch

Fast forward to 2018, and these technologies themselves belong inside an archive. Instead they were still being used by the WCT to collect content for web archives. Twelve years is a long time in the world of the Internet and IT, so needless to say a fair amount of technical debt had caught up with the WCT and its users.

The collaborative development of the WCT between the National Library of the Netherlands and the National Library of New Zealand was full steam ahead after the release of the long awaited Heritrix 3 integration in November 2018. With new features in mind, we knew we needed a modern, stable foundation within the WCT if we were to take it forward. Queue the Technical Uplift.

WCT 3.0

What followed was two years of development by teams in opposing time zones, battling resourcing, lockdowns and endless regression testing. Now at the beginning of 2021, we can at last announce the release of version 3.0 of the WCT.

While some of the names in the technology stack are the same (Java/Spring/Hibernate), the upgrade of these languages and frameworks represent a big milestone for the WCT. A launchpad to tackle the challenges of the next decade of web archiving!

For more information, see our recent blog post on webcuratortool.org. And check out a demo of v3.0 inside our virtual box image here.

WCT Team:

KB-NL

Jeffrey van der Hoeven
Sophie Ham
Trienka Rohrbach
Hanna Koppelaar

NLNZ

Ben O’Brien
Andrea Goethals
Steve Knight
Frank Lee
Charmaine Fajardo

Further reading on WCT:

WCT tutorial on IIPC
Documentation on WCT
WCT on GitHub
WCT on Slack
WCT on Twitter
Recent blogpost on WCT with links to old documentation

The BESOCIAL project: towards a sustainable strategy for social media archiving in Belgium

By Jessica Pranger, Scientific Assistant at KBR / Royal Library of Belgium

In August, we had the opportunity to present the new BESOCIAL research project during the IIPC RSS webinar. Many thanks to all viewers who have shared their remarks, questions and enthusiasm with us!

The aim of the BESOCIAL project is to set up a sustainable strategy for social media archiving in Belgium. Some Belgian institutions are already archiving social media content related to their holdings or interests, but it is necessary to reflect on a national strategy. Launched in summer 2020, this project will run over two years and will be divided in seven steps, called ‘Work packages’ (WP):

  • WP1: Review of existing social media archiving projects and corpora in Belgium and abroad (M1-M6). The aim of this WP is to analyse selection, access and preservation policies, existing foreign legal frameworks and existing technical solutions.
  • WP2: Preparation of a pilot for social media archiving (M4-M15) including the development of a methodology for selection and the technical and functional requirements. An analysis of the user requirements and the existing legal framework is also included.
  • WP3: Pilot for social media archiving (M7-M24) including harvesting, quality control and the development of a preservation plan.
  • WP4: Pilot for access to social media archive (M16-M21) focusing on legal considerations, the development of an access platform and evaluating the pilot.
  • WP5: Recommendations for sustainable social media archiving in Belgium on the legal, technical and operational level (M16-M24).
  • WP6: Coordination, dissemination and valorisation.
  • WP7: Helpdesk for legal enquiries throughout the project.

Figure 1 shows these seven stages and how they will unfold over the two years of the project.

Figure 1. Work Packages of the BESOCIAL project.

Review of existing projects

We are currently in the first stage of the project (Work Package 1). To this end, a survey has been sent to 18 international heritage institutions and 10 Belgian institutions to ask questions on various topics related to the management of their born-digital collections. To date, we have received 13 responses and a first analysis of these answers has been completed and submitted for publication. Another task that is currently being undertaken is to provide an overview of the tools used for social media archiving. It is now important to dig deeper and check which kind of metadata is supported by the tools. We are also working on an analysis of the digital preservation policies, strategies and plans of libraries and archives that already archive digital content, especially social media data. For the legal aspects, we are analysing the legal framework of social media archiving in other European and non-European countries.

Our team

The BESOCIAL project is coordinated by the Royal Library of Belgium (KBR) and is financed by the Belgian Science Policy Office’s (Belspo) BRAIN-be programme. KBR partnered with three universities for this project: CRIDS (University of Namur) works on legal issues related to the information society, CENTAL (University of Louvain) and IDLab (Ghent University) will contribute the necessary technical skills related to information and data science, whereas GhentCDH and MICT (both from Ghent University) have significant expertise in the field of communication studies and digital humanities.

The interdisciplinarity of the team and the thorough analyses of existing policies will ensure that the social media archiving strategy for Belgium will be based on existing best practices and that all involved stakeholders (heritage institutions, users, legislators, etc.) will be taken into account.

If you want to learn more about this project, feel free to follow our hashtag #BeSocialProject on social media platforms, and visit the BESOCIAL web page.

LinkGate: Initial web archive graph visualization demo

By Mohammed Elfarargy and Youssef Eldakar of Bibliotheca Alexandrina

LinkGate is an IIPC-funded project to develop a scalable web archive graph visualization environment and collect research use cases, led by Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ). The project provides three modular components:

  • Link Service (link-serv) for the scalable temporal graph data service with an underlying graph data store and API
  • Link Indexer (link-indexer) for collecting inter-linking data from the web archive

  • Link Visualizer (link-viz) for the web-based frontend geared towards web archive graph data navigation and exploration

Research use cases are being documented to guide future development.

You can read more about our work in the blog post published in April.

During a webinar held at the end of July as part of the IIPC Research Speaker Series (RSS), we presented a demo of the tools being developed and a summary of feedback gathered so far from the community towards a research use case inventory. In this blog post, we give an update on progress of the technical development, focusing on the initial UI of link-viz.

Link Visualizer

LinkGate’s frontend visualization component, link-viz, has developed on many fronts over the last four months. While the LinkServe component is compatible with the Gephi streaming API, Gephi remains a desktop-only general-purpose graph visualization tool. link-viz, on the other hand, is a web-based, scalable graph visualization tool that is made specifically to visualize web archive graph data. This makes it possible to produce more informative graphs for web archive users.

link-viz works in a similar manner to web-based map services like Google Maps. The user gets a graph based on the queried URL and the desired snapshot. Users can set the initial depth of the graph and then incrementally add more nodes as they explore deeper in the graph. This smart loading makes the exploration of such a dense graph run more smoothly.

The link-viz UI is designed to set the main focus on the graph. Users can click on any graph node to select it and perform actions using tools available in the UI. Graph nodes can be moved around and are, by default, distributed using a spring force model to help make a uniform distribution over 2D space. It’s possible to toggle this off to give users the option to organize nodes manually. Users can easily pan and zoom in/out the view using mouse controls or touch gestures. All other tools are located in four floating panels surrounding the main graph area:

The left-hand panel is used to search for a URL and to select the desired snapshot based on which the initial graph will be rendered. The snapshot selection widget is illustrated in Figure 1:

Figure 1: Snapshot selection widget

The bottom panel shows detailed information on the highlighted graph node. This includes a full URL and a listing of all the outlinks and inlinks. This can be seen in Figure 2:

Figure 2: Node details panel

The top panel contains a set of tools for graph navigation (zoom in/out and reset view), taking graph screenshots, setting graph depth, collapsing/expanding portions of the graph, and configuring the look of the graph (selection of color, size, and shape for both graph nodes and edges to represent different pieces of information). One nice feature of link-viz compared to standard graph visualization tools is the usage of website favicons for graph nodes instead of geometric shapes, which makes nodes instantly identifiable and results in a much more readable graph. Figures 3 and 4 show the top panel and favicon usage, respectively:

Figure 3: Top panel

 

Figure 4: Favicons for graph nodes

The right-hand panel contains two tabs reserved for two sets of tools, Vizors and Finders. Vizors are tools to display the same graph highlighting additional information. Two vizors are currently planned. The GeoVizor will put graph nodes on top of a world map to show the hosting physical location. The FileTypeVizor will display file-type icons as graph nodes, making it very easy to identify most common file types and their distribution over the web. Finders perform graph exploration functions, such as finding loops or paths between nodes.

Apart from Vizors and Finders, we are also working on other features, including smart graph loading and animated graph timeline. We are also going to improve UI styling.

Link Indexer

link-indexer is now integrated with link-serv via the API. We have been testing the process of inserting data extracted with link-indexer into link-serv to identify data and scalability problems to work on. link-indexer now accepts command-line options for specifying the target link-serv instance and controlling the insertion batch size to manage how often the API is invoked. More command-line options are being added to control various aspects of the tool, as well as the ability to load options from a configuration file. We are also working to enhance tolerance to data issues, such as very long URLs, and network issues, such as short service outages. Figure 5 shows a sample output from a link-indexer run:

Figure 5: Sample output from a link-indexer run

Link Service

link-serv implements an API for link-indexer and link-viz to communicate with the graph data store. The API is compatible with the Gephi streaming API, giving users the option to connect to link-serv using the popular graph visualization tool, Gephi, as an alternative to the project’s frontend, link-viz.  Figure 6 shows a Gephi client streaming graph data from a link-serv instance:

Figure 6: Gephi client streaming from a link-serv instance

A data schema customized for temporal, versioned web archive data is used in the underlying Neo4j graph data store, and link-serv defines extra API operations not defined in the Gephi streaming API to support temporal navigation functionality in link-viz.

As more data is added to link-serv, the underlying graph data store has difficulty scaling up when reliant on a single instance. Our primary focus in link-serv at the moment, therefore, is to implement clustering. Work is in progress on a customized dispatcher service for the Neo4j graph data store as a substitute to clustering functionality in the commercially licensed Neo4j Enterprise Edition. As a side track, we are also looking into ArangoDB as possibly an alternative deployment option for link-serv’s graph data store.

Robustify your links! A working solution to create persistently robust links

By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory (LANL), Shawn M. Jones, Ph.D. student and Graduate Research Assistant at LANL, Herbert Van de Sompel, Chief Innovation Officer at Data Archiving and Network Services (DANS), and Michael L. Nelson, Professor in the Computer Science Department at Old Dominion University (ODU).

Links on the web break all the time. We frequently experience the infamous “404 – Page not found” message, also known as “a broken link” or “link rot.” Sometimes we follow a link and discover that the linked page has significantly changed and its content no longer represents what was originally referenced, a scenario known as “content drift.” Both link rot and content drift are forms of “reference rot”, a significant detriment to our web experience. In the realm of scholarly communication where we increasingly reference web resources such as blog posts, source code, videos, social media posts, datasets, etc. in our manuscripts, we recognize that we are losing our scholarly record to reference rot.

Robust Links background

As part of The Andrew W. Mellon Foundation funded Hiberlink project, the Prototyping team of the Los Alamos National Laboratory’s Research Library together with colleagues from Edina and the Language Technology Group of the University of Edinburgh developed the Robust Links concept a few years ago to address the problem. Given the renewed interest in the digital preservation community, we have now collaborated with colleagues from DANS and the Web Science and Digital Libraries Research Group at Old Dominion University on a service that makes creating Robust Links straightforward. To create a Robust Link, we need to:

  1. Create an archival snapshot (memento) of the link URL and
  2. Robustify the link in our web page by adding a couple of attributes to the link.

Robust Links creation

The first step can be done by submitting a URL to a proactive web archiving service such as the Internet Archive’s “Save Page Now”, Perma.cc, or archive.today. The second step guarantees that the link retains the original URL, the URL of the archived snapshot (memento), and the datetime of linking. We detail this step in the Robust Links specification. With both done, we truly have robust links with multiple fallback options. If the original link on the live web is subject to reference rot, readers can access the memento from the web archive. If the memento itself is unavailable, for example, because the web archive is temporarily out of service, we can use the original URL and the datetime of linking to locate another suitable memento in a different web archive. The Memento protocol and infrastructure provides a federated search that seamlessly enables this sort of lookup.

Robust Links web service.
Robust Links web service.

To make Robust Links more accessible to everyone, we provide a web service to easily create Robust Links. To “robustify” your links, submit the URL of your HTML link to the web form, optionally specify a link text, and click “Robustify”. The Robust Links service creates a memento of the provided URL either with the Internet Archive or with archive.today (the selection is made randomly). To increase robustness, the service utilizes multiple publicly available web archives and we are working to include additional web archives in the future. From the result page after submitting the form, copy the HTML snippet for your robust link (shown as step 1 on the result page) and paste it into your web page. To make robust links actionable in a web browser, you need to include the Robust Links JavaScript and CSS in your page. We make this easy by providing an HTML snippet (step 2 on the result page) that you can copy and paste inside the HEAD section of your page.

Robust Links web service result page.
Robust Links web service result page.

Robust Links sustainability

During the implementation of this service, we identified two main concerns regarding its sustainability. The first issue is the reliable inclusion of the Robust Links JavaScript and CSS to make Robust Links actionable. Specifically, we were looking for a feasible approach to improve the chances that both files are available in the long term, can continuously be maintained, and their URI persistently resolves to the latest version. Our approach is two-fold:

  1. we moved the source files into the IIPC GitHub repository so they can be maintained (and versioned) by the community and served with the correct mime type via GitHub Pages and
  2. we minted two Digital Object Identifiers (DOIs) with DataCite, one to resolve to the latest version of the Robust Links JavaScript and the other to the CSS.

The other sustainability issue relates to the Memento infrastructure to automatically access mementos across web archives (2nd fallback mentioned above). Here we continue our path in that LANL and ODU, both IIPC member organizations, maintain the Memento infrastructure.

Because of limitations with the WordPress platform, we unfortunately can not demonstrate robust links in this blog post. However, we created a copy with robustified links hosted at https://robustlinks.mementoweb.org/demo/IIPC/robust_links_blog.html. In addition, our Robust Links demo page showcases how robust links are actionable in a browser via the included CSS and JavaScript. We also created an API for machine-access to our Robust Links service.

Robust Links in action
Robust Links in action.

Acknowledgements and feedback

Lastly, we would like to thank DataCite for granting two DOIs to the IIPC for this effort at no cost. We are also grateful to ODU’s Karen Vaughan for her help minting the DOIs.

For feedback/comments/questions, please do not hesitate and get in touch (martinklein0815[at]gmail.com)!

Relevant URIs

https://robustlinks.mementoweb.org/
https://robustlinks.mementoweb.org/about/
https://robustlinks.mementoweb.org/spec/
https://robustlinks.mementoweb.org/api-docs/

The French coronavirus (COVID-19) web archive collection: focus on collaborative networks

BnF’s Covid-19 web archive collection has drawn considerable media attention in France, including coverage in Le Monde, 20 minutes and TV Channel France 3. The following blog post was first published in Web Corpora, BnF’s blog dedicated to web archives.


By Alexandre Faye, Digital Collection Manager, Bibliothèque nationale de France (BnF)
English translation by Alexandre Faye and Karine Delvert

The current global coronavirus pandemic (Covid-19) poses an unprecedented challenge for the web archiving activities. The impact on society is such that the ongoing collection requires several levels of coordination and cooperation at a national and international level.

Since its spreading out of China and its later development in Europe, coronavirus outbreak has become a pervasive theme on the web. This sanitary crisis is being experienced in real time by populations simultaneously confined and largely connected, with a sense of emergency as well as underlying questioning. Archived websites, blogs, and social media should make up a coherent, significant and representative collection. They will be primary sourcesfor future research, and they are already the trace and memory of the event.

#jenesuispasunvirus

At the end of January 2020, while the Wuhan megapolis is quarantined, the first hashtags #JeNeSuisPasUnVirus and #CORONAVIRUSENFRANCE appear on Twitter. They denounce and show the stigma experienced by the Asian community in France. The Movement against racism and for friendship between peoples (Mouvement contre le racisme et pour l’amitié entre les peuples, MRAP) quickly published a page on its website entitled “a virus has no ethnic origin”. This is the first webpage related to coronavirus to have been selected, crawled and preserved under French legal deposit.

Group dynamics

The coronavirus collection is not conceived as a project, in the sense that it would be programmed, would have a precise calendar and would be limited to predetermined topics. It grows as a part of the both National and local news media and Ephemeral News Current Topics collections. The National and local news media collection brings together a hundred of national and local press websites, including the editorial content, such as headlines and related articles as well as Twitter accounts which are collected once a day. The News Current Topics collection, which requires both a technical and organizational approach, relies on the coordination of an internal network of digital curators from their relevant fields”. It facilitates dynamic and reactive identification of web content related to contemporary issues and important events. By documenting the evolution, spreading and overall impact of the pandemic in France, archiving policy embraces all facets of the public health crisis: medical, social, economic, political and more broadly scientific, cultural and moral aspects.

“A virus has no ethnic origin”. Movement Against Racism and for Friendship Between Peoples (MRAP) website. Archive of February 21, 2020.

70 selected seed URLs were crawled in January and February, while the spread of the virus out of China seemed to be limited and under control. Since March 17, date of the French lockdown, 500 to 600 seed URLs per week are selected and assigned to a crawl frequency: several times a day for social networks, daily for national and local press sites, weekly for news sections dedicated to the coronavirus, monthly for articles and dedicated websites which are created ex nihilo. Thus the section of the economic review L’Usine nouvelle is crawled weekly, because it organizes a stream of articles. Less dynamic, the recommendation pages of the National Research and Security Institute (INRES), is assigned monthly frequency.

By mid-April 2020, more than 2,000 selections and settings were created. This reactivity is all the more necessary due to the fact that certain web pages selected in the first phase have already disappeared from the live web.

The regional dimension

The geographical approach is also at the core of the archiving dynamics. The web does not entirely do away with territorial dimensions, as shown by the research works led on this topic. One may even think that they were reinforced as France is hit by the sanitary crisis, as the crisis coincides with the campaign for the municipal elections.

The curators of partner institutions all over the French territory have spontaneously enriched the selections on the coronavirus sanitary crisis. They contributed by including local and regional contents into account. This network is a key element to the national cooperation framework. Initiated in 2004 by the BnF, it relies on a network of 26 regional libraries and archives services, which share this mission of print and web legal deposit by participating in collaborative nominations. Its contribution proved to be significant since over 50% of the nominated websites selected until 15th April refer to local/regional content.

Simplified access to teleconsultation. ARS Guyana. Archived, April 5, 2020.

As a corollary, the crawl devoted to local elections has not been suspended after the 1st poll (which took place on March 15th), although the second poll (due to take place the following weekend) had been postponed and the whole electoral process suspended due to the crisis. In particular, the Twitter and Facebook accounts of the mayors elected in the 1st poll and those of the candidates who are still in contention for the 2nd poll have continued to be collected. These archives, as statements of mayors and candidates on the web during the weeks that had preceded and followed the 1st poll of local elections, already appear to be a major source for both electoral history and coronavirus pandemic in France.

Historic abstention rate in the local elections in the Oise “cluster”. francetvinfo.fr. Capture of March 16, 2020.

International cooperation

At the international level, the BnF and also in this way the other French participating libraries contribute to the archiving project “Novel Coronavirus (2019-nCoV) outbreak”. This initiative launched in February 2020 is supported by the IIPC Content Development Group (CDG) in association with the Internet Archive. It brings together about thirty libraries and institutions collaborating around the world on this web archive collection. At the end of May, more than 6,800 preserved websites representing 45 languages had been put online on Archive-it.org and indexed in full text.


The BnF has for many years been pursuing a policy of cooperation with the IIPC to promote preservation and use of web archives on an international scale. One of the research challenges is to facilitate comparisons of the different national webs, in particular for the global and transnational phenomena such as #MeToo and the current health crisis. A first contribution was sent at the end of February to the IIPC.  It consisted of an 80 seeds selection made during the first phase of the pandemic, just before Europe became the main active center in front of China. Some of these pages have already disappeared from the living web.

According to the IIPC’s new recommendations and considering the evolution of the pandemic in France, the next contribution to the IIPC should be a tight selection (almost 5% of the French collection) linked to high priority subtopics include: information about the spread of infection; regional or local containment efforts; medical and scientific aspects, social aspects; economic aspects; and political aspects. A third of those websites reports on medical domain. A second third provides information about French territories that are remote from Europe: French Guiana and West Indies, Reunion and Mayotte. The last part concerns citizen’s initiatives and debates during the lockdown.

For examples, the special INED’s website hosting gives information on local excess mortality, articles from Madinin’art, Montray Kreyol, Free Pawol were selected by a local curator and banlieues-sante.org is website of an NGO which acts against medical inequality and has created a YouTube channel explaining protection measures in 24 languages including sign language.

Dr François Ehlinger on EHPAD. Nicole Bertin’s Blog. Website capture from the Charente-Maritime region. Capture on April 3, 2020

What’s next?

Some of the websites nominated by the BnF and its partners tend to constitute a collective memory of the event. Until mid-April, the share of social networks represented 40% of the nominations, with a slight predominance of Twitter over Facebook. Although a large share is devoted to official accounts – namely, of institutions or associations (@AssembleeNat, @restosducoeur, @banlieuesante) or to accounts created ex nihilo (@CovidRennes, @CoronaVictimes, @InitiativeCovid), hashtags prevail in the set of selections.

The aim is to archive a representative part of individual and collective expressions by capturing tweets around the most significant hashtags: multiple variations of the terms “coronavirus” and “confinement” (#coronavacances, #ConfinementJour29), criticism of the way the crisis has been managed (#OuSontLesMasques, #OnOublieraPas), instruction dissemination and expressions of sympathy show a unique and characteristic mobilisation of citizens while following the pace of the news (#chloroquine, #Luxfer).

Daniel Bourrion, “The virus journals” on face-ecran.fr. Archived April 3, 2020.

Archives relating to the coronavirus, as they account for the outcomes of the sanitary crisis and of the lockdown in various domains, end up in overlapping the set of themes to which the BnF and its partners pay a particular attention or for which focused crawls have already been conducted or will be led. For instance, digital literature or confinement diaries, relationships between the body and public health policies, epidemiology and artificial intelligence, family life in confinement and feminism, can be mentioned.

“Next” isn’t just a matter of a unique form of promoting this special archive collection, which remains a work-in-progress. It is neither a delimited project nor an already closed. It is documentation for many kinds of research projects and also heritage for all of us.

Guide for confined parents. The French Secretariat for Equality (Le Secrétariat d’Etat chargé de l’égalité entre les femmes et les hommes et de la lutte contre les discriminations). Capture of April 10.

Contribute to CDG’s AI Collection!

By Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia

“Trurl” by Daniel Mróz, from The Cyberiad by Stanisław Lem (Wydawnictwo Literackie, Kraków, 1972). Illustration copyright © 1972 Daniel Mróz. Reprinted by permission.

After significant breakthroughs at the end of the 20th and at the beginning of 21st centuries, artificial intelligence (AI) has played a greater role in our daily lives. Although AI has a huge positive impact on a variety of fields such as manufacturing, healthcare, art, transportation, retail and so on, the use of new technologies also raises ethical issues as well as security risks. One critical and hotly debated issue is the impact of ongoing automation on labor markets, to include changing educational requirements for jobs, job elimination, and various models for transitions.

The IIPC Content Development Group invites curators and web archivists around the world to contribute websites to a new “Artificial Intelligence” web collection.

The purpose of this collection is to bring together and record web content related to use of AI and its impact on any possible aspect of life, reflecting attitudes and thoughts towards it, future predictions etc.

The content can be in any language focusing on specific countries or cultures or have a global scope.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building AI related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to aid in the development of a collection with an international perspective.

The collection aims to cover the following subtopics:

  • Machine learning, natural language processing, robotics, automation;
  • AI in literature, visual arts (e.g. ceramics, drawing, painting, sculpture, design, photography, filmmaking, architecture) and performing arts (e.g. theater, public speech, dance, music etc.); AI in emerging art forms;
  • AI and law/legislation;
  • Social and economic impact (e.g. impact on behavior/interaction, bias in AI, unemployment, inequality, changes in labor markets);
  • Ethical issues (e.g. weaponization of AI, security, robot rights);
  • Future predictions/scenarios concerning AI.

Types of web content to include are personal forms such as blogs, forum posts, and artist websites; trend reports, statements, and analyses (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, businesses).

Time frame covered by content: from the 1990s onwards.

Out of scope are: full social media feeds and channels (Facebook, twitter, Instagram, YouTube, WhatsApp), user’ video channels (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

That said, if you locate individual social media posts of unique value, such as an Instagram post by a bot or a particularly relevant and ephemeral individual video, please submit them for consideration.

Nominations are welcomed using the following form.

The call for nominations will close on the 30th of June 2019. Crawls will be run during the summer 2019. Collection will be made available at the end of 2019.

 For more information about this collection, contact Tiiu Daniel (tiiu.daniel[at]nlib.ee).


Lead-Curators of CDG Artificial Intelligence Collection
Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia
Liisi Esse, Ph.D. Associate Curator for Estonian and Baltic Studies Stanford University Libraries
Rashi Joshi, Reference Librarian /Collections Specialist, Library of Congress

CDG Co-Chairs
Nicola Bingham, Lead Curator Web Archiving, British Library
Alex Thurman, Web Resources Collection Coordinator, Columbia University Libraries

Contribute to CDG’s Climate Change Collection!

By Kees Teszelszky, Curator Digital Collections, Koninklijke Bibliotheek – National Library of The Netherlands and Lead Curator, CDG Climate Change Collection

Climate change is one of the most urgent and hotly debated issues on the web in recent years. The IIPC Content Development Group is inviting all curators and web archivists from around the world to contribute websites to a new collaborative “Climate Change” collection.

Breiðamerkurlón
Breiðamerkurlón, Iceland

In recent decades there is has been strong evidence that the earth is experiencing rapid climate change, characterized by global temperature rise, warming oceans, shrinking ice sheets, glacial retreat, decreased snow cover, sea level rise, declining arctic sea ice, extreme weather events, and ocean acidification. Ninety-seven percent of climate scientists agree that these climate-warming trends over the past century are very likely due to human activities, and most of the leading scientific organizations worldwide have issued public statements endorsing this position (source: climate.nasa.gov/evidence). Global and local action to mitigate this crisis has been complicated by political, economic, technical, cultural, and religious debates.

Many people feel the urge to reflect on this topic on the web. We would like to take an international snapshot of born digital culture relating to documentation of and social debate on the challenging issue of climate change. You can contribute to this collection by nominating web content about any aspect of climate change, and the content can be focused on specific countries or cultures or have a global focus, and can be in any language.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building climate change related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to help us build a collection with an international perspective.

Examples of subtopics might include climatology, climate change denial, climate refugees, religious reflections on climate change, etc. Eligible types of web content include organizational reports or statements (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, political parties/platforms, businesses, religious groups) or more personal forms such as blogs or artistic projects.

Out of scope are: social media feeds (Facebook, Twitter, Instagram, YouTube channels, WhatsApp), video (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

Collecting seeds started on 1 April 2019 and more nominations can be added to this spreadsheet. Crawls will be run during the summer of 2019, to conclude shortly after the upcoming UN Climate Action Summit on 23 September 2019.

Organized by the IIPC and supported by web archivists around the world, the special web collection ‘Climate Change’ is one of the ways the IIPC helps raise awareness of the strategic, cultural and technological issues which make up the web archiving and digital preservation challenge.

For more information about this collection contact Kees Teszelszky for more details: kees.teszelszky[at]kb.nl