LinkGate: Initial web archive graph visualization demo

by Mohammed Elfarargy and Youssef Eldakar of Bibliotheca Alexandrina

LinkGate is an IIPC-funded project to develop a scalable web archive graph visualization environment and collect research use cases, led by Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ). The project provides three modular components:

  • Link Service (link-serv) for the scalable temporal graph data service with an underlying graph data store and API
  • Link Indexer (link-indexer) for collecting inter-linking data from the web archive

  • Link Visualizer (link-viz) for the web-based frontend geared towards web archive graph data navigation and exploration

Research use cases are being documented to guide future development.

You can read more about our work in the blog post published in April.

During a webinar held at the end of July as part of the IIPC Research Speaker Series (RSS), we presented a demo of the tools being developed and a summary of feedback gathered so far from the community towards a research use case inventory. In this blog post, we give an update on progress of the technical development, focusing on the initial UI of link-viz.

Link Visualizer

LinkGate’s frontend visualization component, link-viz, has developed on many fronts over the last four months. While the LinkServe component is compatible with the Gephi streaming API, Gephi remains a desktop-only general-purpose graph visualization tool. link-viz, on the other hand, is a web-based, scalable graph visualization tool that is made specifically to visualize web archive graph data. This makes it possible to produce more informative graphs for web archive users.

link-viz works in a similar manner to web-based map services like Google Maps. The user gets a graph based on the queried URL and the desired snapshot. Users can set the initial depth of the graph and then incrementally add more nodes as they explore deeper in the graph. This smart loading makes the exploration of such a dense graph run more smoothly.

The link-viz UI is designed to set the main focus on the graph. Users can click on any graph node to select it and perform actions using tools available in the UI. Graph nodes can be moved around and are, by default, distributed using a spring force model to help make a uniform distribution over 2D space. It’s possible to toggle this off to give users the option to organize nodes manually. Users can easily pan and zoom in/out the view using mouse controls or touch gestures. All other tools are located in four floating panels surrounding the main graph area:

The left-hand panel is used to search for a URL and to select the desired snapshot based on which the initial graph will be rendered. The snapshot selection widget is illustrated in Figure 1:

Figure 1: Snapshot selection widget

The bottom panel shows detailed information on the highlighted graph node. This includes a full URL and a listing of all the outlinks and inlinks. This can be seen in Figure 2:

Figure 2: Node details panel

The top panel contains a set of tools for graph navigation (zoom in/out and reset view), taking graph screenshots, setting graph depth, collapsing/expanding portions of the graph, and configuring the look of the graph (selection of color, size, and shape for both graph nodes and edges to represent different pieces of information). One nice feature of link-viz compared to standard graph visualization tools is the usage of website favicons for graph nodes instead of geometric shapes, which makes nodes instantly identifiable and results in a much more readable graph. Figures 3 and 4 show the top panel and favicon usage, respectively:

Figure 3: Top panel

 

Figure 4: Favicons for graph nodes

The right-hand panel contains two tabs reserved for two sets of tools, Vizors and Finders. Vizors are tools to display the same graph highlighting additional information. Two vizors are currently planned. The GeoVizor will put graph nodes on top of a world map to show the hosting physical location. The FileTypeVizor will display file-type icons as graph nodes, making it very easy to identify most common file types and their distribution over the web. Finders perform graph exploration functions, such as finding loops or paths between nodes.

Apart from Vizors and Finders, we are also working on other features, including smart graph loading and animated graph timeline. We are also going to improve UI styling.

Link Indexer

link-indexer is now integrated with link-serv via the API. We have been testing the process of inserting data extracted with link-indexer into link-serv to identify data and scalability problems to work on. link-indexer now accepts command-line options for specifying the target link-serv instance and controlling the insertion batch size to manage how often the API is invoked. More command-line options are being added to control various aspects of the tool, as well as the ability to load options from a configuration file. We are also working to enhance tolerance to data issues, such as very long URLs, and network issues, such as short service outages. Figure 5 shows a sample output from a link-indexer run:

Figure 5: Sample output from a link-indexer run

Link Service

link-serv implements an API for link-indexer and link-viz to communicate with the graph data store. The API is compatible with the Gephi streaming API, giving users the option to connect to link-serv using the popular graph visualization tool, Gephi, as an alternative to the project’s frontend, link-viz.  Figure 6 shows a Gephi client streaming graph data from a link-serv instance:

Figure 6: Gephi client streaming from a link-serv instance

A data schema customized for temporal, versioned web archive data is used in the underlying Neo4j graph data store, and link-serv defines extra API operations not defined in the Gephi streaming API to support temporal navigation functionality in link-viz.

As more data is added to link-serv, the underlying graph data store has difficulty scaling up when reliant on a single instance. Our primary focus in link-serv at the moment, therefore, is to implement clustering. Work is in progress on a customized dispatcher service for the Neo4j graph data store as a substitute to clustering functionality in the commercially licensed Neo4j Enterprise Edition. As a side track, we are also looking into ArangoDB as possibly an alternative deployment option for link-serv’s graph data store.

Robustify your links! A working solution to create persistently robust links

By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory (LANL), Shawn M. Jones, Ph.D. student and Graduate Research Assistant at LANL, Herbert Van de Sompel, Chief Innovation Officer at Data Archiving and Network Services (DANS), and Michael L. Nelson, Professor in the Computer Science Department at Old Dominion University (ODU).

Links on the web break all the time. We frequently experience the infamous “404 – Page not found” message, also known as “a broken link” or “link rot.” Sometimes we follow a link and discover that the linked page has significantly changed and its content no longer represents what was originally referenced, a scenario known as “content drift.” Both link rot and content drift are forms of “reference rot”, a significant detriment to our web experience. In the realm of scholarly communication where we increasingly reference web resources such as blog posts, source code, videos, social media posts, datasets, etc. in our manuscripts, we recognize that we are losing our scholarly record to reference rot.

Robust Links background

As part of The Andrew W. Mellon Foundation funded Hiberlink project, the Prototyping team of the Los Alamos National Laboratory’s Research Library together with colleagues from Edina and the Language Technology Group of the University of Edinburgh developed the Robust Links concept a few years ago to address the problem. Given the renewed interest in the digital preservation community, we have now collaborated with colleagues from DANS and the Web Science and Digital Libraries Research Group at Old Dominion University on a service that makes creating Robust Links straightforward. To create a Robust Link, we need to:

  1. Create an archival snapshot (memento) of the link URL and
  2. Robustify the link in our web page by adding a couple of attributes to the link.

Robust Links creation

The first step can be done by submitting a URL to a proactive web archiving service such as the Internet Archive’s “Save Page Now”, Perma.cc, or archive.today. The second step guarantees that the link retains the original URL, the URL of the archived snapshot (memento), and the datetime of linking. We detail this step in the Robust Links specification. With both done, we truly have robust links with multiple fallback options. If the original link on the live web is subject to reference rot, readers can access the memento from the web archive. If the memento itself is unavailable, for example, because the web archive is temporarily out of service, we can use the original URL and the datetime of linking to locate another suitable memento in a different web archive. The Memento protocol and infrastructure provides a federated search that seamlessly enables this sort of lookup.

Robust Links web service.
Robust Links web service.

To make Robust Links more accessible to everyone, we provide a web service to easily create Robust Links. To “robustify” your links, submit the URL of your HTML link to the web form, optionally specify a link text, and click “Robustify”. The Robust Links service creates a memento of the provided URL either with the Internet Archive or with archive.today (the selection is made randomly). To increase robustness, the service utilizes multiple publicly available web archives and we are working to include additional web archives in the future. From the result page after submitting the form, copy the HTML snippet for your robust link (shown as step 1 on the result page) and paste it into your web page. To make robust links actionable in a web browser, you need to include the Robust Links JavaScript and CSS in your page. We make this easy by providing an HTML snippet (step 2 on the result page) that you can copy and paste inside the HEAD section of your page.

Robust Links web service result page.
Robust Links web service result page.

Robust Links sustainability

During the implementation of this service, we identified two main concerns regarding its sustainability. The first issue is the reliable inclusion of the Robust Links JavaScript and CSS to make Robust Links actionable. Specifically, we were looking for a feasible approach to improve the chances that both files are available in the long term, can continuously be maintained, and their URI persistently resolves to the latest version. Our approach is two-fold:

  1. we moved the source files into the IIPC GitHub repository so they can be maintained (and versioned) by the community and served with the correct mime type via GitHub Pages and
  2. we minted two Digital Object Identifiers (DOIs) with DataCite, one to resolve to the latest version of the Robust Links JavaScript and the other to the CSS.

The other sustainability issue relates to the Memento infrastructure to automatically access mementos across web archives (2nd fallback mentioned above). Here we continue our path in that LANL and ODU, both IIPC member organizations, maintain the Memento infrastructure.

Because of limitations with the WordPress platform, we unfortunately can not demonstrate robust links in this blog post. However, we created a copy with robustified links hosted at https://robustlinks.mementoweb.org/demo/IIPC/robust_links_blog.html. In addition, our Robust Links demo page showcases how robust links are actionable in a browser via the included CSS and JavaScript. We also created an API for machine-access to our Robust Links service.

Robust Links in action
Robust Links in action.

Acknowledgements and feedback

Lastly, we would like to thank DataCite for granting two DOIs to the IIPC for this effort at no cost. We are also grateful to ODU’s Karen Vaughan for her help minting the DOIs.

For feedback/comments/questions, please do not hesitate and get in touch (martinklein0815[at]gmail.com)!

Relevant URIs

https://robustlinks.mementoweb.org/
https://robustlinks.mementoweb.org/about/
https://robustlinks.mementoweb.org/spec/
https://robustlinks.mementoweb.org/api-docs/

The Danish Coronavirus web collection – Coronavirus on the curators’ minds

By Sabine Schostag, Web Curator, The Royal Danish Library

Introduction – a provoking cartoon

In a sense, the story of Corona and the national Danish Web Archive (Netarchive) starts at the end of January 2020 – about 6 weeks before Corona came to Denmark. A cartoon by Niels Bo Bojesens in the Danish newspaper “Jyllandsposten” (2020-01-26) showing the Chinese flag with a circle of yellow corona-viruses instead of the stars caused indignation in China and captured attention worldwide. We focused on collecting reactions on different social media and in the international news media. Particularly on Twitter, a seething discussion arose with vehement comments and memes about Denmark.

From epidemic to pandemic

After that, the curators again focused on the daily routines in web archiving, as we believed that Corona (Covid-19) was a closed chapter in Netarchive’s history. But this was not the case. When the IIPC Content Development Working Group launched the Covid-19 collection in February, the Royal Danish Library contributed the Danish seeds.

Suddenly, the Corona virus arrived in Europe and the first infected Dane came home from a skiing trip in Italy. The epidemic turned into a pandemic. On March 12, the Danish Government decided to lockdown the country: all public employees where sent to their home offices and borders were closed. Not only the public sector shut down, trade and industry, shops, restaurants, bars etc. had to close too. Only supermarkets were still open and people in the Health Care sector had to work overtime.

While Denmark came to a standstill, so to speak, the Netarchive curators worked at full throttle on the coronavirus event collection. Zoom became the most important work tool for the following 2½ months. In daily Zoom meetings, we coordinated who worked on which facet of this collection. To put it briefly, we curators had coronavirus on our minds.

Event crawls in Netarchive

The Danish Web Archive crawls all Danish news media between several times daily and one time weekly, so there is no need to include news articles in an event crawl. Thus, with an event crawl we focus on augmented activity on social media, blog articles, new sites emerging in connection to the event – and reactions in news media outside Denmark.

Coronavirus documentation in Denmark

The Danish Web collection on coronavirus in Denmark is part of a general documentation on the corona lockdown in Denmark in 2020. This documentation is a cooperation between several cultural institutions, the National Archives (Rigsarkivet), the National Museum (Nationalmuseet), the Workers Museum (Arbejdermuseet), local archives and, last but not least, the Royal Danish Library. The corona lockdown documentation was supposed to be done in two steps:  the “here and now” collection of documentation under the corona lockdown and a more systematic follow-up by collecting materials from authorities and public bodies.

“Days with Corona” – a call for help

All Danes were asked to contribute to the corona lockdown documentation, for instance by sending photos and narratives from their daily life under the lockdown. “Days with Corona” is the title of this part of the documentation of the Danish Folklore Archives run by the National Museum and the Royal Library.

Netarchive also asked the public for help by nominating URLs of web pages related to coronavirus, social media profiles, hashtags, memes and any other relevant material.

Help from colleagues

Web archiving is part of the Department for Digital Cultural Heritage at the Royal Library. Almost all colleagues from the department were able to continue with their every day work from their home offices. Many colleagues from other departments were not able to do so. Some of them helped the Netarchive team by nominating URLs, as this event crawl could keep curators busy more than 7½ hours a day. We used a Google spreadsheet for all nominations (fig. 1)

Fig. 1 Nomination sheet for curators and colleagues form other departments and a call for contributions.

The Queen’s 80th birthday

On April 16, Queen Margarethe II celebrated her 80th birthday. One of the first things she did after the Corona lockdown, on March 13, was to cancel all her birthday celebration events. In a way, she set a good example, as everybody was asked not to meet with no more than ten people, ideally we only should socialize with members of our own household.

As part of the Corona event crawl, we collected web activity related to the Queen’s birthday, which mainly consisted of reactions on social media.

The big challenge – capturing social media

Knowledge of the coronavirus Covid-19 changes continuously. Consequently, authorities, public bodies, private institutions, and companies change information and precaution rules on their webpages frequently. We try to capture as much of these changes as possible. Companies and private individuals offering safety gear for protection against the virus was another facet in the collection. However, capturing all relevant activity on social media was much more challenging than the frequent updates on traditional web pages. Most of the social media platforms use technologies, which Heritrix (used by Netarchive for event crawling) is not able to capture.

Fig. 2 The Queen’s speech to the Danes on how to cope with the corona crisis. This was the second time in history (the first time was during the World War II) when a Royal Head of State addressed  the nation, besides the annual New Year’s Eve speech.

More or less successfully, we tried to capture content from Facebook, TikTok, Twitter, YouTube, Instagram, Reddit, Imgur, Soundcloud, and Pinterest. Twitter is the platform we are able to crawl with Heritrix with rather good results. We collect Facebook profiles with an account at Archive-It, as they have a better set of tools for capturing Facebook. With frequent Quality Assurance and follow-ups, we also get rather good results from Instagram, TikTok and Reddit. We capture YouTube videos by crawling the watch-URLs with a specific configuration using YouTube dl.  One of the collected YouTube videos comes from the Royal family’s YouTube channel: the Queens address to the people on how to behave to prevent or limit the spreading of the coronavirus (https://www.youtube.com/watch?v=TZKVUQ-E-UI, Fig. 2).

As Heritrix has problems with dynamic web content and streaming, we also used Webrecorder.io, although we have not yet implemented this tool in our harvesting setup. However, captures with Webrecorder.io are only drops in the ocean. The use of Webrecorder.io is manual: a curator clicks on all the elements on a page we want to capture. An example is a page on the BBC website, with a video of the reopening of Danish primary schools after the total lockdown (https://www.bbc.com/news/av/world-europe-52649919/coronavirus-inside-a-reopened-primary-school-in-the-time-of-covid-19, Fig. 3). There is still an issue with ingesting the resulting WARC files from Webrecorder.io in our web archive.

Danes produced a range of podcasts on coronavirus issues. We crawled the podcasts we had identified. We get good results when having an URL to a RSS feed, which we crawl with XML extraction.

Fig. 3 Crawled with Webrecorder.io to get the video.

Capture as much as possible – a broad crawl

Netarchive runs up to four broad crawls a year. We launched our first broad crawl for 2020 just in the beginning of the Danish Corona lockdown – on March 14. A broad crawl is an in-depth snapshot of all dk-domains and all other Top Level Domains (TDLs) where we have identified Danish content. A side benefit of this broad crawl might be getting Corona-related content into the archive – content which the curators do not find with their different methods. We identify content both with classic/common? keyword searches and using a variety of link scraping tools / link scrapers.

Is the coronavirus related web collection of any value to anybody?

In accordance with the Danish personal data protection law, the public has no access to the archived web material. Only researchers affiliated with Danish research institutions can apply for access in connection with specific research projects. We have already received an application for one research project dealing with values in the Covid-19 communication. We hope that our collection will inspire more research projects.

The Croatian Web Archive – what’s new?

The Croatian Web Archive (Hrvatski arhiv weba, HAW), launched in 2004, is open access. To celebrate its 15th anniversary, the National and University Library in Zagreb hosted the IIPC General Assembly and the Web Archiving Conference in June 2019. HAW has been the central point in Croatia for researching website development (.hr domain) and the HAW Team has also been organising training for librarians. One of HAW’s most recent projects was the development of the new portal.


By Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre, Croatian Institute for Librarianship, Ingeborg Rudomino, Senior Librarian at the Croatian Web Archive, & Marta Matijević, Librarian at the Croatian Web Archive (National and University Library in Zagreb)

June 2019 – June 2020

It’s been more than a year since the National and University Library in Zagreb (NSK) hosted the IIPC General Assembly and Web Archiving Conference, which we remember with nostalgia.

Last year was a very busy year for the Croatian Web Archive (HAW) and we would like to share some of the key projects that we have been working on.

New portal design

The highlight of the last period was the launch of the new HAW portal.

Croatian Web Archive (HAW)

It was a complex project that took two years – from the initial idea to the launch of the portal in February 2020. The portal was developed and is maintained by NSK website developers and the HAW team. It is developed in a customized WordPress theme. Since the new portal had to be integrated with the database of the archived content, that is maintained by our partner University of Zagreb University Computing Centre (SRCE), a lot of coding was required in order to connect the portal with the archive database to ensure that everything is working properly and smoothly.

Below you can see fractions of our previous portals from 2006 and from 2020:

HAW’s website from 2006 until 2011

HAW’s website from 2011 until 2020

So, what’s new?

The most important objective was to put search box in focus for all types of crawls and give users an easier way to find a resource. Because of the diverse ways of searching, our goal was to have a clear distinction between selective (that is indexed and can be searched by keywords, any word in title or URL, or use advanced search) and domain crawls (can only be searched by entering the full URL). A valuable addition to this version of the portal are the basic metadata elements that accompany each resource (which has a catalogue record) available in the portal.


Archived resource with the basic metadata elements (available also via library catalogue)

Additionally, the browsing of subject categories has been expanded with subject subcategories.

The visibility of the thematic collections has been improved by placing them on the title page. A new feature In Focus has also been added to highlight some of the most important or interesting events or anniversaries happening in the country, city or at the Library in the form of blog posts. This feature is available only in the Croatian version of the portal. The central part of the homepage features New in HAW and Gone from the web sections where user can browse all publications that are new or publications that are no longer available on the live web. The About HAW page features a timeline marking all the important dates related to history of HAW.

Some parts of the new portal have largely remained the same with only slight improvements to make them more user-friendly and up to date. More information about Selection criteria, National .hr domain crawls, Statistics, Bibliography, FAQ etc. can be found in the footer.

The portal is also available in English.

New thematic collections

During this one-year period, we have been working on six thematic collections. Some of them are already available and others are still ongoing:

Elections for the President of the Republic of Croatia 2019-2020

At the end of 2019, Presidential Elections were held in Croatia. The thematic crawls was conducted in January and the content is publicly available as part of this thematic collection.

Rijeka – European Capital of Culture 2020

Croatian city of Rijeka is European Capital of Culture 2020. All contents related to this event, during this challenging time, will be harvested. We are still collecting the content.

Croatian Presidency of the Council of the European Union

Croatia has chaired the Council of the European Union from January to June 2020. We are finishing this thematic collection and it will soon be publicly available on the HAW’s portal.

COVID-19

Our largest thematic collection so far is definitely COVID-19, which is still ongoing. We have included the public in collecting the content inviting nominations related to the coronavirus. In this thematic collection, we follow the events that begin with the onset of coronavirus in the Republic of Croatia and the world, featured on the Croatian portals, blogs, articles – from the outbreak of coronavirus, through general lockdown to the gradual normalization in which we are now.

Archived website (19.03.2020)

2020 Zagreb earthquake

On March 22, just a few days after the start of coronavirus lockdown in Croatia, Zagreb was hit by the biggest earthquake in 140 years, causing numerous injuries and extensive damage. Croatian Web Archive immediately started collecting content about this disaster. This thematic collection is publicly available on the HAW’s portal.

Archived website (15.04.2020) (photo by HINA; Damir Senčar)

2020 Parliamentary Elections

When the spread of the coronavirus was believed to be under control, Croatia held the Parliamentary Elections on July 5. The content for this collection will be collected until the constitution of the new Croatian Parliament.

In May of this year, we started cataloguing thematic collections at the collection level. We have also contributed the Croatian content to the IIPC Coronavirus (Covid-19) Collection.

Annual .hr crawl

In December 2019 we have conducted the 9th annual domain crawl and collected 119 million resources amounting to 9.3 TB.

HAW also started the installation and configuration of tools for indexing and enabling full-text search for domain and thematic crawls: Webarchive-Discovery for parsing and indexing WARC files, Apache SORL for indexing and searching text content and SHINE web interface for index search and analysis. We are still in the testing phase and only a part of existing crawled content is indexed.

Testing Web Curator Tool for new collaborative processes – Local Web Crowd crawls

A new development phase is the collaboration with public libraries in crawling their local history collections for which we are testing the Web Curator Tool. We expect the first results are by the end of November this year.

What’s next?

In the next months, we will be working on enabling more advanced use of HAW’s content to better suit the researchers, starting with the creation of the data sets from HAW collections. We will also prepare guidelines for using archived content on HAW’s portal. In addition, we are planning to update our training material according to the new IIPC training material. In the meantime, we invite you to explore our new portal.

Documenting COVID-19 and the Great Confinement in Canada

By Sylvain Bélanger, Director General, Transition Team, Library and Archives Canada and Treasurer, International Internet Preservation Consortium

It seemed like it happened overnight, suddenly we were told to work from home and limit our physical interactions with people outside our household until further notice. The information was changing and evolving very rapidly and as we started seeing the rise in COVID-19 related cases globally, the anxiety among colleagues and employees was rising as well. Business rapidly ground to an almost complete halt and only essential services would continue to operate, with strict controls and restrictions.

Spanish Flu and the Great Confinement of 2020

Even during these early days, in these times of uncertainty, a group of individuals saw a parallel between the current situation and the period of the Spanish Flu a century earlier. Thinking ahead to fifty years from now this group was asking the question – how will future generations know about this period of time, the Great Confinement of 2020 as they may call it, or the time of great creativity, or perhaps the time the Internet became our lifeline? Turning the clock back one hundred years to the period of the Spanish Flu has given us hints. Let’s not forget that the tragedy of the early 1900s was documented through newspapers, diaries, photographs, and publications detailing the fight and aftermaths of the Spanish Flu.

In 2020, where social media and websites are key means citizens used to document and get informed, how do we capture such ephemeral product?  Does any country have the answer? Isn’t that the question we often ask ourselves?

The importance of web archiving

Screenshot from the Public Health Agency of Canada website.

This period has given all of us an opportunity to educate news publishers, citizens, and government decisions makers about the work done by web archiving teams across Canada and around the world. The efforts of the IIPC have been pushed to the forefront in this crisis, and have helped us demonstrate the importance of preserving web content for future generations.

In Canada the work entails a coordination of efforts with other governmental institutions as well as with university libraries and provincial/territorial archives to limit duplication of efforts. At Library and Archives Canada (LAC), to ensure a proper reflection of Canadian society, we have captured over 662,000 Tweets with hashtags such as #covidcanada, #covid19canada, #canadalockdown, #canadacovid19, as part of over 38 million digital assets collected for COVID-19 in 2020. Of that a little over 87% of the content is non-governmental, from media and non-media web resources selected for the COVID-19 collection. This includes 33 sites on Canadian news and media collected daily, to ensure we capture a robust sample of the published news on COVID-19. Added to that are non-media web resources that create an overall LAC seed list of over 900 resources. Total data collected to date is a little more than 3.09 TB at LAC alone.

Documenting the Canadian response

In addition to our web archiving program, LAC librarians have noticed an increase in books being published about the crisis. That has been measured through our ISBN team observing an increase in authors requesting ISBN numbers for books about various aspects of the pandemic. In addition, LAC will document the Government of Canada’s response to the COVID-19 pandemic through our Government Records Disposition Program.  In this way the government decision-making on COVID-19 and impact on Canadians will be acquired and preserved by LAC for present and future generations. Also, our Private Archives personnel are monitoring the activities, responses and reactions of individuals, communities, organizations and associations within their respective portfolios. LAC will endeavour to acquire documents about the pandemic when discussing possible acquisitions with current and potential donors and when evaluating offers. Descriptions in archival fonds will now highlight COVID 19 content where appropriate.

The efforts undertaken to date at LAC are meant to document the Canadian response. Are our efforts enough to help citizens 100 years from now to understand the times we were living, and how we responded to and tackled the challenges of COVID-19? Only time will tell whether this is enough, or we need to do any more work to truly document the historical times we live in.

IIPC Content Development Group’s activities 2019-2020

By Nicola Bingham, Lead Curator Web Archives, British Library and Co-Chair of the IIPC Content Development Working Group

Introduction

I was delighted to present an update on the Content Development Group’s (CDG) activities at the 2020 IIPC General Assembly (GA) on behalf of myself, Alex and the curators that have worked so hard on collaborative collections over the past year.

Socks, not contributing to Web Archiving

Although it was disappointing not to have been in Montreal for the GA and Web Archiving Conference (WAC), it is the case that there are many advantages in attending a conference remotely. Apart from cost and time savings, it meant that many more staff members from our organisations could attend. I liked the fact that I could see many “old” web archiving friends online and it did feel like the same friendly, enthusiastic, innovative environment that is normally fostered at IIPC events. I was also delighted to see some of the attendee’s pets on screen, although it did highlight that other people’s cats are generally much more affectionate than my own, who has, I have to say, contributed little to the field web archiving over the years, although he did show a mild interest in Warcat.

Several things become clear when tasked with pre-recording a presentation with a time limit of 2 to 3 minutes. Firstly, it is extremely difficult to fit everything you need to say into such a short space of time; secondly, what you do want to say must be tightly scripted – although this does have the advantage that there is no room for pauses or “errs” in a way that can sometimes pepper my in-person presentations. Thirdly, recording even a two-minute video calls for a surprising number of retakes, taking many hours for no apparent reason. Fourthly, naively explaining these facts to the Programme and Communications Officer leads quite seamlessly to the suggestion of writing a blog post in order that one can be more expansive on the points bulleted in the two-minute presentation….

CDG Collection Update

Since our last General Assembly in Zagreb, in June 2019, the CDG has continued working on several established, and two new collections:

  • The International Cooperation Organizations Collection was initiated in 2015 and is led by Alex Thurman of Columbia University Libraries. It previously consisted of all known active websites in the .int top-level domain (available only to organizations created by treaties), but was expanded to include a large group of similar organizations with .org domain hosts, and renamed Intergovernmental Organizations this year. This increased the collection from 163 to 403 intergovernmental organizations, all of which will continue to be crawled each year.
  • The National Olympic and Paralympic Committees, led by Helena Byrne of the British Library was initiated in 2016 and consists of websites of national Olympics and Paralympics committees and associations, as identified from the official listings of these groups found on the official sites http://www.olympic.org and http://www.paralympic.org.
  • Online News Around the World led by Sabine Schostag of the Royal Danish Library. This collection of seeds was first crawled in October 2018 to document a selection of online news from as many countries as possible. It was crawled again in November 2019. The collection was promoted at the Third RESAW Conference, “The web that was: archives, traces, reflections” in Amsterdam in June 2019 and at the IFLA News Media Conference at Universidad Nacional Autónoma de México, Mexico City in March 2020.
  • New in 2019, the CDG undertook a Climate Change Collection, led by Kees Teszelszky of the National Library of the Netherlands. The first crawl took place in June, with a final crawl shortly after the UN Climate summit in September 2019.
  • New in 2019, a collection on Artificial Intelligence was undertaken between May and December, led by Tiiu Daniel (National Library of Estonia), Liisi Esse (Stanford University Libraries) and Rashi Joshi (Library of Congress).

Coronavirus (Covid-19) Collection

The main collecting activity in 2020 has been around the Covid-19 Global pandemic. This has involved a huge effort by IIPC members with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

We have been very careful with scoping rules so that we are able to collect a diverse range of content within the data budget – and Archive-It generously increased the data limit for this collection to 5TB. Collecting will continue to run, budget permitting, while the event is of global significance.

Publicly available CDG collections can be viewed on the Archive-It website.https://archive-it.org/home/IIPC and an overview of the collection statistics can be seen below.

CDG Collection statistics. Figures correct as of 15th June 2020. Slide presented at IIPC GA 17th June 2020.

Researcher-use of Collections

The CDG has worked closely with the Research Working Group co-chairs to promote and facilitate use of the CDG collections which are now available through the Archives Unleashed Cloud thanks to the Archives Unleashed project. The collections have been analysed and there are a large amount of derivatives available to researchers at IIPC-led events and/or research projects. For more information about how to access these collections please refer to the guidelines.

Next Steps/Getting in touch

We would very much welcome new members to the CDG. We will be having an online meeting in the next couple of months which would be an excellent opportunity to find out more. In the meantime, any IIPC member is welcome to suggest and/or lead on possible 2021 collaborative collections. For more information please contact the co-chairs or the Programme and Communications Officer.

Nicola Bingham & Alex Thurman CDG co-chairs

The CDG Working Group at the 2019 IIPC General Assembly in Zagreb.

From pilot to portal: a year of web archiving in Hungary

National Széchényi Library started a web archiving pilot project in 2017. The aim of the pilot project was to identify the requirements of establishing the Hungarian Internet Archive. In the two years of the pilot phase, some hundred cultural and scientific websites were selected and published with the owners’ permission. The Hungarian Web Archive (MIA) was officially launched in 2017. The Library joined the IIPC in 2018 and the Hungarian Web Archive was first introduced at the General Assembly in Wellington in 2018. Last year, the achievements of the project were presented at the Web Archiving Conference (WAC) in Zagreb, in June 2019. This blog post offers a summary of some key developments since the 2019 conference.


By Márton Németh, Digital librarian at the National Széchényi Library, Hungary

In just about a year, we moved from a pilot project to officially launching our web archive, running a comprehensive crawl and creating special collections. In May 2020, the Hungarian parliament passed the modifications of the Cultural Law which allows us to run web archiving activities as a part of its basic service portfolio. Over the past year we have also organised training and participated in various collaborative initiatives.

Conferences and collaborations

In the summer just after the Zagreb conference, we could exchange experiences with our Czech and Slovak colleagues about the current status and major development points of web archiving projects in the Czech Republic, Slovakia and Hungary in the Visegrad 4 Library Conference in Bratislava. Our presentation is available from here. In the autumn, at the annual international conference of digital preservation in Bratislava, we could elaborate on our basic thoughts about the potential use of microdata in library environment. The presentation can be downloaded from here.

At the Digital Humanities 2020 conference in Budapest, Hungary, we organized a whole web archiving session with presentations and panel discussions together with Marie Haskovcová from the Czech National Library, Kees Teszelszky from the National Library of the Netherlands, Balázs Indig from the Digital Humanities Research Centre of Loránd Eötvös University and with Márton Németh from the National Széchényi Library. The main aim was to get a spotlight on Digital Humanities research activities in the web archiving context. Our presentation is available from here.

Training

Our annual workshop in the National Széchényi Library focused on the metadata enrichment of web archives, crawling and managing local web content in university library and city library environments, crawling and managing online newspaper articles and setting the limits of web archiving in research library environments.

We also run several accredited training courses for Hungarian librarians and summarized our experiences in web archiving education field in an article published by Emerald. The membership in the IIPC Training Working Group has offered us valuable experiences in this field.

Domain crawl and new portal

We had run our second comprehensive harvest about a large segment of the Hungarian web domain in the end of 2019. The robot had started on 246.819 seed addresses and crawled 110 million URL-s in less than eight days with 6,4 TB storage.

Our original project website was the first repository of resources related to web archiving in Hungarian. In 2019 we built a new portal. This new website serves as a knowledgebase in web archiving field in Hungary. Beyond the introduction to the web archive and to the project, separate groups of resources (info-materials, documents etc.) are available for every-day users, for content-owners, for professional experts and for journalists. It is available at https://webarchivum.oszk.hu.

https://webarchivum.oszk.hu
webarchivum.oszk.hu

We created a new sub-collection in 2019-2020 on the Francis II Rákóczi Memorial Year at the National Széchényi Library (NSZL), within the framework of the Public Collection Digitization Strategy. Its primary goal was present the technology of web archiving and the integration of the web archive with other digital collections through a demo application. The content focuses on the webpages and websites related to the Memorial Year, to the War of Independence, to the Prince and to his family. Furthermore, it contains born digital or digitized books from the Hungarian Electronic Library, articles from the Electronic Periodical Archives, photos, illustrations and other visual documents from the Digital Archive of Pictures. The service is available on the following address: http://rakoczi2019.webarchivum.oszk.hu.

OSZK-figure2
rakoczi2019.webarchivum.oszk.hu

Legislation and new collections

In May 2020 the Hungarian parliament passed the modifications of the Cultural Law that entitles the National Széchényi Library to run web archiving activities as a part of its basic service portfolio. Legal deposit of web materials will also be established. The corresponding governmental and ministerial decrees will appear soon, all the law modifications and decrees will be in effect from 1 January 2021.

We made our first experiment of harvesting various materials from 700 pages with more than 100.000 posts from Instagram using the Webrecorder software. We are running event-based harvests too about COVID-19, Summer Olympic Games, Paris Peace Conference (1919-1920). We are joining also to the corresponding international IIPC collaborative collection development projects.

Next steps

Supported by the framework of the Public Collection Digitization Strategy we could start to develop a collaboration network with various regional libraries in Hungary in order to collect local materials for the Hungarian Web Archive. Hopefully, we will summarize our first experiences during our next annual workshop in the autumn and we can further develop our joint collection activities.

Luxembourg Web Archive – Coronavirus Response

By Ben Els, Digital Curator, The National Library of Luxembourg

The National Library of Luxembourg has been harvesting the Luxembourg web under the digital legal deposit since 2016. In addition to the large-scale domain crawls, the Luxembourg Web Archive also operates targeted crawls, aimed at specific subjects or events. During the past weeks and months, the global pandemic of the Coronavirus, has put society before unprecedented challenges. While large parts of our professional and social lives had to move even further online, the need to capture and document the implications of this crisis on the Internet, has seen enormous support in all domains of society. While it is safe to admit that web archiving is still a relatively unknown concept to most people in Luxembourg (probably also in other countries), it is also safe to say, that we have never seen a better case to illustrate the necessity of web archiving and ask for support in this overwhelming challenge.

webarchive.lu

Media and communities

At the National Library, we started our Coronavirus collection on March 16th, while there were 81 known cases in Luxembourg. While we have been harvesting websites in several event crawls for the past 3 years, it was clear from the start that the amount of information to be captured would surpass any other subject by a great deal. Therefore, we decided to ask for support from the Luxembourg news media, by asking them to send us lists of related news articles from their websites. This appeal to editors quickly evolved into a call for participation to the general public, asking all communities, associations, and civil interest groups to share their responses and online information about the crisis. Addressing the news media in the first place, gave us great support in spreading the word about the collection. Part of our approach to building an event collection, is to follow the news and take in information about new developments and publications of different organisations and persons of interest. As the flow and high-paced rhythm of new public information and support was vital to many communities, we also had to try and keep up with new websites, support groups and solidarity platforms being launched every day. However, many of these initiatives are not covered equally in the news or social media, a situation which is even more complicated through Luxembourg’s multilingual makeup. We learned about the challenges from the government and administrations, to convey important and urgent information in 4 or 5 languages at a time: Luxembourgish, French, German, English and Portuguese. The same goes for news and social media, and as a result, for the Luxembourg Web Archive. Therefore, we were grateful to receive contributions from organisations, which we would not have thought of including ourselves, and who were not talked about as much in the news.

© The Luxembourg Government

Effort and resources

While the need and support for web archiving exploded during March and April, it was also clear, that the standard resources allocated to the yearly operations of the web archive would not suffice in responding to the challenge in front of us. The National Library was able to increase our efforts, by securing additional funding, which allowed us to launch an impromptu domain crawl and to expand the data budget on Archive-It crawls. We are all aware of the uphill battle in communicating the benefits of archiving the web. There is a feeling that, while people generally agree on the necessity of preserving websites, in most cases there is little sense of urgency or immediate requirement – since after all, most everyday changes are perceived as corrections of mistakes, or improvements on previous versions. In my opinion, the case of Coronavirus related websites, made the idea of web archiving as a service and obligation to society much clearer and easier to convey.

© Ministry of Health

Private and public

The Web offers many spaces and facets for personal expression and communication. While social media have played a crucial part in helping people to deal with the crisis, web archives face some of their biggest challenges in harvesting and preserving social media. Alongside the technical difficulties and enormous related costs, there is the question of ethics in collecting content which is not 100% private, but also not 100% public. For instance, in Luxembourg, many support groups launched on Facebook, where people could ask their questions about the current situation and new developments in terms of what is

allowed, find help and comfort to their uncertainties. There are several active groups in every language, even some dedicated to districts of the city, with neighbours looking after each other. While it is important to try to capture all facets of an event (especially if this information is unique to the Internet) I am uncertain, whether it is ethical to capture the questions, comments and conversations of people in vulnerable situations. Even though there are sometimes thousands of members per group and pretty much everyone can join, they are not fully open to the public.

Collecting and sharing

covidmemory.lu

Besides the large-scale crawls and Archive-It collection, we also contributed part of our seed list to the IIPC’s collaborative Novel Coronavirus collection, led by the Content Development Working Group. Of course, the National Library did not limit its response to archiving websites. With our call for participation, we also received a variety of physical and digital documents: mainly from municipalities and public administrations who submitted numerous documents, which were issued to the public in relation the reorganisation of public services and the temporary restrictions on social life.

We also received some unexpected contributions, in the form of poems, essays and short diary entries written during confinement, describing and reflecting upon the current situation from a very personal angle. Likewise, a researcher shared his private bibliometric analysis of scientific literature about the Coronavirus. Furthermore, the University of Luxembourg’s Centre for Contemporary and Digital History has launched the sharing platform covidmemory.lu, enabling ordinary people living or working in Luxembourg to share their photos, videos, stories and interviews related to COVID-19.

Web Archiving Week 2021

Since the 2021 edition of the IIPC Web Archiving Conference will be part of the Web Archiving Week, in  partnership with the University of Luxembourg and the RESAW network, I am not going to spoil too much about the program by saying that we will continue exploring these shared efforts and responses during the week of June 14th – 18th 2021. We are looking forward to welcoming you all to Luxembourg!

The Future of Playback

By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio

It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.

OpenWayback

Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out  new releases (thanks Lauren!).

Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.

As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.

Survey

Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.

Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.

The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.

Pywb

Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.

However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.

To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.

Planned work

The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).

The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.

The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.

There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.

Resources:

Launching IIPC training programme

By Olga Holownia, IIPC Programme and Communications Officer

It is no longer uncommon for heritage institutions, particularly national and university libraries, to employ web archivists and web curators. It is, however, still rather unusual that librarians or archivists who hold these positions, had received any formal training in web archiving before they joined web archiving teams. The majority of participants in a small poll organised during a workshop on training new starters in web archiving at last year’s Web Archiving Conference in Zagreb, confirmed that they received ‘on-the-job training’.

Varying approaches, similar needs

Discussions during the 2016 IIPC General Assembly in Ottawa, led to the conclusion that while IIPC members have varying approaches to web archiving, which reflect their own institutional mandates, legal contexts and technical infrastructures, they all need technical and curatorial training – for practitioners and for researchers. This inspired the creation of the IIPC Training Working Group (TWG) initiated by Tom Cramer, where participation was open to the global web archiving community. The TWG, co-chaired by Tom, Abbie Grotke and Maria Praetzellis, was tasked with creating training materials.

The TWG’s first activities included a comprehensive overview of existing curricula as well as a survey to assess the current level of training needs. The results of both helped inform the decisions behind the content of the training modules that had been introduced at last year’s conference. We are delighted to announce that the beginner’s training is now available on the IIPC website.

Who is the training for?

This training programme is aimed at practitioners (including new starters), curators, policy makers and managers or those who would like to learn about the following: what web archives are, how they work, how to curate web archive collections, acquire basic skills in capturing web archive content, but also how to plan and implement a web archiving programme.

This course contains eight sessions and comprises presentation slides and speakers’ notes. Each module starts with an introduction which outlines the learning objectives, target audience and includes information about the way the slides can be customised as well as a comprehensive list of related resources and tools. Published under a CC licence, the training materials can be fully customised and modified by the users.

Video Case Studies

The TWG used the opportunity of the annual gathering of the web archiving community to complement the training material with video case studies. Alex Osborne, Jessica Cebra, Mark Phillips, Eléonore Alquier, Daniel Gomes, Mar Pérez Morillo, Ben Els and Yves Maurer, representing seven IIPC member institutions from around the world, speak about their experiences of becoming involved in web archiving. They also share their knowledge on organisational approaches, collaborations, collecting policies, access as well as the evolution and challenges of web archiving.

Try it and tell us what you think!

This training is freely available to download and we encourage you to experiment with customising it for your trainees. We are also interested in how you use it, so please give us feedback by filling out this short form.

Acknowledgements

Like in the case of many other IIPC projects, the creation of the training material was a collaborative effort and we thank everyone who has been involved. The project was launched by Tom Cramer of Stanford University Libraries, Abbie Grotke of the Library of Congress and Maria Praetzellis of the California Digital Library (previously the Internet Archive). Maria Ryan of the National Library of Ireland and Claire Newing of the National Archives UK, who took over as co-chairs at the last General Assembly, led the project to completion together with Abbie. A whole group of 38 volunteers who form the TWG were involved at various stages of the project, starting with the surveys, followed by brainstorming sessions, many rounds of extensive feedback and the final phase of preparing the materials for publication. A special thank you to Samantha Abrams, Jefferson Bailey, Helena Byrne, Friedel Geeraert, Márton Németh, Anna Perricci, and the participants of the video case studies.

The beginner’s training materials were produced in partnership with the Digital Preservation Coalition (DPC) and we would particularly like to thank Sharon McMeekin, Head of Training and Skills and Sara Day Thomson of University of Edinburgh (previously DPC).

Resources