Robustify your links! A working solution to create persistently robust links

By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory (LANL), Shawn M. Jones, Ph.D. student and Graduate Research Assistant at LANL, Herbert Van de Sompel, Chief Innovation Officer at Data Archiving and Network Services (DANS), and Michael L. Nelson, Professor in the Computer Science Department at Old Dominion University (ODU).

Links on the web break all the time. We frequently experience the infamous “404 – Page not found” message, also known as “a broken link” or “link rot.” Sometimes we follow a link and discover that the linked page has significantly changed and its content no longer represents what was originally referenced, a scenario known as “content drift.” Both link rot and content drift are forms of “reference rot”, a significant detriment to our web experience. In the realm of scholarly communication where we increasingly reference web resources such as blog posts, source code, videos, social media posts, datasets, etc. in our manuscripts, we recognize that we are losing our scholarly record to reference rot.

Robust Links background

As part of The Andrew W. Mellon Foundation funded Hiberlink project, the Prototyping team of the Los Alamos National Laboratory’s Research Library together with colleagues from Edina and the Language Technology Group of the University of Edinburgh developed the Robust Links concept a few years ago to address the problem. Given the renewed interest in the digital preservation community, we have now collaborated with colleagues from DANS and the Web Science and Digital Libraries Research Group at Old Dominion University on a service that makes creating Robust Links straightforward. To create a Robust Link, we need to:

  1. Create an archival snapshot (memento) of the link URL and
  2. Robustify the link in our web page by adding a couple of attributes to the link.

Robust Links creation

The first step can be done by submitting a URL to a proactive web archiving service such as the Internet Archive’s “Save Page Now”, Perma.cc, or archive.today. The second step guarantees that the link retains the original URL, the URL of the archived snapshot (memento), and the datetime of linking. We detail this step in the Robust Links specification. With both done, we truly have robust links with multiple fallback options. If the original link on the live web is subject to reference rot, readers can access the memento from the web archive. If the memento itself is unavailable, for example, because the web archive is temporarily out of service, we can use the original URL and the datetime of linking to locate another suitable memento in a different web archive. The Memento protocol and infrastructure provides a federated search that seamlessly enables this sort of lookup.

Robust Links web service.
Robust Links web service.

To make Robust Links more accessible to everyone, we provide a web service to easily create Robust Links. To “robustify” your links, submit the URL of your HTML link to the web form, optionally specify a link text, and click “Robustify”. The Robust Links service creates a memento of the provided URL either with the Internet Archive or with archive.today (the selection is made randomly). To increase robustness, the service utilizes multiple publicly available web archives and we are working to include additional web archives in the future. From the result page after submitting the form, copy the HTML snippet for your robust link (shown as step 1 on the result page) and paste it into your web page. To make robust links actionable in a web browser, you need to include the Robust Links JavaScript and CSS in your page. We make this easy by providing an HTML snippet (step 2 on the result page) that you can copy and paste inside the HEAD section of your page.

Robust Links web service result page.
Robust Links web service result page.

Robust Links sustainability

During the implementation of this service, we identified two main concerns regarding its sustainability. The first issue is the reliable inclusion of the Robust Links JavaScript and CSS to make Robust Links actionable. Specifically, we were looking for a feasible approach to improve the chances that both files are available in the long term, can continuously be maintained, and their URI persistently resolves to the latest version. Our approach is two-fold:

  1. we moved the source files into the IIPC GitHub repository so they can be maintained (and versioned) by the community and served with the correct mime type via GitHub Pages and
  2. we minted two Digital Object Identifiers (DOIs) with DataCite, one to resolve to the latest version of the Robust Links JavaScript and the other to the CSS.

The other sustainability issue relates to the Memento infrastructure to automatically access mementos across web archives (2nd fallback mentioned above). Here we continue our path in that LANL and ODU, both IIPC member organizations, maintain the Memento infrastructure.

Because of limitations with the WordPress platform, we unfortunately can not demonstrate robust links in this blog post. However, we created a copy with robustified links hosted at https://robustlinks.mementoweb.org/demo/IIPC/robust_links_blog.html. In addition, our Robust Links demo page showcases how robust links are actionable in a browser via the included CSS and JavaScript. We also created an API for machine-access to our Robust Links service.

Robust Links in action
Robust Links in action.

Acknowledgements and feedback

Lastly, we would like to thank DataCite for granting two DOIs to the IIPC for this effort at no cost. We are also grateful to ODU’s Karen Vaughan for her help minting the DOIs.

For feedback/comments/questions, please do not hesitate and get in touch (martinklein0815[at]gmail.com)!

Relevant URIs

https://robustlinks.mementoweb.org/
https://robustlinks.mementoweb.org/about/
https://robustlinks.mementoweb.org/spec/
https://robustlinks.mementoweb.org/api-docs/

IIPC Content Development Group’s activities 2019-2020

By Nicola Bingham, Lead Curator Web Archives, British Library and Co-Chair of the IIPC Content Development Working Group

Introduction

I was delighted to present an update on the Content Development Group’s (CDG) activities at the 2020 IIPC General Assembly (GA) on behalf of myself, Alex and the curators that have worked so hard on collaborative collections over the past year.

Socks, not contributing to Web Archiving

Although it was disappointing not to have been in Montreal for the GA and Web Archiving Conference (WAC), it is the case that there are many advantages in attending a conference remotely. Apart from cost and time savings, it meant that many more staff members from our organisations could attend. I liked the fact that I could see many “old” web archiving friends online and it did feel like the same friendly, enthusiastic, innovative environment that is normally fostered at IIPC events. I was also delighted to see some of the attendee’s pets on screen, although it did highlight that other people’s cats are generally much more affectionate than my own, who has, I have to say, contributed little to the field web archiving over the years, although he did show a mild interest in Warcat.

Several things become clear when tasked with pre-recording a presentation with a time limit of 2 to 3 minutes. Firstly, it is extremely difficult to fit everything you need to say into such a short space of time; secondly, what you do want to say must be tightly scripted – although this does have the advantage that there is no room for pauses or “errs” in a way that can sometimes pepper my in-person presentations. Thirdly, recording even a two-minute video calls for a surprising number of retakes, taking many hours for no apparent reason. Fourthly, naively explaining these facts to the Programme and Communications Officer leads quite seamlessly to the suggestion of writing a blog post in order that one can be more expansive on the points bulleted in the two-minute presentation….

CDG Collection Update

Since our last General Assembly in Zagreb, in June 2019, the CDG has continued working on several established, and two new collections:

  • The International Cooperation Organizations Collection was initiated in 2015 and is led by Alex Thurman of Columbia University Libraries. It previously consisted of all known active websites in the .int top-level domain (available only to organizations created by treaties), but was expanded to include a large group of similar organizations with .org domain hosts, and renamed Intergovernmental Organizations this year. This increased the collection from 163 to 403 intergovernmental organizations, all of which will continue to be crawled each year.
  • The National Olympic and Paralympic Committees, led by Helena Byrne of the British Library was initiated in 2016 and consists of websites of national Olympics and Paralympics committees and associations, as identified from the official listings of these groups found on the official sites http://www.olympic.org and http://www.paralympic.org.
  • Online News Around the World led by Sabine Schostag of the Royal Danish Library. This collection of seeds was first crawled in October 2018 to document a selection of online news from as many countries as possible. It was crawled again in November 2019. The collection was promoted at the Third RESAW Conference, “The web that was: archives, traces, reflections” in Amsterdam in June 2019 and at the IFLA News Media Conference at Universidad Nacional Autónoma de México, Mexico City in March 2020.
  • New in 2019, the CDG undertook a Climate Change Collection, led by Kees Teszelszky of the National Library of the Netherlands. The first crawl took place in June, with a final crawl shortly after the UN Climate summit in September 2019.
  • New in 2019, a collection on Artificial Intelligence was undertaken between May and December, led by Tiiu Daniel (National Library of Estonia), Liisi Esse (Stanford University Libraries) and Rashi Joshi (Library of Congress).

Coronavirus (Covid-19) Collection

The main collecting activity in 2020 has been around the Covid-19 Global pandemic. This has involved a huge effort by IIPC members with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

We have been very careful with scoping rules so that we are able to collect a diverse range of content within the data budget – and Archive-It generously increased the data limit for this collection to 5TB. Collecting will continue to run, budget permitting, while the event is of global significance.

Publicly available CDG collections can be viewed on the Archive-It website.https://archive-it.org/home/IIPC and an overview of the collection statistics can be seen below.

CDG Collection statistics. Figures correct as of 15th June 2020. Slide presented at IIPC GA 17th June 2020.

Researcher-use of Collections

The CDG has worked closely with the Research Working Group co-chairs to promote and facilitate use of the CDG collections which are now available through the Archives Unleashed Cloud thanks to the Archives Unleashed project. The collections have been analysed and there are a large amount of derivatives available to researchers at IIPC-led events and/or research projects. For more information about how to access these collections please refer to the guidelines.

Next Steps/Getting in touch

We would very much welcome new members to the CDG. We will be having an online meeting in the next couple of months which would be an excellent opportunity to find out more. In the meantime, any IIPC member is welcome to suggest and/or lead on possible 2021 collaborative collections. For more information please contact the co-chairs or the Programme and Communications Officer.

Nicola Bingham & Alex Thurman CDG co-chairs

The CDG Working Group at the 2019 IIPC General Assembly in Zagreb.

Luxembourg Web Archive – Coronavirus Response

By Ben Els, Digital Curator, The National Library of Luxembourg

The National Library of Luxembourg has been harvesting the Luxembourg web under the digital legal deposit since 2016. In addition to the large-scale domain crawls, the Luxembourg Web Archive also operates targeted crawls, aimed at specific subjects or events. During the past weeks and months, the global pandemic of the Coronavirus, has put society before unprecedented challenges. While large parts of our professional and social lives had to move even further online, the need to capture and document the implications of this crisis on the Internet, has seen enormous support in all domains of society. While it is safe to admit that web archiving is still a relatively unknown concept to most people in Luxembourg (probably also in other countries), it is also safe to say, that we have never seen a better case to illustrate the necessity of web archiving and ask for support in this overwhelming challenge.

webarchive.lu

Media and communities

At the National Library, we started our Coronavirus collection on March 16th, while there were 81 known cases in Luxembourg. While we have been harvesting websites in several event crawls for the past 3 years, it was clear from the start that the amount of information to be captured would surpass any other subject by a great deal. Therefore, we decided to ask for support from the Luxembourg news media, by asking them to send us lists of related news articles from their websites. This appeal to editors quickly evolved into a call for participation to the general public, asking all communities, associations, and civil interest groups to share their responses and online information about the crisis. Addressing the news media in the first place, gave us great support in spreading the word about the collection. Part of our approach to building an event collection, is to follow the news and take in information about new developments and publications of different organisations and persons of interest. As the flow and high-paced rhythm of new public information and support was vital to many communities, we also had to try and keep up with new websites, support groups and solidarity platforms being launched every day. However, many of these initiatives are not covered equally in the news or social media, a situation which is even more complicated through Luxembourg’s multilingual makeup. We learned about the challenges from the government and administrations, to convey important and urgent information in 4 or 5 languages at a time: Luxembourgish, French, German, English and Portuguese. The same goes for news and social media, and as a result, for the Luxembourg Web Archive. Therefore, we were grateful to receive contributions from organisations, which we would not have thought of including ourselves, and who were not talked about as much in the news.

© The Luxembourg Government

Effort and resources

While the need and support for web archiving exploded during March and April, it was also clear, that the standard resources allocated to the yearly operations of the web archive would not suffice in responding to the challenge in front of us. The National Library was able to increase our efforts, by securing additional funding, which allowed us to launch an impromptu domain crawl and to expand the data budget on Archive-It crawls. We are all aware of the uphill battle in communicating the benefits of archiving the web. There is a feeling that, while people generally agree on the necessity of preserving websites, in most cases there is little sense of urgency or immediate requirement – since after all, most everyday changes are perceived as corrections of mistakes, or improvements on previous versions. In my opinion, the case of Coronavirus related websites, made the idea of web archiving as a service and obligation to society much clearer and easier to convey.

© Ministry of Health

Private and public

The Web offers many spaces and facets for personal expression and communication. While social media have played a crucial part in helping people to deal with the crisis, web archives face some of their biggest challenges in harvesting and preserving social media. Alongside the technical difficulties and enormous related costs, there is the question of ethics in collecting content which is not 100% private, but also not 100% public. For instance, in Luxembourg, many support groups launched on Facebook, where people could ask their questions about the current situation and new developments in terms of what is

allowed, find help and comfort to their uncertainties. There are several active groups in every language, even some dedicated to districts of the city, with neighbours looking after each other. While it is important to try to capture all facets of an event (especially if this information is unique to the Internet) I am uncertain, whether it is ethical to capture the questions, comments and conversations of people in vulnerable situations. Even though there are sometimes thousands of members per group and pretty much everyone can join, they are not fully open to the public.

Collecting and sharing

covidmemory.lu

Besides the large-scale crawls and Archive-It collection, we also contributed part of our seed list to the IIPC’s collaborative Novel Coronavirus collection, led by the Content Development Working Group. Of course, the National Library did not limit its response to archiving websites. With our call for participation, we also received a variety of physical and digital documents: mainly from municipalities and public administrations who submitted numerous documents, which were issued to the public in relation the reorganisation of public services and the temporary restrictions on social life.

We also received some unexpected contributions, in the form of poems, essays and short diary entries written during confinement, describing and reflecting upon the current situation from a very personal angle. Likewise, a researcher shared his private bibliometric analysis of scientific literature about the Coronavirus. Furthermore, the University of Luxembourg’s Centre for Contemporary and Digital History has launched the sharing platform covidmemory.lu, enabling ordinary people living or working in Luxembourg to share their photos, videos, stories and interviews related to COVID-19.

Web Archiving Week 2021

Since the 2021 edition of the IIPC Web Archiving Conference will be part of the Web Archiving Week, in  partnership with the University of Luxembourg and the RESAW network, I am not going to spoil too much about the program by saying that we will continue exploring these shared efforts and responses during the week of June 14th – 18th 2021. We are looking forward to welcoming you all to Luxembourg!

The Future of Playback

By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio

It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.

OpenWayback

Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out  new releases (thanks Lauren!).

Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.

As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.

Survey

Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.

Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.

The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.

Pywb

Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.

However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.

To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.

Planned work

The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).

The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.

The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.

There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.

Resources:

Launching IIPC training programme

By Olga Holownia, IIPC Programme and Communications Officer

It is no longer uncommon for heritage institutions, particularly national and university libraries, to employ web archivists and web curators. It is, however, still rather unusual that librarians or archivists who hold these positions, had received any formal training in web archiving before they joined web archiving teams. The majority of participants in a small poll organised during a workshop on training new starters in web archiving at last year’s Web Archiving Conference in Zagreb, confirmed that they received ‘on-the-job training’.

Varying approaches, similar needs

Discussions during the 2016 IIPC General Assembly in Ottawa, led to the conclusion that while IIPC members have varying approaches to web archiving, which reflect their own institutional mandates, legal contexts and technical infrastructures, they all need technical and curatorial training – for practitioners and for researchers. This inspired the creation of the IIPC Training Working Group (TWG) initiated by Tom Cramer, where participation was open to the global web archiving community. The TWG, co-chaired by Tom, Abbie Grotke and Maria Praetzellis, was tasked with creating training materials.

The TWG’s first activities included a comprehensive overview of existing curricula as well as a survey to assess the current level of training needs. The results of both helped inform the decisions behind the content of the training modules that had been introduced at last year’s conference. We are delighted to announce that the beginner’s training is now available on the IIPC website.

Who is the training for?

This training programme is aimed at practitioners (including new starters), curators, policy makers and managers or those who would like to learn about the following: what web archives are, how they work, how to curate web archive collections, acquire basic skills in capturing web archive content, but also how to plan and implement a web archiving programme.

This course contains eight sessions and comprises presentation slides and speakers’ notes. Each module starts with an introduction which outlines the learning objectives, target audience and includes information about the way the slides can be customised as well as a comprehensive list of related resources and tools. Published under a CC licence, the training materials can be fully customised and modified by the users.

Video Case Studies

The TWG used the opportunity of the annual gathering of the web archiving community to complement the training material with video case studies. Alex Osborne, Jessica Cebra, Mark Phillips, Eléonore Alquier, Daniel Gomes, Mar Pérez Morillo, Ben Els and Yves Maurer, representing seven IIPC member institutions from around the world, speak about their experiences of becoming involved in web archiving. They also share their knowledge on organisational approaches, collaborations, collecting policies, access as well as the evolution and challenges of web archiving.

Try it and tell us what you think!

This training is freely available to download and we encourage you to experiment with customising it for your trainees. We are also interested in how you use it, so please give us feedback by filling out this short form.

Acknowledgements

Like in the case of many other IIPC projects, the creation of the training material was a collaborative effort and we thank everyone who has been involved. The project was launched by Tom Cramer of Stanford University Libraries, Abbie Grotke of the Library of Congress and Maria Praetzellis of the California Digital Library (previously the Internet Archive). Maria Ryan of the National Library of Ireland and Claire Newing of the National Archives UK, who took over as co-chairs at the last General Assembly, led the project to completion together with Abbie. A whole group of 38 volunteers who form the TWG were involved at various stages of the project, starting with the surveys, followed by brainstorming sessions, many rounds of extensive feedback and the final phase of preparing the materials for publication. A special thank you to Samantha Abrams, Jefferson Bailey, Helena Byrne, Friedel Geeraert, Márton Németh, Anna Perricci, and the participants of the video case studies.

The beginner’s training materials were produced in partnership with the Digital Preservation Coalition (DPC) and we would particularly like to thank Sharon McMeekin, Head of Training and Skills and Sara Day Thomson of University of Edinburgh (previously DPC).

Resources

Let’s time travel with the IIPC!

IIPC has been organising its annual meetings for over 15 years. The first full Steering Committee meeting and the meetings of working groups were held in Canberra in 2004. The most recent General Assembly (GA) and Web Archiving Conference (WAC) were held Zagreb in June 2019. What started as a small get-together of web archiving enthusiasts from a dozen national libraries and the Internet Archive, has gradually become an important fixture in the web archiving calendar. We have been very fortunate that our members have volunteered to host the events in Singapore, The Hague, Washington D.C., Ljubljana, Stanford, Reykjavík, London, Wellington, Zagreb and Ottawa. The GA also returned to Canberra in 2008.

 

Due to Covid-19, this year we will not meet in person but we can time travel! While preparing for the next annual event hosted by the National Library of Luxembourg (15-18 June 2021), we will be trawling through the history of the GA and the WAC. We will be collecting, publishing and archiving memories from past events in a variety of formats, ranging from tweets, blog posts to a GA and WAC digital repository and bibliography. All new and older posts will be available in the “GAWAC” archive.

This slideshow requires JavaScript.

We are starting from 2019, which was the first GA for Friedel Geeraert of KBR, The Royal Library of Belgium. This was also the first GA for the British Library web archivists Helena Byrne and Carlos Rarugal, the organisers of a workshop called “Reflecting on how we train new starters in web archiving”.

Abstracts from the 2019 presentations and slides are available on the conference website. You can also watch the keynote speeches and panel discussions on our YouTube Channel and browse through the photos on the IIPC Flickr. The 2019 GA and WAC were hosted by the National and University Library in Zagreb. The Croatian Web Archive (HAW), which last year celebrated its 15th anniversary, has launched its new interface earlier this year. You can browse the archive and the thematic collections at https://haw.nsk.hr/en.

Photo: Tibor God.

Discovering the web archiving community at the IIPC events in Zagreb

By Friedel Geeraert, Scientific Assistant Web Archiving, KBR – Royal Library of Belgium

Last year I had the privilege of participating in the IIPC General Assembly and Web Archiving Conference in Zagreb for the first time as the representative of KBR (the Belgian Royal Library), who was at that time the youngest IIPC member. Last year KBR was involved in a research project called PROMISE that studied the question of web archiving at the federal level in Belgium.

The General Assembly provided good insight into the working of IIPC as an organisation. It was very interesting to participate in the reflection about the future form of IIPC during the General Assembly. According to member institutions the top three priorities for the coming years should be: 1) community-led tools, 2) providing platforms for sharing knowledge and 3) networking and support for innovation in research on the archived web. Furthermore, the reports of the Treasurer and Porgoramme and Communications Officer indicated the different possibilities of engaging with the organisation and other IIPC members: TSS (Technical Speaker Series) and RSS (Research Speaker Series) Webinars, Online Hours, the different working groups (Content Development, Training Working Group, Preservation, Research Working Group), the Discretionary Funding Programme. I took part in the workshops of the Preservation, Training and Research Working Groups which allowed me to discover different initiatives launched within web archiving institutions all over the world.

The Web Archiving Conference brought a plethora of developments within web archiving to light. A lot of focus was on outreach and on how to promote web archives (via library labs for example). Another theme was researcher interaction with web archives and opening up access to complementary files such as crawl and access logs, derivative files or documentation about curatorial decisions and Heritrix settings. The use of machine learning on archived web material was another recurring theme. From a curatorial perspective trending collection themes are minorities, emerging formats such as interactive fiction or retrospective web archiving. It was also stressed that divergent opinions should feature in a web archive in order to avoid curatorial bias. Furthermore, even though I don’t have a technical background, it was fascinating to discover new developments such as size reduction of indexes, Browsertrix or automated quality assurance.

On top of all that rich information, the networking possibilities were fantastic. Within the PROMISE project, we did an extensive literature review concerning web archiving initiatives in Europe and Canada. It was a wonderful opportunity to meet some of the web archivists and researchers I admire in person. It is safe to say that I came back inspired and with a head full of ideas for the Belgian web archive. I’m already looking forward to the next edition.

This slideshow requires JavaScript.

Reflecting on how we train new starters in web archiving

This blog post is a summary of a workshop that took place at the 2019 IIPC Web Archiving Conference in Zagreb, Croatia. The abstract and the final slides used during the workshop are available on the IIPC website.


By Helena Byrne, Web Curator and Carlos Rarugal, Assistant Web Archivist at the British Library

 

Most people when learning can relate to the Benjamin Franklin quote

tell me and I forget, teach me and I may remember, involve me and I learn.*

It can be very challenging to find the most effective way to involve a trainee in web archiving and transfer your specialist knowledge. Web archiving is a relatively new profession that is constantly changing and it is only in recent years that a body of work from practitioners and researchers has started to grow. In addition, each web archiving institution has its own collection policies and many use their own web archiving technology meaning that there is no one size fits all solution to providing training to people who work in this field.

However, before taking on new strategies it is important to understand our own beliefs on training and what actions we currently take when training new staff. Reflecting on these points can help us to become more aware of any biases we may have in terms of preferred training delivery style which could be contradictory to what the trainee really needs.
What we did

Before we started the workshop participants answered a series of questions about their own experience of training or receiving training on web archives via a Menti poll. We then reviewed the training practices of the curatorial web archive team at the British Library and in groups reviewed what methods participants felt worked well or not.

“Reflecting on how we train new starters in web archiving” at the Web Archiving Conference in Zagreb, 6 June 2019.
Photo: Tibor God.

Menti Poll Results

Menti Poll Results: Average Score for each question.

Overall, there were about 26 participants in the workshop who had varying degrees of experience training people on how to work with their web archive. As shown in Slide 3, only 31% of participants train people in web archiving on a regular basis while 50% of participants train people occasionally and the remaining 19% don’t train other people in web archiving. Some of the people in this final category work as solo web archivists and don’t have any resources for additional staff.

When asked if there was a structured training programme on web archiving at their organisation, 65% of participants responded “no” while only 35% of respondents had a programme in place. Not surprisingly, when asked ‘how were you trained in web archiving?’, hands-on training was the most popular method used to train participants at the workshop.

Results of this poll can be viewed here.

Training practices at the British Library

During this workshop we reviewed common training methods and reflected on the current practices of the curatorial team of the UK Web Archive based at the British Library as well as how we would like to change these practices in the future. (Slides 7-8)

Group Discussion

Participants in small groups discussed a series of questions about how they train people in their institutions:

Questions

1. Who do you train about web archiving?
2. How do you currently train them?
3. What web archiving training resources do you have available to your team?
4. What methods do you use for training? Computer based, documentation (handouts, user guides etc.), one to one learning, shadowing etc.

After discussing these questions participants then placed their current training methods onto a scale of what they felt works and doesn’t work.

Brainstorming

Overall there were 56 points filled in on the post-it notes by participants in 6 different groups. These can be loosely categorised into 10 categories:

Reading list, videos, hands on training, documentation, networking, case studies, examples/modelling, verbal training, forums and tutorials. A more detailed breakdown of these categories can be viewed here.

Most of the points noted (30/56) were in the ‘what works’ section, (10/56) were neutral while only (8/56) of the points were in the ‘what doesn’t work’ section. However, there was some overlap with the ‘what works’ and ‘what doesn’t work’ sections, with some methods like videos and reading lists appearing in both sections but in different groups.

Review

In the last workshop activity, participants voted, by using two coloured stickers, on what they considered most aspirational and most achievable training method.

As you can see from the votes below the most popular activity that could be achieved in the short term by the workshop participants was hands-on individual training with 9 votes. While there was a split between participants who felt that writing manuals was achievable with 7 votes and those they felt that this was aspirational with 6 votes.

How people voted

Conclusion

Overall participants were keen to see a training related event on the IIPC Web Archiving Conference programme. As the importance of web archiving grows, so too does the need for training in this field and it has become more evident that these responsibilities are falling on web archivists.

All the data collected during this workshop was shared with the IIPC Training Working Group and it is hoped that it will help inform the development of materials to support training within the field.

More information about the IIPC Training Working Group can be found here: http://netpreserve.org/about-us/working-groups/training-working-group/

References:

* Goodreads.com, ‘Benjamin Franklin > Quotes > Quotable Quote’, https://www.goodreads.com/quotes/21262-tell-me-and-i-forget-teach-me-and-i-may (accessed December 20, 2018).

LinkGate: Let’s build a scalable visualization tool for web archive research

By Youssef Eldakar of Bibliotheca Alexandrina and Lana Alsabbagh of the National Library of New Zealand

Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ) are working together to bring to the web archiving community a tool for scalable web archive visualization: LinkGate. The project was awarded funding by the IIPC for the year 2020. This blog post gives a detailed overview of the work that has been done so far and outlines what lies ahead.


In all domains of science, visualization is essential for deriving meaning from data. In web archiving, data is linked data that may be visualized as graph with web resources as nodes and outlinks as edges.

This phase of the project aims to deliver the core functionality of a scalable web archive visualization environment consisting of a Link Service (link-serv), Link Indexer (link-indexer), and Link Visualizer (link-viz) components as well as to document potential research use cases within the domain of web archiving for future development.

The following illustrates data flow for LinkGate in the web archiving ecosystem, where a web crawler archives captured web resources into WARC/ARC files that are then checked into storage, metadata is extracted from WARC/ARC files into WAT files, link-indexer extracts outlink data from WAT files and inserts it into link-serv, which then serves graph data to link-viz for rendering as the user navigates the graph representation of the web archive:

LinkGate: data flow

In what follows, we look at development by Bibliotheca Alexandrina to get each of the project’s three main components, Link Service, Link Indexer and Link Visualizer, off the ground. We also discuss the outreach part of the project, coordinated  by the National Library of New Zealand, which involves gathering researcher input and putting together an inventory of use cases.

Please watch the project’s code repositories on GitHub for commits following a code review later this month:

Please see also the Research Use Cases for Web Archive Visualization wiki.

Link Service

link-serv is the Link Service that provides an API for inserting web archive interlinking data into a data store and for retrieving back that data for rendering and navigation.
We worked on the following:

  • Data store scalability
  • Data schema
  • API definition and and Gephi compatibility
  • Initial implementation

Data store scalability

link-serv depends on an underlying graph database as repository for web resources as nodes and outlinks as relationships. Building upon BA’s previous experience with graph databases in the Encyclopedia of Life project, we worked on adapting the Neo4j graph database for versioned web archive data. Scalability being a key interest, we ran a benchmark of Neo4j on Intel Xeon E5-2630 v3 hardware using a generated test dataset and examined bottlenecks to tune performance. In the benchmark, over a series of progressions, a total of 15 billion nodes and 34 billion relationships were loaded into Neo4j, and matching and updating performance was tested. And while time to insert nodes into the database for the larger progressions was in hours or even days, match and update times in all progressions after a database index was added, remained in seconds, ranging from 0.01 to 25 seconds for nodes, with 85% of cases remaining below 7 seconds and 0.5 to 34 seconds for relationships, with 67% of cases remaining below 9 seconds. Considering the results promising, we hope that tuning work during the coming months will lead to more desirable performance. Further testing is underway using a second set of generated relationships to more realistically simulate web links.

We ruled out Virtuoso, 4store, and OrientDB as graph data store options for being less suitable for the purposes of this project. A more recent alternative, ArangoDB, is currently being looked into and is also showing promising initial results, and we are leaving open the possibility of additionally supporting it as an option for the graph data store in link-serv.

Data schema

To represent web archive data in the graph data store, we designed a schema with the goals of supporting time-versioned interlinked web resources and being friendly to search using the Cypher Query Language. The schema defines Node and VersionNode as node types and HAS_VERSION and LINKED_TO as relationship types linking a Node to a descendant VersionNode and a VersionNode to a hyperlinked Node, respectively. A Node has the URI of the resource as attribute in Sort-friendly URI Reordering Transform (SURT), and a VersionNode has the ISO 8601 timestamp of the version as attribute. The following illustrates the schema:

LinkGate: data scheuma

API definition and Gephi compatibility

link-serv is to receive data extracted by link-indexer from a web archive and respond to queries by link-viz as the graph representation of web resources is navigated. At this point, 2 API operations were defined for this interfacing: updateGraph and getGraph. updateGraph is to be invoked by link-indexer and takes as input a JSON representation of outlinks to be loaded into the data store. getGraph, on the other hand, is to be invoked by link-viz and returns a JSON representation of possibly nested outlinks for rendering. Additional API operations may be defined in the future as development progresses.

One of the project’s premises is maintaining compatibility with the popular graph visualization tool, Gephi. This would enable users to render web archive data served by link-serv using Gephi as an  alternative to the project’s frontend component, link-viz. To achieve this, the updateGraph and getGraph API operations were based on their counterparts in the Gephi graph streaming API with the following adaptations:

  • Redefining the workspace to refer to a timestamp and URL
  • Adding timestamp and url parameters to both updateGraph and getGraph
  • Adding depth parameter to getGraph

An instance of Gephi with the graph streaming plugin installed was used to examine API behavior. We also examined API behavior using the Neo4j APOC library, which provides a procedure for data export to Gephi.

Initial implementation

Initial minimal API service for link-serv was implemented. The implementation is in Java and uses the Spring Boot framework and Neo4j bindings.
We have the following issues up next:

  • Continue to develop the service API implementation
  • Tune insertion and matching performance
  • Test integration with link-indexer and link-viz
  • ArangoDB benchmark

Link Indexer

link-indexer is the tool that runs on web archive storage where WARC/ARC files are kept and collects outlinks data to feed to link-serv to load into the graph data store. In a subsequent phase of the project, collected data may include details besides outlinks to enrich the visualization.
We worked on the following:

  • Invocation model and choice of programming tools
  • Web Archive Transformation (WAT) as input format
  • Initial implementation

Invocation model and choice of programming tools

link-indexer collects data from the web archive’s underlying file storage, which means it will often be invoked on multiple nodes in a computer cluster. To handle future research use cases, the tool will also eventually need to do a fair amount of data processing, such as  language detection, named entity recognition, or geolocation. For these reasons, we found Python a fitting choice for link-indexer. Additionally, several modules are readily available for Python that implement functionality related to web archiving, such as WARC file reading and writing and URI transformation.
In a distributed environment such as a computer cluster, invocation would be on ad-hoc basis using a tool such as Ansible, dsh, or pdsh (among many others) or configured using a configuration management tool (also such as Ansible) for periodic execution on each host in the distributed environment. Given this intended usage and magnitude of the input data, we identified the following requirements for the tool:

  • Non-interactive (unattended) command-line execution
  • Flexible configuration using a configuration file as well as command-line options
  • Reduced system resource footprint and optimized performance

Web Archive Transformation (WAT) as input format

Building upon already existing tools, Web Archive Transformation (WAT) is used as input format rather than directly reading full WARC/ARC files. WAT files hold metadata extracted from the web archive. Using WAT as input reduces code complexity, promotes modularity, and makes it possible to run link-indexer on auxiliary storage having only WAT files, which are significantly smaller in size compared to their original WARC/ARC sources.
warcio is used in the Python code to read WAT files, which conform in structure to the WARC format. We initially used archive-metadata-extractor to generate WAT files. However, testing our implementation with sample files showed the tool generates files that do not exactly conform to the WARC structure and cause warcio to fail on reading. The more recent webarchive-commons library was subsequently used instead to generate WAT files.

Initial implementation

The current initial minimal implementation of link-indexer includes the following:

  • Basic command-line invocation with multiple input WAT files as arguments
  • Traversal of metadata records in WAT files using warcio
  • Collecting outlink data and converting relative links to absolute
  • Composing JSON graph data compatible with the Gephi streaming API
  • Grouping a defined count of records into batches to reduce hits on the API service

We plan to continue work on the following:

  • Rewriting links in Sort-friendly URI Transformation (SURT)
  • Integration with the link-serv API
  • Command-line options
  • Configuration file

Link Visualizer

link-viz is the project’s web-based frontend for accessing data provided by link-serv as a graph that can be navigated and explored.
We worked on the following:

  • Graph rendering toolkit
  • Web development framework and tools
  • UI design and artwork

Graph visualization libraries, as well as web application frameworks, were researched for the web-based link visualization frontend. Both D3.js and Vis.js emerged as the most suitable candidates for the visualization toolkit. Experimentally coding using both toolkits, we decided to go with Vis.js, which fits the needs of the application and is better documented.
We also took a fresh look at current web development frameworks and decided to house the Vis.js visualization logic within a Laravel framework application combining PHP and Vue.js for future expandability of the application’s features, e.g., user profile management, sharing of graphs, etc.
A virtual machine was allocated on BA’s server infrastructure to host link-viz for the project demo that we will be working on.
We built a barebone frontend consisting of the following:

  • Landing page
  • Graph rendering page with the following UI elements:
    • Graph area
    • URL, depth, and date selection inputs
    • Placeholders for add-ons

As we outlined in the project proposal, we plan to implement add-ons during a later phase of the project to extend functionality. Add-ons would come in 2 categories: vizors for modifying how the user sees the graph, e.g., GeoVizor for superimposing nodes on a map of the world, and finders to help the user explore the graph, e.g., PathFinder for finding all paths from one node to another.
Some work has already been done in UI design, color theming, and artwork, and we plan to continue work on the following:

  • Integration with the link-serv API
  • Continue work on UI design and artwork
  • UI actions
  • Performance considerations

Research use cases for web archive visualization

In terms of outreach, the National Library of New Zealand has been getting in touch with researchers from a wide array of backgrounds, ranging from data scientists to historians, to gather feedback on potential use cases and the types of features researchers would like to see in a web archive visualization tool. Several issues have been brought up, including frustrations with existing tools’ lack of scalability, being tied to a physical workstation, time wasted on preprocessing datasets, and inability to customize an existing tool to a researcher’s individual needs. Gathering first hand input from researchers has led to many interesting insights. The next steps are to document and publish these potential research use cases on the wiki to guide future developments in the project.

We would like to extend our thanks and appreciation for all the researchers who generously gave their time to provide us with feedback, including Dr. Ian Milligan, Dr. Niels Brügger, Emily Maemura, Ryan Deschamps, Erin Gallagher, and Edward Summers.

Acknowledgements

Meet the people involved in the project at Bibliotheca Alexandrina:

  • Amr Morad
  • Amr Rizq
  • Mohamed Elsayed
  • Mohammed Elfarargy
  • Youssef Eldakar

And at the National Library of New Zealand:

  • Andrea Goethals
  • Ben O’Brien
  • Lana Alsabbagh

We would also like to thank Alex Osborne at the National Library of Australia and Andy Jackson at the British Library for their advice on technical issues.

If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.

IIPC Chair Address

Hello there,

I would like to take an opportunity to say how excited I am to be the Chair of the IIPC Steering Committee for 2020 working with a great set of officers. I have had the opportunity to be involved with this organization since my first General Assembly in Canberra, Australia in 2008. The involvement of the University of North Texas Libraries in IIPC has been rewarding and something that has allowed us to grow and feel more secure in our web archiving program over the years.

IIPC funded projects

This year we are off to a quick start for IIPC. We have announced the first round of recipients for the Discretionary Funding Program (DFP) that resulted in several awarded projects. The Steering Committee is excited to see the results of these projects that will be completed during the 2020 calendar year. In addition to the round of DFP projects that have already been awarded, we will be announcing the next round of funding opportunities at this year’s General Assembly in Montreal.

IIPC GA & WAC in Montreal

The annual General Assembly and Web Archiving Conference has become an event that many of us eagerly look forward to each year. I know that I find it to be my most professionally engaging conference and enjoy both the mental stimulation and connecting with friends and colleagues from around the world. This year looks to be an exciting program for the Web Archiving Conference and we are working to schedule the General Assembly time to make the most of the day we have together for IIPC business. I am grateful to the Bibliothèque et Archives nationales du Québec (BAnQ) and their organizing partners, Library and Archives Canada (LAC) and University of Toronto Libraries. I expect that this year’s conference will be a memorable one for all of us.

Consortium Agreement renewal

As you know the IIPC is a consortium of institutions with a similar set of goals to preserve and provide access to the web around the world. This consortium is organized and governed by an agreement that each member signs when they join IIPC. The current document has served us well over the years but it is time for us to rework this agreement to reflect changes that have occurred in the structure of IIPC, as well as to enable us to continue to grow and evolve as an organization. There is a sub-committee of the Steering Committee that is working to revise the consortium agreement, and we are planning to share this with the membership for review at this year’s General Assembly.

CLIR

One of the things that we will be incorporating into the new consortium agreement is the new relationship we have with the Council on Library and Information Resources (CLIR). As many of you will remember, we moved our financial oversite from the Bibliothèque nationale de France (BNF) in 2017 and have been exploring how this relationship can be beneficial to both organizations as time moves on. In the fall of 2019, the IIPC Steering Committee decided to further explore what opportunities exist for collaboration between CLIR and IIPC and we have started to incorporate existing tools such as CLIR’s MemberSuite that we are now using for our annual billing activities. In addition, we are in conversations to see how we can better cross-promote our activities and take advantage of resources that each of organizations bring to the table.

IIPC training materials

Another activity that I am happy to see being released is a set of training materials that have been developed to introduce the topic of web archiving to different audiences. This project was carried out as a collaboration between the IIPC Training Working Group and the Digital Preservation Coalition, who worked with us to help create the training modules. We are expecting to release this training material to the public shortly, so stay tuned.

Tools

There are many other activities of the IIPC that I could continue to enumerate here, including a project and contract to document the use of pywayback in large-scale web archive environments, which many of our member institutions operate. Another event that is happening in April is a Hackathon on Automated Quality Assurance that is being organized by the IIPC, University of North Texas Libraries, and the British Library, in collaboration with National Library of Australia and Library of Congress. This event will be both in-person at the British Library in London and at the University of North Texas in Denton, Texas. In addition to in-person participation, we encourage remote participation during the three-day event.

Collaborative collections

Collaborative collections led by our Content Development Working Group, have continued to be one of IIPC’s most successful outreach projects. The “Climate Change” collection has just been published on Archive-It and IIPC has teamed up with the Internet Archive to preserve web content related to the ongoing Novel Coronavirus (Covid-19) outbreak.

 

You can follow our activities on the IIPC website and Twitter. To subscribe to our mailing list, send an email to communications@iipc.simplelists.com.

 

Mark Phillips
Associate Dean for Digital Libraries at the University of North Texas (UNT)
IIPC Chair 2020-2021