World Wide Webarchiving: Upgrading the Web Curator Tool

by Kees Teszelszky, Curator digital collections, National Library of the Netherlands

The Web Curator Tool (WCT) is a workflow management application designed for selective web archiving. It was created for use in libraries and other digital heritage collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process. The WCT is a tool that supports the selection, harvesting and quality assessment of online material when employed by collaborating users in a library environment. The application is integrated with the existing Heritrix web crawler and supports key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. The WCT allows institutions to capture almost any online resource. These artefacts are handled with all possible care, so that their integrity and authenticity is preserved.

The WCT was developed in 2006 as a collaborative effort by the National Library of New Zealand (NLNZ) and the British Library (BL), initiated by the International Internet Preservation Consortium (IIPC) as can be read in the original documentation. The WCT is open-source and available under the terms of the Apache Public License. The project was moved in 2014 from Sourceforge to Github. The latest ‘binary’ release of the WCT, v1.6.3, was published in July 2017 on the Github page of NLNZ. Even after 12 years, the WCT still continues as one of the most common, open-source enterprise solutions for web archiving. It has an active user forum on Github and Slack.

From January 2018 onwards, NLNZ has been collaborating to upgrade the WCT with the Koninklijke Bibliotheek – National Library of the Netherlands (KB-NL) and adding new features to make the application future-proof. This involves learning the lessons from the previous development and recognising the advancements and trends occurring in the web archiving community. The objective is to get the WCT to a platform where it can keep pace with the requirements of archiving the modern web. Further, the Permission Request module will be extended to fit the Dutch situation which lacks a legal deposit for digital publications.

The first step in that process was decoupling the WCT from the old Heritrix 1.x web crawler, and allowing the WCT to harvest using the updated Heritrix 3.x version. A proof of concept for this change was successfully developed and deployed by the NLNZ, and has been the basis for a joint development work plan. The project will be extensively documented.

The NLNZ has been using the WCT for its selective web archiving programme since January 2007, KB-NL since 2009. In 2008 NLNZ published an article describing their experience using WCT in a production environment. However, the software had fallen into a period of neglect, with mounting technical debt: most notably its tight integration with an out-dated version of the Heritrix web crawler. While the last public release of the WCT is still used day-to-day in various institutions, this release has essentially reached its end-of-life as it has fallen further and further behind the requirements for harvesting the modern web. The community of users have echoed these sentiments over the last few years.

During 2016-2017 the NLNZ conducted a review of the WCT and how it fulfils business requirements, and compared the WCT to alternative software/services. The NLNZ concluded that the WCT was still the closest solution to meeting its requirements – provided the necessary upgrades could be done, namely a change to use the modern Heritrix 3 web crawler. Through a series of fortunate conversations the NLNZ discovered that another WCT user, KB-NL, was going through a similar review process and had reached the same conclusions. This led to collaborative development between the two institutions to uplift the WCT technically and functionally to be a fit for purpose tool within these institutions’ respective web archiving programmes.

Who are involved:

National Library of New Zealand:

Steve Knight
Andrea Goethals
Ben O’Brien
Gillian Lee
Susanna Joe
Sholto Duncan

Koninklijke Bibliotheek:

Peter de Bode
Jeffrey van der Hoeven
Hanna Koppelaar
Tymen Kwant
Barbara Sierman
René Voorburg
Kees Teszelszky

Further reading:

Advertisements

IIPC Steering Committee Election 2018: nominations and results

The 2018 IIPC Steering Committee (SC) elections featured 3 vacant seats. The KB (Netherlands), BnF (France), and UNT (United States) all had reached the end of their prior three-year terms. The period for IIPC members to nominate themselves for election to the SC was opened on December 1, 2017 and ran until March 25, 2018. During the nomination period, three nominations were submitted, by KB, BnF, and UNT. Thus, unlike prior years, no election process is necessary since the expiring members were the only three to nominate to fill the three vacancies. Congratulations and thanks to KB, BnF, and UNT for their long service on the SC and their willingness to continue to serve another term. In 2019, the Steering Committee will have 5 (or potentially 6) spaces open up for election and we encourage any members interested in joining the SC for the first time and contributing to the management and strategic direction of the organization to nominate themselves. The SC meets in early April at DNB (Germany). Be on the lookout for reports on outcomes from that upcoming meeting.

Jefferson Bailey (current Chair, IIPC SC)


Nomination statements:

Bibliothèque nationale de France / The National Library of France

 The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly a petabyte. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an application for seeds selection and curation by librarians which is being open sourced.

The BnF has been involved in IIPC since its beginning and remains firmly committed to the development of a strong community, in order to sustain these open source tools and to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups, in particular Preservation and Collections Development. We are also involved in the new Training working group. Finally, we have invested effort in making the WARC format an ISO standard and will continue to work on its evolution. Our participation in the steering committee, if continued, will be focused on making web archiving a thriving community, engaging researchers in the study of web archives and developing strong archiving strategies for all kinds of web content, including social media.

Koninklijke Bibliotheek / National Library of the Netherlands

The KB is currently a member of the Steering Committee and chair of the Membership Engagement Portfolio Group and would like to nominate itself for election of a new term in the Steering Committee.

The Netherlands were one of the early adopters of the Internet: in fact the 3rd website worldwide was from the Dutch National Institute for Subatomic Physics. The KB started in 2007 collecting websites based on selective harvesting. Currently we harvest around 13.000 websites. Due to copyright reasons, the web sites can only be seen on the premises. Collaboration with other Dutch organizations will improve the coverage of the preserved Dutch national web. In the nationwide Dutch “Network Digital Heritage” we work together on various projects with both GLAM institutions as well as researchers and suppliers of web archiving services to improve the web archiving of the Dutch web. The KB is looking forward to bring this experience to the IIPC and to develop plans to make new connections between the members of IIPC and with other organizations related to the field of creating web collections, web publications, researchers, tool development and digital preservation.

The University of North Texas Libraries 

The University of North Texas (UNT) Libraries is interested in serving another term on the IIPC Steering Committee. As a library that serves a Tier One university and a student population of 38,000 students, we are committed to providing a wide range of resources to researchers. Of these resources, we believe that the preservation of and access to Web archives is an important component. We began capturing websites in 1997 and joined the IIPC in 2007. We find great benefit in participating with an international community dedicated to preserving the Web.

In the last decade, we participated in working groups and served on the steering committee for a number of years. We actively participated in such projects as tool development and maintenance for Open Wayback and Heritrix with UNT Libraries serving as project lead for the Open Wayback project. We participated in collaborative archiving projects including development of the URL Nomination Tool, and served as Steering Committee officers when requested.

If elected, the UNT Libraries will strive to collaborate with our fellow members and represent the best interests of the IIPC community to continue to move forward the preservation of the Web.