The PROMISE of a Belgian web archive

By Friedel Geeraert, Researcher on the PROMISE project at the Royal Library of Belgium

It all began in 2016 when the State Archives and KBR (the Royal Library of Belgium) decided to join forces and set up a joint web archiving project at the federal level in Belgium. Belgium is, sadly, one of the few European countries without a national web archive. Together with the universities of Ghent and Namur and the university college Bruxelles-Brabant they set themselves the task to develop a federal strategy for the preservation of the Belgian web. Funding was secured via the BRAIN.be programme of the Belgian Science Policy Office and in July 2017 the PROMISE project (Preserving Online Multiple Information: towards a Belgian strategy) kicked off.

Interdisciplinary team

Sally Chambers presenting the PROMISE Team’s work at the 2020 RESAW conference in Amsterdam. Photo: Olga Holownia.

One of the strengths of the PROMISE project is the interdisciplinarity of the research team. The State Archives and KBR provide expertise in collection curation and information and documentation management while the University of Namur (Research Centre in Information, Law and Society) provide the legal expertise. The University of Ghent (Research Group for Media, Innovation and Communication Technologies; Ghent Centre for Digital Humanities) and the University college Bruxelles-Brabant (HE2B) collaborated on the technical aspects of the project. The former also worked on analysing the user requirements for web archives. This approach not only ensured the necessary expertise but also led to cross-fertilisation between the different research domains.

Our objectives and how we learned from others

The project team worked on four main objectives:

  1. Identify best practices in the field of web-archiving
  2. Develop a strategy for archiving the Belgian web
  3. Set up a pilot project for the archiving of the Belgian web and providing access to these collections
  4. Make recommendations for the implementation of a sustainable web archiving service

More than two years onwards, a lot has happened within the project. To achieve the first objective, the research team did an extensive literature review of web archiving practices. This was supplemented by in-depth interviews with representatives of 13 web archiving institutions in Europe and Canada. Operational, technical and legal aspects were covered in these interviews and it was a very instructive phase for all researchers involved. The research results were published in the International Journal of Digital Humanities.

Inspired by the first phase, a strategy was outlined by KBR and the State Archives that covers the entire web archiving workflow. The legal analysis done within the project also informed both institutions about what they are legally allowed or required to do. Another important source of input were the results of a survey on user requirements since it is the intention of KBR and the State Archives to focus on the user when developing a functional web archive.

Budgeting scenarios

The strategy also included elaborate cost calculations based on different scenarios that were linked to different selection strategies: limited selective collections only, elaborate selective collections in combination with a limited broad crawl and elaborate selective collections in combination with an extensive broad crawl. A list of tasks and necessary infrastructure was drafted for each of these scenarios, spanning the different functions of the OAIS-model with the addition of the functions selection and capture. An estimation was made of the time needed to accomplish each task per job profile involved in the task. The total number of hours was then multiplied by an average wage per profile to come to a total cost for each scenario. The purpose of this exercise was to allow the board of directors of State Archives and KBR to make informed decisions about which web archiving strategy is preferable and financially viable.

Selection and metadata

The third research phase consisted of a number of elements: creating seed lists for selective collections in accordance with the collection development policies of KBR and the State Archives, creating descriptive metadata based on a recent study by the OCLC, doing a pilot broad crawl based on a sample of 10.000 and 100.000 domain names, capturing these collections and providing access to these collections. The prototype for access is in its final stages of development after which we aim to evaluate the entire pilot project.

Next steps

The project was completed at the end of December 2019 and the PROMISE project team is now working on making recommendations for the implementation of a sustainable web archiving service including legal considerations concerning access to web archives, operational procedures, a business model and technical and functional requirements for web archiving tools.

Niels Brügger, keynote speaker at the colloquium ‘Saving the web: the promise of a Belgian web archive’. Photo: KBR.

So how promising is the future of the Belgian web archive? As is the case with many new endeavours, structural financing plays a key role. It is the intention of KBR and the State Archives to approach the political level in Belgium and make a convincing case for the necessity of a Belgian web archive. During the concluding colloquium ‘Saving the web: the promise of a Belgian web archive’ that was held on 18 October 2019, Niels Brügger, Valérie Schafer and many others shared inspiring ideas with the PROMISE project team that can be used to make a very strong case. It is the sincere hope of both institutions that the results of the PROMISE project will live on in a sustainable web archive at the federal level in Belgium.

The end of the project also induces reflection. Over the course of the project, the team had the pleasure of being introduced to the (inter)national web archiving community, for which the IIPC and RESAW provide very important platforms. We feel that we owe a lot to the exchanges we had with other web archiving professionals and researchers and we would like to thank you all for the inspiration you have given us over the years and look forward to many exchanges to come.

Friedel Geeraert introducing KBR at the IIPC General Assembly in Zagreb (5 June 2019). Photo: Tibor God.

Links:

Digging in Digital Dust: Internet Archaeology at KB-NL in the Netherlands

By Peter de Bode and Kees Teszelszky

The Dutch .nl ccTLD is the third biggest national top level domain in the world and consists of 5.68 million URL’s,according to the Dutch SIDN. The first website of the Netherlands was published on the web in 1992: it was the third website on the World Wide Web. Web archiving in the Netherlands started in 2000 with the project Archipol in Groningen. The Koninklijke Bibliotheek | National Library of The Netherlands (KB-NL) started web archiving with a selection of Dutch websites in 2007. The KB does not only selects and harvest these sites, but also develops a strategy to ensure their long-term usability. As the Netherlands does lack a legal deposit law, the KB cannot crawl the Dutch national domain. KB uses the Web Curator Tool (WCT) to conduct its harvests.  From January 2018 onwards, the National Library of New Zealand (NLNZ) has been collaborating to upgrade this tool with KB-NL and adding new features to make the application future-proof.

As of 2011, the Dutch web archive is available in the KB reading rooms. In addition, researchers may request access to the data for specific projects. Between 2012 and 2016 the research project WebArt was carried out. As per November 2018, 15,000 websites have been selected. The Dutch web archive contains about 37Terabyte of data.

On the occasion of World Digital Preservation Day KB unveiled a special collection internet archaeology Euronet-Internet (1994-2017) [In Dutch: Webcollectie internetarcheologie Euronet]. It is made up of archived websites hosted by internet provider Euronet-Internet between 1994 and 2017. The collection was started in 2017 and ended in 2018. Identification of websites for harvest is done by Peter de Bode and Kees Teszelszky as part of the larger KB web archiving project “internet archaeology.” Euronet is one of the oldest internet providers in the Netherlands (1994) and has been bought up by Online.nl. Priority is given to websites published in the early years of the Dutch web (1994-2000).

These sites can be considered as “web incunables” as these are among the first digital born publications on the Dutch web. Some of the digital treasures from this collection are the oldest website of a national political party, a virtual bank building and several sites of internet pioneers dating from 1995. Information about the collection and its heritage value can be found on a special dataset page of KB-Lab and in a collection description (in Dutch). The collection can be studied on the terminals in the reading room of KB with a valid library card. Researches can also use the dataset with URL’s and a link analysis.