Digging in Digital Dust: Internet Archaeology at KB-NL in the Netherlands

By Peter de Bode and Kees Teszelszky

The Dutch .nl ccTLD is the third biggest national top level domain in the world and consists of 5.68 million URL’s,according to the Dutch SIDN. The first website of the Netherlands was published on the web in 1992: it was the third website on the World Wide Web. Web archiving in the Netherlands started in 2000 with the project Archipol in Groningen. The Koninklijke Bibliotheek | National Library of The Netherlands (KB-NL) started web archiving with a selection of Dutch websites in 2007. The KB does not only selects and harvest these sites, but also develops a strategy to ensure their long-term usability. As the Netherlands does lack a legal deposit law, the KB cannot crawl the Dutch national domain. KB uses the Web Curator Tool (WCT) to conduct its harvests.  From January 2018 onwards, the National Library of New Zealand (NLNZ) has been collaborating to upgrade this tool with KB-NL and adding new features to make the application future-proof.

As of 2011, the Dutch web archive is available in the KB reading rooms. In addition, researchers may request access to the data for specific projects. Between 2012 and 2016 the research project WebArt was carried out. As per November 2018, 15,000 websites have been selected. The Dutch web archive contains about 37Terabyte of data.

On the occasion of World Digital Preservation Day KB unveiled a special collection internet archaeology Euronet-Internet (1994-2017) [In Dutch: Webcollectie internetarcheologie Euronet]. It is made up of archived websites hosted by internet provider Euronet-Internet between 1994 and 2017. The collection was started in 2017 and ended in 2018. Identification of websites for harvest is done by Peter de Bode and Kees Teszelszky as part of the larger KB web archiving project “internet archaeology.” Euronet is one of the oldest internet providers in the Netherlands (1994) and has been bought up by Online.nl. Priority is given to websites published in the early years of the Dutch web (1994-2000).

These sites can be considered as “web incunables” as these are among the first digital born publications on the Dutch web. Some of the digital treasures from this collection are the oldest website of a national political party, a virtual bank building and several sites of internet pioneers dating from 1995. Information about the collection and its heritage value can be found on a special dataset page of KB-Lab and in a collection description (in Dutch). The collection can be studied on the terminals in the reading room of KB with a valid library card. Researches can also use the dataset with URL’s and a link analysis.

Advertisements

Web Archiving at the National Library of Ireland

National Library of Ireland Reading Room © National Library of Ireland.

The National Library of Ireland has a long-standing tradition of collecting, preserving and making accessible the published and printed output of Ireland. The library is over 140 years old and we now also have rich digital collections concerning the political, cultural and creative life of Ireland. The NLI has been archiving the Irish web on a selective basis since 2011. We have over 17 TB of data in the selective web archive, openly available for research through our website.  A particular strength of our web archive is the coverage of Irish politics including a representation of every election and referendum since 2011. No longer in its infancy, the NLI has made some exciting developments in recent years. This year we have begun working with Internet Archive for our selective web archive and are looking forward to the new opportunities that this partnership will bring. We have also begun working closely with an academic researcher from a Higher Education institute in Ireland, who is carrying out network analysis on a portion of our selective data.

In 2007 and 2017, the NLI undertook domain crawling projects and there is now over 43TB of data archived from these crawls. The National Library of Ireland is a legal deposit library, entitling it to a copy of everything published in Ireland. However, unlike many countries in Europe, legal deposit legislation does not currently extend to online material so we cannot make these crawls available. Despite these barriers, the library remains committed to preserving the online story of Ireland in whatever way we can.

Revisions to the legislation are currently before the Irish parliament and if passed will result in the addition of e-publications, such as e-books, journals etc. The addition of websites to that list is currently being considered.

In 2017, the National Library of Ireland became members of the IIPC and we are excited to be attending our first General Assembly in Wellington. While we had anticipated talking about our newly available domain web archive portal and how this had impacted our selective crawls, we are looking forward to discussing the challenges we continue to face, including with Legal Deposit, and how we are developing the web archive as a whole. We may also hopefully be able to update on progress with the legislative framework.  We look forward to seeing you there in Wellington!

Human scale web collecting for individuals and institutions (Webrecorder workshop)

By Anna Perricci, Rhizome

Web archiving ‘at scale’ is usually equated to collecting with automated software (a web crawler) but an assumption that more information is equated to more value is not always right, especially with web archives. A massive scope or scale isn’t required to make meaningful, useful web archives. Collecting at a ‘human scale’ can be as good or better for forming certain collections.

Webrecorder is a free, easy to use, browser based web archiving tool set provided by Rhizome. Rhizome, an affiliate of the New Museum in New York City, champions born-digital art and culture through commissions, exhibitions, digital preservation, and software development. Webrecorder’s development has been generously supported by the Andrew W. Mellon Foundation.

With Webrecorder you can make high fidelity interactive captures of web content as you browse web pages. A “high fidelity capture” means that from a user’s perspective there is a complete or high level of similarity between the original web pages and the archived copies, including the retention of important characteristics and functionality such as: video or audio that requires a user to press ‘play’, or resources that require entry of login credentials for access (e.g. social media accounts). Webrecorder can capture most types of media files, JavaScript and user-triggered actions, which are things that most crawlers struggle with or are unable to obtain.

Workshop attendees will be given an overview of Webrecorder’s features, then engage in hands-on activities and discussions. Further instruction will alternate with opportunities for participants to use the tools introduced and share their thoughts or questions. Instructions on how to manage the collected materials, download them (as a WARC file), and open a local copy offline using Webrecorder Player will also be covered in this workshop.

Human scale web collecting with Webrecorder is not expected to meet all the requirements of a large web archiving program but can satisfy many needs of researchers or smaller web collecting initiatives. Webrecorder can be a great tool for personal digital archiving projects as well. Larger web archiving programs can benefit from using Webrecorder to capture dynamic content and user-triggered behaviors on websites. The WARC files created with Webrecorder can be downloaded and ingested to join WARCs that have been created using crawler-based systems.

With a tool like Webrecorder anyone can get started with web archiving quickly at no cost, which is empowering both to any information professionals and their stakeholders.

On November 14th you can also learn more about Webrecorder in an afternoon session entirely focused on Webrecorder and high fidelity web archiving. This time will start with a 30 minute presentation on Python Wayback (pywb), a core component of Webrecorder, by pywb’s creator and Webrecorder’s lead developer, Ilya Kreymer. Then there will be a 1 hour panel on capturing complex websites and publications using Webrecorder with Jasmine Mulliken, Sumitra Duncan, Nicole Coleman, and me (Anna Perricci).

Whether you are a seasoned expert or newer to web archiving I hope you will be able to join us for the session and this workshop on November 14th at the IIPC WAC. The limit on the number of workshop attendees has been removed so please feel welcome to register.

IIPC Steering Committee Election 2019

The nomination process for IIPC Steering Committee is now open.

The Steering Committee is the executive body of the IIPC, currently comprising 15 member organisations, that take a leadership role in the high-level strategic planning, development and management of programs, policy creation, overall administration, and contribution to IIPC Portfolios and other activities.

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Serving on the Steering Committee is open to any current IIPC member and we strongly encourage any organisation interested in serving on the Steering Committee to nominate themselves for election. SC members are elected for 3 years and meet twice a year in person, once during the General Assembly, once in September and two or more additional times by teleconference.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in mid-May and the three-year term on the Steering Committee will start on 1 June.

Below you will find the election calendar. We are very much looking forward receiving your nominations. If you have any questions, please contact the IIPC PCO.

.


Election Calendar

  •  12 November to 1 March: Members are invited to nominate themselves by sending an email including a statement to the IIPC Programme and Communications Officer.
  • 1 April: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.
  • 1 April to  30 April: Members are invited to vote online. An online voting tool will be used to conduct the vote. The PCO will monitor the vote, ensuring that each organisation votes only once for all nominated seats and that the vote is cast by the organisation’s official representative. People will be encouraged to cast their vote before, during, and after the GA.
  • 30 April: Voting ends.
  • 1 May: The results of the vote are announced officially on the Netpreserve blog and Members mailing list.
  • 1 June: end/start of SC members terms. The newly elected SC members start their term on the 1st of June and are invited to attend a first meeting (by teleconference) by the end of June. The next face to face SC meeting will take place in Zagreb in June 2019.