Webarchiv: 20 Years of Web Archiving in the Czech Republic

By Marie Haškovcová, Illyria Brejchová, Luboš Svoboda, and Andrea Prokopová (Czech Web Archive of National Library of the Czech Republic)

An Introduction to Webarchiv

The idea to create a national web archive that would preserve the growing amount of Czech digital born media was conceived as soon as 1999. In the year 2000, Webarchiv was founded as a joint project of the National Library of the Czech Republic, the Moravian Library, and the Masaryk University, making it one of the oldest webarchives. The first websites were archived in 2001, regular harvesting began in 2005, and in 2007 Webarchiv joined the IIPC.

Webarchiv home page

Currently, Webarchiv is part of the National Library of the Czech Republic and holds approx. 400 TB of data. Webarchive collects this data in a variety of ways. Through comprehensive harvests, second order domains of *.cz are harvested once or twice a year thanks to a cooperation with the Czech domain provider CZ.NIC (currently it is about 1.4 million URLs). Czech web resources with historical, scientific or cultural value are selectively harvested more frequently and in depth compared to comprehensive harvests. Finally, resources connected to a specific event or topical are collected through topic harvests. Webarchiv currently has more than 30 topical collections, covering elections, Olympics, climate change and more. Continuous harvesting (automated, several times a day) is currently being tested on some thematic collections, such as COVID-19 or Czech media.

A gif showing the development of the Webarchiv website created using Time Map Visualization

 

Data harvesting and accessibility

The big challenge for Webarchiv at the moment is assuring the accessibility of the data in its collection, both in regards to maintaining the ability to display the archived websites, but also in regards to allowing access to its collection to researchers as well as the public. In terms of public access, only 0,4 % of the whole collection is available freely online. This is due to current Czech legislation which allows the National Library to make reproductions of a work for its own archiving and conservation purposes, but does not entitle libraries to make them available. Online access is therefore made available only to resources in the selective harvest which are licensed under a Creative Commons licence or after signing a contract with the publisher. Websites available to the public are catalogued in accordance with the RDA rules and integrated into the Czech national bibliography. The entire collection can be accessed by the public on the library premises.

Webarchiv catalogue

On the technical side, Webarchiv uses open source software, such as Heritrix 3.4 and OpenWayback 3.0, but also develops its own open source tools, such as Seeder for managing electronic resources, websites and harvests or WA-KAT as an online resource cataloguing tool. We are testing the harvesting of social media accounts of politicians, are experimenting with UMBRA, and apply manual harvesting using Webrecorder 2.3, which allows curators to harvest web 2.0 or more technologically complex websites such as online exhibitions or multimedia magazines. We plan to replace the current Wayback 3.0 application with Python Wayback (pywb) to display this type of content. We consider the autonomy of curators in regard to planning harvests and quality assurance to be key even in automated harvests, which is why we continue to improve Seeder, a tool for managing harvests and curating web resources. In the future, Seeder should allow curators to perform harvests without technical support, which will allow them to react more efficiently to the ephemeral online environment.

Seeder – tool for managing electronic resources, websites and harvests

 

Collaboration with key partners

Over the years, Webarchiv has developed a collaboration with various institutions. Notably, these include the aforementioned CZ.NIC, the Institute of Czech Literature of the Czech Academy of Sciences, for whom we are archiving the online Czech literary tradition from the beginning of the Czech Internet to the present day, or the Czech National Archive, for whom we are archiving websites of public agencies, such as ministries or other central administrative authorities. As for international collaboration, we worked with the University Library in Bratislava on a shared topic collection of online resources relating to the 30th anniversary of the Velvet Revolution, which led to the collapse of the communist regime in former Czechoslovakia, and regularly contribute to the IIPC collaborative collections, most recently to the COVID-19 collection.

Topical collections

 

As for making our collection more accessible to researches Webarchiv is involved in a research project titled “Development of a centralized interface for extracting big data from web archives”. The National Library of the Czech Republic partnered with the the Department of Cybernetics of the Faculty of Applied Sciences at the University of West Bohemia and the Institute of Sociology of the Czech Academy of Sciences on this project focused on Webarchiv´s data research. The main aim of the project is to develop a centralized user interface which would allow researchers to search through data collected by Webarchiv and obtain datasets for further research. The outcome of the project will be a faceted full text search engine for analyzing large quantities of web archive data with an integrated application for exporting selected datasets. The research project is expected to be completed in 2022.

Engaging the public

Website nomination form

 

Webarchiv also actively engages with the public. we accept suggestions for resources to include in selective harvests on our website, and are also active on social media. We have a long-going campaign, where we share dead websites from our collection (websites that no longer exist, but we have archive copies) on Facebook and we have recently started doing the same on Instagram. Through these activities, we hope to raise awareness and interest in web archiving. We are also active on Twitter, where we recently participated in the #WarcnetChallenge.

A contribution of Webarchiv to the Warcnet Challenge

We also launched a new blog, where we post a series called 10 websites for eternity, where personalities from various fields share a list of Czech websites they can not imagine life without and which they would regret losing if they were discontinued without being preserved in an archive. It is an opportunity for them to share their top 10 treasures the Czech web has to offer, both forgotten webs worm their bookmarks, and accomplished veterans of the internet. Webarchiv then archives the websites on the list and adds them to a topic collection. We see opening Webarchiv to curatorship from external specialists from various fields as an important direction to head in and a great way to expand our current curatorship strategies. Another topic we are considering is Link Rot, we see it as an area in which we can prove to be very beneficial – in the future, a catalog of valuable web resources could be created, curated directly by scientists and students.

From series 10 websites for eternity

The future of Webarchiv

Similar to other web archives, Webarchiv is facing numerous challenges – not only in the field of acquisition, preservation and access to data or legislative changes. The internet is an ever changing environment and we are therefore always a step behind in our efforts to preserve it. Numerous questions offer themselves up for debate: How should we approach ethical questions regarding the use of data collected during harvests? How can content on social media be preserved within its appropriate context when feeds are personalized? Should we preserve software along with the archived web pages? How should we approach valuable online content accessible only beyond a paywall? We are excited to be part of Webarchiv s journey and witness the ways in which it tackles these questions, and matures in the following decades!

Leave a comment