Archiving the Croatian web: has it been fourteen years already?

The National and University Library in Zagreb has been an IIPC member since 2008. The Croatian Web Archive (Hrvatski arhiv weba, HAW), established in 2004, is open access. The current projects include delivering metadata to Europeana, implementation of persistent identifier URN:NBN, migration to OpenWayback, development of a new user interface and integration with the Digital Library portal. Web Archiving Team has also been involved in introducing librarians, archivists and researchers to web archiving and to using HAW resources.


By Ingeborg Rudomino, Croatian Web Archive, National and University Library in Zagreb and Karolina Holub, Croatian Digital Library Development Centre, Croatian Institute for Librarianship, National and University Library in Zagreb

About HAW

The National and University Library in Zagreb (NUL) in collaboration with the University Computing Centre in Zagreb (Srce) established the Croatian Web Archive (Hrvatski arhiv weba, HAW) in 2004 and started to acquire, catalogue and archive online publications according to the legal deposit provisions of the Library Act from 1997. Due to the well-known characteristics of web resources, the NUL started to archive selectively and established selection criteria.

Fig. 1. Croatian Web Archive Homepage.

We use several methods to identify a web resource for cataloguing and archiving: the HAW team searches and browses the web; website owners or content providers fill out the Registration form or we receive notifications from the ISSN Centre for Croatia.

After identification, every resource is catalogued in the library system and automatically transferred into our custom-built archiving system, where the archiving process starts. Our long-standing experience in cataloging this type of resource has shown the process to be very challenging, and describing this dynamic and variable content results in daily interventions in the bibliographic records. Because of that, we created cataloguing guidelines with a variety of examples. Our goal has been to preserve the original websites (their look and feel) as much as possible. In order to achieve quality, each resource is approached individually during the archiving process. The DAMP software, developed by the University Computing Centre in Zagreb, was built especially for this purpose. The workflow of processing web resources is integrated within the organisational structure of the Library.

We are proud of the quantity and quality of web resources stored in the Croatian Web Archive, some of which are websites of institutions, associations, clubs, research projects, news media, portals, blogs, official websites of counties, cities, journals and books. Special attention is given to news media websites/portals, which are archived daily, weekly or monthly.

Access and the first full domain crawl

This selective approach ensures quality and provides full control over the management of web resources. So far, over 6,700 titles have been archived and almost all are publicly available. All content is full text searchable, and it’s possible to search by any word in the title, URL or keywords. Advanced search is available as well. Users can browse the HAW alphabetically and through subject categories, which are extracted from the UDC field in the catalogue.

Fig. 2. Screenshots of archived Croatian websites.

To secure permanent access to archived web resources, we have recently implemented persistent identifier URN:NBN and have assigned it to archived titles and all archived instances (Fig. 3).

Fig. 3. Screenshot of archived instances with URN:NBN.

Since 2013, the metadata from HAW is delivered to Europeana through HAW’s OAI-PMH interface.

To overcome the limitations of selective archiving, the first harvest of the whole .hr domain was conducted in 2011 with the Heritrix web crawler. Since then, we have been harvesting the .hr domain annually. The collected content is publicly available via HAW’s website through the OpenWayback access interface (Fig. 4). To date, we have conducted 7 .hr domain harvests.

Fig. 4. Screenshot of harvested website in OpenWayback.

Thematic crawls

In 2011, we started to periodically harvest websites related to topics and events of national importance using Heritrix and OpenWayback, as well. Nine thematic collections have been created, mainly related to themes such as presidential, parliament or local elections, accession to the EU and the flood in Croatia. Each collection consists of several metadata: title, size, number of seeds/URLs and description.

Training and outreach

Twice every year, we organize a workshop within the Centre of Continuing Education for Librarians. With the main goal to introduce the web archiving to library professionals and students, the workshop focuses on learning how to recognize online materials that should be preserved according to existing criteria for cataloguing and archiving Croatian web resources. The participants are also introduced to the workflow of selective archiving, .hr harvests, the process of selecting materials for thematic collections and different ways of browsing the archived content.

With the experience that we have gained throughout the years, sharing our knowledge and expertise on web archiving is something that we are happy to provide and give support to all those interested. To increase awareness about HAW and web archiving among librarians, archivists, and wider community, we try to make use of every opportunity to do so – such as presenting at national and international conferences, giving lectures to students, researchers, etc.

A few thoughts for the future

The Croatian Web Archive currently has more than 40 TB of content. We are currently working on a web interface that will have new functionalities and features including full-text search for the domain harvests and news sections for web archiving community and researchers. Also, the plan is to integrate HAW’s metadata into the Digital Library portal in order to have a single access point for all digital collections.

By combining all three approaches and using different software, the Library will attempt to cover, to the greatest extent possible, the contemporary part of Croatian cultural and scientific heritage.

Visit us: http://haw.nsk.hr/en

Advertisements

Announcing the IIPC Technical Speaker Series

By Jefferson Bailey, Director, Web Archiving (Internet Archive) & IIPC Chair

The IIPC is excited to announce a call for presenters in a new online series, the IIPC Technical Speaker Series. The goal of the IIPC Technical Speaker Series (TSS) is to facilitate knowledge sharing and foster conversations and collaborations among IIPC members around web archiving technical work.

The TSS will feature 30-60 minute online presentations or demonstrations related to tool development, software engineering, infrastructure management, or other specific technology projects. Presentations can take any format, including prepared slides, open conversations, or live demonstrations via screen sharing. Presentations will be from employees at IIPC member organizations and attendance will be open to all IIPC members. The TSS is intended to be informational, not a formal training or education program, and to provide an open venue for knowledge exchange on technical issues. The series will also give IIPC members the chance to demo and discuss technical work (including R&D, prototype, or early-stage work) taking place in member institutions that may have no other venue for presentation or discussion.

If you are interested in presenting, please fill out the short application form.

Details on applying:

  • Applicants must be employed by an IIPC member institution in good standing
  • Access to an online webinar system (WebEx, Zoom, etc) will be provided
  • Presentations will be scheduled for 60 minutes, but can be shorter and should allow time for questions and discussion
  • Small stipends are available to presenters, if needed or if helpful in getting managerial approval to participate.

We aim to have a 2-3 TSS events per quarter, scheduled at a time amenable to as many time zones as possible. Details on upcoming speakers and registration will be shared via the normal IIPC communication channels (listservs, blog, slack, twitter). This project is funded as part of IIPC’s 2018 suite of projects, including work by IIPC Portfolios and Working Groups, as well as other forthcoming member services. The TSS is currently administered by the IIPC Steering Committee Chair (jefferson@archive.org) and the IIPC Program and Communications Officer (Olga.Holownia@bl.uk). Contact either or both with any questions.

Please apply and present to the IIPC community all the excellent technical work taking place at your organization!