Archiving the Croatian web: has it been fourteen years already?

The National and University Library in Zagreb has been an IIPC member since 2008. The Croatian Web Archive (Hrvatski arhiv weba, HAW), established in 2004, is open access. The current projects include delivering metadata to Europeana, implementation of persistent identifier URN:NBN, migration to OpenWayback, development of a new user interface and integration with the Digital Library portal. Web Archiving Team has also been involved in introducing librarians, archivists and researchers to web archiving and to using HAW resources.


By Ingeborg Rudomino, Croatian Web Archive, National and University Library in Zagreb and Karolina Holub, Croatian Digital Library Development Centre, Croatian Institute for Librarianship, National and University Library in Zagreb

About HAW

The National and University Library in Zagreb (NUL) in collaboration with the University Computing Centre in Zagreb (Srce) established the Croatian Web Archive (Hrvatski arhiv weba, HAW) in 2004 and started to acquire, catalogue and archive online publications according to the legal deposit provisions of the Library Act from 1997. Due to the well-known characteristics of web resources, the NUL started to archive selectively and established selection criteria.

Fig. 1. Croatian Web Archive Homepage.

We use several methods to identify a web resource for cataloguing and archiving: the HAW team searches and browses the web; website owners or content providers fill out the Registration form or we receive notifications from the ISSN Centre for Croatia.

After identification, every resource is catalogued in the library system and automatically transferred into our custom-built archiving system, where the archiving process starts. Our long-standing experience in cataloging this type of resource has shown the process to be very challenging, and describing this dynamic and variable content results in daily interventions in the bibliographic records. Because of that, we created cataloguing guidelines with a variety of examples. Our goal has been to preserve the original websites (their look and feel) as much as possible. In order to achieve quality, each resource is approached individually during the archiving process. The DAMP software, developed by the University Computing Centre in Zagreb, was built especially for this purpose. The workflow of processing web resources is integrated within the organisational structure of the Library.

We are proud of the quantity and quality of web resources stored in the Croatian Web Archive, some of which are websites of institutions, associations, clubs, research projects, news media, portals, blogs, official websites of counties, cities, journals and books. Special attention is given to news media websites/portals, which are archived daily, weekly or monthly.

Access and the first full domain crawl

This selective approach ensures quality and provides full control over the management of web resources. So far, over 6,700 titles have been archived and almost all are publicly available. All content is full text searchable, and it’s possible to search by any word in the title, URL or keywords. Advanced search is available as well. Users can browse the HAW alphabetically and through subject categories, which are extracted from the UDC field in the catalogue.

Fig. 2. Screenshots of archived Croatian websites.

To secure permanent access to archived web resources, we have recently implemented persistent identifier URN:NBN and have assigned it to archived titles and all archived instances (Fig. 3).

Fig. 3. Screenshot of archived instances with URN:NBN.

Since 2013, the metadata from HAW is delivered to Europeana through HAW’s OAI-PMH interface.

To overcome the limitations of selective archiving, the first harvest of the whole .hr domain was conducted in 2011 with the Heritrix web crawler. Since then, we have been harvesting the .hr domain annually. The collected content is publicly available via HAW’s website through the OpenWayback access interface (Fig. 4). To date, we have conducted 7 .hr domain harvests.

Fig. 4. Screenshot of harvested website in OpenWayback.

Thematic crawls

In 2011, we started to periodically harvest websites related to topics and events of national importance using Heritrix and OpenWayback, as well. Nine thematic collections have been created, mainly related to themes such as presidential, parliament or local elections, accession to the EU and the flood in Croatia. Each collection consists of several metadata: title, size, number of seeds/URLs and description.

Training and outreach

Twice every year, we organize a workshop within the Centre of Continuing Education for Librarians. With the main goal to introduce the web archiving to library professionals and students, the workshop focuses on learning how to recognize online materials that should be preserved according to existing criteria for cataloguing and archiving Croatian web resources. The participants are also introduced to the workflow of selective archiving, .hr harvests, the process of selecting materials for thematic collections and different ways of browsing the archived content.

With the experience that we have gained throughout the years, sharing our knowledge and expertise on web archiving is something that we are happy to provide and give support to all those interested. To increase awareness about HAW and web archiving among librarians, archivists, and wider community, we try to make use of every opportunity to do so – such as presenting at national and international conferences, giving lectures to students, researchers, etc.

A few thoughts for the future

The Croatian Web Archive currently has more than 40 TB of content. We are currently working on a web interface that will have new functionalities and features including full-text search for the domain harvests and news sections for web archiving community and researchers. Also, the plan is to integrate HAW’s metadata into the Digital Library portal in order to have a single access point for all digital collections.

By combining all three approaches and using different software, the Library will attempt to cover, to the greatest extent possible, the contemporary part of Croatian cultural and scientific heritage.

Visit us: http://haw.nsk.hr/en

Advertisements

One thought on “Archiving the Croatian web: has it been fourteen years already?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s