The Croatian Web Archive (Hrvatski arhiv weba, HAW), launched in 2004, is open access. To celebrate its 15th anniversary, the National and University Library in Zagreb hosted the IIPC General Assembly and the Web Archiving Conference in June 2019. HAW has been the central point in Croatia for researching website development (.hr domain) and the HAW Team has also been organising training for librarians. One of HAW’s most recent projects was the development of the new portal.
By Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre, Croatian Institute for Librarianship, Ingeborg Rudomino, Senior Librarian at the Croatian Web Archive, & Marta Matijević, Librarian at the Croatian Web Archive (National and University Library in Zagreb)
June 2019 – June 2020
It’s been more than a year since the National and University Library in Zagreb (NSK) hosted the IIPC General Assembly and Web Archiving Conference, which we remember with nostalgia.
Last year was a very busy year for the Croatian Web Archive (HAW) and we would like to share some of the key projects that we have been working on.
New portal design
The highlight of the last period was the launch of the new HAW portal.
It was a complex project that took two years – from the initial idea to the launch of the portal in February 2020. The portal was developed and is maintained by NSK website developers and the HAW team. It is developed in a customized WordPress theme. Since the new portal had to be integrated with the database of the archived content, that is maintained by our partner University of Zagreb University Computing Centre (SRCE), a lot of coding was required in order to connect the portal with the archive database to ensure that everything is working properly and smoothly.
Below you can see fractions of our previous portals from 2006 and from 2020:
HAW’s website from 2006 until 2011
HAW’s website from 2011 until 2020
So, what’s new?
The most important objective was to put search box in focus for all types of crawls and give users an easier way to find a resource. Because of the diverse ways of searching, our goal was to have a clear distinction between selective (that is indexed and can be searched by keywords, any word in title or URL, or use advanced search) and domain crawls (can only be searched by entering the full URL). A valuable addition to this version of the portal are the basic metadata elements that accompany each resource (which has a catalogue record) available in the portal.
Archived resource with the basic metadata elements (available also via library catalogue)
Additionally, the browsing of subject categories has been expanded with subject subcategories.
The visibility of the thematic collections has been improved by placing them on the title page. A new feature In Focus has also been added to highlight some of the most important or interesting events or anniversaries happening in the country, city or at the Library in the form of blog posts. This feature is available only in the Croatian version of the portal. The central part of the homepage features New in HAW and Gone from the web sections where user can browse all publications that are new or publications that are no longer available on the live web. The About HAW page features a timeline marking all the important dates related to history of HAW.
Some parts of the new portal have largely remained the same with only slight improvements to make them more user-friendly and up to date. More information about Selection criteria, National .hr domain crawls, Statistics, Bibliography, FAQ etc. can be found in the footer.
The portal is also available in English.
New thematic collections
During this one-year period, we have been working on six thematic collections. Some of them are already available and others are still ongoing:
At the end of 2019, Presidential Elections were held in Croatia. The thematic crawls was conducted in January and the content is publicly available as part of this thematic collection.
Rijeka – European Capital of Culture 2020
Croatian city of Rijeka is European Capital of Culture 2020. All contents related to this event, during this challenging time, will be harvested. We are still collecting the content.
Croatian Presidency of the Council of the European Union
Croatia has chaired the Council of the European Union from January to June 2020. We are finishing this thematic collection and it will soon be publicly available on the HAW’s portal.
Our largest thematic collection so far is definitely COVID-19, which is still ongoing. We have included the public in collecting the content inviting nominations related to the coronavirus. In this thematic collection, we follow the events that begin with the onset of coronavirus in the Republic of Croatia and the world, featured on the Croatian portals, blogs, articles – from the outbreak of coronavirus, through general lockdown to the gradual normalization in which we are now.
Archived website (19.03.2020)
On March 22, just a few days after the start of coronavirus lockdown in Croatia, Zagreb was hit by the biggest earthquake in 140 years, causing numerous injuries and extensive damage. Croatian Web Archive immediately started collecting content about this disaster. This thematic collection is publicly available on the HAW’s portal.
Archived website (15.04.2020) (photo by HINA; Damir Senčar)
2020 Parliamentary Elections
When the spread of the coronavirus was believed to be under control, Croatia held the Parliamentary Elections on July 5. The content for this collection will be collected until the constitution of the new Croatian Parliament.
In May of this year, we started cataloguing thematic collections at the collection level. We have also contributed the Croatian content to the IIPC Coronavirus (Covid-19) Collection.
Annual .hr crawl
In December 2019 we have conducted the 9th annual domain crawl and collected 119 million resources amounting to 9.3 TB.
HAW also started the installation and configuration of tools for indexing and enabling full-text search for domain and thematic crawls: Webarchive-Discovery for parsing and indexing WARC files, Apache SORL for indexing and searching text content and SHINE web interface for index search and analysis. We are still in the testing phase and only a part of existing crawled content is indexed.
Testing Web Curator Tool for new collaborative processes – Local Web Crowd crawls
A new development phase is the collaboration with public libraries in crawling their local history collections for which we are testing the Web Curator Tool. We expect the first results are by the end of November this year.
In the next months, we will be working on enabling more advanced use of HAW’s content to better suit the researchers, starting with the creation of the data sets from HAW collections. We will also prepare guidelines for using archived content on HAW’s portal. In addition, we are planning to update our training material according to the new IIPC training material. In the meantime, we invite you to explore our new portal.