Web Archiving at the National Library of Ireland

National Library of Ireland Reading Room © National Library of Ireland.

The National Library of Ireland has a long-standing tradition of collecting, preserving and making accessible the published and printed output of Ireland. The library is over 140 years old and we now also have rich digital collections concerning the political, cultural and creative life of Ireland. The NLI has been archiving the Irish web on a selective basis since 2011. We have over 17 TB of data in the selective web archive, openly available for research through our website.  A particular strength of our web archive is the coverage of Irish politics including a representation of every election and referendum since 2011. No longer in its infancy, the NLI has made some exciting developments in recent years. This year we have begun working with Internet Archive for our selective web archive and are looking forward to the new opportunities that this partnership will bring. We have also begun working closely with an academic researcher from a Higher Education institute in Ireland, who is carrying out network analysis on a portion of our selective data.

In 2007 and 2017, the NLI undertook domain crawling projects and there is now over 43TB of data archived from these crawls. The National Library of Ireland is a legal deposit library, entitling it to a copy of everything published in Ireland. However, unlike many countries in Europe, legal deposit legislation does not currently extend to online material so we cannot make these crawls available. Despite these barriers, the library remains committed to preserving the online story of Ireland in whatever way we can.

Revisions to the legislation are currently before the Irish parliament and if passed will result in the addition of e-publications, such as e-books, journals etc. The addition of websites to that list is currently being considered.

In 2017, the National Library of Ireland became members of the IIPC and we are excited to be attending our first General Assembly in Wellington. While we had anticipated talking about our newly available domain web archive portal and how this had impacted our selective crawls, we are looking forward to discussing the challenges we continue to face, including with Legal Deposit, and how we are developing the web archive as a whole. We may also hopefully be able to update on progress with the legislative framework.  We look forward to seeing you there in Wellington!

How Can We Use Web Archive?: A Brief Overview of WARP and How It Is Used

By Naotoshi Maeda, National Diet Library of Japan

As we all know, the use of web archives has recently become a hot topic in the web archive community. In the 14th iPRES held in Kyoto, the National Diet Library of Japan (NDL) took part in some sessions and presented some examples of how web archive can be used. Here, I post the poster and re-present the topics.

Fig. 1: The poster about use cases of web archive presented in iPRES 2017 (pdf)

Overview of WARP

Since 2002, the NDL has been operating the web archive called WARP. It has been harvesting websites under two different frameworks. The first is Japan’s Legal Deposit system and the second is with the permission of the copyright holder. The National Diet Library Law allows the NDL to harvest websites of public agencies, including those of the national government, municipal governments, public universities, and independent administrative agencies. On the other hand, legal deposit does not allow the NDL to harvest websites of private organizations, so the NDL needs to receive permission from the copyright holder beforehand. At present, WARP archives roughly 1 petabyte of data, comprising 5 billion files from 130,000 captures.

Fig. 2: Statistics of archived content in WARP

85% of the archived websites can be accessed via the Internet based on permissions of rights holders, and WARP provides a variety of search methods, including URL, full text, metadata, and by category.

WARP uses standard web archiving technologies, such as Heritrix for web-crawling, WARC file format for storage, OpenWayback for playback, and Apache Lucene Solr for full text search.

Linking from live websites

Given this background, here I show some examples of how WARP can be used.

The first use case is linking from live websites. As mentioned above, WARP comprehensively harvests and archives the websites of public agencies under the legal deposit system. A significant quantity of content is posted, updated, and deleted on these websites every day. Many of these agencies use WARP as a backup database. Before deleting content from their websites, they add a link to content that is archived by WARP. Doing this enables these websites to keep archived content seamlessly available while also reducing the operating costs of their own web servers.

Fig. 3: Linking from live websites to WARP.

Analysis and Visualization

The graphs below present the results of some analysis of content archived in WARP. The first circular graph illustrates link relations between websites in Japan’s 47 prefectures, thereby showing the extent of their interconnection on the Web. The second graph shows the percent of URLs on websites of the national government that were live in 2015, and indicates that 60% of the URLs that existed in 2010 gave 404 errors during 2015. The third bubble chart shows the relative size of data accumulated from each of the 10,000 websites archived in WARP. Thus, you can see at a glance what websites and how much data are archived in WARP.

Fig. 4: Link relations between websites in Japan’s 47 prefectures.
Fig. 5: The percent of URLs on websites of the national government that were live in 2015.
Fig. 6: Relative size of data accumulated from each of the 10,000 websites archived in WARP.

Curation

The next use case shows how WARP can be used for curation. Curators can use a variety of search methods to find content of interest archived in WARP, but it is not easy for them to gauge the full extent of archived content. The NDL curates archived contents for a variety of subjects and provides visual representations that could provide curatots with unexpected discoveries. Here are two examples: a search by region for obsolete websites of defunct municipalities and the 3D wall for the collection of the Great East Japan Earthquake in 2011.

Fig.7: Search by region for obsolete websites of defunct municipalities.
Fig. 8: 3D wall for the collection of the Great East Japan Earthquake in 2011.

Extracting PDF Documents

The fourth use case is extracting PDF documents. The websites that are archived in WARP contain many PDF files of books and periodical articles. We search for these online publications and add metadata to those that are considered significant. These PDF files with metadata are then stored into the “NDL Digital Collections” as the “Online Publications” collection. Furthermore, the metadata are harvested using OAI-PMH by “NDL Search” which is an integrated search service of catalogs including libraries, archives, museums, academic institutes in Japan, so that curators can find PDF files using conventional search methods. 1,400,000 PDF files cataloged in 1,000,000 records are already available online. The NDL launched a new OPAC in January 2018 and it implemented the similar integrated search.

Fig. 9: Extracting PDF documents from archived websites.

Future challenges

I want to conclude this post with a short discussion of future challenges that have been lively discussed by IIPC members too.

Web archives have tremendous potential for use in big data analysis, which could be used to uncover how human history has been recorded in cyberspace. The NDL also needs to study how to make data sets suitable for data mining and how to promote engagement with researchers.

Another challenge is the development of even more robust search engine technology. WARP provides full-text search with Apache Lucene Solr, and has already indexed 2.5 billion files in the creation of indexes totaling 17 terabytes. But we are not satisfied with the search results, which contain duplicate material archived at different times and other noise. We need to develop a robust and accurate search engine specialized for web archives that uses temporal elements.