How Can We Use Web Archive?: A Brief Overview of WARP and How It Is Used

By Naotoshi Maeda, National Diet Library of Japan

As we all know, the use of web archives has recently become a hot topic in the web archive community. In the 14th iPRES held in Kyoto, the National Diet Library of Japan (NDL) took part in some sessions and presented some examples of how web archive can be used. Here, I post the poster and re-present the topics.

Fig. 1: The poster about use cases of web archive presented in iPRES 2017 (pdf)

Overview of WARP

Since 2002, the NDL has been operating the web archive called WARP. It has been harvesting websites under two different frameworks. The first is Japan’s Legal Deposit system and the second is with the permission of the copyright holder. The National Diet Library Law allows the NDL to harvest websites of public agencies, including those of the national government, municipal governments, public universities, and independent administrative agencies. On the other hand, legal deposit does not allow the NDL to harvest websites of private organizations, so the NDL needs to receive permission from the copyright holder beforehand. At present, WARP archives roughly 1 petabyte of data, comprising 5 billion files from 130,000 captures.

Fig. 2: Statistics of archived content in WARP

85% of the archived websites can be accessed via the Internet based on permissions of rights holders, and WARP provides a variety of search methods, including URL, full text, metadata, and by category.

WARP uses standard web archiving technologies, such as Heritrix for web-crawling, WARC file format for storage, OpenWayback for playback, and Apache Lucene Solr for full text search.

Linking from live websites

Given this background, here I show some examples of how WARP can be used.

The first use case is linking from live websites. As mentioned above, WARP comprehensively harvests and archives the websites of public agencies under the legal deposit system. A significant quantity of content is posted, updated, and deleted on these websites every day. Many of these agencies use WARP as a backup database. Before deleting content from their websites, they add a link to content that is archived by WARP. Doing this enables these websites to keep archived content seamlessly available while also reducing the operating costs of their own web servers.

Fig. 3: Linking from live websites to WARP.

Analysis and Visualization

The graphs below present the results of some analysis of content archived in WARP. The first circular graph illustrates link relations between websites in Japan’s 47 prefectures, thereby showing the extent of their interconnection on the Web. The second graph shows the percent of URLs on websites of the national government that were live in 2015, and indicates that 60% of the URLs that existed in 2010 gave 404 errors during 2015. The third bubble chart shows the relative size of data accumulated from each of the 10,000 websites archived in WARP. Thus, you can see at a glance what websites and how much data are archived in WARP.

Fig. 4: Link relations between websites in Japan’s 47 prefectures.
Fig. 5: The percent of URLs on websites of the national government that were live in 2015.
Fig. 6: Relative size of data accumulated from each of the 10,000 websites archived in WARP.

Curation

The next use case shows how WARP can be used for curation. Curators can use a variety of search methods to find content of interest archived in WARP, but it is not easy for them to gauge the full extent of archived content. The NDL curates archived contents for a variety of subjects and provides visual representations that could provide curatots with unexpected discoveries. Here are two examples: a search by region for obsolete websites of defunct municipalities and the 3D wall for the collection of the Great East Japan Earthquake in 2011.

Fig.7: Search by region for obsolete websites of defunct municipalities.
Fig. 8: 3D wall for the collection of the Great East Japan Earthquake in 2011.

Extracting PDF Documents

The fourth use case is extracting PDF documents. The websites that are archived in WARP contain many PDF files of books and periodical articles. We search for these online publications and add metadata to those that are considered significant. These PDF files with metadata are then stored into the “NDL Digital Collections” as the “Online Publications” collection. Furthermore, the metadata are harvested using OAI-PMH by “NDL Search” which is an integrated search service of catalogs including libraries, archives, museums, academic institutes in Japan, so that curators can find PDF files using conventional search methods. 1,400,000 PDF files cataloged in 1,000,000 records are already available online. The NDL launched a new OPAC in January 2018 and it implemented the similar integrated search.

Fig. 9: Extracting PDF documents from archived websites.

Future challenges

I want to conclude this post with a short discussion of future challenges that have been lively discussed by IIPC members too.

Web archives have tremendous potential for use in big data analysis, which could be used to uncover how human history has been recorded in cyberspace. The NDL also needs to study how to make data sets suitable for data mining and how to promote engagement with researchers.

Another challenge is the development of even more robust search engine technology. WARP provides full-text search with Apache Lucene Solr, and has already indexed 2.5 billion files in the creation of indexes totaling 17 terabytes. But we are not satisfied with the search results, which contain duplicate material archived at different times and other noise. We need to develop a robust and accurate search engine specialized for web archives that uses temporal elements.

Leave a comment