20 Years of the Web Archiving Project (WARP) at the National Diet Library, Japan

By SHIMURA Tsutomu, National Diet Library (NDL), Japan


WARP

2022 marked the 20th anniversary of the start of the Web Archiving Project (WARP) at the National Diet Library, Japan. The following article introduces the progress made over the 20 years since the project was first launched on an experimental basis in 2002.

The History of the Web Archiving Project

The National Diet Library’s Web Archiving Project, or WARP, was launched in 2002 as an experimental project to collect, preserve, and provide access to a small number of Japanese websites from both the public and the private sectors with the permission of the webmasters. In 2006, the project was expanded to include all government-related organizations.

In 2009, the National Diet Library Law was amended to enable us to comprehensively collect any website published by public institutions, including all national and municipal government agencies, without permission from the publisher. And when this amendment came into force the following year, 2010, we started to archive at regular intervals, such as monthly or quarterly, depending on the type of institution. It was at this time that the basic framework of the current WARP was solidified.

In 2013, we updated the system and began providing access to curated content, such as Monthly Special feature. In 2018, we developed an English-language user interface in the hope of further expanding our audience. In 2021, we improved the display of search results, and we greatly improved the mechanism for moving links within archived contents in 2022.

Changes in the WARP website

The layout of the WARP website has been changed three times so far: at the start of the experimental project in 2002, at the start of comprehensive archiving in 2010, and at the time of the 2013 update.

screenshot1_NDL
WARP as an experimental project (screenshot from 20 Apr 2003)
screencapture_NDL-2
The beginning of archiving based on the National Diet Library Law (screenshot from 13 Jul 2010)
screencapture-warp-da-ndl-go-jp-info-ndljp-pid-8262028-warp-da-ndl-go-jp-2023-03-31-11_07_01
Updated layout (screenshot from 1 Aug 2013)

Number of targets

numberoftargetsgrap
Changes in number of targets

During the period from FY2002, when we began as an experimental project, until FY2009, we only archived websites when we had obtained permission from the webmasters, irrespective of whether the website was published by a public or a private institution.

With the start of comprehensive archiving in FY2010, we were able to archive the websites of all public institutions, which greatly increased the number of targets. The graph shows that between FY2009 and FY2011, the number of targets increased by more than 2,000.

In addition, we continued to request permission to collect private websites on a daily basis, and the number of targets has increased each year. In 2015, the number of targets increased significantly due to intensive requests made to public interest foundations. Generally, we focus on requesting permission from specific types of institutions for a certain period of time. This has resulted in the number of private-sector targets increasing to about 8,000, which is currently even more than the number of public-sector targets.

In order to provide access via the Internet to archived websites, we need permission of the owner of public transmission rights, which is something granted under the Copyright Law of Japan. Therefore, we request permission from each webmaster to provide access via the Internet to archived websites prior to providing such access. As of FY2021, we were able to provide access via the Internet to 12,435, or about 90%, of the targets we have archived. This is one of the most internet-accessible archives among the world’s web archives.

Data size

datasizechart
Changes in data size

The size of archived data has increased rapidly since the start of comprehensive archiving of the websites of public agencies in FY2010, and nearly reached 2,400 TB in FY2021. This is due to the increase in the number of targets collected as well as the size of data published by each institution.

System configuration transition

Over the past 20 years, the system configuration and various technologies implemented in WARP have changed significantly. The three most important technologies for collecting, preserving, and providing access to web archives are harvest software, storage format, and replay software. In addition, we provide a full-text search function to make it easier for users to find content of interest from the vast amount of archives. Here is a brief summary of the transition of each system configuration.

Harvest software

At first, we used the open-source software Wget to store the harvested websites in units of files. In 2010, we implemented Heritrix, a standard harvesting software used by web archiving organizations around the world and specialized for web archiving and have been using it ever since. In 2013, a duplication reduction function was added as a means of reducing the volume of data to be stored. The duplication reduction function saves only the files that have been updated, thus reducing the total volume of data saved and saving storage space.

Storage format

The data was saved in units of files when using Wget, but since 2010, with the implementation of Heritrix, the storage format was changed to the WARC format. The files that comprise each website as well as metadata about the various files are stored together in a WARC format file. The WARC format allows for the archiving of information that could not be included when saving data in units of files. For example, information that has no content, such as redirection to a new URL when the URL of a website changes, can now be saved.

Replay software

In order to view a website saved in WARC format files, files comprising the website must be extracted from WARC format files and saved to a general web server, or a dedicated software to replay WARC format files is needed. Initially, we adopted the former method. This meant that, in addition to the original WARC format files, storage capacity for the extracted data was required. We currently use OpenWayback, which allows users to directly browse WARC format files, eliminating the need for storage space for data that has been extracted into units of files.

Full-text search software

A full-text search software was introduced during the experimental project period, but at that time it was created exclusively for WARP. In 2010, we adopted Solr, an open source full-text search software widely used around the world, to improve the search speed for large-scale data.

What prompted us to launch Monthly Special?

At the time of the interface renewal in 2013, WARP had been steadily archiving websites, but web archiving itself was not yet well known in Japan. We wanted to attract the interest of more people, so we started publishing introductory articles on web archiving, such as Monthly Special, Mechanism of Web Archiving, and Web Archives of the World.

In particular, the Monthly Special features archived websites related to a topic of interest chosen by our staff or includes an article explaining WARP.

Currently this content is available only in Japanese.

In closing

Looking back over the 20 years since the start of the project, you can see that the number of targets and the size of data archived have steadily increased.

We believe that the role of web archives will continue to grow in importance. We are committed to collecting and preserving websites on a regular basis as well as making them available to as many people as possible.

One thought on “20 Years of the Web Archiving Project (WARP) at the National Diet Library, Japan

Leave a comment