The WARC file format celebrates its 10th anniversary

By Sara Aubry, Web Archiving Project Manager at BnF

The WARC format is our Web ARChives format. It defines a way for combining digital resources into an aggregate archival file along with related metadata.It is today commonly used to store web crawls. For new comers, a WARC file is made of one or multiple records. Each record consists of a header followed by a content block. The header has mandatory named fields that document for instance the URI, the date, the type and the length of the record.The content block may contain resources in any format such as an HTML page,a binary image or a video file. WARC is an extension of the ARCfile format designed by the Internet Archive in 1996.The WARC format was initially released as an ISO international standard 10 years ago, in May 2009, under the number 28500:2009 (we also call it WARC version 1.0). The standardization opened the path to a wider use and implementation in a variety of applications for harvesting,accessing, mining, exchanging and preserving digital resources. While it represents the unique standard format for web archives, it has been adopted beyond the web archiving community to store born-digital or digitized materials.

As with all ISO standards, the WARC standard is periodically reviewed to ensure that it continues to meet the changing needs that emerge from our practice. The first revision, supported by an IIPC task force and the subcommittee in charge of technical interoperability within ISO information and documentation technical committee (ISO/TC46/SC4),was published in August 2017 as ISO28500:2017 (it is also known as WARC version 1.1). This revision mainly introduced new named fields for deduplication and the possibility to have more precise timestamps (See IIPC GitHub for more details).

During the last IIPC general assembly that took place in November 2018 in Wellington, we started to discuss possible evolutions for the second revision. The ISO vote which is required to launch the revision process is currently scheduled for 2022. Alex Osborne from the National Library of Australia challenged the format to support the HTTP/2 protocol. Ilya Kremer presented Rhizome current implementation for recording provenance headers to indicate that a record has been created from another record and not from the original URL. Ilya also presented a need to keep track of dynamic history of a web page display. Exchanges continued and are still alive on IIPC GitHub and Slack (#warc channel). Hot topics are currently related to how to keep track of media (in particular video and audio files) conversion and how to reference a “transcluded” video or audio file from another page.

All these topics need time for raising awareness, in-depth discussions, shared testing and tool implementation within our community before they can be drafted and included in the standard.If you want to join the current discussions or raise any other topic, please join IIPC #warc channel on Slack.

Share this:

Related

Leave a comment Cancel reply