Investigate holdings of web archives through summaries: cdx-summarize

Untitled designBy Yves Maurer, Web Archiving technical lead at the National Library of Luxembourg


Introduction

When researchers want to access web archives, they have two main possibilities. They can either use one of the excellent web archives that is freely accessible online, such as web.archive.org, arquivo.pt, vefsafn.is, haw.nsk.hr, or Common Crawl, or they can travel to the libraries and archives whose web archives are only available in their respective reading rooms. In fact, most web archives have some restrictions on access. Copyright and other legal considerations often make it difficult for institutions to open up the web archive to the broader Internet. Closed web archives are hard to access for researchers, especially if they live far from the physical reading rooms. The overall effect is that closed web archives are less used, studied, and published about and researchers only travel to the closest reading rooms, if at all.

However, web archiving institutions would like more researchers to use their archives and popularize the usage of web archives for all users from contemporary history, sociology, linguistics, economics, law, and other disciplines. For closed web archives, usually little data is publicly available about their contents, so it is difficult to convince researchers to travel to the reading room when the researcher doesn’t know in advance what exactly the archive contains and whether it’s pertinent to their research question. Web archives are also very large which makes handling the raw WARC files difficult for all parties, so sending extracts of data from institution to research team is often not feasible.

It would certainly be preferable to researchers if those closed web archives would just open their entire service to the Internet, but the wholesale lifting of legal restrictions is not easy. Therefore, if researchers cannot access the whole dataset, can they at least access some part that allows them to have an overview of the collection? Just a size indication (e.g. 340 TB) and the number of mementos (e.g. 3 billion) will not help much. A collection policy documenting the aims, scope and contents of the web archive (e.g. https://www.bnf.fr/fr/archives-de-linternet) is already more helpful but does not hold any numbers or information about particular sites of interest. There is, however, some type of data that resides in-between the legal challenges of full access on the one hand and a textual description or rough single numbers on the other hand. This type of data must not be encumbered by any legal restrictions nor should it be so massive that it becomes unwieldy.

Developed as part of the WARCNET network’s working group 1, the cdx-summarize (https://github.com/ymaurer/cdx-summarize) toolset proposes to generate and handle such a dataset. There are no more legal restrictions on the data that it contains, since it is aggregated and does not contain any copyrighted information nor personal data. Moreover, the file is of a manageable size. The institution that has a closed web archive can publish the summary file for the whole collection and then the researchers can investigate its contents or compare it to the summary files from other institutions. Like this, web archives can publicize their collections and make them available for a rough first level of data exploration.

Sample uses of summary files for researchers

The summary files produced by cdx-summarize are simple, but still contain statistics about the different years when mementos were harvested as well as the number and sizes of different file types included in the collection. None of the following samples require direct access to a web archive, only to a summary file. It is not the aim of this blog post to investigate these examples in detail, just to give the readers an idea how rich this summary data still is and what can be done with it.

A very simple example is the chart comparing the evolution of the sizes of HTML files over the years.

Picture1
Fig 1. Average size of HTML files in the Luxembourg Web Archive

Another example is to use the information about 2nd-level domains that is still present in the summary file to find out more about domain names in general, as in the following example:

Picture2
Fig 2. First letter frequency in Internet Archive 2nd-level domains vs French dictionary for the TLD .fr

Here, you could, for example, explain the overall abundance of 2nd level domains starting with the letter “L” by the fact that the French articles “le, la, les,” all start with “L” and so do probably quite a lot of domain names. Other deviations from the mean may need a deeper explanation.

Comparing web archives

Another nice thing about the summary files is that they can be produced for different web archives and then compared. At the time of writing this, I do not have access to any other closed web archive summary file apart from the one for the Luxembourg Web Archive (https://github.com/ymaurer/cdx-summarize/blob/main/summaries/webarchive-lu.summary.xz) (19.1 MB). However, there are open web archives with public APIs like the Internet Archive’s CDX server (https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) or the Common Crawl (https://index.commoncrawl.org/). These can be used to generate a summary file, e.g. a whole top-level domain.

A first comparison between web archives can done on the 2nd level domains. Do all concerned web archives hold data from all the domains? Or does one archive have a clear focus on just a small subset of the domains? The following chart shows the comparisons of the inclusion of domains from the TLD “.lu” into three web archives:

Picture3

The graph clearly shows that the Luxembourg Web Archive started in 2016 and that it is collaborating with the Internet Archive, who have a second copy of the same data. It also shows that the Common Crawl is much less broad in terms of included domains.

A deep comparison of the mementos held between web archives is probably better done on CDXJ index files themselves. There will still be some edge cases of mementos just being slightly different because of embedded timestamps, sessions, etc. but it will give a more detailed picture of the overlaps.

The summary file format

The file consists of JSON lines prefixed by the domain name. This is inspired by the CDXJ format and simplifies using Unix tools such as “sort,” or “join,” on the summary files. In the JSON part, there are keys for each year and inside the year, keys for each (simplified) MIME type for the number of mementos (n_) and their sizes (s_):

host.tld {“year”: {“n_html”: A, … “s_html”:B}}

A sample entry could be:

bnl.lu {“2003”: {“n_audio”: 0,”n_css”: 8,”n_font”: 0,”n_html”: 639,”n_http”: 728,”n_https”: 0,”n_image”: 44,”n_js”: 0,”n_json”: 0,”n_other”: 7,”n_pdf”: 30,”n_total”: 728,”n_video”: 0,”s_audio”: 0,”s_css”: 5268,”s_font”: 0,”s_html”: 1295481,”s_http”: 4680354,”s_https”: 0,”s_image”: 295235,”s_js”: 0,”s_json”: 0,”s_other”: 13156,”s_pdf”: 3071214,”s_total”: 4680354,”s_video”: 0}}

The MIME types are simplified according to the following rules:

MIME(s) category rationale
text/html HTML These are counted as “web pages” by Internet Archive
application/xhtml+xml
text/plain
text/css CSS interesting for changing usage in formatting pages
image/* IMAGE all image types are grouped together
application/pdf PDF interesting independently, although IA groups PDFs in “web page” too
video/* VIDEO all videos
audio/* AUDIO all audio types
application/javascript JS these 3 MIME types are common for javascript
text/javascript
application/x-javascript
application/json JSON relatively common and indicates dynamic pages
text/json
font/ FONT usage of custom fonts
application/vnd.ms-fontobject
application/font
application/x-font*

How do I generate the summary file for my web archive?

As the name cdx-summarize implies, the programs need only access to CDXJ files, not the underlying WARC files. Just start a cdx-summarize.py –compact *.cdx > mywebarchive.summary in your CDX directory and it will do the summarization.

If you are using WARC-indexer from the British Library and have a backend with a Solr index, it’s even simpler, since there is a version contributed by Toke Eskildsen which pulls the data from Solr efficiently and directly (https://github.com/ymaurer/cdx-summarize-warc-indexer) All types of CDXJ files should be supported and different encodings are supported as well.

One thought on “Investigate holdings of web archives through summaries: cdx-summarize

Leave a comment