Quebec Websites: A Decade of Harvesting

This year Bibliothèque et Archives nationales du Québec (BAnQ) celebrates their 10th anniversary of archiving Québec websites. We are delighted to announce that BAnQ will be hosting the next IIPC General Assembly and Web Archiving Conference. The events will be held on 11-13 May 2020.


By Martine Renaud, Librarian, Legal Deposit and Acquisitions Department at Bibliothèque et Archives nationales du Québec 

About Bibliothèque et Archives nationales du Québec
At once national library, national archives and public library of a major metropolitan city, Bibliothèque et Archives nationales du Québec (BAnQ) brings together, preserves and promotes heritage materials from or related to Quebec.

Context 
In 2009, after several years of work and reflection, BAnQ began to harvest and archive Québec websites. As discussed in an article on BAnQ’s blog (in French), these heritage materials are often volatile and ephemeral. Harvests were initially carried out as part of a pilot project.

BAnQ takes a selective approach to Web harvesting. A number of factors make it difficult to thoroughly harvest the Quebec Web, including the size of the body of materials to be collected, given BAnQ’s limited resources, the legal constraints, i.e., the requirement to obtain a license granting permission from the Web Producer or other copyright owners to make their site accessible and finally context, because Quebec does not have its own domain name.

In the news in 2009 
The reach of these first harvests was modest: about 25 government organizations, chiefly ministries.

Looking at the sites collected in 2009, what do we see? Obviously, they reflect what was topical at the time. In 2009, much attention was paid to the influenza epidemic. Does anyone still remember the infamous H1N1 virus? A major vaccination campaign was underway during the winter of 2009, and the Quebec government had a website dedicated to this topic:

The Pandémie influenza website, which is no longer in existence. 

On the Quebec Ministry of Finance website, a number of documents dealt with the effects of the 2008 global financial crisis on Quebec’s economy:

Quebec Ministry of Finance website, 2009.

Still in the news today
While the flu pandemic and the economic crisis are presumably behind us, some news items from 2009 are still topical today. In 2009, reports submitted as part of the Bouchard-Taylor Consultation Commission on Accommodation Practices Related to Cultural Differences were available on the Commission website:

Website of the Bouchard-Taylor Consultation Commission, which no longer exists.

The website is not online anymore, and yet cultural differences and accommodations are still in the news today.

The Quebec National Assembly
Quebec’s National Assembly website also provides interesting historical perspectives. It includes a page dedicated to Quebec’s current Premier, François Legault, who at the time was simply an elected member of the Parti Québécois. As Premier, he is now leader of the Coalition Avenir Québec, a party he co-founded in 2011.

Quebec Web harvests since 2009
Ten years later, harvests have become more numerous. They are broader in scope and much more diverse, with BAnQ’s reach now extending beyond government websites.

The following table compares the 2009 harvests and those carried out as of March 1, 2019:

2009 2009-2019
Number of harvests 16 12,823
Number of organizations whose website is made available 25 1,295
Documents harvested 17,026,257 149,647,697
Total size of archives (terabytes) 0.90 31

It is interesting to see how the use of images, and audio and video materials, has increased:

2009 2009-2019
Type of documents harvested Number Size (Gb) Number Size (Gb)
HTML pages 15,073,735 306 122,146,682 4,967
Images 1,275,183 49 18,159,220 1,454
Applications (PDF, Word, Excel, etc.) 644,117 526 5,702,995 3,695
Video materials 17,009 19 1,309,413 20,288
Audio materials 7,458 4 79,660 320
Other 8,755 0.01 2,249,727 235

Proliferating applications are a major challenge for institutions that harvest websites. BAnQ relies on Heritrix.

Contents to explore and to work with
Since 2009, harvests have progressively widened their scope. They now provide a number of corpuses of interest to researchers, particularly in the digital humanities field. Websites dealing with Quebec provincial elections in 2012, 2014 and 2018 have been harvested (major parties, political blogs, news sites, etc.). The municipal elections of 2013 and 2017 have also been covered. In addition, we harvest what are known as “thematic” (i.e. non-governmental) sites: cultural organizations (museums, libraries, and archives), community organizations, professional associations, regional newspapers, and so on.

Websites harvested by BAnQ can be accessed through an interface. Interested researchers may also access the data directly on request.

Web Archiving at the National Library of Ireland

National Library of Ireland Reading Room © National Library of Ireland.

The National Library of Ireland has a long-standing tradition of collecting, preserving and making accessible the published and printed output of Ireland. The library is over 140 years old and we now also have rich digital collections concerning the political, cultural and creative life of Ireland. The NLI has been archiving the Irish web on a selective basis since 2011. We have over 17 TB of data in the selective web archive, openly available for research through our website.  A particular strength of our web archive is the coverage of Irish politics including a representation of every election and referendum since 2011. No longer in its infancy, the NLI has made some exciting developments in recent years. This year we have begun working with Internet Archive for our selective web archive and are looking forward to the new opportunities that this partnership will bring. We have also begun working closely with an academic researcher from a Higher Education institute in Ireland, who is carrying out network analysis on a portion of our selective data.

In 2007 and 2017, the NLI undertook domain crawling projects and there is now over 43TB of data archived from these crawls. The National Library of Ireland is a legal deposit library, entitling it to a copy of everything published in Ireland. However, unlike many countries in Europe, legal deposit legislation does not currently extend to online material so we cannot make these crawls available. Despite these barriers, the library remains committed to preserving the online story of Ireland in whatever way we can.

Revisions to the legislation are currently before the Irish parliament and if passed will result in the addition of e-publications, such as e-books, journals etc. The addition of websites to that list is currently being considered.

In 2017, the National Library of Ireland became members of the IIPC and we are excited to be attending our first General Assembly in Wellington. While we had anticipated talking about our newly available domain web archive portal and how this had impacted our selective crawls, we are looking forward to discussing the challenges we continue to face, including with Legal Deposit, and how we are developing the web archive as a whole. We may also hopefully be able to update on progress with the legislative framework.  We look forward to seeing you there in Wellington!

How Can We Use Web Archive?: A Brief Overview of WARP and How It Is Used

By Naotoshi Maeda, National Diet Library of Japan

As we all know, the use of web archives has recently become a hot topic in the web archive community. In the 14th iPRES held in Kyoto, the National Diet Library of Japan (NDL) took part in some sessions and presented some examples of how web archive can be used. Here, I post the poster and re-present the topics.

Fig. 1: The poster about use cases of web archive presented in iPRES 2017 (pdf)

Overview of WARP

Since 2002, the NDL has been operating the web archive called WARP. It has been harvesting websites under two different frameworks. The first is Japan’s Legal Deposit system and the second is with the permission of the copyright holder. The National Diet Library Law allows the NDL to harvest websites of public agencies, including those of the national government, municipal governments, public universities, and independent administrative agencies. On the other hand, legal deposit does not allow the NDL to harvest websites of private organizations, so the NDL needs to receive permission from the copyright holder beforehand. At present, WARP archives roughly 1 petabyte of data, comprising 5 billion files from 130,000 captures.

Fig. 2: Statistics of archived content in WARP

85% of the archived websites can be accessed via the Internet based on permissions of rights holders, and WARP provides a variety of search methods, including URL, full text, metadata, and by category.

WARP uses standard web archiving technologies, such as Heritrix for web-crawling, WARC file format for storage, OpenWayback for playback, and Apache Lucene Solr for full text search.

Linking from live websites

Given this background, here I show some examples of how WARP can be used.

The first use case is linking from live websites. As mentioned above, WARP comprehensively harvests and archives the websites of public agencies under the legal deposit system. A significant quantity of content is posted, updated, and deleted on these websites every day. Many of these agencies use WARP as a backup database. Before deleting content from their websites, they add a link to content that is archived by WARP. Doing this enables these websites to keep archived content seamlessly available while also reducing the operating costs of their own web servers.

Fig. 3: Linking from live websites to WARP.

Analysis and Visualization

The graphs below present the results of some analysis of content archived in WARP. The first circular graph illustrates link relations between websites in Japan’s 47 prefectures, thereby showing the extent of their interconnection on the Web. The second graph shows the percent of URLs on websites of the national government that were live in 2015, and indicates that 60% of the URLs that existed in 2010 gave 404 errors during 2015. The third bubble chart shows the relative size of data accumulated from each of the 10,000 websites archived in WARP. Thus, you can see at a glance what websites and how much data are archived in WARP.

Fig. 4: Link relations between websites in Japan’s 47 prefectures.
Fig. 5: The percent of URLs on websites of the national government that were live in 2015.
Fig. 6: Relative size of data accumulated from each of the 10,000 websites archived in WARP.

Curation

The next use case shows how WARP can be used for curation. Curators can use a variety of search methods to find content of interest archived in WARP, but it is not easy for them to gauge the full extent of archived content. The NDL curates archived contents for a variety of subjects and provides visual representations that could provide curatots with unexpected discoveries. Here are two examples: a search by region for obsolete websites of defunct municipalities and the 3D wall for the collection of the Great East Japan Earthquake in 2011.

Fig.7: Search by region for obsolete websites of defunct municipalities.
Fig. 8: 3D wall for the collection of the Great East Japan Earthquake in 2011.

Extracting PDF Documents

The fourth use case is extracting PDF documents. The websites that are archived in WARP contain many PDF files of books and periodical articles. We search for these online publications and add metadata to those that are considered significant. These PDF files with metadata are then stored into the “NDL Digital Collections” as the “Online Publications” collection. Furthermore, the metadata are harvested using OAI-PMH by “NDL Search” which is an integrated search service of catalogs including libraries, archives, museums, academic institutes in Japan, so that curators can find PDF files using conventional search methods. 1,400,000 PDF files cataloged in 1,000,000 records are already available online. The NDL launched a new OPAC in January 2018 and it implemented the similar integrated search.

Fig. 9: Extracting PDF documents from archived websites.

Future challenges

I want to conclude this post with a short discussion of future challenges that have been lively discussed by IIPC members too.

Web archives have tremendous potential for use in big data analysis, which could be used to uncover how human history has been recorded in cyberspace. The NDL also needs to study how to make data sets suitable for data mining and how to promote engagement with researchers.

Another challenge is the development of even more robust search engine technology. WARP provides full-text search with Apache Lucene Solr, and has already indexed 2.5 billion files in the creation of indexes totaling 17 terabytes. But we are not satisfied with the search results, which contain duplicate material archived at different times and other noise. We need to develop a robust and accurate search engine specialized for web archives that uses temporal elements.