Quebec Websites: A Decade of Harvesting

This year Bibliothèque et Archives nationales du Québec (BAnQ) celebrates their 10th anniversary of archiving Québec websites. We are delighted to announce that BAnQ will be hosting the next IIPC General Assembly and Web Archiving Conference. The events will be held on 11-13 May 2020.


By Martine Renaud, Librarian, Legal Deposit and Acquisitions Department at Bibliothèque et Archives nationales du Québec 

About Bibliothèque et Archives nationales du Québec
At once national library, national archives and public library of a major metropolitan city, Bibliothèque et Archives nationales du Québec (BAnQ) brings together, preserves and promotes heritage materials from or related to Quebec.

Context 
In 2009, after several years of work and reflection, BAnQ began to harvest and archive Québec websites. As discussed in an article on BAnQ’s blog (in French), these heritage materials are often volatile and ephemeral. Harvests were initially carried out as part of a pilot project.

BAnQ takes a selective approach to Web harvesting. A number of factors make it difficult to thoroughly harvest the Quebec Web, including the size of the body of materials to be collected, given BAnQ’s limited resources, the legal constraints, i.e., the requirement to obtain a license granting permission from the Web Producer or other copyright owners to make their site accessible and finally context, because Quebec does not have its own domain name.

In the news in 2009 
The reach of these first harvests was modest: about 25 government organizations, chiefly ministries.

Looking at the sites collected in 2009, what do we see? Obviously, they reflect what was topical at the time. In 2009, much attention was paid to the influenza epidemic. Does anyone still remember the infamous H1N1 virus? A major vaccination campaign was underway during the winter of 2009, and the Quebec government had a website dedicated to this topic:

The Pandémie influenza website, which is no longer in existence. 

On the Quebec Ministry of Finance website, a number of documents dealt with the effects of the 2008 global financial crisis on Quebec’s economy:

Quebec Ministry of Finance website, 2009.

Still in the news today
While the flu pandemic and the economic crisis are presumably behind us, some news items from 2009 are still topical today. In 2009, reports submitted as part of the Bouchard-Taylor Consultation Commission on Accommodation Practices Related to Cultural Differences were available on the Commission website:

Website of the Bouchard-Taylor Consultation Commission, which no longer exists.

The website is not online anymore, and yet cultural differences and accommodations are still in the news today.

The Quebec National Assembly
Quebec’s National Assembly website also provides interesting historical perspectives. It includes a page dedicated to Quebec’s current Premier, François Legault, who at the time was simply an elected member of the Parti Québécois. As Premier, he is now leader of the Coalition Avenir Québec, a party he co-founded in 2011.

Quebec Web harvests since 2009
Ten years later, harvests have become more numerous. They are broader in scope and much more diverse, with BAnQ’s reach now extending beyond government websites.

The following table compares the 2009 harvests and those carried out as of March 1, 2019:

2009 2009-2019
Number of harvests 16 12,823
Number of organizations whose website is made available 25 1,295
Documents harvested 17,026,257 149,647,697
Total size of archives (terabytes) 0.90 31

It is interesting to see how the use of images, and audio and video materials, has increased:

2009 2009-2019
Type of documents harvested Number Size (Gb) Number Size (Gb)
HTML pages 15,073,735 306 122,146,682 4,967
Images 1,275,183 49 18,159,220 1,454
Applications (PDF, Word, Excel, etc.) 644,117 526 5,702,995 3,695
Video materials 17,009 19 1,309,413 20,288
Audio materials 7,458 4 79,660 320
Other 8,755 0.01 2,249,727 235

Proliferating applications are a major challenge for institutions that harvest websites. BAnQ relies on Heritrix.

Contents to explore and to work with
Since 2009, harvests have progressively widened their scope. They now provide a number of corpuses of interest to researchers, particularly in the digital humanities field. Websites dealing with Quebec provincial elections in 2012, 2014 and 2018 have been harvested (major parties, political blogs, news sites, etc.). The municipal elections of 2013 and 2017 have also been covered. In addition, we harvest what are known as “thematic” (i.e. non-governmental) sites: cultural organizations (museums, libraries, and archives), community organizations, professional associations, regional newspapers, and so on.

Websites harvested by BAnQ can be accessed through an interface. Interested researchers may also access the data directly on request.