Quebec Websites: A Decade of Harvesting

This year Bibliothèque et Archives nationales du Québec (BAnQ) celebrates their 10th anniversary of archiving Québec websites. We are delighted to announce that BAnQ will be hosting the next IIPC General Assembly and Web Archiving Conference. The events will be held on 11-13 May 2020.


By Martine Renaud, Librarian, Legal Deposit and Acquisitions Department at Bibliothèque et Archives nationales du Québec 

About Bibliothèque et Archives nationales du Québec
At once national library, national archives and public library of a major metropolitan city, Bibliothèque et Archives nationales du Québec (BAnQ) brings together, preserves and promotes heritage materials from or related to Quebec.

Context 
In 2009, after several years of work and reflection, BAnQ began to harvest and archive Québec websites. As discussed in an article on BAnQ’s blog (in French), these heritage materials are often volatile and ephemeral. Harvests were initially carried out as part of a pilot project.

BAnQ takes a selective approach to Web harvesting. A number of factors make it difficult to thoroughly harvest the Quebec Web, including the size of the body of materials to be collected, given BAnQ’s limited resources, the legal constraints, i.e., the requirement to obtain a license granting permission from the Web Producer or other copyright owners to make their site accessible and finally context, because Quebec does not have its own domain name.

In the news in 2009 
The reach of these first harvests was modest: about 25 government organizations, chiefly ministries.

Looking at the sites collected in 2009, what do we see? Obviously, they reflect what was topical at the time. In 2009, much attention was paid to the influenza epidemic. Does anyone still remember the infamous H1N1 virus? A major vaccination campaign was underway during the winter of 2009, and the Quebec government had a website dedicated to this topic:

The Pandémie influenza website, which is no longer in existence. 

On the Quebec Ministry of Finance website, a number of documents dealt with the effects of the 2008 global financial crisis on Quebec’s economy:

Quebec Ministry of Finance website, 2009.

Still in the news today
While the flu pandemic and the economic crisis are presumably behind us, some news items from 2009 are still topical today. In 2009, reports submitted as part of the Bouchard-Taylor Consultation Commission on Accommodation Practices Related to Cultural Differences were available on the Commission website:

Website of the Bouchard-Taylor Consultation Commission, which no longer exists.

The website is not online anymore, and yet cultural differences and accommodations are still in the news today.

The Quebec National Assembly
Quebec’s National Assembly website also provides interesting historical perspectives. It includes a page dedicated to Quebec’s current Premier, François Legault, who at the time was simply an elected member of the Parti Québécois. As Premier, he is now leader of the Coalition Avenir Québec, a party he co-founded in 2011.

Quebec Web harvests since 2009
Ten years later, harvests have become more numerous. They are broader in scope and much more diverse, with BAnQ’s reach now extending beyond government websites.

The following table compares the 2009 harvests and those carried out as of March 1, 2019:

2009 2009-2019
Number of harvests 16 12,823
Number of organizations whose website is made available 25 1,295
Documents harvested 17,026,257 149,647,697
Total size of archives (terabytes) 0.90 31

It is interesting to see how the use of images, and audio and video materials, has increased:

2009 2009-2019
Type of documents harvested Number Size (Gb) Number Size (Gb)
HTML pages 15,073,735 306 122,146,682 4,967
Images 1,275,183 49 18,159,220 1,454
Applications (PDF, Word, Excel, etc.) 644,117 526 5,702,995 3,695
Video materials 17,009 19 1,309,413 20,288
Audio materials 7,458 4 79,660 320
Other 8,755 0.01 2,249,727 235

Proliferating applications are a major challenge for institutions that harvest websites. BAnQ relies on Heritrix.

Contents to explore and to work with
Since 2009, harvests have progressively widened their scope. They now provide a number of corpuses of interest to researchers, particularly in the digital humanities field. Websites dealing with Quebec provincial elections in 2012, 2014 and 2018 have been harvested (major parties, political blogs, news sites, etc.). The municipal elections of 2013 and 2017 have also been covered. In addition, we harvest what are known as “thematic” (i.e. non-governmental) sites: cultural organizations (museums, libraries, and archives), community organizations, professional associations, regional newspapers, and so on.

Websites harvested by BAnQ can be accessed through an interface. Interested researchers may also access the data directly on request.

The WARC file format celebrates its 10th anniversary

By Sara Aubry, Web Archiving Project Manager at BnF

The WARC format is our Web ARChives format. It defines a way for combining digital resources into an aggregate archival file along with related metadata.It is today commonly used to store web crawls. For new comers, a WARC file is made of one or multiple records. Each record consists of a header followed by a content block. The header has mandatory named fields that document for instance the URI, the date, the type and the length of the record.The content block may contain resources in any format such as an HTML page,a binary image or a video file. WARC is an extension of the ARCfile format designed by the Internet Archive in 1996.The WARC format was initially released as an ISO international standard 10 years ago, in May 2009, under the number 28500:2009 (we also call it WARC version 1.0). The standardization opened the path to a wider use and implementation in a variety of applications for harvesting,accessing, mining, exchanging and preserving digital resources. While it represents the unique standard format for web archives, it has been adopted beyond the web archiving community to store born-digital or digitized materials.

As with all ISO standards, the WARC standard is periodically reviewed to ensure that it continues to meet the changing needs that emerge from our practice. The first revision, supported by an IIPC task force and the subcommittee in charge of technical interoperability within ISO information and documentation technical committee (ISO/TC46/SC4),was published in August 2017 as ISO28500:2017 (it is also  known as WARC version 1.1). This revision mainly introduced new named fields for deduplication and the possibility to have more precise timestamps (See IIPC GitHub for more details).

During the last IIPC general assembly that took place in November 2018 in Wellington, we started to discuss possible evolutions for the second revision. The ISO vote which is required to launch the revision process is currently scheduled for 2022. Alex Osborne from the National Library of Australia challenged the format to support the HTTP/2 protocol. Ilya Kremer presented Rhizome current implementation for recording provenance headers to indicate that a record has been created from another record and not from the original URL. Ilya also presented a need to keep track of dynamic history of a web page display. Exchanges continued and are still alive on IIPC GitHub and Slack (#warc channel). Hot topics are currently related to how to keep track of media (in particular video and audio files) conversion and how to reference a “transcluded” video or audio file from another page.

All these topics need time for raising awareness, in-depth discussions, shared testing and tool implementation within our community before they can be drafted and included in the standard.If you want to join the current discussions or raise any other topic, please join IIPC #warc channel on Slack.

Contribute to CDG’s AI Collection!

By Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia

“Trurl” by Daniel Mróz, from The Cyberiad by Stanisław Lem (Wydawnictwo Literackie, Kraków, 1972). Illustration copyright © 1972 Daniel Mróz. Reprinted by permission.

After significant breakthroughs at the end of the 20th and at the beginning of 21st centuries, artificial intelligence (AI) has played a greater role in our daily lives. Although AI has a huge positive impact on a variety of fields such as manufacturing, healthcare, art, transportation, retail and so on, the use of new technologies also raises ethical issues as well as security risks. One critical and hotly debated issue is the impact of ongoing automation on labor markets, to include changing educational requirements for jobs, job elimination, and various models for transitions.

The IIPC Content Development Group invites curators and web archivists around the world to contribute websites to a new “Artificial Intelligence” web collection.

The purpose of this collection is to bring together and record web content related to use of AI and its impact on any possible aspect of life, reflecting attitudes and thoughts towards it, future predictions etc.

The content can be in any language focusing on specific countries or cultures or have a global scope.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building AI related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to aid in the development of a collection with an international perspective.

The collection aims to cover the following subtopics:

  • Machine learning, natural language processing, robotics, automation;
  • AI in literature, visual arts (e.g. ceramics, drawing, painting, sculpture, design, photography, filmmaking, architecture) and performing arts (e.g. theater, public speech, dance, music etc.); AI in emerging art forms;
  • AI and law/legislation;
  • Social and economic impact (e.g. impact on behavior/interaction, bias in AI, unemployment, inequality, changes in labor markets);
  • Ethical issues (e.g. weaponization of AI, security, robot rights);
  • Future predictions/scenarios concerning AI.

Types of web content to include are personal forms such as blogs, forum posts, and artist websites; trend reports, statements, and analyses (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, businesses).

Time frame covered by content: from the 1990s onwards.

Out of scope are: full social media feeds and channels (Facebook, twitter, Instagram, YouTube, WhatsApp), user’ video channels (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

That said, if you locate individual social media posts of unique value, such as an Instagram post by a bot or a particularly relevant and ephemeral individual video, please submit them for consideration.

Nominations are welcomed using the following form.

The call for nominations will close on the 30th of June 2019. Crawls will be run during the summer 2019. Collection will be made available at the end of 2019.

 For more information about this collection, contact Tiiu Daniel (tiiu.daniel[at]nlib.ee).


Lead-Curators of CDG Artificial Intelligence Collection
Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia
Liisi Esse, Ph.D. Associate Curator for Estonian and Baltic Studies Stanford University Libraries
Rashi Joshi, Reference Librarian /Collections Specialist, Library of Congress

CDG Co-Chairs
Nicola Bingham, Lead Curator Web Archiving, British Library
Alex Thurman, Web Resources Collection Coordinator, Columbia University Libraries

Contribute to CDG’s Climate Change Collection!

By Kees Teszelszky, Curator Digital Collections, Koninklijke Bibliotheek – National Library of The Netherlands and Lead Curator, CDG Climate Change Collection

Climate change is one of the most urgent and hotly debated issues on the web in recent years. The IIPC Content Development Group is inviting all curators and web archivists from around the world to contribute websites to a new collaborative “Climate Change” collection.

Breiðamerkurlón
Breiðamerkurlón, Iceland

In recent decades there is has been strong evidence that the earth is experiencing rapid climate change, characterized by global temperature rise, warming oceans, shrinking ice sheets, glacial retreat, decreased snow cover, sea level rise, declining arctic sea ice, extreme weather events, and ocean acidification. Ninety-seven percent of climate scientists agree that these climate-warming trends over the past century are very likely due to human activities, and most of the leading scientific organizations worldwide have issued public statements endorsing this position (source: climate.nasa.gov/evidence). Global and local action to mitigate this crisis has been complicated by political, economic, technical, cultural, and religious debates.

Many people feel the urge to reflect on this topic on the web. We would like to take an international snapshot of born digital culture relating to documentation of and social debate on the challenging issue of climate change. You can contribute to this collection by nominating web content about any aspect of climate change, and the content can be focused on specific countries or cultures or have a global focus, and can be in any language.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building climate change related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to help us build a collection with an international perspective.

Examples of subtopics might include climatology, climate change denial, climate refugees, religious reflections on climate change, etc. Eligible types of web content include organizational reports or statements (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, political parties/platforms, businesses, religious groups) or more personal forms such as blogs or artistic projects.

Out of scope are: social media feeds (Facebook, Twitter, Instagram, YouTube channels, WhatsApp), video (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

Collecting seeds started on 1 April 2019 and more nominations can be added to this spreadsheet. Crawls will be run during the summer of 2019, to conclude shortly after the upcoming UN Climate Action Summit on 23 September 2019.

Organized by the IIPC and supported by web archivists around the world, the special web collection ‘Climate Change’ is one of the ways the IIPC helps raise awareness of the strategic, cultural and technological issues which make up the web archiving and digital preservation challenge.

For more information about this collection contact Kees Teszelszky for more details: kees.teszelszky[at]kb.nl

IIPC Content Development Group: 2019 collections

By Nicola Bingham, Lead Curator Web Archiving, British Library and Co-Chair of the Content Development Working Group

During 2019, the Content Development Group (CDG) will continue to work on several established collections: 

New for 2019, the CDG is undertaking a Climate Change Collection, led by Kees Teszelszky of  the National Library of the Netherlands. The first crawl will take place before the General Assembly & the Web Archiving Conference in June, with a final crawl shortly after the next UN Climate summit in September. This collection has sparked a lot of interest on the CDG mailing list and many curators have expressed an interest in contributing.

We are also planning an Artificial Intelligence Collection, led by Tiiu Daniel of the National Library of Estonia, Liisi Esse of Stanford University Libraries and Rashi Joshi of Library of Congress. The details are still to be firmed up.

We are planning to crawl one of our collections, or a subset of a collection, in order that it can be used by researchers.

Results of the Steering Committee Election 2019

The following IIPC member organisations have been elected to serve for a period of three years starting on 1st of June 2019 –

On behalf of the membership I would like to thank all of those who have taken part in this election.

IIPC PCO

 

IIPC Steering Committee Election 2019: nomination statements

The Steering Committee is the executive body of the IIPC, currently comprising 15 member organisations. This year five seats are up for election/re-election. In response to the call for nominations  to serve on the IIPC Steering Committee for a three-year term commencing 1 June 2019, seven IIPC member organisations have put themselves forward:

An election will be held from 3 March to 31 March. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 1 April. The first Steering Committee meeting will be held before the General Assembly in Zagreb, on 4 June.

If you have any questions, please contact the IIPC Programme and Communications Officer.


Nomination statements in alphabetical order:

Deutsche Nationalbibliothek / German National Library

As a member of the IIPC since 2007, the German National Library has always been particularly interested in preservation aspects and the representative Tobias Steinke is co-lead of the Preservation Working Group. The selective web archive of the German National Library started in 2012. Its workflow is based on a co-operation with the service provider oia and does not include the common open source tools, which could give the IIPC a different perspective and help to represent the various members.

 

Internet Archive

Internet Archive seeks to continue its role on the IIPC Steering Committee. As the oldest and largest publicly-available web archive in the world, a creator and ongoing developer of many of the core technologies used in web archiving, and an original founding member of the IIPC, Internet Archive plays a key role in advancing web archiving and fostering broad community participation in preserving and providing access to the web-published records that document our shared cultural heritage. Internet Archive has also served in a variety of leadership and program roles within the Steering Committee since IIPC’s formation. In continuing this active role on the IIPC Steering Committee, Internet Archive will contribute to furthering the IIPC’s strategic initiatives building a collaborative framework to advance web archiving and grow and diversify the IIPC’s membership. The web is the most significant communication platform of our era — it is also one that can only be preserved and made accessible through broad-based, multi-institutional efforts lead by organizations such as the IIPC. By extending our role on the IIPC Steering Committee, Internet Archive will continue its participation in the knowledge-sharing and leadership that supports the IIPC and the broader community in its ongoing efforts to preserve the web.


 

Landsbókasafn Íslands – Háskólabókasafn / National and University Library of Iceland

The National and University Library of Iceland is interested in serving another term on the IIPC Steering Committee. The library has had an active web archiving effort for nearly two decades. Our participation in the IIPC has been instrumental in its success.

As one of the IIPC‘s smaller members, we are keenly aware of the importance of collaboration to this specialized endeavor. The knowledge and tools that this community has given us access to are priceless.

We believe that in this community active engagement ultimately brings the greatest rewards. As such we have participated in projects, including Heritrix and OpenWayback. We have hosted IIPC events, including the 2016 GA/WAC and an upcoming hackathon in April. And we have provided leadership in various areas, including in working groups, SC chair (2008) and our SC representative is currently in charge of the tools portfolio.

If re-elected to the SC, we will aim to continue on in the same spirit.


 

Library of Congress

The Library of Congress (LC) has been involved in web archiving for almost 20 years, building a variety of thematic and event-based collections for its web archives. LC has worked collaboratively with national and international organizations on collections, preservation tools and workflow processes, while developing in-house expertise and curatorial tools to enable effective collection and management of over 1.7 petabytes of web content collected to date. As a founding member of IIPC, LC has served in a variety of leadership roles, currently as SC member, Preservation WG and Training WG co-chair, and in prior years as SC Chair, Communications Officer, Content Development Group co-chair, and on the Membership Engagement portfolio, and helped secure a new fiscal agent. If re-elected, the LC looks forward to continuing to focus on developing a web archiving training program, encouraging new opportunities for membership engagement and funding opportunities for member projects. We will continue to participate in discussions around preservation, tools, and processes that will enable us all to work more efficiently and collaboratively as a community, and look forward to engaging in activities and discussions that will help strengthen the IIPC for the future and next membership agreement.


 

National Library of Australia

The National Library of Australia (NLA) was a founding IIPC member and Steering Committee member until 2009, hosting the second general committee meeting in Canberra in 2008. In 2004 the NLA organized the first major international conference on web archiving for cultural institutions. The NLA’s experience and leadership in web archiving goes back to 1996 with the establishment of PANDORA, one of the first collaborative web archiving programs.  The NLA has been a continuous IIPC member and has actively contributed expertise to the preservation working group.

The NLA strengths include experience in operational maturity, sustainability and open access through its web archiving program which embraces selective, domain and bulk collecting methods. The NLA has a strong commitment to, and experience with, collaborative web archiving through PANDORA.  The NLA has a demonstrated record with innovation, building the first selective web archiving workflow systems (PANDAS) and the recent ‘outbackCDX’ tool providing efficiency for managing indexing. In March 2019 the NLA launched the Australian Web Archive, which made the whole .au web archive fully accessible and openly searchable in Trove.  The NLA believes it is time for Australia to rejoin the IIPC leadership adding southern hemisphere representation and experience to the Steering Committee.


 

National Library of New Zealand / Te Puna Mātauranga o Aotearoa

National Library of New Zealand’s mandate to preserve New Zealand’s social and cultural history includes:

  1. A legal mandate to perform web harvests under the National Library of New Zealand Act 2003)
  2. A social responsibility to develop collections (including digital collections) reflecting the social, cultural, economic and other endeavours of New Zealanders.

The Library has a programme of selective web harvesting and has conducted eight whole of domain ‘snapshots’ since 2008. We are also experimenting with Twitter, focusing on hashtag crawls of major NZ events or activities considered culturally important (e.g. Kaikoura Earthquake, GE2017, Moko Kauae, Grace Millane, Te Matatini, Nelson Fires). The Library is also collaborating with the National Library of the Netherlands on the ongoing enhancement and development of the Web Curator Tool.

National library has been a continuous member of IIPC since 2007 and has previously been a member of the IIPC Steering Committee. Having recently appointed a dedicated web archiving role to the Library’s digital preservation team we now feel that we are able to contribute more fully to the work of the IIPC, and we feel that membership of the IIPC Steering Committee is one of the ways that we can contribute.


 

Stanford University Libraries

We have concluded our three-year term on the Steering Committee and appreciate your consideration for serving another term. IIPC has progressed notably in these three years. Our private, member-focused GA has been eclipsed by an increasingly visible and rigorously-curated WAC. IIPC as an organization has befittingly matured as well, re-administering itself under CLIR’s fiscal sponsorship. These changes reflect opportunities to continue to evolve IIPC from its start as a largely inward-looking, homogeneous cadre of collaborating member institutions to a professionalized organization more keenly focused on the diversification of participating stakeholders and advancement of web archiving practice broadly.

We are interested in continuing to move IIPC in this direction, in keeping with the vision presented by Jefferson Bailey as outgoing Chair. As a consistent contributor to IIPC activities and goals, we can be counted on to “do the work.” Our tangible contributions to date include serving as Treasurer, serving as Training Working Group co-chair, chairing the 2017 WAC Program Committee, organizing and co-hosting the 2015 GA and WAC, and serving on every WAC Program Committee since 2015.