Discretionary Funding Program Launched by IIPC

By Jefferson Bailey, Internet Archive & IIPC Steering Committee

IIPC is excited to announce the launch of its Discretionary Funding Program (DFP) to support the collaborative activities of its members by providing funding to accelerate the preservation and accessibility of the web. Following the announcement to membership at the recent IIPC General Assembly in Zagreb, Croatia, the IIPC DFP aims to advance the development of tools, training, and practices that further the organization’s mission “to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations.”

The inaugural DFP Call for Proposals will award funding according to an application process. Applications will be due on September 1, 2019 for one-year projects starting January 1, 2020 or July 1, 2020. The program will grant awards in three categories:

  • Seed Grants ($0 to $10,000) fund smaller, individual efforts, help smaller projects/events scale up, or support smaller-scope projects.
  • Development Grants ($10,000 to $25,000) fund efforts that require meaningful funding for event hosting, engineering, publications, project growth, etc.
  • Program Grants ($25,000 to $50,000) fund larger initiatives, either to launch new initiatives or to increase the impact and expansion of proven work or technologies.

The IIPC has earmarked a significant portion of its reserve funds and of income from member dues to support the joint work of its members through this program. Applications will be reviewed by a team of IIPC Steering Committee members as well as representatives from the broader IIPC membership. Our hope is that the IIPC DFP serves as a catalyst to promote grassroots, member-driven innovation and collaboration across the IIPC membership.

Please visit the IIPC DFP page (http://netpreserve.org/projects/funding/) for an overview of the application process, links to the application form and a FAQ page, and other details and contact information. We encourage all IIPC members to apply for DFP funding and to coordinate with their peer member on brainstorming programs to advance the field of web archiving. The DFP team intends to administer the program with the utmost equity and transparency and encourages any members with questions not answered by online resources to post them on the dedicated IIPC Slack channel (#projects at http://iipc.slack.com) or via email projects[at]iipc.simplelists.com.

Quebec Websites: A Decade of Harvesting

This year Bibliothèque et Archives nationales du Québec (BAnQ) celebrates their 10th anniversary of archiving Québec websites. We are delighted to announce that BAnQ will be hosting the next IIPC General Assembly and Web Archiving Conference. The events will be held on 11-13 May 2020.


By Martine Renaud, Librarian, Legal Deposit and Acquisitions Department at Bibliothèque et Archives nationales du Québec 

About Bibliothèque et Archives nationales du Québec
At once national library, national archives and public library of a major metropolitan city, Bibliothèque et Archives nationales du Québec (BAnQ) brings together, preserves and promotes heritage materials from or related to Quebec.

Context 
In 2009, after several years of work and reflection, BAnQ began to harvest and archive Québec websites. As discussed in an article on BAnQ’s blog (in French), these heritage materials are often volatile and ephemeral. Harvests were initially carried out as part of a pilot project.

BAnQ takes a selective approach to Web harvesting. A number of factors make it difficult to thoroughly harvest the Quebec Web, including the size of the body of materials to be collected, given BAnQ’s limited resources, the legal constraints, i.e., the requirement to obtain a license granting permission from the Web Producer or other copyright owners to make their site accessible and finally context, because Quebec does not have its own domain name.

In the news in 2009 
The reach of these first harvests was modest: about 25 government organizations, chiefly ministries.

Looking at the sites collected in 2009, what do we see? Obviously, they reflect what was topical at the time. In 2009, much attention was paid to the influenza epidemic. Does anyone still remember the infamous H1N1 virus? A major vaccination campaign was underway during the winter of 2009, and the Quebec government had a website dedicated to this topic:

The Pandémie influenza website, which is no longer in existence. 

On the Quebec Ministry of Finance website, a number of documents dealt with the effects of the 2008 global financial crisis on Quebec’s economy:

Quebec Ministry of Finance website, 2009.

Still in the news today
While the flu pandemic and the economic crisis are presumably behind us, some news items from 2009 are still topical today. In 2009, reports submitted as part of the Bouchard-Taylor Consultation Commission on Accommodation Practices Related to Cultural Differences were available on the Commission website:

Website of the Bouchard-Taylor Consultation Commission, which no longer exists.

The website is not online anymore, and yet cultural differences and accommodations are still in the news today.

The Quebec National Assembly
Quebec’s National Assembly website also provides interesting historical perspectives. It includes a page dedicated to Quebec’s current Premier, François Legault, who at the time was simply an elected member of the Parti Québécois. As Premier, he is now leader of the Coalition Avenir Québec, a party he co-founded in 2011.

Quebec Web harvests since 2009
Ten years later, harvests have become more numerous. They are broader in scope and much more diverse, with BAnQ’s reach now extending beyond government websites.

The following table compares the 2009 harvests and those carried out as of March 1, 2019:

2009 2009-2019
Number of harvests 16 12,823
Number of organizations whose website is made available 25 1,295
Documents harvested 17,026,257 149,647,697
Total size of archives (terabytes) 0.90 31

It is interesting to see how the use of images, and audio and video materials, has increased:

2009 2009-2019
Type of documents harvested Number Size (Gb) Number Size (Gb)
HTML pages 15,073,735 306 122,146,682 4,967
Images 1,275,183 49 18,159,220 1,454
Applications (PDF, Word, Excel, etc.) 644,117 526 5,702,995 3,695
Video materials 17,009 19 1,309,413 20,288
Audio materials 7,458 4 79,660 320
Other 8,755 0.01 2,249,727 235

Proliferating applications are a major challenge for institutions that harvest websites. BAnQ relies on Heritrix.

Contents to explore and to work with
Since 2009, harvests have progressively widened their scope. They now provide a number of corpuses of interest to researchers, particularly in the digital humanities field. Websites dealing with Quebec provincial elections in 2012, 2014 and 2018 have been harvested (major parties, political blogs, news sites, etc.). The municipal elections of 2013 and 2017 have also been covered. In addition, we harvest what are known as “thematic” (i.e. non-governmental) sites: cultural organizations (museums, libraries, and archives), community organizations, professional associations, regional newspapers, and so on.

Websites harvested by BAnQ can be accessed through an interface. Interested researchers may also access the data directly on request.

The WARC file format celebrates its 10th anniversary

By Sara Aubry, Web Archiving Project Manager at BnF

The WARC format is our Web ARChives format. It defines a way for combining digital resources into an aggregate archival file along with related metadata.It is today commonly used to store web crawls. For new comers, a WARC file is made of one or multiple records. Each record consists of a header followed by a content block. The header has mandatory named fields that document for instance the URI, the date, the type and the length of the record.The content block may contain resources in any format such as an HTML page,a binary image or a video file. WARC is an extension of the ARCfile format designed by the Internet Archive in 1996.The WARC format was initially released as an ISO international standard 10 years ago, in May 2009, under the number 28500:2009 (we also call it WARC version 1.0). The standardization opened the path to a wider use and implementation in a variety of applications for harvesting,accessing, mining, exchanging and preserving digital resources. While it represents the unique standard format for web archives, it has been adopted beyond the web archiving community to store born-digital or digitized materials.

As with all ISO standards, the WARC standard is periodically reviewed to ensure that it continues to meet the changing needs that emerge from our practice. The first revision, supported by an IIPC task force and the subcommittee in charge of technical interoperability within ISO information and documentation technical committee (ISO/TC46/SC4),was published in August 2017 as ISO28500:2017 (it is also  known as WARC version 1.1). This revision mainly introduced new named fields for deduplication and the possibility to have more precise timestamps (See IIPC GitHub for more details).

During the last IIPC general assembly that took place in November 2018 in Wellington, we started to discuss possible evolutions for the second revision. The ISO vote which is required to launch the revision process is currently scheduled for 2022. Alex Osborne from the National Library of Australia challenged the format to support the HTTP/2 protocol. Ilya Kremer presented Rhizome current implementation for recording provenance headers to indicate that a record has been created from another record and not from the original URL. Ilya also presented a need to keep track of dynamic history of a web page display. Exchanges continued and are still alive on IIPC GitHub and Slack (#warc channel). Hot topics are currently related to how to keep track of media (in particular video and audio files) conversion and how to reference a “transcluded” video or audio file from another page.

All these topics need time for raising awareness, in-depth discussions, shared testing and tool implementation within our community before they can be drafted and included in the standard.If you want to join the current discussions or raise any other topic, please join IIPC #warc channel on Slack.

Contribute to CDG’s AI Collection!

By Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia

“Trurl” by Daniel Mróz, from The Cyberiad by Stanisław Lem (Wydawnictwo Literackie, Kraków, 1972). Illustration copyright © 1972 Daniel Mróz. Reprinted by permission.

After significant breakthroughs at the end of the 20th and at the beginning of 21st centuries, artificial intelligence (AI) has played a greater role in our daily lives. Although AI has a huge positive impact on a variety of fields such as manufacturing, healthcare, art, transportation, retail and so on, the use of new technologies also raises ethical issues as well as security risks. One critical and hotly debated issue is the impact of ongoing automation on labor markets, to include changing educational requirements for jobs, job elimination, and various models for transitions.

The IIPC Content Development Group invites curators and web archivists around the world to contribute websites to a new “Artificial Intelligence” web collection.

The purpose of this collection is to bring together and record web content related to use of AI and its impact on any possible aspect of life, reflecting attitudes and thoughts towards it, future predictions etc.

The content can be in any language focusing on specific countries or cultures or have a global scope.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building AI related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to aid in the development of a collection with an international perspective.

The collection aims to cover the following subtopics:

  • Machine learning, natural language processing, robotics, automation;
  • AI in literature, visual arts (e.g. ceramics, drawing, painting, sculpture, design, photography, filmmaking, architecture) and performing arts (e.g. theater, public speech, dance, music etc.); AI in emerging art forms;
  • AI and law/legislation;
  • Social and economic impact (e.g. impact on behavior/interaction, bias in AI, unemployment, inequality, changes in labor markets);
  • Ethical issues (e.g. weaponization of AI, security, robot rights);
  • Future predictions/scenarios concerning AI.

Types of web content to include are personal forms such as blogs, forum posts, and artist websites; trend reports, statements, and analyses (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, businesses).

Time frame covered by content: from the 1990s onwards.

Out of scope are: full social media feeds and channels (Facebook, twitter, Instagram, YouTube, WhatsApp), user’ video channels (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

That said, if you locate individual social media posts of unique value, such as an Instagram post by a bot or a particularly relevant and ephemeral individual video, please submit them for consideration.

Nominations are welcomed using the following form.

The call for nominations will close on the 30th of June 2019. Crawls will be run during the summer 2019. Collection will be made available at the end of 2019.

 For more information about this collection, contact Tiiu Daniel (tiiu.daniel[at]nlib.ee).


Lead-Curators of CDG Artificial Intelligence Collection
Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia
Liisi Esse, Ph.D. Associate Curator for Estonian and Baltic Studies Stanford University Libraries
Rashi Joshi, Reference Librarian /Collections Specialist, Library of Congress

CDG Co-Chairs
Nicola Bingham, Lead Curator Web Archiving, British Library
Alex Thurman, Web Resources Collection Coordinator, Columbia University Libraries

Contribute to CDG’s Climate Change Collection!

By Kees Teszelszky, Curator Digital Collections, Koninklijke Bibliotheek – National Library of The Netherlands and Lead Curator, CDG Climate Change Collection

Climate change is one of the most urgent and hotly debated issues on the web in recent years. The IIPC Content Development Group is inviting all curators and web archivists from around the world to contribute websites to a new collaborative “Climate Change” collection.

Breiðamerkurlón
Breiðamerkurlón, Iceland

In recent decades there is has been strong evidence that the earth is experiencing rapid climate change, characterized by global temperature rise, warming oceans, shrinking ice sheets, glacial retreat, decreased snow cover, sea level rise, declining arctic sea ice, extreme weather events, and ocean acidification. Ninety-seven percent of climate scientists agree that these climate-warming trends over the past century are very likely due to human activities, and most of the leading scientific organizations worldwide have issued public statements endorsing this position (source: climate.nasa.gov/evidence). Global and local action to mitigate this crisis has been complicated by political, economic, technical, cultural, and religious debates.

Many people feel the urge to reflect on this topic on the web. We would like to take an international snapshot of born digital culture relating to documentation of and social debate on the challenging issue of climate change. You can contribute to this collection by nominating web content about any aspect of climate change, and the content can be focused on specific countries or cultures or have a global focus, and can be in any language.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building climate change related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to help us build a collection with an international perspective.

Examples of subtopics might include climatology, climate change denial, climate refugees, religious reflections on climate change, etc. Eligible types of web content include organizational reports or statements (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, political parties/platforms, businesses, religious groups) or more personal forms such as blogs or artistic projects.

Out of scope are: social media feeds (Facebook, Twitter, Instagram, YouTube channels, WhatsApp), video (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

Collecting seeds started on 1 April 2019 and more nominations can be added to this spreadsheet. Crawls will be run during the summer of 2019, to conclude shortly after the upcoming UN Climate Action Summit on 23 September 2019.

Organized by the IIPC and supported by web archivists around the world, the special web collection ‘Climate Change’ is one of the ways the IIPC helps raise awareness of the strategic, cultural and technological issues which make up the web archiving and digital preservation challenge.

For more information about this collection contact Kees Teszelszky for more details: kees.teszelszky[at]kb.nl

IIPC Content Development Group: 2019 collections

By Nicola Bingham, Lead Curator Web Archiving, British Library and Co-Chair of the Content Development Working Group

During 2019, the Content Development Group (CDG) will continue to work on several established collections: 

New for 2019, the CDG is undertaking a Climate Change Collection, led by Kees Teszelszky of  the National Library of the Netherlands. The first crawl will take place before the General Assembly & the Web Archiving Conference in June, with a final crawl shortly after the next UN Climate summit in September. This collection has sparked a lot of interest on the CDG mailing list and many curators have expressed an interest in contributing.

We are also planning an Artificial Intelligence Collection, led by Tiiu Daniel of the National Library of Estonia, Liisi Esse of Stanford University Libraries and Rashi Joshi of Library of Congress. The details are still to be firmed up.

We are planning to crawl one of our collections, or a subset of a collection, in order that it can be used by researchers.

Results of the Steering Committee Election 2019

The following IIPC member organisations have been elected to serve for a period of three years starting on 1st of June 2019 –

On behalf of the membership I would like to thank all of those who have taken part in this election.

IIPC PCO