Web Archiving the War in Ukraine

By Olga Holownia, Senior Program Officer, IIPC & Kelsey Socha, Administrative Officer, IIPC with contributions to the Collaborative Collection section by Nicola Bingham, Lead Curator, Web Archives, British Library; CDG co-chair


This month, the IIPC Content Development Working Group (CDG) launched a new collaborative collection to archive web content related to the war in Ukraine, aiming to map the impact of this conflict on digital history and culture. In this blog, we describe what is involved in creating a transnational collection and we also give an overview of web archiving efforts that started earlier this year: both collections by IIPC members and collaborative volunteer initiatives.

Collaborative Collection 2022

IIPC-CDG-collaborative-collectionsIn line with the broader content development policy, CDG collections focus on topics that are transnational in scope and are considered of high interest to IIPC members. Each collection represents more perspectives than similar collections by a single member archive may include. Nominations are submitted by IIPC members, who have been archiving the conflict as early as January 2022 (see below) as well as the general public.

How do members contribute?

Topics for special collections are proposed by IIPC members who submit their ideas to the IIPC CDG mailing list, or contact the IIPC co-chairs directly at any time. Providing that the topic fits with the CDG collecting scope, there is enough data budget to cover the collection, and a lead curator and volunteers to perform the archiving work are in place, the collection can go ahead. IIPC members are then canvassed widely to submit web content on a shared google spreadsheet together with associated metadata such as title, language and description. The URLs are taken from the spreadsheet and crawled in Archive-It by the project team, formed of volunteers from IIPC members for each collection. Many IIPC members add a selection of seeds from their institutions’ own collections which helps to make CDG collections very diverse in terms of coverage and language.

There will be overlap between the seeds that members submit to CDG collections and their own institutions’ collections, however there are differences, including that selections for IIPC collections can be more geographically wide ranging than those included in their own collections when, for example they must adhere to regional scope, such as in the case of a national library.  Selection decisions that are appropriate for members’ own collections may not be appropriate for CDG collections. For example, members may want to curate individual articles from an online newspaper by crawling each one separately whereas, given the larger scope of CDG collections it would be more appropriate to create the target at the level of the sub-section of the online newspaper. Public access to collections provided by Archive-It is a positive factor for those institutions that, for various reasons, can’t provide access to their collections. You can learn more about the War in Ukraine 2022 collection’s scope and parameters here.

Public nominations

We encourage everyone to nominate relevant web content as defined by the collection’s lead curators: Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, National Library of France and Kees Teszelszky of KB, National Library of the Netherlands. The first crawl is scheduled to take place on 27 July and it will be followed by two additional crawls in September and October. We will be publishing updates on the collection at #Ukraine 2022 Collection. We are also planning to make this collection available to researchers.

Member collections

In Spring 2022, we compiled a survey of the work done by IIPC members. We asked about the collection start date, scope, frequency, type of collected websites, way of collecting (e.g. locally and/or via Archive-It), social media platforms and access.

IIPC members have collected content related to the war, ranging from news portals, to governmental websites, to embassies, charities, and cultural heritage sites. They have also selectively collected content from Ukrainian and Russian websites and social media, including Facebook, Reddit, Instagram, and, most prominently, Twitter. The CDG collection offers another chance for members without special collections to contribute seeds from their own country domains.

Many of our members are national libraries and archives, and legal deposit informs what these institutions are able to collect and how they provide access. In most cases, that would mean crawling country-level domains, offering a localized perspective on the war. Access varies from completely open (e.g. the Internet Archive, National Library of Australia and the Croatian Web Archive), to onsite-only with published and browsable metadata such as collected URLs (e.g. the Hungarian Web Archive) to reading-room only (e.g. Netarkivet at the Royal Danish Library or the “Archives de l’internet” at the national library of France). The UK Web Archive collection has a mixed model of access, where the full list of metadata and collected URLs are available, but access to individual websites depends on whether the website owner has granted permission for off-site open access”.  Some institutions, such as Library of Congress, may have time-based embargoes in place for collection access.

Some of our members have also begun work preparing datasets and visualisations for researchers. The Internet Archive has been supporting multiple collections and volunteer projects and our members have provided valuable advice on capturing content that is difficult to archive (e.g. Telegram messages).

A map of IIPC members currently collecting content related to the war in Ukraine can be seen below. It includes Stanford University, which has been supporting SUCHO (Saving Ukrainian Cultural Heritage Online).

Survey results

Access

While many members have been collecting content related to the war, only a small number of collections are currently publicly available online. Some members provide access to browsable metadata or a list of ULRs. The National Library of Australia has been collecting publicly available Australian websites related to the conflict,as is the case for the National Library of the Czech Republic. A special event collection of 162 crowd-sourced URLs is now accessible at the Croatian Web Archive. The UK Web Archive’s special collection of nearly 300 websites is fully available on-site, however information about the collected resources, which currently include websites of Russian Oligarchs in the UK, Commentators, Charities, Think Tanks and the UK Embassies of Ukraine and the surrounding nations, is publicly available online. Some websites from the UK Web Archive’s collection are also fully available off-site, where website owners have granted permission. The National Library of Scotland has set up a special collection, ‘Scottish Communities and the Ukraine’ which contains nearly 100 websites and focuses on the local response to the Ukraine War. This collection will be viewable in the near future pending QA checks. Most of the University Library of Bratislava’s collection is only available on-site, but information about sites collected is browsable on their web portal with links to current versions of the archived pages.

The web archiving team at the ​​National Széchényi Library in Hungary, which has been capturing content from 75 news portals, has created a SolrWayback-based public search interface which provides access to metadata and full-text search, though full pages cannot be viewed due to copyright. The web archiving team has also been collaborating with the library’s Digital Humanities Center to create datasets and visualisations related to captured content.

Hungarian-Web-Archive-word_cloud
Márton Nemeth of National Széchényi Library and Gyula Kalcsó of Digital Humanities Center, National Széchényi Library presented on this collection at the 2022 Web Archiving Conference.

Multiple institutions plan to make their content available online at a later date, after collecting has finished or after a specified period of time has passed. The Library of Congress has been capturing content in a number of collections within the scope of their collecting policies, including the ongoing East European Government Ministries Web Archive.

Frequency of Collection

Most institutions have been collecting with a variety of frequencies. Institutions rarely answered with just one of the frequency options, opting instead to pick multiple options or “Other.” Of answers in the “Other” category, some were doing one-time collection, while others were collecting yearly, six-monthly, and quarterly.

How the content is collected

Most IIPC members crawl the content locally, while a few have also been using Archive-It. SUCHO has mostly relied on browser-based crawler Browsertrix, which was developed by Ilya Kreymer of Webrecorder and is in part funded by the IIPC, and on the Internet Archive’s Wayback Machine.

Type of collected websites (your domain)

When asked about types of websites being collected within local domains, most institutions have been focusing on governmental and news-related sites, followed by embassies and official sites related to Ukraine and Russia as well as cultural heritage sites. Other websites included a variety of crisis relief organisations, non-profits, blogs, think tanks, charities, and research organisations.

Types of websites/social media collected

When asked more broadly, most members have been focusing on local websites from their home countries. Outside local websites, some institutions were collecting Ukrainian websites and social media, while a smaller number were collecting Russian websites and social media.

Specific social media platforms collected

The survey also asked specifically about social media platforms our members were pulling from: Reddit, Instagram, TikTok, Tumblr, and Youtube. While many institutions were not collecting social media, Twitter was otherwise the most commonly collected social media platform.

Internet Archive

Internet ArchiveThe Internet Archive (IA) has been instrumental in providing support for multiple initiatives related to the war in Ukraine. IA’s initiatives have included:

  1. giving free Archive-It accounts, as well as general data storage, to a number of different community archiving efforts
  2. uploading files to SUCHO collection at archive.org
  3. supporting the extensive use of Save Page Now (especially via the Google Sheets interface) with the help of numerous SUCHO volunteers (many 10s of TB have been archived this way)
  4. supporting the uploading of WACZ files to the Wayback Machine. This work has just started but a significant number of files are expected  to be archived and, similar to other collections featured in the new “Collection Search” service, a full-text index will be available
  5. crawling the entire country code top level domain of the Ukrainian web (the crawl was launched in April and is still running)
  6. archiving Russian Independent Media (TV, TV Rain), Radio (Echo of Moscow) and web-based resources (see “Russian Independent Media” option in the “Collection Search” service at the bottom of the Wayback Machine).

IA’s Television News Archive, the GDELT Project, and the Media-Data Research Consortium have all collaborated to create the  Television News Visual Explorer, which allows for greater research access of the Television News Archive, including channels from across Russia, Belarus, and Ukraine. This blog post by GDELT’s Dr. Kavel H. Leetaru explains more of the significance of this collaboration, and the importance of this new research collection of Belarusian, Russian and Ukrainian television news coverage.

Volunteer initiatives

SUCHO

image3One of the largest volunteer initiatives focusing on preserving Ukrainian web content has been SUCHO. Involving over 1300 librarians, archivists, researchers and programmers, SUCHO is led by Stanford University’s Quinn Dombrowski, Anna E. Kijas of Tufts University, and Sebastian Majstorovic of the Austrian Centre for Digital Humanities and Cultural Heritage. In its first phase, the project’s primary goal was to archive at-risk sites, digital content, and data in Ukrainian cultural heritage institutions. So far over 30TB of content and 3,500+ websites of Ukrainian museums, libraries and archives have been preserved and a subset of this collection is available at https://www.sucho.org/archives. The project is beginning its second phase, focusing on coordinating aid shipments of digitization hardware, exhibiting Ukrainian culture online and organizing training for Ukrainian cultural workers in digitization methods.

sucho-poster-landscape-medium
The SUCHO leads and Ilya Kreymer presented on their work at the 2022 Web Archiving Conference and participated in a Q&A session moderated by Abbie Grotke of the Library of Congress.

The Telegram Archive of the War

image2
Screenshot from the Telegram Archive of the War, taken July 20, 2022.

Telegram has been the most widely used application in Ukraine since the onset of the war but this messaging app is notoriously difficult to archive. A team of five archivists at the Center for Urban History in Lviv led by Taras Nazaruk, has been archiving almost 1000 Telegram channels since late February to create the Telegram Archive of the War. Each team member has been assigned to monitor and archive a topic or a region in Ukraine. They focus on capturing official announcements from different military administrative districts, ministries, local and regional news, volunteer groups helping with evacuation, searches for missing people, local channels for different towns, databases, cyberattacks, Russian propaganda, fake news as well as personal diaries, artistic reflections, humour and memes. Russian government propaganda and pro-Russian channels and chats are also archived. The multi-media content is currently grouped into over 20 thematic collections. The project coordinators have also been working with universities interested in supporting this archive and are planning to set up a working group to provide guidance for the future access to this invaluable archive.

Ukraine collections on Archive-It

New content has been gradually made available within the Ukraine collections on Archive-It that provided free or heavily cost-shared accounts to its partners earlier this year. These collections also include websites documenting the Ukraine Crisis 2014-2015 curated by University of California Berkeley (UC Berkeley) and by Internet Archive Global Events. Four new collections have been created since February 2022 with over 2.5TB of content. The largest one about the 2022 conflict (around 200 URLs) that is publicly available is curated by Ukrainian Research Institute at Harvard University. Other collections that focus on Ukrainian content are curated by Center for Urban History of East Central Europe, UC Berkeley and SUCHO. To learn more about the “War in Ukraine: 2022” collection, read this blog post by Liladhar R. Pendse, Librarian for East European, Central European, Central Asian and Armenian Studies Collections, UC Berkeley. University of Oxford, New College has been archiving at-risk Russian cultural heritage on the web as well as Russian opposition efforts to the war on Ukraine.

HURI-at-Archive-It
Ukrainian Research Institute at Harvard University’s collection at Archive-It.

Organisations interested in collecting web content related to the war in Ukraine, can contact Mirage Berry, Business Development Manager at the Internet Archive.

How to get involved

  1. Nominate web content for the CDG collection
  2. Use the Internet Archive’s “Save Page Now”
  3. Check updates on the SUCHO Page for information on how you can contribute to the new phase of the project. SUCHO is currently accepting donations to pay for server costs and funding digitization equipment to send to Ukraine. Those interested in volunteering with SUCHO can sign up for the standby volunteer list here
  4. Help the Center for Urban History in Lviv by nominating Ukrainian Telegram channels that you think are worth archiving and participate in their events
  5. Submit information about your project: we are working to maintain a comprehensive and up-to-date list of web archiving efforts related to the war in Ukraine. If you are involved in a collection or a project and would like to see it included here, please use this form to contact us: https://bit.ly/archiving-the-war-in-Ukraine.

Many thanks to all of the institutions and projects featured on this list! We appreciate the time our members spent filling out our survey, and answering questions. Special thanks to Nicola Bingham of the British Library, Mark Graham and Mirage Berry of the Internet Archive, and Taras Nazaruk of the Center for Urban History in Lviv for providing supplementary information on their institutions’ collecting efforts.

Resources

Get Involved in Web Archiving the War in Ukraine 2022

By Kees Teszelszky, Curator Digital Collections, National Library of the Netherlands & Vladimir Tybin, Head of Digital Legal Deposit, National Library of France

On February 23, 2022, the armed forces of the Russian Federation invaded Ukrainian territory, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering a war in Ukraine. Since then, the clashes between the Russian military and the Ukrainian population have had unprecedented repercussions on the situation of the civilian population and on international relations. 

IIPC-CDG-collaborative-collectionsWhat we want to collect

This collaborative collection aims to collect web content related to this event in order to map the impact of this conflict on digital history and culture.

This collection will be built through the following themes: 

  • General information about the military confrontations
  • Consequences of the war on the civilian population
  • Refugee crisis and international relief efforts
  • Political consequences
  • International relations
  • Diaspora communities – Ukrainian people around the world 
  • Human rights organisations 
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Russia by foreign powers
  • Consequences on energy and agri-food trade
  • Public opinion: blogs/protest sites/activists

The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.

Out of scope

The following types of content are out of scope for the collection:

  • Data-intensive audio/video content (e.g. YouTube channels)
  • Social media platforms
  • Private member forums, intranets, or email (non-published material)
  • Content identifying vulnerable people and compromising their safety

How to get involved

Once you have selected the web pages that you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/Ukraine-2022-collection-public-nominations 

For the first crawl, the call for nominations will close on July 20, 2022.

For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.

Resources

About IIPC collaborative collections
IIPC CDG updates on the IIPC Blog

Celebrating the 2022 Winter Olympics and Paralympics Web Archive Collection

By Helena Byrne, Curator of Web Archives, British Library

IIPC-CDG-2022Olympics

The first IIPC collection focused just on the 2010 Winter Olympics in Vancouver. Since 2012, the IIPC has archived web content on both the Olympic and Paralympic Games. To date, the IIPC has archived seven Games. Beijing 2022 was also the 4th Winter Games collection.

Collection Name Data Docs
2014 Winter Olympics 1.6 TB 57,145,052
2014 Winter Paralympics 1.3 TB 42,542,659
2016 Summer Olympics and Paralympics 3.1 TB 18,205,981
2018 Winter Olympics and Paralympics 1.2 TB 12,218,514
2020 Summer Olympics and Paralympics [held in 2021] 610.9 GB 6,923,179
2022 Winter Olympics and Paralympics 361.1 GB 14,410,542

You can view the 2022 Winter Olympics and Paralympics here:

https://archive-it.org/collections/18422

In this final blog post on the IIPC Content Development Group (CDG) Beijing 2022 Olympic and Paralympic Games web archive collection, we look back at what content was crawled. 

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

What we collected

Crawl dates

There were five crawl dates for this collection. The collection period started in January and finished towards the end of March. A sixth crawl was conducted on April 26 of 32 seeds as these were missed in the first crawl. This issue was only noticed when preparing the metadata for publishing the collection.

  1. February 02, 2022 (308 seeds crawled)
  2. February 15, 2022 (264 seeds crawled)
  3. February 23, 2022 (65 seeds crawled)
  4. March 07, 2022 (29 seeds crawled)
  5. March 21, 2022 (198 seeds crawled)

We had a steady number of nominations for each crawl date. The only exception was the fourth crawl on March 7th with only 29 seeds crawled. This figure also includes a number of URLs that returned an error in the previous crawl. Nominations to this collection are done on a voluntary basis by members of the IIPC and the public from around the world. 

Countries covered

Athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. 

We received nominations from 38 countries for the IIPC CDG 2022 Winter Olympics and Paralympics collection. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed in multiple events and have no content nominated. 

Languages covered

We have 24 languages in the collection including French (228 nominations), English (162 nominations) and Japanese (89 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection.

Data size 

image2

We have archived 863 seeds out of the 889 seeds that were nominated. These seeds include full websites, subsections of websites and individual web pages in multiple languages from around the world. The 26 seeds nominated that were not archived were social media accounts so weren’t added to the crawler. There were roughly 54 seeds in total that came up in the Archive-It crawl reports. These were URLs that for technical reasons, the crawlers were unable to archive when they visited the seed. These seeds were then assessed and added to the next crawl in the series with some additional techniques used to try and capture them. However, not all of these attempts to recrawl these seeds were successful. Quality assurance was carried out on these 54 seeds and 36 of these seeds were set to private as they displayed no content or just error messages. 

We archived 361.1 GB of data and 14,410,542 documents at the end of five crawl cycles. We had initially set aside 1 TB for this collection but as we weren’t archiving any social media content and implemented a size cap on all seeds, we had not used as much data as expected. 

We used the following policy when setting the scope of the crawl:

  1. Full seed host or directory (Example: team or athlete website)
    • These seeds will be capped at 3 GB
  2. Crawl one page only (Example: news article)
    • These seeds will be capped at 1 GB 
  3. Seed page plus 1 click of all links on seed page (Example: news page linking to multiple articles)
    • These seeds will be capped at 2 GB

In the 2018 Winter Games collection, we collected 1,413 seeds and used 1.2 TB of data with 12,218,514 documents. However, if we just compare the URL nominations, the 2022 and 2018 collection are quite similar excluding the 557 social media URLs tagged as Blogs & Social Media from the 2018 total. 

Related blog posts

Get Involved in Web Archiving the Winter Games – Beijing 2022 

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

The Summer and Winter Olympics and Paralympics Collections in Archive-It

The Summer and Winter Olympics and Paralympics Collections 2010-2020 poster

Despite not collecting social media content, we did promote the call for nominations for this collection on social media channels (mostly Twitter) with the collection hashtag #WAGames2022.

For more information and updates on Content Development Group activities, you can contact the IIPC CDG team at Collaborative-collections@iipc.simplelists.com

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022Olympics

The Winter Olympics may be over but the IIPC Content Development Group Beijing 2022 collection is still running until March 20th, 2022. The Winter Paralympics got underway on March 4th and we would love to see your nominations for this edition of the Games. 

In our first blog post Get Involved in Web Archiving the Winter Games – Beijing 2022, we outlined details of what and how to nominate. Once you have selected the web pages you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/CDG-2022Games-collection-public-nominations 

What we have collected so far

We have archived 616 nominated seeds so far. These nominations include full websites, subsections of websites and individual web pages in multiple languages from around the world. We have archived 280.3GB of data and 13,197,402 documents at the end of three crawl cycles. 

Screenshot of total data archived on Beijing 2022 collection. Figures in paragraph above.

Social media policy

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

Map of the world

Qualified athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Only 29 of these NOCs received a medal. Athletes competed across 109 events over 15 disciplines in seven sports. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. So far, nominations have been received for the IIPC CDG 2022 Winter Olympics and Paralympics collection from 27 countries. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed and have no content nominated such as Austria (106 athletes), Sweden (116 athletes) and Ukraine (45 athletes).

Languages covered

We have 24 languages in the collection including French (200 nominations), Japanese (91 nominations) and English (79 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection. We would like to see more nominations in multiple languages, especially Chinese (4 nominations) and Russian (1 nomination).

How to get involved

Once you have selected the web pages you would like to see in the collection, add them to the submission form: https://bit.ly/CDG-2022Games-collection-public-nominations

If you know anyone who may be interested in contributing to this collection, please share the link with them! The call for nominations will close on March 20, 2022.

For more information and updates, you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.

Resources

Get Involved in Web Archiving the Winter Games – Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022OlympicsIt’s that time of year again when we all become winter sports experts while watching the Winter Olympic and Paralympic Games on TV. There are even helpful guides online to all the slang used on the slopes like steese (style and ease), pow (fresh ski powder) and sendy (to ride it with full vigour).

The ongoing pandemic has caused havoc on sporting schedules since the start, but the Beijing 2022 Winter Games are going ahead as scheduled. This means that the IIPC Content Development Group (CDG) will again be collecting web content related to the Games from across the world in multiple languages.

10 years of collecting the Games

2020 marked ten years of archiving the Games. The first collection focused just on the 2010 Winter Olympics in Vancouver. But since 2012 the IIPC has archived web content on both the Olympic and Paralympic Games. To date the IIPC has archived six Games. Beijing 2022 will be the 4th Winter Games collection.

CDG-Web-archiving-the-Olympics-and-Paralympics-Games
Helena Byrne: Going for Gold: Web Archiving the Olympics & Paralympics Games, poster presented at IIPC Web Archiving Conference 2021.

Previous CDG Olympic and Paralympic collections have focused on events both on and off the field/slopes. Key themes have included doping, corruption and Zika Virus as well as Covid-19 in the 2020 Summer Games. This year will be no different as Covid-19 is still a big issue and the human rights issues in China has meant that some nations like the USA, UK and Australia will not be sending any diplomatic representatives.

What we want to collect

Public platforms in various formats such as:

  • Websites
  • Subsections of websites with an Olympic tag
  • Individual Articles
  • News Reports
  • Blogs
  • Audio visual content

Social media platforms update their code and design frequently and do not prioritize archivability, and as a result present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason we will not be accepting nominations from Facebook, Instagram, or Twitter, nor from other similar social media platforms.

The subjects covered on these sites can include but are not limited to:

  • Athletes/Teams
  • Computer Games (eGames)
  • Covid-19
  • Diplomatic Relations with China
  • Doping/Cheating and Corruption
  • Environmental Issues
  • Fandom
  • Gender Issues (Ex. media coverage, sexual harassment etc.)
  • General News/ Commentary
  • Human Rights issues
  • Olympic/Paralympic Venues
  • Security
  • Sports Events
  • Other

How to get involved

Once you have selected the web pages you would like to see in the collection, it only takes less than 5 minutes to fill in the submission form:
https://bit.ly/CDG-2022Games-collection-public-nominations 

The call for nominations will close on March 20, 2022.

For more information and updates you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.

IIPC Collaborative collection: “Afghanistan regime change (2021) and the international response”

By Nicola Bingham, Lead Curator, Web Archiving, the British Library; Co-chair, IIPC Content Development Working Group

On 4th October 2021 the Content Development Group (CDG) initiated a thematic website collection in response to recent developments in Afghanistan at the behest of several CDG members.

Background

Recent events in Afghanistan have precipitated a humanitarian crisis which escalated markedly after foreign armed forces withdrew from the country in May 2021.1 As US and Allied troops retreated, the Taliban quickly gained ground, seizing cities across the country, increasing threats of a worsening civil war. The Taliban have now claimed control of all major cities in Afghanistan, including the capital Kabul, where fighters have seized the presidential palace, forcing the president to flee. The Afghan government which was supported by the US and the Allies has collapsed and there has been a transition of power to the Taliban.

As violence intensifies across large areas of the country, civilians are being caught up in the fighting and hundreds of Afghans have been killed in recent weeks, while thousands have been forced to flee their homes.

The Department of Defense is committed to supporting the U.S. State Department in the departure of U.S. and allied civilian personnel from Afghanistan, and to evacuate Afghan allies safely. (U.S. Air Force photo by Staff Sgt. Brandon Cribelar)
Public domain, via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:Operation_Allies_Refuge_210819-F-DT970-0064.jpg

The humanitarian crisis is obviously of great concern internationally, however the cultural heritage of Afghanistan is also under threat. As described by Richard Ovenden in an article in the Financial Times (24th September 2021), the global Library and Archive community has been trying to do what it can, from concerted efforts to help Afghans working in the cultural heritage sector to leave the country, to supporting the preservation of cultural artefacts including digital materials.2

It is likely that the new regime will want to bring the Internet under greater censorship and control3 meaning web content and the information contained therein is at risk. Alongside the internal threat, is the risk that foreign internet service providers, largely based in the US, could turn off cloud servers and social media platforms etc., if America decided to act on the threat to impose sanctions on Afghanistan.4

Existing collecting efforts

CDG-Afghan-collection-LC_collection
Afghanistan Web Archive at the Library of Congress: https://www.loc.gov/collections/afghanistan-web-archive/ 

Rapid response collecting of at risk Afghan Internet content has already been undertaken by several archiving institutions, alongside ongoing Afghanistan collections curated by the Library of Congress. Examples include:

The CDG does not wish to duplicate these efforts but rather to complement them by focussing on the international aspects of events in Afghanistan, documenting transnational involvement and worldwide interest in the process of the change of regime, recording how the situation evolves over time.

Content Scope

With this in mind, the Afghanistan collection has been scoped so that it adheres to the broader content development policy of the CDG namely that the following criteria are adhered to;

  • It is of high interest to IIPC members;
  • It does not map to any one member’s responsibility or mandate;
  • It is of higher value to research because it represents more perspectives than similar collections in only one member archive would do;
  • It is transnational in scope, but not necessarily “global”.

Taliban/IS content

The aim of any CDG collection is to reflect multiple viewpoints and to preserve a snapshot of society as it was at the time of archiving. It will be important to researchers that websites from across the spectrum of all human activity are collected in order to present a more accurate picture of the times.

Websites produced by the Taliban or IS, or that are pro-Taliban/IS, can be included in the collection. Most Government websites will begin to express pro-Taliban views in any case.

The Taliban/IS are likely to have used communication networks that cannot be archived for technical reasons, e.g. Facebook/WhatsApp and so this type of content will be excluded.

Daniel Wilkinson (U.S. Department of State), Public domain, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Afghan_females_using_internet_in_Herat.jpg

Sub-topics
Sub-topics may include:

  • Military experience in Afghanistan; nations withdrawing armed forces from Afghanistan; statements of defence and military analysis
  • Analysis and policy of think tanks such as Chatham House (UK), Brookings Institute and RAND International Affairs (US), for example
  • Afghan refugees in Pakistan, Iran and elsewhere
  • International relief efforts (The Red Cross, United Nations etc.)
  • Diaspora communities – Afghan people around the world
  • Human rights/Women’s rights/LGBTQ+ rights
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Afghanistan by foreign powers
  • Transnational websites and social media (SoundCloud, Squarespace, Twitter, WordPress, YouTube, Facebook Group pages (not individual Facebook profiles) about Afghanistan from any country and in any language.

The list is not exhaustive, and it is expected that contributors may wish to explore other sub-topics within their own areas of interest and expertise, providing they are within the general collection development scope.

Stakeholders

The lead curator for this collection is Nicola Bingham. She will be responsible for developing the content strategy, overseeing the progression of the collection, and promoting the collection to potential users.

IIPC members together with a wide number of stakeholders in the Library and Archive community, including staff at the Bodleian Libraries, Oxford, as well as members of the public are expected to contribute to the collection (see below for details about how to contribute).

Crawls are being undertaken in Archive-It by Janko Klasinc (National and University Library, Slovenia) and Carlos Lelkes-Rarugal (Assistant Web Archivist, British Library).

Size of collection

The CDG’s full budget in 2021 is 4 TB, of which 1.8 TB has been used through the end of September. The CDG plan to undertake small crawls for our ongoing and new collections as follows;

  • 2020 Summer Olympics and Paralympics [held in 2021]
  • Novel Coronavirus (COVID-19)
  • Intergovernmental Organizations
  • National Olympic and Paralympic Committees.

At this stage, c. 400 GB of data has been allocated to the Afghanistan collection.

Co-Curation

Nominations will be sought from IIPC Members and external agencies such as the UK Legal Deposit Libraries, University Libraries and the Library and Archive community. A Google form will be sent out to elicit nominations from non-IIPC members and members of the public. This form contains the relevant metadata fields which will populate a Google sheet. The aim of distributing the work of co-curation for the collection is to enable a diverse range of communities and individuals to contribute, including members of the Afghanistan community, helping to ensure that the collection is as representative as possible.

IIPC Members will be able to add their nominations directly to a Google sheet which will be reviewed by the lead curator against collection scope and marked for inclusion in the collection.

Access

Access to the collection will be through the Internet Archives’ Archive-It interface. Metadata will be exposed as facets on the collection home page and will be browseable by users.

How to contribute:

  1. Please read the Collection Scoping Document. This goes into more detail about what is in and out of scope
  2. If you are an IIPC member, please nominate URLs and add basic metadata to this Google Sheet
  3. If you are not an IIPC member you may contribute nominations and a small amount of basic metadata on this Google form.

References

1 Kiely, E. and Farley, R. Timeline of U.S. Withdrawal from Afghanistan. August 17, 2021. FactCheck.org
https://www.factcheck.org/2021/08/timeline-of-u-s-withdrawal-from-afghanistan/

2 Ovenden, R. The Battle for Afghanistan’s libraries. September, 24, 2021. Financial Times. https://www.ft.com/content/82fffcc8-3631-48dc-829d-44f237549a59

3 Afghanistan’s Internet: who has control of what? Goman Web. September 20, 2021. https://gomanweb.net/2021/09/20/afghanistans-internet-who-has-control-of-what/ Digital oppression in Afghanistan. NordVPN Blog. August 20, 2021. https://nordvpn.com/blog/digital-oppression-in-afghanistan

Baibhawi, R. Taliban Shuts Internet In Panjshir To Stop Northern Alliance From Galvanizing Support. August 29, 2021. Republic. https://www.republicworld.com/world-news/rest-of-the-world-news/taliban-shuts-internet-in-panjshir-to-stop-northern-alliance-from-galvanizing-support.html

Vavra, S. and Falzone, D. This Is Why the Taliban Keeps F*cking Up the Internet. September 16, 2021. Daily Beast. https://www.thedailybeast.com/this-is-why-the-taliban-keeps-fcking-up-afghanistans-internet

Sorkin, A. R., Karaian, J., Kessler, S., Gandel, S., Hirsch, L., Livni, E. and Schaverien, A. Big Tech and the Taliban. August, 19, 2021. The New York Times. https://www.nytimes.com/2021/08/19/business/dealbook/taliban-social-media.html

4 Stokel-Walker, C. The battle for control of Afghanistan’s internet. September 7, 2021. Wired. https://www.wired.co.uk/article/afghanistan-taliban-internet

5 Gomes, P. Automated seed selection to preserve Afghan sites (Arquivo.pt). IIPC Curating Special Collections Workshop,  September  24, 2021. https://youtu.be/Aa_-BBnEr8I

IIPC Content Development Group’s activities 2019-2020

By Nicola Bingham, Lead Curator Web Archives, British Library and Co-Chair of the IIPC Content Development Working Group

Introduction

I was delighted to present an update on the Content Development Group’s (CDG) activities at the 2020 IIPC General Assembly (GA) on behalf of myself, Alex and the curators that have worked so hard on collaborative collections over the past year.

Socks, not contributing to Web Archiving

Although it was disappointing not to have been in Montreal for the GA and Web Archiving Conference (WAC), it is the case that there are many advantages in attending a conference remotely. Apart from cost and time savings, it meant that many more staff members from our organisations could attend. I liked the fact that I could see many “old” web archiving friends online and it did feel like the same friendly, enthusiastic, innovative environment that is normally fostered at IIPC events. I was also delighted to see some of the attendee’s pets on screen, although it did highlight that other people’s cats are generally much more affectionate than my own, who has, I have to say, contributed little to the field web archiving over the years, although he did show a mild interest in Warcat.

Several things become clear when tasked with pre-recording a presentation with a time limit of 2 to 3 minutes. Firstly, it is extremely difficult to fit everything you need to say into such a short space of time; secondly, what you do want to say must be tightly scripted – although this does have the advantage that there is no room for pauses or “errs” in a way that can sometimes pepper my in-person presentations. Thirdly, recording even a two-minute video calls for a surprising number of retakes, taking many hours for no apparent reason. Fourthly, naively explaining these facts to the Programme and Communications Officer leads quite seamlessly to the suggestion of writing a blog post in order that one can be more expansive on the points bulleted in the two-minute presentation….

CDG Collection Update

Since our last General Assembly in Zagreb, in June 2019, the CDG has continued working on several established, and two new collections:

  • The International Cooperation Organizations Collection was initiated in 2015 and is led by Alex Thurman of Columbia University Libraries. It previously consisted of all known active websites in the .int top-level domain (available only to organizations created by treaties), but was expanded to include a large group of similar organizations with .org domain hosts, and renamed Intergovernmental Organizations this year. This increased the collection from 163 to 403 intergovernmental organizations, all of which will continue to be crawled each year.
  • The National Olympic and Paralympic Committees, led by Helena Byrne of the British Library was initiated in 2016 and consists of websites of national Olympics and Paralympics committees and associations, as identified from the official listings of these groups found on the official sites http://www.olympic.org and http://www.paralympic.org.
  • Online News Around the World led by Sabine Schostag of the Royal Danish Library. This collection of seeds was first crawled in October 2018 to document a selection of online news from as many countries as possible. It was crawled again in November 2019. The collection was promoted at the Third RESAW Conference, “The web that was: archives, traces, reflections” in Amsterdam in June 2019 and at the IFLA News Media Conference at Universidad Nacional Autónoma de México, Mexico City in March 2020.
  • New in 2019, the CDG undertook a Climate Change Collection, led by Kees Teszelszky of the National Library of the Netherlands. The first crawl took place in June, with a final crawl shortly after the UN Climate summit in September 2019.
  • New in 2019, a collection on Artificial Intelligence was undertaken between May and December, led by Tiiu Daniel (National Library of Estonia), Liisi Esse (Stanford University Libraries) and Rashi Joshi (Library of Congress).

Coronavirus (Covid-19) Collection

The main collecting activity in 2020 has been around the Covid-19 Global pandemic. This has involved a huge effort by IIPC members with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

We have been very careful with scoping rules so that we are able to collect a diverse range of content within the data budget – and Archive-It generously increased the data limit for this collection to 5TB. Collecting will continue to run, budget permitting, while the event is of global significance.

Publicly available CDG collections can be viewed on the Archive-It website.https://archive-it.org/home/IIPC and an overview of the collection statistics can be seen below.

CDG Collection statistics. Figures correct as of 15th June 2020. Slide presented at IIPC GA 17th June 2020.

Researcher-use of Collections

The CDG has worked closely with the Research Working Group co-chairs to promote and facilitate use of the CDG collections which are now available through the Archives Unleashed Cloud thanks to the Archives Unleashed project. The collections have been analysed and there are a large amount of derivatives available to researchers at IIPC-led events and/or research projects. For more information about how to access these collections please refer to the guidelines.

Next Steps/Getting in touch

We would very much welcome new members to the CDG. We will be having an online meeting in the next couple of months which would be an excellent opportunity to find out more. In the meantime, any IIPC member is welcome to suggest and/or lead on possible 2021 collaborative collections. For more information please contact the co-chairs or the Programme and Communications Officer.

Nicola Bingham & Alex Thurman CDG co-chairs

The CDG Working Group at the 2019 IIPC General Assembly in Zagreb.

Novel Coronavirus outbreak: help us collect websites

The International Internet Preservation Consortium’s Content Development Group and Archive-It are collaborating on a web archive collection preserving web content related to the ongoing Novel Coronavirus (Covid-19) outbreak. Due to the urgency of the outbreak, archiving of nominated web content will begin soon (mid-February 2020). Collection of new nominations and new crawls will continue as needed depending on the course of the outbreak and its containment.

What we are collecting

Web content from all countries and in any language is in scope. High priority subtopics include:

  • Coronavirus origins
  • Information about the spread of infection
  • Regional or local containment efforts
  • Medical/Scientific aspects
  • Social aspects
  • Economic aspects
  • Political aspects

Published information resources are a higher priority for seed nominations for this collection than social media feeds or hashtags (though the latter can be useful for finding examples of the former).

How to get involved

If you would like to participate in the collection, please nominate websites by using this submission form: https://forms.gle/zHgJK3DcfGpzAtCz5

The final collection is available at this link: https://archive-it.org/collections/13529

Contribute to CDG’s AI Collection!

By Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia

“Trurl” by Daniel Mróz, from The Cyberiad by Stanisław Lem (Wydawnictwo Literackie, Kraków, 1972). Illustration copyright © 1972 Daniel Mróz. Reprinted by permission.

After significant breakthroughs at the end of the 20th and at the beginning of 21st centuries, artificial intelligence (AI) has played a greater role in our daily lives. Although AI has a huge positive impact on a variety of fields such as manufacturing, healthcare, art, transportation, retail and so on, the use of new technologies also raises ethical issues as well as security risks. One critical and hotly debated issue is the impact of ongoing automation on labor markets, to include changing educational requirements for jobs, job elimination, and various models for transitions.

The IIPC Content Development Group invites curators and web archivists around the world to contribute websites to a new “Artificial Intelligence” web collection.

The purpose of this collection is to bring together and record web content related to use of AI and its impact on any possible aspect of life, reflecting attitudes and thoughts towards it, future predictions etc.

The content can be in any language focusing on specific countries or cultures or have a global scope.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building AI related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to aid in the development of a collection with an international perspective.

The collection aims to cover the following subtopics:

  • Machine learning, natural language processing, robotics, automation;
  • AI in literature, visual arts (e.g. ceramics, drawing, painting, sculpture, design, photography, filmmaking, architecture) and performing arts (e.g. theater, public speech, dance, music etc.); AI in emerging art forms;
  • AI and law/legislation;
  • Social and economic impact (e.g. impact on behavior/interaction, bias in AI, unemployment, inequality, changes in labor markets);
  • Ethical issues (e.g. weaponization of AI, security, robot rights);
  • Future predictions/scenarios concerning AI.

Types of web content to include are personal forms such as blogs, forum posts, and artist websites; trend reports, statements, and analyses (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, businesses).

Time frame covered by content: from the 1990s onwards.

Out of scope are: full social media feeds and channels (Facebook, twitter, Instagram, YouTube, WhatsApp), user’ video channels (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

That said, if you locate individual social media posts of unique value, such as an Instagram post by a bot or a particularly relevant and ephemeral individual video, please submit them for consideration.

Nominations are welcomed using the following form.

The call for nominations will close on the 30th of June 2019. Crawls will be run during the summer 2019. Collection will be made available at the end of 2019.

 For more information about this collection, contact Tiiu Daniel (tiiu.daniel[at]nlib.ee).


Lead-Curators of CDG Artificial Intelligence Collection
Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia
Liisi Esse, Ph.D. Associate Curator for Estonian and Baltic Studies Stanford University Libraries
Rashi Joshi, Reference Librarian /Collections Specialist, Library of Congress

CDG Co-Chairs
Nicola Bingham, Lead Curator Web Archiving, British Library
Alex Thurman, Web Resources Collection Coordinator, Columbia University Libraries

Contribute to CDG’s Climate Change Collection!

By Kees Teszelszky, Curator Digital Collections, Koninklijke Bibliotheek – National Library of The Netherlands and Lead Curator, CDG Climate Change Collection

Climate change is one of the most urgent and hotly debated issues on the web in recent years. The IIPC Content Development Group is inviting all curators and web archivists from around the world to contribute websites to a new collaborative “Climate Change” collection.

Breiðamerkurlón
Breiðamerkurlón, Iceland

In recent decades there is has been strong evidence that the earth is experiencing rapid climate change, characterized by global temperature rise, warming oceans, shrinking ice sheets, glacial retreat, decreased snow cover, sea level rise, declining arctic sea ice, extreme weather events, and ocean acidification. Ninety-seven percent of climate scientists agree that these climate-warming trends over the past century are very likely due to human activities, and most of the leading scientific organizations worldwide have issued public statements endorsing this position (source: climate.nasa.gov/evidence). Global and local action to mitigate this crisis has been complicated by political, economic, technical, cultural, and religious debates.

Many people feel the urge to reflect on this topic on the web. We would like to take an international snapshot of born digital culture relating to documentation of and social debate on the challenging issue of climate change. You can contribute to this collection by nominating web content about any aspect of climate change, and the content can be focused on specific countries or cultures or have a global focus, and can be in any language.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building climate change related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to help us build a collection with an international perspective.

Examples of subtopics might include climatology, climate change denial, climate refugees, religious reflections on climate change, etc. Eligible types of web content include organizational reports or statements (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, political parties/platforms, businesses, religious groups) or more personal forms such as blogs or artistic projects.

Out of scope are: social media feeds (Facebook, Twitter, Instagram, YouTube channels, WhatsApp), video (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

Collecting seeds started on 1 April 2019 and more nominations can be added to this spreadsheet. Crawls will be run during the summer of 2019, to conclude shortly after the upcoming UN Climate Action Summit on 23 September 2019.

Organized by the IIPC and supported by web archivists around the world, the special web collection ‘Climate Change’ is one of the ways the IIPC helps raise awareness of the strategic, cultural and technological issues which make up the web archiving and digital preservation challenge.

For more information about this collection contact Kees Teszelszky for more details: kees.teszelszky[at]kb.nl