One year on: an update on the War in Ukraine CDG collection

By the lead curators: Anaïs Crinière-Boizet, Digital Curator (National Library of France), Kees Teszelszky, Curator Digital Collections (National Library of the Netherlands) & Vladimir Tybin, Head of Digital Legal Deposit (National Library of France).


IIPC-CDG-collaborative-collectionsThis month, the IIPC Content Development Working Group (CDG) launched a new web crawl to archive web content related to the war in Ukraine, based on suggestions by curators, web archivists and members of the public worldwide. The aim of this effort is to map the impact of this conflict on digital history and culture on the web for future historians. This clash has been fought out on the battlefields, but also takes place in cyberspace. It has a tremendous influence on web culture and internet history.

We launched three crawls in 2022, starting with the first crawl on July 20, 2022, the second in September and the third in October of last year. Another crawl was launched on March 16 2023. In this blog post, we describe what has been done so far in creating a transnational collection documenting this important historical event.

On 24 February 2022, the armed forces of the Russian Federation invaded the territory of Ukraine, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering the first large-scale war in Europe since WWII. The war on the territory of Ukraine has different phases[1] which can be summed up as follows:

  • 0: Prelude of the war (up to 23 February 2022), when Russian troops were building up near the borders of Ukraine;
  • 1: Initial invasion (24 February – 7 April 2022), when the Russian president Putin announced a ‘special military operation’ and Russian troops invaded Ukraine territory;
  • 2: Southeastern front (8 April – 11 September 2022);
    • This is the phase during which we began archiving websites as part of the CDG collection.
  • 3: Ukrainian counter offensives (12 September – 9 November 2022);
  • 4: Second stalemate (10 November 2022 – present, March 2023)

Since February, the clashes between the Russian military and the Ukrainian army and population have had an unprecedented impact on the situation in the region and on international relations. The aim of this collaborative project is to collect web content related to this event in order to map the impact of this conflict on digital history and culture. Identification of seed websites and initial web crawling began in July 2022. The archived websites have been preserved in a special web collection hosted by Archive-It, where most of the sites are already available to view. The collection will be expanded with new content as the conflict evolves or as new developments in the historic course happen.

The curators included high priority subtopics in the call for nominations, such as general information on: military confrontations; consequences of the war on the civilian population in Ukraine; refugee crisis and international relief efforts in and outside Europe; political consequences; international relations; diaspora communities like Ukrainians around the world; human rights organizations; foreign embassies and diplomatic relations; sanctions imposed on Russia by foreign powers; consequences on energy and agri-food trade; and public opinion like blogs, protest sites, online writings of activists etc. Websites from countries all over the world and in all languages are in scope. Special attention has been devoted to websites which can be a source of internet culture, such as sites with internet memes.

Many institutions but also the public responded to this call for contributions to document the conflict. No less than 1,137 member proposals were received and 252 via the public nomination form, making 1,389 seeds in total. After cleaning up duplicates and invalid URLs, 1,358 seeds remained. All these were crawled at least once between July 2022 and March 2023.

We have launched the fourth crawl for the War in Ukraine web collection on 1,060 seeds in March. 303 new seeds have been submitted between the last crawl in October and now. No less than 298 seeds have been deactivated since July 2022. These were pages which were not updated since the last crawl or went offline. These “404 file not found” errors show also why our collection work is important, as some sites have already gone offline. In total, 22 new jobs have been launched. Of  these, 19 crawls were done with the standard web crawler software and 3 with Brozzler.[2] This is a distributed web crawler that uses a real web browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. It is especially valuable for its media capture capabilities. We had a total budget of 1 TB for the three crawls in 2022 and 500 GB for the fourth crawl in March 2023.

It is easy to see from the distribution of seeds by scope that this special web collection is the result of an event harvest. For most of the URLs, only one or two pages were selected for the crawl. These pages (mostly from news sites) contain important historic source information which otherwise may be have been lost. Only 204 sites were selected to be fully crawled.

CDG_WarinUkraine_Mar22_01

Looking at the distribution of sites by website type, it is noticeable that a large proportion of the sites are news sites, NGOs and government websites. The role of blogs in internet culture has diminished in recent years, as is also visible in this collection.[3] In contrast, NGO websites contain more and more information worth preserving for historians of the future, as they document their activities to their donors.

CDG_WarinUkraine_Mar22_02

We see a language shift in the distribution since the first crawl took place. Most of the sites which were selected during the first crawl were published in international languages as English, French and German. Now we see more websites written in national languages, such as Ukrainian (122), Russian (31) and Belarusian (5). The impact of the conflict on the rest of Central, Eastern and Southern Europe around Ukraine can be seen by the collection of sites in Hungarian (45), Czech (44), Serbian (42) and Slovakian (23).

CDG_WarinUkraine_Mar22_03

One of the most heavily impacted cultural arenas to be touched by war is heritage and culture. We have all seen the images of looted museums and libraries and scattered books on the streets. As Erasmus of Rotterdam wrote in 1515: “If the laws are already silent amid the clash of arms, how much more are not the virgin muses silent when the world is full of noise, turmoil, confusion due to those raging storms?”[4] It is therefore perhaps a hopeful fact that no fewer than 14 websites have been selected that contain poetry from or about Ukraine.

In conclusion, it is worth recalling the interest of this initiative aimed at keeping track of the very heterogeneous content disseminated on the web about this tragic event. We know that the living web is an extremely fragile publication space where content is ephemeral and most often difficult to find some time after its publication; content can also disappear for technical reasons or be deleted by its producers. At a time when the web concentrates most of the publications of the major media and the press, the reactions of the population and of institutional and non-governmental organizations, and finally, in the age of social media networks, the undertaking of building a collection of web archives that is necessarily fragmentary and incomplete deserves to be carried out in order to provide some primary sources for future historians of this conflict.


[1] https://en.wikipedia.org/wiki/Timeline_of_the_2022_Russian_invasion_of_Ukraine

[2] https://github.com/internetarchive/brozzler

[3] P. de Bode, I. Geldermans, & K. Teszelszky. (2021). Web collection NL-blogosfeer. Zenodo. https://doi.org/10.5281/zenodo.4593479

[4] Letter of Erasmus tot Raffaele Riario, London, 15 May 1515. https://www.dbnl.org/tekst/eras001corr04_01/eras001corr04_01_0039.php

Get Involved in Web Archiving Street Art

By CDG Street Art Collection Co-Leads Ricardo Basílio, Web curator, Arquivo.pt & Miranda Siler, Web Collection Librarian, Ivy Plus Libraries Confederation


IIPC-CDG-collaborative-collections

Street art is ephemeral and so are the websites and web channels that document it. For this reason the IIPC’s Content Development Working Group is taking up the challenge of preserving web content related to street art. Some institutions already do this locally, but a representative web collection of street art with a global scope is lacking.

Street art can be found all over the world and reflects social, political and cultural attitudes. The Web has become the primary means of dissemination of these works beyond the street itself. Thus, we are asking for nominations for web content from different parts of the globe to be preserved in time and to serve for study and research in the future.

image002
Mural. Author: Douglas Pereira (Bicicleta Sem Freio). Title: The Observatory. WOOL, Covilhã Urban Art, 2019 (Portugal). Photo credit: Ricardo Basílio.

What we want to collect

image001
Stencil. Author: Adres, WOOL, Covilhã Urban Art, 2017 (Portugal). Photo credit: Ricardo Basílio.

This collaborative collection aims to collect web content related to street art as a social, political, and cultural manifestation that can be found all over the world.

The types of street art covered by this collection include but are not limited to:

  • Mural art
  • Graffiti
  • Stencil art
  • Fly-posting (gluing posters)
  • Stickering
  • Yarn-bombing
  • Mosaic

The collection will also include a number of different types of websites such as:

The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.

Out of scope

The following types of content are out of scope for the collection:

  • For data budget considerations, websites that are heavy with audio video content such as YouTube will be deprioritised.
  • Social media is labour intensive and unlikely to be archived successfully such as Facebook, YouTube channels, Instagram, TikTok.
  • Content which is in the form of a private members’ forum, intranet or email (non-published material).
  • Content which may identify or dox street artists who wish to remain anonymous or known only by their tagger name.
  • Artist websites where the artist works primarily in mediums other than street art.

Media websites (tv/radio and online newspapers) will be selected in moderation, as generally this type of content is being archived elsewhere, although nominations at the level of the news article documenting specific debates around street art may be considered (as opposed to media landing pages or splash pages). Independent news sources devoted to street art specifically are welcome.

How to get involved

Once you have looked over the collection scope document and selected the web pages that you would like to see in the collection, it takes less than 2 minutes to fill in the submission form:

https://bit.ly/CDG-street-art-public-nominations

For the first crawl, the call for nominations will close on January 20, 2023. 

For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

Get Involved in Web Archiving the War in Ukraine 2022

By Kees Teszelszky, Curator Digital Collections, National Library of the Netherlands & Vladimir Tybin, Head of Digital Legal Deposit, National Library of France

On February 23, 2022, the armed forces of the Russian Federation invaded Ukrainian territory, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering a war in Ukraine. Since then, the clashes between the Russian military and the Ukrainian population have had unprecedented repercussions on the situation of the civilian population and on international relations. 

IIPC-CDG-collaborative-collectionsWhat we want to collect

This collaborative collection aims to collect web content related to this event in order to map the impact of this conflict on digital history and culture.

This collection will be built through the following themes: 

  • General information about the military confrontations
  • Consequences of the war on the civilian population
  • Refugee crisis and international relief efforts
  • Political consequences
  • International relations
  • Diaspora communities – Ukrainian people around the world 
  • Human rights organisations 
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Russia by foreign powers
  • Consequences on energy and agri-food trade
  • Public opinion: blogs/protest sites/activists

The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.

Out of scope

The following types of content are out of scope for the collection:

  • Data-intensive audio/video content (e.g. YouTube channels)
  • Social media platforms
  • Private member forums, intranets, or email (non-published material)
  • Content identifying vulnerable people and compromising their safety

How to get involved

Once you have selected the web pages that you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/Ukraine-2022-collection-public-nominations 

For the first crawl, the call for nominations will close on July 20, 2022.

For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.

Resources

About IIPC collaborative collections
IIPC CDG updates on the IIPC Blog

Celebrating the 2022 Winter Olympics and Paralympics Web Archive Collection

By Helena Byrne, Curator of Web Archives, British Library

IIPC-CDG-2022Olympics

The first IIPC collection focused just on the 2010 Winter Olympics in Vancouver. Since 2012, the IIPC has archived web content on both the Olympic and Paralympic Games. To date, the IIPC has archived seven Games. Beijing 2022 was also the 4th Winter Games collection.

Collection Name Data Docs
2014 Winter Olympics 1.6 TB 57,145,052
2014 Winter Paralympics 1.3 TB 42,542,659
2016 Summer Olympics and Paralympics 3.1 TB 18,205,981
2018 Winter Olympics and Paralympics 1.2 TB 12,218,514
2020 Summer Olympics and Paralympics [held in 2021] 610.9 GB 6,923,179
2022 Winter Olympics and Paralympics 361.1 GB 14,410,542

You can view the 2022 Winter Olympics and Paralympics here:

https://archive-it.org/collections/18422

In this final blog post on the IIPC Content Development Group (CDG) Beijing 2022 Olympic and Paralympic Games web archive collection, we look back at what content was crawled. 

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

What we collected

Crawl dates

There were five crawl dates for this collection. The collection period started in January and finished towards the end of March. A sixth crawl was conducted on April 26 of 32 seeds as these were missed in the first crawl. This issue was only noticed when preparing the metadata for publishing the collection.

  1. February 02, 2022 (308 seeds crawled)
  2. February 15, 2022 (264 seeds crawled)
  3. February 23, 2022 (65 seeds crawled)
  4. March 07, 2022 (29 seeds crawled)
  5. March 21, 2022 (198 seeds crawled)

We had a steady number of nominations for each crawl date. The only exception was the fourth crawl on March 7th with only 29 seeds crawled. This figure also includes a number of URLs that returned an error in the previous crawl. Nominations to this collection are done on a voluntary basis by members of the IIPC and the public from around the world. 

Countries covered

Athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. 

We received nominations from 38 countries for the IIPC CDG 2022 Winter Olympics and Paralympics collection. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed in multiple events and have no content nominated. 

Languages covered

We have 24 languages in the collection including French (228 nominations), English (162 nominations) and Japanese (89 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection.

Data size 

image2

We have archived 863 seeds out of the 889 seeds that were nominated. These seeds include full websites, subsections of websites and individual web pages in multiple languages from around the world. The 26 seeds nominated that were not archived were social media accounts so weren’t added to the crawler. There were roughly 54 seeds in total that came up in the Archive-It crawl reports. These were URLs that for technical reasons, the crawlers were unable to archive when they visited the seed. These seeds were then assessed and added to the next crawl in the series with some additional techniques used to try and capture them. However, not all of these attempts to recrawl these seeds were successful. Quality assurance was carried out on these 54 seeds and 36 of these seeds were set to private as they displayed no content or just error messages. 

We archived 361.1 GB of data and 14,410,542 documents at the end of five crawl cycles. We had initially set aside 1 TB for this collection but as we weren’t archiving any social media content and implemented a size cap on all seeds, we had not used as much data as expected. 

We used the following policy when setting the scope of the crawl:

  1. Full seed host or directory (Example: team or athlete website)
    • These seeds will be capped at 3 GB
  2. Crawl one page only (Example: news article)
    • These seeds will be capped at 1 GB 
  3. Seed page plus 1 click of all links on seed page (Example: news page linking to multiple articles)
    • These seeds will be capped at 2 GB

In the 2018 Winter Games collection, we collected 1,413 seeds and used 1.2 TB of data with 12,218,514 documents. However, if we just compare the URL nominations, the 2022 and 2018 collection are quite similar excluding the 557 social media URLs tagged as Blogs & Social Media from the 2018 total. 

Related blog posts

Get Involved in Web Archiving the Winter Games – Beijing 2022 

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

The Summer and Winter Olympics and Paralympics Collections in Archive-It

The Summer and Winter Olympics and Paralympics Collections 2010-2020 poster

Despite not collecting social media content, we did promote the call for nominations for this collection on social media channels (mostly Twitter) with the collection hashtag #WAGames2022.

For more information and updates on Content Development Group activities, you can contact the IIPC CDG team at Collaborative-collections@iipc.simplelists.com

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022Olympics

The Winter Olympics may be over but the IIPC Content Development Group Beijing 2022 collection is still running until March 20th, 2022. The Winter Paralympics got underway on March 4th and we would love to see your nominations for this edition of the Games. 

In our first blog post Get Involved in Web Archiving the Winter Games – Beijing 2022, we outlined details of what and how to nominate. Once you have selected the web pages you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/CDG-2022Games-collection-public-nominations 

What we have collected so far

We have archived 616 nominated seeds so far. These nominations include full websites, subsections of websites and individual web pages in multiple languages from around the world. We have archived 280.3GB of data and 13,197,402 documents at the end of three crawl cycles. 

Screenshot of total data archived on Beijing 2022 collection. Figures in paragraph above.

Social media policy

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

Map of the world

Qualified athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Only 29 of these NOCs received a medal. Athletes competed across 109 events over 15 disciplines in seven sports. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. So far, nominations have been received for the IIPC CDG 2022 Winter Olympics and Paralympics collection from 27 countries. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed and have no content nominated such as Austria (106 athletes), Sweden (116 athletes) and Ukraine (45 athletes).

Languages covered

We have 24 languages in the collection including French (200 nominations), Japanese (91 nominations) and English (79 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection. We would like to see more nominations in multiple languages, especially Chinese (4 nominations) and Russian (1 nomination).

How to get involved

Once you have selected the web pages you would like to see in the collection, add them to the submission form: https://bit.ly/CDG-2022Games-collection-public-nominations

If you know anyone who may be interested in contributing to this collection, please share the link with them! The call for nominations will close on March 20, 2022.

For more information and updates, you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.

Resources

Get Involved in Web Archiving the Winter Games – Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022OlympicsIt’s that time of year again when we all become winter sports experts while watching the Winter Olympic and Paralympic Games on TV. There are even helpful guides online to all the slang used on the slopes like steese (style and ease), pow (fresh ski powder) and sendy (to ride it with full vigour).

The ongoing pandemic has caused havoc on sporting schedules since the start, but the Beijing 2022 Winter Games are going ahead as scheduled. This means that the IIPC Content Development Group (CDG) will again be collecting web content related to the Games from across the world in multiple languages.

10 years of collecting the Games

2020 marked ten years of archiving the Games. The first collection focused just on the 2010 Winter Olympics in Vancouver. But since 2012 the IIPC has archived web content on both the Olympic and Paralympic Games. To date the IIPC has archived six Games. Beijing 2022 will be the 4th Winter Games collection.

CDG-Web-archiving-the-Olympics-and-Paralympics-Games
Helena Byrne: Going for Gold: Web Archiving the Olympics & Paralympics Games, poster presented at IIPC Web Archiving Conference 2021.

Previous CDG Olympic and Paralympic collections have focused on events both on and off the field/slopes. Key themes have included doping, corruption and Zika Virus as well as Covid-19 in the 2020 Summer Games. This year will be no different as Covid-19 is still a big issue and the human rights issues in China has meant that some nations like the USA, UK and Australia will not be sending any diplomatic representatives.

What we want to collect

Public platforms in various formats such as:

  • Websites
  • Subsections of websites with an Olympic tag
  • Individual Articles
  • News Reports
  • Blogs
  • Audio visual content

Social media platforms update their code and design frequently and do not prioritize archivability, and as a result present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason we will not be accepting nominations from Facebook, Instagram, or Twitter, nor from other similar social media platforms.

The subjects covered on these sites can include but are not limited to:

  • Athletes/Teams
  • Computer Games (eGames)
  • Covid-19
  • Diplomatic Relations with China
  • Doping/Cheating and Corruption
  • Environmental Issues
  • Fandom
  • Gender Issues (Ex. media coverage, sexual harassment etc.)
  • General News/ Commentary
  • Human Rights issues
  • Olympic/Paralympic Venues
  • Security
  • Sports Events
  • Other

How to get involved

Once you have selected the web pages you would like to see in the collection, it only takes less than 5 minutes to fill in the submission form:
https://bit.ly/CDG-2022Games-collection-public-nominations 

The call for nominations will close on March 20, 2022.

For more information and updates you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.

IIPC Collaborative collection: “Afghanistan regime change (2021) and the international response”

By Nicola Bingham, Lead Curator, Web Archiving, the British Library; Co-chair, IIPC Content Development Working Group

On 4th October 2021 the Content Development Group (CDG) initiated a thematic website collection in response to recent developments in Afghanistan at the behest of several CDG members.

Background

Recent events in Afghanistan have precipitated a humanitarian crisis which escalated markedly after foreign armed forces withdrew from the country in May 2021.1 As US and Allied troops retreated, the Taliban quickly gained ground, seizing cities across the country, increasing threats of a worsening civil war. The Taliban have now claimed control of all major cities in Afghanistan, including the capital Kabul, where fighters have seized the presidential palace, forcing the president to flee. The Afghan government which was supported by the US and the Allies has collapsed and there has been a transition of power to the Taliban.

As violence intensifies across large areas of the country, civilians are being caught up in the fighting and hundreds of Afghans have been killed in recent weeks, while thousands have been forced to flee their homes.

The Department of Defense is committed to supporting the U.S. State Department in the departure of U.S. and allied civilian personnel from Afghanistan, and to evacuate Afghan allies safely. (U.S. Air Force photo by Staff Sgt. Brandon Cribelar)
Public domain, via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:Operation_Allies_Refuge_210819-F-DT970-0064.jpg

The humanitarian crisis is obviously of great concern internationally, however the cultural heritage of Afghanistan is also under threat. As described by Richard Ovenden in an article in the Financial Times (24th September 2021), the global Library and Archive community has been trying to do what it can, from concerted efforts to help Afghans working in the cultural heritage sector to leave the country, to supporting the preservation of cultural artefacts including digital materials.2

It is likely that the new regime will want to bring the Internet under greater censorship and control3 meaning web content and the information contained therein is at risk. Alongside the internal threat, is the risk that foreign internet service providers, largely based in the US, could turn off cloud servers and social media platforms etc., if America decided to act on the threat to impose sanctions on Afghanistan.4

Existing collecting efforts

CDG-Afghan-collection-LC_collection
Afghanistan Web Archive at the Library of Congress: https://www.loc.gov/collections/afghanistan-web-archive/ 

Rapid response collecting of at risk Afghan Internet content has already been undertaken by several archiving institutions, alongside ongoing Afghanistan collections curated by the Library of Congress. Examples include:

The CDG does not wish to duplicate these efforts but rather to complement them by focussing on the international aspects of events in Afghanistan, documenting transnational involvement and worldwide interest in the process of the change of regime, recording how the situation evolves over time.

Content Scope

With this in mind, the Afghanistan collection has been scoped so that it adheres to the broader content development policy of the CDG namely that the following criteria are adhered to;

  • It is of high interest to IIPC members;
  • It does not map to any one member’s responsibility or mandate;
  • It is of higher value to research because it represents more perspectives than similar collections in only one member archive would do;
  • It is transnational in scope, but not necessarily “global”.

Taliban/IS content

The aim of any CDG collection is to reflect multiple viewpoints and to preserve a snapshot of society as it was at the time of archiving. It will be important to researchers that websites from across the spectrum of all human activity are collected in order to present a more accurate picture of the times.

Websites produced by the Taliban or IS, or that are pro-Taliban/IS, can be included in the collection. Most Government websites will begin to express pro-Taliban views in any case.

The Taliban/IS are likely to have used communication networks that cannot be archived for technical reasons, e.g. Facebook/WhatsApp and so this type of content will be excluded.

Daniel Wilkinson (U.S. Department of State), Public domain, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Afghan_females_using_internet_in_Herat.jpg

Sub-topics
Sub-topics may include:

  • Military experience in Afghanistan; nations withdrawing armed forces from Afghanistan; statements of defence and military analysis
  • Analysis and policy of think tanks such as Chatham House (UK), Brookings Institute and RAND International Affairs (US), for example
  • Afghan refugees in Pakistan, Iran and elsewhere
  • International relief efforts (The Red Cross, United Nations etc.)
  • Diaspora communities – Afghan people around the world
  • Human rights/Women’s rights/LGBTQ+ rights
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Afghanistan by foreign powers
  • Transnational websites and social media (SoundCloud, Squarespace, Twitter, WordPress, YouTube, Facebook Group pages (not individual Facebook profiles) about Afghanistan from any country and in any language.

The list is not exhaustive, and it is expected that contributors may wish to explore other sub-topics within their own areas of interest and expertise, providing they are within the general collection development scope.

Stakeholders

The lead curator for this collection is Nicola Bingham. She will be responsible for developing the content strategy, overseeing the progression of the collection, and promoting the collection to potential users.

IIPC members together with a wide number of stakeholders in the Library and Archive community, including staff at the Bodleian Libraries, Oxford, as well as members of the public are expected to contribute to the collection (see below for details about how to contribute).

Crawls are being undertaken in Archive-It by Janko Klasinc (National and University Library, Slovenia) and Carlos Lelkes-Rarugal (Assistant Web Archivist, British Library).

Size of collection

The CDG’s full budget in 2021 is 4 TB, of which 1.8 TB has been used through the end of September. The CDG plan to undertake small crawls for our ongoing and new collections as follows;

  • 2020 Summer Olympics and Paralympics [held in 2021]
  • Novel Coronavirus (COVID-19)
  • Intergovernmental Organizations
  • National Olympic and Paralympic Committees.

At this stage, c. 400 GB of data has been allocated to the Afghanistan collection.

Co-Curation

Nominations will be sought from IIPC Members and external agencies such as the UK Legal Deposit Libraries, University Libraries and the Library and Archive community. A Google form will be sent out to elicit nominations from non-IIPC members and members of the public. This form contains the relevant metadata fields which will populate a Google sheet. The aim of distributing the work of co-curation for the collection is to enable a diverse range of communities and individuals to contribute, including members of the Afghanistan community, helping to ensure that the collection is as representative as possible.

IIPC Members will be able to add their nominations directly to a Google sheet which will be reviewed by the lead curator against collection scope and marked for inclusion in the collection.

Access

Access to the collection will be through the Internet Archives’ Archive-It interface. Metadata will be exposed as facets on the collection home page and will be browseable by users.

How to contribute:

  1. Please read the Collection Scoping Document. This goes into more detail about what is in and out of scope
  2. If you are an IIPC member, please nominate URLs and add basic metadata to this Google Sheet
  3. If you are not an IIPC member you may contribute nominations and a small amount of basic metadata on this Google form.

References

1 Kiely, E. and Farley, R. Timeline of U.S. Withdrawal from Afghanistan. August 17, 2021. FactCheck.org
https://www.factcheck.org/2021/08/timeline-of-u-s-withdrawal-from-afghanistan/

2 Ovenden, R. The Battle for Afghanistan’s libraries. September, 24, 2021. Financial Times. https://www.ft.com/content/82fffcc8-3631-48dc-829d-44f237549a59

3 Afghanistan’s Internet: who has control of what? Goman Web. September 20, 2021. https://gomanweb.net/2021/09/20/afghanistans-internet-who-has-control-of-what/ Digital oppression in Afghanistan. NordVPN Blog. August 20, 2021. https://nordvpn.com/blog/digital-oppression-in-afghanistan

Baibhawi, R. Taliban Shuts Internet In Panjshir To Stop Northern Alliance From Galvanizing Support. August 29, 2021. Republic. https://www.republicworld.com/world-news/rest-of-the-world-news/taliban-shuts-internet-in-panjshir-to-stop-northern-alliance-from-galvanizing-support.html

Vavra, S. and Falzone, D. This Is Why the Taliban Keeps F*cking Up the Internet. September 16, 2021. Daily Beast. https://www.thedailybeast.com/this-is-why-the-taliban-keeps-fcking-up-afghanistans-internet

Sorkin, A. R., Karaian, J., Kessler, S., Gandel, S., Hirsch, L., Livni, E. and Schaverien, A. Big Tech and the Taliban. August, 19, 2021. The New York Times. https://www.nytimes.com/2021/08/19/business/dealbook/taliban-social-media.html

4 Stokel-Walker, C. The battle for control of Afghanistan’s internet. September 7, 2021. Wired. https://www.wired.co.uk/article/afghanistan-taliban-internet

5 Gomes, P. Automated seed selection to preserve Afghan sites (Arquivo.pt). IIPC Curating Special Collections Workshop,  September  24, 2021. https://youtu.be/Aa_-BBnEr8I

Rapid Response Twitter Collecting at NLNZ

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand (NLNZ)

This blog post has been adapted from an IIPC RSS webinar held in August where presenters shared  their social media web archiving projects. Thanks to everyone who participated and for your feedback. It’s always encouraging to see the projects colleagues are working on.

Collecting content when you only have a short window of opportunity

The National Library responds quickly to collecting web content when unexpected events occur. Our focus in the past was to collect websites and this worked well for us using Web Curator Tool, however collecting social media was much more difficult. We tried capturing social media using different web archiving tools, but none of them produced satisfactory results.

The Preservation, Research and Consultancy (PRC) team include programmers and web technicians. They thought running Twitter crawls using the public Twitter API could be a good solution for capturing Twitter content. It has enabled us to capture commentary about significant New Zealand events and we’ve been running these Twitter crawls since late 2016.

One such event was the Christchurch Mosque shootings which took place on 15 March 2019.  This terrorist attack by a lone gunman at two mosques in Christchurch, where 51 people were killed was the deadliest mass shooting in modern New Zealand history.  The image you see here by Shaun Yeo was created in response to the tragic events and was shared widely via social media.

Shaun Yeo: Crying Kiwi
Crying Kiwi. Ref: DCDL-0038997. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/42144570   (used with permission)

While the web archivists focussed on collecting web content relating to the attacks, and the IIPC community assisted us by providing links to international commentary for us to crawl using Archive-It, the PRC web technician was busy getting the Twitter harvest underway. He needed to work quickly because there were only a few days lee-way to pick up Tweets using Twitter’s public API.

Search Criteria

Our web technician checked Twitter and found a wide range of hash tags and search terms that we needed to use to collect the tweets.

Hashtags: ‘#ChristchurchMosqueShooting’ ‘#ChristchurchMosqueShootings’ ‘#ChristchurchMosqueAttack’ ‘#ChristchurchTerrorAttack’ ‘#ChristchurchTerroristAttack’ ‘#KiaKahaChristchurch’ ‘#NewZealandMosqueShooting’ ‘#NewZealandShooting’ ‘#NewZealandTerroristAttack’ ‘#NewZealandMosqueAttacks’ ‘#PrayForChristchurch’ ‘#ThisIsNotNewZealand’ ‘#ThisIsNotUs’ ‘#TheyAreUs’

Keywords: ‘zealand AND (gun OR ban OR bans OR automatic OR assault OR weapon OR weapons OR rifle OR military)’ ‘zealand AND (terrorist OR Terrorism OR terror)’ ‘zealand AND mass AND shooting’ ‘Christchurch AND mosque’ ‘Auckland AND vigil’ ‘Wellington AND vigil’

The Dataset

The Twitter crawl ran from 15-29 March 2019. We captured 3.2 million tweets in JSON files. We also collected 30,000 media files that were found in the tweets and we crawled 27,000 seeds referenced in the tweets. The dataset in total was around 108GB in size.

Collecting the Twitter content

We used Twarc to capture the Tweets and we also used some inhouse scripts that enabled us to merge and deduplicate the Tweets each time a crawl was run. The original set for each crawl was kept in case anything went wrong during the deduping process or if we needed to change search parameters.
We also used scripts to capture the media files referenced in the Tweets and a harvest was run using Heritrix to pick up webpages. These webpage URLs were run through a URL unshortening service prior to crawling to ensure we were collecting the original URL, and not a tiny URL that might become invalid within a few months. We felt that the Tweet text without accompanying images, media files and links might lose its context. We were also thinking about long term preservation of content that will be an important part of New Zealand’s documentary heritage.

Access copies

We created three access copies that provide different views of the dataset, namely Tweet IDs and hashed and non-hashed text files.  This enables the Library to restrict access to content where necessary.

Tweet ID’s

Tweet ID’s (system numbers) will be available to the public online. When you rehydrate the Tweet ID’s online, you only receive back the Tweets that are still publicly available – not any of the Tweets that have since been deleted.

Hashed and non-hashed access copies

In 2018, Twitter released a series of election integrity datasets (https://transparency.twitter.com/en/information-operations.html), which contained access copies of Tweets. We have used their structure and format as a precedent for our own reading room copies. These provide access to all Tweets and the majority of their metadata, but with all identifying user details obfuscated by hashed values. You can see in the table below the Tweet ID highlighted in yellow, the user display name in red and its corresponding system number (instead of an actual name) and the tweet text highlighted in blue.

The non-hashed copy provides the actual names and full URL rather than system numbers.

 Shaping the SIP for ingest
National Digital Heritage Archive (NDHA)

We have had some technical challenges ingesting Twitter files into the National Digital Heritage Archive (NDHA). Some files were too large to ingest using Indigo, which is a tool the web and digital archivists use to deposit content into the NDHA, so we have had to use another tool called the SIP Factory, which enables the ingest of large files to the NDHA. This is being carried out by the PRC team.

We’ve shaped the SIPs (submission information packages) according to these files below and have chosen to use file conventions for each event. We thought it would be helpful to create a readme file that shows some of the provenance and technical details of the dataset. Some of this information will be added to the descriptive record, but we felt that a readme file might include more information and it will remain with the dataset.

chch_terror_attack_2019_twitter_tweet_IDs
chch_terror_attack_2019_Twitter_access_copy
chch_terror_attack_2019_Twitter_access_copy_hashed
chch_terror_attack_2019_Twitter_crawl
chch_terror_attack_2019_twitter _readme
chch_terror_attack_2019_twitter_media_files
chch_terror_attack_2019_twitter_warc_files

Description of the dataset

Even though the tweets are published, we have decided to describe them in Tiaki, our archival content management system. This is because we’re effectively creating the dataset and our archival system works better for describing this kind of content than our published catalogue does.
NLNZ, Tiaki archival content management system

Research interest in the dataset

A PhD student was keen to view the dataset as a possible research topic. This was a great opportunity to see what we could provide and the level of assistance that might be required.

Due to the sensitivity of the dataset and the fact that it wasn’t in our archive yet, we liaised with the Library’s Access and Use Committee around what data the library was comfortable to provide. The decision was that the data should only come from Tweets that were still available online in this initial stage while the researcher was still determining the scope of her research study.

The Tweet IDs were put in Dropbox for the researcher to download. There were several complicating factors that meant she was unable to rehydrate the Tweet ID’s, so we did what we could to assist her.

We determined that the researcher simply wanted to get a sense of what was in the dataset, so we extracted a random sample set of 2000 Tweets. This sample included only original tweets (no retweets) and had rehydrated itself to remove any deleted tweets. The data included was, Tweet time, user location, likes, retweets, Tweet language and the Tweet text. She was pleased with what we were able to provide, because it gave her some idea of what was in the dataset even though it was a very small subset of the dataset itself.

Unfortunately, the research project has been put on hold due to Covid-19. If the research project does go ahead, we will need to work with the University to see what level of support they can provide the researcher and what kind of support we will need to provide.

The Danish Coronavirus web collection – Coronavirus on the curators’ minds

By Sabine Schostag, Web Curator, The Royal Danish Library

Introduction – a provoking cartoon

In a sense, the story of Corona and the national Danish Web Archive (Netarchive) starts at the end of January 2020 – about 6 weeks before Corona came to Denmark. A cartoon by Niels Bo Bojesens in the Danish newspaper “Jyllandsposten” (2020-01-26) showing the Chinese flag with a circle of yellow corona-viruses instead of the stars caused indignation in China and captured attention worldwide. We focused on collecting reactions on different social media and in the international news media. Particularly on Twitter, a seething discussion arose with vehement comments and memes about Denmark.

From epidemic to pandemic

After that, the curators again focused on the daily routines in web archiving, as we believed that Corona (Covid-19) was a closed chapter in Netarchive’s history. But this was not the case. When the IIPC Content Development Working Group launched the Covid-19 collection in February, the Royal Danish Library contributed the Danish seeds.

Suddenly, the Corona virus arrived in Europe and the first infected Dane came home from a skiing trip in Italy. The epidemic turned into a pandemic. On March 12, the Danish Government decided to lockdown the country: all public employees where sent to their home offices and borders were closed. Not only the public sector shut down, trade and industry, shops, restaurants, bars etc. had to close too. Only supermarkets were still open and people in the Health Care sector had to work overtime.

While Denmark came to a standstill, so to speak, the Netarchive curators worked at full throttle on the coronavirus event collection. Zoom became the most important work tool for the following 2½ months. In daily Zoom meetings, we coordinated who worked on which facet of this collection. To put it briefly, we curators had coronavirus on our minds.

Event crawls in Netarchive

The Danish Web Archive crawls all Danish news media between several times daily and one time weekly, so there is no need to include news articles in an event crawl. Thus, with an event crawl we focus on augmented activity on social media, blog articles, new sites emerging in connection to the event – and reactions in news media outside Denmark.

Coronavirus documentation in Denmark

The Danish Web collection on coronavirus in Denmark is part of a general documentation on the corona lockdown in Denmark in 2020. This documentation is a cooperation between several cultural institutions, the National Archives (Rigsarkivet), the National Museum (Nationalmuseet), the Workers Museum (Arbejdermuseet), local archives and, last but not least, the Royal Danish Library. The corona lockdown documentation was supposed to be done in two steps:  the “here and now” collection of documentation under the corona lockdown and a more systematic follow-up by collecting materials from authorities and public bodies.

“Days with Corona” – a call for help

All Danes were asked to contribute to the corona lockdown documentation, for instance by sending photos and narratives from their daily life under the lockdown. “Days with Corona” is the title of this part of the documentation of the Danish Folklore Archives run by the National Museum and the Royal Library.

Netarchive also asked the public for help by nominating URLs of web pages related to coronavirus, social media profiles, hashtags, memes and any other relevant material.

Help from colleagues

Web archiving is part of the Department for Digital Cultural Heritage at the Royal Library. Almost all colleagues from the department were able to continue with their every day work from their home offices. Many colleagues from other departments were not able to do so. Some of them helped the Netarchive team by nominating URLs, as this event crawl could keep curators busy more than 7½ hours a day. We used a Google spreadsheet for all nominations (fig. 1)

Fig. 1 Nomination sheet for curators and colleagues form other departments and a call for contributions.

The Queen’s 80th birthday

On April 16, Queen Margarethe II celebrated her 80th birthday. One of the first things she did after the Corona lockdown, on March 13, was to cancel all her birthday celebration events. In a way, she set a good example, as everybody was asked not to meet with no more than ten people, ideally we only should socialize with members of our own household.

As part of the Corona event crawl, we collected web activity related to the Queen’s birthday, which mainly consisted of reactions on social media.

The big challenge – capturing social media

Knowledge of the coronavirus Covid-19 changes continuously. Consequently, authorities, public bodies, private institutions, and companies change information and precaution rules on their webpages frequently. We try to capture as much of these changes as possible. Companies and private individuals offering safety gear for protection against the virus was another facet in the collection. However, capturing all relevant activity on social media was much more challenging than the frequent updates on traditional web pages. Most of the social media platforms use technologies, which Heritrix (used by Netarchive for event crawling) is not able to capture.

Fig. 2 The Queen’s speech to the Danes on how to cope with the corona crisis. This was the second time in history (the first time was during the World War II) when a Royal Head of State addressed  the nation, besides the annual New Year’s Eve speech.

More or less successfully, we tried to capture content from Facebook, TikTok, Twitter, YouTube, Instagram, Reddit, Imgur, Soundcloud, and Pinterest. Twitter is the platform we are able to crawl with Heritrix with rather good results. We collect Facebook profiles with an account at Archive-It, as they have a better set of tools for capturing Facebook. With frequent Quality Assurance and follow-ups, we also get rather good results from Instagram, TikTok and Reddit. We capture YouTube videos by crawling the watch-URLs with a specific configuration using YouTube dl.  One of the collected YouTube videos comes from the Royal family’s YouTube channel: the Queens address to the people on how to behave to prevent or limit the spreading of the coronavirus (https://www.youtube.com/watch?v=TZKVUQ-E-UI, Fig. 2).

As Heritrix has problems with dynamic web content and streaming, we also used Webrecorder.io, although we have not yet implemented this tool in our harvesting setup. However, captures with Webrecorder.io are only drops in the ocean. The use of Webrecorder.io is manual: a curator clicks on all the elements on a page we want to capture. An example is a page on the BBC website, with a video of the reopening of Danish primary schools after the total lockdown (https://www.bbc.com/news/av/world-europe-52649919/coronavirus-inside-a-reopened-primary-school-in-the-time-of-covid-19, Fig. 3). There is still an issue with ingesting the resulting WARC files from Webrecorder.io in our web archive.

Danes produced a range of podcasts on coronavirus issues. We crawled the podcasts we had identified. We get good results when having an URL to a RSS feed, which we crawl with XML extraction.

Fig. 3 Crawled with Webrecorder.io to get the video.

Capture as much as possible – a broad crawl

Netarchive runs up to four broad crawls a year. We launched our first broad crawl for 2020 just in the beginning of the Danish Corona lockdown – on March 14. A broad crawl is an in-depth snapshot of all dk-domains and all other Top Level Domains (TDLs) where we have identified Danish content. A side benefit of this broad crawl might be getting Corona-related content into the archive – content which the curators do not find with their different methods. We identify content both with classic/common? keyword searches and using a variety of link scraping tools / link scrapers.

Is the coronavirus related web collection of any value to anybody?

In accordance with the Danish personal data protection law, the public has no access to the archived web material. Only researchers affiliated with Danish research institutions can apply for access in connection with specific research projects. We have already received an application for one research project dealing with values in the Covid-19 communication. We hope that our collection will inspire more research projects.

Covid-19 Collecting at the National Library of New Zealand

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand

The National Library of New Zealand reflects on their rapid response collecting of Covid-19 related websites since February 2020.

Collecting in response to the pandemic

Web Archivists at the National Library of New Zealand are used to collecting websites relating to major events, but the Covid-19 pandemic has had such a global impact, it’s affected every member of society. It has been heart breaking to see the tragic loss of life and economic hardships that people are facing world-wide. The effects of this pandemic will be with us for a long time.

Collecting content relating to these events always produces mixed emotions as a web archivist. There’s the tension between collecting content before it disappears, and in that regard, we put on our hard hats and get on with it. At the same time however, these events are raw and personal to each one of us and the websites we’ve collected reflect that.

IIPC Collaborative Collection

When the IIPC put out a call to contribute to the Novel Coronavirus Outbreak Collaborative Collection, we got involved. Initially New Zealand sources were commenting on what was happening internationally, so URLs identified were mainly news stories, until our first reported case of Coronavirus occurred in February and then we started to see New Zealand websites created in response to Covid-19 here. We continued to contribute seed URLs to the IIPC collection, but our focus necessarily switched to the selective harvesting we undertake for the National Library’s collections.

Lockdown

The New Zealand government instituted a 4 level alert system on March 21 and we quickly moved to level 4 lockdown on March 24. The lockdown lasted a month, before gradually moving down to level 1 on June 8.

The rapidly changing alert levels were reflected in the constantly changing webpages online. It seemed that most websites we regularly harvest had content relating to Covid-19. Our selective web harvesting team focussed on identifying websites that had significant Covid-19 content or were created to cover Covid-19 events during our rapid response collecting phase. Even then it was difficult to capture all changes on a website as they responded to the different alert levels.

We were working from home during this time and connected to Web Curator Tool through our work computers. The harvesting was consistent, but our internet connections were not always stable, so we often got thrown out of the system! If we had technical issues with any particular website harvest, by the time we resolved it, the pages online had sometimes shifted to another alert level! We also used Web Recorder and Archive-It for some of our web harvests.

Due to the enormous amount of Covid-19 content being generated and because we are a very small team (along with the challenges of working from home), what we collected could really only be a very selective representation.

Unite against Covid-19 – Unite for the Recovery

Unite Against Covid-19 harvested 18 March 2020.

One prominent website captured during this time was the government website ‘Unite Against Covid-19’ which was the go-to place for anyone wanting to know what the current rules were. This website was updated constantly, sometimes several times a day.

When we entered alert level one the website changed to “Unite for the Recovery.” We expect to be collecting this site for some time. While we have completed our rapid response phase we will be continuing to collect Covid-19 related material as part of our regular harvesting.

Unite for the Recovery harvested 9 June 2020.

Economic Impact
Apart from official government websites, we captured websites that reflected the economic impact on our society, such as event cancellations and business closures. We documented how some businesses responded to the pandemic, by changing production lines from clothing to making face masks and from alcohol production to making hand sanitiser. New products like respirators and PPE (personal protective equipment) gear were also being produced. Tourism is a major industry in New Zealand and with border lockdowns still in place, advertising is now targeting New Zealanders. There is talk about extending this to a “Trans-Tasman” bubble to include Australia and possibly some Pacific Islands in the near future.

Social impact
As in many countries, community responses during lockdown provided both unique and shared experiences. New Zealanders were able to walk locally (with social distancing) so people put bears and other soft toys in the windows for kids (and adults) to count as they walked by. The daily televised 1pm Covid-19 updates from Prime Minister Jacinda Ardern and Director General of Health, Dr Ashley Bloomfield during lockdown was compulsive viewing and generated memorabilia such as T-shirts, bags and coasters. These were all reflected in the websites we collected. We also harvested personal blogs such as ‘lockdown diaries’.

Web archiving and beyond
During this rapid collecting phase, the web archivists focussed on collecting websites, and that’s reflected in this blog post. There was also a significant amount of content we wanted to collect from social media such as memes, digital posters and podcasts, New Zealand social commentary on Twitter and email from businesses and associations. This has required considerable effort from the Library’s Digital Collecting and Legal Deposit teams. You can find out more about this in an earlier National Library blog post by our Senior Digital Archivist Valerie Love. We are also working with our GLAM sector colleagues and donors to continue to build these collections.