One year on: an update on the War in Ukraine CDG collection

By the lead curators: Anaïs Crinière-Boizet, Digital Curator (National Library of France), Kees Teszelszky, Curator Digital Collections (National Library of the Netherlands) & Vladimir Tybin, Head of Digital Legal Deposit (National Library of France).


IIPC-CDG-collaborative-collectionsThis month, the IIPC Content Development Working Group (CDG) launched a new web crawl to archive web content related to the war in Ukraine, based on suggestions by curators, web archivists and members of the public worldwide. The aim of this effort is to map the impact of this conflict on digital history and culture on the web for future historians. This clash has been fought out on the battlefields, but also takes place in cyberspace. It has a tremendous influence on web culture and internet history.

We launched three crawls in 2022, starting with the first crawl on July 20, 2022, the second in September and the third in October of last year. Another crawl was launched on March 16 2023. In this blog post, we describe what has been done so far in creating a transnational collection documenting this important historical event.

On 24 February 2022, the armed forces of the Russian Federation invaded the territory of Ukraine, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering the first large-scale war in Europe since WWII. The war on the territory of Ukraine has different phases[1] which can be summed up as follows:

  • 0: Prelude of the war (up to 23 February 2022), when Russian troops were building up near the borders of Ukraine;
  • 1: Initial invasion (24 February – 7 April 2022), when the Russian president Putin announced a ‘special military operation’ and Russian troops invaded Ukraine territory;
  • 2: Southeastern front (8 April – 11 September 2022);
    • This is the phase during which we began archiving websites as part of the CDG collection.
  • 3: Ukrainian counter offensives (12 September – 9 November 2022);
  • 4: Second stalemate (10 November 2022 – present, March 2023)

Since February, the clashes between the Russian military and the Ukrainian army and population have had an unprecedented impact on the situation in the region and on international relations. The aim of this collaborative project is to collect web content related to this event in order to map the impact of this conflict on digital history and culture. Identification of seed websites and initial web crawling began in July 2022. The archived websites have been preserved in a special web collection hosted by Archive-It, where most of the sites are already available to view. The collection will be expanded with new content as the conflict evolves or as new developments in the historic course happen.

The curators included high priority subtopics in the call for nominations, such as general information on: military confrontations; consequences of the war on the civilian population in Ukraine; refugee crisis and international relief efforts in and outside Europe; political consequences; international relations; diaspora communities like Ukrainians around the world; human rights organizations; foreign embassies and diplomatic relations; sanctions imposed on Russia by foreign powers; consequences on energy and agri-food trade; and public opinion like blogs, protest sites, online writings of activists etc. Websites from countries all over the world and in all languages are in scope. Special attention has been devoted to websites which can be a source of internet culture, such as sites with internet memes.

Many institutions but also the public responded to this call for contributions to document the conflict. No less than 1,137 member proposals were received and 252 via the public nomination form, making 1,389 seeds in total. After cleaning up duplicates and invalid URLs, 1,358 seeds remained. All these were crawled at least once between July 2022 and March 2023.

We have launched the fourth crawl for the War in Ukraine web collection on 1,060 seeds in March. 303 new seeds have been submitted between the last crawl in October and now. No less than 298 seeds have been deactivated since July 2022. These were pages which were not updated since the last crawl or went offline. These “404 file not found” errors show also why our collection work is important, as some sites have already gone offline. In total, 22 new jobs have been launched. Of  these, 19 crawls were done with the standard web crawler software and 3 with Brozzler.[2] This is a distributed web crawler that uses a real web browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. It is especially valuable for its media capture capabilities. We had a total budget of 1 TB for the three crawls in 2022 and 500 GB for the fourth crawl in March 2023.

It is easy to see from the distribution of seeds by scope that this special web collection is the result of an event harvest. For most of the URLs, only one or two pages were selected for the crawl. These pages (mostly from news sites) contain important historic source information which otherwise may be have been lost. Only 204 sites were selected to be fully crawled.

CDG_WarinUkraine_Mar22_01

Looking at the distribution of sites by website type, it is noticeable that a large proportion of the sites are news sites, NGOs and government websites. The role of blogs in internet culture has diminished in recent years, as is also visible in this collection.[3] In contrast, NGO websites contain more and more information worth preserving for historians of the future, as they document their activities to their donors.

CDG_WarinUkraine_Mar22_02

We see a language shift in the distribution since the first crawl took place. Most of the sites which were selected during the first crawl were published in international languages as English, French and German. Now we see more websites written in national languages, such as Ukrainian (122), Russian (31) and Belarusian (5). The impact of the conflict on the rest of Central, Eastern and Southern Europe around Ukraine can be seen by the collection of sites in Hungarian (45), Czech (44), Serbian (42) and Slovakian (23).

CDG_WarinUkraine_Mar22_03

One of the most heavily impacted cultural arenas to be touched by war is heritage and culture. We have all seen the images of looted museums and libraries and scattered books on the streets. As Erasmus of Rotterdam wrote in 1515: “If the laws are already silent amid the clash of arms, how much more are not the virgin muses silent when the world is full of noise, turmoil, confusion due to those raging storms?”[4] It is therefore perhaps a hopeful fact that no fewer than 14 websites have been selected that contain poetry from or about Ukraine.

In conclusion, it is worth recalling the interest of this initiative aimed at keeping track of the very heterogeneous content disseminated on the web about this tragic event. We know that the living web is an extremely fragile publication space where content is ephemeral and most often difficult to find some time after its publication; content can also disappear for technical reasons or be deleted by its producers. At a time when the web concentrates most of the publications of the major media and the press, the reactions of the population and of institutional and non-governmental organizations, and finally, in the age of social media networks, the undertaking of building a collection of web archives that is necessarily fragmentary and incomplete deserves to be carried out in order to provide some primary sources for future historians of this conflict.


[1] https://en.wikipedia.org/wiki/Timeline_of_the_2022_Russian_invasion_of_Ukraine

[2] https://github.com/internetarchive/brozzler

[3] P. de Bode, I. Geldermans, & K. Teszelszky. (2021). Web collection NL-blogosfeer. Zenodo. https://doi.org/10.5281/zenodo.4593479

[4] Letter of Erasmus tot Raffaele Riario, London, 15 May 1515. https://www.dbnl.org/tekst/eras001corr04_01/eras001corr04_01_0039.php

One thought on “One year on: an update on the War in Ukraine CDG collection

  1. […] In 2022, the IIPC Content Development Working Group (CDG) launched a collaborative web archive collection to preserve web content related to the war in Ukraine. As they launch their fourth crawl, co-curators Vladimir Tybin and Anaïs Crinière-Boizet reflect on what has been collected so far to document the event, including site selection, distribution, and language. Read more about their efforts in this blog post. […]

    Like

Leave a comment