Celebrating the 2022 Winter Olympics and Paralympics Web Archive Collection

By Helena Byrne, Curator of Web Archives, British Library

IIPC-CDG-2022Olympics

The first IIPC collection focused just on the 2010 Winter Olympics in Vancouver. Since 2012, the IIPC has archived web content on both the Olympic and Paralympic Games. To date, the IIPC has archived seven Games. Beijing 2022 was also the 4th Winter Games collection.

Collection Name Data Docs
2014 Winter Olympics 1.6 TB 57,145,052
2014 Winter Paralympics 1.3 TB 42,542,659
2016 Summer Olympics and Paralympics 3.1 TB 18,205,981
2018 Winter Olympics and Paralympics 1.2 TB 12,218,514
2020 Summer Olympics and Paralympics [held in 2021] 610.9 GB 6,923,179
2022 Winter Olympics and Paralympics 361.1 GB 14,410,542

You can view the 2022 Winter Olympics and Paralympics here:

https://archive-it.org/collections/18422

In this final blog post on the IIPC Content Development Group (CDG) Beijing 2022 Olympic and Paralympic Games web archive collection, we look back at what content was crawled. 

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

What we collected

Crawl dates

There were five crawl dates for this collection. The collection period started in January and finished towards the end of March. A sixth crawl was conducted on April 26 of 32 seeds as these were missed in the first crawl. This issue was only noticed when preparing the metadata for publishing the collection.

  1. February 02, 2022 (308 seeds crawled)
  2. February 15, 2022 (264 seeds crawled)
  3. February 23, 2022 (65 seeds crawled)
  4. March 07, 2022 (29 seeds crawled)
  5. March 21, 2022 (198 seeds crawled)

We had a steady number of nominations for each crawl date. The only exception was the fourth crawl on March 7th with only 29 seeds crawled. This figure also includes a number of URLs that returned an error in the previous crawl. Nominations to this collection are done on a voluntary basis by members of the IIPC and the public from around the world. 

Countries covered

Athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. 

We received nominations from 38 countries for the IIPC CDG 2022 Winter Olympics and Paralympics collection. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed in multiple events and have no content nominated. 

Languages covered

We have 24 languages in the collection including French (228 nominations), English (162 nominations) and Japanese (89 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection.

Data size 

image2

We have archived 863 seeds out of the 889 seeds that were nominated. These seeds include full websites, subsections of websites and individual web pages in multiple languages from around the world. The 26 seeds nominated that were not archived were social media accounts so weren’t added to the crawler. There were roughly 54 seeds in total that came up in the Archive-It crawl reports. These were URLs that for technical reasons, the crawlers were unable to archive when they visited the seed. These seeds were then assessed and added to the next crawl in the series with some additional techniques used to try and capture them. However, not all of these attempts to recrawl these seeds were successful. Quality assurance was carried out on these 54 seeds and 36 of these seeds were set to private as they displayed no content or just error messages. 

We archived 361.1 GB of data and 14,410,542 documents at the end of five crawl cycles. We had initially set aside 1 TB for this collection but as we weren’t archiving any social media content and implemented a size cap on all seeds, we had not used as much data as expected. 

We used the following policy when setting the scope of the crawl:

  1. Full seed host or directory (Example: team or athlete website)
    • These seeds will be capped at 3 GB
  2. Crawl one page only (Example: news article)
    • These seeds will be capped at 1 GB 
  3. Seed page plus 1 click of all links on seed page (Example: news page linking to multiple articles)
    • These seeds will be capped at 2 GB

In the 2018 Winter Games collection, we collected 1,413 seeds and used 1.2 TB of data with 12,218,514 documents. However, if we just compare the URL nominations, the 2022 and 2018 collection are quite similar excluding the 557 social media URLs tagged as Blogs & Social Media from the 2018 total. 

Related blog posts

Get Involved in Web Archiving the Winter Games – Beijing 2022 

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

The Summer and Winter Olympics and Paralympics Collections in Archive-It

The Summer and Winter Olympics and Paralympics Collections 2010-2020 poster

Despite not collecting social media content, we did promote the call for nominations for this collection on social media channels (mostly Twitter) with the collection hashtag #WAGames2022.

For more information and updates on Content Development Group activities, you can contact the IIPC CDG team at Collaborative-collections@iipc.simplelists.com

One thought on “Celebrating the 2022 Winter Olympics and Paralympics Web Archive Collection

  1. […] Helena Byrne of the British Library encouraged everyone to web archive Beijing 2022, adding to a decade-long collaborative effort of archiving the Olympics and Paralympics. Archiving the War in Ukraine is our second collaborative collection for 2022. Co-curated by Kees […]

    Like

Leave a comment