IIPC Collaborative collection: “Afghanistan regime change (2021) and the international response”

By Nicola Bingham, Lead Curator, Web Archiving, the British Library; Co-chair, IIPC Content Development Working Group

On 4th October 2021 the Content Development Group (CDG) initiated a thematic website collection in response to recent developments in Afghanistan at the behest of several CDG members.

Background

Recent events in Afghanistan have precipitated a humanitarian crisis which escalated markedly after foreign armed forces withdrew from the country in May 2021.1 As US and Allied troops retreated, the Taliban quickly gained ground, seizing cities across the country, increasing threats of a worsening civil war. The Taliban have now claimed control of all major cities in Afghanistan, including the capital Kabul, where fighters have seized the presidential palace, forcing the president to flee. The Afghan government which was supported by the US and the Allies has collapsed and there has been a transition of power to the Taliban.

As violence intensifies across large areas of the country, civilians are being caught up in the fighting and hundreds of Afghans have been killed in recent weeks, while thousands have been forced to flee their homes.

The Department of Defense is committed to supporting the U.S. State Department in the departure of U.S. and allied civilian personnel from Afghanistan, and to evacuate Afghan allies safely. (U.S. Air Force photo by Staff Sgt. Brandon Cribelar)
Public domain, via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:Operation_Allies_Refuge_210819-F-DT970-0064.jpg

The humanitarian crisis is obviously of great concern internationally, however the cultural heritage of Afghanistan is also under threat. As described by Richard Ovenden in an article in the Financial Times (24th September 2021), the global Library and Archive community has been trying to do what it can, from concerted efforts to help Afghans working in the cultural heritage sector to leave the country, to supporting the preservation of cultural artefacts including digital materials.2

It is likely that the new regime will want to bring the Internet under greater censorship and control3 meaning web content and the information contained therein is at risk. Alongside the internal threat, is the risk that foreign internet service providers, largely based in the US, could turn off cloud servers and social media platforms etc., if America decided to act on the threat to impose sanctions on Afghanistan.4

Existing collecting efforts

CDG-Afghan-collection-LC_collection
Afghanistan Web Archive at the Library of Congress: https://www.loc.gov/collections/afghanistan-web-archive/ 

Rapid response collecting of at risk Afghan Internet content has already been undertaken by several archiving institutions, alongside ongoing Afghanistan collections curated by the Library of Congress. Examples include:

The CDG does not wish to duplicate these efforts but rather to complement them by focussing on the international aspects of events in Afghanistan, documenting transnational involvement and worldwide interest in the process of the change of regime, recording how the situation evolves over time.

Content Scope

With this in mind, the Afghanistan collection has been scoped so that it adheres to the broader content development policy of the CDG namely that the following criteria are adhered to;

  • It is of high interest to IIPC members;
  • It does not map to any one member’s responsibility or mandate;
  • It is of higher value to research because it represents more perspectives than similar collections in only one member archive would do;
  • It is transnational in scope, but not necessarily “global”.

Taliban/IS content

The aim of any CDG collection is to reflect multiple viewpoints and to preserve a snapshot of society as it was at the time of archiving. It will be important to researchers that websites from across the spectrum of all human activity are collected in order to present a more accurate picture of the times.

Websites produced by the Taliban or IS, or that are pro-Taliban/IS, can be included in the collection. Most Government websites will begin to express pro-Taliban views in any case.

The Taliban/IS are likely to have used communication networks that cannot be archived for technical reasons, e.g. Facebook/WhatsApp and so this type of content will be excluded.

Daniel Wilkinson (U.S. Department of State), Public domain, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Afghan_females_using_internet_in_Herat.jpg

Sub-topics
Sub-topics may include:

  • Military experience in Afghanistan; nations withdrawing armed forces from Afghanistan; statements of defence and military analysis
  • Analysis and policy of think tanks such as Chatham House (UK), Brookings Institute and RAND International Affairs (US), for example
  • Afghan refugees in Pakistan, Iran and elsewhere
  • International relief efforts (The Red Cross, United Nations etc.)
  • Diaspora communities – Afghan people around the world
  • Human rights/Women’s rights/LGBTQ+ rights
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Afghanistan by foreign powers
  • Transnational websites and social media (SoundCloud, Squarespace, Twitter, WordPress, YouTube, Facebook Group pages (not individual Facebook profiles) about Afghanistan from any country and in any language.

The list is not exhaustive, and it is expected that contributors may wish to explore other sub-topics within their own areas of interest and expertise, providing they are within the general collection development scope.

Stakeholders

The lead curator for this collection is Nicola Bingham. She will be responsible for developing the content strategy, overseeing the progression of the collection, and promoting the collection to potential users.

IIPC members together with a wide number of stakeholders in the Library and Archive community, including staff at the Bodleian Libraries, Oxford, as well as members of the public are expected to contribute to the collection (see below for details about how to contribute).

Crawls are being undertaken in Archive-It by Janko Klasinc (National and University Library, Slovenia) and Carlos Lelkes-Rarugal (Assistant Web Archivist, British Library).

Size of collection

The CDG’s full budget in 2021 is 4 TB, of which 1.8 TB has been used through the end of September. The CDG plan to undertake small crawls for our ongoing and new collections as follows;

  • 2020 Summer Olympics and Paralympics [held in 2021]
  • Novel Coronavirus (COVID-19)
  • Intergovernmental Organizations
  • National Olympic and Paralympic Committees.

At this stage, c. 400 GB of data has been allocated to the Afghanistan collection.

Co-Curation

Nominations will be sought from IIPC Members and external agencies such as the UK Legal Deposit Libraries, University Libraries and the Library and Archive community. A Google form will be sent out to elicit nominations from non-IIPC members and members of the public. This form contains the relevant metadata fields which will populate a Google sheet. The aim of distributing the work of co-curation for the collection is to enable a diverse range of communities and individuals to contribute, including members of the Afghanistan community, helping to ensure that the collection is as representative as possible.

IIPC Members will be able to add their nominations directly to a Google sheet which will be reviewed by the lead curator against collection scope and marked for inclusion in the collection.

Access

Access to the collection will be through the Internet Archives’ Archive-It interface. Metadata will be exposed as facets on the collection home page and will be browseable by users.

How to contribute:

  1. Please read the Collection Scoping Document. This goes into more detail about what is in and out of scope
  2. If you are an IIPC member, please nominate URLs and add basic metadata to this Google Sheet
  3. If you are not an IIPC member you may contribute nominations and a small amount of basic metadata on this Google form.

References

1 Kiely, E. and Farley, R. Timeline of U.S. Withdrawal from Afghanistan. August 17, 2021. FactCheck.org
https://www.factcheck.org/2021/08/timeline-of-u-s-withdrawal-from-afghanistan/

2 Ovenden, R. The Battle for Afghanistan’s libraries. September, 24, 2021. Financial Times. https://www.ft.com/content/82fffcc8-3631-48dc-829d-44f237549a59

3 Afghanistan’s Internet: who has control of what? Goman Web. September 20, 2021. https://gomanweb.net/2021/09/20/afghanistans-internet-who-has-control-of-what/ Digital oppression in Afghanistan. NordVPN Blog. August 20, 2021. https://nordvpn.com/blog/digital-oppression-in-afghanistan

Baibhawi, R. Taliban Shuts Internet In Panjshir To Stop Northern Alliance From Galvanizing Support. August 29, 2021. Republic. https://www.republicworld.com/world-news/rest-of-the-world-news/taliban-shuts-internet-in-panjshir-to-stop-northern-alliance-from-galvanizing-support.html

Vavra, S. and Falzone, D. This Is Why the Taliban Keeps F*cking Up the Internet. September 16, 2021. Daily Beast. https://www.thedailybeast.com/this-is-why-the-taliban-keeps-fcking-up-afghanistans-internet

Sorkin, A. R., Karaian, J., Kessler, S., Gandel, S., Hirsch, L., Livni, E. and Schaverien, A. Big Tech and the Taliban. August, 19, 2021. The New York Times. https://www.nytimes.com/2021/08/19/business/dealbook/taliban-social-media.html

4 Stokel-Walker, C. The battle for control of Afghanistan’s internet. September 7, 2021. Wired. https://www.wired.co.uk/article/afghanistan-taliban-internet

5 Gomes, P. Automated seed selection to preserve Afghan sites (Arquivo.pt). IIPC Curating Special Collections Workshop,  September  24, 2021. https://youtu.be/Aa_-BBnEr8I

IIPC Content Development Group’s activities 2019-2020

By Nicola Bingham, Lead Curator Web Archives, British Library and Co-Chair of the IIPC Content Development Working Group

Introduction

I was delighted to present an update on the Content Development Group’s (CDG) activities at the 2020 IIPC General Assembly (GA) on behalf of myself, Alex and the curators that have worked so hard on collaborative collections over the past year.

Socks, not contributing to Web Archiving

Although it was disappointing not to have been in Montreal for the GA and Web Archiving Conference (WAC), it is the case that there are many advantages in attending a conference remotely. Apart from cost and time savings, it meant that many more staff members from our organisations could attend. I liked the fact that I could see many “old” web archiving friends online and it did feel like the same friendly, enthusiastic, innovative environment that is normally fostered at IIPC events. I was also delighted to see some of the attendee’s pets on screen, although it did highlight that other people’s cats are generally much more affectionate than my own, who has, I have to say, contributed little to the field web archiving over the years, although he did show a mild interest in Warcat.

Several things become clear when tasked with pre-recording a presentation with a time limit of 2 to 3 minutes. Firstly, it is extremely difficult to fit everything you need to say into such a short space of time; secondly, what you do want to say must be tightly scripted – although this does have the advantage that there is no room for pauses or “errs” in a way that can sometimes pepper my in-person presentations. Thirdly, recording even a two-minute video calls for a surprising number of retakes, taking many hours for no apparent reason. Fourthly, naively explaining these facts to the Programme and Communications Officer leads quite seamlessly to the suggestion of writing a blog post in order that one can be more expansive on the points bulleted in the two-minute presentation….

CDG Collection Update

Since our last General Assembly in Zagreb, in June 2019, the CDG has continued working on several established, and two new collections:

  • The International Cooperation Organizations Collection was initiated in 2015 and is led by Alex Thurman of Columbia University Libraries. It previously consisted of all known active websites in the .int top-level domain (available only to organizations created by treaties), but was expanded to include a large group of similar organizations with .org domain hosts, and renamed Intergovernmental Organizations this year. This increased the collection from 163 to 403 intergovernmental organizations, all of which will continue to be crawled each year.
  • The National Olympic and Paralympic Committees, led by Helena Byrne of the British Library was initiated in 2016 and consists of websites of national Olympics and Paralympics committees and associations, as identified from the official listings of these groups found on the official sites http://www.olympic.org and http://www.paralympic.org.
  • Online News Around the World led by Sabine Schostag of the Royal Danish Library. This collection of seeds was first crawled in October 2018 to document a selection of online news from as many countries as possible. It was crawled again in November 2019. The collection was promoted at the Third RESAW Conference, “The web that was: archives, traces, reflections” in Amsterdam in June 2019 and at the IFLA News Media Conference at Universidad Nacional Autónoma de México, Mexico City in March 2020.
  • New in 2019, the CDG undertook a Climate Change Collection, led by Kees Teszelszky of the National Library of the Netherlands. The first crawl took place in June, with a final crawl shortly after the UN Climate summit in September 2019.
  • New in 2019, a collection on Artificial Intelligence was undertaken between May and December, led by Tiiu Daniel (National Library of Estonia), Liisi Esse (Stanford University Libraries) and Rashi Joshi (Library of Congress).

Coronavirus (Covid-19) Collection

The main collecting activity in 2020 has been around the Covid-19 Global pandemic. This has involved a huge effort by IIPC members with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

We have been very careful with scoping rules so that we are able to collect a diverse range of content within the data budget – and Archive-It generously increased the data limit for this collection to 5TB. Collecting will continue to run, budget permitting, while the event is of global significance.

Publicly available CDG collections can be viewed on the Archive-It website.https://archive-it.org/home/IIPC and an overview of the collection statistics can be seen below.

CDG Collection statistics. Figures correct as of 15th June 2020. Slide presented at IIPC GA 17th June 2020.

Researcher-use of Collections

The CDG has worked closely with the Research Working Group co-chairs to promote and facilitate use of the CDG collections which are now available through the Archives Unleashed Cloud thanks to the Archives Unleashed project. The collections have been analysed and there are a large amount of derivatives available to researchers at IIPC-led events and/or research projects. For more information about how to access these collections please refer to the guidelines.

Next Steps/Getting in touch

We would very much welcome new members to the CDG. We will be having an online meeting in the next couple of months which would be an excellent opportunity to find out more. In the meantime, any IIPC member is welcome to suggest and/or lead on possible 2021 collaborative collections. For more information please contact the co-chairs or the Programme and Communications Officer.

Nicola Bingham & Alex Thurman CDG co-chairs

The CDG Working Group at the 2019 IIPC General Assembly in Zagreb.

Novel Coronavirus outbreak: help us collect websites

The International Internet Preservation Consortium’s Content Development Group and Archive-It are collaborating on a web archive collection preserving web content related to the ongoing Novel Coronavirus (Covid-19) outbreak. Due to the urgency of the outbreak, archiving of nominated web content will begin soon (mid-February 2020). Collection of new nominations and new crawls will continue as needed depending on the course of the outbreak and its containment.

What we are collecting

Web content from all countries and in any language is in scope. High priority subtopics include:

  • Coronavirus origins
  • Information about the spread of infection
  • Regional or local containment efforts
  • Medical/Scientific aspects
  • Social aspects
  • Economic aspects
  • Political aspects

Published information resources are a higher priority for seed nominations for this collection than social media feeds or hashtags (though the latter can be useful for finding examples of the former).

How to get involved

If you would like to participate in the collection, please nominate websites by using this submission form: https://forms.gle/zHgJK3DcfGpzAtCz5

The final collection is available at this link: https://archive-it.org/collections/13529

Contribute to CDG’s AI Collection!

By Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia

“Trurl” by Daniel Mróz, from The Cyberiad by Stanisław Lem (Wydawnictwo Literackie, Kraków, 1972). Illustration copyright © 1972 Daniel Mróz. Reprinted by permission.

After significant breakthroughs at the end of the 20th and at the beginning of 21st centuries, artificial intelligence (AI) has played a greater role in our daily lives. Although AI has a huge positive impact on a variety of fields such as manufacturing, healthcare, art, transportation, retail and so on, the use of new technologies also raises ethical issues as well as security risks. One critical and hotly debated issue is the impact of ongoing automation on labor markets, to include changing educational requirements for jobs, job elimination, and various models for transitions.

The IIPC Content Development Group invites curators and web archivists around the world to contribute websites to a new “Artificial Intelligence” web collection.

The purpose of this collection is to bring together and record web content related to use of AI and its impact on any possible aspect of life, reflecting attitudes and thoughts towards it, future predictions etc.

The content can be in any language focusing on specific countries or cultures or have a global scope.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building AI related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to aid in the development of a collection with an international perspective.

The collection aims to cover the following subtopics:

  • Machine learning, natural language processing, robotics, automation;
  • AI in literature, visual arts (e.g. ceramics, drawing, painting, sculpture, design, photography, filmmaking, architecture) and performing arts (e.g. theater, public speech, dance, music etc.); AI in emerging art forms;
  • AI and law/legislation;
  • Social and economic impact (e.g. impact on behavior/interaction, bias in AI, unemployment, inequality, changes in labor markets);
  • Ethical issues (e.g. weaponization of AI, security, robot rights);
  • Future predictions/scenarios concerning AI.

Types of web content to include are personal forms such as blogs, forum posts, and artist websites; trend reports, statements, and analyses (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, businesses).

Time frame covered by content: from the 1990s onwards.

Out of scope are: full social media feeds and channels (Facebook, twitter, Instagram, YouTube, WhatsApp), user’ video channels (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

That said, if you locate individual social media posts of unique value, such as an Instagram post by a bot or a particularly relevant and ephemeral individual video, please submit them for consideration.

Nominations are welcomed using the following form.

The call for nominations will close on the 30th of June 2019. Crawls will be run during the summer 2019. Collection will be made available at the end of 2019.

 For more information about this collection, contact Tiiu Daniel (tiiu.daniel[at]nlib.ee).


Lead-Curators of CDG Artificial Intelligence Collection
Tiiu Daniel, Web Archive Leading Specialist, National Library of Estonia
Liisi Esse, Ph.D. Associate Curator for Estonian and Baltic Studies Stanford University Libraries
Rashi Joshi, Reference Librarian /Collections Specialist, Library of Congress

CDG Co-Chairs
Nicola Bingham, Lead Curator Web Archiving, British Library
Alex Thurman, Web Resources Collection Coordinator, Columbia University Libraries

Contribute to CDG’s Climate Change Collection!

By Kees Teszelszky, Curator Digital Collections, Koninklijke Bibliotheek – National Library of The Netherlands and Lead Curator, CDG Climate Change Collection

Climate change is one of the most urgent and hotly debated issues on the web in recent years. The IIPC Content Development Group is inviting all curators and web archivists from around the world to contribute websites to a new collaborative “Climate Change” collection.

Breiðamerkurlón
Breiðamerkurlón, Iceland

In recent decades there is has been strong evidence that the earth is experiencing rapid climate change, characterized by global temperature rise, warming oceans, shrinking ice sheets, glacial retreat, decreased snow cover, sea level rise, declining arctic sea ice, extreme weather events, and ocean acidification. Ninety-seven percent of climate scientists agree that these climate-warming trends over the past century are very likely due to human activities, and most of the leading scientific organizations worldwide have issued public statements endorsing this position (source: climate.nasa.gov/evidence). Global and local action to mitigate this crisis has been complicated by political, economic, technical, cultural, and religious debates.

Many people feel the urge to reflect on this topic on the web. We would like to take an international snapshot of born digital culture relating to documentation of and social debate on the challenging issue of climate change. You can contribute to this collection by nominating web content about any aspect of climate change, and the content can be focused on specific countries or cultures or have a global focus, and can be in any language.

We especially welcome contributions from underrepresented countries, cultures, languages and other groups, or those countries without IIPC members. Curators currently building climate change related collections at their own institutions are welcome to contribute their seeds (matching below criteria) to help us build a collection with an international perspective.

Examples of subtopics might include climatology, climate change denial, climate refugees, religious reflections on climate change, etc. Eligible types of web content include organizational reports or statements (i.e. from government agencies, NGOs, scientific or academic institutions, advocacy groups, political parties/platforms, businesses, religious groups) or more personal forms such as blogs or artistic projects.

Out of scope are: social media feeds (Facebook, Twitter, Instagram, YouTube channels, WhatsApp), video (YouTube, Vimeo), apps and other content which is difficult or impossible to crawl.

Collecting seeds started on 1 April 2019 and more nominations can be added to this spreadsheet. Crawls will be run during the summer of 2019, to conclude shortly after the upcoming UN Climate Action Summit on 23 September 2019.

Organized by the IIPC and supported by web archivists around the world, the special web collection ‘Climate Change’ is one of the ways the IIPC helps raise awareness of the strategic, cultural and technological issues which make up the web archiving and digital preservation challenge.

For more information about this collection contact Kees Teszelszky for more details: kees.teszelszky[at]kb.nl

IIPC Content Development Group: 2019 collections

By Nicola Bingham, Lead Curator Web Archiving, British Library and Co-Chair of the Content Development Working Group

During 2019, the Content Development Group (CDG) will continue to work on several established collections: 

New for 2019, the CDG is undertaking a Climate Change Collection, led by Kees Teszelszky of  the National Library of the Netherlands. The first crawl will take place before the General Assembly & the Web Archiving Conference in June, with a final crawl shortly after the next UN Climate summit in September. This collection has sparked a lot of interest on the CDG mailing list and many curators have expressed an interest in contributing.

We are also planning an Artificial Intelligence Collection, led by Tiiu Daniel of the National Library of Estonia, Liisi Esse of Stanford University Libraries and Rashi Joshi of Library of Congress. The details are still to be firmed up.

We are planning to crawl one of our collections, or a subset of a collection, in order that it can be used by researchers.

IIPC Content Development Group: What’s on in 2018

by Nicola Bingham, Lead Curator, Web Archiving British Library and IIPC CDG Co-Chair

The co-chairs of the IIPC Content Development Group  (CDG) are pleased to submit the following update on the group’s activity so far this year and the major projects which will occupy the group going forward in 2018.

What do we do?

For those new to the IIPC or those who may be interested in either contributing to planned collections or thinking about submitting ideas for new ones, it is worth revisiting the CDG’s mandate.

The CDG was formed in 2014 and crawling began in early 2015. The Group is charged with building publicly accessible web collections on transnational themes or events. Collections are multinational, multilingual and cover a wide variety of perspectives. They are intended, not only to be of particular value to researchers now and in the future but also to promote awareness of web archiving globally, encouraging individuals and institutions not involved in web archiving, or wanting to become involved to find out more.

How to propose a collection?

New collections can be proposed on the CDG member’s mailing list, where the CDG co-chairs and the group (sometimes with consultation with researchers and others) develop a list of collections to pursue in line with pre-defined criteria in the collection policy and our capacity according to the budget approved by the Steering Committee. Each collection is supported by the co-chairs who serve as project admins while a lead curator, often the person who proposes the collection, but not necessarily, scopes the collection, determines the metadata, monitors the collection and leads on quality assurance. Each collection is open to all members to contribute to. We strive to open up the nomination procedure as widely as possible, to non-members and members of the public, to elicit as wide a coverage of particular topics as possible.

Collections developed so far, via the IIPC Archive-It account, can be viewed here https://archive-it.org/home/IIPC

2018 collecting

So far in 2018 we have completed the 2018 Winter Olympics & Paralympics Collection, which contains nearly 1,500 seeds and is 1.2TB of data. The collection covered 35 countries in 21 Languages. The nominations came from a mix of IIPC members and a public nomination form that was available through previous blog posts. For more information on this collection see lead curator, Helena Byrne’s blog posts.

In addition, we updated the National Olympic & Paralympic Committees collection with committees that were missing from the crawl in 2016. This collection was crawled again during the 2018 Winter Olympics & Paralympics. Not all National Committees have a website, but if you notice we are missing any websites get in touch (2018-winter-olympics [at] iipc.simplelists .com).

We are now turning our attention to resuming the World War I Commemoration and the ‘Online News around the World’ collections.

The World War I Commemoration project led by Peter Stirling, BnF, started in October 2015. It already includes over 2,000 seeds and covers a wide variety of different websites from official commemorations to amateur history websites, and the reporting of the centenary in the media. Websites from several different countries and many languages have been selected by the members’ of the IIPC. 2018 is an important year for this collection as we will be looking to capture activity leading up to and during the centenary of the armistice in November.

The ‘Online News around the World’ collection has been several years in planning, led by, Sabine Schostag, the Royal Danish Library, and will begin in earnest shortly. This ambitious project aims to document a selection of online news websites from as many countries as possible  in the world during one week of the year (likely to be in November 2018). Once the metadata has been finalised, we will post details of how to nominate content for this collection.  The IIPC has members in over 34 countries around the world which is already a good starting point but we hope to canvas much more widely than this to achieve our goal of global coverage!

This summer we will also be running new crawls of the seeds in the International Cooperation Organizations collection, led by Alex Thurman from Columbia University Libraries, which consists of all known active websites in the .int top-level domain (available only to organizations created by treaties). This collection was started in 2016 and includes important agencies in areas that require international cooperation, like environmental protection, economic development, and telecommunication.

In the meantime, we hope to see as many CDG members as possible for our session at the IIPC General Assembly on 12th November –  more details to follow shortly.

IIPC Going for Gold – Get involved in #WAOlympics2018

By Helena Byrne, Curator of Web Archives, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the events of the 2018 Winter Olympics in Pyeongchang, South Korea since the start of February 2018. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The IIPC has members in more than 30 countries but there are over 100 countries competing in the Games and we need your help to ensure that these countries are represented in the collection.

So far there have been over 1,360 nominations from at least 28 countries around the world. As you can see from the map of the world, there is a high concentration from Europe as many IIPC members are based there. However, as you zoom in on the map of European nominations, there are still many gaps.

This is your chance to get involved in the collection phase by nominating online content that you are reading, using for research or simply know the language from that country. We are trying to get as many pins on the map from around the world as possible. Nevertheless, some of the pins already there may just have one website nomination so far. Even if you see a pin on your country or another country where you speak the language, we still want your nominations.

Just to remind you, what we want to collect:
Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:
Once you have selected the web pages you would like to see in the collection, it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

The call for nominations will close on the 20th of March 2018.

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018

 

Archives Unleashed at the British Library: Study of gender distribution in National Olympic Committees

Who we are

Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).

From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.

The IIPC Content Development Group

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.

Previous CDG collections include:

  • Summer/Winter Olympics/Paralympics (2010-2016)
  • National Olympic & Paralympic Committees (2016- )
  • European Refugee Crisis (2015-2016)
  • World War One Commemoration (2015- )
  • International Cooperative Organizations (2015- )

All these collections can be viewed from here: https://archive-it.org/home/iipc

What we tried to do

We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with the National Olympic and Paralympic Committees collection.

Our research question was “What is the gender distribution of National Olympic Committees?”.

The data we had

The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.

The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.

After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names.  The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.

The 7 English speaking countries identified in the set.

How we worked on it

Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.

Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.

In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.

Snippet:
20160329, http://www.paralympic.org.au, Cochrane, 3, No Match
20160329, http://www.paralympic.org.au, Sam Carter, 2, Male
20160329, http://www.paralympic.org.au, Alistair, 1, Male
20160329, http://www.paralympic.org.au, Carlee Beattie, 4, Female
20160329, http://www.paralympic.org.au, Ernest Van Dyk, 1, Male

The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).

Step 3 consisted of adapting javascripts to visualize the results of the named entity recognition and the gender distribution as web graphs.

All scripts developed during the hackathon can be accessed on Github:

https://github.com/ukwa/archives-unleashed-olympics

Results

The gender distribution within the subset of the collection.
  • Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
  • ‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Male/female representation over the complete dataset of 76 warc files regardless of language. The gender distribution within the subset of the collection and the overall data set showed that males are more represented than females on National Olympic Committees.

Alternative research question?

Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.

What we got out of it  

Sara Aubry, Bibliothèque nationale de France

My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”

Pamela Graham, Columbia University

“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”

Gethin Rees, British Library

“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”

Gillian Lee, National Library of New Zealand

“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”

Helena Byrne, British Library

“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”

This slideshow requires JavaScript.

2018 Winter Olympics Collection Building – Get Involved!

By Helena Byrne, Curator of Web Archives, The British Library

The International Internet Preservation Consortium Content Development Group (IIPC CDG) would like your help to archive websites from around the world related to the 2018 Winter Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 100 countries  competing in the Games and we need your help to ensure that these countries are represented in the collection. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The 2016 Summer Games was the first time they actively collected content related to activities both on and off the playing field.* The final 2018 Winter Games collection will be published here: https://archive-it.org/home/IIPC

What we want to collect:

Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018


* 2016 Olympics collection round-up