IIPC Content Development Group: What’s on in 2018

by Nicola Bingham, Lead Curator, Web Archiving British Library and IIPC CDG Co-Chair

The co-chairs of the IIPC Content Development Group  (CDG) are pleased to submit the following update on the group’s activity so far this year and the major projects which will occupy the group going forward in 2018.

What do we do?

For those new to the IIPC or those who may be interested in either contributing to planned collections or thinking about submitting ideas for new ones, it is worth revisiting the CDG’s mandate.

The CDG was formed in 2014 and crawling began in early 2015. The Group is charged with building publicly accessible web collections on transnational themes or events. Collections are multinational, multilingual and cover a wide variety of perspectives. They are intended, not only to be of particular value to researchers now and in the future but also to promote awareness of web archiving globally, encouraging individuals and institutions not involved in web archiving, or wanting to become involved to find out more.

How to propose a collection?

New collections can be proposed on the CDG member’s mailing list, where the CDG co-chairs and the group (sometimes with consultation with researchers and others) develop a list of collections to pursue in line with pre-defined criteria in the collection policy and our capacity according to the budget approved by the Steering Committee. Each collection is supported by the co-chairs who serve as project admins while a lead curator, often the person who proposes the collection, but not necessarily, scopes the collection, determines the metadata, monitors the collection and leads on quality assurance. Each collection is open to all members to contribute to. We strive to open up the nomination procedure as widely as possible, to non-members and members of the public, to elicit as wide a coverage of particular topics as possible.

Collections developed so far, via the IIPC Archive-It account, can be viewed here https://archive-it.org/home/IIPC

2018 collecting

So far in 2018 we have completed the 2018 Winter Olympics & Paralympics Collection, which contains nearly 1,500 seeds and is 1.2TB of data. The collection covered 35 countries in 21 Languages. The nominations came from a mix of IIPC members and a public nomination form that was available through previous blog posts. For more information on this collection see lead curator, Helena Byrne’s blog posts.

In addition, we updated the National Olympic & Paralympic Committees collection with committees that were missing from the crawl in 2016. This collection was crawled again during the 2018 Winter Olympics & Paralympics. Not all National Committees have a website, but if you notice we are missing any websites get in touch (2018-winter-olympics [at] iipc.simplelists .com).

We are now turning our attention to resuming the World War I Commemoration and the ‘Online News around the World’ collections.

The World War I Commemoration project led by Peter Stirling, BnF, started in October 2015. It already includes over 2,000 seeds and covers a wide variety of different websites from official commemorations to amateur history websites, and the reporting of the centenary in the media. Websites from several different countries and many languages have been selected by the members’ of the IIPC. 2018 is an important year for this collection as we will be looking to capture activity leading up to and during the centenary of the armistice in November.

The ‘Online News around the World’ collection has been several years in planning, led by, Sabine Schostag, the Royal Danish Library, and will begin in earnest shortly. This ambitious project aims to document a selection of online news websites from as many countries as possible  in the world during one week of the year (likely to be in November 2018). Once the metadata has been finalised, we will post details of how to nominate content for this collection.  The IIPC has members in over 34 countries around the world which is already a good starting point but we hope to canvas much more widely than this to achieve our goal of global coverage!

This summer we will also be running new crawls of the seeds in the International Cooperation Organizations collection, led by Alex Thurman from Columbia University Libraries, which consists of all known active websites in the .int top-level domain (available only to organizations created by treaties). This collection was started in 2016 and includes important agencies in areas that require international cooperation, like environmental protection, economic development, and telecommunication.

In the meantime, we hope to see as many CDG members as possible for our session at the IIPC General Assembly on 12th November –  more details to follow shortly.

Advertisements

IIPC Going for Gold – Get involved in #WAOlympics2018

By Helena Byrne, Curator of Web Archives, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the events of the 2018 Winter Olympics in Pyeongchang, South Korea since the start of February 2018. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The IIPC has members in more than 30 countries but there are over 100 countries competing in the Games and we need your help to ensure that these countries are represented in the collection.

So far there have been over 1,360 nominations from at least 28 countries around the world. As you can see from the map of the world, there is a high concentration from Europe as many IIPC members are based there. However, as you zoom in on the map of European nominations, there are still many gaps.

This is your chance to get involved in the collection phase by nominating online content that you are reading, using for research or simply know the language from that country. We are trying to get as many pins on the map from around the world as possible. Nevertheless, some of the pins already there may just have one website nomination so far. Even if you see a pin on your country or another country where you speak the language, we still want your nominations.

Just to remind you, what we want to collect:
Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:
Once you have selected the web pages you would like to see in the collection, it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

The call for nominations will close on the 20th of March 2018.

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018

 

Archives Unleashed at the British Library: Study of gender distribution in National Olympic Committees

Who we are

Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).

From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.

The IIPC Content Development Group

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.

Previous CDG collections include:

  • Summer/Winter Olympics/Paralympics (2010-2016)
  • National Olympic & Paralympic Committees (2016- )
  • European Refugee Crisis (2015-2016)
  • World War One Commemoration (2015- )
  • International Cooperative Organizations (2015- )

All these collections can be viewed from here: https://archive-it.org/home/iipc

What we tried to do

We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with the National Olympic and Paralympic Committees collection.

Our research question was “What is the gender distribution of National Olympic Committees?”.

The data we had

The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.

The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.

After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names.  The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.

The 7 English speaking countries identified in the set.

How we worked on it

Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.

Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.

In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.

Snippet:
20160329, http://www.paralympic.org.au, Cochrane, 3, No Match
20160329, http://www.paralympic.org.au, Sam Carter, 2, Male
20160329, http://www.paralympic.org.au, Alistair, 1, Male
20160329, http://www.paralympic.org.au, Carlee Beattie, 4, Female
20160329, http://www.paralympic.org.au, Ernest Van Dyk, 1, Male

The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).

Step 3 consisted of adapting javascripts to visualize the results of the named entity recognition and the gender distribution as web graphs.

All scripts developed during the hackathon can be accessed on Github:

https://github.com/ukwa/archives-unleashed-olympics

Results

The gender distribution within the subset of the collection.
  • Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
  • ‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Male/female representation over the complete dataset of 76 warc files regardless of language. The gender distribution within the subset of the collection and the overall data set showed that males are more represented than females on National Olympic Committees.

Alternative research question?

Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.

What we got out of it  

Sara Aubry, Bibliothèque nationale de France

My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”

Pamela Graham, Columbia University

“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”

Gethin Rees, British Library

“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”

Gillian Lee, National Library of New Zealand

“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”

Helena Byrne, British Library

“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”

This slideshow requires JavaScript.

2018 Winter Olympics Collection Building – Get Involved!

By Helena Byrne, Curator of Web Archives, The British Library

The International Internet Preservation Consortium Content Development Group (IIPC CDG) would like your help to archive websites from around the world related to the 2018 Winter Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 100 countries  competing in the Games and we need your help to ensure that these countries are represented in the collection. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The 2016 Summer Games was the first time they actively collected content related to activities both on and off the playing field.* The final 2018 Winter Games collection will be published here: https://archive-it.org/home/IIPC

What we want to collect:

Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018


* 2016 Olympics collection round-up

Rio 2016 Round Up

By Helena Byrne, Assistant Web Archivist, The British Library

The IIPC Content Development Group (CDG) 2016 Summer Olympic and Paralympic Games collection is now live http://archive-it.org/collections/7235.

The collection period ran from June to October 2016, it covered events on and off the playing field. The CDG used a combination of collaborative tools during this project as well as input from the general public.
rio-globe

Collection Fast Facts:

Final Number of Nominations:

In total 4,817 seeds were nominated, 4,642 from CDG members and 176 from public nomination form.

Countries:

125 countries are covered in the collection but the number of nominations varies between the countries: it ranges from 1 to 5 seeds to a couple of hundreds. The top 5 countries covered were France (681), Brazil (553), Japan (447), the Great Britain (341) and Canada (327).

Languages:

34 different languages were recorded.

iipc-rio-2016-collection-languages

What’s Next?:

Quality Assurance:

Now that the collection phase of the project is over, it is hoped that we will be able to do some Quality Assurance (QA) on the archived nominations. Criteria on how to evaluate an archived website can be found here. There are two ways this will be done: the first is through the crawl reports generated by Archive-IT account while the second is through a visual inspection of the website. The second option can be done by anyone using the collection, whether they are IIPC members or individuals interested in the web archiving process.  As there are a large number of sites to look through this would require input from people outside the CDG.  Can you help us do QA on this collection?

Report an issue with the collection:

While using the collection if you would like to flag any issues with the content, you can fill in this Google Form:  https://goo.gl/forms/utvyE8FztZdjFSaB3

Guidelines:

The CDG will publish a ‘Best Practice for Developing Collaborative Collections’ on the IIPC website. This will not only form the guidelines for future CDG collections but will hopefully be of use for anyone working on a collaborative project.

Target Audience:

 This collection will be invaluable for web archives researchers in terms of data mining as well as researchers who focus on sports and Olympic events.

Thank you for contributing to this project, you can keep up to date with any further developments on this project through the collection hashtag #Rio2016WA.


Collection timelines and updates:

Help Identify News Sites for the IIPC Online News Around the World Project!

By Sabine Schostag, Statsbiblioteket, Aarhus

What?

iipc_onlinenewsThe IIPC’s Content Development Working Group, which is leading an effort to build collaborative, global, web archives on a variety of topics of interest to our members, is kicking off a new project that we are calling “Online News Around the World: A Snapshot in Time” Our goal is to document online news websites during one week of the year from ALL of the countries in the world.


Why?

You read that right – ONLINE NEWS FROM ALL OF THE COUNTRIES IN THE WORLD. We never said IIPC members were entirely sane, did we? We know this is a lofty goal, but we have a few reasons for doing this:

  • raise global awareness of the critical need for the web archiving;raise awareness of the importance of preserving born-digital news;
  • create a cohesive and comprehensive collection that will engage researchers;
  • archive content from countries and regions not currently being archived by IIPC members.

When?

“Week 46” November 14, 2017 – November 20, 2017. Strange idea? Maybe not… Week 46 was appointed “ordinary news week” back in the end of the 1990s, by Anker Brink Lund, philosophical doctor and professor at Copenhagen Business School. He wrote in 2014:

The project News week has it origin in an old dream I tried to realize for many years. My burning desire was not only to be able to analyze spectacular news cases, but also to map the journalistic feeding chain in general by registering ALL news in a specific period. This kind of projects needed lots of money and many expert resources. In autumn 1999, I was given both, because the newly opened journalist studies in Odense needed trainee places for their students and because a parliamentary analysis group on the political power wanted to know more about the journalistic power in Denmark. Ever since, I have used week 46 for all kinds of media analyses, together with national and international research colleagues…1

The World Wide Web is more than twenty years old. The IIPC thinks it is time to include web news in this “extraordinary ordinary news week.”

Who?

We know we might not reach our lofty goal instantly, and that it will take some time to identify news sites from around the world. We plan to start gradually with a goal of news sites from IIPC member countries, at first.  But, here is where you fit in. The Content Development Group needs your help! Please nominate 10 news sites from your country to our nomination tool: http://digital2.library.unt.edu/nomination/iipc-news/. Once we receive nominations, the Content Development Group will review the list to determine what set will be archived.

For more information about the project and to find out more about how to help, please contact the Project Team at online-news-project@iipc.simplelists.com or reply to this blog post with your questions!

References:

1 Citation from: Anker Brink Lund: Analysis – An extraordinary ordinary news week. In: Journalisten.dk, 2014-11-14.