Who we are
Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).
From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.
The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.
Previous CDG collections include:
- Summer/Winter Olympics/Paralympics (2010-2016)
- National Olympic & Paralympic Committees (2016- )
- European Refugee Crisis (2015-2016)
- World War One Commemoration (2015- )
- International Cooperative Organizations (2015- )
All these collections can be viewed from here: https://archive-it.org/home/iipc
What we tried to do
We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with the National Olympic and Paralympic Committees collection.
Our research question was “What is the gender distribution of National Olympic Committees?”.
The data we had
The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.
The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.
After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names. The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.
How we worked on it
Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
– Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.
Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.
In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.
20160329, http://www.paralympic.org.au, Cochrane, 3, No Match
20160329, http://www.paralympic.org.au, Sam Carter, 2, Male
20160329, http://www.paralympic.org.au, Alistair, 1, Male
20160329, http://www.paralympic.org.au, Carlee Beattie, 4, Female
20160329, http://www.paralympic.org.au, Ernest Van Dyk, 1, Male
The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).
All scripts developed during the hackathon can be accessed on Github:
- Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
- ‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Alternative research question?
Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.
What we got out of it
Sara Aubry, Bibliothèque nationale de France
“My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”
Pamela Graham, Columbia University
“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”
Gethin Rees, British Library
“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”
Gillian Lee, National Library of New Zealand
“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”
Helena Byrne, British Library
“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”