Web Archivists, Assemble!

By Alex Thurman, Columbia University Libraries, Member of the IIPC Steering Committee and the WAC Program Committee (2016-2018), Co-Chair of the Content Development Group

The IIPC General Assembly & Web Archiving Conference is the professional gathering I anticipate most eagerly each year. In an energizing atmosphere of international cooperation, web curators, librarians, archivists, tool developers, computer scientists, and academic researchers from member organizations and beyond meet to share experiences and best practices and plan projects to tackle the collective challenge of preserving web resources.

I’ve had the good fortune of attending each year since 2012, and for the past three years I’ve also had the rewarding experience of serving on the program committees planning these events. As we look forward to the exciting upcoming 2018 conference in Wellington, New Zealand, here is some background on the recent evolution of the GA/WAC and the work of the 2018 WAC Program Committee.

Recent background

2018 marks the fifteenth anniversary of the IIPC, and the twelfth consecutive year that members of the IIPC will come together in an annual General Assembly. The IIPC Steering Committee has striven to cycle (loosely, as dependent on members volunteering to host the event) the venue of the GA/WAC in alternate years between Europe, North America and Australasia. And from the start, the GA event programs have combined days reserved for IIPC members (focused on Consortium planning and working group activities) with one or more open days to welcome the perspectives and expertise of the wider web archiving community and of researchers.

To emphasize this aspect of outreach to researchers and promoting awareness of web archiving, the Steering Committee has in recent years opted to formalize the “open days” as a distinct event—the IIPC Web Archiving Conference. The 2016 event was the first to thus distinguish the General Assembly from the Web Archiving Conference, and thereafter, at the suggestion of that PC’s Chair (Kristinn Sigurðsson, National and University Library of Iceland), planning responsibility for the different event components became more distributed: the GA program would be determined by the Steering Committee Officers and Portfolio Leads and the Working Group Chairs; a mostly local Organizing Committee would see to the logistical planning of securing a venue and catering and possible sponsors; and the Web Archiving Conference program would be developed by a Program Committee. The 2017 Program Committee (chaired by Nicholas Taylor, Stanford University) was the first to include some non-IIPC members, and their CFP was the first to attract more relevant submissions than we had space to accept, a milestone in the maturation of the conference.

Work of the 2018 Program Committee

Co-chaired by Jan Hutař (Archives New Zealand) and Paul Koerbin (National Library of Australia), the 12-member 2018 Program Committee started work in November 2017. Our first task was drafting a call for papers, which involved first discussing whether the conference would have a stated theme and the types (presentations, panels, workshops, tutorials) of submission proposals we’d ask for and the nature of the submission (abstracts? full papers?). We needed a flexible theme that would acknowledge the IIPC’s milestone 15th anniversary and the value of our collective work preserving the web so far, while embracing creative new approaches to the evolving challenges we face. In his draft CFP, Paul Koerbin hit on “Web Archiving Histories and Futures and we ran with that. And as the Wellington event will be the first GA/WAC held in Australasia in 10 years, we especially encouraged submissions related to Asia/Pacific web archiving activities.

To encourage submissions from all types of web archiving practitioners and users, in the CFP we further listed some suggested topics, under the rubrics of “building web archives,” “maintaining web archive content and operations,” “using and researching web archives,” and “web archive histories and futures.” And we opted to ask applicants to submit abstracts only rather than full papers, both to lower the barriers to application in order to get more submissions, and to allow all Program Committee members to consider (and vote on) all submissions, rather than assigning reviewers to specific papers. Once the CFP was ready, PC members worked hard to distribute it to a wide selection of mailing lists, reaching beyond IIPC members and other cultural memory institutions to also get submissions from independent researchers.

This strategy worked (boosted no doubt by the intrinsic appeal of visiting Wellington!), as we received a record number of submissions for the WAC, submitted through EasyChair. The breadth and depth of interesting submissions allowed us to build a strong program–while unfortunately having to reject some relevant proposals. Each committee member read all the submitted abstracts and rated each one on a 3-point scale, yielding cumulative point averages for each submission from which the committee could decide which submissions would be accepted for the conference. In order to know how many submissions could be accepted we first had to consider how much conference schedule time we had available, which would depend in part on whether we would have multiple tracks.

We decided the program would have a mix of plenary talks and usually two tracks of presentations or workshops, and Olga Holownia (IIPC Program & Communication Officer) provided a range of detailed schedule templates for us to use to figure out how many individual presentations, panels, and workshops we’d have room for. We then began grouping accepted proposals into thematic sessions, loosely conceived as more-technical and less-technical tracks, in order to reduce (though not eliminate) the frustration of attendees wishing they could be in both tracks at once. Committee members then divided up the responsibility of serving as session chairs, to introduce the speakers and keep the sessions running on time.

Between the tasks of preparing the CFP and evaluating the submissions and shaping them into a program, the committee had the additional enjoyable responsibility of brainstorming possible keynote speaker candidates. Committee members suggested over two dozen possible keynoters, voted on them, and eventually submitted a few outstanding candidates to the Organizing Committee for their consideration. The Organizing Committee took these suggestions and added others based on their familiarity with the Australasian digital library and academic scene and delivered two exciting keynote speakers – Wendy Seltzer (World Wide Web Consortium) and Rachael Ka’ai-Mahuta (Te Ipukarea, the National Maori Language Institute, Auckland Institute of Technology) – and an additional plenary talk from Vint Cerf (Google). With these and many other talented contributors from within and beyond IIPC member institutions, the 2018 IIPC Web Archiving Conference looks to be a rich and stimulating event.

Register now!

Serving on the WAC Program Committee is a great opportunity to work directly with IIPC colleagues and other web archiving enthusiasts. And the work continues – you can volunteer now to serve on the Program Committee and start shaping the 2019 IIPC WAC.

Advertisements

A personal reflection on the IIPC WAC

By Gillian Lee, Coordinator, Web Archives at the National Library of New Zealand, Member of the IIPC Steering Committee and the WAC Program Committee

This year I’ve had the privilege of being part of the programme committee for IIPC WAC. Reading through the abstracts that many of you sent in gave me a real sense of excitement about the work that we are all involved in. That caused me to reflect on the benefits of the IIPC conference and what it means to us as members. Some of you might attend these conferences on a regular basis, others may never have had that opportunity.

I’ve been web archiving for 11 years and have been fortunate to attend 3 IIPC conferences during that time. It’s rare for me to attend a conference that’s actually about the work I do, so I really value those times! It’s an opportunity to finally meet people, who were formerly just names on mailing lists and blog posts. Getting together with other web archivists is invaluable, whether it’s talking to someone who is just starting out in the web archiving world, sharing the struggles of budget constraints, or learning more about what members are doing. You can’t beat that!

Even in this digital age it’s easy to feel isolated here in New Zealand when we hear so much about web archiving developments, especially in Europe and the States. There’s only so much you can learn from emails, blog posts and the odd webinar that’s not scheduled for 2am NZ time!!

Despite the distance we have collaborated with other IIPC members over the years. Back in 2006 the National Library of New Zealand worked with the British Library to build Web Curator Tool (WCT). The BL have moved on and developed other tools since then, and this year we’ve collaborated with National Library of the Netherlands in a major upgrade to WCT. Kees Teszelszky blogged about this recently. You can find out more about it during the IIPC conference in Wellington in November.

We’ve also been involved with the Content Development Working Group by submitting seed lists to collaborative collections such as the Olympic Games, World War One Commemoration and the News around the World project. If you’re new to IIPC, do consider getting involved in one of the IIPC groups.

We’re really excited to be hosting IIPC this year and look forward to meeting you all in person! A number of my colleagues have never had the chance to attend an IIPC conference, so they’re in for a treat! See you soon!

Mark_Beatty-NLNZ
National Library of New Zealand, Photo by Mark Beatty / CC BY-NC 3.0 NZ.

Welcome to WAC in Wellington

By Peter McKinney, Digital Preservation Policy Analyst at the National Library of New Zealand and the Chair of the IIPC 2018 General Assembly and Web Archiving Conference Organising Committee

National Library of New Zealand Te Puna Mātauranga o Aotearoa.

I remember my first time in New Zealand. It was wonderful. But I do remember commenting to my partner, as we sat on the tarmac in Auckland, that I couldn’t live here as it was too far away from anything (I lived in Scotland at the time).

Just over a year later I moved to Wellington.

I’m not sure whether this shows my unerring ability to change my mind at a whim, or the strength of what I found over here. I hope the latter. The travel for visitors is well worth it. Wellington and New Zealand are amazing. And while the work of the National Library has attracted a number of us to come and live our lives here, it is the country that makes it home.

It is therefore my great honour to be part of the team that is welcoming you here. The National Library of New Zealand feels greatly priveleged to be hosting this year’s IIPC General Assembly and Web Archiving Conference. The Library has received great benefit from being a member of the IIPC over the years and to be able to entice members and the wider web archiving community all the way down to the South Pacific is an amazing opportunity for us. We can open up participation to those who just have not been able to travel those distances up to the northern hemisphere. It is also a great chance for us to show off what we have down here.

I have two primary responsibilities in my role as Chair of the Organising Committee. The first is to ensure that IIPC members have a productive week. This means providing a comfortable environment where members can get their business done and enjoy everything Wellington has to offer. My second responsibility is that “locals” (New Zealanders and our pacific neighbours) are able to take advantage of the experience and expertise that will be converging at the Library; this is a precious opportunity that will not come round again in the foreseeable future.

The website has a host of information about the GA and WAC, and I encourage you to check it out (and get in touch if need more information). Alex Thurman has written about the work of the programme committee pulling together what is a brilliant selection of papers, panels, posters, tutorials and workshops. Gillian Lee has also covered off what it means to staff in the National Library to be able to have the IIPC event down here in Wellington.

Personally, I can’t wait to hear from our keynote speakers (Rachael Ka’ai-Mahuta and Wendy Seltzer). They have been asked to challenge us and make us pause and consider what the future of web archiving may look like. Vint Cerf needs no introduction and we are incredibly grateful that he has accepted our invitation to share his current thinking with us. We’re also having a public event on Tuesday, which we will be announcing in the next few weeks.

The week will be busy and hopefully, productive and inspiring. I also can’t encourage you enough to explore Wellington and beyond if you have time. There is, of course plenty of time to sleep on the plane on the way back!

Archives Unleashed at the British Library: Study of gender distribution in National Olympic Committees

Who we are

Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).

From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.

The IIPC Content Development Group

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.

Previous CDG collections include:

  • Summer/Winter Olympics/Paralympics (2010-2016)
  • National Olympic & Paralympic Committees (2016- )
  • European Refugee Crisis (2015-2016)
  • World War One Commemoration (2015- )
  • International Cooperative Organizations (2015- )

All these collections can be viewed from here: https://archive-it.org/home/iipc

What we tried to do

We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with the National Olympic and Paralympic Committees collection.

Our research question was “What is the gender distribution of National Olympic Committees?”.

The data we had

The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.

The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.

After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names.  The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.

The 7 English speaking countries identified in the set.

How we worked on it

Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.

Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.

In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.

Snippet:
20160329, http://www.paralympic.org.au, Cochrane, 3, No Match
20160329, http://www.paralympic.org.au, Sam Carter, 2, Male
20160329, http://www.paralympic.org.au, Alistair, 1, Male
20160329, http://www.paralympic.org.au, Carlee Beattie, 4, Female
20160329, http://www.paralympic.org.au, Ernest Van Dyk, 1, Male

The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).

Step 3 consisted of adapting javascripts to visualize the results of the named entity recognition and the gender distribution as web graphs.

All scripts developed during the hackathon can be accessed on Github:

https://github.com/ukwa/archives-unleashed-olympics

Results

The gender distribution within the subset of the collection.
  • Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
  • ‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Male/female representation over the complete dataset of 76 warc files regardless of language. The gender distribution within the subset of the collection and the overall data set showed that males are more represented than females on National Olympic Committees.

Alternative research question?

Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.

What we got out of it  

Sara Aubry, Bibliothèque nationale de France

My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”

Pamela Graham, Columbia University

“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”

Gethin Rees, British Library

“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”

Gillian Lee, National Library of New Zealand

“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”

Helena Byrne, British Library

“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”

This slideshow requires JavaScript.