So far there have been over 1,360 nominations from at least 28 countries around the world. As you can see from the map of the world, there is a high concentration from Europe as many IIPC members are based there. However, as you zoom in on the map of European nominations, there are still many gaps.
This is your chance to get involved in the collection phase by nominating online content that you are reading, using for research or simply know the language from that country. We are trying to get as many pins on the map from around the world as possible. Nevertheless, some of the pins already there may just have one website nomination so far. Even if you see a pin on your country or another country where you speak the language, we still want your nominations.
Just to remind you, what we want to collect:
Public platforms in various formats such as:
Subsections of websites with an Olympic tag
Blogs and Social Media
The subjects covered on these sites can include but is not limited to:
Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).
From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.
The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.
We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with theNational Olympic and Paralympic Committees collection.
Our research question was “What is the gender distribution of National Olympic Committees?”.
The data we had
The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.
The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.
After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names. The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.
How we worked on it
Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
– Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.
Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.
In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.
The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).
All scripts developed during the hackathon can be accessed on Github:
Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Alternative research question?
Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.
What we got out of it
Sara Aubry, Bibliothèque nationale de France
“My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”
Pamela Graham, Columbia University
“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”
Gethin Rees, British Library
“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”
Gillian Lee, National Library of New Zealand
“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”
Helena Byrne, British Library
“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”
The IIPC’s Content Development Working Group, which is leading an effort to build collaborative, global, web archives on a variety of topics of interest to our members, is kicking off a new project that we are calling “Online News Around the World: A Snapshot in Time” Our goal is to document online news websites during one week of the year from ALL of the countries in the world.
You read that right – ONLINE NEWS FROM ALL OF THE COUNTRIES IN THE WORLD. We never said IIPC members were entirely sane, did we? We know this is a lofty goal, but we have a few reasons for doing this:
raise global awareness of the critical need for the web archiving;raise awareness of the importance of preserving born-digital news;
create a cohesive and comprehensive collection that will engage researchers;
archive content from countries and regions not currently being archived by IIPC members.
“Week 46” November 14, 2017 – November 20, 2017. Strange idea? Maybe not… Week 46 was appointed “ordinary news week” back in the end of the 1990s, by Anker Brink Lund, philosophical doctor and professor at Copenhagen Business School. He wrote in 2014:
The project News week has it origin in an old dream I tried to realize for many years. My burning desire was not only to be able to analyze spectacular news cases, but also to map the journalistic feeding chain in general by registering ALL news in a specific period. This kind of projects needed lots of money and many expert resources. In autumn 1999, I was given both, because the newly opened journalist studies in Odense needed trainee places for their students and because a parliamentary analysis group on the political power wanted to know more about the journalistic power in Denmark. Ever since, I have used week 46 for all kinds of media analyses, together with national and international research colleagues…1
The World Wide Web is more than twenty years old. The IIPC thinks it is time to include web news in this “extraordinary ordinary news week.”
We know we might not reach our lofty goal instantly, and that it will take some time to identify news sites from around the world. We plan to start gradually with a goal of news sites from IIPC member countries, at first. But, here is where you fit in. The Content Development Group needs your help! Please nominate 10 news sites from your country to our nomination tool: http://digital2.library.unt.edu/nomination/iipc-news/. Once we receive nominations, the Content Development Group will review the list to determine what set will be archived.
For more information about the project and to find out more about how to help, please contact the Project Team at firstname.lastname@example.org or reply to this blog post with your questions!
This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.
I was struck by the breadth and depth of topics. In particular…
Heritrixnuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki Updates welcome!
System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.
The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.
The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!
Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.
Today marks my final day as Chair of our Consortium. It has been an exciting and busy 17 months since I took on this role. I leave my post with a sense of accomplishment and pride in how the organization has evolved.
When I took over the role in January 2015, I made the commitment to work with the Steering Committee to ensure we modernized the governance and management structure of the IIPC to create a foundation that would allow us to grow and extend our reach. I am happy to say that we have accomplished just that.
As most of you know I am not a career Archivist or Librarian but I have been privileged to work with and learn from professionals within my home organization (Library and Archives Canada) as well as many of you from across the globe. I am pleased to hand over the reins to Emmanuelle Bermès from the National Library of France. She will bring not only deep management and leadership skills to the role, but also (and maybe more importantly) significant experience in the business of the IIPC. I think this balance of experience and competencies is what we need now.
I had the privilege of being involved in three General Assemblies (GA) and the associated conferences. I was continuously amazed with the level of engagement and interaction between the members. Based on the feedback I have received, this last GA and WAC set the bar – this is in no small part to the leadership of Kristinn Sigurðsson.
As with any organization, the goal is to keep that level of engagement going virtually after the face-to-face meetings have ended. We still have much work to do on that front, but I am pleased that our new portfolio structure ensures that there will be dedicated resources and leadership for Birgit Nordsmark Henriksen (Netarchive.dk) and the Membership and Engagement Portfolio. Stay tuned for some steps to facilitate that year-long engagement.
The ecosystem that our respective organizations work in, and the one that the IIPC is trying to foster, is very complex and continues to include new players. Working alongside of other organizations and associations will be key in delivering our mandate. Again we have ensured that we leverage partnerships with complimentary organizations. Listen out for more from Hansueli Locher (Swiss National Library) and the group supporting the Partnership and Outreach Portfolio.
One of the areas that we heard loud and clear was that our members wanted help with tools. At some point I am sure that there will be more and more commercially available solutions for Web harvesting and archiving, but for now it is up to us as a community to rally together to build the tools to support our work led by Tom Cramer (Stanford University Libraries) and the Tools Development Portfolio.
I can say that one of the best ways to support our organization is to get involved. Whether you decide to apply for a position on the Steering Committee, or if you support one of the portfolios, or if you simply ‘lean in’ on some of the discussions that circulate via email – the goal is the same: get involved!
I want to thank my colleagues on the Steering Committee for supporting me (and putting up with me) over the past year and a half. As IIPC members, you can be confident that you have a steering committee which has your best interest at heart. Many excellent and passionate discussions have brought us to where we are today.
I also want to thank the Program and Communications team. In particular, I want to thank Jason Webber from the British Library. He and I worked closely together and spoke almost weekly in an effort to move the agenda forward. Jason (and now Olga) are the glue between the various activities of the Steering Committee and it is often a thankless job.
Lastly, I want to thank all of you – from the emails I received to the one-on-one discussions you have made sure that we heard your needs and expectations.
As they say, the best is yet to come…. so let’s step forward together.
Paul N. Wagner
Chair, International Internet Preservation Consortium
The International Internet Preservation Consortium (IIPC) renewed its consortial agreement at the end of 2015. In the process, it affirmed its longstanding mission to work collaboratively to foster the implementation of solutions to collect, preserve and provide access to Internet content. To achieve this aim, the Consortium is committed to “facilitate the development of appropriate and interoperable, preferably Open Source, software and tools.”
As the IIPC sets its strategic direction for 2016 and beyond, Tools Development will feature as one of three main portfolios of activity (along with Member Engagement, and Partnerships & Outreach). At its General Assembly in Reykjavik, IIPC members held a series of break out meetings to discuss Tools Development. This blog post presents some of that discussion, and lays out the beginnings of a direction for IIPC, and perhaps the web archiving community at large, to pursue in order to build a richer toolscape.
The Current State of Tools Development within the IIPC
The IIPC has always emphasized tool development. Per its website, one of the main objectives “has been to develop a high-quality, easy-to-use open source tools for setting up a web archiving chain.” And the registry of software lists an impressive array tools for everything from acquisition and curation to storage and access. And coming from the 2016 General Assembly and Web Archiving conference, it’s clear that there is actually quite a lot of development going on among and beyond member institutions. Despite all this, the reality may be slightly less rosy than the multitude of listings for tools for web archiving might indicate…
Many are deprecated, or worse, abandoned
Much of the local development is kept local, and not accessible to others for reuse or enhancement
There is a high degree of redundancy among development efforts, due to lack of visibility, lack of understanding, or lack of an effective collaborative framework for code exchange or coordinated development
Many of the tools are not interoperable with each other due to differences in approach in policy, data models or workflows (sometimes intentional, many times not)
Many of the big tools which serve as mainstays for the community (e.g., Heritrix for crawling, Open Wayback for replay) are large, monolithic, complex pieces of software that have multiple forks and less-than-optimal documentation
Given all this, one wonders if IIPC members really believe that coordinated tool development is important; perhaps instead it’s better to let a thousand flowers bloom? The answer to this last question was, refreshingly, a resounding NO. When discussed among members at Reykjavik, support for tools development as a top priority was unanimous, and enthusiastic. The world of the Web and web archiving is vast, yet the number of participants relatively small; the more we can foster a rich (and interoperable) tool environment, the more everyone can benefit in any part of the web archiving chain. Many members in fact said they joined IIPC expressly because they sought a collaboratively defined and community-supported set of tools to support their institutional programs.
In the words of Daniel Gomes from the Portuguese Web Archive: of course tool development is a priority for IIPC; if we don’t develop these tools, who will?
A Brighter Future for Collaborative Tool Development
Several possibilities and principles presented themselves as ways to enhance the way the web archiving community pursues tool development in the future. Interestingly, many of these were more about how the community can work together rather than specific projects. The main principles were:
Interoperability | modularity | APIs are key. The web archiving community needs a bigger suite of smaller, simpler tools that connect together. This promotes reuse of tools, as well as ease of maintenance; allows for institutions to converge on common flows but differentiate where it matters; enables smaller development projects which are more likely to be successful; and provides on ramps for new developers and institutions to take up (and add back to) code. Developing a consensus set of APIs for the web archiving chain is a clear priority and prerequisite here.
Design and development needs to be driven by use cases. Many times, the biggest stumbling block to effective collaboration is differing goals or assumptions. Much of the lack of interoperability comes from differences in institutional models and workflows that makes it difficult for code or data to connect with other systems. Doing the analysis work upfront to clarify not just what a tool might be doing but why, can bring institutional models and developers onto the same page, and facilitate collaborative development.
We need collaborative platforms & social engineering for the web archiving technical community. It’s clear from events like the IIPC Web Archiving Conference and reports such as Helen Hockx-Yu’s of the Internet Archive that a lot of uncoordinated and largely invisible development is happening locally at institutions. Why? Not because people don’t want to collaborate, but because it’s less expensive and more expedient. IIPC and its members need to reduce the friction of exchanging information and code to the point that, as Barbara Sierman of the National Library of the Netherlands said, “collaboration becomes a habit.” Or as Ian Milligan of the University of Waterloo put it, we need the right balance between “hacking” and “yacking”.
IIPC better development of tools both large and small. Collaboration on small tools development is a clear opportunity; innovation is happening at the edges and by working together individual programs can advance their end-to-end workflows in compelling new ways (social media, browser-based capture and new forms of visualization and analysis are all striking examples here). But it’s also clear that there is appetite and need for collaboration on the traditional “big” things that are beyond any single member’s capacity to engineer unilaterally (e.g., Heritrix, WayBack, full text search). As IIPC hasn’t been as successful as anyone might like in terms of directed, top-down development of larger projects, what can be done to carve these larger efforts up into smaller pieces that have a greater chance of success? How can IIPC take on the role of facilitator and matchmaker rather than director & do-er?
The stage is set for revisiting and revitalizing how IIPC works together to build high quality, use case-driven, interoperable tools. Over the next few months (and years!) we will begin translating these needs and strategies into concrete actions. What can we do? Several possibilities suggested themselves in Reykjavik.
Convene web archiving “hack fests”. The web archiving technical community needs face time. As Andy Jackson of the British Library opined in Reykjavik, “How can we collaborate with each other if we don’t know who we are, or what we’re doing?” Face time fuels collaboration in a way that no amount of WebEx’ing or GitHub comments can. Let’s begin to engineer the social ties that will lead to stronger software ties. A couple of three-day unconferences per year would go a long way to accelerating collaboration and diffusion of local innovation.
Convene meetings on key technical topics. It’s clear that IIPC members are beginning to tackle major efforts that would benefit from some early and intensive coordination: Heritrix & browser-based crawling, elaborations on WARC, next steps for Open Wayback, full text search and visualization, use of proxies for enhanced capture, dashboards and metrics for curators and crawl engineers. All of these are likely to see significant development (sometimes at as many as 4-5 different institutions) in the next year. Bringing implementers together early offers the promise of coordinated activity.
Coordinate on API identification and specification. There is clear interest in specifying APIs and more modular interactions across the entire web archiving tool chain. IIPC holds a privileged place as a coordinating body across the sites and players interested in this work. IIPC should structure some way to track, communicate, and help systematize this work, leading to a consortium-based reference architecture (based on APIs rather than specific tools) for the web archiving tool chain.
Use cases. Reykjavik saw a number of excellent presentations on user centered design and use case-driven development. This work should be captured and exposed to the web archiving community to let each participate learn from each other’s work, and to generate a consensus reference architecture based on demonstrated (not just theoretical) needs.
Note that all of these potential steps focus as much on how IIPC can work together as on any specific project, and they all seem to fall into the “small steps” category. In this they have the twin benefits of being both feasible accomplish in the next year, as well as having a good chance to succeed. And if they do succeed, they promise to lay the groundwork for more and larger efforts in the coming years.
What do you think IIPC can do in the next year to advance tools development? Post a comment in this blog or send an email.