On Your Marks, Get Set, Go!

By Helena Byrne, Assistant Web Archivist, The British Library

The Rio 2016 Olympic and Paralympic Games are nearly underway and for the next few weeks sports fans will be glued to the events. As with all major sporting events so much happens on and off the playing field.

When we look back at these events, what do we look at? Archives play an essential role in collecting these snapshots in our lives. As we live in a digital world web archives play a central role in this process. The IIPC Content Development Group curated three large Summer and Winter Olympics collections (2010, 2012 and 2014) and is now archiving the events both on and off the playing field in Rio.

Now it’s your opportunity to have your say about what goes into this collection. The IIPC CDG is calling on you to get involved through the public nomination form. As you can see from our map we still have large parts of the world that aren’t represented in the collection. Do you know of any Olympic or Paralympic websites from these countries?

If you want to find out more about what’s involved in documenting Rio 2016, why not join our Twitter chat and help us archive Rio 2016?

When: Wednesday 10th August at 3pm GMT time 
 At your desk
How: Using Twitter hashtag #Rio2016WA and our previous blog post
Audience: Librarians, Archivists, Sports Researchers and anyone with an interest in web archiving. 
 Nicola Bingham and Helena Byrne, British Library; Eilidh MacGlone, National Library of Scotland

Chat Programme

  1. Introductions
  2. Questions on selecting websites
  3. Instructions on how you can select sites
  4. Add web selections to the public nomination form
  5. Wrap up

Chat Questions

  1. What Olympic collections are available online or in libraries and museums?
    • Are they physical or digital collections?
    • Do you have a favourite go to collection that you like to use?
  2. What’s involved in selecting websites or web pages for the collection?
    • Sourcing, appraising, selecting
  3. What types of resources do researchers like to use most when researching sport?
    • If you could only choose one resource what would it be?
  4. Questions and answers from the audience about the Rio 2016 Collection.

Don’t forget to use the collection hashtag #Rio2016WA when answering the questions. So on your marks, get set, go!

A map of the nominations so far. There are still some parts of the world not covered in this collection. However, all of the National Olympic and Paralympic Committees from around the world are archived in a separate collection.

2016 Rio Games Collection – How to Get Involved!

By Helena Byrne, Assistant Web Archivist, The British Library

The International Internet Preservation Consortium (IIPC) would like your help to archive websites from around the world related to the Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 200 countries competing in the games and we need your help ensure that these countries are represented in the collection.

IIPC World Map

What we want to collect:

Public platforms in various formats such as:

  • Websites
  • Articles
  • News Reports
  • Blogs
  • Facebook
  • Twitter

The subjects covered on these sites can vary from:

  • Sports Events
  • Athletes/Teams
  • Doping/Cheating and Corruption
  • Olympic/Paralympic Venues
  • Gender
  • Fandom
  • Environmental Issues
  • Zika Virus
  • General News/ Commentary
  • Computer Games (eGames)
  • Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.



IIPC Chair Address

Dear colleagues,

As I’m starting my term as Chair of the IIPC for 2016-2017, I’d like to share a few thoughts on what is ahead of us for this year. 2016 is the year of a new start, with a new Consortium Agreement signed for 5 years and the new organisation based on three portfolios: Partnership and Outreach, Tools Development and Membership Engagement. Time has come to build on these solid foundations, laid thanks to the great leadership and vision that Paul Wagner, my predecessor in the Chair position, has provided to our Consortium during the past 18 months.

The tasks undertaken since the General Assembly in Reykjavik have already demonstrated the efficiency of this new work structure. We have taken on board your feedback from the breakout sessions. The Membership Engagement Portfolio Lead Birgit Nordsmark Henriksen, along with our Programme and Communication Officer, is now committed to make information about Members activities in the field of web archiving better available on a renewed website. The Tools Development Portfolio Lead, Tom Cramer, has outlined a list of suggested actions and is planning an open call in order to identify potential projects that may be started this year with the IIPC’s support. Finally, the Partnership and Outreach Portfolio Lead, Hansueli Locher, is gathering ideas on how to engage new members in the web archiving community but also new partners in other domains such as academic research, technology and web development.

In June, during their next phone meeting, your Steering Committee will endorse a one year strategic plan describing the main areas of activities that we want to work on and the actions that we plan to achieve until mid-2017. We are targeting concrete, short-term actions with deliverables that will demonstrate our commitment to move forward and make the IIPC an organisation that is relevant to its Members and to the web archiving community at large.

One of the key actions is the organisation of the 2017 General Assembly and Web Archiving Conference. It will be held in Lisbon, Portugal, on 27-31 March 2017. I would like to thank Daniel Gomes from FCCN (Fundação para a Computação Científica Nacional) for accepting to take the lead on the organisation of this event, with the help of the Conference Programme Committee chaired by Nicholas Taylor (Stanford University Libraries). We expect the General Assembly to be an opportunity for fruitful exchanges and discussions and an input to the following year’s strategic plan. Regarding the Web Archiving Conference, building on this year’s success, we aim at making it an open time to share the latest updates in the field, with a strong contribution from the researcher community.

In the meantime, exciting work is going on within the very active IIPC working groups, in particular the Preservation Working Group (PWG) chaired by Gina Jones (Library of Congress) and Tobias Steinke (German National Library) and the Content Development Working Group (CDG) led by Abbie Grotke (Library of Congress) and Alex Thurman (Columbia University Libraries). Both PWG and CDG are building on the impetus of the GA workshops in their forthcoming projects. The PWG are working on the Compatibility Initiative while the CDG are focusing on the 2016 Summer Olympics and Paralympics Collections as well as the planned online News Around the World project (CDG). Stay tuned for more updates.

Finally, I want to thank our new Officers, Olga Holownia our Programme and Communication Officer and Marie Chouleur our Treasurer, for a very efficient start in their new duties this year. The day to day activities of our Consortium rely heavily on their work and I know they are very committed to provide us with a reliable work environment.

I’m looking forward to the great work we’ll carry out this year together, building on the great skills and impressive experience that this Consortium has been able to pull together. Please feel free to contact me or the Steering Committee and Officers if you want to get involved and learn more about what’s going on.

Emmanuelle Bermès
Bibliothèque nationale de France
Chair of the International Internet Preservation Consortium

IIPC – Meet the Officers, 2016

The IIPC Officers include the Chair and Vice-Chair who are elected by the Steering Committee, the standing officers of Treasurer (based at the Bibliothèque nationale de France, BnF) as well as the Programme and Communications Officer (based at The British Library). The new IIPC Chair and Vice-Chair were elected during the General Assembly that took place in Reykjavík.


Emmanuelle Bermès, photo by Isabelle Jullien Chazal (BnF)
Emmanuelle Bermès, photo by Isabelle Jullien Chazal (BnF)

Emmanuelle Bermès is the deputy director for services and networks at National Library of France (BnF) since 2014. From 2003 to 2011, she worked in the digital libraries and digital preservation area, then moved into metadata management. From 2011 to 2014, she was in charge of multimedia and digital services at the Centre Pompidou (Paris, France).
In the course of her career, Emmanuelle has held a number of responsibilities at international level. She worked as an expert within Europeana and contributed to the design of the Europeana Data Model, before being elected to the Europeana Association Members Council in 2015. In 2010-2011, she was a co-chair of the Library Linked Data W3C incubator group. Member of the IFLA IT section since 2009, she initiated the creation of a Semantic Web special interest group (SWSIG) within IFLA. She became BnF’s representative within the International Internet Preservation Consortium (IIPC) in 2015, and was elected chair of IIPC in 2016.


Jefferson_BaileyJefferson Bailey is the director of Web Archiving Programs at the Internet Archive. He joined the Internet Archive in Summer 2014. Prior to joining IA, he worked on strategic initiatives, digital preservation, archives, and digital collections at institutions such as Metropolitan New York Library Council, Library of Congress, Brooklyn Public Library, and Frick Art Reference Library and has worked in the archives at NARA, NASA, and Atlantic Records. He has an MLIS in Archival Studies from University of Pittsburgh and a BA in English from Oberlin College. He once flew NASA’s Space Shuttle Simulator and caused, according to the flight engineer, “minor landing gear damage”. He has deaccessioned all records of this event from his personal archive.


Marie_ChouleurMarie Chouleur joined the National Library of France (BnF) in September 2015, becoming the head of Digital Legal Deposit. This service is responsible for collecting, preserving and promoting a large part of the National Library’s born-digital heritage: web archives, e-newspapers and e-books. She graduated from the École nationale des Chartes and the National Institute for Cultural Heritage (Institut national du patrimoine), obtaining a bachelor’s degree in literature and a master’s degree in history. Marie previously worked for the National Archives of France (Archives nationales) as a curator in charge of records related to environment, housing and town planning.

Programme and Communication Officer (PCO)

Olga Holownia, photo by Mira Mykkänen
Photo by Mira Mykkänen

Olga Holownia is the IIPC Programme and Communication Officer at the British Library in London. With a PhD in Icelandic and English studies she has a keen interest in research spanning the fields of comparative literature, translation, cultural literacy and digital humanities. Olga has experience working on a number of interdisciplinary digital projects at the University of Iceland as well as organising international cultural events.

Signing Off


Today marks my final day as Chair of our Consortium. It has been an exciting and busy 17 months since I took on this role. I leave my post with a sense of accomplishment and pride in how the organization ‎has evolved.

When I took over the role in January 2015, I made the commitment to work with the Steering Committee to ensure we modernized the governance and management structure of the IIPC to create a foundation that would allow us to grow and extend our reach.  I am happy to say that we have accomplished just that.

As most of you know I am not a career Archivist or Librarian but I have been privileged to work with and learn from professionals within my home organization (Library and Archives Canada) as well as many of you from across the globe. I am pleased to hand over the reins to Emmanuelle Bermès from the National Library of France. She will bring not only deep management and leadership skills to the role, but also (and maybe more importantly) significant experience in the business of the IIPC.  I think this balance of experience and competencies is what we need now.

I had the privilege of being involved in three General Assemblies (GA) and the associated conferences. I was continuously amazed with the level of engagement and interaction between the members. Based on the feedback I have received, this last GA and WAC set the bar – this is in no small part to the leadership of Kristinn Sigurðsson.

As with any organization, the goal is to keep that level of engagement going virtually after the face-to-face meetings have ended. We still have much work to do on that front, but I am pleased that our new portfolio structure ensures that there will be dedicated resources and leadership for Birgit Nordsmark Henriksen (Netarchive.dk) and the Membership and Engagement Portfolio.  Stay tuned for some steps to facilitate that year-long engagement.

The ecosystem that our respective organizations work in, and the one that the IIPC is trying to foster, is  very complex  and continues to include new players. Working alongside of other organizations and associations will be key in delivering our mandate. Again we have ensured that we leverage ‎partnerships with complimentary organizations. Listen out for more from Hansueli Locher (Swiss National Library) and the group supporting the Partnership and Outreach Portfolio.

‎One of the areas that we heard loud and clear was that our members wanted help with tools. At some point I am sure that there will be more and more commercially available solutions for Web harvesting and archiving, but for now it is up to us as a community to rally together to build the tools to support our work led by Tom Cramer (Stanford University Libraries) and the Tools Development Portfolio.

I can say that one of the best ways to support our organization is to get involved. Whether you decide to apply for a position on the Steering Committee, or if you support one of the portfolios, or if you simply ‘lean in’ on some of the discussions that circulate via email –  the goal is the same: get involved!

‎I want to thank my colleagues on the Steering Committee for supporting me  (and putting up with me) over the past year and a half. As IIPC members, you can be confident that you have a steering committee which has your best interest at heart. Many excellent and passionate discussions have brought us to where we are today.

I also want to thank the Program and Communications team. In particular, I want to thank Jason Webber from the British Library. He and I worked closely together and spoke almost weekly in an effort to move the agenda forward. Jason (and now Olga) are the glue between the various activities of the Steering Committee and it is often a thankless job.

Lastly, I want to thank all of you – from the emails I received to the one-on-one discussions you have made sure that we heard your needs and expectations.

As they say, the best is yet to come…. so let’s step forward together.


Paul N. Wagner
Chair, International Internet Preservation Consortium

What can IIPC do to advance tools development?

By Tom Cramer, Stanford University

The International Internet Preservation Consortium (IIPC) renewed its consortial agreement at the end of 2015. In the process, it affirmed its longstanding mission to work collaboratively to foster the implementation of solutions to collect, preserve and provide access to Internet content. To achieve this aim, the Consortium is committed to “facilitate the development of appropriate and interoperable, preferably Open Source, software and tools.”

As the IIPC sets its strategic direction for 2016 and beyond, Tools Development will feature as one of three main portfolios of activity (along with Member Engagement, and Partnerships & Outreach). At its General Assembly in Reykjavik, IIPC members held a series of break out meetings to discuss Tools Development. This blog post presents some of that discussion, and lays out the beginnings of a direction for IIPC, and perhaps the web archiving community at large, to pursue in order to build a richer toolscape.

The Current State of Tools Development within the IIPC

The IIPC has always emphasized tool development. Per its website, one of the main objectives “has been to develop a high-quality, easy-to-use open source tools for setting up a web archiving chain.” And the registry of software lists an impressive array tools for everything from acquisition and curation to storage and access. And coming from the 2016 General Assembly and Web Archiving conference, it’s clear that there is actually quite a lot of development going on among and beyond member institutions. Despite all this, the reality may be slightly less rosy than the multitude of listings for tools for web archiving might indicate…

  • Many are deprecated, or worse, abandoned
  • Much of the local development is kept local, and not accessible to others for reuse or enhancement
  • There is a high degree of redundancy among development efforts, due to lack of visibility, lack of understanding, or lack of an effective collaborative framework for code exchange or coordinated development
  • Many of the tools are not interoperable with each other due to differences in approach in policy, data models or workflows (sometimes intentional, many times not)
  • Many of the big tools which serve as mainstays for the community (e.g., Heritrix for crawling, Open Wayback for replay) are large, monolithic, complex pieces of software that have multiple forks and less-than-optimal documentation

Given all this, one wonders if IIPC members really believe that coordinated tool development is important; perhaps instead it’s better to let a thousand flowers bloom? The answer to this last question was, refreshingly, a resounding NO. When discussed among members at Reykjavik, support for tools development as a top priority was unanimous, and enthusiastic. The world of the Web and web archiving is vast, yet the number of participants relatively small; the more we can foster a rich (and interoperable) tool environment, the more everyone can benefit in any part of the web archiving chain. Many members in fact said they joined IIPC expressly because they sought a collaboratively defined and community-supported set of tools to support their institutional programs.

In the words of Daniel Gomes from the Portuguese Web Archive: of course tool development is a priority for IIPC; if we don’t develop these tools, who will?

A Brighter Future for Collaborative Tool Development

Several possibilities and principles presented themselves as ways to enhance the way the web archiving community pursues tool development in the future. Interestingly, many of these were more about how the community can work together rather than specific projects.  The main principles were:

  • Interoperability | modularity | APIs are key. The web archiving community needs a bigger suite of smaller, simpler tools that connect together. This promotes reuse of tools, as well as ease of maintenance; allows for institutions to converge on common flows but differentiate where it matters; enables smaller development projects which are more likely to be successful; and provides on ramps for new developers and institutions to take up (and add back to) code. Developing a consensus set of APIs for the web archiving chain is a clear priority and prerequisite here.
  • Design and development needs to be driven by use cases. Many times, the biggest stumbling block to effective collaboration is differing goals or assumptions. Much of the lack of interoperability comes from differences in institutional models and workflows that makes it difficult for code or data to connect with other systems. Doing the analysis work upfront to clarify not just what a tool might be doing but why, can bring institutional models and developers onto the same page, and facilitate collaborative development.
  • We need collaborative platforms & social engineering for the web archiving technical community. It’s clear from events like the IIPC Web Archiving Conference and reports such as Helen Hockx-Yu’s of the Internet Archive that a lot of uncoordinated and largely invisible development is happening locally at institutions. Why? Not because people don’t want to collaborate, but because it’s less expensive and more expedient. IIPC and its members need to reduce the friction of exchanging information and code to the point that, as Barbara Sierman of the National Library of the Netherlands said, “collaboration becomes a habit.” Or as Ian Milligan of the University of Waterloo put it, we need the right balance between “hacking” and “yacking”.
  • IIPC better development of tools both large and small. Collaboration on small tools development is a clear opportunity; innovation is happening at the edges and by working together individual programs can advance their end-to-end workflows in compelling new ways (social media, browser-based capture and new forms of visualization and analysis are all striking examples here). But it’s also clear that there is appetite and need for collaboration on the traditional “big” things that are beyond any single member’s capacity to engineer unilaterally (e.g., Heritrix, WayBack, full text search). As IIPC hasn’t been as successful as anyone might like in terms of directed, top-down development of larger projects, what can be done to carve these larger efforts up into smaller pieces that have a greater chance of success? How can IIPC take on the role of facilitator and matchmaker rather than director & do-er?

Next Steps

The stage is set for revisiting and revitalizing how IIPC works together to build high quality, use case-driven, interoperable tools. Over the next few months (and years!) we will begin translating these needs and strategies into concrete actions. What can we do? Several possibilities suggested themselves in Reykjavik.

  1. Convene web archiving “hack fests”. The web archiving technical community needs face time. As Andy Jackson of the British Library opined in Reykjavik, “How can we collaborate with each other if we don’t know who we are, or what we’re doing?” Face time fuels collaboration in a way that no amount of WebEx’ing or GitHub comments can. Let’s begin to engineer the social ties that will lead to stronger software ties. A couple of three-day unconferences per year would go a long way to accelerating collaboration and diffusion of local innovation.
  2. Convene meetings on key technical topics. It’s clear that IIPC members are beginning to tackle major efforts that would benefit from some early and intensive coordination: Heritrix & browser-based crawling, elaborations on WARC, next steps for Open Wayback, full text search and visualization, use of proxies for enhanced capture, dashboards and metrics for curators and crawl engineers. All of these are likely to see significant development (sometimes at as many as 4-5 different institutions) in the next year. Bringing implementers together early offers the promise of coordinated activity.
  3. Coordinate on API identification and specification. There is clear interest in specifying APIs and more modular interactions across the entire web archiving tool chain. IIPC holds a privileged place as a coordinating body across the sites and players interested in this work. IIPC should structure some way to track, communicate, and help systematize this work, leading to a consortium-based reference architecture (based on APIs rather than specific tools) for the web archiving tool chain.
  4. Use cases. Reykjavik saw a number of excellent presentations on user centered design and use case-driven development. This work should be captured and exposed to the web archiving community to let each participate learn from each other’s work, and to generate a consensus reference architecture based on demonstrated (not just theoretical) needs.

Note that all of these potential steps focus as much on how IIPC can work together as on any specific project, and they all seem to fall into the “small steps” category. In this they have the twin benefits of being both feasible accomplish in the next year, as well as having a good chance to succeed. And if they do succeed, they promise to lay the groundwork for more and larger efforts in the coming years.

What do you think IIPC can do in the next year to advance tools development? Post a comment in this blog or send an email.

New Report on Web Archiving Available

By Andrea Goethals, Harvard Library

HarvardLibraryReport-Jan2016This is an expanded version of a post to the Library of Congress’ Signal blog.

Harvard Library recently released a report that is the result of a five-month environmental scan of the landscape of web archiving, made possible by the generous support of the Arcadia Fund. The purpose of the study was to explore and document current web archiving programs to identify common practices, needs, and expectations in the collection and provision of web archives to users; the provision and maintenance of web archiving infrastructure and services; and the use of web archives by researchers. The findings will inform Harvard Library’s strategy for scaling up its web archiving activities, and are also being shared broadly to help inform research and development priorities in the global web archiving community.

The heart of the study was a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of several organizations, including the IIPC of course, but also the Web Archiving Roundtable at the Society of American Archivists (SAA), the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Reuters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

Opportunities for Research & Development

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed below and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development, and (4) build local capacity.

22 Opportunities to Address Common Challenges

(the order has no significance)

  1. Dedicate full-time staff to work in web archiving so that institutions can stay abreast of latest developments, best practices and fully engage in the web archiving community.
  2. Conduct outreach, training and professional development for existing staff, particularly those working with more traditional collections, such as print, who are being asked to collect web archives.
  3. Increase communication and collaboration across types of collectors since they might collect in different areas or for different reasons.
  4. A funded collaboration program (bursary award, for example) to support researcher use of web archives by gathering feedback on requirements and impediments to the use of web archives.
  5. Leverage the membership overlap between RESAW and European IIPC membership to facilitate formal researcher/librarian/archivist collaboration projects.
  6. Institutional web archiving programs become transparent about holdings, indicating what material each has, terms of use, preservation commitment, plus curatorial decisions made for each capture.
  7. Develop a collection development tool (e.g. registry or directory) to expose holdings information to researchers and other collecting institutions even if the content is viewable only in on-site reading rooms.
  8. Conduct outreach and education to website developers to provide guidance on creating sites that can be more easily archived and described by web archiving practitioners.
  9. IIPC, or similar large international organization, attempts to educate and influence tech company content hosting sites (e.g. Google/YouTube) on the importance of supporting libraries and archives in their efforts to archive their content (even if the content cannot be made immediately available to researchers).
  10. Investigate Memento further, for example conduct user studies, to see if more web archiving institutions should adopt it as part of their discovery infrastructure.
  11. Fund a collection development, nomination tool that can enable rapid collection development decisions, possibly building on one or more of the current tools that are targeted for open source deployment.
  12. Gather requirements across institutions and among web researchers for next generation of tools that need to be developed.
  13. Develop specifications for a web archiving API that would allow web archiving tools and services to be used interchangeably.
  14. Train researchers with the skills they need to be able to analyze big data found in web archives.
  15. Provide tools to make researcher analysis of big data found in web archives easier, leveraging existing tools where possible.
  16. Establish a standard for describing the curatorial decisions behind collecting web archives so that there is consistent (and machine-actionable) information for researchers.
  17. Establish a feedback loop between researchers and the librarians/archivists.
  18. Explore how institutions can augment the Archive-It service and provide local support to researchers, possibly using a collaborative model.
  19. Increase interaction with users, and develop deep collaborations with computer scientists.
  20. Explore what, and how, a service might support running computing and software tools and infrastructure for institutions that lack their own onsite infrastructure to do so.
  21. Service providers develop more offerings around the available tools to lower the barrier to entry and make them accessible to those lacking programming skills and/or IT support.
  22. Work with service providers to help reduce any risks of reliance on them (e.g. support for APIs so that service providers could more easily be changed and content exported if needed).

Communication & Collaboration are Key!

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration, among all individuals and organizations involved in some way in web archiving, was the most prevalent theme found by the scan. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed between those collecting web content, but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.

A Note about the Tools

There is a section in the report (Appendix C) that lists all the current web archiving tools that were identified during the environmental scan. The IIPC Tools and Software web page was one of the resources used to construct this list, along with what was learned through interviews, conferences and independent research. The tools are organized according to the various activities needed throughout the lifecycle of acquiring, processing, preserving and providing web archive collections. Many of the tools discovered are fairly new, especially the ones associated with the analysis of web archives. The state of the tools will continue to change rapidly so this list will quickly become out of date unless a group like the IIPC decides to maintain it.  I will be at the GA in April if any IIPC member would like to talk about maintaining this list or other parts of the report.