IIPC Hackathon at the British Library: Laying a New Foundation

By Tom Cramer, Stanford University

This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.

I was struck by the breadth and depth of topics. In particular…

  • Heritrix nuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
  • Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
  • Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
  • QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
  • An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki  Updates welcome!
  • System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.

The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.

IIPC Hackathon, Day 1.

This validates one of the main focal points for the IIPC’s portfolio on Tools Development, which is to foster more grassroots exchange among web archiving practitioners.

The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!

Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.

Web Archiving Rio 2016: The Story So Far

By Helena Byrne, Assistant Web Archivist, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the trials and tribulations of the Rio 2016 Summer Olympic and Paralympic Games. The Olympics might be over but in just a few days the Paralympics will begin and fans will be glued to their screens again.

This project is collecting public platforms such as websites, articles, news reports, blogs and social media about Rio 2016. You can follow updates on this project on Twitter by using the collection hashtag #Rio2016WA. The CDG group has been more active on Twitter and recently hosted a Twitter chat on 10th August 2016 to give an insight on what’s involved in web archiving the Olympics. The chat was based on set questions published in an IIPC blog post with a Q&A session and some time for live nominations. This was an international chat; even though it was small it helped us to make connections with a wider audience. The chat was added to Storify as well as the final archived collection of the Games.

So far the Rio 2016 Collection has over 4,000 nominations from IIPC members and the general public. The nominations up to now are from seventy six countries across the world. However as you can see from the Google Map there are still many countries that have not been covered. Can you help fill the void?

The majority of the public nominations cover Ireland, the Pacific Islands & South Korea and are in a range of languages such as English, Korean, Dutch, Georgian & French to name but a few. Some countries on the map have only one site nominated while others have many, even if you see that there are nominations from your country the web pages you are looking at might not be in the collection. There is still time for you to get involved in web archiving the Olympics and Paralympics. The public nomination form will be open till 21st September 2016. If you would like to make a nomination you can follow these guidelines. This is your chance to be part of the Games!

On Your Marks, Get Set, Go!

By Helena Byrne, Assistant Web Archivist, The British Library

The Rio 2016 Olympic and Paralympic Games are nearly underway and for the next few weeks sports fans will be glued to the events. As with all major sporting events so much happens on and off the playing field.

When we look back at these events, what do we look at? Archives play an essential role in collecting these snapshots in our lives. As we live in a digital world web archives play a central role in this process. The IIPC Content Development Group curated three large Summer and Winter Olympics collections (2010, 2012 and 2014) and is now archiving the events both on and off the playing field in Rio.

Now it’s your opportunity to have your say about what goes into this collection. The IIPC CDG is calling on you to get involved through the public nomination form. As you can see from our map we still have large parts of the world that aren’t represented in the collection. Do you know of any Olympic or Paralympic websites from these countries?

If you want to find out more about what’s involved in documenting Rio 2016, why not join our Twitter chat and help us archive Rio 2016?

When: Wednesday 10th August at 3pm GMT time 
Where:
 At your desk
How: Using Twitter hashtag #Rio2016WA and our previous blog post
Audience: Librarians, Archivists, Sports Researchers and anyone with an interest in web archiving. 
Contributors:
 Nicola Bingham and Helena Byrne, British Library; Eilidh MacGlone, National Library of Scotland

Chat Programme

  1. Introductions
  2. Questions on selecting websites
  3. Instructions on how you can select sites
  4. Add web selections to the public nomination form
  5. Wrap up

Chat Questions

  1. What Olympic collections are available online or in libraries and museums?
    • Are they physical or digital collections?
    • Do you have a favourite go to collection that you like to use?
  2. What’s involved in selecting websites or web pages for the collection?
    • Sourcing, appraising, selecting
  3. What types of resources do researchers like to use most when researching sport?
    • If you could only choose one resource what would it be?
  4. Questions and answers from the audience about the Rio 2016 Collection.

Don’t forget to use the collection hashtag #Rio2016WA when answering the questions. So on your marks, get set, go!

IIPC-Rio2016-map
A map of the nominations so far. There are still some parts of the world not covered in this collection. However, all of the National Olympic and Paralympic Committees from around the world are archived in a separate collection.

2016 Rio Games Collection – How to Get Involved!

By Helena Byrne, Assistant Web Archivist, The British Library

The International Internet Preservation Consortium (IIPC) would like your help to archive websites from around the world related to the Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 200 countries competing in the games and we need your help ensure that these countries are represented in the collection.

IIPC World Map

What we want to collect:

Public platforms in various formats such as:

  • Websites
  • Articles
  • News Reports
  • Blogs
  • Facebook
  • Twitter

The subjects covered on these sites can vary from:

  • Sports Events
  • Athletes/Teams
  • Doping/Cheating and Corruption
  • Olympic/Paralympic Venues
  • Gender
  • Fandom
  • Environmental Issues
  • Zika Virus
  • General News/ Commentary
  • Computer Games (eGames)
  • Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.

http://goo.gl/forms/n4M4XJKfg6STvosb2

 

IIPC Chair Address

Dear colleagues,

As I’m starting my term as Chair of the IIPC for 2016-2017, I’d like to share a few thoughts on what is ahead of us for this year. 2016 is the year of a new start, with a new Consortium Agreement signed for 5 years and the new organisation based on three portfolios: Partnership and Outreach, Tools Development and Membership Engagement. Time has come to build on these solid foundations, laid thanks to the great leadership and vision that Paul Wagner, my predecessor in the Chair position, has provided to our Consortium during the past 18 months.

The tasks undertaken since the General Assembly in Reykjavik have already demonstrated the efficiency of this new work structure. We have taken on board your feedback from the breakout sessions. The Membership Engagement Portfolio Lead Birgit Nordsmark Henriksen, along with our Programme and Communication Officer, is now committed to make information about Members activities in the field of web archiving better available on a renewed website. The Tools Development Portfolio Lead, Tom Cramer, has outlined a list of suggested actions and is planning an open call in order to identify potential projects that may be started this year with the IIPC’s support. Finally, the Partnership and Outreach Portfolio Lead, Hansueli Locher, is gathering ideas on how to engage new members in the web archiving community but also new partners in other domains such as academic research, technology and web development.

In June, during their next phone meeting, your Steering Committee will endorse a one year strategic plan describing the main areas of activities that we want to work on and the actions that we plan to achieve until mid-2017. We are targeting concrete, short-term actions with deliverables that will demonstrate our commitment to move forward and make the IIPC an organisation that is relevant to its Members and to the web archiving community at large.

One of the key actions is the organisation of the 2017 General Assembly and Web Archiving Conference. It will be held in Lisbon, Portugal, on 27-31 March 2017. I would like to thank Daniel Gomes from FCCN (Fundação para a Computação Científica Nacional) for accepting to take the lead on the organisation of this event, with the help of the Conference Programme Committee chaired by Nicholas Taylor (Stanford University Libraries). We expect the General Assembly to be an opportunity for fruitful exchanges and discussions and an input to the following year’s strategic plan. Regarding the Web Archiving Conference, building on this year’s success, we aim at making it an open time to share the latest updates in the field, with a strong contribution from the researcher community.

In the meantime, exciting work is going on within the very active IIPC working groups, in particular the Preservation Working Group (PWG) chaired by Gina Jones (Library of Congress) and Tobias Steinke (German National Library) and the Content Development Working Group (CDG) led by Abbie Grotke (Library of Congress) and Alex Thurman (Columbia University Libraries). Both PWG and CDG are building on the impetus of the GA workshops in their forthcoming projects. The PWG are working on the Compatibility Initiative while the CDG are focusing on the 2016 Summer Olympics and Paralympics Collections as well as the planned online News Around the World project (CDG). Stay tuned for more updates.

Finally, I want to thank our new Officers, Olga Holownia our Programme and Communication Officer and Marie Chouleur our Treasurer, for a very efficient start in their new duties this year. The day to day activities of our Consortium rely heavily on their work and I know they are very committed to provide us with a reliable work environment.

I’m looking forward to the great work we’ll carry out this year together, building on the great skills and impressive experience that this Consortium has been able to pull together. Please feel free to contact me or the Steering Committee and Officers if you want to get involved and learn more about what’s going on.

Emmanuelle Bermès
Bibliothèque nationale de France
Chair of the International Internet Preservation Consortium

IIPC – Meet the Officers, 2016

The IIPC Officers include the Chair and Vice-Chair who are elected by the Steering Committee, the standing officers of Treasurer (based at the Bibliothèque nationale de France, BnF) as well as the Programme and Communications Officer (based at The British Library). The new IIPC Chair and Vice-Chair were elected during the General Assembly that took place in Reykjavík.

Chair

Emmanuelle Bermès, photo by Isabelle Jullien Chazal (BnF)
Emmanuelle Bermès, photo by Isabelle Jullien Chazal (BnF)

Emmanuelle Bermès is the deputy director for services and networks at National Library of France (BnF) since 2014. From 2003 to 2011, she worked in the digital libraries and digital preservation area, then moved into metadata management. From 2011 to 2014, she was in charge of multimedia and digital services at the Centre Pompidou (Paris, France).
In the course of her career, Emmanuelle has held a number of responsibilities at international level. She worked as an expert within Europeana and contributed to the design of the Europeana Data Model, before being elected to the Europeana Association Members Council in 2015. In 2010-2011, she was a co-chair of the Library Linked Data W3C incubator group. Member of the IFLA IT section since 2009, she initiated the creation of a Semantic Web special interest group (SWSIG) within IFLA. She became BnF’s representative within the International Internet Preservation Consortium (IIPC) in 2015, and was elected chair of IIPC in 2016.

Vice-Chair

Jefferson_BaileyJefferson Bailey is the director of Web Archiving Programs at the Internet Archive. He joined the Internet Archive in Summer 2014. Prior to joining IA, he worked on strategic initiatives, digital preservation, archives, and digital collections at institutions such as Metropolitan New York Library Council, Library of Congress, Brooklyn Public Library, and Frick Art Reference Library and has worked in the archives at NARA, NASA, and Atlantic Records. He has an MLIS in Archival Studies from University of Pittsburgh and a BA in English from Oberlin College. He once flew NASA’s Space Shuttle Simulator and caused, according to the flight engineer, “minor landing gear damage”. He has deaccessioned all records of this event from his personal archive.

Treasurer

Marie_ChouleurMarie Chouleur joined the National Library of France (BnF) in September 2015, becoming the head of Digital Legal Deposit. This service is responsible for collecting, preserving and promoting a large part of the National Library’s born-digital heritage: web archives, e-newspapers and e-books. She graduated from the École nationale des Chartes and the National Institute for Cultural Heritage (Institut national du patrimoine), obtaining a bachelor’s degree in literature and a master’s degree in history. Marie previously worked for the National Archives of France (Archives nationales) as a curator in charge of records related to environment, housing and town planning.

Programme and Communication Officer (PCO)

Olga Holownia, photo by Mira Mykkänen
Photo by Mira Mykkänen

Olga Holownia is the IIPC Programme and Communication Officer at the British Library in London. With a PhD in Icelandic and English studies she has a keen interest in research spanning the fields of comparative literature, translation, cultural literacy and digital humanities. Olga has experience working on a number of interdisciplinary digital projects at the University of Iceland as well as organising international cultural events.

Signing Off

Colleagues,

Today marks my final day as Chair of our Consortium. It has been an exciting and busy 17 months since I took on this role. I leave my post with a sense of accomplishment and pride in how the organization ‎has evolved.

When I took over the role in January 2015, I made the commitment to work with the Steering Committee to ensure we modernized the governance and management structure of the IIPC to create a foundation that would allow us to grow and extend our reach.  I am happy to say that we have accomplished just that.

As most of you know I am not a career Archivist or Librarian but I have been privileged to work with and learn from professionals within my home organization (Library and Archives Canada) as well as many of you from across the globe. I am pleased to hand over the reins to Emmanuelle Bermès from the National Library of France. She will bring not only deep management and leadership skills to the role, but also (and maybe more importantly) significant experience in the business of the IIPC.  I think this balance of experience and competencies is what we need now.

I had the privilege of being involved in three General Assemblies (GA) and the associated conferences. I was continuously amazed with the level of engagement and interaction between the members. Based on the feedback I have received, this last GA and WAC set the bar – this is in no small part to the leadership of Kristinn Sigurðsson.

As with any organization, the goal is to keep that level of engagement going virtually after the face-to-face meetings have ended. We still have much work to do on that front, but I am pleased that our new portfolio structure ensures that there will be dedicated resources and leadership for Birgit Nordsmark Henriksen (Netarchive.dk) and the Membership and Engagement Portfolio.  Stay tuned for some steps to facilitate that year-long engagement.

The ecosystem that our respective organizations work in, and the one that the IIPC is trying to foster, is  very complex  and continues to include new players. Working alongside of other organizations and associations will be key in delivering our mandate. Again we have ensured that we leverage ‎partnerships with complimentary organizations. Listen out for more from Hansueli Locher (Swiss National Library) and the group supporting the Partnership and Outreach Portfolio.

‎One of the areas that we heard loud and clear was that our members wanted help with tools. At some point I am sure that there will be more and more commercially available solutions for Web harvesting and archiving, but for now it is up to us as a community to rally together to build the tools to support our work led by Tom Cramer (Stanford University Libraries) and the Tools Development Portfolio.

I can say that one of the best ways to support our organization is to get involved. Whether you decide to apply for a position on the Steering Committee, or if you support one of the portfolios, or if you simply ‘lean in’ on some of the discussions that circulate via email –  the goal is the same: get involved!

‎I want to thank my colleagues on the Steering Committee for supporting me  (and putting up with me) over the past year and a half. As IIPC members, you can be confident that you have a steering committee which has your best interest at heart. Many excellent and passionate discussions have brought us to where we are today.

I also want to thank the Program and Communications team. In particular, I want to thank Jason Webber from the British Library. He and I worked closely together and spoke almost weekly in an effort to move the agenda forward. Jason (and now Olga) are the glue between the various activities of the Steering Committee and it is often a thankless job.

Lastly, I want to thank all of you – from the emails I received to the one-on-one discussions you have made sure that we heard your needs and expectations.

As they say, the best is yet to come…. so let’s step forward together.

Regards.

PnW
Paul N. Wagner
Chair, International Internet Preservation Consortium