Facing the Challenge of Web Archives Preservation Collaboratively

Web archiving is often about collecting the web, but that is only half the story. Once collected you have to make sure to preserve it.D-Lib-blocks This is what the Preservation Working Group of the IIPC is focused on. D-Lib magazine has recently published an article called Facing the Challenge of Web Archives Preservation Collaboratively: The Role and Work of the IIPC Preservation Working Group.

goethalsourypearsonsiermansteinke

The article was written by the group members: Andrea Goethals (Harvard Library), Clément Oury (International ISSN Centre), David Pearson (National Library of Australia), Barbara Sierman (KB National Library of the Netherlands) and Tobias Steinke (Deutsche Nationalbibliothek – German National Library).

The article sets out the goals, activities and results of the Preservation Working Group, describing the findings of a survey that was done amongst the members of the IIPC in 2013 about their approaches to preserving the web. The authors also feature a set of databases maintained by the group with crucial information for web archiving: namely the Environments Database and the Risks Database.

Barbara Sierman, x-post from the KB (Dutch National Library) blog

Update on OpenWayback

OpenWayback

OpenWayback 2.2.0 was recently released. This marks OpenWayback’s third release since becoming a ward of the IIPC in late 2013. This is a fairly modest update and reflects our desire to make frequent, modest sized releases. A few things are still worth pointing out.

First, as of this release, OpenWayback requires Java 7. Java 7 has been out for four years and Java 6 has not been publicly updated in over two years. It is time to move on.

Second, OpenWayback now officially supports internationalized domain names. I.e. domain names containing non-ASCII characters.

Third, UI localization has been much improved. It should now be possible to translate the entire interface without having to mess with the JSP files and otherwise “go under the hood”.

And the last thing I’ll mention is the new WatchedCDXSource which removes the need to enumerate all the CDX files you wish to use. Simply designate a folder and OpenWayback will pick up all the CDX files in it.

The road to here hathankyousn’t been easy, but it is encouraging to see that the number of people involved is slowly, but surely rising. For the 2.2.0 release, we had code contributions from Roger Coram (BL), Lauren Ko (UNT), John Erik Halse (NLN), Sawood Alam (ODU), Mohamed Elsayed (BA) and myself in addition to the IIPC-payed-for work by Roger Mathisen (NLN). Even more people were involved in reporting issues, managing the project and testing the release candidate. My thanks to everyone who helped out.

And going forward, we are certainly going to need people to help out.

help_wanted

Version 2.3.0 of OpenWayback will be another modest bundle of fixes and minor features. We hope it will be ready in September (or so). There are already 10 issues open for it as I write this.

But, we also have larger ambitions. Enter version 3.0.0. It will be developed in parallel with 2.3.0 and aims to make some big changes. Breaking changes. OpenWayback is built on an aging codebase, almost a decade old at this point. To move forward, some big changes need to be made.

The exact features to be implemented will likely shift as work progresses but we are going to increase modularity by pushing the CDXServer front and center and removing the legacy resource stores. In addition to simplifying the codebase, this fits very nicely with the talk at the last GA about APIs.

We’ll also be looking at redoing the user interface using iFrames and providing insight into the temporal drift of the page being viewed. The planned issues are available on GitHub. The list is far from locked and we welcome additional input on which features to work on.

We welcome additional work on those features even more!

callTOactionI’d like to wrap this up with a call to action. We need a reasonably large community around the project to sustain it. Whether it’s testing and bug reporting, occasional development work or taking on more by becoming one of our core developers, your help is both needed and appreciated.

If you’d like to become involved, you can simply join the conversation on the OpenWayback GitHub page. Anyone can open new issues and post comments on existinggithub-social-coding issues. You can also join the OpenWayback developers mailing list.

Kristinn Sigurðsson – Head of IT at the National and University Library of Iceland – x-posted from Kris’s blog

 

A first attempt to archive the .EU domain

EUCommissioni

The .EU domain is commonly used to reference sites related to Europe. EURid is the organization appointed by the European Commission to operate the .EU domain and presents it under the slogan “Your European Identity”.

Therefore, preserving online information published on sites hosted under the .EU domain is crucial to preserve European Cultural Heritage for future generations.

The strategy adopted to archive the World Wide Web has been delegating the responsibility of each domain to the respective national archiving institutions. However, the .EU domain fails to fit in this model because it covers multiple nations. Thus, the preservation of .EU sites has not been yet assigned and undertaken by any institution.

RESAWRESAW is an European network that aims to create a Research Infrastructure for the Study of Archived Web Materials (resaw.eu). The Portuguese Web Archive performed a first attempt to crawl and preserve web sites hosted under the .EU domain performed within the scope of RESAW activities. This first crawl began on the 21 November 2014 and finished on the 16 December 2014.

Challenges crawling .EU

.EUlogoThe first challenge felt was obtaining the seeds for the crawl because our contacts with EURID to get the list of .EU domains failed. The crawl was launched using a total of 34 138 unique seeds obtained from several sources such as Google.com, DomainTyper.com, DMOZ.org or Alexa Top Sites.

During this first crawl we had to iteratively tune crawl configurations in order to overcome hazardous situations caused by web spam sites. The set of spam filters created will be useful to optimize future crawls.seedsImage

We crawled 250 million documents from over 1 million hosts. The crawl documents were stored in 5.8 TB of disk space using the compressed ARC format. 135 907 unique domain URLs were extracted that will be used as seeds for the next crawl.

Two more crawls of the .EU domain planned

As future work we intend to perform 2 more crawls of the .EU domain to be integrated on the Portuguese Web Archive collections. The next crawl is planned to start in November 2015. We estimated that 23 TB of disk space should be required for the following crawl of the .EU domain (without performing deduplication).

04_brancoEach one of the performed .EU crawls shall be indexed and become searchable through www.arquivo.pt one year after its finish date.

 

Researchers wanted!

Collaborations with researchers interested on studying the collected web data or crawl logs are welcome. We can create a prototype system with restricted access to enable search and processing of the .EU crawls if researchers explicitly manifest interest.

This first experiment of archiving the .EU domain was performed mostly using resources from the Portuguese Web Archive. Collaborations with other institutions, for instance, to identify relevant seeds are crucial to improve the quality of the crawls. The obtained results from this experiment are encouraging but an effective archive of the .EU requires further more resources and collaborations.

Learn more

fotografia-de-daniel-gomes

Daniel Gomes andDaniel_Bicho Daniel Bicho, RESAW / Portuguese Web Archive

IIPC GA2015 – Videos of the Conference

All of the presentations from the Open Conference (Mon 27 Apr 2015) and most from the Open Workshop (Tue 28 Apr 2015) are available at the IIPC Youtube Channel.

The Opening Keynote by Vinton Cerf, Chief Internet Evangelist, Google and Mahadev Satyanarayanan, Carnegie Group Professor, Carnegie Mellon University.

Note: We were unable to capture a few of the sessions at the Open Workshop due to sound issues.

What are they saying about IIPC GA2015?

Blog posts about the IIPC General Assembly 2015

See the full GA schedule with links to slides.

IIPC GA – Open Conference – Monday 27 April 2015

The first day of the 2015 IIPC GA is behind us. We have seen and heard amazing talks on web archiving from passionate researchers and problem solvers, covering a wide range of specialisations and questions.

Here are the links to the slides. The recordings will follow soon. Look out for them on the IIPC Youtube channel.

Keynote

Digital Vellum: Interacting with Digital Objects Over Centuries
  • Vinton Cerf, Chief Internet Evangelist, Google
  • Mahadev Satyanarayanan, Carnegie Group Professor, Carnegie Mellon University

Big Picture

Studying a nation’s websphere over time: analytical and methodological considerations
  • Niels Brügger, Associate Professor of Aesthetics and Communication, Aarhus University; ation, Aarhus University
  • Ditte Laursen, Senior researcher, Ph.D, at State Media Archive, State and University Library
Ten years of the UK web archive: what have we saved?
  • Andy Jackson, Web Archiving Technical Lead, British Library

Small Data Research

Keynote

Should we archive Facebook? Why the users are wrong and the NSA is right.
  • Cathy Marshall, Associate Research Scientist, Texas A&M University
Generating granular evidence of lived experience with the Web: archiving everyday digitally lived life
  • Meghan Dougherty, Assistant Professor of Digital Communication, Loyola University Chicago
Everyday saving practices: “small data” and digital heritage strategies
  • Susan Aasman, Assistant Professor, University of Groningen

Big Data Research

Big UK domain data for Arts and Humanities
  • Jane Winters, Professor of Digital History, Institute of Historical Research
  • Helen Hockx-Yu, Head of Web Archiving, British Library
  • Josh Cowls, Research Assistant, Oxford Internet Institute

The History of the IIPC, through Web Archives

By Nicholas Taylor, Web Archiving Service Manager, Stanford University

Web archives have now been around long enough that the web content they’ve preserved may never have been previously experienced by full-grown adults today; to this cohort, some websites were only ever “historical.” Web archives represent an increasingly vital and singular body of cultural heritage and a tool for understanding both the past and social phenomena. They’re also a handy tool for understanding the evolution of the IIPC itself.

netpreserve.org_2015

home page of the IIPC website, 16 March 2015

While I trust that our own programmatic record-keeping would be sufficient to reconstruct some of the following findings, they would also be thankfully self-evident to a future historian (one unusually interested in the history of the history of the Web) from the web archives themselves. Consulting the UK Web Archive front-end for the IIPC-funded, LANL-developed and -hosted Memento Aggregator shows that Internet Archive has the greatest number of snapshots of the entire history of the IIPC’s web presence.

Here’s some of what I learned, exploring the timeline:

netpreserve.org_2004

home page of IIPC website, 3 june 2004

I imagine that these latter three points especially will be interesting to consider in the context of our forthcoming discussions for a new membership agreement to replace the one expiring this year (PDF) and to inform refined IIPC mission and goals. Here’s hoping that the most exciting history of the history of the Web is still ahead of us!

What’s Next for OpenWayback

By Kristinn Sigurðsson, Head of IT at National and University Library Iceland. Cross posted from his own blog

About one month ago, OpenWayback 2.1.0 was released. This was mostly a bug-fix release with a few new features merged in from Internet Archive’s Wayback development fork. For the most part, the OpenWayback effort has focused on ‘fixing’ things. Making sure everything builds and runs nicely and is better documented.

I think we’ve made some very positive strides.

Work is now ongoing for version 2.2.0. Finally, we are moving towards implementing new things! 2.2.0 still has some fixing to do. For example, localization support needs to be improved. But, we’re also planning to implement something new, support for internationalized domain names.

We’ve tentatively scheduled the 2.2.0 release for “spring/early summer”.

After 2.2.0 is released, the question will be which features or improvements to focus on next. The OpenWayback issue tracker on GitHub has (at the time of writing) about 60 open issues in the backlog (i.e. not assigned to a specific release).

We’re currently in the process of trying to prioritize these. Our current resources are nowhere sufficient to resolve them all. Prioritization will involve several aspects, including how difficult they are to implement, how popular they are and, not least, how clearly they are defined.

This is where you, dear reader, can help us out by reviewing the backlog and commenting on issues you believe to by relevant to your organization. We also invite you to submit new issues if needed.

It is enough to just leave a comment that this is relevant to your organization. Even better would be to explain why it is relevant (this helps frame the solution). Where appropriate we would also welcome suggestions for how to implement the feature. Notably in issues like the one about surfacing metadata in the interface.

If you really want to see a feature happen, the best way to make it happen is, of course, to pitch in.

Some of the features and improvements we are currently reviewing are:

  • Enable users to ‘diff’ different captures of an HTML page. Issue 15.
  • Enable search results with a very large number of hits. Issue 19.
  • Surface more metadata. Issue 28and 29.
  • Enable time ranged exclusions. Issue 212.
  • Create a revisit test dataset. Issue 117.
  • Using CDX indexing as the default instead of the BDB index. Issue 132.

As I said, these are just the ones currently being considered. We’re happy to look at others if there is someone championing them.

If you’d like to join the conversation, go to the OpenWayback issue tracker on GitHub and review issues without a milestone.

If you’d like to submit a new issue, please read the instructions on the wiki. The main thing to remember is to provide ample details.

We only have so many resources available. Your input is important to help us allocate them most effectively.

 

 

 

 

IIPC Technical Training Workshop – 14th – 16th January 2015

2015-Jan_IIPC Technical WorkshopThe idea of running a training workshop focusing on technical matters was formed during the 2014 IIPC General Assembly in Paris. It became apparent that there is so much transferrable experience among the members and that some institutions are more advanced than others in using the key software for web archiving. Having a forum to exchange ideas and discuss common issues would be extremely useful and welcomed.

Consortium of memory organisations

Kristinn Sigurðsson gave an accurate account of how the idea developed from a thought, to exciting sessions of discussion, and eventually a proposal supported by the IIPC Steering Committee in his blog. Staff development and training is one of the key areas of work for the IIPC. As a consortium of memory organisations sharing the mission of preserving the Internet for posterity, there is great advantage to collaborate, help each other and not to reinvent the wheel. The IIPC has an Education and Training Programme and allocates each year a certain amount of funding for the purpose of collective learning and development. The National Library of France for example organised a week-long workshop in 2012, to offer training for organisations planning to embark into web archiving.

AndyJackson

TokeEskilden

KristinnSigurdsonRogerCoram

Joint expertise

The British Library and the National and University Library of Iceland joint training workshop was the first one dedicated to technical issues, covering the three key applications for web archiving: Heritrix, OpenWayback and Solr. The speakers mainly came from both libraries’ capable technical teams, including Kristinn Sigurðsson, Andy Jackson, Roger Coram and Gil Hoggarth. Their expertise was strengthened by Toke Eskildsen of the State and University Library in Denmark, who has worked extensively on the Danish Web Archive’s large-scale Solr index. Toke also reported on his visit to the British Library in his blog, regarding his experience of “being embedded in tech talk with intelligent people for 5 days” as “exhausting and very fulfilling”. The British Library also took advantage of Toke’s presence and picked his brain on performance issues related to Solr, a perfect example of what other good things can come out of putting techies together.

For the future

Evaluation of the workshop indicates overall satisfaction from the attendees. More people seemed to favour the presentations on day one and desired more structure to the hands-on sessions on day two and three, with more real world examples to be solved together. The presence of strong technical expertise and the opportunity to talk to peers were appreciated the most. From the organiser’s perspective, there are a few things we could have done better: software could have been pre-installed to avoid network congestion and save time; and for the catering we will remember for future occasions that brilliant minds need adequate and varied fuels to be kept well-oiled and running up to speed.

Training is vital for any organisation that aims at progressing. It is not a cost but an investment which safeguards our continuous capability of doing our job. It is worth to consider establishing technical training as a fix element of the Education and Training Programme. The British Library’s Web Archiving crew are happy to contribute.

Helen Hockx-Yu, Head of Web Archiving, The British Library, 17th Feb 2015

IIPC – Meet the Officers, 2015

The IIPC is governed by the Steering Committee, formed by representatives of 15 member organisations who are each elected for three years.

The IIPC Officers include the Chair and Vice-Chair who are elected by the Steering Committee plus the standing officers of Treasurer and the Program and Communications team.

 They invest their expertise and more importantly their time to dealing with the day-to-day business of running the IIPC. The IIPC secretariat – so to speak – is based at the British Library  and the Bibliothèque nationale de France. At the BL the two Programme and Communication Officers ensure that the IIPC runs smoothly and that all of the projects and programs are completed. The BnF is the treasurer of the IIPC and oversees all financial transactions. One of the main tasks each year for the secretariat  is organising a successful annual General Assembly, this year hosted by Stanford University, California.

Chair

PaulWagnerPaul N. Wagner, Senior Director General, Innovation & Chief Information Officer, Chief Information Officer Branch, Library and Archives Canada

Paul Wagner is the Senior Director General, Innovation and Chief Information Officer for Library and Archives Canada.  In this role Paul provides the leadership for the Digital Agenda as it pertains to Canada’s Documentary Heritage.

Previous to this role Paul was Director General, Client Relationships and Business Intake Directorate, Projects and Client Relationships Branch, at Shared Services Canada (SSC).  In this role, Paul built the first enterprise Partnership Management function for technology in the Government of Canada.

Paul joined SSC from the Department of Justice (DoJ) where he held the positions of Chief Information Officer.  As CIO for the department, he developed and led an aggressive IM/IT transformation program.  Prior to that, Paul was the Chief Technology Officer at DoJ where he was responsible for all technology operations. Paul also held several leadership positions at Services Canada, Human Resources and Skills Development Canada and the Department of Public Works and Government Services Canada in the areas of Business Planning, Relationship Management and IT Product/Service Management.

Paul holds a B.A. with a major in Economics from McGill University and his MBA through the University of Ottawa’s Executive MBA program.

 Vice-Chair

CathyHartmanCathy Hartman is the Associate Dean of Libraries at the University of North Texas in Denton, Texas (University Profile).  Her interests have long been in digital libraries, collection building, and digital preservation.

She first began capturing U.S. government websites in 1997 as government agencies closed and their websites were taken down.  With this early start in web archiving, the University of North Texas (UNT) continued to capture such websites and joined the IIPC in 2007.

Hartman serves as the current Steering Committee co-chair, and served as chair of the IIPC Steering Committee in 2013.  UNT participates in many IIPC initiatives including Steering Committee membership, the Access Working Group, the new Collaborative Collections group, and the Education Committee.

Our Nomination Tool is offered for use by any IIPC member organization to support collaborative collection building, and UNT is currently contributing to the Open Wayback development effort.

 Treasurer

ClementOuryClément Oury is head of Digital Legal Deposit at the Bibliothèque nationale de France (BnF). This service is in charge of collecting and preserving a large part of BnF’s born-digital heritage: web archives, e-newspapers and e-books.

Clément Oury also serves as convenor of two ISO working groups (on the “WARC archiving file format” and on “Statistics and quality issues for web archiving”).

He is a graduate of the École nationale des Chartes and has a PhD in early modern history at the University of Paris-Sorbonne.

As Clément will be leaving the BnF and therefore the IIPC in 2015, the position of treasurer is in transition. To ease this situation Peter Stirling has agreed to be second in command and act as interim treasurer until the BnF has decided who is going to follow in Clément’s very competent footsteps.

 Interim-Treasurer

PeterStirlingPeter Stirling is a digital curator in the Digital Legal Deposit team at the BnF. He is responsible for services for users of the web archives, and is currently working on developing data mining services for researchers.

He also works on day-to-day web archiving activity and the international activity of the team in the context of the IIPC.

He holds an M.A. in English Literature and an M.Sc. in Information and Library Studies, and previously worked for an online information portal for health professionals in the UK and in online information monitoring for the French National Cancer Institute before joining the BnF in 2009.

Programme & Communication Officers

The PCOs both split their time evenly between Program and Communication for the IIPC and Engagement and Liaison for the UK Web Archive. 

JasonWebberJason Webber is Web Archiving Engagement & Liaison Manager at the British Library in London. He is responsible for bringing the UK Web Archive to as wide an audience as possible as well as finding and maintaining partnerships and co-operation in research and technology.

 Previously he has worked on various collections based digital projects at the Museum of London and as a Web Content Manager at the Natural History Museum, London.

SabineHartmannSabine Hartmann is Web Archiving Engagement & Liaison Officer at the British Library in London. During her career Sabine has worked in museum, archives and heritage organisations in Germany, Belgium and the Netherlands before moving to the UK in 2014.

With a Master’s degree in History of Art and Archaeology she has a keen interest in digital applications and research connecting history and ICT. Sabine has managed various heritage projects including geo-location apps and websites, oral history and other heritage websites.