Results of the Web Archiving API Survey of IIPC Members

If you attended the recent GA, or read some of the many blog posts about it, you probably heard about the potential benefits of standardized web archiving APIs. This was a common theme that came up in multiple presentations and informal discussions. During a conversation over lunch mid-week one person suggested that the IIPC form a new working group to focus on web archiving APIs. Clearly some institutions were interested in this, but how many? And are they interested enough to participate in a new working group? A group of us at Harvard decided to find out. We developed a short survey and advertised it on the IIPC mailing list.

The survey was open from May 14 through June 1 and was filled out 18 times, by 17 different institutions from 8 different countries.

Country Institutions
Czech Republic National Library of the Czech Republic
Denmark Netarkivet.dk
France Bibliothèque nationale de France
Iceland National and University Library of Iceland
New Zealand National Library of New Zealand
Spain National Library of Spain
United Kingdom The British Library, The National Archives
United States Stanford University Libraries, Old Dominion University, Internet Archive, LANL, California Digital Library, Harvard Library, Library of Congress, UCLA, University of North Texas

Table 1: The institutions that responded to the survey

The survey asked “Is the topic of web archiving APIs of interest to your institution?” The answer was overwhelmingly “Yes”. All 17 institutions are interested in web archiving APIs. Personally this was the first unanimous survey question I have ever seen.

api1Figure 1: A rare unanimous response

When asked “Why are web archiving APIs of interest to your institution?” the responses (see Figure 2) were varied but had common themes. Many of the reasons were from the perspective of an institution providing or maintaining web archiving programs or infrastructure, for example:

  • “The sustainability of our program depends on the web archiving community as a whole better aligning itself to collaboratively maintain and augment a core set of interoperable systems…”
  • “…appreciate how this would reduce our technical spend in the long term…”
  • “… APIs should ease the maintenance and evolution of the complex set of tools we are using to complete the document cycle: selection, collect, access and preservation.”

Another common response was from the perspective of providing a better service for researchers, for example:

  • “…a common/standard API would make it easier for researchers to work with multiple web archives with standard methodologies.”
  • “To help researchers explore our collection, including within our catalogue system, to link with other web archive collections and potentially to interface with different components of our infrastructure.”
  • “We often do aggregation and want to have a way to archive resources of interest with the help of scripts, in both of these cases an API would be ideal.”

api2Figure 2: A word cloud generated from the free text responses to why the institution is interested in web archiving APIs [Used Word Cloud Generator by Jason Davies]

The respondents were asked “If we organized a new working group within the IIPC to work on web archiving APIs would your institution be willing to participate?” All but one institution said “Yes”. The institution that said “No” said that they were interested but did not have enough staff resources currently to actively participate.

api3Figure 3: Most of the institutions are willing to participate in the new working group.

We asked “In what specific ways could your institution participate? Please select all that apply.” The results are shown in Table 2. Most of the respondents would like to help define the functional requirements, but a good amount would also like to contribute use cases and help design the technical details. Importantly, there are institutions willing to help run the meetings.

Specific Way % of Respondents Count of Respondents
Help define the functional requirements for a web archiving API 94% 15
Contribute curatorial, researcher or management requirements and use cases 81% 13
Help design the technical details of a web archiving API 69% 11
Help schedule and run the working group meetings 19% 3
Other* 6% 1

Table 2: The specific ways institutions would participate in the working group
* One institution said that they would be willing to implement and test web archiving APIs where appropriate and aligned with local needs

So the answer to our original question is a clear YES! There are enough IIPC institutions that are interested and willing to participate in meaningful ways in this new working group. Stay tuned while we work through the logistics of how to start. One of the first steps will be to identify co-chairs for the group. If you are interested in this please let me know! And thanks everyone for taking the time to fill out this survey.

By Andrea Goethals, Manager of Digital Preservation and Repository Services, Harvard Library

Advertisements

Facing the Challenge of Web Archives Preservation Collaboratively

Web archiving is often about collecting the web, but that is only half the story. Once collected you have to make sure to preserve it.D-Lib-blocks This is what the Preservation Working Group of the IIPC is focused on. D-Lib magazine has recently published an article called Facing the Challenge of Web Archives Preservation Collaboratively: The Role and Work of the IIPC Preservation Working Group.

goethalsourypearsonsiermansteinke

The article was written by the group members: Andrea Goethals (Harvard Library), Clément Oury (International ISSN Centre), David Pearson (National Library of Australia), Barbara Sierman (KB National Library of the Netherlands) and Tobias Steinke (Deutsche Nationalbibliothek – German National Library).

The article sets out the goals, activities and results of the Preservation Working Group, describing the findings of a survey that was done amongst the members of the IIPC in 2013 about their approaches to preserving the web. The authors also feature a set of databases maintained by the group with crucial information for web archiving: namely the Environments Database and the Risks Database.

Barbara Sierman, x-post from the KB (Dutch National Library) blog

Update on OpenWayback

OpenWayback

OpenWayback 2.2.0 was recently released. This marks OpenWayback’s third release since becoming a ward of the IIPC in late 2013. This is a fairly modest update and reflects our desire to make frequent, modest sized releases. A few things are still worth pointing out.

First, as of this release, OpenWayback requires Java 7. Java 7 has been out for four years and Java 6 has not been publicly updated in over two years. It is time to move on.

Second, OpenWayback now officially supports internationalized domain names. I.e. domain names containing non-ASCII characters.

Third, UI localization has been much improved. It should now be possible to translate the entire interface without having to mess with the JSP files and otherwise “go under the hood”.

And the last thing I’ll mention is the new WatchedCDXSource which removes the need to enumerate all the CDX files you wish to use. Simply designate a folder and OpenWayback will pick up all the CDX files in it.

The road to here hathankyousn’t been easy, but it is encouraging to see that the number of people involved is slowly, but surely rising. For the 2.2.0 release, we had code contributions from Roger Coram (BL), Lauren Ko (UNT), John Erik Halse (NLN), Sawood Alam (ODU), Mohamed Elsayed (BA) and myself in addition to the IIPC-payed-for work by Roger Mathisen (NLN). Even more people were involved in reporting issues, managing the project and testing the release candidate. My thanks to everyone who helped out.

And going forward, we are certainly going to need people to help out.

help_wanted

Version 2.3.0 of OpenWayback will be another modest bundle of fixes and minor features. We hope it will be ready in September (or so). There are already 10 issues open for it as I write this.

But, we also have larger ambitions. Enter version 3.0.0. It will be developed in parallel with 2.3.0 and aims to make some big changes. Breaking changes. OpenWayback is built on an aging codebase, almost a decade old at this point. To move forward, some big changes need to be made.

The exact features to be implemented will likely shift as work progresses but we are going to increase modularity by pushing the CDXServer front and center and removing the legacy resource stores. In addition to simplifying the codebase, this fits very nicely with the talk at the last GA about APIs.

We’ll also be looking at redoing the user interface using iFrames and providing insight into the temporal drift of the page being viewed. The planned issues are available on GitHub. The list is far from locked and we welcome additional input on which features to work on.

We welcome additional work on those features even more!

callTOactionI’d like to wrap this up with a call to action. We need a reasonably large community around the project to sustain it. Whether it’s testing and bug reporting, occasional development work or taking on more by becoming one of our core developers, your help is both needed and appreciated.

If you’d like to become involved, you can simply join the conversation on the OpenWayback GitHub page. Anyone can open new issues and post comments on existinggithub-social-coding issues. You can also join the OpenWayback developers mailing list.

Kristinn Sigurðsson – Head of IT at the National and University Library of Iceland – x-posted from Kris’s blog

 

A first attempt to archive the .EU domain

EUCommissioni

The .EU domain is commonly used to reference sites related to Europe. EURid is the organization appointed by the European Commission to operate the .EU domain and presents it under the slogan “Your European Identity”.

Therefore, preserving online information published on sites hosted under the .EU domain is crucial to preserve European Cultural Heritage for future generations.

The strategy adopted to archive the World Wide Web has been delegating the responsibility of each domain to the respective national archiving institutions. However, the .EU domain fails to fit in this model because it covers multiple nations. Thus, the preservation of .EU sites has not been yet assigned and undertaken by any institution.

RESAWRESAW is an European network that aims to create a Research Infrastructure for the Study of Archived Web Materials (resaw.eu). The Portuguese Web Archive performed a first attempt to crawl and preserve web sites hosted under the .EU domain performed within the scope of RESAW activities. This first crawl began on the 21 November 2014 and finished on the 16 December 2014.

Challenges crawling .EU

.EUlogoThe first challenge felt was obtaining the seeds for the crawl because our contacts with EURID to get the list of .EU domains failed. The crawl was launched using a total of 34 138 unique seeds obtained from several sources such as Google.com, DomainTyper.com, DMOZ.org or Alexa Top Sites.

During this first crawl we had to iteratively tune crawl configurations in order to overcome hazardous situations caused by web spam sites. The set of spam filters created will be useful to optimize future crawls.seedsImage

We crawled 250 million documents from over 1 million hosts. The crawl documents were stored in 5.8 TB of disk space using the compressed ARC format. 135 907 unique domain URLs were extracted that will be used as seeds for the next crawl.

Two more crawls of the .EU domain planned

As future work we intend to perform 2 more crawls of the .EU domain to be integrated on the Portuguese Web Archive collections. The next crawl is planned to start in November 2015. We estimated that 23 TB of disk space should be required for the following crawl of the .EU domain (without performing deduplication).

04_brancoEach one of the performed .EU crawls shall be indexed and become searchable through www.arquivo.pt one year after its finish date.

 

Researchers wanted!

Collaborations with researchers interested on studying the collected web data or crawl logs are welcome. We can create a prototype system with restricted access to enable search and processing of the .EU crawls if researchers explicitly manifest interest.

This first experiment of archiving the .EU domain was performed mostly using resources from the Portuguese Web Archive. Collaborations with other institutions, for instance, to identify relevant seeds are crucial to improve the quality of the crawls. The obtained results from this experiment are encouraging but an effective archive of the .EU requires further more resources and collaborations.

Learn more

fotografia-de-daniel-gomes

Daniel Gomes andDaniel_Bicho Daniel Bicho, RESAW / Portuguese Web Archive