IIPC Hackathon at the British Library: Laying a New Foundation

By Tom Cramer, Stanford University

This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.

I was struck by the breadth and depth of topics. In particular…

  • Heritrix nuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
  • Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
  • Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
  • QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
  • An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki  Updates welcome!
  • System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.

The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.

IIPC Hackathon, Day 1.

This validates one of the main focal points for the IIPC’s portfolio on Tools Development, which is to foster more grassroots exchange among web archiving practitioners.

The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!

Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.

What can IIPC do to advance tools development?

By Tom Cramer, Stanford University

The International Internet Preservation Consortium (IIPC) renewed its consortial agreement at the end of 2015. In the process, it affirmed its longstanding mission to work collaboratively to foster the implementation of solutions to collect, preserve and provide access to Internet content. To achieve this aim, the Consortium is committed to “facilitate the development of appropriate and interoperable, preferably Open Source, software and tools.”

As the IIPC sets its strategic direction for 2016 and beyond, Tools Development will feature as one of three main portfolios of activity (along with Member Engagement, and Partnerships & Outreach). At its General Assembly in Reykjavik, IIPC members held a series of break out meetings to discuss Tools Development. This blog post presents some of that discussion, and lays out the beginnings of a direction for IIPC, and perhaps the web archiving community at large, to pursue in order to build a richer toolscape.

The Current State of Tools Development within the IIPC

The IIPC has always emphasized tool development. Per its website, one of the main objectives “has been to develop a high-quality, easy-to-use open source tools for setting up a web archiving chain.” And the registry of software lists an impressive array tools for everything from acquisition and curation to storage and access. And coming from the 2016 General Assembly and Web Archiving conference, it’s clear that there is actually quite a lot of development going on among and beyond member institutions. Despite all this, the reality may be slightly less rosy than the multitude of listings for tools for web archiving might indicate…

  • Many are deprecated, or worse, abandoned
  • Much of the local development is kept local, and not accessible to others for reuse or enhancement
  • There is a high degree of redundancy among development efforts, due to lack of visibility, lack of understanding, or lack of an effective collaborative framework for code exchange or coordinated development
  • Many of the tools are not interoperable with each other due to differences in approach in policy, data models or workflows (sometimes intentional, many times not)
  • Many of the big tools which serve as mainstays for the community (e.g., Heritrix for crawling, Open Wayback for replay) are large, monolithic, complex pieces of software that have multiple forks and less-than-optimal documentation

Given all this, one wonders if IIPC members really believe that coordinated tool development is important; perhaps instead it’s better to let a thousand flowers bloom? The answer to this last question was, refreshingly, a resounding NO. When discussed among members at Reykjavik, support for tools development as a top priority was unanimous, and enthusiastic. The world of the Web and web archiving is vast, yet the number of participants relatively small; the more we can foster a rich (and interoperable) tool environment, the more everyone can benefit in any part of the web archiving chain. Many members in fact said they joined IIPC expressly because they sought a collaboratively defined and community-supported set of tools to support their institutional programs.

In the words of Daniel Gomes from the Portuguese Web Archive: of course tool development is a priority for IIPC; if we don’t develop these tools, who will?

A Brighter Future for Collaborative Tool Development

Several possibilities and principles presented themselves as ways to enhance the way the web archiving community pursues tool development in the future. Interestingly, many of these were more about how the community can work together rather than specific projects.  The main principles were:

  • Interoperability | modularity | APIs are key. The web archiving community needs a bigger suite of smaller, simpler tools that connect together. This promotes reuse of tools, as well as ease of maintenance; allows for institutions to converge on common flows but differentiate where it matters; enables smaller development projects which are more likely to be successful; and provides on ramps for new developers and institutions to take up (and add back to) code. Developing a consensus set of APIs for the web archiving chain is a clear priority and prerequisite here.
  • Design and development needs to be driven by use cases. Many times, the biggest stumbling block to effective collaboration is differing goals or assumptions. Much of the lack of interoperability comes from differences in institutional models and workflows that makes it difficult for code or data to connect with other systems. Doing the analysis work upfront to clarify not just what a tool might be doing but why, can bring institutional models and developers onto the same page, and facilitate collaborative development.
  • We need collaborative platforms & social engineering for the web archiving technical community. It’s clear from events like the IIPC Web Archiving Conference and reports such as Helen Hockx-Yu’s of the Internet Archive that a lot of uncoordinated and largely invisible development is happening locally at institutions. Why? Not because people don’t want to collaborate, but because it’s less expensive and more expedient. IIPC and its members need to reduce the friction of exchanging information and code to the point that, as Barbara Sierman of the National Library of the Netherlands said, “collaboration becomes a habit.” Or as Ian Milligan of the University of Waterloo put it, we need the right balance between “hacking” and “yacking”.
  • IIPC better development of tools both large and small. Collaboration on small tools development is a clear opportunity; innovation is happening at the edges and by working together individual programs can advance their end-to-end workflows in compelling new ways (social media, browser-based capture and new forms of visualization and analysis are all striking examples here). But it’s also clear that there is appetite and need for collaboration on the traditional “big” things that are beyond any single member’s capacity to engineer unilaterally (e.g., Heritrix, WayBack, full text search). As IIPC hasn’t been as successful as anyone might like in terms of directed, top-down development of larger projects, what can be done to carve these larger efforts up into smaller pieces that have a greater chance of success? How can IIPC take on the role of facilitator and matchmaker rather than director & do-er?

Next Steps

The stage is set for revisiting and revitalizing how IIPC works together to build high quality, use case-driven, interoperable tools. Over the next few months (and years!) we will begin translating these needs and strategies into concrete actions. What can we do? Several possibilities suggested themselves in Reykjavik.

  1. Convene web archiving “hack fests”. The web archiving technical community needs face time. As Andy Jackson of the British Library opined in Reykjavik, “How can we collaborate with each other if we don’t know who we are, or what we’re doing?” Face time fuels collaboration in a way that no amount of WebEx’ing or GitHub comments can. Let’s begin to engineer the social ties that will lead to stronger software ties. A couple of three-day unconferences per year would go a long way to accelerating collaboration and diffusion of local innovation.
  2. Convene meetings on key technical topics. It’s clear that IIPC members are beginning to tackle major efforts that would benefit from some early and intensive coordination: Heritrix & browser-based crawling, elaborations on WARC, next steps for Open Wayback, full text search and visualization, use of proxies for enhanced capture, dashboards and metrics for curators and crawl engineers. All of these are likely to see significant development (sometimes at as many as 4-5 different institutions) in the next year. Bringing implementers together early offers the promise of coordinated activity.
  3. Coordinate on API identification and specification. There is clear interest in specifying APIs and more modular interactions across the entire web archiving tool chain. IIPC holds a privileged place as a coordinating body across the sites and players interested in this work. IIPC should structure some way to track, communicate, and help systematize this work, leading to a consortium-based reference architecture (based on APIs rather than specific tools) for the web archiving tool chain.
  4. Use cases. Reykjavik saw a number of excellent presentations on user centered design and use case-driven development. This work should be captured and exposed to the web archiving community to let each participate learn from each other’s work, and to generate a consensus reference architecture based on demonstrated (not just theoretical) needs.

Note that all of these potential steps focus as much on how IIPC can work together as on any specific project, and they all seem to fall into the “small steps” category. In this they have the twin benefits of being both feasible accomplish in the next year, as well as having a good chance to succeed. And if they do succeed, they promise to lay the groundwork for more and larger efforts in the coming years.

What do you think IIPC can do in the next year to advance tools development? Post a comment in this blog or send an email.

Five Takeaways from AOIR 2015

aoirI recently attended the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. It was a great conference that I would highly recommend to anyone interested in learning first hand about research questions, methods, and studies broadly related to the Internet.

Researchers presented on a wide range of topics, across a wide range of media, using both qualitative and quantitative methods. You can get an idea of the range of topics by looking at the conference schedule.

I’d like to briefly share some of my key takeaways. I apologize in advance for oversimplifying what was a rich and deep array of research work, my goal here is to provide a quick summary and not an in-depth review of the conference.

  1. Digital Methods Are Where It’s At

I attended an all-day, pre-conference digital methods workshop. As a testament to the interest in this subject, the workshop was so overbooked they had to run three concurrent sessions. The workshops were organized by Axel Bruns, Jean Burgess, Tim Highfield, Ben Light, and Patrik Wikstrom (Queensland University of Technology), and Tama Leaver (Curtin University).

Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better.

At the digital methods workshop, we learned about the “Walkthrough” method for studying software apps, tools for “web scraping” to gather data for analysis, Tableau to conduct social media analysis, and “instagrammatics,” analyzing Instagram.

FYI: The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.

  1. Twitter API Is also Very Popular

There were many Twitter studies, and they all used the Twitter API to download tweets for analysis. Although researchers are widely using the Twitter API, they expressed a lot of frustration over its limitation. For example, you can only download for free up to 1% of the total Twitter volume. If you’re studying something obscure, you are probably okay, but if you’re studying a topic like #jesuischarlie, you’ll have to pay to get the entire output. Many researchers don’t have the funds for that. One person pointed out that it would be ideal to have access to the Library of Congress’s Twitter archive. Yes, agreed!

  1. Social Media over Web Archives

Researchers presented conclusions and provided commentary on our social behavior through studies of social media such as Snapchat, Twitter, Facebook, and Instagram. There were only a handful of presentations using web archived materials. If a researcher used websites, they viewed them live or conducted “web scraping” with tools such as Outwit and Kimono. Many also used custom Python scripts to gather the data from the sites.

  1. Fair Use Needs a PR Movement

There’s still much misunderstanding about what researchers can and cannot do with digital materials. I attended a session where the presenter shared findings from surveys conducted with communication scholars about their knowledge of fair use. The results showed that there was (very!) limited understanding of fair use. Even worse, the findings showed that those scholars who had previously attended a fair use workshop were even more unlikely to understand fair use! Moreover, many admitted that they did not conduct particular studies because of a (misguided) fear of violating copyright. These findings were corroborated by the scholars from a variety of fields who were in the room.

  1. Opportunities for Collaboration

I asked many researchers if they were concerned that they were not saving a snapshot of websites or Apps at the time of their studies. The answer was a resounding “yes!” They recognize that sites and tools change rapidly, but they are unaware of tools or services they can use and/or that their librarians/archivists have solutions.

Clearly there is room for librarians/archivists to conduct more outreach to researchers to inform them about our rich web archive collections and to talk with them about preservation solutions, good data management practices and copyright.

Who knew?

Let me end with sharing one tidbit that really blew my mind. In her research on “Dead Online: Practices of Post-Mortem Digital Interaction,” Paula Kiel presented on the “digital platforms designed to enable post-mortem interactions.” Yes, she was talking about websites where you can send posthumous messages via Facebook and email! For example, https://www.safebeyond.com/, “Life continues when you pass… Ensure your presence – be there when it counts. Leave messages for your loved ones – for FREE!”



By Rosalie Lack, Product Manager, California Digital Library

How Well Are Arabic Websites Archived?

‫Arabic summary

‫إن أرشفة المواقع هي عملية تجميع البيانات الموجودة على الشبكة العنكبوتية من أجل حفظها من الضياع و جعلها متاحة للباحثين في المستقبل. قمنا بهذا البحث العلمي لمحاولة تقدير مدى أرشفة و فهرسة المواقع العربية. تم جمع ١٥،٠٩٢ رابط من ثلاث مواقع تعتبر دليل للمواقع العربية وهي: دليل ديموز العربي، دليل الردادي، دليل ستار٢٨. بعدها تم استخدام أدوات التعرف على اللغات واخترنا المواقع ذات اللغة العربية فقط، فاصبح عدد الروابط المتبقية هو ٧،٩٧٦ رابط. ثم تم زحف المواقع الحية منها لينتج عن ذلك ٣٠٠،٦٤٦ رابط. و من هذه العينة تم اكتشاف مايلي:‬‬‬
‫‫‫١) إن ٤٦٪ من المواقع العربية لم يتم ارشفتها، و إن ٣١٪ من المواقع العربية لم تتم فهرستها من قبل قوقل.‬‬‬
‫‫‫٢) إن ١٤،٨٤٪ من المواقع العربية لها محددات رمز عربية مثل (sa.)، كما وجدنا ١٠،٥٣٪ من المواقع لها موقع جغرافي عربي بناءً على موقع برتوكول الانترنت (IP) الخاص بالحاسب الالي.‬‬‬
‫‫‫٣) إن وجود إما موقع جغرافي عربي أو محددات رمزية عربية يؤثر سلبياً على أرشفتها.‬‬‬
‫‫‫٤) معظم الصفحات المؤرشفة هي بالقرب من المستوى الأعلى من الموقع، أما الصفحات العميقة في الموقع هي غير مؤرشفة جيداً.‬‬‬
‫‫‫٥) وجود الموقع على صفحة ديموز العربية يؤثر على ارشفتها ايجابياً.‬‬‬‫ 

It is anecdotally known that archives favor content in English and from Western countries. In this blog post we summarize our JCDL 2015 paper “How Well are Arabic Websites Archived?“, where we provide an initial quantitative exploration of this well-known phenomenon. When comparing the number of mementos for English vs. Arabic websites we found that English websites are archived more than Arabic websites. For example, when comparing a high ranked English sports website based on Alexa ranking, such as ESPN, with a high ranked Arabic sport website, such as Kooora, we find that ESPN has almost 13,000 mementos, and Kooora has only 2,000 mementos.

Figure 1

We also compared the English vs Arabic encyclopedia and found that the English Wikipedia has 10,000 mementos vs. the Arabic Wikipedia with only around 500 mementos.

Figure 2

Arabic is the fourth most popular language on the Internet, trailing only English, Chinese, and Spanish. Based on the Internet World Stats, in 2009, only 17% of Arabic speakers used the Internet, but by the end of 2013 that had increased to almost 36% (over 135 million), approaching the world average of 39% of the population using the Internet.

Our initial step, collecting Arabic seed URIs, presented our first challenge. We found that Arabic websites could have:
1) Both Arabic geographic IP location (GeoIP) and an Arabic country code top level domain (ccTLD) such as www.uoh.edu.sa.
2) An Arabic GeoIP, but a non Arabic ccTLD such as www.al-watan.com.
3) An Arabic ccTLD, but a non Arabic GeoIP such as www.haraj.com.sa, with a GeoIP in Ireland.
4) Neither an Arabic GeoIP, nor an Arabic ccTLD such as www.alarabiyah.com, with a GeoIP in US.

So for collecting the seed URIs we first searched for Arabic website directories, and grabbed the top three based on Alexa ranking. We selected all live URIs (11,014) from the following resources:
1) Open Directory project (DMOZ) – registered in US in 1999.
2) Raddadi – a well known Arabic directory, registered in Saudi Arabia in 2000.
3) Star28 – an Arabic directory registered in Lebanon in 2004.

Although these URIs are listed in Arabic directories it does not mean that the content is in Arabic. For example, www.arabnews.com is a Arab news website listed in Star28 but provides English language news about Arabic-related topics.

It was hard to find a reliable language test to determine the language for a page, so we employed four different methods: HTTP Content Language, HTML title tag, Triagram method, Language detection API. As shown in Figure 3, the intersection between the four methods was only 8%. We made the decision that any page that passed any of these tests would be included as “in the Arabic web”. The resulting number of Arabic seeds URIs was 7,976 out of 11,014.

Figure 3

To increase the number of URIs, we crawled the live Arabic seed URIs and checked the language using the previously described methods. This increased our data set to 300,646 Arabic seed URIs.

Next we used the ODU Memento Aggregator (mementoproxy.cs.odu.edu) to verify if the URIs were archived in a public web archive. We found that 53.77% of the URIs are archived with a median of 16 mementos per URI. We also analyzed the timespan of the mementos (the number of days between the datetimes of the first memento and last memento) and found that the median archiving period was 48 days.

We also investigated seed source and archiving and found that DMOZ had an archiving rate of 96%, followed by 45% from Raddadi, and 42% from Star28.

In the data set we found that 14% of the URIs had an Arabic ccTLD. We also looked at the GeoIP location since it was an important factor to determine where the hosts of webpages might be located. Using MaxMind GeoLite2, we found 58% of the Arabic seed URIs are hosted in the US.

Figure 4 shows count detail for Arabic GeoIP and ccTLD. We found that: 1) only 2.5% of the URIs are located in an Arabic country, 2) only 7.7% had an Arabic ccTLD, 3) 8.6% are both located in an Arabic country and have an Arabic ccTLD, and 4) the rest of the URIs (81%) are neither located in Arabic country, nor had an Arabic ccTLD.

Figure 4

We also wanted to verify if the URI had been there long enough to be archived. We used the CarbonDate tool, developed by members of the WS-DL group, to analyze our archived Arabic data set. We found that 2013 was the most frequent creation date for archived Arabic webpages. We also wanted to investigate the gap between the creation date of Arabic websites and when they were first archived. We found that 19% of the URIs have an estimated creation date that is the same as first memento date. For the remaining URIs, 28% have creation date over one year before the first memento was archived.

It was interesting to find out if the Arabic URIs are indexed in search engines. We used the Google’s Custom Search API, (which may produce different results than the public Google’s user web interface), and found that 31% of the Arabic URIs were not indexed by Google. When looking at the source of the URIs we found that 82% of the DMOZ URIs are indexed by Google, which was expected since it is more likely to be found and archived.

In conclusion, when looking at the seed URIs we found that DMOZ URIs are more likely to be found and archived, and a website is more likely to be indexed if it is present in a directory. For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ.

I presented this work in JCDL2015, the presentation slides can be found here.

by Lulwah M. Alkwai, PhD student, Computer Science Department, Old Dominion University, VA, USA

LANL’s Time Travel Portal, Part 2

Architecturally, the Time Travel portal operates in a manner similar to a distributed search. Hence, it faces challenges related to query routing, response time optimization, and response freshness. The new infrastructure includes some rule-based mechanisms for intelligent routing but a thorough solution is being investigated in the IIPC-funded Web Archive Profiling project. A background cache continuously fetches TimeMap information from distributed archives, both natively or by-proxy compliant with the Memento protocol. Its collection consists of a seed list of popular URIs augmented with URIs requested by Memento clients. Whenever possible, responses are delivered from a front-end cache that remains in sync with the background cache using the ResourceSync protocol. If a request can not be delivered from cache, because cached content is unavailable or stale, realtime TimeGate requests are sent to Memento-compliant archives only. This setup achieves a satisfactory balance between response times, response completeness, and response freshness. If needed, the front-end cache can be bypassed and a realtime query can explicitly be initiated using the regular browser refresh approach, e.g. Shift-Reload in Chrome.

The Time Travel logo that can be used to advertise the portal.
The Time Travel logo that can be used to advertise the portal.

The development of the Time Travel portal was also strongly motivated by the desire to lower the barrier for developing Memento related functionality, especially at the browser side. Memento protocol information is – appropriately – communicated in HTTP headers. However, browser-side scripts typically do not have header access. Hence, we wanted to bring Memento capabilities within the realm of browser-side development. To that end, we introduced several RESTful APIs:

We are thrilled by the continuous growth in the usage of these APIs and would be interested to learn which kind of applications people out there are building on top of our infrastructure. We know that the new version of the Mink browser extension uses the new APIs. Also, the Time Travel’s Reconstruct service, based on pywb, leverages our own APIs. Memento for Chrome now obtains its list of archives from the Archive Registry. Also, the Robust Links approach to combat reference rot is based on API calls, but that will be the subject of another blog post.

IIPC members that operate public web archives that are not yet Memento compliant are reminded Open Wayback and pywb natively support Memento. From the perspective of the Time Travel portal, compliance means that we don’t have to operate a Memento proxy, that archive holdings can be included in realtime queries, and that both Original URIs and Memento URIs can be used to Find/Reconstruct. From a broader perspective, it means that the archive becomes a building block in a global, interoperable infrastructure that provides a time dimension to the web.

By Herbert Van de SompelDigital Library Researcher at Los Alamos National Laboratory

LANL’s Time Travel Portal, Part 1

Early February 2015, we launched the Time Travel portal, which provides cross-system discovery of Mementos.

The design and development of the Time Travel portal was a significant investment and took about a year from conception to release. It involved work directly related to the portal itself, but also a fundamental redesign of the Memento Aggregator, the introduction of several RESTful APIs, the transfer of the Memento infrastructure from LANL’s network to the Amazon cloud, and operating the new environment as an official service of the LANL Research Library.

The team that designed and implemented the Time Travel portal, from left to right: Luydmila Balakireva, Harihar Shankar, Martin Klein, Ilya Kremer, James Powell, and Herbert Van de Sompel
The team that designed and implemented the Time Travel portal, from left to right: Luydmila Balakireva, Harihar Shankar, Martin Klein, Ilya Kremer, James Powell, and Herbert Van de Sompel

A major motivation for the development of the new portal was to lower the barrier for experiencing Memento’s web time travel. Our flagship Memento for Chrome extension remains the optimal way to experience cross-system time travel. But, we wanted some of the power of Memento to be accessible without the need for an extension.

The Time Travel portal has a basic interface that allows entering a URI and a datetime. It offers a Find and a Reconstruct service:

  • The Find service looks for the Mementos in systems covered by the Memento Aggregator. For each archive that holds Mementos for the requested URI, the Memento that is temporally closest to the submitted date-time is listed, with a clear indication of the archive’s name. Results are ordered by temporal proximity to the requested date-time. For each archive, the first/last/previous/next Memento are also shown when that information is available. For all listed Mementos, a link leads straight into the holding archive. A Find URI can also be constructed. Its syntax follows the convention introduced by Wayback software, e.g. http://timetravel.mementoweb.org/list/20081128230827/http://apple.com.
  • The Reconstruct service reassembles a page using the best Mementos from various Memento-compliant archives. Hereby, “best” means temporally closest to the requested date-time. Hence, in a Reconstruct result page, the archived HTML, images, style sheets, JavaScript, etc. can originate from different archives. Many times, the assembled pages look more complete and the temporal spread of components is smaller, when compared with corresponding pages in distinct archives. As such, the Reconstruct service provides a nice illustration of the cross-archive interoperability introduced by the Memento protocol. A Reconstruct URI is available using the same Wayback URI convention, e.g. http://timetravel.mementoweb.org/reconstruct/20081128230827/http://apple.com.

While the Time Travel portal has been received enthusiastically, usage remains modest. Since its launch, we have seen about 4000 unique visitors, 7000 visits, per month. We have capacity for much more and would appreciate some promotion of our service by IIPC members. Also, we are very open to suggestions about additional portal functionality. For example, we have reached out to IIPC members that operate dark archives because we are interested in including their holding information in Time Travel responses, in order to increase response completeness and to make the existence of these archives more visible. As a first step in that direction, we have proposed Memento-based access to dark archive holdings information as a new functionality for Open Wayback.

By Herbert Van de SompelDigital Library Researcher at Los Alamos National Laboratory

Results of the Web Archiving API Survey of IIPC Members

If you attended the recent GA, or read some of the many blog posts about it, you probably heard about the potential benefits of standardized web archiving APIs. This was a common theme that came up in multiple presentations and informal discussions. During a conversation over lunch mid-week one person suggested that the IIPC form a new working group to focus on web archiving APIs. Clearly some institutions were interested in this, but how many? And are they interested enough to participate in a new working group? A group of us at Harvard decided to find out. We developed a short survey and advertised it on the IIPC mailing list.

The survey was open from May 14 through June 1 and was filled out 18 times, by 17 different institutions from 8 different countries.

Country Institutions
Czech Republic National Library of the Czech Republic
Denmark Netarkivet.dk
France Bibliothèque nationale de France
Iceland National and University Library of Iceland
New Zealand National Library of New Zealand
Spain National Library of Spain
United Kingdom The British Library, The National Archives
United States Stanford University Libraries, Old Dominion University, Internet Archive, LANL, California Digital Library, Harvard Library, Library of Congress, UCLA, University of North Texas

Table 1: The institutions that responded to the survey

The survey asked “Is the topic of web archiving APIs of interest to your institution?” The answer was overwhelmingly “Yes”. All 17 institutions are interested in web archiving APIs. Personally this was the first unanimous survey question I have ever seen.

api1Figure 1: A rare unanimous response

When asked “Why are web archiving APIs of interest to your institution?” the responses (see Figure 2) were varied but had common themes. Many of the reasons were from the perspective of an institution providing or maintaining web archiving programs or infrastructure, for example:

  • “The sustainability of our program depends on the web archiving community as a whole better aligning itself to collaboratively maintain and augment a core set of interoperable systems…”
  • “…appreciate how this would reduce our technical spend in the long term…”
  • “… APIs should ease the maintenance and evolution of the complex set of tools we are using to complete the document cycle: selection, collect, access and preservation.”

Another common response was from the perspective of providing a better service for researchers, for example:

  • “…a common/standard API would make it easier for researchers to work with multiple web archives with standard methodologies.”
  • “To help researchers explore our collection, including within our catalogue system, to link with other web archive collections and potentially to interface with different components of our infrastructure.”
  • “We often do aggregation and want to have a way to archive resources of interest with the help of scripts, in both of these cases an API would be ideal.”

api2Figure 2: A word cloud generated from the free text responses to why the institution is interested in web archiving APIs [Used Word Cloud Generator by Jason Davies]

The respondents were asked “If we organized a new working group within the IIPC to work on web archiving APIs would your institution be willing to participate?” All but one institution said “Yes”. The institution that said “No” said that they were interested but did not have enough staff resources currently to actively participate.

api3Figure 3: Most of the institutions are willing to participate in the new working group.

We asked “In what specific ways could your institution participate? Please select all that apply.” The results are shown in Table 2. Most of the respondents would like to help define the functional requirements, but a good amount would also like to contribute use cases and help design the technical details. Importantly, there are institutions willing to help run the meetings.

Specific Way % of Respondents Count of Respondents
Help define the functional requirements for a web archiving API 94% 15
Contribute curatorial, researcher or management requirements and use cases 81% 13
Help design the technical details of a web archiving API 69% 11
Help schedule and run the working group meetings 19% 3
Other* 6% 1

Table 2: The specific ways institutions would participate in the working group
* One institution said that they would be willing to implement and test web archiving APIs where appropriate and aligned with local needs

So the answer to our original question is a clear YES! There are enough IIPC institutions that are interested and willing to participate in meaningful ways in this new working group. Stay tuned while we work through the logistics of how to start. One of the first steps will be to identify co-chairs for the group. If you are interested in this please let me know! And thanks everyone for taking the time to fill out this survey.

By Andrea Goethals, Manager of Digital Preservation and Repository Services, Harvard Library