What do the New York Times, Organizational Change, and Web Archiving all have in common?

MatthewWeberThe short answer is Matthew Weber. Matthew is an Assistant Professor at Rutgers in the School of Communication and Information. His research focus is on organizational change; in particular he’s been looking at how traditional organizations such as companies in the newspaper business have responded to major technological disruption such as the Internet, or mobile phone applications.

In order to study this type of phenomenon, you need web archives. Unfortunately, however, using web archives as a source for research can be challenging. This is where high performance computing (HPC) and big data come into the picture.

RutgersHPC
https://oirt.rutgers.edu/research-computing/hpc-resources/

Luckily for Matthew, at Rutgers they have HPC and lots of it. He’s working with powerful computer clusters built on complex java script and Hadoop code to crack into Internet Archive (IA) data. Matthew first started working with the IA in 2008 through a summer research institute at Oxford University. More recently, Matthew, working with colleagues at the Internet Archive and Northeastern University, received funding from the National Science Foundation to build tools that enable research access to Internet Archive data.When Matthew says he works with big data, he means really big data, like 80 terabytes big. Matthew works in close partnership with PhD students in the computer science department who maintain the backend that allows him to run complex queries. He is also training PhD students in Communication and other social science disciplines to work with Rutgers HPC system. In addition, Matthew has taught himself basic Pig, to be more exact Pig Latin, a programming language for running queries on data stored in Hadoop.

Intimidated yet? Matthew says don’t be. A researcher can learn some basic tech skills and do quite a bit on his or her own. In fact, Matthew would argue that researchers must learn these skills because we are a long way off from point-and-click systems where you can find exactly the data you want. But there is help out there.

For example, IA’s Senior Data Engineer, Vinay Goel, provided the materials from a recent workshop to walk you through setting up and doing your own data analysis. Also, Professors Ian Milligan and Jimmy Lin from Waterloo University have pulled together some useful code and commentary that is relatively easy to follow. Finally, a good basic starting point is Code Academy:

Challenges Abound

Even though Matthew has access to HPC and is handy with basic Pig, there are still plenty of challenges.

Metadata

One major challenge is metadata; mainly, there isn’t enough of it. In order to draw valid conclusions from data, researchers need a wealth of contextual data, such as the scope of the crawl, how often it was run, why those sites where chosen and not others, etc. They also need the metadata to be complete and consistent across all of the collections they’re analyzing.

As a researcher conducting quantitative analysis, Matthew has to make sure he’s accounting for any and all statistical errors that might creep into the data. In his recent research, for example, he was seeing consistent error patterns in hyperlinks within the network of media websites. He now has to account for this statistical error in his analysis.

To begin to tackle this problem, Matthew is working with researchers and web curators from a group of institutions, including Columbia University Libraries & Information Service’s Web Resources Collection Program, California Digital Library, International Internet Preservation Consortium (IIPC), and Waterloo University to create a survey to learn from researchers, across a broad spectrum of disciplines, what are the essential metadata elements that they need. Matthew intends to share the results of this survey broadly with the web archiving community.

The Holes

Related to the metadata issues is the need for better documentation for missing data.

Matthew would love to have complete archives (along with complete descriptions). He recognizes, however, that there are holes in the data, just as there are with print archives. The difference is that holes in a print archive are easier to know and define than the holes for web archive data, where you need to be able to infer the holes.

The Issue of Size

Matthew explained that for a recent study of news media between 1996 – 2000, you start with transferring the data – and one year of data from Internet Archive took three days to transfer. You then need another two days to process and run the code. That’s a five-day investment just to get data for a single year. And then you discover that you need another data point, so it starts all over again.

To help address this issue at Rutgers and to provide training datasets to help graduate students get started, they are creating and sharing derivative datasets. They have taken large web archive datasets, extracted out small subsets (e.g., U.S. Senate data from the last five sessions), processed them, and produced smaller datasets that others can easily export to do their own analysis. This is essentially a public repository of data for reuse!

A Cool Space to Be In

As tools and collections develop, more and more researchers are starting to realize that web archives are fertile ground for research. Even though challenges remain, there’s clearly a shift toward more research based on web archives.

As Matthew put it, “Eight years ago when I started nobody cared… and now so many scholars are looking to ask critical questions about the way the web permeates our day-to-day lives… people are realizing that web archives are a key way to get at those questions. As a researcher, it’s a cool space to be in right now.”

RosalieLack

By Rosalie Lack, Product Manager, California Digital Library

This blog post is the first in an upcoming series of interviews with researchers to learn about their research using web archives, and the challenges and opportunities.

So You Want to Get Started in Web Archiving?

web3_0The web archiving community is a great one, but it can sometimes be a bit confusing to enter. Unlike communities such as the Digital Humanities, which has developed aggregation services like DH Now, the web archiving community is a bit more dispersed. But fear not, there are a few places to visit to get a quick sense of what’s going on.

Social Media

twitter-logo_1A substantial amount of web archiving scholarship happens on-line. I use Twitter (I’m at @ianmilligan1), for example, as a key way to share research findings and ideas that I have as my project comes together. I usually try to hashtag them with: #webarchiving. This means that all tweets that people use “#webarchiving” with will show up in that specific timeline. For best results, linkedInusing a Twitter client like TweetdeckTweetbot, or Echofon can help you keep aprised of things. There may be Facebook groups – I actually don’t use Facebook (!) so I can’t provide much guidance there. On LinkedIn there are a few relevant groups: IIPC, Web ArchivingPortuguese Web Archive

Blogs

I’m wary of listing blogs, because I will almost certainly leave some out. Please accept my apologies in advance and add your name in the comments below! But a few are on my recurring must-visit list (in addition to this one, of course!):

  • Web Archiving Roundtable: Every week, they have a “Weekly web archiving roundup.” I don’t always have time to keep completely caught up, but I visit roughly weekly and once in a while make sure to download all the linked resources. Being included here is an honour.
  • The UK Web Archive Blog: This blog is a must-have on my RSS feed, and it keeps me posted on what the UK team is doing with their web archive. They do great things, from inspiring outreach, to tools development (i.e. Shine), to researcher reflections. A lively cast of guest bloggers and regulars.
  • Web Science and Digital Libraries Research Group: If you use web archiving research tools, chances are you’ve used some stuff from the WebSciDL group! This fantastic blog has a lively group of contributors, showcasing conference reports, research findings, and beyond. Another must visit.
  • Web Archives for Historians: This blog, written by Peter Webster and myself, aims to bring together scholarship on how historians can use web archives. We have guest posts as well as cross-posts from our own sites.
  • Peter Webster’s Blog: Peter also has his own blog, which covers a diverse range of topics including web archives.
  • Ian Milligan’s Blog: It feels weird including my own blog here, but what the heck. I provide lots of technical background to my own investigations into web archives.
  • The Internet Archive Blog: Almost doesn’t need any more information! It’s actually quite a diverse blog, but a go-to place to find out about cool new collections (the million album covers for example) or datasets that are available.
  • The Signal: Digital Preservation Blog: A diverse blog that occasionally covers web archiving (you can actually find the subcategory here). Well worth reading – and citing, for that matter!
  • Kris’s Blog: Kristinn Sigurðsson runs a great technical blog here, very thought provoking and important for both those who create web archives as well as those who use them.
  • DSHR’s Blog: David Rosenthal’s blog on digital preservation has quite a bit about web archiving, and is always provocative and mind expanding.
  • Andy Jackson’s blog  – Web Archiving Technical Lead at the British Library
  • BUDDAH project – Big UK Domain Data for the Arts and Humanities Research Project
  • Dépôt légal web BnF
  • Stanford University Digital Library blog
  • Internet Memory Foundation blog
  • Toke Eskildsen blog – IT developer at the National Library of Denmark.

Again, I am sure that I have missed some blogs so please accept my sincerest apologies.

1354116111_webIn-Person Events

The best place to learn is in-person events, of course, which are often announced at places like this blog or in many of the above mediums! I hope that the IIPC blog can become a hub for these sorts of things.

Conclusions

Imilligan_-_picture_0 hope this is helpful for people that are starting out in this wonderful field. I’ve just provided a small slice: I hope that in the comments below people can give other suggestions which can help us all out!

By Ian Milligan (University of Waterloo)

LANL’s Time Travel Portal, Part 2

Architecturally, the Time Travel portal operates in a manner similar to a distributed search. Hence, it faces challenges related to query routing, response time optimization, and response freshness. The new infrastructure includes some rule-based mechanisms for intelligent routing but a thorough solution is being investigated in the IIPC-funded Web Archive Profiling project. A background cache continuously fetches TimeMap information from distributed archives, both natively or by-proxy compliant with the Memento protocol. Its collection consists of a seed list of popular URIs augmented with URIs requested by Memento clients. Whenever possible, responses are delivered from a front-end cache that remains in sync with the background cache using the ResourceSync protocol. If a request can not be delivered from cache, because cached content is unavailable or stale, realtime TimeGate requests are sent to Memento-compliant archives only. This setup achieves a satisfactory balance between response times, response completeness, and response freshness. If needed, the front-end cache can be bypassed and a realtime query can explicitly be initiated using the regular browser refresh approach, e.g. Shift-Reload in Chrome.

The Time Travel logo that can be used to advertise the portal.
The Time Travel logo that can be used to advertise the portal.

The development of the Time Travel portal was also strongly motivated by the desire to lower the barrier for developing Memento related functionality, especially at the browser side. Memento protocol information is – appropriately – communicated in HTTP headers. However, browser-side scripts typically do not have header access. Hence, we wanted to bring Memento capabilities within the realm of browser-side development. To that end, we introduced several RESTful APIs:

We are thrilled by the continuous growth in the usage of these APIs and would be interested to learn which kind of applications people out there are building on top of our infrastructure. We know that the new version of the Mink browser extension uses the new APIs. Also, the Time Travel’s Reconstruct service, based on pywb, leverages our own APIs. Memento for Chrome now obtains its list of archives from the Archive Registry. Also, the Robust Links approach to combat reference rot is based on API calls, but that will be the subject of another blog post.

IIPC members that operate public web archives that are not yet Memento compliant are reminded Open Wayback and pywb natively support Memento. From the perspective of the Time Travel portal, compliance means that we don’t have to operate a Memento proxy, that archive holdings can be included in realtime queries, and that both Original URIs and Memento URIs can be used to Find/Reconstruct. From a broader perspective, it means that the archive becomes a building block in a global, interoperable infrastructure that provides a time dimension to the web.

By Herbert Van de SompelDigital Library Researcher at Los Alamos National Laboratory

LANL’s Time Travel Portal, Part 1

Early February 2015, we launched the Time Travel portal, which provides cross-system discovery of Mementos.

The design and development of the Time Travel portal was a significant investment and took about a year from conception to release. It involved work directly related to the portal itself, but also a fundamental redesign of the Memento Aggregator, the introduction of several RESTful APIs, the transfer of the Memento infrastructure from LANL’s network to the Amazon cloud, and operating the new environment as an official service of the LANL Research Library.

The team that designed and implemented the Time Travel portal, from left to right: Luydmila Balakireva, Harihar Shankar, Martin Klein, Ilya Kremer, James Powell, and Herbert Van de Sompel
The team that designed and implemented the Time Travel portal, from left to right: Luydmila Balakireva, Harihar Shankar, Martin Klein, Ilya Kremer, James Powell, and Herbert Van de Sompel

A major motivation for the development of the new portal was to lower the barrier for experiencing Memento’s web time travel. Our flagship Memento for Chrome extension remains the optimal way to experience cross-system time travel. But, we wanted some of the power of Memento to be accessible without the need for an extension.

The Time Travel portal has a basic interface that allows entering a URI and a datetime. It offers a Find and a Reconstruct service:

  • The Find service looks for the Mementos in systems covered by the Memento Aggregator. For each archive that holds Mementos for the requested URI, the Memento that is temporally closest to the submitted date-time is listed, with a clear indication of the archive’s name. Results are ordered by temporal proximity to the requested date-time. For each archive, the first/last/previous/next Memento are also shown when that information is available. For all listed Mementos, a link leads straight into the holding archive. A Find URI can also be constructed. Its syntax follows the convention introduced by Wayback software, e.g. http://timetravel.mementoweb.org/list/20081128230827/http://apple.com.
  • The Reconstruct service reassembles a page using the best Mementos from various Memento-compliant archives. Hereby, “best” means temporally closest to the requested date-time. Hence, in a Reconstruct result page, the archived HTML, images, style sheets, JavaScript, etc. can originate from different archives. Many times, the assembled pages look more complete and the temporal spread of components is smaller, when compared with corresponding pages in distinct archives. As such, the Reconstruct service provides a nice illustration of the cross-archive interoperability introduced by the Memento protocol. A Reconstruct URI is available using the same Wayback URI convention, e.g. http://timetravel.mementoweb.org/reconstruct/20081128230827/http://apple.com.

While the Time Travel portal has been received enthusiastically, usage remains modest. Since its launch, we have seen about 4000 unique visitors, 7000 visits, per month. We have capacity for much more and would appreciate some promotion of our service by IIPC members. Also, we are very open to suggestions about additional portal functionality. For example, we have reached out to IIPC members that operate dark archives because we are interested in including their holding information in Time Travel responses, in order to increase response completeness and to make the existence of these archives more visible. As a first step in that direction, we have proposed Memento-based access to dark archive holdings information as a new functionality for Open Wayback.

By Herbert Van de SompelDigital Library Researcher at Los Alamos National Laboratory