The short answer is Matthew Weber. Matthew is an Assistant Professor at Rutgers in the School of Communication and Information. His research focus is on organizational change; in particular he’s been looking at how traditional organizations such as companies in the newspaper business have responded to major technological disruption such as the Internet, or mobile phone applications.
In order to study this type of phenomenon, you need web archives. Unfortunately, however, using web archives as a source for research can be challenging. This is where high performance computing (HPC) and big data come into the picture.
Luckily for Matthew, at Rutgers they have HPC and lots of it. He’s working with powerful computer clusters built on complex java script and Hadoop code to crack into Internet Archive (IA) data. Matthew first started working with the IA in 2008 through a summer research institute at Oxford University. More recently, Matthew, working with colleagues at the Internet Archive and Northeastern University, received funding from the National Science Foundation to build tools that enable research access to Internet Archive data.When Matthew says he works with big data, he means really big data, like 80 terabytes big. Matthew works in close partnership with PhD students in the computer science department who maintain the backend that allows him to run complex queries. He is also training PhD students in Communication and other social science disciplines to work with Rutgers HPC system. In addition, Matthew has taught himself basic Pig, to be more exact Pig Latin, a programming language for running queries on data stored in Hadoop.
Intimidated yet? Matthew says don’t be. A researcher can learn some basic tech skills and do quite a bit on his or her own. In fact, Matthew would argue that researchers must learn these skills because we are a long way off from point-and-click systems where you can find exactly the data you want. But there is help out there.
For example, IA’s Senior Data Engineer, Vinay Goel, provided the materials from a recent workshop to walk you through setting up and doing your own data analysis. Also, Professors Ian Milligan and Jimmy Lin from Waterloo University have pulled together some useful code and commentary that is relatively easy to follow. Finally, a good basic starting point is Code Academy:
Even though Matthew has access to HPC and is handy with basic Pig, there are still plenty of challenges.
One major challenge is metadata; mainly, there isn’t enough of it. In order to draw valid conclusions from data, researchers need a wealth of contextual data, such as the scope of the crawl, how often it was run, why those sites where chosen and not others, etc. They also need the metadata to be complete and consistent across all of the collections they’re analyzing.
As a researcher conducting quantitative analysis, Matthew has to make sure he’s accounting for any and all statistical errors that might creep into the data. In his recent research, for example, he was seeing consistent error patterns in hyperlinks within the network of media websites. He now has to account for this statistical error in his analysis.
To begin to tackle this problem, Matthew is working with researchers and web curators from a group of institutions, including Columbia University Libraries & Information Service’s Web Resources Collection Program, California Digital Library, International Internet Preservation Consortium (IIPC), and Waterloo University to create a survey to learn from researchers, across a broad spectrum of disciplines, what are the essential metadata elements that they need. Matthew intends to share the results of this survey broadly with the web archiving community.
Related to the metadata issues is the need for better documentation for missing data.
Matthew would love to have complete archives (along with complete descriptions). He recognizes, however, that there are holes in the data, just as there are with print archives. The difference is that holes in a print archive are easier to know and define than the holes for web archive data, where you need to be able to infer the holes.
The Issue of Size
Matthew explained that for a recent study of news media between 1996 – 2000, you start with transferring the data – and one year of data from Internet Archive took three days to transfer. You then need another two days to process and run the code. That’s a five-day investment just to get data for a single year. And then you discover that you need another data point, so it starts all over again.
To help address this issue at Rutgers and to provide training datasets to help graduate students get started, they are creating and sharing derivative datasets. They have taken large web archive datasets, extracted out small subsets (e.g., U.S. Senate data from the last five sessions), processed them, and produced smaller datasets that others can easily export to do their own analysis. This is essentially a public repository of data for reuse!
A Cool Space to Be In
As tools and collections develop, more and more researchers are starting to realize that web archives are fertile ground for research. Even though challenges remain, there’s clearly a shift toward more research based on web archives.
As Matthew put it, “Eight years ago when I started nobody cared… and now so many scholars are looking to ask critical questions about the way the web permeates our day-to-day lives… people are realizing that web archives are a key way to get at those questions. As a researcher, it’s a cool space to be in right now.”
By Rosalie Lack, Product Manager, California Digital Library
This blog post is the first in an upcoming series of interviews with researchers to learn about their research using web archives, and the challenges and opportunities.