New Report on Web Archiving Available

By Andrea Goethals, Harvard Library

HarvardLibraryReport-Jan2016This is an expanded version of a post to the Library of Congress’ Signal blog.

Harvard Library recently released a report that is the result of a five-month environmental scan of the landscape of web archiving, made possible by the generous support of the Arcadia Fund. The purpose of the study was to explore and document current web archiving programs to identify common practices, needs, and expectations in the collection and provision of web archives to users; the provision and maintenance of web archiving infrastructure and services; and the use of web archives by researchers. The findings will inform Harvard Library’s strategy for scaling up its web archiving activities, and are also being shared broadly to help inform research and development priorities in the global web archiving community.

The heart of the study was a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of several organizations, including the IIPC of course, but also the Web Archiving Roundtable at the Society of American Archivists (SAA), the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Reuters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

Opportunities for Research & Development

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed below and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development, and (4) build local capacity.

22 Opportunities to Address Common Challenges

(the order has no significance)

  1. Dedicate full-time staff to work in web archiving so that institutions can stay abreast of latest developments, best practices and fully engage in the web archiving community.
  2. Conduct outreach, training and professional development for existing staff, particularly those working with more traditional collections, such as print, who are being asked to collect web archives.
  3. Increase communication and collaboration across types of collectors since they might collect in different areas or for different reasons.
  4. A funded collaboration program (bursary award, for example) to support researcher use of web archives by gathering feedback on requirements and impediments to the use of web archives.
  5. Leverage the membership overlap between RESAW and European IIPC membership to facilitate formal researcher/librarian/archivist collaboration projects.
  6. Institutional web archiving programs become transparent about holdings, indicating what material each has, terms of use, preservation commitment, plus curatorial decisions made for each capture.
  7. Develop a collection development tool (e.g. registry or directory) to expose holdings information to researchers and other collecting institutions even if the content is viewable only in on-site reading rooms.
  8. Conduct outreach and education to website developers to provide guidance on creating sites that can be more easily archived and described by web archiving practitioners.
  9. IIPC, or similar large international organization, attempts to educate and influence tech company content hosting sites (e.g. Google/YouTube) on the importance of supporting libraries and archives in their efforts to archive their content (even if the content cannot be made immediately available to researchers).
  10. Investigate Memento further, for example conduct user studies, to see if more web archiving institutions should adopt it as part of their discovery infrastructure.
  11. Fund a collection development, nomination tool that can enable rapid collection development decisions, possibly building on one or more of the current tools that are targeted for open source deployment.
  12. Gather requirements across institutions and among web researchers for next generation of tools that need to be developed.
  13. Develop specifications for a web archiving API that would allow web archiving tools and services to be used interchangeably.
  14. Train researchers with the skills they need to be able to analyze big data found in web archives.
  15. Provide tools to make researcher analysis of big data found in web archives easier, leveraging existing tools where possible.
  16. Establish a standard for describing the curatorial decisions behind collecting web archives so that there is consistent (and machine-actionable) information for researchers.
  17. Establish a feedback loop between researchers and the librarians/archivists.
  18. Explore how institutions can augment the Archive-It service and provide local support to researchers, possibly using a collaborative model.
  19. Increase interaction with users, and develop deep collaborations with computer scientists.
  20. Explore what, and how, a service might support running computing and software tools and infrastructure for institutions that lack their own onsite infrastructure to do so.
  21. Service providers develop more offerings around the available tools to lower the barrier to entry and make them accessible to those lacking programming skills and/or IT support.
  22. Work with service providers to help reduce any risks of reliance on them (e.g. support for APIs so that service providers could more easily be changed and content exported if needed).

Communication & Collaboration are Key!

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration, among all individuals and organizations involved in some way in web archiving, was the most prevalent theme found by the scan. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed between those collecting web content, but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.

A Note about the Tools

There is a section in the report (Appendix C) that lists all the current web archiving tools that were identified during the environmental scan. The IIPC Tools and Software web page was one of the resources used to construct this list, along with what was learned through interviews, conferences and independent research. The tools are organized according to the various activities needed throughout the lifecycle of acquiring, processing, preserving and providing web archive collections. Many of the tools discovered are fairly new, especially the ones associated with the analysis of web archives. The state of the tools will continue to change rapidly so this list will quickly become out of date unless a group like the IIPC decides to maintain it.  I will be at the GA in April if any IIPC member would like to talk about maintaining this list or other parts of the report.

Politics, Archaeology, and Swiss Cheese: Meghan Dougherty Shares Her Experiences with Web Archiving

MeghanMeghan Dougherty, Assistant Professor in Digital Communication at Loyola University, Chicago, started our interview by warning me that she is the “odd man out” when it comes to web archiving and uses web archives differently than most. I was immediately intrigued!

Meghan’s research agenda is the nature of inquiry and the nature of evidence. All her research is conducted within the framework of questioning methodology; that is, she’s ask questions about how archives are built, how that process influences what is collected, and how that process then influences scientific inquiry.

Roots in Politics

Before closely examining methodology, Meghan spent “hands-on” time starting in the early 2000s working with a research group called, webarchivist.org, co-founded by Kirsten Foot (University of Washington) and Steve Schneider (SUNYIT). The interdisciplinary nature of the work at this organization was evident in the project members, which included two political scientists and two communications scholars, focusing on both qualitative and quantitative analysis. Their big research question was “what is the impact of the Internet on politics?”

Meghan and the rest of the research group recognized that if you are going to look at how the Internet affects politics, then you need to analyze how it changes over time. To do that you need to slow it down, to essentially take snapshots so you can do an analytical comparison.

To achieve this goal, the team worked collaboratively with Internet Archive and Library of Congress to build an election web archive, specifically around U.S. house, senate, presidential, and gubernatorial elections, with a focus on candidate websites.

The Tao of a Website

As they were doing rigorous quantitative content analysis of election websites, the team was also asked to take extensive field notes to document everything they noticed. This, in turn, is how Meghan became curious about studying methodology. Looking at these sites in such detail prompted many questions:

“What exactly has the crawler captured in an archive? What am I looking at? If a website is fluid and moving and constantly updated, then what is this thing we’ve captured? What is the nature of ‘being’ for objects on the web? If I capture a snapshot, am I really capturing anything, or is it just a resemblance of the thing that existed?”

Meghan admits she doesn’t have all of the answers, but she challenges her fellow scholars to ask these difficult questions and not try to neatly tie up their research with a bow by simplifying the analysis. She cautions that before you can gain knowledge about social and behavioral change over time in the digital world, you need to have a sensibility about what it actually means. Without answering that question the research methods are just practice and not actually knowledge building systems.

The Big Secret

Meghan appreciates when archivists and librarians ask her how they can help to support her in her work. What she really needs, she says, is a long-term collaborator, because frankly she doesn’t know what she wants.

“What if I told you that we don’t know what we want to analyze. We really need to think about these things together. The big secret is that we don’t know what we want because we don’t know what we’re dealing with. We are still working through it and we need you [curators and librarians] to help us to think about what an archive is, what we can collect, and how it gets collected. So we can build knowledge together about this collection of evidence.”

In hearing Meghan discuss two small-scale research projects, it was evident that even within her own research portfolio she has very different requirements for web archives.

#AskACurator

Ask a Curator is a once-a-year event, when cultural heritage institutions across the world open up for anyone to engage with their curators via Twitter.

AskACuratorBy analyzing tweets with the #AskACurator hashtag, Meghan is studying how groups of people come together and interact with institutions and how institutions reach out with digital media to connect with their public.

In this example, Meghan stresses that completeness and precision of data are critical. If the archive of tweets for this hashtag are incomplete, then big chunks of really interesting mini-conversations will be missing from Meghan’s data. In addition, missing data will skew her categorizations and must be accounted for.

Taqwacore

1024px-Muslim_Punks_-_Flickr_-_Eye_Steel_Film_(2)
By Eye Steel Film from Canada (Muslim Punks) [CC BY 2.0 (http://creativecommons.org /licenses/by/2.0)], via Wikimedia Commons
For another project that is more like an ethnographic study, an online community (Taqwacore) of young people gathered around their faith in Islam, interest in punk music, and political activism.  Meghan is studying a wide variety of print and online materials, including a small press novel (that launched this sub-culture), materials distributed online and handed out at concerts, and materials distributed in person and on the online community pages joined by kids living all over the world.

In this study, the precision and completeness of the evidences doesn’t matter as much because Meghan’s goal is to try to get a general gist of the subculture. She is conducting an ethnographic study, but in the past. So, instead of camping out in the scene in the moment, she is looking back in time at the conversations that they had and trying to understand who they were.

Digging Web Archives

In her research, Meghan has come to use the term web archaeology because she has found that regardless of her area of work, her research has felt like an archaeological dig in which she  examines digital traces of past human behavior to understand her subject. Archaeology, not unlike web archiving, can be both destructive as well as constructive, and similarly archaeologists use very specific, specialized tools to find and uncover delicate remains of something that has been covered or even mostly lost over time.

At this year’s IIPC General Assembly < http://netpreserve.org/general-assembly/2015/overview >), Meghan introduced her web archaeology idea, which is also the topic of her forthcoming book (“Virtual Digs: excavating, archiving, preserving and curating the web” from University of Toronto press), through a tongue-in-cheek video from The Onion about uncovering the ruins of a Friendster civilization.

While the video is intended as satire, the topic raises a real question that we need to address, which is that in a hundred years from now people are going to look back at our communication media, such as Facebook, but what will future scholars be able to dig up?

All about the Holes

In a presentation at IIPC 2011, Barbara Signori, Head of the Department e-Helvetica at Swiss National Library, shared a wonderful analogy about how the holes in our archives are like the holes in Swiss cheese – inevitable. When I asked Meghan to share something that surprised her about her research, she shared a story about the holes.

"Emmentaler". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg
“Emmentaler”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/ wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg

When working with the Library of Congress back in the early 2000s, Meghan’s research group provided a list of political candidates to the Library of Congress staff for crawling. Library of Congress staff created an index of the sites crawled, but they did not create an entry in cases where no websites existed.

Meghan and her fellow researchers were surprised because it seemed obvious to them that you would document the candidates who had websites, as well as those who didn’t. Knowing that a candidate DID NOT have a website in the early 2000s was a big deal, and would have a huge impact on findings! Absence shows us something very interesting about the environment.

Meghan would go so far as to say that a quirk about web archives is that librarians and curators are so focused on the cheese, while researchers find the holes of equal interest.

RosalieLackThis blog post is the third in a series of interviews with researchers to learn about their use of web archives.  

By Rosalie Lack, Product Manager, California Digital Library

Web Archives: Preserving the Everyday Record

milligan_-_picture_0In talking with Ian Milligan, Assistant Professor of Digital and Canadian History at the University of Waterloo, you are immediately impressed by his excitement for web archives and how web archiving is fundamentally changing research.

Ian uses web archives for his historical research to demonstrate their relevance and importance. While he clearly sees the value of web archives, he also recognizes the need to improve access in order to increase usage. To that end, he recently launched Webarchives.ca, an archive dedicated to Canadian politics. Ian is also providing pedagogical support for students using digital materials, including web archives.

I interviewed Ian recently to get his thoughts about these and other web archiving topics.

Remembering Geocities: A Community on the Web

Among Ian’s research projects is the study of Geocities. Remember Geocities? It was a user generated web-hosting community that flourished in the late 1990s and 2000s. Unlike other lost civilizations, we know the cause of Geocities’s demise – Yahoo shut it down in 2009. If it were not for the Internet Archive and Jason Scott’s Archive Team, Geocities would be lost forever.

For those who might ask if it was worth saving, Ian would offer a resounding YES! For Ian, Geocities provides a rich historical source for gaining insight into a pivotal moment in time. It is one of the first examples of democratized web access, when average people could reach bigger audiences than ever before. At its height, Geocities featured more than 38 million pages.

Source: Internet Archive's Wayback Machine, December 1, 2009 capture
Source: Internet Archive’s Wayback Machine, December 1, 2009 capture

Some of the research questions Ian is asking about the Geocities corpus include:

  • How was community enacted?
  • How was community lived in a place like Geocities?
  • Was there actually a sense of community on the web?

While these questions might sound like standard research questions, they are only now being recast over “untraditional” sources, such as Geocities.

Archiving Politics

In an effort to improve access to web archives, Ian worked on a project to launch Webarchives.ca, a research corpus containing Canadian Political Parties and Political Interest Groups sites collected since 2005 by the University of Toronto using the Internet Archive’s Archive-It service. Ian teamed up with researchers from the University of Maryland, York University in Toronto, and Western University in London, Ontario to build this massive collection of more than 14 million “documents.”  To help navigate this large collection, UK Web Archive’s Shine front-end was implemented.

Once I got started looking at Webarchives.ca, I couldn’t stop myself from digging further into such a wealth of information. I particularly liked the graphing of terms over time feature, which allows you to see when terms go in and out of use by political parties.

In sharing his takeaways from working with these data, Ian observed that it is equally interesting to see when terms do not appear as when they do.

A Pivotal Shift for Scholarship

Ian shared some concrete examples of how the rise of web archives represents a pivotal shift for scholarship. Let’s take, for instance, particular segments of the population, such as young people, who have traditionally been left out of the historical record.

When Ian was researching the 1960s in order to understand the voice of young activists, he found the sources to be scarce. Conversations among activists tended to happen in coffeehouses, bars, and other places where records were not kept. So, a historian can only hope that a young activist back then kept a diary and that it has survived, or she or he needs to find them and interview them.

Contrast this to today’s world. With the explosion of social media, young people are writing things down and leaving records that we never would have had in the past. Web archiving tools can capture this information, which is a very rich and exciting development for historians, but only if these important records of daily life have been archived.

Is More Better?

The increase in information can be a double-edged sword. As Ian says, “there used to be such a scarcity of historical sources, now we have more information than we know what to do with.”

Ian is concerned that digital and digitized materials will be privileged as sources and/or misinterpreted. He conducted a study when materials were first digitized. He learned that scholars cited more often digital materials vs analog. Basically, content that was more easily available online was getting used more.

Ian is also worried that there is not a deep understanding of how to critically use digital resources. Many are unaware, for example, of the limitations of simple keyword searching. Add to the mix web archives and you have increased the scale of the problem.

So Ian wrote a pedagogical book.

exploringBigHistoricalDAtaThe Historian’s Macroscope: Exploring Big Historical Data, written along with Shawn Graham and Scott Weingart, will be out later this year. The book is a sort of toolbox for upper division history undergraduates to teach them how to think critically about digital resources and to avoid common pitfalls. It also includes “how to” information for analyzing data, such as basic data visualization and network analysis.

Always pushing the envelope, Ian and his co-authors wrote the first draft of their book online.

No “Do Overs”

Ian closed our interview by sharing a provocative statement that he made at the recent IIPC General Assembly. “You cannot study the history of the 90s unless you use web archives. It is a significant part of the record of the 1990s and 2000s for everyday people. When historians write the history of 9/11 or Occupy Wall Street, they are going to have to use web archives.”

As exciting as it is for historians to have access to these rich new resources, Ian also shared his biggest concern, which is that we need to ensure that we are saving websites. “Every day we are losing considerable amounts of our digital heritage. Gathering is critical. There are no ‘do overs.’”

RosalieLack

This blog post is the second in a series of interviews with researchers to learn about their use of web archives.

By Rosalie Lack, Product Manager, California Digital Library

What do the New York Times, Organizational Change, and Web Archiving all have in common?

MatthewWeberThe short answer is Matthew Weber. Matthew is an Assistant Professor at Rutgers in the School of Communication and Information. His research focus is on organizational change; in particular he’s been looking at how traditional organizations such as companies in the newspaper business have responded to major technological disruption such as the Internet, or mobile phone applications.

In order to study this type of phenomenon, you need web archives. Unfortunately, however, using web archives as a source for research can be challenging. This is where high performance computing (HPC) and big data come into the picture.

RutgersHPC
https://oirt.rutgers.edu/research-computing/hpc-resources/

Luckily for Matthew, at Rutgers they have HPC and lots of it. He’s working with powerful computer clusters built on complex java script and Hadoop code to crack into Internet Archive (IA) data. Matthew first started working with the IA in 2008 through a summer research institute at Oxford University. More recently, Matthew, working with colleagues at the Internet Archive and Northeastern University, received funding from the National Science Foundation to build tools that enable research access to Internet Archive data.When Matthew says he works with big data, he means really big data, like 80 terabytes big. Matthew works in close partnership with PhD students in the computer science department who maintain the backend that allows him to run complex queries. He is also training PhD students in Communication and other social science disciplines to work with Rutgers HPC system. In addition, Matthew has taught himself basic Pig, to be more exact Pig Latin, a programming language for running queries on data stored in Hadoop.

Intimidated yet? Matthew says don’t be. A researcher can learn some basic tech skills and do quite a bit on his or her own. In fact, Matthew would argue that researchers must learn these skills because we are a long way off from point-and-click systems where you can find exactly the data you want. But there is help out there.

For example, IA’s Senior Data Engineer, Vinay Goel, provided the materials from a recent workshop to walk you through setting up and doing your own data analysis. Also, Professors Ian Milligan and Jimmy Lin from Waterloo University have pulled together some useful code and commentary that is relatively easy to follow. Finally, a good basic starting point is Code Academy:

Challenges Abound

Even though Matthew has access to HPC and is handy with basic Pig, there are still plenty of challenges.

Metadata

One major challenge is metadata; mainly, there isn’t enough of it. In order to draw valid conclusions from data, researchers need a wealth of contextual data, such as the scope of the crawl, how often it was run, why those sites where chosen and not others, etc. They also need the metadata to be complete and consistent across all of the collections they’re analyzing.

As a researcher conducting quantitative analysis, Matthew has to make sure he’s accounting for any and all statistical errors that might creep into the data. In his recent research, for example, he was seeing consistent error patterns in hyperlinks within the network of media websites. He now has to account for this statistical error in his analysis.

To begin to tackle this problem, Matthew is working with researchers and web curators from a group of institutions, including Columbia University Libraries & Information Service’s Web Resources Collection Program, California Digital Library, International Internet Preservation Consortium (IIPC), and Waterloo University to create a survey to learn from researchers, across a broad spectrum of disciplines, what are the essential metadata elements that they need. Matthew intends to share the results of this survey broadly with the web archiving community.

The Holes

Related to the metadata issues is the need for better documentation for missing data.

Matthew would love to have complete archives (along with complete descriptions). He recognizes, however, that there are holes in the data, just as there are with print archives. The difference is that holes in a print archive are easier to know and define than the holes for web archive data, where you need to be able to infer the holes.

The Issue of Size

Matthew explained that for a recent study of news media between 1996 – 2000, you start with transferring the data – and one year of data from Internet Archive took three days to transfer. You then need another two days to process and run the code. That’s a five-day investment just to get data for a single year. And then you discover that you need another data point, so it starts all over again.

To help address this issue at Rutgers and to provide training datasets to help graduate students get started, they are creating and sharing derivative datasets. They have taken large web archive datasets, extracted out small subsets (e.g., U.S. Senate data from the last five sessions), processed them, and produced smaller datasets that others can easily export to do their own analysis. This is essentially a public repository of data for reuse!

A Cool Space to Be In

As tools and collections develop, more and more researchers are starting to realize that web archives are fertile ground for research. Even though challenges remain, there’s clearly a shift toward more research based on web archives.

As Matthew put it, “Eight years ago when I started nobody cared… and now so many scholars are looking to ask critical questions about the way the web permeates our day-to-day lives… people are realizing that web archives are a key way to get at those questions. As a researcher, it’s a cool space to be in right now.”

RosalieLack

By Rosalie Lack, Product Manager, California Digital Library

This blog post is the first in an upcoming series of interviews with researchers to learn about their research using web archives, and the challenges and opportunities.

IIPC Technical Training Workshop – 14th – 16th January 2015

2015-Jan_IIPC Technical WorkshopThe idea of running a training workshop focusing on technical matters was formed during the 2014 IIPC General Assembly in Paris. It became apparent that there is so much transferrable experience among the members and that some institutions are more advanced than others in using the key software for web archiving. Having a forum to exchange ideas and discuss common issues would be extremely useful and welcomed.

Consortium of memory organisations

Kristinn Sigurðsson gave an accurate account of how the idea developed from a thought, to exciting sessions of discussion, and eventually a proposal supported by the IIPC Steering Committee in his blog. Staff development and training is one of the key areas of work for the IIPC. As a consortium of memory organisations sharing the mission of preserving the Internet for posterity, there is great advantage to collaborate, help each other and not to reinvent the wheel. The IIPC has an Education and Training Programme and allocates each year a certain amount of funding for the purpose of collective learning and development. The National Library of France for example organised a week-long workshop in 2012, to offer training for organisations planning to embark into web archiving.

AndyJackson

TokeEskilden

KristinnSigurdsonRogerCoram

Joint expertise

The British Library and the National and University Library of Iceland joint training workshop was the first one dedicated to technical issues, covering the three key applications for web archiving: Heritrix, OpenWayback and Solr. The speakers mainly came from both libraries’ capable technical teams, including Kristinn Sigurðsson, Andy Jackson, Roger Coram and Gil Hoggarth. Their expertise was strengthened by Toke Eskildsen of the State and University Library in Denmark, who has worked extensively on the Danish Web Archive’s large-scale Solr index. Toke also reported on his visit to the British Library in his blog, regarding his experience of “being embedded in tech talk with intelligent people for 5 days” as “exhausting and very fulfilling”. The British Library also took advantage of Toke’s presence and picked his brain on performance issues related to Solr, a perfect example of what other good things can come out of putting techies together.

For the future

Evaluation of the workshop indicates overall satisfaction from the attendees. More people seemed to favour the presentations on day one and desired more structure to the hands-on sessions on day two and three, with more real world examples to be solved together. The presence of strong technical expertise and the opportunity to talk to peers were appreciated the most. From the organiser’s perspective, there are a few things we could have done better: software could have been pre-installed to avoid network congestion and save time; and for the catering we will remember for future occasions that brilliant minds need adequate and varied fuels to be kept well-oiled and running up to speed.

Training is vital for any organisation that aims at progressing. It is not a cost but an investment which safeguards our continuous capability of doing our job. It is worth to consider establishing technical training as a fix element of the Education and Training Programme. The British Library’s Web Archiving crew are happy to contribute.

Helen Hockx-Yu, Head of Web Archiving, The British Library, 17th Feb 2015