New Report on Web Archiving Available

By Andrea Goethals, Harvard Library

HarvardLibraryReport-Jan2016This is an expanded version of a post to the Library of Congress’ Signal blog.

Harvard Library recently released a report that is the result of a five-month environmental scan of the landscape of web archiving, made possible by the generous support of the Arcadia Fund. The purpose of the study was to explore and document current web archiving programs to identify common practices, needs, and expectations in the collection and provision of web archives to users; the provision and maintenance of web archiving infrastructure and services; and the use of web archives by researchers. The findings will inform Harvard Library’s strategy for scaling up its web archiving activities, and are also being shared broadly to help inform research and development priorities in the global web archiving community.

The heart of the study was a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of several organizations, including the IIPC of course, but also the Web Archiving Roundtable at the Society of American Archivists (SAA), the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Reuters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

Opportunities for Research & Development

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed below and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development, and (4) build local capacity.

22 Opportunities to Address Common Challenges

(the order has no significance)

  1. Dedicate full-time staff to work in web archiving so that institutions can stay abreast of latest developments, best practices and fully engage in the web archiving community.
  2. Conduct outreach, training and professional development for existing staff, particularly those working with more traditional collections, such as print, who are being asked to collect web archives.
  3. Increase communication and collaboration across types of collectors since they might collect in different areas or for different reasons.
  4. A funded collaboration program (bursary award, for example) to support researcher use of web archives by gathering feedback on requirements and impediments to the use of web archives.
  5. Leverage the membership overlap between RESAW and European IIPC membership to facilitate formal researcher/librarian/archivist collaboration projects.
  6. Institutional web archiving programs become transparent about holdings, indicating what material each has, terms of use, preservation commitment, plus curatorial decisions made for each capture.
  7. Develop a collection development tool (e.g. registry or directory) to expose holdings information to researchers and other collecting institutions even if the content is viewable only in on-site reading rooms.
  8. Conduct outreach and education to website developers to provide guidance on creating sites that can be more easily archived and described by web archiving practitioners.
  9. IIPC, or similar large international organization, attempts to educate and influence tech company content hosting sites (e.g. Google/YouTube) on the importance of supporting libraries and archives in their efforts to archive their content (even if the content cannot be made immediately available to researchers).
  10. Investigate Memento further, for example conduct user studies, to see if more web archiving institutions should adopt it as part of their discovery infrastructure.
  11. Fund a collection development, nomination tool that can enable rapid collection development decisions, possibly building on one or more of the current tools that are targeted for open source deployment.
  12. Gather requirements across institutions and among web researchers for next generation of tools that need to be developed.
  13. Develop specifications for a web archiving API that would allow web archiving tools and services to be used interchangeably.
  14. Train researchers with the skills they need to be able to analyze big data found in web archives.
  15. Provide tools to make researcher analysis of big data found in web archives easier, leveraging existing tools where possible.
  16. Establish a standard for describing the curatorial decisions behind collecting web archives so that there is consistent (and machine-actionable) information for researchers.
  17. Establish a feedback loop between researchers and the librarians/archivists.
  18. Explore how institutions can augment the Archive-It service and provide local support to researchers, possibly using a collaborative model.
  19. Increase interaction with users, and develop deep collaborations with computer scientists.
  20. Explore what, and how, a service might support running computing and software tools and infrastructure for institutions that lack their own onsite infrastructure to do so.
  21. Service providers develop more offerings around the available tools to lower the barrier to entry and make them accessible to those lacking programming skills and/or IT support.
  22. Work with service providers to help reduce any risks of reliance on them (e.g. support for APIs so that service providers could more easily be changed and content exported if needed).

Communication & Collaboration are Key!

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration, among all individuals and organizations involved in some way in web archiving, was the most prevalent theme found by the scan. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed between those collecting web content, but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.

A Note about the Tools

There is a section in the report (Appendix C) that lists all the current web archiving tools that were identified during the environmental scan. The IIPC Tools and Software web page was one of the resources used to construct this list, along with what was learned through interviews, conferences and independent research. The tools are organized according to the various activities needed throughout the lifecycle of acquiring, processing, preserving and providing web archive collections. Many of the tools discovered are fairly new, especially the ones associated with the analysis of web archives. The state of the tools will continue to change rapidly so this list will quickly become out of date unless a group like the IIPC decides to maintain it.  I will be at the GA in April if any IIPC member would like to talk about maintaining this list or other parts of the report.

2016 IIPC General Assembly & Web Archiving Conference

In 2016 the IIPC is organising two back-to-back events in the spring hosted by the Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland) in Reykjavík, Iceland:

  • IIPC General Assembly 2016, 11-12 April – Free (open for members only)
  • IIPC Web Archiving Conference 2016, 13-15 April – Free (open to anyone)

The IIPC is seeking proposals for presentations and workshops for the 2016 IIPC Web Archiving Conference (13 – 15 April 2016). Members of the IIPC are also encouraged to submit proposals for the IIPC General Assembly (11 & 12 April 2016).

Theme guidance

Proposals may cover any aspect of web archiving. The following is a non-exhaustive list of possible topics:

Policy and Practice

  • Harvesting, preservation, and/or access
  • Collection development
  • Copyright and privacy
  • Legal and ethical concerns
  • Programmatic organization and management

Research

  • Research using web archives
  • Tools and approaches
  • Initiatives, platforms, and collaborations

Tools

  • New/updated tools for any part of the lifecycle
  • Application programming interfaces (APIs)
  • Current and future landscape

Proposal guidance

Individual presentations can be a maximum of 20 mins. A panel session can be a maximum of 60 minutes with 2 or more presentations on a topic. A discussion session should include one or more introductory statements followed by a moderated discussion. Workshops can be up to a half-day in length; please include details on the proposed structure, content, and target audience.

Abstracts should include the name of the speaker(s), a title, theme and be no more than 300 words. All abstracts should be in English.

Please submit your proposals using this form. For questions, please e-mail iipc@bl.uk .

The deadline for submissions is 17 December 2015. All submissions will be reviewed by the Programme Committee and submitters will be notified by mid-January 2016.

Five Takeaways from AOIR 2015

aoirI recently attended the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. It was a great conference that I would highly recommend to anyone interested in learning first hand about research questions, methods, and studies broadly related to the Internet.

Researchers presented on a wide range of topics, across a wide range of media, using both qualitative and quantitative methods. You can get an idea of the range of topics by looking at the conference schedule.

I’d like to briefly share some of my key takeaways. I apologize in advance for oversimplifying what was a rich and deep array of research work, my goal here is to provide a quick summary and not an in-depth review of the conference.

  1. Digital Methods Are Where It’s At

I attended an all-day, pre-conference digital methods workshop. As a testament to the interest in this subject, the workshop was so overbooked they had to run three concurrent sessions. The workshops were organized by Axel Bruns, Jean Burgess, Tim Highfield, Ben Light, and Patrik Wikstrom (Queensland University of Technology), and Tama Leaver (Curtin University).

Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better.

At the digital methods workshop, we learned about the “Walkthrough” method for studying software apps, tools for “web scraping” to gather data for analysis, Tableau to conduct social media analysis, and “instagrammatics,” analyzing Instagram.

FYI: The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.

  1. Twitter API Is also Very Popular

There were many Twitter studies, and they all used the Twitter API to download tweets for analysis. Although researchers are widely using the Twitter API, they expressed a lot of frustration over its limitation. For example, you can only download for free up to 1% of the total Twitter volume. If you’re studying something obscure, you are probably okay, but if you’re studying a topic like #jesuischarlie, you’ll have to pay to get the entire output. Many researchers don’t have the funds for that. One person pointed out that it would be ideal to have access to the Library of Congress’s Twitter archive. Yes, agreed!

  1. Social Media over Web Archives

Researchers presented conclusions and provided commentary on our social behavior through studies of social media such as Snapchat, Twitter, Facebook, and Instagram. There were only a handful of presentations using web archived materials. If a researcher used websites, they viewed them live or conducted “web scraping” with tools such as Outwit and Kimono. Many also used custom Python scripts to gather the data from the sites.

  1. Fair Use Needs a PR Movement

There’s still much misunderstanding about what researchers can and cannot do with digital materials. I attended a session where the presenter shared findings from surveys conducted with communication scholars about their knowledge of fair use. The results showed that there was (very!) limited understanding of fair use. Even worse, the findings showed that those scholars who had previously attended a fair use workshop were even more unlikely to understand fair use! Moreover, many admitted that they did not conduct particular studies because of a (misguided) fear of violating copyright. These findings were corroborated by the scholars from a variety of fields who were in the room.

  1. Opportunities for Collaboration

I asked many researchers if they were concerned that they were not saving a snapshot of websites or Apps at the time of their studies. The answer was a resounding “yes!” They recognize that sites and tools change rapidly, but they are unaware of tools or services they can use and/or that their librarians/archivists have solutions.

Clearly there is room for librarians/archivists to conduct more outreach to researchers to inform them about our rich web archive collections and to talk with them about preservation solutions, good data management practices and copyright.

Who knew?

Let me end with sharing one tidbit that really blew my mind. In her research on “Dead Online: Practices of Post-Mortem Digital Interaction,” Paula Kiel presented on the “digital platforms designed to enable post-mortem interactions.” Yes, she was talking about websites where you can send posthumous messages via Facebook and email! For example, https://www.safebeyond.com/, “Life continues when you pass… Ensure your presence – be there when it counts. Leave messages for your loved ones – for FREE!”

RosalieLack

 

By Rosalie Lack, Product Manager, California Digital Library

Politics, Archaeology, and Swiss Cheese: Meghan Dougherty Shares Her Experiences with Web Archiving

MeghanMeghan Dougherty, Assistant Professor in Digital Communication at Loyola University, Chicago, started our interview by warning me that she is the “odd man out” when it comes to web archiving and uses web archives differently than most. I was immediately intrigued!

Meghan’s research agenda is the nature of inquiry and the nature of evidence. All her research is conducted within the framework of questioning methodology; that is, she’s ask questions about how archives are built, how that process influences what is collected, and how that process then influences scientific inquiry.

Roots in Politics

Before closely examining methodology, Meghan spent “hands-on” time starting in the early 2000s working with a research group called, webarchivist.org, co-founded by Kirsten Foot (University of Washington) and Steve Schneider (SUNYIT). The interdisciplinary nature of the work at this organization was evident in the project members, which included two political scientists and two communications scholars, focusing on both qualitative and quantitative analysis. Their big research question was “what is the impact of the Internet on politics?”

Meghan and the rest of the research group recognized that if you are going to look at how the Internet affects politics, then you need to analyze how it changes over time. To do that you need to slow it down, to essentially take snapshots so you can do an analytical comparison.

To achieve this goal, the team worked collaboratively with Internet Archive and Library of Congress to build an election web archive, specifically around U.S. house, senate, presidential, and gubernatorial elections, with a focus on candidate websites.

The Tao of a Website

As they were doing rigorous quantitative content analysis of election websites, the team was also asked to take extensive field notes to document everything they noticed. This, in turn, is how Meghan became curious about studying methodology. Looking at these sites in such detail prompted many questions:

“What exactly has the crawler captured in an archive? What am I looking at? If a website is fluid and moving and constantly updated, then what is this thing we’ve captured? What is the nature of ‘being’ for objects on the web? If I capture a snapshot, am I really capturing anything, or is it just a resemblance of the thing that existed?”

Meghan admits she doesn’t have all of the answers, but she challenges her fellow scholars to ask these difficult questions and not try to neatly tie up their research with a bow by simplifying the analysis. She cautions that before you can gain knowledge about social and behavioral change over time in the digital world, you need to have a sensibility about what it actually means. Without answering that question the research methods are just practice and not actually knowledge building systems.

The Big Secret

Meghan appreciates when archivists and librarians ask her how they can help to support her in her work. What she really needs, she says, is a long-term collaborator, because frankly she doesn’t know what she wants.

“What if I told you that we don’t know what we want to analyze. We really need to think about these things together. The big secret is that we don’t know what we want because we don’t know what we’re dealing with. We are still working through it and we need you [curators and librarians] to help us to think about what an archive is, what we can collect, and how it gets collected. So we can build knowledge together about this collection of evidence.”

In hearing Meghan discuss two small-scale research projects, it was evident that even within her own research portfolio she has very different requirements for web archives.

#AskACurator

Ask a Curator is a once-a-year event, when cultural heritage institutions across the world open up for anyone to engage with their curators via Twitter.

AskACuratorBy analyzing tweets with the #AskACurator hashtag, Meghan is studying how groups of people come together and interact with institutions and how institutions reach out with digital media to connect with their public.

In this example, Meghan stresses that completeness and precision of data are critical. If the archive of tweets for this hashtag are incomplete, then big chunks of really interesting mini-conversations will be missing from Meghan’s data. In addition, missing data will skew her categorizations and must be accounted for.

Taqwacore

1024px-Muslim_Punks_-_Flickr_-_Eye_Steel_Film_(2)
By Eye Steel Film from Canada (Muslim Punks) [CC BY 2.0 (http://creativecommons.org /licenses/by/2.0)], via Wikimedia Commons
For another project that is more like an ethnographic study, an online community (Taqwacore) of young people gathered around their faith in Islam, interest in punk music, and political activism.  Meghan is studying a wide variety of print and online materials, including a small press novel (that launched this sub-culture), materials distributed online and handed out at concerts, and materials distributed in person and on the online community pages joined by kids living all over the world.

In this study, the precision and completeness of the evidences doesn’t matter as much because Meghan’s goal is to try to get a general gist of the subculture. She is conducting an ethnographic study, but in the past. So, instead of camping out in the scene in the moment, she is looking back in time at the conversations that they had and trying to understand who they were.

Digging Web Archives

In her research, Meghan has come to use the term web archaeology because she has found that regardless of her area of work, her research has felt like an archaeological dig in which she  examines digital traces of past human behavior to understand her subject. Archaeology, not unlike web archiving, can be both destructive as well as constructive, and similarly archaeologists use very specific, specialized tools to find and uncover delicate remains of something that has been covered or even mostly lost over time.

At this year’s IIPC General Assembly < http://netpreserve.org/general-assembly/2015/overview >), Meghan introduced her web archaeology idea, which is also the topic of her forthcoming book (“Virtual Digs: excavating, archiving, preserving and curating the web” from University of Toronto press), through a tongue-in-cheek video from The Onion about uncovering the ruins of a Friendster civilization.

While the video is intended as satire, the topic raises a real question that we need to address, which is that in a hundred years from now people are going to look back at our communication media, such as Facebook, but what will future scholars be able to dig up?

All about the Holes

In a presentation at IIPC 2011, Barbara Signori, Head of the Department e-Helvetica at Swiss National Library, shared a wonderful analogy about how the holes in our archives are like the holes in Swiss cheese – inevitable. When I asked Meghan to share something that surprised her about her research, she shared a story about the holes.

"Emmentaler". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg
“Emmentaler”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/ wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg

When working with the Library of Congress back in the early 2000s, Meghan’s research group provided a list of political candidates to the Library of Congress staff for crawling. Library of Congress staff created an index of the sites crawled, but they did not create an entry in cases where no websites existed.

Meghan and her fellow researchers were surprised because it seemed obvious to them that you would document the candidates who had websites, as well as those who didn’t. Knowing that a candidate DID NOT have a website in the early 2000s was a big deal, and would have a huge impact on findings! Absence shows us something very interesting about the environment.

Meghan would go so far as to say that a quirk about web archives is that librarians and curators are so focused on the cheese, while researchers find the holes of equal interest.

RosalieLackThis blog post is the third in a series of interviews with researchers to learn about their use of web archives.  

By Rosalie Lack, Product Manager, California Digital Library