A Retrospective with the Archives Unleashed Project

At the 2016 IIPC Web Archiving Conference in Reykjavík, Ian Milligan and Matthew Weber talked about the importance of building communities around analysing web archives and bringing together interdisciplinary researchers which is what Archives Unleashed 1.0, the first Web Archive Hackathon hosted by University of Toronto Libraries, attempted to do. At the same conference, Nick Ruest and Ian gave a workshop on the earliest version of the Archives Unleashed Toolkit (“Hands on with Warcbase“). Five years and 7 datathons later, the Archives Unleashed Project has seen major technological developments (including the Cloud version of the Toolkit integrated with Archive-It collections), a growing community of researchers, an expanded team, new partnerships, and two major grants. At the project’s core, there still is a desire to engage the community, and the most recent initiative, which builds on the datathons, includes the Cohort Program which facilitates research engagement with web archives through a year-long collaboration while receiving mentorship and support from the Archives Unleashed Team. 

In her blog post, Samantha Fritz, the Project Manager at the Archives Unleashed Project, reflects on the strategy and key milestones achieved between 2017 and 2020, as well as the new partnership with Archive-It and the plans for the next 3 years.


By Samantha Fritz, Project Manager, Archives Unleashed Project

AUTLogo-512x512The web archiving world blends the work and contributions of many institutions, groups, projects, and individuals. The field is witnessing work and progress in many areas, from policies, to professional development and learning resources, to the development of tools that address replay, acquisition, and analysis.

For over two decades memory institutions and organizations around the world have engaged in web archiving to ensure the preservation of born-digital content that is vital to our understanding of post-1990s research topics. Increasingly web archiving programs are adopted as part of institutional activities, because in general there is a recognition from librarians, archivists, scholars, and others that web archives are critical resources and are vulnerable to stewarding our cultural heritage.

The National Digital Stewardship Alliance has conducted surveys to “understand the landscape of web archiving activities in the United States.” Reflecting on the most recent 2017 survey results, respondents indicated they perceived the least progress in the past two years fell in the category of access, use, and reuse. The 2017 report indicates that this could suggest “a lack of clarity about how Web archives are to be used post-capture” (Farrell et. al. 2017 Report, p.13). This finding makes complete sense given that focus has largely revolved around selection, appraisal, scoping and capture. 

Ultimately, the active use of web archives by researchers, and by extension the development of tools to explore web archives has lagged. As such we see institutions and service providers like librarians and archivists are tasked with figuring out how to “use” web archives.

We have petabytes of data, but we also have barriers

The amount of data captured is well into the petabyte range – and we can look at larger organizations like the Internet Archive, the British Library, the Bibliothèque Nationale de France, Denmark’s Netarchive, the National Library of Australia’s Trove platform, and Portugal’s Arquivo.pt, who have curated extensive web archive collections, but we still don’t see a mainstream or heavy use of web archives as primary sources in research. This is in part due to access and usability barriers. Essentially, the technical experience needed to work with web archives, especially at scale, is beyond the reach of most scholars.

It is this barrier that offers an opportunity for discussion and work in and beyond the web archiving community. As such, we turn to a reflection of contributions from the Archives Unleashed Project for lowering barriers to web archives.

About the Archives Unleashed Project

Archives Unleashed was established in 2017 with support from The Andrew W. Mellon Foundation. The project grew out of an earlier series of events which identified a collective need among researchers, scholars, librarians and archivists for analytics tools, community infrastructure, and accessible web archival interfaces.

In recognizing the vital role web archives play in studying topics from the 1990s forward, the team has focused on developing open-source tools to lower the barrier for working with and analyze web archives at scale.

From 2017-2020 Archives Unleashed has a three-pronged strategy for tackling the computational woes of working with large data, and more specifically W/ARCs:

  1. Development of the Archives Unleashed Toolkit: to apply modern big data analytics infrastructure to scholarly analysis of web archives
  2. Deployment of the Archives Unleashed Cloud: provide a one-stop, web-based portal for scholars to ingest their Archive-It collections and execute a number of analyses with the click of a mouse.
  3. Organization of Archives Unleashed Datathons: build a sustainable user community around our open-source software. 

Milestones + Achievements

If we look at how Archives Unleashed tools have developed, we have to reach back to 2013 when Warcbase was developed. It was the forerunner to the Toolkit and was built on Hadoop and HBase as an open-source platform to support temporal browsing and large-scale analytics of web archives (Ruest et al., 2020, p. 157).

The Toolkit moves beyond the foundations of Warcbase. Our first major transition was to replace Apache HBase with Apache Spark to modernize analytical functions. In developing the Toolkit, the team was able to leverage the needs of users to inform two significant development choices. First, by creating a Python interface that has functional parity with the Scala interface. Python is widely accepted, and more commonly known, among scholars in the digital humanities who engage in computational work. From a sustainability perspective, Python is a stable, open-source, and ranked as one of the most popular programming languages.

Second, the Toolkit shifted from Spark’s resilient distributed datasets (RDDs), part of the Warcbase legacy, to support DataFrames. While this was part of the initial Toolkit roadmap, the team engaged with users to discuss the impact of alternative options to RDD. Essentially, DataFrames offers the ability within Apache Spark to produce a tabular based output. From the community, this approach was unanimously accepted in large part because of the familiarity with pandas, and DataFrames made it easier to visually read the data outputs (Fritz, et. al, 2018, Medium Post).

Comparison between RDD and DataFrame outputs
Comparison between RDD and DataFrame outputs
 

The Toolkit is currently at a 0.90.0 release, and while the Toolkit offers powerful analytical functionality, it is still geared towards an advanced user. Recognizing that scholars often didn’t know where to start with analyzing W/ARC files, and the intimidating nature of the command line, we took a cookbook approach in developing our Toolkit user documentation. With it, researchers can modify dozens of example scripts for extracting and exploring information. Our team focused on designing documentation that presented possibilities and options, while at the same time guided and supported user learning.

 
Sparkshell for using the Archives Unleashed Toolkit
Sparkshell for using the Archives Unleashed Toolkit

The work to develop the Toolkit, provided the foundations for other platforms and experimental methods of working with web archives. The second large milestone reached by the project was the launch of the Archives Unleashed Cloud.

The Archives Unleashed Cloud, largely developed by project co-investigator Nick Ruest, is an open-source platform that was developed to provide a web-based front end for users to access the most recent version of the Archives Unleashed Toolkit. A core feature of the Cloud, is that it uses the Archive-It WASAPI, which means that users are directly connected to their Archive-It collections and can proceed to analyze web archives without having to spend time delving into the technical world. 

 

 

Archives Unleashed Cloud Interface for Analysis
Archives Unleashed Cloud Interface for Analysis

Recognizing that the Toolkit, while flexible and powerful, may still be a little too advanced for some scholars, the Cloud offers a more user-friendly and familiar user interface for interacting with data. Users are presented with simple dashboards which provide insights into WARC collections, provide downloadable derivative files and offer simple in-browser visualizations.

In June of 2020, marking the end of our grant, the Cloud had analyzed just under a petabyte of data, and has been used by individuals from 59 unique institutions across 10 countries. Cloud remains an open-source project, with code available through a GitHub repository. The canonical instance will be deprecated as of June 30 2021 and be migrated into Archive-It, but more on that project in a bit.

Datathons + Community Engagement

Datathons provided an opportunity to build a sustainable community around Archives Unleashed tools, scholarly discussion, and training for scholars with limited technical expertise to explore archived web content.

Adapting the hackathon model, these events saw participants from over fifty institutions from seven countries engage in a hands-on learning environment – working directly with web archive data and new analytical tools to produce creative and ingenuitive projects that explore W/ARcs. In collaborating with host institutions, the datathons also highlight web archive collections from host institutions, increasing visibility and usability cases for their curated collections.

In a recently published article, “Fostering Community Engagement through Datathon Events: The Archives Unleashed Experience,” we reflected on the impact that our series of datathon events had on community engagement within the web archiving field, and on the professional practices of attendees. We conducted interviews with datathon participants to learn about their experiences and complemented this with an exploration of established models from the community engagement literature. Our article culminates in contextualizing a model for community building and engagement within the Archives Unleashed Project, with potential applications for the wider digital humanities field. 

Our team has also invested and participated in the wider web archival community through additional scholarly activities, such as institutional collaborations, conferences, and meetings. We recognize that these activities bring together many perspectives, and have been a great opportunity to listen to the needs of users and engage in conversations that impact adjacent disciplines and communities.

Archives Unleashed Datathon, Gelman Library, George Washington University
Archives Unleashed Datathon, Gelman Library, George Washington University

Lessons Learned

1. It takes a community

If there is one main take away we’ve learned as a team, and that all our activities point to, it’s that projects can’t live in silos! Be they digital humanities, digital libraries, or any other discipline, projects need communities to function, survive, and thrive. 

We’ve been fortunate and grateful to have been able to connect with various existing groups including being welcomed by the web archiving and digital humanities communities. Community development takes time and focused efforts, but it is certainly worthwhile! Ask yourself, if you don’t have a community, who are you building your tools, services, or platforms for? Who will engage with your work?

We have approached community building through a variety of avenues. First and foremost, we have developed relationships with people and organizations. This is clearly highlighted through our institutional collaborations in hosting datathon events, but we’ve also used platforms like Slack and Twitter to support discussion and connection opportunities among individuals. For instance, in creating both general and specific Slack channels, new users are able to connect with the project team and user community to share information and resources, ask for help, and engage in broader conversations on methods, tools, and data. 

Regardless of platform, successful community building relies on authentic interactions and an acknowledgment that each user brings unique perspectives and experiences to the group. In many cases we have connected with uses who are either new to the field or to analysis methods of web archives. As such, this perspective has helped to inform an empathetic approach to the way we create learning materials, deliver reports and presentations, and share resources. 

2. Interdisciplinary teams are important

So often we see projects and initiatives that highlight an interdisciplinary environment – and we’ve found it to be an important part of why our project has been successful. 

Each of our project investigators personifies a group of users that the Archives Unleashed Project aims to support, all of which converge around data, more specifically WARCs or web archive data. We have a historian who is broadly representative of digital humanists and researchers who analyze and explore web archives; a librarian who represents the curators and service providers of web archives; and a computer scientist who reflects tool builders.

A key strength of our team has been to look at the same problem from different perspectives, allowing each member to apply their unique skills and experiences in different ways. This has been especially valuable in developing underlying systems, processes and structures which now make up the Toolkit. For instance, triaging technical components offered a chance for team members to apply their unique skill sets, which often assisted in navigating issues and roadblocks.

We also recognized each sector has its own language and jargon that can be jarring to new users. In identifying the wide range of technical skills within our team, we leveraged (and valued) those “I have no idea what this means/ what this does.”  moments. If these types of statements were made by team members or close collaborators, chances are they would carry through to our user community. 

Ultimately, the interdisciplinary nature and the wide range of technical expertise found within our team, helped us to see and think like our users.

3. Sustainability planning is really hard

Sustainability has been part question, part riddle. This is the case for many digital humanities projects. These sustainability questions speak to the long term lifecycle of the project, and our primary goal has always been to ensure a project’s survival and continued efforts once the grant cycle has ended.

As such the Archives Unleashed team has developed tools and platforms with sustainability in mind, specifically by adopting widely adopted and stable programming languages and best practices. We’ve also been committed to ensuring all our platforms and tools have developed in the spirit of open-access, and are available in public GitHub repositories.

One overarching question remained as our project entered its final stages in the Spring of 2020: how will the Toolkit live on? Three years of development and use cases demonstrated not only the need and adoption of tools created under the Archives Unleashed Project, but also solidified the fact that without these tools, there aren’t currently any simplified processes to adequately replace it. 

Where we are headed (2020-2023)

Our team was awarded a second grant from The Andrew W. Mellon Foundation, which started in 2020 and will secure the future of Archives Unleashed. The goal of this second phase is the integration of the Cloud with Archive-it, so as a tool it can succeed in a sustainable and long-term environment. The collaboration between Archives Unleashed and Archive-It also aims to continue to widen and enhance the accessibility and usability of web archives.

Priorities of the Project

First, we will merge the Archives Unleashed analytical tools with the Internet Archive’s Archive-it service to provide an end-to-end process for collecting and studying web archives. This will be completed in three stages:

  1. Build. Our team will be setting up the physical infrastructure and computing environment needed to kick start the project. We will be purchasing dedicated infrastructure with the Internet Archive.
  2. Integrate. Here we will be migrating the back end of the Archives Unleashed Cloud to Archive-it and paying attention to how the Cloud can scale to work within its new infrastructure. This stage will also see the development of a new user interface that will provide a basic set of derivatives to users.
  3. Enhance. The team will incorporate consultation with users to develop an expanded and enhanced set of derivatives and implement new features.

Secondly, we will engage the community by facilitating opportunities to support web archives research and scholarly outputs. Building on our earlier successful datathons, we will be launching the Archives Unleashed Cohort program to engage with and support web archives research. The Cohorts will see research teams participate in year-long intensive collaborations and receive mentorship from Archives Unleashed with the intention of producing a full-length manuscript.

We’ve made tremendous progress, as the close of our first year is in sight. Our major milestone will be to complete the integration of the Archives Unleashed Cloud/Toolkit over to Archive-It. As such users will soon see a beta release of the new interface for conducting analysis with their web archive collections, specifically by downloading over a dozen derivatives for further analysis, and access to simple in-browser visualizations.

Our team looks forward to the road ahead, and would like to express our appreciation for the support and enthusiasm Archives Unleashed has received!

 

We would like to recognize our 2017-2020 work was primarily supported by the Andrew W. Mellon Foundation, with financial and in-kind support from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

 

References

Farrell, M., McCain, E., Praetzellis, M., Thomas, G., and Walker, P. 2018. Web Archiving in the United States: A 2017 Survey. National Digital Stewardship Alliance Report. DOI 10.17605/OSF.IO/3QH6N

Ruest, N., Lin, J., Milligan, I., and Fritz, S. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513

Fritz, S., Milligan, I., Ruest, N., and Lin, J. To DataFrame or Not, that is the Questions: A PySpark DataFrames Discussion. May 29, 2018. Medium. https://news.archivesunleashed.org/to-dataframe-or-not-that-is-the-questions-a-pyspark-dataframes-discussion-600f761674c4

 

Resources

We’ve provided some additional reading materials and resources that have been written by our team, and shared with the community over the course of our project work.

For a full list please visit our publications page: https://archivesunleashed.org/publications/.

Shorter blog posts can be found on our Medium site: https://news.archivesunleashed.org

Warcbase

Toolkit

Cloud

Datathons/Community

2 thoughts on “A Retrospective with the Archives Unleashed Project

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s