Dark and Stormy Archives

By Shawn M. Jones, Ph.D. student and Graduate Research Assistant at Los Alamos National Laboratory (LANL), Martin Klein, Scientist in the Research Library at LANL, Michele C. Weigle, Professor in the Computer Science Department at Old Dominion University (ODU), and Michael L. Nelson, Professor in the Computer Science Department at ODU.

Individual web archive collections can contain thousands of documents. Seeds inform capture, but the documents in these collections are archived web pages (mementos) created from those seeds. The sheer size of these collections makes them challenging to understand and compare. Consider Archive-It as an example platform. Archive-It has many collections on the same topic. As of this writing, a search for the query “COVID” returns 215 collections. If a researcher wants to use one of these collections, which one best meets their information need? How does the researcher differentiate them? Archive-It allows its collection owners to apply metadata, but our 2019 study found that as a collection’s number of seeds rises, the amount of metadata per seed falls. This relationship is likely due to the increased effort required to maintain the metadata for a growing number of seeds. It is paradoxical for those viewing the collection because the more seeds exist, the more metadata they need to understand the collection. Additionally, organizations add more collections each year, resulting in more than 15,000 Archive-It collections as of the end of 2020. Too many collections, too many documents, and not enough metadata make human review of these collections a costly proposition.

We use cards to summarize web documents all of the time. Here is the same document rendered as cards on different platforms.

An example of social media storytelling at Storify (now defunct) and Wakelet: cards created from individual pages, pictures, and short text describe a topic.

Ideally, a user would be able to glance at a visualization and gain understanding of the collection, but existing visualizations require a lot of cognitive load and training even to convey one aspect of a collection. Social media storytelling provides us with an approach. We see social cards all of the time on social media. Each card summarizes a single web resource. If we group those cards together, we summarize a topic. Thus social media storytelling produces a summary of summaries. Tools like Storify and Wakelet already apply this technique for live web resources. We want to use this proven technique because readers already understand how to view these visualizations. The Dark and Stormy Archives (DSA) Project explores how to summarize web archive collections through these visualizations. We make our DSA Toolkit freely available to others so they can explore web archive collections through storytelling.

The Dark and Stormy Archives Toolkit

The Dark and Stormy Archives (DSA) Toolkit provides a solution for each stage of the storytelling lifecycle.

Telling a story with web archives consists of three steps. First, we select the mementos for our story. Next, we gather the information to summarize each memento. Finally, we summarize all mementos together and publish the story. We evaluated more than 60 platforms and determined that no platform could reliably tell stories with mementos. Many could not even create cards for mementos, and some mixed information from the archive with details from the underlying document, creating confusing visualizations.

Hypercane selects the mementos for a story. It is a rich solution that gives the storyteller many customization options. With Hypercane, we submit a collection of thousands of documents, and Hypercane reduces them to a manageable number. Hypercane provides commands that allow the archivist to cluster, filter, score, and order mementos automatically. The output from some Hypercane commands can be fed into others so that archivists can create recipes with the intelligent selection steps that work best for them. For those looking for an existing selection algorithm, we provide random selection, filtered random selection, and AlNoamany’s Algorithm as prebuilt intelligent sampling techniques. We are experimenting with new recipes. Hypercane also produces reports, helping us include named entities, gather collection metadata, and select an overall striking image for our story.

To gather the information needed to summarize individual mementos, we required an archive-aware card service; thus, we created MementoEmbed. MementoEmbed can create summaries of individual mementos in the form of cards, browser screenshots, word clouds, and animated GIFs. If a web page author needs to summarize a single memento, we provide a graphical user interface that returns the proper HTML for them to embed in their page. MementoEmbed also provides an extensive API on top of which developers can build clients.

Raintale is one such client. Raintale summarizes all mementos together and publishes a story. An archivist can supply Raintale with a list of mementos. For more complex stories, including overall striking images and metadata, archivists can also provide output from Hypercane’s reports. Because we needed flexibility for our research, we incorporated templates into Raintale. These templates allow us to publish stories to Twitter, HTML, and other file formats and services. With these temples, an archivist can not only choose what elements to include in their cards; they can also brand the output for their institution.

Raintale uses templates to allow the storyteller to tell their story in different formats, with various options, including branding.

The DSA Toolkit at work

The DSA toolkit produced a story summarizing the IIPC’s COVID-19 Archive-It collection.

The DSA Toolkit produced stories from Archive-It collections about mass shootings (from left to right) at Virginia Tech, Norway, and El Paso.

Through these tools, we have produced a variety of stories from web archives. As shown above, we debuted with a story summarizing IIPC’s COVID-19 Archive-It collection, summarizing a collection of 23,376 mementos as an intelligent sample of 36. Instead of seed URLs and metadata, our visualization displays people in masks, places that the virus has affected, text drawn from the underlying mementos, correct source attribution, and, of course, links back to the Archive-It collection so that people can explore the collection further. We recently generated stories that would allow readers to view the differences between Archive-It collections about the mass shootings in Norway, El Paso, and Virginia Tech. Instead of facets and seed metadata, our stories show victims, places, survivors, and other information drawn from the sampled mementos. The reader can also follow the links back to the full collection page and get even more information using the tools provided by the archivists at Archive-It.

With help from StoryGraph, the DSA Toolkit produces daily news stories so that readers can compare the biggest story of the day across different years.

But our stories are not just limited to Archive-It. We designed the tools to work with any Memento-compliant web archive. In collaboration with Storygraph, we produce daily news stories built with mementos stored at Archive.Today and the Internet Archive. We are also experimenting with summarizing a scholar’s grey literature as stored in the web archive maintained by the Scholarly Orphans project.

We designed the DSA Toolkit to work with any Memento-compliant archive. Here we summarize Ian Milligan’s grey literature as captured by the web archive at the Scholarly Orphans Project.

Our Thanks To The IIPC For Funding The DSA Toolkit

We are excited to say that, starting in 2021, as part of a recent IIPC grant, we will be working with the National Library of Australia to pilot the DSA Toolkit with their collections. In addition to solving potential integration problems with their archive, we look forward to improving the DSA Toolkit based on feedback and ideas from the archivists themselves. We will incorporate the lessons learned back into the DSA Toolkit so that all web archives may benefit, which is what the IIPC is all about.

Relevant URLs

DSA web site: http://oduwsdl.github.io/dsa

DSA Toolkit: https://oduwsdl.github.io/dsa/software.html

Raintale web site: https://oduwsdl.github.io/raintale/

Hypercane web site: https://oduwsdl.github.io/hypercane/