“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).
By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench
We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.
Where do you start?
The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!
What can you do?
Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.
A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.
Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.
The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.
While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.
Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.