Asking questions with web archives – introductory notebooks for historians

“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).


By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.

Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that  document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

What can you do?

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.

A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.

Using screenshots to visualise change in a page over time.

Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.

 

Find when a piece of text appears in an archived web page.

The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.

Find and explore Powerpoint presentations from a specific domain.

While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

Visualising subdomains in the gov.au domain as captured by the Internet Archive.

Acknowledgements

Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.

Resources:

Reflecting on how we train new starters in web archiving

This blog post is a summary of a workshop that took place at the 2019 IIPC Web Archiving Conference in Zagreb, Croatia. The abstract and the final slides used during the workshop are available on the IIPC website.


By Helena Byrne, Web Curator and Carlos Rarugal, Assistant Web Archivist at the British Library

 

Most people when learning can relate to the Benjamin Franklin quote

tell me and I forget, teach me and I may remember, involve me and I learn.*

It can be very challenging to find the most effective way to involve a trainee in web archiving and transfer your specialist knowledge. Web archiving is a relatively new profession that is constantly changing and it is only in recent years that a body of work from practitioners and researchers has started to grow. In addition, each web archiving institution has its own collection policies and many use their own web archiving technology meaning that there is no one size fits all solution to providing training to people who work in this field.

However, before taking on new strategies it is important to understand our own beliefs on training and what actions we currently take when training new staff. Reflecting on these points can help us to become more aware of any biases we may have in terms of preferred training delivery style which could be contradictory to what the trainee really needs.
What we did

Before we started the workshop participants answered a series of questions about their own experience of training or receiving training on web archives via a Menti poll. We then reviewed the training practices of the curatorial web archive team at the British Library and in groups reviewed what methods participants felt worked well or not.

“Reflecting on how we train new starters in web archiving” at the Web Archiving Conference in Zagreb, 6 June 2019.
Photo: Tibor God.

Menti Poll Results

Menti Poll Results: Average Score for each question.

Overall, there were about 26 participants in the workshop who had varying degrees of experience training people on how to work with their web archive. As shown in Slide 3, only 31% of participants train people in web archiving on a regular basis while 50% of participants train people occasionally and the remaining 19% don’t train other people in web archiving. Some of the people in this final category work as solo web archivists and don’t have any resources for additional staff.

When asked if there was a structured training programme on web archiving at their organisation, 65% of participants responded “no” while only 35% of respondents had a programme in place. Not surprisingly, when asked ‘how were you trained in web archiving?’, hands-on training was the most popular method used to train participants at the workshop.

Results of this poll can be viewed here.

Training practices at the British Library

During this workshop we reviewed common training methods and reflected on the current practices of the curatorial team of the UK Web Archive based at the British Library as well as how we would like to change these practices in the future. (Slides 7-8)

Group Discussion

Participants in small groups discussed a series of questions about how they train people in their institutions:

Questions

1. Who do you train about web archiving?
2. How do you currently train them?
3. What web archiving training resources do you have available to your team?
4. What methods do you use for training? Computer based, documentation (handouts, user guides etc.), one to one learning, shadowing etc.

After discussing these questions participants then placed their current training methods onto a scale of what they felt works and doesn’t work.

Brainstorming

Overall there were 56 points filled in on the post-it notes by participants in 6 different groups. These can be loosely categorised into 10 categories:

Reading list, videos, hands on training, documentation, networking, case studies, examples/modelling, verbal training, forums and tutorials. A more detailed breakdown of these categories can be viewed here.

Most of the points noted (30/56) were in the ‘what works’ section, (10/56) were neutral while only (8/56) of the points were in the ‘what doesn’t work’ section. However, there was some overlap with the ‘what works’ and ‘what doesn’t work’ sections, with some methods like videos and reading lists appearing in both sections but in different groups.

Review

In the last workshop activity, participants voted, by using two coloured stickers, on what they considered most aspirational and most achievable training method.

As you can see from the votes below the most popular activity that could be achieved in the short term by the workshop participants was hands-on individual training with 9 votes. While there was a split between participants who felt that writing manuals was achievable with 7 votes and those they felt that this was aspirational with 6 votes.

How people voted

Conclusion

Overall participants were keen to see a training related event on the IIPC Web Archiving Conference programme. As the importance of web archiving grows, so too does the need for training in this field and it has become more evident that these responsibilities are falling on web archivists.

All the data collected during this workshop was shared with the IIPC Training Working Group and it is hoped that it will help inform the development of materials to support training within the field.

More information about the IIPC Training Working Group can be found here: http://netpreserve.org/about-us/working-groups/training-working-group/

References:

* Goodreads.com, ‘Benjamin Franklin > Quotes > Quotable Quote’, https://www.goodreads.com/quotes/21262-tell-me-and-i-forget-teach-me-and-i-may (accessed December 20, 2018).

Online Hours: Supporting Open Source

By Andrew Jackson, Web Archiving Technical Lead at the British Library

At the UK Web Archive, we believe in working in the open, and that organisations like ours can achieve more by working together and pooling our knowledge through shared practices and open source tools. However, we’ve come to realise that simply working in the open is not enough – it’s relatively easy to share the technical details, but less clear how to build real collaborations (particularly when not everyone is able to release their work as open source).

To help us work together (and maintain some momentum in the long gaps between conferences or workshops), we were keen to try something new, and hit upon the idea of Online Hours. It’s simply a regular web conference slot (organised and hosted by the IIPC, but open to all) which can act as a forum for anyone interested in collaborating on open source tools for web archiving. We’ve been running for a while now, and have settled on a rough agenda:

Full-text indexing:
– Mostly focussing on our Web Archive Discovery toolkit so far.

Heritrix3:
– including Heritrix3 release management, and the migration of Heritrix3 documentation to the GitHub wiki.

Playback:
– covering e.g. SolrWayback as well as OpenWayback and pywb.

AOB/SOS:
– for Any Other Business, and for anyone to ask for help if they need it.

This gives the meetings some structure, but is really just a starting point. If you look at the notes from the meetings, you’ll see we’ve talked about a wide range of technical topics, e.g.

  • OutbackCDX features and documentation, including its API;
  • web archive analysis, e.g. via the Archives Unleashed Toolkit;
  • summary of technologies so we can compare how we do things in our organisations, to find out which tools and approaches are shared and so might benefit from more collaboration;
  • coming up with ideas for possible new tools that meet a shared need in a modular, reusable way and identify potential collaborative projects.

The meeting is weekly, but we’ve attempted to make the meetings inclusive by alternating the specific time between 10am and 4pm (GMT). This doesn’t catch everyone who might like to attend, but at the moment I’m personally not able to run the call at a time that might tempt those of you on Pacific Standard Time. Of course, I’m more than happy to pass the baton if anyone else wants to run one or more calls at a more suitable time.

If you can’t make the calls, please consider:

My thanks go to everyone who as come along to the calls so far, and to IIPC for supporting us while still keeping it open to non-members.

Maybe see you online?

IIPC Technical Training Workshop – 14th – 16th January 2015

2015-Jan_IIPC Technical WorkshopThe idea of running a training workshop focusing on technical matters was formed during the 2014 IIPC General Assembly in Paris. It became apparent that there is so much transferrable experience among the members and that some institutions are more advanced than others in using the key software for web archiving. Having a forum to exchange ideas and discuss common issues would be extremely useful and welcomed.

Consortium of memory organisations

Kristinn Sigurðsson gave an accurate account of how the idea developed from a thought, to exciting sessions of discussion, and eventually a proposal supported by the IIPC Steering Committee in his blog. Staff development and training is one of the key areas of work for the IIPC. As a consortium of memory organisations sharing the mission of preserving the Internet for posterity, there is great advantage to collaborate, help each other and not to reinvent the wheel. The IIPC has an Education and Training Programme and allocates each year a certain amount of funding for the purpose of collective learning and development. The National Library of France for example organised a week-long workshop in 2012, to offer training for organisations planning to embark into web archiving.

AndyJackson

TokeEskilden

KristinnSigurdsonRogerCoram

Joint expertise

The British Library and the National and University Library of Iceland joint training workshop was the first one dedicated to technical issues, covering the three key applications for web archiving: Heritrix, OpenWayback and Solr. The speakers mainly came from both libraries’ capable technical teams, including Kristinn Sigurðsson, Andy Jackson, Roger Coram and Gil Hoggarth. Their expertise was strengthened by Toke Eskildsen of the State and University Library in Denmark, who has worked extensively on the Danish Web Archive’s large-scale Solr index. Toke also reported on his visit to the British Library in his blog, regarding his experience of “being embedded in tech talk with intelligent people for 5 days” as “exhausting and very fulfilling”. The British Library also took advantage of Toke’s presence and picked his brain on performance issues related to Solr, a perfect example of what other good things can come out of putting techies together.

For the future

Evaluation of the workshop indicates overall satisfaction from the attendees. More people seemed to favour the presentations on day one and desired more structure to the hands-on sessions on day two and three, with more real world examples to be solved together. The presence of strong technical expertise and the opportunity to talk to peers were appreciated the most. From the organiser’s perspective, there are a few things we could have done better: software could have been pre-installed to avoid network congestion and save time; and for the catering we will remember for future occasions that brilliant minds need adequate and varied fuels to be kept well-oiled and running up to speed.

Training is vital for any organisation that aims at progressing. It is not a cost but an investment which safeguards our continuous capability of doing our job. It is worth to consider establishing technical training as a fix element of the Education and Training Programme. The British Library’s Web Archiving crew are happy to contribute.

Helen Hockx-Yu, Head of Web Archiving, The British Library, 17th Feb 2015