Launching IIPC training programme

By Olga Holownia, IIPC Programme and Communications Officer

It is no longer uncommon for heritage institutions, particularly national and university libraries, to employ web archivists and web curators. It is, however, still rather unusual that librarians or archivists who hold these positions, had received any formal training in web archiving before they joined web archiving teams. The majority of participants in a small poll organised during a workshop on training new starters in web archiving at last year’s Web Archiving Conference in Zagreb, confirmed that they received ‘on-the-job training’.

Varying approaches, similar needs

Discussions during the 2016 IIPC General Assembly in Ottawa, led to the conclusion that while IIPC members have varying approaches to web archiving, which reflect their own institutional mandates, legal contexts and technical infrastructures, they all need technical and curatorial training – for practitioners and for researchers. This inspired the creation of the IIPC Training Working Group (TWG) initiated by Tom Cramer, where participation was open to the global web archiving community. The TWG, co-chaired by Tom, Abbie Grotke and Maria Praetzellis, was tasked with creating training materials.

The TWG’s first activities included a comprehensive overview of existing curricula as well as a survey to assess the current level of training needs. The results of both helped inform the decisions behind the content of the training modules that had been introduced at last year’s conference. We are delighted to announce that the beginner’s training is now available on the IIPC website.

Who is the training for?

This training programme is aimed at practitioners (including new starters), curators, policy makers and managers or those who would like to learn about the following: what web archives are, how they work, how to curate web archive collections, acquire basic skills in capturing web archive content, but also how to plan and implement a web archiving programme.

This course contains eight sessions and comprises presentation slides and speakers’ notes. Each module starts with an introduction which outlines the learning objectives, target audience and includes information about the way the slides can be customised as well as a comprehensive list of related resources and tools. Published under a CC licence, the training materials can be fully customised and modified by the users.

Video Case Studies

The TWG used the opportunity of the annual gathering of the web archiving community to complement the training material with video case studies. Alex Osborne, Jessica Cebra, Mark Phillips, Eléonore Alquier, Daniel Gomes, Mar Pérez Morillo, Ben Els and Yves Maurer, representing seven IIPC member institutions from around the world, speak about their experiences of becoming involved in web archiving. They also share their knowledge on organisational approaches, collaborations, collecting policies, access as well as the evolution and challenges of web archiving.

Try it and tell us what you think!

This training is freely available to download and we encourage you to experiment with customising it for your trainees. We are also interested in how you use it, so please give us feedback by filling out this short form.

Acknowledgements

Like in the case of many other IIPC projects, the creation of the training material was a collaborative effort and we thank everyone who has been involved. The project was launched by Tom Cramer of Stanford University Libraries, Abbie Grotke of the Library of Congress and Maria Praetzellis of the California Digital Library (previously the Internet Archive). Maria Ryan of the National Library of Ireland and Claire Newing of the National Archives UK, who took over as co-chairs at the last General Assembly, led the project to completion together with Abbie. A whole group of 38 volunteers who form the TWG were involved at various stages of the project, starting with the surveys, followed by brainstorming sessions, many rounds of extensive feedback and the final phase of preparing the materials for publication. A special thank you to Samantha Abrams, Jefferson Bailey, Helena Byrne, Friedel Geeraert, Márton Németh, Anna Perricci, and the participants of the video case studies.

The beginner’s training materials were produced in partnership with the Digital Preservation Coalition (DPC) and we would particularly like to thank Sharon McMeekin, Head of Training and Skills and Sara Day Thomson of University of Edinburgh (previously DPC).

Resources

Reflecting on how we train new starters in web archiving

This blog post is a summary of a workshop that took place at the 2019 IIPC Web Archiving Conference in Zagreb, Croatia. The abstract and the final slides used during the workshop are available on the IIPC website.


By Helena Byrne, Web Curator and Carlos Rarugal, Assistant Web Archivist at the British Library

 

Most people when learning can relate to the Benjamin Franklin quote

tell me and I forget, teach me and I may remember, involve me and I learn.*

It can be very challenging to find the most effective way to involve a trainee in web archiving and transfer your specialist knowledge. Web archiving is a relatively new profession that is constantly changing and it is only in recent years that a body of work from practitioners and researchers has started to grow. In addition, each web archiving institution has its own collection policies and many use their own web archiving technology meaning that there is no one size fits all solution to providing training to people who work in this field.

However, before taking on new strategies it is important to understand our own beliefs on training and what actions we currently take when training new staff. Reflecting on these points can help us to become more aware of any biases we may have in terms of preferred training delivery style which could be contradictory to what the trainee really needs.
What we did

Before we started the workshop participants answered a series of questions about their own experience of training or receiving training on web archives via a Menti poll. We then reviewed the training practices of the curatorial web archive team at the British Library and in groups reviewed what methods participants felt worked well or not.

“Reflecting on how we train new starters in web archiving” at the Web Archiving Conference in Zagreb, 6 June 2019.
Photo: Tibor God.

Menti Poll Results

Menti Poll Results: Average Score for each question.

Overall, there were about 26 participants in the workshop who had varying degrees of experience training people on how to work with their web archive. As shown in Slide 3, only 31% of participants train people in web archiving on a regular basis while 50% of participants train people occasionally and the remaining 19% don’t train other people in web archiving. Some of the people in this final category work as solo web archivists and don’t have any resources for additional staff.

When asked if there was a structured training programme on web archiving at their organisation, 65% of participants responded “no” while only 35% of respondents had a programme in place. Not surprisingly, when asked ‘how were you trained in web archiving?’, hands-on training was the most popular method used to train participants at the workshop.

Results of this poll can be viewed here.

Training practices at the British Library

During this workshop we reviewed common training methods and reflected on the current practices of the curatorial team of the UK Web Archive based at the British Library as well as how we would like to change these practices in the future. (Slides 7-8)

Group Discussion

Participants in small groups discussed a series of questions about how they train people in their institutions:

Questions

1. Who do you train about web archiving?
2. How do you currently train them?
3. What web archiving training resources do you have available to your team?
4. What methods do you use for training? Computer based, documentation (handouts, user guides etc.), one to one learning, shadowing etc.

After discussing these questions participants then placed their current training methods onto a scale of what they felt works and doesn’t work.

Brainstorming

Overall there were 56 points filled in on the post-it notes by participants in 6 different groups. These can be loosely categorised into 10 categories:

Reading list, videos, hands on training, documentation, networking, case studies, examples/modelling, verbal training, forums and tutorials. A more detailed breakdown of these categories can be viewed here.

Most of the points noted (30/56) were in the ‘what works’ section, (10/56) were neutral while only (8/56) of the points were in the ‘what doesn’t work’ section. However, there was some overlap with the ‘what works’ and ‘what doesn’t work’ sections, with some methods like videos and reading lists appearing in both sections but in different groups.

Review

In the last workshop activity, participants voted, by using two coloured stickers, on what they considered most aspirational and most achievable training method.

As you can see from the votes below the most popular activity that could be achieved in the short term by the workshop participants was hands-on individual training with 9 votes. While there was a split between participants who felt that writing manuals was achievable with 7 votes and those they felt that this was aspirational with 6 votes.

How people voted

Conclusion

Overall participants were keen to see a training related event on the IIPC Web Archiving Conference programme. As the importance of web archiving grows, so too does the need for training in this field and it has become more evident that these responsibilities are falling on web archivists.

All the data collected during this workshop was shared with the IIPC Training Working Group and it is hoped that it will help inform the development of materials to support training within the field.

More information about the IIPC Training Working Group can be found here: http://netpreserve.org/about-us/working-groups/training-working-group/

References:

* Goodreads.com, ‘Benjamin Franklin > Quotes > Quotable Quote’, https://www.goodreads.com/quotes/21262-tell-me-and-i-forget-teach-me-and-i-may (accessed December 20, 2018).

Archiving the Croatian web: has it been fourteen years already?

The National and University Library in Zagreb has been an IIPC member since 2008. The Croatian Web Archive (Hrvatski arhiv weba, HAW), established in 2004, is open access. The current projects include delivering metadata to Europeana, implementation of persistent identifier URN:NBN, migration to OpenWayback, development of a new user interface and integration with the Digital Library portal. Web Archiving Team has also been involved in introducing librarians, archivists and researchers to web archiving and to using HAW resources.


By Ingeborg Rudomino, Croatian Web Archive, National and University Library in Zagreb and Karolina Holub, Croatian Digital Library Development Centre, Croatian Institute for Librarianship, National and University Library in Zagreb

About HAW

The National and University Library in Zagreb (NUL) in collaboration with the University Computing Centre in Zagreb (Srce) established the Croatian Web Archive (Hrvatski arhiv weba, HAW) in 2004 and started to acquire, catalogue and archive online publications according to the legal deposit provisions of the Library Act from 1997. Due to the well-known characteristics of web resources, the NUL started to archive selectively and established selection criteria.

Fig. 1. Croatian Web Archive Homepage.

We use several methods to identify a web resource for cataloguing and archiving: the HAW team searches and browses the web; website owners or content providers fill out the Registration form or we receive notifications from the ISSN Centre for Croatia.

After identification, every resource is catalogued in the library system and automatically transferred into our custom-built archiving system, where the archiving process starts. Our long-standing experience in cataloging this type of resource has shown the process to be very challenging, and describing this dynamic and variable content results in daily interventions in the bibliographic records. Because of that, we created cataloguing guidelines with a variety of examples. Our goal has been to preserve the original websites (their look and feel) as much as possible. In order to achieve quality, each resource is approached individually during the archiving process. The DAMP software, developed by the University Computing Centre in Zagreb, was built especially for this purpose. The workflow of processing web resources is integrated within the organisational structure of the Library.

We are proud of the quantity and quality of web resources stored in the Croatian Web Archive, some of which are websites of institutions, associations, clubs, research projects, news media, portals, blogs, official websites of counties, cities, journals and books. Special attention is given to news media websites/portals, which are archived daily, weekly or monthly.

Access and the first full domain crawl

This selective approach ensures quality and provides full control over the management of web resources. So far, over 6,700 titles have been archived and almost all are publicly available. All content is full text searchable, and it’s possible to search by any word in the title, URL or keywords. Advanced search is available as well. Users can browse the HAW alphabetically and through subject categories, which are extracted from the UDC field in the catalogue.

Fig. 2. Screenshots of archived Croatian websites.

To secure permanent access to archived web resources, we have recently implemented persistent identifier URN:NBN and have assigned it to archived titles and all archived instances (Fig. 3).

Fig. 3. Screenshot of archived instances with URN:NBN.

Since 2013, the metadata from HAW is delivered to Europeana through HAW’s OAI-PMH interface.

To overcome the limitations of selective archiving, the first harvest of the whole .hr domain was conducted in 2011 with the Heritrix web crawler. Since then, we have been harvesting the .hr domain annually. The collected content is publicly available via HAW’s website through the OpenWayback access interface (Fig. 4). To date, we have conducted 7 .hr domain harvests.

Fig. 4. Screenshot of harvested website in OpenWayback.

Thematic crawls

In 2011, we started to periodically harvest websites related to topics and events of national importance using Heritrix and OpenWayback, as well. Nine thematic collections have been created, mainly related to themes such as presidential, parliament or local elections, accession to the EU and the flood in Croatia. Each collection consists of several metadata: title, size, number of seeds/URLs and description.

Training and outreach

Twice every year, we organize a workshop within the Centre of Continuing Education for Librarians. With the main goal to introduce the web archiving to library professionals and students, the workshop focuses on learning how to recognize online materials that should be preserved according to existing criteria for cataloguing and archiving Croatian web resources. The participants are also introduced to the workflow of selective archiving, .hr harvests, the process of selecting materials for thematic collections and different ways of browsing the archived content.

With the experience that we have gained throughout the years, sharing our knowledge and expertise on web archiving is something that we are happy to provide and give support to all those interested. To increase awareness about HAW and web archiving among librarians, archivists, and wider community, we try to make use of every opportunity to do so – such as presenting at national and international conferences, giving lectures to students, researchers, etc.

A few thoughts for the future

The Croatian Web Archive currently has more than 40 TB of content. We are currently working on a web interface that will have new functionalities and features including full-text search for the domain harvests and news sections for web archiving community and researchers. Also, the plan is to integrate HAW’s metadata into the Digital Library portal in order to have a single access point for all digital collections.

By combining all three approaches and using different software, the Library will attempt to cover, to the greatest extent possible, the contemporary part of Croatian cultural and scientific heritage.

Visit us: http://haw.nsk.hr/en

IIPC Training Survey

Recognizing the global need for practical training in web archiving, the IIPC chartered a new working group dedicated to training on October 31, 2017. While vital to preserving our common cultural heritage, web archiving remains a niche area, requiring specialized skills and knowledge to practice effectively.

The goal of this working group is to produce a quality curriculum — that can be delivered online or in person — to train the current and next generation of web archiving practitioners. By giving people the hands on learning they need to preserve the Web, we will empower IIPC members and the field at large to capture more and better archives, and help elevate web archiving worldwide.

One of the first actions for our Training Working Group is to survey the web archiving world to assess the current level of training needs. How do web archivists currently get training, and how good is it? What gaps are there, and where should we prioritize our efforts?

We invite every web archiving stakeholder to reply to the survey; it will be available from now through the end of January 2018.

The results will help us identify learning modules and needed materials. And if you are interested in helping in this endeavor, including joining the Training Working Group, you can read more about us here: http://netpreserve.org/about-us/working-groups/training-working-group/

Tom Cramer
Stanford University

Chair, IIPC Training Working Group

IIPC Hackathon at the British Library: Laying a New Foundation

By Tom Cramer, Stanford University

This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.

I was struck by the breadth and depth of topics. In particular…

  • Heritrix nuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
  • Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
  • Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
  • QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
  • An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki  Updates welcome!
  • System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.

The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.

IIPC Hackathon, Day 1.

This validates one of the main focal points for the IIPC’s portfolio on Tools Development, which is to foster more grassroots exchange among web archiving practitioners.

The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!

Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.