IIPC Content Development Group’s activities 2019-2020

By Nicola Bingham, Lead Curator Web Archives, British Library and Co-Chair of the IIPC Content Development Working Group

Introduction

I was delighted to present an update on the Content Development Group’s (CDG) activities at the 2020 IIPC General Assembly (GA) on behalf of myself, Alex and the curators that have worked so hard on collaborative collections over the past year.

Socks, not contributing to Web Archiving

Although it was disappointing not to have been in Montreal for the GA and Web Archiving Conference (WAC), it is the case that there are many advantages in attending a conference remotely. Apart from cost and time savings, it meant that many more staff members from our organisations could attend. I liked the fact that I could see many “old” web archiving friends online and it did feel like the same friendly, enthusiastic, innovative environment that is normally fostered at IIPC events. I was also delighted to see some of the attendee’s pets on screen, although it did highlight that other people’s cats are generally much more affectionate than my own, who has, I have to say, contributed little to the field web archiving over the years, although he did show a mild interest in Warcat.

Several things become clear when tasked with pre-recording a presentation with a time limit of 2 to 3 minutes. Firstly, it is extremely difficult to fit everything you need to say into such a short space of time; secondly, what you do want to say must be tightly scripted – although this does have the advantage that there is no room for pauses or “errs” in a way that can sometimes pepper my in-person presentations. Thirdly, recording even a two-minute video calls for a surprising number of retakes, taking many hours for no apparent reason. Fourthly, naively explaining these facts to the Programme and Communications Officer leads quite seamlessly to the suggestion of writing a blog post in order that one can be more expansive on the points bulleted in the two-minute presentation….

CDG Collection Update

Since our last General Assembly in Zagreb, in June 2019, the CDG has continued working on several established, and two new collections:

  • The International Cooperation Organizations Collection was initiated in 2015 and is led by Alex Thurman of Columbia University Libraries. It previously consisted of all known active websites in the .int top-level domain (available only to organizations created by treaties), but was expanded to include a large group of similar organizations with .org domain hosts, and renamed Intergovernmental Organizations this year. This increased the collection from 163 to 403 intergovernmental organizations, all of which will continue to be crawled each year.
  • The National Olympic and Paralympic Committees, led by Helena Byrne of the British Library was initiated in 2016 and consists of websites of national Olympics and Paralympics committees and associations, as identified from the official listings of these groups found on the official sites http://www.olympic.org and http://www.paralympic.org.
  • Online News Around the World led by Sabine Schostag of the Royal Danish Library. This collection of seeds was first crawled in October 2018 to document a selection of online news from as many countries as possible. It was crawled again in November 2019. The collection was promoted at the Third RESAW Conference, “The web that was: archives, traces, reflections” in Amsterdam in June 2019 and at the IFLA News Media Conference at Universidad Nacional Autónoma de México, Mexico City in March 2020.
  • New in 2019, the CDG undertook a Climate Change Collection, led by Kees Teszelszky of the National Library of the Netherlands. The first crawl took place in June, with a final crawl shortly after the UN Climate summit in September 2019.
  • New in 2019, a collection on Artificial Intelligence was undertaken between May and December, led by Tiiu Daniel (National Library of Estonia), Liisi Esse (Stanford University Libraries) and Rashi Joshi (Library of Congress).

Coronavirus (Covid-19) Collection

The main collecting activity in 2020 has been around the Covid-19 Global pandemic. This has involved a huge effort by IIPC members with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

We have been very careful with scoping rules so that we are able to collect a diverse range of content within the data budget – and Archive-It generously increased the data limit for this collection to 5TB. Collecting will continue to run, budget permitting, while the event is of global significance.

Publicly available CDG collections can be viewed on the Archive-It website.https://archive-it.org/home/IIPC and an overview of the collection statistics can be seen below.

CDG Collection statistics. Figures correct as of 15th June 2020. Slide presented at IIPC GA 17th June 2020.

Researcher-use of Collections

The CDG has worked closely with the Research Working Group co-chairs to promote and facilitate use of the CDG collections which are now available through the Archives Unleashed Cloud thanks to the Archives Unleashed project. The collections have been analysed and there are a large amount of derivatives available to researchers at IIPC-led events and/or research projects. For more information about how to access these collections please refer to the guidelines.

Next Steps/Getting in touch

We would very much welcome new members to the CDG. We will be having an online meeting in the next couple of months which would be an excellent opportunity to find out more. In the meantime, any IIPC member is welcome to suggest and/or lead on possible 2021 collaborative collections. For more information please contact the co-chairs or the Programme and Communications Officer.

Nicola Bingham & Alex Thurman CDG co-chairs

The CDG Working Group at the 2019 IIPC General Assembly in Zagreb.

From pilot to portal: a year of web archiving in Hungary

National Széchényi Library started a web archiving pilot project in 2017. The aim of the pilot project was to identify the requirements of establishing the Hungarian Internet Archive. In the two years of the pilot phase, some hundred cultural and scientific websites were selected and published with the owners’ permission. The Hungarian Web Archive (MIA) was officially launched in 2017. The Library joined the IIPC in 2018 and the Hungarian Web Archive was first introduced at the General Assembly in Wellington in 2018. Last year, the achievements of the project were presented at the Web Archiving Conference (WAC) in Zagreb, in June 2019. This blog post offers a summary of some key developments since the 2019 conference.


By Márton Németh, Digital librarian at the National Széchényi Library, Hungary

In just about a year, we moved from a pilot project to officially launching our web archive, running a comprehensive crawl and creating special collections. In May 2020, the Hungarian parliament passed the modifications of the Cultural Law which allows us to run web archiving activities as a part of its basic service portfolio. Over the past year we have also organised training and participated in various collaborative initiatives.

Conferences and collaborations

In the summer just after the Zagreb conference, we could exchange experiences with our Czech and Slovak colleagues about the current status and major development points of web archiving projects in the Czech Republic, Slovakia and Hungary in the Visegrad 4 Library Conference in Bratislava. Our presentation is available from here. In the autumn, at the annual international conference of digital preservation in Bratislava, we could elaborate on our basic thoughts about the potential use of microdata in library environment. The presentation can be downloaded from here.

At the Digital Humanities 2020 conference in Budapest, Hungary, we organized a whole web archiving session with presentations and panel discussions together with Marie Haskovcová from the Czech National Library, Kees Teszelszky from the National Library of the Netherlands, Balázs Indig from the Digital Humanities Research Centre of Loránd Eötvös University and with Márton Németh from the National Széchényi Library. The main aim was to get a spotlight on Digital Humanities research activities in the web archiving context. Our presentation is available from here.

Training

Our annual workshop in the National Széchényi Library focused on the metadata enrichment of web archives, crawling and managing local web content in university library and city library environments, crawling and managing online newspaper articles and setting the limits of web archiving in research library environments.

We also run several accredited training courses for Hungarian librarians and summarized our experiences in web archiving education field in an article published by Emerald. The membership in the IIPC Training Working Group has offered us valuable experiences in this field.

Domain crawl and new portal

We had run our second comprehensive harvest about a large segment of the Hungarian web domain in the end of 2019. The robot had started on 246.819 seed addresses and crawled 110 million URL-s in less than eight days with 6,4 TB storage.

Our original project website was the first repository of resources related to web archiving in Hungarian. In 2019 we built a new portal. This new website serves as a knowledgebase in web archiving field in Hungary. Beyond the introduction to the web archive and to the project, separate groups of resources (info-materials, documents etc.) are available for every-day users, for content-owners, for professional experts and for journalists. It is available at https://webarchivum.oszk.hu.

https://webarchivum.oszk.hu
webarchivum.oszk.hu

We created a new sub-collection in 2019-2020 on the Francis II Rákóczi Memorial Year at the National Széchényi Library (NSZL), within the framework of the Public Collection Digitization Strategy. Its primary goal was present the technology of web archiving and the integration of the web archive with other digital collections through a demo application. The content focuses on the webpages and websites related to the Memorial Year, to the War of Independence, to the Prince and to his family. Furthermore, it contains born digital or digitized books from the Hungarian Electronic Library, articles from the Electronic Periodical Archives, photos, illustrations and other visual documents from the Digital Archive of Pictures. The service is available on the following address: http://rakoczi2019.webarchivum.oszk.hu.

OSZK-figure2
rakoczi2019.webarchivum.oszk.hu

Legislation and new collections

In May 2020 the Hungarian parliament passed the modifications of the Cultural Law that entitles the National Széchényi Library to run web archiving activities as a part of its basic service portfolio. Legal deposit of web materials will also be established. The corresponding governmental and ministerial decrees will appear soon, all the law modifications and decrees will be in effect from 1 January 2021.

We made our first experiment of harvesting various materials from 700 pages with more than 100.000 posts from Instagram using the Webrecorder software. We are running event-based harvests too about COVID-19, Summer Olympic Games, Paris Peace Conference (1919-1920). We are joining also to the corresponding international IIPC collaborative collection development projects.

Next steps

Supported by the framework of the Public Collection Digitization Strategy we could start to develop a collaboration network with various regional libraries in Hungary in order to collect local materials for the Hungarian Web Archive. Hopefully, we will summarize our first experiences during our next annual workshop in the autumn and we can further develop our joint collection activities.

Luxembourg Web Archive – Coronavirus Response

By Ben Els, Digital Curator, The National Library of Luxembourg

The National Library of Luxembourg has been harvesting the Luxembourg web under the digital legal deposit since 2016. In addition to the large-scale domain crawls, the Luxembourg Web Archive also operates targeted crawls, aimed at specific subjects or events. During the past weeks and months, the global pandemic of the Coronavirus, has put society before unprecedented challenges. While large parts of our professional and social lives had to move even further online, the need to capture and document the implications of this crisis on the Internet, has seen enormous support in all domains of society. While it is safe to admit that web archiving is still a relatively unknown concept to most people in Luxembourg (probably also in other countries), it is also safe to say, that we have never seen a better case to illustrate the necessity of web archiving and ask for support in this overwhelming challenge.

webarchive.lu

Media and communities

At the National Library, we started our Coronavirus collection on March 16th, while there were 81 known cases in Luxembourg. While we have been harvesting websites in several event crawls for the past 3 years, it was clear from the start that the amount of information to be captured would surpass any other subject by a great deal. Therefore, we decided to ask for support from the Luxembourg news media, by asking them to send us lists of related news articles from their websites. This appeal to editors quickly evolved into a call for participation to the general public, asking all communities, associations, and civil interest groups to share their responses and online information about the crisis. Addressing the news media in the first place, gave us great support in spreading the word about the collection. Part of our approach to building an event collection, is to follow the news and take in information about new developments and publications of different organisations and persons of interest. As the flow and high-paced rhythm of new public information and support was vital to many communities, we also had to try and keep up with new websites, support groups and solidarity platforms being launched every day. However, many of these initiatives are not covered equally in the news or social media, a situation which is even more complicated through Luxembourg’s multilingual makeup. We learned about the challenges from the government and administrations, to convey important and urgent information in 4 or 5 languages at a time: Luxembourgish, French, German, English and Portuguese. The same goes for news and social media, and as a result, for the Luxembourg Web Archive. Therefore, we were grateful to receive contributions from organisations, which we would not have thought of including ourselves, and who were not talked about as much in the news.

© The Luxembourg Government

Effort and resources

While the need and support for web archiving exploded during March and April, it was also clear, that the standard resources allocated to the yearly operations of the web archive would not suffice in responding to the challenge in front of us. The National Library was able to increase our efforts, by securing additional funding, which allowed us to launch an impromptu domain crawl and to expand the data budget on Archive-It crawls. We are all aware of the uphill battle in communicating the benefits of archiving the web. There is a feeling that, while people generally agree on the necessity of preserving websites, in most cases there is little sense of urgency or immediate requirement – since after all, most everyday changes are perceived as corrections of mistakes, or improvements on previous versions. In my opinion, the case of Coronavirus related websites, made the idea of web archiving as a service and obligation to society much clearer and easier to convey.

© Ministry of Health

Private and public

The Web offers many spaces and facets for personal expression and communication. While social media have played a crucial part in helping people to deal with the crisis, web archives face some of their biggest challenges in harvesting and preserving social media. Alongside the technical difficulties and enormous related costs, there is the question of ethics in collecting content which is not 100% private, but also not 100% public. For instance, in Luxembourg, many support groups launched on Facebook, where people could ask their questions about the current situation and new developments in terms of what is

allowed, find help and comfort to their uncertainties. There are several active groups in every language, even some dedicated to districts of the city, with neighbours looking after each other. While it is important to try to capture all facets of an event (especially if this information is unique to the Internet) I am uncertain, whether it is ethical to capture the questions, comments and conversations of people in vulnerable situations. Even though there are sometimes thousands of members per group and pretty much everyone can join, they are not fully open to the public.

Collecting and sharing

covidmemory.lu

Besides the large-scale crawls and Archive-It collection, we also contributed part of our seed list to the IIPC’s collaborative Novel Coronavirus collection, led by the Content Development Working Group. Of course, the National Library did not limit its response to archiving websites. With our call for participation, we also received a variety of physical and digital documents: mainly from municipalities and public administrations who submitted numerous documents, which were issued to the public in relation the reorganisation of public services and the temporary restrictions on social life.

We also received some unexpected contributions, in the form of poems, essays and short diary entries written during confinement, describing and reflecting upon the current situation from a very personal angle. Likewise, a researcher shared his private bibliometric analysis of scientific literature about the Coronavirus. Furthermore, the University of Luxembourg’s Centre for Contemporary and Digital History has launched the sharing platform covidmemory.lu, enabling ordinary people living or working in Luxembourg to share their photos, videos, stories and interviews related to COVID-19.

Web Archiving Week 2021

Since the 2021 edition of the IIPC Web Archiving Conference will be part of the Web Archiving Week, in  partnership with the University of Luxembourg and the RESAW network, I am not going to spoil too much about the program by saying that we will continue exploring these shared efforts and responses during the week of June 14th – 18th 2021. We are looking forward to welcoming you all to Luxembourg!

The Future of Playback

By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio

It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.

OpenWayback

Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out  new releases (thanks Lauren!).

Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.

As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.

Survey

Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.

Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.

The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.

Pywb

Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.

However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.

To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.

Planned work

The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).

The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.

The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.

There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.

Resources:

Launching IIPC training programme

By Olga Holownia, IIPC Programme and Communications Officer

It is no longer uncommon for heritage institutions, particularly national and university libraries, to employ web archivists and web curators. It is, however, still rather unusual that librarians or archivists who hold these positions, had received any formal training in web archiving before they joined web archiving teams. The majority of participants in a small poll organised during a workshop on training new starters in web archiving at last year’s Web Archiving Conference in Zagreb, confirmed that they received ‘on-the-job training’.

Varying approaches, similar needs

Discussions during the 2016 IIPC General Assembly in Ottawa, led to the conclusion that while IIPC members have varying approaches to web archiving, which reflect their own institutional mandates, legal contexts and technical infrastructures, they all need technical and curatorial training – for practitioners and for researchers. This inspired the creation of the IIPC Training Working Group (TWG) initiated by Tom Cramer, where participation was open to the global web archiving community. The TWG, co-chaired by Tom, Abbie Grotke and Maria Praetzellis, was tasked with creating training materials.

The TWG’s first activities included a comprehensive overview of existing curricula as well as a survey to assess the current level of training needs. The results of both helped inform the decisions behind the content of the training modules that had been introduced at last year’s conference. We are delighted to announce that the beginner’s training is now available on the IIPC website.

Who is the training for?

This training programme is aimed at practitioners (including new starters), curators, policy makers and managers or those who would like to learn about the following: what web archives are, how they work, how to curate web archive collections, acquire basic skills in capturing web archive content, but also how to plan and implement a web archiving programme.

This course contains eight sessions and comprises presentation slides and speakers’ notes. Each module starts with an introduction which outlines the learning objectives, target audience and includes information about the way the slides can be customised as well as a comprehensive list of related resources and tools. Published under a CC licence, the training materials can be fully customised and modified by the users.

Video Case Studies

The TWG used the opportunity of the annual gathering of the web archiving community to complement the training material with video case studies. Alex Osborne, Jessica Cebra, Mark Phillips, Eléonore Alquier, Daniel Gomes, Mar Pérez Morillo, Ben Els and Yves Maurer, representing seven IIPC member institutions from around the world, speak about their experiences of becoming involved in web archiving. They also share their knowledge on organisational approaches, collaborations, collecting policies, access as well as the evolution and challenges of web archiving.

Try it and tell us what you think!

This training is freely available to download and we encourage you to experiment with customising it for your trainees. We are also interested in how you use it, so please give us feedback by filling out this short form.

Acknowledgements

Like in the case of many other IIPC projects, the creation of the training material was a collaborative effort and we thank everyone who has been involved. The project was launched by Tom Cramer of Stanford University Libraries, Abbie Grotke of the Library of Congress and Maria Praetzellis of the California Digital Library (previously the Internet Archive). Maria Ryan of the National Library of Ireland and Claire Newing of the National Archives UK, who took over as co-chairs at the last General Assembly, led the project to completion together with Abbie. A whole group of 38 volunteers who form the TWG were involved at various stages of the project, starting with the surveys, followed by brainstorming sessions, many rounds of extensive feedback and the final phase of preparing the materials for publication. A special thank you to Samantha Abrams, Jefferson Bailey, Helena Byrne, Friedel Geeraert, Márton Németh, Anna Perricci, and the participants of the video case studies.

The beginner’s training materials were produced in partnership with the Digital Preservation Coalition (DPC) and we would particularly like to thank Sharon McMeekin, Head of Training and Skills and Sara Day Thomson of University of Edinburgh (previously DPC).

Resources

Covid-19 Collecting at the National Library of New Zealand

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand

The National Library of New Zealand reflects on their rapid response collecting of Covid-19 related websites since February 2020.

Collecting in response to the pandemic

Web Archivists at the National Library of New Zealand are used to collecting websites relating to major events, but the Covid-19 pandemic has had such a global impact, it’s affected every member of society. It has been heart breaking to see the tragic loss of life and economic hardships that people are facing world-wide. The effects of this pandemic will be with us for a long time.

Collecting content relating to these events always produces mixed emotions as a web archivist. There’s the tension between collecting content before it disappears, and in that regard, we put on our hard hats and get on with it. At the same time however, these events are raw and personal to each one of us and the websites we’ve collected reflect that.

IIPC Collaborative Collection

When the IIPC put out a call to contribute to the Novel Coronavirus Outbreak Collaborative Collection, we got involved. Initially New Zealand sources were commenting on what was happening internationally, so URLs identified were mainly news stories, until our first reported case of Coronavirus occurred in February and then we started to see New Zealand websites created in response to Covid-19 here. We continued to contribute seed URLs to the IIPC collection, but our focus necessarily switched to the selective harvesting we undertake for the National Library’s collections.

Lockdown

The New Zealand government instituted a 4 level alert system on March 21 and we quickly moved to level 4 lockdown on March 24. The lockdown lasted a month, before gradually moving down to level 1 on June 8.

The rapidly changing alert levels were reflected in the constantly changing webpages online. It seemed that most websites we regularly harvest had content relating to Covid-19. Our selective web harvesting team focussed on identifying websites that had significant Covid-19 content or were created to cover Covid-19 events during our rapid response collecting phase. Even then it was difficult to capture all changes on a website as they responded to the different alert levels.

We were working from home during this time and connected to Web Curator Tool through our work computers. The harvesting was consistent, but our internet connections were not always stable, so we often got thrown out of the system! If we had technical issues with any particular website harvest, by the time we resolved it, the pages online had sometimes shifted to another alert level! We also used Web Recorder and Archive-It for some of our web harvests.

Due to the enormous amount of Covid-19 content being generated and because we are a very small team (along with the challenges of working from home), what we collected could really only be a very selective representation.

Unite against Covid-19 – Unite for the Recovery

Unite Against Covid-19 harvested 18 March 2020.

One prominent website captured during this time was the government website ‘Unite Against Covid-19’ which was the go-to place for anyone wanting to know what the current rules were. This website was updated constantly, sometimes several times a day.

When we entered alert level one the website changed to “Unite for the Recovery.” We expect to be collecting this site for some time. While we have completed our rapid response phase we will be continuing to collect Covid-19 related material as part of our regular harvesting.

Unite for the Recovery harvested 9 June 2020.

Economic Impact
Apart from official government websites, we captured websites that reflected the economic impact on our society, such as event cancellations and business closures. We documented how some businesses responded to the pandemic, by changing production lines from clothing to making face masks and from alcohol production to making hand sanitiser. New products like respirators and PPE (personal protective equipment) gear were also being produced. Tourism is a major industry in New Zealand and with border lockdowns still in place, advertising is now targeting New Zealanders. There is talk about extending this to a “Trans-Tasman” bubble to include Australia and possibly some Pacific Islands in the near future.

Social impact
As in many countries, community responses during lockdown provided both unique and shared experiences. New Zealanders were able to walk locally (with social distancing) so people put bears and other soft toys in the windows for kids (and adults) to count as they walked by. The daily televised 1pm Covid-19 updates from Prime Minister Jacinda Ardern and Director General of Health, Dr Ashley Bloomfield during lockdown was compulsive viewing and generated memorabilia such as T-shirts, bags and coasters. These were all reflected in the websites we collected. We also harvested personal blogs such as ‘lockdown diaries’.

Web archiving and beyond
During this rapid collecting phase, the web archivists focussed on collecting websites, and that’s reflected in this blog post. There was also a significant amount of content we wanted to collect from social media such as memes, digital posters and podcasts, New Zealand social commentary on Twitter and email from businesses and associations. This has required considerable effort from the Library’s Digital Collecting and Legal Deposit teams. You can find out more about this in an earlier National Library blog post by our Senior Digital Archivist Valerie Love. We are also working with our GLAM sector colleagues and donors to continue to build these collections.

Asking questions with web archives – introductory notebooks for historians

“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).


By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.

Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that  document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

What can you do?

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.

A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.

Using screenshots to visualise change in a page over time.

Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.

 

Find when a piece of text appears in an archived web page.

The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.

Find and explore Powerpoint presentations from a specific domain.

While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

Visualising subdomains in the gov.au domain as captured by the Internet Archive.

Acknowledgements

Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.

Resources:

The French coronavirus (COVID-19) web archive collection: focus on collaborative networks

BnF’s Covid-19 web archive collection has drawn considerable media attention in France, including coverage in Le Monde, 20 minutes and TV Channel France 3. The following blog post was first published in Web Corpora, BnF’s blog dedicated to web archives.


By Alexandre Faye, Digital Collection Manager, Bibliothèque nationale de France (BnF)
English translation by Alexandre Faye and Karine Delvert

The current global coronavirus pandemic (Covid-19) poses an unprecedented challenge for the web archiving activities. The impact on society is such that the ongoing collection requires several levels of coordination and cooperation at a national and international level.

Since its spreading out of China and its later development in Europe, coronavirus outbreak has become a pervasive theme on the web. This sanitary crisis is being experienced in real time by populations simultaneously confined and largely connected, with a sense of emergency as well as underlying questioning. Archived websites, blogs, and social media should make up a coherent, significant and representative collection. They will be primary sourcesfor future research, and they are already the trace and memory of the event.

#jenesuispasunvirus

At the end of January 2020, while the Wuhan megapolis is quarantined, the first hashtags #JeNeSuisPasUnVirus and #CORONAVIRUSENFRANCE appear on Twitter. They denounce and show the stigma experienced by the Asian community in France. The Movement against racism and for friendship between peoples (Mouvement contre le racisme et pour l’amitié entre les peuples, MRAP) quickly published a page on its website entitled “a virus has no ethnic origin”. This is the first webpage related to coronavirus to have been selected, crawled and preserved under French legal deposit.

Group dynamics

The coronavirus collection is not conceived as a project, in the sense that it would be programmed, would have a precise calendar and would be limited to predetermined topics. It grows as a part of the both National and local news media and Ephemeral News Current Topics collections. The National and local news media collection brings together a hundred of national and local press websites, including the editorial content, such as headlines and related articles as well as Twitter accounts which are collected once a day. The News Current Topics collection, which requires both a technical and organizational approach, relies on the coordination of an internal network of digital curators from their relevant fields”. It facilitates dynamic and reactive identification of web content related to contemporary issues and important events. By documenting the evolution, spreading and overall impact of the pandemic in France, archiving policy embraces all facets of the public health crisis: medical, social, economic, political and more broadly scientific, cultural and moral aspects.

“A virus has no ethnic origin”. Movement Against Racism and for Friendship Between Peoples (MRAP) website. Archive of February 21, 2020.

70 selected seed URLs were crawled in January and February, while the spread of the virus out of China seemed to be limited and under control. Since March 17, date of the French lockdown, 500 to 600 seed URLs per week are selected and assigned to a crawl frequency: several times a day for social networks, daily for national and local press sites, weekly for news sections dedicated to the coronavirus, monthly for articles and dedicated websites which are created ex nihilo. Thus the section of the economic review L’Usine nouvelle is crawled weekly, because it organizes a stream of articles. Less dynamic, the recommendation pages of the National Research and Security Institute (INRES), is assigned monthly frequency.

By mid-April 2020, more than 2,000 selections and settings were created. This reactivity is all the more necessary due to the fact that certain web pages selected in the first phase have already disappeared from the live web.

The regional dimension

The geographical approach is also at the core of the archiving dynamics. The web does not entirely do away with territorial dimensions, as shown by the research works led on this topic. One may even think that they were reinforced as France is hit by the sanitary crisis, as the crisis coincides with the campaign for the municipal elections.

The curators of partner institutions all over the French territory have spontaneously enriched the selections on the coronavirus sanitary crisis. They contributed by including local and regional contents into account. This network is a key element to the national cooperation framework. Initiated in 2004 by the BnF, it relies on a network of 26 regional libraries and archives services, which share this mission of print and web legal deposit by participating in collaborative nominations. Its contribution proved to be significant since over 50% of the nominated websites selected until 15th April refer to local/regional content.

Simplified access to teleconsultation. ARS Guyana. Archived, April 5, 2020.

As a corollary, the crawl devoted to local elections has not been suspended after the 1st poll (which took place on March 15th), although the second poll (due to take place the following weekend) had been postponed and the whole electoral process suspended due to the crisis. In particular, the Twitter and Facebook accounts of the mayors elected in the 1st poll and those of the candidates who are still in contention for the 2nd poll have continued to be collected. These archives, as statements of mayors and candidates on the web during the weeks that had preceded and followed the 1st poll of local elections, already appear to be a major source for both electoral history and coronavirus pandemic in France.

Historic abstention rate in the local elections in the Oise “cluster”. francetvinfo.fr. Capture of March 16, 2020.

International cooperation

At the international level, the BnF and also in this way the other French participating libraries contribute to the archiving project “Novel Coronavirus (2019-nCoV) outbreak”. This initiative launched in February 2020 is supported by the IIPC Content Development Group (CDG) in association with the Internet Archive. It brings together about thirty libraries and institutions collaborating around the world on this web archive collection. At the end of May, more than 6,800 preserved websites representing 45 languages had been put online on Archive-it.org and indexed in full text.


The BnF has for many years been pursuing a policy of cooperation with the IIPC to promote preservation and use of web archives on an international scale. One of the research challenges is to facilitate comparisons of the different national webs, in particular for the global and transnational phenomena such as #MeToo and the current health crisis. A first contribution was sent at the end of February to the IIPC.  It consisted of an 80 seeds selection made during the first phase of the pandemic, just before Europe became the main active center in front of China. Some of these pages have already disappeared from the living web.

According to the IIPC’s new recommendations and considering the evolution of the pandemic in France, the next contribution to the IIPC should be a tight selection (almost 5% of the French collection) linked to high priority subtopics include: information about the spread of infection; regional or local containment efforts; medical and scientific aspects, social aspects; economic aspects; and political aspects. A third of those websites reports on medical domain. A second third provides information about French territories that are remote from Europe: French Guiana and West Indies, Reunion and Mayotte. The last part concerns citizen’s initiatives and debates during the lockdown.

For examples, the special INED’s website hosting gives information on local excess mortality, articles from Madinin’art, Montray Kreyol, Free Pawol were selected by a local curator and banlieues-sante.org is website of an NGO which acts against medical inequality and has created a YouTube channel explaining protection measures in 24 languages including sign language.

Dr François Ehlinger on EHPAD. Nicole Bertin’s Blog. Website capture from the Charente-Maritime region. Capture on April 3, 2020

What’s next?

Some of the websites nominated by the BnF and its partners tend to constitute a collective memory of the event. Until mid-April, the share of social networks represented 40% of the nominations, with a slight predominance of Twitter over Facebook. Although a large share is devoted to official accounts – namely, of institutions or associations (@AssembleeNat, @restosducoeur, @banlieuesante) or to accounts created ex nihilo (@CovidRennes, @CoronaVictimes, @InitiativeCovid), hashtags prevail in the set of selections.

The aim is to archive a representative part of individual and collective expressions by capturing tweets around the most significant hashtags: multiple variations of the terms “coronavirus” and “confinement” (#coronavacances, #ConfinementJour29), criticism of the way the crisis has been managed (#OuSontLesMasques, #OnOublieraPas), instruction dissemination and expressions of sympathy show a unique and characteristic mobilisation of citizens while following the pace of the news (#chloroquine, #Luxfer).

Daniel Bourrion, “The virus journals” on face-ecran.fr. Archived April 3, 2020.

Archives relating to the coronavirus, as they account for the outcomes of the sanitary crisis and of the lockdown in various domains, end up in overlapping the set of themes to which the BnF and its partners pay a particular attention or for which focused crawls have already been conducted or will be led. For instance, digital literature or confinement diaries, relationships between the body and public health policies, epidemiology and artificial intelligence, family life in confinement and feminism, can be mentioned.

“Next” isn’t just a matter of a unique form of promoting this special archive collection, which remains a work-in-progress. It is neither a delimited project nor an already closed. It is documentation for many kinds of research projects and also heritage for all of us.

Guide for confined parents. The French Secretariat for Equality (Le Secrétariat d’Etat chargé de l’égalité entre les femmes et les hommes et de la lutte contre les discriminations). Capture of April 10.

COVID-19: Collecting so that we don’t forget

by Martine Renaud, Librarian, Bibliothèque et Archives nationales du Québec [1]

The COVID-19 pandemic has dominated the news for months because of its sheer scale and its impact on our economy and social life as well as our health. How will it be remembered in a few years? The Spanish flu epidemic of 1918-1919 is sometimes described as the forgotten pandemic[2]. This time, how can we make sure nothing is forgotten? Preserving the memory of this turbulent and exceptional time is crucially important for tomorrow’s researchers.

Capturing the Web

The Web and social media are playing a key role in the pandemic. They enable the instant spread of information (as well as fake news), provide a space for exchange and communication in a context of social distancing. BAnQ has been collecting Québec websites on a selective basis since 2009. The result of this harvesting is largely available on the BAnQ portal. Sites for which BAnQ has not gotten permission are preserved, but not made available. They can be accessed for research purposes.

Collaborative Collection

In February 2020, the International Internet Preservation Consortium (IIPC) called on its members, including BAnQ, to create a collaborative collection of websites dealing with the emerging pandemic.

BAnQ’s contribution to this collection formed the basis of the Québec collection, which we decided to create once the scale of the crisis became apparent. BAnQ  had already created several collections on special events, for example the 375th anniversary of the city of Montreal, the collection on the pandemic is part of this corpus around exceptional events.

The Québec collection includes Québec government websites, and sections of websites, dealing with the pandemic. It also includes the websites of public health authorities (Directions de la santé publique), Québec’s National Public Health Institute (INSPQ), as well as the CISS and CIUSS (Integrated Health and Social Services Centres). Web pages about the pandemic from a number of cities and towns are included, as well as universities, CEGEPs (senior high schools), and school boards. Websites of companies that are particularly affected by the pandemic, such as financial institutions and supermarket chains, are also included.

Articles dealing with COVID-19 from Québec-wide and regional papers are collected, as well as parts of the websites of professional orders and associations. Of course, sites that have emerged or been in the news since mid-March, such as Jebenevole.ca, are also harvested. At the time of writing, over 15,000 URL addresses have been collected, and new ones are added every week.

Capturing social media

As for social media, BAnQ collects the Twitter feeds and Facebook pages of personalities and public bodies involved in front-line management of the crisis, such as Premier François Legault, Québec’s health ministry (Santé Québec), and the City of Montréal’s police department (Service de police de la Ville de Montréal). All over the world, memory institutions are working to preserve traces of the pandemic. Thanks to these efforts, it is our hope that nothing will be forgotten.

References:

[1] This article will appear in the June 2020 issue of À rayons ouverts – Chroniques de Bibliothèque et Archives nationales du Québec, No. 106 (Spring/Summer 2020), p. 26.

 [2] Alfred W. Crosby, America’s Forgotten Pandemic – The Influenza of 1918, 2e édition, Cambridge, Cambridge University Press, 2003, https://books.google.ca/books?id=KYtAkAIHw24C&redir_esc=y&hl=en (consulté le 4 mai 2020).

Let’s time travel with the IIPC!

IIPC has been organising its annual meetings for over 15 years. The first full Steering Committee meeting and the meetings of working groups were held in Canberra in 2004. The most recent General Assembly (GA) and Web Archiving Conference (WAC) were held Zagreb in June 2019. What started as a small get-together of web archiving enthusiasts from a dozen national libraries and the Internet Archive, has gradually become an important fixture in the web archiving calendar. We have been very fortunate that our members have volunteered to host the events in Singapore, The Hague, Washington D.C., Ljubljana, Stanford, Reykjavík, London, Wellington, Zagreb and Ottawa. The GA also returned to Canberra in 2008.

 

Due to Covid-19, this year we will not meet in person but we can time travel! While preparing for the next annual event hosted by the National Library of Luxembourg (15-18 June 2021), we will be trawling through the history of the GA and the WAC. We will be collecting, publishing and archiving memories from past events in a variety of formats, ranging from tweets, blog posts to a GA and WAC digital repository and bibliography. All new and older posts will be available in the “GAWAC” archive.

This slideshow requires JavaScript.

We are starting from 2019, which was the first GA for Friedel Geeraert of KBR, The Royal Library of Belgium. This was also the first GA for the British Library web archivists Helena Byrne and Carlos Rarugal, the organisers of a workshop called “Reflecting on how we train new starters in web archiving”.

Abstracts from the 2019 presentations and slides are available on the conference website. You can also watch the keynote speeches and panel discussions on our YouTube Channel and browse through the photos on the IIPC Flickr. The 2019 GA and WAC were hosted by the National and University Library in Zagreb. The Croatian Web Archive (HAW), which last year celebrated its 15th anniversary, has launched its new interface earlier this year. You can browse the archive and the thematic collections at https://haw.nsk.hr/en.

Photo: Tibor God.