Asking questions with web archives – introductory notebooks for historians

“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).


By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.

Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that  document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

What can you do?

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.

A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.

Using screenshots to visualise change in a page over time.

Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.

 

Find when a piece of text appears in an archived web page.

The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.

Find and explore Powerpoint presentations from a specific domain.

While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

Visualising subdomains in the gov.au domain as captured by the Internet Archive.

Acknowledgements

Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.

Resources:

The French coronavirus (COVID-19) web archive collection: focus on collaborative networks

BnF’s Covid-19 web archive collection has drawn considerable media attention in France, including coverage in Le Monde, 20 minutes and TV Channel France 3. The following blog post was first published in Web Corpora, BnF’s blog dedicated to web archives.


By Alexandre Faye, Digital Collection Manager, Bibliothèque nationale de France (BnF)
English translation by Alexandre Faye and Karine Delvert

The current global coronavirus pandemic (Covid-19) poses an unprecedented challenge for the web archiving activities. The impact on society is such that the ongoing collection requires several levels of coordination and cooperation at a national and international level.

Since its spreading out of China and its later development in Europe, coronavirus outbreak has become a pervasive theme on the web. This sanitary crisis is being experienced in real time by populations simultaneously confined and largely connected, with a sense of emergency as well as underlying questioning. Archived websites, blogs, and social media should make up a coherent, significant and representative collection. They will be primary sourcesfor future research, and they are already the trace and memory of the event.

#jenesuispasunvirus

At the end of January 2020, while the Wuhan megapolis is quarantined, the first hashtags #JeNeSuisPasUnVirus and #CORONAVIRUSENFRANCE appear on Twitter. They denounce and show the stigma experienced by the Asian community in France. The Movement against racism and for friendship between peoples (Mouvement contre le racisme et pour l’amitié entre les peuples, MRAP) quickly published a page on its website entitled “a virus has no ethnic origin”. This is the first webpage related to coronavirus to have been selected, crawled and preserved under French legal deposit.

Group dynamics

The coronavirus collection is not conceived as a project, in the sense that it would be programmed, would have a precise calendar and would be limited to predetermined topics. It grows as a part of the both National and local news media and Ephemeral News Current Topics collections. The National and local news media collection brings together a hundred of national and local press websites, including the editorial content, such as headlines and related articles as well as Twitter accounts which are collected once a day. The News Current Topics collection, which requires both a technical and organizational approach, relies on the coordination of an internal network of digital curators from their relevant fields”. It facilitates dynamic and reactive identification of web content related to contemporary issues and important events. By documenting the evolution, spreading and overall impact of the pandemic in France, archiving policy embraces all facets of the public health crisis: medical, social, economic, political and more broadly scientific, cultural and moral aspects.

“A virus has no ethnic origin”. Movement Against Racism and for Friendship Between Peoples (MRAP) website. Archive of February 21, 2020.

70 selected seed URLs were crawled in January and February, while the spread of the virus out of China seemed to be limited and under control. Since March 17, date of the French lockdown, 500 to 600 seed URLs per week are selected and assigned to a crawl frequency: several times a day for social networks, daily for national and local press sites, weekly for news sections dedicated to the coronavirus, monthly for articles and dedicated websites which are created ex nihilo. Thus the section of the economic review L’Usine nouvelle is crawled weekly, because it organizes a stream of articles. Less dynamic, the recommendation pages of the National Research and Security Institute (INRES), is assigned monthly frequency.

By mid-April 2020, more than 2,000 selections and settings were created. This reactivity is all the more necessary due to the fact that certain web pages selected in the first phase have already disappeared from the live web.

The regional dimension

The geographical approach is also at the core of the archiving dynamics. The web does not entirely do away with territorial dimensions, as shown by the research works led on this topic. One may even think that they were reinforced as France is hit by the sanitary crisis, as the crisis coincides with the campaign for the municipal elections.

The curators of partner institutions all over the French territory have spontaneously enriched the selections on the coronavirus sanitary crisis. They contributed by including local and regional contents into account. This network is a key element to the national cooperation framework. Initiated in 2004 by the BnF, it relies on a network of 26 regional libraries and archives services, which share this mission of print and web legal deposit by participating in collaborative nominations. Its contribution proved to be significant since over 50% of the nominated websites selected until 15th April refer to local/regional content.

Simplified access to teleconsultation. ARS Guyana. Archived, April 5, 2020.

As a corollary, the crawl devoted to local elections has not been suspended after the 1st poll (which took place on March 15th), although the second poll (due to take place the following weekend) had been postponed and the whole electoral process suspended due to the crisis. In particular, the Twitter and Facebook accounts of the mayors elected in the 1st poll and those of the candidates who are still in contention for the 2nd poll have continued to be collected. These archives, as statements of mayors and candidates on the web during the weeks that had preceded and followed the 1st poll of local elections, already appear to be a major source for both electoral history and coronavirus pandemic in France.

Historic abstention rate in the local elections in the Oise “cluster”. francetvinfo.fr. Capture of March 16, 2020.

International cooperation

At the international level, the BnF and also in this way the other French participating libraries contribute to the archiving project “Novel Coronavirus (2019-nCoV) outbreak”. This initiative launched in February 2020 is supported by the IIPC Content Development Group (CDG) in association with the Internet Archive. It brings together about thirty libraries and institutions collaborating around the world on this web archive collection. At the end of May, more than 6,800 preserved websites representing 45 languages had been put online on Archive-it.org and indexed in full text.


The BnF has for many years been pursuing a policy of cooperation with the IIPC to promote preservation and use of web archives on an international scale. One of the research challenges is to facilitate comparisons of the different national webs, in particular for the global and transnational phenomena such as #MeToo and the current health crisis. A first contribution was sent at the end of February to the IIPC.  It consisted of an 80 seeds selection made during the first phase of the pandemic, just before Europe became the main active center in front of China. Some of these pages have already disappeared from the living web.

According to the IIPC’s new recommendations and considering the evolution of the pandemic in France, the next contribution to the IIPC should be a tight selection (almost 5% of the French collection) linked to high priority subtopics include: information about the spread of infection; regional or local containment efforts; medical and scientific aspects, social aspects; economic aspects; and political aspects. A third of those websites reports on medical domain. A second third provides information about French territories that are remote from Europe: French Guiana and West Indies, Reunion and Mayotte. The last part concerns citizen’s initiatives and debates during the lockdown.

For examples, the special INED’s website hosting gives information on local excess mortality, articles from Madinin’art, Montray Kreyol, Free Pawol were selected by a local curator and banlieues-sante.org is website of an NGO which acts against medical inequality and has created a YouTube channel explaining protection measures in 24 languages including sign language.

Dr François Ehlinger on EHPAD. Nicole Bertin’s Blog. Website capture from the Charente-Maritime region. Capture on April 3, 2020

What’s next?

Some of the websites nominated by the BnF and its partners tend to constitute a collective memory of the event. Until mid-April, the share of social networks represented 40% of the nominations, with a slight predominance of Twitter over Facebook. Although a large share is devoted to official accounts – namely, of institutions or associations (@AssembleeNat, @restosducoeur, @banlieuesante) or to accounts created ex nihilo (@CovidRennes, @CoronaVictimes, @InitiativeCovid), hashtags prevail in the set of selections.

The aim is to archive a representative part of individual and collective expressions by capturing tweets around the most significant hashtags: multiple variations of the terms “coronavirus” and “confinement” (#coronavacances, #ConfinementJour29), criticism of the way the crisis has been managed (#OuSontLesMasques, #OnOublieraPas), instruction dissemination and expressions of sympathy show a unique and characteristic mobilisation of citizens while following the pace of the news (#chloroquine, #Luxfer).

Daniel Bourrion, “The virus journals” on face-ecran.fr. Archived April 3, 2020.

Archives relating to the coronavirus, as they account for the outcomes of the sanitary crisis and of the lockdown in various domains, end up in overlapping the set of themes to which the BnF and its partners pay a particular attention or for which focused crawls have already been conducted or will be led. For instance, digital literature or confinement diaries, relationships between the body and public health policies, epidemiology and artificial intelligence, family life in confinement and feminism, can be mentioned.

“Next” isn’t just a matter of a unique form of promoting this special archive collection, which remains a work-in-progress. It is neither a delimited project nor an already closed. It is documentation for many kinds of research projects and also heritage for all of us.

Guide for confined parents. The French Secretariat for Equality (Le Secrétariat d’Etat chargé de l’égalité entre les femmes et les hommes et de la lutte contre les discriminations). Capture of April 10.

COVID-19: Collecting so that we don’t forget

by Martine Renaud, Librarian, Bibliothèque et Archives nationales du Québec [1]

The COVID-19 pandemic has dominated the news for months because of its sheer scale and its impact on our economy and social life as well as our health. How will it be remembered in a few years? The Spanish flu epidemic of 1918-1919 is sometimes described as the forgotten pandemic[2]. This time, how can we make sure nothing is forgotten? Preserving the memory of this turbulent and exceptional time is crucially important for tomorrow’s researchers.

Capturing the Web

The Web and social media are playing a key role in the pandemic. They enable the instant spread of information (as well as fake news), provide a space for exchange and communication in a context of social distancing. BAnQ has been collecting Québec websites on a selective basis since 2009. The result of this harvesting is largely available on the BAnQ portal. Sites for which BAnQ has not gotten permission are preserved, but not made available. They can be accessed for research purposes.

Collaborative Collection

In February 2020, the International Internet Preservation Consortium (IIPC) called on its members, including BAnQ, to create a collaborative collection of websites dealing with the emerging pandemic.

BAnQ’s contribution to this collection formed the basis of the Québec collection, which we decided to create once the scale of the crisis became apparent. BAnQ  had already created several collections on special events, for example the 375th anniversary of the city of Montreal, the collection on the pandemic is part of this corpus around exceptional events.

The Québec collection includes Québec government websites, and sections of websites, dealing with the pandemic. It also includes the websites of public health authorities (Directions de la santé publique), Québec’s National Public Health Institute (INSPQ), as well as the CISS and CIUSS (Integrated Health and Social Services Centres). Web pages about the pandemic from a number of cities and towns are included, as well as universities, CEGEPs (senior high schools), and school boards. Websites of companies that are particularly affected by the pandemic, such as financial institutions and supermarket chains, are also included.

Articles dealing with COVID-19 from Québec-wide and regional papers are collected, as well as parts of the websites of professional orders and associations. Of course, sites that have emerged or been in the news since mid-March, such as Jebenevole.ca, are also harvested. At the time of writing, over 15,000 URL addresses have been collected, and new ones are added every week.

Capturing social media

As for social media, BAnQ collects the Twitter feeds and Facebook pages of personalities and public bodies involved in front-line management of the crisis, such as Premier François Legault, Québec’s health ministry (Santé Québec), and the City of Montréal’s police department (Service de police de la Ville de Montréal). All over the world, memory institutions are working to preserve traces of the pandemic. Thanks to these efforts, it is our hope that nothing will be forgotten.

References:

[1] This article will appear in the June 2020 issue of À rayons ouverts – Chroniques de Bibliothèque et Archives nationales du Québec, No. 106 (Spring/Summer 2020), p. 26.

 [2] Alfred W. Crosby, America’s Forgotten Pandemic – The Influenza of 1918, 2e édition, Cambridge, Cambridge University Press, 2003, https://books.google.ca/books?id=KYtAkAIHw24C&redir_esc=y&hl=en (consulté le 4 mai 2020).

Let’s time travel with the IIPC!

IIPC has been organising its annual meetings for over 15 years. The first full Steering Committee meeting and the meetings of working groups were held in Canberra in 2004. The most recent General Assembly (GA) and Web Archiving Conference (WAC) were held Zagreb in June 2019. What started as a small get-together of web archiving enthusiasts from a dozen national libraries and the Internet Archive, has gradually become an important fixture in the web archiving calendar. We have been very fortunate that our members have volunteered to host the events in Singapore, The Hague, Washington D.C., Ljubljana, Stanford, Reykjavík, London, Wellington, Zagreb and Ottawa. The GA also returned to Canberra in 2008.

 

Due to Covid-19, this year we will not meet in person but we can time travel! While preparing for the next annual event hosted by the National Library of Luxembourg (15-18 June 2021), we will be trawling through the history of the GA and the WAC. We will be collecting, publishing and archiving memories from past events in a variety of formats, ranging from tweets, blog posts to a GA and WAC digital repository and bibliography. All new and older posts will be available in the “GAWAC” archive.

This slideshow requires JavaScript.

We are starting from 2019, which was the first GA for Friedel Geeraert of KBR, The Royal Library of Belgium. This was also the first GA for the British Library web archivists Helena Byrne and Carlos Rarugal, the organisers of a workshop called “Reflecting on how we train new starters in web archiving”.

Abstracts from the 2019 presentations and slides are available on the conference website. You can also watch the keynote speeches and panel discussions on our YouTube Channel and browse through the photos on the IIPC Flickr. The 2019 GA and WAC were hosted by the National and University Library in Zagreb. The Croatian Web Archive (HAW), which last year celebrated its 15th anniversary, has launched its new interface earlier this year. You can browse the archive and the thematic collections at https://haw.nsk.hr/en.

Photo: Tibor God.

Discovering the web archiving community at the IIPC events in Zagreb

By Friedel Geeraert, Scientific Assistant Web Archiving, KBR – Royal Library of Belgium

Last year I had the privilege of participating in the IIPC General Assembly and Web Archiving Conference in Zagreb for the first time as the representative of KBR (the Belgian Royal Library), who was at that time the youngest IIPC member. Last year KBR was involved in a research project called PROMISE that studied the question of web archiving at the federal level in Belgium.

The General Assembly provided good insight into the working of IIPC as an organisation. It was very interesting to participate in the reflection about the future form of IIPC during the General Assembly. According to member institutions the top three priorities for the coming years should be: 1) community-led tools, 2) providing platforms for sharing knowledge and 3) networking and support for innovation in research on the archived web. Furthermore, the reports of the Treasurer and Porgoramme and Communications Officer indicated the different possibilities of engaging with the organisation and other IIPC members: TSS (Technical Speaker Series) and RSS (Research Speaker Series) Webinars, Online Hours, the different working groups (Content Development, Training Working Group, Preservation, Research Working Group), the Discretionary Funding Programme. I took part in the workshops of the Preservation, Training and Research Working Groups which allowed me to discover different initiatives launched within web archiving institutions all over the world.

The Web Archiving Conference brought a plethora of developments within web archiving to light. A lot of focus was on outreach and on how to promote web archives (via library labs for example). Another theme was researcher interaction with web archives and opening up access to complementary files such as crawl and access logs, derivative files or documentation about curatorial decisions and Heritrix settings. The use of machine learning on archived web material was another recurring theme. From a curatorial perspective trending collection themes are minorities, emerging formats such as interactive fiction or retrospective web archiving. It was also stressed that divergent opinions should feature in a web archive in order to avoid curatorial bias. Furthermore, even though I don’t have a technical background, it was fascinating to discover new developments such as size reduction of indexes, Browsertrix or automated quality assurance.

On top of all that rich information, the networking possibilities were fantastic. Within the PROMISE project, we did an extensive literature review concerning web archiving initiatives in Europe and Canada. It was a wonderful opportunity to meet some of the web archivists and researchers I admire in person. It is safe to say that I came back inspired and with a head full of ideas for the Belgian web archive. I’m already looking forward to the next edition.

This slideshow requires JavaScript.

Reflecting on how we train new starters in web archiving

This blog post is a summary of a workshop that took place at the 2019 IIPC Web Archiving Conference in Zagreb, Croatia. The abstract and the final slides used during the workshop are available on the IIPC website.


By Helena Byrne, Web Curator and Carlos Rarugal, Assistant Web Archivist at the British Library

 

Most people when learning can relate to the Benjamin Franklin quote

tell me and I forget, teach me and I may remember, involve me and I learn.*

It can be very challenging to find the most effective way to involve a trainee in web archiving and transfer your specialist knowledge. Web archiving is a relatively new profession that is constantly changing and it is only in recent years that a body of work from practitioners and researchers has started to grow. In addition, each web archiving institution has its own collection policies and many use their own web archiving technology meaning that there is no one size fits all solution to providing training to people who work in this field.

However, before taking on new strategies it is important to understand our own beliefs on training and what actions we currently take when training new staff. Reflecting on these points can help us to become more aware of any biases we may have in terms of preferred training delivery style which could be contradictory to what the trainee really needs.
What we did

Before we started the workshop participants answered a series of questions about their own experience of training or receiving training on web archives via a Menti poll. We then reviewed the training practices of the curatorial web archive team at the British Library and in groups reviewed what methods participants felt worked well or not.

“Reflecting on how we train new starters in web archiving” at the Web Archiving Conference in Zagreb, 6 June 2019.
Photo: Tibor God.

Menti Poll Results

Menti Poll Results: Average Score for each question.

Overall, there were about 26 participants in the workshop who had varying degrees of experience training people on how to work with their web archive. As shown in Slide 3, only 31% of participants train people in web archiving on a regular basis while 50% of participants train people occasionally and the remaining 19% don’t train other people in web archiving. Some of the people in this final category work as solo web archivists and don’t have any resources for additional staff.

When asked if there was a structured training programme on web archiving at their organisation, 65% of participants responded “no” while only 35% of respondents had a programme in place. Not surprisingly, when asked ‘how were you trained in web archiving?’, hands-on training was the most popular method used to train participants at the workshop.

Results of this poll can be viewed here.

Training practices at the British Library

During this workshop we reviewed common training methods and reflected on the current practices of the curatorial team of the UK Web Archive based at the British Library as well as how we would like to change these practices in the future. (Slides 7-8)

Group Discussion

Participants in small groups discussed a series of questions about how they train people in their institutions:

Questions

1. Who do you train about web archiving?
2. How do you currently train them?
3. What web archiving training resources do you have available to your team?
4. What methods do you use for training? Computer based, documentation (handouts, user guides etc.), one to one learning, shadowing etc.

After discussing these questions participants then placed their current training methods onto a scale of what they felt works and doesn’t work.

Brainstorming

Overall there were 56 points filled in on the post-it notes by participants in 6 different groups. These can be loosely categorised into 10 categories:

Reading list, videos, hands on training, documentation, networking, case studies, examples/modelling, verbal training, forums and tutorials. A more detailed breakdown of these categories can be viewed here.

Most of the points noted (30/56) were in the ‘what works’ section, (10/56) were neutral while only (8/56) of the points were in the ‘what doesn’t work’ section. However, there was some overlap with the ‘what works’ and ‘what doesn’t work’ sections, with some methods like videos and reading lists appearing in both sections but in different groups.

Review

In the last workshop activity, participants voted, by using two coloured stickers, on what they considered most aspirational and most achievable training method.

As you can see from the votes below the most popular activity that could be achieved in the short term by the workshop participants was hands-on individual training with 9 votes. While there was a split between participants who felt that writing manuals was achievable with 7 votes and those they felt that this was aspirational with 6 votes.

How people voted

Conclusion

Overall participants were keen to see a training related event on the IIPC Web Archiving Conference programme. As the importance of web archiving grows, so too does the need for training in this field and it has become more evident that these responsibilities are falling on web archivists.

All the data collected during this workshop was shared with the IIPC Training Working Group and it is hoped that it will help inform the development of materials to support training within the field.

More information about the IIPC Training Working Group can be found here: http://netpreserve.org/about-us/working-groups/training-working-group/

References:

* Goodreads.com, ‘Benjamin Franklin > Quotes > Quotable Quote’, https://www.goodreads.com/quotes/21262-tell-me-and-i-forget-teach-me-and-i-may (accessed December 20, 2018).