2022 blog round-up

As we approach the end of 2022, we would like to thank our members and the general web archiving community for their support and engagement this year. Before we move forward into 2023, and return to an in-person General Assembly and Web Archiving Conference (for the first time since 2019!), we wanted to highlight some of this past year’s activities featured on our blog and to take this opportunity to thank all the contributors.

IIPC Governance

Thank you to the 2022 IIPC Chair, Vice-Chair and Treasurer for serving on the 2022 Executive Board. Thank you also to all the members who participated in the 2022 Steering Committee election. Many thanks to IIPC 2022 Chair Kristinn Sigurðsson for leading us through this past year, and reminding us that IIPC truly is an organization for all seasons.

Funded projects

2022 started off with a wrap-up of a project led by our Tools Development Portfolio and developed by Ilya Kreymer of Webrecorder. The goal of this project was to support migration from OpenWayback (a playback tool used by most of our members) to pywb by creating a Transition Guide.

This year also saw the launch of a new tools project “Browser-based crawling system for all.” Led by four IIPC members (the British Library, National Library of New Zealand, Royal Danish Library, and the University of North Texas), the Webrecorder-developed crawling system based on the Browsertrix Crawler is designed to allow curators to create, manage, and replay high-fidelity web archive crawls through an easy-to-use interface.

Game Walkthroughs and Web Archiving,” builds on research by Travis Reid, PhD student at Old Dominion University (ODU) that looks at applying gaming concepts to the web archiving process. This collaboration between ODU and Los Alamos National Laboratory was supported by the IIPC through our Discretionary Funding Program (DFP).

Here’s a list of blog posts on the 2022 projects related to web archiving tools:

Collaborative Collections

IIPC also funds collaborative collections, which are curated and supported by volunteers from our community. While our Covid-19 collection continues, three new collections were initiated by the Content Development Working Group (CDG) in 2022. In the winter, Helena Byrne of the British Library encouraged everyone to web archive the Beijing 2022 Olympic & Paralympic Winter Games, adding to a decade-long collaborative effort of archiving the Olympics and Paralympics. Archiving the War in Ukraine was our second collaborative collection for 2022. Co-curated by Kees Teszelszky of KB, National Library of the Netherlands, and Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, the National Library of France, it offers a comprehensive international perspective on the war. We closed 2022 with a call for nominations (due 20 January, 2023) for Web Archiving Street Art, co-led by Ricardo Basílio of Arquivo.pt and Miranda Siler of Ivy Plus Libraries Confederation.

Thank you to Alex Thurman (Columbia University Libraries) and Nicola Bingham (the British Library) for serving as CDG co-chairs, overseeing all new and ongoing collaborative collections:

Researching web archives

We also published blog posts related to researching web archives on topics spanning from a toolset for researchers to archiving social media to analysing Covid-19 web archive collections.

Yves Maurer of the National Library of Luxembourg, wrote about CDX-summarize, his toolset aimed at anyone interested in researching web archives that are not fully accessible. It offers a possible solution to provide a useful glimpse of “data that resides in-between the legal challenges of full access on the one hand and a textual description or rough single numbers on the other hand”.

Beatrice Cannelli, PhD candidate at the School of Advanced Study (University of London), summarised the results of an online survey mapping social media archiving initiatives, which is part of her research project “Archiving Social Media: a Comparative Study of the Practices, Obstacles, and Opportunities Related to the Development of Social Media Archives.”

We also published two blog posts by the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) Team of researchers working with the IIPC Covid-19 collaborative collection by using ARCH (Archives Research Compute Hub), a new interface for web archive analysis created by the Archives Unleashed Project Team and the Internet Archive. AWAC2 is supported by the Archives Unleashed Cohort Program, which facilitates research engagement with web archives and the researchers are members of the WARCnet (Web ARChive studies network researching web domains and events) Working Group 2, focusing on analysing transnational events.

Covid-19 web archived content is also at the core of the Archive of Tomorrow (AoT) project that aims to explore and preserve online information and misinformation about health and the pandemic. Introduced earlier this year by Alice Austin (Centre for Research Collections, University of Edinburgh), AoT will form a ‘Talking about Health’ collection within the UK Web Archive. Cui Cui, PhD candidate at the University of Sheffield and also an AoT web archivist, shared her process of working with the ‘Talking about Health’ collection, using faceted 4D modelling to reconstruct web space in web archives.

Here are the 2022 blog posts on researching web archives:

Last but not least, we would also like to give a shoutout to the brilliant Web Archiving Team at the Library of Congress who worked with us on the online GA and WAC 2022 and took us down memory lane in Remembering Past Web Archiving Events With Library of Congress Staff.

Many thanks to everyone who has contributed to our blog and helped us promote it through their newsletters and social media posts and, of course, thank you to all our readers around the world. We look forward to showcasing your web archiving activities in the new year!

Studying Women and the COVID-19 Crisis through the IIPC Coronavirus Collection  

AWAC2 (Analysing Web Archives of the COVID-19 Crisis) is a project developed by the members of WARCnet (Web ARChive studies network researching web domains and events) Working Group 2 that focuses on analysing transnational events. This is one of the first research projects using an IIPC collaborative collection and ARCH (Archives Research Compute Hub), a new interface for web archive analysis created by the Archives Unleashed Project Team and the Internet Archive.

By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team: Susan Aasman (University of Groningen, The Netherlands), Niels Brügger (Aarhus University, Denmark), Frédéric Clavert (University of Luxembourg, Luxembourg), Karin de Wild (Leiden University, The Netherlands), Sophie Gebeil (Aix-Marseille University, France), Valérie Schafer (University of Luxembourg, Luxembourg), Joshgun Sirajzade (University of Luxembourg, Luxembourg)

A year ago, a post on this very blog (“Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions,” 2 November 2021) invited the IIPC community to vote for a more specific topic that the AWAC 2 team could analyse within the vast IIPC collection devoted to the coronavirus, whose metadata and textual content were made available in the framework of a partnership.

After a first phase of collaboration around this prolific, multilingual, and international corpus, in direct cooperation with the Archives Unleashed team which allowed us to access this abundant collection via ARCH and other tools they have developed, it seemed interesting to us to dive not only into the global analysis of the corpus, but to also try to see the feasibility of more specific studies on a precise topic. Following the vote, the selected theme “Women, gender and COVID” was the subject of several online and on-site meetings by the AWAC2 team, including an internal datathon in March 2022 at the University of Luxembourg (Figure 1).

Figure 1– A datathon as a test bed and the tentative design of a workflow

The purpose of this blog post is to review some of the methodological elements learned during the exploration of this corpus.

Retrievability is a real challenge

The first salient point concerns the amount of data, already considered at the global level of the corpus, but which even in the case of a research specifically focused on women remains important. Above all, data mining and corpus creation is complicated by multilingualism (see table 7 of our previous blog post), in addition to the fact that a search for the term “woman” is not sufficient to create a satisfactory corpus (a woman can be qualified as a mother in the case of home-working or as a feminist in the case of activism and the fight against domestic violence, etc.)

The multidisciplinary team also had to define research priorities in view of the challenges of these massive corpora. Indeed, once they are constituted, the analysis is still far from beginning. The sub-corpora are full of noise, especially when it comes to news sites where the terms COVID and pregnancy or feminism may appear in newsfeeds in a very close way, but without any real thematic correlation (Figure 2). There are also many duplicates, and it must be determined whether or not they inform the study. Such a large amount of data also raises the question of more research-driven or data-driven approaches.

Figure 2 – Entry line 7867: https://flipboard.com/topic/women. The newsfeed mentions the COVID crisis as well as the MeToo movement but the news is unrelated, as visible on top of the capture when accessing full text.

In addition to the technical difficulties, there are also contextual difficulties. The data must also be put into a national context from a qualitative point of view if they are to be analysed properly. For example, lockdowns and school closures have varied from country to country and school organisation is also very different around the world, as is the legislative framework for work during lockdowns.

Topic modeling as a field of investigation

The AWAC2 team shared a strong interest in assessing the presence, retrievability, asymmetries related to gender and COVID, with some colleagues especially interested in understanding the issues related to transnational studies and gender studies, as well as reflecting on invisibility and inclusiveness, while other colleagues were more specifically interested in the computational and topic modeling part.

This second aspect has given rise to interesting developments, as three major algorithms were applied to be able to carry out more sophisticated and semantic search in the corpus: Latent Dirichlet Allocation (LDA), Word2vec, and Doc2vec.

LDA is an extension of Probabilistic Latent Semantic Analysis (PLSA) which is a probabilistic formulation of Latent Semantic Analysis (LSA). LSA is a dimensionality reduction technique where documents in a corpus (in our case, web pages) are compressed to a very small number of documents, which could be read by a human. These compressed documents are called topics. In essence, they carry the words which are shared by many documents and probabilistically more often occur together. In our experiment, we not only identified topics which contain keywords related to the situation of women, but also looked at how these topics are distributed across the web pages (Figure 3).

Figure 3 – Topics identified through LDA over the whole dataset (Covid-19 special collection) and their distribution through time

A few examples of topics are:

  • topics202002.txt:46 0.05 video news show man years police star film death family week weinstein comments day love stars top women fashion black
  • topics202002.txt:69 0.05 shop view accessories gifts price sale products delivery add cart free mens gift shoes brands bags womens clothing hair home
  • topics202003.txt:4 0.05 health children mental kids anxiety child family tips healthy parents social coronavirus stress find support time home women life news
  • topics202004.txt:53 0.05 gender development health policy working european countries international women economic equality work regional employment global world minnesota environment content overview
  • topics202005.txt:83 0.05 study risk patients years people blood disease

Word2vec and Doc2vec in turn are the further formulation of previous algorithms. In the background they not only use newer technologies like a logistic regression (also called a shallow network), but also provide more flexible usage. Word2vec provides a dense vector for every word, and it is very similar to LSA. However, Word2vec also creates vectors from the so-called window or the neighborhood of words like 5 words to the left and to the right of the searched word, operating more on a syntactical level while LSA is purely based on Document-Term-Matrix. With Word2vec it is not only possible to find semantically related words to the searched word, but also concatenate together the vectors of all words in the document. By doing so, similar documents can be found. This in a way goes beyond the so-called bag-of-words approach to which all the previous algorithms belong, because the word order can also be taken into account. From an implementation point of view, this can be done with an additional algorithm like a Long-Short-Term-Memory (LSTM) or a ready-to-use version can be taken with Doc2vec.

In our experiment, we trained a Word2vec algorithm on our corpus, which enabled us to find woman or feminism related keywords. Furthermore, we took these keywords and searched where they occur. By doing so, we again not only investigated the situation of women in the pandemic, but also compared the results to the ones given by LDA. This not only helps us to analyze the working, complementarity, and efficiency of the algorithms, but also allowed to make sure that the search covers or mines our corpus as detailed as possible (Figures 4a and 4b).

Figure 4a – Time series of a topic related to women and children
Figure 4b – Top 20 domains for this selected topic

What’s next?

This research is far from being completed after a year of collaboration with the Archives Unleashed team, whom we warmly thank for their technical and scientific expertise, as well as with IIPC which provided an unprecedented corpus that can stimulate a multitude of research projects, whether thematic or oriented towards computer science and digital humanities. An article is currently being prepared on the second topic, “Deep Mining in Web archives,” while a more general and SSH oriented chapter is being drafted for the final collective book of the WARCnet project. Furthermore, the team will be pleased to present results at the next IIPC Web Archiving Conference in 2023 and thus continue the dialogue with you around the collection.

Related resources

Archive of Tomorrow – Capturing online health (mis)information

By Alice Austin, Web Archivist, Archive of Tomorrow

Centre for Research Collections, Main Library, University of Edinburgh

Copyright ©2021 R. Stevens / CREST (CC BY-SA 4.0)

It goes without saying that the Covid-19 pandemic has cast a harsh light across our society and exposed fault lines in a number of areas, not least in the fragility of our information infrastructures. Over the last two years we have seen misinformation spread at a similar speed to the virus, with the consequence that any future attempts to try and examine the medical pandemic as an historical and social phenomenon will also have to reckon with the misinformation pandemic. Government and medical websites have changed on a daily basis as new information emerges, and there has been a massive proliferation of comment on social media and other online platforms about the virus and other health issues. Clinical advice, data and scientific evidence have been contested, revised, used and misused with dramatic and sometimes tragic consequences, and yet the digital record of this is fragile and difficult to access. There have been sustained and laudable efforts to ensure that inaccurate and potentially harmful information is taken down swiftly, with the result that a researcher exploring (e.g.) the emergence of ivermectin as a Covid ‘miracle cure’ might find they come up against a lot of dead ends and 404s.

Goals of the Archive of Tomorrow

In response, the Archive of Tomorrow project hopes to capture an accurate record of how people use the internet to find, share, and discuss health and health-related topics so that current and future researchers can understand public health practices in the digital age. We hope to capture 10,000 targets – ranging from official, ‘approved’ and verified sources, to unofficial, sometimes controversial publications – and to secure access permission for this content to produce a ‘research-ready’ collection. The project is ambitious, not just in its intention to build a useful evidence base of historical web resources but also in the attempt to develop an ethical and meaningful precedent for archiving possible mis- or dis-information. Because it crystallises so many of these issues, COVID is one subject that we’re focusing on in detail, but we’re also looking at capturing other health-related debates such as those that surround reproductive rights, ‘alternative’ medicines, assisted dying, and the use of medical cannabis.


Having launched in Feb 2022, the project is still in the early stages of development. It’s being led by the National Library of Scotland with web archivists based in university libraries in Edinburgh, Oxford and Cambridge, and invaluable input from the British Library’s web archiving team. This kind of collaborative working feels very much representative of the Covid-era – it’s hard to imagine a project like this emerging in the days when remote working and Zoom meetings were the exception rather than the norm! We’ll be talking more about the collaborative nature of the project at the IIPC WAC conference in May – and registration is open now!

Selecting ‘health information’

Thinking about how work practices have changed throughout the pandemic brings us to something that has been a challenge for the project team to unravel – how to define the boundaries around ‘health information’ – where it begins and ends, how health relates to other spheres like politics, law, employment and so on. We have to impose boundaries on our collecting, and while some boundaries are legislative or technological, such as the exclusion of broadcast media like podcasts and videos from the collection), some are cultural: for example, to what extent do protests against Covid measures such as masks and lockdowns count as health information? What about artistic responses to the pandemic? And how well are we able represent health information-seeking behaviours in languages other than English?

Welsh COVID-19 Pandemic guide: what to do and not do. Copyright © 2020 G. Hegasy (CCBY-SA 4.0)

Archivists have long understood that we can’t collect everything – and we don’t try to! As with so much collecting, the challenge lies in how to communicate our selection decisions without dictating the way the archived material is used and encountered. In this case, we’re trying to capture public health discourse and not be part of the conversation ourselves, but we do have a degree of responsibility when considering health mis/dis/information – to what extent should such inaccurate, or refuted or dangerous content be flagged in the UKWA interface? How do we make such content available responsibly without inserting our perspective into the debates?

Archive of Tomorrow workshop

At this stage we have more questions than answers, and we anticipate that this will continue. The project isn’t designed to solve these problems, but rather, to articulate them in a way that opens the door for future work and solutions. Our first activity towards this goal is the workshop that we’re hosting at the end of the month. We hope that by engaging with current and future researchers with an interest in online information-seeking behaviours or public health we can develop and produce a valuable, research-ready collection that will give real insight into how the internet has been used for health information during the pandemic and beyond.

Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions

By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team: Susan Aasman (University of Groningen, The Netherlands), Niels Brügger (Aarhus University, Denmark), Frédéric Clavert (University of Luxembourg, Luxembourg), Karin de Wild (Leiden University, The Netherlands), Sophie Gebeil (Aix-Marseille University, France), Valérie Schafer (University of Luxembourg, Luxembourg) 

As mentioned by Nicola Bingham in her blog post “IIPC Content Development Group’s activities 2019-2020” in July 2020, a huge effort has been made by the Content Development Group and IIPC members to create a unique collection of web material related to the pandemic, with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

This collection immediately attracted the interest of researchers because of the quantity of collected data, its transnational nature and the many possibilities it offers to explore web archives of this unprecedented period at an international level.

A strong interest in COVID-19 collections in WARCnet

The WARCnet project was launched at the beginning of 2020, just as the world was witnessing the first developments in the COVID-19 crisis.

WARCnet is a network of researchers and web archiving institutions (see WARCnet team) which aims to promote transnational research that will help us to understand the history of (trans)national web domains and transnational events on the web, drawing on the increasing volume of digital cultural heritage held in national web archives (Brügger, 2020). The network’s activities started in 2020 and will run until to 2023, and they are funded by the Independent Research Fund Denmark | Humanities (grant no 9055-00005B). The network is organised into six working groups, with working group 2 (WG2) focusing on the study of transnational events through web archives.

WG2 decided to select the COVID-19 crisis as one of its first case studies and test beds, and it conducted a first distant reading of several collections related to the crisis, combining metadata from the special collections of national institutions like the British Library, the BnF (National Library of France) and INA (the French National Audiovisual Institute) in France, the BnL (National Library of Luxembourg), etc., with the IIPC collection. The WG2 received metadata in the form of seed lists of these collections thanks to web archiving institutions and a special agreement with them.

At the same time a series of oral interviews were carried out with web archivists and web curators to shed more light on the selection and curation processes and the scope of these special collections, including an interview by Friedel Geeraert with Nicola Bingham on the IIPC collection (Geeraert and Bingham, 2020). Transcriptions of this series of oral interviews are available online for free download. You will find interviews with web archivists and curators working at INA, the BnF, the BnL, the IIPC, Netarkivet (Denmark), the National Széchényi Library in Hungary, the UK Web Archive, the Swiss National Library, the National Library of the Netherlands (KB) and the Icelandic Web Archive. Other interviews are scheduled.

An internal WG2 datathon on the metadata received from web archiving institutions held at the very beginning of 2021 enabled us for example to compare national collections with one another and with the selection they made for the IIPC coronavirus collection (table 1) and to measure the websites that emerged during the pandemic (table 2). Another goal of WG2 was to compare the metadata provided by institutions (table 3) and the way they could be intertwined.

Table 1: Presentation by Friedel Geeraert of her hypothesis related to overlaps between national collections and IIPC collection and results (final meeting of our datathon)
Table 2: Presentation by Katharina Schmid and Friedel Geeraert of their hypothesis related to “COVID-19 websites” and results (final meeting of our datathon)
Table 3: Overview and comparison of the data fields provided by each web archive, conducted by Karin de Wild and Niels Brügger

The data was provided by the following web archives: RDL (Royal Danish Library); BnF; NSL (National Széchényi Library, Hungary); IIPC; BnL; KB (Koninklijke Bibliotheek, the Netherlands); UKWA (UK Web Archive).

A new opportunity and a multi-partner collaboration

This first exploration led to the desire to go deeper into COVID-19 collections, and a unique chance was offered to us when the Archives Unleashed team launched its annual call for cohorts.


Some of the WG2 members therefore decided to submit a proposal entitled AWAC2, which stands for “Analysing Web Archives of the COVID-19 Crisis through the IIPC Novel Coronavirus Dataset”. With this application, we hoped to deepen our understanding of the IIPC COVID-19 collection at several levels:

First, it was a way to continue our initial distant exploration of (meta)data, in order to answer qualitative questions such as:

(1) Participation in the collection by web archiving institutions and web archive representatives
— how many URLs inside/outside ccTLDs?
— can we indirectly document some countries with no national web archives through this collection?
— comparison of national collections and their selection for IIPC (based on Danish and Luxembourgish collections)
(2) Categories of stakeholders, websites, representativeness and inclusiveness
(3) New event-specific websites
(4) MIME types and visual studies
(5) Hyperlink networks
What characterises the hyperlink network of websites included in the IIPC collection? Can any national website clusters be identified? (This analysis could be performed in Gephi).

Second, it was an opportunity to obtain complementary data and to combine research methods, especially distant and close reading, thanks to the possibility of accessing full text.

Third, it was also a unique chance to further explore the possibilities offered by the Archives Unleashed tools that the team has been developing for many years, which we were introduced to with the pre-workshop organised by Ian Milligan and Nick Ruest at the 2019 RESAW conference in Amsterdam and have been following ever since through reports about their activities and academic papers (Ruest et al., 2020). It was also an opportunity to benefit from regular discussions with the team, enrich our computational skills and create a research dynamic with an efficient and impressive team in Canada.

Finally, it was a way to explore new research questions that we had in mind from the beginning, and which are more related to topical approaches (e.g. Women, gender and COVID-19). We will come back to this in the last section.

First exciting steps

We are very thankful to the IIPC, Archive-It and the Archives Unleashed team for selecting our project and for the agreement that was signed to access this large dataset. Since the end of August, following the launch of the Cohort Programme in July 2021 that enabled us to meet all the participants and cohorts in the yearly programme, the AWAC2 team has been able to explore the new Archives Unleashed interface within Archive-It and the many datasets and visualisations that have been made available to us (figures 1 and 2).

Figure 1: An interface made available by the Archives Unleashed team to easily download data in a secure way and select between MIME types, domain names, full text, web graph, etc.
Figure 2: An interface to visualise some samples (here a top hosts sample)

The technical skills within the AWAC2 team are heterogeneous and while some members immediately began analysing data, others initially struggled to download some datasets, as the collection contains a huge amount of data and requires computer skills. However, the team is now on track and greatly benefits from the two regular monthly meetings with the Archives Unleashed team, whose availability to answer questions, explore technical issues with us and share (and explain) notebooks is amazing.

The AWAC2 team immediately started mapping data, framing the scope (table 4), discussing methods and creating samples to study several aspects related to multilingualism and stakeholders represented in web archives (tables 5 to 8).

Table 4: Extract of the overview of the dataset produced by Niels Brügger
Table 5: Analysis by Frédéric Clavert of count records by crawl date in the IIPC collection (30% sample)

Table 6: Extract of a visualisation of the 50 most archived domains from a diachronic perspective (F. Clavert)
Table 7: Visualisation of crawl frequency by language thanks to pandas and Altair libraries for Python, full sample (F. Clavert)
Table 8: A first distant reading of randomly selected French content (30% sample) using Iramuteq (F. Clavert)

Full text access also gives us an opportunity to combine methodologies and use tools like Iramuteq that allow text mining. This is the next step we are hoping to achieve…

Taking things further… with you!

To deepen collaboration with IIPC members and continue the fruitful dialogue that we began when we started by exploring the IIPC collection, we will of course continue to share our results and our insights into the crisis gleaned from web archives and this COVID-19 collection, which may in itself become an object of study as a mirror of web archiving practices, curation and methodologies. We also want to respond to the interest of the IIPC community by sharing our research questions with you, and we would like to invite you to vote for the research questions that we should investigate first.

The multidisciplinary nature and wide-ranging fields of expertise of the team have led to a long list of research interests, and the team is planning to meet for three days in March 2022 to conduct a test bed on one or two case studies. Our case studies will be the ones that you select:

  1. Research on Women, Gender and COVID-19 within this collection (e.g. domestic violence, care and homeschooling, etc.). We will probably use Iramuteq or Mallet on derivative files to perform text mining.
  2. Identify private journals of lockdowns, individual traces of daily life and different online expressions that offer insights into the ways people are dealing with COVID-19 in their everyday lives.
  3. Trace public support/opposition to lockdown. Can we conduct a sentiment analysis over time?
  4. How was the home schooling debate conducted on the web? How did the various stakeholders communicate about it?
  5. How to identify fake news, conspiracy theories and other COVID 19-related controversies within these big data?
  6. Is it possible to perform a visual analysis of what medical-scientific communication on COVID-19 looks like (and what type of visual communication is used, e.g. graphs, visuals, colours)?
  7. The pandemic seriously affected museums around the world and in some countries the web became a prominent channel for their communication. How did museum websites evolve during the COVID-19 pandemic?

Please select your two top case studies at https://www.surveymonkey.com/r/BRRX57T by 20 December 2021. Your choice will be ours! We are looking forward to discovering your selection.


Bingham Nicola, “IIPC content development Group’s activities 2019-2020”, Netpreserve Blog, 2020.

Brügger Niels, “Welcome to WARCnet”, Aarhus, WARCnet Paper, 2020.

Geeraert Friedel and Bingham Nicola, “Exploring special web archives collections related to COVID-19: The case of the IIPC Collaborative collection. An interview with Nicola Bingham (British Library) conducted by Friedel Geeraert (KBR)”, Aarhus, WARCnet Paper, 2020.

Ruest Nick, Fritz Samantha, Deschamps Ryan, Lin Jimmy, Milligan Ian, “From archive to analysis: accessing web archives at scale through a cloud-based interface”, International Journal of Digital Humanities, 2021.


The Danish Coronavirus web collection – Coronavirus on the curators’ minds

By Sabine Schostag, Web Curator, The Royal Danish Library

Introduction – a provoking cartoon

In a sense, the story of Corona and the national Danish Web Archive (Netarchive) starts at the end of January 2020 – about 6 weeks before Corona came to Denmark. A cartoon by Niels Bo Bojesens in the Danish newspaper “Jyllandsposten” (2020-01-26) showing the Chinese flag with a circle of yellow corona-viruses instead of the stars caused indignation in China and captured attention worldwide. We focused on collecting reactions on different social media and in the international news media. Particularly on Twitter, a seething discussion arose with vehement comments and memes about Denmark.

From epidemic to pandemic

After that, the curators again focused on the daily routines in web archiving, as we believed that Corona (Covid-19) was a closed chapter in Netarchive’s history. But this was not the case. When the IIPC Content Development Working Group launched the Covid-19 collection in February, the Royal Danish Library contributed the Danish seeds.

Suddenly, the Corona virus arrived in Europe and the first infected Dane came home from a skiing trip in Italy. The epidemic turned into a pandemic. On March 12, the Danish Government decided to lockdown the country: all public employees where sent to their home offices and borders were closed. Not only the public sector shut down, trade and industry, shops, restaurants, bars etc. had to close too. Only supermarkets were still open and people in the Health Care sector had to work overtime.

While Denmark came to a standstill, so to speak, the Netarchive curators worked at full throttle on the coronavirus event collection. Zoom became the most important work tool for the following 2½ months. In daily Zoom meetings, we coordinated who worked on which facet of this collection. To put it briefly, we curators had coronavirus on our minds.

Event crawls in Netarchive

The Danish Web Archive crawls all Danish news media between several times daily and one time weekly, so there is no need to include news articles in an event crawl. Thus, with an event crawl we focus on augmented activity on social media, blog articles, new sites emerging in connection to the event – and reactions in news media outside Denmark.

Coronavirus documentation in Denmark

The Danish Web collection on coronavirus in Denmark is part of a general documentation on the corona lockdown in Denmark in 2020. This documentation is a cooperation between several cultural institutions, the National Archives (Rigsarkivet), the National Museum (Nationalmuseet), the Workers Museum (Arbejdermuseet), local archives and, last but not least, the Royal Danish Library. The corona lockdown documentation was supposed to be done in two steps:  the “here and now” collection of documentation under the corona lockdown and a more systematic follow-up by collecting materials from authorities and public bodies.

“Days with Corona” – a call for help

All Danes were asked to contribute to the corona lockdown documentation, for instance by sending photos and narratives from their daily life under the lockdown. “Days with Corona” is the title of this part of the documentation of the Danish Folklore Archives run by the National Museum and the Royal Library.

Netarchive also asked the public for help by nominating URLs of web pages related to coronavirus, social media profiles, hashtags, memes and any other relevant material.

Help from colleagues

Web archiving is part of the Department for Digital Cultural Heritage at the Royal Library. Almost all colleagues from the department were able to continue with their every day work from their home offices. Many colleagues from other departments were not able to do so. Some of them helped the Netarchive team by nominating URLs, as this event crawl could keep curators busy more than 7½ hours a day. We used a Google spreadsheet for all nominations (fig. 1)

Fig. 1 Nomination sheet for curators and colleagues form other departments and a call for contributions.

The Queen’s 80th birthday

On April 16, Queen Margarethe II celebrated her 80th birthday. One of the first things she did after the Corona lockdown, on March 13, was to cancel all her birthday celebration events. In a way, she set a good example, as everybody was asked not to meet with no more than ten people, ideally we only should socialize with members of our own household.

As part of the Corona event crawl, we collected web activity related to the Queen’s birthday, which mainly consisted of reactions on social media.

The big challenge – capturing social media

Knowledge of the coronavirus Covid-19 changes continuously. Consequently, authorities, public bodies, private institutions, and companies change information and precaution rules on their webpages frequently. We try to capture as much of these changes as possible. Companies and private individuals offering safety gear for protection against the virus was another facet in the collection. However, capturing all relevant activity on social media was much more challenging than the frequent updates on traditional web pages. Most of the social media platforms use technologies, which Heritrix (used by Netarchive for event crawling) is not able to capture.

Fig. 2 The Queen’s speech to the Danes on how to cope with the corona crisis. This was the second time in history (the first time was during the World War II) when a Royal Head of State addressed  the nation, besides the annual New Year’s Eve speech.

More or less successfully, we tried to capture content from Facebook, TikTok, Twitter, YouTube, Instagram, Reddit, Imgur, Soundcloud, and Pinterest. Twitter is the platform we are able to crawl with Heritrix with rather good results. We collect Facebook profiles with an account at Archive-It, as they have a better set of tools for capturing Facebook. With frequent Quality Assurance and follow-ups, we also get rather good results from Instagram, TikTok and Reddit. We capture YouTube videos by crawling the watch-URLs with a specific configuration using YouTube dl.  One of the collected YouTube videos comes from the Royal family’s YouTube channel: the Queens address to the people on how to behave to prevent or limit the spreading of the coronavirus (https://www.youtube.com/watch?v=TZKVUQ-E-UI, Fig. 2).

As Heritrix has problems with dynamic web content and streaming, we also used Webrecorder.io, although we have not yet implemented this tool in our harvesting setup. However, captures with Webrecorder.io are only drops in the ocean. The use of Webrecorder.io is manual: a curator clicks on all the elements on a page we want to capture. An example is a page on the BBC website, with a video of the reopening of Danish primary schools after the total lockdown (https://www.bbc.com/news/av/world-europe-52649919/coronavirus-inside-a-reopened-primary-school-in-the-time-of-covid-19, Fig. 3). There is still an issue with ingesting the resulting WARC files from Webrecorder.io in our web archive.

Danes produced a range of podcasts on coronavirus issues. We crawled the podcasts we had identified. We get good results when having an URL to a RSS feed, which we crawl with XML extraction.

Fig. 3 Crawled with Webrecorder.io to get the video.

Capture as much as possible – a broad crawl

Netarchive runs up to four broad crawls a year. We launched our first broad crawl for 2020 just in the beginning of the Danish Corona lockdown – on March 14. A broad crawl is an in-depth snapshot of all dk-domains and all other Top Level Domains (TDLs) where we have identified Danish content. A side benefit of this broad crawl might be getting Corona-related content into the archive – content which the curators do not find with their different methods. We identify content both with classic/common? keyword searches and using a variety of link scraping tools / link scrapers.

Is the coronavirus related web collection of any value to anybody?

In accordance with the Danish personal data protection law, the public has no access to the archived web material. Only researchers affiliated with Danish research institutions can apply for access in connection with specific research projects. We have already received an application for one research project dealing with values in the Covid-19 communication. We hope that our collection will inspire more research projects.

The Croatian Web Archive – what’s new?

The Croatian Web Archive (Hrvatski arhiv weba, HAW), launched in 2004, is open access. To celebrate its 15th anniversary, the National and University Library in Zagreb hosted the IIPC General Assembly and the Web Archiving Conference in June 2019. HAW has been the central point in Croatia for researching website development (.hr domain) and the HAW Team has also been organising training for librarians. One of HAW’s most recent projects was the development of the new portal.

By Karolina Holub, Library Adviser at the Croatian Digital Library Development Centre, Croatian Institute for Librarianship, Ingeborg Rudomino, Senior Librarian at the Croatian Web Archive, & Marta Matijević, Librarian at the Croatian Web Archive (National and University Library in Zagreb)

June 2019 – June 2020

It’s been more than a year since the National and University Library in Zagreb (NSK) hosted the IIPC General Assembly and Web Archiving Conference, which we remember with nostalgia.

Last year was a very busy year for the Croatian Web Archive (HAW) and we would like to share some of the key projects that we have been working on.

New portal design

The highlight of the last period was the launch of the new HAW portal.

Croatian Web Archive (HAW)

It was a complex project that took two years – from the initial idea to the launch of the portal in February 2020. The portal was developed and is maintained by NSK website developers and the HAW team. It is developed in a customized WordPress theme. Since the new portal had to be integrated with the database of the archived content, that is maintained by our partner University of Zagreb University Computing Centre (SRCE), a lot of coding was required in order to connect the portal with the archive database to ensure that everything is working properly and smoothly.

Below you can see fractions of our previous portals from 2006 and from 2020:

HAW’s website from 2006 until 2011

HAW’s website from 2011 until 2020

So, what’s new?

The most important objective was to put search box in focus for all types of crawls and give users an easier way to find a resource. Because of the diverse ways of searching, our goal was to have a clear distinction between selective (that is indexed and can be searched by keywords, any word in title or URL, or use advanced search) and domain crawls (can only be searched by entering the full URL). A valuable addition to this version of the portal are the basic metadata elements that accompany each resource (which has a catalogue record) available in the portal.

Archived resource with the basic metadata elements (available also via library catalogue)

Additionally, the browsing of subject categories has been expanded with subject subcategories.

The visibility of the thematic collections has been improved by placing them on the title page. A new feature In Focus has also been added to highlight some of the most important or interesting events or anniversaries happening in the country, city or at the Library in the form of blog posts. This feature is available only in the Croatian version of the portal. The central part of the homepage features New in HAW and Gone from the web sections where user can browse all publications that are new or publications that are no longer available on the live web. The About HAW page features a timeline marking all the important dates related to history of HAW.

Some parts of the new portal have largely remained the same with only slight improvements to make them more user-friendly and up to date. More information about Selection criteria, National .hr domain crawls, Statistics, Bibliography, FAQ etc. can be found in the footer.

The portal is also available in English.

New thematic collections

During this one-year period, we have been working on six thematic collections. Some of them are already available and others are still ongoing:

Elections for the President of the Republic of Croatia 2019-2020

At the end of 2019, Presidential Elections were held in Croatia. The thematic crawls was conducted in January and the content is publicly available as part of this thematic collection.

Rijeka – European Capital of Culture 2020

Croatian city of Rijeka is European Capital of Culture 2020. All contents related to this event, during this challenging time, will be harvested. We are still collecting the content.

Croatian Presidency of the Council of the European Union

Croatia has chaired the Council of the European Union from January to June 2020. We are finishing this thematic collection and it will soon be publicly available on the HAW’s portal.


Our largest thematic collection so far is definitely COVID-19, which is still ongoing. We have included the public in collecting the content inviting nominations related to the coronavirus. In this thematic collection, we follow the events that begin with the onset of coronavirus in the Republic of Croatia and the world, featured on the Croatian portals, blogs, articles – from the outbreak of coronavirus, through general lockdown to the gradual normalization in which we are now.

Archived website (19.03.2020)

2020 Zagreb earthquake

On March 22, just a few days after the start of coronavirus lockdown in Croatia, Zagreb was hit by the biggest earthquake in 140 years, causing numerous injuries and extensive damage. Croatian Web Archive immediately started collecting content about this disaster. This thematic collection is publicly available on the HAW’s portal.

Archived website (15.04.2020) (photo by HINA; Damir Senčar)

2020 Parliamentary Elections

When the spread of the coronavirus was believed to be under control, Croatia held the Parliamentary Elections on July 5. The content for this collection will be collected until the constitution of the new Croatian Parliament.

In May of this year, we started cataloguing thematic collections at the collection level. We have also contributed the Croatian content to the IIPC Coronavirus (Covid-19) Collection.

Annual .hr crawl

In December 2019 we have conducted the 9th annual domain crawl and collected 119 million resources amounting to 9.3 TB.

HAW also started the installation and configuration of tools for indexing and enabling full-text search for domain and thematic crawls: Webarchive-Discovery for parsing and indexing WARC files, Apache SORL for indexing and searching text content and SHINE web interface for index search and analysis. We are still in the testing phase and only a part of existing crawled content is indexed.

Testing Web Curator Tool for new collaborative processes – Local Web Crowd crawls

A new development phase is the collaboration with public libraries in crawling their local history collections for which we are testing the Web Curator Tool. We expect the first results are by the end of November this year.

What’s next?

In the next months, we will be working on enabling more advanced use of HAW’s content to better suit the researchers, starting with the creation of the data sets from HAW collections. We will also prepare guidelines for using archived content on HAW’s portal. In addition, we are planning to update our training material according to the new IIPC training material. In the meantime, we invite you to explore our new portal.

Documenting COVID-19 and the Great Confinement in Canada

By Sylvain Bélanger, Director General, Transition Team, Library and Archives Canada and Treasurer, International Internet Preservation Consortium

It seemed like it happened overnight, suddenly we were told to work from home and limit our physical interactions with people outside our household until further notice. The information was changing and evolving very rapidly and as we started seeing the rise in COVID-19 related cases globally, the anxiety among colleagues and employees was rising as well. Business rapidly ground to an almost complete halt and only essential services would continue to operate, with strict controls and restrictions.

Spanish Flu and the Great Confinement of 2020

Even during these early days, in these times of uncertainty, a group of individuals saw a parallel between the current situation and the period of the Spanish Flu a century earlier. Thinking ahead to fifty years from now this group was asking the question – how will future generations know about this period of time, the Great Confinement of 2020 as they may call it, or the time of great creativity, or perhaps the time the Internet became our lifeline? Turning the clock back one hundred years to the period of the Spanish Flu has given us hints. Let’s not forget that the tragedy of the early 1900s was documented through newspapers, diaries, photographs, and publications detailing the fight and aftermaths of the Spanish Flu.

In 2020, where social media and websites are key means citizens used to document and get informed, how do we capture such ephemeral product?  Does any country have the answer? Isn’t that the question we often ask ourselves?

The importance of web archiving

Screenshot from the Public Health Agency of Canada website.

This period has given all of us an opportunity to educate news publishers, citizens, and government decisions makers about the work done by web archiving teams across Canada and around the world. The efforts of the IIPC have been pushed to the forefront in this crisis, and have helped us demonstrate the importance of preserving web content for future generations.

In Canada the work entails a coordination of efforts with other governmental institutions as well as with university libraries and provincial/territorial archives to limit duplication of efforts. At Library and Archives Canada (LAC), to ensure a proper reflection of Canadian society, we have captured over 662,000 Tweets with hashtags such as #covidcanada, #covid19canada, #canadalockdown, #canadacovid19, as part of over 38 million digital assets collected for COVID-19 in 2020. Of that a little over 87% of the content is non-governmental, from media and non-media web resources selected for the COVID-19 collection. This includes 33 sites on Canadian news and media collected daily, to ensure we capture a robust sample of the published news on COVID-19. Added to that are non-media web resources that create an overall LAC seed list of over 900 resources. Total data collected to date is a little more than 3.09 TB at LAC alone.

Documenting the Canadian response

In addition to our web archiving program, LAC librarians have noticed an increase in books being published about the crisis. That has been measured through our ISBN team observing an increase in authors requesting ISBN numbers for books about various aspects of the pandemic. In addition, LAC will document the Government of Canada’s response to the COVID-19 pandemic through our Government Records Disposition Program.  In this way the government decision-making on COVID-19 and impact on Canadians will be acquired and preserved by LAC for present and future generations. Also, our Private Archives personnel are monitoring the activities, responses and reactions of individuals, communities, organizations and associations within their respective portfolios. LAC will endeavour to acquire documents about the pandemic when discussing possible acquisitions with current and potential donors and when evaluating offers. Descriptions in archival fonds will now highlight COVID 19 content where appropriate.

The efforts undertaken to date at LAC are meant to document the Canadian response. Are our efforts enough to help citizens 100 years from now to understand the times we were living, and how we responded to and tackled the challenges of COVID-19? Only time will tell whether this is enough, or we need to do any more work to truly document the historical times we live in.

Luxembourg Web Archive – Coronavirus Response

By Ben Els, Digital Curator, The National Library of Luxembourg

The National Library of Luxembourg has been harvesting the Luxembourg web under the digital legal deposit since 2016. In addition to the large-scale domain crawls, the Luxembourg Web Archive also operates targeted crawls, aimed at specific subjects or events. During the past weeks and months, the global pandemic of the Coronavirus, has put society before unprecedented challenges. While large parts of our professional and social lives had to move even further online, the need to capture and document the implications of this crisis on the Internet, has seen enormous support in all domains of society. While it is safe to admit that web archiving is still a relatively unknown concept to most people in Luxembourg (probably also in other countries), it is also safe to say, that we have never seen a better case to illustrate the necessity of web archiving and ask for support in this overwhelming challenge.


Media and communities

At the National Library, we started our Coronavirus collection on March 16th, while there were 81 known cases in Luxembourg. While we have been harvesting websites in several event crawls for the past 3 years, it was clear from the start that the amount of information to be captured would surpass any other subject by a great deal. Therefore, we decided to ask for support from the Luxembourg news media, by asking them to send us lists of related news articles from their websites. This appeal to editors quickly evolved into a call for participation to the general public, asking all communities, associations, and civil interest groups to share their responses and online information about the crisis. Addressing the news media in the first place, gave us great support in spreading the word about the collection. Part of our approach to building an event collection, is to follow the news and take in information about new developments and publications of different organisations and persons of interest. As the flow and high-paced rhythm of new public information and support was vital to many communities, we also had to try and keep up with new websites, support groups and solidarity platforms being launched every day. However, many of these initiatives are not covered equally in the news or social media, a situation which is even more complicated through Luxembourg’s multilingual makeup. We learned about the challenges from the government and administrations, to convey important and urgent information in 4 or 5 languages at a time: Luxembourgish, French, German, English and Portuguese. The same goes for news and social media, and as a result, for the Luxembourg Web Archive. Therefore, we were grateful to receive contributions from organisations, which we would not have thought of including ourselves, and who were not talked about as much in the news.

© The Luxembourg Government

Effort and resources

While the need and support for web archiving exploded during March and April, it was also clear, that the standard resources allocated to the yearly operations of the web archive would not suffice in responding to the challenge in front of us. The National Library was able to increase our efforts, by securing additional funding, which allowed us to launch an impromptu domain crawl and to expand the data budget on Archive-It crawls. We are all aware of the uphill battle in communicating the benefits of archiving the web. There is a feeling that, while people generally agree on the necessity of preserving websites, in most cases there is little sense of urgency or immediate requirement – since after all, most everyday changes are perceived as corrections of mistakes, or improvements on previous versions. In my opinion, the case of Coronavirus related websites, made the idea of web archiving as a service and obligation to society much clearer and easier to convey.

© Ministry of Health

Private and public

The Web offers many spaces and facets for personal expression and communication. While social media have played a crucial part in helping people to deal with the crisis, web archives face some of their biggest challenges in harvesting and preserving social media. Alongside the technical difficulties and enormous related costs, there is the question of ethics in collecting content which is not 100% private, but also not 100% public. For instance, in Luxembourg, many support groups launched on Facebook, where people could ask their questions about the current situation and new developments in terms of what is

allowed, find help and comfort to their uncertainties. There are several active groups in every language, even some dedicated to districts of the city, with neighbours looking after each other. While it is important to try to capture all facets of an event (especially if this information is unique to the Internet) I am uncertain, whether it is ethical to capture the questions, comments and conversations of people in vulnerable situations. Even though there are sometimes thousands of members per group and pretty much everyone can join, they are not fully open to the public.

Collecting and sharing


Besides the large-scale crawls and Archive-It collection, we also contributed part of our seed list to the IIPC’s collaborative Novel Coronavirus collection, led by the Content Development Working Group. Of course, the National Library did not limit its response to archiving websites. With our call for participation, we also received a variety of physical and digital documents: mainly from municipalities and public administrations who submitted numerous documents, which were issued to the public in relation the reorganisation of public services and the temporary restrictions on social life.

We also received some unexpected contributions, in the form of poems, essays and short diary entries written during confinement, describing and reflecting upon the current situation from a very personal angle. Likewise, a researcher shared his private bibliometric analysis of scientific literature about the Coronavirus. Furthermore, the University of Luxembourg’s Centre for Contemporary and Digital History has launched the sharing platform covidmemory.lu, enabling ordinary people living or working in Luxembourg to share their photos, videos, stories and interviews related to COVID-19.

Web Archiving Week 2021

Since the 2021 edition of the IIPC Web Archiving Conference will be part of the Web Archiving Week, in  partnership with the University of Luxembourg and the RESAW network, I am not going to spoil too much about the program by saying that we will continue exploring these shared efforts and responses during the week of June 14th – 18th 2021. We are looking forward to welcoming you all to Luxembourg!

Covid-19 Collecting at the National Library of New Zealand

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand

The National Library of New Zealand reflects on their rapid response collecting of Covid-19 related websites since February 2020.

Collecting in response to the pandemic

Web Archivists at the National Library of New Zealand are used to collecting websites relating to major events, but the Covid-19 pandemic has had such a global impact, it’s affected every member of society. It has been heart breaking to see the tragic loss of life and economic hardships that people are facing world-wide. The effects of this pandemic will be with us for a long time.

Collecting content relating to these events always produces mixed emotions as a web archivist. There’s the tension between collecting content before it disappears, and in that regard, we put on our hard hats and get on with it. At the same time however, these events are raw and personal to each one of us and the websites we’ve collected reflect that.

IIPC Collaborative Collection

When the IIPC put out a call to contribute to the Novel Coronavirus Outbreak Collaborative Collection, we got involved. Initially New Zealand sources were commenting on what was happening internationally, so URLs identified were mainly news stories, until our first reported case of Coronavirus occurred in February and then we started to see New Zealand websites created in response to Covid-19 here. We continued to contribute seed URLs to the IIPC collection, but our focus necessarily switched to the selective harvesting we undertake for the National Library’s collections.


The New Zealand government instituted a 4 level alert system on March 21 and we quickly moved to level 4 lockdown on March 24. The lockdown lasted a month, before gradually moving down to level 1 on June 8.

The rapidly changing alert levels were reflected in the constantly changing webpages online. It seemed that most websites we regularly harvest had content relating to Covid-19. Our selective web harvesting team focussed on identifying websites that had significant Covid-19 content or were created to cover Covid-19 events during our rapid response collecting phase. Even then it was difficult to capture all changes on a website as they responded to the different alert levels.

We were working from home during this time and connected to Web Curator Tool through our work computers. The harvesting was consistent, but our internet connections were not always stable, so we often got thrown out of the system! If we had technical issues with any particular website harvest, by the time we resolved it, the pages online had sometimes shifted to another alert level! We also used Web Recorder and Archive-It for some of our web harvests.

Due to the enormous amount of Covid-19 content being generated and because we are a very small team (along with the challenges of working from home), what we collected could really only be a very selective representation.

Unite against Covid-19 – Unite for the Recovery

Unite Against Covid-19 harvested 18 March 2020.

One prominent website captured during this time was the government website ‘Unite Against Covid-19’ which was the go-to place for anyone wanting to know what the current rules were. This website was updated constantly, sometimes several times a day.

When we entered alert level one the website changed to “Unite for the Recovery.” We expect to be collecting this site for some time. While we have completed our rapid response phase we will be continuing to collect Covid-19 related material as part of our regular harvesting.

Unite for the Recovery harvested 9 June 2020.

Economic Impact
Apart from official government websites, we captured websites that reflected the economic impact on our society, such as event cancellations and business closures. We documented how some businesses responded to the pandemic, by changing production lines from clothing to making face masks and from alcohol production to making hand sanitiser. New products like respirators and PPE (personal protective equipment) gear were also being produced. Tourism is a major industry in New Zealand and with border lockdowns still in place, advertising is now targeting New Zealanders. There is talk about extending this to a “Trans-Tasman” bubble to include Australia and possibly some Pacific Islands in the near future.

Social impact
As in many countries, community responses during lockdown provided both unique and shared experiences. New Zealanders were able to walk locally (with social distancing) so people put bears and other soft toys in the windows for kids (and adults) to count as they walked by. The daily televised 1pm Covid-19 updates from Prime Minister Jacinda Ardern and Director General of Health, Dr Ashley Bloomfield during lockdown was compulsive viewing and generated memorabilia such as T-shirts, bags and coasters. These were all reflected in the websites we collected. We also harvested personal blogs such as ‘lockdown diaries’.

Web archiving and beyond
During this rapid collecting phase, the web archivists focussed on collecting websites, and that’s reflected in this blog post. There was also a significant amount of content we wanted to collect from social media such as memes, digital posters and podcasts, New Zealand social commentary on Twitter and email from businesses and associations. This has required considerable effort from the Library’s Digital Collecting and Legal Deposit teams. You can find out more about this in an earlier National Library blog post by our Senior Digital Archivist Valerie Love. We are also working with our GLAM sector colleagues and donors to continue to build these collections.

The French coronavirus (COVID-19) web archive collection: focus on collaborative networks

BnF’s Covid-19 web archive collection has drawn considerable media attention in France, including coverage in Le Monde, 20 minutes and TV Channel France 3. The following blog post was first published in Web Corpora, BnF’s blog dedicated to web archives.

By Alexandre Faye, Digital Collection Manager, Bibliothèque nationale de France (BnF)
English translation by Alexandre Faye and Karine Delvert

The current global coronavirus pandemic (Covid-19) poses an unprecedented challenge for the web archiving activities. The impact on society is such that the ongoing collection requires several levels of coordination and cooperation at a national and international level.

Since its spreading out of China and its later development in Europe, coronavirus outbreak has become a pervasive theme on the web. This sanitary crisis is being experienced in real time by populations simultaneously confined and largely connected, with a sense of emergency as well as underlying questioning. Archived websites, blogs, and social media should make up a coherent, significant and representative collection. They will be primary sourcesfor future research, and they are already the trace and memory of the event.


At the end of January 2020, while the Wuhan megapolis is quarantined, the first hashtags #JeNeSuisPasUnVirus and #CORONAVIRUSENFRANCE appear on Twitter. They denounce and show the stigma experienced by the Asian community in France. The Movement against racism and for friendship between peoples (Mouvement contre le racisme et pour l’amitié entre les peuples, MRAP) quickly published a page on its website entitled “a virus has no ethnic origin”. This is the first webpage related to coronavirus to have been selected, crawled and preserved under French legal deposit.

Group dynamics

The coronavirus collection is not conceived as a project, in the sense that it would be programmed, would have a precise calendar and would be limited to predetermined topics. It grows as a part of the both National and local news media and Ephemeral News Current Topics collections. The National and local news media collection brings together a hundred of national and local press websites, including the editorial content, such as headlines and related articles as well as Twitter accounts which are collected once a day. The News Current Topics collection, which requires both a technical and organizational approach, relies on the coordination of an internal network of digital curators from their relevant fields”. It facilitates dynamic and reactive identification of web content related to contemporary issues and important events. By documenting the evolution, spreading and overall impact of the pandemic in France, archiving policy embraces all facets of the public health crisis: medical, social, economic, political and more broadly scientific, cultural and moral aspects.

“A virus has no ethnic origin”. Movement Against Racism and for Friendship Between Peoples (MRAP) website. Archive of February 21, 2020.

70 selected seed URLs were crawled in January and February, while the spread of the virus out of China seemed to be limited and under control. Since March 17, date of the French lockdown, 500 to 600 seed URLs per week are selected and assigned to a crawl frequency: several times a day for social networks, daily for national and local press sites, weekly for news sections dedicated to the coronavirus, monthly for articles and dedicated websites which are created ex nihilo. Thus the section of the economic review L’Usine nouvelle is crawled weekly, because it organizes a stream of articles. Less dynamic, the recommendation pages of the National Research and Security Institute (INRES), is assigned monthly frequency.

By mid-April 2020, more than 2,000 selections and settings were created. This reactivity is all the more necessary due to the fact that certain web pages selected in the first phase have already disappeared from the live web.

The regional dimension

The geographical approach is also at the core of the archiving dynamics. The web does not entirely do away with territorial dimensions, as shown by the research works led on this topic. One may even think that they were reinforced as France is hit by the sanitary crisis, as the crisis coincides with the campaign for the municipal elections.

The curators of partner institutions all over the French territory have spontaneously enriched the selections on the coronavirus sanitary crisis. They contributed by including local and regional contents into account. This network is a key element to the national cooperation framework. Initiated in 2004 by the BnF, it relies on a network of 26 regional libraries and archives services, which share this mission of print and web legal deposit by participating in collaborative nominations. Its contribution proved to be significant since over 50% of the nominated websites selected until 15th April refer to local/regional content.

Simplified access to teleconsultation. ARS Guyana. Archived, April 5, 2020.

As a corollary, the crawl devoted to local elections has not been suspended after the 1st poll (which took place on March 15th), although the second poll (due to take place the following weekend) had been postponed and the whole electoral process suspended due to the crisis. In particular, the Twitter and Facebook accounts of the mayors elected in the 1st poll and those of the candidates who are still in contention for the 2nd poll have continued to be collected. These archives, as statements of mayors and candidates on the web during the weeks that had preceded and followed the 1st poll of local elections, already appear to be a major source for both electoral history and coronavirus pandemic in France.

Historic abstention rate in the local elections in the Oise “cluster”. francetvinfo.fr. Capture of March 16, 2020.

International cooperation

At the international level, the BnF and also in this way the other French participating libraries contribute to the archiving project “Novel Coronavirus (2019-nCoV) outbreak”. This initiative launched in February 2020 is supported by the IIPC Content Development Group (CDG) in association with the Internet Archive. It brings together about thirty libraries and institutions collaborating around the world on this web archive collection. At the end of May, more than 6,800 preserved websites representing 45 languages had been put online on Archive-it.org and indexed in full text.

The BnF has for many years been pursuing a policy of cooperation with the IIPC to promote preservation and use of web archives on an international scale. One of the research challenges is to facilitate comparisons of the different national webs, in particular for the global and transnational phenomena such as #MeToo and the current health crisis. A first contribution was sent at the end of February to the IIPC.  It consisted of an 80 seeds selection made during the first phase of the pandemic, just before Europe became the main active center in front of China. Some of these pages have already disappeared from the living web.

According to the IIPC’s new recommendations and considering the evolution of the pandemic in France, the next contribution to the IIPC should be a tight selection (almost 5% of the French collection) linked to high priority subtopics include: information about the spread of infection; regional or local containment efforts; medical and scientific aspects, social aspects; economic aspects; and political aspects. A third of those websites reports on medical domain. A second third provides information about French territories that are remote from Europe: French Guiana and West Indies, Reunion and Mayotte. The last part concerns citizen’s initiatives and debates during the lockdown.

For examples, the special INED’s website hosting gives information on local excess mortality, articles from Madinin’art, Montray Kreyol, Free Pawol were selected by a local curator and banlieues-sante.org is website of an NGO which acts against medical inequality and has created a YouTube channel explaining protection measures in 24 languages including sign language.

Dr François Ehlinger on EHPAD. Nicole Bertin’s Blog. Website capture from the Charente-Maritime region. Capture on April 3, 2020

What’s next?

Some of the websites nominated by the BnF and its partners tend to constitute a collective memory of the event. Until mid-April, the share of social networks represented 40% of the nominations, with a slight predominance of Twitter over Facebook. Although a large share is devoted to official accounts – namely, of institutions or associations (@AssembleeNat, @restosducoeur, @banlieuesante) or to accounts created ex nihilo (@CovidRennes, @CoronaVictimes, @InitiativeCovid), hashtags prevail in the set of selections.

The aim is to archive a representative part of individual and collective expressions by capturing tweets around the most significant hashtags: multiple variations of the terms “coronavirus” and “confinement” (#coronavacances, #ConfinementJour29), criticism of the way the crisis has been managed (#OuSontLesMasques, #OnOublieraPas), instruction dissemination and expressions of sympathy show a unique and characteristic mobilisation of citizens while following the pace of the news (#chloroquine, #Luxfer).

Daniel Bourrion, “The virus journals” on face-ecran.fr. Archived April 3, 2020.

Archives relating to the coronavirus, as they account for the outcomes of the sanitary crisis and of the lockdown in various domains, end up in overlapping the set of themes to which the BnF and its partners pay a particular attention or for which focused crawls have already been conducted or will be led. For instance, digital literature or confinement diaries, relationships between the body and public health policies, epidemiology and artificial intelligence, family life in confinement and feminism, can be mentioned.

“Next” isn’t just a matter of a unique form of promoting this special archive collection, which remains a work-in-progress. It is neither a delimited project nor an already closed. It is documentation for many kinds of research projects and also heritage for all of us.

Guide for confined parents. The French Secretariat for Equality (Le Secrétariat d’Etat chargé de l’égalité entre les femmes et les hommes et de la lutte contre les discriminations). Capture of April 10.