25 Years Preserving UK Government Web History

By Web Archiving Team at The National Archives


Today is a doubly special day for the UK Government Web Archive (UKGWA): as well as celebrating World Digital Preservation Day, we mark our 25th Anniversary.

For this occasion, blog posts and a series of social media posts with facts and statistics are being released in addition to a time lapse video of the GOV.UK website that shows the evolution of the design of the site and how the Government has communicated with the public. Check: @UKNatArchives  and https://www.facebook.com/TheNationalArchives/ for more.

Maintaining and preserving the government web estate and its interlinking network of resources was the driver for the original Web Continuity Initiative. The TNA Web Archiving team began by capturing 50 websites in 2003, but the collection dates from 1996 when we received copies of UK government websites from the Internet Archive. Since then, we have been expanding our collection and increasing our capacity to handle the growing volume and complexity of websites over time.

The UK Government’s use of the web is extensive: websites present information, act as document stores, and provide dynamic transactional services. Often, information published on the web is the only place where it is available. There is a tension between providing up-to-date information and ensuring that published information remains available in its original context for future reference, this is where the UKGWA steps in.

The UKGWA is a comprehensive, cloud-based, and freely accessible archive for everyone, including students, historians, researchers, government employees, business, and journalists – having become a reference in web archiving.

The numbers impress: there are around 6,400 websites in our collection, and it has over 644 social media accounts across YouTube, Twitter, and Flickr, with over 1.9 million post archived. Around 63% of pages we direct to on our A-Z list are no longer available on the live internet. If the UKGWA was not here, much of the government’s online content since 1996 would probably have been lost and would be unavailable to the public.

User experience

As important as it is to preserve our collection, it is vital to provide a great experience for our users. We are currently investing in more user-friendly guidance for websites owners. The new documentation aims to provide a straightforward understanding of the archiving process and the requirements to successfully capture and published websites.

We are continually upgrading the technologies used to capture and replay websites to ensure the highest fidelity possible – a website in the archive should look and function, as far as possible, as the original site. We are investing in auto-QA technologies to ensure our collection is of the highest possible quality. And we are focusing on opening the data in our collection for researchers. We have been working on the development of tools and methods for this purpose which you will hear more about in 2022!

One of the earliest pieces of our collection, HM Treasure website dates from 1996.

UKWA update for the 2021 World Digital Preservation Day

By Andrew Jackson, Web Archiving Technical Lead, UK Web Archive, British Library

It’s World Digital Preservation Day 2021 #WDPD2021 so this is a good opportunity to give an update on what is going on today at the DPC Award Winning UK Web Archive.

Domain and frequent crawls

The 2021 domain crawl is still running. There’s been a few ups and downs, and it’s not going as fast as we’d ideally like, but it’s chugging away at 200-250 URLs a second, on track to get to around two billion URLs by the end of the year.

We run two main crawl streams, ‘frequent’ and ‘domain’ crawls. The ‘frequent’ one gets fresh content from thousands of sites everyday. It also takes screenshots while it goes, so in the future we will know what the site was supposed to look like!

We’ve been systematically taking thousands of screenshots every day since about 2018.

The frequent crawls also refresh site maps every day, to make sure we know when content is added or updated. This is how websites make sure search engine indexes are up-to-date, and we can take advantage of that!

The ‘domain’ crawl is different – it runs once a year and attempts to crawl every known UK website once. It crawls around two billion URLs from millions of websites, so it’s a bit of a challenge to run. The frequent crawls also refresh site maps every day, to make sure we know when content is added or updated. This is how websites make sure search engine indexes are up-to-date, and we can take advantage of that!

The last two domain crawls have been run in the cloud, which has brought many new challenges, but also makes it much easier to experiment with different virtual hardware configurations.

The 2021 domain crawl is still running. There’s been a few ups and downs, and it’s not going as fast as we’d ideally like, but it’s chugging away at 200-250 URLs a second, on track to get to around two billion URLs by the end of the year.

UKWA domain drawl: URL totals over 100 days.

All in all, we gather around 150TB of web content each year, and we’re over a petabyte in total now. This graph shows how the crawls have grown over the years (although it doesn’t include the 2020/2021 domain crawls as they are still on the cloud).

The legal framework we operate under means we can’t make all of this available openly on our website, but the curatorial team work on getting agreements in place to allow this where we can, as well as telling the system what should be crawled.

This graph shows the growth in crawl targets over the last four weeks.
And this shows the growth in openly-accessible and curated web sites and pages over the same time period.

Open source tools

All that hard work means many millions of web pages are openly accessible via www.webarchive.org.uk/ – you can look up URLs and also do full-text faceted search. Not all web archives offer full-text search, and it’s been a challenge for us. Our websites search indexes are not as up-to-date or complete as we’d like, but we’re working on it. At least we can be glad that we’re not to be working on it alone. Our search indexing tools are open source and are now in use at a number of web archives across the world. This makes me very happy, as I belive our under-funded sector can only sustain the custom tools we need if we find ways to share the load.

This is why almost the entire UK Web Archive software stack is open source, and all our custom components can be found at https://github.com/ukwa/. Where appropriate, we also support or fund developments in open source tools beyond our own, e.g. the Webrecorder tools – we want everyone to be able to make their own web archives!


We also collaborate through the IIPC and with their support we help run Online Hours to support open source users and collaborators. These are regular online videocalls to help us keep in touch with colleagues across the world: see here for more details – Join us!

We’re very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others. Among these collaborators, I’d like to highlight the work of the Danish Net Archive. They understand Apache Solr better than us, and are working on an excellent search front-end called SolrWayback.

As well as full-text search, we also use this index to support activities specific to digital preservation by performing format analysis and metadata extraction during the text extraction process, and indexing that metadata as additional facets. We’ve not had time to make all this information available to the public, but some of it is accessible.

Working with data and researchers

Our Shine search prototype covers older web archive material held by the Internet Archive and if you know the Secret Squirrel Syntax, you can poke around in there and look for format information e.g. by MIME type, by file extension or by first-four-bytes. We also generate datasets of information extracted from our collections, including format statistics, and make this available as open data via the new Shared Research Repository. But again, we don’t always have time to work on this, so keeping those up to date is a big challenge.

One way to alleviate this is to partner with researchers in projects that can fund the resources and bring in the people to do the data extraction and analysis, while we facilitate access to the data and work with our British Library Open Research colleagues to give the results a long-term home. This is what happened with the recent work on how words have changed meaning over time, a research project led by Barbara McGillvray, and we’d like to support these kinds of projects in the future too.

Another tactic is to open up as much metadata as we can, and provide that via APIs that others can then build on.This was the notion behind our recent IIPC-funded collaboration with Tim Sherratt to add a web archives section to the GLAM Workbench. A collection of tools and examples to help you work with data from galleries, libraries, archives, and museums. We hope the workbench will be a great starting point for anyone wanting to do research with and about web archives.

If you’d like to know more about any of this, feel free to get in touch with me or with the UK Web Archive.

Happy World Digital Preservation Day!


Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions

By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team: Susan Aasman (University of Groningen, The Netherlands), Niels Brügger (Aarhus University, Denmark), Frédéric Clavert (University of Luxembourg, Luxembourg), Karin de Wild (Leiden University, The Netherlands), Sophie Gebeil (Aix-Marseille University, France), Valérie Schafer (University of Luxembourg, Luxembourg) 

As mentioned by Nicola Bingham in her blog post “IIPC Content Development Group’s activities 2019-2020” in July 2020, a huge effort has been made by the Content Development Group and IIPC members to create a unique collection of web material related to the pandemic, with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

This collection immediately attracted the interest of researchers because of the quantity of collected data, its transnational nature and the many possibilities it offers to explore web archives of this unprecedented period at an international level.

A strong interest in COVID-19 collections in WARCnet

The WARCnet project was launched at the beginning of 2020, just as the world was witnessing the first developments in the COVID-19 crisis.

WARCnet is a network of researchers and web archiving institutions (see WARCnet team) which aims to promote transnational research that will help us to understand the history of (trans)national web domains and transnational events on the web, drawing on the increasing volume of digital cultural heritage held in national web archives (Brügger, 2020). The network’s activities started in 2020 and will run until to 2023, and they are funded by the Independent Research Fund Denmark | Humanities (grant no 9055-00005B). The network is organised into six working groups, with working group 2 (WG2) focusing on the study of transnational events through web archives.

WG2 decided to select the COVID-19 crisis as one of its first case studies and test beds, and it conducted a first distant reading of several collections related to the crisis, combining metadata from the special collections of national institutions like the British Library, the BnF (National Library of France) and INA (the French National Audiovisual Institute) in France, the BnL (National Library of Luxembourg), etc., with the IIPC collection. The WG2 received metadata in the form of seed lists of these collections thanks to web archiving institutions and a special agreement with them.

At the same time a series of oral interviews were carried out with web archivists and web curators to shed more light on the selection and curation processes and the scope of these special collections, including an interview by Friedel Geeraert with Nicola Bingham on the IIPC collection (Geeraert and Bingham, 2020). Transcriptions of this series of oral interviews are available online for free download. You will find interviews with web archivists and curators working at INA, the BnF, the BnL, the IIPC, Netarkivet (Denmark), the National Széchényi Library in Hungary, the UK Web Archive, the Swiss National Library, the National Library of the Netherlands (KB) and the Icelandic Web Archive. Other interviews are scheduled.

An internal WG2 datathon on the metadata received from web archiving institutions held at the very beginning of 2021 enabled us for example to compare national collections with one another and with the selection they made for the IIPC coronavirus collection (table 1) and to measure the websites that emerged during the pandemic (table 2). Another goal of WG2 was to compare the metadata provided by institutions (table 3) and the way they could be intertwined.

Table 1: Presentation by Friedel Geeraert of her hypothesis related to overlaps between national collections and IIPC collection and results (final meeting of our datathon)

Table 2: Presentation by Katharina Schmid and Friedel Geeraert of their hypothesis related to “COVID-19 websites” and results (final meeting of our datathon)

Table 3: Overview and comparison of the data fields provided by each web archive, conducted by Karin de Wild and Niels Brügger

The data was provided by the following web archives: RDL (Royal Danish Library); BnF; NSL (National Széchényi Library, Hungary); IIPC; BnL; KB (Koninklijke Bibliotheek, the Netherlands); UKWA (UK Web Archive).

A new opportunity and a multi-partner collaboration

This first exploration led to the desire to go deeper into COVID-19 collections, and a unique chance was offered to us when the Archives Unleashed team launched its annual call for cohorts.


Some of the WG2 members therefore decided to submit a proposal entitled AWAC2, which stands for “Analysing Web Archives of the COVID-19 Crisis through the IIPC Novel Coronavirus Dataset”. With this application, we hoped to deepen our understanding of the IIPC COVID-19 collection at several levels:

First, it was a way to continue our initial distant exploration of (meta)data, in order to answer qualitative questions such as:

(1) Participation in the collection by web archiving institutions and web archive representatives
— how many URLs inside/outside ccTLDs?
— can we indirectly document some countries with no national web archives through this collection?
— comparison of national collections and their selection for IIPC (based on Danish and Luxembourgish collections)
(2) Categories of stakeholders, websites, representativeness and inclusiveness
(3) New event-specific websites
(4) MIME types and visual studies
(5) Hyperlink networks
What characterises the hyperlink network of websites included in the IIPC collection? Can any national website clusters be identified? (This analysis could be performed in Gephi).

Second, it was an opportunity to obtain complementary data and to combine research methods, especially distant and close reading, thanks to the possibility of accessing full text.

Third, it was also a unique chance to further explore the possibilities offered by the Archives Unleashed tools that the team has been developing for many years, which we were introduced to with the pre-workshop organised by Ian Milligan and Nick Ruest at the 2019 RESAW conference in Amsterdam and have been following ever since through reports about their activities and academic papers (Ruest et al., 2020). It was also an opportunity to benefit from regular discussions with the team, enrich our computational skills and create a research dynamic with an efficient and impressive team in Canada.

Finally, it was a way to explore new research questions that we had in mind from the beginning, and which are more related to topical approaches (e.g. Women, gender and COVID-19). We will come back to this in the last section.

First exciting steps

We are very thankful to the IIPC, Archive-It and the Archives Unleashed team for selecting our project and for the agreement that was signed to access this large dataset. Since the end of August, following the launch of the Cohort Programme in July 2021 that enabled us to meet all the participants and cohorts in the yearly programme, the AWAC2 team has been able to explore the new Archives Unleashed interface within Archive-It and the many datasets and visualisations that have been made available to us (figures 1 and 2).

Figure 1: An interface made available by the Archives Unleashed team to easily download data in a secure way and select between MIME types, domain names, full text, web graph, etc.

Figure 2: An interface to visualise some samples (here a top hosts sample)

The technical skills within the AWAC2 team are heterogeneous and while some members immediately began analysing data, others initially struggled to download some datasets, as the collection contains a huge amount of data and requires computer skills. However, the team is now on track and greatly benefits from the two regular monthly meetings with the Archives Unleashed team, whose availability to answer questions, explore technical issues with us and share (and explain) notebooks is amazing.

The AWAC2 team immediately started mapping data, framing the scope (table 4), discussing methods and creating samples to study several aspects related to multilingualism and stakeholders represented in web archives (tables 5 to 8).

Table 4: Extract of the overview of the dataset produced by Niels Brügger

Table 5: Analysis by Frédéric Clavert of count records by crawl date in the IIPC collection (30% sample)

Table 6: Extract of a visualisation of the 50 most archived domains from a diachronic perspective (F. Clavert)

Table 7: Visualisation of crawl frequency by language thanks to pandas and Altair libraries for Python, full sample (F. Clavert)

Table 8: A first distant reading of randomly selected French content (30% sample) using Iramuteq (F. Clavert)

Full text access also gives us an opportunity to combine methodologies and use tools like Iramuteq that allow text mining. This is the next step we are hoping to achieve…

Taking things further… with you!

To deepen collaboration with IIPC members and continue the fruitful dialogue that we began when we started by exploring the IIPC collection, we will of course continue to share our results and our insights into the crisis gleaned from web archives and this COVID-19 collection, which may in itself become an object of study as a mirror of web archiving practices, curation and methodologies. We also want to respond to the interest of the IIPC community by sharing our research questions with you, and we would like to invite you to vote for the research questions that we should investigate first.

The multidisciplinary nature and wide-ranging fields of expertise of the team have led to a long list of research interests, and the team is planning to meet for three days in March 2022 to conduct a test bed on one or two case studies. Our case studies will be the ones that you select:

  1. Research on Women, Gender and COVID-19 within this collection (e.g. domestic violence, care and homeschooling, etc.). We will probably use Iramuteq or Mallet on derivative files to perform text mining.
  2. Identify private journals of lockdowns, individual traces of daily life and different online expressions that offer insights into the ways people are dealing with COVID-19 in their everyday lives.
  3. Trace public support/opposition to lockdown. Can we conduct a sentiment analysis over time?
  4. How was the home schooling debate conducted on the web? How did the various stakeholders communicate about it?
  5. How to identify fake news, conspiracy theories and other COVID 19-related controversies within these big data?
  6. Is it possible to perform a visual analysis of what medical-scientific communication on COVID-19 looks like (and what type of visual communication is used, e.g. graphs, visuals, colours)?
  7. The pandemic seriously affected museums around the world and in some countries the web became a prominent channel for their communication. How did museum websites evolve during the COVID-19 pandemic?

Please select your two top case studies at https://www.surveymonkey.com/r/BRRX57T by 20 December 2021. Your choice will be ours! We are looking forward to discovering your selection.


Bingham Nicola, “IIPC content development Group’s activities 2019-2020”, Netpreserve Blog, 2020.

Brügger Niels, “Welcome to WARCnet”, Aarhus, WARCnet Paper, 2020.

Geeraert Friedel and Bingham Nicola, “Exploring special web archives collections related to COVID-19: The case of the IIPC Collaborative collection. An interview with Nicola Bingham (British Library) conducted by Friedel Geeraert (KBR)”, Aarhus, WARCnet Paper, 2020.

Ruest Nick, Fritz Samantha, Deschamps Ryan, Lin Jimmy, Milligan Ian, “From archive to analysis: accessing web archives at scale through a cloud-based interface”, International Journal of Digital Humanities, 2021.


IIPC Collaborative collection: “Afghanistan regime change (2021) and the international response”

By Nicola Bingham, Lead Curator, Web Archiving, the British Library; Co-chair, IIPC Content Development Working Group

On 4th October 2021 the Content Development Group (CDG) initiated a thematic website collection in response to recent developments in Afghanistan at the behest of several CDG members.


Recent events in Afghanistan have precipitated a humanitarian crisis which escalated markedly after foreign armed forces withdrew from the country in May 2021.1 As US and Allied troops retreated, the Taliban quickly gained ground, seizing cities across the country, increasing threats of a worsening civil war. The Taliban have now claimed control of all major cities in Afghanistan, including the capital Kabul, where fighters have seized the presidential palace, forcing the president to flee. The Afghan government which was supported by the US and the Allies has collapsed and there has been a transition of power to the Taliban.

As violence intensifies across large areas of the country, civilians are being caught up in the fighting and hundreds of Afghans have been killed in recent weeks, while thousands have been forced to flee their homes.

The Department of Defense is committed to supporting the U.S. State Department in the departure of U.S. and allied civilian personnel from Afghanistan, and to evacuate Afghan allies safely. (U.S. Air Force photo by Staff Sgt. Brandon Cribelar)
Public domain, via Wikimedia Commons: https://commons.wikimedia.org/wiki/File:Operation_Allies_Refuge_210819-F-DT970-0064.jpg

The humanitarian crisis is obviously of great concern internationally, however the cultural heritage of Afghanistan is also under threat. As described by Richard Ovenden in an article in the Financial Times (24th September 2021), the global Library and Archive community has been trying to do what it can, from concerted efforts to help Afghans working in the cultural heritage sector to leave the country, to supporting the preservation of cultural artefacts including digital materials.2

It is likely that the new regime will want to bring the Internet under greater censorship and control3 meaning web content and the information contained therein is at risk. Alongside the internal threat, is the risk that foreign internet service providers, largely based in the US, could turn off cloud servers and social media platforms etc., if America decided to act on the threat to impose sanctions on Afghanistan.4

Existing collecting efforts

Afghanistan Web Archive at the Library of Congress: https://www.loc.gov/collections/afghanistan-web-archive/ 

Rapid response collecting of at risk Afghan Internet content has already been undertaken by several archiving institutions, alongside ongoing Afghanistan collections curated by the Library of Congress. Examples include:

The CDG does not wish to duplicate these efforts but rather to complement them by focussing on the international aspects of events in Afghanistan, documenting transnational involvement and worldwide interest in the process of the change of regime, recording how the situation evolves over time.

Content Scope

With this in mind, the Afghanistan collection has been scoped so that it adheres to the broader content development policy of the CDG namely that the following criteria are adhered to;

  • It is of high interest to IIPC members;
  • It does not map to any one member’s responsibility or mandate;
  • It is of higher value to research because it represents more perspectives than similar collections in only one member archive would do;
  • It is transnational in scope, but not necessarily “global”.

Taliban/IS content

The aim of any CDG collection is to reflect multiple viewpoints and to preserve a snapshot of society as it was at the time of archiving. It will be important to researchers that websites from across the spectrum of all human activity are collected in order to present a more accurate picture of the times.

Websites produced by the Taliban or IS, or that are pro-Taliban/IS, can be included in the collection. Most Government websites will begin to express pro-Taliban views in any case.

The Taliban/IS are likely to have used communication networks that cannot be archived for technical reasons, e.g. Facebook/WhatsApp and so this type of content will be excluded.

Daniel Wilkinson (U.S. Department of State), Public domain, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Afghan_females_using_internet_in_Herat.jpg

Sub-topics may include:

  • Military experience in Afghanistan; nations withdrawing armed forces from Afghanistan; statements of defence and military analysis
  • Analysis and policy of think tanks such as Chatham House (UK), Brookings Institute and RAND International Affairs (US), for example
  • Afghan refugees in Pakistan, Iran and elsewhere
  • International relief efforts (The Red Cross, United Nations etc.)
  • Diaspora communities – Afghan people around the world
  • Human rights/Women’s rights/LGBTQ+ rights
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Afghanistan by foreign powers
  • Transnational websites and social media (SoundCloud, Squarespace, Twitter, WordPress, YouTube, Facebook Group pages (not individual Facebook profiles) about Afghanistan from any country and in any language.

The list is not exhaustive, and it is expected that contributors may wish to explore other sub-topics within their own areas of interest and expertise, providing they are within the general collection development scope.


The lead curator for this collection is Nicola Bingham. She will be responsible for developing the content strategy, overseeing the progression of the collection, and promoting the collection to potential users.

IIPC members together with a wide number of stakeholders in the Library and Archive community, including staff at the Bodleian Libraries, Oxford, as well as members of the public are expected to contribute to the collection (see below for details about how to contribute).

Crawls are being undertaken in Archive-It by Janko Klasinc (National and University Library, Slovenia) and Carlos Lelkes-Rarugal (Assistant Web Archivist, British Library).

Size of collection

The CDG’s full budget in 2021 is 4 TB, of which 1.8 TB has been used through the end of September. The CDG plan to undertake small crawls for our ongoing and new collections as follows;

  • 2020 Summer Olympics and Paralympics [held in 2021]
  • Novel Coronavirus (COVID-19)
  • Intergovernmental Organizations
  • National Olympic and Paralympic Committees.

At this stage, c. 400 GB of data has been allocated to the Afghanistan collection.


Nominations will be sought from IIPC Members and external agencies such as the UK Legal Deposit Libraries, University Libraries and the Library and Archive community. A Google form will be sent out to elicit nominations from non-IIPC members and members of the public. This form contains the relevant metadata fields which will populate a Google sheet. The aim of distributing the work of co-curation for the collection is to enable a diverse range of communities and individuals to contribute, including members of the Afghanistan community, helping to ensure that the collection is as representative as possible.

IIPC Members will be able to add their nominations directly to a Google sheet which will be reviewed by the lead curator against collection scope and marked for inclusion in the collection.


Access to the collection will be through the Internet Archives’ Archive-It interface. Metadata will be exposed as facets on the collection home page and will be browseable by users.

How to contribute:

  1. Please read the Collection Scoping Document. This goes into more detail about what is in and out of scope
  2. If you are an IIPC member, please nominate URLs and add basic metadata to this Google Sheet
  3. If you are not an IIPC member you may contribute nominations and a small amount of basic metadata on this Google form.


1 Kiely, E. and Farley, R. Timeline of U.S. Withdrawal from Afghanistan. August 17, 2021. FactCheck.org

2 Ovenden, R. The Battle for Afghanistan’s libraries. September, 24, 2021. Financial Times. https://www.ft.com/content/82fffcc8-3631-48dc-829d-44f237549a59

3 Afghanistan’s Internet: who has control of what? Goman Web. September 20, 2021. https://gomanweb.net/2021/09/20/afghanistans-internet-who-has-control-of-what/ Digital oppression in Afghanistan. NordVPN Blog. August 20, 2021. https://nordvpn.com/blog/digital-oppression-in-afghanistan

Baibhawi, R. Taliban Shuts Internet In Panjshir To Stop Northern Alliance From Galvanizing Support. August 29, 2021. Republic. https://www.republicworld.com/world-news/rest-of-the-world-news/taliban-shuts-internet-in-panjshir-to-stop-northern-alliance-from-galvanizing-support.html

Vavra, S. and Falzone, D. This Is Why the Taliban Keeps F*cking Up the Internet. September 16, 2021. Daily Beast. https://www.thedailybeast.com/this-is-why-the-taliban-keeps-fcking-up-afghanistans-internet

Sorkin, A. R., Karaian, J., Kessler, S., Gandel, S., Hirsch, L., Livni, E. and Schaverien, A. Big Tech and the Taliban. August, 19, 2021. The New York Times. https://www.nytimes.com/2021/08/19/business/dealbook/taliban-social-media.html

4 Stokel-Walker, C. The battle for control of Afghanistan’s internet. September 7, 2021. Wired. https://www.wired.co.uk/article/afghanistan-taliban-internet

5 Gomes, P. Automated seed selection to preserve Afghan sites (Arquivo.pt). IIPC Curating Special Collections Workshop,  September  24, 2021. https://youtu.be/Aa_-BBnEr8I

IIPC Steering Committee Election 2021 Results

The 2021 Steering Committee Election closed on Friday, 15 October. The following IIPC member institutions have been elected to serve on the Steering Committee for a term commencing 1 January 2022:

We would like to thank all members who took part in the election either by nominating themselves or by taking the time to vote. Congratulations to the new and re-elected Steering Committee Members!

The Spanish Web Archive as a training field for Natural Language Processing models

By Alicia Pastrana García and José Carlos Cerdán Medina, National Library of Spain


In the last 20 years most web archives have been building their websites collections. They will be very valuable as the years go by, as much of this information will no longer exist on the Internet. However, do we have to wait that long to see our collections be useful?

The huge amount of information the National Library of Spain (BNE) has built since 2009 has emerged as one of the largest linguistic corpus of current language. For this reason, BNE has collaborated with the Barcelona Supercomputing Center (BSC) to create the first massive AI model of the Spanish language. This collaboration is in the framework of the Language Technologies Plan of the State Secretariat for Digitization and Artificial Intelligence of the Ministry of Economic Affairs and Digital Agenda of Spain.

The players

The National Library of Spain has been harvesting information from the web for more than 10 years. The Spanish Web Archive is still young but it already contains more than a Petabyte of information.

On the other hand, the Barcelona Supercomputing Center (BSC) is the leading supercomputing center in Spain. They offer infrastructures and supercomputing services to Spanish and European researchers, in order to generate knowledge and technology for the society.

The data

The Spanish Web Archive, as most of the national libraries web archives, is based on a mixed model. It combines broad and selective crawls. The broad crawls harvest as many Spanish domains as possible without going very deep in the navigation levels. The scope is the .es domain. Selective crawls complement the broad crawls and harvest a smaller sample of websites but in greater depth and frequency. The sites are selected for their relevance to history, society and culture. Selective crawls include other king of domains (.org, .com, etc.)

Web Curators, from the BNE and the regional libraries, select the seeds that will be part of these collections. They assess the relevance of the websites from the heritage point of view and the importance for research and knowledge in the future.

For this project we chose the content harvested on selective crawls, a collection of around 40,000 websites.

How to prepare WARC files

The result of the collections is stored in WARC files (Web ARChive file format). The BSC just needed the text extracted from the WARC files to train the language models, so they removed everything else, using a specific script. It uses a parser to keep exclusively the HTML text tags (paragraphs, headlines, keywords, etc.) and discard everything that was not useful for the purpose of the project (e.g. images, audios, videos).

This parser was an open source Python module called Selectolax. It is seven times faster than others and it is easily customizable. Selectolax can be configured to take labels that contained text and to discard those that are not useful for the project. At the end of the process, the script generated JSON files organized according to the selected HTML tags and it structured the information in paragraphs, headlines, keywords, etc. These files are not only useful for the project, but will also be able to help us improve the Spanish Web Archive full text search.

All this work was done in the Library itself, in order to obtain files that were more manageable. It must be taken into account that the huge volume of information was a challenge. It was not easy to transfer the files to the BSC, where the supercomputer was. Hence the importance of starting with this cleaning process in the Library.

Once at the BSC, a second cleaning process was run. The BSC project team removed everything that it is not well-formed text (unfinished or duplicated sentences, erroneous encodings, other languages, etc.). The result was only well-formed texts in Spanish, as it is actually used.

BSC used the supercomputer MareNostrum, the most powerful computer in Spain and the only one capable of processing such a volume of information in a short time frame.

The language model

Once the files were prepared, the BSC used a neural network technology based on Transformer, already proven with English. It was trained to learn to use the language. The result is an AI model that is able to understand the Spanish language, its vocabulary, and its mechanisms for expressing meaning and writing at an expert level. This model is also able to understand abstract concepts and it deduces the meaning of words according to the context in which they are used.
This model is larger and better than the other models of the Spanish language available today. It is called MarIA and is open access. This project represents a milestone both in the application of artificial intelligence to Spanish language, and in collaboration between national libraries and research centers. It is a good example of the value of collaboration between different institutions with common objectives. The uses of MarIA can be multiple: correctors or predictors of language, auto summarization apps, chatbots, smart searches, translation engines, auto captioning, etc. They are all broad fields that promote the use of Spanish for technological applications, helping to increase its presence in the world. This way, the BNE fulfils part of its mission, promoting the scientific research and the dissemination of knowledge, helping to transform information into accessible technology for all.

News from the IIPC Membership Engagement Portfolio

By Emmanuelle Bermès, Deputy Director for Services and Networks at BnF and Membership Engagement Portfolio Lead

During the refresh of our consortium agreement in 2016, three portfolios were created to lead on a strategy in three areas: membership engagement, tools development, and partnerships and outreach. The first major project for the Membership Portfolio was to survey our members and find common grounds for potential collaborations. While this remains our goal, in the new Strategic Action Plan, we would also like to focus on regular conversations with our members, in the spirit of the updates we have at our General Assembly (GA), and on supporting regional initiatives.

Following up on this, on Monday 13 September, the Membership Engagement Portfolio hosted two calls open to all members, scheduled for two slots to accommodate different time zones.

The past months have made it so much more complicated to share, engage and work together. Our yearly event, the GA/WAC, fully held online by the National Library of Luxembourg, was very successful and we all enjoyed this opportunity to feel the strength and vitality of our community. So, we thought that opening more options for informal communication online could also be a way to keep the momentum.

The Members Call was an opportunity to keep our members posted with what’s currently happening within the consortium. Olga Holownia, our Program Officer, reminded us that the Strategic Plan 2021-2025 is now online, as well as the videos from the WAC conference. Another highlight went on two calls that were just about to close at the time of the meeting: the call for projects under our funding program (DFP) and the call for host for the 2022 GA and WAC. There is still time, however, to cast your vote in the Steering Committee elections and to volunteer to host the 2022 events! Finally, a number of resources and events are at your disposal either on this blog or on the IIPC website, including the upcoming training workshop “Web archiving for beginners” on October 5 and 6, the webinar series (Research Speaker and Technical Speaker), and the Online Hours: Supporting Open Source. Join us on the IIPC Members mailing list to learn more!

We also wanted this Members Call to be a moment to share your updates on what’s happening in your organization. With short presentations from the Library of Congress, the National Archives UK, the ResPaDon project at the Bibliothèque nationale de France, the National Széchényi Library in Hungary, the National and University Library in Zagreb, the National Library of New Zealand, the National Library of Australia, the agenda was very rich. One of the major outcomes of these two calls may be a future workshop on capturing social media, as this was a topic of great interest in both meetings. The workshop will be organised by the IIPC and the National and University Library in Zagreb. If you’re interested in contributing to this specific topic, please let us know at events[at]netpreserve.org.

We hope this shared moment can become a regular meeting point between members, where you can share your institution’s hot topics and let us know your expectations about IIPC activities. So, we hope you will join us for this conversation during our next call, expected to take place on Wednesday, 15th of December (UTC). Abbie Grotke and Karolina Holub, Membership Engagement Portfolio co-leads, Olga Holownia and myself, are looking forward to meeting you there.

IIPC Steering Committee Election 2021: nomination statements

The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:

An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.

If you have any questions, please contact the IIPC Senior Program Officer.

Nomination statements in alphabetical order:

Bibliothèque nationale de France / National Library of France

BnF-logoThe National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.

The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.

The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.

The British Library

BL-logoThe British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.

Deutsche Nationalbibliothek / German National Library

DNB-logoThe German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.

Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB.
At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.

Det Kongelige Bibliotek / Royal Library of Denmark

KBDK-logoRoyal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.

Koninklijke Bibliotheek / National Library of the Netherlands

KBNL-logoAs the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.

We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.

We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.

The National Archives (UK)

TNA-logoThe National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.

Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.

With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.


PYWB 2.6

By Ilya Kreymer, Webrecorder.net

After several betas and months of development, I’m excited to announce the release of pywb 2.6!

This release, supported in large part by the IIPC (International Internet Preservation Consortium), includes several new features and documentation as well as many replay fidelity improvements and optimizations.

The main new features of the release include improvements to the access control system and localization/multi-language support. The access control system has been expanded with a flexible date-range based embargo, allowing for automated exlcusions of newer or old content. The release also includes the ability to configure pywb for different user access levels, when running pywb behind an Nginx or Apache server. For more details on these features, see the Access Control Guide and Deployment Guide.

With this release, pywb also includes support for running in different languages and configuring the main UI to support switching between different languages. All text used is automatically populated into CSV files and imported back. For more details, see the Localization / Multi-Language Guide section of the documentation.

A complete list of changes is also available in the pywb Changelist on GitHub.

This work is a follow-up to the first package of work supported by the IIPC, which resulted in the creation of a transition guide for users of OpenWayback. Webrecorder wishes to thank the IIPC for their support of pywb development.

The next release of pywb, corresponding to the final batch of work sponsored by IIPC in this round, will include several improvements to the pywb user-interface and navigation.

For more discussion on this work, Webrecorder will be participating in an IIPC-hosted webinar on Tuesday, August 31st, 2021.

IIPC Steering Committee Election 2021: Call for nominations

The nomination process for IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year five seats are up for election. 

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.

Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).

Election Calendar

14 June – 14 September 2021: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.

15 September 2021: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.

15 September – 15 October 2021: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.

18 October 2021: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2022: The newly elected SC members start their term on 1 January.