Investigate holdings of web archives through summaries: cdx-summarize

Untitled designBy Yves Maurer, Web Archiving technical lead at the National Library of Luxembourg


Introduction

When researchers want to access web archives, they have two main possibilities. They can either use one of the excellent web archives that is freely accessible online, such as web.archive.org, arquivo.pt, vefsafn.is, haw.nsk.hr, or Common Crawl, or they can travel to the libraries and archives whose web archives are only available in their respective reading rooms. In fact, most web archives have some restrictions on access. Copyright and other legal considerations often make it difficult for institutions to open up the web archive to the broader Internet. Closed web archives are hard to access for researchers, especially if they live far from the physical reading rooms. The overall effect is that closed web archives are less used, studied, and published about and researchers only travel to the closest reading rooms, if at all.

However, web archiving institutions would like more researchers to use their archives and popularize the usage of web archives for all users from contemporary history, sociology, linguistics, economics, law, and other disciplines. For closed web archives, usually little data is publicly available about their contents, so it is difficult to convince researchers to travel to the reading room when the researcher doesn’t know in advance what exactly the archive contains and whether it’s pertinent to their research question. Web archives are also very large which makes handling the raw WARC files difficult for all parties, so sending extracts of data from institution to research team is often not feasible.

It would certainly be preferable to researchers if those closed web archives would just open their entire service to the Internet, but the wholesale lifting of legal restrictions is not easy. Therefore, if researchers cannot access the whole dataset, can they at least access some part that allows them to have an overview of the collection? Just a size indication (e.g. 340 TB) and the number of mementos (e.g. 3 billion) will not help much. A collection policy documenting the aims, scope and contents of the web archive (e.g. https://www.bnf.fr/fr/archives-de-linternet) is already more helpful but does not hold any numbers or information about particular sites of interest. There is, however, some type of data that resides in-between the legal challenges of full access on the one hand and a textual description or rough single numbers on the other hand. This type of data must not be encumbered by any legal restrictions nor should it be so massive that it becomes unwieldy.

Developed as part of the WARCNET network’s working group 1, the cdx-summarize (https://github.com/ymaurer/cdx-summarize) toolset proposes to generate and handle such a dataset. There are no more legal restrictions on the data that it contains, since it is aggregated and does not contain any copyrighted information nor personal data. Moreover, the file is of a manageable size. The institution that has a closed web archive can publish the summary file for the whole collection and then the researchers can investigate its contents or compare it to the summary files from other institutions. Like this, web archives can publicize their collections and make them available for a rough first level of data exploration.

Sample uses of summary files for researchers

The summary files produced by cdx-summarize are simple, but still contain statistics about the different years when mementos were harvested as well as the number and sizes of different file types included in the collection. None of the following samples require direct access to a web archive, only to a summary file. It is not the aim of this blog post to investigate these examples in detail, just to give the readers an idea how rich this summary data still is and what can be done with it.

A very simple example is the chart comparing the evolution of the sizes of HTML files over the years.

Picture1
Fig 1. Average size of HTML files in the Luxembourg Web Archive

Another example is to use the information about 2nd-level domains that is still present in the summary file to find out more about domain names in general, as in the following example:

Picture2
Fig 2. First letter frequency in Internet Archive 2nd-level domains vs French dictionary for the TLD .fr

Here, you could, for example, explain the overall abundance of 2nd level domains starting with the letter “L” by the fact that the French articles “le, la, les,” all start with “L” and so do probably quite a lot of domain names. Other deviations from the mean may need a deeper explanation.

Comparing web archives

Another nice thing about the summary files is that they can be produced for different web archives and then compared. At the time of writing this, I do not have access to any other closed web archive summary file apart from the one for the Luxembourg Web Archive (https://github.com/ymaurer/cdx-summarize/blob/main/summaries/webarchive-lu.summary.xz) (19.1 MB). However, there are open web archives with public APIs like the Internet Archive’s CDX server (https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) or the Common Crawl (https://index.commoncrawl.org/). These can be used to generate a summary file, e.g. a whole top-level domain.

A first comparison between web archives can done on the 2nd level domains. Do all concerned web archives hold data from all the domains? Or does one archive have a clear focus on just a small subset of the domains? The following chart shows the comparisons of the inclusion of domains from the TLD “.lu” into three web archives:

Picture3

The graph clearly shows that the Luxembourg Web Archive started in 2016 and that it is collaborating with the Internet Archive, who have a second copy of the same data. It also shows that the Common Crawl is much less broad in terms of included domains.

A deep comparison of the mementos held between web archives is probably better done on CDXJ index files themselves. There will still be some edge cases of mementos just being slightly different because of embedded timestamps, sessions, etc. but it will give a more detailed picture of the overlaps.

The summary file format

The file consists of JSON lines prefixed by the domain name. This is inspired by the CDXJ format and simplifies using Unix tools such as “sort,” or “join,” on the summary files. In the JSON part, there are keys for each year and inside the year, keys for each (simplified) MIME type for the number of mementos (n_) and their sizes (s_):

host.tld {“year”: {“n_html”: A, … “s_html”:B}}

A sample entry could be:

bnl.lu {“2003”: {“n_audio”: 0,”n_css”: 8,”n_font”: 0,”n_html”: 639,”n_http”: 728,”n_https”: 0,”n_image”: 44,”n_js”: 0,”n_json”: 0,”n_other”: 7,”n_pdf”: 30,”n_total”: 728,”n_video”: 0,”s_audio”: 0,”s_css”: 5268,”s_font”: 0,”s_html”: 1295481,”s_http”: 4680354,”s_https”: 0,”s_image”: 295235,”s_js”: 0,”s_json”: 0,”s_other”: 13156,”s_pdf”: 3071214,”s_total”: 4680354,”s_video”: 0}}

The MIME types are simplified according to the following rules:

MIME(s) category rationale
text/html HTML These are counted as “web pages” by Internet Archive
application/xhtml+xml
text/plain
text/css CSS interesting for changing usage in formatting pages
image/* IMAGE all image types are grouped together
application/pdf PDF interesting independently, although IA groups PDFs in “web page” too
video/* VIDEO all videos
audio/* AUDIO all audio types
application/javascript JS these 3 MIME types are common for javascript
text/javascript
application/x-javascript
application/json JSON relatively common and indicates dynamic pages
text/json
font/ FONT usage of custom fonts
application/vnd.ms-fontobject
application/font
application/x-font*

How do I generate the summary file for my web archive?

As the name cdx-summarize implies, the programs need only access to CDXJ files, not the underlying WARC files. Just start a cdx-summarize.py –compact *.cdx > mywebarchive.summary in your CDX directory and it will do the summarization.

If you are using WARC-indexer from the British Library and have a backend with a Solr index, it’s even simpler, since there is a version contributed by Toke Eskildsen which pulls the data from Solr efficiently and directly (https://github.com/ymaurer/cdx-summarize-warc-indexer) All types of CDXJ files should be supported and different encodings are supported as well.

Web Archiving the War in Ukraine

By Olga Holownia, Senior Program Officer, IIPC & Kelsey Socha, Administrative Officer, IIPC with contributions to the Collaborative Collection section by Nicola Bingham, Lead Curator, Web Archives, British Library; CDG co-chair


This month, the IIPC Content Development Working Group (CDG) launched a new collaborative collection to archive web content related to the war in Ukraine, aiming to map the impact of this conflict on digital history and culture. In this blog, we describe what is involved in creating a transnational collection and we also give an overview of web archiving efforts that started earlier this year: both collections by IIPC members and collaborative volunteer initiatives.

Collaborative Collection 2022

IIPC-CDG-collaborative-collectionsIn line with the broader content development policy, CDG collections focus on topics that are transnational in scope and are considered of high interest to IIPC members. Each collection represents more perspectives than similar collections by a single member archive may include. Nominations are submitted by IIPC members, who have been archiving the conflict as early as January 2022 (see below) as well as the general public.

How do members contribute?

Topics for special collections are proposed by IIPC members who submit their ideas to the IIPC CDG mailing list, or contact the IIPC co-chairs directly at any time. Providing that the topic fits with the CDG collecting scope, there is enough data budget to cover the collection, and a lead curator and volunteers to perform the archiving work are in place, the collection can go ahead. IIPC members are then canvassed widely to submit web content on a shared google spreadsheet together with associated metadata such as title, language and description. The URLs are taken from the spreadsheet and crawled in Archive-It by the project team, formed of volunteers from IIPC members for each collection. Many IIPC members add a selection of seeds from their institutions’ own collections which helps to make CDG collections very diverse in terms of coverage and language.

There will be overlap between the seeds that members submit to CDG collections and their own institutions’ collections, however there are differences, including that selections for IIPC collections can be more geographically wide ranging than those included in their own collections when, for example they must adhere to regional scope, such as in the case of a national library.  Selection decisions that are appropriate for members’ own collections may not be appropriate for CDG collections. For example, members may want to curate individual articles from an online newspaper by crawling each one separately whereas, given the larger scope of CDG collections it would be more appropriate to create the target at the level of the sub-section of the online newspaper. Public access to collections provided by Archive-It is a positive factor for those institutions that, for various reasons, can’t provide access to their collections. You can learn more about the War in Ukraine 2022 collection’s scope and parameters here.

Public nominations

We encourage everyone to nominate relevant web content as defined by the collection’s lead curators: Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, National Library of France and Kees Teszelszky of KB, National Library of the Netherlands. The first crawl is scheduled to take place on 27 July and it will be followed by two additional crawls in September and October. We will be publishing updates on the collection at #Ukraine 2022 Collection. We are also planning to make this collection available to researchers.

Member collections

In Spring 2022, we compiled a survey of the work done by IIPC members. We asked about the collection start date, scope, frequency, type of collected websites, way of collecting (e.g. locally and/or via Archive-It), social media platforms and access.

IIPC members have collected content related to the war, ranging from news portals, to governmental websites, to embassies, charities, and cultural heritage sites. They have also selectively collected content from Ukrainian and Russian websites and social media, including Facebook, Reddit, Instagram, and, most prominently, Twitter. The CDG collection offers another chance for members without special collections to contribute seeds from their own country domains.

Many of our members are national libraries and archives, and legal deposit informs what these institutions are able to collect and how they provide access. In most cases, that would mean crawling country-level domains, offering a localized perspective on the war. Access varies from completely open (e.g. the Internet Archive, National Library of Australia and the Croatian Web Archive), to onsite-only with published and browsable metadata such as collected URLs (e.g. the Hungarian Web Archive) to reading-room only (e.g. Netarkivet at the Royal Danish Library or the “Archives de l’internet” at the national library of France). The UK Web Archive collection has a mixed model of access, where the full list of metadata and collected URLs are available, but access to individual websites depends on whether the website owner has granted permission for off-site open access”.  Some institutions, such as Library of Congress, may have time-based embargoes in place for collection access.

Some of our members have also begun work preparing datasets and visualisations for researchers. The Internet Archive has been supporting multiple collections and volunteer projects and our members have provided valuable advice on capturing content that is difficult to archive (e.g. Telegram messages).

A map of IIPC members currently collecting content related to the war in Ukraine can be seen below. It includes Stanford University, which has been supporting SUCHO (Saving Ukrainian Cultural Heritage Online).

Survey results

Access

While many members have been collecting content related to the war, only a small number of collections are currently publicly available online. Some members provide access to browsable metadata or a list of ULRs. The National Library of Australia has been collecting publicly available Australian websites related to the conflict,as is the case for the National Library of the Czech Republic. A special event collection of 162 crowd-sourced URLs is now accessible at the Croatian Web Archive. The UK Web Archive’s special collection of nearly 300 websites is fully available on-site, however information about the collected resources, which currently include websites of Russian Oligarchs in the UK, Commentators, Charities, Think Tanks and the UK Embassies of Ukraine and the surrounding nations, is publicly available online. Some websites from the UK Web Archive’s collection are also fully available off-site, where website owners have granted permission. The National Library of Scotland has set up a special collection, ‘Scottish Communities and the Ukraine’ which contains nearly 100 websites and focuses on the local response to the Ukraine War. This collection will be viewable in the near future pending QA checks. Most of the University Library of Bratislava’s collection is only available on-site, but information about sites collected is browsable on their web portal with links to current versions of the archived pages.

The web archiving team at the ​​National Széchényi Library in Hungary, which has been capturing content from 75 news portals, has created a SolrWayback-based public search interface which provides access to metadata and full-text search, though full pages cannot be viewed due to copyright. The web archiving team has also been collaborating with the library’s Digital Humanities Center to create datasets and visualisations related to captured content.

Hungarian-Web-Archive-word_cloud
Márton Nemeth of National Széchényi Library and Gyula Kalcsó of Digital Humanities Center, National Széchényi Library presented on this collection at the 2022 Web Archiving Conference.

Multiple institutions plan to make their content available online at a later date, after collecting has finished or after a specified period of time has passed. The Library of Congress has been capturing content in a number of collections within the scope of their collecting policies, including the ongoing East European Government Ministries Web Archive.

Frequency of Collection

Most institutions have been collecting with a variety of frequencies. Institutions rarely answered with just one of the frequency options, opting instead to pick multiple options or “Other.” Of answers in the “Other” category, some were doing one-time collection, while others were collecting yearly, six-monthly, and quarterly.

How the content is collected

Most IIPC members crawl the content locally, while a few have also been using Archive-It. SUCHO has mostly relied on browser-based crawler Browsertrix, which was developed by Ilya Kreymer of Webrecorder and is in part funded by the IIPC, and on the Internet Archive’s Wayback Machine.

Type of collected websites (your domain)

When asked about types of websites being collected within local domains, most institutions have been focusing on governmental and news-related sites, followed by embassies and official sites related to Ukraine and Russia as well as cultural heritage sites. Other websites included a variety of crisis relief organisations, non-profits, blogs, think tanks, charities, and research organisations.

Types of websites/social media collected

When asked more broadly, most members have been focusing on local websites from their home countries. Outside local websites, some institutions were collecting Ukrainian websites and social media, while a smaller number were collecting Russian websites and social media.

Specific social media platforms collected

The survey also asked specifically about social media platforms our members were pulling from: Reddit, Instagram, TikTok, Tumblr, and Youtube. While many institutions were not collecting social media, Twitter was otherwise the most commonly collected social media platform.

Internet Archive

Internet ArchiveThe Internet Archive (IA) has been instrumental in providing support for multiple initiatives related to the war in Ukraine. IA’s initiatives have included:

  1. giving free Archive-It accounts, as well as general data storage, to a number of different community archiving efforts
  2. uploading files to SUCHO collection at archive.org
  3. supporting the extensive use of Save Page Now (especially via the Google Sheets interface) with the help of numerous SUCHO volunteers (many 10s of TB have been archived this way)
  4. supporting the uploading of WACZ files to the Wayback Machine. This work has just started but a significant number of files are expected  to be archived and, similar to other collections featured in the new “Collection Search” service, a full-text index will be available
  5. crawling the entire country code top level domain of the Ukrainian web (the crawl was launched in April and is still running)
  6. archiving Russian Independent Media (TV, TV Rain), Radio (Echo of Moscow) and web-based resources (see “Russian Independent Media” option in the “Collection Search” service at the bottom of the Wayback Machine).

IA’s Television News Archive, the GDELT Project, and the Media-Data Research Consortium have all collaborated to create the  Television News Visual Explorer, which allows for greater research access of the Television News Archive, including channels from across Russia, Belarus, and Ukraine. This blog post by GDELT’s Dr. Kavel H. Leetaru explains more of the significance of this collaboration, and the importance of this new research collection of Belarusian, Russian and Ukrainian television news coverage.

Volunteer initiatives

SUCHO

image3One of the largest volunteer initiatives focusing on preserving Ukrainian web content has been SUCHO. Involving over 1300 librarians, archivists, researchers and programmers, SUCHO is led by Stanford University’s Quinn Dombrowski, Anna E. Kijas of Tufts University, and Sebastian Majstorovic of the Austrian Centre for Digital Humanities and Cultural Heritage. In its first phase, the project’s primary goal was to archive at-risk sites, digital content, and data in Ukrainian cultural heritage institutions. So far over 30TB of content and 3,500+ websites of Ukrainian museums, libraries and archives have been preserved and a subset of this collection is available at https://www.sucho.org/archives. The project is beginning its second phase, focusing on coordinating aid shipments of digitization hardware, exhibiting Ukrainian culture online and organizing training for Ukrainian cultural workers in digitization methods.

sucho-poster-landscape-medium
The SUCHO leads and Ilya Kreymer presented on their work at the 2022 Web Archiving Conference and participated in a Q&A session moderated by Abbie Grotke of the Library of Congress.

The Telegram Archive of the War

image2
Screenshot from the Telegram Archive of the War, taken July 20, 2022.

Telegram has been the most widely used application in Ukraine since the onset of the war but this messaging app is notoriously difficult to archive. A team of five archivists at the Center for Urban History in Lviv led by Taras Nazaruk, has been archiving almost 1000 Telegram channels since late February to create the Telegram Archive of the War. Each team member has been assigned to monitor and archive a topic or a region in Ukraine. They focus on capturing official announcements from different military administrative districts, ministries, local and regional news, volunteer groups helping with evacuation, searches for missing people, local channels for different towns, databases, cyberattacks, Russian propaganda, fake news as well as personal diaries, artistic reflections, humour and memes. Russian government propaganda and pro-Russian channels and chats are also archived. The multi-media content is currently grouped into over 20 thematic collections. The project coordinators have also been working with universities interested in supporting this archive and are planning to set up a working group to provide guidance for the future access to this invaluable archive.

Ukraine collections on Archive-It

New content has been gradually made available within the Ukraine collections on Archive-It that provided free or heavily cost-shared accounts to its partners earlier this year. These collections also include websites documenting the Ukraine Crisis 2014-2015 curated by University of California Berkeley (UC Berkeley) and by Internet Archive Global Events. Four new collections have been created since February 2022 with over 2.5TB of content. The largest one about the 2022 conflict (around 200 URLs) that is publicly available is curated by Ukrainian Research Institute at Harvard University. Other collections that focus on Ukrainian content are curated by Center for Urban History of East Central Europe, UC Berkeley and SUCHO. To learn more about the “War in Ukraine: 2022” collection, read this blog post by Liladhar R. Pendse, Librarian for East European, Central European, Central Asian and Armenian Studies Collections, UC Berkeley. University of Oxford, New College has been archiving at-risk Russian cultural heritage on the web as well as Russian opposition efforts to the war on Ukraine.

HURI-at-Archive-It
Ukrainian Research Institute at Harvard University’s collection at Archive-It.

Organisations interested in collecting web content related to the war in Ukraine, can contact Mirage Berry, Business Development Manager at the Internet Archive.

How to get involved

  1. Nominate web content for the CDG collection
  2. Use the Internet Archive’s “Save Page Now”
  3. Check updates on the SUCHO Page for information on how you can contribute to the new phase of the project. SUCHO is currently accepting donations to pay for server costs and funding digitization equipment to send to Ukraine. Those interested in volunteering with SUCHO can sign up for the standby volunteer list here
  4. Help the Center for Urban History in Lviv by nominating Ukrainian Telegram channels that you think are worth archiving and participate in their events
  5. Submit information about your project: we are working to maintain a comprehensive and up-to-date list of web archiving efforts related to the war in Ukraine. If you are involved in a collection or a project and would like to see it included here, please use this form to contact us: https://bit.ly/archiving-the-war-in-Ukraine.

Many thanks to all of the institutions and projects featured on this list! We appreciate the time our members spent filling out our survey, and answering questions. Special thanks to Nicola Bingham of the British Library, Mark Graham and Mirage Berry of the Internet Archive, and Taras Nazaruk of the Center for Urban History in Lviv for providing supplementary information on their institutions’ collecting efforts.

Resources

Get Involved in Web Archiving the War in Ukraine 2022

By Kees Teszelszky, Curator Digital Collections, National Library of the Netherlands & Vladimir Tybin, Head of Digital Legal Deposit, National Library of France

On February 23, 2022, the armed forces of the Russian Federation invaded Ukrainian territory, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering a war in Ukraine. Since then, the clashes between the Russian military and the Ukrainian population have had unprecedented repercussions on the situation of the civilian population and on international relations. 

IIPC-CDG-collaborative-collectionsWhat we want to collect

This collaborative collection aims to collect web content related to this event in order to map the impact of this conflict on digital history and culture.

This collection will be built through the following themes: 

  • General information about the military confrontations
  • Consequences of the war on the civilian population
  • Refugee crisis and international relief efforts
  • Political consequences
  • International relations
  • Diaspora communities – Ukrainian people around the world 
  • Human rights organisations 
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Russia by foreign powers
  • Consequences on energy and agri-food trade
  • Public opinion: blogs/protest sites/activists

The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.

Out of scope

The following types of content are out of scope for the collection:

  • Data-intensive audio/video content (e.g. YouTube channels)
  • Social media platforms
  • Private member forums, intranets, or email (non-published material)
  • Content identifying vulnerable people and compromising their safety

How to get involved

Once you have selected the web pages that you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/Ukraine-2022-collection-public-nominations 

For the first crawl, the call for nominations will close on July 20, 2022.

For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.

Resources

About IIPC collaborative collections
IIPC CDG updates on the IIPC Blog

IIPC Steering Committee Election 2022: call for nominations

The nomination process for IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year, five seats are up for election. 

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation. The elected SC members also lead IIPC Portfolios and thus have the opportunity to shape the Consortium’s strategic direction related to three main areas: tools development, membership engagement and partnerships. Every year, three SC members are designated as IIPC Officers (Chair, Vice-Chair and Treasurer) to serve on the IIPC Executive Board and are responsible for implementing the Strategic Plan.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.

Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).


Election Calendar

15 June – 14 September 2022: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.

15 September 2022: Nominee statements are published on the Netpreserve blog and Members mailing list. Nominees are encouraged to campaign through their own networks.

15 September – 15 October 2022: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.

18 October 2022: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2023: The newly elected SC members start their three-year term.

Game Walkthroughs and Web Archiving Project: Integrating Gaming, Web Archiving, and Livestreaming 

“Game Walkthroughs and Web Archiving” was awarded a grant in the 2021-2022 round of the Discretionary Funding Programme (DFP), the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project lead is Michael L. Nelson from the Department of Computer Science at Old Dominion University. Los Alamos National Laboratory Research Library is a project partner.


By Travis Reid, Ph.D. student at Old Dominion University (ODU), Michael L. Nelson, Professor in the Computer Science Department at ODU, and Michele C. Weigle, Professor in the Computer Science Department at ODU

Introduction

Game walkthroughs are guides that show viewers the steps the player would take while playing a video game. Recording and streaming a user’s interactive web browsing session is similar to a game walkthrough, as it shows the steps the user would take while browsing different websites. The idea of having game walkthroughs for web archiving was first explored in 2013 (“Game Walkthroughs As A Metaphor for Web Preservation”). At that time, web archive crawlers were not ideal for web archiving walkthroughs because they did not allow the user to view the webpage as it was being archived. Recent advancements in web archive crawlers have made it possible to preserve the experience of dynamic web pages by recording a user’s interactive web browsing session. Now, we have several browser-based web archiving tools such as WARCreate, Squidwarc, Brozzler, Browsertrix Crawler, ArchiveWeb.page, and Browsertrix Cloud that allow the user to view a web page while it is being archived, enabling users to create a walkthrough of a web archiving session.

Figure 1
Figure 1: Different ways to participate in gaming (left), web archiving (center), and sport sessions (right)

Figure 1 applies the analogy of different types of video games and basketball scenarios to types of web archiving sessions. Practicing playing a sport like basketball by yourself, playing an offline single player game like Pac-Man, and archiving a web page with a browser extension such as WARCreate are all similar because only one user or player is participating in the session (Figure 1, top row). Playing team sports with a group of people, playing an online multiplayer game like Halo, and collaboratively archiving web pages with Browsertrix Cloud are similar since multiple invited users or players can participate in the sessions (Figure 1, center row). Watching a professional sport on ESPN+, streaming a video game on Twitch, and streaming a web archiving session on YouTube can all be similar because anyone can be a spectator and watch the sporting event, gameplay, or web archiving session (Figure 1, bottom row).

One of our goals in the Game Walkthroughs and Web Archiving project is to create a web archiving livestream like that shown in Figure 1. We want to make web archiving entertaining to a general audience so that it can be enjoyed like a spectator sport. To this end, we have applied a gaming concept to the web archiving process and integrated video games with web archiving. We have created automated web archiving livestreams (video playlist) where the gaming concept of a speedrun was applied to the web archiving process. Presenting the web archiving process in this way can be a general introduction to web archiving for some viewers. We have also created automated gaming livestreams (video playlist) where the capabilities for the in-game characters were determined by the web archiving performance from the web archiving livestream. The current process that we are using for the web archiving and gaming livestreams is shown in Figure 2.

Figure 2
Figure 2: The current process for running our web archiving livestream and gaming livestream.

Web Archiving Livestream

For the web archiving livestream (Figure 2, left side), we wanted to create a livestream where viewers could watch browser-based web crawlers archive web pages. To make the livestreams more entertaining, we made each web archiving livestream into a competition between crawlers to see which crawler performs better at archiving the set of seed URIs. The first step for the web archiving livestream is to use Selenium to set up the browsers that will be used to show information needed for the livestream such as the name and current progress for each crawler. The information currently displayed for a crawler’s progress is the current URL being archived and the number of web pages archived so far. The next step is to get a set of seed URIs from an existing text file and then let each crawler start archiving the URIs. The viewers can then watch the web archiving process in action.

Automated Gaming Livestream

The automated gaming livestream (Figure 2, right side) was created so that viewers can watch a game where the gameplay is influenced by the web archiving and replay performance results from a web archiving livestream or any crawling session. Before an in-game match starts, a game configuration file is needed since it contains information about the selections that will be made in the game for the settings. The game configuration file is modified based on how well the crawlers performed during the web archiving livestream. If a crawler performs well during the web archiving livestream, then the in-game character associated with the crawler will have better items, perks, and other traits. If a crawler performs poorly, then their in-game character will have worse character traits. At the beginning of the gaming livestream, an app automation tool like Selenium (for browser games) or Appium (for locally installed PC games) is used to select the settings for the in-game characters based on the performance of the web crawlers. After the settings are selected by the app automation tool, the match is started and the viewers of the livestream can watch the match between the crawlers’ in-game characters. We have initially implemented this process for two video games, Gun Mayhem 2 More Mayhem and NFL Challenge. However, any game with a mode that does not require a human player could be used for an automated gaming livestream.

Gun Mayhem 2 More Mayhem Demo

Gun Mayhem 2 More Mayhem is similar to other fighting games like Super Smash Bros. and Brawlhalla where the goal is to knock the opponent off the stage. When a player gets knocked off the stage, they lose a life. The winner of the match will be the last player left on the stage. Gun Mayhem 2 More Mayhem is a Flash game that is played in a web browser. Selenium was used to automate this game since it is a browser game. In Gun Mayhem 2 More Mayhem, the crawler’s speed was used to determine which perk to use and the gun to use. Some example perks are infinite ammo, triple jump, and no recoil when firing a gun. The fastest crawler used the fastest gun and was given an infinite ammo perk (Figure 3, left side). The slowest crawler used the slowest gun and did not get a perk (Figure 3, right side).

Figure 3
Figure 3: The character selections made for the fastest and slowest web crawlers

NFL Challenge Demo

NFL Challenge is a NFL Football simulator that was released in 1985 and was popular during the 1980s. The performance of a team is based on the player attributes that are stored in editable text files. It is possible to change the stats for the players, like the speed, passing, and kicking ratings, and it is possible to change the name of the team and the players on the team. This customization allows us to  rename the team to the name of the web crawler and to rename the players of the team to the names of the contributors of the tool. NFL Challenge is a MS-DOS game that can be played with an emulator named DOSBox. Appium was used to automate the game since NFL Challenge is a locally installed game. In NFL Challenge, the fastest crawler would get the team with the fastest players based on the players’ speed attribute (Figure 4, left side) and the other crawler would get the team with the slowest players (Figure 4, right side).

Figure 4
Figure 4: The player attributes for the teams associated with the fastest and slowest web crawlers. The speed ratings are the times for the 40-yard dash, so the lower numbers are faster.

Future Work

In future work, we plan on making more improvements to the livestreams. We will update the web archiving livestreams and the gaming livestreams so that they can run at the same time. The web archiving livestream will use more than the speed of a web archive crawler when determining the crawler’s performance, such as using metrics from Brunelle’s memento damage algorithm which is used to measure the replay quality of archived web pages. During future web archiving livestreams, we will also evaluate and compare the capture and playback of web pages archived by different web archives and archiving tools like the Internet Archive’s Wayback Machine, archive.today, and Arquivo.pt. We will update the gaming livestreams so that they can support more games and games from different genres. The games that we supported so far are multiplayer games. We will also try to automate single player games where the in-game character for each crawler can compete to see which player gets the highest score on a level or which player finishes the level the fastest. For games that allow creating a level or game world, we would like to use what happens during a crawling session to determine how the level is created. If the crawler was not able to archive most of the resources, then more enemies or obstacles could be placed in the level to make it more difficult to complete the level. Some games that we will try to automate include: Rocket League, Brawhalla, Quake, and DOTA 2. When the scripts for the gaming livestream are ready to be released publicly, it will also be possible for anyone to add support for more games that can be automated. We will also have longer runs for the gaming livestreams so that a campaign or season in a game can be completed. A campaign is a game mode where the same characters can play a continuing story until it is completed. A season for the gaming livestreams will be like a season for a sport where there are a certain number of matches that must be completed for each team during a simulated year and a playoff tournament that ends with a championship match.

Summary

We are developing a proof of concept that involves integrating gaming and web archiving. We have integrated gaming and web archiving so that web archiving can be more entertaining to watch and enjoyed like a spectator sport. We have applied the gaming concept of a speedrun to the web archiving process by having a competition between two crawlers where the crawler that finished archiving the set of seed URIs first would be the winner of the competition. We have also created automated web archiving and gaming livestreams where the web archiving performance of web crawlers from the web archiving livestreams were used to determine the capabilities of the characters inside of the Gun Mayhem 2 More Mayhem and NFL Challenge video games that were played during the gaming livestreams. In the future, more nuanced evaluation of the crawling and replay performance can be used to better influence in-game environments and capabilities.

If you have any questions or feedback, you can email Travis Reid at treid003@odu.edu.

Remembering Past Web Archiving Events With Library of Congress Staff

By Meghan Lyon, Digital Collection Specialist, Library of Congress and member of WAC 2022 Program Committee


Since joining the Library of Congress Web Archiving Program remotely in 2020, I have had the pleasure of participating in IIPC activities and getting to know the generous and hardworking members of this community. Although—due to Covid-19 restrictions—I have yet to meet many of my colleagues in person, I feel as though I’ve been wholeheartedly welcomed. It is a privilege to be a member of the Program Committee for the 2022 Web Archiving Conference and General Assembly, which will be hosted virtually by the Library of Congress.

Last year, I remember the tireless planning efforts of Senior Program Officer Olga Holownia as she and then-IIPC Chair, now Vice-Chair, Abbie Grotke and staff members from the National Library of Luxembourg (2021’s amazing virtual conference host) tested the virtual conference platform. They tested virtual tables, planned for break-out chats post-session where engaged members could continue discussions from the previous panel. The end result was engaging and exciting, especially for a virtual conference.

It was a pleasure to learn at that time about topics as diverse as the Frisian web (Kees Teszelszky, “Side fûn: mapping the Frisian web domain in the Netherlands”), flash capable browser emulation (Ilya Kreymer & Humbert Hardy, “Not gone in a Flash! Developing a Flash-capable remote browser emulation system”), and experimental methods of quality assurance for web archives (Brenda Reyes Ayala, James Sun, Jennifer McDevitt & Xiaohui Liu, “Detecting quality problems in archived website using image similarity”). Ayala, et.al.’s presentation led me to Dr. Ayala’s research, which has greatly impacted QA workflow development here at the LoC. Workflow development will be included in the panel “Advancing Quality Assurance for Web Archives: Putting Theory into Practice” in the upcoming 2022 Web Archiving Conference. If you missed the 2021 conference, you can still view selected talks and Q&A sessions on the IIPC YouTube channel

WAC2021
IIPC 2021 Web Archiving Conference co-hosted with the National Library of Luxembourg.

With that, I’d like now to ask Abbie Grotke, my supervisor and Vice-Chair of the IIPC, as well as Grace Thomas, one of my teammates on the Web Archiving Team, some questions about their experience in the IIPC community:

Meghan Lyon: Give us a snapshot of your first experience — or of a memorable experience — at an IIPC WAC & GA of times past.

Grace Thomas (Senior Digital Collection Specialist, Web Archiving Team):

The first IIPC WAC I attended was Web Archiving Week 2017 in London. I had joined the Library of Congress Web Archiving Team less than a year earlier and I was still trying to figure out the extent of this new world. From what my seasoned coworkers said about the web archiving community, I knew it was small and geographically disparate – a modest group of faceless individuals shouldering the massive task of archiving the web – but the events in London showed me how kind, collaborative, and very real everyone is. We are all dealing with the exact same issues at different scales and, most importantly, I got the feeling that everyone was there because they wanted to carry on this work and find solutions to those problems together.

WAW2017-ArchivesUnleashed
Archives Unleashed hackathon during the Web Archiving Week 2017 at the British Library.
Photo credit: Olga Holownia.

I also attended the 2018 WAC in Wellington, New Zealand which provided me the opportunity of a grand adventure in a stunning locale! Even now, nearly four years later, I frequently recall Dr Rachael Ka’ai-Mahuta’s keynote about the archiving of Indigenous Peoples’ language, culture, and movement, which gave me an important framework for thinking about the ethics of cultural ownership. The farthest I had ever traveled from home, having been surrounded by Māori customs and artifacts that week further deepened these concepts and I’m grateful to have been in that place exactly at that time.

Although, I have to say the most memorable WAC experience was nearly missing my flight back to the US from London in 2017 and seeing Abbie’s face break into a relieved smile as I sprinted up to the gate at Heathrow! I guess I didn’t want to leave the WAC… and who would?

WAC2022-Dr_Rachael_Ka’ai-Mahuta
Keynote by Dr Rachael Ka’ai-Mahuta titled Te Māwhai – te reo Māori, the Internet, archiving, and trust issues. Photo credit: Mark Beatty.

Abbie Grotke: Besides reliving that moment of almost losing Grace in London (oh my!) my first IIPC memories were from way back in the beginning of the consortium. There was talk of this international group forming, and although I did not get to go to an early meeting in Rome, Italy, I attended another very early-days discussion called “National Libraries Web Archiving Consortium” which was held at the Library of Congress in the March 2003. It was there that (besides colleagues at Internet Archive) I first met fellow web archiving colleagues from the British Library, Bibliothèque nationale de France, National and University Library of Iceland, and National Library of Canada (now Library and Archives Canada). These, along with LC and Internet Archive and a number of other institutions were the early founders of the consortium and many became good friends and colleagues for many years. I couldn’t have imagined then that I would still be involved in this community all these years later!  A lot of those folks have moved on or retired, but our institutions still work closely together to this day.

One of my favorite memorable experiences was when I was communications officer, supporting the Steering Committee of the IIPC, who had been in Oslo for a meeting of the Access Working Group where we were hashing out requirements for an access tool. We all hopped on a plane to Trondheim, then a puddle jumper plane from Trondheim to Mo i Rana where the other National Library buildings were, for a Steering Committee meeting up there. Gildas Illian (the IIPC technical lead at the time) and I were in the very back of the plane which had a row entirely across the back, looking straight down the middle aisle. Most of the Steering Committee was on the plane, which was having some horrible turbulence. Even though we were terrified by the flight I just remember laughing so much (coping mechanism!) with Gildas about the fact that if the plane went down, the consortium would be over. We also couldn’t stop laughing at the “barf bags” in front of us, which said “uuf da” – which I now say ALL the time and always think back to that moment. We of course landed safely. That was also the meeting where a colleague from New Zealand was calling in the entire two days of meetings despite the time difference, and at dinner we started talking to a plant in the middle of the table as if it were him. Good times!

This slideshow requires JavaScript.

Meghan Lyon: Tell us one thing you love or appreciate about being a part of the international web archiving community.

Abbie Grotke: It truly is the most supportive community and I am forever grateful about the opportunity to meet and know so many helpful colleagues from across the globe. And there is nothing like a conference in a beautiful library in an unfamiliar city with the smartest experts in web archiving in the world. I’ve forged some wonderful friendships over the years. While a virtual meeting is not quite the same and I can’t wait until we can meet again in person, I’ve been amazed at how we’ve adapted as a community to an entirely virtual event. In many ways it’s allowed for a richer experience – more people who might not have been able to travel to the conference and meetings can participate, and in my mind that’s always a benefit! I hope we can continue to keep a blend of in person and virtual events in the future. Come join us!

WAC 2022

Registration is now open. Register separately for each day you plan to attend—May 23, 24, and 25 for the WAC, May 17-19 for the General Assembly. View the schedule and abstracts, and learn more about the Conference and GA sessions on the IIPC Website: 2022 Web Archiving Conference!


IIPC General Assembly & Web Archiving Conference 

IIPC GA & WAC collection at UNT Digital Library

2021 Web Archiving Conference: presentations & recordings

Celebrating the 2022 Winter Olympics and Paralympics Web Archive Collection

By Helena Byrne, Curator of Web Archives, British Library

IIPC-CDG-2022Olympics

The first IIPC collection focused just on the 2010 Winter Olympics in Vancouver. Since 2012, the IIPC has archived web content on both the Olympic and Paralympic Games. To date, the IIPC has archived seven Games. Beijing 2022 was also the 4th Winter Games collection.

Collection Name Data Docs
2014 Winter Olympics 1.6 TB 57,145,052
2014 Winter Paralympics 1.3 TB 42,542,659
2016 Summer Olympics and Paralympics 3.1 TB 18,205,981
2018 Winter Olympics and Paralympics 1.2 TB 12,218,514
2020 Summer Olympics and Paralympics [held in 2021] 610.9 GB 6,923,179
2022 Winter Olympics and Paralympics 361.1 GB 14,410,542

You can view the 2022 Winter Olympics and Paralympics here:

https://archive-it.org/collections/18422

In this final blog post on the IIPC Content Development Group (CDG) Beijing 2022 Olympic and Paralympic Games web archive collection, we look back at what content was crawled. 

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

What we collected

Crawl dates

There were five crawl dates for this collection. The collection period started in January and finished towards the end of March. A sixth crawl was conducted on April 26 of 32 seeds as these were missed in the first crawl. This issue was only noticed when preparing the metadata for publishing the collection.

  1. February 02, 2022 (308 seeds crawled)
  2. February 15, 2022 (264 seeds crawled)
  3. February 23, 2022 (65 seeds crawled)
  4. March 07, 2022 (29 seeds crawled)
  5. March 21, 2022 (198 seeds crawled)

We had a steady number of nominations for each crawl date. The only exception was the fourth crawl on March 7th with only 29 seeds crawled. This figure also includes a number of URLs that returned an error in the previous crawl. Nominations to this collection are done on a voluntary basis by members of the IIPC and the public from around the world. 

Countries covered

Athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. 

We received nominations from 38 countries for the IIPC CDG 2022 Winter Olympics and Paralympics collection. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed in multiple events and have no content nominated. 

Languages covered

We have 24 languages in the collection including French (228 nominations), English (162 nominations) and Japanese (89 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection.

Data size 

image2

We have archived 863 seeds out of the 889 seeds that were nominated. These seeds include full websites, subsections of websites and individual web pages in multiple languages from around the world. The 26 seeds nominated that were not archived were social media accounts so weren’t added to the crawler. There were roughly 54 seeds in total that came up in the Archive-It crawl reports. These were URLs that for technical reasons, the crawlers were unable to archive when they visited the seed. These seeds were then assessed and added to the next crawl in the series with some additional techniques used to try and capture them. However, not all of these attempts to recrawl these seeds were successful. Quality assurance was carried out on these 54 seeds and 36 of these seeds were set to private as they displayed no content or just error messages. 

We archived 361.1 GB of data and 14,410,542 documents at the end of five crawl cycles. We had initially set aside 1 TB for this collection but as we weren’t archiving any social media content and implemented a size cap on all seeds, we had not used as much data as expected. 

We used the following policy when setting the scope of the crawl:

  1. Full seed host or directory (Example: team or athlete website)
    • These seeds will be capped at 3 GB
  2. Crawl one page only (Example: news article)
    • These seeds will be capped at 1 GB 
  3. Seed page plus 1 click of all links on seed page (Example: news page linking to multiple articles)
    • These seeds will be capped at 2 GB

In the 2018 Winter Games collection, we collected 1,413 seeds and used 1.2 TB of data with 12,218,514 documents. However, if we just compare the URL nominations, the 2022 and 2018 collection are quite similar excluding the 557 social media URLs tagged as Blogs & Social Media from the 2018 total. 

Related blog posts

Get Involved in Web Archiving the Winter Games – Beijing 2022 

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

The Summer and Winter Olympics and Paralympics Collections in Archive-It

The Summer and Winter Olympics and Paralympics Collections 2010-2020 poster

Despite not collecting social media content, we did promote the call for nominations for this collection on social media channels (mostly Twitter) with the collection hashtag #WAGames2022.

For more information and updates on Content Development Group activities, you can contact the IIPC CDG team at Collaborative-collections@iipc.simplelists.com

Archive of Tomorrow – Capturing online health (mis)information

By Alice Austin, Web Archivist, Archive of Tomorrow

Centre for Research Collections, Main Library, University of Edinburgh


AoT-image-1
Copyright ©2021 R. Stevens / CREST (CC BY-SA 4.0)

It goes without saying that the Covid-19 pandemic has cast a harsh light across our society and exposed fault lines in a number of areas, not least in the fragility of our information infrastructures. Over the last two years we have seen misinformation spread at a similar speed to the virus, with the consequence that any future attempts to try and examine the medical pandemic as an historical and social phenomenon will also have to reckon with the misinformation pandemic. Government and medical websites have changed on a daily basis as new information emerges, and there has been a massive proliferation of comment on social media and other online platforms about the virus and other health issues. Clinical advice, data and scientific evidence have been contested, revised, used and misused with dramatic and sometimes tragic consequences, and yet the digital record of this is fragile and difficult to access. There have been sustained and laudable efforts to ensure that inaccurate and potentially harmful information is taken down swiftly, with the result that a researcher exploring (e.g.) the emergence of ivermectin as a Covid ‘miracle cure’ might find they come up against a lot of dead ends and 404s.

Goals of the Archive of Tomorrow

In response, the Archive of Tomorrow project hopes to capture an accurate record of how people use the internet to find, share, and discuss health and health-related topics so that current and future researchers can understand public health practices in the digital age. We hope to capture 10,000 targets – ranging from official, ‘approved’ and verified sources, to unofficial, sometimes controversial publications – and to secure access permission for this content to produce a ‘research-ready’ collection. The project is ambitious, not just in its intention to build a useful evidence base of historical web resources but also in the attempt to develop an ethical and meaningful precedent for archiving possible mis- or dis-information. Because it crystallises so many of these issues, COVID is one subject that we’re focusing on in detail, but we’re also looking at capturing other health-related debates such as those that surround reproductive rights, ‘alternative’ medicines, assisted dying, and the use of medical cannabis.

Timeline

Having launched in Feb 2022, the project is still in the early stages of development. It’s being led by the National Library of Scotland with web archivists based in university libraries in Edinburgh, Oxford and Cambridge, and invaluable input from the British Library’s web archiving team. This kind of collaborative working feels very much representative of the Covid-era – it’s hard to imagine a project like this emerging in the days when remote working and Zoom meetings were the exception rather than the norm! We’ll be talking more about the collaborative nature of the project at the IIPC WAC conference in May – and registration is open now!

Selecting ‘health information’

Thinking about how work practices have changed throughout the pandemic brings us to something that has been a challenge for the project team to unravel – how to define the boundaries around ‘health information’ – where it begins and ends, how health relates to other spheres like politics, law, employment and so on. We have to impose boundaries on our collecting, and while some boundaries are legislative or technological, such as the exclusion of broadcast media like podcasts and videos from the collection), some are cultural: for example, to what extent do protests against Covid measures such as masks and lockdowns count as health information? What about artistic responses to the pandemic? And how well are we able represent health information-seeking behaviours in languages other than English?

AOT-image-2
Welsh COVID-19 Pandemic guide: what to do and not do. Copyright © 2020 G. Hegasy (CCBY-SA 4.0)

Archivists have long understood that we can’t collect everything – and we don’t try to! As with so much collecting, the challenge lies in how to communicate our selection decisions without dictating the way the archived material is used and encountered. In this case, we’re trying to capture public health discourse and not be part of the conversation ourselves, but we do have a degree of responsibility when considering health mis/dis/information – to what extent should such inaccurate, or refuted or dangerous content be flagged in the UKWA interface? How do we make such content available responsibly without inserting our perspective into the debates?

Archive of Tomorrow workshop

At this stage we have more questions than answers, and we anticipate that this will continue. The project isn’t designed to solve these problems, but rather, to articulate them in a way that opens the door for future work and solutions. Our first activity towards this goal is the workshop that we’re hosting at the end of the month. We hope that by engaging with current and future researchers with an interest in online information-seeking behaviours or public health we can develop and produce a valuable, research-ready collection that will give real insight into how the internet has been used for health information during the pandemic and beyond.

Migrating to pywb at the National and University Library of Iceland

image3By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, and Georg Perreiter, Software Developer at the National and University Library of Iceland.


Here at the National and University Library of Iceland (NULI) we have over the last couple of years eagerly awaited each new deliverable of the IIPC funded pywb project, developed by Webrecorder’s Ilya Kreymer. Last year Kristinn wrote a blog post about our adoption of OutbackCDX based on the recommendation from the OpenWayback to pywb transition guide that was a part of the first deliverable. In that post he noted that we’d gotten pywb to run against the index but there were still many issues that were expected to be addressed as the pywb project continued. Now that the project has been completed, we’d like to use this opportunity to share our experience of this transition.

As Kristinn is a member of the IIPC’s Tools Development Portfolio (TDP) – which oversees the project – this was partly an effort on our behalf to help the TDP evaluate the project deliverables. Primarily, however, this was motivated by the need to be able to replace our aging OpenWayback installation.

It is worth noting that prior to this project, we had no experience with using Python based software beyond some personal hobby projects. We were (and are) primarily a “Java shop.” We note this as the same is likely true of many organizations considering this switch. As we’ll describe below, this proved quite manageable despite our limited familiarity with Python.

Get pywb Running

The first obstacle we encountered was related to the required Python version. pywb requires version 3.8 but our production environment, running Red Hat Enterprise Linux (RHEL) 7, defaulted to Python 3.6. So we had to additionally install Python 3.8. We also had to learn how to use a Python virtual environment so we could run pywb in isolation. Then we needed to learn how to resolve site-package conflicts using Python’s package manager (pip3) due to differences between Ubuntu and RHEL.

Of course, all of that could be avoided if you deploy pywb on a machine with a compatible version of Python or use pywb’s Docker image. Indeed, when we first set up a test instance on a “throwaway” virtual machine, we were able to get pywb up and running against our OutbackCDX in a matter of minutes.

Access Control

Our web archive is open to the world. However, we do need to limit access to a small number of resources. With OpenWayback this has been handled using a plain text exclusion file. We were able to use pywb’s wb-manager command line tool to migrate this file to the JSON based file format that pywb uses. The only issue we ran into was that we needed to strip out empty lines and comments (i.e. lines starting with #) before passing it to this utility.

Making pywb Also Speak Icelandic

We want our web archive user interface to be available in both Icelandic and English. When adopting OpenWayback, we ran into issues with such internationalization (i18n) support and ultimately just translated it into Icelandic and abandoned the i18n effort. pywb already supported i18n and further support and documentation of this was one of the elements of the IIPC pywb project. So we very much wanted to take advantage of this and fully support both languages in our pywb installation.

We found the documentation describing this process to be very robust and easy to follow. Following it, we installed pywb’s i18n tool, added an “is” locale and edited the provided CSV file to include Icelandic translations.

Along the way we had a few minor issues with textual elements that were hard coded and translations could not be provided for. This was notably more common in new features being added, as one might expect. We were, in a sense, acting as beta testers of the software, picking up each new update as it came, so this isn’t all that surprising. We reported these omissions as we discovered them and they were quickly addressed.

The only issue that wasn’t (and couldn’t) be addressed ended up relating to a limitation of Chrome. We noticed that our date formatting for Icelandic was working well in both Firefox and Edge, but displayed incorrectly in Chrome. This turned out to be because Chrome does not support Icelandic in JavaScript code like this: new Date().toLocaleDateString(“is”)

We were able to work around this issue with Chrome by using a German locale as none of the date formatting patterns relied on outputting the names of days or months.

Making pywb Fit In

Here at NULI we have a lot of websites. To help us maintain a “brand” identity, we – to the extent possible – like them to have a consistent look and feel. So, in addition to making pywb speak Icelandic, we wanted it to fit in.

Much like i18n, UI customizations were identified as being important to many IIPC members and additional support for and documentation of that was included in the IIPC pywb project. Following the documentation, we found the customization work to be very straightforward.

You can easily add your own templates and static files or copy and modify the existing ones. As you can always remove your added files, there is no chance of messing anything up.

image2

As you can see on our website, we were able to bring our standard theme to pywb.

Additionally, we added 20 lines of code to frontendapp.py to allow serving of additional, localized, static content fed by an additional template (incl. header and footer) that loads static html files as content. This allowed us to add a few extra web pages to serve our FAQ and some other static content. This was our only “hack” and is, of course, only needed if you want to add static content that is served directly from pywb (as opposed to linking to another web host).

New Calendar & Sparkline and Performance

The final deliverable of the IIPC funded pywb project included the introduction of a new calendar-based search result page and a “sparkline” navigation element into the UI header. These were both features found in OpenWayback and, in our view, the last “missing” functionality in pywb. We were very happy to see these features in pywb but also discovered a performance problem.

image1

Our web archive is by no means the largest one in the world. It is, however, somewhat unique in that it contains some pages with over one hundred thousand copies (yes 100.000+ copies). These mostly come from our RSS-based crawls that capture news sites’ front pages every time there is a new item in the RSS feed. The largest is likely the front page of our state broadcaster (RÚV) with 159.043 captures available as we write this (with probably another thousand or so waiting to be indexed).

The initial version of the calendar and sparkline struggled with these URLs. After we reported the issue, some improvements were made involving additional caching and “loading” animations so users would know the system was busy instead of just seeing a blank screen. This improved matters notably, but pywb’s performance under these circumstances could stand further improvement.

We recognize that our web archive is somewhat unusual in having material like this. However, as time goes on, archives with a high number of captures of the same pages will only increase, so this is worth considering in the future.

Final Thoughts

We’ve been very pleased with this migration process. In particular we’d like to commend the Webrecorder team for the excellent documentation now available for pywb. We’d also like to acknowledge all testing and vetting of the IIPC pywb project deliverables that Lauren Ko (UNT and member of the IIPC TDP) did alongside – and often ahead of – us.

We can also reaffirm Kristinn’s recommendation from last year to use OutbackCDX as a backend for pywb (and OpenWayback). Having a single OutbackCDX instance powering both our OpenWayback and pywb installations notably simplified the setup of pywb and ensured we only had one index to update.

We still have pywb in a public “beta” – in part due to the performance issues discussed above – while we continue to test it. But we expect it will replace OpenWayback as our main replay interface at some point this year.

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022Olympics

The Winter Olympics may be over but the IIPC Content Development Group Beijing 2022 collection is still running until March 20th, 2022. The Winter Paralympics got underway on March 4th and we would love to see your nominations for this edition of the Games. 

In our first blog post Get Involved in Web Archiving the Winter Games – Beijing 2022, we outlined details of what and how to nominate. Once you have selected the web pages you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/CDG-2022Games-collection-public-nominations 

What we have collected so far

We have archived 616 nominated seeds so far. These nominations include full websites, subsections of websites and individual web pages in multiple languages from around the world. We have archived 280.3GB of data and 13,197,402 documents at the end of three crawl cycles. 

Screenshot of total data archived on Beijing 2022 collection. Figures in paragraph above.

Social media policy

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

Map of the world

Qualified athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Only 29 of these NOCs received a medal. Athletes competed across 109 events over 15 disciplines in seven sports. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. So far, nominations have been received for the IIPC CDG 2022 Winter Olympics and Paralympics collection from 27 countries. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed and have no content nominated such as Austria (106 athletes), Sweden (116 athletes) and Ukraine (45 athletes).

Languages covered

We have 24 languages in the collection including French (200 nominations), Japanese (91 nominations) and English (79 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection. We would like to see more nominations in multiple languages, especially Chinese (4 nominations) and Russian (1 nomination).

How to get involved

Once you have selected the web pages you would like to see in the collection, add them to the submission form: https://bit.ly/CDG-2022Games-collection-public-nominations

If you know anyone who may be interested in contributing to this collection, please share the link with them! The call for nominations will close on March 20, 2022.

For more information and updates, you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.

Resources