IIPC Steering Committee Election 2022: call for nominations

The nomination process for IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year, five seats are up for election. 

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation. The elected SC members also lead IIPC Portfolios and thus have the opportunity to shape the Consortium’s strategic direction related to three main areas: tools development, membership engagement and partnerships. Every year, three SC members are designated as IIPC Officers (Chair, Vice-Chair and Treasurer) to serve on the IIPC Executive Board and are responsible for implementing the Strategic Plan.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.

Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).


Election Calendar

15 June – 14 September 2022: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.

15 September 2022: Nominee statements are published on the Netpreserve blog and Members mailing list. Nominees are encouraged to campaign through their own networks.

15 September – 15 October 2022: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.

18 October 2022: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2023: The newly elected SC members start their three-year term.

Game Walkthroughs and Web Archiving Project: Integrating Gaming, Web Archiving, and Livestreaming 

“Game Walkthroughs and Web Archiving” was awarded a grant in the 2021-2022 round of the Discretionary Funding Programme (DFP), the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project lead is Michael L. Nelson from the Department of Computer Science at Old Dominion University. Los Alamos National Laboratory Research Library is a project partner.


By Travis Reid, Ph.D. student at Old Dominion University (ODU), Michael L. Nelson, Professor in the Computer Science Department at ODU, and Michele C. Weigle, Professor in the Computer Science Department at ODU

Introduction

Game walkthroughs are guides that show viewers the steps the player would take while playing a video game. Recording and streaming a user’s interactive web browsing session is similar to a game walkthrough, as it shows the steps the user would take while browsing different websites. The idea of having game walkthroughs for web archiving was first explored in 2013 (“Game Walkthroughs As A Metaphor for Web Preservation”). At that time, web archive crawlers were not ideal for web archiving walkthroughs because they did not allow the user to view the webpage as it was being archived. Recent advancements in web archive crawlers have made it possible to preserve the experience of dynamic web pages by recording a user’s interactive web browsing session. Now, we have several browser-based web archiving tools such as WARCreate, Squidwarc, Brozzler, Browsertrix Crawler, ArchiveWeb.page, and Browsertrix Cloud that allow the user to view a web page while it is being archived, enabling users to create a walkthrough of a web archiving session.

Figure 1
Figure 1: Different ways to participate in gaming (left), web archiving (center), and sport sessions (right)

Figure 1 applies the analogy of different types of video games and basketball scenarios to types of web archiving sessions. Practicing playing a sport like basketball by yourself, playing an offline single player game like Pac-Man, and archiving a web page with a browser extension such as WARCreate are all similar because only one user or player is participating in the session (Figure 1, top row). Playing team sports with a group of people, playing an online multiplayer game like Halo, and collaboratively archiving web pages with Browsertrix Cloud are similar since multiple invited users or players can participate in the sessions (Figure 1, center row). Watching a professional sport on ESPN+, streaming a video game on Twitch, and streaming a web archiving session on YouTube can all be similar because anyone can be a spectator and watch the sporting event, gameplay, or web archiving session (Figure 1, bottom row).

One of our goals in the Game Walkthroughs and Web Archiving project is to create a web archiving livestream like that shown in Figure 1. We want to make web archiving entertaining to a general audience so that it can be enjoyed like a spectator sport. To this end, we have applied a gaming concept to the web archiving process and integrated video games with web archiving. We have created automated web archiving livestreams (video playlist) where the gaming concept of a speedrun was applied to the web archiving process. Presenting the web archiving process in this way can be a general introduction to web archiving for some viewers. We have also created automated gaming livestreams (video playlist) where the capabilities for the in-game characters were determined by the web archiving performance from the web archiving livestream. The current process that we are using for the web archiving and gaming livestreams is shown in Figure 2.

Figure 2
Figure 2: The current process for running our web archiving livestream and gaming livestream.

Web Archiving Livestream

For the web archiving livestream (Figure 2, left side), we wanted to create a livestream where viewers could watch browser-based web crawlers archive web pages. To make the livestreams more entertaining, we made each web archiving livestream into a competition between crawlers to see which crawler performs better at archiving the set of seed URIs. The first step for the web archiving livestream is to use Selenium to set up the browsers that will be used to show information needed for the livestream such as the name and current progress for each crawler. The information currently displayed for a crawler’s progress is the current URL being archived and the number of web pages archived so far. The next step is to get a set of seed URIs from an existing text file and then let each crawler start archiving the URIs. The viewers can then watch the web archiving process in action.

Automated Gaming Livestream

The automated gaming livestream (Figure 2, right side) was created so that viewers can watch a game where the gameplay is influenced by the web archiving and replay performance results from a web archiving livestream or any crawling session. Before an in-game match starts, a game configuration file is needed since it contains information about the selections that will be made in the game for the settings. The game configuration file is modified based on how well the crawlers performed during the web archiving livestream. If a crawler performs well during the web archiving livestream, then the in-game character associated with the crawler will have better items, perks, and other traits. If a crawler performs poorly, then their in-game character will have worse character traits. At the beginning of the gaming livestream, an app automation tool like Selenium (for browser games) or Appium (for locally installed PC games) is used to select the settings for the in-game characters based on the performance of the web crawlers. After the settings are selected by the app automation tool, the match is started and the viewers of the livestream can watch the match between the crawlers’ in-game characters. We have initially implemented this process for two video games, Gun Mayhem 2 More Mayhem and NFL Challenge. However, any game with a mode that does not require a human player could be used for an automated gaming livestream.

Gun Mayhem 2 More Mayhem Demo

Gun Mayhem 2 More Mayhem is similar to other fighting games like Super Smash Bros. and Brawlhalla where the goal is to knock the opponent off the stage. When a player gets knocked off the stage, they lose a life. The winner of the match will be the last player left on the stage. Gun Mayhem 2 More Mayhem is a Flash game that is played in a web browser. Selenium was used to automate this game since it is a browser game. In Gun Mayhem 2 More Mayhem, the crawler’s speed was used to determine which perk to use and the gun to use. Some example perks are infinite ammo, triple jump, and no recoil when firing a gun. The fastest crawler used the fastest gun and was given an infinite ammo perk (Figure 3, left side). The slowest crawler used the slowest gun and did not get a perk (Figure 3, right side).

Figure 3
Figure 3: The character selections made for the fastest and slowest web crawlers

NFL Challenge Demo

NFL Challenge is a NFL Football simulator that was released in 1985 and was popular during the 1980s. The performance of a team is based on the player attributes that are stored in editable text files. It is possible to change the stats for the players, like the speed, passing, and kicking ratings, and it is possible to change the name of the team and the players on the team. This customization allows us to  rename the team to the name of the web crawler and to rename the players of the team to the names of the contributors of the tool. NFL Challenge is a MS-DOS game that can be played with an emulator named DOSBox. Appium was used to automate the game since NFL Challenge is a locally installed game. In NFL Challenge, the fastest crawler would get the team with the fastest players based on the players’ speed attribute (Figure 4, left side) and the other crawler would get the team with the slowest players (Figure 4, right side).

Figure 4
Figure 4: The player attributes for the teams associated with the fastest and slowest web crawlers. The speed ratings are the times for the 40-yard dash, so the lower numbers are faster.

Future Work

In future work, we plan on making more improvements to the livestreams. We will update the web archiving livestreams and the gaming livestreams so that they can run at the same time. The web archiving livestream will use more than the speed of a web archive crawler when determining the crawler’s performance, such as using metrics from Brunelle’s memento damage algorithm which is used to measure the replay quality of archived web pages. During future web archiving livestreams, we will also evaluate and compare the capture and playback of web pages archived by different web archives and archiving tools like the Internet Archive’s Wayback Machine, archive.today, and Arquivo.pt. We will update the gaming livestreams so that they can support more games and games from different genres. The games that we supported so far are multiplayer games. We will also try to automate single player games where the in-game character for each crawler can compete to see which player gets the highest score on a level or which player finishes the level the fastest. For games that allow creating a level or game world, we would like to use what happens during a crawling session to determine how the level is created. If the crawler was not able to archive most of the resources, then more enemies or obstacles could be placed in the level to make it more difficult to complete the level. Some games that we will try to automate include: Rocket League, Brawhalla, Quake, and DOTA 2. When the scripts for the gaming livestream are ready to be released publicly, it will also be possible for anyone to add support for more games that can be automated. We will also have longer runs for the gaming livestreams so that a campaign or season in a game can be completed. A campaign is a game mode where the same characters can play a continuing story until it is completed. A season for the gaming livestreams will be like a season for a sport where there are a certain number of matches that must be completed for each team during a simulated year and a playoff tournament that ends with a championship match.

Summary

We are developing a proof of concept that involves integrating gaming and web archiving. We have integrated gaming and web archiving so that web archiving can be more entertaining to watch and enjoyed like a spectator sport. We have applied the gaming concept of a speedrun to the web archiving process by having a competition between two crawlers where the crawler that finished archiving the set of seed URIs first would be the winner of the competition. We have also created automated web archiving and gaming livestreams where the web archiving performance of web crawlers from the web archiving livestreams were used to determine the capabilities of the characters inside of the Gun Mayhem 2 More Mayhem and NFL Challenge video games that were played during the gaming livestreams. In the future, more nuanced evaluation of the crawling and replay performance can be used to better influence in-game environments and capabilities.

If you have any questions or feedback, you can email Travis Reid at treid003@odu.edu.

Remembering Past Web Archiving Events With Library of Congress Staff

By Meghan Lyon, Digital Collection Specialist, Library of Congress and member of WAC 2022 Program Committee


Since joining the Library of Congress Web Archiving Program remotely in 2020, I have had the pleasure of participating in IIPC activities and getting to know the generous and hardworking members of this community. Although—due to Covid-19 restrictions—I have yet to meet many of my colleagues in person, I feel as though I’ve been wholeheartedly welcomed. It is a privilege to be a member of the Program Committee for the 2022 Web Archiving Conference and General Assembly, which will be hosted virtually by the Library of Congress.

Last year, I remember the tireless planning efforts of Senior Program Officer Olga Holownia as she and then-IIPC Chair, now Vice-Chair, Abbie Grotke and staff members from the National Library of Luxembourg (2021’s amazing virtual conference host) tested the virtual conference platform. They tested virtual tables, planned for break-out chats post-session where engaged members could continue discussions from the previous panel. The end result was engaging and exciting, especially for a virtual conference.

It was a pleasure to learn at that time about topics as diverse as the Frisian web (Kees Teszelszky, “Side fûn: mapping the Frisian web domain in the Netherlands”), flash capable browser emulation (Ilya Kreymer & Humbert Hardy, “Not gone in a Flash! Developing a Flash-capable remote browser emulation system”), and experimental methods of quality assurance for web archives (Brenda Reyes Ayala, James Sun, Jennifer McDevitt & Xiaohui Liu, “Detecting quality problems in archived website using image similarity”). Ayala, et.al.’s presentation led me to Dr. Ayala’s research, which has greatly impacted QA workflow development here at the LoC. Workflow development will be included in the panel “Advancing Quality Assurance for Web Archives: Putting Theory into Practice” in the upcoming 2022 Web Archiving Conference. If you missed the 2021 conference, you can still view selected talks and Q&A sessions on the IIPC YouTube channel

WAC2021
IIPC 2021 Web Archiving Conference co-hosted with the National Library of Luxembourg.

With that, I’d like now to ask Abbie Grotke, my supervisor and Vice-Chair of the IIPC, as well as Grace Thomas, one of my teammates on the Web Archiving Team, some questions about their experience in the IIPC community:

Meghan Lyon: Give us a snapshot of your first experience — or of a memorable experience — at an IIPC WAC & GA of times past.

Grace Thomas (Senior Digital Collection Specialist, Web Archiving Team):

The first IIPC WAC I attended was Web Archiving Week 2017 in London. I had joined the Library of Congress Web Archiving Team less than a year earlier and I was still trying to figure out the extent of this new world. From what my seasoned coworkers said about the web archiving community, I knew it was small and geographically disparate – a modest group of faceless individuals shouldering the massive task of archiving the web – but the events in London showed me how kind, collaborative, and very real everyone is. We are all dealing with the exact same issues at different scales and, most importantly, I got the feeling that everyone was there because they wanted to carry on this work and find solutions to those problems together.

WAW2017-ArchivesUnleashed
Archives Unleashed hackathon during the Web Archiving Week 2017 at the British Library.
Photo credit: Olga Holownia.

I also attended the 2018 WAC in Wellington, New Zealand which provided me the opportunity of a grand adventure in a stunning locale! Even now, nearly four years later, I frequently recall Dr Rachael Ka’ai-Mahuta’s keynote about the archiving of Indigenous Peoples’ language, culture, and movement, which gave me an important framework for thinking about the ethics of cultural ownership. The farthest I had ever traveled from home, having been surrounded by Māori customs and artifacts that week further deepened these concepts and I’m grateful to have been in that place exactly at that time.

Although, I have to say the most memorable WAC experience was nearly missing my flight back to the US from London in 2017 and seeing Abbie’s face break into a relieved smile as I sprinted up to the gate at Heathrow! I guess I didn’t want to leave the WAC… and who would?

WAC2022-Dr_Rachael_Ka’ai-Mahuta
Keynote by Dr Rachael Ka’ai-Mahuta titled Te Māwhai – te reo Māori, the Internet, archiving, and trust issues. Photo credit: Mark Beatty.

Abbie Grotke: Besides reliving that moment of almost losing Grace in London (oh my!) my first IIPC memories were from way back in the beginning of the consortium. There was talk of this international group forming, and although I did not get to go to an early meeting in Rome, Italy, I attended another very early-days discussion called “National Libraries Web Archiving Consortium” which was held at the Library of Congress in the March 2003. It was there that (besides colleagues at Internet Archive) I first met fellow web archiving colleagues from the British Library, Bibliothèque nationale de France, National and University Library of Iceland, and National Library of Canada (now Library and Archives Canada). These, along with LC and Internet Archive and a number of other institutions were the early founders of the consortium and many became good friends and colleagues for many years. I couldn’t have imagined then that I would still be involved in this community all these years later!  A lot of those folks have moved on or retired, but our institutions still work closely together to this day.

One of my favorite memorable experiences was when I was communications officer, supporting the Steering Committee of the IIPC, who had been in Oslo for a meeting of the Access Working Group where we were hashing out requirements for an access tool. We all hopped on a plane to Trondheim, then a puddle jumper plane from Trondheim to Mo i Rana where the other National Library buildings were, for a Steering Committee meeting up there. Gildas Illian (the IIPC technical lead at the time) and I were in the very back of the plane which had a row entirely across the back, looking straight down the middle aisle. Most of the Steering Committee was on the plane, which was having some horrible turbulence. Even though we were terrified by the flight I just remember laughing so much (coping mechanism!) with Gildas about the fact that if the plane went down, the consortium would be over. We also couldn’t stop laughing at the “barf bags” in front of us, which said “uuf da” – which I now say ALL the time and always think back to that moment. We of course landed safely. That was also the meeting where a colleague from New Zealand was calling in the entire two days of meetings despite the time difference, and at dinner we started talking to a plant in the middle of the table as if it were him. Good times!

This slideshow requires JavaScript.

Meghan Lyon: Tell us one thing you love or appreciate about being a part of the international web archiving community.

Abbie Grotke: It truly is the most supportive community and I am forever grateful about the opportunity to meet and know so many helpful colleagues from across the globe. And there is nothing like a conference in a beautiful library in an unfamiliar city with the smartest experts in web archiving in the world. I’ve forged some wonderful friendships over the years. While a virtual meeting is not quite the same and I can’t wait until we can meet again in person, I’ve been amazed at how we’ve adapted as a community to an entirely virtual event. In many ways it’s allowed for a richer experience – more people who might not have been able to travel to the conference and meetings can participate, and in my mind that’s always a benefit! I hope we can continue to keep a blend of in person and virtual events in the future. Come join us!

WAC 2022

Registration is now open. Register separately for each day you plan to attend—May 23, 24, and 25 for the WAC, May 17-19 for the General Assembly. View the schedule and abstracts, and learn more about the Conference and GA sessions on the IIPC Website: 2022 Web Archiving Conference!


IIPC General Assembly & Web Archiving Conference 

IIPC GA & WAC collection at UNT Digital Library

2021 Web Archiving Conference: presentations & recordings

Celebrating the 2022 Winter Olympics and Paralympics Web Archive Collection

By Helena Byrne, Curator of Web Archives, British Library

IIPC-CDG-2022Olympics

The first IIPC collection focused just on the 2010 Winter Olympics in Vancouver. Since 2012, the IIPC has archived web content on both the Olympic and Paralympic Games. To date, the IIPC has archived seven Games. Beijing 2022 was also the 4th Winter Games collection.

Collection Name Data Docs
2014 Winter Olympics 1.6 TB 57,145,052
2014 Winter Paralympics 1.3 TB 42,542,659
2016 Summer Olympics and Paralympics 3.1 TB 18,205,981
2018 Winter Olympics and Paralympics 1.2 TB 12,218,514
2020 Summer Olympics and Paralympics [held in 2021] 610.9 GB 6,923,179
2022 Winter Olympics and Paralympics 361.1 GB 14,410,542

You can view the 2022 Winter Olympics and Paralympics here:

https://archive-it.org/collections/18422

In this final blog post on the IIPC Content Development Group (CDG) Beijing 2022 Olympic and Paralympic Games web archive collection, we look back at what content was crawled. 

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

What we collected

Crawl dates

There were five crawl dates for this collection. The collection period started in January and finished towards the end of March. A sixth crawl was conducted on April 26 of 32 seeds as these were missed in the first crawl. This issue was only noticed when preparing the metadata for publishing the collection.

  1. February 02, 2022 (308 seeds crawled)
  2. February 15, 2022 (264 seeds crawled)
  3. February 23, 2022 (65 seeds crawled)
  4. March 07, 2022 (29 seeds crawled)
  5. March 21, 2022 (198 seeds crawled)

We had a steady number of nominations for each crawl date. The only exception was the fourth crawl on March 7th with only 29 seeds crawled. This figure also includes a number of URLs that returned an error in the previous crawl. Nominations to this collection are done on a voluntary basis by members of the IIPC and the public from around the world. 

Countries covered

Athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. 

We received nominations from 38 countries for the IIPC CDG 2022 Winter Olympics and Paralympics collection. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed in multiple events and have no content nominated. 

Languages covered

We have 24 languages in the collection including French (228 nominations), English (162 nominations) and Japanese (89 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection.

Data size 

image2

We have archived 863 seeds out of the 889 seeds that were nominated. These seeds include full websites, subsections of websites and individual web pages in multiple languages from around the world. The 26 seeds nominated that were not archived were social media accounts so weren’t added to the crawler. There were roughly 54 seeds in total that came up in the Archive-It crawl reports. These were URLs that for technical reasons, the crawlers were unable to archive when they visited the seed. These seeds were then assessed and added to the next crawl in the series with some additional techniques used to try and capture them. However, not all of these attempts to recrawl these seeds were successful. Quality assurance was carried out on these 54 seeds and 36 of these seeds were set to private as they displayed no content or just error messages. 

We archived 361.1 GB of data and 14,410,542 documents at the end of five crawl cycles. We had initially set aside 1 TB for this collection but as we weren’t archiving any social media content and implemented a size cap on all seeds, we had not used as much data as expected. 

We used the following policy when setting the scope of the crawl:

  1. Full seed host or directory (Example: team or athlete website)
    • These seeds will be capped at 3 GB
  2. Crawl one page only (Example: news article)
    • These seeds will be capped at 1 GB 
  3. Seed page plus 1 click of all links on seed page (Example: news page linking to multiple articles)
    • These seeds will be capped at 2 GB

In the 2018 Winter Games collection, we collected 1,413 seeds and used 1.2 TB of data with 12,218,514 documents. However, if we just compare the URL nominations, the 2022 and 2018 collection are quite similar excluding the 557 social media URLs tagged as Blogs & Social Media from the 2018 total. 

Related blog posts

Get Involved in Web Archiving the Winter Games – Beijing 2022 

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

The Summer and Winter Olympics and Paralympics Collections in Archive-It

The Summer and Winter Olympics and Paralympics Collections 2010-2020 poster

Despite not collecting social media content, we did promote the call for nominations for this collection on social media channels (mostly Twitter) with the collection hashtag #WAGames2022.

For more information and updates on Content Development Group activities, you can contact the IIPC CDG team at Collaborative-collections@iipc.simplelists.com

Archive of Tomorrow – Capturing online health (mis)information

By Alice Austin, Web Archivist, Archive of Tomorrow

Centre for Research Collections, Main Library, University of Edinburgh


AoT-image-1
Copyright ©2021 R. Stevens / CREST (CC BY-SA 4.0)

It goes without saying that the Covid-19 pandemic has cast a harsh light across our society and exposed fault lines in a number of areas, not least in the fragility of our information infrastructures. Over the last two years we have seen misinformation spread at a similar speed to the virus, with the consequence that any future attempts to try and examine the medical pandemic as an historical and social phenomenon will also have to reckon with the misinformation pandemic. Government and medical websites have changed on a daily basis as new information emerges, and there has been a massive proliferation of comment on social media and other online platforms about the virus and other health issues. Clinical advice, data and scientific evidence have been contested, revised, used and misused with dramatic and sometimes tragic consequences, and yet the digital record of this is fragile and difficult to access. There have been sustained and laudable efforts to ensure that inaccurate and potentially harmful information is taken down swiftly, with the result that a researcher exploring (e.g.) the emergence of ivermectin as a Covid ‘miracle cure’ might find they come up against a lot of dead ends and 404s.

Goals of the Archive of Tomorrow

In response, the Archive of Tomorrow project hopes to capture an accurate record of how people use the internet to find, share, and discuss health and health-related topics so that current and future researchers can understand public health practices in the digital age. We hope to capture 10,000 targets – ranging from official, ‘approved’ and verified sources, to unofficial, sometimes controversial publications – and to secure access permission for this content to produce a ‘research-ready’ collection. The project is ambitious, not just in its intention to build a useful evidence base of historical web resources but also in the attempt to develop an ethical and meaningful precedent for archiving possible mis- or dis-information. Because it crystallises so many of these issues, COVID is one subject that we’re focusing on in detail, but we’re also looking at capturing other health-related debates such as those that surround reproductive rights, ‘alternative’ medicines, assisted dying, and the use of medical cannabis.

Timeline

Having launched in Feb 2022, the project is still in the early stages of development. It’s being led by the National Library of Scotland with web archivists based in university libraries in Edinburgh, Oxford and Cambridge, and invaluable input from the British Library’s web archiving team. This kind of collaborative working feels very much representative of the Covid-era – it’s hard to imagine a project like this emerging in the days when remote working and Zoom meetings were the exception rather than the norm! We’ll be talking more about the collaborative nature of the project at the IIPC WAC conference in May – and registration is open now!

Selecting ‘health information’

Thinking about how work practices have changed throughout the pandemic brings us to something that has been a challenge for the project team to unravel – how to define the boundaries around ‘health information’ – where it begins and ends, how health relates to other spheres like politics, law, employment and so on. We have to impose boundaries on our collecting, and while some boundaries are legislative or technological, such as the exclusion of broadcast media like podcasts and videos from the collection), some are cultural: for example, to what extent do protests against Covid measures such as masks and lockdowns count as health information? What about artistic responses to the pandemic? And how well are we able represent health information-seeking behaviours in languages other than English?

AOT-image-2
Welsh COVID-19 Pandemic guide: what to do and not do. Copyright © 2020 G. Hegasy (CCBY-SA 4.0)

Archivists have long understood that we can’t collect everything – and we don’t try to! As with so much collecting, the challenge lies in how to communicate our selection decisions without dictating the way the archived material is used and encountered. In this case, we’re trying to capture public health discourse and not be part of the conversation ourselves, but we do have a degree of responsibility when considering health mis/dis/information – to what extent should such inaccurate, or refuted or dangerous content be flagged in the UKWA interface? How do we make such content available responsibly without inserting our perspective into the debates?

Archive of Tomorrow workshop

At this stage we have more questions than answers, and we anticipate that this will continue. The project isn’t designed to solve these problems, but rather, to articulate them in a way that opens the door for future work and solutions. Our first activity towards this goal is the workshop that we’re hosting at the end of the month. We hope that by engaging with current and future researchers with an interest in online information-seeking behaviours or public health we can develop and produce a valuable, research-ready collection that will give real insight into how the internet has been used for health information during the pandemic and beyond.

Migrating to pywb at the National and University Library of Iceland

image3By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, and Georg Perreiter, Software Developer at the National and University Library of Iceland.


Here at the National and University Library of Iceland (NULI) we have over the last couple of years eagerly awaited each new deliverable of the IIPC funded pywb project, developed by Webrecorder’s Ilya Kreymer. Last year Kristinn wrote a blog post about our adoption of OutbackCDX based on the recommendation from the OpenWayback to pywb transition guide that was a part of the first deliverable. In that post he noted that we’d gotten pywb to run against the index but there were still many issues that were expected to be addressed as the pywb project continued. Now that the project has been completed, we’d like to use this opportunity to share our experience of this transition.

As Kristinn is a member of the IIPC’s Tools Development Portfolio (TDP) – which oversees the project – this was partly an effort on our behalf to help the TDP evaluate the project deliverables. Primarily, however, this was motivated by the need to be able to replace our aging OpenWayback installation.

It is worth noting that prior to this project, we had no experience with using Python based software beyond some personal hobby projects. We were (and are) primarily a “Java shop.” We note this as the same is likely true of many organizations considering this switch. As we’ll describe below, this proved quite manageable despite our limited familiarity with Python.

Get pywb Running

The first obstacle we encountered was related to the required Python version. pywb requires version 3.8 but our production environment, running Red Hat Enterprise Linux (RHEL) 7, defaulted to Python 3.6. So we had to additionally install Python 3.8. We also had to learn how to use a Python virtual environment so we could run pywb in isolation. Then we needed to learn how to resolve site-package conflicts using Python’s package manager (pip3) due to differences between Ubuntu and RHEL.

Of course, all of that could be avoided if you deploy pywb on a machine with a compatible version of Python or use pywb’s Docker image. Indeed, when we first set up a test instance on a “throwaway” virtual machine, we were able to get pywb up and running against our OutbackCDX in a matter of minutes.

Access Control

Our web archive is open to the world. However, we do need to limit access to a small number of resources. With OpenWayback this has been handled using a plain text exclusion file. We were able to use pywb’s wb-manager command line tool to migrate this file to the JSON based file format that pywb uses. The only issue we ran into was that we needed to strip out empty lines and comments (i.e. lines starting with #) before passing it to this utility.

Making pywb Also Speak Icelandic

We want our web archive user interface to be available in both Icelandic and English. When adopting OpenWayback, we ran into issues with such internationalization (i18n) support and ultimately just translated it into Icelandic and abandoned the i18n effort. pywb already supported i18n and further support and documentation of this was one of the elements of the IIPC pywb project. So we very much wanted to take advantage of this and fully support both languages in our pywb installation.

We found the documentation describing this process to be very robust and easy to follow. Following it, we installed pywb’s i18n tool, added an “is” locale and edited the provided CSV file to include Icelandic translations.

Along the way we had a few minor issues with textual elements that were hard coded and translations could not be provided for. This was notably more common in new features being added, as one might expect. We were, in a sense, acting as beta testers of the software, picking up each new update as it came, so this isn’t all that surprising. We reported these omissions as we discovered them and they were quickly addressed.

The only issue that wasn’t (and couldn’t) be addressed ended up relating to a limitation of Chrome. We noticed that our date formatting for Icelandic was working well in both Firefox and Edge, but displayed incorrectly in Chrome. This turned out to be because Chrome does not support Icelandic in JavaScript code like this: new Date().toLocaleDateString(“is”)

We were able to work around this issue with Chrome by using a German locale as none of the date formatting patterns relied on outputting the names of days or months.

Making pywb Fit In

Here at NULI we have a lot of websites. To help us maintain a “brand” identity, we – to the extent possible – like them to have a consistent look and feel. So, in addition to making pywb speak Icelandic, we wanted it to fit in.

Much like i18n, UI customizations were identified as being important to many IIPC members and additional support for and documentation of that was included in the IIPC pywb project. Following the documentation, we found the customization work to be very straightforward.

You can easily add your own templates and static files or copy and modify the existing ones. As you can always remove your added files, there is no chance of messing anything up.

image2

As you can see on our website, we were able to bring our standard theme to pywb.

Additionally, we added 20 lines of code to frontendapp.py to allow serving of additional, localized, static content fed by an additional template (incl. header and footer) that loads static html files as content. This allowed us to add a few extra web pages to serve our FAQ and some other static content. This was our only “hack” and is, of course, only needed if you want to add static content that is served directly from pywb (as opposed to linking to another web host).

New Calendar & Sparkline and Performance

The final deliverable of the IIPC funded pywb project included the introduction of a new calendar-based search result page and a “sparkline” navigation element into the UI header. These were both features found in OpenWayback and, in our view, the last “missing” functionality in pywb. We were very happy to see these features in pywb but also discovered a performance problem.

image1

Our web archive is by no means the largest one in the world. It is, however, somewhat unique in that it contains some pages with over one hundred thousand copies (yes 100.000+ copies). These mostly come from our RSS-based crawls that capture news sites’ front pages every time there is a new item in the RSS feed. The largest is likely the front page of our state broadcaster (RÚV) with 159.043 captures available as we write this (with probably another thousand or so waiting to be indexed).

The initial version of the calendar and sparkline struggled with these URLs. After we reported the issue, some improvements were made involving additional caching and “loading” animations so users would know the system was busy instead of just seeing a blank screen. This improved matters notably, but pywb’s performance under these circumstances could stand further improvement.

We recognize that our web archive is somewhat unusual in having material like this. However, as time goes on, archives with a high number of captures of the same pages will only increase, so this is worth considering in the future.

Final Thoughts

We’ve been very pleased with this migration process. In particular we’d like to commend the Webrecorder team for the excellent documentation now available for pywb. We’d also like to acknowledge all testing and vetting of the IIPC pywb project deliverables that Lauren Ko (UNT and member of the IIPC TDP) did alongside – and often ahead of – us.

We can also reaffirm Kristinn’s recommendation from last year to use OutbackCDX as a backend for pywb (and OpenWayback). Having a single OutbackCDX instance powering both our OpenWayback and pywb installations notably simplified the setup of pywb and ensured we only had one index to update.

We still have pywb in a public “beta” – in part due to the performance issues discussed above – while we continue to test it. But we expect it will replace OpenWayback as our main replay interface at some point this year.

Steeze (Style & Ease) on the Slopes – Web Archiving Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022Olympics

The Winter Olympics may be over but the IIPC Content Development Group Beijing 2022 collection is still running until March 20th, 2022. The Winter Paralympics got underway on March 4th and we would love to see your nominations for this edition of the Games. 

In our first blog post Get Involved in Web Archiving the Winter Games – Beijing 2022, we outlined details of what and how to nominate. Once you have selected the web pages you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/CDG-2022Games-collection-public-nominations 

What we have collected so far

We have archived 616 nominated seeds so far. These nominations include full websites, subsections of websites and individual web pages in multiple languages from around the world. We have archived 280.3GB of data and 13,197,402 documents at the end of three crawl cycles. 

Screenshot of total data archived on Beijing 2022 collection. Figures in paragraph above.

Social media policy

Social media was excluded from the collection policy as these platforms update their code and design frequently and do not prioritise archivability. As a result they present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason, we have not accepted nominations of content from Facebook, Instagram, Twitter, and other similar social media platforms.

Map of the world

Qualified athletes from 91 National Olympic Committees (NOCs) competed in Beijing 2022. Only 29 of these NOCs received a medal. Athletes competed across 109 events over 15 disciplines in seven sports. Haiti and Saudi Arabia made their Winter Olympic debuts at this edition of the Games. So far, nominations have been received for the IIPC CDG 2022 Winter Olympics and Paralympics collection from 27 countries. Some of these countries might have only one or two websites nominated from them, and there are many more countries that competed and have no content nominated such as Austria (106 athletes), Sweden (116 athletes) and Ukraine (45 athletes).

Languages covered

We have 24 languages in the collection including French (200 nominations), Japanese (91 nominations) and English (79 nominations). But many languages only have a few nominations and there are many other languages that haven’t been represented in the collection. We would like to see more nominations in multiple languages, especially Chinese (4 nominations) and Russian (1 nomination).

How to get involved

Once you have selected the web pages you would like to see in the collection, add them to the submission form: https://bit.ly/CDG-2022Games-collection-public-nominations

If you know anyone who may be interested in contributing to this collection, please share the link with them! The call for nominations will close on March 20, 2022.

For more information and updates, you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.

Resources

IIPC Chair Address 2022

Kristinn-Sigurðsson-2021

By Kristinn Sigurðsson, Head of Digital Projects and Development
at the National and University Library of Iceland
and the IIPC Chair 2022-2023


Hi all,

It is my pleasure to serve as Chair of the IIPC Steering Committee and the Executive Board for 2022. I’ve been a part of this community in one way or another since the start. I’ve often stated that without our involvement in the IIPC, the Icelandic web archive here at the National and University Library of Iceland would be a shadow of its current self.

An Organization for All Seasons

Despite the challenges of the pandemic, I’m very happy that the IIPC has been able to maintain an extensive and ambitious program over the last couple of years. This was very important as, in the past, the IIPC’s in-person events – notably our General Assembly (GA) and Web Archiving Conference (WAC) – have been at the core of IIPC activity.

The IIPC community is incredibly positive, energetic and creative, and this is always on full display when we meet in person. Over the years we’ve tried very hard to sustain the energy and shared feeling of purpose throughout the year. Even before the pandemic, we made sure the community could meet online during our regular technical calls, webinars and workshops.

These earlier efforts left us with a foundation to build upon, and I’m very happy with what we have been able to deliver. We have had even more events these past two years, including our first online conference co-hosted by the National Library of Luxembourg. My deepest thanks to Olga and everyone who helped make that possible.

Unfortunately, it seems we face another year without any significant in-person events. Rest assured, however, that we will continue with our program of meetings, workshops, members updates, webinars featuring web archiving initiatives (including the IIPC funded projects)  and, of course, a virtual GA and conference in late May.

This transformation from an organization that sometimes seemed to disappear in the off season to one that is active year round has been very satisfying from my perspective. This “disappearance” was always a bit of an illusion, as there was always some work and collaboration going on, but the added visibility and opportunities for engagement have been crucial.

As we look forward to resuming in-person events next year (fingers crossed), it is important that we do not forget any of the lessons we have learned from this. It is important that we do not simply go back to things as they were, but that we retain this online aspect of the organization all year round. With that in mind, much of this year we will be looking to establish a more predictable and consistent schedule of events, allowing our progress to carry on into future years.

Of course, all of this has required a fair amount of work and will continue to do so, bringing me to our next topic.

Reinforcement

As the IIPC has expanded and matured, the duties and responsibilities of our sole employee have grown considerably. Recognizing that this had reached an unsustainable point, last year the Steering Committee authorized the hiring of one additional full-time staff member for the new role of Administrative Officer. The Administrative Officer will take some of the more routine, administrative, duties off of our Senior Program Officer’s plate and support her as needed.

The position was advertised in November and prospective candidates were interviewed in late December and early January. After concluding our search, we hired Kelsey Socha as the new Administrative Officer for the IIPC. Kelsey Socha holds a BFA in Theatrical Design and Production from the University of Michigan, and an MS in Library and Information Science from Simmons University. She has served in a variety of library roles, most recently as Head of Adult Services for the Westfield Athenaeum in Westfield, Massachusetts. She began work for the IIPC on February 9th. Both the Administrative Officer and the Senior Program Officer roles are hosted by Council of Library and Information Resources (CLIR).

With this reinforcement, I feel confident that we will be able to rise to the ambitious schedule I discussed earlier.

Executive Board

Two years ago, we revised and renewed our Consortium Agreement, which among other changes introduced an Executive Board (EB), composed of the Chair, Vice-Chair, Treasurer, Senior Program Officer and, optionally, up to two other Steering Committee members. Aside from the Senior Program Officer, appointments are for 1 year. The EB was set up to create a smaller and more responsive body to manage the practical aspects of running the IIPC and to liaise with CLIR, our financial and administrative host. The first Board started work in January 2021 and this new setup has made our governance more agile. I would like to take this opportunity to thank Sylvain Bélanger of Library and Archives Canada, who served as IIPC Treasurer, and to welcome Ian Cooke of the British Library who has taken on this role. My long-term IIPC colleague Abbie Grotke has volunteered to serve as the Vice-Chair. My thanks also go to Hansueli Locher of Swiss National Library, who served on the EB last year.

The Steering Committee will continue to focus on our longer-term policy as well as oversight, with members of our three Portfolios and being actively involved the areas outlined in our Strategic Plan.

New Steering Committee Members

I would also like to take this opportunity to welcome the two newly elected SC members, Bjarne Andersen on behalf of the Royal Danish Library and Tobias Steinke on behalf of the German National Library.

Each year, about one-third of the fifteen SC seats are up for reelection. I’ve been pleased to observe over the last few years that these elections have gotten more competitive with more members seeking to serve. As the active involvement of our members is vital to the IIPC long term success, I’m confident that this is a positive sign.

Tools

Just recently, a project commissioned by our Tools Development Portfolio (TDP) to improve the open source web archive replay tool PyWb was completed. I wrote about this project back when it was just starting (read post). Now, Ilya Kreymer, PyWb’s developer, has delivered the last of the work agreed upon. There will be more in-depth posts about this soon, and you can also find all the blog posts on interim work here.

This PyWb project is the first funded development project to be managed by the TDP and overall, I’m very pleased with the outcome. I would like to take this opportunity to thank the other members of the TDP, Lauren Ko, Alex Osborne and Youssef Eldakar, for all their hard work. Even our funded projects still depend on a fair amount of volunteer effort. Partly based on the experience from this project, the TDP is currently working on another project with Ilya Kreymer, this time focused on browser-based crawling.

Browser-based crawling was identified as a key capability that is largely lacking in our tool suite at an online Tools Workshop with IIPC members held in June 2021. Based on discussions there, several of our member organizations put together a project plan to address this. The project will involve notable extensions to the Webrecorder software to facilitate better browser-based crawling. Unlike the PyWb project, however, this project also includes considerable commitments on behalf of 4 member organizations (British Library, National Library of New Zealand, Royal Danish Library and University of North Texas) that are not funded by the IIPC.

This project is expected to last two years, and we plan to keep you informed throughout. Members interested in participating should keep an eye out for upcoming announcements and posts here.

20th Anniversary

Next year will be the IIPC’s 20th anniversary. A lot has changed since the original 12 members signed the first Consortium Agreement back in 2003. The fact that such a milestone looks to also align with a return to in-person events after three years without them gives us even more cause to strive for the best General Assembly and Web Archiving Conference ever (as if we ever aimed lower).

Even as we work on the substantial online slate of events for 2022 that I mentioned above, work has already begun on this return to normalcy. You can be part of this too. Keep an eye out for the call for a hosting institution, participation in our program committee, and the call for papers.

Lastly, please feel free to reach out to me if you have any questions or concerns during my time as Chair.

Kristinn Sigurðsson,
Head of Digital Projects and Development at the National and University Library of Iceland
IIPC Chair 2022-2023

This slideshow requires JavaScript.

IIPC – Meet the Officers, 2022

IIPCThe IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, the Vice-Chair and the Treasurer of the Consortium. Together with the Senior Program Officer, based at Council on Library and Information Resources (CLIR), the Officers make up the Executive Board and are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Kristinn Sigurðsson of National and University Library of Iceland to serve as Chair, Abbie Grotke of the Library of Congress to serve as Vice-Chair in 2022, and Ian Cooke of the British Library to serve as the IIPC Treasurer. Olga Holownia continues as Senior Programme Officer, and CLIR remains the Consortium’s financial and administrative host.


IIPC CHAIR

Kristinn Sigurðsson is Head of Digital Projects and Development at the National and University Library of Iceland. He joined the library in 2003 as a software developer. Over the years he has worked on a multitude of projects related to the acquisition, preservation and presentation of digital content, as well as the digital reproduction of physical media. This includes leading the buildup of the library’s legal deposit web archive – that now contains nearly 4 billion items – as well as its very popular newspaper/magazine website.

He has also been very active within the IIPC and related web archiving collaboration. This includes working on the first version of the Heritrix crawler in 2003-4 (and on and off since). In 2010 he joined the IIPC Steering Committee as well as taking over as co-lead of the Harvesting Working Group. More recently he has served as the Lead of the Tools Development Portfolio.

IIPC VICE-CHAIR

Abbie Grotke, IIPC Chair 2021

Abbie Grotke is Assistant Head, Digital Content Management Section, within the Digital Services Directorate of the Library of Congress, and leads the Web Archiving Team. Since 2002 she has been involved in the Library’s web archiving program. In her role, Grotke has helped develop policies, workflows, and tools to collect and preserve web content for the Library’s collections and provides overall program management for web archiving at the Library. She has been active in a number of collaborative web archive collections and initiatives, including the U.S. End of Term Government Web Archive, and the U.S. Federal Government Web Archiving Interest Group.

Since the Library of Congress joined the IIPC as a founding member in 2003, Abbie has served in a variety of roles and on a number of working groups, task forces, and committees. She spent a number of years as Communications Officer, and was a member of the Access Working Group. More recently, she has served as co-leader of the Content Development, and Training Working Groups, and Membership Engagement Portfolio, and served as Chair in 2021. She has been a member of the Steering Committee since 2013.

IIPC TREASURER

Ian_Cooke

Ian Cooke leads the Contemporary British Publications team at the British Library, which is responsible for curation of 21st century publications from the UK and Ireland. This includes the curatorial team for the UK Web Archive, as well as digital maps, emerging formats and print and digital publications ranging from small press and artists books to the latest literary blockbusters. Ian joined the British Library’s Social Sciences team in 2007, having previously worked in academic and research libraries, taking up his current role in 2015.

Ian has been a member of the IIPC Steering Committee and has worked on strategy development for the IIPC. The British Library was the host for the Programmes and Communications role up to April 2021.

Get Involved in Web Archiving the Winter Games – Beijing 2022

By Helena Byrne, Curator Web Archives, British Library

IIPC-CDG-2022OlympicsIt’s that time of year again when we all become winter sports experts while watching the Winter Olympic and Paralympic Games on TV. There are even helpful guides online to all the slang used on the slopes like steese (style and ease), pow (fresh ski powder) and sendy (to ride it with full vigour).

The ongoing pandemic has caused havoc on sporting schedules since the start, but the Beijing 2022 Winter Games are going ahead as scheduled. This means that the IIPC Content Development Group (CDG) will again be collecting web content related to the Games from across the world in multiple languages.

10 years of collecting the Games

2020 marked ten years of archiving the Games. The first collection focused just on the 2010 Winter Olympics in Vancouver. But since 2012 the IIPC has archived web content on both the Olympic and Paralympic Games. To date the IIPC has archived six Games. Beijing 2022 will be the 4th Winter Games collection.

CDG-Web-archiving-the-Olympics-and-Paralympics-Games
Helena Byrne: Going for Gold: Web Archiving the Olympics & Paralympics Games, poster presented at IIPC Web Archiving Conference 2021.

Previous CDG Olympic and Paralympic collections have focused on events both on and off the field/slopes. Key themes have included doping, corruption and Zika Virus as well as Covid-19 in the 2020 Summer Games. This year will be no different as Covid-19 is still a big issue and the human rights issues in China has meant that some nations like the USA, UK and Australia will not be sending any diplomatic representatives.

What we want to collect

Public platforms in various formats such as:

  • Websites
  • Subsections of websites with an Olympic tag
  • Individual Articles
  • News Reports
  • Blogs
  • Audio visual content

Social media platforms update their code and design frequently and do not prioritize archivability, and as a result present ever-changing technical challenges to archivists, requiring more labour-intensive scoping with less satisfactory results. For that reason we will not be accepting nominations from Facebook, Instagram, or Twitter, nor from other similar social media platforms.

The subjects covered on these sites can include but are not limited to:

  • Athletes/Teams
  • Computer Games (eGames)
  • Covid-19
  • Diplomatic Relations with China
  • Doping/Cheating and Corruption
  • Environmental Issues
  • Fandom
  • Gender Issues (Ex. media coverage, sexual harassment etc.)
  • General News/ Commentary
  • Human Rights issues
  • Olympic/Paralympic Venues
  • Security
  • Sports Events
  • Other

How to get involved

Once you have selected the web pages you would like to see in the collection, it only takes less than 5 minutes to fill in the submission form:
https://bit.ly/CDG-2022Games-collection-public-nominations 

The call for nominations will close on March 20, 2022.

For more information and updates you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2022.