Game Walkthroughs and Web Archiving Project Update: Adding Replay Sessions, Performance Results Summary, and Web Archiving Tournament Option to Web Archiving Livestreams

“Game Walkthroughs and Web Archiving” was awarded a grant in the 2021-2022 round of the Discretionary Funding Programme (DFP), the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project lead is Michael L. Nelson from the Department of Computer Science at Old Dominion University. Los Alamos National Laboratory Research Library is a project partner. You can learn more about this DFP-funded project at our Dec. 14th IIPC RSS Webinar: Game Walkthroughs and Web Archiving, where Travis Reid will be presenting on his research in greater detail. 


By Travis Reid, Ph.D. student at Old Dominion University (ODU), Michael L. Nelson, Professor in the Computer Science Department at ODU, and Michele C. Weigle, Professor in the Computer Science Department at ODU

The Game Walkthroughs and Web Archiving project focuses on integrating video games with web archiving and applying gaming concepts like speedruns to the web archiving process. We have made some recent updates for this project, which involve adding a replay mode, results mode, and making it possible to have a web archiving tournament during a livestream.      

Replay Mode

Replay mode (Figure 1) is used to show the web pages that were archived during the web archiving livestream and to compare the archived web pages to the live web page. During replay mode, the live web page is shown beside the archived web pages associated with each crawler. The web archiving livestream script scrolls the live web page and archived web pages so that viewers can see the differences between the live web page and the recently archived web pages. In the future when the web archiving livestream supports the use of WARC files from a crawl that was not performed recently, we will compare the archived web pages from the WARC file with a memento from a web archive like Wayback Machine or Arquivo.pt instead of comparing the archived web page against the live web page. For replay mode, we are currently using Webrecorder’s ReplayWeb.page.

Replay_Mode
Figure 1: Replay mode

Replay mode will have an option for viewing annotations created by the web archiving livestream script for the missing resources that were detected (Figure 2). The annotation option was created so that the web archiving livestream would be more like a human driven livestream where the streamer would mention potential reasons why a certain embedded resource is not being replayed properly during the livestream. Another reason for creating the annotation option is so that replay mode can show more than just web pages being scrolled and can provide some information about the elements on a web page that are associated with missing embedded resources. There will also be an option for printing an output file that contains the annotation information created by the web archiving livestream script for the missing embedded resources. For each detected missing resource, this file will include the URI-R for the missing resource, the HTTP response status code, the element that is associated with the resource, and the HTML attribute where the resource’s URI-R is extracted from.

Annotation_option
Figure 2: During replay sessions, there will be an option for automated annotation for the missing resources found on the web page

Results Mode

We have added a results mode (Figure 3) to the web archiving livestream so that viewers can see a summary of the web archiving and replay performance results. This mode is also used to compute the score for each crawler so that we can determine which crawler has won the current round based on the archiving and replay performance. The performance metrics used during results mode is retrieved from the performance result file that is generated after the web archiving and replay sessions. Currently this performance result file includes the number of web pages archived by the crawler during the competition (number of seed URIs), the speedrun completion time for the crawler, the number of resources in the CDXJ file with an HTTP response status code of 404, the number of archived resources categorized by the file type (e.g., HTML, image, video, audio, CSS, JavaScript, JSON, XML, PDF, and fonts), and the number of missing resources categorized by the file type. The performance metrics we are currently using for determining missing and archived resources are temporary and will be replaced with a replay performance metric calculated by the memento damage service. The temporary metrics associated with missing and archived resources are calculated by going through a CDXJ file and counting the number of resources with a 200 status code for the number of archived resources and counting the number of resources with a 404 status code for the number of missing resources. Results mode will allow the viewers to access the performance results file for the round by showing a link or QR code associated with a web page that can dynamically generate the performance results from the current round and allow the viewers to download the performance results file. The web page that is shared with the viewers will also have a button that can navigate them to the video timestamp URL associated with the start of the round, so that viewers who recently joined the livestream can go back and watch the archiving and replay sessions for the current round.

results_mode_SS
Figure 3: Results mode, where the first performance metric is shown which is the speedrun time

Web Archiving Tournaments

A concept that we recently applied to our web archiving livestreams is web archiving tournaments. A web archiving tournament is a competition between four or more crawlers. The web archiving tournaments are currently single elimination tournaments similar to the NFL, NCAA College Basketball, and MLS Cup playoffs, where a team is eliminated from the tournament if they lose a single game. Figure 4 shows an example of teams progressing through our tournament bracket. For each match in a web archiving tournament, there will be two crawlers competing against each other. Each crawler is given the same set of URIs to archive and the set of URIs will be different for each match. The viewers will be able to watch the web archiving and replay sessions for each match. After the replay session is finished, the viewers will see a summary of the web archiving and replay performance results and how the score for each crawler is computed. The crawler with the highest score will be the winner of the match and will progress further in the web archiving tournament. When a crawler loses a match it will not be able to compete in any future matches in the current tournament. The winner of the web archiving tournament will be the crawler that has won every match that it has participated in during the tournament. The web archiving tournament will be updated in the future to support other types of tournaments like double elimination tournament where teams can lose more than once, round robin tournament where teams play each other an equal amount of times, or a combination like the FIFA World Cup that uses round robin for the group stage and single elimination for the knockout phase.

Progressing_Through_Tournament_Bracket
Figure 4: Example of teams progressing through our tournament bracket (in this example, the scores are randomly generated)

Future work

We will apply more gaming concepts to our web archiving livestreams, like having tag-team matches and a single player mode. For a tag-team match, we would have multiple crawlers working together on the same team when archiving a set of URIs. For a single player mode, we could allow the streamer or viewers to select one crawler to use when playing a level.

We are accepting suggestions for video games that will be integrated with our web archiving livestreams and shown during our gaming livestreams. The game must have a mode where we can watch automated gameplay that uses bots (computer players) and there needs to be customization for the bots that can improve the skill level, stats, or abilities for the bot. Call of Duty: Vanguard is an example of a game that can be used during our gaming livestream. In a custom match for Call of Duty: Vanguard, the skill level for the bots can be changed individually for each bot and we can change the number of players added to each team (Figure 5). This game also has other team customization options (Figure 6) that are recommended for games used during our gaming livestream but are not required like being able to change the name of the team and choose the team colors. Call of Duty: Vanguard also has a spectator mode named CoDCaster (Figure 7) where we can watch a match between the bots.

Player_Customization_Annotated
Figure 5: Player customization must allow bots with skill levels that can be changed individually or have abilities that can give a bot an advantage over other bots

Ideal_Team_Customization_Settings_Annotated
Figure 6: Example of team customization that is preferred for team based games used during gaming livestreams, but is optional
spectator_mode
Figure 7: The game must have a spectator option so that we can watch the automated gameplay

An example of a game that will not be used during our gaming livestream is Rocket League. When creating a custom match in Rocket League it is not possible to make one bot have better stats or skills than the other bots in a match. The skill level for the bots in Rocket League is applied to all bots and cannot be individually set for each bot (Figure 8).

Rocket_League_Bot_Difficulty
Figure 8: Rocket League will not be used during our automated gaming livestreams, because their “Bot Difficulty” setting applies the same skill level to all bots

A single player game like Pac-Man also cannot be played during our automated gaming livestream, because a human player is needed in order to play the game (Figure 9). If there are any games that you would like to see during our gaming livestream where we can spectate the gameplay of computer players, then you can use this Google Form to suggest the game.

Pacman_Reduced
Figure 9: Single player games that require a human player to play the game like Pac-Man cannot be used during our automated gaming livestream

Summary

Our recent updates to the web archiving livestreams are adding a replay mode, results mode, and an option for having a web archiving tournament. Replay mode allows the viewers to watch the replay of the web pages that were archived during the web archiving livestream. Results mode shows a summary of the web archiving and replay performance results that were measured during the livestream and shows the match scores for the crawlers. The web archiving tournament option allows us to have a competition between four web archive crawlers and to determine which crawler performed the best during the livestream.

If you have any questions or feedback, you can email Travis Reid at treid003@odu.edu.

Browser-based crawling system for all

“Browser-based crawling for all” is one of IIPC’s 2022 funded projects. The lead developer is Ilya Kreymer of Webrecorder. The project is supported by the Tools Development Portfolio (TDP) and led by 4 member institutions (The British LibraryNational Library of New ZealandRoyal Danish Library, and the University of North Texas) who contribute their technical expertise and staff time towards the development of Browsertrix Cloud. The project, comprising 4 packages, commenced in January 2022. Much of Browsertrix’s early development was tested and refined thanks to Webrecorder’s collaboration with SUCHO in efforts to capture Ukrainian cultural heritage websites. IIPC Members have been testing Browsertrix Cloud and offering feedback throughout the year, with curators attending early September workshops focused on improving UI and meeting one-on-one with Browsertrix’s new UI/UX designer, Henry Wilkinson, to discuss feedback in greater detail. Ilya also gave a demonstration of Browsertrix Cloud to IIPC members during the 2022 General Assembly Tools Meeting.


By Ilya Kreymer, Webrecorder.net

As part of the IIPC-funded project “Browser-based crawling for all”, Webrecorder has been working in collaboration with IIPC Members, led by the British Library, National Library of New Zealand, Royal Danish Library, and University of North Texas to test Browsertrix Cloud as it is being developed.

Browsertrix Cloud provides a fully-integrated system for Webrecorder’s open-source high-fidelity browser-based crawling system, Browsertrix Crawler, designed to allow curators to create, manage, and replay high-fidelity web archive crawls through an easy to use interface.

You can read more about Browsertrix Cloud at https://browsertrix.cloud/

At present, a dedicated IIPC cluster of Browsertrix Cloud has been deployed and made available to users from all members’ institutions. This cloud cluster is deployed using Digital Ocean in a European data center. Users are able to create and configure high-fidelity crawls and watch them archive web pages in real time. Browsertrix Cloud also allows users to create browser profiles and crawl sites which require logins, one of the only tools to date that allows for this capability.

One of the key goals of this project is to enable institutions to deploy the system both locally and in the cloud. We are currently working on documentation outlining the procedure for deploying Browsertrix Cloud on Digital Ocean, AWS and a single machine.

Thus far, we have collected feedback from many institutions and are working on new features, including the ability to update the crawl queue once the crawl has started, and improvements to our logging capabilities. IIPC users have provided us with valuable feedback after testing the service and we hope to receive more as development continues.

We are focusing on further improving the UX for Browsertrix Cloud to make complex crawl related tasks as simple as possible. After adding crawl exclusion management, we are looking at simplifying the crawl configuration and browser profiles screens, adding additional logging information, and eventually adding support for additional organizational features.

Please reach out if you would like additional accounts to test Browsertrix Cloud or have additional questions or feedback.

IIPC Steering Committee Election 2022 Results

The 2022 Steering Committee Election closed on Saturday, 15 October. The following IIPC member institutions have been elected to serve on the Steering Committee for a term commencing 1 January 2023:

We would like to thank all members who took part in the election either by nominating themselves or by taking the time to vote. Congratulations to the re-elected Steering Committee Members!

IIPC Steering Committee Election 2022: nomination statements

The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2023, 6 IIPC member organisations have put themselves forward:

An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all of the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting in 2023 will be held online.

If you have any questions, please contact the IIPC Senior Program Officer.


Nomination statements in alphabetical order:

Internet Archive

Internet Archive seeks to continue its role on the IIPC Steering Committee. As the oldest and largest publicly-available web archive in the world, a founding member of the IIPC, and a creator of many of the core technologies used in web archiving, Internet Archive plays a key role in fostering broad community participation in preserving and providing access to the web-published records that document our shared cultural heritage. Internet Archive has long served on the Steering Committee, including as Chair, and has helped establish IIPC’s relationship with CLIR, the Discretionary Funding Program, the IIPC Training Program, and other initiatives. By continuing on the Steering Committee, Internet Archive will advance these and similar programs to expand and diversify IIPC membership, further knowledge sharing and skills development, ensure the impact and sustainability of the organization, and help build collaborative frameworks that allow members to work together. The web can only be preserved through broad-based, multi-institutional efforts. Internet Archive looks to continue its role on the Steering Committee in order to bring our capacity and expertise to support both the mission of the IIPC and the shared mission of the larger community working to preserve and provide access to the archived web.

Landsbókasafn Íslands – Háskólabókasafn / National and University Library of Iceland

The National and University Library of Iceland is interested in serving another term on the IIPC Steering Committee. The library has had an active web archiving effort for nearly two decades. Our participation in the IIPC has been instrumental in its success.

As one of the IIPC‘s smaller members, we are keenly aware of the importance of collaboration to this specialized endeavor. The knowledge and tools that this community has given us access to are priceless.

We believe that in this community, active engagement ultimately brings the greatest rewards. As such, we have participated in projects, including Heritrix, OpenWayback and, most recently, PyWb. We have hosted several IIPC events, including the 2016 GA/WAC. We have also provided leadership in various areas including in working groups and the tools development portfolio, and our SC representative currently serves as the IIPC’s Steering Committee Chair.

If re-elected to the SC, we will aim to continue on in the same spirit.

Library of Congress

The Library of Congress (LC) has been involved in web archiving for over 22 years and is a founding member of the IIPC. LC has worked collaboratively with international organizations on collections, tools and workflows, while developing in-house expertise enabling the collection and management at scale of over 3.3 petabytes of web content. LC has served in a variety of IIPC leadership roles, currently as vice-chair of the IIPC, and as chair in 2021. Roles also include Membership Engagement portfolio lead and Training WG co-chair. Staff participate actively in a variety of technical discussions, workshops, working groups, and member calls, and in 2022 LC co-hosted the WAC/GA. As a Steering Committee member, LC helped secure a new fiscal agent and helped hire and onboard new IIPC staff. If re-elected, we will continue to focus on increasing engagement of all members, enabling use of member benefits by all members. We will continue to actively participate in discussions around the best use of IIPC funding to support staff, projects, and events that will enable us all to work more efficiently and collaboratively as a community and to help strengthen ties to the researcher and wider web archiving community in the coming years.

National Library of Australia

The National Library of Australia was a founding IIPC member and Steering Committee member until 2009, hosting the second general assembly in Canberra in 2008. The NLA was re-elected in 2019 and filled the vice-chair role in 2020. Long engagement with the international web archiving community includes organizing one of the first major international conferences on web archiving in 2004. The NLA is currently active within the IIPC Tools Portfolio and was involved with two recent IIPC discretionary funded projects.

The NLA strengths include experience, operational maturity and a pragmatic approach to web archiving. Its web archiving program, established in 1996, embraces selective, domain and bulk collecting methods and now holds around 700 TBs or 15 billion URL snapshots. With a self-described ‘radical incrementalism’ approach, the NLA has a record of agile innovation, from building the first selective web archiving workflow system to the ‘outbackCDX’ tool providing efficiency for managing CDX indexes. The NLA is committed to open access, maintaining the entire Australian Web Archive as fully accessible and searchable though the Trove discovery service. In seeking re-election, the NLA aims to offer the Steering Committee long web archiving experience, proven practical engagement, and a unique Australasian, Southern Hemisphere perspective.

National Library of New Zealand / Te Puna Mātauranga o Aotearoa

National Library of New Zealand has been an IIPC member since 2007, started web archiving in 1999, and appointed a dedicated web archiving role in 2017. The Library’s recent IIPC activities include:

  • Host of the 2018 IIPC GA/WAC and the ‘What do researchers want’ workshop; Member of the 2022 WAC Programme Committee
  • Co-chair of the Research WG; Participation in OHSOS sessions, the IIPC Strategic Direction Group, the CDG’s collaborative collections, and social media collecting webinars
  • Project partner on ‘Asking questions with web archives’ and ‘LinkGate’; A project lead on ‘Browser-based crawling system for all’

The Library also co-develops the open source Web Curator Tool with National Library of Netherlands and shares updates at IIPC conferences.

The Library’s current web archiving priorities align closely with the IIPC Strategic Plan 2021-2025:

  • Full-text search and improved access to our web archives
  • Policies that allow us to provide greater access to our web archives
  • Social media collecting

Our experimentation in these areas helps the IIPC achieve its strategic objectives, by demonstrating to other IIPC member organisations how to build capacity in these areas, and by collaborating with other IIPC members in these areas.

University of North Texas Libraries

UNT-logoThe University of North Texas (UNT) Libraries expresses its interest in being elected to the IIPC Steering Committee. As a library that serves a population of 40,000+ students and faculty, we are committed to providing a wide range of resources and services to our users. Of these services we feel that the preservation of and access to Web archives is an important component.

The UNT Libraries has been a member of the IIPC since 2007 and has served in several capacities, including previous terms on the Steering Committee. Recently, members of the UNT Libraries have worked as co-chairs of the Tools Development Portfolio, on the Partnership and Outreach Portfolio, on the Discretionary Funding Program selection committee, on the WAC program committee, and as co-lead on the Browser-Based Crawler project.

The UNT Libraries is interested in helping the IIPC move forward into the future. We have an interest in representing the unique needs and concerns of research libraries as well as continuing to support the needs of other IIPC member institutions. If elected, the UNT Libraries will strive to represent the best interests of the IIPC community and to help move forward the preservation of the Web.

Investigate holdings of web archives through summaries: cdx-summarize

Untitled designBy Yves Maurer, Web Archiving technical lead at the National Library of Luxembourg


Introduction

When researchers want to access web archives, they have two main possibilities. They can either use one of the excellent web archives that is freely accessible online, such as web.archive.org, arquivo.pt, vefsafn.is, haw.nsk.hr, or Common Crawl, or they can travel to the libraries and archives whose web archives are only available in their respective reading rooms. In fact, most web archives have some restrictions on access. Copyright and other legal considerations often make it difficult for institutions to open up the web archive to the broader Internet. Closed web archives are hard to access for researchers, especially if they live far from the physical reading rooms. The overall effect is that closed web archives are less used, studied, and published about and researchers only travel to the closest reading rooms, if at all.

However, web archiving institutions would like more researchers to use their archives and popularize the usage of web archives for all users from contemporary history, sociology, linguistics, economics, law, and other disciplines. For closed web archives, usually little data is publicly available about their contents, so it is difficult to convince researchers to travel to the reading room when the researcher doesn’t know in advance what exactly the archive contains and whether it’s pertinent to their research question. Web archives are also very large which makes handling the raw WARC files difficult for all parties, so sending extracts of data from institution to research team is often not feasible.

It would certainly be preferable to researchers if those closed web archives would just open their entire service to the Internet, but the wholesale lifting of legal restrictions is not easy. Therefore, if researchers cannot access the whole dataset, can they at least access some part that allows them to have an overview of the collection? Just a size indication (e.g. 340 TB) and the number of mementos (e.g. 3 billion) will not help much. A collection policy documenting the aims, scope and contents of the web archive (e.g. https://www.bnf.fr/fr/archives-de-linternet) is already more helpful but does not hold any numbers or information about particular sites of interest. There is, however, some type of data that resides in-between the legal challenges of full access on the one hand and a textual description or rough single numbers on the other hand. This type of data must not be encumbered by any legal restrictions nor should it be so massive that it becomes unwieldy.

Developed as part of the WARCNET network’s working group 1, the cdx-summarize (https://github.com/ymaurer/cdx-summarize) toolset proposes to generate and handle such a dataset. There are no more legal restrictions on the data that it contains, since it is aggregated and does not contain any copyrighted information nor personal data. Moreover, the file is of a manageable size. The institution that has a closed web archive can publish the summary file for the whole collection and then the researchers can investigate its contents or compare it to the summary files from other institutions. Like this, web archives can publicize their collections and make them available for a rough first level of data exploration.

Sample uses of summary files for researchers

The summary files produced by cdx-summarize are simple, but still contain statistics about the different years when mementos were harvested as well as the number and sizes of different file types included in the collection. None of the following samples require direct access to a web archive, only to a summary file. It is not the aim of this blog post to investigate these examples in detail, just to give the readers an idea how rich this summary data still is and what can be done with it.

A very simple example is the chart comparing the evolution of the sizes of HTML files over the years.

Picture1
Fig 1. Average size of HTML files in the Luxembourg Web Archive

Another example is to use the information about 2nd-level domains that is still present in the summary file to find out more about domain names in general, as in the following example:

Picture2
Fig 2. First letter frequency in Internet Archive 2nd-level domains vs French dictionary for the TLD .fr

Here, you could, for example, explain the overall abundance of 2nd level domains starting with the letter “L” by the fact that the French articles “le, la, les,” all start with “L” and so do probably quite a lot of domain names. Other deviations from the mean may need a deeper explanation.

Comparing web archives

Another nice thing about the summary files is that they can be produced for different web archives and then compared. At the time of writing this, I do not have access to any other closed web archive summary file apart from the one for the Luxembourg Web Archive (https://github.com/ymaurer/cdx-summarize/blob/main/summaries/webarchive-lu.summary.xz) (19.1 MB). However, there are open web archives with public APIs like the Internet Archive’s CDX server (https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) or the Common Crawl (https://index.commoncrawl.org/). These can be used to generate a summary file, e.g. a whole top-level domain.

A first comparison between web archives can done on the 2nd level domains. Do all concerned web archives hold data from all the domains? Or does one archive have a clear focus on just a small subset of the domains? The following chart shows the comparisons of the inclusion of domains from the TLD “.lu” into three web archives:

Picture3

The graph clearly shows that the Luxembourg Web Archive started in 2016 and that it is collaborating with the Internet Archive, who have a second copy of the same data. It also shows that the Common Crawl is much less broad in terms of included domains.

A deep comparison of the mementos held between web archives is probably better done on CDXJ index files themselves. There will still be some edge cases of mementos just being slightly different because of embedded timestamps, sessions, etc. but it will give a more detailed picture of the overlaps.

The summary file format

The file consists of JSON lines prefixed by the domain name. This is inspired by the CDXJ format and simplifies using Unix tools such as “sort,” or “join,” on the summary files. In the JSON part, there are keys for each year and inside the year, keys for each (simplified) MIME type for the number of mementos (n_) and their sizes (s_):

host.tld {“year”: {“n_html”: A, … “s_html”:B}}

A sample entry could be:

bnl.lu {“2003”: {“n_audio”: 0,”n_css”: 8,”n_font”: 0,”n_html”: 639,”n_http”: 728,”n_https”: 0,”n_image”: 44,”n_js”: 0,”n_json”: 0,”n_other”: 7,”n_pdf”: 30,”n_total”: 728,”n_video”: 0,”s_audio”: 0,”s_css”: 5268,”s_font”: 0,”s_html”: 1295481,”s_http”: 4680354,”s_https”: 0,”s_image”: 295235,”s_js”: 0,”s_json”: 0,”s_other”: 13156,”s_pdf”: 3071214,”s_total”: 4680354,”s_video”: 0}}

The MIME types are simplified according to the following rules:

MIME(s) category rationale
text/html HTML These are counted as “web pages” by Internet Archive
application/xhtml+xml
text/plain
text/css CSS interesting for changing usage in formatting pages
image/* IMAGE all image types are grouped together
application/pdf PDF interesting independently, although IA groups PDFs in “web page” too
video/* VIDEO all videos
audio/* AUDIO all audio types
application/javascript JS these 3 MIME types are common for javascript
text/javascript
application/x-javascript
application/json JSON relatively common and indicates dynamic pages
text/json
font/ FONT usage of custom fonts
application/vnd.ms-fontobject
application/font
application/x-font*

How do I generate the summary file for my web archive?

As the name cdx-summarize implies, the programs need only access to CDXJ files, not the underlying WARC files. Just start a cdx-summarize.py –compact *.cdx > mywebarchive.summary in your CDX directory and it will do the summarization.

If you are using WARC-indexer from the British Library and have a backend with a Solr index, it’s even simpler, since there is a version contributed by Toke Eskildsen which pulls the data from Solr efficiently and directly (https://github.com/ymaurer/cdx-summarize-warc-indexer) All types of CDXJ files should be supported and different encodings are supported as well.

Web Archiving the War in Ukraine

By Olga Holownia, Senior Program Officer, IIPC & Kelsey Socha, Administrative Officer, IIPC with contributions to the Collaborative Collection section by Nicola Bingham, Lead Curator, Web Archives, British Library; CDG co-chair


This month, the IIPC Content Development Working Group (CDG) launched a new collaborative collection to archive web content related to the war in Ukraine, aiming to map the impact of this conflict on digital history and culture. In this blog, we describe what is involved in creating a transnational collection and we also give an overview of web archiving efforts that started earlier this year: both collections by IIPC members and collaborative volunteer initiatives.

Collaborative Collection 2022

IIPC-CDG-collaborative-collectionsIn line with the broader content development policy, CDG collections focus on topics that are transnational in scope and are considered of high interest to IIPC members. Each collection represents more perspectives than similar collections by a single member archive may include. Nominations are submitted by IIPC members, who have been archiving the conflict as early as January 2022 (see below) as well as the general public.

How do members contribute?

Topics for special collections are proposed by IIPC members who submit their ideas to the IIPC CDG mailing list, or contact the IIPC co-chairs directly at any time. Providing that the topic fits with the CDG collecting scope, there is enough data budget to cover the collection, and a lead curator and volunteers to perform the archiving work are in place, the collection can go ahead. IIPC members are then canvassed widely to submit web content on a shared google spreadsheet together with associated metadata such as title, language and description. The URLs are taken from the spreadsheet and crawled in Archive-It by the project team, formed of volunteers from IIPC members for each collection. Many IIPC members add a selection of seeds from their institutions’ own collections which helps to make CDG collections very diverse in terms of coverage and language.

There will be overlap between the seeds that members submit to CDG collections and their own institutions’ collections, however there are differences, including that selections for IIPC collections can be more geographically wide ranging than those included in their own collections when, for example they must adhere to regional scope, such as in the case of a national library.  Selection decisions that are appropriate for members’ own collections may not be appropriate for CDG collections. For example, members may want to curate individual articles from an online newspaper by crawling each one separately whereas, given the larger scope of CDG collections it would be more appropriate to create the target at the level of the sub-section of the online newspaper. Public access to collections provided by Archive-It is a positive factor for those institutions that, for various reasons, can’t provide access to their collections. You can learn more about the War in Ukraine 2022 collection’s scope and parameters here.

Public nominations

We encourage everyone to nominate relevant web content as defined by the collection’s lead curators: Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, National Library of France and Kees Teszelszky of KB, National Library of the Netherlands. The first crawl is scheduled to take place on 27 July and it will be followed by two additional crawls in September and October. We will be publishing updates on the collection at #Ukraine 2022 Collection. We are also planning to make this collection available to researchers.

Member collections

In Spring 2022, we compiled a survey of the work done by IIPC members. We asked about the collection start date, scope, frequency, type of collected websites, way of collecting (e.g. locally and/or via Archive-It), social media platforms and access.

IIPC members have collected content related to the war, ranging from news portals, to governmental websites, to embassies, charities, and cultural heritage sites. They have also selectively collected content from Ukrainian and Russian websites and social media, including Facebook, Reddit, Instagram, and, most prominently, Twitter. The CDG collection offers another chance for members without special collections to contribute seeds from their own country domains.

Many of our members are national libraries and archives, and legal deposit informs what these institutions are able to collect and how they provide access. In most cases, that would mean crawling country-level domains, offering a localized perspective on the war. Access varies from completely open (e.g. the Internet Archive, National Library of Australia and the Croatian Web Archive), to onsite-only with published and browsable metadata such as collected URLs (e.g. the Hungarian Web Archive) to reading-room only (e.g. Netarkivet at the Royal Danish Library or the “Archives de l’internet” at the national library of France). The UK Web Archive collection has a mixed model of access, where the full list of metadata and collected URLs are available, but access to individual websites depends on whether the website owner has granted permission for off-site open access”.  Some institutions, such as Library of Congress, may have time-based embargoes in place for collection access.

Some of our members have also begun work preparing datasets and visualisations for researchers. The Internet Archive has been supporting multiple collections and volunteer projects and our members have provided valuable advice on capturing content that is difficult to archive (e.g. Telegram messages).

A map of IIPC members currently collecting content related to the war in Ukraine can be seen below. It includes Stanford University, which has been supporting SUCHO (Saving Ukrainian Cultural Heritage Online).

Survey results

Access

While many members have been collecting content related to the war, only a small number of collections are currently publicly available online. Some members provide access to browsable metadata or a list of ULRs. The National Library of Australia has been collecting publicly available Australian websites related to the conflict,as is the case for the National Library of the Czech Republic. A special event collection of 162 crowd-sourced URLs is now accessible at the Croatian Web Archive. The UK Web Archive’s special collection of nearly 300 websites is fully available on-site, however information about the collected resources, which currently include websites of Russian Oligarchs in the UK, Commentators, Charities, Think Tanks and the UK Embassies of Ukraine and the surrounding nations, is publicly available online. Some websites from the UK Web Archive’s collection are also fully available off-site, where website owners have granted permission. The National Library of Scotland has set up a special collection, ‘Scottish Communities and the Ukraine’ which contains nearly 100 websites and focuses on the local response to the Ukraine War. This collection will be viewable in the near future pending QA checks. Most of the University Library of Bratislava’s collection is only available on-site, but information about sites collected is browsable on their web portal with links to current versions of the archived pages.

The web archiving team at the ​​National Széchényi Library in Hungary, which has been capturing content from 75 news portals, has created a SolrWayback-based public search interface which provides access to metadata and full-text search, though full pages cannot be viewed due to copyright. The web archiving team has also been collaborating with the library’s Digital Humanities Center to create datasets and visualisations related to captured content.

Hungarian-Web-Archive-word_cloud
Márton Nemeth of National Széchényi Library and Gyula Kalcsó of Digital Humanities Center, National Széchényi Library presented on this collection at the 2022 Web Archiving Conference.

Multiple institutions plan to make their content available online at a later date, after collecting has finished or after a specified period of time has passed. The Library of Congress has been capturing content in a number of collections within the scope of their collecting policies, including the ongoing East European Government Ministries Web Archive.

Frequency of Collection

Most institutions have been collecting with a variety of frequencies. Institutions rarely answered with just one of the frequency options, opting instead to pick multiple options or “Other.” Of answers in the “Other” category, some were doing one-time collection, while others were collecting yearly, six-monthly, and quarterly.

How the content is collected

Most IIPC members crawl the content locally, while a few have also been using Archive-It. SUCHO has mostly relied on browser-based crawler Browsertrix, which was developed by Ilya Kreymer of Webrecorder and is in part funded by the IIPC, and on the Internet Archive’s Wayback Machine.

Type of collected websites (your domain)

When asked about types of websites being collected within local domains, most institutions have been focusing on governmental and news-related sites, followed by embassies and official sites related to Ukraine and Russia as well as cultural heritage sites. Other websites included a variety of crisis relief organisations, non-profits, blogs, think tanks, charities, and research organisations.

Types of websites/social media collected

When asked more broadly, most members have been focusing on local websites from their home countries. Outside local websites, some institutions were collecting Ukrainian websites and social media, while a smaller number were collecting Russian websites and social media.

Specific social media platforms collected

The survey also asked specifically about social media platforms our members were pulling from: Reddit, Instagram, TikTok, Tumblr, and Youtube. While many institutions were not collecting social media, Twitter was otherwise the most commonly collected social media platform.

Internet Archive

Internet ArchiveThe Internet Archive (IA) has been instrumental in providing support for multiple initiatives related to the war in Ukraine. IA’s initiatives have included:

  1. giving free Archive-It accounts, as well as general data storage, to a number of different community archiving efforts
  2. uploading files to SUCHO collection at archive.org
  3. supporting the extensive use of Save Page Now (especially via the Google Sheets interface) with the help of numerous SUCHO volunteers (many 10s of TB have been archived this way)
  4. supporting the uploading of WACZ files to the Wayback Machine. This work has just started but a significant number of files are expected  to be archived and, similar to other collections featured in the new “Collection Search” service, a full-text index will be available
  5. crawling the entire country code top level domain of the Ukrainian web (the crawl was launched in April and is still running)
  6. archiving Russian Independent Media (TV, TV Rain), Radio (Echo of Moscow) and web-based resources (see “Russian Independent Media” option in the “Collection Search” service at the bottom of the Wayback Machine).

IA’s Television News Archive, the GDELT Project, and the Media-Data Research Consortium have all collaborated to create the  Television News Visual Explorer, which allows for greater research access of the Television News Archive, including channels from across Russia, Belarus, and Ukraine. This blog post by GDELT’s Dr. Kavel H. Leetaru explains more of the significance of this collaboration, and the importance of this new research collection of Belarusian, Russian and Ukrainian television news coverage.

Volunteer initiatives

SUCHO

image3One of the largest volunteer initiatives focusing on preserving Ukrainian web content has been SUCHO. Involving over 1300 librarians, archivists, researchers and programmers, SUCHO is led by Stanford University’s Quinn Dombrowski, Anna E. Kijas of Tufts University, and Sebastian Majstorovic of the Austrian Centre for Digital Humanities and Cultural Heritage. In its first phase, the project’s primary goal was to archive at-risk sites, digital content, and data in Ukrainian cultural heritage institutions. So far over 30TB of content and 3,500+ websites of Ukrainian museums, libraries and archives have been preserved and a subset of this collection is available at https://www.sucho.org/archives. The project is beginning its second phase, focusing on coordinating aid shipments of digitization hardware, exhibiting Ukrainian culture online and organizing training for Ukrainian cultural workers in digitization methods.

sucho-poster-landscape-medium
The SUCHO leads and Ilya Kreymer presented on their work at the 2022 Web Archiving Conference and participated in a Q&A session moderated by Abbie Grotke of the Library of Congress.

The Telegram Archive of the War

image2
Screenshot from the Telegram Archive of the War, taken July 20, 2022.

Telegram has been the most widely used application in Ukraine since the onset of the war but this messaging app is notoriously difficult to archive. A team of five archivists at the Center for Urban History in Lviv led by Taras Nazaruk, has been archiving almost 1000 Telegram channels since late February to create the Telegram Archive of the War. Each team member has been assigned to monitor and archive a topic or a region in Ukraine. They focus on capturing official announcements from different military administrative districts, ministries, local and regional news, volunteer groups helping with evacuation, searches for missing people, local channels for different towns, databases, cyberattacks, Russian propaganda, fake news as well as personal diaries, artistic reflections, humour and memes. Russian government propaganda and pro-Russian channels and chats are also archived. The multi-media content is currently grouped into over 20 thematic collections. The project coordinators have also been working with universities interested in supporting this archive and are planning to set up a working group to provide guidance for the future access to this invaluable archive.

Ukraine collections on Archive-It

New content has been gradually made available within the Ukraine collections on Archive-It that provided free or heavily cost-shared accounts to its partners earlier this year. These collections also include websites documenting the Ukraine Crisis 2014-2015 curated by University of California Berkeley (UC Berkeley) and by Internet Archive Global Events. Four new collections have been created since February 2022 with over 2.5TB of content. The largest one about the 2022 conflict (around 200 URLs) that is publicly available is curated by Ukrainian Research Institute at Harvard University. Other collections that focus on Ukrainian content are curated by Center for Urban History of East Central Europe, UC Berkeley and SUCHO. To learn more about the “War in Ukraine: 2022” collection, read this blog post by Liladhar R. Pendse, Librarian for East European, Central European, Central Asian and Armenian Studies Collections, UC Berkeley. University of Oxford, New College has been archiving at-risk Russian cultural heritage on the web as well as Russian opposition efforts to the war on Ukraine.

HURI-at-Archive-It
Ukrainian Research Institute at Harvard University’s collection at Archive-It.

Organisations interested in collecting web content related to the war in Ukraine, can contact Mirage Berry, Business Development Manager at the Internet Archive.

How to get involved

  1. Nominate web content for the CDG collection
  2. Use the Internet Archive’s “Save Page Now”
  3. Check updates on the SUCHO Page for information on how you can contribute to the new phase of the project. SUCHO is currently accepting donations to pay for server costs and funding digitization equipment to send to Ukraine. Those interested in volunteering with SUCHO can sign up for the standby volunteer list here
  4. Help the Center for Urban History in Lviv by nominating Ukrainian Telegram channels that you think are worth archiving and participate in their events
  5. Submit information about your project: we are working to maintain a comprehensive and up-to-date list of web archiving efforts related to the war in Ukraine. If you are involved in a collection or a project and would like to see it included here, please use this form to contact us: https://bit.ly/archiving-the-war-in-Ukraine.

Many thanks to all of the institutions and projects featured on this list! We appreciate the time our members spent filling out our survey, and answering questions. Special thanks to Nicola Bingham of the British Library, Mark Graham and Mirage Berry of the Internet Archive, and Taras Nazaruk of the Center for Urban History in Lviv for providing supplementary information on their institutions’ collecting efforts.

Resources

Get Involved in Web Archiving the War in Ukraine 2022

By Kees Teszelszky, Curator Digital Collections, National Library of the Netherlands & Vladimir Tybin, Head of Digital Legal Deposit, National Library of France

On February 23, 2022, the armed forces of the Russian Federation invaded Ukrainian territory, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering a war in Ukraine. Since then, the clashes between the Russian military and the Ukrainian population have had unprecedented repercussions on the situation of the civilian population and on international relations. 

IIPC-CDG-collaborative-collectionsWhat we want to collect

This collaborative collection aims to collect web content related to this event in order to map the impact of this conflict on digital history and culture.

This collection will be built through the following themes: 

  • General information about the military confrontations
  • Consequences of the war on the civilian population
  • Refugee crisis and international relief efforts
  • Political consequences
  • International relations
  • Diaspora communities – Ukrainian people around the world 
  • Human rights organisations 
  • Foreign embassies and diplomatic relations
  • Sanctions imposed against Russia by foreign powers
  • Consequences on energy and agri-food trade
  • Public opinion: blogs/protest sites/activists

The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.

Out of scope

The following types of content are out of scope for the collection:

  • Data-intensive audio/video content (e.g. YouTube channels)
  • Social media platforms
  • Private member forums, intranets, or email (non-published material)
  • Content identifying vulnerable people and compromising their safety

How to get involved

Once you have selected the web pages that you would like to see in the collection, it takes less than 5 minutes to fill in the submission form:

https://bit.ly/Ukraine-2022-collection-public-nominations 

For the first crawl, the call for nominations will close on July 20, 2022.

For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.

Resources

About IIPC collaborative collections
IIPC CDG updates on the IIPC Blog

IIPC Steering Committee Election 2022: call for nominations

The nomination process for IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year, five seats are up for election. 

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation. The elected SC members also lead IIPC Portfolios and thus have the opportunity to shape the Consortium’s strategic direction related to three main areas: tools development, membership engagement and partnerships. Every year, three SC members are designated as IIPC Officers (Chair, Vice-Chair and Treasurer) to serve on the IIPC Executive Board and are responsible for implementing the Strategic Plan.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.

Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).


Election Calendar

15 June – 14 September 2022: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.

15 September 2022: Nominee statements are published on the Netpreserve blog and Members mailing list. Nominees are encouraged to campaign through their own networks.

15 September – 15 October 2022: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.

18 October 2022: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2023: The newly elected SC members start their three-year term.

Game Walkthroughs and Web Archiving Project: Integrating Gaming, Web Archiving, and Livestreaming 

“Game Walkthroughs and Web Archiving” was awarded a grant in the 2021-2022 round of the Discretionary Funding Programme (DFP), the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project lead is Michael L. Nelson from the Department of Computer Science at Old Dominion University. Los Alamos National Laboratory Research Library is a project partner.


By Travis Reid, Ph.D. student at Old Dominion University (ODU), Michael L. Nelson, Professor in the Computer Science Department at ODU, and Michele C. Weigle, Professor in the Computer Science Department at ODU

Introduction

Game walkthroughs are guides that show viewers the steps the player would take while playing a video game. Recording and streaming a user’s interactive web browsing session is similar to a game walkthrough, as it shows the steps the user would take while browsing different websites. The idea of having game walkthroughs for web archiving was first explored in 2013 (“Game Walkthroughs As A Metaphor for Web Preservation”). At that time, web archive crawlers were not ideal for web archiving walkthroughs because they did not allow the user to view the webpage as it was being archived. Recent advancements in web archive crawlers have made it possible to preserve the experience of dynamic web pages by recording a user’s interactive web browsing session. Now, we have several browser-based web archiving tools such as WARCreate, Squidwarc, Brozzler, Browsertrix Crawler, ArchiveWeb.page, and Browsertrix Cloud that allow the user to view a web page while it is being archived, enabling users to create a walkthrough of a web archiving session.

Figure 1
Figure 1: Different ways to participate in gaming (left), web archiving (center), and sport sessions (right)

Figure 1 applies the analogy of different types of video games and basketball scenarios to types of web archiving sessions. Practicing playing a sport like basketball by yourself, playing an offline single player game like Pac-Man, and archiving a web page with a browser extension such as WARCreate are all similar because only one user or player is participating in the session (Figure 1, top row). Playing team sports with a group of people, playing an online multiplayer game like Halo, and collaboratively archiving web pages with Browsertrix Cloud are similar since multiple invited users or players can participate in the sessions (Figure 1, center row). Watching a professional sport on ESPN+, streaming a video game on Twitch, and streaming a web archiving session on YouTube can all be similar because anyone can be a spectator and watch the sporting event, gameplay, or web archiving session (Figure 1, bottom row).

One of our goals in the Game Walkthroughs and Web Archiving project is to create a web archiving livestream like that shown in Figure 1. We want to make web archiving entertaining to a general audience so that it can be enjoyed like a spectator sport. To this end, we have applied a gaming concept to the web archiving process and integrated video games with web archiving. We have created automated web archiving livestreams (video playlist) where the gaming concept of a speedrun was applied to the web archiving process. Presenting the web archiving process in this way can be a general introduction to web archiving for some viewers. We have also created automated gaming livestreams (video playlist) where the capabilities for the in-game characters were determined by the web archiving performance from the web archiving livestream. The current process that we are using for the web archiving and gaming livestreams is shown in Figure 2.

Figure 2
Figure 2: The current process for running our web archiving livestream and gaming livestream.

Web Archiving Livestream

For the web archiving livestream (Figure 2, left side), we wanted to create a livestream where viewers could watch browser-based web crawlers archive web pages. To make the livestreams more entertaining, we made each web archiving livestream into a competition between crawlers to see which crawler performs better at archiving the set of seed URIs. The first step for the web archiving livestream is to use Selenium to set up the browsers that will be used to show information needed for the livestream such as the name and current progress for each crawler. The information currently displayed for a crawler’s progress is the current URL being archived and the number of web pages archived so far. The next step is to get a set of seed URIs from an existing text file and then let each crawler start archiving the URIs. The viewers can then watch the web archiving process in action.

Automated Gaming Livestream

The automated gaming livestream (Figure 2, right side) was created so that viewers can watch a game where the gameplay is influenced by the web archiving and replay performance results from a web archiving livestream or any crawling session. Before an in-game match starts, a game configuration file is needed since it contains information about the selections that will be made in the game for the settings. The game configuration file is modified based on how well the crawlers performed during the web archiving livestream. If a crawler performs well during the web archiving livestream, then the in-game character associated with the crawler will have better items, perks, and other traits. If a crawler performs poorly, then their in-game character will have worse character traits. At the beginning of the gaming livestream, an app automation tool like Selenium (for browser games) or Appium (for locally installed PC games) is used to select the settings for the in-game characters based on the performance of the web crawlers. After the settings are selected by the app automation tool, the match is started and the viewers of the livestream can watch the match between the crawlers’ in-game characters. We have initially implemented this process for two video games, Gun Mayhem 2 More Mayhem and NFL Challenge. However, any game with a mode that does not require a human player could be used for an automated gaming livestream.

Gun Mayhem 2 More Mayhem Demo

Gun Mayhem 2 More Mayhem is similar to other fighting games like Super Smash Bros. and Brawlhalla where the goal is to knock the opponent off the stage. When a player gets knocked off the stage, they lose a life. The winner of the match will be the last player left on the stage. Gun Mayhem 2 More Mayhem is a Flash game that is played in a web browser. Selenium was used to automate this game since it is a browser game. In Gun Mayhem 2 More Mayhem, the crawler’s speed was used to determine which perk to use and the gun to use. Some example perks are infinite ammo, triple jump, and no recoil when firing a gun. The fastest crawler used the fastest gun and was given an infinite ammo perk (Figure 3, left side). The slowest crawler used the slowest gun and did not get a perk (Figure 3, right side).

Figure 3
Figure 3: The character selections made for the fastest and slowest web crawlers

NFL Challenge Demo

NFL Challenge is a NFL Football simulator that was released in 1985 and was popular during the 1980s. The performance of a team is based on the player attributes that are stored in editable text files. It is possible to change the stats for the players, like the speed, passing, and kicking ratings, and it is possible to change the name of the team and the players on the team. This customization allows us to  rename the team to the name of the web crawler and to rename the players of the team to the names of the contributors of the tool. NFL Challenge is a MS-DOS game that can be played with an emulator named DOSBox. Appium was used to automate the game since NFL Challenge is a locally installed game. In NFL Challenge, the fastest crawler would get the team with the fastest players based on the players’ speed attribute (Figure 4, left side) and the other crawler would get the team with the slowest players (Figure 4, right side).

Figure 4
Figure 4: The player attributes for the teams associated with the fastest and slowest web crawlers. The speed ratings are the times for the 40-yard dash, so the lower numbers are faster.

Future Work

In future work, we plan on making more improvements to the livestreams. We will update the web archiving livestreams and the gaming livestreams so that they can run at the same time. The web archiving livestream will use more than the speed of a web archive crawler when determining the crawler’s performance, such as using metrics from Brunelle’s memento damage algorithm which is used to measure the replay quality of archived web pages. During future web archiving livestreams, we will also evaluate and compare the capture and playback of web pages archived by different web archives and archiving tools like the Internet Archive’s Wayback Machine, archive.today, and Arquivo.pt. We will update the gaming livestreams so that they can support more games and games from different genres. The games that we supported so far are multiplayer games. We will also try to automate single player games where the in-game character for each crawler can compete to see which player gets the highest score on a level or which player finishes the level the fastest. For games that allow creating a level or game world, we would like to use what happens during a crawling session to determine how the level is created. If the crawler was not able to archive most of the resources, then more enemies or obstacles could be placed in the level to make it more difficult to complete the level. Some games that we will try to automate include: Rocket League, Brawhalla, Quake, and DOTA 2. When the scripts for the gaming livestream are ready to be released publicly, it will also be possible for anyone to add support for more games that can be automated. We will also have longer runs for the gaming livestreams so that a campaign or season in a game can be completed. A campaign is a game mode where the same characters can play a continuing story until it is completed. A season for the gaming livestreams will be like a season for a sport where there are a certain number of matches that must be completed for each team during a simulated year and a playoff tournament that ends with a championship match.

Summary

We are developing a proof of concept that involves integrating gaming and web archiving. We have integrated gaming and web archiving so that web archiving can be more entertaining to watch and enjoyed like a spectator sport. We have applied the gaming concept of a speedrun to the web archiving process by having a competition between two crawlers where the crawler that finished archiving the set of seed URIs first would be the winner of the competition. We have also created automated web archiving and gaming livestreams where the web archiving performance of web crawlers from the web archiving livestreams were used to determine the capabilities of the characters inside of the Gun Mayhem 2 More Mayhem and NFL Challenge video games that were played during the gaming livestreams. In the future, more nuanced evaluation of the crawling and replay performance can be used to better influence in-game environments and capabilities.

If you have any questions or feedback, you can email Travis Reid at treid003@odu.edu.

Remembering Past Web Archiving Events With Library of Congress Staff

By Meghan Lyon, Digital Collection Specialist, Library of Congress and member of WAC 2022 Program Committee


Since joining the Library of Congress Web Archiving Program remotely in 2020, I have had the pleasure of participating in IIPC activities and getting to know the generous and hardworking members of this community. Although—due to Covid-19 restrictions—I have yet to meet many of my colleagues in person, I feel as though I’ve been wholeheartedly welcomed. It is a privilege to be a member of the Program Committee for the 2022 Web Archiving Conference and General Assembly, which will be hosted virtually by the Library of Congress.

Last year, I remember the tireless planning efforts of Senior Program Officer Olga Holownia as she and then-IIPC Chair, now Vice-Chair, Abbie Grotke and staff members from the National Library of Luxembourg (2021’s amazing virtual conference host) tested the virtual conference platform. They tested virtual tables, planned for break-out chats post-session where engaged members could continue discussions from the previous panel. The end result was engaging and exciting, especially for a virtual conference.

It was a pleasure to learn at that time about topics as diverse as the Frisian web (Kees Teszelszky, “Side fûn: mapping the Frisian web domain in the Netherlands”), flash capable browser emulation (Ilya Kreymer & Humbert Hardy, “Not gone in a Flash! Developing a Flash-capable remote browser emulation system”), and experimental methods of quality assurance for web archives (Brenda Reyes Ayala, James Sun, Jennifer McDevitt & Xiaohui Liu, “Detecting quality problems in archived website using image similarity”). Ayala, et.al.’s presentation led me to Dr. Ayala’s research, which has greatly impacted QA workflow development here at the LoC. Workflow development will be included in the panel “Advancing Quality Assurance for Web Archives: Putting Theory into Practice” in the upcoming 2022 Web Archiving Conference. If you missed the 2021 conference, you can still view selected talks and Q&A sessions on the IIPC YouTube channel

WAC2021
IIPC 2021 Web Archiving Conference co-hosted with the National Library of Luxembourg.

With that, I’d like now to ask Abbie Grotke, my supervisor and Vice-Chair of the IIPC, as well as Grace Thomas, one of my teammates on the Web Archiving Team, some questions about their experience in the IIPC community:

Meghan Lyon: Give us a snapshot of your first experience — or of a memorable experience — at an IIPC WAC & GA of times past.

Grace Thomas (Senior Digital Collection Specialist, Web Archiving Team):

The first IIPC WAC I attended was Web Archiving Week 2017 in London. I had joined the Library of Congress Web Archiving Team less than a year earlier and I was still trying to figure out the extent of this new world. From what my seasoned coworkers said about the web archiving community, I knew it was small and geographically disparate – a modest group of faceless individuals shouldering the massive task of archiving the web – but the events in London showed me how kind, collaborative, and very real everyone is. We are all dealing with the exact same issues at different scales and, most importantly, I got the feeling that everyone was there because they wanted to carry on this work and find solutions to those problems together.

WAW2017-ArchivesUnleashed
Archives Unleashed hackathon during the Web Archiving Week 2017 at the British Library.
Photo credit: Olga Holownia.

I also attended the 2018 WAC in Wellington, New Zealand which provided me the opportunity of a grand adventure in a stunning locale! Even now, nearly four years later, I frequently recall Dr Rachael Ka’ai-Mahuta’s keynote about the archiving of Indigenous Peoples’ language, culture, and movement, which gave me an important framework for thinking about the ethics of cultural ownership. The farthest I had ever traveled from home, having been surrounded by Māori customs and artifacts that week further deepened these concepts and I’m grateful to have been in that place exactly at that time.

Although, I have to say the most memorable WAC experience was nearly missing my flight back to the US from London in 2017 and seeing Abbie’s face break into a relieved smile as I sprinted up to the gate at Heathrow! I guess I didn’t want to leave the WAC… and who would?

WAC2022-Dr_Rachael_Ka’ai-Mahuta
Keynote by Dr Rachael Ka’ai-Mahuta titled Te Māwhai – te reo Māori, the Internet, archiving, and trust issues. Photo credit: Mark Beatty.

Abbie Grotke: Besides reliving that moment of almost losing Grace in London (oh my!) my first IIPC memories were from way back in the beginning of the consortium. There was talk of this international group forming, and although I did not get to go to an early meeting in Rome, Italy, I attended another very early-days discussion called “National Libraries Web Archiving Consortium” which was held at the Library of Congress in the March 2003. It was there that (besides colleagues at Internet Archive) I first met fellow web archiving colleagues from the British Library, Bibliothèque nationale de France, National and University Library of Iceland, and National Library of Canada (now Library and Archives Canada). These, along with LC and Internet Archive and a number of other institutions were the early founders of the consortium and many became good friends and colleagues for many years. I couldn’t have imagined then that I would still be involved in this community all these years later!  A lot of those folks have moved on or retired, but our institutions still work closely together to this day.

One of my favorite memorable experiences was when I was communications officer, supporting the Steering Committee of the IIPC, who had been in Oslo for a meeting of the Access Working Group where we were hashing out requirements for an access tool. We all hopped on a plane to Trondheim, then a puddle jumper plane from Trondheim to Mo i Rana where the other National Library buildings were, for a Steering Committee meeting up there. Gildas Illian (the IIPC technical lead at the time) and I were in the very back of the plane which had a row entirely across the back, looking straight down the middle aisle. Most of the Steering Committee was on the plane, which was having some horrible turbulence. Even though we were terrified by the flight I just remember laughing so much (coping mechanism!) with Gildas about the fact that if the plane went down, the consortium would be over. We also couldn’t stop laughing at the “barf bags” in front of us, which said “uuf da” – which I now say ALL the time and always think back to that moment. We of course landed safely. That was also the meeting where a colleague from New Zealand was calling in the entire two days of meetings despite the time difference, and at dinner we started talking to a plant in the middle of the table as if it were him. Good times!

This slideshow requires JavaScript.

Meghan Lyon: Tell us one thing you love or appreciate about being a part of the international web archiving community.

Abbie Grotke: It truly is the most supportive community and I am forever grateful about the opportunity to meet and know so many helpful colleagues from across the globe. And there is nothing like a conference in a beautiful library in an unfamiliar city with the smartest experts in web archiving in the world. I’ve forged some wonderful friendships over the years. While a virtual meeting is not quite the same and I can’t wait until we can meet again in person, I’ve been amazed at how we’ve adapted as a community to an entirely virtual event. In many ways it’s allowed for a richer experience – more people who might not have been able to travel to the conference and meetings can participate, and in my mind that’s always a benefit! I hope we can continue to keep a blend of in person and virtual events in the future. Come join us!

WAC 2022

Registration is now open. Register separately for each day you plan to attend—May 23, 24, and 25 for the WAC, May 17-19 for the General Assembly. View the schedule and abstracts, and learn more about the Conference and GA sessions on the IIPC Website: 2022 Web Archiving Conference!


IIPC General Assembly & Web Archiving Conference 

IIPC GA & WAC collection at UNT Digital Library

2021 Web Archiving Conference: presentations & recordings