The 2023 IIPC Web Archiving Conference Reflections

By Friedel Geeraert, Expert in web archiving at KBR | Royal Library of Belgium


The IIPC Web Archiving Conference 2023 took place in Hilversum in The Netherlands at the beautiful building of Sound and Vision. The warm atmosphere of the web archiving community gathered there more than compensated for the cold rain outside. Over the two day conference, presentations were given about themes such as new initiatives and collections, COVID-19, collaborations, digital scholarship and research, tool development, quality assurance, outreach, inclusive representation, data management, preservation and infrastructure. The different workshops organised during both days provided the opportunity to gain more hands-on experience. The programme was so interesting that it was difficult to choose which track to follow or which workshop to choose.

S&V
Netherlands Institute for Sound and Vision in Hilversum
Photo: Olga Holownia | IIPC

Open Source Investigation and Public Values in the Digital Domain

The two keynote speakers, Eliot Higgins of Belligcat and Marleen Stikker of Waag Futurelab shared their expertise and vision. Higgins provided insight into Bellingcat, the independent group of investigative journalists and their ethical digital investigation into conflicts such as the war in Ukraine to debunk misinformation. Bellingcat also initiates programmes to teach students to think critically about online information and sources, thereby helping them to make better informed decisions and formulate well-founded opinions, which is hopeful in light of the polarisation of society.

IIPC WAC 2023 keynote: Eliot Higgins
Eliot Higgins | Bellingcat & Johan Oomen | Sound & Vision
Photo: Olga Holownia | IIPC

Stikker explained her alternative history of the internet focusing on the social roots instead of its military origins and the role we can all play into managing the internet as a commons and govern it accordingly. She suggests assessing the foundation by asking critical questions about the underlying assumptions and the organisation of government, guaranteeing human rights and ensuring a regenerative socio-economic model (as opposed to the current extractive model). Above all, she argues for undertaking action by for example moving towards platforms that are not governed by big commercial corporations such as Signal and Mastodon.

IIPC 2023 WAC keynote: Marleen Stikker
Marleen Stikker | Waag Futurelab
Photo: Olga Holownia | IIPC

Thoughts, tips and takeaways

As always, participants came away with their heads filled with ideas and useful information. Armed with numerous pages filled with notes and lists of people I need to contact in the coming months to obtain more information, I returned to KBR in Belgium. I will be using the coming year to further look into ARCH, the Archive Research Compute Hub, developed by the Archives Unleashed Project, the Browsertrix Cloud, developed by the Webrecorder team and SolrWayback, developed by the Danish Royal Library. Providing more descriptive information about web archive collections was another interesting idea that was evoked by the web archiving team of the BnF and by Emily Maemura and Helena Byrne as well in their ‘datasheets for datasets’ concept.

IIPC WAC 2023: workshops
“Describing Collections with Datasheets for Datasets” workshop
Photo: Jacqueline van der Kort | Beeldstudio KB
Web Archiving Conference  may 2023
Jefferson Bailey | Internet Archive, ARCH workshop
Photo: Jacqueline van der Kort | Beeldstudio KB

Other aspects that sparked my interest are preservation practices in the context of web archiving, for example WARC validation presented by the team of the National Archives of the Netherlands, the need for consistent use of data repositories such as Zenodo, Software Heritage and the Internet Archive and the use of the URN PWID to reference web archive sources. Other ideas that arose during the conference were linked to quality assurance and analysis: the use of tools such as Screaming Frog by the team at the UK Government Web Archive, the WAVA tool (Web Visualisation and Analysis), developed by the team behind the Web Curator Tool, and the use of rubrics as demonstrated by the speakers of the Library of Congress. The team behind the End of Term web archive also talked about tools used by CommonCrawl that are promising to create derivative datasets and enriched metadata.

Community coming together

These are only a few examples of the wealth of interesting ideas evoked at this conference but on top of that it was wonderful to catch up with other members of the web archiving community during the breaks. Over cups of coffee, delicious ciabatta and sweet pastries, topics of conversation ranged widely from planned changes to national legislations, evolutions in Twitter collection policies and public tenders on one end of the seriousness spectrum, with the sock affinity of an awfully cute puppy and discussions about the best gifts to give to 6 month old babies on the other end. Many thanks for the organisers of this year’s conference for such a great edition of the IIPC WAC. Needless to say, I’m already looking forward to next year’s edition on April 25-26, 2024 hosted by the BnF.

Web Archiving Conference: May 2023
Photo: Jacqueline van der Kort | Beeldstudio KB

IIPC Chair Address 2023

Youssef-EldakarYoussef Eldakar, Head of the International School of Information Science at Bibliotheca Alexandrina and the IIPC Chair 2023-2024


We are already in the second quarter of a new IIPC year. 2023 marks the first year since 2019 that we finally are able to meet in person again at the General Assembly and the Web Archiving Conference, this time kindly hosted by the Netherlands Institute of Sound and Vision in Hilversum and co-organised by KB, National Library of the Netherlands. I have the pleasure of chairing the Consortium in 2023, a year that also marks 20 years of us working together to preserve content on the web. I would like to start by thanking Abbie Grotke of the Library of Congress (2021 Chair, 2022 Vice Chair) and Kristinn Sigurðsson of the National and University Library of Iceland (2021 Vice-Chair, 2022 Chair) for their leadership in 2021 and 2022. I would also like to thank Ian Cooke of the British Library (2022 Treasurer) for continuing on in his role as the IIPC Treasurer this year.

Anniversary

Time flies when we are having fun, so it is very hard to believe that it has already been 20 years of working together as a consortium to forward the domain of digital preservation of web content. Our anniversary year officially starts in July (IIPC was founded in Paris on July 24, 2003), but we would like to use our meeting in Hilversum to reflect on lessons learned and what we have achieved as a community over the past two decades. Our shared endeavor of capturing and preserving the ever-changing web and creating sustainable programs and practices is aptly captured by both this year’s conference theme, “Resilience and renewal,” and the following statement: “Web archiving practice has needed to demonstrate resilience in the face of the challenges. It has also required sustained innovation and renewal to find novel and practical ways to try to overcome obstacles and to demonstrate (and add to) the value of web archiving programs.”

What cannot be underestimated is the amount of knowledge that has accumulated over the past 20 years, which both web archiving practitioners and researchers have so generously shared at various IIPC events including our annual General Assembly and Web Archiving Conference, as well as IIPC projects, working groups and task forces. The IIPC GA has now been held 18 times across the world and online, and our members and IIPC staff have organised and contributed to many more workshops, webinars, training events, member updates, working group meetings, and technical calls.

I would like to thank the University of North Texas University Libraries and IIPC Staff for ensuring that these valuable resources are preserved, searchable and easily accessible through the IIPC collections at UNT. IIPC members and the wider web archiving community have also produced publicly available documentation of tools, including the Awesome Web Archiving List. Thanks to the efforts of the IIPC Training Working Group, we now also have training materials introducing a wide range of web archiving topics which have been used extensively.

Interestingly, this IIPC twentieth anniversary year overlaps with that of Bibliotheca Alexandrina. In 2002, just a few months before the inauguration of the revived Library of Alexandria, Brewster Kahle (founder of the Internet Archive) was in Alexandria, Egypt on a mission where the IA team and the BA team worked together to install a copy of IA’s web archive collection at the time inside the new library’s building. In addition to putting in place web archiving as a function of the library before its opening, such initiative was quite symbolic, alluding to the idea that preserving the web in modern times is rooted in preserving human knowledge regardless of the medium, be it papyrus scrolls in the ancient library centuries ago or digital documents on computer storage in the 21st Century.

Membership

IIPC started with 12 founding members: 11 national libraries and the Internet Archive. Over 20 years, the consortium has expanded to 54 members and now also includes university libraries, audiovisual institutes, service providers and an independent open-source project. I would like to take this opportunity to welcome our newest members, Smithsonian Libraries and Archives (joined in 2022) and Webrecorder (joined in 2023).

First In-Person Event in Almost 4 Years

The IIPC has made it through recent times during which all interaction was virtual and we have even expanded our activities. For three consecutive years, from 2020 to 2022, due to the Coronavirus pandemic, the IIPC held all its activities and events, including the annual General Assembly and Web Archiving Conference, online. While we are grateful that technology allowed us to go on uninterrupted to carry out the functions of the Consortium and even expand the audience of some of our events, we are also grateful for the opportunity to gather again as an in-person community this year, an experience which has proven over the years to be an opportunity for a truly dynamic interchange of ideas and experiences for many in the community.

Collection Building

One of our key IIPC-funded activities that continues this year is our Collaborative Collections, which are led by the Content Development Working Group (CDG) and supplemented by contributions from the community. Through this effort, the IIPC has been doing its fair share of building web archive collections that are transnational in scope and thus fitting to the consortium’s internationally diverse makeup. These collections offer members the chance to contribute to important global collections that may otherwise be outside of member organisations’ collecting scope. Last year, we notably launched a collection on archiving the War in Ukraine. This and many other collaborative collections are available both through Archive-it and Bibliotheca Alexandrina. This new access brings together web archive collections and tools developed by IIPC members: LinkGate and SolrWayback.

BA is currently in the process of moving the Solr index for the SolrWayback access interface to higher-end storage in order to achieve adequate search performance. This year, we will continue working with the IIPC Research Working Group on mapping possibilities for researcher use of our collections through these additional access points.

image001

Tools

One of the IIPC’s key goals is to foster the development of tools for web archiving. While this year marks two decades for the IIPC since its founding, it also marks the one major step forward toward a more complete set of tools to cover key aspects of the web archiving process: capture, playback and analysis.

Following our members’ interests and needs, since last year, IIPC has been supporting the development of Webrecorder’s Browsertrix Cloud. This two-year project, led by The British Library, National Library of New Zealand, University of North Texas Libraries and the Royal Danish Library, is an excellent example of an IIPC collaborative approach to tools development. While the work on this continues with significant support from the community involved in testing the tool, our members have already presented on their use of Browstertrix at IIPC workshops, webinars and, most recently at the WAC 2023 Online Day and in-person conference.

In the past few years, IIPC has supported members in transitioning from OpenWayback, a tool developed and maintained by IIPC members, to Python Wayback, or pywb, developed by Ilya Kreymer of Webrecorder. As I mentioned earlier, one of the strengths of our community is the willingness to share knowledge. In addition to informal calls and exchanges on the pywb deployment at IIPC member institutions, our members have been presenting their processes at IIPC webinars, a  use cases series that will continue later in 2023.

This brings me to tools for analysis and research. As we are getting more and more excellent examples of research use presented at our Research Webinar Series and at WAC, we have also been following the development of tools for analysing web archives. I have already mentioned SolrWayback, which was developed at the Royal Danish Library and featured at multiple IIPC conferences, including this year, and I would also like to congratulate the Internet Archive and the Archives Unleashed on creating ARCH (Archives Research Compute Hub) – also presented at this year’s WAC. AWAC2 (Analysing Web Archives of the COVID-19 Crisis) is one of the projects that used the new research hub to analyse IIPC’s biggest collection, and we are hoping that the combination of access through Archive-It and the BA mentioned earlier will draw more researchers to our collaborative collections.

Members Survey 2023

There has been a growing interest in computational access to web archives and engagement with researchers, and many of our members have been working on making their collections available to researchers. As a response to this trend, we have made sure that our new members survey includes a detailed section on the topic. The survey, based on one we conducted in 2017, also covers many other main areas in web archiving programs. The results of the previous survey helped us shape our current Strategic Plan and Consortium Agreement. Our hope is that the new survey will allow us the opportunity to further shape IIPC strategic priorities and guide our funding programs. Created jointly by the Membership Engagement Portfolio and IIPC Staff, the survey is another key activity for this year which will allow us to better serve the web archiving community.

Tools Training

The IIPC has been investing significant time and resources over the years into developing tools for web archiving to serve the consortium’s primary goal of preserving Internet content. An investment in tools development is only truly rewarding when the developed tools find their way to the right hands that will put them to effective use. The Tools Development Portfolio and the IIPC Senior Program Officer have made a proposal to provide technical training to IIPC members and anyone else interested in learning the technology used by web archiving institutions. As part of the survey mentioned earlier, we will be gathering information about the types of training that are most needed and the format that would be most suitable for our global membership.

Online activities and partner events

We are meeting in Hilversum in person, but one of our lessons learned from the past few years of online programming is the need to offer the conference to those who are unable to travel. As Kristinn pointed out last year, the COVID-19 pandemic made us an organisation for all seasons, and while we are celebrating the return to in-person events, we want to ensure that we can also provide an online forum serving the needs of members and the wider community. This year, we have already delivered an entire online day of WAC, and we will continue organising webinars, technical calls and other online events for members.

Last year we partnered with the Open Preservation Foundation (OPF), the Impact Centre of Competence in Digitisation (IMPACT), to deliver a widely popular online panel on the preservation of digitised and born-digital collections. Our Training Working Group also co-organised a workshop on advocacy with the Digital Preservation Coalition. The Partnerships and Outreach Portfolio and the SPO will continue to work on furthering our advocacy efforts, exploring future collaborations with organisations with interests overlapping with the IIPC’s strategic priorities. We are glad to see OPF, Dutch Digital Heritage Network (DDHN) and IIPC’s administrative and financial host, Council on Library and Information Resources (CLIR), at this year’s annual event and we are looking forward to an August 2023 in-person conference in Germany organised by Nestor in partnership with the IIPC.

All of our activities are only possible thanks to our members who contribute their time and expertise, as well as offering their institutions as a home for our in-person events. We are very grateful to the Netherlands Institute for Sound and Vision for hosting this year’s GA & WAC, offering both a unique and beautiful conference location and staff time spent helping with on-site logistics and conference organization. We’re also extremely grateful to KB, National Library of the Netherlands for their work in co-organising the conference, including generous sponsorship, contributions to the conference program, and significant amounts of staff time. Our consortium’s work is really built on collaboration, and this year’s co-organised conference is no exception. Thank you to all of our members who have volunteered their time and skills, whether by work in one of Portfolios, Task Forces, or Working Groups, time spent helping with the annual conference, reviewing the conference proposals, showcasing their work in a webinar or workshop, or otherwise.

What we do is also made possible thanks to the IIPC staff, hosted at CLIR, who are responsible for driving all of the IIPC activities and supporting our collaborative efforts to preserve the web. One of our recent strategic goals has been to “strengthen organizational resilience via both increased engagement with the Consortium Financial and Administrative Host and additional support staff (workforce development for increased efficiency, productivity and continuity).” Since 2022, we have finally had two full-time staff members for the first time. This has made us a much more robust organisation and has allowed us to significantly reduce our dependence on the volunteer model. I would like to thank Olga Holownia, our Senior Program Officer, and Kelsey Socha-Bishop, our Administrative Officer, for all their hard work.

20 Years of the Web Archiving Project (WARP) at the National Diet Library, Japan

By SHIMURA Tsutomu, National Diet Library (NDL), Japan


WARP

2022 marked the 20th anniversary of the start of the Web Archiving Project (WARP) at the National Diet Library, Japan. The following article introduces the progress made over the 20 years since the project was first launched on an experimental basis in 2002.

The History of the Web Archiving Project

The National Diet Library’s Web Archiving Project, or WARP, was launched in 2002 as an experimental project to collect, preserve, and provide access to a small number of Japanese websites from both the public and the private sectors with the permission of the webmasters. In 2006, the project was expanded to include all government-related organizations.

In 2009, the National Diet Library Law was amended to enable us to comprehensively collect any website published by public institutions, including all national and municipal government agencies, without permission from the publisher. And when this amendment came into force the following year, 2010, we started to archive at regular intervals, such as monthly or quarterly, depending on the type of institution. It was at this time that the basic framework of the current WARP was solidified.

In 2013, we updated the system and began providing access to curated content, such as Monthly Special feature. In 2018, we developed an English-language user interface in the hope of further expanding our audience. In 2021, we improved the display of search results, and we greatly improved the mechanism for moving links within archived contents in 2022.

Changes in the WARP website

The layout of the WARP website has been changed three times so far: at the start of the experimental project in 2002, at the start of comprehensive archiving in 2010, and at the time of the 2013 update.

screenshot1_NDL
WARP as an experimental project (screenshot from 20 Apr 2003)

screencapture_NDL-2
The beginning of archiving based on the National Diet Library Law (screenshot from 13 Jul 2010)

screencapture-warp-da-ndl-go-jp-info-ndljp-pid-8262028-warp-da-ndl-go-jp-2023-03-31-11_07_01
Updated layout (screenshot from 1 Aug 2013)

Number of targets

numberoftargetsgrap
Changes in number of targets

During the period from FY2002, when we began as an experimental project, until FY2009, we only archived websites when we had obtained permission from the webmasters, irrespective of whether the website was published by a public or a private institution.

With the start of comprehensive archiving in FY2010, we were able to archive the websites of all public institutions, which greatly increased the number of targets. The graph shows that between FY2009 and FY2011, the number of targets increased by more than 2,000.

In addition, we continued to request permission to collect private websites on a daily basis, and the number of targets has increased each year. In 2015, the number of targets increased significantly due to intensive requests made to public interest foundations. Generally, we focus on requesting permission from specific types of institutions for a certain period of time. This has resulted in the number of private-sector targets increasing to about 8,000, which is currently even more than the number of public-sector targets.

In order to provide access via the Internet to archived websites, we need permission of the owner of public transmission rights, which is something granted under the Copyright Law of Japan. Therefore, we request permission from each webmaster to provide access via the Internet to archived websites prior to providing such access. As of FY2021, we were able to provide access via the Internet to 12,435, or about 90%, of the targets we have archived. This is one of the most internet-accessible archives among the world’s web archives.

Data size

datasizechart
Changes in data size

The size of archived data has increased rapidly since the start of comprehensive archiving of the websites of public agencies in FY2010, and nearly reached 2,400 TB in FY2021. This is due to the increase in the number of targets collected as well as the size of data published by each institution.

System configuration transition

Over the past 20 years, the system configuration and various technologies implemented in WARP have changed significantly. The three most important technologies for collecting, preserving, and providing access to web archives are harvest software, storage format, and replay software. In addition, we provide a full-text search function to make it easier for users to find content of interest from the vast amount of archives. Here is a brief summary of the transition of each system configuration.

Harvest software

At first, we used the open-source software Wget to store the harvested websites in units of files. In 2010, we implemented Heritrix, a standard harvesting software used by web archiving organizations around the world and specialized for web archiving and have been using it ever since. In 2013, a duplication reduction function was added as a means of reducing the volume of data to be stored. The duplication reduction function saves only the files that have been updated, thus reducing the total volume of data saved and saving storage space.

Storage format

The data was saved in units of files when using Wget, but since 2010, with the implementation of Heritrix, the storage format was changed to the WARC format. The files that comprise each website as well as metadata about the various files are stored together in a WARC format file. The WARC format allows for the archiving of information that could not be included when saving data in units of files. For example, information that has no content, such as redirection to a new URL when the URL of a website changes, can now be saved.

Replay software

In order to view a website saved in WARC format files, files comprising the website must be extracted from WARC format files and saved to a general web server, or a dedicated software to replay WARC format files is needed. Initially, we adopted the former method. This meant that, in addition to the original WARC format files, storage capacity for the extracted data was required. We currently use OpenWayback, which allows users to directly browse WARC format files, eliminating the need for storage space for data that has been extracted into units of files.

Full-text search software

A full-text search software was introduced during the experimental project period, but at that time it was created exclusively for WARP. In 2010, we adopted Solr, an open source full-text search software widely used around the world, to improve the search speed for large-scale data.

What prompted us to launch Monthly Special?

At the time of the interface renewal in 2013, WARP had been steadily archiving websites, but web archiving itself was not yet well known in Japan. We wanted to attract the interest of more people, so we started publishing introductory articles on web archiving, such as Monthly Special, Mechanism of Web Archiving, and Web Archives of the World.

In particular, the Monthly Special features archived websites related to a topic of interest chosen by our staff or includes an article explaining WARP.

Currently this content is available only in Japanese.

In closing

Looking back over the 20 years since the start of the project, you can see that the number of targets and the size of data archived have steadily increased.

We believe that the role of web archives will continue to grow in importance. We are committed to collecting and preserving websites on a regular basis as well as making them available to as many people as possible.

One year on: an update on the War in Ukraine CDG collection

By the lead curators: Anaïs Crinière-Boizet, Digital Curator (National Library of France), Kees Teszelszky, Curator Digital Collections (National Library of the Netherlands) & Vladimir Tybin, Head of Digital Legal Deposit (National Library of France).


IIPC-CDG-collaborative-collectionsThis month, the IIPC Content Development Working Group (CDG) launched a new web crawl to archive web content related to the war in Ukraine, based on suggestions by curators, web archivists and members of the public worldwide. The aim of this effort is to map the impact of this conflict on digital history and culture on the web for future historians. This clash has been fought out on the battlefields, but also takes place in cyberspace. It has a tremendous influence on web culture and internet history.

We launched three crawls in 2022, starting with the first crawl on July 20, 2022, the second in September and the third in October of last year. Another crawl was launched on March 16 2023. In this blog post, we describe what has been done so far in creating a transnational collection documenting this important historical event.

On 24 February 2022, the armed forces of the Russian Federation invaded the territory of Ukraine, annexing certain regions and cities and carrying out a series of military strikes throughout the country, thus triggering the first large-scale war in Europe since WWII. The war on the territory of Ukraine has different phases[1] which can be summed up as follows:

  • 0: Prelude of the war (up to 23 February 2022), when Russian troops were building up near the borders of Ukraine;
  • 1: Initial invasion (24 February – 7 April 2022), when the Russian president Putin announced a ‘special military operation’ and Russian troops invaded Ukraine territory;
  • 2: Southeastern front (8 April – 11 September 2022);
    • This is the phase during which we began archiving websites as part of the CDG collection.
  • 3: Ukrainian counter offensives (12 September – 9 November 2022);
  • 4: Second stalemate (10 November 2022 – present, March 2023)

Since February, the clashes between the Russian military and the Ukrainian army and population have had an unprecedented impact on the situation in the region and on international relations. The aim of this collaborative project is to collect web content related to this event in order to map the impact of this conflict on digital history and culture. Identification of seed websites and initial web crawling began in July 2022. The archived websites have been preserved in a special web collection hosted by Archive-It, where most of the sites are already available to view. The collection will be expanded with new content as the conflict evolves or as new developments in the historic course happen.

The curators included high priority subtopics in the call for nominations, such as general information on: military confrontations; consequences of the war on the civilian population in Ukraine; refugee crisis and international relief efforts in and outside Europe; political consequences; international relations; diaspora communities like Ukrainians around the world; human rights organizations; foreign embassies and diplomatic relations; sanctions imposed on Russia by foreign powers; consequences on energy and agri-food trade; and public opinion like blogs, protest sites, online writings of activists etc. Websites from countries all over the world and in all languages are in scope. Special attention has been devoted to websites which can be a source of internet culture, such as sites with internet memes.

Many institutions but also the public responded to this call for contributions to document the conflict. No less than 1,137 member proposals were received and 252 via the public nomination form, making 1,389 seeds in total. After cleaning up duplicates and invalid URLs, 1,358 seeds remained. All these were crawled at least once between July 2022 and March 2023.

We have launched the fourth crawl for the War in Ukraine web collection on 1,060 seeds in March. 303 new seeds have been submitted between the last crawl in October and now. No less than 298 seeds have been deactivated since July 2022. These were pages which were not updated since the last crawl or went offline. These “404 file not found” errors show also why our collection work is important, as some sites have already gone offline. In total, 22 new jobs have been launched. Of  these, 19 crawls were done with the standard web crawler software and 3 with Brozzler.[2] This is a distributed web crawler that uses a real web browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. It is especially valuable for its media capture capabilities. We had a total budget of 1 TB for the three crawls in 2022 and 500 GB for the fourth crawl in March 2023.

It is easy to see from the distribution of seeds by scope that this special web collection is the result of an event harvest. For most of the URLs, only one or two pages were selected for the crawl. These pages (mostly from news sites) contain important historic source information which otherwise may be have been lost. Only 204 sites were selected to be fully crawled.

CDG_WarinUkraine_Mar22_01

Looking at the distribution of sites by website type, it is noticeable that a large proportion of the sites are news sites, NGOs and government websites. The role of blogs in internet culture has diminished in recent years, as is also visible in this collection.[3] In contrast, NGO websites contain more and more information worth preserving for historians of the future, as they document their activities to their donors.

CDG_WarinUkraine_Mar22_02

We see a language shift in the distribution since the first crawl took place. Most of the sites which were selected during the first crawl were published in international languages as English, French and German. Now we see more websites written in national languages, such as Ukrainian (122), Russian (31) and Belarusian (5). The impact of the conflict on the rest of Central, Eastern and Southern Europe around Ukraine can be seen by the collection of sites in Hungarian (45), Czech (44), Serbian (42) and Slovakian (23).

CDG_WarinUkraine_Mar22_03

One of the most heavily impacted cultural arenas to be touched by war is heritage and culture. We have all seen the images of looted museums and libraries and scattered books on the streets. As Erasmus of Rotterdam wrote in 1515: “If the laws are already silent amid the clash of arms, how much more are not the virgin muses silent when the world is full of noise, turmoil, confusion due to those raging storms?”[4] It is therefore perhaps a hopeful fact that no fewer than 14 websites have been selected that contain poetry from or about Ukraine.

In conclusion, it is worth recalling the interest of this initiative aimed at keeping track of the very heterogeneous content disseminated on the web about this tragic event. We know that the living web is an extremely fragile publication space where content is ephemeral and most often difficult to find some time after its publication; content can also disappear for technical reasons or be deleted by its producers. At a time when the web concentrates most of the publications of the major media and the press, the reactions of the population and of institutional and non-governmental organizations, and finally, in the age of social media networks, the undertaking of building a collection of web archives that is necessarily fragmentary and incomplete deserves to be carried out in order to provide some primary sources for future historians of this conflict.


[1] https://en.wikipedia.org/wiki/Timeline_of_the_2022_Russian_invasion_of_Ukraine

[2] https://github.com/internetarchive/brozzler

[3] P. de Bode, I. Geldermans, & K. Teszelszky. (2021). Web collection NL-blogosfeer. Zenodo. https://doi.org/10.5281/zenodo.4593479

[4] Letter of Erasmus tot Raffaele Riario, London, 15 May 1515. https://www.dbnl.org/tekst/eras001corr04_01/eras001corr04_01_0039.php

IIPC – Meet the Officers, 2023

IIPC

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Senior Program Officer, based at the Council on Library & Information Resources (CLIR), the Officers make up the Executive Board and are responsible for dealing with the day-to-day business of running the IIPC. 

The Steering Committee has designated Youssef Eldakar of Bibliotheca Alexandrina to serve as Chair, and Jeffrey van der Hoeven of KB, National Library of the Netherlands to serve as Vice-Chair in 2023. Ian Cooke of the British Library will continue to serve as the IIPC Treasurer. Olga Holownia continues as Senior Programme Officer, Kelsey Socha-Bishop as Administrative Officer and CLIR remains the Consortium’s financial and administrative host.

The Members and the Steering Committee would like to thank Kristinn Sigurðsson of the National and University of Iceland and Abbie Grotke of the Library of Congress for leading the IIPC in 2021 and 2022.


IIPC CHAIR

Youssef-EldakarYoussef Eldakar is Head of the International School of Information Science, a department of Information and Communication Technology at Bibliotheca Alexandrina (BA) in Egypt. Youssef entered the domain of web archiving as a software engineer in 2002, working with Brewster Kahle to deploy the reborn Library of Alexandria’s first web archiving computer cluster, a mirror of the Internet Archive’s collection at the time. In the years that followed, he went on to lead BA’s work in web archiving and has represented BA in the International Internet Preservation Consortium (IIPC) since 2011. Also at BA, he contributed to book digitization during the initial phase of the effort. In 2013, he was additionally assigned to take lead of the BA supercomputing service, providing a platform for High-Performance Computing (HPC) to researchers in diverse domains of science in Egypt, as well as regionally through European collaboration. At his present post, Youssef works to provide support to research through the technologies of parallel computing, big data, natural language processing, and visualization.

In the IIPC, Youssef has been lead to Project LinkGate, started in 2020, for scalable temporal graph visualization, and he has more recently been working as part of a collaboration involving the Research Working Group and the Content Development Working Group to republish IIPC collections through alternative interfaces for researcher access. He has been a member of the Steering Committee since 2018 and has served as the lead of the Tools Development Portfolio.

IIPC VICE-CHAIR

Jeffrey-van-der-HoevenJeffrey van der Hoeven is head of the Digital Preservation department at the National Library of the Netherlands (KB). In this role he is responsible for defining the policies, strategies and organisational implementation of digital preservation at the library, with the goal to keep the digital collections accessible to current users and generations to come. Jeffrey is also director at the Open Preservation Foundation and steering committee member at the IIPC. In previous roles, he has been involved in various national and international preservation projects such as the European projects PLANETS, KEEP, PARSE.insight and APARSEN.

IIPC TREASURER

Ian_Cooke

Ian Cooke leads the Contemporary British Publications team at the British Library, which is responsible for curation of 21st century publications from the UK and Ireland. This includes the curatorial team for the UK Web Archive, as well as digital maps, emerging formats and print and digital publications ranging from small press and artists books to the latest literary blockbusters. Ian joined the British Library’s Social Sciences team in 2007, having previously worked in academic and research libraries, taking up his current role in 2015. 

Ian has been a member of the IIPC Steering Committee and has worked on strategy development for the IIPC. The British Library was the host for the Programmes and Communications role up to April 2021.  

Building a Web-Archive Image Search Service at Arquivo.pt

By André Mourão, Senior Software Engineer, Arquivo.pt and Daniel Gomes, Head of Arquivo.pt


Arquivo.pt launched a service that enables search over 1.8 billion images archived from the web since the 1990s. Users can submit text queries and immediately receive a list of historical web-archived images through a web user interface or an API.

The goal was to develop a service that addressed the challenges raised by the inherent temporal properties of web-archived data, but at the same time provided a familiar look-and-feel to users of platforms such as Google Images.

Supporting image search using web archives raised new challenges: little research was published on the subject and the volume of data to be processed was big and heterogeneous, summing over 530 TB of historical web data published since the early days of the Web.

The Arquivo.pt Image Search service has been running officially since March 2021 and it is based on Apache Solr. All the developed software is available as open-source to be freely reused and improved.

Search images from the Past Web

The simplest way to access the search service is using the web interface. Users can, for example, search for GIF images published during the early days of the Web related to Christmas by defining the time span of the search.

buildingimagesearch_fig1
Figure 1. Results from Advanced Image Search for GIF images archived between 6 August 1991 and 16 December 2005 (https://arquivo.pt/image/search?q=Christmas+type%3Agif&l=en&from=19910806&to=20051216).

There is also an Advanced Image Search interface available at https://arquivo.pt/advancedImages.jsp?l=en which allows users to:

  • search for terms
  • search for phrases
  • exclude certain words
  • limit search by dates
  • select the size of the images
  • select the file format of the images
  • enable/disable safe search filter
  • restrict the site where the image was found

buildingimagesearch_fig2
Figure 2. Details for an Image Search result.

Users can select a given result and consult metadata about the image (e.g. title, ALT text, original URL, resolution or media type) or about the web page that contained it (e.g. page title, original URL or crawl date). Quickly identifying the page that embedded the image enables the interpretation of its original context.

buildingimagesearch_fig3
Figure 3. The web page that contained an image returned on the search results can be immediately visited by selecting the “Visit” button.

Automatic identification of Not Suitable For Work images

Arquivo.pt automatically performs broad crawls of web pages hosted under the .PT domain. Thus, some of the images archived may contain pornographic content that users do not want to be immediately displayed by default, for instance while using Arquivo.pt in a classroom.

The Image Search service retrieves images based on the filename, alternative text and the surrounding text of an image contained on a web page. Images returned to answer a search query may include offensive content even for inoffensive queries due to the prevalence of web spam.

The detection of NSFW (not suitable for work) content on the archived Web pages from the Internet is challenging due to the scale (billions of images) and the diversity (small to very large images, graphic, colour images, among others) of image content.

Currently, Arquivo.pt applies an NSFW image classifier trained with over 60 GB of images scrapped from the web. Instead of identifying images as safe or not safe, this classifier returns the probability of an image belonging to one of five categories: drawing (SFW drawings), neutral (SFW photographic images), hentai (including explicit drawings), porn (explicit photographic images), sexy (potentially explicit images that are not pornographic, e.g. woman in bikini). nsfw (sum of hentai and porn) scores.

By default, Arquivo.pt hides pornographic images from the search results if their NSFW classification rate was higher than 0.5. This filter can be disabled by the user through the Advanced Image Search interface.

Image Search API

Arquivo.pt developed a free and open Image Search API, so that third-party software developers can integrate the Arquivo.pt image search results in their applications and for instance apply for the annual Arquivo.pt Awards.

The ImageSearch API allows keyword to image search and access to preserved web content and related metadata. The API returns a JSON object containing the metadata elements also available through the “Details” button.

buildingimagesearch_fig4
Figure 4. All metadata about the image and its host web page is available through the “Details” button or the Image Search API.

buildingimagesearch_fig5
Figure 5. GitHub Wiki page that documents the Arquivo.pt Image Search API (https://arquivo.pt/api/imagesearch).

Scientific and technical contributions

There are several services that enable image search over web collections (e.g. Google Images). However, the literature published about them is very limited and even less research has been published about how to search images in web archives.

Moreover, supporting image search over the historical web-data preserved by web archives raises new challenges that live-web search engines do not need to address such as having to deal with multiple versions of images and pages referenced by the same URLs, handling duplication of web-archived images over time or ranking search results considering the temporal features of historical web data published over decades.

Developing and maintaining an Image Search engine over the Arquivo.pt web archive originated scientific and technical contributions by addressing the following research questions:

  • How to extract relevant textual content in web pages that best describes images?
  • How to de-duplicate billions of archived images collected from the web over decades?
  • How to index and rank search results over web-archived images?

The main contributions of our work are:

  • A toolkit of algorithms that extract textual metadata to describe web-archived images
  • A system architecture and workflow to index large amounts of web-archived images considering their specific temporal features
  • A ranking algorithm to order image-search results

Learn more

2022 blog round-up

As we approach the end of 2022, we would like to thank our members and the general web archiving community for their support and engagement this year. Before we move forward into 2023, and return to an in-person General Assembly and Web Archiving Conference (for the first time since 2019!), we wanted to highlight some of this past year’s activities featured on our blog and to take this opportunity to thank all the contributors.

IIPC Governance

Thank you to the 2022 IIPC Chair, Vice-Chair and Treasurer for serving on the 2022 Executive Board. Thank you also to all the members who participated in the 2022 Steering Committee election. Many thanks to IIPC 2022 Chair Kristinn Sigurðsson for leading us through this past year, and reminding us that IIPC truly is an organization for all seasons.

Funded projects

2022 started off with a wrap-up of a project led by our Tools Development Portfolio and developed by Ilya Kreymer of Webrecorder. The goal of this project was to support migration from OpenWayback (a playback tool used by most of our members) to pywb by creating a Transition Guide.

This year also saw the launch of a new tools project “Browser-based crawling system for all.” Led by four IIPC members (the British Library, National Library of New Zealand, Royal Danish Library, and the University of North Texas), the Webrecorder-developed crawling system based on the Browsertrix Crawler is designed to allow curators to create, manage, and replay high-fidelity web archive crawls through an easy-to-use interface.

Game Walkthroughs and Web Archiving,” builds on research by Travis Reid, PhD student at Old Dominion University (ODU) that looks at applying gaming concepts to the web archiving process. This collaboration between ODU and Los Alamos National Laboratory was supported by the IIPC through our Discretionary Funding Program (DFP).

Here’s a list of blog posts on the 2022 projects related to web archiving tools:

Collaborative Collections

IIPC also funds collaborative collections, which are curated and supported by volunteers from our community. While our Covid-19 collection continues, three new collections were initiated by the Content Development Working Group (CDG) in 2022. In the winter, Helena Byrne of the British Library encouraged everyone to web archive the Beijing 2022 Olympic & Paralympic Winter Games, adding to a decade-long collaborative effort of archiving the Olympics and Paralympics. Archiving the War in Ukraine was our second collaborative collection for 2022. Co-curated by Kees Teszelszky of KB, National Library of the Netherlands, and Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, the National Library of France, it offers a comprehensive international perspective on the war. We closed 2022 with a call for nominations (due 20 January, 2023) for Web Archiving Street Art, co-led by Ricardo Basílio of Arquivo.pt and Miranda Siler of Ivy Plus Libraries Confederation.

Thank you to Alex Thurman (Columbia University Libraries) and Nicola Bingham (the British Library) for serving as CDG co-chairs, overseeing all new and ongoing collaborative collections:

Researching web archives

We also published blog posts related to researching web archives on topics spanning from a toolset for researchers to archiving social media to analysing Covid-19 web archive collections.

Yves Maurer of the National Library of Luxembourg, wrote about CDX-summarize, his toolset aimed at anyone interested in researching web archives that are not fully accessible. It offers a possible solution to provide a useful glimpse of “data that resides in-between the legal challenges of full access on the one hand and a textual description or rough single numbers on the other hand”.

Beatrice Cannelli, PhD candidate at the School of Advanced Study (University of London), summarised the results of an online survey mapping social media archiving initiatives, which is part of her research project “Archiving Social Media: a Comparative Study of the Practices, Obstacles, and Opportunities Related to the Development of Social Media Archives.”

We also published two blog posts by the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) Team of researchers working with the IIPC Covid-19 collaborative collection by using ARCH (Archives Research Compute Hub), a new interface for web archive analysis created by the Archives Unleashed Project Team and the Internet Archive. AWAC2 is supported by the Archives Unleashed Cohort Program, which facilitates research engagement with web archives and the researchers are members of the WARCnet (Web ARChive studies network researching web domains and events) Working Group 2, focusing on analysing transnational events.

Covid-19 web archived content is also at the core of the Archive of Tomorrow (AoT) project that aims to explore and preserve online information and misinformation about health and the pandemic. Introduced earlier this year by Alice Austin (Centre for Research Collections, University of Edinburgh), AoT will form a ‘Talking about Health’ collection within the UK Web Archive. Cui Cui, PhD candidate at the University of Sheffield and also an AoT web archivist, shared her process of working with the ‘Talking about Health’ collection, using faceted 4D modelling to reconstruct web space in web archives.

Here are the 2022 blog posts on researching web archives:

Last but not least, we would also like to give a shoutout to the brilliant Web Archiving Team at the Library of Congress who worked with us on the online GA and WAC 2022 and took us down memory lane in Remembering Past Web Archiving Events With Library of Congress Staff.

Many thanks to everyone who has contributed to our blog and helped us promote it through their newsletters and social media posts and, of course, thank you to all our readers around the world. We look forward to showcasing your web archiving activities in the new year!

Studying Women and the COVID-19 Crisis through the IIPC Coronavirus Collection  

AWAC2 (Analysing Web Archives of the COVID-19 Crisis) is a project developed by the members of WARCnet (Web ARChive studies network researching web domains and events) Working Group 2 that focuses on analysing transnational events. This is one of the first research projects using an IIPC collaborative collection and ARCH (Archives Research Compute Hub), a new interface for web archive analysis created by the Archives Unleashed Project Team and the Internet Archive.


By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team: Susan Aasman (University of Groningen, The Netherlands), Niels Brügger (Aarhus University, Denmark), Frédéric Clavert (University of Luxembourg, Luxembourg), Karin de Wild (Leiden University, The Netherlands), Sophie Gebeil (Aix-Marseille University, France), Valérie Schafer (University of Luxembourg, Luxembourg), Joshgun Sirajzade (University of Luxembourg, Luxembourg)

A year ago, a post on this very blog (“Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions,” 2 November 2021) invited the IIPC community to vote for a more specific topic that the AWAC 2 team could analyse within the vast IIPC collection devoted to the coronavirus, whose metadata and textual content were made available in the framework of a partnership.

After a first phase of collaboration around this prolific, multilingual, and international corpus, in direct cooperation with the Archives Unleashed team which allowed us to access this abundant collection via ARCH and other tools they have developed, it seemed interesting to us to dive not only into the global analysis of the corpus, but to also try to see the feasibility of more specific studies on a precise topic. Following the vote, the selected theme “Women, gender and COVID” was the subject of several online and on-site meetings by the AWAC2 team, including an internal datathon in March 2022 at the University of Luxembourg (Figure 1).

AWAC2-Figure01
Figure 1– A datathon as a test bed and the tentative design of a workflow

The purpose of this blog post is to review some of the methodological elements learned during the exploration of this corpus.

Retrievability is a real challenge

The first salient point concerns the amount of data, already considered at the global level of the corpus, but which even in the case of a research specifically focused on women remains important. Above all, data mining and corpus creation is complicated by multilingualism (see table 7 of our previous blog post), in addition to the fact that a search for the term “woman” is not sufficient to create a satisfactory corpus (a woman can be qualified as a mother in the case of home-working or as a feminist in the case of activism and the fight against domestic violence, etc.)

The multidisciplinary team also had to define research priorities in view of the challenges of these massive corpora. Indeed, once they are constituted, the analysis is still far from beginning. The sub-corpora are full of noise, especially when it comes to news sites where the terms COVID and pregnancy or feminism may appear in newsfeeds in a very close way, but without any real thematic correlation (Figure 2). There are also many duplicates, and it must be determined whether or not they inform the study. Such a large amount of data also raises the question of more research-driven or data-driven approaches.

AWAC2-Figure-2
Figure 2 – Entry line 7867: https://flipboard.com/topic/women. The newsfeed mentions the COVID crisis as well as the MeToo movement but the news is unrelated, as visible on top of the capture when accessing full text.

In addition to the technical difficulties, there are also contextual difficulties. The data must also be put into a national context from a qualitative point of view if they are to be analysed properly. For example, lockdowns and school closures have varied from country to country and school organisation is also very different around the world, as is the legislative framework for work during lockdowns.

Topic modeling as a field of investigation

The AWAC2 team shared a strong interest in assessing the presence, retrievability, asymmetries related to gender and COVID, with some colleagues especially interested in understanding the issues related to transnational studies and gender studies, as well as reflecting on invisibility and inclusiveness, while other colleagues were more specifically interested in the computational and topic modeling part.

This second aspect has given rise to interesting developments, as three major algorithms were applied to be able to carry out more sophisticated and semantic search in the corpus: Latent Dirichlet Allocation (LDA), Word2vec, and Doc2vec.

LDA is an extension of Probabilistic Latent Semantic Analysis (PLSA) which is a probabilistic formulation of Latent Semantic Analysis (LSA). LSA is a dimensionality reduction technique where documents in a corpus (in our case, web pages) are compressed to a very small number of documents, which could be read by a human. These compressed documents are called topics. In essence, they carry the words which are shared by many documents and probabilistically more often occur together. In our experiment, we not only identified topics which contain keywords related to the situation of women, but also looked at how these topics are distributed across the web pages (Figure 3).

AWAC2-Figure03
Figure 3 – Topics identified through LDA over the whole dataset (Covid-19 special collection) and their distribution through time

A few examples of topics are:

  • topics202002.txt:46 0.05 video news show man years police star film death family week weinstein comments day love stars top women fashion black
  • topics202002.txt:69 0.05 shop view accessories gifts price sale products delivery add cart free mens gift shoes brands bags womens clothing hair home
  • topics202003.txt:4 0.05 health children mental kids anxiety child family tips healthy parents social coronavirus stress find support time home women life news
  • topics202004.txt:53 0.05 gender development health policy working european countries international women economic equality work regional employment global world minnesota environment content overview
  • topics202005.txt:83 0.05 study risk patients years people blood disease

Word2vec and Doc2vec in turn are the further formulation of previous algorithms. In the background they not only use newer technologies like a logistic regression (also called a shallow network), but also provide more flexible usage. Word2vec provides a dense vector for every word, and it is very similar to LSA. However, Word2vec also creates vectors from the so-called window or the neighborhood of words like 5 words to the left and to the right of the searched word, operating more on a syntactical level while LSA is purely based on Document-Term-Matrix. With Word2vec it is not only possible to find semantically related words to the searched word, but also concatenate together the vectors of all words in the document. By doing so, similar documents can be found. This in a way goes beyond the so-called bag-of-words approach to which all the previous algorithms belong, because the word order can also be taken into account. From an implementation point of view, this can be done with an additional algorithm like a Long-Short-Term-Memory (LSTM) or a ready-to-use version can be taken with Doc2vec.

In our experiment, we trained a Word2vec algorithm on our corpus, which enabled us to find woman or feminism related keywords. Furthermore, we took these keywords and searched where they occur. By doing so, we again not only investigated the situation of women in the pandemic, but also compared the results to the ones given by LDA. This not only helps us to analyze the working, complementarity, and efficiency of the algorithms, but also allowed to make sure that the search covers or mines our corpus as detailed as possible (Figures 4a and 4b).

AWAC2-Figure04a
Figure 4a – Time series of a topic related to women and children

AWAC2-Figure04b
Figure 4b – Top 20 domains for this selected topic

What’s next?

This research is far from being completed after a year of collaboration with the Archives Unleashed team, whom we warmly thank for their technical and scientific expertise, as well as with IIPC which provided an unprecedented corpus that can stimulate a multitude of research projects, whether thematic or oriented towards computer science and digital humanities. An article is currently being prepared on the second topic, “Deep Mining in Web archives,” while a more general and SSH oriented chapter is being drafted for the final collective book of the WARCnet project. Furthermore, the team will be pleased to present results at the next IIPC Web Archiving Conference in 2023 and thus continue the dialogue with you around the collection.

Related resources

Get Involved in Web Archiving Street Art

By CDG Street Art Collection Co-Leads Ricardo Basílio, Web curator, Arquivo.pt & Miranda Siler, Web Collection Librarian, Ivy Plus Libraries Confederation


IIPC-CDG-collaborative-collections

Street art is ephemeral and so are the websites and web channels that document it. For this reason the IIPC’s Content Development Working Group is taking up the challenge of preserving web content related to street art. Some institutions already do this locally, but a representative web collection of street art with a global scope is lacking.

Street art can be found all over the world and reflects social, political and cultural attitudes. The Web has become the primary means of dissemination of these works beyond the street itself. Thus, we are asking for nominations for web content from different parts of the globe to be preserved in time and to serve for study and research in the future.

image002
Mural. Author: Douglas Pereira (Bicicleta Sem Freio). Title: The Observatory. WOOL, Covilhã Urban Art, 2019 (Portugal). Photo credit: Ricardo Basílio.

What we want to collect

image001
Stencil. Author: Adres, WOOL, Covilhã Urban Art, 2017 (Portugal). Photo credit: Ricardo Basílio.

This collaborative collection aims to collect web content related to street art as a social, political, and cultural manifestation that can be found all over the world.

The types of street art covered by this collection include but are not limited to:

  • Mural art
  • Graffiti
  • Stencil art
  • Fly-posting (gluing posters)
  • Stickering
  • Yarn-bombing
  • Mosaic

The collection will also include a number of different types of websites such as:

The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.

Out of scope

The following types of content are out of scope for the collection:

  • For data budget considerations, websites that are heavy with audio video content such as YouTube will be deprioritised.
  • Social media is labour intensive and unlikely to be archived successfully such as Facebook, YouTube channels, Instagram, TikTok.
  • Content which is in the form of a private members’ forum, intranet or email (non-published material).
  • Content which may identify or dox street artists who wish to remain anonymous or known only by their tagger name.
  • Artist websites where the artist works primarily in mediums other than street art.

Media websites (tv/radio and online newspapers) will be selected in moderation, as generally this type of content is being archived elsewhere, although nominations at the level of the news article documenting specific debates around street art may be considered (as opposed to media landing pages or splash pages). Independent news sources devoted to street art specifically are welcome.

How to get involved

Once you have looked over the collection scope document and selected the web pages that you would like to see in the collection, it takes less than 2 minutes to fill in the submission form:

https://bit.ly/CDG-street-art-public-nominations

For the first crawl, the call for nominations will close on January 20, 2023. 

For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.

Resources

About IIPC collaborative collections

IIPC CDG updates on the IIPC Blog

Game Walkthroughs and Web Archiving Project Update: Adding Replay Sessions, Performance Results Summary, and Web Archiving Tournament Option to Web Archiving Livestreams

“Game Walkthroughs and Web Archiving” was awarded a grant in the 2021-2022 round of the Discretionary Funding Programme (DFP), the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project lead is Michael L. Nelson from the Department of Computer Science at Old Dominion University. Los Alamos National Laboratory Research Library is a project partner. You can learn more about this DFP-funded project at our Dec. 14th IIPC RSS Webinar: Game Walkthroughs and Web Archiving, where Travis Reid will be presenting on his research in greater detail. 


By Travis Reid, Ph.D. student at Old Dominion University (ODU), Michael L. Nelson, Professor in the Computer Science Department at ODU, and Michele C. Weigle, Professor in the Computer Science Department at ODU

The Game Walkthroughs and Web Archiving project focuses on integrating video games with web archiving and applying gaming concepts like speedruns to the web archiving process. We have made some recent updates for this project, which involve adding a replay mode, results mode, and making it possible to have a web archiving tournament during a livestream.      

Replay Mode

Replay mode (Figure 1) is used to show the web pages that were archived during the web archiving livestream and to compare the archived web pages to the live web page. During replay mode, the live web page is shown beside the archived web pages associated with each crawler. The web archiving livestream script scrolls the live web page and archived web pages so that viewers can see the differences between the live web page and the recently archived web pages. In the future when the web archiving livestream supports the use of WARC files from a crawl that was not performed recently, we will compare the archived web pages from the WARC file with a memento from a web archive like Wayback Machine or Arquivo.pt instead of comparing the archived web page against the live web page. For replay mode, we are currently using Webrecorder’s ReplayWeb.page.

Replay_Mode
Figure 1: Replay mode

Replay mode will have an option for viewing annotations created by the web archiving livestream script for the missing resources that were detected (Figure 2). The annotation option was created so that the web archiving livestream would be more like a human driven livestream where the streamer would mention potential reasons why a certain embedded resource is not being replayed properly during the livestream. Another reason for creating the annotation option is so that replay mode can show more than just web pages being scrolled and can provide some information about the elements on a web page that are associated with missing embedded resources. There will also be an option for printing an output file that contains the annotation information created by the web archiving livestream script for the missing embedded resources. For each detected missing resource, this file will include the URI-R for the missing resource, the HTTP response status code, the element that is associated with the resource, and the HTML attribute where the resource’s URI-R is extracted from.

Annotation_option
Figure 2: During replay sessions, there will be an option for automated annotation for the missing resources found on the web page

Results Mode

We have added a results mode (Figure 3) to the web archiving livestream so that viewers can see a summary of the web archiving and replay performance results. This mode is also used to compute the score for each crawler so that we can determine which crawler has won the current round based on the archiving and replay performance. The performance metrics used during results mode is retrieved from the performance result file that is generated after the web archiving and replay sessions. Currently this performance result file includes the number of web pages archived by the crawler during the competition (number of seed URIs), the speedrun completion time for the crawler, the number of resources in the CDXJ file with an HTTP response status code of 404, the number of archived resources categorized by the file type (e.g., HTML, image, video, audio, CSS, JavaScript, JSON, XML, PDF, and fonts), and the number of missing resources categorized by the file type. The performance metrics we are currently using for determining missing and archived resources are temporary and will be replaced with a replay performance metric calculated by the memento damage service. The temporary metrics associated with missing and archived resources are calculated by going through a CDXJ file and counting the number of resources with a 200 status code for the number of archived resources and counting the number of resources with a 404 status code for the number of missing resources. Results mode will allow the viewers to access the performance results file for the round by showing a link or QR code associated with a web page that can dynamically generate the performance results from the current round and allow the viewers to download the performance results file. The web page that is shared with the viewers will also have a button that can navigate them to the video timestamp URL associated with the start of the round, so that viewers who recently joined the livestream can go back and watch the archiving and replay sessions for the current round.

results_mode_SS
Figure 3: Results mode, where the first performance metric is shown which is the speedrun time

Web Archiving Tournaments

A concept that we recently applied to our web archiving livestreams is web archiving tournaments. A web archiving tournament is a competition between four or more crawlers. The web archiving tournaments are currently single elimination tournaments similar to the NFL, NCAA College Basketball, and MLS Cup playoffs, where a team is eliminated from the tournament if they lose a single game. Figure 4 shows an example of teams progressing through our tournament bracket. For each match in a web archiving tournament, there will be two crawlers competing against each other. Each crawler is given the same set of URIs to archive and the set of URIs will be different for each match. The viewers will be able to watch the web archiving and replay sessions for each match. After the replay session is finished, the viewers will see a summary of the web archiving and replay performance results and how the score for each crawler is computed. The crawler with the highest score will be the winner of the match and will progress further in the web archiving tournament. When a crawler loses a match it will not be able to compete in any future matches in the current tournament. The winner of the web archiving tournament will be the crawler that has won every match that it has participated in during the tournament. The web archiving tournament will be updated in the future to support other types of tournaments like double elimination tournament where teams can lose more than once, round robin tournament where teams play each other an equal amount of times, or a combination like the FIFA World Cup that uses round robin for the group stage and single elimination for the knockout phase.

Progressing_Through_Tournament_Bracket
Figure 4: Example of teams progressing through our tournament bracket (in this example, the scores are randomly generated)

Future work

We will apply more gaming concepts to our web archiving livestreams, like having tag-team matches and a single player mode. For a tag-team match, we would have multiple crawlers working together on the same team when archiving a set of URIs. For a single player mode, we could allow the streamer or viewers to select one crawler to use when playing a level.

We are accepting suggestions for video games that will be integrated with our web archiving livestreams and shown during our gaming livestreams. The game must have a mode where we can watch automated gameplay that uses bots (computer players) and there needs to be customization for the bots that can improve the skill level, stats, or abilities for the bot. Call of Duty: Vanguard is an example of a game that can be used during our gaming livestream. In a custom match for Call of Duty: Vanguard, the skill level for the bots can be changed individually for each bot and we can change the number of players added to each team (Figure 5). This game also has other team customization options (Figure 6) that are recommended for games used during our gaming livestream but are not required like being able to change the name of the team and choose the team colors. Call of Duty: Vanguard also has a spectator mode named CoDCaster (Figure 7) where we can watch a match between the bots.

Player_Customization_Annotated
Figure 5: Player customization must allow bots with skill levels that can be changed individually or have abilities that can give a bot an advantage over other bots

Ideal_Team_Customization_Settings_Annotated
Figure 6: Example of team customization that is preferred for team based games used during gaming livestreams, but is optional
spectator_mode
Figure 7: The game must have a spectator option so that we can watch the automated gameplay

An example of a game that will not be used during our gaming livestream is Rocket League. When creating a custom match in Rocket League it is not possible to make one bot have better stats or skills than the other bots in a match. The skill level for the bots in Rocket League is applied to all bots and cannot be individually set for each bot (Figure 8).

Rocket_League_Bot_Difficulty
Figure 8: Rocket League will not be used during our automated gaming livestreams, because their “Bot Difficulty” setting applies the same skill level to all bots

A single player game like Pac-Man also cannot be played during our automated gaming livestream, because a human player is needed in order to play the game (Figure 9). If there are any games that you would like to see during our gaming livestream where we can spectate the gameplay of computer players, then you can use this Google Form to suggest the game.

Pacman_Reduced
Figure 9: Single player games that require a human player to play the game like Pac-Man cannot be used during our automated gaming livestream

Summary

Our recent updates to the web archiving livestreams are adding a replay mode, results mode, and an option for having a web archiving tournament. Replay mode allows the viewers to watch the replay of the web pages that were archived during the web archiving livestream. Results mode shows a summary of the web archiving and replay performance results that were measured during the livestream and shows the match scores for the crawlers. The web archiving tournament option allows us to have a competition between four web archive crawlers and to determine which crawler performed the best during the livestream.

If you have any questions or feedback, you can email Travis Reid at treid003@odu.edu.