IIPC Steering Committee Election 2024: Nomination Statements and Results

The call for nominations for the 2024 Steering Committee elections has closed. We had four vacant seats and received three nominations, so an election process is not required this year. We would like to congratulate and thank the British Library, the National Library of France, and the National Library of the Netherlands, who will be continuing for another term.

Please find the statements of all the nominees below. The new, three-year term starts on 1 January 2025.


Nomination statements:

Bibliothèque nationale de France | National Library of France

The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of more than 2.2 petabytes. We develop national strategies for the growth and outreach of web archives and host several academic projects in our DataLab. We use and share our expertise on key tools for IIPC members (Heritrix 3, NetarchiveSuite, OpenWayback, SolrWayback, webarchive-discovery) and contribute to the development of several of them.

As one of the founding members of the IIPC, we have always actively contributed to the GA & WAC meetings, workshops and most of the working groups, and we remain committed to the development of a strong community sharing knowledge and practices. Recently, the BnF has been particularly active within the CDG, leading a collaborative collection on the War in Ukraine, participating in the work of various groups, and also hosting the 2024 GA & WAC in Paris. BnF is currently co-leading the Membership Engagement Portfolio. By drafting and implementing the main thrusts of the new Strategic Plan and Consortium Agreement, our participation in the steering committee will be focused on making web archiving a thriving community, engaging researchers in the study of web archives, developing harvest and access strategies.

The British Library

BL-logoThis year, the British Library has had good cause to reflect on the importance of an international community in supporting resilience as well as development of capability in web archiving. The British Library became a founding member of the IIPC because we recognised the need to work collaboratively and share knowledge and experience in a field that was technologically challenging and rapidly changing. That remains the case today, as the technology of the web, web archiving tools and researcher needs have advanced.

The next 3 years will be a period of rapid change for the UK Web Archive as we restore our service following last year’s cyber attack. We remain engaged in how the community can support the development of tools on which we depend. Collaborative collecting remains a key part of how we work with the IIPC. We are excited by new research, including support for use of the archived web as data. We are a member of the Steering Committee and, as Treasurer for 2022 – 2023, gained a good understanding of how the organisation operates. As a member of the Steering Committee we would additionally take a role in the strategic direction of the IIPC.

Koninklijke Bibliotheek | National Library of the Netherlands

We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In recent years, KB National Library of the Netherlands (KBNL) has taken an active role in the IIPC: we co-hosted the 2023 Web Archiving Conference and our representatives have served in various leadership roles, including as portfolio leads, IIPC vice-chair (2023) and chair (2024).

We would like to continue our work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remains findable, accessible and re-usable for generations to come. Our main focus will be to support the IIPC in reshaping the Consortium Agreement and developing the new strategic plan, taking input from our members and the wider web archiving community.

As a national library, our work is fueled by the power of the written word. It preserves stories, essays, and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying, or conducting research, they impact their lives. With this perspective in mind, we find it vital to preserve web content for future generations.

WAC 2024 Student Travel Report

By Kayla Martin-Gant, IIPC Administrative Officer


The IIPC Web Archiving Conference (WAC) 2024, held from April 24-26 at the National Library of France (BnF) in Paris. Co-organized by the IIPC and the BnF in partnership with the French National Audiovisual Institute (INA), the conference brought together over two hundred members of the web archiving community from all over the world. Below are insights and experiences of a few of the attendees who received student bursaries from the IIPC, collected from their submitted travel reports.

Student Bursary Experience

Each year, the IIPC awards up to ten applicants in good academic standing with a bursary to cover their registration costs. We had nine student bursary recipients this year, four of whom also attended as presenters at the conference. While many were local students of the National School of Charters, others hailed from Belgium, Portugal, England, and the United States. Additionally, five of the twenty-four mentees in this year’s mentoring program – a fairly new but much appreciated element of the conference – were student bursary recipients.

Jonas Melo, a student of the University of Porto’s Information and Communication in Digital Platforms program, was a familiar face at WAC, having attended past conferences. As both an attendee and a speaker at this year’s conference, Melo expressed his gratitude for the assistance. “The student bursary program was a fantastic initiative by the IIPC, providing financial support that made it possible for students like me to attend the conference,” he says, noting the ease of the application process and that “the support from the IIPC team was exceptional. I am grateful for this opportunity and hope that the program continues to support future students.” The other recipients echoed his sentiments in their own reports.

Program Favorites

The conference featured a variety of presentations, short talks, panels, posters, and workshops on a range of diverse topics. Attendees had the challenging task of choosing between parallel tracks, each offering valuable insights and innovations. Though they found value in all the sessions they were able to attend, the bursary recipients made note of those they enjoyed the most.

Melo points to the very first session as a standout for him, saying it “sparked many ideas on how we can leverage AI to improve the efficiency and accuracy of our archiving efforts.” He went on to add that the second panel of the conference, Archiving Social Media in an Age of APICalypse, was “particularly relevant as it underscored the importance of balancing technological advancements with ethical responsibilities.”

Panel 2_WAC2024
Archiving Social Media In An Age of APIcalypse
From the left: Anat Ben-David, Benjamin Ooghe-Tabanou,
Frédéric Clavert, Beatrice Cannelli, and Jerôme Thièvre
Photo credit: Olga Holownia / IIPC

Beatrice Cannelli, a PhD student at the University of London School of Advanced Study and a panelist on Archiving Social Media in an Age of APICalypse, had a tough time deciding between the simultaneous sessions. “I seriously wished I had a time-turner at some points!” she exclaims, describing the program as “incredibly rich” and praising the attention to legality and ethics as well as “archiving diverse communities while ensuring inclusivity.” She also found the Unusual Content session to be particularly engaging, especially Christopher Rauch’s Saving Ads: Assessing and Improving Web Archives’ Holdings of Online Advertisements and Valérie Schafer’s Put it Back! Archived Memes in Context.

Fellow attendee Lizzy Zarate, a student of New York University’s Archives and Public History program, agrees with Melo on the quality of both the second panel, which she describes as “an examination of the legal, ethical, and technical issues relating to the regulation of API access by tech platforms that are not incentivized to act in the public interest,” as well as the Artificial Intelligence & Machine Learning session. “As somebody who primarily performs quality assurance checks on archived websites, I was interested in learning about attempts to automate this process and other facets of web archiving using machine learning and artificial intelligence,” she explains.

She points to Benjamin Lee’s work with the End of Term Archive as “an interesting exploration of how preserving materials like PDFs can be accomplished using machine learning. Projects such as these seem particularly important for government accountability as well as potential uses for curation.” She goes on to add that, aside from AI, Alex Dempsey’s lightning talk on the Internet Archive’s deduplication work was “an introduction to a topic that I had never encountered in my work, and I am excited to track how IA continues to address this issue in the future.”

With this year’s conference, I confirmed my affinity for web archives, both as the object of my studies and as a field I would like to work in my future archivist career. I met many engaging professionals, ready to have a small talk around a coffee.

– Alice Guérin

The opening keynote panel, Here Ya Free! Crossed Views on Skyblog, the French Pioneer of Digital Social Networks, was mentioned as a favorite by nearly all of the student bursary recipients.

“After almost 20 years of providing users with a personal digital space, enabling them to connect with other users sharing the same interests, the platform announced its closure in 2023,” explains Cannelli, whose doctoral research focuses on the strategies employed by archiving initiatives in the preservation of social media platforms. “The BnF and INA – as France’s electronic deposit institutions – coordinated an emergency capture to preserve billions of URLs.”

Panelists included Pierre Bellanger, founder and CEO of Skyrock Radio, freelance journalist and former Skyblog user Pauline Ferrari, and Web Archiving Technical Leads Jerôme Thièvre of INA and Sara Aubry of the BnF, and was moderated by Emmanuelle Bermès, Educational Manager of the Digital Technologies Applied to History master’s program at the National School of Charters. 

Opening_Keynote_Right
Opening Keynote Panel Here Ya Free! Crossed Views on Skyblog, the French Pioneer of Digital Social Networks
From the left: Pierre Bellanger, Pauline Ferrari, Emmanuelle Bermès, Sara Aubry, and Jerôme Thièvre
Photo credit: Nola N’Diaye / BnF

“This mix of voices underscored the important role that such platforms play in our daily lives and the vital function performed by web archiving institutions in ensuring the long-term preservation of such content even beyond the platforms’ lifespan,” says Cannelli. Zarate, who regularly works with student blogs and engages with university students in her research, says the panel “helped illuminate the value and challenges of preserving materials created by young people on relatively unregulated platforms.”

Alice Guérin, who is pursuing a master’s degree in Digital Technologies Applied to History at the National School of Charters under Bermès, cited her current thesis on the history of the Skyblog platform as the reason the keynote panel drew her so strongly. She also notes that the entirety of the Digital Preservation session, as well as Niels Brügger’s presentation on web history, The Form Of Websites: Studying The Formal Development Of Websites, The Case Of Professional Danish Football Clubs, “offered very interesting perspectives for researchers.”

Personal Insights and Takeaways

Despite the packed program, the conference provided ample time to mingle, from casual chats during session breaks to purposefully engineered networking opportunities. Attendees appreciated the chance to engage with such a diverse cross-section of the web archiving community.

As a student bursary recipient, I found the conference to be an invaluable learning experience. The sessions were not only informative but also thought-provoking, encouraging us to think critically about the future of web archiving. I appreciated the opportunity to engage with experts in the field and to gain insights that will undoubtedly shape my future research and career.

– Jonas Melo

“Something that really surprised me was the wide variety of disciplines represented across the conference,” notes Zarate, who learned that her future master’s degree in Archives is an uncommon one in many other countries. At the conference, she says, she was able to meet “archivists, librarians, and computer programmers from around the world…Hearing about the sheer number of different ongoing projects expanded my view of what I had previously thought was possible within web archives.”

Guérin had the opportunity to participate in the Early Scholars Spring School on Web Archives organized by Emmanuelle Bermès (National School of Charters, PSL University of Paris) and Valérie Schafer (Luxembourg Centre for Contemporary and Digital History, C2DH; Internet and Society Center, CNRS). While this did mean she was unable to attend any of the pre-WAC workshops on April 24th, the Spring School gave her a chance to prepare for the intensity of the conference and to have familiar faces to look for at the conference proper. She agrees that the diversity of both the careers and experiences of the attendees were a key component in the enrichment of their discussions, adding that this year’s mentoring program provided her and her fellow participants with “valuable insight on their career prospects and research subjects.”

Networking break in the Grand Auditorium Foyer of the BnF’s François Mitterrand site.
Photo credit: Olga Holownia / IIPC

Conclusion

Overall, the bursary recipients found immeasurable value in the 2024 Web Archiving Conference, leaving with a wealth of gained knowledge, new connections, and a renewed sense of purpose in their web archiving careers.

“This conference was instrumental in shaping my understanding of web archival work, and I hope to use this knowledge as I prepare to begin my career as an archivist.”

– Lizzy Zarate

“Although some panels were too technical for my understanding, I can’t wait to have more experience in the field to understand its subtleties,” says Guérin, emphasizing that the conference experience confirmed her desire for a future career in web archiving. Melo agrees that the interactions with his fellow attendees “were instrumental in expanding my understanding of the global web archiving community,” and that he hopes that the connections he formed will lead to future collaborations.

“I left the conference with so much food for thought, and I am looking forward to the 2025 IIPC Web Archiving Conference in Oslo,” says Cannelli, before offering a “special thanks to the organizers for putting together such a fantastic event, and to the IIPC for their invaluable support through the student bursary.”

LINKS:

Web Archiving Conference 2024 Travel Report

By Anastasia Nefeli Vidaki, law scholar and researcher at the Cyber and Data Security Lab, Vrije Universiteit Brussel


The IIPC Web Archiving Conference (WAC), organized by the International Internet Preservation Consortium (IIPC) and BnF (National Library of France) in partnership with the French National Audiovisual Institute (INA), united professionals and enthusiasts globally to explore the nuances of web archiving. Set in Paris, France, between 24-26 April 2024, this gathering served as a vital forum for participants to delve into groundbreaking research and good practices, network, and enjoy the French hospitality in the library, the exhibitions, and events. It was my first time participating in the conference, and in this report, I aim to depict my experience as a delegate and presenter.

Day 1: Arrival, Registration and Workshops

BnF_Entrance
Entrance of the National Library
of France | Photo credit: Anastasia
Nefeli Vidaki, 2024

Upon arrival in Paris early on Wednesday morning, delegates were welcomed by the bright spring sunlight. After checking into our accommodations, we reached BnF’s François-Mitterrand site where the bulk of the conference was taking place. We were astonished by its modern architecture, rich collection, and exhibitions. Right after lunch, we registered for the conference and collected our badges and materials. In our conference bag we found many surprises, from a vintage map of Paris to a ticket for the library’s special exhibition, “La France sous leurs yeux.” The afternoon commenced with captivating workshops and ended with a warm welcome reception at the BnF’s Richelieu site. Among drinks and conversations, the attendees had the opportunity to get to know each other in a friendly environment and establish connections.

Day 2: Keynote Panel, Sessions, Workshops and Talks

With enthusiasm from the previous day’s experience, the second day commenced. The conference began with a keynote panel on Skyblog, which filled us with nostalgia and thoughts on the freedom of the web realm. Right after was a series of inspiring sessions delivered by esteemed practitioners in the field of archiving, digitization, data science, AI, law, and web-crawling. From discussions on the future of archiving to good practices, each presentation ignited engaging conversations and fueled attendees’ interest for the development and the challenges of web archiving in the era of novel technologies. We delved deeper into specific areas with interactive workshops and panels and then sat in on lightning and drop-in talks, exchanging insights with fellow experts before walking across the foyer, which was filled with posters. The second day ended with dinner, giving participants a chance to relax from the active day and reflect upon all the information they received.

Photos from the first poster session, featuring Nola N’Diaye of the National Library of France (top right) and Helena Byrne of the British Library (bottom right) Photo credit: Guillame Murat / BnF

Day 3: Panel Discussions, Sessions and Posters

Day three started with another poster session relevant to the previous day’s discussions. While having our morning coffee and typical French croissants, we toured around the poster presenters. That day of the conference was marked by dynamic panel discussions covering a diverse range of interdisciplinary topics. From exploring the role of communities in web archiving to the requests for accessibility and inclusivity, the panels sparked lively debates.

BnF_Globe
One of the two 20-foot globes commissioned by King Louis XIV in 1681, now residing at the François-Mitterrand site of the BnF, © Anastasia Nefeli Vidaki, 2024

I took advantage of the gifted ticket inside my conference bag and visited the photo exhibition during the lunch break. More relaxed, I returned to the last panel and sessions of the conference. During networking breaks, I was able to meet many experts across the globe, share ideas, and present my research and point of view on many topics of common interest.

With some closing remarks, takeaways, and a slide show of conference photos, WAC came to an end. Our stay in Paris ended along with it, and we prepared for our return with a feeling of fullness, valuable knowledge gained, and connections established throughout the last three days. Carrying with us a wealth of memories and insights, we waved goodbye to the National Library of France, WAC, and the people behind it, and promised to continue to be inspired and to strive for openness, preservation, and freedom in archiving, whether in physical or digital discourse.

Conclusion

The Web Archiving Conference in Paris offered us delegates an exceptional opportunity to delve into pioneering research, network with peers, and enjoy a few days exploring the vibrant city of Paris. From enlightening sessions to hands-on workshops and stimulating panel discussions and talks, the conference nurtured collaboration and innovation within the international web archiving community. Departing attendees left with a revitalized sense of purpose and passion, eager to apply the fresh insights and knowledge acquired toward tackling the field’s most significant challenges.

LINKS:

And we’re off – Get Involved in Web Archiving the Summer Games – Paris 2024

By Helena Byrne, Curator of Web Archives, British Library

The International Internet Preservation Consortium (IIPC)’s Content Development Group would like your help to archive websites from around the world related to the Olympic and Paralympic Games.

The IIPC has members in 33 countries, but there are over 200 countries competing in the Games, and we need your help to ensure that these countries are represented in the collection.

We want to collect websites in various formats such as:

  • Websites, or subsections of websites, related to the Olympic and Paralympic Games 2024
  • Individual articles, or documents on websites
  • News reports
  • Blogs

The subjects covered on these sites can include but are not limited to:

  • Athletes/Teams
  • Computer Games (eGames)
  • Doping/Cheating and Corruption
  • Environmental Issues
  • Fandom
  • Gender Issues (Ex. media coverage, sexual harassment etc.)
  • General News/ Commentary
  • Health issues (covid, bed bugs etc.)
  • Human Rights issues
  • Olympic/Paralympic Venues
  • Security
  • Sports Events
  • Other

Social media policy

Social media platforms are difficult to archive due to technical issues and privacy concerns. For these reasons,  we will not be accepting nominations of content on Facebook, Instagram, Twitter, or from other social media platforms.

How to get involved

Once you have selected the web pages you would like to see in the collection it takes less than 5 minutes to fill in the submission form:

https://forms.gle/NyH2AnYSKFwS66K59

For more information and updates, you can contact the IIPC CDG team Collaborative-collections@iipc.simplelists.com or follow the collection hashtag #WAGames2024.


IIPC Steering Committee Election 2024: Call for Nominations

The nomination process for the IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions. SC Members provide oversight of the Consortium and define and oversee action on its strategy. This year, four seats are up for election.

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. SC members are expected to take an active role in leadership and help guide and administer the organisation. The elected SC members also lead IIPC Portfolios and thus have the opportunity to shape the Consortium’s strategic direction related to three main areas: tools development, membership engagement and partnerships. Every year, three SC members are designated as IIPC Officers (Chair, Vice-Chair and Treasurer) to serve on the IIPC Executive Board and are responsible for implementing the Strategic Plan. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required. The key tasks for the upcoming term include drafting and overseeing the implementation of the new Strategic Plan and Consortium Agreement.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve on the SC, are asked to write a short statement (no longer than 200 words) outlining their vision for how they would contribute to IIPC via serving on the SC. Statements can point to current and past contributions to the IIPC activities (e.g. through collaborative projects, conference hosting, participation in SC, Working Groups or taskforces), relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election, giving all members ample time to review them. The results will be announced in October, and the three-year term on the Steering Committee will start on 1 January.

Below is the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).


Election Calendar

13 June – 10 September 2024: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by emailing the IIPC SPO. The nomination statement should be no longer than 200 words.

11 September 2024: Nominee statements are published on the Netpreserve blog and circulated to the Members mailing list. Nominees are encouraged to campaign through their own networks.

11 September – 9 October 2024: Members are invited to vote online. The vote is cast by the Designated Representative.

11 October 2024: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2025: The newly elected SC members start their three-year term.

Reflections on the 2024 IIPC General Assembly and Web Archiving Conference

By Friedel Geeraert, Expert in web archiving at KBR | Royal Library of Belgium


This year’s IIPC General Assembly and Web Archiving Conference took place at the Bibliothèque nationale de France (BnF) in Paris. It was wonderful to be welcomed once again into the warm web archiving community, especially in the superb surroundings the BnF had to offer. The welcome reception in the oval reading room at the BnF Richelieu site was especially memorable in that respect. Other than the lovely encounters with web archiving colleagues from around the world, the General Assembly and the Web Archiving Conference program had a lot to offer.

GAWAC2024Reception
Opening remarks by the President of the BnF, Gilles Pécout, in Salle Ovale.
Photo credit: Guillaume Murat, BnF

The General Assembly gave insight into the strategic plan for 2026-2031 and the reflections of the Steering Committee during their meeting that took place the day before. The transparency about their discussions and the active call for participation of members in determining the strategic priorities of the IIPC was greatly appreciated. The historical overview of the changes that have taken place in the Consortium Agreement was also fun to see, as it showed how the IIPC has grown as an organization over the decades. 

Workshops offered participants opportunities to gain hands-on experience in becoming confident trainers in the domain of web archiving, running your own full stack SolrWayback, and crawling using the Browsertrix Cloud, among others. Panel discussions and keynotes allowed for deepening one’s knowledge about Skyblog (a French pioneer in social networks), the archivability of websites, archiving social media, and training Large Language Models. Sessions focused on a myriad of subjects such as capturing unique content (ads, digital artworks, memes, etc.), digital preservation, and planning (tenders, sustainability of web archiving programs, training, etc.). The poster sessions and the drop-in and lightning talks allowed participants to gather information on a whole range of concepts very efficiently.

This is only a selection of themes that were covered during the conference. The program comprised three parallel sessions, all covering interesting topics, thereby inspiring a significant level of FOMO in participants.

Friedel_Geeraert_DITalk_GAWAC24
Friedel Geeraert presenting a KBR drop-in talk. Photo credit: Olga Holownia.

At KBR, there are currently three projects in the pipeline:

  1. Setting up a web archive on a voluntary basis (via a public tender)

  2. Extending the legal deposit legislation to online content

  3. The BelgicaWeb research project. The project is funded by BELSPO, the Belgian Science Policy Office, through the BRAIN 2.0 program and aims to make the born-digital heritage of Belgium accessible and FAIR.

Bearing in mind this institutional context, a number of elements evoked during the General Assembly and Web Archiving Conference are particularly useful. Within the BelgicaWeb project, we will further look into SolrWayback and Browsertrix Cloud. APIs offered by organizations such as Arquivo.pt are also sources of inspiration. Initiatives such as Datasheets for Web Archives by Emily Maemura and Helena Byrne can also prove useful in describing the provenance of collections of archived web content. Using PWIDs to reference web sources archived in certain web archive collections has also been adopted as best practice within the BelgicaWeb project.

As a member of the Preservation Working Group at KBR, I found the session on Digital Preservation especially useful. The Danish Royal Library proved itself once again as one of the leading examples in Europe where digital preservation of born-digital content is concerned. Thanks to their presentations, we will be looking further into Bitrepository.org.

All in all, this was another great edition of the IIPC GA & WAC. I can’t wait for the next conference in Oslo!

LINKS:

Meet the Officers, 2024

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three-year terms. The Steering Committee designates the Chair, Vice-Chair and Treasurer of the Consortium. Together with the Senior Program Officer, based at the Council on Library and Information Resources (CLIR), the Officers make up the Executive Board and are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Jeffrey van der Hoeven of KB, the National Library of the Netherlands to serve as Chair and Andrea Goethals of the National Library of New Zealand to serve as Vice-Chair in 2024. Bjarne Andersen of the Royal Danish Library will serve as the IIPC Treasurer. Olga Holownia continues as Senior Program Officer, Kayla Martin-Gant is now Administrative Officer, and CLIR remains the Consortium’s financial and administrative host.

The Members and the Steering Committee would like to thank Youssef Eldakar of Bibliotheca Alexandrina for leading the IIPC in 2023 and Ian Cooke of the British Library for his two-year term as the IIPC Treasurer.


IIPC CHAIR

Jeffrey van der
Hoeven, IIPC Chair

Jeffrey van der Hoeven is head of the Digital Preservation department at the National Library of the Netherlands (KB). In this role, he is responsible for defining the policies, strategies, and organizational implementation of digital preservation at the library, with the goal to keep the digital collections accessible to current users and generations to come. Jeffrey is also director at the Open Preservation Foundation (OPF) and has been a member of the IIPC Steering Committee since 2020. In previous roles, he was involved in various national and international preservation projects such as the European projects PLANETS, KEEP, PARSE.insight, and APARSEN.

IIPC VICE-CHAIR

Andrea Goethals, IIPC Vice-Chair

Andrea Goethals started her digital preservation career in 2003 as a computer scientist working on technical strategies for handling format obsolescence. Since then, she has worked in roles focusing on the policies, strategies, and people that make digital preservation programs possible. She is now Manager of Digital Preservation and Data Capability (Kaiwhakahaere Rokiroki ā-Matihiko me te Āheinga Raraunga) at the National Library of New Zealand, where she leads a team of specialists with expertise in digital preservation, web archiving, software development, and data analysis. She runs the NZ DOI Consortium and participates in many regional and international groups, including iPRES Steering Group, Australasia Preserves, Digital Preservation Storage Criteria WG, DataCite CESG, and NSLA Digital Preservation Network.

IIPC TREASURER

Bjarne Andersen, IIPC Treasurer

Bjarne Andersen is Head of the Data Department at the Royal Danish Library (KB). In this role, he is responsible for IT-development of systems handling ingest and digital preservation of digital cultural heritage. Apart from managing IT-developers, Bjarne is also in charge of enterprise IT-architecture at KB and involved in setting up a new AI and Data initiative at the library. Bjarne started more than 20 years ago building the national web archive Netarkivet and has held many different roles since. Bjarne was on the founding board of directors of OPF following the involvement in European Projects DPE, PLANETS and SCAPE – all around Digital Preservation.

Communities of Digital Preservation

By Andy Jackson, Web Archiving Technical Lead at the British Library (until January 2024)

I joined the UK Web Archive early in 2012, during the build-up to our very first UK domain crawl. As I started to understand what the team did, it became very clear that the collaboration with the wider IIPC web archiving community had been crucial to the team’s success, and would be a vital part of our future work.

The knowledge sharing and socialising at the IIPC conferences provide the fundamental rhythm, but the web archiving community has arranged all sorts of beats over that bass drum. Not just special events, both online and in person (e.g. technical training and a hackathon held at the British Library), but also through the way we build our shared tools. My research career had often involved using open source software, but in web archiving I began to understand how those same approaches had been used to share the load of developing standard practices, embodied by specialist tools. I also began to see how this could empower people and organisations to run their own web archiving operations.

Buy or Build?

While the public awareness of web archiving has certainly risen over the years, it remains something of a niche concern. It has been over twenty years since a small group of cultural heritage organisations kicked things off, writing and sharing their own tools to archive the web. In the intervening years the heritage community has grown a great deal, but most of today’s archival web crawlers are still built on those first foundations. There seems to be a reasonable market for ‘medium-scale’ web archiving, with a few different vendors offering various services at that scale. But at the extremes, with personal web archiving at one end and Legal Deposit domain crawls at the other, there are all sorts of constraints that make it difficult to take advantage of those commercial offerings.

Sometimes, you have to build your own tools. But, if you must build your own, you can try to find others with similar needs and look for common ground to share. Open source licences and development practices have clearly been pivotal to helping this happen in web archiving, leading to the widespread use of Heritrix for web crawling and of the original Java Wayback playback engine. This was a success story I wanted to join in with, and a community I wanted to help grow.

Barriers to Collaboration

Seeing this historical success, I took it for granted that of course our institutions would understand and support this. That anyone using these tools would be able and keen to collaborate. Why keep fixing the same bugs alone when we could fix each one once by working together?

That was very naive of me. There are lots of reasons why the open source model of collaboration can be difficult to adopt. The relationships between organisational needs and Information Technology service delivery are incredibly varied and complex. It can be very difficult to get the space and permission to experiment. It can be extremely difficult to build up or pull in the skills we need.

Even where people would like to collaborate more, there are often perfectly understandable personal or professional constraints that mean they can’t just pitch in. I am very fortunate that my direct managers and colleagues at the British Library supported my strategy of working in the open. I am also fortunate that I risk very little by doing so. It took me a while to realise what a privilege that is.

The desire to overcome these barriers was part of the reason why I helped start up the regular Online Hours calls to support the teams and individuals who rely on our shared tools, and provide a safe and friendly forum for anyone who is interested in talking about them.

Investing in Open Source

I’ve also tried to support and encourage direct investment in shared tooling, both through IIPC and the British Library. I’ve been particularly pleased by the project to extend the GLAM Workbench to explore web archives, the project to help IIPC members make use of the Browsertrix Cloud crawl system, and the project to help everyone move from OpenWayback to pywb. It’s also been great to see the increased adoption of the webarchive-discovery WARC indexing toolkit, largely driven by the excellent SolrWayback search interface project.

In January, I left the British Library to work at the Digital Preservation Coalition. I suspect I’ll reconnect with web archiving at some point in the future, in one form or another, but for now, I’m looking forward to taking what I’ve learned and applying it anew. Because at some point I realised that open source isn’t just about making do with not-much money. It’s about digital preservation too.

Critical Dependencies

One of the core concepts in digital preservation is the idea of Representation Information, which provides a way to formally recognise the additional information we need to make our collections accessible. Crucially, this includes software. After all, the thing that makes digital objects digital is the fact that we need software to use them.

This is where proprietary systems can become a significant risk to digital preservation. Perhaps the most important part of digital preservation is identifying single points of failure within the chain of dependencies that access requires. If playback depends on a single service provider, it’s at risk. Long-term preservation demands interoperability, which is why the WARC standard exists in the first place.

The WARC standard is our foundation stone, but that alone is not enough to make those frozen fragments sing. We can’t grasp what landed in our ‘response’ records without being able to understand the mechanisms that put them there. And we can’t analyse and explore our petrified webs without the software tools that bring them to life. There is no ‘ISO standard for playback’ (and I doubt such a thing is even possible), so we must instead preserve the software that makes playback work. This is why having at least one open source playback system is a crucial concern for the members of the IIPC.

But this is not just true for web archiving. This same story plays out across the whole of digital preservation. The wider shift to open source, and the work that the global community has put into open source implementations of widespread formats, has become the backbone of every digital preservation programme. We’re not out here re-implementing libtiff, or writing PDF readers based on the ISO spec. We’re all re-using open source implementations that are being maintained by the wider community. We’re all in the business of preserving software, at least to some degree.

Communities of Practice

The success of the community-maintained Web Archiving Awesome List, the way organisations have transitioned to pywb (like this) and the growing support for Browsertrix Cloud show that the web archiving community understands this. That one way to sustainable, shared practices is through shared tools as well as common purpose. These tactics don’t only help established archives do their work, but also make it easier for ‘younger’ archives to join in and so grow the community around those tools.

My new role is all about helping digital preservation practitioners discover and build on the good practice of others. I will take what I’ve learned from web archiving with me, and come back to this community as an exemplar of what we can achieve when we work together.

Migrating to pywb at the National Library of Luxembourg

The National Library of Luxembourg (BnL) has been harvesting the Luxembourg web under the digital legal deposit since 2016. In addition to broad crawls of the entire .lu domain, the Luxembourg Web Archive conducts targeted crawls focusing on specific topics or events. Due to legal restrictions, the web archives of the BnL can only be viewed on the library premises in Luxembourg. BnL joined the IIPC in 2017 and co-organised the 2021 online General Assembly and Web Archiving Conference.


By László Tóth, Web Archiving Software Developer at the National Library of Luxembourg

During the course of 2023, the Bibliothèque Nationale du Luxembourg  | National Library of Luxembourg (BnL) undertook the task of migrating its web archive into a new infrastructure. This migration affected all aspects of the archives:

  1. Hardware: the BnL invested in 4 new high-end servers for hosting the applications related to indexing and playback
  2. Software: the outdated OpenWayback application was upgraded to a modern pywb/OutbackCDX duo
  3. Web archive storage: the 339 TB of WARC files were migrated from NetApp NFS to high-performance IBM S3 object storage

In theory, such a migration is not a very complicated task; however, in our case, several additional factors had to be considered:

  • pywb has no module for reading WARC data from an IBM-based S3 bucket or communicating with a custom S3 endpoint
  • pywb does not know which bucket each resource is stored in
  • Our 339 TB of data had to be indexed in a “reasonable” amount of time

In this blog post, we will discuss each of the points mentioned above and provide details on how we overcame these difficulties.

The “before”

Up until December 2023, the BnL offered OpenWayback as a playback engine for users wishing to access the Luxembourg Web Archive. Simple but slow (and somewhat cumbersome to use), OpenWayback lacks a number of features required for an ergonomic user experience and efficient browsing of the archive. 

The WARC files were stored on locally mounted NFS shares, and the server handling the archive and serving OpenWayback to clients was a virtual machine with 8 cores and 24 GB of RAM.

One of the big drawbacks of this setup was the way indexes were handled. These were stored in CDX files at the rate of one file per collection, resulting in a loaded OpenWayback configuration file:

pywbluxembourg_fig1
Figure 1: There has to be a better way of doing this…

In order for OpenWayback to access a given resource, it first had to find it; thus, the main source of loading times came from the size of the CDX files (3.1 TB in total) and the slowness of the NFS drive they were stored on. Furthermore, every time a new collection was added or a new WARC file indexed into the archive, the Tomcat web server had to be restarted, resulting in a few minutes of downtime.

The slow loading of the pages meant that users were not encouraged to visit our web archive.

Planning the update

In order to improve the BnL’s offerings to users, we decided to address these problems by switching to pywb, the popular playback engine developed by Webrecorder, and OutbackCDX, a high-performance CDX index server developed by the National Library of Australia.

For hosting these applications, and with a future SolrWayback setup in mind, the BnL purchased 4 high-end servers, boasting 96 CPUs and 768 GB RAM each.

pywbluxembourg_fig2
Figure 2: Hardware comparison – before and after

The migration to S3 storage

Although not necessary for installing those new applications, we decided to migrate to S3 before installing pywb and Solrwayback. This is because our state IT service provider had successfully implemented a storage system based on S3 and had a positive experience with it performance-wise compared to using NFS. Since the web archive consumes a sizable chunk of their storage infrastructure, we made a joint strategic decision to move to S3. Migrating the storage layer at the point while we were migrating the access systems made sense, so this was done first.

This entailed several additional tasks:

  1. Setting up a database to store the S3 location of each WARC file together with various metadata, such as integrity hashes and harvest details for each file
  2. Physically copying 339 TB WARC files to S3 buckets
  3. Developing a pywb module to read WARC data directly from IBM S3 buckets and another module to get the S3 bucket characteristics from the database

We began by setting up a MariaDB database with a few tables for storing file and collection metadata. Here is an example entry in the table “file”:

pywbluxembourg_fig3
Figure 3: Some WARC file metadata in our DB

We then copied the files to S3 storage using a simple multithreaded Python script that used the ibm_boto31 module to upload files to S3 buckets and compute various pieces of metadata. We divided our web archive into separate collections, each stored in a single bucket and corresponding to a specific harvest made within a specific time period by a specific organization. Our naming scheme also includes internal or external identifiers if there are any. For instance, files that were harvested by the Internet Archive during the 2023 autumn broad crawl, having the ID #22, are stored in a bucket named “bnl-broad-022-2023-autumn”, while those collected during the second behind-the-paywall campaign of 2023 are stored in “bnl-internal-paywall-2023-2”. In total, we have 32 such buckets.

Finally, we developed pywb modules to ensure that every time the application requests WARC data, it first queries the aforementioned database to find out which bucket the resource is in, and then loads the data from the requested offset up to the requested length. Of course, we didn’t allow direct communication between pywb and our database, so we developed a small Java application with a REST API whose sole purpose is to facilitate and moderate such a communication.

At this point, we’d like to note that pywb already has an S3Loader class; however, this is based on Amazon S3 technology and does not allow defining a custom endpoint for communicating with the S3 service itself. In order to adapt this to our needs, we modified pywb by implementing a BnlLoader class that extends BaseLoader, which does all the above and uses ibm_boto3 to get the WARC data. We then mapped it to a custom loader prefix in the loaders.py file:

pywbluxembourg_fig4
Figure 4: Custom BnL loader reference (loaders.py)

Notice our special class on the last line in the figure above. Now in order for pywb to use this class, it has to be referenced in the config file. Here is what the relevant part of our config file looks like:

pywbluxembourg_fig5
Figure 5: Our configuration file (config.yaml)

Note the “bnl://” prefix in “archive_paths”: this tells pywb to load the BnlLoader class as the WARC data handler for the corresponding collection. The value of this key is simply the URL linking to our database API server that we mentioned above, which allows controlled communication with the underlying MariaDB database. 

So, in summary:

  1. pywb needs to load a resource from a WARC file
  2. The BnlLoader class’ overridden load() method is called
  3. In this method, pywb makes an API call to our REST API service in order to get the S3 bucket where the WARC is stored via the database
  4. The S3 path that is returned is then used together with the requested offset and length (provided by OutbackCDX) to make a call to the IBM S3 cluster using ibm_boto3
  5. pywb now has all the data it needs to display the page

The result can be summarized in the following workflow diagram:

pywbluxembourg_fig6
Figure 6: The BnL’s pywb + OutbackCDX workflow

Note that OutbackCDX and pywb are set up on the same physical server, having 96 CPUs and 768 GM RAM; however, on the diagram above, we have shown them separately for the purpose of clarity.

Our new access portal

Our access portal to the web archives was also completely redesigned using pywb’s templating engine, which allows us to fully customize the appearance of almost all elements in the interface. We decided to provide our users with two main sub-portals: one for accessing websites harvested behind paywalls, and another one for everything else. We had two main reasons for this:

  1. Several of our collections contained harvests of paywalled versions of websites. We did not want to mix these together with the un-paywalled versions,
  2. We wanted to offer our users a clear distinction between open (“freely accessible”) content and un-paywalled content.

The screenshot below shows our main access portal:

pywbluxembourg_fig7
Figure 7: The BnL’s web archive portal

Next steps

2024 will see the installation of Netarkivet’s SolrWayback at the BnL, providing advanced search features such as full-text search, faceted search, file type search, and many more. Our hybrid solution will use pywb as the playback engine while SolrWayback will be responsible for the search aspects of our archive.

The BnL will also provide an additional NSFW-filter and virus scan in SolrWayback. During indexing, the content inside the WARC files will be scanned using artificial intelligence techniques in order to label each one as “safe” or “not safe for work” material. This way, we will be able to restrict certain content, such as pornography, in our reading rooms and block other harmful elements, such as viruses.

Finally, the BnL will develop an automated QA workflow, aiming at detecting and patching missing elements after harvests. Since this work is currently being done manually by a student helper, this workflow will likely greatly increase the quality and efficiency of our in-house harvests.

Resources

Footnotes
  1. Link to Python module and documentation ↩︎

Resilience, renewal and creating future pasts through web archiving: interview with Dr Paul Koerbin

Paul Koerbin is the Assistant Director of web archiving at the National Library of Australia (NLA). He is one of the pioneers of web archiving and has been involved in the IIPC since its inception in 2003, including the early meetings hosted in Canberra in 2004 and 2008 and the first Researchers Requirements Working Group meeting in London in 2004. He has represented NLA on the IIPC Steering Committee (SC) since 2018, served as Vice-Chair in 2020, chaired the WAC Program Committee (PC) in 2018 and represented the SC on the PC in the following years. Paul is also one of the custodians of the oral history of web archiving.


“Resilience is really fundamental to a sustainable web archiving programme. And then renewal. You know what’s important now, more than ever, is innovative approaches to the challenges we face whether that’s how programmes are organised within institutions or collaborations that we can form.”

From serials to social media

Olga Holownia: The pretext for this interview is your retirement and the IIPC 20th anniversary. You are one of the “oral historians” of web archiving. You have been involved in the web archiving program at the National Library of Australia since its inception in the mid-1990s and in the IIPC since its early days. If I remember correctly, the first web archivists at NLA were curators of the serials collections and this was one of the approaches used for the earliest web archive collections. In our anniversary video spotlight, you say that in the early days, web archiving was “simpler because the materials we were dealing with were simpler, the web was simpler.” As you look back, what would you consider to be the most significant moments in this journey from serials to social media?

Paul Koerbin: Yes, the web archiving team (at first known as the ‘electronic unit’ and later as ‘digital archiving section’) was established within the serials cataloguing team, where I began at the NLA. That was the ‘Australian Serials and Electronic Unit’. This was established before we actually did any harvesting. It started by selecting, cataloguing, and scoping future harvests (when we had the infrastructure to do it). So, in this context, I would suggest as one of the most significant moments – although more than a single moment – being the first actual harvesting we did, getting that very early content. But, perhaps more important, since it goes to the whole idea of resilience and sustainability, was building the specifications and then the application that became our web archiving workflow management tool (PANDAS). This, I believe, was one of, if not the, first such bespoke workflow tools for web archiving. The significance of this was to turn web archiving, quite quickly, into an operational activity, not simply a project. A quarter of a century on, the renewal of our web archiving workflow system as the fourth generation PANDAS in the past three years I think is just as significant for our web archiving operation by renewing and making the tool ever more adaptable, sustainable and fit for purpose.

On the matter of archiving being ‘simpler’ in the early days. I think I mean it was less complex, not necessarily easier. We were starting from ground zero then, building workflows, policy, and infrastructure. And the harvesting tools were obviously not as sophisticated. So, that was not simple, but a website in the early days could in many practical respects be treated like publications we were already familiar with, so perhaps what we had to deal with was conceptually simpler. However, with Web 2.0 and, of course, the later emergence of social media, the target medium has become so much more complex and both conceptually and practically more challenging to archive. Add to that the issues of preserving, managing, and making accessible huge complex collections, I think in many ways the past does look ‘simpler’. I would add that these changes are what has kept many of us involved in this enterprise for so long. It is constantly challenging and never dull!

NLA-Pandora-1998
This is how PANDORA was described in a page from February 1998. https://webarchive.nla.gov.au/awa/19980205191902/http://www.nla.gov.au/pandora/

Resilience and renewal

OH: The Australian Web Archive is one of the oldest ones in the world and it’s listed in the Australian Register UNESCO Memory of the World Program. What would you consider to be the crucial decision in the development and renewal of the web archiving programme?

PK: I think a crucial decision made at the very beginning of the NLA’s web archiving initiative was to treat it as a programme rather than a project. There was a 6 month period at the very beginning when it was considered a scoping project, but by the end of 1996, it was a programme incorporated into operation of the NLA’s core business to comprehensively collect Australia’s cultural heritage. I think that was visionary for the time and a crucial decision. The other decision I would  mention, perhaps more a way of working rather than a stated decision, was to take a radical incremental approach to the development of infrastructure. By that I mean we did not try and solve all the problems and issues before building the applications to get operational. They were not perfect but they gave us experience and, importantly, allowed us to collect material from as early as late 1996. I think this approach built renewal – by increments – and resilience – by experience – into our web archiving programme.

OH: What have been the most rewarding experiences for you in your tenure in web archiving?

PK: Of course so much of the experience of being part of a small team over the years building the programme and building the collection of web content has been immensely rewarding. But I will highlight a couple of things: firstly, I was very involved in the framing of legislation extending legal deposit to online materials that came into force in 2016. This was the culmination of 20 years of the NLA trying to get this amendment to the legislation. The final days of preparing the bill with drafting experts in our portfolio department were exhilarating. What we emerged with was, on the whole, a broadly applicable, workable, and very successful piece of legislation. The passing of that legislation led in turn to the NLA reviewing its management of risk and the opening of access to our entire web archiving collection in 2019 as the Australian Web Archive through the Trove discovery service.

The other personally rewarding experience I would highlight would be the opportunity and privilege I have had in representing the NLA’s achievements in web archiving to the international community, through the IIPC. The NLA was a founding member of the IIPC and, despite Australia’s remoteness from the majority of the web archiving activities in the northern hemisphere, we have tried to maintain an international presence. I have found it very rewarding to have been given the responsibility by the NLA over the past 20 years to explain, promote, and represent our web archiving achievements and experiences  both internationally and in Australia.

“We can’t solve all the problems alone nor recognise all the opportunities by ourselves”

OH: You mentioned the importance for NLA to be part of the international community right from the start. The IIPC has also greatly benefited from NLA’s engagement, not least with respect to creating the model for a curator workflow tool in PANDAS in the early years and more recently with open source tools, notably OutbackCDX developed by Alex Osborne. In what ways did the international collaboration contribute to the advancement of the web archiving programme in Australia?

PK: I think being part of the international web archiving community is important for the renewal of our web archiving program. At one time the NLA was a world leader and at other times we could see that there were areas where we were, perhaps, not doing so well. We can’t solve all the problems alone nor recognise all the opportunities by ourselves. I think this is the greatness of the IIPC forum. It is perhaps less about adopting tools others have developed or contributing tools to the community, though there is that too; rather it is so much about the sharing of and awareness of the diversity of approaches and the opportunities and ideas we can bring back to our own organizations and programmes. Personally, I have felt that international engagement, and seeing the achievements of others, has helped to keep me enthusiastic and driven over a long period of time to improve and promote our activities.

NLA2024-IIPC-meeting
IIPC Steering Committee meeting, Canberra, 15 November 2004. Clockwise from right: Julien Masanes (Bibliotheque nationale de France); Caroline Wiegandt (Bibliotheque nationale de France); Pam Gatenby (NLA); Margaret Phillips (NLA); Hans Kristian Mikkelsen (Royal Danish Library); Martha Anderson (Library of Congress); Yvette Hackett (Library and Archives Canada); Svein Arne Solbakk (National Library of Norway) and Mark Middleton (British Library). 
Photo: Damian McDonald, National Library of Australia

Web archiving and its processes can be understood as  a methodically and purposefully constructed taphonomy of the web.

Future pasts

OH: “Future pasts” was one of the topics you suggested for the 2022 Web Archiving Conference. Let me ask you your own question: how are web archives framing future perceptions of the past?

PK: Web archiving is very much an activity of taking a snapshot of the dynamic, ephemeral and relentless ‘now’ that is the world wide web, and undertaking to preserve that for a future we are yet to know. Those who will come after us will only have those fragments of the web that we have archived to understand that past. Web archiving and its processes can be understood as a methodically and purposefully constructed taphonomy of the web. The nature of the web is that it is not going to leave a trace of itself at any given time without this intervention. That is how important our work is. So much of our social discourse happens online and so much of the important ‘grey literature’ that forms the basis of social policy is published online. Without access to this, those who come after us will have no historical perspective. So, of course, this is also our tremendous responsibility, since what we choose to collect and what we are able to collect and preserve – let us remember the persistent technical, legal and resources constraints that continue to limit our activities – will frame how the future looks at the past through what we have collected and how we have preserved and provided access to it.

IIPC-WAC2018-NLNZ-keynote
IIPC Web Archiving Conference at the National Library of New Zealand. Wellington, 13 November 2018. Keynote by Dr Rachael Ka’ai-Mahuta titled Te Māwhai – te reo Māori, the Internet, archiving, and trust issues.
Photo: Mark Beatty.

OH: Thank you for taking the time to answer my questions today and over the past 7 years. I have always regarded you as one of the oral historians of web archiving and you’ve always helped me fill the gaps in the early history of the IIPC. To finish off on a less sombre note, does your retirement mean that you will now have more time to dedicate to composing music for “Gumboots and Consequences”?

PK: I deny everything! (Note to readers: you had to be involved in organizing the 2018 GA-WAC in Wellington to understand the reference here. Vale the great John Clarke.) On a serious note, however, I do hope that in the near future a major IIPC event can again be hosted in Australia. The major international conference that the NLA organized in 2004 was an important milestone and the General Assembly and Conference in Wellington in 2018 (which the NLA co-sponsored and which I was involved with as co-chair of the programme committee) was a great success for the region bringing the international web archiving community to the southern hemisphere. I am hopeful that there will be another IIPC event in this region sooner rather than later.

IMG_7167
Paul Koerbin’s closing remarks at the 2018 General Assembly hosted by the National Library of New Zealand. Wellington, 12 November, 2018.
Photo: Olga Holownia, IIPC.

References: