Communities of Digital Preservation

Mon, 19 February 2024Thu, 22 February 2024 IIPC StaffLeave a comment

By Andy Jackson, Web Archiving Technical Lead at the British Library (until January 2024)

I joined the UK Web Archive early in 2012, during the build-up to our very first UK domain crawl. As I started to understand what the team did, it became very clear that the collaboration with the wider IIPC web archiving community had been crucial to the team’s success, and would be a vital part of our future work.

The knowledge sharing and socialising at the IIPC conferences provide the fundamental rhythm, but the web archiving community has arranged all sorts of beats over that bass drum. Not just special events, both online and in person (e.g. technical training and a hackathon held at the British Library), but also through the way we build our shared tools. My research career had often involved using open source software, but in web archiving I began to understand how those same approaches had been used to share the load of developing standard practices, embodied by specialist tools. I also began to see how this could empower people and organisations to run their own web archiving operations.

Buy or Build?

While the public awareness of web archiving has certainly risen over the years, it remains something of a niche concern. It has been over twenty years since a small group of cultural heritage organisations kicked things off, writing and sharing their own tools to archive the web. In the intervening years the heritage community has grown a great deal, but most of today’s archival web crawlers are still built on those first foundations. There seems to be a reasonable market for ‘medium-scale’ web archiving, with a few different vendors offering various services at that scale. But at the extremes, with personal web archiving at one end and Legal Deposit domain crawls at the other, there are all sorts of constraints that make it difficult to take advantage of those commercial offerings.

Sometimes, you have to build your own tools. But, if you must build your own, you can try to find others with similar needs and look for common ground to share. Open source licences and development practices have clearly been pivotal to helping this happen in web archiving, leading to the widespread use of Heritrix for web crawling and of the original Java Wayback playback engine. This was a success story I wanted to join in with, and a community I wanted to help grow.

Barriers to Collaboration

Seeing this historical success, I took it for granted that of course our institutions would understand and support this. That anyone using these tools would be able and keen to collaborate. Why keep fixing the same bugs alone when we could fix each one once by working together?

That was very naive of me. There are lots of reasons why the open source model of collaboration can be difficult to adopt. The relationships between organisational needs and Information Technology service delivery are incredibly varied and complex. It can be very difficult to get the space and permission to experiment. It can be extremely difficult to build up or pull in the skills we need.

Even where people would like to collaborate more, there are often perfectly understandable personal or professional constraints that mean they can’t just pitch in. I am very fortunate that my direct managers and colleagues at the British Library supported my strategy of working in the open. I am also fortunate that I risk very little by doing so. It took me a while to realise what a privilege that is.

The desire to overcome these barriers was part of the reason why I helped start up the regular Online Hours calls to support the teams and individuals who rely on our shared tools, and provide a safe and friendly forum for anyone who is interested in talking about them.

Investing in Open Source

I’ve also tried to support and encourage direct investment in shared tooling, both through IIPC and the British Library. I’ve been particularly pleased by the project to extend the GLAM Workbench to explore web archives, the project to help IIPC members make use of the Browsertrix Cloud crawl system, and the project to help everyone move from OpenWayback to pywb. It’s also been great to see the increased adoption of the webarchive-discovery WARC indexing toolkit, largely driven by the excellent SolrWayback search interface project.

In January, I left the British Library to work at the Digital Preservation Coalition. I suspect I’ll reconnect with web archiving at some point in the future, in one form or another, but for now, I’m looking forward to taking what I’ve learned and applying it anew. Because at some point I realised that open source isn’t just about making do with not-much money. It’s about digital preservation too.

Critical Dependencies

One of the core concepts in digital preservation is the idea of Representation Information, which provides a way to formally recognise the additional information we need to make our collections accessible. Crucially, this includes software. After all, the thing that makes digital objects digital is the fact that we need software to use them.

This is where proprietary systems can become a significant risk to digital preservation. Perhaps the most important part of digital preservation is identifying single points of failure within the chain of dependencies that access requires. If playback depends on a single service provider, it’s at risk. Long-term preservation demands interoperability, which is why the WARC standard exists in the first place.

The WARC standard is our foundation stone, but that alone is not enough to make those frozen fragments sing. We can’t grasp what landed in our ‘response’ records without being able to understand the mechanisms that put them there. And we can’t analyse and explore our petrified webs without the software tools that bring them to life. There is no ‘ISO standard for playback’ (and I doubt such a thing is even possible), so we must instead preserve the software that makes playback work. This is why having at least one open source playback system is a crucial concern for the members of the IIPC.

But this is not just true for web archiving. This same story plays out across the whole of digital preservation. The wider shift to open source, and the work that the global community has put into open source implementations of widespread formats, has become the backbone of every digital preservation programme. We’re not out here re-implementing libtiff, or writing PDF readers based on the ISO spec. We’re all re-using open source implementations that are being maintained by the wider community. We’re all in the business of preserving software, at least to some degree.

Communities of Practice

The success of the community-maintained Web Archiving Awesome List, the way organisations have transitioned to pywb (like this) and the growing support for Browsertrix Cloud show that the web archiving community understands this. That one way to sustainable, shared practices is through shared tools as well as common purpose. These tactics don’t only help established archives do their work, but also make it easier for ‘younger’ archives to join in and so grow the community around those tools.

My new role is all about helping digital preservation practitioners discover and build on the good practice of others. I will take what I’ve learned from web archiving with me, and come back to this community as an exemplar of what we can achieve when we work together.

IIPC – Meet the Officers, 2023

Thu, 19 January 2023Mon, 23 January 2023 IIPC StaffLeave a comment

IIPC

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Senior Program Officer, based at the Council on Library & Information Resources (CLIR), the Officers make up the Executive Board and are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Youssef Eldakar of Bibliotheca Alexandrina to serve as Chair, and Jeffrey van der Hoeven of KB, National Library of the Netherlands to serve as Vice-Chair in 2023. Ian Cooke of the British Library will continue to serve as the IIPC Treasurer. Olga Holownia continues as Senior Programme Officer, Kelsey Socha-Bishop as Administrative Officer and CLIR remains the Consortium’s financial and administrative host.

The Members and the Steering Committee would like to thank Kristinn Sigurðsson of the National and University of Iceland and Abbie Grotke of the Library of Congress for leading the IIPC in 2021 and 2022.

IIPC CHAIR

Youssef Eldakar is Head of the International School of Information Science, a department of Information and Communication Technology at Bibliotheca Alexandrina (BA) in Egypt. Youssef entered the domain of web archiving as a software engineer in 2002, working with Brewster Kahle to deploy the reborn Library of Alexandria’s first web archiving computer cluster, a mirror of the Internet Archive’s collection at the time. In the years that followed, he went on to lead BA’s work in web archiving and has represented BA in the International Internet Preservation Consortium (IIPC) since 2011. Also at BA, he contributed to book digitization during the initial phase of the effort. In 2013, he was additionally assigned to take lead of the BA supercomputing service, providing a platform for High-Performance Computing (HPC) to researchers in diverse domains of science in Egypt, as well as regionally through European collaboration. At his present post, Youssef works to provide support to research through the technologies of parallel computing, big data, natural language processing, and visualization.

In the IIPC, Youssef has been lead to Project LinkGate, started in 2020, for scalable temporal graph visualization, and he has more recently been working as part of a collaboration involving the Research Working Group and the Content Development Working Group to republish IIPC collections through alternative interfaces for researcher access. He has been a member of the Steering Committee since 2018 and has served as the lead of the Tools Development Portfolio.

IIPC VICE-CHAIR

Jeffrey van der Hoeven is head of the Digital Preservation department at the National Library of the Netherlands (KB). In this role he is responsible for defining the policies, strategies and organisational implementation of digital preservation at the library, with the goal to keep the digital collections accessible to current users and generations to come. Jeffrey is also director at the Open Preservation Foundation and steering committee member at the IIPC. In previous roles, he has been involved in various national and international preservation projects such as the European projects PLANETS, KEEP, PARSE.insight and APARSEN.

IIPC TREASURER

Ian_Cooke

Ian Cooke leads the Contemporary British Publications team at the British Library, which is responsible for curation of 21st century publications from the UK and Ireland. This includes the curatorial team for the UK Web Archive, as well as digital maps, emerging formats and print and digital publications ranging from small press and artists books to the latest literary blockbusters. Ian joined the British Library’s Social Sciences team in 2007, having previously worked in academic and research libraries, taking up his current role in 2015.

Ian has been a member of the IIPC Steering Committee and has worked on strategy development for the IIPC. The British Library was the host for the Programmes and Communications role up to April 2021.

Faceted 4D modelling to reconstruct web space in web archives: a pilot approach to develop a sub-collection in context

Thu, 08 December 2022Thu, 08 December 2022 IIPC Staff2 Comments

By Cui Cui, Web Archivist, Special Collections, Bodleian Libraries and PhD candidate, Information School, University of Sheffield

The Archive of Tomorrow project, funded by the Wellcome Trust, is designed to explore and preserve online information and misinformation about health and the Covid-19 pandemic. Started in February 2022, the project runs for 14 months and will form a “Talking about Health” collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources. Earlier this year, Alice Austin presented an introduction to the project on the IIPC blog. As a web archivist, I work on various sub-collections within the “Talking about Health” collection on topics relating to cancer, food, diet, nutrition, and wellbeing. This blog post is meant to summarise some challenges I have encountered during the collecting process and approaches I took to tackle them.

Challenges in Capturing Health Collections

The web space related to the topic of health is broad and exists in a complicated context. The subject of health calls for an interdisciplinary approach. There are multiple independent yet connected stakeholders in this web space who create content to promote policies, research outcomes, opinions, guidelines, services and products, all of which ultimately influence the behaviour of the general public regarding their own health. It is therefore essential for us to understand that we are developing the collection within a context that goes beyond the medical concerns.

It is a challenge to capture this context in the current fluid and dynamic environment. For example, within the sub-collection on food, diet and nutrition, research on cancer prevention related to dairy and meat consumption^[1] is part of the Livestock, Environment and People project,^[2] which is supported by the Wellcome Trust’s Our Planet Our Health Programme.^[3] It suggests that research on health is very much entwined in wider scientific and social issues. Another related topic, “alternative protein”^[4] is also becoming part of the discourse: how information related to these products is distributed online will have an impact on our choice of diet. There was a recent case when a commercial company making plant-based products misled customers on their ads, press and Twitter posts^[5] by using data out of context. The topic of “alternative protein” is not limited to traditional plant-based products but also covers innovations such as cultivated meat. Cultivated meat is also called cultured protein/meat or lab-grown meat,^[6] which is labelled as affordable, nutritious and sustainable. This has also drawn debate.^[7] However, on the web, there is little discussion of what the larger scale of consumption of cultivated meat will mean for one’s health. At the same time, traditional farmers are working to promote the health benefits of red meat and dairy consumption^[8] as well as marketing their products as local and environmentally conscious.^[9] Clearly, online information that may have an impact on our diet is not an isolated topic; it instead goes beyond nutrition and medical concerns.

The mismatches and gaps between content created online and health information needs raises a set of less visible challenges. Research has pointed out the complex needs of these who seek online health information.^[10] Such needs are not always met. Research by Abu-Serriah and colleagues shows: “of the 156 OMFS units identified in the UK, only 51% had websites. None of the websites contained more than 50% of what patients expected. Interestingly, the study has shown considerable geographical variation across the UK. While almost 80% of the OMFS units in London had websites, there were none in Northern Ireland and Wales.”^[11] Within this online information ecosystem, individuals are largely in a passive position; they have little control over what is available on the web. Therefore, the content within the collection does not always reflect end users’ needs. Coverage of the collection could easily be skewed due to the digital divide on the web.

Faceted 4D Modelling as a Collection Development Tool

Despite these complexities, I view the process of developing web archives as an attempt to reconstruct the web space in web archives. It does not mean that I seek to replicate this space. As shown in the chart, health information online is only part of the general health information ecosystem. Such content curated into the collection will be an even smaller proportion. Nevertheless, with this in mind, it does offer a roadmap that can guide the development of the collection.

To illustrate how I am developing a sub-collection on food, diet and nutrition for the Archive of Tomorrow project,^[12] I have formulated a faceted 4D modelling approach. It defines the collection’s scope through 4 dimensions of content creator, content, audience, and geographical coverage, as shown in the following chart. The facets within each dimension are used to profile websites that could be included in this sub-collection. It offers different routes to narrow the topic down so that the boundary of the sub-collection can be defined and the collection process can be articulated.

I plotted the seeds in this sub-collection and visualised them using a 4D model, which can aid us in identifying collection gaps and refining searching strategies with focused efforts. According to this model, a large proportion of content in this sub-collection relates to healthy food and diets from commercial or media organisations for the general public and consumers. A much smaller number targets groups such as the elderly, children, young adults and professionals. Content that is relevant to policy and guidance aspects may have not been covered well (perhaps there are not many sources online). Most materials are on the national level. The model offers a direction for the next stage of collection development.

This slideshow requires JavaScript.

However, this modelling approach is an evolving process. The vocabularies can only be established as the collecting efforts progress, and this is often limited by my own knowledge and judgements. Since it currently is a manual process and rather time-consuming, it might be only useful in the development of a small focused sub-collection. I currently test it for this small sub-collection only. The potential use for a large collection probably is only sustainable when sufficient resources are available or other forms of technical support can automate the process, such as generating themes, topics, and keywords. While it could be very difficult to embed these facets into metadata, it does offer a different approach to refine a collection.

The model, as a concept, can be adapted flexibly in various collections by identifying different dimensions and facets that are more relevant to a particular topic. It offers a framework to track and review the progress of the collection development. It can also be used to assess the quality of the collection and identify gaps. If these facets could be embedded into metadata, it might offer opportunities for end-users to scope and refine a collection as datasets. This model is not an attempt to resolve those difficulties highlighted at the beginning of this blog, but it at least helps articulate some of the complexities during the collection development process.

For more information, visit the project discourse site https://ukwa.discourse.group/ and join the discussion. If you would like to make suggestions to improve the collection or are interested in using data from the project, please contact the project team at aot@nls.uk.

^[1] ^{https://oxford.shorthandstories.com/cancer-prevention/}^{(Nov 23. 2022.)}

^[2]^{https://www.leap.ox.ac.uk/about} ^{(Nov 23. 2022.)}

^[3] ^{https://wellcome.org/what-we-do/climate-and-health} ^{(Nov 23. 2022.)}

^{[4] https://www.ukri.org/what-we-offer/our-main-funds/industrial-strategy-challenge-fund/clean-growth/transforming-food-production-challenge/alternative-proteins-new-horizons-for-novel-and-traditional-food-production/ (Nov 23. 2022.)}

^[5] ^{https://www.asa.org.uk/rulings/oatly-uk-ltd-g21-1096286-oatly-uk-ltd.html} ^{(Nov 29. 2022.)}

^[6] ^{https://www.cellularagriculture.co.uk/}^,^{https://roslintech.com/}^;^{https://cellag.uk/} ^{(Nov 29. 2022.)}

^[7]^{https://www.tabledebates.org/letterbox/depolarising-future-protein} ^{(Nov 23. 2022.)}

^{[8] https://www.farminguk.com/news/health-benefits-of-red-meat-and-dairy-needs-highlighting-to-sell-more-research-says_48459.html (Nov 29. 2022.)}

^{[9] https://northmoormeat.co.uk/ (Nov 29. 2022.)}

^[10]^{Wollmann, K., Der Keylen, P.}^{, Tomandl, J., Meerpohl, J. J., Sofroniou, M., Maun, A., & Voigt-Radloff, S. (2021). The information needs of internet users and their requirements for online health information—A scoping review of qualitative and quantitative studies. Patient Education and Counseling, 104(8), 1904-1932.}

^{[11] Abu-Serriah, M., Valiji Bharmal, R., Gallagher, J., & Ameerally, P.J. (2013). Patients’ expectations and online presence of Oral and Maxillofacial Surgery in the United Kingdom. British Journal of Oral & Maxillofacial Surgery, 52(2), 158-162.}

^[12] ^{https://www.nls.uk/about-us/working-with-others/archive-of-tomorrow/ (Nov 29. 2022.)}

Web Archiving the War in Ukraine

Wed, 20 July 2022Wed, 20 July 2022 IIPC Staff4 Comments

By Olga Holownia, Senior Program Officer, IIPC & Kelsey Socha, Administrative Officer, IIPC with contributions to the Collaborative Collection section by Nicola Bingham, Lead Curator, Web Archives, British Library; CDG co-chair

This month, the IIPC Content Development Working Group (CDG) launched a new collaborative collection to archive web content related to the war in Ukraine, aiming to map the impact of this conflict on digital history and culture. In this blog, we describe what is involved in creating a transnational collection and we also give an overview of web archiving efforts that started earlier this year: both collections by IIPC members and collaborative volunteer initiatives.

Collaborative Collection 2022

In line with the broader content development policy, CDG collections focus on topics that are transnational in scope and are considered of high interest to IIPC members. Each collection represents more perspectives than similar collections by a single member archive may include. Nominations are submitted by IIPC members, who have been archiving the conflict as early as January 2022 (see below) as well as the general public.

How do members contribute?

Topics for special collections are proposed by IIPC members who submit their ideas to the IIPC CDG mailing list, or contact the IIPC co-chairs directly at any time. Providing that the topic fits with the CDG collecting scope, there is enough data budget to cover the collection, and a lead curator and volunteers to perform the archiving work are in place, the collection can go ahead. IIPC members are then canvassed widely to submit web content on a shared google spreadsheet together with associated metadata such as title, language and description. The URLs are taken from the spreadsheet and crawled in Archive-It by the project team, formed of volunteers from IIPC members for each collection. Many IIPC members add a selection of seeds from their institutions’ own collections which helps to make CDG collections very diverse in terms of coverage and language.

There will be overlap between the seeds that members submit to CDG collections and their own institutions’ collections, however there are differences, including that selections for IIPC collections can be more geographically wide ranging than those included in their own collections when, for example they must adhere to regional scope, such as in the case of a national library. Selection decisions that are appropriate for members’ own collections may not be appropriate for CDG collections. For example, members may want to curate individual articles from an online newspaper by crawling each one separately whereas, given the larger scope of CDG collections it would be more appropriate to create the target at the level of the sub-section of the online newspaper. Public access to collections provided by Archive-It is a positive factor for those institutions that, for various reasons, can’t provide access to their collections. You can learn more about the War in Ukraine 2022 collection’s scope and parameters here.

Public nominations

We encourage everyone to nominate relevant web content as defined by the collection’s lead curators: Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, National Library of France and Kees Teszelszky of KB, National Library of the Netherlands. The first crawl is scheduled to take place on 27 July and it will be followed by two additional crawls in September and October. We will be publishing updates on the collection at #Ukraine 2022 Collection. We are also planning to make this collection available to researchers.

Member collections

In Spring 2022, we compiled a survey of the work done by IIPC members. We asked about the collection start date, scope, frequency, type of collected websites, way of collecting (e.g. locally and/or via Archive-It), social media platforms and access.

IIPC members have collected content related to the war, ranging from news portals, to governmental websites, to embassies, charities, and cultural heritage sites. They have also selectively collected content from Ukrainian and Russian websites and social media, including Facebook, Reddit, Instagram, and, most prominently, Twitter. The CDG collection offers another chance for members without special collections to contribute seeds from their own country domains.

Many of our members are national libraries and archives, and legal deposit informs what these institutions are able to collect and how they provide access. In most cases, that would mean crawling country-level domains, offering a localized perspective on the war. Access varies from completely open (e.g. the Internet Archive, National Library of Australia and the Croatian Web Archive), to onsite-only with published and browsable metadata such as collected URLs (e.g. the Hungarian Web Archive) to reading-room only (e.g. Netarkivet at the Royal Danish Library or the “Archives de l’internet” at the national library of France). The UK Web Archive collection has a mixed model of access, where the full list of metadata and collected URLs are available, but access to individual websites depends on whether the website owner has granted permission for off-site open access”. Some institutions, such as Library of Congress, may have time-based embargoes in place for collection access.

Some of our members have also begun work preparing datasets and visualisations for researchers. The Internet Archive has been supporting multiple collections and volunteer projects and our members have provided valuable advice on capturing content that is difficult to archive (e.g. Telegram messages).

A map of IIPC members currently collecting content related to the war in Ukraine can be seen below. It includes Stanford University, which has been supporting SUCHO (Saving Ukrainian Cultural Heritage Online).

Survey results

Access

While many members have been collecting content related to the war, only a small number of collections are currently publicly available online. Some members provide access to browsable metadata or a list of ULRs. The National Library of Australia has been collecting publicly available Australian websites related to the conflict,as is the case for the National Library of the Czech Republic. A special event collection of 162 crowd-sourced URLs is now accessible at the Croatian Web Archive. The UK Web Archive’s special collection of nearly 300 websites is fully available on-site, however information about the collected resources, which currently include websites of Russian Oligarchs in the UK, Commentators, Charities, Think Tanks and the UK Embassies of Ukraine and the surrounding nations, is publicly available online. Some websites from the UK Web Archive’s collection are also fully available off-site, where website owners have granted permission. The National Library of Scotland has set up a special collection, ‘Scottish Communities and the Ukraine’ which contains nearly 100 websites and focuses on the local response to the Ukraine War. This collection will be viewable in the near future pending QA checks. Most of the University Library of Bratislava’s collection is only available on-site, but information about sites collected is browsable on their web portal with links to current versions of the archived pages.

The web archiving team at the National Széchényi Library in Hungary, which has been capturing content from 75 news portals, has created a SolrWayback-based public search interface which provides access to metadata and full-text search, though full pages cannot be viewed due to copyright. The web archiving team has also been collaborating with the library’s Digital Humanities Center to create datasets and visualisations related to captured content.

Hungarian-Web-Archive-word_cloud — **Márton Nemeth** of National Széchényi Library and **Gyula Kalcsó** of Digital Humanities Center, National Széchényi Library presented on this collection at the 2022 Web Archiving Conference.

Multiple institutions plan to make their content available online at a later date, after collecting has finished or after a specified period of time has passed. The Library of Congress has been capturing content in a number of collections within the scope of their collecting policies, including the ongoing East European Government Ministries Web Archive.

Frequency of Collection

Most institutions have been collecting with a variety of frequencies. Institutions rarely answered with just one of the frequency options, opting instead to pick multiple options or “Other.” Of answers in the “Other” category, some were doing one-time collection, while others were collecting yearly, six-monthly, and quarterly.

How the content is collected

Most IIPC members crawl the content locally, while a few have also been using Archive-It. SUCHO has mostly relied on browser-based crawler Browsertrix, which was developed by Ilya Kreymer of Webrecorder and is in part funded by the IIPC, and on the Internet Archive’s Wayback Machine.

Type of collected websites (your domain)

When asked about types of websites being collected within local domains, most institutions have been focusing on governmental and news-related sites, followed by embassies and official sites related to Ukraine and Russia as well as cultural heritage sites. Other websites included a variety of crisis relief organisations, non-profits, blogs, think tanks, charities, and research organisations.

Types of websites/social media collected

When asked more broadly, most members have been focusing on local websites from their home countries. Outside local websites, some institutions were collecting Ukrainian websites and social media, while a smaller number were collecting Russian websites and social media.

Specific social media platforms collected

The survey also asked specifically about social media platforms our members were pulling from: Reddit, Instagram, TikTok, Tumblr, and Youtube. While many institutions were not collecting social media, Twitter was otherwise the most commonly collected social media platform.

Internet Archive

The Internet Archive (IA) has been instrumental in providing support for multiple initiatives related to the war in Ukraine. IA’s initiatives have included:

giving free Archive-It accounts, as well as general data storage, to a number of different community archiving efforts
uploading files to SUCHO collection at archive.org
supporting the extensive use of Save Page Now (especially via the Google Sheets interface) with the help of numerous SUCHO volunteers (many 10s of TB have been archived this way)
supporting the uploading of WACZ files to the Wayback Machine. This work has just started but a significant number of files are expected to be archived and, similar to other collections featured in the new “Collection Search” service, a full-text index will be available
crawling the entire country code top level domain of the Ukrainian web (the crawl was launched in April and is still running)
archiving Russian Independent Media (TV, TV Rain), Radio (Echo of Moscow) and web-based resources (see “Russian Independent Media” option in the “Collection Search” service at the bottom of the Wayback Machine).

IA’s Television News Archive, the GDELT Project, and the Media-Data Research Consortium have all collaborated to create the Television News Visual Explorer, which allows for greater research access of the Television News Archive, including channels from across Russia, Belarus, and Ukraine. This blog post by GDELT’s Dr. Kavel H. Leetaru explains more of the significance of this collaboration, and the importance of this new research collection of Belarusian, Russian and Ukrainian television news coverage.

Volunteer initiatives

SUCHO

One of the largest volunteer initiatives focusing on preserving Ukrainian web content has been SUCHO. Involving over 1300 librarians, archivists, researchers and programmers, SUCHO is led by Stanford University’s Quinn Dombrowski, Anna E. Kijas of Tufts University, and Sebastian Majstorovic of the Austrian Centre for Digital Humanities and Cultural Heritage. In its first phase, the project’s primary goal was to archive at-risk sites, digital content, and data in Ukrainian cultural heritage institutions. So far over 30TB of content and 3,500+ websites of Ukrainian museums, libraries and archives have been preserved and a subset of this collection is available at https://www.sucho.org/archives. The project is beginning its second phase, focusing on coordinating aid shipments of digitization hardware, exhibiting Ukrainian culture online and organizing training for Ukrainian cultural workers in digitization methods.

The Telegram Archive of the War

Telegram has been the most widely used application in Ukraine since the onset of the war but this messaging app is notoriously difficult to archive. A team of five archivists at the Center for Urban History in Lviv led by Taras Nazaruk, has been archiving almost 1000 Telegram channels since late February to create the Telegram Archive of the War. Each team member has been assigned to monitor and archive a topic or a region in Ukraine. They focus on capturing official announcements from different military administrative districts, ministries, local and regional news, volunteer groups helping with evacuation, searches for missing people, local channels for different towns, databases, cyberattacks, Russian propaganda, fake news as well as personal diaries, artistic reflections, humour and memes. Russian government propaganda and pro-Russian channels and chats are also archived. The multi-media content is currently grouped into over 20 thematic collections. The project coordinators have also been working with universities interested in supporting this archive and are planning to set up a working group to provide guidance for the future access to this invaluable archive.

Ukraine collections on Archive-It

New content has been gradually made available within the Ukraine collections on Archive-It that provided free or heavily cost-shared accounts to its partners earlier this year. These collections also include websites documenting the Ukraine Crisis 2014-2015 curated by University of California Berkeley (UC Berkeley) and by Internet Archive Global Events. Four new collections have been created since February 2022 with over 2.5TB of content. The largest one about the 2022 conflict (around 200 URLs) that is publicly available is curated by Ukrainian Research Institute at Harvard University. Other collections that focus on Ukrainian content are curated by Center for Urban History of East Central Europe, UC Berkeley and SUCHO. To learn more about the “War in Ukraine: 2022” collection, read this blog post by Liladhar R. Pendse, Librarian for East European, Central European, Central Asian and Armenian Studies Collections, UC Berkeley. University of Oxford, New College has been archiving at-risk Russian cultural heritage on the web as well as Russian opposition efforts to the war on Ukraine.

HURI-at-Archive-It — Ukrainian Research Institute at Harvard University’s collection at Archive-It.

Organisations interested in collecting web content related to the war in Ukraine, can contact Mirage Berry, Business Development Manager at the Internet Archive.

How to get involved

Nominate web content for the CDG collection
Use the Internet Archive’s “Save Page Now”
Check updates on the SUCHO Page for information on how you can contribute to the new phase of the project. SUCHO is currently accepting donations to pay for server costs and funding digitization equipment to send to Ukraine. Those interested in volunteering with SUCHO can sign up for the standby volunteer list here
Help the Center for Urban History in Lviv by nominating Ukrainian Telegram channels that you think are worth archiving and participate in their events
Submit information about your project: we are working to maintain a comprehensive and up-to-date list of web archiving efforts related to the war in Ukraine. If you are involved in a collection or a project and would like to see it included here, please use this form to contact us: https://bit.ly/archiving-the-war-in-Ukraine.

Many thanks to all of the institutions and projects featured on this list! We appreciate the time our members spent filling out our survey, and answering questions. Special thanks to Nicola Bingham of the British Library, Mark Graham and Mirage Berry of the Internet Archive, and Taras Nazaruk of the Center for Urban History in Lviv for providing supplementary information on their institutions’ collecting efforts.

Resources

Archive of Tomorrow – Capturing online health (mis)information

Thu, 21 April 2022 IIPC Senior Program Officer3 Comments

By Alice Austin, Web Archivist, Archive of Tomorrow

Centre for Research Collections, Main Library, University of Edinburgh

AoT-image-1 — Copyright ©2021 R. Stevens / CREST (CC BY-SA 4.0)

It goes without saying that the Covid-19 pandemic has cast a harsh light across our society and exposed fault lines in a number of areas, not least in the fragility of our information infrastructures. Over the last two years we have seen misinformation spread at a similar speed to the virus, with the consequence that any future attempts to try and examine the medical pandemic as an historical and social phenomenon will also have to reckon with the misinformation pandemic. Government and medical websites have changed on a daily basis as new information emerges, and there has been a massive proliferation of comment on social media and other online platforms about the virus and other health issues. Clinical advice, data and scientific evidence have been contested, revised, used and misused with dramatic and sometimes tragic consequences, and yet the digital record of this is fragile and difficult to access. There have been sustained and laudable efforts to ensure that inaccurate and potentially harmful information is taken down swiftly, with the result that a researcher exploring (e.g.) the emergence of ivermectin as a Covid ‘miracle cure’ might find they come up against a lot of dead ends and 404s.

Goals of the Archive of Tomorrow

In response, the Archive of Tomorrow project hopes to capture an accurate record of how people use the internet to find, share, and discuss health and health-related topics so that current and future researchers can understand public health practices in the digital age. We hope to capture 10,000 targets – ranging from official, ‘approved’ and verified sources, to unofficial, sometimes controversial publications – and to secure access permission for this content to produce a ‘research-ready’ collection. The project is ambitious, not just in its intention to build a useful evidence base of historical web resources but also in the attempt to develop an ethical and meaningful precedent for archiving possible mis- or dis-information. Because it crystallises so many of these issues, COVID is one subject that we’re focusing on in detail, but we’re also looking at capturing other health-related debates such as those that surround reproductive rights, ‘alternative’ medicines, assisted dying, and the use of medical cannabis.

Timeline

Having launched in Feb 2022, the project is still in the early stages of development. It’s being led by the National Library of Scotland with web archivists based in university libraries in Edinburgh, Oxford and Cambridge, and invaluable input from the British Library’s web archiving team. This kind of collaborative working feels very much representative of the Covid-era – it’s hard to imagine a project like this emerging in the days when remote working and Zoom meetings were the exception rather than the norm! We’ll be talking more about the collaborative nature of the project at the IIPC WAC conference in May – and registration is open now!

Selecting ‘health information’

Thinking about how work practices have changed throughout the pandemic brings us to something that has been a challenge for the project team to unravel – how to define the boundaries around ‘health information’ – where it begins and ends, how health relates to other spheres like politics, law, employment and so on. We have to impose boundaries on our collecting, and while some boundaries are legislative or technological, such as the exclusion of broadcast media like podcasts and videos from the collection), some are cultural: for example, to what extent do protests against Covid measures such as masks and lockdowns count as health information? What about artistic responses to the pandemic? And how well are we able represent health information-seeking behaviours in languages other than English?

AOT-image-2 — Welsh COVID-19 Pandemic guide: what to do and not do. Copyright © 2020 G. Hegasy (CCBY-SA 4.0)

Archivists have long understood that we can’t collect everything – and we don’t try to! As with so much collecting, the challenge lies in how to communicate our selection decisions without dictating the way the archived material is used and encountered. In this case, we’re trying to capture public health discourse and not be part of the conversation ourselves, but we do have a degree of responsibility when considering health mis/dis/information – to what extent should such inaccurate, or refuted or dangerous content be flagged in the UKWA interface? How do we make such content available responsibly without inserting our perspective into the debates?

Archive of Tomorrow workshop

At this stage we have more questions than answers, and we anticipate that this will continue. The project isn’t designed to solve these problems, but rather, to articulate them in a way that opens the door for future work and solutions. Our first activity towards this goal is the workshop that we’re hosting at the end of the month. We hope that by engaging with current and future researchers with an interest in online information-seeking behaviours or public health we can develop and produce a valuable, research-ready collection that will give real insight into how the internet has been used for health information during the pandemic and beyond.

IIPC – Meet the Officers, 2022

Mon, 24 January 2022Mon, 24 January 2022 IIPC Senior Program Officer2 Comments

IIPC The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, the Vice-Chair and the Treasurer of the Consortium. Together with the Senior Program Officer, based at Council on Library and Information Resources (CLIR), the Officers make up the Executive Board and are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Kristinn Sigurðsson of National and University Library of Iceland to serve as Chair, Abbie Grotke of the Library of Congress to serve as Vice-Chair in 2022, and Ian Cooke of the British Library to serve as the IIPC Treasurer. Olga Holownia continues as Senior Programme Officer, and CLIR remains the Consortium’s financial and administrative host.

IIPC CHAIR

Kristinn Sigurðsson is Head of Digital Projects and Development at the National and University Library of Iceland. He joined the library in 2003 as a software developer. Over the years he has worked on a multitude of projects related to the acquisition, preservation and presentation of digital content, as well as the digital reproduction of physical media. This includes leading the buildup of the library’s legal deposit web archive – that now contains nearly 4 billion items – as well as its very popular newspaper/magazine website.

He has also been very active within the IIPC and related web archiving collaboration. This includes working on the first version of the Heritrix crawler in 2003-4 (and on and off since). In 2010 he joined the IIPC Steering Committee as well as taking over as co-lead of the Harvesting Working Group. More recently he has served as the Lead of the Tools Development Portfolio.

IIPC VICE-CHAIR

Abbie Grotke, IIPC Chair 2021

Abbie Grotke is Assistant Head, Digital Content Management Section, within the Digital Services Directorate of the Library of Congress, and leads the Web Archiving Team. Since 2002 she has been involved in the Library’s web archiving program. In her role, Grotke has helped develop policies, workflows, and tools to collect and preserve web content for the Library’s collections and provides overall program management for web archiving at the Library. She has been active in a number of collaborative web archive collections and initiatives, including the U.S. End of Term Government Web Archive, and the U.S. Federal Government Web Archiving Interest Group.

Since the Library of Congress joined the IIPC as a founding member in 2003, Abbie has served in a variety of roles and on a number of working groups, task forces, and committees. She spent a number of years as Communications Officer, and was a member of the Access Working Group. More recently, she has served as co-leader of the Content Development, and Training Working Groups, and Membership Engagement Portfolio, and served as Chair in 2021. She has been a member of the Steering Committee since 2013.

IIPC TREASURER

Ian_Cooke

Ian has been a member of the IIPC Steering Committee and has worked on strategy development for the IIPC. The British Library was the host for the Programmes and Communications role up to April 2021.

UKWA update for the 2021 World Digital Preservation Day

Thu, 04 November 2021Thu, 04 November 2021 IIPC Senior Program OfficerLeave a comment

By Andrew Jackson, Web Archiving Technical Lead, UK Web Archive, British Library

It’s World Digital Preservation Day 2021 #WDPD2021 so this is a good opportunity to give an update on what is going on today at the DPC Award Winning UK Web Archive.

Domain and frequent crawls

The 2021 domain crawl is still running. There’s been a few ups and downs, and it’s not going as fast as we’d ideally like, but it’s chugging away at 200-250 URLs a second, on track to get to around two billion URLs by the end of the year.

We run two main crawl streams, ‘frequent’ and ‘domain’ crawls. The ‘frequent’ one gets fresh content from thousands of sites everyday. It also takes screenshots while it goes, so in the future we will know what the site was supposed to look like!

We’ve been systematically taking thousands of screenshots every day since about 2018.

e.g. here’s how Twitter’s UI changed since January 2020 #PureDigiPres (who remembers Moments?)

The frequent crawls also refresh site maps every day, to make sure we know when content is added or updated. This is how websites make sure search engine indexes are up-to-date, and we can take advantage of that!

The ‘domain’ crawl is different – it runs once a year and attempts to crawl every known UK website once. It crawls around two billion URLs from millions of websites, so it’s a bit of a challenge to run. The frequent crawls also refresh site maps every day, to make sure we know when content is added or updated. This is how websites make sure search engine indexes are up-to-date, and we can take advantage of that!

The last two domain crawls have been run in the cloud, which has brought many new challenges, but also makes it much easier to experiment with different virtual hardware configurations.

UKWA03-domain-crawl-100days — UKWA domain drawl: URL totals over 100 days.

All in all, we gather around 150TB of web content each year, and we’re over a petabyte in total now. This graph shows how the crawls have grown over the years (although it doesn’t include the 2020/2021 domain crawls as they are still on the cloud).

The legal framework we operate under means we can’t make all of this available openly on our website, but the curatorial team work on getting agreements in place to allow this where we can, as well as telling the system what should be crawled.

UKWA5-crawls — This graph shows the growth in crawl targets over the last four weeks.

UKWA5-crawls-curated-websites — And this shows the growth in openly-accessible and curated web sites and pages over the same time period.

Open source tools

All that hard work means many millions of web pages are openly accessible via www.webarchive.org.uk/ – you can look up URLs and also do full-text faceted search. Not all web archives offer full-text search, and it’s been a challenge for us. Our websites search indexes are not as up-to-date or complete as we’d like, but we’re working on it. At least we can be glad that we’re not to be working on it alone. Our search indexing tools are open source and are now in use at a number of web archives across the world. This makes me very happy, as I belive our under-funded sector can only sustain the custom tools we need if we find ways to share the load.

This is why almost the entire UK Web Archive software stack is open source, and all our custom components can be found at https://github.com/ukwa/. Where appropriate, we also support or fund developments in open source tools beyond our own, e.g. the Webrecorder tools – we want everyone to be able to make their own web archives!

Collaborations

We also collaborate through the IIPC and with their support we help run Online Hours to support open source users and collaborators. These are regular online videocalls to help us keep in touch with colleagues across the world: see here for more details – Join us!

We’re very fortunate to be able to work in the open, with almost all code on GitHub. Some of our work has been taken up and re-used by others. Among these collaborators, I’d like to highlight the work of the Danish Net Archive. They understand Apache Solr better than us, and are working on an excellent search front-end called SolrWayback.

As well as full-text search, we also use this index to support activities specific to digital preservation by performing format analysis and metadata extraction during the text extraction process, and indexing that metadata as additional facets. We’ve not had time to make all this information available to the public, but some of it is accessible.

Working with data and researchers

Our Shine search prototype covers older web archive material held by the Internet Archive and if you know the Secret Squirrel Syntax, you can poke around in there and look for format information e.g. by MIME type, by file extension or by first-four-bytes. We also generate datasets of information extracted from our collections, including format statistics, and make this available as open data via the new Shared Research Repository. But again, we don’t always have time to work on this, so keeping those up to date is a big challenge.

One way to alleviate this is to partner with researchers in projects that can fund the resources and bring in the people to do the data extraction and analysis, while we facilitate access to the data and work with our British Library Open Research colleagues to give the results a long-term home. This is what happened with the recent work on how words have changed meaning over time, a research project led by Barbara McGillvray, and we’d like to support these kinds of projects in the future too.

Another tactic is to open up as much metadata as we can, and provide that via APIs that others can then build on.This was the notion behind our recent IIPC-funded collaboration with Tim Sherratt to add a web archives section to the GLAM Workbench. A collection of tools and examples to help you work with data from galleries, libraries, archives, and museums. We hope the workbench will be a great starting point for anyone wanting to do research with and about web archives.

If you’d like to know more about any of this, feel free to get in touch with me or with the UK Web Archive.

Happy World Digital Preservation Day!

Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions

Tue, 02 November 2021 IIPC Senior Program Officer4 Comments

By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team: Susan Aasman (University of Groningen, The Netherlands), Niels Brügger (Aarhus University, Denmark), Frédéric Clavert (University of Luxembourg, Luxembourg), Karin de Wild (Leiden University, The Netherlands), Sophie Gebeil (Aix-Marseille University, France), Valérie Schafer (University of Luxembourg, Luxembourg)

As mentioned by Nicola Bingham in her blog post “IIPC Content Development Group’s activities 2019-2020” in July 2020, a huge effort has been made by the Content Development Group and IIPC members to create a unique collection of web material related to the pandemic, with contributions from over 30 members as well as public nominations from over 100 individuals/institutions.

This collection immediately attracted the interest of researchers because of the quantity of collected data, its transnational nature and the many possibilities it offers to explore web archives of this unprecedented period at an international level.

A strong interest in COVID-19 collections in WARCnet

The WARCnet project was launched at the beginning of 2020, just as the world was witnessing the first developments in the COVID-19 crisis.

WARCnet is a network of researchers and web archiving institutions (see WARCnet team) which aims to promote transnational research that will help us to understand the history of (trans)national web domains and transnational events on the web, drawing on the increasing volume of digital cultural heritage held in national web archives (Brügger, 2020). The network’s activities started in 2020 and will run until to 2023, and they are funded by the Independent Research Fund Denmark | Humanities (grant no 9055-00005B). The network is organised into six working groups, with working group 2 (WG2) focusing on the study of transnational events through web archives.

WG2 decided to select the COVID-19 crisis as one of its first case studies and test beds, and it conducted a first distant reading of several collections related to the crisis, combining metadata from the special collections of national institutions like the British Library, the BnF (National Library of France) and INA (the French National Audiovisual Institute) in France, the BnL (National Library of Luxembourg), etc., with the IIPC collection. The WG2 received metadata in the form of seed lists of these collections thanks to web archiving institutions and a special agreement with them.

At the same time a series of oral interviews were carried out with web archivists and web curators to shed more light on the selection and curation processes and the scope of these special collections, including an interview by Friedel Geeraert with Nicola Bingham on the IIPC collection (Geeraert and Bingham, 2020). Transcriptions of this series of oral interviews are available online for free download. You will find interviews with web archivists and curators working at INA, the BnF, the BnL, the IIPC, Netarkivet (Denmark), the National Széchényi Library in Hungary, the UK Web Archive, the Swiss National Library, the National Library of the Netherlands (KB) and the Icelandic Web Archive. Other interviews are scheduled.

An internal WG2 datathon on the metadata received from web archiving institutions held at the very beginning of 2021 enabled us for example to compare national collections with one another and with the selection they made for the IIPC coronavirus collection (table 1) and to measure the websites that emerged during the pandemic (table 2). Another goal of WG2 was to compare the metadata provided by institutions (table 3) and the way they could be intertwined.

AWAC2-Table1 — Table 1: Presentation by Friedel Geeraert of her hypothesis related to overlaps between national collections and IIPC collection and results (final meeting of our datathon)

AWAC2-Table2 — Table 2: Presentation by Katharina Schmid and Friedel Geeraert of their hypothesis related to “COVID-19 websites” and results (final meeting of our datathon)

AWAC2-Table3 — Table 3: Overview and comparison of the data fields provided by each web archive, conducted by Karin de Wild and Niels Brügger

The data was provided by the following web archives: RDL (Royal Danish Library); BnF; NSL (National Széchényi Library, Hungary); IIPC; BnL; KB (Koninklijke Bibliotheek, the Netherlands); UKWA (UK Web Archive).

A new opportunity and a multi-partner collaboration

This first exploration led to the desire to go deeper into COVID-19 collections, and a unique chance was offered to us when the Archives Unleashed team launched its annual call for cohorts.

AWAC2-AU-tweet

Some of the WG2 members therefore decided to submit a proposal entitled AWAC2, which stands for “Analysing Web Archives of the COVID-19 Crisis through the IIPC Novel Coronavirus Dataset”. With this application, we hoped to deepen our understanding of the IIPC COVID-19 collection at several levels:

First, it was a way to continue our initial distant exploration of (meta)data, in order to answer qualitative questions such as:

(1) Participation in the collection by web archiving institutions and web archive representatives
— how many URLs inside/outside ccTLDs?
— can we indirectly document some countries with no national web archives through this collection?
— comparison of national collections and their selection for IIPC (based on Danish and Luxembourgish collections)
(2) Categories of stakeholders, websites, representativeness and inclusiveness
(3) New event-specific websites
(4) MIME types and visual studies
(5) Hyperlink networks
What characterises the hyperlink network of websites included in the IIPC collection? Can any national website clusters be identified? (This analysis could be performed in Gephi).

Second, it was an opportunity to obtain complementary data and to combine research methods, especially distant and close reading, thanks to the possibility of accessing full text.

Third, it was also a unique chance to further explore the possibilities offered by the Archives Unleashed tools that the team has been developing for many years, which we were introduced to with the pre-workshop organised by Ian Milligan and Nick Ruest at the 2019 RESAW conference in Amsterdam and have been following ever since through reports about their activities and academic papers (Ruest et al., 2020). It was also an opportunity to benefit from regular discussions with the team, enrich our computational skills and create a research dynamic with an efficient and impressive team in Canada.

Finally, it was a way to explore new research questions that we had in mind from the beginning, and which are more related to topical approaches (e.g. Women, gender and COVID-19). We will come back to this in the last section.

First exciting steps

We are very thankful to the IIPC, Archive-It and the Archives Unleashed team for selecting our project and for the agreement that was signed to access this large dataset. Since the end of August, following the launch of the Cohort Programme in July 2021 that enabled us to meet all the participants and cohorts in the yearly programme, the AWAC2 team has been able to explore the new Archives Unleashed interface within Archive-It and the many datasets and visualisations that have been made available to us (figures 1 and 2).

AWAC2-Figure1 — Figure 1: An interface made available by the Archives Unleashed team to easily download data in a secure way and select between MIME types, domain names, full text, web graph, etc.

AWAC2-Figure2 — Figure 2: An interface to visualise some samples (here a top hosts sample)

The technical skills within the AWAC2 team are heterogeneous and while some members immediately began analysing data, others initially struggled to download some datasets, as the collection contains a huge amount of data and requires computer skills. However, the team is now on track and greatly benefits from the two regular monthly meetings with the Archives Unleashed team, whose availability to answer questions, explore technical issues with us and share (and explain) notebooks is amazing.

The AWAC2 team immediately started mapping data, framing the scope (table 4), discussing methods and creating samples to study several aspects related to multilingualism and stakeholders represented in web archives (tables 5 to 8).

AWAC2-Table4 — Table 4: Extract of the overview of the dataset produced by Niels Brügger

AWAC2-Table5 — Table 5: Analysis by Frédéric Clavert of count records by crawl date in the IIPC collection (30% sample)

AWAC2-Table6 — Table 6: Extract of a visualisation of the 50 most archived domains from a diachronic perspective (F. Clavert)

AWAC2-Table7 — Table 7: Visualisation of crawl frequency by language thanks to pandas and Altair libraries for Python, full sample (F. Clavert)

AWAC2-Table8 — Table 8: A first distant reading of randomly selected French content (30% sample) using Iramuteq (F. Clavert)

Full text access also gives us an opportunity to combine methodologies and use tools like Iramuteq that allow text mining. This is the next step we are hoping to achieve…

Taking things further… with you!

To deepen collaboration with IIPC members and continue the fruitful dialogue that we began when we started by exploring the IIPC collection, we will of course continue to share our results and our insights into the crisis gleaned from web archives and this COVID-19 collection, which may in itself become an object of study as a mirror of web archiving practices, curation and methodologies. We also want to respond to the interest of the IIPC community by sharing our research questions with you, and we would like to invite you to vote for the research questions that we should investigate first.

The multidisciplinary nature and wide-ranging fields of expertise of the team have led to a long list of research interests, and the team is planning to meet for three days in March 2022 to conduct a test bed on one or two case studies. Our case studies will be the ones that you select:

Research on Women, Gender and COVID-19 within this collection (e.g. domestic violence, care and homeschooling, etc.). We will probably use Iramuteq or Mallet on derivative files to perform text mining.
Identify private journals of lockdowns, individual traces of daily life and different online expressions that offer insights into the ways people are dealing with COVID-19 in their everyday lives.
Trace public support/opposition to lockdown. Can we conduct a sentiment analysis over time?
How was the home schooling debate conducted on the web? How did the various stakeholders communicate about it?
How to identify fake news, conspiracy theories and other COVID 19-related controversies within these big data?
Is it possible to perform a visual analysis of what medical-scientific communication on COVID-19 looks like (and what type of visual communication is used, e.g. graphs, visuals, colours)?
The pandemic seriously affected museums around the world and in some countries the web became a prominent channel for their communication. How did museum websites evolve during the COVID-19 pandemic?

Please select your two top case studies at https://www.surveymonkey.com/r/BRRX57T by 20 December 2021. Your choice will be ours! We are looking forward to discovering your selection.

References

Bingham Nicola, “IIPC content development Group’s activities 2019-2020”, Netpreserve Blog, 2020.
https://netpreserveblog.wordpress.com/2020/07/01/iipc-content-development-groups-activities-2019-2020/

Brügger Niels, “Welcome to WARCnet”, Aarhus, WARCnet Paper, 2020.
https://cc.au.dk/fileadmin/user_upload/WARCnet/1.Bru__gger_Welcome_to_WARCnet.pdf

Geeraert Friedel and Bingham Nicola, “Exploring special web archives collections related to COVID-19: The case of the IIPC Collaborative collection. An interview with Nicola Bingham (British Library) conducted by Friedel Geeraert (KBR)”, Aarhus, WARCnet Paper, 2020.
https://cc.au.dk/fileadmin/user_upload/WARCnet/Geeraert_et_al_COVID-19_IIPC__1_.pdf

Ruest Nick, Fritz Samantha, Deschamps Ryan, Lin Jimmy, Milligan Ian, “From archive to analysis: accessing web archives at scale through a cloud-based interface”, International Journal of Digital Humanities, 2021.

Software:

IIPC Steering Committee Election 2021: nomination statements

Wed, 15 September 2021Wed, 15 September 2021 IIPC Senior Program Officer2 Comments

The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:

Bibliothèque nationale de France / National Library of France (re-election)
The British Library (re-election)
Deutsche Nationalbibliothek / German National Library
Det Kongelige Bibliotek / Royal Library of Denmark
Koninklijke Bibliotheek / National Library of the Netherlands (re-election)
National Archives, UK

An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.

If you have any questions, please contact the IIPC Senior Program Officer.

Nomination statements in alphabetical order:

Bibliothèque nationale de France / National Library of France

The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.

The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.

The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.

The British Library

The British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.

Deutsche Nationalbibliothek / German National Library

The German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.

Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB.
At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.

Det Kongelige Bibliotek / Royal Library of Denmark

Royal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.

Koninklijke Bibliotheek / National Library of the Netherlands

As the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.

We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.

We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.

The National Archives (UK)

The National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.

Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.

With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.

Asking questions with web archives – introductory notebooks for historians

Thu, 28 May 2020 IIPC Senior Program Officer2 Comments

“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).

By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.

Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

What can you do?

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.

A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.

Using screenshots to visualise change in a page over time.

Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.

Find when a piece of text appears in an archived web page.

The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.

Find and explore Powerpoint presentations from a specific domain.

While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

Visualising subdomains in the gov.au domain as captured by the Internet Archive.

Acknowledgements

Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.

Resources: