The IIPC is governed bythe Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Senior Program Officer, based at the Council on Library & Information Resources (CLIR), the Officers make up the Executive Board and are responsible for dealing with the day-to-day business of running the IIPC.
The Steering Committee has designated Youssef Eldakar of Bibliotheca Alexandrina to serve as Chair, and Jeffrey van der Hoeven of KB, National Library of the Netherlands to serve as Vice-Chair in 2023. Ian Cooke of the British Library will continue to serve as the IIPC Treasurer. Olga Holownia continues as Senior Programme Officer, Kelsey Socha-Bishop as Administrative Officer andCLIR remains the Consortium’s financial and administrative host.
The Members and the Steering Committee would like to thank Kristinn Sigurðsson of the National and University of Iceland and Abbie Grotke of the Library of Congress for leading the IIPC in 2021 and 2022.
IIPC CHAIR
Youssef Eldakar is Head of the International School of Information Science, a department of Information and Communication Technology at Bibliotheca Alexandrina (BA) in Egypt. Youssef entered the domain of web archiving as a software engineer in 2002, working with Brewster Kahle to deploy the reborn Library of Alexandria’s first web archiving computer cluster, a mirror of the Internet Archive’s collection at the time. In the years that followed, he went on to lead BA’s work in web archiving and has represented BA in the International Internet Preservation Consortium (IIPC) since 2011. Also at BA, he contributed to book digitization during the initial phase of the effort. In 2013, he was additionally assigned to take lead of the BA supercomputing service, providing a platform for High-Performance Computing (HPC) to researchers in diverse domains of science in Egypt, as well as regionally through European collaboration. At his present post, Youssef works to provide support to research through the technologies of parallel computing, big data, natural language processing, and visualization.
In the IIPC, Youssef has been lead to Project LinkGate, started in 2020, for scalable temporal graph visualization, and he has more recently been working as part of a collaboration involving the Research Working Group and the Content Development Working Group to republish IIPC collections through alternative interfaces for researcher access. He has been a member of the Steering Committee since 2018 and has served as the lead of the Tools Development Portfolio.
IIPC VICE-CHAIR
Jeffrey van der Hoeven is head of the Digital Preservation department at the National Library of the Netherlands (KB). In this role he is responsible for defining the policies, strategies and organisational implementation of digital preservation at the library, with the goal to keep the digital collections accessible to current users and generations to come. Jeffrey is also director at the Open Preservation Foundation and steering committee member at the IIPC. In previous roles, he has been involved in various national and international preservation projects such as the European projects PLANETS, KEEP, PARSE.insight and APARSEN.
IIPC TREASURER
Ian Cooke leads the Contemporary British Publications team at the British Library, which is responsible for curation of 21st century publications from the UK and Ireland. This includes the curatorial team for the UK Web Archive, as well as digital maps, emerging formats and print and digital publications ranging from small press and artists books to the latest literary blockbusters. Ian joined the British Library’s Social Sciences team in 2007, having previously worked in academic and research libraries, taking up his current role in 2015.
Ian has been a member of the IIPC Steering Committee and has worked on strategy development for the IIPC. The British Library was the host for the Programmes and Communications role up to April 2021.
By André Mourão, Senior Software Engineer, Arquivo.pt and Daniel Gomes, Head of Arquivo.pt
Arquivo.pt launched a service that enables search over 1.8 billion images archived from the web since the 1990s. Users can submit text queries and immediately receive a list of historical web-archived images through a web user interface or an API.
The goal was to develop a service that addressed the challenges raised by the inherent temporal properties of web-archived data, but at the same time provided a familiar look-and-feel to users of platforms such as Google Images.
Supporting image search using web archives raised new challenges: little research was published on the subject and the volume of data to be processed was big and heterogeneous, summing over 530 TB of historical web data published since the early days of the Web.
The Arquivo.pt Image Search service has been running officially since March 2021 and it is based on Apache Solr. All the developed software is available as open-source to be freely reused and improved.
Search images from the Past Web
The simplest way to access the search service is using the web interface. Users can, for example, search for GIF images published during the early days of the Web related to Christmas by defining the time span of the search.
Users can select a given result and consult metadata about the image (e.g. title, ALT text, original URL, resolution or media type) or about the web page that contained it (e.g. page title, original URL or crawl date). Quickly identifying the page that embedded the image enables the interpretation of its original context.
Figure 3. The web page that contained an image returned on the search results can be immediately visited by selecting the “Visit” button.
Automatic identification of Not Suitable For Work images
Arquivo.pt automatically performs broad crawls of web pages hosted under the .PT domain. Thus, some of the images archived may contain pornographic content that users do not want to be immediately displayed by default, for instance while using Arquivo.pt in a classroom.
The Image Search service retrieves images based on the filename, alternative text and the surrounding text of an image contained on a web page. Images returned to answer a search query may include offensive content even for inoffensive queries due to the prevalence of web spam.
The detection of NSFW (not suitable for work) content on the archived Web pages from the Internet is challenging due to the scale (billions of images) and the diversity (small to very large images, graphic, colour images, among others) of image content.
Currently, Arquivo.pt applies an NSFW image classifier trained with over 60 GB of images scrapped from the web. Instead of identifying images as safe or not safe, this classifier returns the probability of an image belonging to one of five categories: drawing (SFW drawings), neutral (SFW photographic images), hentai (including explicit drawings), porn (explicit photographic images), sexy (potentially explicit images that are not pornographic, e.g. woman in bikini). nsfw (sum of hentai and porn) scores.
By default, Arquivo.pt hides pornographic images from the search results if their NSFW classification rate was higher than 0.5. This filter can be disabled by the user through the Advanced Image Search interface.
Image Search API
Arquivo.pt developed a free and open Image Search API, so that third-party software developers can integrate the Arquivo.pt image search results in their applications and for instance apply for the annual Arquivo.pt Awards.
The ImageSearch API allows keyword to image search and access to preserved web content and related metadata. The API returns a JSON object containing the metadata elements also available through the “Details” button.
Figure 4. All metadata about the image and its host web page is available through the “Details” button or the Image Search API.Figure 5. GitHub Wiki page that documents the Arquivo.pt Image Search API (https://arquivo.pt/api/imagesearch).
Scientific and technical contributions
There are several services that enable image search over web collections (e.g. Google Images). However, the literature published about them is very limited and even less research has been published about how to search images in web archives.
Moreover, supporting image search over the historical web-data preserved by web archives raises new challenges that live-web search engines do not need to address such as having to deal with multiple versions of images and pages referenced by the same URLs, handling duplication of web-archived images over time or ranking search results considering the temporal features of historical web data published over decades.
Developing and maintaining an Image Search engine over the Arquivo.pt web archive originated scientific and technical contributions by addressing the following research questions:
How to extract relevant textual content in web pages that best describes images?
How to de-duplicate billions of archived images collected from the web over decades?
How to index and rank search results over web-archived images?
The main contributions of our work are:
A toolkit of algorithms that extract textual metadata to describe web-archived images
A system architecture and workflow to index large amounts of web-archived images considering their specific temporal features
As we approach the end of 2022, we would like to thank our members and the general web archiving community for their support and engagement this year. Before we move forward into 2023, and return to an in-person General Assembly and Web Archiving Conference (for the first time since 2019!), we wanted to highlight some of this past year’s activities featured on our blog and to take this opportunity to thank all the contributors.
2022 started off with a wrap-up of a project led by our Tools Development Portfolio and developed by Ilya Kreymer of Webrecorder. The goal of this project was to support migration from OpenWayback (a playback tool used by most of our members) to pywb by creating a Transition Guide.
This year also saw the launch of a new tools project “Browser-based crawling system for all.” Led by four IIPC members (the British Library, National Library of New Zealand, Royal Danish Library, and the University of North Texas), the Webrecorder-developed crawling system based on the Browsertrix Crawler is designed to allow curators to create, manage, and replay high-fidelity web archive crawls through an easy-to-use interface.
“Game Walkthroughs and Web Archiving,” builds on research by Travis Reid, PhD student at Old Dominion University (ODU) that looks at applying gaming concepts to the web archiving process. This collaboration between ODU and Los Alamos National Laboratory was supported by the IIPC through our Discretionary Funding Program (DFP).
Here’s a list of blog posts on the 2022 projects related to web archiving tools:
IIPC also funds collaborative collections, which are curated and supported by volunteers from our community. While our Covid-19 collection continues, three new collections were initiated by the Content Development Working Group (CDG) in 2022. In the winter, Helena Byrne of the British Library encouraged everyone to web archive the Beijing 2022 Olympic & Paralympic Winter Games, adding to a decade-long collaborative effort of archiving the Olympics and Paralympics. Archiving the War in Ukraine was our second collaborative collection for 2022. Co-curated by Kees Teszelszky of KB, National Library of the Netherlands, and Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, the National Library of France, it offers a comprehensive international perspective on the war. We closed 2022 with a call for nominations (due 20 January, 2023) for Web Archiving Street Art, co-led by Ricardo Basílio of Arquivo.pt and Miranda Siler of Ivy Plus Libraries Confederation.
Thank you to Alex Thurman (Columbia University Libraries) and Nicola Bingham (the British Library) for serving as CDG co-chairs, overseeing all new and ongoing collaborative collections:
We also published blog posts related to researching web archives on topics spanning from a toolset for researchers to archiving social media to analysing Covid-19 web archive collections.
Beatrice Cannelli, PhD candidate at the School of Advanced Study (University of London), summarised the results of an online survey mapping social media archiving initiatives, which is part of her research project “Archiving Social Media: a Comparative Study of the Practices, Obstacles, and Opportunities Related to the Development of Social Media Archives.”
Covid-19 web archived content is also at the core of the Archive of Tomorrow (AoT) project that aims to explore and preserve online information and misinformation about health and the pandemic. Introduced earlier this year by Alice Austin (Centre for Research Collections, University of Edinburgh), AoT will form a ‘Talking about Health’ collection within the UK Web Archive. Cui Cui, PhD candidate at the University of Sheffield and also an AoT web archivist, shared her process of working with the ‘Talking about Health’ collection, using faceted 4D modelling to reconstruct web space in web archives.
Here are the 2022 blog posts on researching web archives:
Many thanks to everyone who has contributed to our blog and helped us promote it through their newsletters and social media posts and, of course, thank you to all our readers around the world. We look forward to showcasing your web archiving activities in the new year!
By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team:Susan Aasman (University of Groningen, The Netherlands),Niels Brügger(Aarhus University, Denmark),Frédéric Clavert(University of Luxembourg, Luxembourg),Karin de Wild(Leiden University, The Netherlands),Sophie Gebeil(Aix-Marseille University, France),Valérie Schafer(University of Luxembourg, Luxembourg), Joshgun Sirajzade (University of Luxembourg, Luxembourg)
After a first phase of collaboration around this prolific, multilingual, and international corpus, in direct cooperation with the Archives Unleashed team which allowed us to access this abundant collection via ARCH and other tools they have developed, it seemed interesting to us to dive not only into the global analysis of the corpus, but to also try to see the feasibility of more specific studies on a precise topic. Following the vote, the selected theme “Women, gender and COVID” was the subject of several online and on-site meetings by the AWAC2 team, including an internal datathon in March 2022 at the University of Luxembourg (Figure 1).
Figure 1– A datathon as a test bed and the tentative design of a workflow
The purpose of this blog post is to review some of the methodological elements learned during the exploration of this corpus.
Retrievability is a real challenge
The first salient point concerns the amount of data, already considered at the global level of the corpus, but which even in the case of a research specifically focused on women remains important. Above all, data mining and corpus creation is complicated by multilingualism (see table 7 of our previous blog post), in addition to the fact that a search for the term “woman” is not sufficient to create a satisfactory corpus (a woman can be qualified as a mother in the case of home-working or as a feminist in the case of activism and the fight against domestic violence, etc.)
The multidisciplinary team also had to define research priorities in view of the challenges of these massive corpora. Indeed, once they are constituted, the analysis is still far from beginning. The sub-corpora are full of noise, especially when it comes to news sites where the terms COVID and pregnancy or feminism may appear in newsfeeds in a very close way, but without any real thematic correlation (Figure 2). There are also many duplicates, and it must be determined whether or not they inform the study. Such a large amount of data also raises the question of more research-driven or data-driven approaches.
Figure 2 – Entry line 7867: https://flipboard.com/topic/women. The newsfeed mentions the COVID crisis as well as the MeToo movement but the news is unrelated, as visible on top of the capture when accessing full text.
In addition to the technical difficulties, there are also contextual difficulties. The data must also be put into a national context from a qualitative point of view if they are to be analysed properly. For example, lockdowns and school closures have varied from country to country and school organisation is also very different around the world, as is the legislative framework for work during lockdowns.
Topic modeling as a field of investigation
The AWAC2 team shared a strong interest in assessing the presence, retrievability, asymmetries related to gender and COVID, with some colleagues especially interested in understanding the issues related to transnational studies and gender studies, as well as reflecting on invisibility and inclusiveness, while other colleagues were more specifically interested in the computational and topic modeling part.
This second aspect has given rise to interesting developments, as three major algorithms were applied to be able to carry out more sophisticated and semantic search in the corpus: Latent Dirichlet Allocation (LDA), Word2vec, and Doc2vec.
LDA is an extension of Probabilistic Latent Semantic Analysis (PLSA) which is a probabilistic formulation of Latent Semantic Analysis (LSA). LSA is a dimensionality reduction technique where documents in a corpus (in our case, web pages) are compressed to a very small number of documents, which could be read by a human. These compressed documents are called topics. In essence, they carry the words which are shared by many documents and probabilistically more often occur together. In our experiment, we not only identified topics which contain keywords related to the situation of women, but also looked at how these topics are distributed across the web pages (Figure 3).
Figure 3 – Topics identified through LDA over the whole dataset (Covid-19 special collection) and their distribution through time
A few examples of topics are:
topics202002.txt:46 0.05 video news show man years police star film death family week weinstein comments day love stars top women fashion black
topics202002.txt:69 0.05 shop view accessories gifts price sale products delivery add cart free mens gift shoes brands bags womens clothing hair home
topics202003.txt:4 0.05 health children mental kids anxiety child family tips healthy parents social coronavirus stress find support time home women life news
topics202004.txt:53 0.05 gender development health policy working european countries international women economic equality work regional employment global world minnesota environment content overview
topics202005.txt:83 0.05 study risk patients years people blood disease
Word2vec and Doc2vec in turn are the further formulation of previous algorithms. In the background they not only use newer technologies like a logistic regression (also called a shallow network), but also provide more flexible usage. Word2vec provides a dense vector for every word, and it is very similar to LSA. However, Word2vec also creates vectors from the so-called window or the neighborhood of words like 5 words to the left and to the right of the searched word, operating more on a syntactical level while LSA is purely based on Document-Term-Matrix. With Word2vec it is not only possible to find semantically related words to the searched word, but also concatenate together the vectors of all words in the document. By doing so, similar documents can be found. This in a way goes beyond the so-called bag-of-words approach to which all the previous algorithms belong, because the word order can also be taken into account. From an implementation point of view, this can be done with an additional algorithm like a Long-Short-Term-Memory (LSTM) or a ready-to-use version can be taken with Doc2vec.
In our experiment, we trained a Word2vec algorithm on our corpus, which enabled us to find woman or feminism related keywords. Furthermore, we took these keywords and searched where they occur. By doing so, we again not only investigated the situation of women in the pandemic, but also compared the results to the ones given by LDA. This not only helps us to analyze the working, complementarity, and efficiency of the algorithms, but also allowed to make sure that the search covers or mines our corpus as detailed as possible (Figures 4a and 4b).
Figure 4a – Time series of a topic related to women and childrenFigure 4b – Top 20 domains for this selected topic
What’s next?
This research is far from being completed after a year of collaboration with the Archives Unleashed team, whom we warmly thank for their technical and scientific expertise, as well as with IIPC which provided an unprecedented corpus that can stimulate a multitude of research projects, whether thematic or oriented towards computer science and digital humanities. An article is currently being prepared on the second topic, “Deep Mining in Web archives,” while a more general and SSH oriented chapter is being drafted for the final collective book of the WARCnet project. Furthermore, the team will be pleased to present results at the next IIPC Web Archiving Conference in 2023 and thus continue the dialogue with you around the collection.
By CDG Street Art Collection Co-Leads Ricardo Basílio, Web curator, Arquivo.pt & Miranda Siler, Web Collection Librarian, Ivy Plus Libraries Confederation
Street art is ephemeral and so are the websites and web channels that document it. For this reason the IIPC’s Content Development Working Group is taking up the challenge of preserving web content related to street art. Some institutions already do this locally, but a representative web collection of street art with a global scope is lacking.
Street art can be found all over the world and reflects social, political and cultural attitudes. The Web has become the primary means of dissemination of these works beyond the street itself. Thus, we are asking for nominations for web content from different parts of the globe to be preserved in time and to serve for study and research in the future.
Mural. Author: Douglas Pereira (Bicicleta Sem Freio). Title: The Observatory. WOOL, Covilhã Urban Art, 2019 (Portugal). Photo credit: Ricardo Basílio.
This collaborative collection aims to collect web content related to street art as a social, political, and cultural manifestation that can be found all over the world.
The types of street art covered by this collection include but are not limited to:
Mural art
Graffiti
Stencil art
Fly-posting (gluing posters)
Stickering
Yarn-bombing
Mosaic
The collection will also include a number of different types of websites such as:
The list is not exhaustive and it is expected that contributing partners may wish to explore other sub-topics within their own areas of interest and expertise, providing that they are within the general collection development scope.
Out of scope
The following types of content are out of scope for the collection:
For data budget considerations, websites that are heavy with audio video content such as YouTube will be deprioritised.
Social media is labour intensive and unlikely to be archived successfully such as Facebook, YouTube channels, Instagram, TikTok.
Content which is in the form of a private members’ forum, intranet or email (non-published material).
Content which may identify or dox street artists who wish to remain anonymous or known only by their tagger name.
Artist websites where the artist works primarily in mediums other than street art.
Media websites (tv/radio and online newspapers) will be selected in moderation, as generally this type of content is being archived elsewhere, although nominations at the level of the news article documenting specific debates around street art may be considered (as opposed to media landing pages or splash pages). Independent news sources devoted to street art specifically are welcome.
How to get involved
Once you have looked over the collection scope document and selected the web pages that you would like to see in the collection, it takes less than 2 minutes to fill in the submission form:
For the first crawl, the call for nominations will close on January 20, 2023.
For more information and updates, you can contact the IIPC Content Development Working Group team at Collaborative-collections@iipc.simplelists.com or follow the collection hashtag on Twitter at #iipcCDG.
By Cui Cui, Web Archivist, Special Collections, Bodleian Libraries and PhD candidate, Information School, University of Sheffield
The Archive of Tomorrow project, funded by the Wellcome Trust, is designed to explore and preserve online information and misinformation about health and the Covid-19 pandemic. Started in February 2022, the project runs for 14 months and will form a “Talking about Health” collection within the UK Web Archive, giving researchers and members of the public access to a wide representation of diverse online sources. Earlier this year, Alice Austin presented an introduction to the project on the IIPC blog. As a web archivist, I work on various sub-collections within the “Talking about Health” collection on topics relating to cancer, food, diet, nutrition, and wellbeing. This blog post is meant to summarise some challenges I have encountered during the collecting process and approaches I took to tackle them.
Challenges in Capturing Health Collections
The web space related to the topic of health is broad and exists in a complicated context. The subject of health calls for an interdisciplinary approach. There are multiple independent yet connected stakeholders in this web space who create content to promote policies, research outcomes, opinions, guidelines, services and products, all of which ultimately influence the behaviour of the general public regarding their own health. It is therefore essential for us to understand that we are developing the collection within a context that goes beyond the medical concerns.
It is a challenge to capture this context in the current fluid and dynamic environment. For example, within the sub-collection on food, diet and nutrition, research on cancer prevention related to dairy and meat consumption[1] is part of the Livestock, Environment and People project,[2] which is supported by the Wellcome Trust’s Our Planet Our Health Programme.[3] It suggests that research on health is very much entwined in wider scientific and social issues. Another related topic, “alternative protein”[4] is also becoming part of the discourse: how information related to these products is distributed online will have an impact on our choice of diet. There was a recent case when a commercial company making plant-based products misled customers on their ads, press and Twitter posts[5] by using data out of context. The topic of “alternative protein” is not limited to traditional plant-based products but also covers innovations such as cultivated meat. Cultivated meat is also called cultured protein/meat or lab-grown meat,[6] which is labelled as affordable, nutritious and sustainable. This has also drawn debate.[7] However, on the web, there is little discussion of what the larger scale of consumption of cultivated meat will mean for one’s health. At the same time, traditional farmers are working to promote the health benefits of red meat and dairy consumption[8] as well as marketing their products as local and environmentally conscious.[9] Clearly, online information that may have an impact on our diet is not an isolated topic; it instead goes beyond nutrition and medical concerns.
The mismatches and gaps between content created online and health information needs raises a set of less visible challenges. Research has pointed out the complex needs of these who seek online health information.[10] Such needs are not always met. Research by Abu-Serriah and colleagues shows: “of the 156 OMFS units identified in the UK, only 51% had websites. None of the websites contained more than 50% of what patients expected. Interestingly, the study has shown considerable geographical variation across the UK. While almost 80% of the OMFS units in London had websites, there were none in Northern Ireland and Wales.”[11] Within this online information ecosystem, individuals are largely in a passive position; they have little control over what is available on the web. Therefore, the content within the collection does not always reflect end users’ needs. Coverage of the collection could easily be skewed due to the digital divide on the web.
Faceted 4D Modelling as a Collection Development Tool
Despite these complexities, I view the process of developing web archives as an attempt to reconstruct the web space in web archives. It does not mean that I seek to replicate this space. As shown in the chart, health information online is only part of the general health information ecosystem. Such content curated into the collection will be an even smaller proportion. Nevertheless, with this in mind, it does offer a roadmap that can guide the development of the collection.
To illustrate how I am developing a sub-collection on food, diet and nutrition for the Archive of Tomorrow project,[12] I have formulated a faceted 4D modelling approach. It defines the collection’s scope through 4 dimensions of content creator, content, audience, and geographical coverage, as shown in the following chart. The facets within each dimension are used to profile websites that could be included in this sub-collection. It offers different routes to narrow the topic down so that the boundary of the sub-collection can be defined and the collection process can be articulated.
I plotted the seeds in this sub-collection and visualised them using a 4D model, which can aid us in identifying collection gaps and refining searching strategies with focused efforts. According to this model, a large proportion of content in this sub-collection relates to healthy food and diets from commercial or media organisations for the general public and consumers. A much smaller number targets groups such as the elderly, children, young adults and professionals. Content that is relevant to policy and guidance aspects may have not been covered well (perhaps there are not many sources online). Most materials are on the national level. The model offers a direction for the next stage of collection development.
This slideshow requires JavaScript.
However, this modelling approach is an evolving process. The vocabularies can only be established as the collecting efforts progress, and this is often limited by my own knowledge and judgements. Since it currently is a manual process and rather time-consuming, it might be only useful in the development of a small focused sub-collection. I currently test it for this small sub-collection only. The potential use for a large collection probably is only sustainable when sufficient resources are available or other forms of technical support can automate the process, such as generating themes, topics, and keywords. While it could be very difficult to embed these facets into metadata, it does offer a different approach to refine a collection.
The model, as a concept, can be adapted flexibly in various collections by identifying different dimensions and facets that are more relevant to a particular topic. It offers a framework to track and review the progress of the collection development. It can also be used to assess the quality of the collection and identify gaps. If these facets could be embedded into metadata, it might offer opportunities for end-users to scope and refine a collection as datasets. This model is not an attempt to resolve those difficulties highlighted at the beginning of this blog, but it at least helps articulate some of the complexities during the collection development process.
For more information, visit the project discourse site https://ukwa.discourse.group/ and join the discussion. If you would like to make suggestions to improve the collection or are interested in using data from the project, please contact the project team at aot@nls.uk.
[10] Wollmann, K., Der Keylen, P., Tomandl, J., Meerpohl, J. J., Sofroniou, M., Maun, A., & Voigt-Radloff, S. (2021). The information needs of internet users and their requirements for online health information—A scoping review of qualitative and quantitative studies. Patient Education and Counseling, 104(8), 1904-1932.
[11] Abu-Serriah, M., Valiji Bharmal, R., Gallagher, J., & Ameerally, P.J. (2013). Patients’ expectations and online presence of Oral and Maxillofacial Surgery in the United Kingdom. British Journal of Oral & Maxillofacial Surgery, 52(2), 158-162.
The unique and dynamic nature of social platforms, together with legal and technical challenges make social media content very difficult to capture in a comprehensive way. Although it can be assumed that many archiving institutions include, to a certain degree, this born-digital material in their web collections, only a small number of them capture social media consistently.
As part of my PhD research titled ‘Archiving Social Media: a Comparative Study of the Practices, Obstacles, and Opportunities Related to the Development of Social Media Archives’, I have conducted an online survey with the aim of locating the latest social media archiving initiatives that are either currently archiving or are planning to archive content from social platforms.
Recent papers (Vlassenroot et al., 2021) have offered an overview of social media initiatives developed within broader and pre-existing web archiving projects.
In this post, I will discuss the results from the survey carried out between 2021 and 2022, offering a worldwide review of memory institutions engaged in social media archiving activities, including institutions such as museums and other organisations (e.g. Universities, other government institutions and administrative agencies) in order to get a better understanding of the current social media archiving panorama, highlight imbalances, and developments we can expect to see in the years to come.
The research
The online survey was circulated via email using various mailing lists (e.g. the IIPC curators mailing list) and Twitter, receiving a total of 33 responses.
The below map (Figure 1) illustrates the location of the respondent institutions arranged by country.
Figure 1: Location of social media archiving initiatives
Based on desk research, I have also added institutions located in Sweden, China, Ukraine and South Korea to the map as they mentioned social media archiving activities either on their websites or collection development policies. Only two institutions, based in Italy and Lithuania, completed the survey stating that they currently have no plans to collect this born-digital material in the near future.
Figure 2: Type of institutions | *includes data based on desk research
As expected, the majority of the initiatives originate in national libraries and archives. Yet, a significant number of responses came also from different areas of the GLAM sector and other institutions such as universities, government agencies, and archives related to political parties (Figure 2).
Figure 3: Type and stage of social media archiving initiatives
Through the survey, I also wanted to capture the stage these social media initiatives were at in order to help paint a much clearer image of the current social media archiving landscape while also envisioning future developments. As shown in Figure 3, more than half of the respondents are institutions that are running long-term projects, while the remainder are distributed between pilot (5, 16%) and planning phase (7, 23%).
Figure 4: Year in which institutions have started collecting social media
Responses to the question related to the year in which memory institutions started archiving social media revealed an uneven and fluctuating image of the history of social media archiving. Figure 4 illustrates how there has been a significant increase between 2017 and 2022 in the number of archiving institutions interested in including these platforms in their collections.
If this data is then compared with responses to the question related to whether the participating social media archiving initiative were part of a wider, pre-existing web archiving project (Figure 5), data suggests that there seems to be a new trend in the development of social media archives.
Figure 5: Social Media Archiving initiatives that are part of a pre-existing web archive
About 23% percent of respondents declared that their social media archiving initiative has been indeed independently developed from any previous web-related archiving activities. Of these, many have started adding social media content to their collections (or begun planning to) only recently, between 2017 and 2022.
Figure 6: Social media platforms collected
As already reported in previous studies (Vlassenroot et al., 2021 and also in the IIPC blogpost “Web Archiving the War in Ukraine”), data from the survey confirmed the tendency for social media archiving initiatives to collect predominantly content from Twitter (28, 34%), followed by Facebook (19, 23%) and Instagram (17, 20%). As illustrated in Figure 6, a number of institutions preserve or plan to preserve material from WhatsApp and TikTok, especially after the platforms becoming very popular around 2020-2021. Furthermore, some respondents mentioned platforms such as YouTube, Vimeo, Flickr, Tumblr and Telegram. The latter is one of those messaging apps that are not frequently archived yet due to a wide array of difficulties. However, it has recently become the subject of a very interesting archiving initiative – the Telegram Archive of the War – operated by the Center for Urban History in Lviv which has started collecting Telegram channels since February 2022, given the key role this app has been playing in the circulation of official announcements during the war.
Conclusion and future perspectives
Data from the survey revealed that the geographical distribution of social media archiving initiatives seem to reflect imbalances that are mainly related to economic, political and ICT divide. Moreover, challenges related to the collection of certain platforms, boundaries set by national legal frameworks, and other limitations imposed by social platforms have affected the type and rate with which social media is currently being archived, leading to the formation of inevitable gaps in the material preserved. This further underlines the importance of documenting the criteria on which selection is based at an institutional level, in order to clearly understand curation choices, and why some things have been archived rather than others.
Besides, it is worth noting how the most archived social media platforms do not always match the most popular ones in all countries, while other social platforms are only used in areas of the globe that are currently not capturing these sites. This has its implications on the preservation of our collective memory and the representation of silences, marginalised voices and content produced in areas, such as the Global South, which is relevant to study events that transcend the Global North, and national borders.
In this sense, the emerging of social media archiving initiatives outside legal deposit institutions is an important phenomenon which might represent an opportunity to preserve material that would otherwise be out of scope for most institutions, thus keeping record of those silences, marginalised histories, or events unfolding on the less frequently archived platforms.
Findings from the survey revealed that the number of social media archiving initiatives across the globe are steadily growing in number, especially in response to recent and historically significant events. An increasing number of these initiatives appear, as mentioned, to be developing independently from pre-existing web archiving initiatives. This suggests that what started as a prolongation of archiving activities mainly focused on websites seems to be evolving and defining itself as a distinct phenomenon with its own challenges, requiring ad-hoc solutions, specific practices, and generating future new scenarios that will be interesting to investigate further.
*This post summarizes results presented at the WARCnet closing conference in Aarhus, 17-18 October with the title “Mapping social media archiving initiatives: some considerations on imbalances and future directions”.
References
Vlassenroot, E., Chambers, S., Lieber, S., Michel, A., Geeraert, F., Pranger, J., Birkholz, J., & Mechant, P. (2021). Web-archiving and social media: An exploratory analysis. International Journal of Digital Humanities. https://doi.org/10.1007/s42803-021-00036-1
“Game Walkthroughs and Web Archiving” was awarded a grant in the 2021-2022 round of the Discretionary Funding Programme (DFP), the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project lead is Michael L. Nelson from the Department of Computer Science at Old Dominion University. Los Alamos National Laboratory Research Library is a project partner. You can learn more about this DFP-funded project at our Dec. 14th IIPC RSS Webinar: Game Walkthroughs and Web Archiving, where Travis Reid will be presenting on his research in greater detail.
By Travis Reid, Ph.D. student at Old Dominion University (ODU), Michael L. Nelson, Professor in the Computer Science Department at ODU, and Michele C. Weigle, Professor in the Computer Science Department at ODU
The Game Walkthroughs and Web Archiving project focuses on integrating video games with web archiving and applying gaming concepts like speedruns to the web archiving process. We have made some recent updates for this project, which involve adding a replay mode, results mode, and making it possible to have a web archiving tournament during a livestream.
Replay Mode
Replay mode (Figure 1) is used to show the web pages that were archived during the web archiving livestream and to compare the archived web pages to the live web page. During replay mode, the live web page is shown beside the archived web pages associated with each crawler. The web archiving livestream script scrolls the live web page and archived web pages so that viewers can see the differences between the live web page and the recently archived web pages. In the future when the web archiving livestream supports the use of WARC files from a crawl that was not performed recently, we will compare the archived web pages from the WARC file with a memento from a web archive like Wayback Machine or Arquivo.pt instead of comparing the archived web page against the live web page. For replay mode, we are currently using Webrecorder’s ReplayWeb.page.
Figure 1: Replay mode
Replay mode will have an option for viewing annotations created by the web archiving livestream script for the missing resources that were detected (Figure 2). The annotation option was created so that the web archiving livestream would be more like a human driven livestream where the streamer would mention potential reasons why a certain embedded resource is not being replayed properly during the livestream. Another reason for creating the annotation option is so that replay mode can show more than just web pages being scrolled and can provide some information about the elements on a web page that are associated with missing embedded resources. There will also be an option for printing an output file that contains the annotation information created by the web archiving livestream script for the missing embedded resources. For each detected missing resource, this file will include the URI-R for the missing resource, the HTTP response status code, the element that is associated with the resource, and the HTML attribute where the resource’s URI-R is extracted from.
Figure 2: During replay sessions, there will be an option for automated annotation for the missing resources found on the web page
Results Mode
We have added a results mode (Figure 3) to the web archiving livestream so that viewers can see a summary of the web archiving and replay performance results. This mode is also used to compute the score for each crawler so that we can determine which crawler has won the current round based on the archiving and replay performance. The performance metrics used during results mode is retrieved from the performance result file that is generated after the web archiving and replay sessions. Currently this performance result file includes the number of web pages archived by the crawler during the competition (number of seed URIs), the speedrun completion time for the crawler, the number of resources in the CDXJ file with an HTTP response status code of 404, the number of archived resources categorized by the file type (e.g., HTML, image, video, audio, CSS, JavaScript, JSON, XML, PDF, and fonts), and the number of missing resources categorized by the file type. The performance metrics we are currently using for determining missing and archived resources are temporary and will be replaced with a replay performance metric calculated by the memento damage service. The temporary metrics associated with missing and archived resources are calculated by going through a CDXJ file and counting the number of resources with a 200 status code for the number of archived resources and counting the number of resources with a 404 status code for the number of missing resources. Results mode will allow the viewers to access the performance results file for the round by showing a link or QR code associated with a web page that can dynamically generate the performance results from the current round and allow the viewers to download the performance results file. The web page that is shared with the viewers will also have a button that can navigate them to the video timestamp URL associated with the start of the round, so that viewers who recently joined the livestream can go back and watch the archiving and replay sessions for the current round.
Figure 3: Results mode, where the first performance metric is shown which is the speedrun time
Web Archiving Tournaments
A concept that we recently applied to our web archiving livestreams is web archiving tournaments. A web archiving tournament is a competition between four or more crawlers. The web archiving tournaments are currently single elimination tournaments similar to the NFL, NCAA College Basketball, and MLS Cup playoffs, where a team is eliminated from the tournament if they lose a single game. Figure 4 shows an example of teams progressing through our tournament bracket. For each match in a web archiving tournament, there will be two crawlers competing against each other. Each crawler is given the same set of URIs to archive and the set of URIs will be different for each match. The viewers will be able to watch the web archiving and replay sessions for each match. After the replay session is finished, the viewers will see a summary of the web archiving and replay performance results and how the score for each crawler is computed. The crawler with the highest score will be the winner of the match and will progress further in the web archiving tournament. When a crawler loses a match it will not be able to compete in any future matches in the current tournament. The winner of the web archiving tournament will be the crawler that has won every match that it has participated in during the tournament. The web archiving tournament will be updated in the future to support other types of tournaments like double elimination tournament where teams can lose more than once, round robin tournament where teams play each other an equal amount of times, or a combination like the FIFA World Cup that uses round robin for the group stage and single elimination for the knockout phase.
Figure 4: Example of teams progressing through our tournament bracket (in this example, the scores are randomly generated)
Future work
We will apply more gaming concepts to our web archiving livestreams, like having tag-team matches and a single player mode. For a tag-team match, we would have multiple crawlers working together on the same team when archiving a set of URIs. For a single player mode, we could allow the streamer or viewers to select one crawler to use when playing a level.
We are accepting suggestions for video games that will be integrated with our web archiving livestreams and shown during our gaming livestreams. The game must have a mode where we can watch automated gameplay that uses bots (computer players) and there needs to be customization for the bots that can improve the skill level, stats, or abilities for the bot. Call of Duty: Vanguard is an example of a game that can be used during our gaming livestream. In a custom match for Call of Duty: Vanguard, the skill level for the bots can be changed individually for each bot and we can change the number of players added to each team (Figure 5). This game also has other team customization options (Figure 6) that are recommended for games used during our gaming livestream but are not required like being able to change the name of the team and choose the team colors. Call of Duty: Vanguard also has a spectator mode named CoDCaster (Figure 7) where we can watch a match between the bots.
Figure 5: Player customization must allow bots with skill levels that can be changed individually or have abilities that can give a bot an advantage over other bots Figure 6: Example of team customization that is preferred for team based games used during gaming livestreams, but is optionalFigure 7: The game must have a spectator option so that we can watch the automated gameplay
An example of a game that will not be used during our gaming livestream is Rocket League. When creating a custom match in Rocket League it is not possible to make one bot have better stats or skills than the other bots in a match. The skill level for the bots in Rocket League is applied to all bots and cannot be individually set for each bot (Figure 8).
Figure 8: Rocket League will not be used during our automated gaming livestreams, because their “Bot Difficulty” setting applies the same skill level to all bots
A single player game like Pac-Man also cannot be played during our automated gaming livestream, because a human player is needed in order to play the game (Figure 9). If there are any games that you would like to see during our gaming livestream where we can spectate the gameplay of computer players, then you can use this Google Form to suggest the game.
Figure 9: Single player games that require a human player to play the game like Pac-Man cannot be used during our automated gaming livestream
Summary
Our recent updates to the web archiving livestreams are adding a replay mode, results mode, and an option for having a web archiving tournament. Replay mode allows the viewers to watch the replay of the web pages that were archived during the web archiving livestream. Results mode shows a summary of the web archiving and replay performance results that were measured during the livestream and shows the match scores for the crawlers. The web archiving tournament option allows us to have a competition between four web archive crawlers and to determine which crawler performed the best during the livestream.
If you have any questions or feedback, you can email Travis Reid at treid003@odu.edu.
As part of the IIPC-funded project “Browser-based crawling for all”, Webrecorder has been working in collaboration with IIPC Members, led by the British Library, National Library of New Zealand, Royal Danish Library, and University of North Texas to test Browsertrix Cloud as it is being developed.
Browsertrix Cloud provides a fully-integrated system for Webrecorder’s open-source high-fidelity browser-based crawling system, Browsertrix Crawler, designed to allow curators to create, manage, and replay high-fidelity web archive crawls through an easy to use interface.
At present, a dedicated IIPC cluster of Browsertrix Cloud has been deployed and made available to users from all members’ institutions. This cloud cluster is deployed using Digital Ocean in a European data center. Users are able to create and configure high-fidelity crawls and watch them archive web pages in real time. Browsertrix Cloud also allows users to create browser profiles and crawl sites which require logins, one of the only tools to date that allows for this capability.
One of the key goals of this project is to enable institutions to deploy the system both locally and in the cloud. We are currently working on documentation outlining the procedure for deploying Browsertrix Cloud on Digital Ocean, AWS and a single machine.
Thus far, we have collected feedback from many institutions and are working on new features, including the ability to update the crawl queue once the crawl has started, and improvements to our logging capabilities. IIPC users have provided us with valuable feedback after testing the service and we hope to receive more as development continues.
We are focusing on further improving the UX for Browsertrix Cloud to make complex crawl related tasks as simple as possible. After adding crawl exclusion management, we are looking at simplifying the crawl configuration and browser profiles screens, adding additional logging information, and eventually adding support for additional organizational features.
Please reach out if you would like additional accounts to test Browsertrix Cloud or have additional questions or feedback.
The 2022 Steering Committee Election closed on Saturday, 15 October. The following IIPC member institutions have been elected to serve on the Steering Committee for a term commencing 1 January 2023:
We would like to thank all members who took part in the election either by nominating themselves or by taking the time to vote. Congratulations to the re-elected Steering Committee Members!