Quality Assurance in Web Archives: How to Automate Your Work with Command Line

By Kourosh Hassan Feissali, Web Archivist on the National Archives, UK

Both ‘automation’ and ‘command line’ can sound daunting to non-programmers. But in this post I’m going to describe how easy it is for non-programmers to use baby steps to automate time consuming tasks, or parts of a big task.

Why Use Command Line?

I admit that I’m new to command line myself but the more I use it the more I am amazed by its efficiency and by all the things it can do. Here are some of the reasons for using command line:

  • You don’t need to be a programmer to write commands. Once you learn the basics you can just copy and paste commands.
  • A Command-Line Interface (CLI) is pre-installed on your computer for free!
  • You can copy useful commands from the Internet and simply paste them in your CLI.
  • You write a command for a task once and you use it as many times as you want without having to think about the steps involved. This will save a lot of time.
  • You can write one tiny command to automate step 1 of a bigger task with 10 steps. Then, if you want, you can add a second command to automate step 2. Therefore, you don’t have to write an entire programme.
  • Spreadsheet or text editors often struggle with very large files but this is not a problem in a CLI.

Case Study: Quality Assurance of Brexit Sites

At the UK Government Web Archive (UKGWA) we crawled a large number of Brexit-related websites. Due to the nature of the project it was essential to carry out enhanced quality assurance (QA) on these websites. One technique that we used was checking the logs of the web crawler. The problem was that crawl logs can have millions of lines and there are very few applications on the market that can easily handle these huge log files. Further, some of these apps only work on one operating system (OS) but we use multiple OS’s in the team.

To illustrate how simple it is to speed up a multi-stage task I’m going to break down our enhanced QA into smaller tasks here and use some basic commands to drastically speed up the process.

The Steps

  1. Download all the log files.
  2. Merge them into one.
  3. Sort the lines by server response code.
  4. Remove all the URLs where the server error begins with 404, 2, 3, or blank space.
  5. Save the remaining URLs that begin with server errors 500, 403, etc. into a new file.
  6. Remove duplicate URLs and save in a new file.
  7. Check the remaining URLs against the live site.
  8. Ignore the ones that are broken on the live site.
  9. Copy the ones that work correctly on the live site and save as a patch-list.
  10. Clean up Downloads folder.

The process above is a little longer than this but I’ve omitted some of the steps for the purposes of this blog post. As you see, we’ve broken down one fairly complex job into 10 simple steps that are easy to understand and easy to tackle on their own. Some of the steps above are quite simple but when you’re dealing with very large files, they can freeze your computer if you use generic applications such as MS Excel. Here, I’ll describe how we can use CLI for some of the above steps.

Step 2: cat *.log >> final.log

Step 4: sed -i “” ‘s+^404.*++g’ sorted.txt; sed -i “” ‘s+^[2-3].*++g’ sorted.txt;

Step 5: cp sorted.txt sorted_errors.txt

Step 6: cat sorted_errors.txt | sort -u > sorted_errors_dedup.txt

Step 10: rm *.log

What’s great about CLI is that you don’t have to learn a whole new language before seeing the result of your work. As you see in Step 2 above the ‘cat’ command concatenates multiple files into one. No matter how many files you have or how large they are. MS Excel can give you a really hard time for this simple step but this one command concatenates any file that ends with ‘.log’ in your Downloads folder into one file in a blink of an eye. Automating with command line brings a lot of joy to your work life!


More on this topic:
How to automate web archiving quality assurance without a programmer

IIPC – Meet the Officers, 2021

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Programme and Communications Officer based at the British Library, the Officers are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Abbie Grotke of the Library of Congress, to serve as Chair and Kristinn Sigurðsson of National and University Library of Iceland, to serve as Vice-Chair in 2021. Sylvain Bélanger of Library and Archives Canada continues in his role as Treasurer. Olga Holownia continues as Programme and Communications Officer, and CLIR (the Council on Library and Information Resources) remains the Consortium’s financial host.

The Officers make up the new Executive Board introduced in the Consortium Agreement 2021-2025. The additional Steering Committee members, who will serve on the Executive Board in 2021, will be named in the coming months.

The Members and the Steering Committee would like to thank Mark Phillips of the University of North Texas Libraries (IIPC Chair, 2020) and Paul Koerbin (IIPC Vice-Chair, 2020) of the National Library of Australia, for their contribution to the day-to-day running of the IIPC.


IIPC CHAIR

Abbie Grotke, IIPC Chair 2021
Photo: Denis Malloy.

Abbie Grotke is Assistant Head, Digital Content Management Section, within the Digital Services Directorate of the Library of Congress, and leads the Web Archiving Team. She joined the Library in 1997 to work on American Memory digitization projects, and since 2002 has been involved in the Library’s web archiving program, which celebrated its 20th anniversary in 2020. In her role, Grotke has helped develop policies, workflows, and tools to collect and preserve web content for the Library’s collections and provides overall program management for web archiving at the Library, managing over 2.3 petabytes of data. The team also supports and trains almost 100 recommending officers across the Library who select content for the archives in a wide range of event and thematic web archive collections. She has been active in a number of collaborative web archive collections and initiatives, including the U.S. End of Term Government Web Archive, and the U.S. Federal Government Web Archiving Interest Group.

Since the Library of Congress joined the IIPC as a founding member in 2003, Abbie has served in a variety of roles and on a number of working groups, task forces, and committees. She spent a number of years as Communications Officer, and was a member of the Access Working Group. More recently, she has served as co-leader of the Content Development, and Training Working Groups, and Membership Engagement Portfolio. She has been a member of the Steering Committee since 2013.

IIPC VICE-CHAIR

Photo: Tibor God (General Assembly in Zagreb, 2019).

Kristinn Sigurðsson is Head of Digital Projects and Development at the National and University Library of Iceland. He joined the library in 2003 as a software developer. Over the years he has worked on a multitude of projects related to the acquisition, preservation and presentation of digital content, as well as the digital reproduction of physical media. This includes leading the buildup of the library’s legal deposit web archive – that now contains nearly 4 billion items – as well as its very popular newspaper/magazine website.

He has also been very active within the IIPC and related web archiving collaboration. This includes working on the first version of the Heritrix crawler in 2003-4 (and on and off since). In 2010 he joined the IIPC Steering Committee as well as taking over as co-load of the Harvesting Working Group. More recently he has served as the Lead of the Tools Development Portfolio.

IIPC TREASURER

Sylvain Bélanger is Director General of the Transition Team at Library and Archives Canada (LAC). Sylvain was previously Director General of the Digital Operations and Preservation Branch for Library and Archives Canada since February 2014. In this role Sylvain is responsible for leading and supporting LAC’s digital business operations, and all aspects of preservation for digital and analog collections. Sylvain is also lead for LAC’s digital transformation activities. Prior to accepting this role, Sylvain had been Director of the Holdings Management Division since 2010, and previously Corporate Secretary and Chief of Staff for Library and Archives Canada.

OpenWayback to pywb Transition Guide and pywb update

By Ilya Kreymer, Lead Software Engineer at Webrecorder Software

Earlier this year, the IIPC, after an internal survey, recommended the adoption of Webrecorder pywb as the primary replay system for their members’ web archives. Webrecorder and IIPC established a multi-part collaboration to help with this transition and advance the development of pywb.

To meet these goals, I’m excited to announce the launch of an official guide for migrating from OpenWayback to Webrecorder pywb, available at:

https://pywb.readthedocs.io/en/latest/manual/owb-transition.html

This guide was created with input from IIPC members and marks the completion of the first package of the IIPC project on pywb. This guide is now part of the standard pywb documentation and provides examples of various OpenWayback configurations and how they can be adapted to analogous options in pywb. The guide covers updating the index, WARC storage and exclusion systems to run in pywb with minimal changes.

For best results, deployment of OutbackCDX, an open-source standalone web archive indexing system developed by the National Library of Australia, alongside pywb is the recommended setup for managing web archive indexes. See the guide for more details and additional options.

Sample Deployment Configurations

With the guide, pywb now also includes a few working deployments (via Docker Compose) of running pywb with Nginx, Apache and OutbackCDX.

These deployments will be part of the upcoming pywb release and will be updated as pywb and configuration options evolve.

Next Steps

Next on the immediate roadmap for pywb is an upcoming release, which will feature numerous fixes in addition to the guide (See the pywb CHANGELIST for more details on upcoming and new features).

The next iteration of pywb, which will be released in the first half of 2021, will include improved support for access controls, including a time-based access ‘embargo’, location-based access controls, and improved support for localization, in line with the work outlined in pywb project Package B.

Feedback Wanted!

We hope the guide will be useful for those updating from OpenWayback to pywb. We are also looking for input from IIPC members about any use cases for improved access control and localization for the next iteration.

If you have any questions, run into issues, or find anything missing, please send feed feedback to pywb[at]iipc.simplelists.com or directly to Webrecorder, via email or via the forum.

Webarchiv: 20 Years of Web Archiving in the Czech Republic

By Marie Haškovcová, Illyria Brejchová, Luboš Svoboda, and Andrea Prokopová (Czech Web Archive of National Library of the Czech Republic)

An Introduction to Webarchiv

The idea to create a national web archive that would preserve the growing amount of Czech digital born media was conceived as soon as 1999. In the year 2000, Webarchiv was founded as a joint project of the National Library of the Czech Republic, the Moravian Library, and the Masaryk University, making it one of the oldest webarchives. The first websites were archived in 2001, regular harvesting began in 2005, and in 2007 Webarchiv joined the IIPC.

Webarchiv home page

Currently, Webarchiv is part of the National Library of the Czech Republic and holds approx. 400 TB of data. Webarchive collects this data in a variety of ways. Through comprehensive harvests, second order domains of *.cz are harvested once or twice a year thanks to a cooperation with the Czech domain provider CZ.NIC (currently it is about 1.4 million URLs). Czech web resources with historical, scientific or cultural value are selectively harvested more frequently and in depth compared to comprehensive harvests. Finally, resources connected to a specific event or topical are collected through topic harvests. Webarchiv currently has more than 30 topical collections, covering elections, Olympics, climate change and more. Continuous harvesting (automated, several times a day) is currently being tested on some thematic collections, such as COVID-19 or Czech media.

A gif showing the development of the Webarchiv website created using Time Map Visualization

 

Data harvesting and accessibility

The big challenge for Webarchiv at the moment is assuring the accessibility of the data in its collection, both in regards to maintaining the ability to display the archived websites, but also in regards to allowing access to its collection to researchers as well as the public. In terms of public access, only 0,4 % of the whole collection is available freely online. This is due to current Czech legislation which allows the National Library to make reproductions of a work for its own archiving and conservation purposes, but does not entitle libraries to make them available. Online access is therefore made available only to resources in the selective harvest which are licensed under a Creative Commons licence or after signing a contract with the publisher. Websites available to the public are catalogued in accordance with the RDA rules and integrated into the Czech national bibliography. The entire collection can be accessed by the public on the library premises.

Webarchiv catalogue

On the technical side, Webarchiv uses open source software, such as Heritrix 3.4 and OpenWayback 3.0, but also develops its own open source tools, such as Seeder for managing electronic resources, websites and harvests or WA-KAT as an online resource cataloguing tool. We are testing the harvesting of social media accounts of politicians, are experimenting with UMBRA, and apply manual harvesting using Webrecorder 2.3, which allows curators to harvest web 2.0 or more technologically complex websites such as online exhibitions or multimedia magazines. We plan to replace the current Wayback 3.0 application with Python Wayback (pywb) to display this type of content. We consider the autonomy of curators in regard to planning harvests and quality assurance to be key even in automated harvests, which is why we continue to improve Seeder, a tool for managing harvests and curating web resources. In the future, Seeder should allow curators to perform harvests without technical support, which will allow them to react more efficiently to the ephemeral online environment.

Seeder – tool for managing electronic resources, websites and harvests

 

Collaboration with key partners

Over the years, Webarchiv has developed a collaboration with various institutions. Notably, these include the aforementioned CZ.NIC, the Institute of Czech Literature of the Czech Academy of Sciences, for whom we are archiving the online Czech literary tradition from the beginning of the Czech Internet to the present day, or the Czech National Archive, for whom we are archiving websites of public agencies, such as ministries or other central administrative authorities. As for international collaboration, we worked with the University Library in Bratislava on a shared topic collection of online resources relating to the 30th anniversary of the Velvet Revolution, which led to the collapse of the communist regime in former Czechoslovakia, and regularly contribute to the IIPC collaborative collections, most recently to the COVID-19 collection.

Topical collections

 

As for making our collection more accessible to researches Webarchiv is involved in a research project titled “Development of a centralized interface for extracting big data from web archives”. The National Library of the Czech Republic partnered with the the Department of Cybernetics of the Faculty of Applied Sciences at the University of West Bohemia and the Institute of Sociology of the Czech Academy of Sciences on this project focused on Webarchiv´s data research. The main aim of the project is to develop a centralized user interface which would allow researchers to search through data collected by Webarchiv and obtain datasets for further research. The outcome of the project will be a faceted full text search engine for analyzing large quantities of web archive data with an integrated application for exporting selected datasets. The research project is expected to be completed in 2022.

Engaging the public

Website nomination form

 

Webarchiv also actively engages with the public. we accept suggestions for resources to include in selective harvests on our website, and are also active on social media. We have a long-going campaign, where we share dead websites from our collection (websites that no longer exist, but we have archive copies) on Facebook and we have recently started doing the same on Instagram. Through these activities, we hope to raise awareness and interest in web archiving. We are also active on Twitter, where we recently participated in the #WarcnetChallenge.

A contribution of Webarchiv to the Warcnet Challenge

We also launched a new blog, where we post a series called 10 websites for eternity, where personalities from various fields share a list of Czech websites they can not imagine life without and which they would regret losing if they were discontinued without being preserved in an archive. It is an opportunity for them to share their top 10 treasures the Czech web has to offer, both forgotten webs worm their bookmarks, and accomplished veterans of the internet. Webarchiv then archives the websites on the list and adds them to a topic collection. We see opening Webarchiv to curatorship from external specialists from various fields as an important direction to head in and a great way to expand our current curatorship strategies. Another topic we are considering is Link Rot, we see it as an area in which we can prove to be very beneficial – in the future, a catalog of valuable web resources could be created, curated directly by scientists and students.

From series 10 websites for eternity

The future of Webarchiv

Similar to other web archives, Webarchiv is facing numerous challenges – not only in the field of acquisition, preservation and access to data or legislative changes. The internet is an ever changing environment and we are therefore always a step behind in our efforts to preserve it. Numerous questions offer themselves up for debate: How should we approach ethical questions regarding the use of data collected during harvests? How can content on social media be preserved within its appropriate context when feeds are personalized? Should we preserve software along with the archived web pages? How should we approach valuable online content accessible only beyond a paywall? We are excited to be part of Webarchiv s journey and witness the ways in which it tackles these questions, and matures in the following decades!

Raiders of the lost Web: WARCnet 💽World Wide Web 🌐 archiving discovery challenge 🏃‍♀️

Launched in the spring of 2020, WARCnet (Web ARChive studies network researching web domains and events) aims to promote high-quality national and transnational research that will help to understand the history of (trans)national web domains and of transnational events on the web. Led by Niels Brügger (Aarhus University), Valérie Schafer (University of Luxembourg), and Jane Winters (School of Advanced Studies, University of London), WARCnet brings together the expertise of researchers in the  field of web archiving as well as seven national web archives. The forthcoming WARCnet Autumn meeting (November 4-6 2020) is accompanied by the first WARCnet Challenge!


By Niels Brügger (Aarhus University), Valérie Schafer (University of Luxembourg), Jane Winters (School of Advanced Studies, University of London), and Kees Teszelszky (The National Library of the Netherlands)

As all web archivists 💾, scholars of the web 🔬 and all digital natives 👶 are aware, the World Wide Web 🌐 is full of old, rare, shocking and weird gems💎 waiting to be discovered 👩‍🔬🧑‍💻. Show us your skills as internet researchers 🔎, web archaeologists 🏺 and twenty-first century online Indiana Joneses 🔮 and uncover the most interesting 😍, thought-provoking 🤔 or downright tasteless treasures 🏴‍☠️ from the online mud. Let’s see if you 👈 can surprise us 🎁 with your finds and show us 🖥️ wonderful things!

You can join the WARCnet network’s challenge with text, image, video or code found on the live web, in a web archive or elsewhere (library, physical archive, computer museum, basement of a web collector).

Your web archiving discovery challenge entry must consist of three parts: 1. Your discovery (what have you found?), 2. Your method (how did you find it or discover it?), 3. Your story (why is your find so special?).

“Can you see anything?” “Yes, wonderful things!” Be smart and creative! Let us learn from how you have done it! Show us web archiving is more than web crawlers and WARC files! Have fun and make us laugh or shiver!

Need some ideas 💡 for your online treasure 👑 trove⚒️ — have a look at https://cc.au.dk/en/warcnet/warcnet-twitter-challenge/ or go straight to #WarcnetChallenge.

Theme of this WARCnet Challenge

The theme of this WARCnet challenge is ‘Web archaeology and history (trial version!) (Temple of ZOOM)’ — we’ll leave up to participants to interpret the theme…

How to participate in the WARCnet challenge?

You participate in the challenge by tweeting your reply using the hashtag #WarcnetChallenge. It is OK to post more than one tweet; please mark them ‘1 of 5’ for five tweets, etc. Deadline ⏳ for tweets is Friday 6 November 09:00 CEST!

How is the winner found?

A jury composed of Niels Brügger, Valérie Schäfer, Kees Teszelszky and Jane Winters will nominate 3-5 entries. Then, on Friday 6 November, the last day of the WARCnet meeting in Luxembourg, the members of the WARCnet network will vote for one of the nominees as the winner.

What will the winner get?

The winner will get a unique laptop sticker 🏆 and eternal fame on the wall of fame of the Raiders of the lost Web on this web page. The winning entry will also be included in the Grand Finale at the closing WARCnet conference in June 2022, to potentially win the Great Raider of the Lost Web Award.

Will there be more WARCnet challenges?

Yes, indeed there will. The WARCnet challenge is a play in four parts, each with a specific theme:

Part 1: Web archaeology and history (trial version!) (Temple of ZOOM): WARCnet Autumn meeting in Luxembourg, November 4-6 2020

Part 2: Web design and culture (Raiders of the lost WARC): WARCnet Spring meeting in Aarhus April 20-22 2021

Part 3. Offline internet culture, digital born time travellers and internet culture in old analogue history. (Dr. Jones and the Wayback Machine): WARCnet Autumn meeting in London, November 3-5 2021

Part 4: Grand Finale and announcement of winner. (Back to the Future from the Digital Dark Age): The last WARCnet meeting and conference in Aarhus, June 13-15 2022.

Who to contact with questions?

Any questions, just email us at warcnet@cc.au.dk.

Rapid Response Twitter Collecting at NLNZ

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand (NLNZ)

This blog post has been adapted from an IIPC RSS webinar held in August where presenters shared  their social media web archiving projects. Thanks to everyone who participated and for your feedback. It’s always encouraging to see the projects colleagues are working on.

Collecting content when you only have a short window of opportunity

The National Library responds quickly to collecting web content when unexpected events occur. Our focus in the past was to collect websites and this worked well for us using Web Curator Tool, however collecting social media was much more difficult. We tried capturing social media using different web archiving tools, but none of them produced satisfactory results.

The Preservation, Research and Consultancy (PRC) team include programmers and web technicians. They thought running Twitter crawls using the public Twitter API could be a good solution for capturing Twitter content. It has enabled us to capture commentary about significant New Zealand events and we’ve been running these Twitter crawls since late 2016.

One such event was the Christchurch Mosque shootings which took place on 15 March 2019.  This terrorist attack by a lone gunman at two mosques in Christchurch, where 51 people were killed was the deadliest mass shooting in modern New Zealand history.  The image you see here by Shaun Yeo was created in response to the tragic events and was shared widely via social media.

Shaun Yeo: Crying Kiwi
Crying Kiwi. Ref: DCDL-0038997. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/42144570   (used with permission)

While the web archivists focussed on collecting web content relating to the attacks, and the IIPC community assisted us by providing links to international commentary for us to crawl using Archive-It, the PRC web technician was busy getting the Twitter harvest underway. He needed to work quickly because there were only a few days lee-way to pick up Tweets using Twitter’s public API.

Search Criteria

Our web technician checked Twitter and found a wide range of hash tags and search terms that we needed to use to collect the tweets.

Hashtags: ‘#ChristchurchMosqueShooting’ ‘#ChristchurchMosqueShootings’ ‘#ChristchurchMosqueAttack’ ‘#ChristchurchTerrorAttack’ ‘#ChristchurchTerroristAttack’ ‘#KiaKahaChristchurch’ ‘#NewZealandMosqueShooting’ ‘#NewZealandShooting’ ‘#NewZealandTerroristAttack’ ‘#NewZealandMosqueAttacks’ ‘#PrayForChristchurch’ ‘#ThisIsNotNewZealand’ ‘#ThisIsNotUs’ ‘#TheyAreUs’

Keywords: ‘zealand AND (gun OR ban OR bans OR automatic OR assault OR weapon OR weapons OR rifle OR military)’ ‘zealand AND (terrorist OR Terrorism OR terror)’ ‘zealand AND mass AND shooting’ ‘Christchurch AND mosque’ ‘Auckland AND vigil’ ‘Wellington AND vigil’

The Dataset

The Twitter crawl ran from 15-29 March 2019. We captured 3.2 million tweets in JSON files. We also collected 30,000 media files that were found in the tweets and we crawled 27,000 seeds referenced in the tweets. The dataset in total was around 108GB in size.

Collecting the Twitter content

We used Twarc to capture the Tweets and we also used some inhouse scripts that enabled us to merge and deduplicate the Tweets each time a crawl was run. The original set for each crawl was kept in case anything went wrong during the deduping process or if we needed to change search parameters.
We also used scripts to capture the media files referenced in the Tweets and a harvest was run using Heritrix to pick up webpages. These webpage URLs were run through a URL unshortening service prior to crawling to ensure we were collecting the original URL, and not a tiny URL that might become invalid within a few months. We felt that the Tweet text without accompanying images, media files and links might lose its context. We were also thinking about long term preservation of content that will be an important part of New Zealand’s documentary heritage.

Access copies

We created three access copies that provide different views of the dataset, namely Tweet IDs and hashed and non-hashed text files.  This enables the Library to restrict access to content where necessary.

Tweet ID’s

Tweet ID’s (system numbers) will be available to the public online. When you rehydrate the Tweet ID’s online, you only receive back the Tweets that are still publicly available – not any of the Tweets that have since been deleted.

Hashed and non-hashed access copies

In 2018, Twitter released a series of election integrity datasets (https://transparency.twitter.com/en/information-operations.html), which contained access copies of Tweets. We have used their structure and format as a precedent for our own reading room copies. These provide access to all Tweets and the majority of their metadata, but with all identifying user details obfuscated by hashed values. You can see in the table below the Tweet ID highlighted in yellow, the user display name in red and its corresponding system number (instead of an actual name) and the tweet text highlighted in blue.

The non-hashed copy provides the actual names and full URL rather than system numbers.

 Shaping the SIP for ingest
National Digital Heritage Archive (NDHA)

We have had some technical challenges ingesting Twitter files into the National Digital Heritage Archive (NDHA). Some files were too large to ingest using Indigo, which is a tool the web and digital archivists use to deposit content into the NDHA, so we have had to use another tool called the SIP Factory, which enables the ingest of large files to the NDHA. This is being carried out by the PRC team.

We’ve shaped the SIPs (submission information packages) according to these files below and have chosen to use file conventions for each event. We thought it would be helpful to create a readme file that shows some of the provenance and technical details of the dataset. Some of this information will be added to the descriptive record, but we felt that a readme file might include more information and it will remain with the dataset.

chch_terror_attack_2019_twitter_tweet_IDs
chch_terror_attack_2019_Twitter_access_copy
chch_terror_attack_2019_Twitter_access_copy_hashed
chch_terror_attack_2019_Twitter_crawl
chch_terror_attack_2019_twitter _readme
chch_terror_attack_2019_twitter_media_files
chch_terror_attack_2019_twitter_warc_files

Description of the dataset

Even though the tweets are published, we have decided to describe them in Tiaki, our archival content management system. This is because we’re effectively creating the dataset and our archival system works better for describing this kind of content than our published catalogue does.
NLNZ, Tiaki archival content management system

Research interest in the dataset

A PhD student was keen to view the dataset as a possible research topic. This was a great opportunity to see what we could provide and the level of assistance that might be required.

Due to the sensitivity of the dataset and the fact that it wasn’t in our archive yet, we liaised with the Library’s Access and Use Committee around what data the library was comfortable to provide. The decision was that the data should only come from Tweets that were still available online in this initial stage while the researcher was still determining the scope of her research study.

The Tweet IDs were put in Dropbox for the researcher to download. There were several complicating factors that meant she was unable to rehydrate the Tweet ID’s, so we did what we could to assist her.

We determined that the researcher simply wanted to get a sense of what was in the dataset, so we extracted a random sample set of 2000 Tweets. This sample included only original tweets (no retweets) and had rehydrated itself to remove any deleted tweets. The data included was, Tweet time, user location, likes, retweets, Tweet language and the Tweet text. She was pleased with what we were able to provide, because it gave her some idea of what was in the dataset even though it was a very small subset of the dataset itself.

Unfortunately, the research project has been put on hold due to Covid-19. If the research project does go ahead, we will need to work with the University to see what level of support they can provide the researcher and what kind of support we will need to provide.

The BESOCIAL project: towards a sustainable strategy for social media archiving in Belgium

By Jessica Pranger, Scientific Assistant at KBR / Royal Library of Belgium

In August, we had the opportunity to present the new BESOCIAL research project during the IIPC RSS webinar. Many thanks to all viewers who have shared their remarks, questions and enthusiasm with us!

The aim of the BESOCIAL project is to set up a sustainable strategy for social media archiving in Belgium. Some Belgian institutions are already archiving social media content related to their holdings or interests, but it is necessary to reflect on a national strategy. Launched in summer 2020, this project will run over two years and will be divided in seven steps, called ‘Work packages’ (WP):

  • WP1: Review of existing social media archiving projects and corpora in Belgium and abroad (M1-M6). The aim of this WP is to analyse selection, access and preservation policies, existing foreign legal frameworks and existing technical solutions.
  • WP2: Preparation of a pilot for social media archiving (M4-M15) including the development of a methodology for selection and the technical and functional requirements. An analysis of the user requirements and the existing legal framework is also included.
  • WP3: Pilot for social media archiving (M7-M24) including harvesting, quality control and the development of a preservation plan.
  • WP4: Pilot for access to social media archive (M16-M21) focusing on legal considerations, the development of an access platform and evaluating the pilot.
  • WP5: Recommendations for sustainable social media archiving in Belgium on the legal, technical and operational level (M16-M24).
  • WP6: Coordination, dissemination and valorisation.
  • WP7: Helpdesk for legal enquiries throughout the project.

Figure 1 shows these seven stages and how they will unfold over the two years of the project.

Figure 1. Work Packages of the BESOCIAL project.

Review of existing projects

We are currently in the first stage of the project (Work Package 1). To this end, a survey has been sent to 18 international heritage institutions and 10 Belgian institutions to ask questions on various topics related to the management of their born-digital collections. To date, we have received 13 responses and a first analysis of these answers has been completed and submitted for publication. Another task that is currently being undertaken is to provide an overview of the tools used for social media archiving. It is now important to dig deeper and check which kind of metadata is supported by the tools. We are also working on an analysis of the digital preservation policies, strategies and plans of libraries and archives that already archive digital content, especially social media data. For the legal aspects, we are analysing the legal framework of social media archiving in other European and non-European countries.

Our team

The BESOCIAL project is coordinated by the Royal Library of Belgium (KBR) and is financed by the Belgian Science Policy Office’s (Belspo) BRAIN-be programme. KBR partnered with three universities for this project: CRIDS (University of Namur) works on legal issues related to the information society, CENTAL (University of Louvain) and IDLab (Ghent University) will contribute the necessary technical skills related to information and data science, whereas GhentCDH and MICT (both from Ghent University) have significant expertise in the field of communication studies and digital humanities.

The interdisciplinarity of the team and the thorough analyses of existing policies will ensure that the social media archiving strategy for Belgium will be based on existing best practices and that all involved stakeholders (heritage institutions, users, legislators, etc.) will be taken into account.

If you want to learn more about this project, feel free to follow our hashtag #BeSocialProject on social media platforms, and visit the BESOCIAL web page.

LinkGate: Initial web archive graph visualization demo

By Mohammed Elfarargy and Youssef Eldakar of Bibliotheca Alexandrina

LinkGate is an IIPC-funded project to develop a scalable web archive graph visualization environment and collect research use cases, led by Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ). The project provides three modular components:

  • Link Service (link-serv) for the scalable temporal graph data service with an underlying graph data store and API
  • Link Indexer (link-indexer) for collecting inter-linking data from the web archive

  • Link Visualizer (link-viz) for the web-based frontend geared towards web archive graph data navigation and exploration

Research use cases are being documented to guide future development.

You can read more about our work in the blog post published in April.

During a webinar held at the end of July as part of the IIPC Research Speaker Series (RSS), we presented a demo of the tools being developed and a summary of feedback gathered so far from the community towards a research use case inventory. In this blog post, we give an update on progress of the technical development, focusing on the initial UI of link-viz.

Link Visualizer

LinkGate’s frontend visualization component, link-viz, has developed on many fronts over the last four months. While the LinkServe component is compatible with the Gephi streaming API, Gephi remains a desktop-only general-purpose graph visualization tool. link-viz, on the other hand, is a web-based, scalable graph visualization tool that is made specifically to visualize web archive graph data. This makes it possible to produce more informative graphs for web archive users.

link-viz works in a similar manner to web-based map services like Google Maps. The user gets a graph based on the queried URL and the desired snapshot. Users can set the initial depth of the graph and then incrementally add more nodes as they explore deeper in the graph. This smart loading makes the exploration of such a dense graph run more smoothly.

The link-viz UI is designed to set the main focus on the graph. Users can click on any graph node to select it and perform actions using tools available in the UI. Graph nodes can be moved around and are, by default, distributed using a spring force model to help make a uniform distribution over 2D space. It’s possible to toggle this off to give users the option to organize nodes manually. Users can easily pan and zoom in/out the view using mouse controls or touch gestures. All other tools are located in four floating panels surrounding the main graph area:

The left-hand panel is used to search for a URL and to select the desired snapshot based on which the initial graph will be rendered. The snapshot selection widget is illustrated in Figure 1:

Figure 1: Snapshot selection widget

The bottom panel shows detailed information on the highlighted graph node. This includes a full URL and a listing of all the outlinks and inlinks. This can be seen in Figure 2:

Figure 2: Node details panel

The top panel contains a set of tools for graph navigation (zoom in/out and reset view), taking graph screenshots, setting graph depth, collapsing/expanding portions of the graph, and configuring the look of the graph (selection of color, size, and shape for both graph nodes and edges to represent different pieces of information). One nice feature of link-viz compared to standard graph visualization tools is the usage of website favicons for graph nodes instead of geometric shapes, which makes nodes instantly identifiable and results in a much more readable graph. Figures 3 and 4 show the top panel and favicon usage, respectively:

Figure 3: Top panel

 

Figure 4: Favicons for graph nodes

The right-hand panel contains two tabs reserved for two sets of tools, Vizors and Finders. Vizors are tools to display the same graph highlighting additional information. Two vizors are currently planned. The GeoVizor will put graph nodes on top of a world map to show the hosting physical location. The FileTypeVizor will display file-type icons as graph nodes, making it very easy to identify most common file types and their distribution over the web. Finders perform graph exploration functions, such as finding loops or paths between nodes.

Apart from Vizors and Finders, we are also working on other features, including smart graph loading and animated graph timeline. We are also going to improve UI styling.

Link Indexer

link-indexer is now integrated with link-serv via the API. We have been testing the process of inserting data extracted with link-indexer into link-serv to identify data and scalability problems to work on. link-indexer now accepts command-line options for specifying the target link-serv instance and controlling the insertion batch size to manage how often the API is invoked. More command-line options are being added to control various aspects of the tool, as well as the ability to load options from a configuration file. We are also working to enhance tolerance to data issues, such as very long URLs, and network issues, such as short service outages. Figure 5 shows a sample output from a link-indexer run:

Figure 5: Sample output from a link-indexer run

Link Service

link-serv implements an API for link-indexer and link-viz to communicate with the graph data store. The API is compatible with the Gephi streaming API, giving users the option to connect to link-serv using the popular graph visualization tool, Gephi, as an alternative to the project’s frontend, link-viz.  Figure 6 shows a Gephi client streaming graph data from a link-serv instance:

Figure 6: Gephi client streaming from a link-serv instance

A data schema customized for temporal, versioned web archive data is used in the underlying Neo4j graph data store, and link-serv defines extra API operations not defined in the Gephi streaming API to support temporal navigation functionality in link-viz.

As more data is added to link-serv, the underlying graph data store has difficulty scaling up when reliant on a single instance. Our primary focus in link-serv at the moment, therefore, is to implement clustering. Work is in progress on a customized dispatcher service for the Neo4j graph data store as a substitute to clustering functionality in the commercially licensed Neo4j Enterprise Edition. As a side track, we are also looking into ArangoDB as possibly an alternative deployment option for link-serv’s graph data store.

Robustify your links! A working solution to create persistently robust links

By Martin Klein, Scientist in the Research Library at Los Alamos National Laboratory (LANL), Shawn M. Jones, Ph.D. student and Graduate Research Assistant at LANL, Herbert Van de Sompel, Chief Innovation Officer at Data Archiving and Network Services (DANS), and Michael L. Nelson, Professor in the Computer Science Department at Old Dominion University (ODU).

Links on the web break all the time. We frequently experience the infamous “404 – Page not found” message, also known as “a broken link” or “link rot.” Sometimes we follow a link and discover that the linked page has significantly changed and its content no longer represents what was originally referenced, a scenario known as “content drift.” Both link rot and content drift are forms of “reference rot”, a significant detriment to our web experience. In the realm of scholarly communication where we increasingly reference web resources such as blog posts, source code, videos, social media posts, datasets, etc. in our manuscripts, we recognize that we are losing our scholarly record to reference rot.

Robust Links background

As part of The Andrew W. Mellon Foundation funded Hiberlink project, the Prototyping team of the Los Alamos National Laboratory’s Research Library together with colleagues from Edina and the Language Technology Group of the University of Edinburgh developed the Robust Links concept a few years ago to address the problem. Given the renewed interest in the digital preservation community, we have now collaborated with colleagues from DANS and the Web Science and Digital Libraries Research Group at Old Dominion University on a service that makes creating Robust Links straightforward. To create a Robust Link, we need to:

  1. Create an archival snapshot (memento) of the link URL and
  2. Robustify the link in our web page by adding a couple of attributes to the link.

Robust Links creation

The first step can be done by submitting a URL to a proactive web archiving service such as the Internet Archive’s “Save Page Now”, Perma.cc, or archive.today. The second step guarantees that the link retains the original URL, the URL of the archived snapshot (memento), and the datetime of linking. We detail this step in the Robust Links specification. With both done, we truly have robust links with multiple fallback options. If the original link on the live web is subject to reference rot, readers can access the memento from the web archive. If the memento itself is unavailable, for example, because the web archive is temporarily out of service, we can use the original URL and the datetime of linking to locate another suitable memento in a different web archive. The Memento protocol and infrastructure provides a federated search that seamlessly enables this sort of lookup.

Robust Links web service.
Robust Links web service.

To make Robust Links more accessible to everyone, we provide a web service to easily create Robust Links. To “robustify” your links, submit the URL of your HTML link to the web form, optionally specify a link text, and click “Robustify”. The Robust Links service creates a memento of the provided URL either with the Internet Archive or with archive.today (the selection is made randomly). To increase robustness, the service utilizes multiple publicly available web archives and we are working to include additional web archives in the future. From the result page after submitting the form, copy the HTML snippet for your robust link (shown as step 1 on the result page) and paste it into your web page. To make robust links actionable in a web browser, you need to include the Robust Links JavaScript and CSS in your page. We make this easy by providing an HTML snippet (step 2 on the result page) that you can copy and paste inside the HEAD section of your page.

Robust Links web service result page.
Robust Links web service result page.

Robust Links sustainability

During the implementation of this service, we identified two main concerns regarding its sustainability. The first issue is the reliable inclusion of the Robust Links JavaScript and CSS to make Robust Links actionable. Specifically, we were looking for a feasible approach to improve the chances that both files are available in the long term, can continuously be maintained, and their URI persistently resolves to the latest version. Our approach is two-fold:

  1. we moved the source files into the IIPC GitHub repository so they can be maintained (and versioned) by the community and served with the correct mime type via GitHub Pages and
  2. we minted two Digital Object Identifiers (DOIs) with DataCite, one to resolve to the latest version of the Robust Links JavaScript and the other to the CSS.

The other sustainability issue relates to the Memento infrastructure to automatically access mementos across web archives (2nd fallback mentioned above). Here we continue our path in that LANL and ODU, both IIPC member organizations, maintain the Memento infrastructure.

Because of limitations with the WordPress platform, we unfortunately can not demonstrate robust links in this blog post. However, we created a copy with robustified links hosted at https://robustlinks.mementoweb.org/demo/IIPC/robust_links_blog.html. In addition, our Robust Links demo page showcases how robust links are actionable in a browser via the included CSS and JavaScript. We also created an API for machine-access to our Robust Links service.

Robust Links in action
Robust Links in action.

Acknowledgements and feedback

Lastly, we would like to thank DataCite for granting two DOIs to the IIPC for this effort at no cost. We are also grateful to ODU’s Karen Vaughan for her help minting the DOIs.

For feedback/comments/questions, please do not hesitate and get in touch (martinklein0815[at]gmail.com)!

Relevant URIs

https://robustlinks.mementoweb.org/
https://robustlinks.mementoweb.org/about/
https://robustlinks.mementoweb.org/spec/
https://robustlinks.mementoweb.org/api-docs/

The Danish Coronavirus web collection – Coronavirus on the curators’ minds

By Sabine Schostag, Web Curator, The Royal Danish Library

Introduction – a provoking cartoon

In a sense, the story of Corona and the national Danish Web Archive (Netarchive) starts at the end of January 2020 – about 6 weeks before Corona came to Denmark. A cartoon by Niels Bo Bojesens in the Danish newspaper “Jyllandsposten” (2020-01-26) showing the Chinese flag with a circle of yellow corona-viruses instead of the stars caused indignation in China and captured attention worldwide. We focused on collecting reactions on different social media and in the international news media. Particularly on Twitter, a seething discussion arose with vehement comments and memes about Denmark.

From epidemic to pandemic

After that, the curators again focused on the daily routines in web archiving, as we believed that Corona (Covid-19) was a closed chapter in Netarchive’s history. But this was not the case. When the IIPC Content Development Working Group launched the Covid-19 collection in February, the Royal Danish Library contributed the Danish seeds.

Suddenly, the Corona virus arrived in Europe and the first infected Dane came home from a skiing trip in Italy. The epidemic turned into a pandemic. On March 12, the Danish Government decided to lockdown the country: all public employees where sent to their home offices and borders were closed. Not only the public sector shut down, trade and industry, shops, restaurants, bars etc. had to close too. Only supermarkets were still open and people in the Health Care sector had to work overtime.

While Denmark came to a standstill, so to speak, the Netarchive curators worked at full throttle on the coronavirus event collection. Zoom became the most important work tool for the following 2½ months. In daily Zoom meetings, we coordinated who worked on which facet of this collection. To put it briefly, we curators had coronavirus on our minds.

Event crawls in Netarchive

The Danish Web Archive crawls all Danish news media between several times daily and one time weekly, so there is no need to include news articles in an event crawl. Thus, with an event crawl we focus on augmented activity on social media, blog articles, new sites emerging in connection to the event – and reactions in news media outside Denmark.

Coronavirus documentation in Denmark

The Danish Web collection on coronavirus in Denmark is part of a general documentation on the corona lockdown in Denmark in 2020. This documentation is a cooperation between several cultural institutions, the National Archives (Rigsarkivet), the National Museum (Nationalmuseet), the Workers Museum (Arbejdermuseet), local archives and, last but not least, the Royal Danish Library. The corona lockdown documentation was supposed to be done in two steps:  the “here and now” collection of documentation under the corona lockdown and a more systematic follow-up by collecting materials from authorities and public bodies.

“Days with Corona” – a call for help

All Danes were asked to contribute to the corona lockdown documentation, for instance by sending photos and narratives from their daily life under the lockdown. “Days with Corona” is the title of this part of the documentation of the Danish Folklore Archives run by the National Museum and the Royal Library.

Netarchive also asked the public for help by nominating URLs of web pages related to coronavirus, social media profiles, hashtags, memes and any other relevant material.

Help from colleagues

Web archiving is part of the Department for Digital Cultural Heritage at the Royal Library. Almost all colleagues from the department were able to continue with their every day work from their home offices. Many colleagues from other departments were not able to do so. Some of them helped the Netarchive team by nominating URLs, as this event crawl could keep curators busy more than 7½ hours a day. We used a Google spreadsheet for all nominations (fig. 1)

Fig. 1 Nomination sheet for curators and colleagues form other departments and a call for contributions.

The Queen’s 80th birthday

On April 16, Queen Margarethe II celebrated her 80th birthday. One of the first things she did after the Corona lockdown, on March 13, was to cancel all her birthday celebration events. In a way, she set a good example, as everybody was asked not to meet with no more than ten people, ideally we only should socialize with members of our own household.

As part of the Corona event crawl, we collected web activity related to the Queen’s birthday, which mainly consisted of reactions on social media.

The big challenge – capturing social media

Knowledge of the coronavirus Covid-19 changes continuously. Consequently, authorities, public bodies, private institutions, and companies change information and precaution rules on their webpages frequently. We try to capture as much of these changes as possible. Companies and private individuals offering safety gear for protection against the virus was another facet in the collection. However, capturing all relevant activity on social media was much more challenging than the frequent updates on traditional web pages. Most of the social media platforms use technologies, which Heritrix (used by Netarchive for event crawling) is not able to capture.

Fig. 2 The Queen’s speech to the Danes on how to cope with the corona crisis. This was the second time in history (the first time was during the World War II) when a Royal Head of State addressed  the nation, besides the annual New Year’s Eve speech.

More or less successfully, we tried to capture content from Facebook, TikTok, Twitter, YouTube, Instagram, Reddit, Imgur, Soundcloud, and Pinterest. Twitter is the platform we are able to crawl with Heritrix with rather good results. We collect Facebook profiles with an account at Archive-It, as they have a better set of tools for capturing Facebook. With frequent Quality Assurance and follow-ups, we also get rather good results from Instagram, TikTok and Reddit. We capture YouTube videos by crawling the watch-URLs with a specific configuration using YouTube dl.  One of the collected YouTube videos comes from the Royal family’s YouTube channel: the Queens address to the people on how to behave to prevent or limit the spreading of the coronavirus (https://www.youtube.com/watch?v=TZKVUQ-E-UI, Fig. 2).

As Heritrix has problems with dynamic web content and streaming, we also used Webrecorder.io, although we have not yet implemented this tool in our harvesting setup. However, captures with Webrecorder.io are only drops in the ocean. The use of Webrecorder.io is manual: a curator clicks on all the elements on a page we want to capture. An example is a page on the BBC website, with a video of the reopening of Danish primary schools after the total lockdown (https://www.bbc.com/news/av/world-europe-52649919/coronavirus-inside-a-reopened-primary-school-in-the-time-of-covid-19, Fig. 3). There is still an issue with ingesting the resulting WARC files from Webrecorder.io in our web archive.

Danes produced a range of podcasts on coronavirus issues. We crawled the podcasts we had identified. We get good results when having an URL to a RSS feed, which we crawl with XML extraction.

Fig. 3 Crawled with Webrecorder.io to get the video.

Capture as much as possible – a broad crawl

Netarchive runs up to four broad crawls a year. We launched our first broad crawl for 2020 just in the beginning of the Danish Corona lockdown – on March 14. A broad crawl is an in-depth snapshot of all dk-domains and all other Top Level Domains (TDLs) where we have identified Danish content. A side benefit of this broad crawl might be getting Corona-related content into the archive – content which the curators do not find with their different methods. We identify content both with classic/common? keyword searches and using a variety of link scraping tools / link scrapers.

Is the coronavirus related web collection of any value to anybody?

In accordance with the Danish personal data protection law, the public has no access to the archived web material. Only researchers affiliated with Danish research institutions can apply for access in connection with specific research projects. We have already received an application for one research project dealing with values in the Covid-19 communication. We hope that our collection will inspire more research projects.