IIPC Steering Committee Election 2022 Results

The 2022 Steering Committee Election closed on Saturday, 15 October. The following IIPC member institutions have been elected to serve on the Steering Committee for a term commencing 1 January 2023:

We would like to thank all members who took part in the election either by nominating themselves or by taking the time to vote. Congratulations to the re-elected Steering Committee Members!

IIPC Steering Committee Election 2022: nomination statements

The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2023, 6 IIPC member organisations have put themselves forward:

An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all of the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting in 2023 will be held online.

If you have any questions, please contact the IIPC Senior Program Officer.


Nomination statements in alphabetical order:

Internet Archive

Internet Archive seeks to continue its role on the IIPC Steering Committee. As the oldest and largest publicly-available web archive in the world, a founding member of the IIPC, and a creator of many of the core technologies used in web archiving, Internet Archive plays a key role in fostering broad community participation in preserving and providing access to the web-published records that document our shared cultural heritage. Internet Archive has long served on the Steering Committee, including as Chair, and has helped establish IIPC’s relationship with CLIR, the Discretionary Funding Program, the IIPC Training Program, and other initiatives. By continuing on the Steering Committee, Internet Archive will advance these and similar programs to expand and diversify IIPC membership, further knowledge sharing and skills development, ensure the impact and sustainability of the organization, and help build collaborative frameworks that allow members to work together. The web can only be preserved through broad-based, multi-institutional efforts. Internet Archive looks to continue its role on the Steering Committee in order to bring our capacity and expertise to support both the mission of the IIPC and the shared mission of the larger community working to preserve and provide access to the archived web.

Landsbókasafn Íslands – Háskólabókasafn / National and University Library of Iceland

The National and University Library of Iceland is interested in serving another term on the IIPC Steering Committee. The library has had an active web archiving effort for nearly two decades. Our participation in the IIPC has been instrumental in its success.

As one of the IIPC‘s smaller members, we are keenly aware of the importance of collaboration to this specialized endeavor. The knowledge and tools that this community has given us access to are priceless.

We believe that in this community, active engagement ultimately brings the greatest rewards. As such, we have participated in projects, including Heritrix, OpenWayback and, most recently, PyWb. We have hosted several IIPC events, including the 2016 GA/WAC. We have also provided leadership in various areas including in working groups and the tools development portfolio, and our SC representative currently serves as the IIPC’s Steering Committee Chair.

If re-elected to the SC, we will aim to continue on in the same spirit.

Library of Congress

The Library of Congress (LC) has been involved in web archiving for over 22 years and is a founding member of the IIPC. LC has worked collaboratively with international organizations on collections, tools and workflows, while developing in-house expertise enabling the collection and management at scale of over 3.3 petabytes of web content. LC has served in a variety of IIPC leadership roles, currently as vice-chair of the IIPC, and as chair in 2021. Roles also include Membership Engagement portfolio lead and Training WG co-chair. Staff participate actively in a variety of technical discussions, workshops, working groups, and member calls, and in 2022 LC co-hosted the WAC/GA. As a Steering Committee member, LC helped secure a new fiscal agent and helped hire and onboard new IIPC staff. If re-elected, we will continue to focus on increasing engagement of all members, enabling use of member benefits by all members. We will continue to actively participate in discussions around the best use of IIPC funding to support staff, projects, and events that will enable us all to work more efficiently and collaboratively as a community and to help strengthen ties to the researcher and wider web archiving community in the coming years.

National Library of Australia

The National Library of Australia was a founding IIPC member and Steering Committee member until 2009, hosting the second general assembly in Canberra in 2008. The NLA was re-elected in 2019 and filled the vice-chair role in 2020. Long engagement with the international web archiving community includes organizing one of the first major international conferences on web archiving in 2004. The NLA is currently active within the IIPC Tools Portfolio and was involved with two recent IIPC discretionary funded projects.

The NLA strengths include experience, operational maturity and a pragmatic approach to web archiving. Its web archiving program, established in 1996, embraces selective, domain and bulk collecting methods and now holds around 700 TBs or 15 billion URL snapshots. With a self-described ‘radical incrementalism’ approach, the NLA has a record of agile innovation, from building the first selective web archiving workflow system to the ‘outbackCDX’ tool providing efficiency for managing CDX indexes. The NLA is committed to open access, maintaining the entire Australian Web Archive as fully accessible and searchable though the Trove discovery service. In seeking re-election, the NLA aims to offer the Steering Committee long web archiving experience, proven practical engagement, and a unique Australasian, Southern Hemisphere perspective.

National Library of New Zealand / Te Puna Mātauranga o Aotearoa

National Library of New Zealand has been an IIPC member since 2007, started web archiving in 1999, and appointed a dedicated web archiving role in 2017. The Library’s recent IIPC activities include:

  • Host of the 2018 IIPC GA/WAC and the ‘What do researchers want’ workshop; Member of the 2022 WAC Programme Committee
  • Co-chair of the Research WG; Participation in OHSOS sessions, the IIPC Strategic Direction Group, the CDG’s collaborative collections, and social media collecting webinars
  • Project partner on ‘Asking questions with web archives’ and ‘LinkGate’; A project lead on ‘Browser-based crawling system for all’

The Library also co-develops the open source Web Curator Tool with National Library of Netherlands and shares updates at IIPC conferences.

The Library’s current web archiving priorities align closely with the IIPC Strategic Plan 2021-2025:

  • Full-text search and improved access to our web archives
  • Policies that allow us to provide greater access to our web archives
  • Social media collecting

Our experimentation in these areas helps the IIPC achieve its strategic objectives, by demonstrating to other IIPC member organisations how to build capacity in these areas, and by collaborating with other IIPC members in these areas.

University of North Texas Libraries

UNT-logoThe University of North Texas (UNT) Libraries expresses its interest in being elected to the IIPC Steering Committee. As a library that serves a population of 40,000+ students and faculty, we are committed to providing a wide range of resources and services to our users. Of these services we feel that the preservation of and access to Web archives is an important component.

The UNT Libraries has been a member of the IIPC since 2007 and has served in several capacities, including previous terms on the Steering Committee. Recently, members of the UNT Libraries have worked as co-chairs of the Tools Development Portfolio, on the Partnership and Outreach Portfolio, on the Discretionary Funding Program selection committee, on the WAC program committee, and as co-lead on the Browser-Based Crawler project.

The UNT Libraries is interested in helping the IIPC move forward into the future. We have an interest in representing the unique needs and concerns of research libraries as well as continuing to support the needs of other IIPC member institutions. If elected, the UNT Libraries will strive to represent the best interests of the IIPC community and to help move forward the preservation of the Web.

Launching LinkGate

By Youssef Eldakar of Bibliotheca Alexandrina

We are pleased to invite the web archiving community to visit LinkGate at linkgate.bibalex.org.

LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.

What does LinkGate do?

LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.

APIs and use cases

We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.

Using LinkGate

An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:

  • Graph with nodes colorized by domain
  • Nodes being zoomed in
  • Settings dialog for customizing graph
  • Showing properties for a selected node
  • PathFinder for finding routes between any two nodes

Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.

We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.

Next steps

This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:

  • Improved scalability
  • Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
  • Enriched metadata
  • Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
  • Improved implementation of the software in general.

BA intends to maintain and expand the deploymentat linkgate.bibalex.org on a long-term basis.

Acknowledgements

The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.

LinkGate Team

  • Lana Alsabbagh, NLNZ, Research Use Cases
  • Youssef Eldakar, BA, Project Coordination
  • Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
  • Mohamed Elsayed, BA, Link Indexer (link-indexer)
  • Andrea Goethals, NLNZ, Project Coordination
  • Amr Morad, BA, Link Service (link-serv)
  • Ben O’Brien, NLNZ, Research Use Cases
  • Amr Rizq, BA, Link Visualizer (link-viz)

Additional Thanks

  • Tasneem Allam, BA, link-viz development
  • Suzan Attia, BA, UI design
  • Dalia Elbadry, BA, UI design
  • Nada Eliba, BA, link-serv development
  • Mirona Gamil, BA, link-serv development
  • Olga Holownia, IIPC, project support
  • Andy Jackson, British Library, technical advice
  • Amged Magdey, BA, logo design
  • Liquaa Mahmoud, BA, logo design
  • Alex Osborne, National Library of Australia, technical advice

We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.


Resources

WCT 3.0 Release

By Ben O’Brien, Web Archive Technical Lead, National Library of New Zealand

Let’s rewind 15 years, back to 2006. The Nintendo Wii is released, Google has just bought YouTube, Facebook switches to open registration, Italy has won the Fifa World Cup, and Borat is shocking cinema screens across the globe.

Java 6, Spring 1.2, Hibernate 3.1, Struts 1.2, Acegi-security are some of the technologies we’re using to deliver open source enterprise web applications. One application in particular, the Web Curator Tool (WCT) is starting its journey into the wide world of web archiving. WCT is an open source tool for managing the selective web harvesting process.

2018 Relaunch

Fast forward to 2018, and these technologies themselves belong inside an archive. Instead they were still being used by the WCT to collect content for web archives. Twelve years is a long time in the world of the Internet and IT, so needless to say a fair amount of technical debt had caught up with the WCT and its users.

The collaborative development of the WCT between the National Library of the Netherlands and the National Library of New Zealand was full steam ahead after the release of the long awaited Heritrix 3 integration in November 2018. With new features in mind, we knew we needed a modern, stable foundation within the WCT if we were to take it forward. Queue the Technical Uplift.

WCT 3.0

What followed was two years of development by teams in opposing time zones, battling resourcing, lockdowns and endless regression testing. Now at the beginning of 2021, we can at last announce the release of version 3.0 of the WCT.

While some of the names in the technology stack are the same (Java/Spring/Hibernate), the upgrade of these languages and frameworks represent a big milestone for the WCT. A launchpad to tackle the challenges of the next decade of web archiving!

For more information, see our recent blog post on webcuratortool.org. And check out a demo of v3.0 inside our virtual box image here.

WCT Team:

KB-NL

Jeffrey van der Hoeven
Sophie Ham
Trienka Rohrbach
Hanna Koppelaar

NLNZ

Ben O’Brien
Andrea Goethals
Steve Knight
Frank Lee
Charmaine Fajardo

Further reading on WCT:

WCT tutorial on IIPC
Documentation on WCT
WCT on GitHub
WCT on Slack
WCT on Twitter
Recent blogpost on WCT with links to old documentation

Rapid Response Twitter Collecting at NLNZ

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand (NLNZ)

This blog post has been adapted from an IIPC RSS webinar held in August where presenters shared  their social media web archiving projects. Thanks to everyone who participated and for your feedback. It’s always encouraging to see the projects colleagues are working on.

Collecting content when you only have a short window of opportunity

The National Library responds quickly to collecting web content when unexpected events occur. Our focus in the past was to collect websites and this worked well for us using Web Curator Tool, however collecting social media was much more difficult. We tried capturing social media using different web archiving tools, but none of them produced satisfactory results.

The Preservation, Research and Consultancy (PRC) team include programmers and web technicians. They thought running Twitter crawls using the public Twitter API could be a good solution for capturing Twitter content. It has enabled us to capture commentary about significant New Zealand events and we’ve been running these Twitter crawls since late 2016.

One such event was the Christchurch Mosque shootings which took place on 15 March 2019.  This terrorist attack by a lone gunman at two mosques in Christchurch, where 51 people were killed was the deadliest mass shooting in modern New Zealand history.  The image you see here by Shaun Yeo was created in response to the tragic events and was shared widely via social media.

Shaun Yeo: Crying Kiwi
Crying Kiwi. Ref: DCDL-0038997. Alexander Turnbull Library, Wellington, New Zealand. http://natlib.govt.nz/records/42144570   (used with permission)

While the web archivists focussed on collecting web content relating to the attacks, and the IIPC community assisted us by providing links to international commentary for us to crawl using Archive-It, the PRC web technician was busy getting the Twitter harvest underway. He needed to work quickly because there were only a few days lee-way to pick up Tweets using Twitter’s public API.

Search Criteria

Our web technician checked Twitter and found a wide range of hash tags and search terms that we needed to use to collect the tweets.

Hashtags: ‘#ChristchurchMosqueShooting’ ‘#ChristchurchMosqueShootings’ ‘#ChristchurchMosqueAttack’ ‘#ChristchurchTerrorAttack’ ‘#ChristchurchTerroristAttack’ ‘#KiaKahaChristchurch’ ‘#NewZealandMosqueShooting’ ‘#NewZealandShooting’ ‘#NewZealandTerroristAttack’ ‘#NewZealandMosqueAttacks’ ‘#PrayForChristchurch’ ‘#ThisIsNotNewZealand’ ‘#ThisIsNotUs’ ‘#TheyAreUs’

Keywords: ‘zealand AND (gun OR ban OR bans OR automatic OR assault OR weapon OR weapons OR rifle OR military)’ ‘zealand AND (terrorist OR Terrorism OR terror)’ ‘zealand AND mass AND shooting’ ‘Christchurch AND mosque’ ‘Auckland AND vigil’ ‘Wellington AND vigil’

The Dataset

The Twitter crawl ran from 15-29 March 2019. We captured 3.2 million tweets in JSON files. We also collected 30,000 media files that were found in the tweets and we crawled 27,000 seeds referenced in the tweets. The dataset in total was around 108GB in size.

Collecting the Twitter content

We used Twarc to capture the Tweets and we also used some inhouse scripts that enabled us to merge and deduplicate the Tweets each time a crawl was run. The original set for each crawl was kept in case anything went wrong during the deduping process or if we needed to change search parameters.
We also used scripts to capture the media files referenced in the Tweets and a harvest was run using Heritrix to pick up webpages. These webpage URLs were run through a URL unshortening service prior to crawling to ensure we were collecting the original URL, and not a tiny URL that might become invalid within a few months. We felt that the Tweet text without accompanying images, media files and links might lose its context. We were also thinking about long term preservation of content that will be an important part of New Zealand’s documentary heritage.

Access copies

We created three access copies that provide different views of the dataset, namely Tweet IDs and hashed and non-hashed text files.  This enables the Library to restrict access to content where necessary.

Tweet ID’s

Tweet ID’s (system numbers) will be available to the public online. When you rehydrate the Tweet ID’s online, you only receive back the Tweets that are still publicly available – not any of the Tweets that have since been deleted.

Hashed and non-hashed access copies

In 2018, Twitter released a series of election integrity datasets (https://transparency.twitter.com/en/information-operations.html), which contained access copies of Tweets. We have used their structure and format as a precedent for our own reading room copies. These provide access to all Tweets and the majority of their metadata, but with all identifying user details obfuscated by hashed values. You can see in the table below the Tweet ID highlighted in yellow, the user display name in red and its corresponding system number (instead of an actual name) and the tweet text highlighted in blue.

The non-hashed copy provides the actual names and full URL rather than system numbers.

 Shaping the SIP for ingest
National Digital Heritage Archive (NDHA)

We have had some technical challenges ingesting Twitter files into the National Digital Heritage Archive (NDHA). Some files were too large to ingest using Indigo, which is a tool the web and digital archivists use to deposit content into the NDHA, so we have had to use another tool called the SIP Factory, which enables the ingest of large files to the NDHA. This is being carried out by the PRC team.

We’ve shaped the SIPs (submission information packages) according to these files below and have chosen to use file conventions for each event. We thought it would be helpful to create a readme file that shows some of the provenance and technical details of the dataset. Some of this information will be added to the descriptive record, but we felt that a readme file might include more information and it will remain with the dataset.

chch_terror_attack_2019_twitter_tweet_IDs
chch_terror_attack_2019_Twitter_access_copy
chch_terror_attack_2019_Twitter_access_copy_hashed
chch_terror_attack_2019_Twitter_crawl
chch_terror_attack_2019_twitter _readme
chch_terror_attack_2019_twitter_media_files
chch_terror_attack_2019_twitter_warc_files

Description of the dataset

Even though the tweets are published, we have decided to describe them in Tiaki, our archival content management system. This is because we’re effectively creating the dataset and our archival system works better for describing this kind of content than our published catalogue does.
NLNZ, Tiaki archival content management system

Research interest in the dataset

A PhD student was keen to view the dataset as a possible research topic. This was a great opportunity to see what we could provide and the level of assistance that might be required.

Due to the sensitivity of the dataset and the fact that it wasn’t in our archive yet, we liaised with the Library’s Access and Use Committee around what data the library was comfortable to provide. The decision was that the data should only come from Tweets that were still available online in this initial stage while the researcher was still determining the scope of her research study.

The Tweet IDs were put in Dropbox for the researcher to download. There were several complicating factors that meant she was unable to rehydrate the Tweet ID’s, so we did what we could to assist her.

We determined that the researcher simply wanted to get a sense of what was in the dataset, so we extracted a random sample set of 2000 Tweets. This sample included only original tweets (no retweets) and had rehydrated itself to remove any deleted tweets. The data included was, Tweet time, user location, likes, retweets, Tweet language and the Tweet text. She was pleased with what we were able to provide, because it gave her some idea of what was in the dataset even though it was a very small subset of the dataset itself.

Unfortunately, the research project has been put on hold due to Covid-19. If the research project does go ahead, we will need to work with the University to see what level of support they can provide the researcher and what kind of support we will need to provide.

LinkGate: Initial web archive graph visualization demo

By Mohammed Elfarargy and Youssef Eldakar of Bibliotheca Alexandrina

LinkGate is an IIPC-funded project to develop a scalable web archive graph visualization environment and collect research use cases, led by Bibliotheca Alexandrina (BA) and the National Library of New Zealand (NLNZ). The project provides three modular components:

  • Link Service (link-serv) for the scalable temporal graph data service with an underlying graph data store and API
  • Link Indexer (link-indexer) for collecting inter-linking data from the web archive

  • Link Visualizer (link-viz) for the web-based frontend geared towards web archive graph data navigation and exploration

Research use cases are being documented to guide future development.

You can read more about our work in the blog post published in April.

During a webinar held at the end of July as part of the IIPC Research Speaker Series (RSS), we presented a demo of the tools being developed and a summary of feedback gathered so far from the community towards a research use case inventory. In this blog post, we give an update on progress of the technical development, focusing on the initial UI of link-viz.

Link Visualizer

LinkGate’s frontend visualization component, link-viz, has developed on many fronts over the last four months. While the LinkServe component is compatible with the Gephi streaming API, Gephi remains a desktop-only general-purpose graph visualization tool. link-viz, on the other hand, is a web-based, scalable graph visualization tool that is made specifically to visualize web archive graph data. This makes it possible to produce more informative graphs for web archive users.

link-viz works in a similar manner to web-based map services like Google Maps. The user gets a graph based on the queried URL and the desired snapshot. Users can set the initial depth of the graph and then incrementally add more nodes as they explore deeper in the graph. This smart loading makes the exploration of such a dense graph run more smoothly.

The link-viz UI is designed to set the main focus on the graph. Users can click on any graph node to select it and perform actions using tools available in the UI. Graph nodes can be moved around and are, by default, distributed using a spring force model to help make a uniform distribution over 2D space. It’s possible to toggle this off to give users the option to organize nodes manually. Users can easily pan and zoom in/out the view using mouse controls or touch gestures. All other tools are located in four floating panels surrounding the main graph area:

The left-hand panel is used to search for a URL and to select the desired snapshot based on which the initial graph will be rendered. The snapshot selection widget is illustrated in Figure 1:

Figure 1: Snapshot selection widget

The bottom panel shows detailed information on the highlighted graph node. This includes a full URL and a listing of all the outlinks and inlinks. This can be seen in Figure 2:

Figure 2: Node details panel

The top panel contains a set of tools for graph navigation (zoom in/out and reset view), taking graph screenshots, setting graph depth, collapsing/expanding portions of the graph, and configuring the look of the graph (selection of color, size, and shape for both graph nodes and edges to represent different pieces of information). One nice feature of link-viz compared to standard graph visualization tools is the usage of website favicons for graph nodes instead of geometric shapes, which makes nodes instantly identifiable and results in a much more readable graph. Figures 3 and 4 show the top panel and favicon usage, respectively:

Figure 3: Top panel

 

Figure 4: Favicons for graph nodes

The right-hand panel contains two tabs reserved for two sets of tools, Vizors and Finders. Vizors are tools to display the same graph highlighting additional information. Two vizors are currently planned. The GeoVizor will put graph nodes on top of a world map to show the hosting physical location. The FileTypeVizor will display file-type icons as graph nodes, making it very easy to identify most common file types and their distribution over the web. Finders perform graph exploration functions, such as finding loops or paths between nodes.

Apart from Vizors and Finders, we are also working on other features, including smart graph loading and animated graph timeline. We are also going to improve UI styling.

Link Indexer

link-indexer is now integrated with link-serv via the API. We have been testing the process of inserting data extracted with link-indexer into link-serv to identify data and scalability problems to work on. link-indexer now accepts command-line options for specifying the target link-serv instance and controlling the insertion batch size to manage how often the API is invoked. More command-line options are being added to control various aspects of the tool, as well as the ability to load options from a configuration file. We are also working to enhance tolerance to data issues, such as very long URLs, and network issues, such as short service outages. Figure 5 shows a sample output from a link-indexer run:

Figure 5: Sample output from a link-indexer run

Link Service

link-serv implements an API for link-indexer and link-viz to communicate with the graph data store. The API is compatible with the Gephi streaming API, giving users the option to connect to link-serv using the popular graph visualization tool, Gephi, as an alternative to the project’s frontend, link-viz.  Figure 6 shows a Gephi client streaming graph data from a link-serv instance:

Figure 6: Gephi client streaming from a link-serv instance

A data schema customized for temporal, versioned web archive data is used in the underlying Neo4j graph data store, and link-serv defines extra API operations not defined in the Gephi streaming API to support temporal navigation functionality in link-viz.

As more data is added to link-serv, the underlying graph data store has difficulty scaling up when reliant on a single instance. Our primary focus in link-serv at the moment, therefore, is to implement clustering. Work is in progress on a customized dispatcher service for the Neo4j graph data store as a substitute to clustering functionality in the commercially licensed Neo4j Enterprise Edition. As a side track, we are also looking into ArangoDB as possibly an alternative deployment option for link-serv’s graph data store.

Covid-19 Collecting at the National Library of New Zealand

By Gillian Lee, Coordinator, Web Archives at the Alexander Turnbull Library, National Library of New Zealand

The National Library of New Zealand reflects on their rapid response collecting of Covid-19 related websites since February 2020.

Collecting in response to the pandemic

Web Archivists at the National Library of New Zealand are used to collecting websites relating to major events, but the Covid-19 pandemic has had such a global impact, it’s affected every member of society. It has been heart breaking to see the tragic loss of life and economic hardships that people are facing world-wide. The effects of this pandemic will be with us for a long time.

Collecting content relating to these events always produces mixed emotions as a web archivist. There’s the tension between collecting content before it disappears, and in that regard, we put on our hard hats and get on with it. At the same time however, these events are raw and personal to each one of us and the websites we’ve collected reflect that.

IIPC Collaborative Collection

When the IIPC put out a call to contribute to the Novel Coronavirus Outbreak Collaborative Collection, we got involved. Initially New Zealand sources were commenting on what was happening internationally, so URLs identified were mainly news stories, until our first reported case of Coronavirus occurred in February and then we started to see New Zealand websites created in response to Covid-19 here. We continued to contribute seed URLs to the IIPC collection, but our focus necessarily switched to the selective harvesting we undertake for the National Library’s collections.

Lockdown

The New Zealand government instituted a 4 level alert system on March 21 and we quickly moved to level 4 lockdown on March 24. The lockdown lasted a month, before gradually moving down to level 1 on June 8.

The rapidly changing alert levels were reflected in the constantly changing webpages online. It seemed that most websites we regularly harvest had content relating to Covid-19. Our selective web harvesting team focussed on identifying websites that had significant Covid-19 content or were created to cover Covid-19 events during our rapid response collecting phase. Even then it was difficult to capture all changes on a website as they responded to the different alert levels.

We were working from home during this time and connected to Web Curator Tool through our work computers. The harvesting was consistent, but our internet connections were not always stable, so we often got thrown out of the system! If we had technical issues with any particular website harvest, by the time we resolved it, the pages online had sometimes shifted to another alert level! We also used Web Recorder and Archive-It for some of our web harvests.

Due to the enormous amount of Covid-19 content being generated and because we are a very small team (along with the challenges of working from home), what we collected could really only be a very selective representation.

Unite against Covid-19 – Unite for the Recovery

Unite Against Covid-19 harvested 18 March 2020.

One prominent website captured during this time was the government website ‘Unite Against Covid-19’ which was the go-to place for anyone wanting to know what the current rules were. This website was updated constantly, sometimes several times a day.

When we entered alert level one the website changed to “Unite for the Recovery.” We expect to be collecting this site for some time. While we have completed our rapid response phase we will be continuing to collect Covid-19 related material as part of our regular harvesting.

Unite for the Recovery harvested 9 June 2020.

Economic Impact
Apart from official government websites, we captured websites that reflected the economic impact on our society, such as event cancellations and business closures. We documented how some businesses responded to the pandemic, by changing production lines from clothing to making face masks and from alcohol production to making hand sanitiser. New products like respirators and PPE (personal protective equipment) gear were also being produced. Tourism is a major industry in New Zealand and with border lockdowns still in place, advertising is now targeting New Zealanders. There is talk about extending this to a “Trans-Tasman” bubble to include Australia and possibly some Pacific Islands in the near future.

Social impact
As in many countries, community responses during lockdown provided both unique and shared experiences. New Zealanders were able to walk locally (with social distancing) so people put bears and other soft toys in the windows for kids (and adults) to count as they walked by. The daily televised 1pm Covid-19 updates from Prime Minister Jacinda Ardern and Director General of Health, Dr Ashley Bloomfield during lockdown was compulsive viewing and generated memorabilia such as T-shirts, bags and coasters. These were all reflected in the websites we collected. We also harvested personal blogs such as ‘lockdown diaries’.

Web archiving and beyond
During this rapid collecting phase, the web archivists focussed on collecting websites, and that’s reflected in this blog post. There was also a significant amount of content we wanted to collect from social media such as memes, digital posters and podcasts, New Zealand social commentary on Twitter and email from businesses and associations. This has required considerable effort from the Library’s Digital Collecting and Legal Deposit teams. You can find out more about this in an earlier National Library blog post by our Senior Digital Archivist Valerie Love. We are also working with our GLAM sector colleagues and donors to continue to build these collections.

Asking questions with web archives – introductory notebooks for historians

“Asking questions with web archives – introductory notebooks for historians” is one of three projects awarded a grant in the first round of the Discretionary Funding Programme (DFP) the aim of which is to support the collaborative activities of the IIPC members by providing funding to accelerate the preservation and accessibility of the web. The project was led by Dr Andy Jackson of the British Library. The project co-lead and developer was Dr Tim Sherratt, the creator of the GLAM Workbench, which provides researchers with examples, tools, and documentation to help them explore and use the online collections of libraries, archives, and museums. The notebooks were developed with the participation of the British Library (UK Web Archive), the National Library of Australia (Australian Web Archive), and the National Library of New Zealand (the New Zealand Web Archive).


By Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra & the creator of the GLAM Workbench

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult.

Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that  document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

What can you do?

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these. To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain.

A number of the notebooks use Timegates and Timemaps to explore change over time. They could be easily adapted to work with any Memento compliant system. For example, one notebook steps through the process of creating and compiling annual full-page screenshots into a time series.

Using screenshots to visualise change in a page over time.

Another walks forwards or backwards through a Timemap to find when a phrase first appears in (or disappears from) a page. You can also view the total number of occurrences over time as a chart.

 

Find when a piece of text appears in an archived web page.

The notebooks document a number of possible workflows. One uses the Internet Archive’s CDX API to find all the Powerpoint files within a domain. It then downloads the files, converts them to PDFs, saves an image of the first slide, extracts the text, and compiles everything into an SQLite database. You end up with a searchable dataset that can be easily loaded into Datasette for exploration.

Find and explore Powerpoint presentations from a specific domain.

While most of the notebooks work with small slices of web archive data, one harvests all the unique urls from the gov.au domain and makes an attempt to visualise the subdomains. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

Visualising subdomains in the gov.au domain as captured by the Internet Archive.

Acknowledgements

Thanks to everyone who contributed to the discussion on the IIPC Slack, in particular Alex Osborne, Ben O’Brien and Andy Jackson who helped out with understanding how to use NLA/NZNL/UKWA collections respectively.

Resources:

Web Archiving Down Under: Relaunch of the Web Curator Tool at the IIPC conference, Wellington, New Zealand

Kees Teszelszky, Curator Digital Collections at the National Library of the Netherlands/Koninklijke Bibliotheek (with input of Hanna Koppelaar, Jeffrey van der Hoeven – KB-NL, Ben O’Brien, Steve Knight and Andrea Goethals – National Library of New Zealand)

Hanna Koppelaar, KB & Ben O'Brien, NLNZ. IIPC WAC 2018.
Hanna Koppelaar, KB & Ben O’Brien, NLNZ. IIPC Web Archiving Conference 2018. Photo by Kees Teszelszky

The Web Curator Tool (WCT) is a globally used workflow management application designed for selective web archiving in digital heritage collecting organisations. Version 2.0 of the WCT is now available on Github. This release is the product of a collaborative development effort started in late 2017 between the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL). The new version was previewed during a tutorial at the IIPC Web Archiving Conference on 14 November 2018 at the National Library of New Zealand in Wellington, New Zealand. Ben O’Brien (NLNZ) and Hanna Koppelaar (KB-NL) presented the new features of the WCT and showed how to work collaboratively on opposite sides of the world in front of an audience of more than 25 spectators.

The tutorial highlighted that part of our road map for this version has been dedicated to improving the installation and support of WCT. We recognised that the majority of requests for support were related to database setup and application configuration. To improve this experience we consolidated and refactored the setup process, correcting ambiguities and misleading documentation. Another component to this improvement was the migration of our documentation to the readthedocs platform (found here), making the content more accessible and the process of updating it a lot simpler. This has replaced the PDF versions of the documentation, but not the Github wiki. The wiki content will be migrated where we see fit.

A guide on how to install WCT can be found here, a video can be found here.

1) WCT Workflow

One of the objectives in upgrading the WCT, was to raise it to a level where it could keep pace with the requirements of archiving the modern web. The first step in this process was decoupling the integration with the old Heritrix 1 web crawler, and allowing the WCT to harvest using the more modern Heritrix 3 (H3) version. This work started as a proof-of-concept in 2017, which did not include any configuration of H3 from within the WCT UI. A single H3 profile was used in the backend to run H3 crawls. Today H3 crawls are fully configurable from within the WCT, mirroring the existing profile management that users had with Heritrix 1.

2) 2018 Work Plan Milestones

The second step in this process of raising the WCT up is a technical uplift. For the past six or seven years, the software has fallen into a period of neglect, with mounting technical debt. The tool is sitting atop outdated and unsupported libraries and frameworks. Two of those frameworks are Spring and Hibernate. The feasibility of this upgrade has been explored through a proof-of-concept which was successful. We also want to make the WCT much more flexible and less coupled by exposing each component via an API layer. In order to make that API development much easier we are looking to migrate the existing SOAP API to REST and changing components so they are less dependent on each other.

Currently the Web Curator Tool is tightly coupled with the Heritrix crawler (H1 and H3). However, other crawl tools exist and the future will bring more. The third step is re-architecting WCT to be crawler agnostic. The abstracting out of all crawler-specific logic allows for minimal development effort to integrate new crawling tools. The path to this stage has already been started with the integration of Heritrix 3, and will be further developed during the technical uplift.

More detail about future milestones can be found in the Web Curator Tool Developer Guide in the appropriately titled section Future Milestones. This section will be updated as development work progresses.

3) Diagram showing the relationships between different Web Curator Tool components

We are conscious that there are long-time users on various old versions of WCT, as well as regular downloads of those older versions from the old Sourceforge repository (soon to be deactivated). We would like to encourage those users of older versions to start using WCT 2.0 and reaching out for support in upgrading. The primary channels for contact are the WCT Slack group and the Github repository. We hope that WCT will be widely used by the web archiving community in future and will have a large development and support base. Please contact us if you are interested in cooperating! See the Web Curator Tool Developer Guide for more information about how to become involved in the Web Curator Tool community.

WCT facts

The WCT is one of the most common, open-source enterprise solutions for web archiving. It was developed in 2006 as a collaborative effort between the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium (IIPC) as can be read in the original documentation. Since January 2018 it is being upgraded through collaboration with the Koninklijke Bibliotheek – National Library of the Netherlands. The WCT is open-source and available under the terms of the Apache Public License. The project was moved in 2014 from Sourceforge to Github. The latest release of the WCT, v2.0, is available now. It has an active user forum on Github and Slack.

Further reading on WCT:

Reaction on twitter:

Digging in Digital Dust: Internet Archaeology at KB-NL in the Netherlands

By Peter de Bode and Kees Teszelszky

The Dutch .nl ccTLD is the third biggest national top level domain in the world and consists of 5.68 million URL’s,according to the Dutch SIDN. The first website of the Netherlands was published on the web in 1992: it was the third website on the World Wide Web. Web archiving in the Netherlands started in 2000 with the project Archipol in Groningen. The Koninklijke Bibliotheek | National Library of The Netherlands (KB-NL) started web archiving with a selection of Dutch websites in 2007. The KB does not only selects and harvest these sites, but also develops a strategy to ensure their long-term usability. As the Netherlands does lack a legal deposit law, the KB cannot crawl the Dutch national domain. KB uses the Web Curator Tool (WCT) to conduct its harvests.  From January 2018 onwards, the National Library of New Zealand (NLNZ) has been collaborating to upgrade this tool with KB-NL and adding new features to make the application future-proof.

As of 2011, the Dutch web archive is available in the KB reading rooms. In addition, researchers may request access to the data for specific projects. Between 2012 and 2016 the research project WebArt was carried out. As per November 2018, 15,000 websites have been selected. The Dutch web archive contains about 37Terabyte of data.

On the occasion of World Digital Preservation Day KB unveiled a special collection internet archaeology Euronet-Internet (1994-2017) [In Dutch: Webcollectie internetarcheologie Euronet]. It is made up of archived websites hosted by internet provider Euronet-Internet between 1994 and 2017. The collection was started in 2017 and ended in 2018. Identification of websites for harvest is done by Peter de Bode and Kees Teszelszky as part of the larger KB web archiving project “internet archaeology.” Euronet is one of the oldest internet providers in the Netherlands (1994) and has been bought up by Online.nl. Priority is given to websites published in the early years of the Dutch web (1994-2000).

These sites can be considered as “web incunables” as these are among the first digital born publications on the Dutch web. Some of the digital treasures from this collection are the oldest website of a national political party, a virtual bank building and several sites of internet pioneers dating from 1995. Information about the collection and its heritage value can be found on a special dataset page of KB-Lab and in a collection description (in Dutch). The collection can be studied on the terminals in the reading room of KB with a valid library card. Researches can also use the dataset with URL’s and a link analysis.