IIPC Steering Committee Election 2021: nomination statements

The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:

An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.

If you have any questions, please contact the IIPC Senior Program Officer.


Nomination statements in alphabetical order:

Bibliothèque nationale de France / National Library of France

BnF-logoThe National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.

The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.

The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.

The British Library

BL-logoThe British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.

Deutsche Nationalbibliothek / German National Library

DNB-logoThe German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.

Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB.
At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.

Det Kongelige Bibliotek / Royal Library of Denmark

KBDK-logoRoyal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.

Koninklijke Bibliotheek / National Library of the Netherlands

KBNL-logoAs the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.

We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.

We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.

The National Archives (UK)

TNA-logoThe National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.

Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.

With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.

 

PYWB 2.6

By Ilya Kreymer, Webrecorder.net

After several betas and months of development, I’m excited to announce the release of pywb 2.6!

This release, supported in large part by the IIPC (International Internet Preservation Consortium), includes several new features and documentation as well as many replay fidelity improvements and optimizations.

The main new features of the release include improvements to the access control system and localization/multi-language support. The access control system has been expanded with a flexible date-range based embargo, allowing for automated exlcusions of newer or old content. The release also includes the ability to configure pywb for different user access levels, when running pywb behind an Nginx or Apache server. For more details on these features, see the Access Control Guide and Deployment Guide.

With this release, pywb also includes support for running in different languages and configuring the main UI to support switching between different languages. All text used is automatically populated into CSV files and imported back. For more details, see the Localization / Multi-Language Guide section of the documentation.

A complete list of changes is also available in the pywb Changelist on GitHub.

This work is a follow-up to the first package of work supported by the IIPC, which resulted in the creation of a transition guide for users of OpenWayback. Webrecorder wishes to thank the IIPC for their support of pywb development.

The next release of pywb, corresponding to the final batch of work sponsored by IIPC in this round, will include several improvements to the pywb user-interface and navigation.

For more discussion on this work, Webrecorder will be participating in an IIPC-hosted webinar on Tuesday, August 31st, 2021.

IIPC Steering Committee Election 2021: Call for nominations

The nomination process for IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year five seats are up for election. 

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.

Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).


Election Calendar

14 June – 14 September 2021: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.

15 September 2021: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.

15 September – 15 October 2021: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.

18 October 2021: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2022: The newly elected SC members start their term on 1 January.

The 2021 Web Archiving Conference in Luxembourg

By Ben Els, Digital Curator at the National Library of Luxembourg and Chair of the 2021 General Assembly (GA) and Web Archiving Conference (WAC) Organising Committee

In collaboration with the IIPC, the National Library of Luxembourg (BnL) had the honour of hosting the 2021 edition of the Web Archiving Conference. As a virtual event, this conference brought together experts and researchers from 39 countries, to present and evaluate the latest developments in the world of web archiving. The last edition of this conference took place in 2019 in Zagreb and since many web archiving institutions haven’t had the opportunity for a local exchange with fellow practitioners, this international meeting plays a vital role in the concerted efforts in Internet preservation.

From in-person to virtual

The preparations for this year’s conference started in June 2019 and, initially, the 2021 conference was meant to take place in our new library building. As you can imagine, the ups and downs of the past 1 ½ years had a significant influence on the preparations for this conference: Will people be able to travel? What will the safety measures be? Should we plan for a hybrid solution? Eventually, we decided that a hybrid solution would hardly be feasible from an organisational standpoint and would likely also disadvantage participants attending the live event. When the decision was made to go for an online event, we faced another set of questions: how to combine the advantages of a virtual meeting with the indispensable aspects of a physical conference? In other words, what are the most valuable experiences that people would like to take away from a real-life meeting?

Our conclusion was to aim our efforts at enabling lively discussions, to focus on Q&A sessions and networking, which would normally happen during coffee breaks or social events. We spent several months researching and testing different video-conference and virtual event platforms. Finally, we decided to abandon the idea of 8-10 hour Zoom calls and moved to a different format, using the relatively new platform Remo. We also asked the speakers and panelists to make their presentations available ahead of the conference as pre-recorded videos. This way, participants were able to watch the videos they were interested in beforehand, so that during the conference, we could jump into the Q&A part right away. This format allowed for a more lively experience, with more engaging discussions. This was illustrated by the fact that many participants stayed with the event from 08:00 in the morning until midnight!

Customising the online experience 

As Olga explained during the General Assembly: “online doesn’t mean less work”. We realised that Remo is not a perfect platform and that there was a smart adaptation phase for first-time users. Therefore, a lot of work went into organising training events during the three weeks before the conference. We made sure that all speakers, session chairs and panelists took part in at least one of these familiarisation sessions, which helped the event in getting a lot of technical and organisational questions out of the way ahead of time. Moreover, we had a number of volunteers on board, making sure that the program of the conference would run smoothly and all technical difficulties could be dealt with as quickly as possible. The team composed of five core members of the organising committee and the conference “super elves” was operating like a well-oiled machine – and that over the course of three days from 08:00 in the morning until midnight.

Online and all time zones

The second challenge of a virtual event is the differences in time zones, when people want to follow the discussions from their homes on the other side of the world. For this reason, we arranged the conference schedule in a way that would allow participants from all time zones to follow at least 6 hours of program during their normal working hours. This inclusive approach has proven to be successful, by surpassing the previous records for registrations and attendance. It is safe to say that the 2021 edition of the Web Archiving Conference has reached more people than ever before.

Virtually in Luxembourg  

The third challenge in an online event: how to highlight the character of the hosting institution, since the conference venue doesn’t really matter on the Internet? In collaboration with our sponsors and partners, including the National Research Fund and the Luxembourg – Let’s make it happen initiative, we tried to represent the BnL and the country of Luxembourg in a virtual space. On the customised floorplan in Remo, we highlighted our partners and included hints at cultural, historical and culinary Luxembourg landmarks. If you would like to learn more about the Emoxie icons and their stories, we invite you to a virtual visit to Luxembourg. Please don’t forget to stop by at the National Library! 

Lessons learnt and raising awareness

Before WAC 2021, the BnL didn’t have a lot of experience with hosting larger conferences and even less experience in online events. Although the time commitment should not be underestimated, the whole process was at the same time, an incredibly valuable learning experience (not to mention how much fun we had during preparation calls and all throughout the conference). Hosting the Web Archiving Conference has also pushed the BnL in getting to know all parts of the inner workings of the IIPC and getting in contact with many member institutions. Locally, we were able to draw attention to the role of the National Library as a frontrunner in digital preservation in Luxembourg (Mois des archives: Web Archive & Mir brauchen dréngend e Bachelor an den Informatiounswëssenschaften). We were also able to organise a shared panel with the University of Luxembourg, to highlight local efforts in documenting the Covid-19 pandemic.


From the incredibly generous feedback, we also learned that the attention to detail and thoughtful planning have not gone unnoticed by the participants. For that part, the BnL can only accept a fraction of the praise: without Olga’s and Robin’s tireless commitment and expertise, we never could have reached the goals that were set up at the beginning. Therefore, next year’s hosts should be reassured to have both of them on board and set the bar for 2022 even higher.

Launching LinkGate

By Youssef Eldakar of Bibliotheca Alexandrina

We are pleased to invite the web archiving community to visit LinkGate at linkgate.bibalex.org.

LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.

What does LinkGate do?

LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.

APIs and use cases

We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.

Using LinkGate

An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:

  • Graph with nodes colorized by domain
  • Nodes being zoomed in
  • Settings dialog for customizing graph
  • Showing properties for a selected node
  • PathFinder for finding routes between any two nodes

Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.

We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.

Next steps

This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:

  • Improved scalability
  • Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
  • Enriched metadata
  • Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
  • Improved implementation of the software in general.

BA intends to maintain and expand the deploymentat linkgate.bibalex.org on a long-term basis.

Acknowledgements

The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.

LinkGate Team

  • Lana Alsabbagh, NLNZ, Research Use Cases
  • Youssef Eldakar, BA, Project Coordination
  • Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
  • Mohamed Elsayed, BA, Link Indexer (link-indexer)
  • Andrea Goethals, NLNZ, Project Coordination
  • Amr Morad, BA, Link Service (link-serv)
  • Ben O’Brien, NLNZ, Research Use Cases
  • Amr Rizq, BA, Link Visualizer (link-viz)

Additional Thanks

  • Tasneem Allam, BA, link-viz development
  • Suzan Attia, BA, UI design
  • Dalia Elbadry, BA, UI design
  • Nada Eliba, BA, link-serv development
  • Mirona Gamil, BA, link-serv development
  • Olga Holownia, IIPC, project support
  • Andy Jackson, British Library, technical advice
  • Amged Magdey, BA, logo design
  • Liquaa Mahmoud, BA, logo design
  • Alex Osborne, National Library of Australia, technical advice

We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.


Resources

CLIR Becomes Administrative Home for the IIPC

Council on Library and Information Resources (CLIR) has become the administrative home of the IIPC with the move of its senior program officer, Olga Holownia, to the CLIR staff.

Based in the Washington, D, area, CLIR forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning. CLIR has a number of international affiliates, including IIIF and NDSA. These affiliations give organizations opportunities to engage meaningfully with new constituencies, and to work together toward integrating services, tools, platforms, research, and expertise across organizations in ways that will reduce costs, create greater efficiencies, and better serve our collective constituencies.

In 2017, IIPC became a CLIR Affiliate and CLIR agreed to serve as the organization’s fiscal agent, but IIPC staff were hosted by the British Library until Holownia’s move to CLIR. IIPC will remain independent, and its Steering Committee and Executive Board will continue to be responsible for setting the strategy, overseeing membership, tools development and outreach, as well as the Consortium’s key events.

“We warmly welcome Olga Holownia to the staff,” said CLIR president Charles Henry. “IIPC’s work is closely aligned with CLIR’s mission, and her presence will open new opportunities to enrich the work of both organizations.”

“We are thrilled that Olga has accepted the role of senior program officer with CLIR, after performing in a program officer role for many years through her position with the British Library,” said Abbie Grotke, IIPC Chair. “With CLIR now hosting this role in addition to other administrative host activities, the IIPC is well suited to serve its members and the broader web archiving community in the future.”


The IIPC community is encouraged to mark their calendars for CLIR’s Digital Library Federation (DLF) 2021 Forum, November 1-3. The annual Forum is a meeting place, marketplace, and congress for digital library practitioners, featuring panels, individual presentations, lightning talks, and birds of a feather sessions. The Forum program will be announced in late August, when registration opens. The 2021 Forum will be virtual and free of charge, as will its two affiliated events: Digital Preservation 2021, the annual conference of the National Digital Stewardship Alliance (NDSA) on November 4; and Learn@DLF, a workshop series, November 8-10.

A Retrospective with the Archives Unleashed Project

At the 2016 IIPC Web Archiving Conference in Reykjavík, Ian Milligan and Matthew Weber talked about the importance of building communities around analysing web archives and bringing together interdisciplinary researchers which is what Archives Unleashed 1.0, the first Web Archive Hackathon hosted by University of Toronto Libraries, attempted to do. At the same conference, Nick Ruest and Ian gave a workshop on the earliest version of the Archives Unleashed Toolkit (“Hands on with Warcbase“). Five years and 7 datathons later, the Archives Unleashed Project has seen major technological developments (including the Cloud version of the Toolkit integrated with Archive-It collections), a growing community of researchers, an expanded team, new partnerships, and two major grants. At the project’s core, there still is a desire to engage the community, and the most recent initiative, which builds on the datathons, includes the Cohort Program which facilitates research engagement with web archives through a year-long collaboration while receiving mentorship and support from the Archives Unleashed Team. 

In her blog post, Samantha Fritz, the Project Manager at the Archives Unleashed Project, reflects on the strategy and key milestones achieved between 2017 and 2020, as well as the new partnership with Archive-It and the plans for the next 3 years.


By Samantha Fritz, Project Manager, Archives Unleashed Project

AUTLogo-512x512The web archiving world blends the work and contributions of many institutions, groups, projects, and individuals. The field is witnessing work and progress in many areas, from policies, to professional development and learning resources, to the development of tools that address replay, acquisition, and analysis.

For over two decades memory institutions and organizations around the world have engaged in web archiving to ensure the preservation of born-digital content that is vital to our understanding of post-1990s research topics. Increasingly web archiving programs are adopted as part of institutional activities, because in general there is a recognition from librarians, archivists, scholars, and others that web archives are critical resources and are vulnerable to stewarding our cultural heritage.

The National Digital Stewardship Alliance has conducted surveys to “understand the landscape of web archiving activities in the United States.” Reflecting on the most recent 2017 survey results, respondents indicated they perceived the least progress in the past two years fell in the category of access, use, and reuse. The 2017 report indicates that this could suggest “a lack of clarity about how Web archives are to be used post-capture” (Farrell et. al. 2017 Report, p.13). This finding makes complete sense given that focus has largely revolved around selection, appraisal, scoping and capture. 

Ultimately, the active use of web archives by researchers, and by extension the development of tools to explore web archives has lagged. As such we see institutions and service providers like librarians and archivists are tasked with figuring out how to “use” web archives.

We have petabytes of data, but we also have barriers

The amount of data captured is well into the petabyte range – and we can look at larger organizations like the Internet Archive, the British Library, the Bibliothèque Nationale de France, Denmark’s Netarchive, the National Library of Australia’s Trove platform, and Portugal’s Arquivo.pt, who have curated extensive web archive collections, but we still don’t see a mainstream or heavy use of web archives as primary sources in research. This is in part due to access and usability barriers. Essentially, the technical experience needed to work with web archives, especially at scale, is beyond the reach of most scholars.

It is this barrier that offers an opportunity for discussion and work in and beyond the web archiving community. As such, we turn to a reflection of contributions from the Archives Unleashed Project for lowering barriers to web archives.

About the Archives Unleashed Project

Archives Unleashed was established in 2017 with support from The Andrew W. Mellon Foundation. The project grew out of an earlier series of events which identified a collective need among researchers, scholars, librarians and archivists for analytics tools, community infrastructure, and accessible web archival interfaces.

In recognizing the vital role web archives play in studying topics from the 1990s forward, the team has focused on developing open-source tools to lower the barrier for working with and analyze web archives at scale.

From 2017-2020 Archives Unleashed has a three-pronged strategy for tackling the computational woes of working with large data, and more specifically W/ARCs:

  1. Development of the Archives Unleashed Toolkit: to apply modern big data analytics infrastructure to scholarly analysis of web archives
  2. Deployment of the Archives Unleashed Cloud: provide a one-stop, web-based portal for scholars to ingest their Archive-It collections and execute a number of analyses with the click of a mouse.
  3. Organization of Archives Unleashed Datathons: build a sustainable user community around our open-source software. 

Milestones + Achievements

If we look at how Archives Unleashed tools have developed, we have to reach back to 2013 when Warcbase was developed. It was the forerunner to the Toolkit and was built on Hadoop and HBase as an open-source platform to support temporal browsing and large-scale analytics of web archives (Ruest et al., 2020, p. 157).

The Toolkit moves beyond the foundations of Warcbase. Our first major transition was to replace Apache HBase with Apache Spark to modernize analytical functions. In developing the Toolkit, the team was able to leverage the needs of users to inform two significant development choices. First, by creating a Python interface that has functional parity with the Scala interface. Python is widely accepted, and more commonly known, among scholars in the digital humanities who engage in computational work. From a sustainability perspective, Python is a stable, open-source, and ranked as one of the most popular programming languages.

Second, the Toolkit shifted from Spark’s resilient distributed datasets (RDDs), part of the Warcbase legacy, to support DataFrames. While this was part of the initial Toolkit roadmap, the team engaged with users to discuss the impact of alternative options to RDD. Essentially, DataFrames offers the ability within Apache Spark to produce a tabular based output. From the community, this approach was unanimously accepted in large part because of the familiarity with pandas, and DataFrames made it easier to visually read the data outputs (Fritz, et. al, 2018, Medium Post).

Comparison between RDD and DataFrame outputs
Comparison between RDD and DataFrame outputs
 

The Toolkit is currently at a 0.90.0 release, and while the Toolkit offers powerful analytical functionality, it is still geared towards an advanced user. Recognizing that scholars often didn’t know where to start with analyzing W/ARC files, and the intimidating nature of the command line, we took a cookbook approach in developing our Toolkit user documentation. With it, researchers can modify dozens of example scripts for extracting and exploring information. Our team focused on designing documentation that presented possibilities and options, while at the same time guided and supported user learning.

 
Sparkshell for using the Archives Unleashed Toolkit
Sparkshell for using the Archives Unleashed Toolkit

The work to develop the Toolkit, provided the foundations for other platforms and experimental methods of working with web archives. The second large milestone reached by the project was the launch of the Archives Unleashed Cloud.

The Archives Unleashed Cloud, largely developed by project co-investigator Nick Ruest, is an open-source platform that was developed to provide a web-based front end for users to access the most recent version of the Archives Unleashed Toolkit. A core feature of the Cloud, is that it uses the Archive-It WASAPI, which means that users are directly connected to their Archive-It collections and can proceed to analyze web archives without having to spend time delving into the technical world. 

 

 

Archives Unleashed Cloud Interface for Analysis
Archives Unleashed Cloud Interface for Analysis

Recognizing that the Toolkit, while flexible and powerful, may still be a little too advanced for some scholars, the Cloud offers a more user-friendly and familiar user interface for interacting with data. Users are presented with simple dashboards which provide insights into WARC collections, provide downloadable derivative files and offer simple in-browser visualizations.

In June of 2020, marking the end of our grant, the Cloud had analyzed just under a petabyte of data, and has been used by individuals from 59 unique institutions across 10 countries. Cloud remains an open-source project, with code available through a GitHub repository. The canonical instance will be deprecated as of June 30 2021 and be migrated into Archive-It, but more on that project in a bit.

Datathons + Community Engagement

Datathons provided an opportunity to build a sustainable community around Archives Unleashed tools, scholarly discussion, and training for scholars with limited technical expertise to explore archived web content.

Adapting the hackathon model, these events saw participants from over fifty institutions from seven countries engage in a hands-on learning environment – working directly with web archive data and new analytical tools to produce creative and ingenuitive projects that explore W/ARcs. In collaborating with host institutions, the datathons also highlight web archive collections from host institutions, increasing visibility and usability cases for their curated collections.

In a recently published article, “Fostering Community Engagement through Datathon Events: The Archives Unleashed Experience,” we reflected on the impact that our series of datathon events had on community engagement within the web archiving field, and on the professional practices of attendees. We conducted interviews with datathon participants to learn about their experiences and complemented this with an exploration of established models from the community engagement literature. Our article culminates in contextualizing a model for community building and engagement within the Archives Unleashed Project, with potential applications for the wider digital humanities field. 

Our team has also invested and participated in the wider web archival community through additional scholarly activities, such as institutional collaborations, conferences, and meetings. We recognize that these activities bring together many perspectives, and have been a great opportunity to listen to the needs of users and engage in conversations that impact adjacent disciplines and communities.

Archives Unleashed Datathon, Gelman Library, George Washington University
Archives Unleashed Datathon, Gelman Library, George Washington University

Lessons Learned

1. It takes a community

If there is one main take away we’ve learned as a team, and that all our activities point to, it’s that projects can’t live in silos! Be they digital humanities, digital libraries, or any other discipline, projects need communities to function, survive, and thrive. 

We’ve been fortunate and grateful to have been able to connect with various existing groups including being welcomed by the web archiving and digital humanities communities. Community development takes time and focused efforts, but it is certainly worthwhile! Ask yourself, if you don’t have a community, who are you building your tools, services, or platforms for? Who will engage with your work?

We have approached community building through a variety of avenues. First and foremost, we have developed relationships with people and organizations. This is clearly highlighted through our institutional collaborations in hosting datathon events, but we’ve also used platforms like Slack and Twitter to support discussion and connection opportunities among individuals. For instance, in creating both general and specific Slack channels, new users are able to connect with the project team and user community to share information and resources, ask for help, and engage in broader conversations on methods, tools, and data. 

Regardless of platform, successful community building relies on authentic interactions and an acknowledgment that each user brings unique perspectives and experiences to the group. In many cases we have connected with uses who are either new to the field or to analysis methods of web archives. As such, this perspective has helped to inform an empathetic approach to the way we create learning materials, deliver reports and presentations, and share resources. 

2. Interdisciplinary teams are important

So often we see projects and initiatives that highlight an interdisciplinary environment – and we’ve found it to be an important part of why our project has been successful. 

Each of our project investigators personifies a group of users that the Archives Unleashed Project aims to support, all of which converge around data, more specifically WARCs or web archive data. We have a historian who is broadly representative of digital humanists and researchers who analyze and explore web archives; a librarian who represents the curators and service providers of web archives; and a computer scientist who reflects tool builders.

A key strength of our team has been to look at the same problem from different perspectives, allowing each member to apply their unique skills and experiences in different ways. This has been especially valuable in developing underlying systems, processes and structures which now make up the Toolkit. For instance, triaging technical components offered a chance for team members to apply their unique skill sets, which often assisted in navigating issues and roadblocks.

We also recognized each sector has its own language and jargon that can be jarring to new users. In identifying the wide range of technical skills within our team, we leveraged (and valued) those “I have no idea what this means/ what this does.”  moments. If these types of statements were made by team members or close collaborators, chances are they would carry through to our user community. 

Ultimately, the interdisciplinary nature and the wide range of technical expertise found within our team, helped us to see and think like our users.

3. Sustainability planning is really hard

Sustainability has been part question, part riddle. This is the case for many digital humanities projects. These sustainability questions speak to the long term lifecycle of the project, and our primary goal has always been to ensure a project’s survival and continued efforts once the grant cycle has ended.

As such the Archives Unleashed team has developed tools and platforms with sustainability in mind, specifically by adopting widely adopted and stable programming languages and best practices. We’ve also been committed to ensuring all our platforms and tools have developed in the spirit of open-access, and are available in public GitHub repositories.

One overarching question remained as our project entered its final stages in the Spring of 2020: how will the Toolkit live on? Three years of development and use cases demonstrated not only the need and adoption of tools created under the Archives Unleashed Project, but also solidified the fact that without these tools, there aren’t currently any simplified processes to adequately replace it. 

Where we are headed (2020-2023)

Our team was awarded a second grant from The Andrew W. Mellon Foundation, which started in 2020 and will secure the future of Archives Unleashed. The goal of this second phase is the integration of the Cloud with Archive-it, so as a tool it can succeed in a sustainable and long-term environment. The collaboration between Archives Unleashed and Archive-It also aims to continue to widen and enhance the accessibility and usability of web archives.

Priorities of the Project

First, we will merge the Archives Unleashed analytical tools with the Internet Archive’s Archive-it service to provide an end-to-end process for collecting and studying web archives. This will be completed in three stages:

  1. Build. Our team will be setting up the physical infrastructure and computing environment needed to kick start the project. We will be purchasing dedicated infrastructure with the Internet Archive.
  2. Integrate. Here we will be migrating the back end of the Archives Unleashed Cloud to Archive-it and paying attention to how the Cloud can scale to work within its new infrastructure. This stage will also see the development of a new user interface that will provide a basic set of derivatives to users.
  3. Enhance. The team will incorporate consultation with users to develop an expanded and enhanced set of derivatives and implement new features.

Secondly, we will engage the community by facilitating opportunities to support web archives research and scholarly outputs. Building on our earlier successful datathons, we will be launching the Archives Unleashed Cohort program to engage with and support web archives research. The Cohorts will see research teams participate in year-long intensive collaborations and receive mentorship from Archives Unleashed with the intention of producing a full-length manuscript.

We’ve made tremendous progress, as the close of our first year is in sight. Our major milestone will be to complete the integration of the Archives Unleashed Cloud/Toolkit over to Archive-It. As such users will soon see a beta release of the new interface for conducting analysis with their web archive collections, specifically by downloading over a dozen derivatives for further analysis, and access to simple in-browser visualizations.

Our team looks forward to the road ahead, and would like to express our appreciation for the support and enthusiasm Archives Unleashed has received!

 

We would like to recognize our 2017-2020 work was primarily supported by the Andrew W. Mellon Foundation, with financial and in-kind support from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

 

References

Farrell, M., McCain, E., Praetzellis, M., Thomas, G., and Walker, P. 2018. Web Archiving in the United States: A 2017 Survey. National Digital Stewardship Alliance Report. DOI 10.17605/OSF.IO/3QH6N

Ruest, N., Lin, J., Milligan, I., and Fritz, S. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513

Fritz, S., Milligan, I., Ruest, N., and Lin, J. To DataFrame or Not, that is the Questions: A PySpark DataFrames Discussion. May 29, 2018. Medium. https://news.archivesunleashed.org/to-dataframe-or-not-that-is-the-questions-a-pyspark-dataframes-discussion-600f761674c4

 

Resources

We’ve provided some additional reading materials and resources that have been written by our team, and shared with the community over the course of our project work.

For a full list please visit our publications page: https://archivesunleashed.org/publications/.

Shorter blog posts can be found on our Medium site: https://news.archivesunleashed.org

Warcbase

Toolkit

Cloud

Datathons/Community

Using OutbackCDX with OpenWayback

By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, IIPC Vice-Chair and Co-Lead of the Tools Development Portfolio

Last year I wrote about The Future of Playback and the work that the IIPC was funding to facilitate the migration from OpenWayback to PyWb for our members. That work is being done by Ilya Kreymer and the first work package has now been delivered. An exhaustive transition guide detailing how OpenWayback configuration options can be translated into equivalent PyWb settings.

One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.

What is OutbackCDX?

OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.

A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.

OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.

The Mission

Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.

So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.

Of course, there were a few bumps on that road, but more on that later.

Installing OutbackCDX

Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.

Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.

Building the Index

Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.

Example:

curl -X POST –data-binary @index.cdx http://localhost:8901/myindex

In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.

Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.

Configuring OpenWayback

Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.

OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:

<property name=”resourceIndex”>
<bean class=”org.archive.wayback.resourceindex.RemoteResourceIndex”>
<property name=”searchUrlBase” value=”http://localhost:8080/myindex&#8221; />
</bean>
</property>

And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.

Those ‘bumps’

This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.

The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.

I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.

Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.

OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.

Summary

Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.

The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.

Our OpenWayback at https://vefsafn.is, is now fully powered by OutbackCDX.

Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is

Search results in Vefsafn.is (beta) that uses PyWb.

.

SolrWayback 4.0 release! What’s it all about? Part 2

By Thomas Egense, Programmer at the Royal Danish Library and the Lead Developer on SolrWayback.

This blog post is republished from Software Development at Royal Danish Library.

In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.

Live demo of SolrWayback

You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!

Back in 2018…

The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.

Another interesting frontend was the Shine frontend. It was custom tailored for the Solr index created with warc-indexer and had features such as Trend analysis (n-gram) visualization of search results over time. The showstopper was that Shine was using an older version the Play-framework and the latest version of the Play-framework was not backwards compatible to the maintained branch of the Play-framework. Upgrading was far from trivial and would require a major rewrite of the application. Adding to that, the frontend developers had years of experience with the larger more widely used pure javascript-frameworks. The weapon of choice by the frontenders for SolrWayback was the VUE JS framework. Both SolrWayback 3.0 and the new rewritten SolrWayback 4.0 had the frontend developed in VUE JS. If you have skills in VUE JS and interest in SolrWayback, your collaboration will be appriciate.

WARC-Indexer. Where the magic happens!

WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.

WARC-Indexer. Paying the price up front…

Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.

Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.

Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.

HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.

Solr. Reaping the benefits from the WARC-indexer

The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.

See the frontend blog post for more feature examples.

Wordcloud
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.

Interactive linkgraph
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).

Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.

Link graph example from the Danish NetArchive.

The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.

Image search
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.

CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.

WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.

Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.

SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.

SolrWayback playback engine

SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.

Playback quality

The playback quality of SolrWayback is an improvement over OpenWayback for the Danish Netarchive, but not as good as PyWb. The technique used is url-rewrite just as PyWb does, and replaces urls according to the HTML specification for html-pages and CSS files. However, SolrWayback does not replace links generated from javascript yet, but this is most likely to be improved in a next major release. It has not been a priority since the content for the Danish Netarchive is harvested with Heritrix and the dynamic javascript resources are not harvested by Heritrix.

This is only a problem for absolute links, ie. starting with http://domain/… since all relative URL paths will be resolved automatically due to the URL playback API. Relative links that refer to the root of the playback-server will also be resolved by the SolrWaybackRootProxy application which has this sole purpose. It calculates the correct URL from the http-referer tags and redirect back into SolrWayback. The absolute URL from javascript (or dynamic javascript) can result in live leaks. This can be avoided by a HTTP proxy or just adding a white list of URLs to the browser. In the Danish Citrix production environment, live leaks are blocked by sandboxing the enviroment. Improving playback is in the pipeline.

The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.

The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.

SolrWayback and Scalability

For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.

One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.

For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.

Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there  no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.

You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.

Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.

Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.

SolrWayback – framework

SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.

SolrWayback software bundle

Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.

Try SolrWayback Software bundle!

IIPC Chair Address

By Abbie Grotke, Assistant Head, Digital Content Management Section
(Web Archiving Program), Library of Congress
and the IIPC Chair 2021-2022


Hello IIPC community!

I am thrilled to be the Chair of the IIPC in 2021. I’ve been involved in this organization since the very early days, so much so that somewhere buried in my folders in my office (which I have not been in for almost a year), are meeting notes from the very first discussions that led to the IIPC being formed back in 2003. Involvement in IIPC has been incredibly rewarding personally, and for our institution and all of our team members who have had the chance to interact with the community through working groups, projects, events, and informal discussions.

This year brings changes, challenges and opportunities for our community. Particularly during a time when many of us are isolated and working from home, both documenting web content about the pandemic and living it at the same time, connections to my friends and colleagues around the world seem more important than ever.

Here are a few key things to highlight for the coming year:

A Big Year for Organisation, Governance, and Strategic Planning Change

As a result of the fine work of the Strategic Direction Group led by Hansueli Locher of Swiss National Library, the IIPC has a new Consortium Agreement for 2021-2025! This document is renewed every 4-5 years, and this time some key changes were made to strengthen our ability to manage the Consortium more efficiently and to reflect the organisational changes that have taken place since 2016. Feedback from IIPC members was used to create the new agreement, and you’ll notice a slight update of objectives, which now acknowledge the importance of collaborative collections and research. Many thanks to the Strategic Direction Group (Emmanuelle Bermès of the BnF, Steve Knight of the National Library of New Zealand, Hansueli Locher, Alex Thurman of the Columbia University Libraries, and IIPC Programme and Communications Officer) for their work on this and continued engagement.

Executive Board and the Steering Committee’s terms

The new agreement establishes a new Executive Board composed of the Chair, the Vice-Chair, the Treasurer and our new senior staff member, as well as additional members of the SC appointed as needed. While the Steering Committee is responsible for setting out the strategic direction for our organisation for the next 5 years, one of our key tasks for this year is to convert it into an Action Plan.

The new Consortium Agreement aligns the terms of the Steering Committee members and the Executive Board. What it means in practise is that the SC members’ 3-year term will start on January 1 and not June 1. We will open a call for nominations to serve on the SC during our next General Assembly but if you are interested in nominating your institution, you can contact the PCO.

For more information about the responsibilities of the new Executive Board please review section 2.5 of the new Consortium Agreement.

Administrative Host

Our ability to have and compensate our Administrative and Financial Host has been formalized in the new agreement. We are excited to collaborate more with  the Council on Library and Information Resources (CLIR) this year through this arrangement, particularly in setting up some new staffing arrangements for us. More on this will be announced in the coming months.

Strategic Plan

One of our big tasks in 2021 will be working on the Strategic Plan. This work is led by the Strategic Direction Group, with inputs from the Steering Committee, Working Groups, and Portfolio Leads. Since this work is one of our important activities for the year, Hansueli has will joined the Executive Board to ensure close collaboration and support for the initiative.

Missing Your IIPC Colleagues? Join our Virtual Events!

A blast from the past: the IIPC General Assembly at the Library of Congress, May 2012.
From the left: Kristinn Sigurðsson (IIPC Vice-Chair, National and University Library of Iceland), Gildas Illien (BnF), and Abbie.

As anyone who has attended an IIPC event in person knows, it is one of the best parts about being a member. In my case, interacting with colleagues from around the world who have similar challenges, experiences, and new and exciting insights has been great for my own professional growth, and has only helped the Library of Congress web archiving program be more successful. While it’s sad that we cannot travel and meet in person together right now, there are opportunities to continue to connect virtually and to engage others in our institutions who may not have been able to travel to the in-person meetings. We’re already working on developing a more robust calendar of events for members (and some that will be more widely open to non-members).

As you’re aware, our big event, the General Assembly (June 14) and the Web Archiving Conference (June 15-16)  have been moved to a virtual event as a part of Web Archiving Week (virtually from Luxembourg). Many thanks to the National Library of Luxembourg for offering to host the online event!

Beyond the GA and WAC, due to the success of the well-received and well-attended webinars and calls with members in 2020, we will continue to deliver those over the course of the year. We are also working on additional training events and continuing report-outs of technical projects and member updates. Stay tuned for more soon and check our events page for updates!

Working Groups and funded projects

The IIPC continues to work collaboratively together in 2021 on a number of initiatives through our Working Groups), including our transnational collections (the Covid-19 collection continues in 2021), training materials, and activities focusing on research use of the web archives. 2021 also brings exciting funded project news, thanks to the continuation of DFP, a funding programme launched in June 2019 and led by three former IIPC Chairs: Emmanuelle Bermès, Jefferson Bailey (Internet Archive), and Mark Phillips (University of North Texas Libraries). In 2020 the Jupyter Notebooks project led by Andy Jackson of the British Library and created by Tim Sherratt was successfully completely and won the British Library Labs award. This year, we are launching Developing Bloom Filters for Web Archives’ Holdings (a collaboration between Los Alamos National Laboratory (LANL) & National and University Library in Zagreb), Improving the Dark and Stormy Archives Framework by Summarizing the Collections of the National Library of Australia (a collaboration between Old Dominion University, National Library of Australia and LANL), and continuing LinkGate: Core Functionality and Future Use Cases (Bibliotheca Alexandrina & National Library of New Zealand) and hoping to be able to hold the Archives Unleashed datathon led by the BnF in partnership with KBR / Royal Library of Belgium and the National Library of Luxembourg later in 2021.

We are also working with Webrecorder on the pywb transition support for members. The migration guide, with inputs from the IIPC Members, is already available and the work continues on the next stages of the project. Look for more updates on these projects through our events and blog posts throughout the year. There will also be an opportunity in 2021 for more projects to be funded, so we encourage members to start thinking about other projects that could use support and that would benefit the community.

Lastly, I want to remind you to continue to follow our activities on the IIPC website and Twitter (do tweet on #WebArchiveWednesday!). To subscribe to our mailing list, send an email to communications@iipc.simplelists.com.

I look forward to working with you all more closely this year. Please feel free to reach out to me if you have any questions or concerns during my time as Chair.

Happy Web Archiving to you all!

Abbie Grotke

Assistant Head, Digital Content Management Section (Web Archiving Program), Library of Congress

IIPC Chair 2021-2022