The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:
An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.
If you have any questions, please contact the IIPC Senior Program Officer.
Nomination statements in alphabetical order:
Bibliothèque nationale de France / National Library of France
The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.
The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.
The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.
The British Library
The British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.
Deutsche Nationalbibliothek / German National Library
The German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.
Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB. At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.
Det Kongelige Bibliotek / Royal Library of Denmark
Royal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.
Koninklijke Bibliotheek / National Library of the Netherlands
As the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.
We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.
We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.
The National Archives (UK)
The National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.
Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.
With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.
After several betas and months of development, I’m excited to announce the release of pywb 2.6!
This release, supported in large part by the IIPC (International Internet Preservation Consortium), includes several new features and documentation as well as many replay fidelity improvements and optimizations.
The main new features of the release include improvements to the access control system and localization/multi-language support. The access control system has been expanded with a flexible date-range based embargo, allowing for automated exlcusions of newer or old content. The release also includes the ability to configure pywb for different user access levels, when running pywb behind an Nginx or Apache server. For more details on these features, see the Access Control Guide and Deployment Guide.
With this release, pywb also includes support for running in different languages and configuring the main UI to support switching between different languages. All text used is automatically populated into CSV files and imported back. For more details, see the Localization / Multi-Language Guide section of the documentation.
The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year five seats are up for election.
What is at stake?
Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.
Who can run for election?
Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.
Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.
How to run for election?
All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.
All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.
Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).
14 June – 14 September 2021: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.
15 September 2021: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.
15 September – 15 October 2021: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.
18 October 2021: The results of the vote are announced on the Netpreserve blog and Members mailing list.
1 January 2022: The newly elected SC members start their term on 1 January.
By Ben Els, Digital Curator at the National Library of Luxembourgand Chair of the 2021 General Assembly (GA) and Web Archiving Conference (WAC) Organising Committee
In collaboration with the IIPC, the National Library of Luxembourg (BnL) had the honour of hosting the 2021 edition of the Web Archiving Conference. As a virtual event, this conference brought together experts and researchers from 39 countries, to present and evaluate the latest developments in the world of web archiving. The last edition of this conference took place in 2019 in Zagreb and since many web archiving institutions haven’t had the opportunity for a local exchange with fellow practitioners, this international meeting plays a vital role in the concerted efforts in Internet preservation.
From in-person to virtual
The preparations for this year’s conference started in June 2019 and, initially, the 2021 conference was meant to take place in our new library building. As you can imagine, the ups and downs of the past 1 ½ years had a significant influence on the preparations for this conference: Will people be able to travel? What will the safety measures be? Should we plan for a hybrid solution? Eventually, we decided that a hybrid solution would hardly be feasible from an organisational standpoint and would likely also disadvantage participants attending the live event. When the decision was made to go for an online event, we faced another set of questions: how to combine the advantages of a virtual meeting with the indispensable aspects of a physical conference? In other words, what are the most valuable experiences that people would like to take away from a real-life meeting?
Our conclusion was to aim our efforts at enabling lively discussions, to focus on Q&A sessions and networking, which would normally happen during coffee breaks or social events. We spent several months researching and testing different video-conference and virtual event platforms. Finally, we decided to abandon the idea of 8-10 hour Zoom calls and moved to a different format, using the relatively new platform Remo. We also asked the speakers and panelists to make their presentations available ahead of the conference as pre-recorded videos. This way, participants were able to watch the videos they were interested in beforehand, so that during the conference, we could jump into the Q&A part right away. This format allowed for a more lively experience, with more engaging discussions. This was illustrated by the fact that many participants stayed with the event from 08:00 in the morning until midnight!
Customising the online experience
As Olga explained during the General Assembly: “online doesn’t mean less work”. We realised that Remo is not a perfect platform and that there was a smart adaptation phase for first-time users. Therefore, a lot of work went into organising training events during the three weeks before the conference. We made sure that all speakers, session chairs and panelists took part in at least one of these familiarisation sessions, which helped the event in getting a lot of technical and organisational questions out of the way ahead of time. Moreover, we had a number of volunteers on board, making sure that the program of the conference would run smoothly and all technical difficulties could be dealt with as quickly as possible. The team composed of five core members of the organising committee and the conference “super elves” was operating like a well-oiled machine – and that over the course of three days from 08:00 in the morning until midnight.
Online and all time zones
The second challenge of a virtual event is the differences in time zones, when people want to follow the discussions from their homes on the other side of the world. For this reason, we arranged the conference schedule in a way that would allow participants from all time zones to follow at least 6 hours of program during their normal working hours. This inclusive approach has proven to be successful, by surpassing the previous records for registrations and attendance. It is safe to say that the 2021 edition of the Web Archiving Conference has reached more people than ever before.
Virtually in Luxembourg
The third challenge in an online event: how to highlight the character of the hosting institution, since the conference venue doesn’t really matter on the Internet? In collaboration with our sponsors and partners, including the National Research Fund and the Luxembourg – Let’s make it happen initiative, we tried to represent the BnL and the country of Luxembourg in a virtual space. On the customised floorplan in Remo, we highlighted our partners and included hints at cultural, historical and culinary Luxembourg landmarks. If you would like to learn more about the Emoxie icons and their stories, we invite you to a virtual visit to Luxembourg. Please don’t forget to stop by at the National Library!
Lessons learnt and raising awareness
Before WAC 2021, the BnL didn’t have a lot of experience with hosting larger conferences and even less experience in online events. Although the time commitment should not be underestimated, the whole process was at the same time, an incredibly valuable learning experience (not to mention how much fun we had during preparation calls and all throughout the conference). Hosting the Web Archiving Conference has also pushed the BnL in getting to know all parts of the inner workings of the IIPC and getting in contact with many member institutions. Locally, we were able to draw attention to the role of the National Library as a frontrunner in digital preservation in Luxembourg (Mois des archives: Web Archive & Mir brauchen dréngend e Bachelor an den Informatiounswëssenschaften). We were also able to organise a shared panel with the University of Luxembourg, to highlight local efforts in documenting the Covid-19 pandemic.
From the incredibly generous feedback, we also learned that the attention to detail and thoughtful planning have not gone unnoticed by the participants. For that part, the BnL can only accept a fraction of the praise: without Olga’s and Robin’s tireless commitment and expertise, we never could have reached the goals that were set up at the beginning. Therefore, next year’s hosts should be reassured to have both of them on board and set the bar for 2022 even higher.
LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.
What does LinkGate do?
LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.
APIs and use cases
We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.
An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:
Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.
We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.
This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:
Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
Improved implementation of the software in general.
The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.
Lana Alsabbagh, NLNZ, Research Use Cases
Youssef Eldakar, BA, Project Coordination
Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
Mohamed Elsayed, BA, Link Indexer (link-indexer)
Andrea Goethals, NLNZ, Project Coordination
Amr Morad, BA, Link Service (link-serv)
Ben O’Brien, NLNZ, Research Use Cases
Amr Rizq, BA, Link Visualizer (link-viz)
Tasneem Allam, BA, link-viz development
Suzan Attia, BA, UI design
Dalia Elbadry, BA, UI design
Nada Eliba, BA, link-serv development
Mirona Gamil, BA, link-serv development
Olga Holownia, IIPC, project support
Andy Jackson, British Library, technical advice
Amged Magdey, BA, logo design
Liquaa Mahmoud, BA, logo design
Alex Osborne, National Library of Australia, technical advice
We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.
Based in the Washington, D, area, CLIR forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning. CLIR has a number of international affiliates, including IIIF and NDSA. These affiliations give organizations opportunities to engage meaningfully with new constituencies, and to work together toward integrating services, tools, platforms, research, and expertise across organizations in ways that will reduce costs, create greater efficiencies, and better serve our collective constituencies.
In 2017, IIPC became a CLIR Affiliate and CLIR agreed to serve as the organization’s fiscal agent, but IIPC staff were hosted by the British Library until Holownia’s move to CLIR. IIPC will remain independent, and its Steering Committee and Executive Board will continue to be responsible for setting the strategy, overseeing membership, tools development and outreach, as well as the Consortium’s key events.
“We warmly welcome Olga Holownia to the staff,” said CLIR president Charles Henry. “IIPC’s work is closely aligned with CLIR’s mission, and her presence will open new opportunities to enrich the work of both organizations.”
“We are thrilled that Olga has accepted the role of senior program officer with CLIR, after performing in a program officer role for many years through her position with the British Library,” said Abbie Grotke, IIPC Chair. “With CLIR now hosting this role in addition to other administrative host activities, the IIPC is well suited to serve its members and the broader web archiving community in the future.”
The IIPC community is encouraged to mark their calendars for CLIR’s Digital Library Federation (DLF) 2021 Forum, November 1-3. The annual Forum is a meeting place, marketplace, and congress for digital library practitioners, featuring panels, individual presentations, lightning talks, and birds of a feather sessions. The Forum program will be announced in late August, when registration opens. The 2021 Forum will be virtual and free of charge, as will its two affiliated events: Digital Preservation 2021, the annual conference of the National Digital Stewardship Alliance (NDSA) on November 4; and Learn@DLF, a workshop series, November 8-10.
At the 2016 IIPC Web Archiving Conference in Reykjavík, Ian Milligan and Matthew Weber talked about the importance of building communities around analysing web archives and bringing together interdisciplinary researchers which is what Archives Unleashed 1.0, the first Web Archive Hackathon hosted by University of Toronto Libraries, attempted to do. At the same conference, Nick Ruest and Ian gave a workshop on the earliest version of the Archives Unleashed Toolkit (“Hands on with Warcbase“). Five years and 7 datathons later, the Archives Unleashed Project has seen major technological developments (including the Cloud version of the Toolkit integrated with Archive-It collections), a growing community of researchers, an expanded team, new partnerships, and two major grants. At the project’s core, there still is a desire to engage the community, and the most recent initiative, which builds on the datathons, includes the Cohort Program which facilitates research engagement with web archives through a year-long collaboration while receiving mentorship and support from the Archives Unleashed Team.
In her blog post, Samantha Fritz, the Project Manager at the Archives Unleashed Project, reflects on the strategy and key milestones achieved between 2017 and 2020, as well as the new partnership with Archive-It and the plans for the next 3 years.
The web archiving world blends the work and contributions of many institutions, groups, projects, and individuals. The field is witnessing work and progress in many areas, from policies, to professional development and learning resources, to the development of tools that address replay, acquisition, and analysis.
For over two decades memory institutions and organizations around the world have engaged in web archiving to ensure the preservation of born-digital content that is vital to our understanding of post-1990s research topics. Increasingly web archiving programs are adopted as part of institutional activities, because in general there is a recognition from librarians, archivists, scholars, and others that web archives are critical resources and are vulnerable to stewarding our cultural heritage.
The National Digital Stewardship Alliance has conducted surveys to “understand the landscape of web archiving activities in the United States.” Reflecting on the most recent 2017 survey results, respondents indicated they perceived the least progress in the past two years fell in the category of access, use, and reuse. The 2017 report indicates that this could suggest “a lack of clarity about how Web archives are to be used post-capture” (Farrell et. al. 2017 Report, p.13). This finding makes complete sense given that focus has largely revolved around selection, appraisal, scoping and capture.
Ultimately, the active use of web archives by researchers, and by extension the development of tools to explore web archives has lagged. As such we see institutions and service providers like librarians and archivists are tasked with figuring out how to “use” web archives.
We have petabytes of data, but we also have barriers
The amount of data captured is well into the petabyte range – and we can look at larger organizations like the Internet Archive, the British Library, the Bibliothèque Nationale de France, Denmark’s Netarchive, the National Library of Australia’s Trove platform, and Portugal’s Arquivo.pt, who have curated extensive web archive collections, but we still don’t see a mainstream or heavy use of web archives as primary sources in research. This is in part due to access and usability barriers. Essentially, the technical experience needed to work with web archives, especially at scale, is beyond the reach of most scholars.
It is this barrier that offers an opportunity for discussion and work in and beyond the web archiving community. As such, we turn to a reflection of contributions from the Archives Unleashed Project for lowering barriers to web archives.
About the Archives Unleashed Project
Archives Unleashed was established in 2017 with support from The Andrew W. Mellon Foundation. The project grew out of an earlier series of events which identified a collective need among researchers, scholars, librarians and archivists for analytics tools, community infrastructure, and accessible web archival interfaces.
In recognizing the vital role web archives play in studying topics from the 1990s forward, the team has focused on developing open-source tools to lower the barrier for working with and analyze web archives at scale.
From 2017-2020 Archives Unleashed has a three-pronged strategy for tackling the computational woes of working with large data, and more specifically W/ARCs:
Development of the Archives Unleashed Toolkit: to apply modern big data analytics infrastructure to scholarly analysis of web archives
Deployment of the Archives Unleashed Cloud: provide a one-stop, web-based portal for scholars to ingest their Archive-It collections and execute a number of analyses with the click of a mouse.
Organization of Archives Unleashed Datathons: build a sustainable user community around our open-source software.
Milestones + Achievements
If we look at how Archives Unleashed tools have developed, we have to reach back to 2013 when Warcbase was developed. It was the forerunner to the Toolkit and was built on Hadoop and HBase as an open-source platform to support temporal browsing and large-scale analytics of web archives (Ruest et al., 2020, p. 157).
The Toolkit moves beyond the foundations of Warcbase. Our first major transition was to replace Apache HBase with Apache Spark to modernize analytical functions. In developing the Toolkit, the team was able to leverage the needs of users to inform two significant development choices. First, by creating a Python interface that has functional parity with the Scala interface. Python is widely accepted, and more commonly known, among scholars in the digital humanities who engage in computational work. From a sustainability perspective, Python is a stable, open-source, and ranked as one of the most popular programming languages.
Second, the Toolkit shifted from Spark’s resilient distributed datasets (RDDs), part of the Warcbase legacy, to support DataFrames. While this was part of the initial Toolkit roadmap, the team engaged with users to discuss the impact of alternative options to RDD. Essentially, DataFrames offers the ability within Apache Spark to produce a tabular based output. From the community, this approach was unanimously accepted in large part because of the familiarity with pandas, and DataFrames made it easier to visually read the data outputs (Fritz, et. al, 2018, Medium Post).
The Toolkit is currently at a 0.90.0 release, and while the Toolkit offers powerful analytical functionality, it is still geared towards an advanced user. Recognizing that scholars often didn’t know where to start with analyzing W/ARC files, and the intimidating nature of the command line, we took a cookbook approach in developing our Toolkit user documentation. With it, researchers can modify dozens of example scripts for extracting and exploring information. Our team focused on designing documentation that presented possibilities and options, while at the same time guided and supported user learning.
The work to develop the Toolkit, provided the foundations for other platforms and experimental methods of working with web archives. The second large milestone reached by the project was the launch of the Archives Unleashed Cloud.
The Archives Unleashed Cloud, largely developed by project co-investigator Nick Ruest, is an open-source platform that was developed to provide a web-based front end for users to access the most recent version of the Archives Unleashed Toolkit. A core feature of the Cloud, is that it uses the Archive-It WASAPI, which means that users are directly connected to their Archive-It collections and can proceed to analyze web archives without having to spend time delving into the technical world.
Recognizing that the Toolkit, while flexible and powerful, may still be a little too advanced for some scholars, the Cloud offers a more user-friendly and familiar user interface for interacting with data. Users are presented with simple dashboards which provide insights into WARC collections, provide downloadable derivative files and offer simple in-browser visualizations.
In June of 2020, marking the end of our grant, the Cloud had analyzed just under a petabyte of data, and has been used by individuals from 59 unique institutions across 10 countries. Cloud remains an open-source project, with code available through a GitHub repository. The canonical instance will be deprecated as of June 30 2021 and be migrated into Archive-It, but more on that project in a bit.
Datathons + Community Engagement
Datathons provided an opportunity to build a sustainable community around Archives Unleashed tools, scholarly discussion, and training for scholars with limited technical expertise to explore archived web content.
Adapting the hackathon model, these events saw participants from over fifty institutions from seven countries engage in a hands-on learning environment – working directly with web archive data and new analytical tools to produce creative and ingenuitive projects that explore W/ARcs. In collaborating with host institutions, the datathons also highlight web archive collections from host institutions, increasing visibility and usability cases for their curated collections.
In a recently published article, “Fostering Community Engagement through Datathon Events: The Archives Unleashed Experience,” we reflected on the impact that our series of datathon events had on community engagement within the web archiving field, and on the professional practices of attendees. We conducted interviews with datathon participants to learn about their experiences and complemented this with an exploration of established models from the community engagement literature. Our article culminates in contextualizing a model for community building and engagement within the Archives Unleashed Project, with potential applications for the wider digital humanities field.
Our team has also invested and participated in the wider web archival community through additional scholarly activities, such as institutional collaborations, conferences, and meetings. We recognize that these activities bring together many perspectives, and have been a great opportunity to listen to the needs of users and engage in conversations that impact adjacent disciplines and communities.
1. It takes a community
If there is one main take away we’ve learned as a team, and that all our activities point to, it’s that projects can’t live in silos! Be they digital humanities, digital libraries, or any other discipline, projects need communities to function, survive, and thrive.
We’ve been fortunate and grateful to have been able to connect with various existing groups including being welcomed by the web archiving and digital humanities communities. Community development takes time and focused efforts, but it is certainly worthwhile! Ask yourself, if you don’t have a community, who are you building your tools, services, or platforms for? Who will engage with your work?
We have approached community building through a variety of avenues. First and foremost, we have developed relationships with people and organizations. This is clearly highlighted through our institutional collaborations in hosting datathon events, but we’ve also used platforms like Slack and Twitter to support discussion and connection opportunities among individuals. For instance, in creating both general and specific Slack channels, new users are able to connect with the project team and user community to share information and resources, ask for help, and engage in broader conversations on methods, tools, and data.
Regardless of platform, successful community building relies on authentic interactions and an acknowledgment that each user brings unique perspectives and experiences to the group. In many cases we have connected with uses who are either new to the field or to analysis methods of web archives. As such, this perspective has helped to inform an empathetic approach to the way we create learning materials, deliver reports and presentations, and share resources.
2. Interdisciplinary teams are important
So often we see projects and initiatives that highlight an interdisciplinary environment – and we’ve found it to be an important part of why our project has been successful.
Each of our project investigators personifies a group of users that the Archives Unleashed Project aims to support, all of which converge around data, more specifically WARCs or web archive data. We have a historian who is broadly representative of digital humanists and researchers who analyze and explore web archives; a librarian who represents the curators and service providers of web archives; and a computer scientist who reflects tool builders.
A key strength of our team has been to look at the same problem from different perspectives, allowing each member to apply their unique skills and experiences in different ways. This has been especially valuable in developing underlying systems, processes and structures which now make up the Toolkit. For instance, triaging technical components offered a chance for team members to apply their unique skill sets, which often assisted in navigating issues and roadblocks.
We also recognized each sector has its own language and jargon that can be jarring to new users. In identifying the wide range of technical skills within our team, we leveraged (and valued) those “I have no idea what this means/ what this does.” moments. If these types of statements were made by team members or close collaborators, chances are they would carry through to our user community.
Ultimately, the interdisciplinary nature and the wide range of technical expertise found within our team, helped us to see and think like our users.
3. Sustainability planning is really hard
Sustainability has been part question, part riddle. This is the case for many digital humanities projects. These sustainability questions speak to the long term lifecycle of the project, and our primary goal has always been to ensure a project’s survival and continued efforts once the grant cycle has ended.
As such the Archives Unleashed team has developed tools and platforms with sustainability in mind, specifically by adopting widely adopted and stable programming languages and best practices. We’ve also been committed to ensuring all our platforms and tools have developed in the spirit of open-access, and are available in public GitHub repositories.
One overarching question remained as our project entered its final stages in the Spring of 2020: how will the Toolkit live on? Three years of development and use cases demonstrated not only the need and adoption of tools created under the Archives Unleashed Project, but also solidified the fact that without these tools, there aren’t currently any simplified processes to adequately replace it.
Where we are headed (2020-2023)
Our team was awarded a second grant from The Andrew W. Mellon Foundation, which started in 2020 and will secure the future of Archives Unleashed. The goal of this second phase is the integration of the Cloud with Archive-it, so as a tool it can succeed in a sustainable and long-term environment. The collaboration between Archives Unleashed and Archive-It also aims to continue to widen and enhance the accessibility and usability of web archives.
Priorities of the Project
First, we will merge the Archives Unleashed analytical tools with the Internet Archive’s Archive-it service to provide an end-to-end process for collecting and studying web archives. This will be completed in three stages:
Build. Our team will be setting up the physical infrastructure and computing environment needed to kick start the project. We will be purchasing dedicated infrastructure with the Internet Archive.
Integrate. Here we will be migrating the back end of the Archives Unleashed Cloud to Archive-it and paying attention to how the Cloud can scale to work within its new infrastructure. This stage will also see the development of a new user interface that will provide a basic set of derivatives to users.
Enhance. The team will incorporate consultation with users to develop an expanded and enhanced set of derivatives and implement new features.
Secondly, we will engage the community by facilitating opportunities to support web archives research and scholarly outputs. Building on our earlier successful datathons, we will be launching the Archives Unleashed Cohort program to engage with and support web archives research. The Cohorts will see research teams participate in year-long intensive collaborations and receive mentorship from Archives Unleashed with the intention of producing a full-length manuscript.
We’ve made tremendous progress, as the close of our first year is in sight. Our major milestone will be to complete the integration of the Archives Unleashed Cloud/Toolkit over to Archive-It. As such users will soon see a beta release of the new interface for conducting analysis with their web archive collections, specifically by downloading over a dozen derivatives for further analysis, and access to simple in-browser visualizations.
Our team looks forward to the road ahead, and would like to express our appreciation for the support and enthusiasm Archives Unleashed has received!
Farrell, M., McCain, E., Praetzellis, M., Thomas, G., and Walker, P. 2018. Web Archiving in the United States: A 2017 Survey. National Digital Stewardship Alliance Report. DOI 10.17605/OSF.IO/3QH6N
Ruest, N., Lin, J., Milligan, I., and Fritz, S. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ’20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513
Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. 2017. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. J. Comput. Cult. Herit. 10, 4, Article 22 (October 2017), 30 pages. DOI: https://doi.org/10.1145/3097570
One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.
What is OutbackCDX?
OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.
A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.
OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.
Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.
So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.
Of course, there were a few bumps on that road, but more on that later.
Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.
Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.
Building the Index
Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.
In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.
Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.
Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.
OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:
And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.
This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.
The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.
I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.
Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.
OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.
Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.
The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.
Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is.
In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.
Live demo of SolrWayback
You can access a live demo of SolrWayback here. Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!
Back in 2018…
The open source SolrWayback project was created in 2018 as an alternative to the existing Netarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic Solr-frontend, it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.
WARC-Indexer. Where the magic happens!
WARC files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record, extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC files. Tika extracts the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images metadata, it can also extract width/height, or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.
WARC-Indexer. Paying the price up front…
Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements. When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The WARC files have to be indexed first.
Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.
Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.
HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be played directly from the results list with an in-browser player or downloaded if the browser does not support that format.
Solr. Reaping the benefits from the WARC-indexer
The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the WARC files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s, seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph (try demo: interactive linkgraph).
Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours.
The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter. The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google-like image search result. Under the assumption that text on the HTML page relates to the images, you can find images that match the query. If you search for “Cats” in the HTML pages, the results will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no metadata (or image-name) has “Cats” as part of it.
CVS Stream export
You can export result sets with millions of documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.
WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the WARC-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new WARC-file with all results combined.
Extract a sub-corpus this easy and it has already proven to be extremely useful for researchers. Examples include extration of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the warc-files.
SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export. The extended export ensures that playback will also work for the sub-corpus. Since the exported WARC file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.
SolrWayback playback engine
SolrWayback has a built-in playback engine, but it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.
The SolrWayback playback has been designed to be as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.
The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.
SolrWayback and Scalability
For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving in each new version. Storing the indexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr services running in a SolrCloud setup.
One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently have an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.
For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.
Building new shards
Building new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since there no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.
You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale netarchive, we keep track of which WARC files has been indexed using Archon and Arctica.
Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.
Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.
SolrWayback – framework
SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.
SolrWayback software bundle
Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC files yourself and start the indexing job.
By Abbie Grotke, Assistant Head, Digital Content Management Section
(Web Archiving Program), Library of Congress
and the IIPC Chair 2021-2022
Hello IIPC community!
I am thrilled to be the Chair of the IIPC in 2021. I’ve been involved in this organization since the very early days, so much so that somewhere buried in my folders in my office (which I have not been in for almost a year), are meeting notes from the very first discussions that led to the IIPC being formed back in 2003. Involvement in IIPC has been incredibly rewarding personally, and for our institution and all of our team members who have had the chance to interact with the community through working groups, projects, events, and informal discussions.
This year brings changes, challenges and opportunities for our community. Particularly during a time when many of us are isolated and working from home, both documenting web content about the pandemic and living it at the same time, connections to my friends and colleagues around the world seem more important than ever.
Here are a few key things to highlight for the coming year:
A Big Year for Organisation, Governance, and Strategic Planning Change
As a result of the fine work of the Strategic Direction Group led by Hansueli Locher of Swiss National Library, the IIPC has a new Consortium Agreement for 2021-2025! This document is renewed every 4-5 years, and this time some key changes were made to strengthen our ability to manage the Consortium more efficiently and to reflect the organisational changes that have taken place since 2016. Feedback from IIPC members was used to create the new agreement, and you’ll notice a slight update of objectives, which now acknowledge the importance of collaborative collections and research. Many thanks to the Strategic Direction Group (Emmanuelle Bermès of the BnF, Steve Knight of the National Library of New Zealand, Hansueli Locher, Alex Thurman of the Columbia University Libraries, and IIPC Programme and Communications Officer) for their work on this and continued engagement.
Executive Board and the Steering Committee’s terms
The new agreement establishes a new Executive Board composed of the Chair, the Vice-Chair, the Treasurer and our new senior staff member, as well as additional members of the SC appointed as needed. While the Steering Committee is responsible for setting out the strategic direction for our organisation for the next 5 years, one of our key tasks for this year is to convert it into an Action Plan.
The new Consortium Agreement aligns the terms of the Steering Committee members and the Executive Board. What it means in practise is that the SC members’ 3-year term will start on January 1 and not June 1. We will open a call for nominations to serve on the SC during our next General Assembly but if you are interested in nominating your institution, you can contact the PCO.
For more information about the responsibilities of the new Executive Board please review section 2.5 of the new Consortium Agreement.
Our ability to have and compensate our Administrative and Financial Host has been formalized in the new agreement. We are excited to collaborate more with the Council on Library and Information Resources (CLIR) this year through this arrangement, particularly in setting up some new staffing arrangements for us. More on this will be announced in the coming months.
One of our big tasks in 2021 will be working on the Strategic Plan. This work is led by the Strategic Direction Group, with inputs from the Steering Committee, Working Groups, and Portfolio Leads. Since this work is one of our important activities for the year, Hansueli has will joined the Executive Board to ensure close collaboration and support for the initiative.
Missing Your IIPC Colleagues? Join our Virtual Events!
As anyone who has attended an IIPC event in person knows, it is one of the best parts about being a member. In my case, interacting with colleagues from around the world who have similar challenges, experiences, and new and exciting insights has been great for my own professional growth, and has only helped the Library of Congress web archiving program be more successful. While it’s sad that we cannot travel and meet in person together right now, there are opportunities to continue to connect virtually and to engage others in our institutions who may not have been able to travel to the in-person meetings. We’re already working on developing a more robust calendar of events for members (and some that will be more widely open to non-members).
Beyond the GA and WAC, due to the success of the well-received and well-attended webinars and calls with members in 2020, we will continue to deliver those over the course of the year. We are also working on additional training events and continuing report-outs of technical projects and member updates. Stay tuned for more soon and check our events page for updates!
We are also working with Webrecorder on the pywb transition support for members. The migration guide, with inputs from the IIPC Members, is already available and the work continues on the next stages of the project. Look for more updates on these projects through our events and blog posts throughout the year. There will also be an opportunity in 2021 for more projects to be funded, so we encourage members to start thinking about other projects that could use support and that would benefit the community.