IIPC Steering Committee Election 2021: nomination statements

The Steering Committee, composed of no more than fifteen Member Institutions, provides oversight of the Consortium and defines and oversees its strategy. This year five seats are up for election/re-election. In response to the call for nominations to serve on the IIPC Steering Committee for a three-year term commencing 1 January 2022, six IIPC member organisations have put themselves forward:

An election is held from 15 September to 15 October. The IIPC designated representatives from all member organisations will receive an email with instructions on how to vote. Each member will be asked to cast five votes. The representatives should ensure that they read all the nomination statements before casting their votes. The results of the vote will be announced on the Netpreserve blog and Members mailing list on 18 October. The first Steering Committee meeting will be held online.

If you have any questions, please contact the IIPC Senior Program Officer.


Nomination statements in alphabetical order:

Bibliothèque nationale de France / National Library of France

BnF-logoThe National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly 1.5 petabyte. We develop national strategies for the growth and outreach of web archives and host several academic projects in our Data Lab. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an open source application for seeds selection and curation, also shared with other national libraries in Europe.

The BnF has been involved in IIPC since its very beginning and remains committed to the development of a strong community, not only in order to sustain these open source tools but also to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups.

The BnF chaired the consortium in 2016-2017 and currently leads the Membership Engagement Portfolio. Our participation in the Steering Committee, if continued, will be focused as ever on making web archiving a thriving community, engaging researchers in the study of web archives and further developing access strategies.

The British Library

BL-logoThe British Library is an IIPC founding member and has enjoyed active engagement with the work of the IIPC. This has included leading technical workshops and hackathons; helping to co-ordinate and lead member calls and other resources for tools development; co-chairing the Collection Development Group; hosting the Web Archive Conference in 2017; and participating in the development of training materials. In 2020, the British Library, with Dr Tim Sherratt, the National Library of Australia and National Library of New Zealand, led the IIPC Discretionary Funding project to develop Jupyter notebooks for researchers using web archives. The British Library hosted the Programme and Communications Officer for the IIPC up until the end of March this year, and has continued to work closely on strategic direction for the IIPC. If elected, the British Library would continue to work on IIPC strategy, and collaborate on the strategic plan. The British Library benefits a great deal from being part of the IIPC, and places a high value on the continued support, professional engagement, and friendships that have resulted from membership. The nomination for membership of the Steering Committee forms part of the British Library’s ongoing commitment to the international community of web archiving.

Deutsche Nationalbibliothek / German National Library

DNB-logoThe German National Library (DNB) has been doing Web archiving since 2012. The legal deposit in Germany includes web sites and all kinds of digital publications like eBooks, eJournals and eThesis. The selective Web archive includes currently about 5,000 sites with 30,000 crawls. It is planned to expand the collection to a larger scale. Crawling, quality assurance, storage and access are done together with a service provider and not with common tools like Heritrix and Wayback Machine.

Digital preservation was always an important topic for the German National Library. In many international and national projects and co-operations DNB worked on concepts and solutions in this area. Nestor, the network of expertise in long-term storage of digital resources in Germany, has its office at the DNB. The Preservation Working Group of the IIPC was co-lead for many years by the DNB.
At the IIPC steering committee the German National Library would like to advance the joint preserving of the Web.

Det Kongelige Bibliotek / Royal Library of Denmark

KBDK-logoRoyal Danish Library (in charge of the Danish national web archiving program Netarkivet) will serve the SC of IIPC with great expertise within web archiving since 2001. Netarkivet now holds a collection of more than 800Tbytes and is active in open source development of web archiving tools like NetarchiveSuite and SolrWayback. The representative from RDL will bring IIPC a lot of experience from working with web archiving for more than 20 years. RDL will bring both technical and strategic competences to the SC as well as skills within financial management and budgeting as well as project portfolio management. Royal Danish library was among the founding members of IIPC and the institution served on the SC of IIPC for a number of years and is now ready to go for another term.

Koninklijke Bibliotheek / National Library of the Netherlands

KBNL-logoAs the National Library of the Netherlands (KBNL), our work is fueled by the power of the written word. It preserves stories, essays and ideas, both printed and digital. When people come into contact with these words, whether through reading, studying or conducting research, it has an impact on their lives. With this perspective in mind we find it of vital importance to preserve web content for future generations.

We believe the IIPC is an important network organization which brings together ideas, knowledge and best practices on how to preserve the web and retain access to its information in all its diversity. In the past years, KBNL used its voice in the SC to raise awareness for sustainability of tools, (as we do by improving the Webcurator tool), point out the importance of quality assurance and co-organized the WAC 2021. Furthermore, we shared our insights and expertise on preservation in webinars and workshops. Since recently, we take part in the Partnerships & Outreach Portfolio.

We would like to continue this work and bring together more organizations, large and small across the world, to learn from each other and ensure web content remain findable, accessible and re-usable for generations to come.

The National Archives (UK)

TNA-logoThe National Archives (UK) is an extremely active web archiving practitioner and runs two open access web archive services – the UK Government Web Archive (UKGWA), which also includes an extensive social media archive, and the EU Exit Web Archive (EEWA). While our scope is limited to information produced by the government of the UK, we have nonetheless built up our collections to over 200TB.

Our team has grown in capacity over the years and we are now increasingly becoming involved in research initiatives that will be relevant to the IIPC’s strategic interests.

With over 35 years’ collective team experience in the field, through building and running one of the largest and most used open access web archives in the world, we believe that we can provide valuable experience and we are extremely keen to actively contribute to the objectives of the IIPC through membership of the Steering Committee.

 

PYWB 2.6

By Ilya Kreymer, Webrecorder.net

After several betas and months of development, I’m excited to announce the release of pywb 2.6!

This release, supported in large part by the IIPC (International Internet Preservation Consortium), includes several new features and documentation as well as many replay fidelity improvements and optimizations.

The main new features of the release include improvements to the access control system and localization/multi-language support. The access control system has been expanded with a flexible date-range based embargo, allowing for automated exlcusions of newer or old content. The release also includes the ability to configure pywb for different user access levels, when running pywb behind an Nginx or Apache server. For more details on these features, see the Access Control Guide and Deployment Guide.

With this release, pywb also includes support for running in different languages and configuring the main UI to support switching between different languages. All text used is automatically populated into CSV files and imported back. For more details, see the Localization / Multi-Language Guide section of the documentation.

A complete list of changes is also available in the pywb Changelist on GitHub.

This work is a follow-up to the first package of work supported by the IIPC, which resulted in the creation of a transition guide for users of OpenWayback. Webrecorder wishes to thank the IIPC for their support of pywb development.

The next release of pywb, corresponding to the final batch of work sponsored by IIPC in this round, will include several improvements to the pywb user-interface and navigation.

For more discussion on this work, Webrecorder will be participating in an IIPC-hosted webinar on Tuesday, August 31st, 2021.

IIPC Steering Committee Election 2021: Call for nominations

The nomination process for IIPC Steering Committee is now open.

The Steering Committee (SC) is composed of no more than fifteen Member Institutions who provide oversight of the Consortium and define and oversee action on its strategy. This year five seats are up for election. 

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Participation in the SC is open to any IIPC member in good standing. We strongly encourage any organisation interested in serving on the SC to nominate themselves for election. The SC members meet in person (if circumstances allow) at least once a year. Face-to-face meetings are supplemented by two teleconferences plus additional ones as required.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in October and the three-year term on the Steering Committee will start on 1 January.

Below you will find the election calendar. We are very much looking forward to receiving your nominations. If you have any questions, please contact the IIPC Senior Program Officer (SPO).


Election Calendar

14 June – 14 September 2021: Nomination period. IIPC Designated Representatives are invited to nominate their organisation by sending an email including a statement of up to 200 words to the IIPC SPO.

15 September 2021: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.

15 September – 15 October 2021: Members are invited to vote online. Each organisation votes only once for all nominated seats. The vote is cast by the Designated Representative.

18 October 2021: The results of the vote are announced on the Netpreserve blog and Members mailing list.

1 January 2022: The newly elected SC members start their term on 1 January.

The 2021 Web Archiving Conference in Luxembourg

By Ben Els, Digital Curator at the National Library of Luxembourg and Chair of the 2021 General Assembly (GA) and Web Archiving Conference (WAC) Organising Committee

In collaboration with the IIPC, the National Library of Luxembourg (BnL) had the honour of hosting the 2021 edition of the Web Archiving Conference. As a virtual event, this conference brought together experts and researchers from 39 countries, to present and evaluate the latest developments in the world of web archiving. The last edition of this conference took place in 2019 in Zagreb and since many web archiving institutions haven’t had the opportunity for a local exchange with fellow practitioners, this international meeting plays a vital role in the concerted efforts in Internet preservation.

From in-person to virtual

The preparations for this year’s conference started in June 2019 and, initially, the 2021 conference was meant to take place in our new library building. As you can imagine, the ups and downs of the past 1 ½ years had a significant influence on the preparations for this conference: Will people be able to travel? What will the safety measures be? Should we plan for a hybrid solution? Eventually, we decided that a hybrid solution would hardly be feasible from an organisational standpoint and would likely also disadvantage participants attending the live event. When the decision was made to go for an online event, we faced another set of questions: how to combine the advantages of a virtual meeting with the indispensable aspects of a physical conference? In other words, what are the most valuable experiences that people would like to take away from a real-life meeting?

Our conclusion was to aim our efforts at enabling lively discussions, to focus on Q&A sessions and networking, which would normally happen during coffee breaks or social events. We spent several months researching and testing different video-conference and virtual event platforms. Finally, we decided to abandon the idea of 8-10 hour Zoom calls and moved to a different format, using the relatively new platform Remo. We also asked the speakers and panelists to make their presentations available ahead of the conference as pre-recorded videos. This way, participants were able to watch the videos they were interested in beforehand, so that during the conference, we could jump into the Q&A part right away. This format allowed for a more lively experience, with more engaging discussions. This was illustrated by the fact that many participants stayed with the event from 08:00 in the morning until midnight!

Customising the online experience 

As Olga explained during the General Assembly: “online doesn’t mean less work”. We realised that Remo is not a perfect platform and that there was a smart adaptation phase for first-time users. Therefore, a lot of work went into organising training events during the three weeks before the conference. We made sure that all speakers, session chairs and panelists took part in at least one of these familiarisation sessions, which helped the event in getting a lot of technical and organisational questions out of the way ahead of time. Moreover, we had a number of volunteers on board, making sure that the program of the conference would run smoothly and all technical difficulties could be dealt with as quickly as possible. The team composed of five core members of the organising committee and the conference “super elves” was operating like a well-oiled machine – and that over the course of three days from 08:00 in the morning until midnight.

Online and all time zones

The second challenge of a virtual event is the differences in time zones, when people want to follow the discussions from their homes on the other side of the world. For this reason, we arranged the conference schedule in a way that would allow participants from all time zones to follow at least 6 hours of program during their normal working hours. This inclusive approach has proven to be successful, by surpassing the previous records for registrations and attendance. It is safe to say that the 2021 edition of the Web Archiving Conference has reached more people than ever before.

Virtually in Luxembourg  

The third challenge in an online event: how to highlight the character of the hosting institution, since the conference venue doesn’t really matter on the Internet? In collaboration with our sponsors and partners, including the National Research Fund and the Luxembourg – Let’s make it happen initiative, we tried to represent the BnL and the country of Luxembourg in a virtual space. On the customised floorplan in Remo, we highlighted our partners and included hints at cultural, historical and culinary Luxembourg landmarks. If you would like to learn more about the Emoxie icons and their stories, we invite you to a virtual visit to Luxembourg. Please don’t forget to stop by at the National Library! 

Lessons learnt and raising awareness

Before WAC 2021, the BnL didn’t have a lot of experience with hosting larger conferences and even less experience in online events. Although the time commitment should not be underestimated, the whole process was at the same time, an incredibly valuable learning experience (not to mention how much fun we had during preparation calls and all throughout the conference). Hosting the Web Archiving Conference has also pushed the BnL in getting to know all parts of the inner workings of the IIPC and getting in contact with many member institutions. Locally, we were able to draw attention to the role of the National Library as a frontrunner in digital preservation in Luxembourg (Mois des archives: Web Archive & Mir brauchen dréngend e Bachelor an den Informatiounswëssenschaften). We were also able to organise a shared panel with the University of Luxembourg, to highlight local efforts in documenting the Covid-19 pandemic.


From the incredibly generous feedback, we also learned that the attention to detail and thoughtful planning have not gone unnoticed by the participants. For that part, the BnL can only accept a fraction of the praise: without Olga’s and Robin’s tireless commitment and expertise, we never could have reached the goals that were set up at the beginning. Therefore, next year’s hosts should be reassured to have both of them on board and set the bar for 2022 even higher.

Launching LinkGate

By Youssef Eldakar of Bibliotheca Alexandrina

We are pleased to invite the web archiving community to visit LinkGate at linkgate.bibalex.org.

LinkGate is scalable web archive graph visualization. The project was launched with funding by the IIPC in January 2020. During the term of this round of funding, Bibliotheca Alexandrina (BA) and the national Library of New Zealand (NLNZ) partnered together to develop core functionality for a scalable graph visualization solution geared towards web archiving and to compile an inventory of research use cases to guide future development of LinkGate.

What does LinkGate do?

LinkGate seeks to address the need to visualize data stored in a web archive. Fundamentally, the web is a graph, where nodes are webpages and other web resources, and edges are the hyperlinks that connect web resources together. A web archive introduces the time dimension to this pool of data and makes the graph a temporal graph, where each node has multiple versions according to the time of capture. Because the web is big, web archive graph data is big data, and scalability of a visualization solution is a key concern.

APIs and use cases

We developed a scalable graph data service that exposes temporal graph data via an API, a data collection tool for feeding interlinking data extracted from web archive data files into the data service, and a web-based frontend for visualizing web archive graph data streamed by the data service. Because this project was first conceived to fulfill a research need, we reached out to the web archive community and interviewed researchers to identify use cases to guide development beyond core functionality. Source code for the three software components, link-serv, link-indexer, and link-viz, respectively, as well as the use cases, are openly available on GitHub.

Using LinkGate

An instance of LinkGate is deployed on Bibliotheca Alexandrina’s infrastructure and accessible at linkgate.bibalex.org. Insertion of data into the backend data service is ongoing. The following are a few screenshots of the frontend:

  • Graph with nodes colorized by domain
  • Nodes being zoomed in
  • Settings dialog for customizing graph
  • Showing properties for a selected node
  • PathFinder for finding routes between any two nodes

Please see the project’s IIPC Discretionary Funding Program (DFP) 2020 final report for additional details.

We will presenting about the project at the upcoming IIPC Web Archiving Conference on Tuesday, 15 June 2021 and also share the results of our work at an Research Speakers Series webinars on 28 July. If you have any questions or feedback, please contact the LinkGate Team at linkgate[at]iipc.simplelists.com.

Next steps

This development phase of Project LinkGate has been for the core functionality of a scalable, modular graph visualization environment for web archive data. Our team shares a common passion for this work and we remain committed to continuing to build up the components, including:

  • Improved scalability
  • Design and development of the plugin API to support the implementation of add-on finders and vizors (graph exploration tools)
  • Enriched metadata
  • Integration of alternative data stores (e.g., the Solr index in SolrWayback, so that data may be served by link-serv to visualize in link-viz or Gephi)
  • Improved implementation of the software in general.

BA intends to maintain and expand the deploymentat linkgate.bibalex.org on a long-term basis.

Acknowledgements

The LinkGate team is grateful to the IIPC for providing the funding to get the project started and develop the core functionality. The team is passionate about this work and is eager to carry on with development.

LinkGate Team

  • Lana Alsabbagh, NLNZ, Research Use Cases
  • Youssef Eldakar, BA, Project Coordination
  • Mohammed Elfarargy, BA, Link Visualizer (link-viz) & Development Coordination
  • Mohamed Elsayed, BA, Link Indexer (link-indexer)
  • Andrea Goethals, NLNZ, Project Coordination
  • Amr Morad, BA, Link Service (link-serv)
  • Ben O’Brien, NLNZ, Research Use Cases
  • Amr Rizq, BA, Link Visualizer (link-viz)

Additional Thanks

  • Tasneem Allam, BA, link-viz development
  • Suzan Attia, BA, UI design
  • Dalia Elbadry, BA, UI design
  • Nada Eliba, BA, link-serv development
  • Mirona Gamil, BA, link-serv development
  • Olga Holownia, IIPC, project support
  • Andy Jackson, British Library, technical advice
  • Amged Magdey, BA, logo design
  • Liquaa Mahmoud, BA, logo design
  • Alex Osborne, National Library of Australia, technical advice

We would also like to thank the researchers who agreed to be interviewed for our Inventory of Use Cases.


Resources

CLIR Becomes Administrative Home for the IIPC

Council on Library and Information Resources (CLIR) has become the administrative home of the IIPC with the move of its senior program officer, Olga Holownia, to the CLIR staff.

Based in the Washington, D, area, CLIR forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning. CLIR has a number of international affiliates, including IIIF and NDSA. These affiliations give organizations opportunities to engage meaningfully with new constituencies, and to work together toward integrating services, tools, platforms, research, and expertise across organizations in ways that will reduce costs, create greater efficiencies, and better serve our collective constituencies.

In 2017, IIPC became a CLIR Affiliate and CLIR agreed to serve as the organization’s fiscal agent, but IIPC staff were hosted by the British Library until Holownia’s move to CLIR. IIPC will remain independent, and its Steering Committee and Executive Board will continue to be responsible for setting the strategy, overseeing membership, tools development and outreach, as well as the Consortium’s key events.

“We warmly welcome Olga Holownia to the staff,” said CLIR president Charles Henry. “IIPC’s work is closely aligned with CLIR’s mission, and her presence will open new opportunities to enrich the work of both organizations.”

“We are thrilled that Olga has accepted the role of senior program officer with CLIR, after performing in a program officer role for many years through her position with the British Library,” said Abbie Grotke, IIPC Chair. “With CLIR now hosting this role in addition to other administrative host activities, the IIPC is well suited to serve its members and the broader web archiving community in the future.”


The IIPC community is encouraged to mark their calendars for CLIR’s Digital Library Federation (DLF) 2021 Forum, November 1-3. The annual Forum is a meeting place, marketplace, and congress for digital library practitioners, featuring panels, individual presentations, lightning talks, and birds of a feather sessions. The Forum program will be announced in late August, when registration opens. The 2021 Forum will be virtual and free of charge, as will its two affiliated events: Digital Preservation 2021, the annual conference of the National Digital Stewardship Alliance (NDSA) on November 4; and Learn@DLF, a workshop series, November 8-10.

IIPC Chair Address

By Abbie Grotke, Assistant Head, Digital Content Management Section
(Web Archiving Program), Library of Congress
and the IIPC Chair 2021-2022


Hello IIPC community!

I am thrilled to be the Chair of the IIPC in 2021. I’ve been involved in this organization since the very early days, so much so that somewhere buried in my folders in my office (which I have not been in for almost a year), are meeting notes from the very first discussions that led to the IIPC being formed back in 2003. Involvement in IIPC has been incredibly rewarding personally, and for our institution and all of our team members who have had the chance to interact with the community through working groups, projects, events, and informal discussions.

This year brings changes, challenges and opportunities for our community. Particularly during a time when many of us are isolated and working from home, both documenting web content about the pandemic and living it at the same time, connections to my friends and colleagues around the world seem more important than ever.

Here are a few key things to highlight for the coming year:

A Big Year for Organisation, Governance, and Strategic Planning Change

As a result of the fine work of the Strategic Direction Group led by Hansueli Locher of Swiss National Library, the IIPC has a new Consortium Agreement for 2021-2025! This document is renewed every 4-5 years, and this time some key changes were made to strengthen our ability to manage the Consortium more efficiently and to reflect the organisational changes that have taken place since 2016. Feedback from IIPC members was used to create the new agreement, and you’ll notice a slight update of objectives, which now acknowledge the importance of collaborative collections and research. Many thanks to the Strategic Direction Group (Emmanuelle Bermès of the BnF, Steve Knight of the National Library of New Zealand, Hansueli Locher, Alex Thurman of the Columbia University Libraries, and IIPC Programme and Communications Officer) for their work on this and continued engagement.

Executive Board and the Steering Committee’s terms

The new agreement establishes a new Executive Board composed of the Chair, the Vice-Chair, the Treasurer and our new senior staff member, as well as additional members of the SC appointed as needed. While the Steering Committee is responsible for setting out the strategic direction for our organisation for the next 5 years, one of our key tasks for this year is to convert it into an Action Plan.

The new Consortium Agreement aligns the terms of the Steering Committee members and the Executive Board. What it means in practise is that the SC members’ 3-year term will start on January 1 and not June 1. We will open a call for nominations to serve on the SC during our next General Assembly but if you are interested in nominating your institution, you can contact the PCO.

For more information about the responsibilities of the new Executive Board please review section 2.5 of the new Consortium Agreement.

Administrative Host

Our ability to have and compensate our Administrative and Financial Host has been formalized in the new agreement. We are excited to collaborate more with  the Council on Library and Information Resources (CLIR) this year through this arrangement, particularly in setting up some new staffing arrangements for us. More on this will be announced in the coming months.

Strategic Plan

One of our big tasks in 2021 will be working on the Strategic Plan. This work is led by the Strategic Direction Group, with inputs from the Steering Committee, Working Groups, and Portfolio Leads. Since this work is one of our important activities for the year, Hansueli has will joined the Executive Board to ensure close collaboration and support for the initiative.

Missing Your IIPC Colleagues? Join our Virtual Events!

A blast from the past: the IIPC General Assembly at the Library of Congress, May 2012.
From the left: Kristinn Sigurðsson (IIPC Vice-Chair, National and University Library of Iceland), Gildas Illien (BnF), and Abbie.

As anyone who has attended an IIPC event in person knows, it is one of the best parts about being a member. In my case, interacting with colleagues from around the world who have similar challenges, experiences, and new and exciting insights has been great for my own professional growth, and has only helped the Library of Congress web archiving program be more successful. While it’s sad that we cannot travel and meet in person together right now, there are opportunities to continue to connect virtually and to engage others in our institutions who may not have been able to travel to the in-person meetings. We’re already working on developing a more robust calendar of events for members (and some that will be more widely open to non-members).

As you’re aware, our big event, the General Assembly (June 14) and the Web Archiving Conference (June 15-16)  have been moved to a virtual event as a part of Web Archiving Week (virtually from Luxembourg). Many thanks to the National Library of Luxembourg for offering to host the online event!

Beyond the GA and WAC, due to the success of the well-received and well-attended webinars and calls with members in 2020, we will continue to deliver those over the course of the year. We are also working on additional training events and continuing report-outs of technical projects and member updates. Stay tuned for more soon and check our events page for updates!

Working Groups and funded projects

The IIPC continues to work collaboratively together in 2021 on a number of initiatives through our Working Groups), including our transnational collections (the Covid-19 collection continues in 2021), training materials, and activities focusing on research use of the web archives. 2021 also brings exciting funded project news, thanks to the continuation of DFP, a funding programme launched in June 2019 and led by three former IIPC Chairs: Emmanuelle Bermès, Jefferson Bailey (Internet Archive), and Mark Phillips (University of North Texas Libraries). In 2020 the Jupyter Notebooks project led by Andy Jackson of the British Library and created by Tim Sherratt was successfully completely and won the British Library Labs award. This year, we are launching Developing Bloom Filters for Web Archives’ Holdings (a collaboration between Los Alamos National Laboratory (LANL) & National and University Library in Zagreb), Improving the Dark and Stormy Archives Framework by Summarizing the Collections of the National Library of Australia (a collaboration between Old Dominion University, National Library of Australia and LANL), and continuing LinkGate: Core Functionality and Future Use Cases (Bibliotheca Alexandrina & National Library of New Zealand) and hoping to be able to hold the Archives Unleashed datathon led by the BnF in partnership with KBR / Royal Library of Belgium and the National Library of Luxembourg later in 2021.

We are also working with Webrecorder on the pywb transition support for members. The migration guide, with inputs from the IIPC Members, is already available and the work continues on the next stages of the project. Look for more updates on these projects through our events and blog posts throughout the year. There will also be an opportunity in 2021 for more projects to be funded, so we encourage members to start thinking about other projects that could use support and that would benefit the community.

Lastly, I want to remind you to continue to follow our activities on the IIPC website and Twitter (do tweet on #WebArchiveWednesday!). To subscribe to our mailing list, send an email to communications@iipc.simplelists.com.

I look forward to working with you all more closely this year. Please feel free to reach out to me if you have any questions or concerns during my time as Chair.

Happy Web Archiving to you all!

Abbie Grotke

Assistant Head, Digital Content Management Section (Web Archiving Program), Library of Congress

IIPC Chair 2021-2022

WCT 3.0 Release

By Ben O’Brien, Web Archive Technical Lead, National Library of New Zealand

Let’s rewind 15 years, back to 2006. The Nintendo Wii is released, Google has just bought YouTube, Facebook switches to open registration, Italy has won the Fifa World Cup, and Borat is shocking cinema screens across the globe.

Java 6, Spring 1.2, Hibernate 3.1, Struts 1.2, Acegi-security are some of the technologies we’re using to deliver open source enterprise web applications. One application in particular, the Web Curator Tool (WCT) is starting its journey into the wide world of web archiving. WCT is an open source tool for managing the selective web harvesting process.

2018 Relaunch

Fast forward to 2018, and these technologies themselves belong inside an archive. Instead they were still being used by the WCT to collect content for web archives. Twelve years is a long time in the world of the Internet and IT, so needless to say a fair amount of technical debt had caught up with the WCT and its users.

The collaborative development of the WCT between the National Library of the Netherlands and the National Library of New Zealand was full steam ahead after the release of the long awaited Heritrix 3 integration in November 2018. With new features in mind, we knew we needed a modern, stable foundation within the WCT if we were to take it forward. Queue the Technical Uplift.

WCT 3.0

What followed was two years of development by teams in opposing time zones, battling resourcing, lockdowns and endless regression testing. Now at the beginning of 2021, we can at last announce the release of version 3.0 of the WCT.

While some of the names in the technology stack are the same (Java/Spring/Hibernate), the upgrade of these languages and frameworks represent a big milestone for the WCT. A launchpad to tackle the challenges of the next decade of web archiving!

For more information, see our recent blog post on webcuratortool.org. And check out a demo of v3.0 inside our virtual box image here.

WCT Team:

KB-NL

Jeffrey van der Hoeven
Sophie Ham
Trienka Rohrbach
Hanna Koppelaar

NLNZ

Ben O’Brien
Andrea Goethals
Steve Knight
Frank Lee
Charmaine Fajardo

Further reading on WCT:

WCT tutorial on IIPC
Documentation on WCT
WCT on GitHub
WCT on Slack
WCT on Twitter
Recent blogpost on WCT with links to old documentation

Quality Assurance in Web Archives: How to Automate Your Work with Command Line

By Kourosh Hassan Feissali, Web Archivist on the National Archives, UK

Both ‘automation’ and ‘command line’ can sound daunting to non-programmers. But in this post I’m going to describe how easy it is for non-programmers to use baby steps to automate time consuming tasks, or parts of a big task.

Why Use Command Line?

I admit that I’m new to command line myself but the more I use it the more I am amazed by its efficiency and by all the things it can do. Here are some of the reasons for using command line:

  • You don’t need to be a programmer to write commands. Once you learn the basics you can just copy and paste commands.
  • A Command-Line Interface (CLI) is pre-installed on your computer for free!
  • You can copy useful commands from the Internet and simply paste them in your CLI.
  • You write a command for a task once and you use it as many times as you want without having to think about the steps involved. This will save a lot of time.
  • You can write one tiny command to automate step 1 of a bigger task with 10 steps. Then, if you want, you can add a second command to automate step 2. Therefore, you don’t have to write an entire programme.
  • Spreadsheet or text editors often struggle with very large files but this is not a problem in a CLI.

Case Study: Quality Assurance of Brexit Sites

At the UK Government Web Archive (UKGWA) we crawled a large number of Brexit-related websites. Due to the nature of the project it was essential to carry out enhanced quality assurance (QA) on these websites. One technique that we used was checking the logs of the web crawler. The problem was that crawl logs can have millions of lines and there are very few applications on the market that can easily handle these huge log files. Further, some of these apps only work on one operating system (OS) but we use multiple OS’s in the team.

To illustrate how simple it is to speed up a multi-stage task I’m going to break down our enhanced QA into smaller tasks here and use some basic commands to drastically speed up the process.

The Steps

  1. Download all the log files.
  2. Merge them into one.
  3. Sort the lines by server response code.
  4. Remove all the URLs where the server error begins with 404, 2, 3, or blank space.
  5. Save the remaining URLs that begin with server errors 500, 403, etc. into a new file.
  6. Remove duplicate URLs and save in a new file.
  7. Check the remaining URLs against the live site.
  8. Ignore the ones that are broken on the live site.
  9. Copy the ones that work correctly on the live site and save as a patch-list.
  10. Clean up Downloads folder.

The process above is a little longer than this but I’ve omitted some of the steps for the purposes of this blog post. As you see, we’ve broken down one fairly complex job into 10 simple steps that are easy to understand and easy to tackle on their own. Some of the steps above are quite simple but when you’re dealing with very large files, they can freeze your computer if you use generic applications such as MS Excel. Here, I’ll describe how we can use CLI for some of the above steps.

Step 2: cat *.log >> final.log

Step 4: sed -i “” ‘s+^404.*++g’ sorted.txt; sed -i “” ‘s+^[2-3].*++g’ sorted.txt;

Step 5: cp sorted.txt sorted_errors.txt

Step 6: cat sorted_errors.txt | sort -u > sorted_errors_dedup.txt

Step 10: rm *.log

What’s great about CLI is that you don’t have to learn a whole new language before seeing the result of your work. As you see in Step 2 above the ‘cat’ command concatenates multiple files into one. No matter how many files you have or how large they are. MS Excel can give you a really hard time for this simple step but this one command concatenates any file that ends with ‘.log’ in your Downloads folder into one file in a blink of an eye. Automating with command line brings a lot of joy to your work life!


More on this topic:
How to automate web archiving quality assurance without a programmer

IIPC – Meet the Officers, 2021

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Programme and Communications Officer based at the British Library, the Officers are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Abbie Grotke of the Library of Congress, to serve as Chair and Kristinn Sigurðsson of National and University Library of Iceland, to serve as Vice-Chair in 2021. Sylvain Bélanger of Library and Archives Canada continues in his role as Treasurer. Olga Holownia continues as Programme and Communications Officer, and CLIR (the Council on Library and Information Resources) remains the Consortium’s financial host.

The Officers make up the new Executive Board introduced in the Consortium Agreement 2021-2025. The additional Steering Committee members, who will serve on the Executive Board in 2021, will be named in the coming months.

The Members and the Steering Committee would like to thank Mark Phillips of the University of North Texas Libraries (IIPC Chair, 2020) and Paul Koerbin (IIPC Vice-Chair, 2020) of the National Library of Australia, for their contribution to the day-to-day running of the IIPC.


IIPC CHAIR

Abbie Grotke, IIPC Chair 2021
Photo: Denis Malloy.

Abbie Grotke is Assistant Head, Digital Content Management Section, within the Digital Services Directorate of the Library of Congress, and leads the Web Archiving Team. She joined the Library in 1997 to work on American Memory digitization projects, and since 2002 has been involved in the Library’s web archiving program, which celebrated its 20th anniversary in 2020. In her role, Grotke has helped develop policies, workflows, and tools to collect and preserve web content for the Library’s collections and provides overall program management for web archiving at the Library, managing over 2.3 petabytes of data. The team also supports and trains almost 100 recommending officers across the Library who select content for the archives in a wide range of event and thematic web archive collections. She has been active in a number of collaborative web archive collections and initiatives, including the U.S. End of Term Government Web Archive, and the U.S. Federal Government Web Archiving Interest Group.

Since the Library of Congress joined the IIPC as a founding member in 2003, Abbie has served in a variety of roles and on a number of working groups, task forces, and committees. She spent a number of years as Communications Officer, and was a member of the Access Working Group. More recently, she has served as co-leader of the Content Development, and Training Working Groups, and Membership Engagement Portfolio. She has been a member of the Steering Committee since 2013.

IIPC VICE-CHAIR

Photo: Tibor God (General Assembly in Zagreb, 2019).

Kristinn Sigurðsson is Head of Digital Projects and Development at the National and University Library of Iceland. He joined the library in 2003 as a software developer. Over the years he has worked on a multitude of projects related to the acquisition, preservation and presentation of digital content, as well as the digital reproduction of physical media. This includes leading the buildup of the library’s legal deposit web archive – that now contains nearly 4 billion items – as well as its very popular newspaper/magazine website.

He has also been very active within the IIPC and related web archiving collaboration. This includes working on the first version of the Heritrix crawler in 2003-4 (and on and off since). In 2010 he joined the IIPC Steering Committee as well as taking over as co-load of the Harvesting Working Group. More recently he has served as the Lead of the Tools Development Portfolio.

IIPC TREASURER

Sylvain Bélanger is Director General of the Transition Team at Library and Archives Canada (LAC). Sylvain was previously Director General of the Digital Operations and Preservation Branch for Library and Archives Canada since February 2014. In this role Sylvain is responsible for leading and supporting LAC’s digital business operations, and all aspects of preservation for digital and analog collections. Sylvain is also lead for LAC’s digital transformation activities. Prior to accepting this role, Sylvain had been Director of the Holdings Management Division since 2010, and previously Corporate Secretary and Chief of Staff for Library and Archives Canada.