A New Release of Heritrix 3

By Andy Jackson, Web Archiving Technical Lead at the British Library

One of the outcomes of the Online Hours meetings has been an increase in activity around Heritrix 3. Most of us rely on Heritrix to carry out our web crawls, but recognise that to keep this large, complex crawler framework sustainable we need to try and get more people use the most recent versions, and make it easier for new users to get on board.

The most recent ‘formal’ release of Heritrix 3 was version 3.2.0 back in 2014, but a lot has happened since then. Numerous serious bugs have been discovered and resolved, and some new features added, but only those of us running the very latest code were able to take advantage of these changes.

Those of us who would rather base our crawling on a software release rather than building from source have been relying on the stable releases built by Kristinn Sigurðsson, and hosted on the NetarchiveSuite Maven Repository. This worked well for ‘those in the know’, but did little to make things easier for new users.

In an attempt to resolve this, and in coordination with the Internet Archive, we have started releasing ‘formal’ versions of Heritrix, culminating in the 3.4.0-20190207 Interim Release. This new release believed to be stable, and is recommended over previous releases of Heritrix 3. As well as being released on GitHub, it is also available through the Maven Central Repository, which should make it easier for others to re-use Heritrix.

You may notice we’ve added a date to the version tag. Traditionally, Heritrix 3 has used a tag of the form “X.X.X”, which gives the impression we are using a form of Semantic Versioning. However, that does not reflect how Heritrix is evolving. Heritrix is a broad framework of modules for building a crawler, and has lots of different components of different ages, at different levels of maturity and use. Given there are only a small number of developers working on Heritrix, we don’t have the resources to guarantee that a breaking change won’t slip into a minor release, so it’s best not to appear to be promising something we cannot deliver.

This means that, when you are upgrading your Heritrix 3 crawler, we recommend that you thoroughly test each release using your configuration (your ‘crawler beans’ in Hertrix3 jargon) under a realistic workload. If you can, please let us know how this goes, to help us understand how reliable the different parts of Heritrix 3 are.

As well as making new releases, we have also moved the Heritrix 3 documentation over to GitHub to populate the Heritrix3 wiki, and shifted the API documentation to a more modern platform. We hope this will help those who have been frustrated by the available documentation, and we encourage you to get in touch with any ideas for improving the situation, particularly when it comes to helping new users get on board.

If you want to know more, please drop into the Online Hours calls or use the archive-crawler mailing list or IIPC Slack to get in touch. To join IIPC Slack, submit a request through this form.

Passing the Torch

By Jefferson Bailey, Internet Archive

Dear IIPC Community,

As of January 1, 2019, my term as Chair of the IIPC came to a close. Having served as Chair since September 2017 and, prior to that, as Vice Chair from April 2016 (during the excellent leadership of Emmanuelle Bermès of BNF), I have seen the IIPC continue to grow and evolve. It has been a privilege to serve in these roles during this exciting time. While I will continue to serve as a regular Steering Committee (SC) member, I wanted to take this transitional moment to reflect on the successes and ongoing work of both the SC and the IIPC. The centrality of the web as a communication and publication platform only increases by the day and the work of the IIPC and its members becomes ever more critical in documenting history, preserving knowledge, and interrogating privilege and power. There is always more work to be done.

Before reflecting on recent progress and future directions, I want to give a big thanks to my co-Officers. Vice Chair Sylvain Bélanger of Library Archive Canada and Treasurer Tom Cramer of Stanford University Libraries both worked to advance IIPC’s mission and operations. As well, Program and Communications Officer Olga Holownia worked, and continues to work, tirelessly to support the overall activities of the consortium. Thanks go as well to the SC members that volunteer their time and to the many regular members that actively contribute to Working Groups (WGs), committees, portfolios, etc, and who keep the IIPC a dynamic forum for sharing knowledge and practices. Lastly, I look forward to the great team of new SC Officers, Chair Hansueli Locher of the Swiss National Library, Vice Chair Mark Phillips of University of North Texas, and Sylvain serving as Treasurer. The near-term future of IIPC is in good hands.

In my time as Vice Chair and Chair, IIPC has continued to add new members and expand its activities. Here is my reflection on areas of recent progress and further effort:

Areas of Recent Progress

A Maturing Organization

It is well known that IIPC faced many financial and operational difficulties related to the unforeseen inability of BNF to continue to provide financial and accounting support for IIPC in 2017, after many years of admirably providing this service for IIPC without recompense. We all owe thanks to British Library and to Olga for enabling the 2017 conference to happen, even in a moment of financial uncertainty. From crisis came positive change, as myself and Abbie Grotke of Library of Congress were able to arrange an agreement with the Council of Library and Information Resources (CLIR) to provide professional fiscal sponsor services for IIPC. CLIR is a wonderful supporter of the library community, has proven an excellent fiscal agent, and we are excited to establish this relationship and expect it to be a foundation for further collaborations.

Much work was also done by Officers to implement a suite of protocols and procedures around invoicing, member onboarding, financial tracking, vendor and expense payments, and other basic budgeting and organization management. Many of these processes were previously unenforced or nonexistent and caused a notable strain on IIPC’s limited staffing. Professionalization of finances and many operations should allow IIPC to focus more on its core mission – delivering member value and advancing preservation of the internet!

Premier Events

The past few years also featured improvements in the planning and management of the GA and WAC conference, including more seamless planning workflows, more budgetary autonomy for hosts, the exploration of sponsorships, registration fees, and event planning services, and other efficiency and sustainability approaches. The IIPC WAC continues to be the premier event for web archiving, and many attendees noted that the 2018 GA/WAC hosted by National Library of New Zealand was one of the best conferences so far. Proposal submissions, sessions, and attendance all continue to grow and the quality of the event remains superlative. The 2019 event at the  National and University Library in Zagreb, Croatia will continue the trend. Other workshops, forums, and programming also continued IIPC’s essential role in providing the best venue for discussion of web preservation and access issues.

Member Activity

A number of new initiatives, as well as growth in existing projects, signaled that member engagement and contribution remains high. From the new Training Working Group, to an extensive Member Engagement Survey, to the growing collaborative collections of the Content Development Group, to many other formal and informal activities, IIPC members remain active in the organization. We are hoping the stability mechanisms of the past few years have enabled even more ways for members to participate and contribute.

Areas of Further Effort

Organizational Maturity

Though, per above, great strides were made in professionalizing many activities, other areas of operations also need to evolve to account for IIPC’s growth and strategic aspirations. The challenges related the fiscal agent transition illuminated broader circumstances related to IIPC’s growth over the years – namely that critical operational and administrative functions can no longer be dependent on the the unpredictable contributions or internal decisions of individual member institutions. The model of member-contributed operational support made sense when IIPC was one or two dozen members. With over 50 members, a growing portfolio of activities, and nearly 200,000 EUR in annual member dues, IIPC has outgrown such an arrangement. All core functions of IIPC – from finances to operations to staffing – need to operate autonomously and independent of individual members to ensure a successful, ongoing provision and continuation of services and obviate conflicts of interest. There are many arrangements that can be pursued to support this self-sufficiency and IIPC is blessed with a large financial reserve that can help advance this effort. Work to achieve this self-reliance will no doubt be a focus of the SC in the coming years.

Scaling Participation

As I noted in my Chair’s address at the General Assembly, IIPC is poised to pivot to focusing on resiliency, member benefits, and strategic investment. I had fantastic conversations over the years with members about ideas for IIPC to deliver value to members via new activities and investments. As part of these conversations, I devised with feedback from SC, a “Discretionary Funding Program” (see link above) to invest a significant portion of IIPC’s reserve funds to support member-proposed and member-managed projects. Expect more news about this program soon.

IIPC also needs to invest resources to encourage a broader involvement of members in leadership positions. There is very little turnover in institutional representation on the SC. As well, Officer roles have also been held by an even smaller number of institutions historically, and there was no self-nomination for Chair during this year’s nomination period (thanks go to Hansueli for stepping up after the year started with the role vacant). To remain vibrant and reflective of its community, representation of more members is needed at the Steering, Officer, Working Group chairs, and other elected and self-nominated positions. Term limits, limitations on consecutive terms served by an institution, leadership stipends, more clearly defined expectations of service, or other formal or informal inducements are ideas that could bring fresh perspectives and new ideas to SC or WG leadership roles. Like with operations, IIPC’s governance needs to evolve and adapt to introduce new voices and vibrancy to our growing organization.

Member Diversity

The web is a global and, in some ways, borderless phenomenon. Yet one only need to look at the IIPC membership map to recognize that vast portions of the globe are underrepresented in IIPC and, likely, in the global web collection we are all working to build. As well, web preservation is increasingly a concern of institutions beyond just national libraries and research universities. There is surely momentum and engagement to be found in scaling IIPC membership and activities both vertically (inclusive of organizations of differing size, mandates, and missions) and horizontally (inclusive of underrepresented regions and nations). Building a truly global organization, as well as a diverse, inclusive preserved record of the web, will require participation far beyond North America and Europe. Subsidized membership rates, diversity scholarships or travel funding, and targeted partnerships, outreach, or participation tools  are just some ideas or activities that can started or scaled. I am excited to work with colleagues over the coming year to propose a number of ideas for improving diversity within IIPC.

In summary, I had four goals when assuming SC Officer roles: get IIPC’s house in order, improve operations, scale support of member-driven projects, and diversify membership and leadership. I think notable progress was made on three of those four and more time will allow diversity initiatives to gain traction. While some of this progress was behind-the-scenes or is soon-to-be-released, hopefully it has helped IIPC grow and thrive. The new leadership team will no doubt continue this trend.

Keep on crawlin’!

Jefferson Bailey
Internet Archive
IIPC Steering Committee

IIPC – Meet the Officers, 2019

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Programme and Communications Officer (PCO, based at the British Library), the Officers are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Hansueli Locher, Swiss National Library, to serve as Chair, Mark Phillips, University of North Texas, to serve as Vice-Chair and Sylvain Bélanger, Library and Archives Canada, to serve as Treasurer in 2019. CLIR (the Council on Library and Information Resources) remains the Consortium’s fiscal host.

The Members and the Steering Committee of the IIPC would like to thank Jefferson Bailey (IIPC Chair, September 2017 – January 2019 and Vice-Chair, April 2016 – September 2017), Internet Archive, Sylvain Bélanger (IIPC Vice-Chair, September 2017 – January 2019) and Tom Cramer (IIPC Treasurer, September 2017 – January 2019), Stanford University Libraries, for contributing their time and expertise to support the Consortium during their extended terms of office.

The nomination process for the IIPC Steering Committee is still open and five seats will become available as of June 1st 2019. IIPC Members are invited to nominate themselves by sending an email including a statement to the IIPC Programme and Communications Officer by March 1st 2019.


Hansueli LocherAfter being a teacher for several years Hansueli Locher decided to turn his hobby – computer science – into his profession. He worked at the Swiss Federal Statistical Office where he was responsible for database supported evaluations of statistical data. He also developed a library system and supervised information projects with strong IT-links. Since 2000 he is working at the National Library. As Project Manager “Archiving” he was responsible for the technical aspects of long-term preservation of digital objects. As Head of ICT Services he is now responsible for IT at the Swiss National Library and the Federal Office of Culture. Swiss National Library joined IIPC in 2007 and Hansueli has represented the Library in the Steering Committee since 2013. Hansueli has also been the Lead and is now one of the Co-Leads of the IIPC Partnerships and Outreach Portfolio and was on the Organising Committee for the General Assembly and the Web Archiving Conference in Wellington, New Zealand.



Mark PhillipsMark Phillips is Associate Dean for Digital Libraries at the University of North Texas (UNT) in Denton, Texas. Mark has been involved with all stages in the development of the digital library access and preservation infrastructure at the UNT Libraries. The UNT Libraries’ Digital Collection manages over 2.5 million digital resources made available through the interfaces of The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History. In addition to digital library infrastructure development, Mark has been involved in the web archiving activities at the UNT Libraries since 2004 including the 2008, 2012, and 2016 End of Term Web Archive activities and the development of the URL Nomination Tool. He has been active in IIPC since the UNT Libraries joined the Consortium in 2008 and has served as the UNT representative for the IIPC Steering Committee since 2015. Mark has been one of the Co-Leads of the IIPC Partnerships and Outreach Portfolio and is currently the Portfolio’s Lead.



Sylvain BélangerSylvain Bélanger is Director General of the Digital Operations and Preservation Branch for Library and Archives Canada since February 2014. In this role Sylvain is responsible for leading and supporting LAC’s digital business operations, and all aspects of preservation for digital and analog collections. Prior to accepting this role, Sylvain was Director of the Holdings Management Division since 2010, and previously Corporate Secretary and Chief of Staff for Library and Archives Canada. Library and Archives Canada is one of the founding members of the IIPC.

Collaborate to develop web archive collections with Cobweb!

By Kathryn Stine, Manager, Digital Content Development and Strategy at the California Digital Library

Cobweb is a recently launched collaborative collection development platform for web archives, now available for anyone to use to establish and participate in web archiving collecting projects at https://cobwebarchive.org. A cross-institutional team from UCLA, the California Digital Library (CDL), and Harvard University has developed Cobweb, which was made possible in part by funding from the United States Institute for Museum and Library Services and initially hosted by CDL. We’ve been encouraged by the enthusiasm and engagement that’s met Cobweb and look forward to supporting a range of collaborative and coordinated web archiving collecting projects with this new platform.

Peter Broadwell & Kathryn Stine introducing CobWeb at the Web Archiving Conference in Wellington (slides).

At the 2018 IIPC Web Archiving Conference in New Zealand, Cobweb tutorial attendees played with Cobweb functionality and provided useful feedback and ideas for platform refinements and future feature options. Thank you to all who have shared their suggestions for advancing Cobweb! A number of demonstration projects are now on the platform that showcase how Cobweb supports web archiving collection development activities, including nominating web resources to a project and claiming intentions for, and following through with, archiving nominated web content. Additionally, the extensive Archive of the California Government Domain (CA.gov) has been established as a Cobweb collecting project and the CA.gov team is considering how to integrate Cobweb into its collection development workflows.

Cobweb centralizes the often distributed activities that go into developing web archive collections, allowing for multiple contributors and organizations to work together towards realizing common collecting goals. The coordinated activities that result in rich, useful web archive collections can draw upon distinct areas of expertise or capacity including subject specialization, technical facility with content capture, and resources for storing and managing content. The Cobweb platform is well-suited to supporting curated and crowdsourced collection building, from complex, multi-partner initiatives to local efforts that require coordination, such as that between digital archivists and library subject selectors.

If you have web archiving collecting goals that can benefit from engaging in collaborative and/or coordinated participation, learn more about getting started with Cobweb by visiting https://cobwebarchive.org/getting_started, checking out the Cobweb presentation from the IIPC WAC, or by emailing cobwebarchive[at]gmail.com.

Web Archiving Down Under: Relaunch of the Web Curator Tool at the IIPC conference, Wellington, New Zealand

Kees Teszelszky, Curator Digital Collections at the National Library of the Netherlands/Koninklijke Bibliotheek (with input of Hanna Koppelaar, Jeffrey van der Hoeven – KB-NL, Ben O’Brien, Steve Knight and Andrea Goethals – National Library of New Zealand)

Hanna Koppelaar, KB & Ben O'Brien, NLNZ. IIPC WAC 2018.
Hanna Koppelaar, KB & Ben O’Brien, NLNZ. IIPC Web Archiving Conference 2018. Photo by Kees Teszelszky

The Web Curator Tool (WCT) is a globally used workflow management application designed for selective web archiving in digital heritage collecting organisations. Version 2.0 of the WCT is now available on Github. This release is the product of a collaborative development effort started in late 2017 between the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL). The new version was previewed during a tutorial at the IIPC Web Archiving Conference on 14 November 2018 at the National Library of New Zealand in Wellington, New Zealand. Ben O’Brien (NLNZ) and Hanna Koppelaar (KB-NL) presented the new features of the WCT and showed how to work collaboratively on opposite sides of the world in front of an audience of more than 25 spectators.

The tutorial highlighted that part of our road map for this version has been dedicated to improving the installation and support of WCT. We recognised that the majority of requests for support were related to database setup and application configuration. To improve this experience we consolidated and refactored the setup process, correcting ambiguities and misleading documentation. Another component to this improvement was the migration of our documentation to the readthedocs platform (found here), making the content more accessible and the process of updating it a lot simpler. This has replaced the PDF versions of the documentation, but not the Github wiki. The wiki content will be migrated where we see fit.

A guide on how to install WCT can be found here, a video can be found here.

1) WCT Workflow

One of the objectives in upgrading the WCT, was to raise it to a level where it could keep pace with the requirements of archiving the modern web. The first step in this process was decoupling the integration with the old Heritrix 1 web crawler, and allowing the WCT to harvest using the more modern Heritrix 3 (H3) version. This work started as a proof-of-concept in 2017, which did not include any configuration of H3 from within the WCT UI. A single H3 profile was used in the backend to run H3 crawls. Today H3 crawls are fully configurable from within the WCT, mirroring the existing profile management that users had with Heritrix 1.

2) 2018 Work Plan Milestones

The second step in this process of raising the WCT up is a technical uplift. For the past six or seven years, the software has fallen into a period of neglect, with mounting technical debt. The tool is sitting atop outdated and unsupported libraries and frameworks. Two of those frameworks are Spring and Hibernate. The feasibility of this upgrade has been explored through a proof-of-concept which was successful. We also want to make the WCT much more flexible and less coupled by exposing each component via an API layer. In order to make that API development much easier we are looking to migrate the existing SOAP API to REST and changing components so they are less dependent on each other.

Currently the Web Curator Tool is tightly coupled with the Heritrix crawler (H1 and H3). However, other crawl tools exist and the future will bring more. The third step is re-architecting WCT to be crawler agnostic. The abstracting out of all crawler-specific logic allows for minimal development effort to integrate new crawling tools. The path to this stage has already been started with the integration of Heritrix 3, and will be further developed during the technical uplift.

More detail about future milestones can be found in the Web Curator Tool Developer Guide in the appropriately titled section Future Milestones. This section will be updated as development work progresses.

3) Diagram showing the relationships between different Web Curator Tool components

We are conscious that there are long-time users on various old versions of WCT, as well as regular downloads of those older versions from the old Sourceforge repository (soon to be deactivated). We would like to encourage those users of older versions to start using WCT 2.0 and reaching out for support in upgrading. The primary channels for contact are the WCT Slack group and the Github repository. We hope that WCT will be widely used by the web archiving community in future and will have a large development and support base. Please contact us if you are interested in cooperating! See the Web Curator Tool Developer Guide for more information about how to become involved in the Web Curator Tool community.

WCT facts

The WCT is one of the most common, open-source enterprise solutions for web archiving. It was developed in 2006 as a collaborative effort between the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium (IIPC) as can be read in the original documentation. Since January 2018 it is being upgraded through collaboration with the Koninklijke Bibliotheek – National Library of the Netherlands. The WCT is open-source and available under the terms of the Apache Public License. The project was moved in 2014 from Sourceforge to Github. The latest release of the WCT, v2.0, is available now. It has an active user forum on Github and Slack.

Further reading on WCT:

Reaction on twitter:

Digging in Digital Dust: Internet Archaeology at KB-NL in the Netherlands

By Peter de Bode and Kees Teszelszky

The Dutch .nl ccTLD is the third biggest national top level domain in the world and consists of 5.68 million URL’s,according to the Dutch SIDN. The first website of the Netherlands was published on the web in 1992: it was the third website on the World Wide Web. Web archiving in the Netherlands started in 2000 with the project Archipol in Groningen. The Koninklijke Bibliotheek | National Library of The Netherlands (KB-NL) started web archiving with a selection of Dutch websites in 2007. The KB does not only selects and harvest these sites, but also develops a strategy to ensure their long-term usability. As the Netherlands does lack a legal deposit law, the KB cannot crawl the Dutch national domain. KB uses the Web Curator Tool (WCT) to conduct its harvests.  From January 2018 onwards, the National Library of New Zealand (NLNZ) has been collaborating to upgrade this tool with KB-NL and adding new features to make the application future-proof.

As of 2011, the Dutch web archive is available in the KB reading rooms. In addition, researchers may request access to the data for specific projects. Between 2012 and 2016 the research project WebArt was carried out. As per November 2018, 15,000 websites have been selected. The Dutch web archive contains about 37Terabyte of data.

On the occasion of World Digital Preservation Day KB unveiled a special collection internet archaeology Euronet-Internet (1994-2017) [In Dutch: Webcollectie internetarcheologie Euronet]. It is made up of archived websites hosted by internet provider Euronet-Internet between 1994 and 2017. The collection was started in 2017 and ended in 2018. Identification of websites for harvest is done by Peter de Bode and Kees Teszelszky as part of the larger KB web archiving project “internet archaeology.” Euronet is one of the oldest internet providers in the Netherlands (1994) and has been bought up by Online.nl. Priority is given to websites published in the early years of the Dutch web (1994-2000).

These sites can be considered as “web incunables” as these are among the first digital born publications on the Dutch web. Some of the digital treasures from this collection are the oldest website of a national political party, a virtual bank building and several sites of internet pioneers dating from 1995. Information about the collection and its heritage value can be found on a special dataset page of KB-Lab and in a collection description (in Dutch). The collection can be studied on the terminals in the reading room of KB with a valid library card. Researches can also use the dataset with URL’s and a link analysis.

Web Archiving at the National Library of Ireland

National Library of Ireland Reading Room © National Library of Ireland.

The National Library of Ireland has a long-standing tradition of collecting, preserving and making accessible the published and printed output of Ireland. The library is over 140 years old and we now also have rich digital collections concerning the political, cultural and creative life of Ireland. The NLI has been archiving the Irish web on a selective basis since 2011. We have over 17 TB of data in the selective web archive, openly available for research through our website.  A particular strength of our web archive is the coverage of Irish politics including a representation of every election and referendum since 2011. No longer in its infancy, the NLI has made some exciting developments in recent years. This year we have begun working with Internet Archive for our selective web archive and are looking forward to the new opportunities that this partnership will bring. We have also begun working closely with an academic researcher from a Higher Education institute in Ireland, who is carrying out network analysis on a portion of our selective data.

In 2007 and 2017, the NLI undertook domain crawling projects and there is now over 43TB of data archived from these crawls. The National Library of Ireland is a legal deposit library, entitling it to a copy of everything published in Ireland. However, unlike many countries in Europe, legal deposit legislation does not currently extend to online material so we cannot make these crawls available. Despite these barriers, the library remains committed to preserving the online story of Ireland in whatever way we can.

Revisions to the legislation are currently before the Irish parliament and if passed will result in the addition of e-publications, such as e-books, journals etc. The addition of websites to that list is currently being considered.

In 2017, the National Library of Ireland became members of the IIPC and we are excited to be attending our first General Assembly in Wellington. While we had anticipated talking about our newly available domain web archive portal and how this had impacted our selective crawls, we are looking forward to discussing the challenges we continue to face, including with Legal Deposit, and how we are developing the web archive as a whole. We may also hopefully be able to update on progress with the legislative framework.  We look forward to seeing you there in Wellington!