By Andy Jackson, Web Archiving Technical Lead at the British Library
One of the outcomes of the Online Hours meetings has been an increase in activity around Heritrix 3. Most of us rely on Heritrix to carry out our web crawls, but recognise that to keep this large, complex crawler framework sustainable we need to try and get more people use the most recent versions, and make it easier for new users to get on board.
The most recent ‘formal’ release of Heritrix 3 was version 3.2.0 back in 2014, but a lot has happened since then. Numerous serious bugs have been discovered and resolved, and some new features added, but only those of us running the very latest code were able to take advantage of these changes.
Those of us who would rather base our crawling on a software release rather than building from source have been relying on the stable releases built by Kristinn Sigurðsson, and hosted on the NetarchiveSuite Maven Repository. This worked well for ‘those in the know’, but did little to make things easier for new users.
In an attempt to resolve this, and in coordination with the Internet Archive, we have started releasing ‘formal’ versions of Heritrix, culminating in the 3.4.0-20190207 Interim Release. This new release believed to be stable, and is recommended over previous releases of Heritrix 3. As well as being released on GitHub, it is also available through the Maven Central Repository, which should make it easier for others to re-use Heritrix.
You may notice we’ve added a date to the version tag. Traditionally, Heritrix 3 has used a tag of the form “X.X.X”, which gives the impression we are using a form of Semantic Versioning. However, that does not reflect how Heritrix is evolving. Heritrix is a broad framework of modules for building a crawler, and has lots of different components of different ages, at different levels of maturity and use. Given there are only a small number of developers working on Heritrix, we don’t have the resources to guarantee that a breaking change won’t slip into a minor release, so it’s best not to appear to be promising something we cannot deliver.
This means that, when you are upgrading your Heritrix 3 crawler, we recommend that you thoroughly test each release using your configuration (your ‘crawler beans’ in Hertrix3 jargon) under a realistic workload. If you can, please let us know how this goes, to help us understand how reliable the different parts of Heritrix 3 are.
As well as making new releases, we have also moved the Heritrix 3 documentation over to GitHub to populate the Heritrix3 wiki, and shifted the API documentation to a more modern platform. We hope this will help those who have been frustrated by the available documentation, and we encourage you to get in touch with any ideas for improving the situation, particularly when it comes to helping new users get on board.