By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, IIPC Vice-Chair and Co-Lead of the Tools Development Portfolio
Last year I wrote about The Future of Playback and the work that the IIPC was funding to facilitate the migration from OpenWayback to PyWb for our members. That work is being done by Ilya Kreymer and the first work package has now been delivered. An exhaustive transition guide detailing how OpenWayback configuration options can be translated into equivalent PyWb settings.
One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.
What is OutbackCDX?
OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.
A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.
OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.
Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.
So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.
Of course, there were a few bumps on that road, but more on that later.
Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.
Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.
Building the Index
Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.
curl -X POST –data-binary @index.cdx http://localhost:8901/myindex
In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.
Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.
Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.
OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:
<property name=”searchUrlBase” value=”http://localhost:8080/myindex” />
And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.
This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.
The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.
I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.
Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.
OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.
Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.
The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.
Our OpenWayback at https://vefsafn.is, is now fully powered by OutbackCDX.
Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is.