Using OutbackCDX with OpenWayback

By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, IIPC Vice-Chair and Co-Lead of the Tools Development Portfolio

Last year I wrote about The Future of Playback and the work that the IIPC was funding to facilitate the migration from OpenWayback to PyWb for our members. That work is being done by Ilya Kreymer and the first work package has now been delivered. An exhaustive transition guide detailing how OpenWayback configuration options can be translated into equivalent PyWb settings.

One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.

What is OutbackCDX?

OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.

A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.

OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.

The Mission

Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.

So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.

Of course, there were a few bumps on that road, but more on that later.

Installing OutbackCDX

Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.

Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.

Building the Index

Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.

Example:

curl -X POST –data-binary @index.cdx http://localhost:8901/myindex

In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.

Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.

Configuring OpenWayback

Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.

OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:

<property name=”resourceIndex”>
<bean class=”org.archive.wayback.resourceindex.RemoteResourceIndex”>
<property name=”searchUrlBase” value=”http://localhost:8080/myindex&#8221; />
</bean>
</property>

And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.

Those ‘bumps’

This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.

The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.

I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.

Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.

OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.

Summary

Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.

The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.

Our OpenWayback at https://vefsafn.is, is now fully powered by OutbackCDX.

Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is

Search results in Vefsafn.is (beta) that uses PyWb.

.

OpenWayback to pywb Transition Guide and pywb update

By Ilya Kreymer, Lead Software Engineer at Webrecorder Software

Earlier this year, the IIPC, after an internal survey, recommended the adoption of Webrecorder pywb as the primary replay system for their members’ web archives. Webrecorder and IIPC established a multi-part collaboration to help with this transition and advance the development of pywb.

To meet these goals, I’m excited to announce the launch of an official guide for migrating from OpenWayback to Webrecorder pywb, available at:

https://pywb.readthedocs.io/en/latest/manual/owb-transition.html

This guide was created with input from IIPC members and marks the completion of the first package of the IIPC project on pywb. This guide is now part of the standard pywb documentation and provides examples of various OpenWayback configurations and how they can be adapted to analogous options in pywb. The guide covers updating the index, WARC storage and exclusion systems to run in pywb with minimal changes.

For best results, deployment of OutbackCDX, an open-source standalone web archive indexing system developed by the National Library of Australia, alongside pywb is the recommended setup for managing web archive indexes. See the guide for more details and additional options.

Sample Deployment Configurations

With the guide, pywb now also includes a few working deployments (via Docker Compose) of running pywb with Nginx, Apache and OutbackCDX.

These deployments will be part of the upcoming pywb release and will be updated as pywb and configuration options evolve.

Next Steps

Next on the immediate roadmap for pywb is an upcoming release, which will feature numerous fixes in addition to the guide (See the pywb CHANGELIST for more details on upcoming and new features).

The next iteration of pywb, which will be released in the first half of 2021, will include improved support for access controls, including a time-based access ‘embargo’, location-based access controls, and improved support for localization, in line with the work outlined in pywb project Package B.

Feedback Wanted!

We hope the guide will be useful for those updating from OpenWayback to pywb. We are also looking for input from IIPC members about any use cases for improved access control and localization for the next iteration.

If you have any questions, run into issues, or find anything missing, please send feed feedback to pywb[at]iipc.simplelists.com or directly to Webrecorder, via email or via the forum.

The Future of Playback

By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio

It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.

OpenWayback

Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out  new releases (thanks Lauren!).

Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.

As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.

Survey

Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.

Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.

The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.

Pywb

Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.

However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.

To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.

Planned work

The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).

The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.

The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.

There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.

Resources: