Pywb at the Australian Web Archive: notes on the migration

Last year the National Library of Australia (NLA) launched the revamped Australian Web Archive (AWA) which expanded their older PANDORA selective archive with comprehensive snapshots of the .au domain. The AWA is full-text searchable through Trove, a single discovery service for the collections of Australia’s libraries, museums, arc hives and other cultural institutions. To replay archived pages, AWA used Java-based OpenWayback, but NLA’s technical team is now transitioning to Python Wayback (pywb).


By Alex Osborne, Web Archive Technical Lead at National Library of Australia and Co-Lead of the IIPC Tools Development Portfolio

We recently migrated the Australian Web Archive (AWA) from OpenWayback to Pywb in order to take advantage of Pywb’s better replay fidelity, particularly for JavaScript heavy websites. First I must give some background about the existing architecture. The Trove-branded user interface to the web archive is a separate web application which displays Wayback in an iframe.

The old architecture

We originally chose to write a separate application rather than customising OpenWayback’s banner templates for a couple of reasons:

  • We thought it would make updating wayback and the UI independently of each other easier. This is particularly important as the web archive backend and the Trove user interface are managed and developed by different teams with different development processes and release cycles.
  • It allows for replayed content to live on a different origin (domain) to the UI. This is important security measure to prevent archived content from being able to interfere with the UI. While we didn’t take advantage of this in the original release, it’s something I’ve long wanted to implement.
  • We had in mind from the beginning that we may eventually want to swap OpenWayback out for another replay tool and it’d be nice not to have to rewrite our UI in order to do it.

While it has caused a few problems in the past (redirect loops, PDF plugins) this architecture made the transition to Pywb straightforward. Pywb out of the box renders its own UI with an iframe so it was close to a drop in replacement for us. There were a few small problems we encountered along the way, most of our own making rather than Pywb’s though.

Problem 1: Notifying the UI when a navigation event happens

In the AWA’s initial release, archived content and UI were both served from the same domain name. This meant that the browser allowed the UI’s JavaScript to reach inside the iframe and access the archived page. The Trove-Web UI therefore was able to listen to the iframe’s load event and even intercept click events. When the iframe loads we can inspect the page’s title to update it in the UI and extract the current URL and timestamp from the iframe’s URL.

While this was convenient and would have worked with Pywb if we kept it on the same domain it also means archived content could do the same to our UI! We never encountered anyone doing this in the wild but we’ve always been a little worried that the web archive could be abused for attacks like phishing.

This means we needed a replacement way for the UI to get information about what was happening inside the replay iframe. Pywb fortunately has already solved this problem and uses the Window.postMessage() to send a message like this when the archived page loads.

{
    "wb_type":"load",
    "url":"http://www.example.com/",
    "ts":"20060821035730",
    "title":"Example Web Page",
    "icons":[],
    "request_ts":"20060821035730",
    "is_live":false,
    "readyState":"interactive"
}

One gotcha I encountered though was that Pywb doesn’t send a postMessage when displaying error pages. Our custom UI intercepts OpenWayback’s not found errors in order to display a more detailed message suggesting alternative ways to find the content or an explanatory message about restricted content.

I worked around that by including a custom templates/not_found.html which sends the load message:

<script>
    parent.postMessage({
        'wb_type': 'load',
        'url': '{{ url }}',
        'ts': location.href.split('/')[4].replace('mp_',''),
        'title': 'Webpage snapshot not found',
        'status': 404
    }, '*');
</script>

Problem 2: Accessing HTML meta tags

Trove’s user interface (and this is not specific to the web archive) has a tab which shows how to cite the item you’re looking at when referencing it in an academic context or on Wikipedia. The original implementation of this would on the client side inspect the contents of the iframe for HTML meta tags to pull out information such as an author or publisher’s name. This of course also broke when we moved wayback to a separate security origin and unlike the URL and page title Pywb doesn’t include the page’s meta tags in its load message.

Rather trying to provide JavaScript access to the page content and potentially risking undoing some of the isolation we were trying to introduce, I moved the meta tag extraction server side translating the code from JavaScript/DOM to Java/JSoup.

Problem 3: Multiple access points

The library has a take-down procedure for restricting access to content under certain circumstances. Restricted content can have several different policies applied to it. Content can be fully public, accessible only to staff, accessible on-site in the reading room or fully restricted.

OpenWayback also enables a different URL to configured for routing incoming requests (accessPointPath) as to the one that’s generated when rewriting links (replayPrefix). So our original implementation simply configured three access points under paths like /public, /onsite and /staff but with the generated links all at /wayback.

<beans>
     <bean name="publicaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/public/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="publiccollection" />
     </bean>
    
     <bean name="onsiteaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/onsite/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="onsitecollection" />
    </bean>
    
    <bean name="staffaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/staff/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="staffcollection" />
    </bean>
</beans>

Our frontend webserver (nginx) has a set of IP range rules which map the different access locations and rewrote the path accordingly:

rewrite /wayback/(.*) /$webarchive_access_point/$1 break;
proxy_pass http://openwayback;

I couldn’t find a way to replicate this configuration as Pywb appears to only have one parameter – the collection name – used for both routing and link generation. (Aside: perhaps it’s possible by using uwsgi and overriding SCRIPT_NAME but I couldn’t figure it out.) Therefore rather than running a single instance of Pywb with multiple collections configured, we ended up with a separate instance of Pywb for each access point and have nginx route requests to the appropriate port. It’s a little more complex to deploy than I’d like but works well enough.

upstream pywb-public { server backend.example:8080; }
upstream pywb-staff  { server backend.example:8081; }
upstream pywb-onsite { server backend.example:8082; }

proxy_pass http://pywb-$webarchive_access_point;

Problem 4: Hiding the UI for thumbnails

Trove’s collection browse path displays thumbnails of the archived sites. These are generated by a web service that wraps Chromium’s headless mode. Obviously, we don’t want the UI of the archive to be visible in the thumbnails. But if Chromium loads the URL of Pywb directly, there’s a JavaScript redirect back to the Trove UI. This exists so that if a user opens an archived link in a new tab, they get the web archive’s UI in that new tab too rather than just the contents of the replay frame.

In our original implementation we kind of hacked around this for screenshots by passing a magic flag as a URL fragment that the JavaScript redirect looked for as indicator not to redirect to the UI. This time though that redirect was being done by Pywb itself rather than our template customisations. Plus we don’t really want to expose a way of hiding the UI entirely again due to risk of the archive being abused for phishing.

I hoped to setup a second Pywb collection with framed_replay: false and a blank banner, but was thwarted as that can only be specified at the top level. So yes, you guessed it, we’re now up to four instances of Pywb. ¯\_(ツ)_/¯

Problem 5: The PDF workaround that broke

In the original iframe implementation we encountered problems with displaying PDFs in an iframe in some browsers. The developers ended up working around this by embedding PDF.js rather than relying on the browser’s rendering. This broke when switching to Pywb on an isolated domain as the interception was based on a onClick handler injected into the iframe and also the PDF.js viewer can’t load documents from a different origin without special configuration. Fortunately, it seems either Pywb does something differently or more likely browser’s now handle PDFs in iframes better so I was able to just disable the whole PDF interception thing.

Funnily enough, we do still use PDF.js for generating thumbnails though. Chromium’s PDF viewer doesn’t work in headless mode. While it probably would be more efficient to use some native PDF viewer just using PDF.js, let’s us reuse the same thumbnail generation logic and also piggyback on the browser’s security sandbox.

Problem 6: Reusing OpenWayback’s manhattan graph

I discovered, a little to my surprise, that Trove’s visualisation of captures over time was actually using OpenWayback’s server-side manhattan graph renderer. There’s no exact equivalent of this in Pywb and I didn’t want to keep a running instance of OpenWayback for that alone. Fortunately the graph renderer is standalone and could be incorporated into Trove-Web directly.

Screenshot of manhattan graph

Conclusion

Overall the problems we encountered were relatively minor. Pywb’s builtin frames support allowed us to eliminate virtually all the modifications and custom templates we’d had to make for OpenWayback. If there’s one wishlist item I have for Pywb it’s to allow more options to be overridden at a per-collection level not just the top-level. The migration fixed a large set of longstanding delivery problems for us including key websites like the Sydney 2000 Olympic Games. I encourage other archives looking to improve the quality of their replay to make the switch.

 

One thought on “Pywb at the Australian Web Archive: notes on the migration

  1. […] In addition to the establishment of the coronavirus web archive, the IIPC has multiple announcements, including the newly elected steering committee member organizations and a published address from IIPC Chair on IIPC funded projects, consortium agreement renewals, collaboration with CLIR,  training materials, and collaborative collecting. The IIPC blog also featured a guest post from the National Library of Australia on their migration to Python Wayback. […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s