The Future of Playback

By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio

It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.

OpenWayback

Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out  new releases (thanks Lauren!).

Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.

As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.

Survey

Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.

Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.

The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.

Pywb

Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.

However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.

To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.

Planned work

The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).

The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.

The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.

There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.

Resources:

Pywb at the Australian Web Archive: notes on the migration

Last year the National Library of Australia (NLA) launched the revamped Australian Web Archive (AWA) which expanded their older PANDORA selective archive with comprehensive snapshots of the .au domain. The AWA is full-text searchable through Trove, a single discovery service for the collections of Australia’s libraries, museums, arc hives and other cultural institutions. To replay archived pages, AWA used Java-based OpenWayback, but NLA’s technical team is now transitioning to Python Wayback (pywb).


By Alex Osborne, Web Archive Technical Lead at National Library of Australia and Co-Lead of the IIPC Tools Development Portfolio

We recently migrated the Australian Web Archive (AWA) from OpenWayback to Pywb in order to take advantage of Pywb’s better replay fidelity, particularly for JavaScript heavy websites. First I must give some background about the existing architecture. The Trove-branded user interface to the web archive is a separate web application which displays Wayback in an iframe.

The old architecture

We originally chose to write a separate application rather than customising OpenWayback’s banner templates for a couple of reasons:

  • We thought it would make updating wayback and the UI independently of each other easier. This is particularly important as the web archive backend and the Trove user interface are managed and developed by different teams with different development processes and release cycles.
  • It allows for replayed content to live on a different origin (domain) to the UI. This is important security measure to prevent archived content from being able to interfere with the UI. While we didn’t take advantage of this in the original release, it’s something I’ve long wanted to implement.
  • We had in mind from the beginning that we may eventually want to swap OpenWayback out for another replay tool and it’d be nice not to have to rewrite our UI in order to do it.

While it has caused a few problems in the past (redirect loops, PDF plugins) this architecture made the transition to Pywb straightforward. Pywb out of the box renders its own UI with an iframe so it was close to a drop in replacement for us. There were a few small problems we encountered along the way, most of our own making rather than Pywb’s though.

Problem 1: Notifying the UI when a navigation event happens

In the AWA’s initial release, archived content and UI were both served from the same domain name. This meant that the browser allowed the UI’s JavaScript to reach inside the iframe and access the archived page. The Trove-Web UI therefore was able to listen to the iframe’s load event and even intercept click events. When the iframe loads we can inspect the page’s title to update it in the UI and extract the current URL and timestamp from the iframe’s URL.

While this was convenient and would have worked with Pywb if we kept it on the same domain it also means archived content could do the same to our UI! We never encountered anyone doing this in the wild but we’ve always been a little worried that the web archive could be abused for attacks like phishing.

This means we needed a replacement way for the UI to get information about what was happening inside the replay iframe. Pywb fortunately has already solved this problem and uses the Window.postMessage() to send a message like this when the archived page loads.

{
    "wb_type":"load",
    "url":"http://www.example.com/",
    "ts":"20060821035730",
    "title":"Example Web Page",
    "icons":[],
    "request_ts":"20060821035730",
    "is_live":false,
    "readyState":"interactive"
}

One gotcha I encountered though was that Pywb doesn’t send a postMessage when displaying error pages. Our custom UI intercepts OpenWayback’s not found errors in order to display a more detailed message suggesting alternative ways to find the content or an explanatory message about restricted content.

I worked around that by including a custom templates/not_found.html which sends the load message:

<script>
    parent.postMessage({
        'wb_type': 'load',
        'url': '{{ url }}',
        'ts': location.href.split('/')[4].replace('mp_',''),
        'title': 'Webpage snapshot not found',
        'status': 404
    }, '*');
</script>

Problem 2: Accessing HTML meta tags

Trove’s user interface (and this is not specific to the web archive) has a tab which shows how to cite the item you’re looking at when referencing it in an academic context or on Wikipedia. The original implementation of this would on the client side inspect the contents of the iframe for HTML meta tags to pull out information such as an author or publisher’s name. This of course also broke when we moved wayback to a separate security origin and unlike the URL and page title Pywb doesn’t include the page’s meta tags in its load message.

Rather trying to provide JavaScript access to the page content and potentially risking undoing some of the isolation we were trying to introduce, I moved the meta tag extraction server side translating the code from JavaScript/DOM to Java/JSoup.

Problem 3: Multiple access points

The library has a take-down procedure for restricting access to content under certain circumstances. Restricted content can have several different policies applied to it. Content can be fully public, accessible only to staff, accessible on-site in the reading room or fully restricted.

OpenWayback also enables a different URL to configured for routing incoming requests (accessPointPath) as to the one that’s generated when rewriting links (replayPrefix). So our original implementation simply configured three access points under paths like /public, /onsite and /staff but with the generated links all at /wayback.

<beans>
     <bean name="publicaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/public/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="publiccollection" />
     </bean>
    
     <bean name="onsiteaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/onsite/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="onsitecollection" />
    </bean>
    
    <bean name="staffaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/staff/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="staffcollection" />
    </bean>
</beans>

Our frontend webserver (nginx) has a set of IP range rules which map the different access locations and rewrote the path accordingly:

rewrite /wayback/(.*) /$webarchive_access_point/$1 break;
proxy_pass http://openwayback;

I couldn’t find a way to replicate this configuration as Pywb appears to only have one parameter – the collection name – used for both routing and link generation. (Aside: perhaps it’s possible by using uwsgi and overriding SCRIPT_NAME but I couldn’t figure it out.) Therefore rather than running a single instance of Pywb with multiple collections configured, we ended up with a separate instance of Pywb for each access point and have nginx route requests to the appropriate port. It’s a little more complex to deploy than I’d like but works well enough.

upstream pywb-public { server backend.example:8080; }
upstream pywb-staff  { server backend.example:8081; }
upstream pywb-onsite { server backend.example:8082; }

proxy_pass http://pywb-$webarchive_access_point;

Problem 4: Hiding the UI for thumbnails

Trove’s collection browse path displays thumbnails of the archived sites. These are generated by a web service that wraps Chromium’s headless mode. Obviously, we don’t want the UI of the archive to be visible in the thumbnails. But if Chromium loads the URL of Pywb directly, there’s a JavaScript redirect back to the Trove UI. This exists so that if a user opens an archived link in a new tab, they get the web archive’s UI in that new tab too rather than just the contents of the replay frame.

In our original implementation we kind of hacked around this for screenshots by passing a magic flag as a URL fragment that the JavaScript redirect looked for as indicator not to redirect to the UI. This time though that redirect was being done by Pywb itself rather than our template customisations. Plus we don’t really want to expose a way of hiding the UI entirely again due to risk of the archive being abused for phishing.

I hoped to setup a second Pywb collection with framed_replay: false and a blank banner, but was thwarted as that can only be specified at the top level. So yes, you guessed it, we’re now up to four instances of Pywb. ¯\_(ツ)_/¯

Problem 5: The PDF workaround that broke

In the original iframe implementation we encountered problems with displaying PDFs in an iframe in some browsers. The developers ended up working around this by embedding PDF.js rather than relying on the browser’s rendering. This broke when switching to Pywb on an isolated domain as the interception was based on a onClick handler injected into the iframe and also the PDF.js viewer can’t load documents from a different origin without special configuration. Fortunately, it seems either Pywb does something differently or more likely browser’s now handle PDFs in iframes better so I was able to just disable the whole PDF interception thing.

Funnily enough, we do still use PDF.js for generating thumbnails though. Chromium’s PDF viewer doesn’t work in headless mode. While it probably would be more efficient to use some native PDF viewer just using PDF.js, let’s us reuse the same thumbnail generation logic and also piggyback on the browser’s security sandbox.

Problem 6: Reusing OpenWayback’s manhattan graph

I discovered, a little to my surprise, that Trove’s visualisation of captures over time was actually using OpenWayback’s server-side manhattan graph renderer. There’s no exact equivalent of this in Pywb and I didn’t want to keep a running instance of OpenWayback for that alone. Fortunately the graph renderer is standalone and could be incorporated into Trove-Web directly.

Screenshot of manhattan graph

Conclusion

Overall the problems we encountered were relatively minor. Pywb’s builtin frames support allowed us to eliminate virtually all the modifications and custom templates we’d had to make for OpenWayback. If there’s one wishlist item I have for Pywb it’s to allow more options to be overridden at a per-collection level not just the top-level. The migration fixed a large set of longstanding delivery problems for us including key websites like the Sydney 2000 Olympic Games. I encourage other archives looking to improve the quality of their replay to make the switch.

 

New OpenWayback lead

By Lauren Ko, University of North Texas Libraries

In response to IIPC’s call, I have volunteered to take on a leadership role in the OpenWayback project. Having been involved with web archives since 2008 as a programmer at the University of North Texas Libraries, I expect my experience working with OpenWayback, Heritrix, and WARC files, as well as writing code to support my institution’s broader digital library initiatives, to aid me in this endeavor.openwayback-banner

Over the past few years, the web archiving community has seen much development in the area of access related projects such as pywb, Memento, ipwb, and OutbackCDX – to name a few. There is great value in a growing number of available tools written in different languages/running in different environments. In line with this, we would like to keep the OpenWayback project’s development moving forward while it remains of use. Further, we hope to facilitate development of access related standards and APIs, interoperability of components such as index servers, and compatibility of formats such as CDXJ.

Moving OpenWayback forward will take a community. With Kristinn Sigurðsson soon relinquishing his leadership position, we are seeking a co-leader for the OpenWayback project. We also continue to need people to contribute code, provide code review, and test deployments. I hope this community will continue not only to develop access tools, but also access to those tools, encouraging and supporting newcomers via mailing lists and Slack channels as they begin building and interacting with web archives.

If your institution uses OpenWayback, please consider:

If you are interested in taking a co-leadership role in this project or are otherwise interested in helping with OpenWayback and IIPC’s access related initiatives, even if you don’t know how that might be, I welcome you to contact me by the name lauren.ko via IIPC Slack or email me at lauren.ko@unt.edu.

Wanted: New Leaders for OpenWayback

By Kristinn Sigurðsson, National and University Library of Iceland

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

openwayback-bannerWhy now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

Update on OpenWayback

OpenWayback

OpenWayback 2.2.0 was recently released. This marks OpenWayback’s third release since becoming a ward of the IIPC in late 2013. This is a fairly modest update and reflects our desire to make frequent, modest sized releases. A few things are still worth pointing out.

First, as of this release, OpenWayback requires Java 7. Java 7 has been out for four years and Java 6 has not been publicly updated in over two years. It is time to move on.

Second, OpenWayback now officially supports internationalized domain names. I.e. domain names containing non-ASCII characters.

Third, UI localization has been much improved. It should now be possible to translate the entire interface without having to mess with the JSP files and otherwise “go under the hood”.

And the last thing I’ll mention is the new WatchedCDXSource which removes the need to enumerate all the CDX files you wish to use. Simply designate a folder and OpenWayback will pick up all the CDX files in it.

The road to here hathankyousn’t been easy, but it is encouraging to see that the number of people involved is slowly, but surely rising. For the 2.2.0 release, we had code contributions from Roger Coram (BL), Lauren Ko (UNT), John Erik Halse (NLN), Sawood Alam (ODU), Mohamed Elsayed (BA) and myself in addition to the IIPC-payed-for work by Roger Mathisen (NLN). Even more people were involved in reporting issues, managing the project and testing the release candidate. My thanks to everyone who helped out.

And going forward, we are certainly going to need people to help out.

help_wanted

Version 2.3.0 of OpenWayback will be another modest bundle of fixes and minor features. We hope it will be ready in September (or so). There are already 10 issues open for it as I write this.

But, we also have larger ambitions. Enter version 3.0.0. It will be developed in parallel with 2.3.0 and aims to make some big changes. Breaking changes. OpenWayback is built on an aging codebase, almost a decade old at this point. To move forward, some big changes need to be made.

The exact features to be implemented will likely shift as work progresses but we are going to increase modularity by pushing the CDXServer front and center and removing the legacy resource stores. In addition to simplifying the codebase, this fits very nicely with the talk at the last GA about APIs.

We’ll also be looking at redoing the user interface using iFrames and providing insight into the temporal drift of the page being viewed. The planned issues are available on GitHub. The list is far from locked and we welcome additional input on which features to work on.

We welcome additional work on those features even more!

callTOactionI’d like to wrap this up with a call to action. We need a reasonably large community around the project to sustain it. Whether it’s testing and bug reporting, occasional development work or taking on more by becoming one of our core developers, your help is both needed and appreciated.

If you’d like to become involved, you can simply join the conversation on the OpenWayback GitHub page. Anyone can open new issues and post comments on existinggithub-social-coding issues. You can also join the OpenWayback developers mailing list.

Kristinn Sigurðsson – Head of IT at the National and University Library of Iceland – x-posted from Kris’s blog

 

What’s Next for OpenWayback

By Kristinn Sigurðsson, Head of IT at National and University Library Iceland. Cross posted from his own blog

About one month ago, OpenWayback 2.1.0 was released. This was mostly a bug-fix release with a few new features merged in from Internet Archive’s Wayback development fork. For the most part, the OpenWayback effort has focused on ‘fixing’ things. Making sure everything builds and runs nicely and is better documented.

I think we’ve made some very positive strides.

Work is now ongoing for version 2.2.0. Finally, we are moving towards implementing new things! 2.2.0 still has some fixing to do. For example, localization support needs to be improved. But, we’re also planning to implement something new, support for internationalized domain names.

We’ve tentatively scheduled the 2.2.0 release for “spring/early summer”.

After 2.2.0 is released, the question will be which features or improvements to focus on next. The OpenWayback issue tracker on GitHub has (at the time of writing) about 60 open issues in the backlog (i.e. not assigned to a specific release).

We’re currently in the process of trying to prioritize these. Our current resources are nowhere sufficient to resolve them all. Prioritization will involve several aspects, including how difficult they are to implement, how popular they are and, not least, how clearly they are defined.

This is where you, dear reader, can help us out by reviewing the backlog and commenting on issues you believe to by relevant to your organization. We also invite you to submit new issues if needed.

It is enough to just leave a comment that this is relevant to your organization. Even better would be to explain why it is relevant (this helps frame the solution). Where appropriate we would also welcome suggestions for how to implement the feature. Notably in issues like the one about surfacing metadata in the interface.

If you really want to see a feature happen, the best way to make it happen is, of course, to pitch in.

Some of the features and improvements we are currently reviewing are:

  • Enable users to ‘diff’ different captures of an HTML page. Issue 15.
  • Enable search results with a very large number of hits. Issue 19.
  • Surface more metadata. Issue 28and 29.
  • Enable time ranged exclusions. Issue 212.
  • Create a revisit test dataset. Issue 117.
  • Using CDX indexing as the default instead of the BDB index. Issue 132.

As I said, these are just the ones currently being considered. We’re happy to look at others if there is someone championing them.

If you’d like to join the conversation, go to the OpenWayback issue tracker on GitHub and review issues without a milestone.

If you’d like to submit a new issue, please read the instructions on the wiki. The main thing to remember is to provide ample details.

We only have so many resources available. Your input is important to help us allocate them most effectively.