Pywb at the Australian Web Archive: notes on the migration

Last year the National Library of Australia (NLA) launched the revamped Australian Web Archive (AWA) which expanded their older PANDORA selective archive with comprehensive snapshots of the .au domain. The AWA is full-text searchable through Trove, a single discovery service for the collections of Australia’s libraries, museums, arc hives and other cultural institutions. To replay archived pages, AWA used Java-based OpenWayback, but NLA’s technical team is now transitioning to Python Wayback (pywb).


By Alex Osborne, Web Archive Technical Lead at National Library of Australia and Co-Lead of the IIPC Tools Development Portfolio

We recently migrated the Australian Web Archive (AWA) from OpenWayback to Pywb in order to take advantage of Pywb’s better replay fidelity, particularly for JavaScript heavy websites. First I must give some background about the existing architecture. The Trove-branded user interface to the web archive is a separate web application which displays Wayback in an iframe.

The old architecture

We originally chose to write a separate application rather than customising OpenWayback’s banner templates for a couple of reasons:

  • We thought it would make updating wayback and the UI independently of each other easier. This is particularly important as the web archive backend and the Trove user interface are managed and developed by different teams with different development processes and release cycles.
  • It allows for replayed content to live on a different origin (domain) to the UI. This is important security measure to prevent archived content from being able to interfere with the UI. While we didn’t take advantage of this in the original release, it’s something I’ve long wanted to implement.
  • We had in mind from the beginning that we may eventually want to swap OpenWayback out for another replay tool and it’d be nice not to have to rewrite our UI in order to do it.

While it has caused a few problems in the past (redirect loops, PDF plugins) this architecture made the transition to Pywb straightforward. Pywb out of the box renders its own UI with an iframe so it was close to a drop in replacement for us. There were a few small problems we encountered along the way, most of our own making rather than Pywb’s though.

Problem 1: Notifying the UI when a navigation event happens

In the AWA’s initial release, archived content and UI were both served from the same domain name. This meant that the browser allowed the UI’s JavaScript to reach inside the iframe and access the archived page. The Trove-Web UI therefore was able to listen to the iframe’s load event and even intercept click events. When the iframe loads we can inspect the page’s title to update it in the UI and extract the current URL and timestamp from the iframe’s URL.

While this was convenient and would have worked with Pywb if we kept it on the same domain it also means archived content could do the same to our UI! We never encountered anyone doing this in the wild but we’ve always been a little worried that the web archive could be abused for attacks like phishing.

This means we needed a replacement way for the UI to get information about what was happening inside the replay iframe. Pywb fortunately has already solved this problem and uses the Window.postMessage() to send a message like this when the archived page loads.

{
    "wb_type":"load",
    "url":"http://www.example.com/",
    "ts":"20060821035730",
    "title":"Example Web Page",
    "icons":[],
    "request_ts":"20060821035730",
    "is_live":false,
    "readyState":"interactive"
}

One gotcha I encountered though was that Pywb doesn’t send a postMessage when displaying error pages. Our custom UI intercepts OpenWayback’s not found errors in order to display a more detailed message suggesting alternative ways to find the content or an explanatory message about restricted content.

I worked around that by including a custom templates/not_found.html which sends the load message:

<script>
    parent.postMessage({
        'wb_type': 'load',
        'url': '{{ url }}',
        'ts': location.href.split('/')[4].replace('mp_',''),
        'title': 'Webpage snapshot not found',
        'status': 404
    }, '*');
</script>

Problem 2: Accessing HTML meta tags

Trove’s user interface (and this is not specific to the web archive) has a tab which shows how to cite the item you’re looking at when referencing it in an academic context or on Wikipedia. The original implementation of this would on the client side inspect the contents of the iframe for HTML meta tags to pull out information such as an author or publisher’s name. This of course also broke when we moved wayback to a separate security origin and unlike the URL and page title Pywb doesn’t include the page’s meta tags in its load message.

Rather trying to provide JavaScript access to the page content and potentially risking undoing some of the isolation we were trying to introduce, I moved the meta tag extraction server side translating the code from JavaScript/DOM to Java/JSoup.

Problem 3: Multiple access points

The library has a take-down procedure for restricting access to content under certain circumstances. Restricted content can have several different policies applied to it. Content can be fully public, accessible only to staff, accessible on-site in the reading room or fully restricted.

OpenWayback also enables a different URL to configured for routing incoming requests (accessPointPath) as to the one that’s generated when rewriting links (replayPrefix). So our original implementation simply configured three access points under paths like /public, /onsite and /staff but with the generated links all at /wayback.

<beans>
     <bean name="publicaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/public/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="publiccollection" />
     </bean>
    
     <bean name="onsiteaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/onsite/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="onsitecollection" />
    </bean>
    
    <bean name="staffaccesspoint" parent="standardaccesspoint">
        <property name="accessPointPath" value="http://backend.example:8080/staff/"/>
        <property name="replayPrefix" value="https://frontend.example/wayback/"/>
        <property name="collection" ref="staffcollection" />
    </bean>
</beans>

Our frontend webserver (nginx) has a set of IP range rules which map the different access locations and rewrote the path accordingly:

rewrite /wayback/(.*) /$webarchive_access_point/$1 break;
proxy_pass http://openwayback;

I couldn’t find a way to replicate this configuration as Pywb appears to only have one parameter – the collection name – used for both routing and link generation. (Aside: perhaps it’s possible by using uwsgi and overriding SCRIPT_NAME but I couldn’t figure it out.) Therefore rather than running a single instance of Pywb with multiple collections configured, we ended up with a separate instance of Pywb for each access point and have nginx route requests to the appropriate port. It’s a little more complex to deploy than I’d like but works well enough.

upstream pywb-public { server backend.example:8080; }
upstream pywb-staff  { server backend.example:8081; }
upstream pywb-onsite { server backend.example:8082; }

proxy_pass http://pywb-$webarchive_access_point;

Problem 4: Hiding the UI for thumbnails

Trove’s collection browse path displays thumbnails of the archived sites. These are generated by a web service that wraps Chromium’s headless mode. Obviously, we don’t want the UI of the archive to be visible in the thumbnails. But if Chromium loads the URL of Pywb directly, there’s a JavaScript redirect back to the Trove UI. This exists so that if a user opens an archived link in a new tab, they get the web archive’s UI in that new tab too rather than just the contents of the replay frame.

In our original implementation we kind of hacked around this for screenshots by passing a magic flag as a URL fragment that the JavaScript redirect looked for as indicator not to redirect to the UI. This time though that redirect was being done by Pywb itself rather than our template customisations. Plus we don’t really want to expose a way of hiding the UI entirely again due to risk of the archive being abused for phishing.

I hoped to setup a second Pywb collection with framed_replay: false and a blank banner, but was thwarted as that can only be specified at the top level. So yes, you guessed it, we’re now up to four instances of Pywb. ¯\_(ツ)_/¯

Problem 5: The PDF workaround that broke

In the original iframe implementation we encountered problems with displaying PDFs in an iframe in some browsers. The developers ended up working around this by embedding PDF.js rather than relying on the browser’s rendering. This broke when switching to Pywb on an isolated domain as the interception was based on a onClick handler injected into the iframe and also the PDF.js viewer can’t load documents from a different origin without special configuration. Fortunately, it seems either Pywb does something differently or more likely browser’s now handle PDFs in iframes better so I was able to just disable the whole PDF interception thing.

Funnily enough, we do still use PDF.js for generating thumbnails though. Chromium’s PDF viewer doesn’t work in headless mode. While it probably would be more efficient to use some native PDF viewer just using PDF.js, let’s us reuse the same thumbnail generation logic and also piggyback on the browser’s security sandbox.

Problem 6: Reusing OpenWayback’s manhattan graph

I discovered, a little to my surprise, that Trove’s visualisation of captures over time was actually using OpenWayback’s server-side manhattan graph renderer. There’s no exact equivalent of this in Pywb and I didn’t want to keep a running instance of OpenWayback for that alone. Fortunately the graph renderer is standalone and could be incorporated into Trove-Web directly.

Screenshot of manhattan graph

Conclusion

Overall the problems we encountered were relatively minor. Pywb’s builtin frames support allowed us to eliminate virtually all the modifications and custom templates we’d had to make for OpenWayback. If there’s one wishlist item I have for Pywb it’s to allow more options to be overridden at a per-collection level not just the top-level. The migration fixed a large set of longstanding delivery problems for us including key websites like the Sydney 2000 Olympic Games. I encourage other archives looking to improve the quality of their replay to make the switch.

 

IIPC Steering Committee Election 2020 Results

The Steering Committee  is the executive body of the IIPC, currently comprising 15 member organisations. The 2020 Steering Committee Election closed on Friday, 14 February. The following IIPC member institutions have been elected to serve on the Steering Committee for a term commencing 1 June 2020:

We would like to thank all members who took part in the election either by nominating themselves or by taking the time to vote. Congratulations to the new and re-elected Steering Committee Members!

Novel Coronavirus outbreak: help us collect websites

The International Internet Preservation Consortium’s Content Development Group and Archive-It are collaborating on a web archive collection preserving web content related to the ongoing Novel Coronavirus (Covid-19) outbreak. Due to the urgency of the outbreak, archiving of nominated web content will begin soon (mid-February 2020). Collection of new nominations and new crawls will continue as needed depending on the course of the outbreak and its containment.

What we are collecting

Web content from all countries and in any language is in scope. High priority subtopics include:

  • Coronavirus origins
  • Information about the spread of infection
  • Regional or local containment efforts
  • Medical/Scientific aspects
  • Social aspects
  • Economic aspects
  • Political aspects

Published information resources are a higher priority for seed nominations for this collection than social media feeds or hashtags (though the latter can be useful for finding examples of the former).

How to get involved

If you would like to participate in the collection, please nominate websites by using this submission form: https://forms.gle/zHgJK3DcfGpzAtCz5

The final collection is available at this link: https://archive-it.org/collections/13529

IIPC – Meet the Officers, 2020

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Programme and Communications Officer based at the British Library, the Officers are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Mark Phillips of the University of North Texas Libraries, to serve as Chair and Paul Koerbin of National Library of Australia and to serve as Vice-Chair in 2020. Sylvain Bélanger of Library and Archives Canada continues in his role as Treasurer. Olga Holownia continues as Programme and Communications Officer and CLIR (the Council on Library and Information Resources) remains the Consortium’s fiscal host.

The Members and the Steering Committee would like to thank Hansueli Locher of the Swiss National Library for chairing the Consortium in 2019, his contribution to the day-to-day running of the IIPC, supporting tools strategy, and leading the work on shaping the future direction of the IIPC.


Mark PhillipsIIPC CHAIR

Mark Phillips is Associate Dean for Digital Libraries at the University of North Texas (UNT) in Denton, Texas. Mark has been involved with all stages in the development of the digital library access and preservation infrastructure at the UNT Libraries. The UNT Libraries’ Digital Collection manages over 2.8 million digital resources made available through the interfaces of The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History. In addition to digital library infrastructure development, Mark has been involved in the web archiving activities at the UNT Libraries since 2004 including the 2008, 2012, and 2016 End of Term Web Archive activities and the development of the URL Nomination Tool. He has been active in IIPC since the UNT Libraries joined the Consortium in 2008 and has served as the UNT representative for the IIPC Steering Committee since 2015. Mark was IIPC Vice-Chair in 2019 and has been one of the Co-Leads of the Partnerships and Outreach Portfolio since 2016.

IIPC VICE-CHAIR

Paul Koerbin is the Assistant Director for Web Archiving and Government Publications at the National Library of Australia (NLA) in Canberra. In a variety of roles, he has been involved with the development, operation and management of the NLA’s web archiving program since its inception in the late 1990s. He was part of the team that developed one of the first workflow systems for selective web archiving (PANDAS – the PANDORA Digital Archiving System) first released in 2001 and later led the development of the Australian Government Web Archive initiative in 2014. He took a leading role in preparing the NLA’s input to the drafting of electronic legal deposit legislation which came into effect in February 2016. Paul has had a long association with the international web archiving community from presenting the Swiss Library Science Talk at CERN in 2004 to co-chairing the programme committee for the 2018 IIPC Web Archiving Conference in Wellington. Paul has a graduate qualification in library and information studies from the University of Tasmania and a PhD from Western Sydney University.

Sylvain BélangerIIPC TREASURER

Sylvain Bélanger is Director General of the Digital Operations and Preservation Branch for Library and Archives Canada since February 2014. In this role Sylvain is responsible for leading and supporting LAC’s digital business operations, and all aspects of preservation for digital and analog collections. Prior to accepting this role, Sylvain was Director of the Holdings Management Division since 2010, and previously Corporate Secretary and Chief of Staff for Library and Archives Canada. Library and Archives Canada is one of the founding members of the IIPC.

The PROMISE of a Belgian web archive

By Friedel Geeraert, Researcher on the PROMISE project at the Royal Library of Belgium

It all began in 2016 when the State Archives and KBR (the Royal Library of Belgium) decided to join forces and set up a joint web archiving project at the federal level in Belgium. Belgium is, sadly, one of the few European countries without a national web archive. Together with the universities of Ghent and Namur and the university college Bruxelles-Brabant they set themselves the task to develop a federal strategy for the preservation of the Belgian web. Funding was secured via the BRAIN.be programme of the Belgian Science Policy Office and in July 2017 the PROMISE project (Preserving Online Multiple Information: towards a Belgian strategy) kicked off.

Interdisciplinary team

Sally Chambers presenting the PROMISE Team’s work at the 2020 RESAW conference in Amsterdam. Photo: Olga Holownia.

One of the strengths of the PROMISE project is the interdisciplinarity of the research team. The State Archives and KBR provide expertise in collection curation and information and documentation management while the University of Namur (Research Centre in Information, Law and Society) provide the legal expertise. The University of Ghent (Research Group for Media, Innovation and Communication Technologies; Ghent Centre for Digital Humanities) and the University college Bruxelles-Brabant (HE2B) collaborated on the technical aspects of the project. The former also worked on analysing the user requirements for web archives. This approach not only ensured the necessary expertise but also led to cross-fertilisation between the different research domains.

Our objectives and how we learned from others

The project team worked on four main objectives:

  1. Identify best practices in the field of web-archiving
  2. Develop a strategy for archiving the Belgian web
  3. Set up a pilot project for the archiving of the Belgian web and providing access to these collections
  4. Make recommendations for the implementation of a sustainable web archiving service

More than two years onwards, a lot has happened within the project. To achieve the first objective, the research team did an extensive literature review of web archiving practices. This was supplemented by in-depth interviews with representatives of 13 web archiving institutions in Europe and Canada. Operational, technical and legal aspects were covered in these interviews and it was a very instructive phase for all researchers involved. The research results were published in the International Journal of Digital Humanities.

Inspired by the first phase, a strategy was outlined by KBR and the State Archives that covers the entire web archiving workflow. The legal analysis done within the project also informed both institutions about what they are legally allowed or required to do. Another important source of input were the results of a survey on user requirements since it is the intention of KBR and the State Archives to focus on the user when developing a functional web archive.

Budgeting scenarios

The strategy also included elaborate cost calculations based on different scenarios that were linked to different selection strategies: limited selective collections only, elaborate selective collections in combination with a limited broad crawl and elaborate selective collections in combination with an extensive broad crawl. A list of tasks and necessary infrastructure was drafted for each of these scenarios, spanning the different functions of the OAIS-model with the addition of the functions selection and capture. An estimation was made of the time needed to accomplish each task per job profile involved in the task. The total number of hours was then multiplied by an average wage per profile to come to a total cost for each scenario. The purpose of this exercise was to allow the board of directors of State Archives and KBR to make informed decisions about which web archiving strategy is preferable and financially viable.

Selection and metadata

The third research phase consisted of a number of elements: creating seed lists for selective collections in accordance with the collection development policies of KBR and the State Archives, creating descriptive metadata based on a recent study by the OCLC, doing a pilot broad crawl based on a sample of 10.000 and 100.000 domain names, capturing these collections and providing access to these collections. The prototype for access is in its final stages of development after which we aim to evaluate the entire pilot project.

Next steps

The project was completed at the end of December 2019 and the PROMISE project team is now working on making recommendations for the implementation of a sustainable web archiving service including legal considerations concerning access to web archives, operational procedures, a business model and technical and functional requirements for web archiving tools.

Niels Brügger, keynote speaker at the colloquium ‘Saving the web: the promise of a Belgian web archive’. Photo: KBR.

So how promising is the future of the Belgian web archive? As is the case with many new endeavours, structural financing plays a key role. It is the intention of KBR and the State Archives to approach the political level in Belgium and make a convincing case for the necessity of a Belgian web archive. During the concluding colloquium ‘Saving the web: the promise of a Belgian web archive’ that was held on 18 October 2019, Niels Brügger, Valérie Schafer and many others shared inspiring ideas with the PROMISE project team that can be used to make a very strong case. It is the sincere hope of both institutions that the results of the PROMISE project will live on in a sustainable web archive at the federal level in Belgium.

The end of the project also induces reflection. Over the course of the project, the team had the pleasure of being introduced to the (inter)national web archiving community, for which the IIPC and RESAW provide very important platforms. We feel that we owe a lot to the exchanges we had with other web archiving professionals and researchers and we would like to thank you all for the inspiration you have given us over the years and look forward to many exchanges to come.

Friedel Geeraert introducing KBR at the IIPC General Assembly in Zagreb (5 June 2019). Photo: Tibor God.


Links: