By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, and Georg Perreiter, Software Developer at the National and University Library of Iceland.
Here at the National and University Library of Iceland (NULI) we have over the last couple of years eagerly awaited each new deliverable of the IIPC funded pywb project, developed by Webrecorder’s Ilya Kreymer. Last year Kristinn wrote a blog post about our adoption of OutbackCDX based on the recommendation from the OpenWayback to pywb transition guide that was a part of the first deliverable. In that post he noted that we’d gotten pywb to run against the index but there were still many issues that were expected to be addressed as the pywb project continued. Now that the project has been completed, we’d like to use this opportunity to share our experience of this transition.
As Kristinn is a member of the IIPC’s Tools Development Portfolio (TDP) – which oversees the project – this was partly an effort on our behalf to help the TDP evaluate the project deliverables. Primarily, however, this was motivated by the need to be able to replace our aging OpenWayback installation.
It is worth noting that prior to this project, we had no experience with using Python based software beyond some personal hobby projects. We were (and are) primarily a “Java shop.” We note this as the same is likely true of many organizations considering this switch. As we’ll describe below, this proved quite manageable despite our limited familiarity with Python.
Get pywb Running
The first obstacle we encountered was related to the required Python version. pywb requires version 3.8 but our production environment, running Red Hat Enterprise Linux (RHEL) 7, defaulted to Python 3.6. So we had to additionally install Python 3.8. We also had to learn how to use a Python virtual environment so we could run pywb in isolation. Then we needed to learn how to resolve site-package conflicts using Python’s package manager (pip3) due to differences between Ubuntu and RHEL.
Of course, all of that could be avoided if you deploy pywb on a machine with a compatible version of Python or use pywb’s Docker image. Indeed, when we first set up a test instance on a “throwaway” virtual machine, we were able to get pywb up and running against our OutbackCDX in a matter of minutes.
Our web archive is open to the world. However, we do need to limit access to a small number of resources. With OpenWayback this has been handled using a plain text exclusion file. We were able to use pywb’s wb-manager command line tool to migrate this file to the JSON based file format that pywb uses. The only issue we ran into was that we needed to strip out empty lines and comments (i.e. lines starting with #) before passing it to this utility.
Making pywb Also Speak Icelandic
We want our web archive user interface to be available in both Icelandic and English. When adopting OpenWayback, we ran into issues with such internationalization (i18n) support and ultimately just translated it into Icelandic and abandoned the i18n effort. pywb already supported i18n and further support and documentation of this was one of the elements of the IIPC pywb project. So we very much wanted to take advantage of this and fully support both languages in our pywb installation.
We found the documentation describing this process to be very robust and easy to follow. Following it, we installed pywb’s i18n tool, added an “is” locale and edited the provided CSV file to include Icelandic translations.
Along the way we had a few minor issues with textual elements that were hard coded and translations could not be provided for. This was notably more common in new features being added, as one might expect. We were, in a sense, acting as beta testers of the software, picking up each new update as it came, so this isn’t all that surprising. We reported these omissions as we discovered them and they were quickly addressed.
We were able to work around this issue with Chrome by using a German locale as none of the date formatting patterns relied on outputting the names of days or months.
Making pywb Fit In
Here at NULI we have a lot of websites. To help us maintain a “brand” identity, we – to the extent possible – like them to have a consistent look and feel. So, in addition to making pywb speak Icelandic, we wanted it to fit in.
Much like i18n, UI customizations were identified as being important to many IIPC members and additional support for and documentation of that was included in the IIPC pywb project. Following the documentation, we found the customization work to be very straightforward.
You can easily add your own templates and static files or copy and modify the existing ones. As you can always remove your added files, there is no chance of messing anything up.
As you can see on our website, we were able to bring our standard theme to pywb.
Additionally, we added 20 lines of code to frontendapp.py to allow serving of additional, localized, static content fed by an additional template (incl. header and footer) that loads static html files as content. This allowed us to add a few extra web pages to serve our FAQ and some other static content. This was our only “hack” and is, of course, only needed if you want to add static content that is served directly from pywb (as opposed to linking to another web host).
New Calendar & Sparkline and Performance
The final deliverable of the IIPC funded pywb project included the introduction of a new calendar-based search result page and a “sparkline” navigation element into the UI header. These were both features found in OpenWayback and, in our view, the last “missing” functionality in pywb. We were very happy to see these features in pywb but also discovered a performance problem.
Our web archive is by no means the largest one in the world. It is, however, somewhat unique in that it contains some pages with over one hundred thousand copies (yes 100.000+ copies). These mostly come from our RSS-based crawls that capture news sites’ front pages every time there is a new item in the RSS feed. The largest is likely the front page of our state broadcaster (RÚV) with 159.043 captures available as we write this (with probably another thousand or so waiting to be indexed).
The initial version of the calendar and sparkline struggled with these URLs. After we reported the issue, some improvements were made involving additional caching and “loading” animations so users would know the system was busy instead of just seeing a blank screen. This improved matters notably, but pywb’s performance under these circumstances could stand further improvement.
We recognize that our web archive is somewhat unusual in having material like this. However, as time goes on, archives with a high number of captures of the same pages will only increase, so this is worth considering in the future.
We’ve been very pleased with this migration process. In particular we’d like to commend the Webrecorder team for the excellent documentation now available for pywb. We’d also like to acknowledge all testing and vetting of the IIPC pywb project deliverables that Lauren Ko (UNT and member of the IIPC TDP) did alongside – and often ahead of – us.
We can also reaffirm Kristinn’s recommendation from last year to use OutbackCDX as a backend for pywb (and OpenWayback). Having a single OutbackCDX instance powering both our OpenWayback and pywb installations notably simplified the setup of pywb and ensured we only had one index to update.
We still have pywb in a public “beta” – in part due to the performance issues discussed above – while we continue to test it. But we expect it will replace OpenWayback as our main replay interface at some point this year.