2022 blog round-up

Sat, 31 December 2022Wed, 04 January 2023 IIPC StaffLeave a comment

As we approach the end of 2022, we would like to thank our members and the general web archiving community for their support and engagement this year. Before we move forward into 2023, and return to an in-person General Assembly and Web Archiving Conference (for the first time since 2019!), we wanted to highlight some of this past year’s activities featured on our blog and to take this opportunity to thank all the contributors.

IIPC Governance

Thank you to the 2022 IIPC Chair, Vice-Chair and Treasurer for serving on the 2022 Executive Board. Thank you also to all the members who participated in the 2022 Steering Committee election. Many thanks to IIPC 2022 Chair Kristinn Sigurðsson for leading us through this past year, and reminding us that IIPC truly is an organization for all seasons.

Funded projects

2022 started off with a wrap-up of a project led by our Tools Development Portfolio and developed by Ilya Kreymer of Webrecorder. The goal of this project was to support migration from OpenWayback (a playback tool used by most of our members) to pywb by creating a Transition Guide.

This year also saw the launch of a new tools project “Browser-based crawling system for all.” Led by four IIPC members (the British Library, National Library of New Zealand, Royal Danish Library, and the University of North Texas), the Webrecorder-developed crawling system based on the Browsertrix Crawler is designed to allow curators to create, manage, and replay high-fidelity web archive crawls through an easy-to-use interface.

“Game Walkthroughs and Web Archiving,” builds on research by Travis Reid, PhD student at Old Dominion University (ODU) that looks at applying gaming concepts to the web archiving process. This collaboration between ODU and Los Alamos National Laboratory was supported by the IIPC through our Discretionary Funding Program (DFP).

Here’s a list of blog posts on the 2022 projects related to web archiving tools:

Collaborative Collections

IIPC also funds collaborative collections, which are curated and supported by volunteers from our community. While our Covid-19 collection continues, three new collections were initiated by the Content Development Working Group (CDG) in 2022. In the winter, Helena Byrne of the British Library encouraged everyone to web archive the Beijing 2022 Olympic & Paralympic Winter Games, adding to a decade-long collaborative effort of archiving the Olympics and Paralympics. Archiving the War in Ukraine was our second collaborative collection for 2022. Co-curated by Kees Teszelszky of KB, National Library of the Netherlands, and Vladimir Tybin and Anaïs Crinière-Boizet of the BnF, the National Library of France, it offers a comprehensive international perspective on the war. We closed 2022 with a call for nominations (due 20 January, 2023) for Web Archiving Street Art, co-led by Ricardo Basílio of Arquivo.pt and Miranda Siler of Ivy Plus Libraries Confederation.

Thank you to Alex Thurman (Columbia University Libraries) and Nicola Bingham (the British Library) for serving as CDG co-chairs, overseeing all new and ongoing collaborative collections:

Researching web archives

We also published blog posts related to researching web archives on topics spanning from a toolset for researchers to archiving social media to analysing Covid-19 web archive collections.

Yves Maurer of the National Library of Luxembourg, wrote about CDX-summarize, his toolset aimed at anyone interested in researching web archives that are not fully accessible. It offers a possible solution to provide a useful glimpse of “data that resides in-between the legal challenges of full access on the one hand and a textual description or rough single numbers on the other hand”.

Beatrice Cannelli, PhD candidate at the School of Advanced Study (University of London), summarised the results of an online survey mapping social media archiving initiatives, which is part of her research project “Archiving Social Media: a Comparative Study of the Practices, Obstacles, and Opportunities Related to the Development of Social Media Archives.”

We also published two blog posts by the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) Team of researchers working with the IIPC Covid-19 collaborative collection by using ARCH (Archives Research Compute Hub), a new interface for web archive analysis created by the Archives Unleashed Project Team and the Internet Archive. AWAC2 is supported by the Archives Unleashed Cohort Program, which facilitates research engagement with web archives and the researchers are members of the WARCnet (Web ARChive studies network researching web domains and events) Working Group 2, focusing on analysing transnational events.

Covid-19 web archived content is also at the core of the Archive of Tomorrow (AoT) project that aims to explore and preserve online information and misinformation about health and the pandemic. Introduced earlier this year by Alice Austin (Centre for Research Collections, University of Edinburgh), AoT will form a ‘Talking about Health’ collection within the UK Web Archive. Cui Cui, PhD candidate at the University of Sheffield and also an AoT web archivist, shared her process of working with the ‘Talking about Health’ collection, using faceted 4D modelling to reconstruct web space in web archives.

Here are the 2022 blog posts on researching web archives:

Last but not least, we would also like to give a shoutout to the brilliant Web Archiving Team at the Library of Congress who worked with us on the online GA and WAC 2022 and took us down memory lane in Remembering Past Web Archiving Events With Library of Congress Staff.

Many thanks to everyone who has contributed to our blog and helped us promote it through their newsletters and social media posts and, of course, thank you to all our readers around the world. We look forward to showcasing your web archiving activities in the new year!

Migrating to pywb at the National and University Library of Iceland

Tue, 08 March 2022Wed, 09 March 2022 IIPC Senior Program Officer1 Comment

By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, and Georg Perreiter, Software Developer at the National and University Library of Iceland.

Here at the National and University Library of Iceland (NULI) we have over the last couple of years eagerly awaited each new deliverable of the IIPC funded pywb project, developed by Webrecorder’s Ilya Kreymer. Last year Kristinn wrote a blog post about our adoption of OutbackCDX based on the recommendation from the OpenWayback to pywb transition guide that was a part of the first deliverable. In that post he noted that we’d gotten pywb to run against the index but there were still many issues that were expected to be addressed as the pywb project continued. Now that the project has been completed, we’d like to use this opportunity to share our experience of this transition.

As Kristinn is a member of the IIPC’s Tools Development Portfolio (TDP) – which oversees the project – this was partly an effort on our behalf to help the TDP evaluate the project deliverables. Primarily, however, this was motivated by the need to be able to replace our aging OpenWayback installation.

It is worth noting that prior to this project, we had no experience with using Python based software beyond some personal hobby projects. We were (and are) primarily a “Java shop.” We note this as the same is likely true of many organizations considering this switch. As we’ll describe below, this proved quite manageable despite our limited familiarity with Python.

Get pywb Running

The first obstacle we encountered was related to the required Python version. pywb requires version 3.8 but our production environment, running Red Hat Enterprise Linux (RHEL) 7, defaulted to Python 3.6. So we had to additionally install Python 3.8. We also had to learn how to use a Python virtual environment so we could run pywb in isolation. Then we needed to learn how to resolve site-package conflicts using Python’s package manager (pip3) due to differences between Ubuntu and RHEL.

Of course, all of that could be avoided if you deploy pywb on a machine with a compatible version of Python or use pywb’s Docker image. Indeed, when we first set up a test instance on a “throwaway” virtual machine, we were able to get pywb up and running against our OutbackCDX in a matter of minutes.

Access Control

Our web archive is open to the world. However, we do need to limit access to a small number of resources. With OpenWayback this has been handled using a plain text exclusion file. We were able to use pywb’s wb-manager command line tool to migrate this file to the JSON based file format that pywb uses. The only issue we ran into was that we needed to strip out empty lines and comments (i.e. lines starting with #) before passing it to this utility.

Making pywb Also Speak Icelandic

We want our web archive user interface to be available in both Icelandic and English. When adopting OpenWayback, we ran into issues with such internationalization (i18n) support and ultimately just translated it into Icelandic and abandoned the i18n effort. pywb already supported i18n and further support and documentation of this was one of the elements of the IIPC pywb project. So we very much wanted to take advantage of this and fully support both languages in our pywb installation.

We found the documentation describing this process to be very robust and easy to follow. Following it, we installed pywb’s i18n tool, added an “is” locale and edited the provided CSV file to include Icelandic translations.

Along the way we had a few minor issues with textual elements that were hard coded and translations could not be provided for. This was notably more common in new features being added, as one might expect. We were, in a sense, acting as beta testers of the software, picking up each new update as it came, so this isn’t all that surprising. We reported these omissions as we discovered them and they were quickly addressed.

The only issue that wasn’t (and couldn’t) be addressed ended up relating to a limitation of Chrome. We noticed that our date formatting for Icelandic was working well in both Firefox and Edge, but displayed incorrectly in Chrome. This turned out to be because Chrome does not support Icelandic in JavaScript code like this: new Date().toLocaleDateString(“is”)

We were able to work around this issue with Chrome by using a German locale as none of the date formatting patterns relied on outputting the names of days or months.

Making pywb Fit In

Here at NULI we have a lot of websites. To help us maintain a “brand” identity, we – to the extent possible – like them to have a consistent look and feel. So, in addition to making pywb speak Icelandic, we wanted it to fit in.

Much like i18n, UI customizations were identified as being important to many IIPC members and additional support for and documentation of that was included in the IIPC pywb project. Following the documentation, we found the customization work to be very straightforward.

You can easily add your own templates and static files or copy and modify the existing ones. As you can always remove your added files, there is no chance of messing anything up.

As you can see on our website, we were able to bring our standard theme to pywb.

Additionally, we added 20 lines of code to frontendapp.py to allow serving of additional, localized, static content fed by an additional template (incl. header and footer) that loads static html files as content. This allowed us to add a few extra web pages to serve our FAQ and some other static content. This was our only “hack” and is, of course, only needed if you want to add static content that is served directly from pywb (as opposed to linking to another web host).

New Calendar & Sparkline and Performance

The final deliverable of the IIPC funded pywb project included the introduction of a new calendar-based search result page and a “sparkline” navigation element into the UI header. These were both features found in OpenWayback and, in our view, the last “missing” functionality in pywb. We were very happy to see these features in pywb but also discovered a performance problem.

Our web archive is by no means the largest one in the world. It is, however, somewhat unique in that it contains some pages with over one hundred thousand copies (yes 100.000+ copies). These mostly come from our RSS-based crawls that capture news sites’ front pages every time there is a new item in the RSS feed. The largest is likely the front page of our state broadcaster (RÚV) with 159.043 captures available as we write this (with probably another thousand or so waiting to be indexed).

The initial version of the calendar and sparkline struggled with these URLs. After we reported the issue, some improvements were made involving additional caching and “loading” animations so users would know the system was busy instead of just seeing a blank screen. This improved matters notably, but pywb’s performance under these circumstances could stand further improvement.

We recognize that our web archive is somewhat unusual in having material like this. However, as time goes on, archives with a high number of captures of the same pages will only increase, so this is worth considering in the future.

Final Thoughts

We’ve been very pleased with this migration process. In particular we’d like to commend the Webrecorder team for the excellent documentation now available for pywb. We’d also like to acknowledge all testing and vetting of the IIPC pywb project deliverables that Lauren Ko (UNT and member of the IIPC TDP) did alongside – and often ahead of – us.

We can also reaffirm Kristinn’s recommendation from last year to use OutbackCDX as a backend for pywb (and OpenWayback). Having a single OutbackCDX instance powering both our OpenWayback and pywb installations notably simplified the setup of pywb and ensured we only had one index to update.

We still have pywb in a public “beta” – in part due to the performance issues discussed above – while we continue to test it. But we expect it will replace OpenWayback as our main replay interface at some point this year.

IIPC Chair Address 2022

Tue, 22 February 2022Wed, 23 February 2022 IIPC Senior Program Officer2 Comments

Kristinn-Sigurðsson-2021

By Kristinn Sigurðsson, Head of Digital Projects and Development
at the National and University Library of Iceland
and the IIPC Chair 2022-2023

Hi all,

It is my pleasure to serve as Chair of the IIPC Steering Committee and the Executive Board for 2022. I’ve been a part of this community in one way or another since the start. I’ve often stated that without our involvement in the IIPC, the Icelandic web archive here at the National and University Library of Iceland would be a shadow of its current self.

An Organization for All Seasons

Despite the challenges of the pandemic, I’m very happy that the IIPC has been able to maintain an extensive and ambitious program over the last couple of years. This was very important as, in the past, the IIPC’s in-person events – notably our General Assembly (GA) and Web Archiving Conference (WAC) – have been at the core of IIPC activity.

The IIPC community is incredibly positive, energetic and creative, and this is always on full display when we meet in person. Over the years we’ve tried very hard to sustain the energy and shared feeling of purpose throughout the year. Even before the pandemic, we made sure the community could meet online during our regular technical calls, webinars and workshops.

These earlier efforts left us with a foundation to build upon, and I’m very happy with what we have been able to deliver. We have had even more events these past two years, including our first online conference co-hosted by the National Library of Luxembourg. My deepest thanks to Olga and everyone who helped make that possible.

Unfortunately, it seems we face another year without any significant in-person events. Rest assured, however, that we will continue with our program of meetings, workshops, members updates, webinars featuring web archiving initiatives (including the IIPC funded projects) and, of course, a virtual GA and conference in late May.

This transformation from an organization that sometimes seemed to disappear in the off season to one that is active year round has been very satisfying from my perspective. This “disappearance” was always a bit of an illusion, as there was always some work and collaboration going on, but the added visibility and opportunities for engagement have been crucial.

As we look forward to resuming in-person events next year (fingers crossed), it is important that we do not forget any of the lessons we have learned from this. It is important that we do not simply go back to things as they were, but that we retain this online aspect of the organization all year round. With that in mind, much of this year we will be looking to establish a more predictable and consistent schedule of events, allowing our progress to carry on into future years.

Of course, all of this has required a fair amount of work and will continue to do so, bringing me to our next topic.

Reinforcement

As the IIPC has expanded and matured, the duties and responsibilities of our sole employee have grown considerably. Recognizing that this had reached an unsustainable point, last year the Steering Committee authorized the hiring of one additional full-time staff member for the new role of Administrative Officer. The Administrative Officer will take some of the more routine, administrative, duties off of our Senior Program Officer’s plate and support her as needed.

The position was advertised in November and prospective candidates were interviewed in late December and early January. After concluding our search, we hired Kelsey Socha as the new Administrative Officer for the IIPC. Kelsey Socha holds a BFA in Theatrical Design and Production from the University of Michigan, and an MS in Library and Information Science from Simmons University. She has served in a variety of library roles, most recently as Head of Adult Services for the Westfield Athenaeum in Westfield, Massachusetts. She began work for the IIPC on February 9th. Both the Administrative Officer and the Senior Program Officer roles are hosted by Council of Library and Information Resources (CLIR).

With this reinforcement, I feel confident that we will be able to rise to the ambitious schedule I discussed earlier.

Executive Board

Two years ago, we revised and renewed our Consortium Agreement, which among other changes introduced an Executive Board (EB), composed of the Chair, Vice-Chair, Treasurer, Senior Program Officer and, optionally, up to two other Steering Committee members. Aside from the Senior Program Officer, appointments are for 1 year. The EB was set up to create a smaller and more responsive body to manage the practical aspects of running the IIPC and to liaise with CLIR, our financial and administrative host. The first Board started work in January 2021 and this new setup has made our governance more agile. I would like to take this opportunity to thank Sylvain Bélanger of Library and Archives Canada, who served as IIPC Treasurer, and to welcome Ian Cooke of the British Library who has taken on this role. My long-term IIPC colleague Abbie Grotke has volunteered to serve as the Vice-Chair. My thanks also go to Hansueli Locher of Swiss National Library, who served on the EB last year.

The Steering Committee will continue to focus on our longer-term policy as well as oversight, with members of our three Portfolios and being actively involved the areas outlined in our Strategic Plan.

New Steering Committee Members

I would also like to take this opportunity to welcome the two newly elected SC members, Bjarne Andersen on behalf of the Royal Danish Library and Tobias Steinke on behalf of the German National Library.

Each year, about one-third of the fifteen SC seats are up for reelection. I’ve been pleased to observe over the last few years that these elections have gotten more competitive with more members seeking to serve. As the active involvement of our members is vital to the IIPC long term success, I’m confident that this is a positive sign.

Tools

Just recently, a project commissioned by our Tools Development Portfolio (TDP) to improve the open source web archive replay tool PyWb was completed. I wrote about this project back when it was just starting (read post). Now, Ilya Kreymer, PyWb’s developer, has delivered the last of the work agreed upon. There will be more in-depth posts about this soon, and you can also find all the blog posts on interim work here.

This PyWb project is the first funded development project to be managed by the TDP and overall, I’m very pleased with the outcome. I would like to take this opportunity to thank the other members of the TDP, Lauren Ko, Alex Osborne and Youssef Eldakar, for all their hard work. Even our funded projects still depend on a fair amount of volunteer effort. Partly based on the experience from this project, the TDP is currently working on another project with Ilya Kreymer, this time focused on browser-based crawling.

Browser-based crawling was identified as a key capability that is largely lacking in our tool suite at an online Tools Workshop with IIPC members held in June 2021. Based on discussions there, several of our member organizations put together a project plan to address this. The project will involve notable extensions to the Webrecorder software to facilitate better browser-based crawling. Unlike the PyWb project, however, this project also includes considerable commitments on behalf of 4 member organizations (British Library, National Library of New Zealand, Royal Danish Library and University of North Texas) that are not funded by the IIPC.

This project is expected to last two years, and we plan to keep you informed throughout. Members interested in participating should keep an eye out for upcoming announcements and posts here.

20th Anniversary

Next year will be the IIPC’s 20th anniversary. A lot has changed since the original 12 members signed the first Consortium Agreement back in 2003. The fact that such a milestone looks to also align with a return to in-person events after three years without them gives us even more cause to strive for the best General Assembly and Web Archiving Conference ever (as if we ever aimed lower).

Even as we work on the substantial online slate of events for 2022 that I mentioned above, work has already begun on this return to normalcy. You can be part of this too. Keep an eye out for the call for a hosting institution, participation in our program committee, and the call for papers.

Lastly, please feel free to reach out to me if you have any questions or concerns during my time as Chair.

Kristinn Sigurðsson,
Head of Digital Projects and Development at the National and University Library of Iceland
IIPC Chair 2022-2023

This slideshow requires JavaScript.

Using OutbackCDX with OpenWayback

Fri, 12 March 2021Fri, 12 March 2021 IIPC Senior Program Officer1 Comment

By Kristinn Sigurðsson, Head of Digital Projects and Development at the National and University Library of Iceland, IIPC Vice-Chair and Co-Lead of the Tools Development Portfolio

Last year I wrote about The Future of Playback and the work that the IIPC was funding to facilitate the migration from OpenWayback to PyWb for our members. That work is being done by Ilya Kreymer and the first work package has now been delivered. An exhaustive transition guide detailing how OpenWayback configuration options can be translated into equivalent PyWb settings.

One thing I quickly noticed as I read through the guide is that it recommends that users use OutbackCDX as a backend for PyWb, rather than continuing to rely on “flat file”, sorted CDXes. PyWb does support “flat CDXs”, as long as they are the 9 or 11 column format, but a convincing argument is made that using OutbackCDX for resolving URLs is preferable. Whether you use PyWb or OpenWayback.

What is OutbackCDX?

OutbackCDX is a tool created by Alex Osborne, Web Archive Technical Lead at National Library of Australia. It handles the fundamental task of indexing the contents of web archives. Mapping URLs to contents in WARC files.

A “traditional” CDX file (or set of files) accomplishes this by listing each and every URL, in order, in a simple text file along with information about them like in which WARC file they are stored. This has the benefit of simplicity and can be managed using simple GNU tools, such as sort. Plain CDXs, however, make inefficient use of disk space. And as they get larger, they become increasingly difficult to update because inserting even a small amount of data into the middle of a large file requires rewriting a large part of the file.

OutbackCDX improves on this by using a simple, but powerful, key-value store RocksDB. The URLs are the keys and remaining info from the CDX is the stored value. RocksDB then does the heavy lifting of storing the data efficiently and providing speedy lookups and updates to the data. Notably, OutbackCDX enables updates to the index without any disruption to the service.

The Mission

Given all this, transitioning to OutbackCDX for PyWb makes sense. But OutbackCDX also works with OpenWayback. If you aren’t quite ready to move to PyWb, adopting OutbackCDX first can serve as a stepping stone. It offers enough benefits all on its own to be worth it. And, once in place, it is fairly trivial to have it serve as a backend for both OpenWayback and PyWb at the same time.

So, this is what I decided to do. Our web archive, Vefsafn.is, has been running on OpenWayback with a flat file CDX index for a very long time. The index has grown to 4 billion URLs and takes up around 1.4 terabytes of disk space. Time for an upgrade.

Of course, there were a few bumps on that road, but more on that later.

Installing OutbackCDX

Installing OutbackCDX was entirely trivial. You get the latest release JAR, run it like any standalone Java application and it just works. It takes a few parameters to determine where the index should be, what port it should be on and so forth, but configuration really is minimal.

Unlike OpenWayback, OutbackCDX is not installed into a servlet container like Tomcat, but instead (like Heritrix) comes with its own, built in web server. End users do not need access to this, so it may be advisable to configure it to only be accessible internally.

Building the Index

Once running, you’ll need to feed your existing CDXs into it. OutbackCDX can ingest most commonly used CDX formats. Certainly all that PyWb can read. CDX files can simply be “posted” OutbackCDX using a command line tool like curl.

Example:

curl -X POST –data-binary @index.cdx http://localhost:8901/myindex

In our environment, we keep around a gzipped CDX for each (W)ARC file, in addition to the merged, searchable CDX that powered OpenWayback. I initially just wrote a script that looped through the whole batch and posted them, one at a time. I realized, though, that the number of URLs ingested per second was much higher in CDXs that contained a lot of URLs. There is an overhead to each post. On the other hand, you can’t just post your entire mega CDX in one go, as OutbackCDX will run out of memory.

Ultimately, I wrote a script that posted about 5MB of my compressed CDXs at a time. Using it, I was able to add all ~4 billion URLs in our collection to OutbackCDX in about 2 days. I should note that our OutbackCDX is on high performance SSDs. Same as our regular CDX files have used.

Configuring OpenWayback

Next up was to configure our OpenWayback instance to use OutbackCDX. This proved easy to do, but turned up some issues with OutbackCDX. First the configuration.

OpenWayback has a module called ‘RemoteResourceIndex’. This can be trivially enabled in the wayback.xml configuration file. Simply replace the existing `resourceIndex` with something like:

<property name=”resourceIndex”>
<bean class=”org.archive.wayback.resourceindex.RemoteResourceIndex”>
<property name=”searchUrlBase” value=”http://localhost:8080/myindex” />
</bean>
</property>

And OpenWayback will use OutbackCDX to resolve URLs. Easy as that.

Those ‘bumps’

This is, of course, where I started running into those bumps I mentioned earlier. Turns out there were a number of edge cases where OutbackCDX and OpenWayback had different ideas. Luckly, Alex – the aforementioned creator of OutbackCDX – was happy to help resolve this. Thanks again Alex.

The first issue I encountered was due to the age of some of our ARCs. The date fields had variable precision, rather than all being exactly 14 digits long some had less precision and were only 10-12 characters long. This was resolved by having OutbackCDX pad those shorter dates with zeros.

I also discovered some inconsistencies in the metadata supplied along with the query results. OpenWayback expected some fields that were either missing or miss-named. These were a little tricky, as it only affected some aspects of OpenWayback, most notably in the metadata in the banner inserted at the top of each page. All of this has been resolved.

Lastly, I ran into an issue, not related to OpenWayback, but PyWb. It stemmed from the fact that my CDXs are not generated in the 11 column CDX format. The 11 column includes the compressed size of the WARC holding the resource. OutbackCDX was recording this value as 0 when absent. Unfortunately, PyWb didn’t like this and would fail to load such resources. Again, Alex helped me resolve this.

OutbackCDX 0.9.1 is now the most recent release, and includes the fixes to all the issues I encountered.

Summary

Having gone through all of this, I feel fairly confident that swapping in OutbackCDX to replace a ‘regular’ CDX index for OpenWayback is very doable for most installations. And the benefits are considerable.

The size of the OutbackCDX index on disk ended up being about 270 GB. As noted before, the existing CDX index powering our OpenWayback was 1.4 TB. A reduction of more than 80%. OpenWayback also feels notably snappier after the upgrade. And updates are notably easier.

Our OpenWayback at https://vefsafn.is, is now fully powered by OutbackCDX.

Next we will be looking at replacing it with PyWb. I’ll write more about that later, once we’ve made more progress, but I will say that having it run on the same OutbackCDX proved trivial to accomplish, and we now have a beta website up, using PyWb, http://beta.vefsafn.is.

Search results in Vefsafn.is (beta) that uses PyWb.

OpenWayback to pywb Transition Guide and pywb update

Wed, 16 December 2020Wed, 16 December 2020 IIPC Senior Program OfficerLeave a comment

By Ilya Kreymer, Lead Software Engineer at Webrecorder Software

Earlier this year, the IIPC, after an internal survey, recommended the adoption of Webrecorder pywb as the primary replay system for their members’ web archives. Webrecorder and IIPC established a multi-part collaboration to help with this transition and advance the development of pywb.

To meet these goals, I’m excited to announce the launch of an official guide for migrating from OpenWayback to Webrecorder pywb, available at:

https://pywb.readthedocs.io/en/latest/manual/owb-transition.html

This guide was created with input from IIPC members and marks the completion of the first package of the IIPC project on pywb. This guide is now part of the standard pywb documentation and provides examples of various OpenWayback configurations and how they can be adapted to analogous options in pywb. The guide covers updating the index, WARC storage and exclusion systems to run in pywb with minimal changes.

For best results, deployment of OutbackCDX, an open-source standalone web archive indexing system developed by the National Library of Australia, alongside pywb is the recommended setup for managing web archive indexes. See the guide for more details and additional options.

Sample Deployment Configurations

With the guide, pywb now also includes a few working deployments (via Docker Compose) of running pywb with Nginx, Apache and OutbackCDX.

These deployments will be part of the upcoming pywb release and will be updated as pywb and configuration options evolve.

Next Steps

Next on the immediate roadmap for pywb is an upcoming release, which will feature numerous fixes in addition to the guide (See the pywb CHANGELIST for more details on upcoming and new features).

The next iteration of pywb, which will be released in the first half of 2021, will include improved support for access controls, including a time-based access ‘embargo’, location-based access controls, and improved support for localization, in line with the work outlined in pywb project Package B.

Feedback Wanted!

We hope the guide will be useful for those updating from OpenWayback to pywb. We are also looking for input from IIPC members about any use cases for improved access control and localization for the next iteration.

If you have any questions, run into issues, or find anything missing, please send feed feedback to pywb[at]iipc.simplelists.com or directly to Webrecorder, via email or via the forum.

The Future of Playback

Tue, 16 June 2020Wed, 17 June 2020 IIPC Senior Program Officer5 Comments

By Kristinn Sigurðsson, Head of IT at the National and University Library of Iceland and the Lead of the IIPC Tools Development Portfolio

It is difficult to overstate the importance of playback in web archiving. While it is possible to evaluate and make use of a web archive via data mining, text extraction and analysis, and so on, the ability to present the captured content in its original form to enable human inspection of the pages. A good playback tool opens up a wide range of practical use cases by the general public, facilitates non-automated quality assurance efforts and (sometimes most importantly) creates a highly visible “face” to our efforts.

OpenWayback

Over the last decade or so, most IIPC members, who operate their own web archives in-house, have relied on OpenWayback, even before it acquired that name. Recognizing the need for a playback tool and the prevalence of OpenWayback, the IIPC has been supporting OpenWayback in a variety of ways over the last five or six years. Most recently, Lauren Ko (UNT), a co-lead of the IIPC’s Tools Development Portfolio, has shepherded work on OpenWayback and pushed out new releases (thanks Lauren!).

Unfortunately, it has been clear for some time that OpenWayback would require a ground up rewrite if it were to be continued on. The software, now almost a decade and a half old, is complicated and archaic. Adding features is nearly impossible and often bug fixes require exceptional effort. This has led to OpenWayback falling behind as web material evolves. Its replay fidelity fading.

As there was no prospect for the IIPC to fund a full rewrite, the Tools Development Portfolio, along with other interested IIPC members, began to consider alternatives. As far as we could see, there was only one viable contender on the market, Pywb.

Survey

Last fall the IIPC sent out a survey to our members to get some insights into the playback software that is currently being used, plans to transition to pywb and what were the key roadblocks preventing IIPC members from adopting Pywb. The IIPC also organised online calls for members and got feedback from institutions who had already adopted Pywb.

Unsurprisingly, these consultations with the membership confirmed the – current – importance of OpenWayback. The results also showed a general interest in adopting to Pywb whilst highlighting a number of likely hurdles our members faced in that change. Consequently, we decided to move ahead with the decision to endorse Pywb as a replay solution and work to support IIPC members’ adoption of Pywb.

The members of the IIPC’s Tools Development Portfolio then analyzed the results of the survey and, in consultation with Ilya Kreymer, came up with a list of requirements that, once met, would make it much easier for IIPC members to adopt Pywb. These requirements were then divided into three work packages to be delivered over the next year.

Pywb

Over the last few years, Pywb has emerged as a capable alternative to OpenWayback. In some areas of playback it is better or at least comparable to OpenWayback, having been updated to account for recent developments in web technology. Being more modern and still actively maintained the gap between it and OpenWayback is only likely to grow. As it is also open source, it makes for a reasonable alternative for the IIPC to support as the new “go-to” replay tool.

However, while Pywb’s replay abilities are impressive, it is far from a drop-in replacement for OpenWayback. Notably, OpenWayback offers more customization and localization support than Pywb. There are also many differences between the two softwares that make migration from one to the other difficult.

To address this, the IIPC has signed a contract with Ilya Kreymer, the maintainer of the web archive replay tool Pywb. The IIPC will be providing financial support for the development of key new features in Pywb.

Planned work

The first work package will focus on developing a detailed migration guide for existing OpenWayback users. This will include example configuration for common cases and cover diverse backend setups (e.g. CDX vs. ZipNum vs. OutbackCDX).

The second package will have some Pywb improvements to make it more modular, extended support and documentation for localization and extended access control options.

The third package will focus on customization and integration with existing services. It will also bring in some improvements to the Pywb “calendar page” and “banner”, bringing to them features now available in OpenWayback.

There is clearly more work that can be done on replay. The ever fluid nature of the web means we will always be playing catch-up. As work progresses on the work packages mentioned above, we will be soliciting feedback from our community. Based on that, we will consider how best to meet those challenges.

Tag: pywb project