SolrWayback 4.0 release! What’s it all about?

By Jesper Lauridsen, frontend developer at the Royal Danish Library.

This blog post is republished from Software Development at Royal Danish Library.

So, it’s finally here! SolrWayback 4.0 was released December 20th, after an intense development period. In this blog post, we’ll give you a nice little overview of the changes we made, some of the improvements and some of the added functionality that we’re very proud of having released. So let’s dig in!

A small intro – What is SolrWayback really?

As the name implies, SolrWayback is a fusion of discovery (Solr) and playback (Wayback) functionality. Besides full-text search, Solr provides multiple ways of aggregating data, moving common net archive statistics tasks from slow batch processing to interactive requests. Based on input from researchers the feature set is continuously expanding with aggregation, visualization and extraction of data.

SolrWayback relies on real time access to WARC files and a Solr index populated by the UKWA webarchive-discovery tool. The basic workflow is:

Amass a collection of WARCs (using Heritrix, wget, ArchiveIT…) and put them on live storage
Analyze and process the WARCs using webarchive-discovery. Depending on the amount of WARCS, this can be a fairly heavy job: Processing ½ petabyte of WARCs at the Royal Danish Library took 40+ CPU-years
Index the result from webarchive-discovery into Solr. For non-small collections, this means SolrCloud and Solid State Drives. A rule of thumb is that the index takes up about 5-10% of the size of the compressed WARCs
Connect SolrWayback to the WARC storage and the Solr index

A small visual illustration of the components used for SolrWayback.

Live demo

Try Live demo provided by National Széchényi Library, Hungary. (thanks!)

Helicopter view: What happend to SolrWayback

We decided to give the SolrWayback a complete makeover, making the interface more coherent, the design more stylish, and the information architecture better structured. At first glance, not much has changed apart from an update on the color scheme, but looking closely, we’ve added some new functionality, and grouped some of the existing features in a new, and much improved, way.

The search page is still the same, and after searching, you’ll still see all the results lined in a nice single column. We’ve added some more functionality up front, giving you the opportunity to see the WARC header for a single post, as well as selecting an alternative playback engine for the post. Some of the more noticeable reworks and optimizations are highlighted in the section below.

Faster loadtimes

We’ve done some work under the hood too, to make the application run faster. A lot of our call to the backend has been reworked to be individual calls, only being requested at need. This means, that facet calls are now made as a separate call to the backend instead of being being called with a query. So when you’re paging results, we only request the results – giving us a faster response, since the facets stay the same. The same principle has been applied to loading images and individual post data.

GUI polished

As mentioned, we’ve done some cleanup in the interface, making it easier to navigate. The search field has been reworked, to service the many needs. It will expand if the query is line separated (do so by SHIFT+Enter), making large and complex queries much easier to manage. We’ve even added context sensitive help, so if you’re making queries with boolean operators or similar, SolrWayback tell you if their syntax is correct.

We’ve kept the most used features upfront, with image and URL search readily available from the get go. The same goes for the option to group the search results to avoid URL duplicates.

Below the line are some of of the other features not directly linked to the query field, but nice to have upfront. Searching with an uploaded file, searching by GPS and the toolbox containing a lot of the different tools that can help gain insight into the archive, by generating Wordclouds or link graphs, searching through the Ngram interface and much more.

The nifty helper when making complex queries for SolrWayback.

Image searching by location rethought

We’re reworked the way to search and look through the results when searching by GPS coordinates. We’ve made it easy to search for a specific location, and we’ve grouped the results so that they are easier to interpret.

The new and improved location search interface. Images intentionally blurred.

Zooming into the map will expand the places where images are clustered. Furthermore, we realize that sometimes the need is to look through all the images regardless of their exact position, so we’ve made a split screen that can expand either way, depending on your needs. It’s still possible to do do a new search based on any of the found images in the list.

Elaborated export options

We’ve added more functionality to the export options. It’s possible to export both fields from the full search result and the raw WARC records for the search result, if enabled in the settings. You can even decide the format of your export and we’ve added an option to select exactly which fields in the search result you want exported – so if you want to leave out some stuff, that is now possible!

Quickly move through your archive

The standard search UI is pretty much as you are accustomed to but we made an effort to keep things simple and clean as well as facilitating in depth research and tracking of subject interests. In the search results you get a basic outline of metadata on each post. You can narrow your search with the provided facet filters. When expanding a post you get access to all metadata and every field has a link if you which to explore a particular angle related to your post. So you can quickly navigate the archive by starting wide, filtering and afterwards do a specific drill down and find related material.

Visualization of search result by domain

We’ve also made it very easy to quickly get a overview of the results. When clicking the icon in the results headline, you get a complete overview of the different domains in the results, and how big of a portion of the search result they amount for to each year. This is a very neat way to get a overview of the results, and the relative distribution by year.

The toolbox

With quick access from right under the search box we have gathered Toolbox with utilities for further data exploration. In the following we will give you a quick tour of the updates and new functionality in this section.

Linkgraph, domain stats and wordcloud

We reworked the Linkgraph, the Wordcloud and the Domain stats components a little, adding some more interaction to the graph and domain stats, and polished the interface for all of them a little. For the Linkgraph, it is now possible to highlight certain sections within the graph, making it much easier to navigate the sometimes rather large cluster, and looking at connections you find relevant. These tools now provide a easy and quick way to gain a deeper insight in specific domains and what content they hold.

Ngram

We are so pleased to finally be able to supply a classical Ngram search tool complete with graphs and all. In this version you are able to search through the entire HTML content of your archive and see how the results are distributed over time (harvest time). You can even do comparisons by providing several queries sequentially and see how they compare. On every graph the datapoint at each year is clickable and will trigger a search for the underlying results which is a very handy feature for checking the context and further exploring underlying data. Oh and before we forget – if things get a little crowded in the graph area you can always click on the nicely colored labels at the top of the chart and deselect/select each query.

If the HTML content isn’t really your thing but your passion lays within the HTML tags themselves we got you covered. Just flip the radio button under the search box over to HTML-tags in HTML-pages and you will have all same features listed above but now the underlying data will be the HTML tags themselves. As easy as that you will finally be able to get answers to important questions like ‘when did we actually start to frown upon the blink tag?’

Gephi Export

The possibilty to export a query, in a format that can be used in Gephi, is still present in the new version of SolrWayback. This will allow you to create some very nice visual graphs that can help you explore how exactly a collection of results are tied together. If you’re interested in this, feel free to visit the labs website about gephi graphs, where we’ve showcasted some of the possiblities of using Gephi.

Tools for the playback

SolrWayback comes with a build in playback engine, but can be configured to use another playback engine such as PyWb. The SolrWayback playback viewer shows a small toolbar overlay on the page that can be opened or hidden. When the toolbar is hidden the page is display without any frame/top-toolbar etc. to show the page exactly as it was harvested.

The menu when you access the individual search results.

When you have clicked a specific result, you’re taking to the harvested resource. If it is a website, you will be shown a menu to the right, giving you some more ways to analyse the resource. This menu is hidden in the left upper corner when you enter, but can be expanded by clicking on it.

The harvest calendar will give you a very smooth overview of the harvest times of the resource, so you can easily see when, and how often, the resource has been harvest in the current index. This gives you an excellent opportunity to look at your index over time, and see how a website evolved.

The PWID option lets you export the harvest resource metadata, so you can share what’s in that particular resource in a nice and clean way. the PWID standard is an excellent way to keep track of, and share ressources between researchers, so a list of the exact dataset is preserved – along with all the resources to go with it

View page resources gives you a clean overview of the contents of the harvested page, along with all the resources. We’ve even added a way to quickly see the difference between the first and the last harvested resource on the page, giving you a quick hint of the contents and if they are all from the same period. You can even see a preview of the page here and download the individual resources from the page, if you wish.

Customization of your local SolrWayback instance

We’ve made it possible to customize your installation, to fit your needs. The logo can be changed, the about text can be changed, and you can even customize your search guidelines, if you need to. This makes sure, that you have a chance to make instance your own in some way – making sure that people can recognize when they are using your instance of SolarWayback, and it can now reflect your organisation and the people who is contributing to it.

The future of the SolrWayback

This is just the beginning for SolrWayback. Further down the road, we hope to add even more functionality that can help you dig deeper into the archives. One of our main goals is to provide you with the tools necessary to understand and analyse the vast amounts of data, that lies in most of the archives that SolrWayback is designed for. We already have a few ideas as to what could be useful, but if you have any suggestions for tools that might be helpful, feel free to reach out to us.

2 thoughts on “SolrWayback 4.0 release! What’s it all about?”

SolrWayback 4.0 release! What’s it all about? Part 2 says:

Thu, 04 March 2021 at 10:59

[…] expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in […]

LikeLiked by 1 person

Navigating Through Archived Websites: From Text Matching to Generative AI-Enhanced Q&A says:

Wed, 28 June 2023 at 12:04

[…] If you have access to WARC files that you wish to analyze, deploying SolrWayback could be a worthwhile option to explore. This software (developed by the Royal Danish Library) is specifically designed to facilitate navigation through historical ARC/WARC files. It allows for free-text searching across multiple resources, such as HTML pages, PDFs, URLs, media metadata, and more. Additionally, it includes an interactive link graph for domains, giving insight into both incoming and outgoing connections. More details can be found here: https://netpreserveblog.wordpress.com/2021/02/25/solrwayback-4-0-release-whats-it-all-about/. […]

LikeLike