Building a Web-Archive Image Search Service at Arquivo.pt

By André Mourão, Senior Software Engineer, Arquivo.pt and Daniel Gomes, Head of Arquivo.pt


Arquivo.pt launched a service that enables search over 1.8 billion images archived from the web since the 1990s. Users can submit text queries and immediately receive a list of historical web-archived images through a web user interface or an API.

The goal was to develop a service that addressed the challenges raised by the inherent temporal properties of web-archived data, but at the same time provided a familiar look-and-feel to users of platforms such as Google Images.

Supporting image search using web archives raised new challenges: little research was published on the subject and the volume of data to be processed was big and heterogeneous, summing over 530 TB of historical web data published since the early days of the Web.

The Arquivo.pt Image Search service has been running officially since March 2021 and it is based on Apache Solr. All the developed software is available as open-source to be freely reused and improved.

Search images from the Past Web

The simplest way to access the search service is using the web interface. Users can, for example, search for GIF images published during the early days of the Web related to Christmas by defining the time span of the search.

buildingimagesearch_fig1
Figure 1. Results from Advanced Image Search for GIF images archived between 6 August 1991 and 16 December 2005 (https://arquivo.pt/image/search?q=Christmas+type%3Agif&l=en&from=19910806&to=20051216).

There is also an Advanced Image Search interface available at https://arquivo.pt/advancedImages.jsp?l=en which allows users to:

  • search for terms
  • search for phrases
  • exclude certain words
  • limit search by dates
  • select the size of the images
  • select the file format of the images
  • enable/disable safe search filter
  • restrict the site where the image was found
buildingimagesearch_fig2
Figure 2. Details for an Image Search result.

Users can select a given result and consult metadata about the image (e.g. title, ALT text, original URL, resolution or media type) or about the web page that contained it (e.g. page title, original URL or crawl date). Quickly identifying the page that embedded the image enables the interpretation of its original context.

buildingimagesearch_fig3
Figure 3. The web page that contained an image returned on the search results can be immediately visited by selecting the “Visit” button.

Automatic identification of Not Suitable For Work images

Arquivo.pt automatically performs broad crawls of web pages hosted under the .PT domain. Thus, some of the images archived may contain pornographic content that users do not want to be immediately displayed by default, for instance while using Arquivo.pt in a classroom.

The Image Search service retrieves images based on the filename, alternative text and the surrounding text of an image contained on a web page. Images returned to answer a search query may include offensive content even for inoffensive queries due to the prevalence of web spam.

The detection of NSFW (not suitable for work) content on the archived Web pages from the Internet is challenging due to the scale (billions of images) and the diversity (small to very large images, graphic, colour images, among others) of image content.

Currently, Arquivo.pt applies an NSFW image classifier trained with over 60 GB of images scrapped from the web. Instead of identifying images as safe or not safe, this classifier returns the probability of an image belonging to one of five categories: drawing (SFW drawings), neutral (SFW photographic images), hentai (including explicit drawings), porn (explicit photographic images), sexy (potentially explicit images that are not pornographic, e.g. woman in bikini). nsfw (sum of hentai and porn) scores.

By default, Arquivo.pt hides pornographic images from the search results if their NSFW classification rate was higher than 0.5. This filter can be disabled by the user through the Advanced Image Search interface.

Image Search API

Arquivo.pt developed a free and open Image Search API, so that third-party software developers can integrate the Arquivo.pt image search results in their applications and for instance apply for the annual Arquivo.pt Awards.

The ImageSearch API allows keyword to image search and access to preserved web content and related metadata. The API returns a JSON object containing the metadata elements also available through the “Details” button.

buildingimagesearch_fig4
Figure 4. All metadata about the image and its host web page is available through the “Details” button or the Image Search API.
buildingimagesearch_fig5
Figure 5. GitHub Wiki page that documents the Arquivo.pt Image Search API (https://arquivo.pt/api/imagesearch).

Scientific and technical contributions

There are several services that enable image search over web collections (e.g. Google Images). However, the literature published about them is very limited and even less research has been published about how to search images in web archives.

Moreover, supporting image search over the historical web-data preserved by web archives raises new challenges that live-web search engines do not need to address such as having to deal with multiple versions of images and pages referenced by the same URLs, handling duplication of web-archived images over time or ranking search results considering the temporal features of historical web data published over decades.

Developing and maintaining an Image Search engine over the Arquivo.pt web archive originated scientific and technical contributions by addressing the following research questions:

  • How to extract relevant textual content in web pages that best describes images?
  • How to de-duplicate billions of archived images collected from the web over decades?
  • How to index and rank search results over web-archived images?

The main contributions of our work are:

  • A toolkit of algorithms that extract textual metadata to describe web-archived images
  • A system architecture and workflow to index large amounts of web-archived images considering their specific temporal features
  • A ranking algorithm to order image-search results

Learn more

One thought on “Building a Web-Archive Image Search Service at Arquivo.pt

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s