Navigating Through Archived Websites: From Text Matching to Generative AI-Enhanced Q&A

By Peter Chan, Web Archivist at Stanford Libraries


Navigating the vast ocean of websites archived by organizations such as Stanford University Libraries, Internet Archive, National Libraries, and other University Libraries can be a daunting task. The purpose of this blog post is to provide a roadmap of possible strategies that Stanford Libraries can adopt to aid researchers in effectively using our web archive (captured by Archive-It). These websites are housed at Stanford and displayed through pywb. I acknowledge that different institutions may have varying set-ups and tools in place to help researchers. However, my hope is that this post will prove beneficial to the broader web archiving community.

image001

Locating Specific Text on a Page

The simplest way to locate specific text on a webpage is by utilizing the ‘Find’ function on your keyboard. This can be activated by pressing Ctrl+F if you’re using a Windows PC, Chromebook, or Linux system, or Command+F if you’re on a Mac. This function is a standard feature across all browsers. It’s such a basic tool that people often forget it’s there, but a quick reminder can help users make the most of this handy feature. Should your exploration extend beyond a single webpage, you may find the need to employ the advanced tools discussed in the subsequent sections.

Employing Full Text Search for Websites

If you have access to WARC files that you wish to analyze, deploying SolrWayback could be a worthwhile option to explore. This software (developed by the Royal Danish Library)  is specifically designed to facilitate navigation through historical ARC/WARC files. It allows for free-text searching across multiple resources, such as HTML pages, PDFs, URLs, media metadata, and more. Additionally, it includes an interactive link graph for domains, giving insight into both incoming and outgoing connections. More details can be found here: https://netpreserveblog.wordpress.com/2021/02/25/solrwayback-4-0-release-whats-it-all-about/.

SolrWayback is used by the British Library, the Royal Danish Library, the Bibliothèque nationale de France and others.

image003

If you don’t have the technical capabilities to implement SolrWayback and your archived sites have been indexed by Google, you could recommend that your users utilize Google Search as an alternative. To limit the search to a particular website, type “site:” into the Google search bar followed by the website’s name. Then, after a space, enter the search term. For instance, input “site:https://swap.stanford.edu/was/ foia” into the search bar, hit “Return or Enter,” and the results will display any article from https://swap.stanford.edu/was/ containing the word “foia”.

Navigating Through Entity Extraction, Domain Analysis, Network Diagram & More

For digital humanities scholars seeking a more in-depth exploration, the Internet Archive provides the Archives Research Compute Hub (ARCH). ARCH is designed to assist users in conducting and supporting computational research on a large scale with digital collections. This includes areas like text and data mining, data science, digital scholarship, machine learning, and others. Users have the opportunity to create bespoke research collections pertaining to an extensive range of subjects, produce and access datasets ready for research from these collections, and carry out analyses on these datasets. Consistent with the best practices for reproducibility, ARCH enables open publication and preservation of user-generated datasets.

Engaging with Generative AI-Powered Question and Answer Systems

Recent advancements in generative artificial intelligence have paved the way for user interaction with computer systems using various natural languages such as English, Chinese, and others. AI tools such as ChatGPT, Google Bard, and HuggingChat are increasingly being employed across a multitude of fields like email composition, coding, content generation, tutoring in various subjects, language translation, video game character simulation, among others. These tools harness the power of large language models like GPT-4, PaLM 2, and Falcon 40B, which are trained until a certain cutoff date, to generate responses. They are adept at interpreting user queries and delivering valuable insights.

However, they do have limitations, particularly when addressing questions that require information beyond their training cut-off date. In these situations, these tools may generate responses that are not backed by the available data, essentially inventing answers. To avoid this, these AI tools can be set up to constrain their responses to designated data sources via plugins like Web Request.

As an example, when I asked ChatGPT with the Web Request plugin, “What is this site https://wayback.archive-it.org/8751/20230513070138/https://stanfordhatesfun.com/ about,” the response was, “The website ‘stanfordhatesfun.com’ appears to be a channel for expressing dissatisfaction with certain Stanford University policies and actions. It is part of the Stanford Activism collection and has been archived by the Stanford University Archives…”

The Internet Archive has also introduced an experimental tool “IA Copilot” to interact with the web archive content in the Wayback Machine using ChatGPT.

Further reading:


Note: This blog was created with the assistance of ChatGPT. I would like to express my gratitude to Josh Schneider, Olga Holownia, and Edward Summers for their valuable input and suggestions regarding the blog. If you are interested in studying any of Stanford’s web archive collections, please feel free to contact me at pchan3 AT stanford.edu.

Leave a comment