The Spanish Web Archive as a training field for Natural Language Processing models

By Alicia Pastrana García and José Carlos Cerdán Medina, National Library of Spain

BNE-andrea-de-santis-zwd435-ewb4-unsplash

In the last 20 years most web archives have been building their websites collections. They will be very valuable as the years go by, as much of this information will no longer exist on the Internet. However, do we have to wait that long to see our collections be useful?

The huge amount of information the National Library of Spain (BNE) has built since 2009 has emerged as one of the largest linguistic corpus of current language. For this reason, BNE has collaborated with the Barcelona Supercomputing Center (BSC) to create the first massive AI model of the Spanish language. This collaboration is in the framework of the Language Technologies Plan of the State Secretariat for Digitization and Artificial Intelligence of the Ministry of Economic Affairs and Digital Agenda of Spain.

The players

The National Library of Spain has been harvesting information from the web for more than 10 years. The Spanish Web Archive is still young but it already contains more than a Petabyte of information.

On the other hand, the Barcelona Supercomputing Center (BSC) is the leading supercomputing center in Spain. They offer infrastructures and supercomputing services to Spanish and European researchers, in order to generate knowledge and technology for the society.

The data

The Spanish Web Archive, as most of the national libraries web archives, is based on a mixed model. It combines broad and selective crawls. The broad crawls harvest as many Spanish domains as possible without going very deep in the navigation levels. The scope is the .es domain. Selective crawls complement the broad crawls and harvest a smaller sample of websites but in greater depth and frequency. The sites are selected for their relevance to history, society and culture. Selective crawls include other king of domains (.org, .com, etc.)

Web Curators, from the BNE and the regional libraries, select the seeds that will be part of these collections. They assess the relevance of the websites from the heritage point of view and the importance for research and knowledge in the future.

For this project we chose the content harvested on selective crawls, a collection of around 40,000 websites.

How to prepare WARC files

The result of the collections is stored in WARC files (Web ARChive file format). The BSC just needed the text extracted from the WARC files to train the language models, so they removed everything else, using a specific script. It uses a parser to keep exclusively the HTML text tags (paragraphs, headlines, keywords, etc.) and discard everything that was not useful for the purpose of the project (e.g. images, audios, videos).

This parser was an open source Python module called Selectolax. It is seven times faster than others and it is easily customizable. Selectolax can be configured to take labels that contained text and to discard those that are not useful for the project. At the end of the process, the script generated JSON files organized according to the selected HTML tags and it structured the information in paragraphs, headlines, keywords, etc. These files are not only useful for the project, but will also be able to help us improve the Spanish Web Archive full text search.

All this work was done in the Library itself, in order to obtain files that were more manageable. It must be taken into account that the huge volume of information was a challenge. It was not easy to transfer the files to the BSC, where the supercomputer was. Hence the importance of starting with this cleaning process in the Library.
BNE-ray-harrington-SZLzXxbCTD0-unsplash

Once at the BSC, a second cleaning process was run. The BSC project team removed everything that it is not well-formed text (unfinished or duplicated sentences, erroneous encodings, other languages, etc.). The result was only well-formed texts in Spanish, as it is actually used.

BSC used the supercomputer MareNostrum, the most powerful computer in Spain and the only one capable of processing such a volume of information in a short time frame.

The language model

Once the files were prepared, the BSC used a neural network technology based on Transformer, already proven with English. It was trained to learn to use the language. The result is an AI model that is able to understand the Spanish language, its vocabulary, and its mechanisms for expressing meaning and writing at an expert level. This model is also able to understand abstract concepts and it deduces the meaning of words according to the context in which they are used.
BNE-MarIA
This model is larger and better than the other models of the Spanish language available today. It is called MarIA and is open access. This project represents a milestone both in the application of artificial intelligence to Spanish language, and in collaboration between national libraries and research centers. It is a good example of the value of collaboration between different institutions with common objectives. The uses of MarIA can be multiple: correctors or predictors of language, auto summarization apps, chatbots, smart searches, translation engines, auto captioning, etc. They are all broad fields that promote the use of Spanish for technological applications, helping to increase its presence in the world. This way, the BNE fulfils part of its mission, promoting the scientific research and the dissemination of knowledge, helping to transform information into accessible technology for all.

2 thoughts on “The Spanish Web Archive as a training field for Natural Language Processing models

  1. Congratulations, Alicia and José Carlos! Super great job! It will help so many people involved in the use of AI!
    Thanks for your efforts

    Like

Leave a comment