Studying Women and the COVID-19 Crisis through the IIPC Coronavirus Collection  

AWAC2 (Analysing Web Archives of the COVID-19 Crisis) is a project developed by the members of WARCnet (Web ARChive studies network researching web domains and events) Working Group 2 that focuses on analysing transnational events. This is one of the first research projects using an IIPC collaborative collection and ARCH (Archives Research Compute Hub), a new interface for web archive analysis created by the Archives Unleashed Project Team and the Internet Archive.

By the AWAC2 (Analysing Web Archives of the COVID-19 Crisis) team: Susan Aasman (University of Groningen, The Netherlands), Niels Brügger (Aarhus University, Denmark), Frédéric Clavert (University of Luxembourg, Luxembourg), Karin de Wild (Leiden University, The Netherlands), Sophie Gebeil (Aix-Marseille University, France), Valérie Schafer (University of Luxembourg, Luxembourg), Joshgun Sirajzade (University of Luxembourg, Luxembourg)

A year ago, a post on this very blog (“Analysing Web Archives of the COVID-19 Crisis through the IIPC collaborative collection: early findings and further research questions,” 2 November 2021) invited the IIPC community to vote for a more specific topic that the AWAC 2 team could analyse within the vast IIPC collection devoted to the coronavirus, whose metadata and textual content were made available in the framework of a partnership.

After a first phase of collaboration around this prolific, multilingual, and international corpus, in direct cooperation with the Archives Unleashed team which allowed us to access this abundant collection via ARCH and other tools they have developed, it seemed interesting to us to dive not only into the global analysis of the corpus, but to also try to see the feasibility of more specific studies on a precise topic. Following the vote, the selected theme “Women, gender and COVID” was the subject of several online and on-site meetings by the AWAC2 team, including an internal datathon in March 2022 at the University of Luxembourg (Figure 1).

Figure 1– A datathon as a test bed and the tentative design of a workflow

The purpose of this blog post is to review some of the methodological elements learned during the exploration of this corpus.

Retrievability is a real challenge

The first salient point concerns the amount of data, already considered at the global level of the corpus, but which even in the case of a research specifically focused on women remains important. Above all, data mining and corpus creation is complicated by multilingualism (see table 7 of our previous blog post), in addition to the fact that a search for the term “woman” is not sufficient to create a satisfactory corpus (a woman can be qualified as a mother in the case of home-working or as a feminist in the case of activism and the fight against domestic violence, etc.)

The multidisciplinary team also had to define research priorities in view of the challenges of these massive corpora. Indeed, once they are constituted, the analysis is still far from beginning. The sub-corpora are full of noise, especially when it comes to news sites where the terms COVID and pregnancy or feminism may appear in newsfeeds in a very close way, but without any real thematic correlation (Figure 2). There are also many duplicates, and it must be determined whether or not they inform the study. Such a large amount of data also raises the question of more research-driven or data-driven approaches.

Figure 2 – Entry line 7867: The newsfeed mentions the COVID crisis as well as the MeToo movement but the news is unrelated, as visible on top of the capture when accessing full text.

In addition to the technical difficulties, there are also contextual difficulties. The data must also be put into a national context from a qualitative point of view if they are to be analysed properly. For example, lockdowns and school closures have varied from country to country and school organisation is also very different around the world, as is the legislative framework for work during lockdowns.

Topic modeling as a field of investigation

The AWAC2 team shared a strong interest in assessing the presence, retrievability, asymmetries related to gender and COVID, with some colleagues especially interested in understanding the issues related to transnational studies and gender studies, as well as reflecting on invisibility and inclusiveness, while other colleagues were more specifically interested in the computational and topic modeling part.

This second aspect has given rise to interesting developments, as three major algorithms were applied to be able to carry out more sophisticated and semantic search in the corpus: Latent Dirichlet Allocation (LDA), Word2vec, and Doc2vec.

LDA is an extension of Probabilistic Latent Semantic Analysis (PLSA) which is a probabilistic formulation of Latent Semantic Analysis (LSA). LSA is a dimensionality reduction technique where documents in a corpus (in our case, web pages) are compressed to a very small number of documents, which could be read by a human. These compressed documents are called topics. In essence, they carry the words which are shared by many documents and probabilistically more often occur together. In our experiment, we not only identified topics which contain keywords related to the situation of women, but also looked at how these topics are distributed across the web pages (Figure 3).

Figure 3 – Topics identified through LDA over the whole dataset (Covid-19 special collection) and their distribution through time

A few examples of topics are:

  • topics202002.txt:46 0.05 video news show man years police star film death family week weinstein comments day love stars top women fashion black
  • topics202002.txt:69 0.05 shop view accessories gifts price sale products delivery add cart free mens gift shoes brands bags womens clothing hair home
  • topics202003.txt:4 0.05 health children mental kids anxiety child family tips healthy parents social coronavirus stress find support time home women life news
  • topics202004.txt:53 0.05 gender development health policy working european countries international women economic equality work regional employment global world minnesota environment content overview
  • topics202005.txt:83 0.05 study risk patients years people blood disease

Word2vec and Doc2vec in turn are the further formulation of previous algorithms. In the background they not only use newer technologies like a logistic regression (also called a shallow network), but also provide more flexible usage. Word2vec provides a dense vector for every word, and it is very similar to LSA. However, Word2vec also creates vectors from the so-called window or the neighborhood of words like 5 words to the left and to the right of the searched word, operating more on a syntactical level while LSA is purely based on Document-Term-Matrix. With Word2vec it is not only possible to find semantically related words to the searched word, but also concatenate together the vectors of all words in the document. By doing so, similar documents can be found. This in a way goes beyond the so-called bag-of-words approach to which all the previous algorithms belong, because the word order can also be taken into account. From an implementation point of view, this can be done with an additional algorithm like a Long-Short-Term-Memory (LSTM) or a ready-to-use version can be taken with Doc2vec.

In our experiment, we trained a Word2vec algorithm on our corpus, which enabled us to find woman or feminism related keywords. Furthermore, we took these keywords and searched where they occur. By doing so, we again not only investigated the situation of women in the pandemic, but also compared the results to the ones given by LDA. This not only helps us to analyze the working, complementarity, and efficiency of the algorithms, but also allowed to make sure that the search covers or mines our corpus as detailed as possible (Figures 4a and 4b).

Figure 4a – Time series of a topic related to women and children
Figure 4b – Top 20 domains for this selected topic

What’s next?

This research is far from being completed after a year of collaboration with the Archives Unleashed team, whom we warmly thank for their technical and scientific expertise, as well as with IIPC which provided an unprecedented corpus that can stimulate a multitude of research projects, whether thematic or oriented towards computer science and digital humanities. An article is currently being prepared on the second topic, “Deep Mining in Web archives,” while a more general and SSH oriented chapter is being drafted for the final collective book of the WARCnet project. Furthermore, the team will be pleased to present results at the next IIPC Web Archiving Conference in 2023 and thus continue the dialogue with you around the collection.

Related resources

2 thoughts on “Studying Women and the COVID-19 Crisis through the IIPC Coronavirus Collection  

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s