World Wide Webarchiving: Upgrading the Web Curator Tool

by Kees Teszelszky, Curator digital collections, National Library of the Netherlands

The Web Curator Tool (WCT) is a workflow management application designed for selective web archiving. It was created for use in libraries and other digital heritage collecting organisations, and supports collection by non-technical users while still allowing complete control of the web harvesting process. The WCT is a tool that supports the selection, harvesting and quality assessment of online material when employed by collaborating users in a library environment. The application is integrated with the existing Heritrix web crawler and supports key processes such as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. The WCT allows institutions to capture almost any online resource. These artefacts are handled with all possible care, so that their integrity and authenticity is preserved.

The WCT was developed in 2006 as a collaborative effort by the National Library of New Zealand (NLNZ) and the British Library (BL), initiated by the International Internet Preservation Consortium (IIPC) as can be read in the original documentation. The WCT is open-source and available under the terms of the Apache Public License. The project was moved in 2014 from Sourceforge to Github. The latest ‘binary’ release of the WCT, v1.6.3, was published in July 2017 on the Github page of NLNZ. Even after 12 years, the WCT still continues as one of the most common, open-source enterprise solutions for web archiving. It has an active user forum on Github and Slack.

From January 2018 onwards, NLNZ has been collaborating to upgrade the WCT with the Koninklijke Bibliotheek – National Library of the Netherlands (KB-NL) and adding new features to make the application future-proof. This involves learning the lessons from the previous development and recognising the advancements and trends occurring in the web archiving community. The objective is to get the WCT to a platform where it can keep pace with the requirements of archiving the modern web. Further, the Permission Request module will be extended to fit the Dutch situation which lacks a legal deposit for digital publications.

The first step in that process was decoupling the WCT from the old Heritrix 1.x web crawler, and allowing the WCT to harvest using the updated Heritrix 3.x version. A proof of concept for this change was successfully developed and deployed by the NLNZ, and has been the basis for a joint development work plan. The project will be extensively documented.

The NLNZ has been using the WCT for its selective web archiving programme since January 2007, KB-NL since 2009. In 2008 NLNZ published an article describing their experience using WCT in a production environment. However, the software had fallen into a period of neglect, with mounting technical debt: most notably its tight integration with an out-dated version of the Heritrix web crawler. While the last public release of the WCT is still used day-to-day in various institutions, this release has essentially reached its end-of-life as it has fallen further and further behind the requirements for harvesting the modern web. The community of users have echoed these sentiments over the last few years.

During 2016-2017 the NLNZ conducted a review of the WCT and how it fulfils business requirements, and compared the WCT to alternative software/services. The NLNZ concluded that the WCT was still the closest solution to meeting its requirements – provided the necessary upgrades could be done, namely a change to use the modern Heritrix 3 web crawler. Through a series of fortunate conversations the NLNZ discovered that another WCT user, KB-NL, was going through a similar review process and had reached the same conclusions. This led to collaborative development between the two institutions to uplift the WCT technically and functionally to be a fit for purpose tool within these institutions’ respective web archiving programmes.

Who are involved:

National Library of New Zealand:

Steve Knight
Andrea Goethals
Ben O’Brien
Gillian Lee
Susanna Joe
Sholto Duncan

Koninklijke Bibliotheek:

Peter de Bode
Jeffrey van der Hoeven
Hanna Koppelaar
Tymen Kwant
Barbara Sierman
René Voorburg
Kees Teszelszky

Further reading:

Advertisements

IIPC Steering Committee Election 2018: nominations and results

The 2018 IIPC Steering Committee (SC) elections featured 3 vacant seats. The KB (Netherlands), BnF (France), and UNT (United States) all had reached the end of their prior three-year terms. The period for IIPC members to nominate themselves for election to the SC was opened on December 1, 2017 and ran until March 25, 2018. During the nomination period, three nominations were submitted, by KB, BnF, and UNT. Thus, unlike prior years, no election process is necessary since the expiring members were the only three to nominate to fill the three vacancies. Congratulations and thanks to KB, BnF, and UNT for their long service on the SC and their willingness to continue to serve another term. In 2019, the Steering Committee will have 5 (or potentially 6) spaces open up for election and we encourage any members interested in joining the SC for the first time and contributing to the management and strategic direction of the organization to nominate themselves. The SC meets in early April at DNB (Germany). Be on the lookout for reports on outcomes from that upcoming meeting.

Jefferson Bailey (current Chair, IIPC SC)


Nomination statements:

Bibliothèque nationale de France / The National Library of France

 The National Library of France (BnF) started its web archiving programme in the early 2000s and now holds an archive of nearly a petabyte. We use and share expertise about key tools for IIPC members (Heritrix 3, OpenWayback, NetarchiveSuite, webarchive-discovery) and contribute to the development of several of them. We have developed BCweb, an application for seeds selection and curation by librarians which is being open sourced.

The BnF has been involved in IIPC since its beginning and remains firmly committed to the development of a strong community, in order to sustain these open source tools and to share experiences and practices. We have attended, and frequently actively contributed to, general assembly meetings, workshops and hackathons, and most IIPC working groups, in particular Preservation and Collections Development. We are also involved in the new Training working group. Finally, we have invested effort in making the WARC format an ISO standard and will continue to work on its evolution. Our participation in the steering committee, if continued, will be focused on making web archiving a thriving community, engaging researchers in the study of web archives and developing strong archiving strategies for all kinds of web content, including social media.

Koninklijke Bibliotheek / National Library of the Netherlands

The KB is currently a member of the Steering Committee and chair of the Membership Engagement Portfolio Group and would like to nominate itself for election of a new term in the Steering Committee.

The Netherlands were one of the early adopters of the Internet: in fact the 3rd website worldwide was from the Dutch National Institute for Subatomic Physics. The KB started in 2007 collecting websites based on selective harvesting. Currently we harvest around 13.000 websites. Due to copyright reasons, the web sites can only be seen on the premises. Collaboration with other Dutch organizations will improve the coverage of the preserved Dutch national web. In the nationwide Dutch “Network Digital Heritage” we work together on various projects with both GLAM institutions as well as researchers and suppliers of web archiving services to improve the web archiving of the Dutch web. The KB is looking forward to bring this experience to the IIPC and to develop plans to make new connections between the members of IIPC and with other organizations related to the field of creating web collections, web publications, researchers, tool development and digital preservation.

The University of North Texas Libraries 

The University of North Texas (UNT) Libraries is interested in serving another term on the IIPC Steering Committee. As a library that serves a Tier One university and a student population of 38,000 students, we are committed to providing a wide range of resources to researchers. Of these resources, we believe that the preservation of and access to Web archives is an important component. We began capturing websites in 1997 and joined the IIPC in 2007. We find great benefit in participating with an international community dedicated to preserving the Web.

In the last decade, we participated in working groups and served on the steering committee for a number of years. We actively participated in such projects as tool development and maintenance for Open Wayback and Heritrix with UNT Libraries serving as project lead for the Open Wayback project. We participated in collaborative archiving projects including development of the URL Nomination Tool, and served as Steering Committee officers when requested.

If elected, the UNT Libraries will strive to collaborate with our fellow members and represent the best interests of the IIPC community to continue to move forward the preservation of the Web.

Archiving the Croatian web: has it been fourteen years already?

The National and University Library in Zagreb has been an IIPC member since 2008. The Croatian Web Archive (Hrvatski arhiv weba, HAW), established in 2004, is open access. The current projects include delivering metadata to Europeana, implementation of persistent identifier URN:NBN, migration to OpenWayback, development of a new user interface and integration with the Digital Library portal. Web Archiving Team has also been involved in introducing librarians, archivists and researchers to web archiving and to using HAW resources.


By Ingeborg Rudomino, Croatian Web Archive, National and University Library in Zagreb and Karolina Holub, Croatian Digital Library Development Centre, Croatian Institute for Librarianship, National and University Library in Zagreb

About HAW

The National and University Library in Zagreb (NUL) in collaboration with the University Computing Centre in Zagreb (Srce) established the Croatian Web Archive (Hrvatski arhiv weba, HAW) in 2004 and started to acquire, catalogue and archive online publications according to the legal deposit provisions of the Library Act from 1997. Due to the well-known characteristics of web resources, the NUL started to archive selectively and established selection criteria.

Fig. 1. Croatian Web Archive Homepage.

We use several methods to identify a web resource for cataloguing and archiving: the HAW team searches and browses the web; website owners or content providers fill out the Registration form or we receive notifications from the ISSN Centre for Croatia.

After identification, every resource is catalogued in the library system and automatically transferred into our custom-built archiving system, where the archiving process starts. Our long-standing experience in cataloging this type of resource has shown the process to be very challenging, and describing this dynamic and variable content results in daily interventions in the bibliographic records. Because of that, we created cataloguing guidelines with a variety of examples. Our goal has been to preserve the original websites (their look and feel) as much as possible. In order to achieve quality, each resource is approached individually during the archiving process. The DAMP software, developed by the University Computing Centre in Zagreb, was built especially for this purpose. The workflow of processing web resources is integrated within the organisational structure of the Library.

We are proud of the quantity and quality of web resources stored in the Croatian Web Archive, some of which are websites of institutions, associations, clubs, research projects, news media, portals, blogs, official websites of counties, cities, journals and books. Special attention is given to news media websites/portals, which are archived daily, weekly or monthly.

Access and the first full domain crawl

This selective approach ensures quality and provides full control over the management of web resources. So far, over 6,700 titles have been archived and almost all are publicly available. All content is full text searchable, and it’s possible to search by any word in the title, URL or keywords. Advanced search is available as well. Users can browse the HAW alphabetically and through subject categories, which are extracted from the UDC field in the catalogue.

Fig. 2. Screenshots of archived Croatian websites.

To secure permanent access to archived web resources, we have recently implemented persistent identifier URN:NBN and have assigned it to archived titles and all archived instances (Fig. 3).

Fig. 3. Screenshot of archived instances with URN:NBN.

Since 2013, the metadata from HAW is delivered to Europeana through HAW’s OAI-PMH interface.

To overcome the limitations of selective archiving, the first harvest of the whole .hr domain was conducted in 2011 with the Heritrix web crawler. Since then, we have been harvesting the .hr domain annually. The collected content is publicly available via HAW’s website through the OpenWayback access interface (Fig. 4). To date, we have conducted 7 .hr domain harvests.

Fig. 4. Screenshot of harvested website in OpenWayback.

Thematic crawls

In 2011, we started to periodically harvest websites related to topics and events of national importance using Heritrix and OpenWayback, as well. Nine thematic collections have been created, mainly related to themes such as presidential, parliament or local elections, accession to the EU and the flood in Croatia. Each collection consists of several metadata: title, size, number of seeds/URLs and description.

Training and outreach

Twice every year, we organize a workshop within the Centre of Continuing Education for Librarians. With the main goal to introduce the web archiving to library professionals and students, the workshop focuses on learning how to recognize online materials that should be preserved according to existing criteria for cataloguing and archiving Croatian web resources. The participants are also introduced to the workflow of selective archiving, .hr harvests, the process of selecting materials for thematic collections and different ways of browsing the archived content.

With the experience that we have gained throughout the years, sharing our knowledge and expertise on web archiving is something that we are happy to provide and give support to all those interested. To increase awareness about HAW and web archiving among librarians, archivists, and wider community, we try to make use of every opportunity to do so – such as presenting at national and international conferences, giving lectures to students, researchers, etc.

A few thoughts for the future

The Croatian Web Archive currently has more than 40 TB of content. We are currently working on a web interface that will have new functionalities and features including full-text search for the domain harvests and news sections for web archiving community and researchers. Also, the plan is to integrate HAW’s metadata into the Digital Library portal in order to have a single access point for all digital collections.

By combining all three approaches and using different software, the Library will attempt to cover, to the greatest extent possible, the contemporary part of Croatian cultural and scientific heritage.

Visit us: http://haw.nsk.hr/en

Announcing the IIPC Technical Speaker Series

By Jefferson Bailey, Director, Web Archiving (Internet Archive) & IIPC Chair

The IIPC is excited to announce a call for presenters in a new online series, the IIPC Technical Speaker Series. The goal of the IIPC Technical Speaker Series (TSS) is to facilitate knowledge sharing and foster conversations and collaborations among IIPC members around web archiving technical work.

The TSS will feature 30-60 minute online presentations or demonstrations related to tool development, software engineering, infrastructure management, or other specific technology projects. Presentations can take any format, including prepared slides, open conversations, or live demonstrations via screen sharing. Presentations will be from employees at IIPC member organizations and attendance will be open to all IIPC members. The TSS is intended to be informational, not a formal training or education program, and to provide an open venue for knowledge exchange on technical issues. The series will also give IIPC members the chance to demo and discuss technical work (including R&D, prototype, or early-stage work) taking place in member institutions that may have no other venue for presentation or discussion.

If you are interested in presenting, please fill out the short application form.

Details on applying:

  • Applicants must be employed by an IIPC member institution in good standing
  • Access to an online webinar system (WebEx, Zoom, etc) will be provided
  • Presentations will be scheduled for 60 minutes, but can be shorter and should allow time for questions and discussion
  • Small stipends are available to presenters, if needed or if helpful in getting managerial approval to participate.

We aim to have a 2-3 TSS events per quarter, scheduled at a time amenable to as many time zones as possible. Details on upcoming speakers and registration will be shared via the normal IIPC communication channels (listservs, blog, slack, twitter). This project is funded as part of IIPC’s 2018 suite of projects, including work by IIPC Portfolios and Working Groups, as well as other forthcoming member services. The TSS is currently administered by the IIPC Steering Committee Chair (jefferson@archive.org) and the IIPC Program and Communications Officer (Olga.Holownia@bl.uk). Contact either or both with any questions.

Please apply and present to the IIPC community all the excellent technical work taking place at your organization!

How Can We Use Web Archive?: A Brief Overview of WARP and How It Is Used

By Naotoshi Maeda, National Diet Library of Japan

As we all know, the use of web archives has recently become a hot topic in the web archive community. In the 14th iPRES held in Kyoto, the National Diet Library of Japan (NDL) took part in some sessions and presented some examples of how web archive can be used. Here, I post the poster and re-present the topics.

Fig. 1: The poster about use cases of web archive presented in iPRES 2017 (pdf)

Overview of WARP

Since 2002, the NDL has been operating the web archive called WARP. It has been harvesting websites under two different frameworks. The first is Japan’s Legal Deposit system and the second is with the permission of the copyright holder. The National Diet Library Law allows the NDL to harvest websites of public agencies, including those of the national government, municipal governments, public universities, and independent administrative agencies. On the other hand, legal deposit does not allow the NDL to harvest websites of private organizations, so the NDL needs to receive permission from the copyright holder beforehand. At present, WARP archives roughly 1 petabyte of data, comprising 5 billion files from 130,000 captures.

Fig. 2: Statistics of archived content in WARP

85% of the archived websites can be accessed via the Internet based on permissions of rights holders, and WARP provides a variety of search methods, including URL, full text, metadata, and by category.

WARP uses standard web archiving technologies, such as Heritrix for web-crawling, WARC file format for storage, OpenWayback for playback, and Apache Lucene Solr for full text search.

Linking from live websites

Given this background, here I show some examples of how WARP can be used.

The first use case is linking from live websites. As mentioned above, WARP comprehensively harvests and archives the websites of public agencies under the legal deposit system. A significant quantity of content is posted, updated, and deleted on these websites every day. Many of these agencies use WARP as a backup database. Before deleting content from their websites, they add a link to content that is archived by WARP. Doing this enables these websites to keep archived content seamlessly available while also reducing the operating costs of their own web servers.

Fig. 3: Linking from live websites to WARP.

Analysis and Visualization

The graphs below present the results of some analysis of content archived in WARP. The first circular graph illustrates link relations between websites in Japan’s 47 prefectures, thereby showing the extent of their interconnection on the Web. The second graph shows the percent of URLs on websites of the national government that were live in 2015, and indicates that 60% of the URLs that existed in 2010 gave 404 errors during 2015. The third bubble chart shows the relative size of data accumulated from each of the 10,000 websites archived in WARP. Thus, you can see at a glance what websites and how much data are archived in WARP.

Fig. 4: Link relations between websites in Japan’s 47 prefectures.
Fig. 5: The percent of URLs on websites of the national government that were live in 2015.
Fig. 6: Relative size of data accumulated from each of the 10,000 websites archived in WARP.

Curation

The next use case shows how WARP can be used for curation. Curators can use a variety of search methods to find content of interest archived in WARP, but it is not easy for them to gauge the full extent of archived content. The NDL curates archived contents for a variety of subjects and provides visual representations that could provide curatots with unexpected discoveries. Here are two examples: a search by region for obsolete websites of defunct municipalities and the 3D wall for the collection of the Great East Japan Earthquake in 2011.

Fig.7: Search by region for obsolete websites of defunct municipalities.
Fig. 8: 3D wall for the collection of the Great East Japan Earthquake in 2011.

Extracting PDF Documents

The fourth use case is extracting PDF documents. The websites that are archived in WARP contain many PDF files of books and periodical articles. We search for these online publications and add metadata to those that are considered significant. These PDF files with metadata are then stored into the “NDL Digital Collections” as the “Online Publications” collection. Furthermore, the metadata are harvested using OAI-PMH by “NDL Search” which is an integrated search service of catalogs including libraries, archives, museums, academic institutes in Japan, so that curators can find PDF files using conventional search methods. 1,400,000 PDF files cataloged in 1,000,000 records are already available online. The NDL launched a new OPAC in January 2018 and it implemented the similar integrated search.

Fig. 9: Extracting PDF documents from archived websites.

Future challenges

I want to conclude this post with a short discussion of future challenges that have been lively discussed by IIPC members too.

Web archives have tremendous potential for use in big data analysis, which could be used to uncover how human history has been recorded in cyberspace. The NDL also needs to study how to make data sets suitable for data mining and how to promote engagement with researchers.

Another challenge is the development of even more robust search engine technology. WARP provides full-text search with Apache Lucene Solr, and has already indexed 2.5 billion files in the creation of indexes totaling 17 terabytes. But we are not satisfied with the search results, which contain duplicate material archived at different times and other noise. We need to develop a robust and accurate search engine specialized for web archives that uses temporal elements.

The Study of punk culture through the Portuguese Web Archive

Arquivo.ptIn the third guest blog post presenting the results of  Investiga XXIDIOGO DUARTE, introduces his study of the emergence of the Straight Edge, a drug-free punk subculture, in Portugal which was made through the web pages preserved by Arquivo.pt. Being an international and informal suburban culture, Straight Edge had in the internet one of the factors of its expansion in the second half of the nineties. This text presents a first approach to build the history of the Straight Edge culture.


Since its eruption in the second half of the 1970s, punk was characterized by a multiplicity of derived experiences and expressions that defied the simplistic and sensationalist picture often portrayed of a self-destructive movement (due to the drug and alcohol excesses of some of its members). One of those expressions with a significant growth and impact was Straight Edge.

Sober punk: “I’ve got better things to do

Born in the beginning of the 1980’s in Washington D.C., U.S.A., by the voice of one of the most emblematic bands of punk-hardcore history, Minor Threat, Straight Edge was one of the answers to that self-destructive spiral. Besides the refusal to consume addictive substances, vegetarianism and animal rights became strongly associated with Straight Edge lifestyle since its beginning.

Minor Threat lyrics quickly found echo in a number of individuals that identified themselves with punk rebelliousness and the raw energy of its loud and fast music but that were not feeling attracted to some of its common behaviors. In a short notice, Straight Edge was reclaimed as an identity by a growing number of bands and individuals all over the United States.

The explosion of Straight Edge in Portugal

In Portugal, this punk subculture started to explode in the beginning of the 1990s, with X-Acto, the first Straight Edge band, appearing in 1991. Through this decade, Straight Edge never stopped to grow, with more and more bands and individuals reclaiming its principles to guide their lives.

Fig. 1: X-Acto website preserved by Arquivo.pt

In the second half of the 1990s, Internet became of the of main platforms of communication within the Straight Edge community. Making it easier to spread its ideas and events among a larger audience, the internet created a new space of sociability complementary to the concerts and other meeting spaces.

The growth of the Straight Edge culture reflected some of the social and political dynamics of the Portuguese society that emerged during the 1990s, but it also contributed to accelerate those changes, particularly through its interventional and strongly politicized characteristics.

Anti-consumption, anti-capitalism, anti-racism, feminism, ecology and, especially, veganism and animal rights were some of the causes more actively promoted by the Straight Edge followers.

As a predominantly suburban culture, informal and absent of any institutional structure, based in the punk Do It Yourself ethics, Straight Edge remained underground, without any media or public visibility. Information circulated through concerts, through independent distributors and, with the Internet, online through online forums, websites or blogs.

The importance of web archives to the study of popular subcultures

Fig. 2: Founded in 1998, StraightEdge.pt was the most important website of the Straight Edge community in Portugal.

With the slowing down of the movement during the early 2000’s, much of the information available online that documented the existence of this culture disappeared – in some cases irretrievably – without having been preserved in traditional archives or without leaving a trace in institutional media.

Thus, the possibilities of studying the Straight Edge culture and its impact on the Portuguese society were severely reduced. Arquivo.pt recovered and archived many of those pages and re-opened the possibility of studying then.

The websites preserved by Arquivo.pt were the basis of this research. Through them, we observed Straight Edge’s eruption, expansion, consolidation and decline in Portugal and analyzed the changes that occurred in its internal dynamics, in its main concerns and the splits that traversed it (firstly, in its relation to punk culture in general, and then inside the Straight Edge scene itself).

This study provided a glimpse into the potential that web archives offer for the study of almost any contemporary culture, providing a new source of information for social groups and events that are usually underrepresented in traditional archives.

Without web archives, the study of the eruption of the Straight Edge culture in Portugal would have been impossible, just a few years after it happened.

In the Internet age, the same applies to a lot of different phenomena, even to those widely studied. Undeniably, research using web archives implies new methodological and epistemological challenges, but the main challenge is also an opportunity to find new perspectives and new study objects.

Learn more

About the author:

Diogo Duarte is a researcher at the Institute of Contemporary History (FCSH-NOVA) and a Doctorate student in the same institute, with a thesis about the history of anarchism in Portugal.

Today’s news to be forgotten tomorrow?

Arquivo.pt
Research financed by Fundação para a Ciência e a Tecnologia, SFRH/BGCT/135017/2017

A study about the transformations of newspaper websites can only be carried out because there are web archives preserving materials that the newspapers themselves do not preserve or provide. In the second guest blog post in the series showcasing Investiga XXIDIOGO SILVA DA CUNHA, University of Lisbon, presents the results of his project focusing on transformations of this kind in four Portuguese newspapers using Arquivo.pt.


The transition to what is referred to as Digital Age and Information Society implied a great transformation which continues to take place at several levels. The professionals of the various communication sectors are now confronted at the forefront with new conditions to perform their work.

An important change occurred at the level of the support of journalistic messages. Since the 1990s, newspapers have begun to translate their printed press editions into online editions.

Fig. 1: Detail of preserved version of Diário de Notícias website in the Arquivo.pt graphical interface, October 13, 1996.

At the end of the 90s, great importance was given to online editions, focusing part of the newsroom workflow on their update 24/7, an approach known as “web-first” or “online first”. Something was happening. Born-digital content has become an integral part of today’s journalism with some of this content being published exclusively in the newspaper’s online editions.

The disappearance of born-digital newspaper materials

It is now common to consider in the context of Communication, Media and Journalism Studies that the structure of the online newspaper websites can accumulate journalistic materials and can be consulted in the long term by both journalists and readers, according to search filters specific to such structure.

In the same line of reasoning, it seems that the expectations of journalists and other professionals linked to newspapers and media companies are similar. The existence of such expectations was confirmed in the present research on the Portuguese newspaper websites.

But, as Web Archiving Studies have been showing, there is a general trend for websites to be deeply modified or disappear within a year. In the case of newspaper websites, the problem is aggravated by the fact that they are updated at least daily and their structure as a whole, from its URL to its layout, also undergoes changes, although this happens over a longer period of time. So, although the news content produced by journalists may remain on the newspaper websites for a while, these websites end up with missing elements or they just disappear.

The transformations of Portuguese newspaper websites: a case study

Web archives can be seen as an alternative in terms of public, direct and interactive access to born-digital journalistic materials that are not preserved or that are not publicly provided by newspapers and their media companies. In this sense, a web archive becomes an information technology structure which functions as a ‘source’ in the conventional, historiographical sense of the term.

The research on the transformations of Portuguese newspaper websites, that was carried out using Arquivo.pt, focused on a longitudinal study (1996-2016) of the structure of the websites of four weekly and daily newspapers: Correio da Manhã, Diário de Notícias, Expresso and Público.

The process of describing and comparing the preserved versions of those newspapers’ homepages in Arquivo.pt enabled us to reconstruct the development trends between the different layouts and the different web addresses of these pages. From this work, we drew the following general conclusions:

  • Websites are increasingly extensive and vertically oriented;
  • Websites gradually become aesthetically cohesive, consolidating the newspaper’s visual identity;
  • Changes are increasingly less noticeable as they tend to be on the “micro” rather than  “macro” level (see Fig. 2)
Fig. 2: Detail of preserved versions of Expresso website, 2008, 2011 and 2012, respectively.
  • More embedded images and videos are used, often framed in galleries, the number of links, buttons, menus and scroll bars has also increased over time;
  • The visual changes, along with the changes of web addresses, are sometimes shaped by the relationships of the media companies with audiovisual and telecommunications companies, e.g. in the different versions shown in Fig.3, the names, colors and/or symbols of these companies are present in the user interface of the newspapers (we see Clix logo on the top left and a pink button on the top right corner in 2007 and in the 2012 capture they are replaced by the AEIOU logo).
Fig. 3: Detail of preserved versions of Expresso website, 2007 and 2012, respectively.

Future work

It is now possible to propose at least three ways to looking at the developments listed above:

  • using digital tools for detailed analysis of changes in layouts at the level of information design,
  • extending the scope of the study to the websites of other newspapers (e.g. other countries, other companies, other types of social institutions, etc.),
  • widening the scope of the study even more to confront the lines of development discovered with web publishing models beyond the spectrum of journalism (e.g. blogs).

It is also worth underlining that it is fundamental to develop a systematic reflection on the web archives as such, perceiving them not only as informatic structures, but also as ‘research infrastructures’, with their own professional and epistemic cultures. In the terms of research on web archives, the work of Niels Brügger seems to offer an excellent starting point. However, it will be crucial to consider web archives in the context of Big Data discussions around reductionist and empiricist trends in the social sciences.

A reflection of this kind would integrate web archives in discussions about ontology, epistemology, methodology, culture, economy and politics. The question would be to think of web archives not only as instruments of access to the world, not only as windows to the digital recent past, but as devices that are part of the constitution of the world, as mediating technologies with their own implications in retrospective placement, themselves part of the digitalization process.

As outlined above, it’s equally important that there is a dialogue between researchers, journalists and newspaper editorial staff. The general problem of digital preservation, especially complicated in the field of media and journalism, makes clear the need to establish digital preservation guides for journalists and editors and to promote the joint discussion of information curation initiatives, if we don’t want today’s news to be forgotten tomorrow.


Learn more


About the author:

Diogo Silva da Cunha is a PhD student of Philosophy of Science and Technology, Art and Society at the University of Lisbon. His major fields of interest are epistemology of the social sciences and communication, media and journalism studies. Diogo Silva da Cunha recently participated in a study on the digitalization process in Portuguese journalism promoted by the respective national regulatory entity. Last year, he participated in the research project of Arquivo.pt, in the context of which he proposed, developed and applied a model of analysis of journalistic material available in web archives.