By Sara Aubry, Web Archiving Project Manager at BnF
The WARC format is our Web ARChives format. It defines a way for combining digital resources into an aggregate archival file along with related metadata.It is today commonly used to store web crawls. For new comers, a WARC file is made of one or multiple records. Each record consists of a header followed by a content block. The header has mandatory named fields that document for instance the URI, the date, the type and the length of the record.The content block may contain resources in any format such as an HTML page,a binary image or a video file. WARC is an extension of the ARCfile format designed by the Internet Archive in 1996.The WARC format was initially released as an ISO international standard 10 years ago, in May 2009, under the number 28500:2009 (we also call it WARC version 1.0). The standardization opened the path to a wider use and implementation in a variety of applications for harvesting,accessing, mining, exchanging and preserving digital resources. While it represents the unique standard format for web archives, it has been adopted beyond the web archiving community to store born-digital or digitized materials.
As with all ISO standards, the WARC standard is periodically reviewed to ensure that it continues to meet the changing needs that emerge from our practice. The first revision, supported by an IIPC task force and the subcommittee in charge of technical interoperability within ISO information and documentation technical committee (ISO/TC46/SC4),was published in August 2017 as ISO28500:2017 (it is also known as WARC version 1.1). This revision mainly introduced new named fields for deduplication and the possibility to have more precise timestamps (See IIPC GitHub for more details).
During the last IIPC general assembly that took place in November 2018 in Wellington, we started to discuss possible evolutions for the second revision. The ISO vote which is required to launch the revision process is currently scheduled for 2022. Alex Osborne from the National Library of Australia challenged the format to support the HTTP/2 protocol. Ilya Kremer presented Rhizome current implementation for recording provenance headers to indicate that a record has been created from another record and not from the original URL. Ilya also presented a need to keep track of dynamic history of a web page display. Exchanges continued and are still alive on IIPC GitHub and Slack (#warc channel). Hot topics are currently related to how to keep track of media (in particular video and audio files) conversion and how to reference a “transcluded” video or audio file from another page.
All these topics need time for raising awareness, in-depth discussions, shared testing and tool implementation within our community before they can be drafted and included in the standard.If you want to join the current discussions or raise any other topic, please join IIPC #warc channel on Slack.
The National and University Library in Zagreb has been an IIPC member since 2008. The Croatian Web Archive (Hrvatski arhiv weba, HAW), established in 2004, is open access. The current projects include delivering metadata to Europeana, implementation of persistent identifier URN:NBN, migration to OpenWayback, development of a new user interface and integration with the Digital Library portal. Web Archiving Team has also been involved in introducing librarians, archivists and researchers to web archiving and to using HAW resources.
By Ingeborg Rudomino, Croatian Web Archive, National and University Library in Zagreb and Karolina Holub, Croatian Digital Library Development Centre, Croatian Institute for Librarianship, National and University Library in Zagreb
The National and University Library in Zagreb (NUL) in collaboration with the University Computing Centre in Zagreb (Srce) established the Croatian Web Archive (Hrvatski arhiv weba, HAW) in 2004 and started to acquire, catalogue and archive online publications according to the legal deposit provisions of the Library Act from 1997. Due to the well-known characteristics of web resources, the NUL started to archive selectively and established selection criteria.
We use several methods to identify a web resource for cataloguing and archiving: the HAW team searches and browses the web; website owners or content providers fill out the Registration formor we receive notifications from the ISSN Centre for Croatia.
After identification, every resource is catalogued in the library system and automatically transferred into our custom-built archiving system, where the archiving process starts. Our long-standing experience in cataloging this type of resource has shown the process to be very challenging, and describing this dynamic and variable content results in daily interventions in the bibliographic records. Because of that, we created cataloguing guidelines with a variety of examples. Our goal has been to preserve the original websites (their look and feel) as much as possible. In order to achieve quality, each resource is approached individually during the archiving process. The DAMP software, developed by the University Computing Centre in Zagreb, was built especially for this purpose. The workflow of processing web resources is integrated within the organisational structure of the Library.
We are proud of the quantity and quality of web resources stored in the Croatian Web Archive, some of which are websites of institutions, associations, clubs, research projects, news media, portals, blogs, official websites of counties, cities, journals and books. Special attention is given to news media websites/portals, which are archived daily, weekly or monthly.
Access and the first full domain crawl
This selective approach ensures quality and provides full control over the management of web resources. So far, over 6,700 titles have been archived and almost all are publicly available. All content is full text searchable, and it’s possible to search by any word in the title, URL or keywords. Advanced search is available as well. Users can browse the HAW alphabetically and through subject categories, which are extracted from the UDC field in the catalogue.
To secure permanent access to archived web resources, we have recently implemented persistent identifier URN:NBN and have assigned it to archived titles and all archived instances (Fig. 3).
To overcome the limitations of selective archiving, the first harvest of the whole .hr domain was conducted in 2011 with the Heritrix web crawler. Since then, we have been harvesting the .hr domain annually. The collected content is publicly available via HAW’s website through the OpenWayback access interface (Fig. 4). To date, we have conducted 7 .hr domain harvests.
In 2011, we started to periodically harvest websites related to topics and events of national importance using Heritrix and OpenWayback, as well. Nine thematic collections have been created, mainly related to themes such as presidential, parliament or local elections, accession to the EU and the flood in Croatia. Each collection consists of several metadata: title, size, number of seeds/URLs and description.
Training and outreach
Twice every year, we organize a workshop within the Centre of Continuing Education for Librarians. With the main goal to introduce the web archiving to library professionals and students, the workshop focuses on learning how to recognize online materials that should be preserved according to existing criteria for cataloguing and archiving Croatian web resources. The participants are also introduced to the workflow of selective archiving, .hr harvests, the process of selecting materials for thematic collections and different ways of browsing the archived content.
With the experience that we have gained throughout the years, sharing our knowledge and expertise on web archiving is something that we are happy to provide and give support to all those interested. To increase awareness about HAW and web archiving among librarians, archivists, and wider community, we try to make use of every opportunity to do so – such as presenting at national and international conferences, giving lectures to students, researchers, etc.
A few thoughts for the future
The Croatian Web Archive currently has more than 40 TB of content. We are currently working on a web interface that will have new functionalities and features including full-text search for the domain harvests and news sections for web archiving community and researchers. Also, the plan is to integrate HAW’s metadata into theDigital Library portal in order to have a single access point for all digital collections.
By combining all three approaches and using different software, the Library will attempt to cover, to the greatest extent possible, the contemporary part of Croatian cultural and scientific heritage.
In the third guest blog post presenting the results of Investiga XXI, DIOGO DUARTE, introduces his study of the emergence of the Straight Edge, a drug-free punk subculture, in Portugal which was made through the web pages preserved by Arquivo.pt. Being an international and informal suburban culture, Straight Edge had in the internet one of the factors of its expansion in the second half of the nineties. This text presents a first approach to build the history of the Straight Edge culture.
Since its eruption in the second half of the 1970s, punk was characterized by a multiplicity of derived experiences and expressions that defied the simplistic and sensationalist picture often portrayed of a self-destructive movement (due to the drug and alcohol excesses of some of its members). One of those expressions with a significant growth and impact was Straight Edge.
Sober punk: “I’ve got better things to do”
Born in the beginning of the 1980’s in Washington D.C., U.S.A., by the voice of one of the most emblematic bands of punk-hardcore history, Minor Threat, Straight Edge was one of the answers to that self-destructive spiral. Besides the refusal to consume addictive substances, vegetarianism and animal rights became strongly associated with Straight Edge lifestyle since its beginning.
Minor Threat lyrics quickly found echo in a number of individuals that identified themselves with punk rebelliousness and the raw energy of its loud and fast music but that were not feeling attracted to some of its common behaviors. In a short notice, Straight Edge was reclaimed as an identity by a growing number of bands and individuals all over the United States.
The explosion of Straight Edge in Portugal
In Portugal, this punk subculture started to explode in the beginning of the 1990s, with X-Acto, the first Straight Edge band, appearing in 1991. Through this decade, Straight Edge never stopped to grow, with more and more bands and individuals reclaiming its principles to guide their lives.
In the second half of the 1990s, Internet became of the of main platforms of communication within the Straight Edge community. Making it easier to spread its ideas and events among a larger audience, the internet created a new space of sociability complementary to the concerts and other meeting spaces.
The growth of the Straight Edge culture reflected some of the social and political dynamics of the Portuguese society that emerged during the 1990s, but it also contributed to accelerate those changes, particularly through its interventional and strongly politicized characteristics.
Anti-consumption, anti-capitalism, anti-racism, feminism, ecology and, especially, veganism and animal rights were some of the causes more actively promoted by the Straight Edge followers.
As a predominantly suburban culture, informal and absent of any institutional structure, based in the punk Do It Yourself ethics, Straight Edge remained underground, without any media or public visibility. Information circulated through concerts, through independent distributors and, with the Internet, online through online forums, websites or blogs.
The importance of web archives to the study of popular subcultures
With the slowing down of the movement during the early 2000’s, much of the information available online that documented the existence of this culture disappeared – in some cases irretrievably – without having been preserved in traditional archives or without leaving a trace in institutional media.
Thus, the possibilities of studying the Straight Edge culture and its impact on the Portuguese society were severely reduced. Arquivo.pt recovered and archived many of those pages and re-opened the possibility of studying then.
The websites preserved by Arquivo.pt were the basis of this research. Through them, we observed Straight Edge’s eruption, expansion, consolidation and decline in Portugal and analyzed the changes that occurred in its internal dynamics, in its main concerns and the splits that traversed it (firstly, in its relation to punk culture in general, and then inside the Straight Edge scene itself).
This study provided a glimpse into the potential that web archives offer for the study of almost any contemporary culture, providing a new source of information for social groups and events that are usually underrepresented in traditional archives.
Without web archives, the study of the eruption of the Straight Edge culture in Portugal would have been impossible, just a few years after it happened.
In the Internet age, the same applies to a lot of different phenomena, even to those widely studied. Undeniably, research using web archives implies new methodological and epistemological challenges, but the main challenge is also an opportunity to find new perspectives and new study objects.
A study about the transformations of newspaper websites can only be carried out because there are web archives preserving materials that the newspapers themselves do not preserve or provide. In the second guest blog post in the series showcasingInvestiga XXI, DIOGO SILVA DA CUNHA, University of Lisbon, presents the results of his project focusing on transformations of this kind in four Portuguese newspapers using Arquivo.pt.
The transition to what is referred to as Digital Age and Information Society implied a great transformation which continues to take place at several levels. The professionals of the various communication sectors are now confronted at the forefront with new conditions to perform their work.
An important change occurred at the level of the support of journalistic messages. Since the 1990s, newspapers have begun to translate their printed press editions into online editions.
At the end of the 90s, great importance was given to online editions, focusing part of the newsroom workflow on their update 24/7, an approach known as “web-first” or “online first”. Something was happening. Born-digital content has become an integral part of today’s journalism with some of this content being published exclusively in the newspaper’s online editions.
The disappearance of born-digital newspaper materials
It is now common to consider in the context of Communication, Media and Journalism Studies that the structure of the online newspaper websites can accumulate journalistic materials and can be consulted in the long term by both journalists and readers, according to search filters specific to such structure.
In the same line of reasoning, it seems that the expectations of journalists and other professionals linked to newspapers and media companies are similar. The existence of such expectations was confirmed in the present research on the Portuguese newspaper websites.
But, as Web Archiving Studies have been showing, there is a general trend for websites to be deeply modified or disappear within a year. In the case of newspaper websites, the problem is aggravated by the fact that they are updated at least daily and their structure as a whole, from its URL to its layout, also undergoes changes, although this happens over a longer period of time. So, although the news content produced by journalists may remain on the newspaper websites for a while, these websites end up with missing elements or they just disappear.
The transformations of Portuguese newspaper websites: a case study
Web archives can be seen as an alternative in terms of public, direct and interactive access to born-digital journalistic materials that are not preserved or that are not publicly provided by newspapers and their media companies. In this sense, a web archive becomes an information technology structure which functions as a ‘source’ in the conventional, historiographical sense of the term.
The research on the transformations of Portuguese newspaper websites, that was carried out using Arquivo.pt, focused on a longitudinal study (1996-2016) of the structure of the websites of four weekly and daily newspapers: Correio da Manhã, Diário de Notícias, Expresso and Público.
The process of describing and comparing the preserved versions of those newspapers’ homepages in Arquivo.pt enabled us to reconstruct the development trends between the different layouts and the different web addresses of these pages. From this work, we drew the following general conclusions:
Websites are increasingly extensive and vertically oriented;
Websites gradually become aesthetically cohesive, consolidating the newspaper’s visual identity;
Changes are increasingly less noticeable as they tend to be on the “micro” rather than “macro” level (see Fig. 2)
More embedded images and videos are used, often framed in galleries, the number of links, buttons, menus and scroll bars has also increased over time;
The visual changes, along with the changes of web addresses, are sometimes shaped by the relationships of the media companies with audiovisual and telecommunications companies, e.g. in the different versions shown in Fig.3, the names, colors and/or symbols of these companies are present in the user interface of the newspapers (we see Clix logo on the top left and a pink button on the top right corner in 2007 and in the 2012 capture they are replaced by the AEIOU logo).
It is now possible to propose at least three ways to looking at the developments listed above:
using digital tools for detailed analysis of changes in layouts at the level of information design,
extending the scope of the study to the websites of other newspapers (e.g. other countries, other companies, other types of social institutions, etc.),
widening the scope of the study even more to confront the lines of development discovered with web publishing models beyond the spectrum of journalism (e.g. blogs).
It is also worth underlining that it is fundamental to develop a systematic reflection on the web archives as such, perceiving them not only as informatic structures, but also as ‘research infrastructures’, with their own professional and epistemic cultures. In the terms of research on web archives, the work of Niels Brügger seems to offer an excellent starting point. However, it will be crucial to consider web archives in the context of Big Data discussions around reductionist and empiricist trends in the social sciences.
A reflection of this kind would integrate web archives in discussions about ontology, epistemology, methodology, culture, economy and politics. The question would be to think of web archives not only as instruments of access to the world, not only as windows to the digital recent past, but as devices that are part of the constitution of the world, as mediating technologies with their own implications in retrospective placement, themselves part of the digitalization process.
As outlined above, it’s equally important that there is a dialogue between researchers, journalists and newspaper editorial staff. The general problem of digital preservation, especially complicated in the field of media and journalism, makes clear the need to establish digital preservation guides for journalists and editors and to promote the joint discussion of information curation initiatives, if we don’t want today’s news to be forgotten tomorrow.
Diego Duarte, The Study of punk culture through the Portuguese Web Archive
Ricardo Basílio, Memory of the online presence of a Faculty: an exhibition
About the author:
Diogo Silva da Cunha is a PhD student of Philosophy of Science and Technology, Art and Society at the University of Lisbon. His major fields of interest are epistemology of the social sciences and communication, media and journalism studies. Diogo Silva da Cunha recently participated in a study on the digitalization process in Portuguese journalism promoted by the respective national regulatory entity. Last year, he participated in the research project of Arquivo.pt, in the context of which he proposed, developed and applied a model of analysis of journalistic material available in web archives.
FCSH was founded in 1977 and it is part of Universidade Nova de Lisboa. Since 1997, that FCSH websites have been used as communication interfaces with its community of teachers, researchers and students.
Arquivo.pt preserves web content published since 1996. Therefore, the time span of the web content preserved by Arquivo.pt covers 20 years of the institutional online memory of FCSH, that is half of the Faculty’s lifetime.
In the early years of the Web, the FCSH website mostly replicated printed information. However, it has gradually become a comprehensive portal to academic live at the Faculty including also news, lists of researchers, research programs or access points to services.
Research centers are important entities of the Faculty’s ecosystem. In 1997 there were 30 small research centers, but in the 2016 they were merged into 16 larger ones.
The research centers are autonomous, manage their own projects and organize specific events. This fact resulted in the creation of over 100 additional related websites serving various purposes, such as institutional communication, project descriptions and event promotions.
The online exhibition aimed to create an institutional memory through a chronological narrative built from past web pages preserved by Arquivo.pt.
Synthesizing 20 years of memories into a single page
The project began by inventorying a large number of current websites related to the Faculty activities. We subsequently narrowed our scope to include only the institutional websites leaving other ones for future work (e.g. projects and events). All the identified websites were targeted to be preserved by Arquivo.pt.
The data collection was performed manually through the Arquivo.pt search interfaces. We mainly searched for the hostname and analyzed the corresponding version history, noticing its main content changes and references to external websites of events and projects. The data was collected, selected and registered into a page per organizational unity (see Fig. 2).
Some research centers adopted multiple hostnames along time. On the other hand, the institutional identity may have also changed due to organizational merging, name changes or different institutional frameworks. For example, CHAM “Centro de Humanidades” (in 2017) had two previous names: “Centro d’Além Mar” in 2002 and then changed to “Centro d’Aquém e d’Além Mar” in 2013-2014, when merged with “Centro de História da Cultura – CHC”, “Centro de Estudos Históricos – CEH” and “Instituto Oriental – IO”. Although, the hostname of the website has never changed: cham.fcsh.unl.pt.
Sometimes it was not straightforward to conclude if we were facing the same organizational entity after a merge, even when the website remained with the same title, hostname and URL. It’s hard, however, to imagine that the entity changed if everything remained the same. Therefore, our conclusions were validated through interviews with current and previous staff of the Faculty and research centers. Hence, the importance of institutional support and direct interaction with the entities.
Designing a time travel to the past
The objective was to create a website with a clean look and that was easy to browse. We anchored its navigation on suggestive images extracted from preserved web pages, to reinforce that it is an exhibition about online memory, rather than about current information available on the live-Web.
Thus, the homepage of the online exhibition presents a collection of preserved web images from old websites of organizational units that belonged to FCSH.
The chosen publishing platform was the free version of WordPress.com, so that anyone can create a similar project, despite a potential lack of financial resources.
By clicking on each image, the user is taken to a page that describes the online memory of each entity of the Faculty. It presents the following elements: featured image, brief synopsis, list of addresses along time and selection of mesmerizing moments.
The description of each entity has a maximum length of 150 words and includes links to versions preserved on Arquivo.pt. This interaction between the online exhibition and the web archive aims to provide the user experience of browsing an institutional memory.
The exhibition is complemented with frequently asked questions and tutorials related to digital preservation.
Future work, because a website is never finished
The next step is to promote this exhibition through the institutional communication channels of the Faculty (e.g. institutional website, mailing lists).
The exhibition still has plenty of room to be complemented with additional entities that could be aggregated in collections organized by topic or scientific area.
Direct interaction with research centers is mandatory as well as organization of training courses on web preservation and research to raise awareness to the importance of web archiving.
This project was developed in just 3 months, between May and July 2017. This short time span forced us to focus and set priorities on the most important issues. We would still be lost now choosing plug-ins if we had had more time and, however, would the extra plug-ins had actually been needed to accomplish the objectives? The users don’t seem to miss them on the exhibition.
We aimed to demonstrate that anyone could develop a similar exhibition to preserve the online memory of an organization without requiring significant financial resources or technical skills.
We hope that this project will encourage librarians and archivists to create ways of preserving the online memory of their institutions.
Ricardo Basílio, has a Master in Documentation and Information Sciences, was a librarian at the Faculty of Social and Human Sciences of Universidade Nova de Lisboa, and at the Art Library of Fundação Calouste Gulbenkian, on the digital collections about portuguese tiles, the “DigiTile” project. His areas of interest are digital preservation, digital libraries and technologies that support information. Created and manages a website in Portuguese about Digital Preservation (Digital Preservation Guide).
A guest blog post by Lozana Rossenova, a collaborative doctoral student with the Centre for the Study of the Networked Image (London, UK) and Rhizome (New York, USA). Lozana’s PhD is supported by the AHRC Collaborative Doctoral Awards 2016.
The evolution of network environments and the development of new patterns of interaction between users and online interfaces create multiple challenges for the long-term provision of access to online artefacts of cultural value. In the case of internet art, curating and archiving activities are contingent upon addressing the question of what constitutes the art object. Internet artworks are not single digital objects, but rather assemblages, dependent on specific software and network environments to be executed and rendered.
My research project seeks to better understand problems associated with the archiving of internet art: How the artworks can be made accessible to the public in their native environment – online – while enabling users of the archive to gain an expanded understanding of the artworks’ context?
User experience and the ArtBase
In the fields of user experience design and human computer interaction (HCI), there has been substantial research done around issues of discoverability, accessibility and usability in digital archives, but the studies have focused primarily on archives with digitised born-analogue text- or image-based documents. Presentation and contextualisation in archives of complex born-digital artefacts, on the other hand, have been discussed much less, particularly from the point of view of the user’s experience.
Unlike digitised texts or images, internet art spans beyond the boundaries of a single object and oftentimes references external, dynamic and real-time data sources, or exists across multiple locations and platforms. Rhizome has recognised the inherent vulnerability of internet art since its inception as an organisation and community-building platform in 1996. The ArtBase was established in 1999 as an online space to present and archive internet art. Initial strategies towards presentation of artworks in the ArtBase reflected contemporaneous developments in the fields of interaction design and digital preservation. More recently the archival system has struggled to accommodate the growing number and variety of artworks in the ArtBase. Providing a consistent user experience in making artworks accessible brings additional challenges and requires further research into how users encounter and interact with archives of web-based artefacts.
Beyond preservation challenges – such as an artwork’s technical dependencies on specific network protocols, web standards, or browser plugins – various interface design elements and conventions change over time. These influence how users navigate, interact with and understand context within the networked artwork. Interaction patterns and interface elements, such as frames, check-boxes and scrollbars, could all significantly impact or potentially change, and even render defunct, the user experience of an artwork. Examples that illustrate this clearly include works such as Jan Robert Leegte’s Scrollbar Composition (2000) or Alexei Shulgin’s Form Art (1997) [See Figures 2, 3]. Given these circumstances, new preservation and presentation paradigms are needed in order for the online archive of internet art to be able to provide access not only to an artwork’s html and css code, but also to the contextualised experience of the work.
Web archiving and remote browsers
Recognising the limitations in the current archival framework to provide adequate access to a large number of historic artworks, increasingly the focus of preservation efforts at Rhizome has been on building tools to support the presentation of complex artworks with multiple dependencies. Recent developments in browser-based emulation and web archiving tools have been instrumental in facilitating the restoration and re-performance of important internet artworks, which have been presented as instalments in Rhizome’s major new curatorial project – Net Art Anthology.
The remote browsing technology, first introduced in Rhizome’soldweb.today project to emulate old browser environments, has facilitated the online presentation of historic internet artworks in contemporaneous environments, such as Netscape Navigator or early versions of Internet Explorer. Furthermore, the capacity to create high-fidelity archives of the dynamic web with Rhizome’s browser-based archiving tool, Webrecorder, has enabled the preservation of artworks based on third-party web services, such as Instagram and Yelp.
Presenting artworks inside browsers running in Docker containers allows for the restaging of historic artworks in the original environments in which users encountered them, thereby providing oftentimes crucial contextual information to contemporary audiences (see reference to Form Art above). Meanwhile, the remote browsers in Webrecorder provide an environment for the recording and replaying of various internet artworks including ones that use Flash or Java, which are unsupported in the most recent versions of major browsers like Chrome, Firefox, Safari or Microsoft Edge.
Recent developments in Rhizome’s preservation practices indicate that the online archive of internet art is not accessible or sustainable if it remains a single centralised platform. Instead, it could be reconceptualised as a resource, connected with and linking out to various instantiations of the artworks. Remote browsers, in particular, could become a powerful tool allowing presentation of artworks either as a link out of the ArtBase page into a new page running the emulated browser, or as an embedded iframe within the ArtBase page of the artwork. In each of these cases, users would encounter a “browser within a browser” presentation paradigm. A potential challenge here would be users mistaking the remote browser environment for other secondary representations (a static screenshot, for instance, a device commonly used to present web-based artworks). Providing a consistent and contextualised user experience across the system used to present the artwork and the archival record of the work requires addressing such challenges. In the coming months, we will be conducting further research into interaction design patterns of ArtBase artworks and behaviour patterns of the archive’s users, which will inform a redevelopment of the ArtBase interaction design framework.
A presentation of recent developments in Rhizome’s Webrecorder tool, the remote browsers technology and strategies for augmenting web archives will take place at the IIPC/RESAW Conference (WAC) 2017 during Web Archiving Week, 12–16 June, 2017