The Study of punk culture through the Portuguese Web Archive

Arquivo.ptIn the third guest blog post presenting the results of  Investiga XXIDIOGO DUARTE, introduces his study of the emergence of the Straight Edge, a drug-free punk subculture, in Portugal which was made through the web pages preserved by Arquivo.pt. Being an international and informal suburban culture, Straight Edge had in the internet one of the factors of its expansion in the second half of the nineties. This text presents a first approach to build the history of the Straight Edge culture.


Since its eruption in the second half of the 1970s, punk was characterized by a multiplicity of derived experiences and expressions that defied the simplistic and sensationalist picture often portrayed of a self-destructive movement (due to the drug and alcohol excesses of some of its members). One of those expressions with a significant growth and impact was Straight Edge.

Sober punk: “I’ve got better things to do

Born in the beginning of the 1980’s in Washington D.C., U.S.A., by the voice of one of the most emblematic bands of punk-hardcore history, Minor Threat, Straight Edge was one of the answers to that self-destructive spiral. Besides the refusal to consume addictive substances, vegetarianism and animal rights became strongly associated with Straight Edge lifestyle since its beginning.

Minor Threat lyrics quickly found echo in a number of individuals that identified themselves with punk rebelliousness and the raw energy of its loud and fast music but that were not feeling attracted to some of its common behaviors. In a short notice, Straight Edge was reclaimed as an identity by a growing number of bands and individuals all over the United States.

The explosion of Straight Edge in Portugal

In Portugal, this punk subculture started to explode in the beginning of the 1990s, with X-Acto, the first Straight Edge band, appearing in 1991. Through this decade, Straight Edge never stopped to grow, with more and more bands and individuals reclaiming its principles to guide their lives.

Fig. 1: X-Acto website preserved by Arquivo.pt

In the second half of the 1990s, Internet became of the of main platforms of communication within the Straight Edge community. Making it easier to spread its ideas and events among a larger audience, the internet created a new space of sociability complementary to the concerts and other meeting spaces.

The growth of the Straight Edge culture reflected some of the social and political dynamics of the Portuguese society that emerged during the 1990s, but it also contributed to accelerate those changes, particularly through its interventional and strongly politicized characteristics.

Anti-consumption, anti-capitalism, anti-racism, feminism, ecology and, especially, veganism and animal rights were some of the causes more actively promoted by the Straight Edge followers.

As a predominantly suburban culture, informal and absent of any institutional structure, based in the punk Do It Yourself ethics, Straight Edge remained underground, without any media or public visibility. Information circulated through concerts, through independent distributors and, with the Internet, online through online forums, websites or blogs.

The importance of web archives to the study of popular subcultures

Fig. 2: Founded in 1998, StraightEdge.pt was the most important website of the Straight Edge community in Portugal.

With the slowing down of the movement during the early 2000’s, much of the information available online that documented the existence of this culture disappeared – in some cases irretrievably – without having been preserved in traditional archives or without leaving a trace in institutional media.

Thus, the possibilities of studying the Straight Edge culture and its impact on the Portuguese society were severely reduced. Arquivo.pt recovered and archived many of those pages and re-opened the possibility of studying then.

The websites preserved by Arquivo.pt were the basis of this research. Through them, we observed Straight Edge’s eruption, expansion, consolidation and decline in Portugal and analyzed the changes that occurred in its internal dynamics, in its main concerns and the splits that traversed it (firstly, in its relation to punk culture in general, and then inside the Straight Edge scene itself).

This study provided a glimpse into the potential that web archives offer for the study of almost any contemporary culture, providing a new source of information for social groups and events that are usually underrepresented in traditional archives.

Without web archives, the study of the eruption of the Straight Edge culture in Portugal would have been impossible, just a few years after it happened.

In the Internet age, the same applies to a lot of different phenomena, even to those widely studied. Undeniably, research using web archives implies new methodological and epistemological challenges, but the main challenge is also an opportunity to find new perspectives and new study objects.

Learn more

About the author:

Diogo Duarte is a researcher at the Institute of Contemporary History (FCSH-NOVA) and a Doctorate student in the same institute, with a thesis about the history of anarchism in Portugal.

Advertisements

Today’s news to be forgotten tomorrow?

Arquivo.pt
Research financed by Fundação para a Ciência e a Tecnologia, SFRH/BGCT/135017/2017

A study about the transformations of newspaper websites can only be carried out because there are web archives preserving materials that the newspapers themselves do not preserve or provide. In the second guest blog post in the series showcasing Investiga XXIDIOGO SILVA DA CUNHA, University of Lisbon, presents the results of his project focusing on transformations of this kind in four Portuguese newspapers using Arquivo.pt.


The transition to what is referred to as Digital Age and Information Society implied a great transformation which continues to take place at several levels. The professionals of the various communication sectors are now confronted at the forefront with new conditions to perform their work.

An important change occurred at the level of the support of journalistic messages. Since the 1990s, newspapers have begun to translate their printed press editions into online editions.

Fig. 1: Detail of preserved version of Diário de Notícias website in the Arquivo.pt graphical interface, October 13, 1996.

At the end of the 90s, great importance was given to online editions, focusing part of the newsroom workflow on their update 24/7, an approach known as “web-first” or “online first”. Something was happening. Born-digital content has become an integral part of today’s journalism with some of this content being published exclusively in the newspaper’s online editions.

The disappearance of born-digital newspaper materials

It is now common to consider in the context of Communication, Media and Journalism Studies that the structure of the online newspaper websites can accumulate journalistic materials and can be consulted in the long term by both journalists and readers, according to search filters specific to such structure.

In the same line of reasoning, it seems that the expectations of journalists and other professionals linked to newspapers and media companies are similar. The existence of such expectations was confirmed in the present research on the Portuguese newspaper websites.

But, as Web Archiving Studies have been showing, there is a general trend for websites to be deeply modified or disappear within a year. In the case of newspaper websites, the problem is aggravated by the fact that they are updated at least daily and their structure as a whole, from its URL to its layout, also undergoes changes, although this happens over a longer period of time. So, although the news content produced by journalists may remain on the newspaper websites for a while, these websites end up with missing elements or they just disappear.

The transformations of Portuguese newspaper websites: a case study

Web archives can be seen as an alternative in terms of public, direct and interactive access to born-digital journalistic materials that are not preserved or that are not publicly provided by newspapers and their media companies. In this sense, a web archive becomes an information technology structure which functions as a ‘source’ in the conventional, historiographical sense of the term.

The research on the transformations of Portuguese newspaper websites, that was carried out using Arquivo.pt, focused on a longitudinal study (1996-2016) of the structure of the websites of four weekly and daily newspapers: Correio da Manhã, Diário de Notícias, Expresso and Público.

The process of describing and comparing the preserved versions of those newspapers’ homepages in Arquivo.pt enabled us to reconstruct the development trends between the different layouts and the different web addresses of these pages. From this work, we drew the following general conclusions:

  • Websites are increasingly extensive and vertically oriented;
  • Websites gradually become aesthetically cohesive, consolidating the newspaper’s visual identity;
  • Changes are increasingly less noticeable as they tend to be on the “micro” rather than  “macro” level (see Fig. 2)
Fig. 2: Detail of preserved versions of Expresso website, 2008, 2011 and 2012, respectively.
  • More embedded images and videos are used, often framed in galleries, the number of links, buttons, menus and scroll bars has also increased over time;
  • The visual changes, along with the changes of web addresses, are sometimes shaped by the relationships of the media companies with audiovisual and telecommunications companies, e.g. in the different versions shown in Fig.3, the names, colors and/or symbols of these companies are present in the user interface of the newspapers (we see Clix logo on the top left and a pink button on the top right corner in 2007 and in the 2012 capture they are replaced by the AEIOU logo).
Fig. 3: Detail of preserved versions of Expresso website, 2007 and 2012, respectively.

Future work

It is now possible to propose at least three ways to looking at the developments listed above:

  • using digital tools for detailed analysis of changes in layouts at the level of information design,
  • extending the scope of the study to the websites of other newspapers (e.g. other countries, other companies, other types of social institutions, etc.),
  • widening the scope of the study even more to confront the lines of development discovered with web publishing models beyond the spectrum of journalism (e.g. blogs).

It is also worth underlining that it is fundamental to develop a systematic reflection on the web archives as such, perceiving them not only as informatic structures, but also as ‘research infrastructures’, with their own professional and epistemic cultures. In the terms of research on web archives, the work of Niels Brügger seems to offer an excellent starting point. However, it will be crucial to consider web archives in the context of Big Data discussions around reductionist and empiricist trends in the social sciences.

A reflection of this kind would integrate web archives in discussions about ontology, epistemology, methodology, culture, economy and politics. The question would be to think of web archives not only as instruments of access to the world, not only as windows to the digital recent past, but as devices that are part of the constitution of the world, as mediating technologies with their own implications in retrospective placement, themselves part of the digitalization process.

As outlined above, it’s equally important that there is a dialogue between researchers, journalists and newspaper editorial staff. The general problem of digital preservation, especially complicated in the field of media and journalism, makes clear the need to establish digital preservation guides for journalists and editors and to promote the joint discussion of information curation initiatives, if we don’t want today’s news to be forgotten tomorrow.


Learn more


About the author:

Diogo Silva da Cunha is a PhD student of Philosophy of Science and Technology, Art and Society at the University of Lisbon. His major fields of interest are epistemology of the social sciences and communication, media and journalism studies. Diogo Silva da Cunha recently participated in a study on the digitalization process in Portuguese journalism promoted by the respective national regulatory entity. Last year, he participated in the research project of Arquivo.pt, in the context of which he proposed, developed and applied a model of analysis of journalistic material available in web archives.

Memory of the online presence of a Faculty: an exhibition

Arquivo.ptIn 2017 Arquivo.pt launched Investiga XXI, a project that aims to promote the use of the Portuguese Web Archive as a research tool and resource. In this first guest blog post introducing the Portuguese initiativeRICARDO BASÍLIO presents a collaboration between  Arquivo.pt and researchers from the Faculty of Social and Human Sciences of Universidade Nova de Lisboa (FCSH-UNL) which resulted in the creation of an online exhibition that illustrates a use case for the historical information preserved by Arquivo.pt. This exhibition highlights extracts of institutional online memories.


FCSH: 40 years of lifetime, 20 years online

FCSH was founded in 1977 and it is part of Universidade Nova de Lisboa. Since 1997, that FCSH websites have been used as communication interfaces with its community of teachers, researchers and students.

Arquivo.pt preserves web content published since 1996. Therefore, the time span of the web content preserved by Arquivo.pt covers 20 years of the institutional online memory of FCSH, that is half of the Faculty’s lifetime.

Fig. 1: The first version of the FCSH website preserved by Arquivo.pt

In the early years of the Web, the FCSH website mostly replicated printed information. However, it has gradually become a comprehensive portal to academic live at the Faculty including also news, lists of researchers, research programs or access points to services.

Research centers are important entities of the Faculty’s ecosystem. In 1997 there were 30 small research centers, but in the 2016 they were merged into 16 larger ones.

The research centers are autonomous, manage their own projects and organize specific events. This fact resulted in the creation of over 100 additional related websites serving various purposes, such as institutional communication, project descriptions and event promotions.

The online exhibition aimed to create an institutional memory through a chronological narrative built from past web pages preserved by Arquivo.pt.

Synthesizing 20 years of memories into a single page

The project began by inventorying a large number of current websites related to the Faculty activities. We subsequently narrowed our scope to include only the institutional websites leaving other ones for future work (e.g. projects and events). All the identified websites were targeted to be preserved by Arquivo.pt.

Fig. 2: Table with elements for collecting information about the preserved websites of a given organizational entity.

The data collection was performed manually through the Arquivo.pt search interfaces. We mainly searched for the hostname and analyzed the corresponding version history, noticing its main content changes and references to external websites of events and projects. The data was collected, selected and registered into a page per organizational unity (see Fig. 2).

Some research centers adopted multiple hostnames along time. On the other hand, the institutional identity may have also changed due to organizational merging, name changes or different institutional frameworks. For example, CHAM “Centro de Humanidades” (in 2017) had two previous names:  “Centro d’Além Mar” in 2002 and then changed to “Centro d’Aquém e d’Além Mar” in 2013-2014, when merged with “Centro de História da Cultura – CHC”, “Centro de Estudos Históricos – CEH” and “Instituto Oriental – IO”. Although, the hostname of the website has never changed: cham.fcsh.unl.pt.

Sometimes it was not straightforward to conclude if we were facing the same organizational entity after a merge, even when the website remained with the same title, hostname and URL. It’s hard, however, to imagine that the entity changed if everything remained the same. Therefore, our conclusions were  validated through interviews with current and previous staff of the Faculty and research centers. Hence, the importance of institutional support and direct  interaction with the entities.

Designing a time travel to the past

Fig. 3: Homepage of the online exhibition with images taken from the websites.

The objective was to create a website with a clean look and that was easy to browse. We anchored its navigation on suggestive images extracted from preserved web pages, to reinforce that it is an exhibition about online memory, rather than about current information available on the live-Web.

Thus, the homepage of the online exhibition presents a collection of preserved web images from old websites of organizational units that belonged to FCSH.

The chosen publishing platform was the free version of WordPress.com, so that anyone can create a similar project, despite a potential lack of financial resources.

By clicking on each image, the user is taken to a page that describes the online memory of each entity of the Faculty. It presents the following elements: featured image, brief synopsis, list of addresses along time and selection of mesmerizing moments.

The description of each entity has a maximum length of 150 words and includes links to versions preserved on Arquivo.pt. This interaction between the online exhibition and the web archive aims to provide the user experience of browsing an institutional memory.

Fig. 4: Description page of an entity of the Faculty.

The exhibition is complemented with frequently asked questions and tutorials related to digital preservation.

Future work, because a website is never finished

The next step is to promote this exhibition through the institutional communication channels of the Faculty (e.g. institutional website, mailing lists).

The exhibition still has plenty of room to be complemented with additional entities that could be aggregated in collections organized by topic or scientific area.

Direct interaction with research centers is mandatory as well as organization of training courses on web preservation and research to raise awareness to the importance of web archiving.

Conclusions

This project was developed in just 3 months, between May and July 2017. This short time span forced us to focus and set priorities on the most important issues. We would still be lost now choosing plug-ins if we had had more time and, however, would the extra plug-ins had actually been needed to accomplish the objectives? The users don’t seem to miss them on the exhibition.

We aimed to demonstrate that anyone could develop a similar exhibition to preserve the online memory of an organization without requiring significant financial resources or technical skills.

We hope that this project will encourage librarians and archivists to create ways of preserving the online memory of their institutions.

If we did it, you can also do it.


Learn more:

About the author:

Ricardo Basílio, has a Master in Documentation and Information Sciences, was a librarian at the Faculty of Social and Human Sciences of Universidade Nova de Lisboa, and at the Art Library of Fundação Calouste Gulbenkian, on the digital collections about portuguese tiles, the “DigiTile” project. His areas of interest are digital preservation, digital libraries and technologies that support information. Created and manages a website in Portuguese about Digital Preservation (Digital Preservation Guide).

Archives Unleashed at the British Library: Study of gender distribution in National Olympic Committees

Who we are

Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).

From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.

The IIPC Content Development Group

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.

Previous CDG collections include:

  • Summer/Winter Olympics/Paralympics (2010-2016)
  • National Olympic & Paralympic Committees (2016- )
  • European Refugee Crisis (2015-2016)
  • World War One Commemoration (2015- )
  • International Cooperative Organizations (2015- )

All these collections can be viewed from here: https://archive-it.org/home/iipc

What we tried to do

We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with the National Olympic and Paralympic Committees collection.

Our research question was “What is the gender distribution of National Olympic Committees?”.

The data we had

The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.

The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.

After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names.  The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.

The 7 English speaking countries identified in the set.

How we worked on it

Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.

Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.

In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.

Snippet:
20160329, http://www.paralympic.org.au, Cochrane, 3, No Match
20160329, http://www.paralympic.org.au, Sam Carter, 2, Male
20160329, http://www.paralympic.org.au, Alistair, 1, Male
20160329, http://www.paralympic.org.au, Carlee Beattie, 4, Female
20160329, http://www.paralympic.org.au, Ernest Van Dyk, 1, Male

The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).

Step 3 consisted of adapting javascripts to visualize the results of the named entity recognition and the gender distribution as web graphs.

All scripts developed during the hackathon can be accessed on Github:

https://github.com/ukwa/archives-unleashed-olympics

Results

The gender distribution within the subset of the collection.
  • Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
  • ‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Male/female representation over the complete dataset of 76 warc files regardless of language. The gender distribution within the subset of the collection and the overall data set showed that males are more represented than females on National Olympic Committees.

Alternative research question?

Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.

What we got out of it  

Sara Aubry, Bibliothèque nationale de France

My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”

Pamela Graham, Columbia University

“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”

Gethin Rees, British Library

“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”

Gillian Lee, National Library of New Zealand

“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”

Helena Byrne, British Library

“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”

This slideshow requires JavaScript.

New Report on Web Archiving Available

By Andrea Goethals, Harvard Library

HarvardLibraryReport-Jan2016This is an expanded version of a post to the Library of Congress’ Signal blog.

Harvard Library recently released a report that is the result of a five-month environmental scan of the landscape of web archiving, made possible by the generous support of the Arcadia Fund. The purpose of the study was to explore and document current web archiving programs to identify common practices, needs, and expectations in the collection and provision of web archives to users; the provision and maintenance of web archiving infrastructure and services; and the use of web archives by researchers. The findings will inform Harvard Library’s strategy for scaling up its web archiving activities, and are also being shared broadly to help inform research and development priorities in the global web archiving community.

The heart of the study was a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of several organizations, including the IIPC of course, but also the Web Archiving Roundtable at the Society of American Archivists (SAA), the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Reuters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

Opportunities for Research & Development

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed below and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development, and (4) build local capacity.

22 Opportunities to Address Common Challenges

(the order has no significance)

  1. Dedicate full-time staff to work in web archiving so that institutions can stay abreast of latest developments, best practices and fully engage in the web archiving community.
  2. Conduct outreach, training and professional development for existing staff, particularly those working with more traditional collections, such as print, who are being asked to collect web archives.
  3. Increase communication and collaboration across types of collectors since they might collect in different areas or for different reasons.
  4. A funded collaboration program (bursary award, for example) to support researcher use of web archives by gathering feedback on requirements and impediments to the use of web archives.
  5. Leverage the membership overlap between RESAW and European IIPC membership to facilitate formal researcher/librarian/archivist collaboration projects.
  6. Institutional web archiving programs become transparent about holdings, indicating what material each has, terms of use, preservation commitment, plus curatorial decisions made for each capture.
  7. Develop a collection development tool (e.g. registry or directory) to expose holdings information to researchers and other collecting institutions even if the content is viewable only in on-site reading rooms.
  8. Conduct outreach and education to website developers to provide guidance on creating sites that can be more easily archived and described by web archiving practitioners.
  9. IIPC, or similar large international organization, attempts to educate and influence tech company content hosting sites (e.g. Google/YouTube) on the importance of supporting libraries and archives in their efforts to archive their content (even if the content cannot be made immediately available to researchers).
  10. Investigate Memento further, for example conduct user studies, to see if more web archiving institutions should adopt it as part of their discovery infrastructure.
  11. Fund a collection development, nomination tool that can enable rapid collection development decisions, possibly building on one or more of the current tools that are targeted for open source deployment.
  12. Gather requirements across institutions and among web researchers for next generation of tools that need to be developed.
  13. Develop specifications for a web archiving API that would allow web archiving tools and services to be used interchangeably.
  14. Train researchers with the skills they need to be able to analyze big data found in web archives.
  15. Provide tools to make researcher analysis of big data found in web archives easier, leveraging existing tools where possible.
  16. Establish a standard for describing the curatorial decisions behind collecting web archives so that there is consistent (and machine-actionable) information for researchers.
  17. Establish a feedback loop between researchers and the librarians/archivists.
  18. Explore how institutions can augment the Archive-It service and provide local support to researchers, possibly using a collaborative model.
  19. Increase interaction with users, and develop deep collaborations with computer scientists.
  20. Explore what, and how, a service might support running computing and software tools and infrastructure for institutions that lack their own onsite infrastructure to do so.
  21. Service providers develop more offerings around the available tools to lower the barrier to entry and make them accessible to those lacking programming skills and/or IT support.
  22. Work with service providers to help reduce any risks of reliance on them (e.g. support for APIs so that service providers could more easily be changed and content exported if needed).

Communication & Collaboration are Key!

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration, among all individuals and organizations involved in some way in web archiving, was the most prevalent theme found by the scan. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed between those collecting web content, but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.

A Note about the Tools

There is a section in the report (Appendix C) that lists all the current web archiving tools that were identified during the environmental scan. The IIPC Tools and Software web page was one of the resources used to construct this list, along with what was learned through interviews, conferences and independent research. The tools are organized according to the various activities needed throughout the lifecycle of acquiring, processing, preserving and providing web archive collections. Many of the tools discovered are fairly new, especially the ones associated with the analysis of web archives. The state of the tools will continue to change rapidly so this list will quickly become out of date unless a group like the IIPC decides to maintain it.  I will be at the GA in April if any IIPC member would like to talk about maintaining this list or other parts of the report.

2016 IIPC General Assembly & Web Archiving Conference

In 2016 the IIPC is organising two back-to-back events in the spring hosted by the Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland) in Reykjavík, Iceland:

  • IIPC General Assembly 2016, 11-12 April – Free (open for members only)
  • IIPC Web Archiving Conference 2016, 13-15 April – Free (open to anyone)

The IIPC is seeking proposals for presentations and workshops for the 2016 IIPC Web Archiving Conference (13 – 15 April 2016). Members of the IIPC are also encouraged to submit proposals for the IIPC General Assembly (11 & 12 April 2016).

Theme guidance

Proposals may cover any aspect of web archiving. The following is a non-exhaustive list of possible topics:

Policy and Practice

  • Harvesting, preservation, and/or access
  • Collection development
  • Copyright and privacy
  • Legal and ethical concerns
  • Programmatic organization and management

Research

  • Research using web archives
  • Tools and approaches
  • Initiatives, platforms, and collaborations

Tools

  • New/updated tools for any part of the lifecycle
  • Application programming interfaces (APIs)
  • Current and future landscape

Proposal guidance

Individual presentations can be a maximum of 20 mins. A panel session can be a maximum of 60 minutes with 2 or more presentations on a topic. A discussion session should include one or more introductory statements followed by a moderated discussion. Workshops can be up to a half-day in length; please include details on the proposed structure, content, and target audience.

Abstracts should include the name of the speaker(s), a title, theme and be no more than 300 words. All abstracts should be in English.

Please submit your proposals using this form. For questions, please e-mail iipc@bl.uk .

The deadline for submissions is 17 December 2015. All submissions will be reviewed by the Programme Committee and submitters will be notified by mid-January 2016.

Five Takeaways from AOIR 2015

aoirI recently attended the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. It was a great conference that I would highly recommend to anyone interested in learning first hand about research questions, methods, and studies broadly related to the Internet.

Researchers presented on a wide range of topics, across a wide range of media, using both qualitative and quantitative methods. You can get an idea of the range of topics by looking at the conference schedule.

I’d like to briefly share some of my key takeaways. I apologize in advance for oversimplifying what was a rich and deep array of research work, my goal here is to provide a quick summary and not an in-depth review of the conference.

  1. Digital Methods Are Where It’s At

I attended an all-day, pre-conference digital methods workshop. As a testament to the interest in this subject, the workshop was so overbooked they had to run three concurrent sessions. The workshops were organized by Axel Bruns, Jean Burgess, Tim Highfield, Ben Light, and Patrik Wikstrom (Queensland University of Technology), and Tama Leaver (Curtin University).

Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better.

At the digital methods workshop, we learned about the “Walkthrough” method for studying software apps, tools for “web scraping” to gather data for analysis, Tableau to conduct social media analysis, and “instagrammatics,” analyzing Instagram.

FYI: The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.

  1. Twitter API Is also Very Popular

There were many Twitter studies, and they all used the Twitter API to download tweets for analysis. Although researchers are widely using the Twitter API, they expressed a lot of frustration over its limitation. For example, you can only download for free up to 1% of the total Twitter volume. If you’re studying something obscure, you are probably okay, but if you’re studying a topic like #jesuischarlie, you’ll have to pay to get the entire output. Many researchers don’t have the funds for that. One person pointed out that it would be ideal to have access to the Library of Congress’s Twitter archive. Yes, agreed!

  1. Social Media over Web Archives

Researchers presented conclusions and provided commentary on our social behavior through studies of social media such as Snapchat, Twitter, Facebook, and Instagram. There were only a handful of presentations using web archived materials. If a researcher used websites, they viewed them live or conducted “web scraping” with tools such as Outwit and Kimono. Many also used custom Python scripts to gather the data from the sites.

  1. Fair Use Needs a PR Movement

There’s still much misunderstanding about what researchers can and cannot do with digital materials. I attended a session where the presenter shared findings from surveys conducted with communication scholars about their knowledge of fair use. The results showed that there was (very!) limited understanding of fair use. Even worse, the findings showed that those scholars who had previously attended a fair use workshop were even more unlikely to understand fair use! Moreover, many admitted that they did not conduct particular studies because of a (misguided) fear of violating copyright. These findings were corroborated by the scholars from a variety of fields who were in the room.

  1. Opportunities for Collaboration

I asked many researchers if they were concerned that they were not saving a snapshot of websites or Apps at the time of their studies. The answer was a resounding “yes!” They recognize that sites and tools change rapidly, but they are unaware of tools or services they can use and/or that their librarians/archivists have solutions.

Clearly there is room for librarians/archivists to conduct more outreach to researchers to inform them about our rich web archive collections and to talk with them about preservation solutions, good data management practices and copyright.

Who knew?

Let me end with sharing one tidbit that really blew my mind. In her research on “Dead Online: Practices of Post-Mortem Digital Interaction,” Paula Kiel presented on the “digital platforms designed to enable post-mortem interactions.” Yes, she was talking about websites where you can send posthumous messages via Facebook and email! For example, https://www.safebeyond.com/, “Life continues when you pass… Ensure your presence – be there when it counts. Leave messages for your loved ones – for FREE!”

RosalieLack

 

By Rosalie Lack, Product Manager, California Digital Library