Memory of the online presence of a Faculty: an exhibition

Arquivo.ptIn 2017 Arquivo.pt launched Investiga XXI, a project that aims to promote the use of the Portuguese Web Archive as a research tool and resource. In this first guest blog post introducing the Portuguese initiativeRICARDO BASÍLIO presents a collaboration between  Arquivo.pt and researchers from the Faculty of Social and Human Sciences of Universidade Nova de Lisboa (FCSH-UNL) which resulted in the creation of an online exhibition that illustrates a use case for the historical information preserved by Arquivo.pt. This exhibition highlights extracts of institutional online memories.


FCSH: 40 years of lifetime, 20 years online

FCSH was founded in 1977 and it is part of Universidade Nova de Lisboa. Since 1997, that FCSH websites have been used as communication interfaces with its community of teachers, researchers and students.

Arquivo.pt preserves web content published since 1996. Therefore, the time span of the web content preserved by Arquivo.pt covers 20 years of the institutional online memory of FCSH, that is half of the Faculty’s lifetime.

Fig. 1: The first version of the FCSH website preserved by Arquivo.pt

In the early years of the Web, the FCSH website mostly replicated printed information. However, it has gradually become a comprehensive portal to academic live at the Faculty including also news, lists of researchers, research programs or access points to services.

Research centers are important entities of the Faculty’s ecosystem. In 1997 there were 30 small research centers, but in the 2016 they were merged into 16 larger ones.

The research centers are autonomous, manage their own projects and organize specific events. This fact resulted in the creation of over 100 additional related websites serving various purposes, such as institutional communication, project descriptions and event promotions.

The online exhibition aimed to create an institutional memory through a chronological narrative built from past web pages preserved by Arquivo.pt.

Synthesizing 20 years of memories into a single page

The project began by inventorying a large number of current websites related to the Faculty activities. We subsequently narrowed our scope to include only the institutional websites leaving other ones for future work (e.g. projects and events). All the identified websites were targeted to be preserved by Arquivo.pt.

Fig. 2: Table with elements for collecting information about the preserved websites of a given organizational entity.

The data collection was performed manually through the Arquivo.pt search interfaces. We mainly searched for the hostname and analyzed the corresponding version history, noticing its main content changes and references to external websites of events and projects. The data was collected, selected and registered into a page per organizational unity (see Fig. 2).

Some research centers adopted multiple hostnames along time. On the other hand, the institutional identity may have also changed due to organizational merging, name changes or different institutional frameworks. For example, CHAM “Centro de Humanidades” (in 2017) had two previous names:  “Centro d’Além Mar” in 2002 and then changed to “Centro d’Aquém e d’Além Mar” in 2013-2014, when merged with “Centro de História da Cultura – CHC”, “Centro de Estudos Históricos – CEH” and “Instituto Oriental – IO”. Although, the hostname of the website has never changed: cham.fcsh.unl.pt.

Sometimes it was not straightforward to conclude if we were facing the same organizational entity after a merge, even when the website remained with the same title, hostname and URL. It’s hard, however, to imagine that the entity changed if everything remained the same. Therefore, our conclusions were  validated through interviews with current and previous staff of the Faculty and research centers. Hence, the importance of institutional support and direct  interaction with the entities.

Designing a time travel to the past

Fig. 3: Homepage of the online exhibition with images taken from the websites.

The objective was to create a website with a clean look and that was easy to browse. We anchored its navigation on suggestive images extracted from preserved web pages, to reinforce that it is an exhibition about online memory, rather than about current information available on the live-Web.

Thus, the homepage of the online exhibition presents a collection of preserved web images from old websites of organizational units that belonged to FCSH.

The chosen publishing platform was the free version of WordPress.com, so that anyone can create a similar project, despite a potential lack of financial resources.

By clicking on each image, the user is taken to a page that describes the online memory of each entity of the Faculty. It presents the following elements: featured image, brief synopsis, list of addresses along time and selection of mesmerizing moments.

The description of each entity has a maximum length of 150 words and includes links to versions preserved on Arquivo.pt. This interaction between the online exhibition and the web archive aims to provide the user experience of browsing an institutional memory.

Fig. 4: Description page of an entity of the Faculty.

The exhibition is complemented with frequently asked questions and tutorials related to digital preservation.

Future work, because a website is never finished

The next step is to promote this exhibition through the institutional communication channels of the Faculty (e.g. institutional website, mailing lists).

The exhibition still has plenty of room to be complemented with additional entities that could be aggregated in collections organized by topic or scientific area.

Direct interaction with research centers is mandatory as well as organization of training courses on web preservation and research to raise awareness to the importance of web archiving.

Conclusions

This project was developed in just 3 months, between May and July 2017. This short time span forced us to focus and set priorities on the most important issues. We would still be lost now choosing plug-ins if we had had more time and, however, would the extra plug-ins had actually been needed to accomplish the objectives? The users don’t seem to miss them on the exhibition.

We aimed to demonstrate that anyone could develop a similar exhibition to preserve the online memory of an organization without requiring significant financial resources or technical skills.

We hope that this project will encourage librarians and archivists to create ways of preserving the online memory of their institutions.

If we did it, you can also do it.


Learn more:

About the author:

Ricardo Basílio, has a Master in Documentation and Information Sciences, was a librarian at the Faculty of Social and Human Sciences of Universidade Nova de Lisboa, and at the Art Library of Fundação Calouste Gulbenkian, on the digital collections about portuguese tiles, the “DigiTile” project. His areas of interest are digital preservation, digital libraries and technologies that support information. Created and manages a website in Portuguese about Digital Preservation (Digital Preservation Guide).

Advertisements

IIPC Going for Gold – Get involved in #WAOlympics2018

By Helena Byrne, Curator of Web Archives, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the events of the 2018 Winter Olympics in Pyeongchang, South Korea since the start of February 2018. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The IIPC has members in more than 30 countries but there are over 100 countries competing in the Games and we need your help to ensure that these countries are represented in the collection.

So far there have been over 1,360 nominations from at least 28 countries around the world. As you can see from the map of the world, there is a high concentration from Europe as many IIPC members are based there. However, as you zoom in on the map of European nominations, there are still many gaps.

This is your chance to get involved in the collection phase by nominating online content that you are reading, using for research or simply know the language from that country. We are trying to get as many pins on the map from around the world as possible. Nevertheless, some of the pins already there may just have one website nomination so far. Even if you see a pin on your country or another country where you speak the language, we still want your nominations.

Just to remind you, what we want to collect:
Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:
Once you have selected the web pages you would like to see in the collection, it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

The call for nominations will close on the 20th of March 2018.

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018

 

Archives Unleashed at the British Library: Study of gender distribution in National Olympic Committees

Who we are

Sara Aubry (National Library of France), Helena Byrne (British Library), Naomi Dushay (Stanford University), Pamela Graham (Columbia University), Andy Jackson (British Library), Gillian Lee (National Library of New Zealand) and Gethin Rees (British Library).

From the 11th to 13th of June 2017 a group of seven individuals from five institutions came together to analyse a web archive collection at a datathon held at the British Library as part of the Web Archiving Week. The aim of Archives Unleashed is for programmers and researchers to come together to develop new strategies to analyse web archive collections. Our team was a mix of technical and curatorial staff, and we were working with the IIPC Content Development Group (CDG) National Olympic Committees collection.

The IIPC Content Development Group

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. The Content Development Group is a subgroup of the IIPC and specialises in building collaborative international web archive collections.

Previous CDG collections include:

  • Summer/Winter Olympics/Paralympics (2010-2016)
  • National Olympic & Paralympic Committees (2016- )
  • European Refugee Crisis (2015-2016)
  • World War One Commemoration (2015- )
  • International Cooperative Organizations (2015- )

All these collections can be viewed from here: https://archive-it.org/home/iipc

What we tried to do

We initially started with idea of working with London 2012 and Rio 2016 Olympic collaborative crawl collections but both these data sets were too large for us to work with in the short time frame we had. This is why we decided to work with the National Olympic and Paralympic Committees collection.

Our research question was “What is the gender distribution of National Olympic Committees?”.

The data we had

The 2016 National Olympic and Paralympic Committees collection is a comprehensive collection of national Olympic/Paralympic committees drawn from IOC official sites. In 2016 191 seeds crawled as not all International Olympic Committee member countries have a website.

The 191 seeds translated into 152 WARC files and was 294 GB in size. However, there was an issue when the files were downloaded and a number of the files were corrupted. After programmatically separating the corrupted files from the good there were 76 WARC files that were 74 GB in size to work with. Although, this was 50% of the collection it was more than enough data to work with over the two days.

After the technical team isolated the usable WARCs and had a look at the tools available to run our analysis it was decided to scale down our research question to “What is the gender distribution of English speaking national Olympic Committees?”. As the tools used to run this analysis was developed in north America there is a bias towards English language names.  The curatorial team identified all the English speaking countries that were represented in the full collection. We used this list to filter out non English speaking countries from the clean WARCs so that we would have a smaller subset to run our analyses. The usable WARCs had seven English speaking countries.

The 7 English speaking countries identified in the set.

How we worked on it

Several Linux virtual machines were prepared by the organizers specifically for the hackathon so that the WARC files were easily accessible and the participants didn’t have to transfer large amounts of data and also to ensure that there was enough processing capacity. We started by installing three tools that we had identified as being useful on a designated virtual machine:
Warcbase, an open source platform to facilitate the analysis and processing of web archives with Hadoop and Apache Spark. It provides tools to extract content, filter it down, and then analyze, aggregate and visualize it. [ Note that Warcbase has now been superseded by The Archives Unleashed Toolkit.
– Warcbase also includes a key tool for our analysis: Stanford Named Entity Recognizer (NER) for named entity recognition. It gives the ability to identify and label sequences of words in a text which are the names of things, particularly for the 3 classes person, organization and location.
– finally, OpenGenderTracking, another open source tool, which gives a framework to identify the likely gender based on a person’s first name.

Step 1 of the analysis consisted of extracting all named entities from the WARCs using warcbase and NER with a scala script derived from sample scripts provided with warcbase. The output was a list (in JSON format) of domain records with for each its associated PERSON, ORGANIZATION and LOCATION extracted entities and their frequency of occurrence.

In step 2, with a Python script, we matched the extracted PERSON names with a framework containing a large structured list of first names built from the US census and the probability of each being a male or a female first name. The output was a result list (in CSV format) of this association.

Snippet:
20160329, http://www.paralympic.org.au, Cochrane, 3, No Match
20160329, http://www.paralympic.org.au, Sam Carter, 2, Male
20160329, http://www.paralympic.org.au, Alistair, 1, Male
20160329, http://www.paralympic.org.au, Carlee Beattie, 4, Female
20160329, http://www.paralympic.org.au, Ernest Van Dyk, 1, Male

The analysis was run on two sub datasets:
– the committee pages: 16 of them contained entities (which was small and fast to process);
– the entire collection: 1 251 pages contained entities (which was bigger and took a few hours to process).

Step 3 consisted of adapting javascripts to visualize the results of the named entity recognition and the gender distribution as web graphs.

All scripts developed during the hackathon can be accessed on Github:

https://github.com/ukwa/archives-unleashed-olympics

Results

The gender distribution within the subset of the collection.
  • Gender representation by country of the 7 English speaking countries identified in the set.‘No match’ means the name didn’t appear in the reference source for identifying names.
  • ‘Unknown’ means the reference source couldn’t identify whether the name was male or female.
Male/female representation over the complete dataset of 76 warc files regardless of language. The gender distribution within the subset of the collection and the overall data set showed that males are more represented than females on National Olympic Committees.

Alternative research question?

Each National Committee has official partners that sponsor their participation in the Olympics. When we ran the entity extraction for corporations, it raised further questions about what percentage of the site is taken up with references to commercial sponsorship. The gender and corporation names are just two of many entities that could be extracted from the data set using this methodology.

What we got out of it  

Sara Aubry, Bibliothèque nationale de France

My participation to the hackathon was linked to BnF current efforts in engaging researchers to use web archives as data sets. We aimed at discussing research topic ideas, learning how to use available open source tools, tackling limitations and sharing practices among participants.What I liked most was the hackathon model itself that challenged us into collaborative work in a very short period of time. I guess a little more time would have been useful to explore and compare the results of the analysis we ran.”

Pamela Graham, Columbia University

“I enjoyed our sub-groupings into programmers/technical experts and curators (forgive this oversimplification). As a curator, I needed a better understanding of the process of working with web archive data. Since I don’t have programming skills, this was more of a conceptual exercise than a practical one. I gained a good, first-hand sense of the issues and challenges of analyzing web data. But even more helpful was the attempt the curators made to evaluate the collection–how and why were the sites selected and what’s missing? This is really important to interpreting the results and reinforced for me the importance of curation. I greatly benefited from talking with Helena and Gillian on these issues.”

Gethin Rees, British Library

“Having recently started as a curator working with digital collections at the British Library I was keen to learn about web archives. I was also intent on improving my use of python for data science. I loved being introduced to new technologies like Hadoop and connecting to powerful computers in north America. Next time I would try to get stuck in to processing some WARC files independently.”

Gillian Lee, National Library of New Zealand

“I wanted to see what tools were available to help people analyse data in web archives. The collaborative aspect was great. I discovered you have to refine and reduce your data set quite substantially and that the scope and provenance of the collections is really important for researchers. I don’t feel I’m any closer to actually using Warcbase myself (yet), but I had more of an understanding of the kind of research that could be done using Warcbase and associated tools. Given the time frame we were working in and the amount of corrupted data we encountered, I would say the process was more valuable than the output!”

Helena Byrne, British Library

“For me as a curator my expectations of how things work were quite different from the reality but the overall experience was still good as it gave me a better understanding of the process. It was also useful to discuss the differences I had in expectation and reality with Pamela and Gillian as we were able to come up with ways we could assist the technical team.”

This slideshow requires JavaScript.

2018 Winter Olympics Collection Building – Get Involved!

By Helena Byrne, Curator of Web Archives, The British Library

The International Internet Preservation Consortium Content Development Group (IIPC CDG) would like your help to archive websites from around the world related to the 2018 Winter Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 100 countries  competing in the Games and we need your help to ensure that these countries are represented in the collection. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The 2016 Summer Games was the first time they actively collected content related to activities both on and off the playing field.* The final 2018 Winter Games collection will be published here: https://archive-it.org/home/IIPC

What we want to collect:

Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018


* 2016 Olympics collection round-up

IIPC Training Survey

Recognizing the global need for practical training in web archiving, the IIPC chartered a new working group dedicated to training on October 31, 2017. While vital to preserving our common cultural heritage, web archiving remains a niche area, requiring specialized skills and knowledge to practice effectively.

The goal of this working group is to produce a quality curriculum — that can be delivered online or in person — to train the current and next generation of web archiving practitioners. By giving people the hands on learning they need to preserve the Web, we will empower IIPC members and the field at large to capture more and better archives, and help elevate web archiving worldwide.

One of the first actions for our Training Working Group is to survey the web archiving world to assess the current level of training needs. How do web archivists currently get training, and how good is it? What gaps are there, and where should we prioritize our efforts?

We invite every web archiving stakeholder to reply to the survey; it will be available from now through the end of January 2018.

The results will help us identify learning modules and needed materials. And if you are interested in helping in this endeavor, including joining the Training Working Group, you can read more about us here: http://netpreserve.org/about-us/working-groups/training-working-group/

Tom Cramer
Stanford University

Chair, IIPC Training Working Group

Call for nominations: the IIPC Steering Committee Election 2018

The nomination process for IIPC Steering Committee is now open.

The Steering Committee is the executive body of the IIPC, currently comprising 15 member organisations, that take a leadership role in the high-level strategic planning, development and management of programs, policy creation, overall administration, and contribution to IIPC Portfolios and other activities.

In the last election three new members joined the Steering Committee: the National Library of Finland, Bibliotheca Alexandrina and Columbia University Libraries. Library and Archives Canada and the National Library of Spain were re-elected for another term.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations and their representatives is available on the IIPC website.

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Serving on the Steering Committee is open to any current IIPC member and we strongly encourage any organisation interested in serving on the Steering Committee to nominate themselves for election. SC members are elected for 3 years and meet twice a year in person, once during the General Assembly, once in September and two or more additional times by teleconference.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in mid-May and the three-year term on the Steering Committee will start on 1 June.

Below you will find the election calendar. We are very much looking forward receiving your nominations. If you have any questions, please contact the IIPC PCO.

.


Election Calendar

  •  1 December to 25 March: Members are invited to nominate themselves by sending an email including a statement to the IIPC Programme and Communications Officer.
  • 4 April: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.
  • 11 April to 11 May: Members are invited to vote online. An online voting tool will be used to conduct the vote. The PCO will monitor the vote, ensuring that each organisation votes only once for all nominated seats and that the vote is cast by the organisation’s official representative. People will be encouraged to cast their vote before, during, and after the GA.
  • 11 May: Voting ends.
  •  15 May: The results of the vote are announced officially on the Netpreserve blog and Members mailing list.
  • 1 June: end/start of SC members terms. The newly elected SC members start their term on the 1st of June and are invited to attend a first meeting (by teleconference) by the end of June. The next face to face SC meeting will take place in Wellington in November 2018.

 

The Council on Library and Information Resources and Digital Library Federation to be Fiscal Host for IIPC

The Council on Library and Information Resources (CLIR) and Digital Library Federation (DLF) have agreed to serve as new fiscal host for the International Internet Preservation Consortium (IIPC), a global organization that coordinates efforts across libraries and other institutions to preserve internet content for the future.

Chartered in 2003 with 12 founding members , IIPC currently has 54 members on six continents. A 15-member steering committee serves as an executive board and defines and oversees organizational strategy. The organization’s work focuses on enabling internet content from around the world to be archived, secured, and accessed over time; fostering the development and use of common tools, techniques, and standards that enable the creation of international archives; and encouraging and supporting cultural heritage institutions to address internet archiving and preservation.

As financial sponsor, CLIR will manage banking for the organization and will be responsible for processing payments, accounting, and financial reporting. IIPC will remain independent and be responsible for managing membership and member communications, setting strategic directions, and organizing activities and events. DLF will offer relevant programmatic and community connections through its working groups, conferences, and collaborating groups, including the National Digital Stewardship Alliance (NDSA), which also finds a home at CLIR/DLF.

“We are very excited to work with CLIR/DLF on bringing fiscal stability to the IIPC and to partner with a mission-aligned organization also dedicated to advancing preservation and access for our shared cultural heritage. Their support of the library and archives community and work to advance the field is invaluable and we know this is just the beginning of a fantastic collaboration,” said IIPC Vice-Chair Jefferson Bailey of the Internet Archive, who, with Steering Committee member Abbie Grotke of the Library of Congress, worked on the arrangements for IIPC.

DLF Director Bethany Nowviskie added, “We’re thrilled to offer support for an international organization so closely aligned with the digital preservation interests and goals of the CLIR/DLF community. IIPC is unique in its global reach and the level of expertise its members bring to the challenge of web archiving.”

The Council on Library and Information Resources (CLIR) is an independent, nonprofit organization that forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning.

The Digital Library Federation, founded in 1995, is a robust and diverse community of practitioners who advance research, learning, social justice, and the public good through the creative design and wise application of digital library technologies. DLF connects its parent organization, CLIR, to an active network of 162 member institutions, including colleges, universities, public libraries, museums, labs, agencies, and consortia.