Web Archiving Rio 2016: The Story So Far

By Helena Byrne, Assistant Web Archivist, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the trials and tribulations of the Rio 2016 Summer Olympic and Paralympic Games. The Olympics might be over but in just a few days the Paralympics will begin and fans will be glued to their screens again.

This project is collecting public platforms such as websites, articles, news reports, blogs and social media about Rio 2016. You can follow updates on this project on Twitter by using the collection hashtag #Rio2016WA. The CDG group has been more active on Twitter and recently hosted a Twitter chat on 10th August 2016 to give an insight on what’s involved in web archiving the Olympics. The chat was based on set questions published in an IIPC blog post with a Q&A session and some time for live nominations. This was an international chat; even though it was small it helped us to make connections with a wider audience. The chat was added to Storify as well as the final archived collection of the Games.

So far the Rio 2016 Collection has over 4,000 nominations from IIPC members and the general public. The nominations up to now are from seventy six countries across the world. However as you can see from the Google Map there are still many countries that have not been covered. Can you help fill the void?

The majority of the public nominations cover Ireland, the Pacific Islands & South Korea and are in a range of languages such as English, Korean, Dutch, Georgian & French to name but a few. Some countries on the map have only one site nominated while others have many, even if you see that there are nominations from your country the web pages you are looking at might not be in the collection. There is still time for you to get involved in web archiving the Olympics and Paralympics. The public nomination form will be open till 21st September 2016. If you would like to make a nomination you can follow these guidelines. This is your chance to be part of the Games!

The Case for the Preservation of R&D Project Websites

By Daniel Gomes, Arquivo.pt – the Portuguese Web Archive

Although most current Research & Development (R&D) projects rely on their sites to publish valuable information about their activities and achievements, these sites and the information they provide typically disappear a few years after the end of the projects. Web archiving is a solution to this problem.

Why preserve websites of Research & Development projects?

During the FP7 work programme the European Union invested billions of EUROS on R&D projects. Scientific outputs from this significant investment were disseminated online through R&D project sites. Moreover, part of the funding was invested in the development of the project sites themselves.

FP7projectDB

 

Sites of R&D projects must be preserved because they:

  • publish valuable scientific outputs;
  • are highly transient, typically they vanish shortly after the project funding ends;
  • constitute a trans-national, multi-lingual and cross-field set of historical web data for researchers (e.g. social scientists);
  • are not being officially preserved by any institution.

The archivist’s dilemma has always been what to preserve for the future? R&D project sites are definitely worth preserving.

Open Data gets obsolete due to project site ephemera

There has been a growing effort of the European Union, and governments in general, to improve transparency by providing Open Data about their activities and outputs of the granted fundings.

The European Union Open Data Portal is an example of this effort. It conveys information about European Union funded projects such as the project name, beginning and ending dates, subject, budget or project URL. Almost all this information is persistent and usable over time after the project or funding instrument ends. The exception is the project URL.

cordis

 

A pilot experiment was performed on November 27th 2015 to test project URLs for 100 random projects funded by the FP6 and FP7 work programmes. We automatically checked if the project URLs were still referencing any content (OK response).

The obtained results revealed that 19% of the project sites of R&D projects financed by FP7 (2007-2013) were already unavailable. This percentage of data loss increased to 30% for the older project sites financed by FP6 (2002-2006).

Moreover, we observed that some of these URLs were referencing a content that was no longer related to the R&D project. Therefore, this suggests that the percentage of valid project URLs is in fact lower than the obtained percentages. Attaining more accurate percentages would require human validation. This is an interesting issue that could be further investigated by researchers.

Web archiving provides a solution

The constant deactivation of sites that publish and disseminate the scientific outputs originated  from R&D projects causes a permanent loss of valuable information to Human knowledge from both a societal and scientific perspective. As project sites inevitably close, the online information referenced on databases, such as the CORDIS – EU research projects databases, suffer irrecoverable degradation.

EUOpenDataCordisResults

The good news is that web archives from IIPC members may have preserved sites from past R&D projects that became unavailable and a trans-national research infrastructure such as RESAW could make them more widely accessible.

Funding management databases could be enhanced to also reference the preserved versions of the project websites that meanwhile disappeared from the live-web. Project officers or reviewers could complement their analysis by retrieving missing online content about the funded projects and researchers in general could mitigate the serious cross-field problem originated by scientific publications citing crucial online resources that became unavailable.

You can help to preserve R&D project sites right now!

About 1 year ago, Arquivo.pt performed an experiment to preserve the .EU domain.

We are now trying to focus on preserving sites of R&D projects. To this end, our first idea was to use the EU Open Data Portal to identify these project URLs. The problem is that from the 25 608 R&D projects funded by FP7 listed by the EU Open Data Portal, only 7.9% had an associated project URL.

So, the first main challenge is to identify R&D project web sites to be preserved.

And you can help!

You just need to contribute with project sites to this document:

Collaborative list of Research and Development project websites

The resulting list will be used to:

  • make an experimental crawl of these sites that will be made publicly available;
  • research techniques to automatically identify R&D project URLs that will be published;
  • help other institutions interested in preserving these sites.

Who’s in?

Memento: Help Us Route URI Lookups to the Right Archives

More Memento-enabled web archives are coming online every day, enabling aggregating services such as Time Travel and OldWeb. However, as the number of web archives grows, we must be able to better route URI lookups to the archives that are likely to have the requested URIs. We need assistance from IIPC members to help us better model both what archives contain as well as what people are looking for.

In our TPDL 2015 paper we found that less than 5% of the queried URIs have mementos in any individual archive that is not the Internet Archive. We created four different sample sets of one million URIs each and compared them against three different archives. The table below shows the percentage of the sample URIs found in various archives.

Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union of {AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
Memento Proxy Logs 4.182% 0.408% 0.046% 4.527%
IA Wayback Logs 3.716% 0.519% 0.039% 4.165%
UKWA Wayback Logs 0.108% 0.034% 0.002% 0.134%

However, these small archives, when aggregated together prove to be much more useful and complete than they are individually. We found that the intersection between these archives is small, so the union of them is large (see the last column in the table above). The figure below shows the overlap among three archives for the sample of one million URIs from DMOZ.

stanford-ukwa-archive-it

We are working on an IIPC funded Archive Profiling project in which we are trying to create a high level summary of the holdings of each archive. Apart from the many other use cases, this will help us route the Memento Aggregator queries to only archives that are likely to return good results for a given URI.

We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives. Another issue when aggregating large number of archives is that the aggregators follow the buffalo theory where the slowest upstream archive affects the roundtrip time of the aggregator. A single malfunctioning (or down) upstream archive may delay each aggregator response for the set timeout period. There are ways to solve the latter issue such as detecting continuously failing archives at runtime and temporarily disabling them from being aggregated. However, building Archive Profiles and predicting the probability of finding any Mementos in each archive to route the requests solves both the problems. Individual archives only get requests when they are likely to return good results, hence the routing saves their network and computing resources. Additionally, aggregators benefit in terms of the improved response time, because only a small subset of all the known archives is queried for any given URI.

We appreciate Andy Jackson of the UK Web Archive for providing the anonymised Wayback access logs that we used for sampling one of the URI sets. We would like to extend this study on other archives’ access logs to learn what people are looking for when they visit these archives. This will help us build sampling based profiling for archives that may not be able to share CDX files or generate/update full-coverage archive profiles.

We encourage all IIPC member archives to share their access logs just enough to generate at least one million unique URIs that people looked for in their archives. We are only interested in the log entries that have a URI-R in it (e.g., /wayback/14-digit-datetime/{URI}). We can handle all the cleanup and parsing tasks, or you can remove the requesting IP address from the logs (we don’t need it) if you would prefer. The logs can be continuous or consist of many sparse logs. We promise not to publish those logs in the raw form anywhere on the Web. Please feel free to discuss further details with me at salam@cs.odu.edu. Also contact me if you are interested in testing the software for profiling your archive.

by Sawood Alam
Department of Computer Science, Old Dominion University

IIPC Co-Chair Cathy Hartman Retires

By Abbie Grotke, United States Library of Congress, and Birgit Nordsmark Henriksen, The Royal Library, Denmark

CathyHartmanThe IIPC is saying a fond farewell this month to Cathy Hartman from member institution University of North Texas Libraries. She is retiring from UNT at the end of December. More about her illustrious career can be found in this tribute to Cathy on the UNT website.

Since her institution joined the IIPC in 2008, Cathy has been a strong leader and participant in the activities of the consortium, serving as Chair of the IIPC Steering Committee in 2008, and more recently as co-chair in 2015, helping to lead an effort to rethink the organizational structure and focus of the organization. She’s had a particular interest in education and training opportunities for web archivists, and led an effort to form an Education Working Group to review and fund training proposals, including support for a PhD candidate. She was also one of the first members representing the unique perspectives of universities and colleges on the Steering Committee.

“In my tenure as Chair of the International Internet Preservation Consortium,” commented Paul N. Wagner from the Library and Archives Canada, “I have been privileged to have Cathy Hartman as my Vice-Chair. Cathy brought a sage and pragmatic approach to her dealings with the Consortium. Her wealth of experience coupled with her ability to develop and nourish personal relationships will be sorely missed. I will personally miss the opportunity to call her up and get a ‘dose of reality’ as we worked through the complexities of overseeing an International organization. While it may be true that ‘things are bigger in Texas,’ I can tell you that the IIPC Steering Committee table will feel a lot smaller without Cathy’s presence. So on behalf of the entire IIPC membership, I wish you all the best in this next chapter of your life.”

Before she departs, we took a few moments to ask Cathy some questions about her career in Web Archiving and her experiences in IIPC:

Q: How did you get involved in Web Archiving?
As a librarian, I specialized in government information in the 1990s and watched with interest as governments began publishing content on their websites in the mid 90s rather than printing. We also noticed that the websites just disappeared when an agency or commission closed, so I began capturing the sites before they closed and preserving them for access. We captured the first site in 1997. As we added to the collection of websites, the project became known as the CyberCemetery, where dead agency websites went for perpetual care.

Q: What is your fondest memory of an IIPC meeting?
My first IIPC meeting was in Canberra, Australia, where I immediately connected to this group of people. I discovered a group who agreed with my passion for preserving the content published on the web. Also, the representatives from around the world were all so welcoming and immediately made us feel a part of the organization. The programs, the group discussions, the ongoing work – I was totally on board. IIPC has, since that meeting, been my favorite professional organization.

Q: Which changes do you look at as the biggest changes in Web archiving over the years you have been involved?
Everything has changed since 1997. Sites then were simple, straightforward and easy to capture. Websites now are complex, multimedia, interactive. The growth and change of the web over that 18 year period is remarkable.

Q: What do you plan to do in retirement?
A little consulting, a bit of fundraising for an endowment for UNT’s digital programs, some travel, and anything else I find interesting along the way.

The authors of this post would like to add that we will miss Cathy very much at future IIPC meetings, both as a professional colleague working together to achieve IIPC goals, and personally, as a good friend. We have enjoyed our time working together and particularly her way of get things done — her calm demeanor, a heart as “big as Texas,” and her ability to always focus on the issues at hand by working collaboratively – these have all been an inspiration! While we doubt her travels will take her to our next meeting in Iceland, we all hope that our paths cross again in the near future.

IIPC Steering Committee - Paris

2016 IIPC General Assembly & Web Archiving Conference

In 2016 the IIPC is organising two back-to-back events in the spring hosted by the Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland) in Reykjavík, Iceland:

  • IIPC General Assembly 2016, 11-12 April – Free (open for members only)
  • IIPC Web Archiving Conference 2016, 13-15 April – Free (open to anyone)

The IIPC is seeking proposals for presentations and workshops for the 2016 IIPC Web Archiving Conference (13 – 15 April 2016). Members of the IIPC are also encouraged to submit proposals for the IIPC General Assembly (11 & 12 April 2016).

Theme guidance

Proposals may cover any aspect of web archiving. The following is a non-exhaustive list of possible topics:

Policy and Practice

  • Harvesting, preservation, and/or access
  • Collection development
  • Copyright and privacy
  • Legal and ethical concerns
  • Programmatic organization and management

Research

  • Research using web archives
  • Tools and approaches
  • Initiatives, platforms, and collaborations

Tools

  • New/updated tools for any part of the lifecycle
  • Application programming interfaces (APIs)
  • Current and future landscape

Proposal guidance

Individual presentations can be a maximum of 20 mins. A panel session can be a maximum of 60 minutes with 2 or more presentations on a topic. A discussion session should include one or more introductory statements followed by a moderated discussion. Workshops can be up to a half-day in length; please include details on the proposed structure, content, and target audience.

Abstracts should include the name of the speaker(s), a title, theme and be no more than 300 words. All abstracts should be in English.

Please submit your proposals using this form. For questions, please e-mail iipc@bl.uk .

The deadline for submissions is 17 December 2015. All submissions will be reviewed by the Programme Committee and submitters will be notified by mid-January 2016.

Five Takeaways from AOIR 2015

aoirI recently attended the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. It was a great conference that I would highly recommend to anyone interested in learning first hand about research questions, methods, and studies broadly related to the Internet.

Researchers presented on a wide range of topics, across a wide range of media, using both qualitative and quantitative methods. You can get an idea of the range of topics by looking at the conference schedule.

I’d like to briefly share some of my key takeaways. I apologize in advance for oversimplifying what was a rich and deep array of research work, my goal here is to provide a quick summary and not an in-depth review of the conference.

  1. Digital Methods Are Where It’s At

I attended an all-day, pre-conference digital methods workshop. As a testament to the interest in this subject, the workshop was so overbooked they had to run three concurrent sessions. The workshops were organized by Axel Bruns, Jean Burgess, Tim Highfield, Ben Light, and Patrik Wikstrom (Queensland University of Technology), and Tama Leaver (Curtin University).

Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better.

At the digital methods workshop, we learned about the “Walkthrough” method for studying software apps, tools for “web scraping” to gather data for analysis, Tableau to conduct social media analysis, and “instagrammatics,” analyzing Instagram.

FYI: The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.

  1. Twitter API Is also Very Popular

There were many Twitter studies, and they all used the Twitter API to download tweets for analysis. Although researchers are widely using the Twitter API, they expressed a lot of frustration over its limitation. For example, you can only download for free up to 1% of the total Twitter volume. If you’re studying something obscure, you are probably okay, but if you’re studying a topic like #jesuischarlie, you’ll have to pay to get the entire output. Many researchers don’t have the funds for that. One person pointed out that it would be ideal to have access to the Library of Congress’s Twitter archive. Yes, agreed!

  1. Social Media over Web Archives

Researchers presented conclusions and provided commentary on our social behavior through studies of social media such as Snapchat, Twitter, Facebook, and Instagram. There were only a handful of presentations using web archived materials. If a researcher used websites, they viewed them live or conducted “web scraping” with tools such as Outwit and Kimono. Many also used custom Python scripts to gather the data from the sites.

  1. Fair Use Needs a PR Movement

There’s still much misunderstanding about what researchers can and cannot do with digital materials. I attended a session where the presenter shared findings from surveys conducted with communication scholars about their knowledge of fair use. The results showed that there was (very!) limited understanding of fair use. Even worse, the findings showed that those scholars who had previously attended a fair use workshop were even more unlikely to understand fair use! Moreover, many admitted that they did not conduct particular studies because of a (misguided) fear of violating copyright. These findings were corroborated by the scholars from a variety of fields who were in the room.

  1. Opportunities for Collaboration

I asked many researchers if they were concerned that they were not saving a snapshot of websites or Apps at the time of their studies. The answer was a resounding “yes!” They recognize that sites and tools change rapidly, but they are unaware of tools or services they can use and/or that their librarians/archivists have solutions.

Clearly there is room for librarians/archivists to conduct more outreach to researchers to inform them about our rich web archive collections and to talk with them about preservation solutions, good data management practices and copyright.

Who knew?

Let me end with sharing one tidbit that really blew my mind. In her research on “Dead Online: Practices of Post-Mortem Digital Interaction,” Paula Kiel presented on the “digital platforms designed to enable post-mortem interactions.” Yes, she was talking about websites where you can send posthumous messages via Facebook and email! For example, https://www.safebeyond.com/, “Life continues when you pass… Ensure your presence – be there when it counts. Leave messages for your loved ones – for FREE!”

RosalieLack

 

By Rosalie Lack, Product Manager, California Digital Library

Being a Small-Time Software Contributor–Non-Developers Included

OpenWayback

At the IIPC General Assembly 2015, we
heard a call for contributors to IIPC relevant software projects (e.g. OpenWayback and Heritrix). We imagined what we could accomplish if every member institution could contribute half a developer’s heritrix-logotime to work on these tools. As individuals though, we are part of the IIPC because of the institutions for which we work. The tasks dealt by our employers come first, not always leaving an abundance of time for external projects. However, there are several ways to contribute on a smaller scale (not just committing code).

How To Help

1. Provide user support for OpenWayback and Heritrix

Join the openwayback-dev list and/or the Heritrix list, and answer questions when you can.

2. Log issues for software problems

github-social-codingAnytime you notice something isn’t working as expected in a piece of software, report the issue. For projects like OpenWayback and Heritrix that are on GitHub, creating an account to enable reporting issues is easy. If you aren’t sure if the problem warrants opening an issue, send a message to the relevant mailing list.

3. Follow issues on the OpenWayback and Heritrix GitHub repositories

Check issue trackers regularly or “Watch” GitHub repositories to receive issue updates via email. If you see an issue for a bug or new feature relevant to your institution, comment on it, even if only to say that it is relevant. This helps the developers prioritize which issues to work on.

watch_github_repo
https://github.com/iipc/openwayback

4. Test release candidates

When a new distribution of OpenWayback is about to be released, the development group sends out emails asking for people to test the release distribution candidates. Verify whether the deployment works in your environment and use cases. Then report back.

5. Contribute to documentation

For any web archiving project, if you find documentation that is lacking or unclear, report it to the maintainers, and if possible, volunteer to fix it.

6. Contribute to code

OpenWayback currently has several open issues for bugs and enhancements. If you find an issue of interest to you and/or your institution, notify others with a comment that you want to work on it. View the contribution guidelines, and start contributing. OpenWayback and Heritrix are happy to get pull requests.

7. Review codeBinary code

When others submit code for potential inclusion into a project’s master code branch, volunteer to review the code and test it by deploying the software with the changes in place to verify everything works as expected.

 8. Join the OpenWayback Developer calls

If you are interested in contributing to OpenWayback, these calls keep you informed on the current state of development. The group is always looking for help with testing release candidates, prioritizing issues, writing documentation, reviewing pull requests, and writing code. Calls take place approximately every three weeks at 4PM London time, there is also a Google Groups list, email the IIPC PCO to join.

9. Solicit development support from your institution

Non-developers have a great role in the development effort. Encourage technical staff you work with to contribute to software projects and help them build time into their schedules for it. If you are not in a position to do this, lobby the people who can grant some of your institution’s developer time to web archiving projects.

What You Get Back

Collaborating on web archiving projects isn’t just about what you contribute. The more you follow mailing lists and issue trackers and the more you work with code and its deployment, the better your institution can utilize the software and keep current on the direction of its development.

If your institution doesn’t use OpenWayback or Heritrix, the above ways of helping apply to many other web archiving software projects. So get involved where you can; you don’t have to fix everything.

lauren_koLauren Ko
Programmer, Digital Libraries Division, UNT Libraries