By Daniel Gomes, Arquivo.pt – the Portuguese Web Archive
Although most current Research & Development (R&D) projects rely on their sites to publish valuable information about their activities and achievements, these sites and the information they provide typically disappear a few years after the end of the projects. Web archiving is a solution to this problem.
Why preserve websites of Research & Development projects?
During the FP7 work programme the European Union invested billions of EUROS on R&D projects. Scientific outputs from this significant investment were disseminated online through R&D project sites. Moreover, part of the funding was invested in the development of the project sites themselves.
Sites of R&D projects must be preserved because they:
- publish valuable scientific outputs;
- are highly transient, typically they vanish shortly after the project funding ends;
- constitute a trans-national, multi-lingual and cross-field set of historical web data for researchers (e.g. social scientists);
- are not being officially preserved by any institution.
The archivist’s dilemma has always been what to preserve for the future? R&D project sites are definitely worth preserving.
Open Data gets obsolete due to project site ephemera
There has been a growing effort of the European Union, and governments in general, to improve transparency by providing Open Data about their activities and outputs of the granted fundings.
The European Union Open Data Portal is an example of this effort. It conveys information about European Union funded projects such as the project name, beginning and ending dates, subject, budget or project URL. Almost all this information is persistent and usable over time after the project or funding instrument ends. The exception is the project URL.
A pilot experiment was performed on November 27th 2015 to test project URLs for 100 random projects funded by the FP6 and FP7 work programmes. We automatically checked if the project URLs were still referencing any content (OK response).
The obtained results revealed that 19% of the project sites of R&D projects financed by FP7 (2007-2013) were already unavailable. This percentage of data loss increased to 30% for the older project sites financed by FP6 (2002-2006).
Moreover, we observed that some of these URLs were referencing a content that was no longer related to the R&D project. Therefore, this suggests that the percentage of valid project URLs is in fact lower than the obtained percentages. Attaining more accurate percentages would require human validation. This is an interesting issue that could be further investigated by researchers.
Web archiving provides a solution
The constant deactivation of sites that publish and disseminate the scientific outputs originated from R&D projects causes a permanent loss of valuable information to Human knowledge from both a societal and scientific perspective. As project sites inevitably close, the online information referenced on databases, such as the CORDIS – EU research projects databases, suffer irrecoverable degradation.
The good news is that web archives from IIPC members may have preserved sites from past R&D projects that became unavailable and a trans-national research infrastructure such as RESAW could make them more widely accessible.
Funding management databases could be enhanced to also reference the preserved versions of the project websites that meanwhile disappeared from the live-web. Project officers or reviewers could complement their analysis by retrieving missing online content about the funded projects and researchers in general could mitigate the serious cross-field problem originated by scientific publications citing crucial online resources that became unavailable.
You can help to preserve R&D project sites right now!
We are now trying to focus on preserving sites of R&D projects. To this end, our first idea was to use the EU Open Data Portal to identify these project URLs. The problem is that from the 25 608 R&D projects funded by FP7 listed by the EU Open Data Portal, only 7.9% had an associated project URL.
So, the first main challenge is to identify R&D project web sites to be preserved.
And you can help!
You just need to contribute with project sites to this document:
The resulting list will be used to:
- make an experimental crawl of these sites that will be made publicly available;
- research techniques to automatically identify R&D project URLs that will be published;
- help other institutions interested in preserving these sites.