The .EU domain is commonly used to reference sites related to Europe. EURid is the organization appointed by the European Commission to operate the .EU domain and presents it under the slogan “Your European Identity”.
Therefore, preserving online information published on sites hosted under the .EU domain is crucial to preserve European Cultural Heritage for future generations.
The strategy adopted to archive the World Wide Web has been delegating the responsibility of each domain to the respective national archiving institutions. However, the .EU domain fails to fit in this model because it covers multiple nations. Thus, the preservation of .EU sites has not been yet assigned and undertaken by any institution.
RESAW is an European network that aims to create a Research Infrastructure for the Study of Archived Web Materials (resaw.eu). The Portuguese Web Archive performed a first attempt to crawl and preserve web sites hosted under the .EU domain performed within the scope of RESAW activities. This first crawl began on the 21 November 2014 and finished on the 16 December 2014.
Challenges crawling .EU
The first challenge felt was obtaining the seeds for the crawl because our contacts with EURID to get the list of .EU domains failed. The crawl was launched using a total of 34 138 unique seeds obtained from several sources such as Google.com, DomainTyper.com, DMOZ.org or Alexa Top Sites.
During this first crawl we had to iteratively tune crawl configurations in order to overcome hazardous situations caused by web spam sites. The set of spam filters created will be useful to optimize future crawls.
We crawled 250 million documents from over 1 million hosts. The crawl documents were stored in 5.8 TB of disk space using the compressed ARC format. 135 907 unique domain URLs were extracted that will be used as seeds for the next crawl.
Two more crawls of the .EU domain planned
As future work we intend to perform 2 more crawls of the .EU domain to be integrated on the Portuguese Web Archive collections. The next crawl is planned to start in November 2015. We estimated that 23 TB of disk space should be required for the following crawl of the .EU domain (without performing deduplication).
Each one of the performed .EU crawls shall be indexed and become searchable through www.arquivo.pt one year after its finish date.
Collaborations with researchers interested on studying the collected web data or crawl logs are welcome. We can create a prototype system with restricted access to enable search and processing of the .EU crawls if researchers explicitly manifest interest.
This first experiment of archiving the .EU domain was performed mostly using resources from the Portuguese Web Archive. Collaborations with other institutions, for instance, to identify relevant seeds are crucial to improve the quality of the crawls. The obtained results from this experiment are encouraging but an effective archive of the .EU requires further more resources and collaborations.
- A first attempt to archive the .EU domain, technical report
- Heritrix original crawl log (19,6 GB)
- Heritrix generated reports (21,5 MB)
- Analysis sheet generated using the Notebook Python library