IIPC Hackathon at the British Library: Laying a New Foundation

By Tom Cramer, Stanford University

This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.

I was struck by the breadth and depth of topics. In particular…

  • Heritrix nuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
  • Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
  • Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
  • QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
  • An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki  Updates welcome!
  • System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.

The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.

IIPC Hackathon, Day 1.

This validates one of the main focal points for the IIPC’s portfolio on Tools Development, which is to foster more grassroots exchange among web archiving practitioners.

The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!

Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.

Advertisements

Internet Memory Research launches MemoryBot

IMRInternet Memory Research is pleased to announce that our new crawler, MemoryBot, has now gained full maturity.

With it, we completed a first map of the Web and we would like to share some early results on this experiment.

With only few small servers and 4 weeks time, we were able to crawl over 2+ billions resources with the objective to discover as many domains as possible. Overall, over 60+ millions of domains have been discovered, which represent about half of active domains in the world (the rest is mostly composed of parking sites and other types of empty domains).

In addition, we were able to process several types of analysis on this material thanks to the current Hadoop and Flink based architecture of our archive.
Among other things, we used machine learning to classify domains by type or genre (News, Forums, Blogs, E-commerce, etc.).

The other good news is that, thanks to many improvements in both the overall efficiency and stability, the cost of such crawls has been divided by two. Accompanied by the fall of storage costs, global crawls are becoming much more affordable and we hope it will benefit to more and more institutions.

More details will be published later on this, but we wanted to share this early update with the web archiving community.

By Chloé Martin (COO) of Internet Memory Research

10 years anniversary of the Netarchive (Netarkivet), the Danish national web archive

The Royal Library in Copenhagen and the State and University Library in Aarhus are happy to announce the 10 years anniversary of the Netarchive (Netarkivet), the Danish national web archive.

netarkivet

In July 2005, a new legal deposit law came into force: materials “published in electronic communication networks” became part of the legal deposit, that is to say, collecting and preserving “the Danish part of the Internet” now was issued by law. In the same year, the Netarchive joined the IIPC.

In the early years of the Netarchive, we focused on collection building and strategies: how to manage 4 broad crawls a year and choosing about 100 sites to be harvested selectively.  At the end of 2005, we finished our first broad crawl – it took almost a year. In 2007, our first systematic set of selective crawls was in place, we had a first dialogue with Facebook about harvesting Danish open profiles, we released NetarchiveSuite as an open source web curator tool and we gave access to the archived material to the first researchers.

In 2008, we started harvesting e-books, and the French National Library  and the Austrian National Library joined the NetarchiveSuite development project. In 2009, the first Ph.D. student graduated with a project based on the Netarchive. In 2010, we participated in the first IIPC collaborative collection (Winter Olympic Games). In 2011, we established access through the Wayback Machine and started a special collection on online games.

In 2012, we fulfilled our objective of carrying out four broad crawls a year and began with a special collection on YouTube videos. In 2013, we established access on the premises for eligible master students in their final year and we developed a solution, which makes selected electronic publications from ministries and official agencies accessible to the public via persistent links from The Administrative Library’s catalogue. In 2014, we started indexing for full text search in the whole archive and performed our largest event harvest ever of the European Song Contest, hosted in Denmark.

Erland Kolding Nielsen (director of the Royal Library), cutting the ribbon to the full text search
Erland Kolding Nielsen (director of the Royal Library), cutting the ribbon to the full text search

The birthday gift to our users and to ourselves is the full text searchable archive!

Netarkivet har fejret 10 års fødselsdag

Thank you for cooperation and feedback during all these years.

On behalf of the Netarchive Team

Sabine Schostag, Web curator, NETARCHIVE, STATE AND UNIVERSITY LIBRARY

A first attempt to archive the .EU domain

EUCommissioni

The .EU domain is commonly used to reference sites related to Europe. EURid is the organization appointed by the European Commission to operate the .EU domain and presents it under the slogan “Your European Identity”.

Therefore, preserving online information published on sites hosted under the .EU domain is crucial to preserve European Cultural Heritage for future generations.

The strategy adopted to archive the World Wide Web has been delegating the responsibility of each domain to the respective national archiving institutions. However, the .EU domain fails to fit in this model because it covers multiple nations. Thus, the preservation of .EU sites has not been yet assigned and undertaken by any institution.

RESAWRESAW is an European network that aims to create a Research Infrastructure for the Study of Archived Web Materials (resaw.eu). The Portuguese Web Archive performed a first attempt to crawl and preserve web sites hosted under the .EU domain performed within the scope of RESAW activities. This first crawl began on the 21 November 2014 and finished on the 16 December 2014.

Challenges crawling .EU

.EUlogoThe first challenge felt was obtaining the seeds for the crawl because our contacts with EURID to get the list of .EU domains failed. The crawl was launched using a total of 34 138 unique seeds obtained from several sources such as Google.com, DomainTyper.com, DMOZ.org or Alexa Top Sites.

During this first crawl we had to iteratively tune crawl configurations in order to overcome hazardous situations caused by web spam sites. The set of spam filters created will be useful to optimize future crawls.seedsImage

We crawled 250 million documents from over 1 million hosts. The crawl documents were stored in 5.8 TB of disk space using the compressed ARC format. 135 907 unique domain URLs were extracted that will be used as seeds for the next crawl.

Two more crawls of the .EU domain planned

As future work we intend to perform 2 more crawls of the .EU domain to be integrated on the Portuguese Web Archive collections. The next crawl is planned to start in November 2015. We estimated that 23 TB of disk space should be required for the following crawl of the .EU domain (without performing deduplication).

04_brancoEach one of the performed .EU crawls shall be indexed and become searchable through www.arquivo.pt one year after its finish date.

 

Researchers wanted!

Collaborations with researchers interested on studying the collected web data or crawl logs are welcome. We can create a prototype system with restricted access to enable search and processing of the .EU crawls if researchers explicitly manifest interest.

This first experiment of archiving the .EU domain was performed mostly using resources from the Portuguese Web Archive. Collaborations with other institutions, for instance, to identify relevant seeds are crucial to improve the quality of the crawls. The obtained results from this experiment are encouraging but an effective archive of the .EU requires further more resources and collaborations.

Learn more

fotografia-de-daniel-gomes

Daniel Gomes andDaniel_Bicho Daniel Bicho, RESAW / Portuguese Web Archive