Digging in Digital Dust: Internet Archaeology at KB-NL in the Netherlands

By Peter de Bode and Kees Teszelszky

The Dutch .nl ccTLD is the third biggest national top level domain in the world and consists of 5.68 million URL’s,according to the Dutch SIDN. The first website of the Netherlands was published on the web in 1992: it was the third website on the World Wide Web. Web archiving in the Netherlands started in 2000 with the project Archipol in Groningen. The Koninklijke Bibliotheek | National Library of The Netherlands (KB-NL) started web archiving with a selection of Dutch websites in 2007. The KB does not only selects and harvest these sites, but also develops a strategy to ensure their long-term usability. As the Netherlands does lack a legal deposit law, the KB cannot crawl the Dutch national domain. KB uses the Web Curator Tool (WCT) to conduct its harvests.  From January 2018 onwards, the National Library of New Zealand (NLNZ) has been collaborating to upgrade this tool with KB-NL and adding new features to make the application future-proof.

As of 2011, the Dutch web archive is available in the KB reading rooms. In addition, researchers may request access to the data for specific projects. Between 2012 and 2016 the research project WebArt was carried out. As per November 2018, 15,000 websites have been selected. The Dutch web archive contains about 37Terabyte of data.

On the occasion of World Digital Preservation Day KB unveiled a special collection internet archaeology Euronet-Internet (1994-2017) [In Dutch: Webcollectie internetarcheologie Euronet]. It is made up of archived websites hosted by internet provider Euronet-Internet between 1994 and 2017. The collection was started in 2017 and ended in 2018. Identification of websites for harvest is done by Peter de Bode and Kees Teszelszky as part of the larger KB web archiving project “internet archaeology.” Euronet is one of the oldest internet providers in the Netherlands (1994) and has been bought up by Online.nl. Priority is given to websites published in the early years of the Dutch web (1994-2000).

These sites can be considered as “web incunables” as these are among the first digital born publications on the Dutch web. Some of the digital treasures from this collection are the oldest website of a national political party, a virtual bank building and several sites of internet pioneers dating from 1995. Information about the collection and its heritage value can be found on a special dataset page of KB-Lab and in a collection description (in Dutch). The collection can be studied on the terminals in the reading room of KB with a valid library card. Researches can also use the dataset with URL’s and a link analysis.

Advertisements

Web Archiving at the National Library of Ireland

National Library of Ireland Reading Room © National Library of Ireland.

The National Library of Ireland has a long-standing tradition of collecting, preserving and making accessible the published and printed output of Ireland. The library is over 140 years old and we now also have rich digital collections concerning the political, cultural and creative life of Ireland. The NLI has been archiving the Irish web on a selective basis since 2011. We have over 17 TB of data in the selective web archive, openly available for research through our website.  A particular strength of our web archive is the coverage of Irish politics including a representation of every election and referendum since 2011. No longer in its infancy, the NLI has made some exciting developments in recent years. This year we have begun working with Internet Archive for our selective web archive and are looking forward to the new opportunities that this partnership will bring. We have also begun working closely with an academic researcher from a Higher Education institute in Ireland, who is carrying out network analysis on a portion of our selective data.

In 2007 and 2017, the NLI undertook domain crawling projects and there is now over 43TB of data archived from these crawls. The National Library of Ireland is a legal deposit library, entitling it to a copy of everything published in Ireland. However, unlike many countries in Europe, legal deposit legislation does not currently extend to online material so we cannot make these crawls available. Despite these barriers, the library remains committed to preserving the online story of Ireland in whatever way we can.

Revisions to the legislation are currently before the Irish parliament and if passed will result in the addition of e-publications, such as e-books, journals etc. The addition of websites to that list is currently being considered.

In 2017, the National Library of Ireland became members of the IIPC and we are excited to be attending our first General Assembly in Wellington. While we had anticipated talking about our newly available domain web archive portal and how this had impacted our selective crawls, we are looking forward to discussing the challenges we continue to face, including with Legal Deposit, and how we are developing the web archive as a whole. We may also hopefully be able to update on progress with the legislative framework.  We look forward to seeing you there in Wellington!

Human scale web collecting for individuals and institutions (Webrecorder workshop)

By Anna Perricci, Rhizome

Web archiving ‘at scale’ is usually equated to collecting with automated software (a web crawler) but an assumption that more information is equated to more value is not always right, especially with web archives. A massive scope or scale isn’t required to make meaningful, useful web archives. Collecting at a ‘human scale’ can be as good or better for forming certain collections.

Webrecorder is a free, easy to use, browser based web archiving tool set provided by Rhizome. Rhizome, an affiliate of the New Museum in New York City, champions born-digital art and culture through commissions, exhibitions, digital preservation, and software development. Webrecorder’s development has been generously supported by the Andrew W. Mellon Foundation.

With Webrecorder you can make high fidelity interactive captures of web content as you browse web pages. A “high fidelity capture” means that from a user’s perspective there is a complete or high level of similarity between the original web pages and the archived copies, including the retention of important characteristics and functionality such as: video or audio that requires a user to press ‘play’, or resources that require entry of login credentials for access (e.g. social media accounts). Webrecorder can capture most types of media files, JavaScript and user-triggered actions, which are things that most crawlers struggle with or are unable to obtain.

Workshop attendees will be given an overview of Webrecorder’s features, then engage in hands-on activities and discussions. Further instruction will alternate with opportunities for participants to use the tools introduced and share their thoughts or questions. Instructions on how to manage the collected materials, download them (as a WARC file), and open a local copy offline using Webrecorder Player will also be covered in this workshop.

Human scale web collecting with Webrecorder is not expected to meet all the requirements of a large web archiving program but can satisfy many needs of researchers or smaller web collecting initiatives. Webrecorder can be a great tool for personal digital archiving projects as well. Larger web archiving programs can benefit from using Webrecorder to capture dynamic content and user-triggered behaviors on websites. The WARC files created with Webrecorder can be downloaded and ingested to join WARCs that have been created using crawler-based systems.

With a tool like Webrecorder anyone can get started with web archiving quickly at no cost, which is empowering both to any information professionals and their stakeholders.

On November 14th you can also learn more about Webrecorder in an afternoon session entirely focused on Webrecorder and high fidelity web archiving. This time will start with a 30 minute presentation on Python Wayback (pywb), a core component of Webrecorder, by pywb’s creator and Webrecorder’s lead developer, Ilya Kreymer. Then there will be a 1 hour panel on capturing complex websites and publications using Webrecorder with Jasmine Mulliken, Sumitra Duncan, Nicole Coleman, and me (Anna Perricci).

Whether you are a seasoned expert or newer to web archiving I hope you will be able to join us for the session and this workshop on November 14th at the IIPC WAC. The limit on the number of workshop attendees has been removed so please feel welcome to register.

IIPC Steering Committee Election 2019

The nomination process for IIPC Steering Committee is now open.

The Steering Committee is the executive body of the IIPC, currently comprising 15 member organisations, that take a leadership role in the high-level strategic planning, development and management of programs, policy creation, overall administration, and contribution to IIPC Portfolios and other activities.

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Serving on the Steering Committee is open to any current IIPC member and we strongly encourage any organisation interested in serving on the Steering Committee to nominate themselves for election. SC members are elected for 3 years and meet twice a year in person, once during the General Assembly, once in September and two or more additional times by teleconference.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in mid-May and the three-year term on the Steering Committee will start on 1 June.

Below you will find the election calendar. We are very much looking forward receiving your nominations. If you have any questions, please contact the IIPC PCO.

.


Election Calendar

  •  12 November to 1 March: Members are invited to nominate themselves by sending an email including a statement to the IIPC Programme and Communications Officer.
  • 1 April: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.
  • 1 April to  30 April: Members are invited to vote online. An online voting tool will be used to conduct the vote. The PCO will monitor the vote, ensuring that each organisation votes only once for all nominated seats and that the vote is cast by the organisation’s official representative. People will be encouraged to cast their vote before, during, and after the GA.
  • 30 April: Voting ends.
  • 1 May: The results of the vote are announced officially on the Netpreserve blog and Members mailing list.
  • 1 June: end/start of SC members terms. The newly elected SC members start their term on the 1st of June and are invited to attend a first meeting (by teleconference) by the end of June. The next face to face SC meeting will take place in Zagreb in June 2019.

 

Online Hours: Supporting Open Source

By Andrew Jackson, Web Archiving Technical Lead at the British Library

At the UK Web Archive, we believe in working in the open, and that organisations like ours can achieve more by working together and pooling our knowledge through shared practices and open source tools. However, we’ve come to realise that simply working in the open is not enough – it’s relatively easy to share the technical details, but less clear how to build real collaborations (particularly when not everyone is able to release their work as open source).

To help us work together (and maintain some momentum in the long gaps between conferences or workshops), we were keen to try something new, and hit upon the idea of Online Hours. It’s simply a regular web conference slot (organised and hosted by the IIPC, but open to all) which can act as a forum for anyone interested in collaborating on open source tools for web archiving. We’ve been running for a while now, and have settled on a rough agenda:

Full-text indexing:
– Mostly focussing on our Web Archive Discovery toolkit so far.

Heritrix3:
– including Heritrix3 release management, and the migration of Heritrix3 documentation to the GitHub wiki.

Playback:
– covering e.g. SolrWayback as well as OpenWayback and pywb.

AOB/SOS:
– for Any Other Business, and for anyone to ask for help if they need it.

This gives the meetings some structure, but is really just a starting point. If you look at the notes from the meetings, you’ll see we’ve talked about a wide range of technical topics, e.g.

  • OutbackCDX features and documentation, including its API;
  • web archive analysis, e.g. via the Archives Unleashed Toolkit;
  • summary of technologies so we can compare how we do things in our organisations, to find out which tools and approaches are shared and so might benefit from more collaboration;
  • coming up with ideas for possible new tools that meet a shared need in a modular, reusable way and identify potential collaborative projects.

The meeting is weekly, but we’ve attempted to make the meetings inclusive by alternating the specific time between 10am and 4pm (GMT). This doesn’t catch everyone who might like to attend, but at the moment I’m personally not able to run the call at a time that might tempt those of you on Pacific Standard Time. Of course, I’m more than happy to pass the baton if anyone else wants to run one or more calls at a more suitable time.

If you can’t make the calls, please consider:

My thanks go to everyone who as come along to the calls so far, and to IIPC for supporting us while still keeping it open to non-members.

Maybe see you online?

Web Archivists, Assemble!

By Alex Thurman, Columbia University Libraries, Member of the IIPC Steering Committee and the WAC Program Committee (2016-2018), Co-Chair of the Content Development Group

The IIPC General Assembly & Web Archiving Conference is the professional gathering I anticipate most eagerly each year. In an energizing atmosphere of international cooperation, web curators, librarians, archivists, tool developers, computer scientists, and academic researchers from member organizations and beyond meet to share experiences and best practices and plan projects to tackle the collective challenge of preserving web resources.

I’ve had the good fortune of attending each year since 2012, and for the past three years I’ve also had the rewarding experience of serving on the program committees planning these events. As we look forward to the exciting upcoming 2018 conference in Wellington, New Zealand, here is some background on the recent evolution of the GA/WAC and the work of the 2018 WAC Program Committee.

Recent background

2018 marks the fifteenth anniversary of the IIPC, and the twelfth consecutive year that members of the IIPC will come together in an annual General Assembly. The IIPC Steering Committee has striven to cycle (loosely, as dependent on members volunteering to host the event) the venue of the GA/WAC in alternate years between Europe, North America and Australasia. And from the start, the GA event programs have combined days reserved for IIPC members (focused on Consortium planning and working group activities) with one or more open days to welcome the perspectives and expertise of the wider web archiving community and of researchers.

To emphasize this aspect of outreach to researchers and promoting awareness of web archiving, the Steering Committee has in recent years opted to formalize the “open days” as a distinct event—the IIPC Web Archiving Conference. The 2016 event was the first to thus distinguish the General Assembly from the Web Archiving Conference, and thereafter, at the suggestion of that PC’s Chair (Kristinn Sigurðsson, National and University Library of Iceland), planning responsibility for the different event components became more distributed: the GA program would be determined by the Steering Committee Officers and Portfolio Leads and the Working Group Chairs; a mostly local Organizing Committee would see to the logistical planning of securing a venue and catering and possible sponsors; and the Web Archiving Conference program would be developed by a Program Committee. The 2017 Program Committee (chaired by Nicholas Taylor, Stanford University) was the first to include some non-IIPC members, and their CFP was the first to attract more relevant submissions than we had space to accept, a milestone in the maturation of the conference.

Work of the 2018 Program Committee

Co-chaired by Jan Hutař (Archives New Zealand) and Paul Koerbin (National Library of Australia), the 12-member 2018 Program Committee started work in November 2017. Our first task was drafting a call for papers, which involved first discussing whether the conference would have a stated theme and the types (presentations, panels, workshops, tutorials) of submission proposals we’d ask for and the nature of the submission (abstracts? full papers?). We needed a flexible theme that would acknowledge the IIPC’s milestone 15th anniversary and the value of our collective work preserving the web so far, while embracing creative new approaches to the evolving challenges we face. In his draft CFP, Paul Koerbin hit on “Web Archiving Histories and Futures and we ran with that. And as the Wellington event will be the first GA/WAC held in Australasia in 10 years, we especially encouraged submissions related to Asia/Pacific web archiving activities.

To encourage submissions from all types of web archiving practitioners and users, in the CFP we further listed some suggested topics, under the rubrics of “building web archives,” “maintaining web archive content and operations,” “using and researching web archives,” and “web archive histories and futures.” And we opted to ask applicants to submit abstracts only rather than full papers, both to lower the barriers to application in order to get more submissions, and to allow all Program Committee members to consider (and vote on) all submissions, rather than assigning reviewers to specific papers. Once the CFP was ready, PC members worked hard to distribute it to a wide selection of mailing lists, reaching beyond IIPC members and other cultural memory institutions to also get submissions from independent researchers.

This strategy worked (boosted no doubt by the intrinsic appeal of visiting Wellington!), as we received a record number of submissions for the WAC, submitted through EasyChair. The breadth and depth of interesting submissions allowed us to build a strong program–while unfortunately having to reject some relevant proposals. Each committee member read all the submitted abstracts and rated each one on a 3-point scale, yielding cumulative point averages for each submission from which the committee could decide which submissions would be accepted for the conference. In order to know how many submissions could be accepted we first had to consider how much conference schedule time we had available, which would depend in part on whether we would have multiple tracks.

We decided the program would have a mix of plenary talks and usually two tracks of presentations or workshops, and Olga Holownia (IIPC Program & Communication Officer) provided a range of detailed schedule templates for us to use to figure out how many individual presentations, panels, and workshops we’d have room for. We then began grouping accepted proposals into thematic sessions, loosely conceived as more-technical and less-technical tracks, in order to reduce (though not eliminate) the frustration of attendees wishing they could be in both tracks at once. Committee members then divided up the responsibility of serving as session chairs, to introduce the speakers and keep the sessions running on time.

Between the tasks of preparing the CFP and evaluating the submissions and shaping them into a program, the committee had the additional enjoyable responsibility of brainstorming possible keynote speaker candidates. Committee members suggested over two dozen possible keynoters, voted on them, and eventually submitted a few outstanding candidates to the Organizing Committee for their consideration. The Organizing Committee took these suggestions and added others based on their familiarity with the Australasian digital library and academic scene and delivered two exciting keynote speakers – Wendy Seltzer (World Wide Web Consortium) and Rachael Ka’ai-Mahuta (Te Ipukarea, the National Maori Language Institute, Auckland Institute of Technology) – and an additional plenary talk from Vint Cerf (Google). With these and many other talented contributors from within and beyond IIPC member institutions, the 2018 IIPC Web Archiving Conference looks to be a rich and stimulating event.

Register now!

Serving on the WAC Program Committee is a great opportunity to work directly with IIPC colleagues and other web archiving enthusiasts. And the work continues – you can volunteer now to serve on the Program Committee and start shaping the 2019 IIPC WAC.

A personal reflection on the IIPC WAC

By Gillian Lee, Coordinator, Web Archives at the National Library of New Zealand, Member of the IIPC Steering Committee and the WAC Program Committee

This year I’ve had the privilege of being part of the programme committee for IIPC WAC. Reading through the abstracts that many of you sent in gave me a real sense of excitement about the work that we are all involved in. That caused me to reflect on the benefits of the IIPC conference and what it means to us as members. Some of you might attend these conferences on a regular basis, others may never have had that opportunity.

I’ve been web archiving for 11 years and have been fortunate to attend 3 IIPC conferences during that time. It’s rare for me to attend a conference that’s actually about the work I do, so I really value those times! It’s an opportunity to finally meet people, who were formerly just names on mailing lists and blog posts. Getting together with other web archivists is invaluable, whether it’s talking to someone who is just starting out in the web archiving world, sharing the struggles of budget constraints, or learning more about what members are doing. You can’t beat that!

Even in this digital age it’s easy to feel isolated here in New Zealand when we hear so much about web archiving developments, especially in Europe and the States. There’s only so much you can learn from emails, blog posts and the odd webinar that’s not scheduled for 2am NZ time!!

Despite the distance we have collaborated with other IIPC members over the years. Back in 2006 the National Library of New Zealand worked with the British Library to build Web Curator Tool (WCT). The BL have moved on and developed other tools since then, and this year we’ve collaborated with National Library of the Netherlands in a major upgrade to WCT. Kees Teszelszky blogged about this recently. You can find out more about it during the IIPC conference in Wellington in November.

We’ve also been involved with the Content Development Working Group by submitting seed lists to collaborative collections such as the Olympic Games, World War One Commemoration and the News around the World project. If you’re new to IIPC, do consider getting involved in one of the IIPC groups.

We’re really excited to be hosting IIPC this year and look forward to meeting you all in person! A number of my colleagues have never had the chance to attend an IIPC conference, so they’re in for a treat! See you soon!

Mark_Beatty-NLNZ
National Library of New Zealand, Photo by Mark Beatty / CC BY-NC 3.0 NZ.