Collaborate to develop web archive collections with Cobweb!

By Kathryn Stine, Manager, Digital Content Development and Strategy at the California Digital Library

Cobweb is a recently launched collaborative collection development platform for web archives, now available for anyone to use to establish and participate in web archiving collecting projects at https://cobwebarchive.org. A cross-institutional team from UCLA, the California Digital Library (CDL), and Harvard University has developed Cobweb, which was made possible in part by funding from the United States Institute for Museum and Library Services and initially hosted by CDL. We’ve been encouraged by the enthusiasm and engagement that’s met Cobweb and look forward to supporting a range of collaborative and coordinated web archiving collecting projects with this new platform.

Peter Broadwell & Kathryn Stine introducing CobWeb at the Web Archiving Conference in Wellington (slides).

At the 2018 IIPC Web Archiving Conference in New Zealand, Cobweb tutorial attendees played with Cobweb functionality and provided useful feedback and ideas for platform refinements and future feature options. Thank you to all who have shared their suggestions for advancing Cobweb! A number of demonstration projects are now on the platform that showcase how Cobweb supports web archiving collection development activities, including nominating web resources to a project and claiming intentions for, and following through with, archiving nominated web content. Additionally, the extensive Archive of the California Government Domain (CA.gov) has been established as a Cobweb collecting project and the CA.gov team is considering how to integrate Cobweb into its collection development workflows.

Cobweb centralizes the often distributed activities that go into developing web archive collections, allowing for multiple contributors and organizations to work together towards realizing common collecting goals. The coordinated activities that result in rich, useful web archive collections can draw upon distinct areas of expertise or capacity including subject specialization, technical facility with content capture, and resources for storing and managing content. The Cobweb platform is well-suited to supporting curated and crowdsourced collection building, from complex, multi-partner initiatives to local efforts that require coordination, such as that between digital archivists and library subject selectors.

If you have web archiving collecting goals that can benefit from engaging in collaborative and/or coordinated participation, learn more about getting started with Cobweb by visiting https://cobwebarchive.org/getting_started, checking out the Cobweb presentation from the IIPC WAC, or by emailing cobwebarchive[at]gmail.com.

Web Archiving Down Under: Relaunch of the Web Curator Tool at the IIPC conference, Wellington, New Zealand

Kees Teszelszky, Curator Digital Collections at the National Library of the Netherlands/Koninklijke Bibliotheek (with input of Hanna Koppelaar, Jeffrey van der Hoeven – KB-NL, Ben O’Brien, Steve Knight and Andrea Goethals – National Library of New Zealand)

Hanna Koppelaar, KB & Ben O'Brien, NLNZ. IIPC WAC 2018.
Hanna Koppelaar, KB & Ben O’Brien, NLNZ. IIPC Web Archiving Conference 2018. Photo by Kees Teszelszky

The Web Curator Tool (WCT) is a globally used workflow management application designed for selective web archiving in digital heritage collecting organisations. Version 2.0 of the WCT is now available on Github. This release is the product of a collaborative development effort started in late 2017 between the National Library of New Zealand (NLNZ) and the National Library of the Netherlands (KB-NL). The new version was previewed during a tutorial at the IIPC Web Archiving Conference on 14 November 2018 at the National Library of New Zealand in Wellington, New Zealand. Ben O’Brien (NLNZ) and Hanna Koppelaar (KB-NL) presented the new features of the WCT and showed how to work collaboratively on opposite sides of the world in front of an audience of more than 25 spectators.

The tutorial highlighted that part of our road map for this version has been dedicated to improving the installation and support of WCT. We recognised that the majority of requests for support were related to database setup and application configuration. To improve this experience we consolidated and refactored the setup process, correcting ambiguities and misleading documentation. Another component to this improvement was the migration of our documentation to the readthedocs platform (found here), making the content more accessible and the process of updating it a lot simpler. This has replaced the PDF versions of the documentation, but not the Github wiki. The wiki content will be migrated where we see fit.

A guide on how to install WCT can be found here, a video can be found here.

1) WCT Workflow

One of the objectives in upgrading the WCT, was to raise it to a level where it could keep pace with the requirements of archiving the modern web. The first step in this process was decoupling the integration with the old Heritrix 1 web crawler, and allowing the WCT to harvest using the more modern Heritrix 3 (H3) version. This work started as a proof-of-concept in 2017, which did not include any configuration of H3 from within the WCT UI. A single H3 profile was used in the backend to run H3 crawls. Today H3 crawls are fully configurable from within the WCT, mirroring the existing profile management that users had with Heritrix 1.

2) 2018 Work Plan Milestones

The second step in this process of raising the WCT up is a technical uplift. For the past six or seven years, the software has fallen into a period of neglect, with mounting technical debt. The tool is sitting atop outdated and unsupported libraries and frameworks. Two of those frameworks are Spring and Hibernate. The feasibility of this upgrade has been explored through a proof-of-concept which was successful. We also want to make the WCT much more flexible and less coupled by exposing each component via an API layer. In order to make that API development much easier we are looking to migrate the existing SOAP API to REST and changing components so they are less dependent on each other.

Currently the Web Curator Tool is tightly coupled with the Heritrix crawler (H1 and H3). However, other crawl tools exist and the future will bring more. The third step is re-architecting WCT to be crawler agnostic. The abstracting out of all crawler-specific logic allows for minimal development effort to integrate new crawling tools. The path to this stage has already been started with the integration of Heritrix 3, and will be further developed during the technical uplift.

More detail about future milestones can be found in the Web Curator Tool Developer Guide in the appropriately titled section Future Milestones. This section will be updated as development work progresses.

3) Diagram showing the relationships between different Web Curator Tool components

We are conscious that there are long-time users on various old versions of WCT, as well as regular downloads of those older versions from the old Sourceforge repository (soon to be deactivated). We would like to encourage those users of older versions to start using WCT 2.0 and reaching out for support in upgrading. The primary channels for contact are the WCT Slack group and the Github repository. We hope that WCT will be widely used by the web archiving community in future and will have a large development and support base. Please contact us if you are interested in cooperating! See the Web Curator Tool Developer Guide for more information about how to become involved in the Web Curator Tool community.

WCT facts

The WCT is one of the most common, open-source enterprise solutions for web archiving. It was developed in 2006 as a collaborative effort between the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium (IIPC) as can be read in the original documentation. Since January 2018 it is being upgraded through collaboration with the Koninklijke Bibliotheek – National Library of the Netherlands. The WCT is open-source and available under the terms of the Apache Public License. The project was moved in 2014 from Sourceforge to Github. The latest release of the WCT, v2.0, is available now. It has an active user forum on Github and Slack.

Further reading on WCT:

Reaction on twitter:

Digging in Digital Dust: Internet Archaeology at KB-NL in the Netherlands

By Peter de Bode and Kees Teszelszky

The Dutch .nl ccTLD is the third biggest national top level domain in the world and consists of 5.68 million URL’s,according to the Dutch SIDN. The first website of the Netherlands was published on the web in 1992: it was the third website on the World Wide Web. Web archiving in the Netherlands started in 2000 with the project Archipol in Groningen. The Koninklijke Bibliotheek | National Library of The Netherlands (KB-NL) started web archiving with a selection of Dutch websites in 2007. The KB does not only selects and harvest these sites, but also develops a strategy to ensure their long-term usability. As the Netherlands does lack a legal deposit law, the KB cannot crawl the Dutch national domain. KB uses the Web Curator Tool (WCT) to conduct its harvests.  From January 2018 onwards, the National Library of New Zealand (NLNZ) has been collaborating to upgrade this tool with KB-NL and adding new features to make the application future-proof.

As of 2011, the Dutch web archive is available in the KB reading rooms. In addition, researchers may request access to the data for specific projects. Between 2012 and 2016 the research project WebArt was carried out. As per November 2018, 15,000 websites have been selected. The Dutch web archive contains about 37Terabyte of data.

On the occasion of World Digital Preservation Day KB unveiled a special collection internet archaeology Euronet-Internet (1994-2017) [In Dutch: Webcollectie internetarcheologie Euronet]. It is made up of archived websites hosted by internet provider Euronet-Internet between 1994 and 2017. The collection was started in 2017 and ended in 2018. Identification of websites for harvest is done by Peter de Bode and Kees Teszelszky as part of the larger KB web archiving project “internet archaeology.” Euronet is one of the oldest internet providers in the Netherlands (1994) and has been bought up by Online.nl. Priority is given to websites published in the early years of the Dutch web (1994-2000).

These sites can be considered as “web incunables” as these are among the first digital born publications on the Dutch web. Some of the digital treasures from this collection are the oldest website of a national political party, a virtual bank building and several sites of internet pioneers dating from 1995. Information about the collection and its heritage value can be found on a special dataset page of KB-Lab and in a collection description (in Dutch). The collection can be studied on the terminals in the reading room of KB with a valid library card. Researches can also use the dataset with URL’s and a link analysis.

IIPC Steering Committee Election 2019

The nomination process for IIPC Steering Committee is now open.

The Steering Committee is the executive body of the IIPC, currently comprising 15 member organisations, that take a leadership role in the high-level strategic planning, development and management of programs, policy creation, overall administration, and contribution to IIPC Portfolios and other activities.

What is at stake?

Serving on the Steering Committee is an opportunity for motivated members to help guide the IIPC’s mission of improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Steering Committee members are expected to take an active role in leadership, contribute to SC and Portfolio activities, and help guide and administer the organisation.

Who can run for election?

Serving on the Steering Committee is open to any current IIPC member and we strongly encourage any organisation interested in serving on the Steering Committee to nominate themselves for election. SC members are elected for 3 years and meet twice a year in person, once during the General Assembly, once in September or October and two or more additional times by teleconference.

Please note that the nomination should be on behalf of an organisation, not an individual. Once elected, the member organisation designates a representative to serve on the Steering Committee. The list of current SC member organisations is available on the IIPC website.

How to run for election?

All nominee institutions, both new and existing members whose term is expiring but are interested in continuing to serve, are asked to write a short statement (max 200 words) outlining their vision for how they would contribute to IIPC via serving on the Steering Committee. Statements can point to past contributions to the IIPC or the SC, relevant experience or expertise, new ideas for advancing the organisation, or any other relevant information.

All statements will be posted online and emailed to members prior to the election with ample time for review by all membership. The results will be announced in April and the three-year term on the Steering Committee will start on 1 June.

Below you will find the election calendar. We are very much looking forward receiving your nominations. If you have any questions, please contact the IIPC PCO.

.


Election Calendar

  •  12 November to 1 March: Members are invited to nominate themselves by sending an email including a statement to the IIPC Programme and Communications Officer.
  • 1 March: Nominees statements are published on the Netpreserve Blog and Members mailing list. Nominees are encouraged to campaign through their own networks.
  • 3 March to 31 March: Members are invited to vote online. An online voting tool will be used to conduct the vote. The PCO will monitor the vote, ensuring that each organisation votes only once for all nominated seats and that the vote is cast by the organisation’s official representative.
  • 31 March: Voting ends.
  • 1 April: The results of the vote are announced officially on the Netpreserve blog and Members mailing list.
  • 1 June: end/start of SC members terms. The newly elected SC members start their term on the 1st of June and are invited to attend their first SC meeting, either in person or by teleconference, before the GA in Zagreb on the 4th of June 2019.

 

Online Hours: Supporting Open Source

By Andrew Jackson, Web Archiving Technical Lead at the British Library

At the UK Web Archive, we believe in working in the open, and that organisations like ours can achieve more by working together and pooling our knowledge through shared practices and open source tools. However, we’ve come to realise that simply working in the open is not enough – it’s relatively easy to share the technical details, but less clear how to build real collaborations (particularly when not everyone is able to release their work as open source).

To help us work together (and maintain some momentum in the long gaps between conferences or workshops), we were keen to try something new, and hit upon the idea of Online Hours. It’s simply a regular web conference slot (organised and hosted by the IIPC, but open to all) which can act as a forum for anyone interested in collaborating on open source tools for web archiving. We’ve been running for a while now, and have settled on a rough agenda:

Full-text indexing:
– Mostly focussing on our Web Archive Discovery toolkit so far.

Heritrix3:
– including Heritrix3 release management, and the migration of Heritrix3 documentation to the GitHub wiki.

Playback:
– covering e.g. SolrWayback as well as OpenWayback and pywb.

AOB/SOS:
– for Any Other Business, and for anyone to ask for help if they need it.

This gives the meetings some structure, but is really just a starting point. If you look at the notes from the meetings, you’ll see we’ve talked about a wide range of technical topics, e.g.

  • OutbackCDX features and documentation, including its API;
  • web archive analysis, e.g. via the Archives Unleashed Toolkit;
  • summary of technologies so we can compare how we do things in our organisations, to find out which tools and approaches are shared and so might benefit from more collaboration;
  • coming up with ideas for possible new tools that meet a shared need in a modular, reusable way and identify potential collaborative projects.

The meeting is weekly, but we’ve attempted to make the meetings inclusive by alternating the specific time between 10am and 4pm (GMT). This doesn’t catch everyone who might like to attend, but at the moment I’m personally not able to run the call at a time that might tempt those of you on Pacific Standard Time. Of course, I’m more than happy to pass the baton if anyone else wants to run one or more calls at a more suitable time.

If you can’t make the calls, please consider:

My thanks go to everyone who as come along to the calls so far, and to IIPC for supporting us while still keeping it open to non-members.

Maybe see you online?

Web Archivists, Assemble!

By Alex Thurman, Columbia University Libraries, Member of the IIPC Steering Committee and the WAC Program Committee (2016-2018), Co-Chair of the Content Development Group

The IIPC General Assembly & Web Archiving Conference is the professional gathering I anticipate most eagerly each year. In an energizing atmosphere of international cooperation, web curators, librarians, archivists, tool developers, computer scientists, and academic researchers from member organizations and beyond meet to share experiences and best practices and plan projects to tackle the collective challenge of preserving web resources.

I’ve had the good fortune of attending each year since 2012, and for the past three years I’ve also had the rewarding experience of serving on the program committees planning these events. As we look forward to the exciting upcoming 2018 conference in Wellington, New Zealand, here is some background on the recent evolution of the GA/WAC and the work of the 2018 WAC Program Committee.

Recent background

2018 marks the fifteenth anniversary of the IIPC, and the twelfth consecutive year that members of the IIPC will come together in an annual General Assembly. The IIPC Steering Committee has striven to cycle (loosely, as dependent on members volunteering to host the event) the venue of the GA/WAC in alternate years between Europe, North America and Australasia. And from the start, the GA event programs have combined days reserved for IIPC members (focused on Consortium planning and working group activities) with one or more open days to welcome the perspectives and expertise of the wider web archiving community and of researchers.

To emphasize this aspect of outreach to researchers and promoting awareness of web archiving, the Steering Committee has in recent years opted to formalize the “open days” as a distinct event—the IIPC Web Archiving Conference. The 2016 event was the first to thus distinguish the General Assembly from the Web Archiving Conference, and thereafter, at the suggestion of that PC’s Chair (Kristinn Sigurðsson, National and University Library of Iceland), planning responsibility for the different event components became more distributed: the GA program would be determined by the Steering Committee Officers and Portfolio Leads and the Working Group Chairs; a mostly local Organizing Committee would see to the logistical planning of securing a venue and catering and possible sponsors; and the Web Archiving Conference program would be developed by a Program Committee. The 2017 Program Committee (chaired by Nicholas Taylor, Stanford University) was the first to include some non-IIPC members, and their CFP was the first to attract more relevant submissions than we had space to accept, a milestone in the maturation of the conference.

Work of the 2018 Program Committee

Co-chaired by Jan Hutař (Archives New Zealand) and Paul Koerbin (National Library of Australia), the 12-member 2018 Program Committee started work in November 2017. Our first task was drafting a call for papers, which involved first discussing whether the conference would have a stated theme and the types (presentations, panels, workshops, tutorials) of submission proposals we’d ask for and the nature of the submission (abstracts? full papers?). We needed a flexible theme that would acknowledge the IIPC’s milestone 15th anniversary and the value of our collective work preserving the web so far, while embracing creative new approaches to the evolving challenges we face. In his draft CFP, Paul Koerbin hit on “Web Archiving Histories and Futures and we ran with that. And as the Wellington event will be the first GA/WAC held in Australasia in 10 years, we especially encouraged submissions related to Asia/Pacific web archiving activities.

To encourage submissions from all types of web archiving practitioners and users, in the CFP we further listed some suggested topics, under the rubrics of “building web archives,” “maintaining web archive content and operations,” “using and researching web archives,” and “web archive histories and futures.” And we opted to ask applicants to submit abstracts only rather than full papers, both to lower the barriers to application in order to get more submissions, and to allow all Program Committee members to consider (and vote on) all submissions, rather than assigning reviewers to specific papers. Once the CFP was ready, PC members worked hard to distribute it to a wide selection of mailing lists, reaching beyond IIPC members and other cultural memory institutions to also get submissions from independent researchers.

This strategy worked (boosted no doubt by the intrinsic appeal of visiting Wellington!), as we received a record number of submissions for the WAC, submitted through EasyChair. The breadth and depth of interesting submissions allowed us to build a strong program–while unfortunately having to reject some relevant proposals. Each committee member read all the submitted abstracts and rated each one on a 3-point scale, yielding cumulative point averages for each submission from which the committee could decide which submissions would be accepted for the conference. In order to know how many submissions could be accepted we first had to consider how much conference schedule time we had available, which would depend in part on whether we would have multiple tracks.

We decided the program would have a mix of plenary talks and usually two tracks of presentations or workshops, and Olga Holownia (IIPC Program & Communication Officer) provided a range of detailed schedule templates for us to use to figure out how many individual presentations, panels, and workshops we’d have room for. We then began grouping accepted proposals into thematic sessions, loosely conceived as more-technical and less-technical tracks, in order to reduce (though not eliminate) the frustration of attendees wishing they could be in both tracks at once. Committee members then divided up the responsibility of serving as session chairs, to introduce the speakers and keep the sessions running on time.

Between the tasks of preparing the CFP and evaluating the submissions and shaping them into a program, the committee had the additional enjoyable responsibility of brainstorming possible keynote speaker candidates. Committee members suggested over two dozen possible keynoters, voted on them, and eventually submitted a few outstanding candidates to the Organizing Committee for their consideration. The Organizing Committee took these suggestions and added others based on their familiarity with the Australasian digital library and academic scene and delivered two exciting keynote speakers – Wendy Seltzer (World Wide Web Consortium) and Rachael Ka’ai-Mahuta (Te Ipukarea, the National Maori Language Institute, Auckland Institute of Technology) – and an additional plenary talk from Vint Cerf (Google). With these and many other talented contributors from within and beyond IIPC member institutions, the 2018 IIPC Web Archiving Conference looks to be a rich and stimulating event.

Register now!

Serving on the WAC Program Committee is a great opportunity to work directly with IIPC colleagues and other web archiving enthusiasts. And the work continues – you can volunteer now to serve on the Program Committee and start shaping the 2019 IIPC WAC.

A personal reflection on the IIPC WAC

By Gillian Lee, Coordinator, Web Archives at the National Library of New Zealand, Member of the IIPC Steering Committee and the WAC Program Committee

This year I’ve had the privilege of being part of the programme committee for IIPC WAC. Reading through the abstracts that many of you sent in gave me a real sense of excitement about the work that we are all involved in. That caused me to reflect on the benefits of the IIPC conference and what it means to us as members. Some of you might attend these conferences on a regular basis, others may never have had that opportunity.

I’ve been web archiving for 11 years and have been fortunate to attend 3 IIPC conferences during that time. It’s rare for me to attend a conference that’s actually about the work I do, so I really value those times! It’s an opportunity to finally meet people, who were formerly just names on mailing lists and blog posts. Getting together with other web archivists is invaluable, whether it’s talking to someone who is just starting out in the web archiving world, sharing the struggles of budget constraints, or learning more about what members are doing. You can’t beat that!

Even in this digital age it’s easy to feel isolated here in New Zealand when we hear so much about web archiving developments, especially in Europe and the States. There’s only so much you can learn from emails, blog posts and the odd webinar that’s not scheduled for 2am NZ time!!

Despite the distance we have collaborated with other IIPC members over the years. Back in 2006 the National Library of New Zealand worked with the British Library to build Web Curator Tool (WCT). The BL have moved on and developed other tools since then, and this year we’ve collaborated with National Library of the Netherlands in a major upgrade to WCT. Kees Teszelszky blogged about this recently. You can find out more about it during the IIPC conference in Wellington in November.

We’ve also been involved with the Content Development Working Group by submitting seed lists to collaborative collections such as the Olympic Games, World War One Commemoration and the News around the World project. If you’re new to IIPC, do consider getting involved in one of the IIPC groups.

We’re really excited to be hosting IIPC this year and look forward to meeting you all in person! A number of my colleagues have never had the chance to attend an IIPC conference, so they’re in for a treat! See you soon!

Mark_Beatty-NLNZ
National Library of New Zealand, Photo by Mark Beatty / CC BY-NC 3.0 NZ.