2018 Winter Olympics Collection Building – Get Involved!

By Helena Byrne, Curator of Web Archives, The British Library

The International Internet Preservation Consortium Content Development Group (IIPC CDG) would like your help to archive websites from around the world related to the 2018 Winter Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 100 countries  competing in the Games and we need your help to ensure that these countries are represented in the collection. The IIPC CDG has been building web archive collections on the Olympic and the Paralympic Games since 2010. The 2016 Summer Games was the first time they actively collected content related to activities both on and off the playing field.* The final 2018 Winter Games collection will be published here: https://archive-it.org/home/IIPC

What we want to collect:

Public platforms in various formats such as:

  •         Websites
  •         Subsections of websites with an Olympic tag
  •         Individual Articles
  •         News Reports
  •         Blogs and Social Media

The subjects covered on these sites can include but is not limited to:

  •         Athletes/Teams
  •         Computer Games (eGames)
  •         Doping/Cheating and Corruption
  •         Environmental Issues
  •         Fandom
  •         Gender Issues (Ex. media coverage, testosterone levels etc.)
  •         General News/ Commentary
  •         Olympic/Paralympic Venues
  •         Security
  •         Sports Events
  •         US/North Korean Relations
  •         Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.

https://goo.gl/forms/UwxiBg5klE6I7Z7g1

For more information and updates you can contact the IIPC CDG team via email (2018-winter-olympics [at] iipc.simplelists .com) or follow the collection hashtag #WAOlympics2018


* 2016 Olympics collection round-up

Advertisements

User experience in the online archive of internet art

A guest blog post by Lozana Rossenova, a collaborative doctoral student with the Centre for the Study of the Networked Image (London, UK) and Rhizome (New York, USA). Lozana’s PhD is supported by the AHRC Collaborative Doctoral Awards 2016.

The evolution of network environments and the development of new patterns of interaction between users and online interfaces create multiple challenges for the long-term provision of access to online artefacts of cultural value. In the case of internet art, curating and archiving activities are contingent upon addressing the question of what constitutes the art object. Internet artworks are not single digital objects, but rather assemblages, dependent on specific software and network environments to be executed and rendered.

My research project seeks to better understand problems associated with the archiving of internet art: How the artworks can be made accessible to the public in their native environment – online – while enabling users of the archive to gain an expanded understanding of the artworks’ context?

User experience and the ArtBase

In the fields of user experience design and human computer interaction (HCI), there has been substantial research done around issues of discoverability, accessibility and usability in digital archives, but the studies have focused primarily on archives with digitised born-analogue text- or image-based documents. Presentation and contextualisation in archives of complex born-digital artefacts, on the other hand, have been discussed much less, particularly from the point of view of the user’s experience.

Unlike digitised texts or images, internet art spans beyond the boundaries of a single object and oftentimes references external, dynamic and real-time data sources, or exists across multiple locations and platforms. Rhizome has recognised the inherent vulnerability of internet art since its inception as an organisation and community-building platform in 1996. The ArtBase was established in 1999 as an online space to present and archive internet art. Initial strategies towards presentation of artworks in the ArtBase reflected contemporaneous developments in the fields of interaction design and digital preservation. More recently the archival system has struggled to accommodate the growing number and variety of artworks in the ArtBase. Providing a consistent user experience in making artworks accessible brings additional challenges and requires further research into how users encounter and interact with archives of web-based artefacts.

Figure 1: View of Rhizome’s ArtBase from 2011 when the archive already had over 2000 artworks.

Beyond preservation challenges – such as an artwork’s technical dependencies on specific network protocols, web standards, or browser plugins – various interface design elements and conventions change over time. These influence how users navigate, interact with and understand context within the networked artwork. Interaction patterns and interface elements, such as frames, check-boxes and scrollbars, could all significantly impact or potentially change, and even render defunct, the user experience of an artwork. Examples that illustrate this clearly include works such as Jan Robert Leegte’s Scrollbar Composition (2000) or Alexei Shulgin’s Form Art (1997) [See Figures 2, 3]. Given these circumstances, new preservation and presentation paradigms are needed in order for the online archive of internet art to be able to provide access not only to an artwork’s html and css code, but also to the contextualised experience of the work.

Figure 2: View of Alexei Shulgin, Form Art (1997) in Rhizome’s ArtBase.
Figure 3: Presentation of Alexei Shulgin, Form Art (1997) in a remote Netscape 3.0 browser. The contextual presentation is crucial for retaining the original look and capacity for interaction of the work.

Web archiving and remote browsers

Recognising the limitations in the current archival framework to provide adequate access to a large number of historic artworks, increasingly the focus of preservation efforts at Rhizome has been on building tools to support the presentation of complex artworks with multiple dependencies. Recent developments in browser-based emulation and web archiving tools have been instrumental in facilitating the restoration and re-performance of important internet artworks, which have been presented as instalments in Rhizome’s major new curatorial project – Net Art Anthology.

Figure 4: First chapter in the Net Art Anthology online exhibition.

The remote browsing technology, first introduced in Rhizome’s oldweb.today project to emulate old browser environments, has facilitated the online presentation of historic internet artworks in contemporaneous environments, such as Netscape Navigator or early versions of Internet Explorer. Furthermore, the capacity to create high-fidelity archives of the dynamic web with Rhizome’s browser-based archiving tool, Webrecorder, has enabled the preservation of artworks based on third-party web services, such as Instagram and Yelp.

Figure 5: Choice of browsers and operating systems in oldweb.today.

Presenting artworks inside browsers running in Docker containers allows for the restaging of historic artworks in the original environments in which users encountered them, thereby providing oftentimes crucial contextual information to contemporary audiences (see reference to Form Art above). Meanwhile, the remote browsers in Webrecorder provide an environment for the recording and replaying of various internet artworks including ones that use Flash or Java, which are unsupported in the most recent versions of major browsers like Chrome, Firefox, Safari or Microsoft Edge.

Figure 6: Webrecorder archive of Dragan Espenschied, Bridging the Digital Divide (2003). Website with Java Applet.

Next steps

Recent developments in Rhizome’s preservation practices indicate that the online archive of internet art is not accessible or sustainable if it remains a single centralised platform. Instead, it could be reconceptualised as a resource, connected with and linking out to various instantiations of the artworks. Remote browsers, in particular, could become a powerful tool allowing presentation of artworks either as a link out of the ArtBase page into a new page running the emulated browser, or as an embedded iframe within the ArtBase page of the artwork. In each of these cases, users would encounter a “browser within a browser” presentation paradigm. A potential challenge here would be users mistaking the remote browser environment for other secondary representations (a static screenshot, for instance, a device commonly used to present web-based artworks). Providing a consistent and contextualised user experience across the system used to present the artwork and the archival record of the work requires addressing such challenges. In the coming months, we will be conducting further research into interaction design patterns of ArtBase artworks and behaviour patterns of the archive’s users, which will inform a redevelopment of the ArtBase interaction design framework.

A presentation of recent developments in Rhizome’s Webrecorder tool, the remote browsers technology and strategies for augmenting web archives will take place at the IIPC/RESAW  Conference (WAC) 2017 during Web Archiving Week, 12–16 June, 2017

New OpenWayback lead

By Lauren Ko, University of North Texas Libraries

In response to IIPC’s call, I have volunteered to take on a leadership role in the OpenWayback project. Having been involved with web archives since 2008 as a programmer at the University of North Texas Libraries, I expect my experience working with OpenWayback, Heritrix, and WARC files, as well as writing code to support my institution’s broader digital library initiatives, to aid me in this endeavor.openwayback-banner

Over the past few years, the web archiving community has seen much development in the area of access related projects such as pywb, Memento, ipwb, and OutbackCDX – to name a few. There is great value in a growing number of available tools written in different languages/running in different environments. In line with this, we would like to keep the OpenWayback project’s development moving forward while it remains of use. Further, we hope to facilitate development of access related standards and APIs, interoperability of components such as index servers, and compatibility of formats such as CDXJ.

Moving OpenWayback forward will take a community. With Kristinn Sigurðsson soon relinquishing his leadership position, we are seeking a co-leader for the OpenWayback project. We also continue to need people to contribute code, provide code review, and test deployments. I hope this community will continue not only to develop access tools, but also access to those tools, encouraging and supporting newcomers via mailing lists and Slack channels as they begin building and interacting with web archives.

If your institution uses OpenWayback, please consider:

If you are interested in taking a co-leadership role in this project or are otherwise interested in helping with OpenWayback and IIPC’s access related initiatives, even if you don’t know how that might be, I welcome you to contact me by the name lauren.ko via IIPC Slack or email me at lauren.ko@unt.edu.

Rio 2016 Round Up

By Helena Byrne, Assistant Web Archivist, The British Library

The IIPC Content Development Group (CDG) 2016 Summer Olympic and Paralympic Games collection is now live http://archive-it.org/collections/7235.

The collection period ran from June to October 2016, it covered events on and off the playing field. The CDG used a combination of collaborative tools during this project as well as input from the general public.
rio-globe

Collection Fast Facts:

Final Number of Nominations:

In total 4,817 seeds were nominated, 4,642 from CDG members and 176 from public nomination form.

Countries:

125 countries are covered in the collection but the number of nominations varies between the countries: it ranges from 1 to 5 seeds to a couple of hundreds. The top 5 countries covered were France (681), Brazil (553), Japan (447), the Great Britain (341) and Canada (327).

Languages:

34 different languages were recorded.

iipc-rio-2016-collection-languages

What’s Next?:

Quality Assurance:

Now that the collection phase of the project is over, it is hoped that we will be able to do some Quality Assurance (QA) on the archived nominations. Criteria on how to evaluate an archived website can be found here. There are two ways this will be done: the first is through the crawl reports generated by Archive-IT account while the second is through a visual inspection of the website. The second option can be done by anyone using the collection, whether they are IIPC members or individuals interested in the web archiving process.  As there are a large number of sites to look through this would require input from people outside the CDG.  Can you help us do QA on this collection?

Report an issue with the collection:

While using the collection if you would like to flag any issues with the content, you can fill in this Google Form:  https://goo.gl/forms/utvyE8FztZdjFSaB3

Guidelines:

The CDG will publish a ‘Best Practice for Developing Collaborative Collections’ on the IIPC website. This will not only form the guidelines for future CDG collections but will hopefully be of use for anyone working on a collaborative project.

Target Audience:

 This collection will be invaluable for web archives researchers in terms of data mining as well as researchers who focus on sports and Olympic events.

Thank you for contributing to this project, you can keep up to date with any further developments on this project through the collection hashtag #Rio2016WA.


Collection timelines and updates:

Wanted: New Leaders for OpenWayback

By Kristinn Sigurðsson, National and University Library of Iceland

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

openwayback-bannerWhy now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

IIPC Hackathon at the British Library: Laying a New Foundation

By Tom Cramer, Stanford University

This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.

I was struck by the breadth and depth of topics. In particular…

  • Heritrix nuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
  • Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
  • Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
  • QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
  • An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki  Updates welcome!
  • System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.

The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.

IIPC Hackathon, Day 1.

This validates one of the main focal points for the IIPC’s portfolio on Tools Development, which is to foster more grassroots exchange among web archiving practitioners.

The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!

Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.

What can IIPC do to advance tools development?

By Tom Cramer, Stanford University

The International Internet Preservation Consortium (IIPC) renewed its consortial agreement at the end of 2015. In the process, it affirmed its longstanding mission to work collaboratively to foster the implementation of solutions to collect, preserve and provide access to Internet content. To achieve this aim, the Consortium is committed to “facilitate the development of appropriate and interoperable, preferably Open Source, software and tools.”

As the IIPC sets its strategic direction for 2016 and beyond, Tools Development will feature as one of three main portfolios of activity (along with Member Engagement, and Partnerships & Outreach). At its General Assembly in Reykjavik, IIPC members held a series of break out meetings to discuss Tools Development. This blog post presents some of that discussion, and lays out the beginnings of a direction for IIPC, and perhaps the web archiving community at large, to pursue in order to build a richer toolscape.

The Current State of Tools Development within the IIPC

The IIPC has always emphasized tool development. Per its website, one of the main objectives “has been to develop a high-quality, easy-to-use open source tools for setting up a web archiving chain.” And the registry of software lists an impressive array tools for everything from acquisition and curation to storage and access. And coming from the 2016 General Assembly and Web Archiving conference, it’s clear that there is actually quite a lot of development going on among and beyond member institutions. Despite all this, the reality may be slightly less rosy than the multitude of listings for tools for web archiving might indicate…

  • Many are deprecated, or worse, abandoned
  • Much of the local development is kept local, and not accessible to others for reuse or enhancement
  • There is a high degree of redundancy among development efforts, due to lack of visibility, lack of understanding, or lack of an effective collaborative framework for code exchange or coordinated development
  • Many of the tools are not interoperable with each other due to differences in approach in policy, data models or workflows (sometimes intentional, many times not)
  • Many of the big tools which serve as mainstays for the community (e.g., Heritrix for crawling, Open Wayback for replay) are large, monolithic, complex pieces of software that have multiple forks and less-than-optimal documentation

Given all this, one wonders if IIPC members really believe that coordinated tool development is important; perhaps instead it’s better to let a thousand flowers bloom? The answer to this last question was, refreshingly, a resounding NO. When discussed among members at Reykjavik, support for tools development as a top priority was unanimous, and enthusiastic. The world of the Web and web archiving is vast, yet the number of participants relatively small; the more we can foster a rich (and interoperable) tool environment, the more everyone can benefit in any part of the web archiving chain. Many members in fact said they joined IIPC expressly because they sought a collaboratively defined and community-supported set of tools to support their institutional programs.

In the words of Daniel Gomes from the Portuguese Web Archive: of course tool development is a priority for IIPC; if we don’t develop these tools, who will?

A Brighter Future for Collaborative Tool Development

Several possibilities and principles presented themselves as ways to enhance the way the web archiving community pursues tool development in the future. Interestingly, many of these were more about how the community can work together rather than specific projects.  The main principles were:

  • Interoperability | modularity | APIs are key. The web archiving community needs a bigger suite of smaller, simpler tools that connect together. This promotes reuse of tools, as well as ease of maintenance; allows for institutions to converge on common flows but differentiate where it matters; enables smaller development projects which are more likely to be successful; and provides on ramps for new developers and institutions to take up (and add back to) code. Developing a consensus set of APIs for the web archiving chain is a clear priority and prerequisite here.
  • Design and development needs to be driven by use cases. Many times, the biggest stumbling block to effective collaboration is differing goals or assumptions. Much of the lack of interoperability comes from differences in institutional models and workflows that makes it difficult for code or data to connect with other systems. Doing the analysis work upfront to clarify not just what a tool might be doing but why, can bring institutional models and developers onto the same page, and facilitate collaborative development.
  • We need collaborative platforms & social engineering for the web archiving technical community. It’s clear from events like the IIPC Web Archiving Conference and reports such as Helen Hockx-Yu’s of the Internet Archive that a lot of uncoordinated and largely invisible development is happening locally at institutions. Why? Not because people don’t want to collaborate, but because it’s less expensive and more expedient. IIPC and its members need to reduce the friction of exchanging information and code to the point that, as Barbara Sierman of the National Library of the Netherlands said, “collaboration becomes a habit.” Or as Ian Milligan of the University of Waterloo put it, we need the right balance between “hacking” and “yacking”.
  • IIPC better development of tools both large and small. Collaboration on small tools development is a clear opportunity; innovation is happening at the edges and by working together individual programs can advance their end-to-end workflows in compelling new ways (social media, browser-based capture and new forms of visualization and analysis are all striking examples here). But it’s also clear that there is appetite and need for collaboration on the traditional “big” things that are beyond any single member’s capacity to engineer unilaterally (e.g., Heritrix, WayBack, full text search). As IIPC hasn’t been as successful as anyone might like in terms of directed, top-down development of larger projects, what can be done to carve these larger efforts up into smaller pieces that have a greater chance of success? How can IIPC take on the role of facilitator and matchmaker rather than director & do-er?

Next Steps

The stage is set for revisiting and revitalizing how IIPC works together to build high quality, use case-driven, interoperable tools. Over the next few months (and years!) we will begin translating these needs and strategies into concrete actions. What can we do? Several possibilities suggested themselves in Reykjavik.

  1. Convene web archiving “hack fests”. The web archiving technical community needs face time. As Andy Jackson of the British Library opined in Reykjavik, “How can we collaborate with each other if we don’t know who we are, or what we’re doing?” Face time fuels collaboration in a way that no amount of WebEx’ing or GitHub comments can. Let’s begin to engineer the social ties that will lead to stronger software ties. A couple of three-day unconferences per year would go a long way to accelerating collaboration and diffusion of local innovation.
  2. Convene meetings on key technical topics. It’s clear that IIPC members are beginning to tackle major efforts that would benefit from some early and intensive coordination: Heritrix & browser-based crawling, elaborations on WARC, next steps for Open Wayback, full text search and visualization, use of proxies for enhanced capture, dashboards and metrics for curators and crawl engineers. All of these are likely to see significant development (sometimes at as many as 4-5 different institutions) in the next year. Bringing implementers together early offers the promise of coordinated activity.
  3. Coordinate on API identification and specification. There is clear interest in specifying APIs and more modular interactions across the entire web archiving tool chain. IIPC holds a privileged place as a coordinating body across the sites and players interested in this work. IIPC should structure some way to track, communicate, and help systematize this work, leading to a consortium-based reference architecture (based on APIs rather than specific tools) for the web archiving tool chain.
  4. Use cases. Reykjavik saw a number of excellent presentations on user centered design and use case-driven development. This work should be captured and exposed to the web archiving community to let each participate learn from each other’s work, and to generate a consensus reference architecture based on demonstrated (not just theoretical) needs.

Note that all of these potential steps focus as much on how IIPC can work together as on any specific project, and they all seem to fall into the “small steps” category. In this they have the twin benefits of being both feasible accomplish in the next year, as well as having a good chance to succeed. And if they do succeed, they promise to lay the groundwork for more and larger efforts in the coming years.

What do you think IIPC can do in the next year to advance tools development? Post a comment in this blog or send an email.