New OpenWayback lead

By Lauren Ko, University of North Texas Libraries

In response to IIPC’s call, I have volunteered to take on a leadership role in the OpenWayback project. Having been involved with web archives since 2008 as a programmer at the University of North Texas Libraries, I expect my experience working with OpenWayback, Heritrix, and WARC files, as well as writing code to support my institution’s broader digital library initiatives, to aid me in this endeavor.openwayback-banner

Over the past few years, the web archiving community has seen much development in the area of access related projects such as pywb, Memento, ipwb, and OutbackCDX – to name a few. There is great value in a growing number of available tools written in different languages/running in different environments. In line with this, we would like to keep the OpenWayback project’s development moving forward while it remains of use. Further, we hope to facilitate development of access related standards and APIs, interoperability of components such as index servers, and compatibility of formats such as CDXJ.

Moving OpenWayback forward will take a community. With Kristinn Sigurðsson soon relinquishing his leadership position, we are seeking a co-leader for the OpenWayback project. We also continue to need people to contribute code, provide code review, and test deployments. I hope this community will continue not only to develop access tools, but also access to those tools, encouraging and supporting newcomers via mailing lists and Slack channels as they begin building and interacting with web archives.

If your institution uses OpenWayback, please consider:

If you are interested in taking a co-leadership role in this project or are otherwise interested in helping with OpenWayback and IIPC’s access related initiatives, even if you don’t know how that might be, I welcome you to contact me by the name lauren.ko via IIPC Slack or email me at lauren.ko@unt.edu.

Advertisements

Wanted: New Leaders for OpenWayback

By Kristinn Sigurðsson, National and University Library of Iceland

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

openwayback-bannerWhy now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

IIPC Hackathon at the British Library: Laying a New Foundation

By Tom Cramer, Stanford University

This past week, 22-23 September 2016, members of the IIPC gathered at the British Library for a hackathon focused on web crawling technologies and techniques. The event saw 14 technologists from 12 institutions near (the UK, Netherlands, France) and far (Denmark, Iceland, Estonia, the US and Australia). The event provided a rare opportunity for an intensive, two-day, uninterrupted deep dive into how institutions are capturing web content, and to explore opportunities for advancing the state of the art.

I was struck by the breadth and depth of topics. In particular…

  • Heritrix nuts and bolts. Everything from small tricks and known issues for optimizing captures with Heritrix 3, to how people were innovating around its edges, to the history of the crawler, to a wishlist for improving it (including better documentation).
  • Brozzler and browser-based capture. Noah Levitt from the Internet Archive, and the engineer behind Brozzler, gave a mini-workshop on the latest developments, and how to get it up and running. This was one of the biggest points of interest as institutions look to enhance their ability to capture dynamic content and social media. About ⅓ of the workshop attendees went home with fresh installs on their laptops. (Also note, per Noah, pull requests welcome!)
  • Technical training. Web archiving is a relatively esoteric domain without a huge community; how have institutions trained new staff or fractionally assigned staff to engaged effectively with web archiving systems? This appears to be a major, common need, and also one that is approachable. Watch this space for developments…
  • QA of web captures: as Andy Jackson of the British Library put it, how can we tip the scales of mostly manual QA with some automated processes, to mostly automated QA with some manual training and intervention?
  • An up-to-date registry of web archiving tools. The IIPC currently maintains a list of web archiving tools, but it’s a bit dated (as these sites tend to become). Just to get the list in a place where tool users and developers can update it, a working copy of this list is now in the IIPC Github organization. Importantly, the group decided that it might be just as valuable to create a list of dead or deprecated tools, as these can often be dead ends for new adopters. See (and contribute to) https://github.com/iipc/iipc.github.io/wiki  Updates welcome!
  • System & storage architectures for web archiving. How institutions are storing, preserving and computing on the bits. There was a great diversity of approaches here, and this is likely good fodder for a future event and more structured knowledge sharing.

The biggest outcome of the event may have been the energy and inherent value in having engineers and technical program managers spending lightly structured face time exchanging information and collaborating. The event was a significant step forward in building awareness of approaches and people doing web archiving.

IIPC Hackathon, Day 1.

This validates one of the main focal points for the IIPC’s portfolio on Tools Development, which is to foster more grassroots exchange among web archiving practitioners.

The participants committed to keeping the dialogue going, and to expanding the number of participants within and beyond IIPC. Slack is emerging as one of the main channels for technical communication; if you’d like to join in, let us know. We also expect to run multiple, smaller face-to-face events in the next year: 3 in Europe and another 2-3 in North America with several delving into APIs, archiving time-based media, and access. (These are all in addition to the IIPC General Assembly and Web Archiving Conference in 27-30 March 2017, in Lisbon.) If you have an idea for a specific topic or would like to host an event, please let us know!

Many thanks to all the participants at the hackathon last week, and to the British Library (especially Andy Jackson and Olga Holownia) for hosting last week’s hackathon. It provided exactly the kind of forum needed by the web archiving community to share knowledge among practitioners and to advance the state of the art.

What can IIPC do to advance tools development?

By Tom Cramer, Stanford University

The International Internet Preservation Consortium (IIPC) renewed its consortial agreement at the end of 2015. In the process, it affirmed its longstanding mission to work collaboratively to foster the implementation of solutions to collect, preserve and provide access to Internet content. To achieve this aim, the Consortium is committed to “facilitate the development of appropriate and interoperable, preferably Open Source, software and tools.”

As the IIPC sets its strategic direction for 2016 and beyond, Tools Development will feature as one of three main portfolios of activity (along with Member Engagement, and Partnerships & Outreach). At its General Assembly in Reykjavik, IIPC members held a series of break out meetings to discuss Tools Development. This blog post presents some of that discussion, and lays out the beginnings of a direction for IIPC, and perhaps the web archiving community at large, to pursue in order to build a richer toolscape.

The Current State of Tools Development within the IIPC

The IIPC has always emphasized tool development. Per its website, one of the main objectives “has been to develop a high-quality, easy-to-use open source tools for setting up a web archiving chain.” And the registry of software lists an impressive array tools for everything from acquisition and curation to storage and access. And coming from the 2016 General Assembly and Web Archiving conference, it’s clear that there is actually quite a lot of development going on among and beyond member institutions. Despite all this, the reality may be slightly less rosy than the multitude of listings for tools for web archiving might indicate…

  • Many are deprecated, or worse, abandoned
  • Much of the local development is kept local, and not accessible to others for reuse or enhancement
  • There is a high degree of redundancy among development efforts, due to lack of visibility, lack of understanding, or lack of an effective collaborative framework for code exchange or coordinated development
  • Many of the tools are not interoperable with each other due to differences in approach in policy, data models or workflows (sometimes intentional, many times not)
  • Many of the big tools which serve as mainstays for the community (e.g., Heritrix for crawling, Open Wayback for replay) are large, monolithic, complex pieces of software that have multiple forks and less-than-optimal documentation

Given all this, one wonders if IIPC members really believe that coordinated tool development is important; perhaps instead it’s better to let a thousand flowers bloom? The answer to this last question was, refreshingly, a resounding NO. When discussed among members at Reykjavik, support for tools development as a top priority was unanimous, and enthusiastic. The world of the Web and web archiving is vast, yet the number of participants relatively small; the more we can foster a rich (and interoperable) tool environment, the more everyone can benefit in any part of the web archiving chain. Many members in fact said they joined IIPC expressly because they sought a collaboratively defined and community-supported set of tools to support their institutional programs.

In the words of Daniel Gomes from the Portuguese Web Archive: of course tool development is a priority for IIPC; if we don’t develop these tools, who will?

A Brighter Future for Collaborative Tool Development

Several possibilities and principles presented themselves as ways to enhance the way the web archiving community pursues tool development in the future. Interestingly, many of these were more about how the community can work together rather than specific projects.  The main principles were:

  • Interoperability | modularity | APIs are key. The web archiving community needs a bigger suite of smaller, simpler tools that connect together. This promotes reuse of tools, as well as ease of maintenance; allows for institutions to converge on common flows but differentiate where it matters; enables smaller development projects which are more likely to be successful; and provides on ramps for new developers and institutions to take up (and add back to) code. Developing a consensus set of APIs for the web archiving chain is a clear priority and prerequisite here.
  • Design and development needs to be driven by use cases. Many times, the biggest stumbling block to effective collaboration is differing goals or assumptions. Much of the lack of interoperability comes from differences in institutional models and workflows that makes it difficult for code or data to connect with other systems. Doing the analysis work upfront to clarify not just what a tool might be doing but why, can bring institutional models and developers onto the same page, and facilitate collaborative development.
  • We need collaborative platforms & social engineering for the web archiving technical community. It’s clear from events like the IIPC Web Archiving Conference and reports such as Helen Hockx-Yu’s of the Internet Archive that a lot of uncoordinated and largely invisible development is happening locally at institutions. Why? Not because people don’t want to collaborate, but because it’s less expensive and more expedient. IIPC and its members need to reduce the friction of exchanging information and code to the point that, as Barbara Sierman of the National Library of the Netherlands said, “collaboration becomes a habit.” Or as Ian Milligan of the University of Waterloo put it, we need the right balance between “hacking” and “yacking”.
  • IIPC better development of tools both large and small. Collaboration on small tools development is a clear opportunity; innovation is happening at the edges and by working together individual programs can advance their end-to-end workflows in compelling new ways (social media, browser-based capture and new forms of visualization and analysis are all striking examples here). But it’s also clear that there is appetite and need for collaboration on the traditional “big” things that are beyond any single member’s capacity to engineer unilaterally (e.g., Heritrix, WayBack, full text search). As IIPC hasn’t been as successful as anyone might like in terms of directed, top-down development of larger projects, what can be done to carve these larger efforts up into smaller pieces that have a greater chance of success? How can IIPC take on the role of facilitator and matchmaker rather than director & do-er?

Next Steps

The stage is set for revisiting and revitalizing how IIPC works together to build high quality, use case-driven, interoperable tools. Over the next few months (and years!) we will begin translating these needs and strategies into concrete actions. What can we do? Several possibilities suggested themselves in Reykjavik.

  1. Convene web archiving “hack fests”. The web archiving technical community needs face time. As Andy Jackson of the British Library opined in Reykjavik, “How can we collaborate with each other if we don’t know who we are, or what we’re doing?” Face time fuels collaboration in a way that no amount of WebEx’ing or GitHub comments can. Let’s begin to engineer the social ties that will lead to stronger software ties. A couple of three-day unconferences per year would go a long way to accelerating collaboration and diffusion of local innovation.
  2. Convene meetings on key technical topics. It’s clear that IIPC members are beginning to tackle major efforts that would benefit from some early and intensive coordination: Heritrix & browser-based crawling, elaborations on WARC, next steps for Open Wayback, full text search and visualization, use of proxies for enhanced capture, dashboards and metrics for curators and crawl engineers. All of these are likely to see significant development (sometimes at as many as 4-5 different institutions) in the next year. Bringing implementers together early offers the promise of coordinated activity.
  3. Coordinate on API identification and specification. There is clear interest in specifying APIs and more modular interactions across the entire web archiving tool chain. IIPC holds a privileged place as a coordinating body across the sites and players interested in this work. IIPC should structure some way to track, communicate, and help systematize this work, leading to a consortium-based reference architecture (based on APIs rather than specific tools) for the web archiving tool chain.
  4. Use cases. Reykjavik saw a number of excellent presentations on user centered design and use case-driven development. This work should be captured and exposed to the web archiving community to let each participate learn from each other’s work, and to generate a consensus reference architecture based on demonstrated (not just theoretical) needs.

Note that all of these potential steps focus as much on how IIPC can work together as on any specific project, and they all seem to fall into the “small steps” category. In this they have the twin benefits of being both feasible accomplish in the next year, as well as having a good chance to succeed. And if they do succeed, they promise to lay the groundwork for more and larger efforts in the coming years.

What do you think IIPC can do in the next year to advance tools development? Post a comment in this blog or send an email.

Memento: Help Us Route URI Lookups to the Right Archives

More Memento-enabled web archives are coming online every day, enabling aggregating services such as Time Travel and OldWeb. However, as the number of web archives grows, we must be able to better route URI lookups to the archives that are likely to have the requested URIs. We need assistance from IIPC members to help us better model both what archives contain as well as what people are looking for.

In our TPDL 2015 paper we found that less than 5% of the queried URIs have mementos in any individual archive that is not the Internet Archive. We created four different sample sets of one million URIs each and compared them against three different archives. The table below shows the percentage of the sample URIs found in various archives.

Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union of {AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
Memento Proxy Logs 4.182% 0.408% 0.046% 4.527%
IA Wayback Logs 3.716% 0.519% 0.039% 4.165%
UKWA Wayback Logs 0.108% 0.034% 0.002% 0.134%

However, these small archives, when aggregated together prove to be much more useful and complete than they are individually. We found that the intersection between these archives is small, so the union of them is large (see the last column in the table above). The figure below shows the overlap among three archives for the sample of one million URIs from DMOZ.

stanford-ukwa-archive-it

We are working on an IIPC funded Archive Profiling project in which we are trying to create a high level summary of the holdings of each archive. Apart from the many other use cases, this will help us route the Memento Aggregator queries to only archives that are likely to return good results for a given URI.

We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives. Another issue when aggregating large number of archives is that the aggregators follow the buffalo theory where the slowest upstream archive affects the roundtrip time of the aggregator. A single malfunctioning (or down) upstream archive may delay each aggregator response for the set timeout period. There are ways to solve the latter issue such as detecting continuously failing archives at runtime and temporarily disabling them from being aggregated. However, building Archive Profiles and predicting the probability of finding any Mementos in each archive to route the requests solves both the problems. Individual archives only get requests when they are likely to return good results, hence the routing saves their network and computing resources. Additionally, aggregators benefit in terms of the improved response time, because only a small subset of all the known archives is queried for any given URI.

We appreciate Andy Jackson of the UK Web Archive for providing the anonymised Wayback access logs that we used for sampling one of the URI sets. We would like to extend this study on other archives’ access logs to learn what people are looking for when they visit these archives. This will help us build sampling based profiling for archives that may not be able to share CDX files or generate/update full-coverage archive profiles.

We encourage all IIPC member archives to share their access logs just enough to generate at least one million unique URIs that people looked for in their archives. We are only interested in the log entries that have a URI-R in it (e.g., /wayback/14-digit-datetime/{URI}). We can handle all the cleanup and parsing tasks, or you can remove the requesting IP address from the logs (we don’t need it) if you would prefer. The logs can be continuous or consist of many sparse logs. We promise not to publish those logs in the raw form anywhere on the Web. Please feel free to discuss further details with me at salam@cs.odu.edu. Also contact me if you are interested in testing the software for profiling your archive.

by Sawood Alam
Department of Computer Science, Old Dominion University

Five Takeaways from AOIR 2015

aoirI recently attended the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. It was a great conference that I would highly recommend to anyone interested in learning first hand about research questions, methods, and studies broadly related to the Internet.

Researchers presented on a wide range of topics, across a wide range of media, using both qualitative and quantitative methods. You can get an idea of the range of topics by looking at the conference schedule.

I’d like to briefly share some of my key takeaways. I apologize in advance for oversimplifying what was a rich and deep array of research work, my goal here is to provide a quick summary and not an in-depth review of the conference.

  1. Digital Methods Are Where It’s At

I attended an all-day, pre-conference digital methods workshop. As a testament to the interest in this subject, the workshop was so overbooked they had to run three concurrent sessions. The workshops were organized by Axel Bruns, Jean Burgess, Tim Highfield, Ben Light, and Patrik Wikstrom (Queensland University of Technology), and Tama Leaver (Curtin University).

Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better.

At the digital methods workshop, we learned about the “Walkthrough” method for studying software apps, tools for “web scraping” to gather data for analysis, Tableau to conduct social media analysis, and “instagrammatics,” analyzing Instagram.

FYI: The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.

  1. Twitter API Is also Very Popular

There were many Twitter studies, and they all used the Twitter API to download tweets for analysis. Although researchers are widely using the Twitter API, they expressed a lot of frustration over its limitation. For example, you can only download for free up to 1% of the total Twitter volume. If you’re studying something obscure, you are probably okay, but if you’re studying a topic like #jesuischarlie, you’ll have to pay to get the entire output. Many researchers don’t have the funds for that. One person pointed out that it would be ideal to have access to the Library of Congress’s Twitter archive. Yes, agreed!

  1. Social Media over Web Archives

Researchers presented conclusions and provided commentary on our social behavior through studies of social media such as Snapchat, Twitter, Facebook, and Instagram. There were only a handful of presentations using web archived materials. If a researcher used websites, they viewed them live or conducted “web scraping” with tools such as Outwit and Kimono. Many also used custom Python scripts to gather the data from the sites.

  1. Fair Use Needs a PR Movement

There’s still much misunderstanding about what researchers can and cannot do with digital materials. I attended a session where the presenter shared findings from surveys conducted with communication scholars about their knowledge of fair use. The results showed that there was (very!) limited understanding of fair use. Even worse, the findings showed that those scholars who had previously attended a fair use workshop were even more unlikely to understand fair use! Moreover, many admitted that they did not conduct particular studies because of a (misguided) fear of violating copyright. These findings were corroborated by the scholars from a variety of fields who were in the room.

  1. Opportunities for Collaboration

I asked many researchers if they were concerned that they were not saving a snapshot of websites or Apps at the time of their studies. The answer was a resounding “yes!” They recognize that sites and tools change rapidly, but they are unaware of tools or services they can use and/or that their librarians/archivists have solutions.

Clearly there is room for librarians/archivists to conduct more outreach to researchers to inform them about our rich web archive collections and to talk with them about preservation solutions, good data management practices and copyright.

Who knew?

Let me end with sharing one tidbit that really blew my mind. In her research on “Dead Online: Practices of Post-Mortem Digital Interaction,” Paula Kiel presented on the “digital platforms designed to enable post-mortem interactions.” Yes, she was talking about websites where you can send posthumous messages via Facebook and email! For example, https://www.safebeyond.com/, “Life continues when you pass… Ensure your presence – be there when it counts. Leave messages for your loved ones – for FREE!”

RosalieLack

 

By Rosalie Lack, Product Manager, California Digital Library

Being a Small-Time Software Contributor–Non-Developers Included

OpenWayback

At the IIPC General Assembly 2015, we
heard a call for contributors to IIPC relevant software projects (e.g. OpenWayback and Heritrix). We imagined what we could accomplish if every member institution could contribute half a developer’s heritrix-logotime to work on these tools. As individuals though, we are part of the IIPC because of the institutions for which we work. The tasks dealt by our employers come first, not always leaving an abundance of time for external projects. However, there are several ways to contribute on a smaller scale (not just committing code).

How To Help

1. Provide user support for OpenWayback and Heritrix

Join the openwayback-dev list and/or the Heritrix list, and answer questions when you can.

2. Log issues for software problems

github-social-codingAnytime you notice something isn’t working as expected in a piece of software, report the issue. For projects like OpenWayback and Heritrix that are on GitHub, creating an account to enable reporting issues is easy. If you aren’t sure if the problem warrants opening an issue, send a message to the relevant mailing list.

3. Follow issues on the OpenWayback and Heritrix GitHub repositories

Check issue trackers regularly or “Watch” GitHub repositories to receive issue updates via email. If you see an issue for a bug or new feature relevant to your institution, comment on it, even if only to say that it is relevant. This helps the developers prioritize which issues to work on.

watch_github_repo
https://github.com/iipc/openwayback

4. Test release candidates

When a new distribution of OpenWayback is about to be released, the development group sends out emails asking for people to test the release distribution candidates. Verify whether the deployment works in your environment and use cases. Then report back.

5. Contribute to documentation

For any web archiving project, if you find documentation that is lacking or unclear, report it to the maintainers, and if possible, volunteer to fix it.

6. Contribute to code

OpenWayback currently has several open issues for bugs and enhancements. If you find an issue of interest to you and/or your institution, notify others with a comment that you want to work on it. View the contribution guidelines, and start contributing. OpenWayback and Heritrix are happy to get pull requests.

7. Review codeBinary code

When others submit code for potential inclusion into a project’s master code branch, volunteer to review the code and test it by deploying the software with the changes in place to verify everything works as expected.

 8. Join the OpenWayback Developer calls

If you are interested in contributing to OpenWayback, these calls keep you informed on the current state of development. The group is always looking for help with testing release candidates, prioritizing issues, writing documentation, reviewing pull requests, and writing code. Calls take place approximately every three weeks at 4PM London time, there is also a Google Groups list, email the IIPC PCO to join.

9. Solicit development support from your institution

Non-developers have a great role in the development effort. Encourage technical staff you work with to contribute to software projects and help them build time into their schedules for it. If you are not in a position to do this, lobby the people who can grant some of your institution’s developer time to web archiving projects.

What You Get Back

Collaborating on web archiving projects isn’t just about what you contribute. The more you follow mailing lists and issue trackers and the more you work with code and its deployment, the better your institution can utilize the software and keep current on the direction of its development.

If your institution doesn’t use OpenWayback or Heritrix, the above ways of helping apply to many other web archiving software projects. So get involved where you can; you don’t have to fix everything.

lauren_koLauren Ko
Programmer, Digital Libraries Division, UNT Libraries