New OpenWayback lead

By Lauren Ko, University of North Texas Libraries

In response to IIPC’s call, I have volunteered to take on a leadership role in the OpenWayback project. Having been involved with web archives since 2008 as a programmer at the University of North Texas Libraries, I expect my experience working with OpenWayback, Heritrix, and WARC files, as well as writing code to support my institution’s broader digital library initiatives, to aid me in this endeavor.openwayback-banner

Over the past few years, the web archiving community has seen much development in the area of access related projects such as pywb, Memento, ipwb, and OutbackCDX – to name a few. There is great value in a growing number of available tools written in different languages/running in different environments. In line with this, we would like to keep the OpenWayback project’s development moving forward while it remains of use. Further, we hope to facilitate development of access related standards and APIs, interoperability of components such as index servers, and compatibility of formats such as CDXJ.

Moving OpenWayback forward will take a community. With Kristinn Sigurðsson soon relinquishing his leadership position, we are seeking a co-leader for the OpenWayback project. We also continue to need people to contribute code, provide code review, and test deployments. I hope this community will continue not only to develop access tools, but also access to those tools, encouraging and supporting newcomers via mailing lists and Slack channels as they begin building and interacting with web archives.

If your institution uses OpenWayback, please consider:

If you are interested in taking a co-leadership role in this project or are otherwise interested in helping with OpenWayback and IIPC’s access related initiatives, even if you don’t know how that might be, I welcome you to contact me by the name lauren.ko via IIPC Slack or email me at lauren.ko@unt.edu.

Wanted: New Leaders for OpenWayback

By Kristinn Sigurðsson, National and University Library of Iceland

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

openwayback-bannerWhy now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

What can IIPC do to advance tools development?

By Tom Cramer, Stanford University

The International Internet Preservation Consortium (IIPC) renewed its consortial agreement at the end of 2015. In the process, it affirmed its longstanding mission to work collaboratively to foster the implementation of solutions to collect, preserve and provide access to Internet content. To achieve this aim, the Consortium is committed to “facilitate the development of appropriate and interoperable, preferably Open Source, software and tools.”

As the IIPC sets its strategic direction for 2016 and beyond, Tools Development will feature as one of three main portfolios of activity (along with Member Engagement, and Partnerships & Outreach). At its General Assembly in Reykjavik, IIPC members held a series of break out meetings to discuss Tools Development. This blog post presents some of that discussion, and lays out the beginnings of a direction for IIPC, and perhaps the web archiving community at large, to pursue in order to build a richer toolscape.

The Current State of Tools Development within the IIPC

The IIPC has always emphasized tool development. Per its website, one of the main objectives “has been to develop a high-quality, easy-to-use open source tools for setting up a web archiving chain.” And the registry of software lists an impressive array tools for everything from acquisition and curation to storage and access. And coming from the 2016 General Assembly and Web Archiving conference, it’s clear that there is actually quite a lot of development going on among and beyond member institutions. Despite all this, the reality may be slightly less rosy than the multitude of listings for tools for web archiving might indicate…

  • Many are deprecated, or worse, abandoned
  • Much of the local development is kept local, and not accessible to others for reuse or enhancement
  • There is a high degree of redundancy among development efforts, due to lack of visibility, lack of understanding, or lack of an effective collaborative framework for code exchange or coordinated development
  • Many of the tools are not interoperable with each other due to differences in approach in policy, data models or workflows (sometimes intentional, many times not)
  • Many of the big tools which serve as mainstays for the community (e.g., Heritrix for crawling, Open Wayback for replay) are large, monolithic, complex pieces of software that have multiple forks and less-than-optimal documentation

Given all this, one wonders if IIPC members really believe that coordinated tool development is important; perhaps instead it’s better to let a thousand flowers bloom? The answer to this last question was, refreshingly, a resounding NO. When discussed among members at Reykjavik, support for tools development as a top priority was unanimous, and enthusiastic. The world of the Web and web archiving is vast, yet the number of participants relatively small; the more we can foster a rich (and interoperable) tool environment, the more everyone can benefit in any part of the web archiving chain. Many members in fact said they joined IIPC expressly because they sought a collaboratively defined and community-supported set of tools to support their institutional programs.

In the words of Daniel Gomes from the Portuguese Web Archive: of course tool development is a priority for IIPC; if we don’t develop these tools, who will?

A Brighter Future for Collaborative Tool Development

Several possibilities and principles presented themselves as ways to enhance the way the web archiving community pursues tool development in the future. Interestingly, many of these were more about how the community can work together rather than specific projects.  The main principles were:

  • Interoperability | modularity | APIs are key. The web archiving community needs a bigger suite of smaller, simpler tools that connect together. This promotes reuse of tools, as well as ease of maintenance; allows for institutions to converge on common flows but differentiate where it matters; enables smaller development projects which are more likely to be successful; and provides on ramps for new developers and institutions to take up (and add back to) code. Developing a consensus set of APIs for the web archiving chain is a clear priority and prerequisite here.
  • Design and development needs to be driven by use cases. Many times, the biggest stumbling block to effective collaboration is differing goals or assumptions. Much of the lack of interoperability comes from differences in institutional models and workflows that makes it difficult for code or data to connect with other systems. Doing the analysis work upfront to clarify not just what a tool might be doing but why, can bring institutional models and developers onto the same page, and facilitate collaborative development.
  • We need collaborative platforms & social engineering for the web archiving technical community. It’s clear from events like the IIPC Web Archiving Conference and reports such as Helen Hockx-Yu’s of the Internet Archive that a lot of uncoordinated and largely invisible development is happening locally at institutions. Why? Not because people don’t want to collaborate, but because it’s less expensive and more expedient. IIPC and its members need to reduce the friction of exchanging information and code to the point that, as Barbara Sierman of the National Library of the Netherlands said, “collaboration becomes a habit.” Or as Ian Milligan of the University of Waterloo put it, we need the right balance between “hacking” and “yacking”.
  • IIPC better development of tools both large and small. Collaboration on small tools development is a clear opportunity; innovation is happening at the edges and by working together individual programs can advance their end-to-end workflows in compelling new ways (social media, browser-based capture and new forms of visualization and analysis are all striking examples here). But it’s also clear that there is appetite and need for collaboration on the traditional “big” things that are beyond any single member’s capacity to engineer unilaterally (e.g., Heritrix, WayBack, full text search). As IIPC hasn’t been as successful as anyone might like in terms of directed, top-down development of larger projects, what can be done to carve these larger efforts up into smaller pieces that have a greater chance of success? How can IIPC take on the role of facilitator and matchmaker rather than director & do-er?

Next Steps

The stage is set for revisiting and revitalizing how IIPC works together to build high quality, use case-driven, interoperable tools. Over the next few months (and years!) we will begin translating these needs and strategies into concrete actions. What can we do? Several possibilities suggested themselves in Reykjavik.

  1. Convene web archiving “hack fests”. The web archiving technical community needs face time. As Andy Jackson of the British Library opined in Reykjavik, “How can we collaborate with each other if we don’t know who we are, or what we’re doing?” Face time fuels collaboration in a way that no amount of WebEx’ing or GitHub comments can. Let’s begin to engineer the social ties that will lead to stronger software ties. A couple of three-day unconferences per year would go a long way to accelerating collaboration and diffusion of local innovation.
  2. Convene meetings on key technical topics. It’s clear that IIPC members are beginning to tackle major efforts that would benefit from some early and intensive coordination: Heritrix & browser-based crawling, elaborations on WARC, next steps for Open Wayback, full text search and visualization, use of proxies for enhanced capture, dashboards and metrics for curators and crawl engineers. All of these are likely to see significant development (sometimes at as many as 4-5 different institutions) in the next year. Bringing implementers together early offers the promise of coordinated activity.
  3. Coordinate on API identification and specification. There is clear interest in specifying APIs and more modular interactions across the entire web archiving tool chain. IIPC holds a privileged place as a coordinating body across the sites and players interested in this work. IIPC should structure some way to track, communicate, and help systematize this work, leading to a consortium-based reference architecture (based on APIs rather than specific tools) for the web archiving tool chain.
  4. Use cases. Reykjavik saw a number of excellent presentations on user centered design and use case-driven development. This work should be captured and exposed to the web archiving community to let each participate learn from each other’s work, and to generate a consensus reference architecture based on demonstrated (not just theoretical) needs.

Note that all of these potential steps focus as much on how IIPC can work together as on any specific project, and they all seem to fall into the “small steps” category. In this they have the twin benefits of being both feasible accomplish in the next year, as well as having a good chance to succeed. And if they do succeed, they promise to lay the groundwork for more and larger efforts in the coming years.

What do you think IIPC can do in the next year to advance tools development? Post a comment in this blog or send an email.

Being a Small-Time Software Contributor–Non-Developers Included

OpenWayback

At the IIPC General Assembly 2015, we
heard a call for contributors to IIPC relevant software projects (e.g. OpenWayback and Heritrix). We imagined what we could accomplish if every member institution could contribute half a developer’s heritrix-logotime to work on these tools. As individuals though, we are part of the IIPC because of the institutions for which we work. The tasks dealt by our employers come first, not always leaving an abundance of time for external projects. However, there are several ways to contribute on a smaller scale (not just committing code).

How To Help

1. Provide user support for OpenWayback and Heritrix

Join the openwayback-dev list and/or the Heritrix list, and answer questions when you can.

2. Log issues for software problems

github-social-codingAnytime you notice something isn’t working as expected in a piece of software, report the issue. For projects like OpenWayback and Heritrix that are on GitHub, creating an account to enable reporting issues is easy. If you aren’t sure if the problem warrants opening an issue, send a message to the relevant mailing list.

3. Follow issues on the OpenWayback and Heritrix GitHub repositories

Check issue trackers regularly or “Watch” GitHub repositories to receive issue updates via email. If you see an issue for a bug or new feature relevant to your institution, comment on it, even if only to say that it is relevant. This helps the developers prioritize which issues to work on.

watch_github_repo
https://github.com/iipc/openwayback

4. Test release candidates

When a new distribution of OpenWayback is about to be released, the development group sends out emails asking for people to test the release distribution candidates. Verify whether the deployment works in your environment and use cases. Then report back.

5. Contribute to documentation

For any web archiving project, if you find documentation that is lacking or unclear, report it to the maintainers, and if possible, volunteer to fix it.

6. Contribute to code

OpenWayback currently has several open issues for bugs and enhancements. If you find an issue of interest to you and/or your institution, notify others with a comment that you want to work on it. View the contribution guidelines, and start contributing. OpenWayback and Heritrix are happy to get pull requests.

7. Review codeBinary code

When others submit code for potential inclusion into a project’s master code branch, volunteer to review the code and test it by deploying the software with the changes in place to verify everything works as expected.

 8. Join the OpenWayback Developer calls

If you are interested in contributing to OpenWayback, these calls keep you informed on the current state of development. The group is always looking for help with testing release candidates, prioritizing issues, writing documentation, reviewing pull requests, and writing code. Calls take place approximately every three weeks at 4PM London time, there is also a Google Groups list, email the IIPC PCO to join.

9. Solicit development support from your institution

Non-developers have a great role in the development effort. Encourage technical staff you work with to contribute to software projects and help them build time into their schedules for it. If you are not in a position to do this, lobby the people who can grant some of your institution’s developer time to web archiving projects.

What You Get Back

Collaborating on web archiving projects isn’t just about what you contribute. The more you follow mailing lists and issue trackers and the more you work with code and its deployment, the better your institution can utilize the software and keep current on the direction of its development.

If your institution doesn’t use OpenWayback or Heritrix, the above ways of helping apply to many other web archiving software projects. So get involved where you can; you don’t have to fix everything.

lauren_koLauren Ko
Programmer, Digital Libraries Division, UNT Libraries

Update on OpenWayback

OpenWayback

OpenWayback 2.2.0 was recently released. This marks OpenWayback’s third release since becoming a ward of the IIPC in late 2013. This is a fairly modest update and reflects our desire to make frequent, modest sized releases. A few things are still worth pointing out.

First, as of this release, OpenWayback requires Java 7. Java 7 has been out for four years and Java 6 has not been publicly updated in over two years. It is time to move on.

Second, OpenWayback now officially supports internationalized domain names. I.e. domain names containing non-ASCII characters.

Third, UI localization has been much improved. It should now be possible to translate the entire interface without having to mess with the JSP files and otherwise “go under the hood”.

And the last thing I’ll mention is the new WatchedCDXSource which removes the need to enumerate all the CDX files you wish to use. Simply designate a folder and OpenWayback will pick up all the CDX files in it.

The road to here hathankyousn’t been easy, but it is encouraging to see that the number of people involved is slowly, but surely rising. For the 2.2.0 release, we had code contributions from Roger Coram (BL), Lauren Ko (UNT), John Erik Halse (NLN), Sawood Alam (ODU), Mohamed Elsayed (BA) and myself in addition to the IIPC-payed-for work by Roger Mathisen (NLN). Even more people were involved in reporting issues, managing the project and testing the release candidate. My thanks to everyone who helped out.

And going forward, we are certainly going to need people to help out.

help_wanted

Version 2.3.0 of OpenWayback will be another modest bundle of fixes and minor features. We hope it will be ready in September (or so). There are already 10 issues open for it as I write this.

But, we also have larger ambitions. Enter version 3.0.0. It will be developed in parallel with 2.3.0 and aims to make some big changes. Breaking changes. OpenWayback is built on an aging codebase, almost a decade old at this point. To move forward, some big changes need to be made.

The exact features to be implemented will likely shift as work progresses but we are going to increase modularity by pushing the CDXServer front and center and removing the legacy resource stores. In addition to simplifying the codebase, this fits very nicely with the talk at the last GA about APIs.

We’ll also be looking at redoing the user interface using iFrames and providing insight into the temporal drift of the page being viewed. The planned issues are available on GitHub. The list is far from locked and we welcome additional input on which features to work on.

We welcome additional work on those features even more!

callTOactionI’d like to wrap this up with a call to action. We need a reasonably large community around the project to sustain it. Whether it’s testing and bug reporting, occasional development work or taking on more by becoming one of our core developers, your help is both needed and appreciated.

If you’d like to become involved, you can simply join the conversation on the OpenWayback GitHub page. Anyone can open new issues and post comments on existinggithub-social-coding issues. You can also join the OpenWayback developers mailing list.

Kristinn Sigurðsson – Head of IT at the National and University Library of Iceland – x-posted from Kris’s blog

 

What’s Next for OpenWayback

By Kristinn Sigurðsson, Head of IT at National and University Library Iceland. Cross posted from his own blog

About one month ago, OpenWayback 2.1.0 was released. This was mostly a bug-fix release with a few new features merged in from Internet Archive’s Wayback development fork. For the most part, the OpenWayback effort has focused on ‘fixing’ things. Making sure everything builds and runs nicely and is better documented.

I think we’ve made some very positive strides.

Work is now ongoing for version 2.2.0. Finally, we are moving towards implementing new things! 2.2.0 still has some fixing to do. For example, localization support needs to be improved. But, we’re also planning to implement something new, support for internationalized domain names.

We’ve tentatively scheduled the 2.2.0 release for “spring/early summer”.

After 2.2.0 is released, the question will be which features or improvements to focus on next. The OpenWayback issue tracker on GitHub has (at the time of writing) about 60 open issues in the backlog (i.e. not assigned to a specific release).

We’re currently in the process of trying to prioritize these. Our current resources are nowhere sufficient to resolve them all. Prioritization will involve several aspects, including how difficult they are to implement, how popular they are and, not least, how clearly they are defined.

This is where you, dear reader, can help us out by reviewing the backlog and commenting on issues you believe to by relevant to your organization. We also invite you to submit new issues if needed.

It is enough to just leave a comment that this is relevant to your organization. Even better would be to explain why it is relevant (this helps frame the solution). Where appropriate we would also welcome suggestions for how to implement the feature. Notably in issues like the one about surfacing metadata in the interface.

If you really want to see a feature happen, the best way to make it happen is, of course, to pitch in.

Some of the features and improvements we are currently reviewing are:

  • Enable users to ‘diff’ different captures of an HTML page. Issue 15.
  • Enable search results with a very large number of hits. Issue 19.
  • Surface more metadata. Issue 28and 29.
  • Enable time ranged exclusions. Issue 212.
  • Create a revisit test dataset. Issue 117.
  • Using CDX indexing as the default instead of the BDB index. Issue 132.

As I said, these are just the ones currently being considered. We’re happy to look at others if there is someone championing them.

If you’d like to join the conversation, go to the OpenWayback issue tracker on GitHub and review issues without a milestone.

If you’d like to submit a new issue, please read the instructions on the wiki. The main thing to remember is to provide ample details.

We only have so many resources available. Your input is important to help us allocate them most effectively.