Help Identify News Sites for the IIPC Online News Around the World Project!

By Sabine Schostag, Statsbiblioteket, Aarhus

What?

iipc_onlinenewsThe IIPC’s Content Development Working Group, which is leading an effort to build collaborative, global, web archives on a variety of topics of interest to our members, is kicking off a new project that we are calling “Online News Around the World: A Snapshot in Time” Our goal is to document online news websites during one week of the year from ALL of the countries in the world.


Why?

You read that right – ONLINE NEWS FROM ALL OF THE COUNTRIES IN THE WORLD. We never said IIPC members were entirely sane, did we? We know this is a lofty goal, but we have a few reasons for doing this:

  • raise global awareness of the critical need for the web archiving;raise awareness of the importance of preserving born-digital news;
  • create a cohesive and comprehensive collection that will engage researchers;
  • archive content from countries and regions not currently being archived by IIPC members.

When?

“Week 46” November 14, 2017 – November 20, 2017. Strange idea? Maybe not… Week 46 was appointed “ordinary news week” back in the end of the 1990s, by Anker Brink Lund, philosophical doctor and professor at Copenhagen Business School. He wrote in 2014:

The project News week has it origin in an old dream I tried to realize for many years. My burning desire was not only to be able to analyze spectacular news cases, but also to map the journalistic feeding chain in general by registering ALL news in a specific period. This kind of projects needed lots of money and many expert resources. In autumn 1999, I was given both, because the newly opened journalist studies in Odense needed trainee places for their students and because a parliamentary analysis group on the political power wanted to know more about the journalistic power in Denmark. Ever since, I have used week 46 for all kinds of media analyses, together with national and international research colleagues…1

The World Wide Web is more than twenty years old. The IIPC thinks it is time to include web news in this “extraordinary ordinary news week.”

Who?

We know we might not reach our lofty goal instantly, and that it will take some time to identify news sites from around the world. We plan to start gradually with a goal of news sites from IIPC member countries, at first.  But, here is where you fit in. The Content Development Group needs your help! Please nominate 10 news sites from your country to our nomination tool: http://digital2.library.unt.edu/nomination/iipc-news/. Once we receive nominations, the Content Development Group will review the list to determine what set will be archived.

For more information about the project and to find out more about how to help, please contact the Project Team at online-news-project@iipc.simplelists.com or reply to this blog post with your questions!

References:

1 Citation from: Anker Brink Lund: Analysis – An extraordinary ordinary news week. In: Journalisten.dk, 2014-11-14.

Wanted: New Leaders for OpenWayback

By Kristinn Sigurðsson, National and University Library of Iceland

The IIPC is looking for one or two people to take on a leadership role in the OpenWayback project.

The OpenWayback project is responsible not only for the widely used OpenWayback software, but also for the underlying webarchive-commons library. In addition the OpenWayback project has been working to define access related APIs.

The OpenWayback project thus plays an important role in the IIPCs efforts to foster the development and use of common tools and standards for web archives.

openwayback-bannerWhy now?

The OpenWayback project is at a cross roads. The IIPC first took on this project three years ago with the initial objective to make the software easier to install, run and manage. This included cleaning up the code and improving documentation.

Originally this work was done by volunteers in our community. About two years ago the IIPC decided to fund a developer to work on it. The initial funding was for 16 months. With this we were able to complete the task of stabilizing the software as evidenced by the release of OpenWayback 2.0.0 through 2.3.0.

We then embarked on a somewhat more ambitious task to improve the core of the software. A significant milestone that is now ending as a new ‘CDX server’ or resource resolver is being introduced. You can read more about that here.

This marks the end of the paid position (at least for time being). The original 16 months wound up being spread over somewhat longer time frame, but they are now exhausted. Currently, the National Library of Norway (who hosted the paid developer) is contributing, for free, the work to finalize the new resource resolver.

I’ve been guiding the project over the last year since the previous project leader moved on. While I was happy to assume this role to ensure that our funded developer had a functioning community, I felt like I was never able to give the project the kind of attention that is needed to grow it. Now it seems to be a good time for a change.

With the end of the paid position we are now at a point where there either needs to be a significant transformation of the project or it will likely die away, bit by bit, which is a shame bearing in mind the significance of the project to the community and the time already invested in it.

Who are we looking for?

While a technical background is certainly useful it is not a primary requirement for this role. As you may have surmised from the above, building up this community will definitely be a part of the job. Being a good communicator, manager and organizer may be far more important at this stage.

Ideally, I’d like to see two leads with complementary skill sets, technical and communications/management. Ultimately, the most important requirement is a willingness and ability to take on this challenge.

You’ll not be alone, aside from your prospective co-lead, there is an existing community to build on. Notably when it comes to the technical aspects of the project. You can get a feel for the community on the OpenWayback Google Group and the IIPC GitHub page.

It would be simplest if the new leads were drawn from IIPC member institutions. We may, however, be willing to consider a non-member, especially as a co-lead, if they are uniquely suited for the position.

If you would like to take up this challenge and help move this project forward, please get in touch. My email is kristinn (at) landsbokasafn (dot) is.

There is no deadline, as such, but ideally I’d like the new leads to be in place prior to our next General Assembly in Lisbon next March.

Web Archiving Rio 2016: The Story So Far

By Helena Byrne, Assistant Web Archivist, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the trials and tribulations of the Rio 2016 Summer Olympic and Paralympic Games. The Olympics might be over but in just a few days the Paralympics will begin and fans will be glued to their screens again.

This project is collecting public platforms such as websites, articles, news reports, blogs and social media about Rio 2016. You can follow updates on this project on Twitter by using the collection hashtag #Rio2016WA. The CDG group has been more active on Twitter and recently hosted a Twitter chat on 10th August 2016 to give an insight on what’s involved in web archiving the Olympics. The chat was based on set questions published in an IIPC blog post with a Q&A session and some time for live nominations. This was an international chat; even though it was small it helped us to make connections with a wider audience. The chat was added to Storify as well as the final archived collection of the Games.

So far the Rio 2016 Collection has over 4,000 nominations from IIPC members and the general public. The nominations up to now are from seventy six countries across the world. However as you can see from the Google Map there are still many countries that have not been covered. Can you help fill the void?

The majority of the public nominations cover Ireland, the Pacific Islands & South Korea and are in a range of languages such as English, Korean, Dutch, Georgian & French to name but a few. Some countries on the map have only one site nominated while others have many, even if you see that there are nominations from your country the web pages you are looking at might not be in the collection. There is still time for you to get involved in web archiving the Olympics and Paralympics. The public nomination form will be open till 21st September 2016. If you would like to make a nomination you can follow these guidelines. This is your chance to be part of the Games!

On Your Marks, Get Set, Go!

By Helena Byrne, Assistant Web Archivist, The British Library

The Rio 2016 Olympic and Paralympic Games are nearly underway and for the next few weeks sports fans will be glued to the events. As with all major sporting events so much happens on and off the playing field.

When we look back at these events, what do we look at? Archives play an essential role in collecting these snapshots in our lives. As we live in a digital world web archives play a central role in this process. The IIPC Content Development Group curated three large Summer and Winter Olympics collections (2010, 2012 and 2014) and is now archiving the events both on and off the playing field in Rio.

Now it’s your opportunity to have your say about what goes into this collection. The IIPC CDG is calling on you to get involved through the public nomination form. As you can see from our map we still have large parts of the world that aren’t represented in the collection. Do you know of any Olympic or Paralympic websites from these countries?

If you want to find out more about what’s involved in documenting Rio 2016, why not join our Twitter chat and help us archive Rio 2016?

When: Wednesday 10th August at 3pm GMT time 
Where:
 At your desk
How: Using Twitter hashtag #Rio2016WA and our previous blog post
Audience: Librarians, Archivists, Sports Researchers and anyone with an interest in web archiving. 
Contributors:
 Nicola Bingham and Helena Byrne, British Library; Eilidh MacGlone, National Library of Scotland

Chat Programme

  1. Introductions
  2. Questions on selecting websites
  3. Instructions on how you can select sites
  4. Add web selections to the public nomination form
  5. Wrap up

Chat Questions

  1. What Olympic collections are available online or in libraries and museums?
    • Are they physical or digital collections?
    • Do you have a favourite go to collection that you like to use?
  2. What’s involved in selecting websites or web pages for the collection?
    • Sourcing, appraising, selecting
  3. What types of resources do researchers like to use most when researching sport?
    • If you could only choose one resource what would it be?
  4. Questions and answers from the audience about the Rio 2016 Collection.

Don’t forget to use the collection hashtag #Rio2016WA when answering the questions. So on your marks, get set, go!

IIPC-Rio2016-map
A map of the nominations so far. There are still some parts of the world not covered in this collection. However, all of the National Olympic and Paralympic Committees from around the world are archived in a separate collection.

2016 Rio Games Collection – How to Get Involved!

By Helena Byrne, Assistant Web Archivist, The British Library

The International Internet Preservation Consortium (IIPC) would like your help to archive websites from around the world related to the Olympic and Paralympic Games.

The IIPC has members in 33 countries but there are over 200 countries competing in the games and we need your help ensure that these countries are represented in the collection.

IIPC World Map

What we want to collect:

Public platforms in various formats such as:

  • Websites
  • Articles
  • News Reports
  • Blogs
  • Facebook
  • Twitter

The subjects covered on these sites can vary from:

  • Sports Events
  • Athletes/Teams
  • Doping/Cheating and Corruption
  • Olympic/Paralympic Venues
  • Gender
  • Fandom
  • Environmental Issues
  • Zika Virus
  • General News/ Commentary
  • Computer Games (eGames)
  • Other

How to get involved:

Once you have selected the web pages you would like to see in the collection it only takes less than 5 minutes to fill in the submission form.

http://goo.gl/forms/n4M4XJKfg6STvosb2

 

Memento: Help Us Route URI Lookups to the Right Archives

More Memento-enabled web archives are coming online every day, enabling aggregating services such as Time Travel and OldWeb. However, as the number of web archives grows, we must be able to better route URI lookups to the archives that are likely to have the requested URIs. We need assistance from IIPC members to help us better model both what archives contain as well as what people are looking for.

In our TPDL 2015 paper we found that less than 5% of the queried URIs have mementos in any individual archive that is not the Internet Archive. We created four different sample sets of one million URIs each and compared them against three different archives. The table below shows the percentage of the sample URIs found in various archives.

Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union of {AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
Memento Proxy Logs 4.182% 0.408% 0.046% 4.527%
IA Wayback Logs 3.716% 0.519% 0.039% 4.165%
UKWA Wayback Logs 0.108% 0.034% 0.002% 0.134%

However, these small archives, when aggregated together prove to be much more useful and complete than they are individually. We found that the intersection between these archives is small, so the union of them is large (see the last column in the table above). The figure below shows the overlap among three archives for the sample of one million URIs from DMOZ.

stanford-ukwa-archive-it

We are working on an IIPC funded Archive Profiling project in which we are trying to create a high level summary of the holdings of each archive. Apart from the many other use cases, this will help us route the Memento Aggregator queries to only archives that are likely to return good results for a given URI.

We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives. Another issue when aggregating large number of archives is that the aggregators follow the buffalo theory where the slowest upstream archive affects the roundtrip time of the aggregator. A single malfunctioning (or down) upstream archive may delay each aggregator response for the set timeout period. There are ways to solve the latter issue such as detecting continuously failing archives at runtime and temporarily disabling them from being aggregated. However, building Archive Profiles and predicting the probability of finding any Mementos in each archive to route the requests solves both the problems. Individual archives only get requests when they are likely to return good results, hence the routing saves their network and computing resources. Additionally, aggregators benefit in terms of the improved response time, because only a small subset of all the known archives is queried for any given URI.

We appreciate Andy Jackson of the UK Web Archive for providing the anonymised Wayback access logs that we used for sampling one of the URI sets. We would like to extend this study on other archives’ access logs to learn what people are looking for when they visit these archives. This will help us build sampling based profiling for archives that may not be able to share CDX files or generate/update full-coverage archive profiles.

We encourage all IIPC member archives to share their access logs just enough to generate at least one million unique URIs that people looked for in their archives. We are only interested in the log entries that have a URI-R in it (e.g., /wayback/14-digit-datetime/{URI}). We can handle all the cleanup and parsing tasks, or you can remove the requesting IP address from the logs (we don’t need it) if you would prefer. The logs can be continuous or consist of many sparse logs. We promise not to publish those logs in the raw form anywhere on the Web. Please feel free to discuss further details with me at salam@cs.odu.edu. Also contact me if you are interested in testing the software for profiling your archive.

by Sawood Alam
Department of Computer Science, Old Dominion University

IIPC Co-Chair Cathy Hartman Retires

By Abbie Grotke, United States Library of Congress, and Birgit Nordsmark Henriksen, The Royal Library, Denmark

CathyHartmanThe IIPC is saying a fond farewell this month to Cathy Hartman from member institution University of North Texas Libraries. She is retiring from UNT at the end of December. More about her illustrious career can be found in this tribute to Cathy on the UNT website.

Since her institution joined the IIPC in 2008, Cathy has been a strong leader and participant in the activities of the consortium, serving as Chair of the IIPC Steering Committee in 2008, and more recently as co-chair in 2015, helping to lead an effort to rethink the organizational structure and focus of the organization. She’s had a particular interest in education and training opportunities for web archivists, and led an effort to form an Education Working Group to review and fund training proposals, including support for a PhD candidate. She was also one of the first members representing the unique perspectives of universities and colleges on the Steering Committee.

“In my tenure as Chair of the International Internet Preservation Consortium,” commented Paul N. Wagner from the Library and Archives Canada, “I have been privileged to have Cathy Hartman as my Vice-Chair. Cathy brought a sage and pragmatic approach to her dealings with the Consortium. Her wealth of experience coupled with her ability to develop and nourish personal relationships will be sorely missed. I will personally miss the opportunity to call her up and get a ‘dose of reality’ as we worked through the complexities of overseeing an International organization. While it may be true that ‘things are bigger in Texas,’ I can tell you that the IIPC Steering Committee table will feel a lot smaller without Cathy’s presence. So on behalf of the entire IIPC membership, I wish you all the best in this next chapter of your life.”

Before she departs, we took a few moments to ask Cathy some questions about her career in Web Archiving and her experiences in IIPC:

Q: How did you get involved in Web Archiving?
As a librarian, I specialized in government information in the 1990s and watched with interest as governments began publishing content on their websites in the mid 90s rather than printing. We also noticed that the websites just disappeared when an agency or commission closed, so I began capturing the sites before they closed and preserving them for access. We captured the first site in 1997. As we added to the collection of websites, the project became known as the CyberCemetery, where dead agency websites went for perpetual care.

Q: What is your fondest memory of an IIPC meeting?
My first IIPC meeting was in Canberra, Australia, where I immediately connected to this group of people. I discovered a group who agreed with my passion for preserving the content published on the web. Also, the representatives from around the world were all so welcoming and immediately made us feel a part of the organization. The programs, the group discussions, the ongoing work – I was totally on board. IIPC has, since that meeting, been my favorite professional organization.

Q: Which changes do you look at as the biggest changes in Web archiving over the years you have been involved?
Everything has changed since 1997. Sites then were simple, straightforward and easy to capture. Websites now are complex, multimedia, interactive. The growth and change of the web over that 18 year period is remarkable.

Q: What do you plan to do in retirement?
A little consulting, a bit of fundraising for an endowment for UNT’s digital programs, some travel, and anything else I find interesting along the way.

The authors of this post would like to add that we will miss Cathy very much at future IIPC meetings, both as a professional colleague working together to achieve IIPC goals, and personally, as a good friend. We have enjoyed our time working together and particularly her way of get things done — her calm demeanor, a heart as “big as Texas,” and her ability to always focus on the issues at hand by working collaboratively – these have all been an inspiration! While we doubt her travels will take her to our next meeting in Iceland, we all hope that our paths cross again in the near future.

IIPC Steering Committee - Paris