IIPC – Meet the Officers, 2021

The IIPC is governed by the Steering Committee, formed of representatives from fifteen Member Institutions who are each elected for three year terms. The Steering Committee designates the Chair, Vice-Chair and the Treasurer of the Consortium. Together with the Programme and Communications Officer based at the British Library, the Officers are responsible for dealing with the day-to-day business of running the IIPC.

The Steering Committee has designated Abbie Grotke of the Library of Congress, to serve as Chair and Kristinn Sigurðsson of National and University Library of Iceland, to serve as Vice-Chair in 2021. Sylvain Bélanger of Library and Archives Canada continues in his role as Treasurer. Olga Holownia continues as Programme and Communications Officer, and CLIR (the Council on Library and Information Resources) remains the Consortium’s financial host.

The Officers make up the new Executive Board introduced in the Consortium Agreement 2021-2025. The additional Steering Committee members, who will serve on the Executive Board in 2021, will be named in the coming months.

The Members and the Steering Committee would like to thank Mark Phillips of the University of North Texas Libraries (IIPC Chair, 2020) and Paul Koerbin (IIPC Vice-Chair, 2020) of the National Library of Australia, for their contribution to the day-to-day running of the IIPC.


Abbie Grotke, IIPC Chair 2021
Photo: Denis Malloy.

Abbie Grotke is Assistant Head, Digital Content Management Section, within the Digital Services Directorate of the Library of Congress, and leads the Web Archiving Team. She joined the Library in 1997 to work on American Memory digitization projects, and since 2002 has been involved in the Library’s web archiving program, which celebrated its 20th anniversary in 2020. In her role, Grotke has helped develop policies, workflows, and tools to collect and preserve web content for the Library’s collections and provides overall program management for web archiving at the Library, managing over 2.3 petabytes of data. The team also supports and trains almost 100 recommending officers across the Library who select content for the archives in a wide range of event and thematic web archive collections. She has been active in a number of collaborative web archive collections and initiatives, including the U.S. End of Term Government Web Archive, and the U.S. Federal Government Web Archiving Interest Group.

Since the Library of Congress joined the IIPC as a founding member in 2003, Abbie has served in a variety of roles and on a number of working groups, task forces, and committees. She spent a number of years as Communications Officer, and was a member of the Access Working Group. More recently, she has served as co-leader of the Content Development, and Training Working Groups, and Membership Engagement Portfolio. She has been a member of the Steering Committee since 2013.


Photo: Tibor God (General Assembly in Zagreb, 2019).

Kristinn Sigurðsson is Head of Digital Projects and Development at the National and University Library of Iceland. He joined the library in 2003 as a software developer. Over the years he has worked on a multitude of projects related to the acquisition, preservation and presentation of digital content, as well as the digital reproduction of physical media. This includes leading the buildup of the library’s legal deposit web archive – that now contains nearly 4 billion items – as well as its very popular newspaper/magazine website.

He has also been very active within the IIPC and related web archiving collaboration. This includes working on the first version of the Heritrix crawler in 2003-4 (and on and off since). In 2010 he joined the IIPC Steering Committee as well as taking over as co-load of the Harvesting Working Group. More recently he has served as the Lead of the Tools Development Portfolio.


Sylvain Bélanger is Director General of the Transition Team at Library and Archives Canada (LAC). Sylvain was previously Director General of the Digital Operations and Preservation Branch for Library and Archives Canada since February 2014. In this role Sylvain is responsible for leading and supporting LAC’s digital business operations, and all aspects of preservation for digital and analog collections. Sylvain is also lead for LAC’s digital transformation activities. Prior to accepting this role, Sylvain had been Director of the Holdings Management Division since 2010, and previously Corporate Secretary and Chief of Staff for Library and Archives Canada.

Web Archiving Rio 2016: The Story So Far

By Helena Byrne, Assistant Web Archivist, The British Library

The IIPC Content Development Group (CDG) has been busy archiving the trials and tribulations of the Rio 2016 Summer Olympic and Paralympic Games. The Olympics might be over but in just a few days the Paralympics will begin and fans will be glued to their screens again.

This project is collecting public platforms such as websites, articles, news reports, blogs and social media about Rio 2016. You can follow updates on this project on Twitter by using the collection hashtag #Rio2016WA. The CDG group has been more active on Twitter and recently hosted a Twitter chat on 10th August 2016 to give an insight on what’s involved in web archiving the Olympics. The chat was based on set questions published in an IIPC blog post with a Q&A session and some time for live nominations. This was an international chat; even though it was small it helped us to make connections with a wider audience. The chat was added to Storify as well as the final archived collection of the Games.

So far the Rio 2016 Collection has over 4,000 nominations from IIPC members and the general public. The nominations up to now are from seventy six countries across the world. However as you can see from the Google Map there are still many countries that have not been covered. Can you help fill the void?

The majority of the public nominations cover Ireland, the Pacific Islands & South Korea and are in a range of languages such as English, Korean, Dutch, Georgian & French to name but a few. Some countries on the map have only one site nominated while others have many, even if you see that there are nominations from your country the web pages you are looking at might not be in the collection. There is still time for you to get involved in web archiving the Olympics and Paralympics. The public nomination form will be open till 21st September 2016. If you would like to make a nomination you can follow these guidelines. This is your chance to be part of the Games!

The Case for the Preservation of R&D Project Websites

By Daniel Gomes, Arquivo.pt – the Portuguese Web Archive

Although most current Research & Development (R&D) projects rely on their sites to publish valuable information about their activities and achievements, these sites and the information they provide typically disappear a few years after the end of the projects. Web archiving is a solution to this problem.

Why preserve websites of Research & Development projects?

During the FP7 work programme the European Union invested billions of EUROS on R&D projects. Scientific outputs from this significant investment were disseminated online through R&D project sites. Moreover, part of the funding was invested in the development of the project sites themselves.



Sites of R&D projects must be preserved because they:

  • publish valuable scientific outputs;
  • are highly transient, typically they vanish shortly after the project funding ends;
  • constitute a trans-national, multi-lingual and cross-field set of historical web data for researchers (e.g. social scientists);
  • are not being officially preserved by any institution.

The archivist’s dilemma has always been what to preserve for the future? R&D project sites are definitely worth preserving.

Open Data gets obsolete due to project site ephemera

There has been a growing effort of the European Union, and governments in general, to improve transparency by providing Open Data about their activities and outputs of the granted fundings.

The European Union Open Data Portal is an example of this effort. It conveys information about European Union funded projects such as the project name, beginning and ending dates, subject, budget or project URL. Almost all this information is persistent and usable over time after the project or funding instrument ends. The exception is the project URL.



A pilot experiment was performed on November 27th 2015 to test project URLs for 100 random projects funded by the FP6 and FP7 work programmes. We automatically checked if the project URLs were still referencing any content (OK response).

The obtained results revealed that 19% of the project sites of R&D projects financed by FP7 (2007-2013) were already unavailable. This percentage of data loss increased to 30% for the older project sites financed by FP6 (2002-2006).

Moreover, we observed that some of these URLs were referencing a content that was no longer related to the R&D project. Therefore, this suggests that the percentage of valid project URLs is in fact lower than the obtained percentages. Attaining more accurate percentages would require human validation. This is an interesting issue that could be further investigated by researchers.

Web archiving provides a solution

The constant deactivation of sites that publish and disseminate the scientific outputs originated  from R&D projects causes a permanent loss of valuable information to Human knowledge from both a societal and scientific perspective. As project sites inevitably close, the online information referenced on databases, such as the CORDIS – EU research projects databases, suffer irrecoverable degradation.


The good news is that web archives from IIPC members may have preserved sites from past R&D projects that became unavailable and a trans-national research infrastructure such as RESAW could make them more widely accessible.

Funding management databases could be enhanced to also reference the preserved versions of the project websites that meanwhile disappeared from the live-web. Project officers or reviewers could complement their analysis by retrieving missing online content about the funded projects and researchers in general could mitigate the serious cross-field problem originated by scientific publications citing crucial online resources that became unavailable.

You can help to preserve R&D project sites right now!

About 1 year ago, Arquivo.pt performed an experiment to preserve the .EU domain.

We are now trying to focus on preserving sites of R&D projects. To this end, our first idea was to use the EU Open Data Portal to identify these project URLs. The problem is that from the 25 608 R&D projects funded by FP7 listed by the EU Open Data Portal, only 7.9% had an associated project URL.

So, the first main challenge is to identify R&D project web sites to be preserved.

And you can help!

You just need to contribute with project sites to this document:

Collaborative list of Research and Development project websites

The resulting list will be used to:

  • make an experimental crawl of these sites that will be made publicly available;
  • research techniques to automatically identify R&D project URLs that will be published;
  • help other institutions interested in preserving these sites.

Who’s in?

Memento: Help Us Route URI Lookups to the Right Archives

More Memento-enabled web archives are coming online every day, enabling aggregating services such as Time Travel and OldWeb. However, as the number of web archives grows, we must be able to better route URI lookups to the archives that are likely to have the requested URIs. We need assistance from IIPC members to help us better model both what archives contain as well as what people are looking for.

In our TPDL 2015 paper we found that less than 5% of the queried URIs have mementos in any individual archive that is not the Internet Archive. We created four different sample sets of one million URIs each and compared them against three different archives. The table below shows the percentage of the sample URIs found in various archives.

Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union of {AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
Memento Proxy Logs 4.182% 0.408% 0.046% 4.527%
IA Wayback Logs 3.716% 0.519% 0.039% 4.165%
UKWA Wayback Logs 0.108% 0.034% 0.002% 0.134%

However, these small archives, when aggregated together prove to be much more useful and complete than they are individually. We found that the intersection between these archives is small, so the union of them is large (see the last column in the table above). The figure below shows the overlap among three archives for the sample of one million URIs from DMOZ.


We are working on an IIPC funded Archive Profiling project in which we are trying to create a high level summary of the holdings of each archive. Apart from the many other use cases, this will help us route the Memento Aggregator queries to only archives that are likely to return good results for a given URI.

We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives. Another issue when aggregating large number of archives is that the aggregators follow the buffalo theory where the slowest upstream archive affects the roundtrip time of the aggregator. A single malfunctioning (or down) upstream archive may delay each aggregator response for the set timeout period. There are ways to solve the latter issue such as detecting continuously failing archives at runtime and temporarily disabling them from being aggregated. However, building Archive Profiles and predicting the probability of finding any Mementos in each archive to route the requests solves both the problems. Individual archives only get requests when they are likely to return good results, hence the routing saves their network and computing resources. Additionally, aggregators benefit in terms of the improved response time, because only a small subset of all the known archives is queried for any given URI.

We appreciate Andy Jackson of the UK Web Archive for providing the anonymised Wayback access logs that we used for sampling one of the URI sets. We would like to extend this study on other archives’ access logs to learn what people are looking for when they visit these archives. This will help us build sampling based profiling for archives that may not be able to share CDX files or generate/update full-coverage archive profiles.

We encourage all IIPC member archives to share their access logs just enough to generate at least one million unique URIs that people looked for in their archives. We are only interested in the log entries that have a URI-R in it (e.g., /wayback/14-digit-datetime/{URI}). We can handle all the cleanup and parsing tasks, or you can remove the requesting IP address from the logs (we don’t need it) if you would prefer. The logs can be continuous or consist of many sparse logs. We promise not to publish those logs in the raw form anywhere on the Web. Please feel free to discuss further details with me at salam@cs.odu.edu. Also contact me if you are interested in testing the software for profiling your archive.

by Sawood Alam
Department of Computer Science, Old Dominion University

IIPC Co-Chair Cathy Hartman Retires

By Abbie Grotke, United States Library of Congress, and Birgit Nordsmark Henriksen, The Royal Library, Denmark

CathyHartmanThe IIPC is saying a fond farewell this month to Cathy Hartman from member institution University of North Texas Libraries. She is retiring from UNT at the end of December. More about her illustrious career can be found in this tribute to Cathy on the UNT website.

Since her institution joined the IIPC in 2008, Cathy has been a strong leader and participant in the activities of the consortium, serving as Chair of the IIPC Steering Committee in 2008, and more recently as co-chair in 2015, helping to lead an effort to rethink the organizational structure and focus of the organization. She’s had a particular interest in education and training opportunities for web archivists, and led an effort to form an Education Working Group to review and fund training proposals, including support for a PhD candidate. She was also one of the first members representing the unique perspectives of universities and colleges on the Steering Committee.

“In my tenure as Chair of the International Internet Preservation Consortium,” commented Paul N. Wagner from the Library and Archives Canada, “I have been privileged to have Cathy Hartman as my Vice-Chair. Cathy brought a sage and pragmatic approach to her dealings with the Consortium. Her wealth of experience coupled with her ability to develop and nourish personal relationships will be sorely missed. I will personally miss the opportunity to call her up and get a ‘dose of reality’ as we worked through the complexities of overseeing an International organization. While it may be true that ‘things are bigger in Texas,’ I can tell you that the IIPC Steering Committee table will feel a lot smaller without Cathy’s presence. So on behalf of the entire IIPC membership, I wish you all the best in this next chapter of your life.”

Before she departs, we took a few moments to ask Cathy some questions about her career in Web Archiving and her experiences in IIPC:

Q: How did you get involved in Web Archiving?
As a librarian, I specialized in government information in the 1990s and watched with interest as governments began publishing content on their websites in the mid 90s rather than printing. We also noticed that the websites just disappeared when an agency or commission closed, so I began capturing the sites before they closed and preserving them for access. We captured the first site in 1997. As we added to the collection of websites, the project became known as the CyberCemetery, where dead agency websites went for perpetual care.

Q: What is your fondest memory of an IIPC meeting?
My first IIPC meeting was in Canberra, Australia, where I immediately connected to this group of people. I discovered a group who agreed with my passion for preserving the content published on the web. Also, the representatives from around the world were all so welcoming and immediately made us feel a part of the organization. The programs, the group discussions, the ongoing work – I was totally on board. IIPC has, since that meeting, been my favorite professional organization.

Q: Which changes do you look at as the biggest changes in Web archiving over the years you have been involved?
Everything has changed since 1997. Sites then were simple, straightforward and easy to capture. Websites now are complex, multimedia, interactive. The growth and change of the web over that 18 year period is remarkable.

Q: What do you plan to do in retirement?
A little consulting, a bit of fundraising for an endowment for UNT’s digital programs, some travel, and anything else I find interesting along the way.

The authors of this post would like to add that we will miss Cathy very much at future IIPC meetings, both as a professional colleague working together to achieve IIPC goals, and personally, as a good friend. We have enjoyed our time working together and particularly her way of get things done — her calm demeanor, a heart as “big as Texas,” and her ability to always focus on the issues at hand by working collaboratively – these have all been an inspiration! While we doubt her travels will take her to our next meeting in Iceland, we all hope that our paths cross again in the near future.

IIPC Steering Committee - Paris

2016 IIPC General Assembly & Web Archiving Conference

In 2016 the IIPC is organising two back-to-back events in the spring hosted by the Landsbókasafn Íslands – Háskólabókasafn (National and University Library of Iceland) in Reykjavík, Iceland:

  • IIPC General Assembly 2016, 11-12 April – Free (open for members only)
  • IIPC Web Archiving Conference 2016, 13-15 April – Free (open to anyone)

The IIPC is seeking proposals for presentations and workshops for the 2016 IIPC Web Archiving Conference (13 – 15 April 2016). Members of the IIPC are also encouraged to submit proposals for the IIPC General Assembly (11 & 12 April 2016).

Theme guidance

Proposals may cover any aspect of web archiving. The following is a non-exhaustive list of possible topics:

Policy and Practice

  • Harvesting, preservation, and/or access
  • Collection development
  • Copyright and privacy
  • Legal and ethical concerns
  • Programmatic organization and management


  • Research using web archives
  • Tools and approaches
  • Initiatives, platforms, and collaborations


  • New/updated tools for any part of the lifecycle
  • Application programming interfaces (APIs)
  • Current and future landscape

Proposal guidance

Individual presentations can be a maximum of 20 mins. A panel session can be a maximum of 60 minutes with 2 or more presentations on a topic. A discussion session should include one or more introductory statements followed by a moderated discussion. Workshops can be up to a half-day in length; please include details on the proposed structure, content, and target audience.

Abstracts should include the name of the speaker(s), a title, theme and be no more than 300 words. All abstracts should be in English.

Please submit your proposals using this form. For questions, please e-mail iipc@bl.uk .

The deadline for submissions is 17 December 2015. All submissions will be reviewed by the Programme Committee and submitters will be notified by mid-January 2016.

Five Takeaways from AOIR 2015

aoirI recently attended the annual Association of Internet Researchers (AOIR) conference in
Phoenix, AZ. It was a great conference that I would highly recommend to anyone interested in learning first hand about research questions, methods, and studies broadly related to the Internet.

Researchers presented on a wide range of topics, across a wide range of media, using both qualitative and quantitative methods. You can get an idea of the range of topics by looking at the conference schedule.

I’d like to briefly share some of my key takeaways. I apologize in advance for oversimplifying what was a rich and deep array of research work, my goal here is to provide a quick summary and not an in-depth review of the conference.

  1. Digital Methods Are Where It’s At

I attended an all-day, pre-conference digital methods workshop. As a testament to the interest in this subject, the workshop was so overbooked they had to run three concurrent sessions. The workshops were organized by Axel Bruns, Jean Burgess, Tim Highfield, Ben Light, and Patrik Wikstrom (Queensland University of Technology), and Tama Leaver (Curtin University).

Researchers are recognizing that digital research skills are essential. And, if you have some basic coding knowledge, all the better.

At the digital methods workshop, we learned about the “Walkthrough” method for studying software apps, tools for “web scraping” to gather data for analysis, Tableau to conduct social media analysis, and “instagrammatics,” analyzing Instagram.

FYI: The Digital Methods Initiative from Europe has tons of great information, including an amazing list of tools.

  1. Twitter API Is also Very Popular

There were many Twitter studies, and they all used the Twitter API to download tweets for analysis. Although researchers are widely using the Twitter API, they expressed a lot of frustration over its limitation. For example, you can only download for free up to 1% of the total Twitter volume. If you’re studying something obscure, you are probably okay, but if you’re studying a topic like #jesuischarlie, you’ll have to pay to get the entire output. Many researchers don’t have the funds for that. One person pointed out that it would be ideal to have access to the Library of Congress’s Twitter archive. Yes, agreed!

  1. Social Media over Web Archives

Researchers presented conclusions and provided commentary on our social behavior through studies of social media such as Snapchat, Twitter, Facebook, and Instagram. There were only a handful of presentations using web archived materials. If a researcher used websites, they viewed them live or conducted “web scraping” with tools such as Outwit and Kimono. Many also used custom Python scripts to gather the data from the sites.

  1. Fair Use Needs a PR Movement

There’s still much misunderstanding about what researchers can and cannot do with digital materials. I attended a session where the presenter shared findings from surveys conducted with communication scholars about their knowledge of fair use. The results showed that there was (very!) limited understanding of fair use. Even worse, the findings showed that those scholars who had previously attended a fair use workshop were even more unlikely to understand fair use! Moreover, many admitted that they did not conduct particular studies because of a (misguided) fear of violating copyright. These findings were corroborated by the scholars from a variety of fields who were in the room.

  1. Opportunities for Collaboration

I asked many researchers if they were concerned that they were not saving a snapshot of websites or Apps at the time of their studies. The answer was a resounding “yes!” They recognize that sites and tools change rapidly, but they are unaware of tools or services they can use and/or that their librarians/archivists have solutions.

Clearly there is room for librarians/archivists to conduct more outreach to researchers to inform them about our rich web archive collections and to talk with them about preservation solutions, good data management practices and copyright.

Who knew?

Let me end with sharing one tidbit that really blew my mind. In her research on “Dead Online: Practices of Post-Mortem Digital Interaction,” Paula Kiel presented on the “digital platforms designed to enable post-mortem interactions.” Yes, she was talking about websites where you can send posthumous messages via Facebook and email! For example, https://www.safebeyond.com/, “Life continues when you pass… Ensure your presence – be there when it counts. Leave messages for your loved ones – for FREE!”



By Rosalie Lack, Product Manager, California Digital Library

Being a Small-Time Software Contributor–Non-Developers Included


At the IIPC General Assembly 2015, we
heard a call for contributors to IIPC relevant software projects (e.g. OpenWayback and Heritrix). We imagined what we could accomplish if every member institution could contribute half a developer’s heritrix-logotime to work on these tools. As individuals though, we are part of the IIPC because of the institutions for which we work. The tasks dealt by our employers come first, not always leaving an abundance of time for external projects. However, there are several ways to contribute on a smaller scale (not just committing code).

How To Help

1. Provide user support for OpenWayback and Heritrix

Join the openwayback-dev list and/or the Heritrix list, and answer questions when you can.

2. Log issues for software problems

github-social-codingAnytime you notice something isn’t working as expected in a piece of software, report the issue. For projects like OpenWayback and Heritrix that are on GitHub, creating an account to enable reporting issues is easy. If you aren’t sure if the problem warrants opening an issue, send a message to the relevant mailing list.

3. Follow issues on the OpenWayback and Heritrix GitHub repositories

Check issue trackers regularly or “Watch” GitHub repositories to receive issue updates via email. If you see an issue for a bug or new feature relevant to your institution, comment on it, even if only to say that it is relevant. This helps the developers prioritize which issues to work on.


4. Test release candidates

When a new distribution of OpenWayback is about to be released, the development group sends out emails asking for people to test the release distribution candidates. Verify whether the deployment works in your environment and use cases. Then report back.

5. Contribute to documentation

For any web archiving project, if you find documentation that is lacking or unclear, report it to the maintainers, and if possible, volunteer to fix it.

6. Contribute to code

OpenWayback currently has several open issues for bugs and enhancements. If you find an issue of interest to you and/or your institution, notify others with a comment that you want to work on it. View the contribution guidelines, and start contributing. OpenWayback and Heritrix are happy to get pull requests.

7. Review codeBinary code

When others submit code for potential inclusion into a project’s master code branch, volunteer to review the code and test it by deploying the software with the changes in place to verify everything works as expected.

 8. Join the OpenWayback Developer calls

If you are interested in contributing to OpenWayback, these calls keep you informed on the current state of development. The group is always looking for help with testing release candidates, prioritizing issues, writing documentation, reviewing pull requests, and writing code. Calls take place approximately every three weeks at 4PM London time, there is also a Google Groups list, email the IIPC PCO to join.

9. Solicit development support from your institution

Non-developers have a great role in the development effort. Encourage technical staff you work with to contribute to software projects and help them build time into their schedules for it. If you are not in a position to do this, lobby the people who can grant some of your institution’s developer time to web archiving projects.

What You Get Back

Collaborating on web archiving projects isn’t just about what you contribute. The more you follow mailing lists and issue trackers and the more you work with code and its deployment, the better your institution can utilize the software and keep current on the direction of its development.

If your institution doesn’t use OpenWayback or Heritrix, the above ways of helping apply to many other web archiving software projects. So get involved where you can; you don’t have to fix everything.

lauren_koLauren Ko
Programmer, Digital Libraries Division, UNT Libraries

Politics, Archaeology, and Swiss Cheese: Meghan Dougherty Shares Her Experiences with Web Archiving

MeghanMeghan Dougherty, Assistant Professor in Digital Communication at Loyola University, Chicago, started our interview by warning me that she is the “odd man out” when it comes to web archiving and uses web archives differently than most. I was immediately intrigued!

Meghan’s research agenda is the nature of inquiry and the nature of evidence. All her research is conducted within the framework of questioning methodology; that is, she’s ask questions about how archives are built, how that process influences what is collected, and how that process then influences scientific inquiry.

Roots in Politics

Before closely examining methodology, Meghan spent “hands-on” time starting in the early 2000s working with a research group called, webarchivist.org, co-founded by Kirsten Foot (University of Washington) and Steve Schneider (SUNYIT). The interdisciplinary nature of the work at this organization was evident in the project members, which included two political scientists and two communications scholars, focusing on both qualitative and quantitative analysis. Their big research question was “what is the impact of the Internet on politics?”

Meghan and the rest of the research group recognized that if you are going to look at how the Internet affects politics, then you need to analyze how it changes over time. To do that you need to slow it down, to essentially take snapshots so you can do an analytical comparison.

To achieve this goal, the team worked collaboratively with Internet Archive and Library of Congress to build an election web archive, specifically around U.S. house, senate, presidential, and gubernatorial elections, with a focus on candidate websites.

The Tao of a Website

As they were doing rigorous quantitative content analysis of election websites, the team was also asked to take extensive field notes to document everything they noticed. This, in turn, is how Meghan became curious about studying methodology. Looking at these sites in such detail prompted many questions:

“What exactly has the crawler captured in an archive? What am I looking at? If a website is fluid and moving and constantly updated, then what is this thing we’ve captured? What is the nature of ‘being’ for objects on the web? If I capture a snapshot, am I really capturing anything, or is it just a resemblance of the thing that existed?”

Meghan admits she doesn’t have all of the answers, but she challenges her fellow scholars to ask these difficult questions and not try to neatly tie up their research with a bow by simplifying the analysis. She cautions that before you can gain knowledge about social and behavioral change over time in the digital world, you need to have a sensibility about what it actually means. Without answering that question the research methods are just practice and not actually knowledge building systems.

The Big Secret

Meghan appreciates when archivists and librarians ask her how they can help to support her in her work. What she really needs, she says, is a long-term collaborator, because frankly she doesn’t know what she wants.

“What if I told you that we don’t know what we want to analyze. We really need to think about these things together. The big secret is that we don’t know what we want because we don’t know what we’re dealing with. We are still working through it and we need you [curators and librarians] to help us to think about what an archive is, what we can collect, and how it gets collected. So we can build knowledge together about this collection of evidence.”

In hearing Meghan discuss two small-scale research projects, it was evident that even within her own research portfolio she has very different requirements for web archives.


Ask a Curator is a once-a-year event, when cultural heritage institutions across the world open up for anyone to engage with their curators via Twitter.

AskACuratorBy analyzing tweets with the #AskACurator hashtag, Meghan is studying how groups of people come together and interact with institutions and how institutions reach out with digital media to connect with their public.

In this example, Meghan stresses that completeness and precision of data are critical. If the archive of tweets for this hashtag are incomplete, then big chunks of really interesting mini-conversations will be missing from Meghan’s data. In addition, missing data will skew her categorizations and must be accounted for.


By Eye Steel Film from Canada (Muslim Punks) [CC BY 2.0 (http://creativecommons.org /licenses/by/2.0)], via Wikimedia Commons
For another project that is more like an ethnographic study, an online community (Taqwacore) of young people gathered around their faith in Islam, interest in punk music, and political activism.  Meghan is studying a wide variety of print and online materials, including a small press novel (that launched this sub-culture), materials distributed online and handed out at concerts, and materials distributed in person and on the online community pages joined by kids living all over the world.

In this study, the precision and completeness of the evidences doesn’t matter as much because Meghan’s goal is to try to get a general gist of the subculture. She is conducting an ethnographic study, but in the past. So, instead of camping out in the scene in the moment, she is looking back in time at the conversations that they had and trying to understand who they were.

Digging Web Archives

In her research, Meghan has come to use the term web archaeology because she has found that regardless of her area of work, her research has felt like an archaeological dig in which she  examines digital traces of past human behavior to understand her subject. Archaeology, not unlike web archiving, can be both destructive as well as constructive, and similarly archaeologists use very specific, specialized tools to find and uncover delicate remains of something that has been covered or even mostly lost over time.

At this year’s IIPC General Assembly < http://netpreserve.org/general-assembly/2015/overview >), Meghan introduced her web archaeology idea, which is also the topic of her forthcoming book (“Virtual Digs: excavating, archiving, preserving and curating the web” from University of Toronto press), through a tongue-in-cheek video from The Onion about uncovering the ruins of a Friendster civilization.

While the video is intended as satire, the topic raises a real question that we need to address, which is that in a hundred years from now people are going to look back at our communication media, such as Facebook, but what will future scholars be able to dig up?

All about the Holes

In a presentation at IIPC 2011, Barbara Signori, Head of the Department e-Helvetica at Swiss National Library, shared a wonderful analogy about how the holes in our archives are like the holes in Swiss cheese – inevitable. When I asked Meghan to share something that surprised her about her research, she shared a story about the holes.

"Emmentaler". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg
“Emmentaler”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/ wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg

When working with the Library of Congress back in the early 2000s, Meghan’s research group provided a list of political candidates to the Library of Congress staff for crawling. Library of Congress staff created an index of the sites crawled, but they did not create an entry in cases where no websites existed.

Meghan and her fellow researchers were surprised because it seemed obvious to them that you would document the candidates who had websites, as well as those who didn’t. Knowing that a candidate DID NOT have a website in the early 2000s was a big deal, and would have a huge impact on findings! Absence shows us something very interesting about the environment.

Meghan would go so far as to say that a quirk about web archives is that librarians and curators are so focused on the cheese, while researchers find the holes of equal interest.

RosalieLackThis blog post is the third in a series of interviews with researchers to learn about their use of web archives.  

By Rosalie Lack, Product Manager, California Digital Library

How Well Are Arabic Websites Archived?

‫Arabic summary

‫إن أرشفة المواقع هي عملية تجميع البيانات الموجودة على الشبكة العنكبوتية من أجل حفظها من الضياع و جعلها متاحة للباحثين في المستقبل. قمنا بهذا البحث العلمي لمحاولة تقدير مدى أرشفة و فهرسة المواقع العربية. تم جمع ١٥،٠٩٢ رابط من ثلاث مواقع تعتبر دليل للمواقع العربية وهي: دليل ديموز العربي، دليل الردادي، دليل ستار٢٨. بعدها تم استخدام أدوات التعرف على اللغات واخترنا المواقع ذات اللغة العربية فقط، فاصبح عدد الروابط المتبقية هو ٧،٩٧٦ رابط. ثم تم زحف المواقع الحية منها لينتج عن ذلك ٣٠٠،٦٤٦ رابط. و من هذه العينة تم اكتشاف مايلي:‬‬‬
‫‫‫١) إن ٤٦٪ من المواقع العربية لم يتم ارشفتها، و إن ٣١٪ من المواقع العربية لم تتم فهرستها من قبل قوقل.‬‬‬
‫‫‫٢) إن ١٤،٨٤٪ من المواقع العربية لها محددات رمز عربية مثل (sa.)، كما وجدنا ١٠،٥٣٪ من المواقع لها موقع جغرافي عربي بناءً على موقع برتوكول الانترنت (IP) الخاص بالحاسب الالي.‬‬‬
‫‫‫٣) إن وجود إما موقع جغرافي عربي أو محددات رمزية عربية يؤثر سلبياً على أرشفتها.‬‬‬
‫‫‫٤) معظم الصفحات المؤرشفة هي بالقرب من المستوى الأعلى من الموقع، أما الصفحات العميقة في الموقع هي غير مؤرشفة جيداً.‬‬‬
‫‫‫٥) وجود الموقع على صفحة ديموز العربية يؤثر على ارشفتها ايجابياً.‬‬‬‫ 

It is anecdotally known that archives favor content in English and from Western countries. In this blog post we summarize our JCDL 2015 paper “How Well are Arabic Websites Archived?“, where we provide an initial quantitative exploration of this well-known phenomenon. When comparing the number of mementos for English vs. Arabic websites we found that English websites are archived more than Arabic websites. For example, when comparing a high ranked English sports website based on Alexa ranking, such as ESPN, with a high ranked Arabic sport website, such as Kooora, we find that ESPN has almost 13,000 mementos, and Kooora has only 2,000 mementos.

Figure 1

We also compared the English vs Arabic encyclopedia and found that the English Wikipedia has 10,000 mementos vs. the Arabic Wikipedia with only around 500 mementos.

Figure 2

Arabic is the fourth most popular language on the Internet, trailing only English, Chinese, and Spanish. Based on the Internet World Stats, in 2009, only 17% of Arabic speakers used the Internet, but by the end of 2013 that had increased to almost 36% (over 135 million), approaching the world average of 39% of the population using the Internet.

Our initial step, collecting Arabic seed URIs, presented our first challenge. We found that Arabic websites could have:
1) Both Arabic geographic IP location (GeoIP) and an Arabic country code top level domain (ccTLD) such as www.uoh.edu.sa.
2) An Arabic GeoIP, but a non Arabic ccTLD such as www.al-watan.com.
3) An Arabic ccTLD, but a non Arabic GeoIP such as www.haraj.com.sa, with a GeoIP in Ireland.
4) Neither an Arabic GeoIP, nor an Arabic ccTLD such as www.alarabiyah.com, with a GeoIP in US.

So for collecting the seed URIs we first searched for Arabic website directories, and grabbed the top three based on Alexa ranking. We selected all live URIs (11,014) from the following resources:
1) Open Directory project (DMOZ) – registered in US in 1999.
2) Raddadi – a well known Arabic directory, registered in Saudi Arabia in 2000.
3) Star28 – an Arabic directory registered in Lebanon in 2004.

Although these URIs are listed in Arabic directories it does not mean that the content is in Arabic. For example, www.arabnews.com is a Arab news website listed in Star28 but provides English language news about Arabic-related topics.

It was hard to find a reliable language test to determine the language for a page, so we employed four different methods: HTTP Content Language, HTML title tag, Triagram method, Language detection API. As shown in Figure 3, the intersection between the four methods was only 8%. We made the decision that any page that passed any of these tests would be included as “in the Arabic web”. The resulting number of Arabic seeds URIs was 7,976 out of 11,014.

Figure 3

To increase the number of URIs, we crawled the live Arabic seed URIs and checked the language using the previously described methods. This increased our data set to 300,646 Arabic seed URIs.

Next we used the ODU Memento Aggregator (mementoproxy.cs.odu.edu) to verify if the URIs were archived in a public web archive. We found that 53.77% of the URIs are archived with a median of 16 mementos per URI. We also analyzed the timespan of the mementos (the number of days between the datetimes of the first memento and last memento) and found that the median archiving period was 48 days.

We also investigated seed source and archiving and found that DMOZ had an archiving rate of 96%, followed by 45% from Raddadi, and 42% from Star28.

In the data set we found that 14% of the URIs had an Arabic ccTLD. We also looked at the GeoIP location since it was an important factor to determine where the hosts of webpages might be located. Using MaxMind GeoLite2, we found 58% of the Arabic seed URIs are hosted in the US.

Figure 4 shows count detail for Arabic GeoIP and ccTLD. We found that: 1) only 2.5% of the URIs are located in an Arabic country, 2) only 7.7% had an Arabic ccTLD, 3) 8.6% are both located in an Arabic country and have an Arabic ccTLD, and 4) the rest of the URIs (81%) are neither located in Arabic country, nor had an Arabic ccTLD.

Figure 4

We also wanted to verify if the URI had been there long enough to be archived. We used the CarbonDate tool, developed by members of the WS-DL group, to analyze our archived Arabic data set. We found that 2013 was the most frequent creation date for archived Arabic webpages. We also wanted to investigate the gap between the creation date of Arabic websites and when they were first archived. We found that 19% of the URIs have an estimated creation date that is the same as first memento date. For the remaining URIs, 28% have creation date over one year before the first memento was archived.

It was interesting to find out if the Arabic URIs are indexed in search engines. We used the Google’s Custom Search API, (which may produce different results than the public Google’s user web interface), and found that 31% of the Arabic URIs were not indexed by Google. When looking at the source of the URIs we found that 82% of the DMOZ URIs are indexed by Google, which was expected since it is more likely to be found and archived.

In conclusion, when looking at the seed URIs we found that DMOZ URIs are more likely to be found and archived, and a website is more likely to be indexed if it is present in a directory. For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ.

I presented this work in JCDL2015, the presentation slides can be found here.

by Lulwah M. Alkwai, PhD student, Computer Science Department, Old Dominion University, VA, USA