Politics, Archaeology, and Swiss Cheese: Meghan Dougherty Shares Her Experiences with Web Archiving

MeghanMeghan Dougherty, Assistant Professor in Digital Communication at Loyola University, Chicago, started our interview by warning me that she is the “odd man out” when it comes to web archiving and uses web archives differently than most. I was immediately intrigued!

Meghan’s research agenda is the nature of inquiry and the nature of evidence. All her research is conducted within the framework of questioning methodology; that is, she’s ask questions about how archives are built, how that process influences what is collected, and how that process then influences scientific inquiry.

Roots in Politics

Before closely examining methodology, Meghan spent “hands-on” time starting in the early 2000s working with a research group called, webarchivist.org, co-founded by Kirsten Foot (University of Washington) and Steve Schneider (SUNYIT). The interdisciplinary nature of the work at this organization was evident in the project members, which included two political scientists and two communications scholars, focusing on both qualitative and quantitative analysis. Their big research question was “what is the impact of the Internet on politics?”

Meghan and the rest of the research group recognized that if you are going to look at how the Internet affects politics, then you need to analyze how it changes over time. To do that you need to slow it down, to essentially take snapshots so you can do an analytical comparison.

To achieve this goal, the team worked collaboratively with Internet Archive and Library of Congress to build an election web archive, specifically around U.S. house, senate, presidential, and gubernatorial elections, with a focus on candidate websites.

The Tao of a Website

As they were doing rigorous quantitative content analysis of election websites, the team was also asked to take extensive field notes to document everything they noticed. This, in turn, is how Meghan became curious about studying methodology. Looking at these sites in such detail prompted many questions:

“What exactly has the crawler captured in an archive? What am I looking at? If a website is fluid and moving and constantly updated, then what is this thing we’ve captured? What is the nature of ‘being’ for objects on the web? If I capture a snapshot, am I really capturing anything, or is it just a resemblance of the thing that existed?”

Meghan admits she doesn’t have all of the answers, but she challenges her fellow scholars to ask these difficult questions and not try to neatly tie up their research with a bow by simplifying the analysis. She cautions that before you can gain knowledge about social and behavioral change over time in the digital world, you need to have a sensibility about what it actually means. Without answering that question the research methods are just practice and not actually knowledge building systems.

The Big Secret

Meghan appreciates when archivists and librarians ask her how they can help to support her in her work. What she really needs, she says, is a long-term collaborator, because frankly she doesn’t know what she wants.

“What if I told you that we don’t know what we want to analyze. We really need to think about these things together. The big secret is that we don’t know what we want because we don’t know what we’re dealing with. We are still working through it and we need you [curators and librarians] to help us to think about what an archive is, what we can collect, and how it gets collected. So we can build knowledge together about this collection of evidence.”

In hearing Meghan discuss two small-scale research projects, it was evident that even within her own research portfolio she has very different requirements for web archives.


Ask a Curator is a once-a-year event, when cultural heritage institutions across the world open up for anyone to engage with their curators via Twitter.

AskACuratorBy analyzing tweets with the #AskACurator hashtag, Meghan is studying how groups of people come together and interact with institutions and how institutions reach out with digital media to connect with their public.

In this example, Meghan stresses that completeness and precision of data are critical. If the archive of tweets for this hashtag are incomplete, then big chunks of really interesting mini-conversations will be missing from Meghan’s data. In addition, missing data will skew her categorizations and must be accounted for.


By Eye Steel Film from Canada (Muslim Punks) [CC BY 2.0 (http://creativecommons.org /licenses/by/2.0)], via Wikimedia Commons
For another project that is more like an ethnographic study, an online community (Taqwacore) of young people gathered around their faith in Islam, interest in punk music, and political activism.  Meghan is studying a wide variety of print and online materials, including a small press novel (that launched this sub-culture), materials distributed online and handed out at concerts, and materials distributed in person and on the online community pages joined by kids living all over the world.

In this study, the precision and completeness of the evidences doesn’t matter as much because Meghan’s goal is to try to get a general gist of the subculture. She is conducting an ethnographic study, but in the past. So, instead of camping out in the scene in the moment, she is looking back in time at the conversations that they had and trying to understand who they were.

Digging Web Archives

In her research, Meghan has come to use the term web archaeology because she has found that regardless of her area of work, her research has felt like an archaeological dig in which she  examines digital traces of past human behavior to understand her subject. Archaeology, not unlike web archiving, can be both destructive as well as constructive, and similarly archaeologists use very specific, specialized tools to find and uncover delicate remains of something that has been covered or even mostly lost over time.

At this year’s IIPC General Assembly < http://netpreserve.org/general-assembly/2015/overview >), Meghan introduced her web archaeology idea, which is also the topic of her forthcoming book (“Virtual Digs: excavating, archiving, preserving and curating the web” from University of Toronto press), through a tongue-in-cheek video from The Onion about uncovering the ruins of a Friendster civilization.

While the video is intended as satire, the topic raises a real question that we need to address, which is that in a hundred years from now people are going to look back at our communication media, such as Facebook, but what will future scholars be able to dig up?

All about the Holes

In a presentation at IIPC 2011, Barbara Signori, Head of the Department e-Helvetica at Swiss National Library, shared a wonderful analogy about how the holes in our archives are like the holes in Swiss cheese – inevitable. When I asked Meghan to share something that surprised her about her research, she shared a story about the holes.

"Emmentaler". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg
“Emmentaler”. Licensed under CC BY-SA 3.0 via Wikimedia Commons – https://commons.wikimedia.org/ wiki/File:Emmentaler.jpg#/media/File:Emmentaler.jpg

When working with the Library of Congress back in the early 2000s, Meghan’s research group provided a list of political candidates to the Library of Congress staff for crawling. Library of Congress staff created an index of the sites crawled, but they did not create an entry in cases where no websites existed.

Meghan and her fellow researchers were surprised because it seemed obvious to them that you would document the candidates who had websites, as well as those who didn’t. Knowing that a candidate DID NOT have a website in the early 2000s was a big deal, and would have a huge impact on findings! Absence shows us something very interesting about the environment.

Meghan would go so far as to say that a quirk about web archives is that librarians and curators are so focused on the cheese, while researchers find the holes of equal interest.

RosalieLackThis blog post is the third in a series of interviews with researchers to learn about their use of web archives.  

By Rosalie Lack, Product Manager, California Digital Library

How Well Are Arabic Websites Archived?

‫Arabic summary

‫إن أرشفة المواقع هي عملية تجميع البيانات الموجودة على الشبكة العنكبوتية من أجل حفظها من الضياع و جعلها متاحة للباحثين في المستقبل. قمنا بهذا البحث العلمي لمحاولة تقدير مدى أرشفة و فهرسة المواقع العربية. تم جمع ١٥،٠٩٢ رابط من ثلاث مواقع تعتبر دليل للمواقع العربية وهي: دليل ديموز العربي، دليل الردادي، دليل ستار٢٨. بعدها تم استخدام أدوات التعرف على اللغات واخترنا المواقع ذات اللغة العربية فقط، فاصبح عدد الروابط المتبقية هو ٧،٩٧٦ رابط. ثم تم زحف المواقع الحية منها لينتج عن ذلك ٣٠٠،٦٤٦ رابط. و من هذه العينة تم اكتشاف مايلي:‬‬‬
‫‫‫١) إن ٤٦٪ من المواقع العربية لم يتم ارشفتها، و إن ٣١٪ من المواقع العربية لم تتم فهرستها من قبل قوقل.‬‬‬
‫‫‫٢) إن ١٤،٨٤٪ من المواقع العربية لها محددات رمز عربية مثل (sa.)، كما وجدنا ١٠،٥٣٪ من المواقع لها موقع جغرافي عربي بناءً على موقع برتوكول الانترنت (IP) الخاص بالحاسب الالي.‬‬‬
‫‫‫٣) إن وجود إما موقع جغرافي عربي أو محددات رمزية عربية يؤثر سلبياً على أرشفتها.‬‬‬
‫‫‫٤) معظم الصفحات المؤرشفة هي بالقرب من المستوى الأعلى من الموقع، أما الصفحات العميقة في الموقع هي غير مؤرشفة جيداً.‬‬‬
‫‫‫٥) وجود الموقع على صفحة ديموز العربية يؤثر على ارشفتها ايجابياً.‬‬‬‫ 

It is anecdotally known that archives favor content in English and from Western countries. In this blog post we summarize our JCDL 2015 paper “How Well are Arabic Websites Archived?“, where we provide an initial quantitative exploration of this well-known phenomenon. When comparing the number of mementos for English vs. Arabic websites we found that English websites are archived more than Arabic websites. For example, when comparing a high ranked English sports website based on Alexa ranking, such as ESPN, with a high ranked Arabic sport website, such as Kooora, we find that ESPN has almost 13,000 mementos, and Kooora has only 2,000 mementos.

Figure 1

We also compared the English vs Arabic encyclopedia and found that the English Wikipedia has 10,000 mementos vs. the Arabic Wikipedia with only around 500 mementos.

Figure 2

Arabic is the fourth most popular language on the Internet, trailing only English, Chinese, and Spanish. Based on the Internet World Stats, in 2009, only 17% of Arabic speakers used the Internet, but by the end of 2013 that had increased to almost 36% (over 135 million), approaching the world average of 39% of the population using the Internet.

Our initial step, collecting Arabic seed URIs, presented our first challenge. We found that Arabic websites could have:
1) Both Arabic geographic IP location (GeoIP) and an Arabic country code top level domain (ccTLD) such as www.uoh.edu.sa.
2) An Arabic GeoIP, but a non Arabic ccTLD such as www.al-watan.com.
3) An Arabic ccTLD, but a non Arabic GeoIP such as www.haraj.com.sa, with a GeoIP in Ireland.
4) Neither an Arabic GeoIP, nor an Arabic ccTLD such as www.alarabiyah.com, with a GeoIP in US.

So for collecting the seed URIs we first searched for Arabic website directories, and grabbed the top three based on Alexa ranking. We selected all live URIs (11,014) from the following resources:
1) Open Directory project (DMOZ) – registered in US in 1999.
2) Raddadi – a well known Arabic directory, registered in Saudi Arabia in 2000.
3) Star28 – an Arabic directory registered in Lebanon in 2004.

Although these URIs are listed in Arabic directories it does not mean that the content is in Arabic. For example, www.arabnews.com is a Arab news website listed in Star28 but provides English language news about Arabic-related topics.

It was hard to find a reliable language test to determine the language for a page, so we employed four different methods: HTTP Content Language, HTML title tag, Triagram method, Language detection API. As shown in Figure 3, the intersection between the four methods was only 8%. We made the decision that any page that passed any of these tests would be included as “in the Arabic web”. The resulting number of Arabic seeds URIs was 7,976 out of 11,014.

Figure 3

To increase the number of URIs, we crawled the live Arabic seed URIs and checked the language using the previously described methods. This increased our data set to 300,646 Arabic seed URIs.

Next we used the ODU Memento Aggregator (mementoproxy.cs.odu.edu) to verify if the URIs were archived in a public web archive. We found that 53.77% of the URIs are archived with a median of 16 mementos per URI. We also analyzed the timespan of the mementos (the number of days between the datetimes of the first memento and last memento) and found that the median archiving period was 48 days.

We also investigated seed source and archiving and found that DMOZ had an archiving rate of 96%, followed by 45% from Raddadi, and 42% from Star28.

In the data set we found that 14% of the URIs had an Arabic ccTLD. We also looked at the GeoIP location since it was an important factor to determine where the hosts of webpages might be located. Using MaxMind GeoLite2, we found 58% of the Arabic seed URIs are hosted in the US.

Figure 4 shows count detail for Arabic GeoIP and ccTLD. We found that: 1) only 2.5% of the URIs are located in an Arabic country, 2) only 7.7% had an Arabic ccTLD, 3) 8.6% are both located in an Arabic country and have an Arabic ccTLD, and 4) the rest of the URIs (81%) are neither located in Arabic country, nor had an Arabic ccTLD.

Figure 4

We also wanted to verify if the URI had been there long enough to be archived. We used the CarbonDate tool, developed by members of the WS-DL group, to analyze our archived Arabic data set. We found that 2013 was the most frequent creation date for archived Arabic webpages. We also wanted to investigate the gap between the creation date of Arabic websites and when they were first archived. We found that 19% of the URIs have an estimated creation date that is the same as first memento date. For the remaining URIs, 28% have creation date over one year before the first memento was archived.

It was interesting to find out if the Arabic URIs are indexed in search engines. We used the Google’s Custom Search API, (which may produce different results than the public Google’s user web interface), and found that 31% of the Arabic URIs were not indexed by Google. When looking at the source of the URIs we found that 82% of the DMOZ URIs are indexed by Google, which was expected since it is more likely to be found and archived.

In conclusion, when looking at the seed URIs we found that DMOZ URIs are more likely to be found and archived, and a website is more likely to be indexed if it is present in a directory. For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ.

I presented this work in JCDL2015, the presentation slides can be found here.

by Lulwah M. Alkwai, PhD student, Computer Science Department, Old Dominion University, VA, USA

Web Archives: Preserving the Everyday Record

milligan_-_picture_0In talking with Ian Milligan, Assistant Professor of Digital and Canadian History at the University of Waterloo, you are immediately impressed by his excitement for web archives and how web archiving is fundamentally changing research.

Ian uses web archives for his historical research to demonstrate their relevance and importance. While he clearly sees the value of web archives, he also recognizes the need to improve access in order to increase usage. To that end, he recently launched Webarchives.ca, an archive dedicated to Canadian politics. Ian is also providing pedagogical support for students using digital materials, including web archives.

I interviewed Ian recently to get his thoughts about these and other web archiving topics.

Remembering Geocities: A Community on the Web

Among Ian’s research projects is the study of Geocities. Remember Geocities? It was a user generated web-hosting community that flourished in the late 1990s and 2000s. Unlike other lost civilizations, we know the cause of Geocities’s demise – Yahoo shut it down in 2009. If it were not for the Internet Archive and Jason Scott’s Archive Team, Geocities would be lost forever.

For those who might ask if it was worth saving, Ian would offer a resounding YES! For Ian, Geocities provides a rich historical source for gaining insight into a pivotal moment in time. It is one of the first examples of democratized web access, when average people could reach bigger audiences than ever before. At its height, Geocities featured more than 38 million pages.

Source: Internet Archive's Wayback Machine, December 1, 2009 capture
Source: Internet Archive’s Wayback Machine, December 1, 2009 capture

Some of the research questions Ian is asking about the Geocities corpus include:

  • How was community enacted?
  • How was community lived in a place like Geocities?
  • Was there actually a sense of community on the web?

While these questions might sound like standard research questions, they are only now being recast over “untraditional” sources, such as Geocities.

Archiving Politics

In an effort to improve access to web archives, Ian worked on a project to launch Webarchives.ca, a research corpus containing Canadian Political Parties and Political Interest Groups sites collected since 2005 by the University of Toronto using the Internet Archive’s Archive-It service. Ian teamed up with researchers from the University of Maryland, York University in Toronto, and Western University in London, Ontario to build this massive collection of more than 14 million “documents.”  To help navigate this large collection, UK Web Archive’s Shine front-end was implemented.

Once I got started looking at Webarchives.ca, I couldn’t stop myself from digging further into such a wealth of information. I particularly liked the graphing of terms over time feature, which allows you to see when terms go in and out of use by political parties.

In sharing his takeaways from working with these data, Ian observed that it is equally interesting to see when terms do not appear as when they do.

A Pivotal Shift for Scholarship

Ian shared some concrete examples of how the rise of web archives represents a pivotal shift for scholarship. Let’s take, for instance, particular segments of the population, such as young people, who have traditionally been left out of the historical record.

When Ian was researching the 1960s in order to understand the voice of young activists, he found the sources to be scarce. Conversations among activists tended to happen in coffeehouses, bars, and other places where records were not kept. So, a historian can only hope that a young activist back then kept a diary and that it has survived, or she or he needs to find them and interview them.

Contrast this to today’s world. With the explosion of social media, young people are writing things down and leaving records that we never would have had in the past. Web archiving tools can capture this information, which is a very rich and exciting development for historians, but only if these important records of daily life have been archived.

Is More Better?

The increase in information can be a double-edged sword. As Ian says, “there used to be such a scarcity of historical sources, now we have more information than we know what to do with.”

Ian is concerned that digital and digitized materials will be privileged as sources and/or misinterpreted. He conducted a study when materials were first digitized. He learned that scholars cited more often digital materials vs analog. Basically, content that was more easily available online was getting used more.

Ian is also worried that there is not a deep understanding of how to critically use digital resources. Many are unaware, for example, of the limitations of simple keyword searching. Add to the mix web archives and you have increased the scale of the problem.

So Ian wrote a pedagogical book.

exploringBigHistoricalDAtaThe Historian’s Macroscope: Exploring Big Historical Data, written along with Shawn Graham and Scott Weingart, will be out later this year. The book is a sort of toolbox for upper division history undergraduates to teach them how to think critically about digital resources and to avoid common pitfalls. It also includes “how to” information for analyzing data, such as basic data visualization and network analysis.

Always pushing the envelope, Ian and his co-authors wrote the first draft of their book online.

No “Do Overs”

Ian closed our interview by sharing a provocative statement that he made at the recent IIPC General Assembly. “You cannot study the history of the 90s unless you use web archives. It is a significant part of the record of the 1990s and 2000s for everyday people. When historians write the history of 9/11 or Occupy Wall Street, they are going to have to use web archives.”

As exciting as it is for historians to have access to these rich new resources, Ian also shared his biggest concern, which is that we need to ensure that we are saving websites. “Every day we are losing considerable amounts of our digital heritage. Gathering is critical. There are no ‘do overs.’”


This blog post is the second in a series of interviews with researchers to learn about their use of web archives.

By Rosalie Lack, Product Manager, California Digital Library

We want YOUR ideas for the IIPC General Assembly 2016

NatLibIcelandYou will be pleased to hear that preparations for the IIPC General Assembly 2016 in Reykjavik, Iceland (11-15 April) are under way and we are aiming to make it the best one yet.

The program team have been hard at work looking at potential themes, topics and areas for discussion and debate. We would, however, love to have your input into this too!

So far, we’ve outlined the following areas:

  • Nuts and bolts of web archiving (management, metrics, organisation, programs)
  • De-duplication 
  • Researcher use cases (of web archives)
  • Big Data usage and potential
  • Web Archiving policies and frameworks / Preservation policies, Collection policies 
  • API’s
  • Web Archiving Tool development 
  • Legal deposit, copyright, data protection (EU wide perspective?)

help_wantedWhat have we missed, what should we focus on, what would YOU like to see and hear about?

Please use the comments below and tell us what you would like from the conference? This will help frame the call for papers due to go out at the end of October.

Thank you.

Jason Webber, IIPC Program and Communications Officer

Open letter by IIPC Chair

Greetings IIPC Memebers,

I hope that your summer is going very well and that you are all able to take some time off to recharge and spend time with family and friends.  It is hard to believe that more than 3 months have passed since many of us were together at Standford University in Palo Alto for our 2015 General Assembly (GA)!

I want to take this opportunity to  once again say how impressed I was at the quality of the event.  Everything from the organization of the entire event to the excellent interactions that our members engaged in brought significant value to the week.

I want to focus in on the Member’s Day that we had at the Internet Archive offices.  At one point in the day, you were asked to break off into groups to discuss some of the important issues and challenges facing the IIPC in the near future.  The Steering Committee met on the Saturday following the GA to discuss how we can better serve you – our members – and to ensure that we focus our limited resources what brings the greatest value to the global Web Archiving community.  I want to assure you that YOUR feedback was taken very seriously and thanks to the leadership of Birgit Nordsmark Henriksen (Netarchive.dk) and Barbara Sierman (National Library of the Netherlands) the Steering Committee was able to distill your comments and input into 4 manageable work packages:

  1. Researcher Involvement
  2. Tools
  3. Connectedness
  4. Practicalities

Work on each of these elements has begun (thanks to dedicated teams looking at each individual area) and each group is coming prepared to our upcoming in-person Steering Committee meeting in September.  I will update you right after that meeting to let you know what you can expect from the IIPC in the coming year(s).

What I can tell you is that you can count on the IIPC continuing on being a robust and vibrant community and that your contributions will become even more important as we move forward.  Your Steering Committee remains commited to ensuring the value of Your membership to the Consortium.

I welcome any comments or questions at paul.wagner@bac-lac.gc.ca

Stay tuned for more updates in September.

PaulWagnerPaul N. Wagner, Chair, IIPC

Directeur général principal et DPI, Direction générale d’innovation et du Dirigeant principal de l’information – Senior Director General & CIO, Innovation and Chief Information Officer Branch

Bibliothèque et Archives Canada / Gouvernement du Canada – Library and Archives Canada / Government of Canada

What do the New York Times, Organizational Change, and Web Archiving all have in common?

MatthewWeberThe short answer is Matthew Weber. Matthew is an Assistant Professor at Rutgers in the School of Communication and Information. His research focus is on organizational change; in particular he’s been looking at how traditional organizations such as companies in the newspaper business have responded to major technological disruption such as the Internet, or mobile phone applications.

In order to study this type of phenomenon, you need web archives. Unfortunately, however, using web archives as a source for research can be challenging. This is where high performance computing (HPC) and big data come into the picture.


Luckily for Matthew, at Rutgers they have HPC and lots of it. He’s working with powerful computer clusters built on complex java script and Hadoop code to crack into Internet Archive (IA) data. Matthew first started working with the IA in 2008 through a summer research institute at Oxford University. More recently, Matthew, working with colleagues at the Internet Archive and Northeastern University, received funding from the National Science Foundation to build tools that enable research access to Internet Archive data.When Matthew says he works with big data, he means really big data, like 80 terabytes big. Matthew works in close partnership with PhD students in the computer science department who maintain the backend that allows him to run complex queries. He is also training PhD students in Communication and other social science disciplines to work with Rutgers HPC system. In addition, Matthew has taught himself basic Pig, to be more exact Pig Latin, a programming language for running queries on data stored in Hadoop.

Intimidated yet? Matthew says don’t be. A researcher can learn some basic tech skills and do quite a bit on his or her own. In fact, Matthew would argue that researchers must learn these skills because we are a long way off from point-and-click systems where you can find exactly the data you want. But there is help out there.

For example, IA’s Senior Data Engineer, Vinay Goel, provided the materials from a recent workshop to walk you through setting up and doing your own data analysis. Also, Professors Ian Milligan and Jimmy Lin from Waterloo University have pulled together some useful code and commentary that is relatively easy to follow. Finally, a good basic starting point is Code Academy:

Challenges Abound

Even though Matthew has access to HPC and is handy with basic Pig, there are still plenty of challenges.


One major challenge is metadata; mainly, there isn’t enough of it. In order to draw valid conclusions from data, researchers need a wealth of contextual data, such as the scope of the crawl, how often it was run, why those sites where chosen and not others, etc. They also need the metadata to be complete and consistent across all of the collections they’re analyzing.

As a researcher conducting quantitative analysis, Matthew has to make sure he’s accounting for any and all statistical errors that might creep into the data. In his recent research, for example, he was seeing consistent error patterns in hyperlinks within the network of media websites. He now has to account for this statistical error in his analysis.

To begin to tackle this problem, Matthew is working with researchers and web curators from a group of institutions, including Columbia University Libraries & Information Service’s Web Resources Collection Program, California Digital Library, International Internet Preservation Consortium (IIPC), and Waterloo University to create a survey to learn from researchers, across a broad spectrum of disciplines, what are the essential metadata elements that they need. Matthew intends to share the results of this survey broadly with the web archiving community.

The Holes

Related to the metadata issues is the need for better documentation for missing data.

Matthew would love to have complete archives (along with complete descriptions). He recognizes, however, that there are holes in the data, just as there are with print archives. The difference is that holes in a print archive are easier to know and define than the holes for web archive data, where you need to be able to infer the holes.

The Issue of Size

Matthew explained that for a recent study of news media between 1996 – 2000, you start with transferring the data – and one year of data from Internet Archive took three days to transfer. You then need another two days to process and run the code. That’s a five-day investment just to get data for a single year. And then you discover that you need another data point, so it starts all over again.

To help address this issue at Rutgers and to provide training datasets to help graduate students get started, they are creating and sharing derivative datasets. They have taken large web archive datasets, extracted out small subsets (e.g., U.S. Senate data from the last five sessions), processed them, and produced smaller datasets that others can easily export to do their own analysis. This is essentially a public repository of data for reuse!

A Cool Space to Be In

As tools and collections develop, more and more researchers are starting to realize that web archives are fertile ground for research. Even though challenges remain, there’s clearly a shift toward more research based on web archives.

As Matthew put it, “Eight years ago when I started nobody cared… and now so many scholars are looking to ask critical questions about the way the web permeates our day-to-day lives… people are realizing that web archives are a key way to get at those questions. As a researcher, it’s a cool space to be in right now.”


By Rosalie Lack, Product Manager, California Digital Library

This blog post is the first in an upcoming series of interviews with researchers to learn about their research using web archives, and the challenges and opportunities.

So You Want to Get Started in Web Archiving?

web3_0The web archiving community is a great one, but it can sometimes be a bit confusing to enter. Unlike communities such as the Digital Humanities, which has developed aggregation services like DH Now, the web archiving community is a bit more dispersed. But fear not, there are a few places to visit to get a quick sense of what’s going on.

Social Media

twitter-logo_1A substantial amount of web archiving scholarship happens on-line. I use Twitter (I’m at @ianmilligan1), for example, as a key way to share research findings and ideas that I have as my project comes together. I usually try to hashtag them with: #webarchiving. This means that all tweets that people use “#webarchiving” with will show up in that specific timeline. For best results, linkedInusing a Twitter client like TweetdeckTweetbot, or Echofon can help you keep aprised of things. There may be Facebook groups – I actually don’t use Facebook (!) so I can’t provide much guidance there. On LinkedIn there are a few relevant groups: IIPC, Web ArchivingPortuguese Web Archive


I’m wary of listing blogs, because I will almost certainly leave some out. Please accept my apologies in advance and add your name in the comments below! But a few are on my recurring must-visit list (in addition to this one, of course!):

  • Web Archiving Roundtable: Every week, they have a “Weekly web archiving roundup.” I don’t always have time to keep completely caught up, but I visit roughly weekly and once in a while make sure to download all the linked resources. Being included here is an honour.
  • The UK Web Archive Blog: This blog is a must-have on my RSS feed, and it keeps me posted on what the UK team is doing with their web archive. They do great things, from inspiring outreach, to tools development (i.e. Shine), to researcher reflections. A lively cast of guest bloggers and regulars.
  • Web Science and Digital Libraries Research Group: If you use web archiving research tools, chances are you’ve used some stuff from the WebSciDL group! This fantastic blog has a lively group of contributors, showcasing conference reports, research findings, and beyond. Another must visit.
  • Web Archives for Historians: This blog, written by Peter Webster and myself, aims to bring together scholarship on how historians can use web archives. We have guest posts as well as cross-posts from our own sites.
  • Peter Webster’s Blog: Peter also has his own blog, which covers a diverse range of topics including web archives.
  • Ian Milligan’s Blog: It feels weird including my own blog here, but what the heck. I provide lots of technical background to my own investigations into web archives.
  • The Internet Archive Blog: Almost doesn’t need any more information! It’s actually quite a diverse blog, but a go-to place to find out about cool new collections (the million album covers for example) or datasets that are available.
  • The Signal: Digital Preservation Blog: A diverse blog that occasionally covers web archiving (you can actually find the subcategory here). Well worth reading – and citing, for that matter!
  • Kris’s Blog: Kristinn Sigurðsson runs a great technical blog here, very thought provoking and important for both those who create web archives as well as those who use them.
  • DSHR’s Blog: David Rosenthal’s blog on digital preservation has quite a bit about web archiving, and is always provocative and mind expanding.
  • Andy Jackson’s blog  – Web Archiving Technical Lead at the British Library
  • BUDDAH project – Big UK Domain Data for the Arts and Humanities Research Project
  • Dépôt légal web BnF
  • Stanford University Digital Library blog
  • Internet Memory Foundation blog
  • Toke Eskildsen blog – IT developer at the National Library of Denmark.

Again, I am sure that I have missed some blogs so please accept my sincerest apologies.

1354116111_webIn-Person Events

The best place to learn is in-person events, of course, which are often announced at places like this blog or in many of the above mediums! I hope that the IIPC blog can become a hub for these sorts of things.


Imilligan_-_picture_0 hope this is helpful for people that are starting out in this wonderful field. I’ve just provided a small slice: I hope that in the comments below people can give other suggestions which can help us all out!

By Ian Milligan (University of Waterloo)