How Well Are Arabic Websites Archived?

‫Arabic summary

‫إن أرشفة المواقع هي عملية تجميع البيانات الموجودة على الشبكة العنكبوتية من أجل حفظها من الضياع و جعلها متاحة للباحثين في المستقبل. قمنا بهذا البحث العلمي لمحاولة تقدير مدى أرشفة و فهرسة المواقع العربية. تم جمع ١٥،٠٩٢ رابط من ثلاث مواقع تعتبر دليل للمواقع العربية وهي: دليل ديموز العربي، دليل الردادي، دليل ستار٢٨. بعدها تم استخدام أدوات التعرف على اللغات واخترنا المواقع ذات اللغة العربية فقط، فاصبح عدد الروابط المتبقية هو ٧،٩٧٦ رابط. ثم تم زحف المواقع الحية منها لينتج عن ذلك ٣٠٠،٦٤٦ رابط. و من هذه العينة تم اكتشاف مايلي:‬‬‬
‫‫‫١) إن ٤٦٪ من المواقع العربية لم يتم ارشفتها، و إن ٣١٪ من المواقع العربية لم تتم فهرستها من قبل قوقل.‬‬‬
‫‫‫٢) إن ١٤،٨٤٪ من المواقع العربية لها محددات رمز عربية مثل (sa.)، كما وجدنا ١٠،٥٣٪ من المواقع لها موقع جغرافي عربي بناءً على موقع برتوكول الانترنت (IP) الخاص بالحاسب الالي.‬‬‬
‫‫‫٣) إن وجود إما موقع جغرافي عربي أو محددات رمزية عربية يؤثر سلبياً على أرشفتها.‬‬‬
‫‫‫٤) معظم الصفحات المؤرشفة هي بالقرب من المستوى الأعلى من الموقع، أما الصفحات العميقة في الموقع هي غير مؤرشفة جيداً.‬‬‬
‫‫‫٥) وجود الموقع على صفحة ديموز العربية يؤثر على ارشفتها ايجابياً.‬‬‬‫ 

It is anecdotally known that archives favor content in English and from Western countries. In this blog post we summarize our JCDL 2015 paper “How Well are Arabic Websites Archived?“, where we provide an initial quantitative exploration of this well-known phenomenon. When comparing the number of mementos for English vs. Arabic websites we found that English websites are archived more than Arabic websites. For example, when comparing a high ranked English sports website based on Alexa ranking, such as ESPN, with a high ranked Arabic sport website, such as Kooora, we find that ESPN has almost 13,000 mementos, and Kooora has only 2,000 mementos.

fig1_iipc
Figure 1

We also compared the English vs Arabic encyclopedia and found that the English Wikipedia has 10,000 mementos vs. the Arabic Wikipedia with only around 500 mementos.

fig2_iipc
Figure 2

Arabic is the fourth most popular language on the Internet, trailing only English, Chinese, and Spanish. Based on the Internet World Stats, in 2009, only 17% of Arabic speakers used the Internet, but by the end of 2013 that had increased to almost 36% (over 135 million), approaching the world average of 39% of the population using the Internet.

Our initial step, collecting Arabic seed URIs, presented our first challenge. We found that Arabic websites could have:
1) Both Arabic geographic IP location (GeoIP) and an Arabic country code top level domain (ccTLD) such as www.uoh.edu.sa.
2) An Arabic GeoIP, but a non Arabic ccTLD such as www.al-watan.com.
3) An Arabic ccTLD, but a non Arabic GeoIP such as www.haraj.com.sa, with a GeoIP in Ireland.
4) Neither an Arabic GeoIP, nor an Arabic ccTLD such as www.alarabiyah.com, with a GeoIP in US.

So for collecting the seed URIs we first searched for Arabic website directories, and grabbed the top three based on Alexa ranking. We selected all live URIs (11,014) from the following resources:
1) Open Directory project (DMOZ) – registered in US in 1999.
2) Raddadi – a well known Arabic directory, registered in Saudi Arabia in 2000.
3) Star28 – an Arabic directory registered in Lebanon in 2004.

Although these URIs are listed in Arabic directories it does not mean that the content is in Arabic. For example, www.arabnews.com is a Arab news website listed in Star28 but provides English language news about Arabic-related topics.

It was hard to find a reliable language test to determine the language for a page, so we employed four different methods: HTTP Content Language, HTML title tag, Triagram method, Language detection API. As shown in Figure 3, the intersection between the four methods was only 8%. We made the decision that any page that passed any of these tests would be included as “in the Arabic web”. The resulting number of Arabic seeds URIs was 7,976 out of 11,014.

fig3_iipc
Figure 3

To increase the number of URIs, we crawled the live Arabic seed URIs and checked the language using the previously described methods. This increased our data set to 300,646 Arabic seed URIs.

Next we used the ODU Memento Aggregator (mementoproxy.cs.odu.edu) to verify if the URIs were archived in a public web archive. We found that 53.77% of the URIs are archived with a median of 16 mementos per URI. We also analyzed the timespan of the mementos (the number of days between the datetimes of the first memento and last memento) and found that the median archiving period was 48 days.

We also investigated seed source and archiving and found that DMOZ had an archiving rate of 96%, followed by 45% from Raddadi, and 42% from Star28.

In the data set we found that 14% of the URIs had an Arabic ccTLD. We also looked at the GeoIP location since it was an important factor to determine where the hosts of webpages might be located. Using MaxMind GeoLite2, we found 58% of the Arabic seed URIs are hosted in the US.

Figure 4 shows count detail for Arabic GeoIP and ccTLD. We found that: 1) only 2.5% of the URIs are located in an Arabic country, 2) only 7.7% had an Arabic ccTLD, 3) 8.6% are both located in an Arabic country and have an Arabic ccTLD, and 4) the rest of the URIs (81%) are neither located in Arabic country, nor had an Arabic ccTLD.

fig4_iipc
Figure 4

We also wanted to verify if the URI had been there long enough to be archived. We used the CarbonDate tool, developed by members of the WS-DL group, to analyze our archived Arabic data set. We found that 2013 was the most frequent creation date for archived Arabic webpages. We also wanted to investigate the gap between the creation date of Arabic websites and when they were first archived. We found that 19% of the URIs have an estimated creation date that is the same as first memento date. For the remaining URIs, 28% have creation date over one year before the first memento was archived.

It was interesting to find out if the Arabic URIs are indexed in search engines. We used the Google’s Custom Search API, (which may produce different results than the public Google’s user web interface), and found that 31% of the Arabic URIs were not indexed by Google. When looking at the source of the URIs we found that 82% of the DMOZ URIs are indexed by Google, which was expected since it is more likely to be found and archived.

In conclusion, when looking at the seed URIs we found that DMOZ URIs are more likely to be found and archived, and a website is more likely to be indexed if it is present in a directory. For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ.

I presented this work in JCDL2015, the presentation slides can be found here.

by Lulwah M. Alkwai, PhD student, Computer Science Department, Old Dominion University, VA, USA

Web Archives: Preserving the Everyday Record

milligan_-_picture_0In talking with Ian Milligan, Assistant Professor of Digital and Canadian History at the University of Waterloo, you are immediately impressed by his excitement for web archives and how web archiving is fundamentally changing research.

Ian uses web archives for his historical research to demonstrate their relevance and importance. While he clearly sees the value of web archives, he also recognizes the need to improve access in order to increase usage. To that end, he recently launched Webarchives.ca, an archive dedicated to Canadian politics. Ian is also providing pedagogical support for students using digital materials, including web archives.

I interviewed Ian recently to get his thoughts about these and other web archiving topics.

Remembering Geocities: A Community on the Web

Among Ian’s research projects is the study of Geocities. Remember Geocities? It was a user generated web-hosting community that flourished in the late 1990s and 2000s. Unlike other lost civilizations, we know the cause of Geocities’s demise – Yahoo shut it down in 2009. If it were not for the Internet Archive and Jason Scott’s Archive Team, Geocities would be lost forever.

For those who might ask if it was worth saving, Ian would offer a resounding YES! For Ian, Geocities provides a rich historical source for gaining insight into a pivotal moment in time. It is one of the first examples of democratized web access, when average people could reach bigger audiences than ever before. At its height, Geocities featured more than 38 million pages.

Source: Internet Archive's Wayback Machine, December 1, 2009 capture
Source: Internet Archive’s Wayback Machine, December 1, 2009 capture

Some of the research questions Ian is asking about the Geocities corpus include:

  • How was community enacted?
  • How was community lived in a place like Geocities?
  • Was there actually a sense of community on the web?

While these questions might sound like standard research questions, they are only now being recast over “untraditional” sources, such as Geocities.

Archiving Politics

In an effort to improve access to web archives, Ian worked on a project to launch Webarchives.ca, a research corpus containing Canadian Political Parties and Political Interest Groups sites collected since 2005 by the University of Toronto using the Internet Archive’s Archive-It service. Ian teamed up with researchers from the University of Maryland, York University in Toronto, and Western University in London, Ontario to build this massive collection of more than 14 million “documents.”  To help navigate this large collection, UK Web Archive’s Shine front-end was implemented.

Once I got started looking at Webarchives.ca, I couldn’t stop myself from digging further into such a wealth of information. I particularly liked the graphing of terms over time feature, which allows you to see when terms go in and out of use by political parties.

In sharing his takeaways from working with these data, Ian observed that it is equally interesting to see when terms do not appear as when they do.

A Pivotal Shift for Scholarship

Ian shared some concrete examples of how the rise of web archives represents a pivotal shift for scholarship. Let’s take, for instance, particular segments of the population, such as young people, who have traditionally been left out of the historical record.

When Ian was researching the 1960s in order to understand the voice of young activists, he found the sources to be scarce. Conversations among activists tended to happen in coffeehouses, bars, and other places where records were not kept. So, a historian can only hope that a young activist back then kept a diary and that it has survived, or she or he needs to find them and interview them.

Contrast this to today’s world. With the explosion of social media, young people are writing things down and leaving records that we never would have had in the past. Web archiving tools can capture this information, which is a very rich and exciting development for historians, but only if these important records of daily life have been archived.

Is More Better?

The increase in information can be a double-edged sword. As Ian says, “there used to be such a scarcity of historical sources, now we have more information than we know what to do with.”

Ian is concerned that digital and digitized materials will be privileged as sources and/or misinterpreted. He conducted a study when materials were first digitized. He learned that scholars cited more often digital materials vs analog. Basically, content that was more easily available online was getting used more.

Ian is also worried that there is not a deep understanding of how to critically use digital resources. Many are unaware, for example, of the limitations of simple keyword searching. Add to the mix web archives and you have increased the scale of the problem.

So Ian wrote a pedagogical book.

exploringBigHistoricalDAtaThe Historian’s Macroscope: Exploring Big Historical Data, written along with Shawn Graham and Scott Weingart, will be out later this year. The book is a sort of toolbox for upper division history undergraduates to teach them how to think critically about digital resources and to avoid common pitfalls. It also includes “how to” information for analyzing data, such as basic data visualization and network analysis.

Always pushing the envelope, Ian and his co-authors wrote the first draft of their book online.

No “Do Overs”

Ian closed our interview by sharing a provocative statement that he made at the recent IIPC General Assembly. “You cannot study the history of the 90s unless you use web archives. It is a significant part of the record of the 1990s and 2000s for everyday people. When historians write the history of 9/11 or Occupy Wall Street, they are going to have to use web archives.”

As exciting as it is for historians to have access to these rich new resources, Ian also shared his biggest concern, which is that we need to ensure that we are saving websites. “Every day we are losing considerable amounts of our digital heritage. Gathering is critical. There are no ‘do overs.’”

RosalieLack

This blog post is the second in a series of interviews with researchers to learn about their use of web archives.

By Rosalie Lack, Product Manager, California Digital Library

We want YOUR ideas for the IIPC General Assembly 2016

NatLibIcelandYou will be pleased to hear that preparations for the IIPC General Assembly 2016 in Reykjavik, Iceland (11-15 April) are under way and we are aiming to make it the best one yet.

The program team have been hard at work looking at potential themes, topics and areas for discussion and debate. We would, however, love to have your input into this too!

So far, we’ve outlined the following areas:

  • Nuts and bolts of web archiving (management, metrics, organisation, programs)
  • De-duplication 
  • Researcher use cases (of web archives)
  • Big Data usage and potential
  • Web Archiving policies and frameworks / Preservation policies, Collection policies 
  • API’s
  • Web Archiving Tool development 
  • Legal deposit, copyright, data protection (EU wide perspective?)

help_wantedWhat have we missed, what should we focus on, what would YOU like to see and hear about?

Please use the comments below and tell us what you would like from the conference? This will help frame the call for papers due to go out at the end of October.

Thank you.

Jason Webber, IIPC Program and Communications Officer

Open letter by IIPC Chair

Greetings IIPC Memebers,

I hope that your summer is going very well and that you are all able to take some time off to recharge and spend time with family and friends.  It is hard to believe that more than 3 months have passed since many of us were together at Standford University in Palo Alto for our 2015 General Assembly (GA)!

I want to take this opportunity to  once again say how impressed I was at the quality of the event.  Everything from the organization of the entire event to the excellent interactions that our members engaged in brought significant value to the week.

I want to focus in on the Member’s Day that we had at the Internet Archive offices.  At one point in the day, you were asked to break off into groups to discuss some of the important issues and challenges facing the IIPC in the near future.  The Steering Committee met on the Saturday following the GA to discuss how we can better serve you – our members – and to ensure that we focus our limited resources what brings the greatest value to the global Web Archiving community.  I want to assure you that YOUR feedback was taken very seriously and thanks to the leadership of Birgit Nordsmark Henriksen (Netarchive.dk) and Barbara Sierman (National Library of the Netherlands) the Steering Committee was able to distill your comments and input into 4 manageable work packages:

  1. Researcher Involvement
  2. Tools
  3. Connectedness
  4. Practicalities

Work on each of these elements has begun (thanks to dedicated teams looking at each individual area) and each group is coming prepared to our upcoming in-person Steering Committee meeting in September.  I will update you right after that meeting to let you know what you can expect from the IIPC in the coming year(s).

What I can tell you is that you can count on the IIPC continuing on being a robust and vibrant community and that your contributions will become even more important as we move forward.  Your Steering Committee remains commited to ensuring the value of Your membership to the Consortium.

I welcome any comments or questions at paul.wagner@bac-lac.gc.ca

Stay tuned for more updates in September.

PaulWagnerPaul N. Wagner, Chair, IIPC

Directeur général principal et DPI, Direction générale d’innovation et du Dirigeant principal de l’information – Senior Director General & CIO, Innovation and Chief Information Officer Branch

Bibliothèque et Archives Canada / Gouvernement du Canada – Library and Archives Canada / Government of Canada