How Well Are Arabic Websites Archived?

‫Arabic summary

‫إن أرشفة المواقع هي عملية تجميع البيانات الموجودة على الشبكة العنكبوتية من أجل حفظها من الضياع و جعلها متاحة للباحثين في المستقبل. قمنا بهذا البحث العلمي لمحاولة تقدير مدى أرشفة و فهرسة المواقع العربية. تم جمع ١٥،٠٩٢ رابط من ثلاث مواقع تعتبر دليل للمواقع العربية وهي: دليل ديموز العربي، دليل الردادي، دليل ستار٢٨. بعدها تم استخدام أدوات التعرف على اللغات واخترنا المواقع ذات اللغة العربية فقط، فاصبح عدد الروابط المتبقية هو ٧،٩٧٦ رابط. ثم تم زحف المواقع الحية منها لينتج عن ذلك ٣٠٠،٦٤٦ رابط. و من هذه العينة تم اكتشاف مايلي:‬‬‬
‫‫‫١) إن ٤٦٪ من المواقع العربية لم يتم ارشفتها، و إن ٣١٪ من المواقع العربية لم تتم فهرستها من قبل قوقل.‬‬‬
‫‫‫٢) إن ١٤،٨٤٪ من المواقع العربية لها محددات رمز عربية مثل (sa.)، كما وجدنا ١٠،٥٣٪ من المواقع لها موقع جغرافي عربي بناءً على موقع برتوكول الانترنت (IP) الخاص بالحاسب الالي.‬‬‬
‫‫‫٣) إن وجود إما موقع جغرافي عربي أو محددات رمزية عربية يؤثر سلبياً على أرشفتها.‬‬‬
‫‫‫٤) معظم الصفحات المؤرشفة هي بالقرب من المستوى الأعلى من الموقع، أما الصفحات العميقة في الموقع هي غير مؤرشفة جيداً.‬‬‬
‫‫‫٥) وجود الموقع على صفحة ديموز العربية يؤثر على ارشفتها ايجابياً.‬‬‬‫ 

It is anecdotally known that archives favor content in English and from Western countries. In this blog post we summarize our JCDL 2015 paper “How Well are Arabic Websites Archived?“, where we provide an initial quantitative exploration of this well-known phenomenon. When comparing the number of mementos for English vs. Arabic websites we found that English websites are archived more than Arabic websites. For example, when comparing a high ranked English sports website based on Alexa ranking, such as ESPN, with a high ranked Arabic sport website, such as Kooora, we find that ESPN has almost 13,000 mementos, and Kooora has only 2,000 mementos.

fig1_iipc
Figure 1

We also compared the English vs Arabic encyclopedia and found that the English Wikipedia has 10,000 mementos vs. the Arabic Wikipedia with only around 500 mementos.

fig2_iipc
Figure 2

Arabic is the fourth most popular language on the Internet, trailing only English, Chinese, and Spanish. Based on the Internet World Stats, in 2009, only 17% of Arabic speakers used the Internet, but by the end of 2013 that had increased to almost 36% (over 135 million), approaching the world average of 39% of the population using the Internet.

Our initial step, collecting Arabic seed URIs, presented our first challenge. We found that Arabic websites could have:
1) Both Arabic geographic IP location (GeoIP) and an Arabic country code top level domain (ccTLD) such as www.uoh.edu.sa.
2) An Arabic GeoIP, but a non Arabic ccTLD such as www.al-watan.com.
3) An Arabic ccTLD, but a non Arabic GeoIP such as www.haraj.com.sa, with a GeoIP in Ireland.
4) Neither an Arabic GeoIP, nor an Arabic ccTLD such as www.alarabiyah.com, with a GeoIP in US.

So for collecting the seed URIs we first searched for Arabic website directories, and grabbed the top three based on Alexa ranking. We selected all live URIs (11,014) from the following resources:
1) Open Directory project (DMOZ) – registered in US in 1999.
2) Raddadi – a well known Arabic directory, registered in Saudi Arabia in 2000.
3) Star28 – an Arabic directory registered in Lebanon in 2004.

Although these URIs are listed in Arabic directories it does not mean that the content is in Arabic. For example, www.arabnews.com is a Arab news website listed in Star28 but provides English language news about Arabic-related topics.

It was hard to find a reliable language test to determine the language for a page, so we employed four different methods: HTTP Content Language, HTML title tag, Triagram method, Language detection API. As shown in Figure 3, the intersection between the four methods was only 8%. We made the decision that any page that passed any of these tests would be included as “in the Arabic web”. The resulting number of Arabic seeds URIs was 7,976 out of 11,014.

fig3_iipc
Figure 3

To increase the number of URIs, we crawled the live Arabic seed URIs and checked the language using the previously described methods. This increased our data set to 300,646 Arabic seed URIs.

Next we used the ODU Memento Aggregator (mementoproxy.cs.odu.edu) to verify if the URIs were archived in a public web archive. We found that 53.77% of the URIs are archived with a median of 16 mementos per URI. We also analyzed the timespan of the mementos (the number of days between the datetimes of the first memento and last memento) and found that the median archiving period was 48 days.

We also investigated seed source and archiving and found that DMOZ had an archiving rate of 96%, followed by 45% from Raddadi, and 42% from Star28.

In the data set we found that 14% of the URIs had an Arabic ccTLD. We also looked at the GeoIP location since it was an important factor to determine where the hosts of webpages might be located. Using MaxMind GeoLite2, we found 58% of the Arabic seed URIs are hosted in the US.

Figure 4 shows count detail for Arabic GeoIP and ccTLD. We found that: 1) only 2.5% of the URIs are located in an Arabic country, 2) only 7.7% had an Arabic ccTLD, 3) 8.6% are both located in an Arabic country and have an Arabic ccTLD, and 4) the rest of the URIs (81%) are neither located in Arabic country, nor had an Arabic ccTLD.

fig4_iipc
Figure 4

We also wanted to verify if the URI had been there long enough to be archived. We used the CarbonDate tool, developed by members of the WS-DL group, to analyze our archived Arabic data set. We found that 2013 was the most frequent creation date for archived Arabic webpages. We also wanted to investigate the gap between the creation date of Arabic websites and when they were first archived. We found that 19% of the URIs have an estimated creation date that is the same as first memento date. For the remaining URIs, 28% have creation date over one year before the first memento was archived.

It was interesting to find out if the Arabic URIs are indexed in search engines. We used the Google’s Custom Search API, (which may produce different results than the public Google’s user web interface), and found that 31% of the Arabic URIs were not indexed by Google. When looking at the source of the URIs we found that 82% of the DMOZ URIs are indexed by Google, which was expected since it is more likely to be found and archived.

In conclusion, when looking at the seed URIs we found that DMOZ URIs are more likely to be found and archived, and a website is more likely to be indexed if it is present in a directory. For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ.

I presented this work in JCDL2015, the presentation slides can be found here.

by Lulwah M. Alkwai, PhD student, Computer Science Department, Old Dominion University, VA, USA

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s