Memento: Help Us Route URI Lookups to the Right Archives

More Memento-enabled web archives are coming online every day, enabling aggregating services such as Time Travel and OldWeb. However, as the number of web archives grows, we must be able to better route URI lookups to the archives that are likely to have the requested URIs. We need assistance from IIPC members to help us better model both what archives contain as well as what people are looking for.

In our TPDL 2015 paper we found that less than 5% of the queried URIs have mementos in any individual archive that is not the Internet Archive. We created four different sample sets of one million URIs each and compared them against three different archives. The table below shows the percentage of the sample URIs found in various archives.

Sample (1M URIs Each) In Archive-It In UKWA In Stanford Union of {AIT, UK, SU}
DMOZ 4.097% 3.594% 0.034% 7.575%
Memento Proxy Logs 4.182% 0.408% 0.046% 4.527%
IA Wayback Logs 3.716% 0.519% 0.039% 4.165%
UKWA Wayback Logs 0.108% 0.034% 0.002% 0.134%

However, these small archives, when aggregated together prove to be much more useful and complete than they are individually. We found that the intersection between these archives is small, so the union of them is large (see the last column in the table above). The figure below shows the overlap among three archives for the sample of one million URIs from DMOZ.

stanford-ukwa-archive-it

We are working on an IIPC funded Archive Profiling project in which we are trying to create a high level summary of the holdings of each archive. Apart from the many other use cases, this will help us route the Memento Aggregator queries to only archives that are likely to return good results for a given URI.

We learned in the recent surge of oldweb.today (that uses MemGator to aggregate mementos from various archives) that some upstream archives had issues handling the sudden increase in the traffic and had to be removed from the list of aggregated archives. Another issue when aggregating large number of archives is that the aggregators follow the buffalo theory where the slowest upstream archive affects the roundtrip time of the aggregator. A single malfunctioning (or down) upstream archive may delay each aggregator response for the set timeout period. There are ways to solve the latter issue such as detecting continuously failing archives at runtime and temporarily disabling them from being aggregated. However, building Archive Profiles and predicting the probability of finding any Mementos in each archive to route the requests solves both the problems. Individual archives only get requests when they are likely to return good results, hence the routing saves their network and computing resources. Additionally, aggregators benefit in terms of the improved response time, because only a small subset of all the known archives is queried for any given URI.

We appreciate Andy Jackson of the UK Web Archive for providing the anonymised Wayback access logs that we used for sampling one of the URI sets. We would like to extend this study on other archives’ access logs to learn what people are looking for when they visit these archives. This will help us build sampling based profiling for archives that may not be able to share CDX files or generate/update full-coverage archive profiles.

We encourage all IIPC member archives to share their access logs just enough to generate at least one million unique URIs that people looked for in their archives. We are only interested in the log entries that have a URI-R in it (e.g., /wayback/14-digit-datetime/{URI}). We can handle all the cleanup and parsing tasks, or you can remove the requesting IP address from the logs (we don’t need it) if you would prefer. The logs can be continuous or consist of many sparse logs. We promise not to publish those logs in the raw form anywhere on the Web. Please feel free to discuss further details with me at salam@cs.odu.edu. Also contact me if you are interested in testing the software for profiling your archive.

by Sawood Alam
Department of Computer Science, Old Dominion University

Advertisements

How Well Are Arabic Websites Archived?

‫Arabic summary

‫إن أرشفة المواقع هي عملية تجميع البيانات الموجودة على الشبكة العنكبوتية من أجل حفظها من الضياع و جعلها متاحة للباحثين في المستقبل. قمنا بهذا البحث العلمي لمحاولة تقدير مدى أرشفة و فهرسة المواقع العربية. تم جمع ١٥،٠٩٢ رابط من ثلاث مواقع تعتبر دليل للمواقع العربية وهي: دليل ديموز العربي، دليل الردادي، دليل ستار٢٨. بعدها تم استخدام أدوات التعرف على اللغات واخترنا المواقع ذات اللغة العربية فقط، فاصبح عدد الروابط المتبقية هو ٧،٩٧٦ رابط. ثم تم زحف المواقع الحية منها لينتج عن ذلك ٣٠٠،٦٤٦ رابط. و من هذه العينة تم اكتشاف مايلي:‬‬‬
‫‫‫١) إن ٤٦٪ من المواقع العربية لم يتم ارشفتها، و إن ٣١٪ من المواقع العربية لم تتم فهرستها من قبل قوقل.‬‬‬
‫‫‫٢) إن ١٤،٨٤٪ من المواقع العربية لها محددات رمز عربية مثل (sa.)، كما وجدنا ١٠،٥٣٪ من المواقع لها موقع جغرافي عربي بناءً على موقع برتوكول الانترنت (IP) الخاص بالحاسب الالي.‬‬‬
‫‫‫٣) إن وجود إما موقع جغرافي عربي أو محددات رمزية عربية يؤثر سلبياً على أرشفتها.‬‬‬
‫‫‫٤) معظم الصفحات المؤرشفة هي بالقرب من المستوى الأعلى من الموقع، أما الصفحات العميقة في الموقع هي غير مؤرشفة جيداً.‬‬‬
‫‫‫٥) وجود الموقع على صفحة ديموز العربية يؤثر على ارشفتها ايجابياً.‬‬‬‫ 

It is anecdotally known that archives favor content in English and from Western countries. In this blog post we summarize our JCDL 2015 paper “How Well are Arabic Websites Archived?“, where we provide an initial quantitative exploration of this well-known phenomenon. When comparing the number of mementos for English vs. Arabic websites we found that English websites are archived more than Arabic websites. For example, when comparing a high ranked English sports website based on Alexa ranking, such as ESPN, with a high ranked Arabic sport website, such as Kooora, we find that ESPN has almost 13,000 mementos, and Kooora has only 2,000 mementos.

fig1_iipc
Figure 1

We also compared the English vs Arabic encyclopedia and found that the English Wikipedia has 10,000 mementos vs. the Arabic Wikipedia with only around 500 mementos.

fig2_iipc
Figure 2

Arabic is the fourth most popular language on the Internet, trailing only English, Chinese, and Spanish. Based on the Internet World Stats, in 2009, only 17% of Arabic speakers used the Internet, but by the end of 2013 that had increased to almost 36% (over 135 million), approaching the world average of 39% of the population using the Internet.

Our initial step, collecting Arabic seed URIs, presented our first challenge. We found that Arabic websites could have:
1) Both Arabic geographic IP location (GeoIP) and an Arabic country code top level domain (ccTLD) such as www.uoh.edu.sa.
2) An Arabic GeoIP, but a non Arabic ccTLD such as www.al-watan.com.
3) An Arabic ccTLD, but a non Arabic GeoIP such as www.haraj.com.sa, with a GeoIP in Ireland.
4) Neither an Arabic GeoIP, nor an Arabic ccTLD such as www.alarabiyah.com, with a GeoIP in US.

So for collecting the seed URIs we first searched for Arabic website directories, and grabbed the top three based on Alexa ranking. We selected all live URIs (11,014) from the following resources:
1) Open Directory project (DMOZ) – registered in US in 1999.
2) Raddadi – a well known Arabic directory, registered in Saudi Arabia in 2000.
3) Star28 – an Arabic directory registered in Lebanon in 2004.

Although these URIs are listed in Arabic directories it does not mean that the content is in Arabic. For example, www.arabnews.com is a Arab news website listed in Star28 but provides English language news about Arabic-related topics.

It was hard to find a reliable language test to determine the language for a page, so we employed four different methods: HTTP Content Language, HTML title tag, Triagram method, Language detection API. As shown in Figure 3, the intersection between the four methods was only 8%. We made the decision that any page that passed any of these tests would be included as “in the Arabic web”. The resulting number of Arabic seeds URIs was 7,976 out of 11,014.

fig3_iipc
Figure 3

To increase the number of URIs, we crawled the live Arabic seed URIs and checked the language using the previously described methods. This increased our data set to 300,646 Arabic seed URIs.

Next we used the ODU Memento Aggregator (mementoproxy.cs.odu.edu) to verify if the URIs were archived in a public web archive. We found that 53.77% of the URIs are archived with a median of 16 mementos per URI. We also analyzed the timespan of the mementos (the number of days between the datetimes of the first memento and last memento) and found that the median archiving period was 48 days.

We also investigated seed source and archiving and found that DMOZ had an archiving rate of 96%, followed by 45% from Raddadi, and 42% from Star28.

In the data set we found that 14% of the URIs had an Arabic ccTLD. We also looked at the GeoIP location since it was an important factor to determine where the hosts of webpages might be located. Using MaxMind GeoLite2, we found 58% of the Arabic seed URIs are hosted in the US.

Figure 4 shows count detail for Arabic GeoIP and ccTLD. We found that: 1) only 2.5% of the URIs are located in an Arabic country, 2) only 7.7% had an Arabic ccTLD, 3) 8.6% are both located in an Arabic country and have an Arabic ccTLD, and 4) the rest of the URIs (81%) are neither located in Arabic country, nor had an Arabic ccTLD.

fig4_iipc
Figure 4

We also wanted to verify if the URI had been there long enough to be archived. We used the CarbonDate tool, developed by members of the WS-DL group, to analyze our archived Arabic data set. We found that 2013 was the most frequent creation date for archived Arabic webpages. We also wanted to investigate the gap between the creation date of Arabic websites and when they were first archived. We found that 19% of the URIs have an estimated creation date that is the same as first memento date. For the remaining URIs, 28% have creation date over one year before the first memento was archived.

It was interesting to find out if the Arabic URIs are indexed in search engines. We used the Google’s Custom Search API, (which may produce different results than the public Google’s user web interface), and found that 31% of the Arabic URIs were not indexed by Google. When looking at the source of the URIs we found that 82% of the DMOZ URIs are indexed by Google, which was expected since it is more likely to be found and archived.

In conclusion, when looking at the seed URIs we found that DMOZ URIs are more likely to be found and archived, and a website is more likely to be indexed if it is present in a directory. For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ.

I presented this work in JCDL2015, the presentation slides can be found here.

by Lulwah M. Alkwai, PhD student, Computer Science Department, Old Dominion University, VA, USA

LANL’s Time Travel Portal, Part 2

Architecturally, the Time Travel portal operates in a manner similar to a distributed search. Hence, it faces challenges related to query routing, response time optimization, and response freshness. The new infrastructure includes some rule-based mechanisms for intelligent routing but a thorough solution is being investigated in the IIPC-funded Web Archive Profiling project. A background cache continuously fetches TimeMap information from distributed archives, both natively or by-proxy compliant with the Memento protocol. Its collection consists of a seed list of popular URIs augmented with URIs requested by Memento clients. Whenever possible, responses are delivered from a front-end cache that remains in sync with the background cache using the ResourceSync protocol. If a request can not be delivered from cache, because cached content is unavailable or stale, realtime TimeGate requests are sent to Memento-compliant archives only. This setup achieves a satisfactory balance between response times, response completeness, and response freshness. If needed, the front-end cache can be bypassed and a realtime query can explicitly be initiated using the regular browser refresh approach, e.g. Shift-Reload in Chrome.

The Time Travel logo that can be used to advertise the portal.
The Time Travel logo that can be used to advertise the portal.

The development of the Time Travel portal was also strongly motivated by the desire to lower the barrier for developing Memento related functionality, especially at the browser side. Memento protocol information is – appropriately – communicated in HTTP headers. However, browser-side scripts typically do not have header access. Hence, we wanted to bring Memento capabilities within the realm of browser-side development. To that end, we introduced several RESTful APIs:

We are thrilled by the continuous growth in the usage of these APIs and would be interested to learn which kind of applications people out there are building on top of our infrastructure. We know that the new version of the Mink browser extension uses the new APIs. Also, the Time Travel’s Reconstruct service, based on pywb, leverages our own APIs. Memento for Chrome now obtains its list of archives from the Archive Registry. Also, the Robust Links approach to combat reference rot is based on API calls, but that will be the subject of another blog post.

IIPC members that operate public web archives that are not yet Memento compliant are reminded Open Wayback and pywb natively support Memento. From the perspective of the Time Travel portal, compliance means that we don’t have to operate a Memento proxy, that archive holdings can be included in realtime queries, and that both Original URIs and Memento URIs can be used to Find/Reconstruct. From a broader perspective, it means that the archive becomes a building block in a global, interoperable infrastructure that provides a time dimension to the web.

By Herbert Van de SompelDigital Library Researcher at Los Alamos National Laboratory

LANL’s Time Travel Portal, Part 1

Early February 2015, we launched the Time Travel portal, which provides cross-system discovery of Mementos.

The design and development of the Time Travel portal was a significant investment and took about a year from conception to release. It involved work directly related to the portal itself, but also a fundamental redesign of the Memento Aggregator, the introduction of several RESTful APIs, the transfer of the Memento infrastructure from LANL’s network to the Amazon cloud, and operating the new environment as an official service of the LANL Research Library.

The team that designed and implemented the Time Travel portal, from left to right: Luydmila Balakireva, Harihar Shankar, Martin Klein, Ilya Kremer, James Powell, and Herbert Van de Sompel
The team that designed and implemented the Time Travel portal, from left to right: Luydmila Balakireva, Harihar Shankar, Martin Klein, Ilya Kremer, James Powell, and Herbert Van de Sompel

A major motivation for the development of the new portal was to lower the barrier for experiencing Memento’s web time travel. Our flagship Memento for Chrome extension remains the optimal way to experience cross-system time travel. But, we wanted some of the power of Memento to be accessible without the need for an extension.

The Time Travel portal has a basic interface that allows entering a URI and a datetime. It offers a Find and a Reconstruct service:

  • The Find service looks for the Mementos in systems covered by the Memento Aggregator. For each archive that holds Mementos for the requested URI, the Memento that is temporally closest to the submitted date-time is listed, with a clear indication of the archive’s name. Results are ordered by temporal proximity to the requested date-time. For each archive, the first/last/previous/next Memento are also shown when that information is available. For all listed Mementos, a link leads straight into the holding archive. A Find URI can also be constructed. Its syntax follows the convention introduced by Wayback software, e.g. http://timetravel.mementoweb.org/list/20081128230827/http://apple.com.
  • The Reconstruct service reassembles a page using the best Mementos from various Memento-compliant archives. Hereby, “best” means temporally closest to the requested date-time. Hence, in a Reconstruct result page, the archived HTML, images, style sheets, JavaScript, etc. can originate from different archives. Many times, the assembled pages look more complete and the temporal spread of components is smaller, when compared with corresponding pages in distinct archives. As such, the Reconstruct service provides a nice illustration of the cross-archive interoperability introduced by the Memento protocol. A Reconstruct URI is available using the same Wayback URI convention, e.g. http://timetravel.mementoweb.org/reconstruct/20081128230827/http://apple.com.

While the Time Travel portal has been received enthusiastically, usage remains modest. Since its launch, we have seen about 4000 unique visitors, 7000 visits, per month. We have capacity for much more and would appreciate some promotion of our service by IIPC members. Also, we are very open to suggestions about additional portal functionality. For example, we have reached out to IIPC members that operate dark archives because we are interested in including their holding information in Time Travel responses, in order to increase response completeness and to make the existence of these archives more visible. As a first step in that direction, we have proposed Memento-based access to dark archive holdings information as a new functionality for Open Wayback.

By Herbert Van de SompelDigital Library Researcher at Los Alamos National Laboratory