Internet Memory Research launches MemoryBot

IMRInternet Memory Research is pleased to announce that our new crawler, MemoryBot, has now gained full maturity.

With it, we completed a first map of the Web and we would like to share some early results on this experiment.

With only few small servers and 4 weeks time, we were able to crawl over 2+ billions resources with the objective to discover as many domains as possible. Overall, over 60+ millions of domains have been discovered, which represent about half of active domains in the world (the rest is mostly composed of parking sites and other types of empty domains).

In addition, we were able to process several types of analysis on this material thanks to the current Hadoop and Flink based architecture of our archive.
Among other things, we used machine learning to classify domains by type or genre (News, Forums, Blogs, E-commerce, etc.).

The other good news is that, thanks to many improvements in both the overall efficiency and stability, the cost of such crawls has been divided by two. Accompanied by the fall of storage costs, global crawls are becoming much more affordable and we hope it will benefit to more and more institutions.

More details will be published later on this, but we wanted to share this early update with the web archiving community.

By Chloé Martin (COO) of Internet Memory Research

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s