With it, we completed a first map of the Web and we would like to share some early results on this experiment.
With only few small servers and 4 weeks time, we were able to crawl over 2+ billions resources with the objective to discover as many domains as possible. Overall, over 60+ millions of domains have been discovered, which represent about half of active domains in the world (the rest is mostly composed of parking sites and other types of empty domains).
In addition, we were able to process several types of analysis on this material thanks to the current Hadoop and Flink based architecture of our archive.
Among other things, we used machine learning to classify domains by type or genre (News, Forums, Blogs, E-commerce, etc.).
The other good news is that, thanks to many improvements in both the overall efficiency and stability, the cost of such crawls has been divided by two. Accompanied by the fall of storage costs, global crawls are becoming much more affordable and we hope it will benefit to more and more institutions.
More details will be published later on this, but we wanted to share this early update with the web archiving community.
By Chloé Martin (COO) of Internet Memory Research