Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b)Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c)
However, the size of the web (estimated at over 4 billion pages) and its rate of change(estimated at 7% per week) move this plan from a trivial programming exercise to aserious algorithmic and system design challenge
Indeed, these two factors aloneimply that for a reasonably fresh and complete crawl of the web, step (a) must beexecuted about a thousand times per second, and thus the membership test (c) must bedone well over ten thousand times per second against a set too large to store in mainmemory
This requires a distributed architecture, which further complicates themembership test
A crucial way to speed up the test