外 文 原 文Efficient URL Caching for World Wide Web CrawlingAndrei Z
BroderIBM TJ Watson Research Center19 Skyline DrHawthorne, NY 10532abroder@us
comMarc NajorkMicrosoft Research1065 La AvenidaMountain View, CA 94043najork@microsoft
comJanet L
WienerHewlett Packard Labs1501 Page Mill RoadPalo Alto, CA 94304janet
wiener@hp
comABSTRACTCrawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c)
However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge
Indeed, these two factors alone imply that for a reasonably fresh and