电脑桌面
添加小米粒文库到电脑桌面
安装后可以在桌面快捷访问

外文翻译--基于网络爬虫的有效URL缓存

外文翻译--基于网络爬虫的有效URL缓存_第1页
1/22
外文翻译--基于网络爬虫的有效URL缓存_第2页
2/22
外文翻译--基于网络爬虫的有效URL缓存_第3页
3/22
外 文 原 文Efficient URL Caching for World Wide Web CrawlingAndrei Z. BroderIBM TJ Watson Research Center19 Skyline DrHawthorne, NY 10532abroder@us.ibm.comMarc NajorkMicrosoft Research1065 La AvenidaMountain View, CA 94043najork@microsoft.comJanet L. WienerHewlett Packard Labs1501 Page Mill RoadPalo Alto, CA 94304janet.wiener@hp.comABSTRACTCrawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching i...

1、当您付费下载文档后,您只拥有了使用权限,并不意味着购买了版权,文档只能用于自身使用,不得用于其他商业用途(如 [转卖]进行直接盈利或[编辑后售卖]进行间接盈利)。
2、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。
3、如文档内容存在违规,或者侵犯商业秘密、侵犯著作权等,请点击“违规举报”。

碎片内容

外文翻译--基于网络爬虫的有效URL缓存

确认删除?
VIP
微信客服
  • 扫码咨询
会员Q群
  • 会员专属群点击这里加入QQ群
客服邮箱
回到顶部