外文翻译--基于网络爬虫的有效URL缓存

下载本文档

阅读 71
下载 7
格式 doc
大小 112.5 KB
约22页
2025-07-25 发布于天津市
收藏
评论
点赞(0)
海报
举报

1/22页

2/22页

3/22页

在线预览已结束，请下载后查看完整版，加入VIP享文档下载特权

/22

文本预览下载提示常见问题

外文原文Efficient URL Caching for World Wide Web CrawlingAndrei Z. BroderIBM TJ Watson Research Center19 Skyline DrHawthorne, NY 10532abroder@us.ibm.comMarc NajorkMicrosoft Research1065 La AvenidaMountain View, CA 94043najork@microsoft.comJanet L. WienerHewlett Packard Labs1501 Page Mill RoadPalo Alto, CA 94304janet.wiener@hp.comABSTRACTCrawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching i...

1、当您付费下载文档后，您只拥有了使用权限，并不意味着购买了版权，文档只能用于自身使用，不得用于其他商业用途（如 [转卖]进行直接盈利或[编辑后售卖]进行间接盈利）。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。
3、如文档内容存在违规，或者侵犯商业秘密、侵犯著作权等，请点击“违规举报”。

碎片内容