freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

正文內(nèi)容

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-展示頁

2025-01-27 15:13本頁面
  

【正文】 For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, municating web crawler processes. Each crawler process is responsible for a subset of all web servers。 crawling processes fetch pages。外文資料原文Efficient URL Caching for World Wide Web CrawlingMarc NajorkBMJ (International Edition) 2009Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and plete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further plicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study [31] states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c) Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as , but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure
點擊復制文檔內(nèi)容
教學教案相關(guān)推薦
文庫吧 www.dybbs8.com
備案圖鄂ICP備17016276號-1