正文內(nèi)容

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-展示頁

2025-01-27 15:13本頁面

　　

【正文】 For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, municating web crawler processes. Each crawler process is responsible for a subset of all web servers。 crawling processes fetch pages。外文資料原文Efficient URL Caching for World Wide Web CrawlingMarc NajorkBMJ (International Edition) 2009Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and plete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further plicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study [31] states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c) Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as , but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure

點擊復制文檔內(nèi)容

教學教案相關(guān)推薦

外文翻譯--網(wǎng)絡(luò)性能的測量-展示頁

【摘要】中文2100字英文翻譯：出自《ComputerNetwork》第四版Andrew著NetworkPerformanceMeasurementWhenaworkperformspoorly,itsusersoftenplaintothefolksrunningit,demandingimprovements.Toi

2025-05-27 06:02

機械類外文翻譯--基于網(wǎng)絡(luò)的快速原型制造-其他專業(yè)-展示頁

【摘要】1Aweb-basedmanufacturingservicesystemforrapidproductdevelopmentAbstract:Thispaperproposesanovelintegratedsystemofrapidproductdevelopmentbasedonrapidprototy

2025-01-31 06:08

外文翻譯--基于網(wǎng)絡(luò)的虛擬數(shù)控車削系統(tǒng)研究-數(shù)控設(shè)計-展示頁

【摘要】畢業(yè)設(shè)計(論文)外文資料翻譯系部：機械工程系專業(yè)：機械工程及自動化姓名：學號：外文出處

2025-05-27 07:20

外文翻譯--網(wǎng)絡(luò)編程的技術(shù)解析-展示頁

【摘要】中文翻譯-1-中文4960字網(wǎng)絡(luò)編程的技術(shù)解析的身份驗證的身份驗證有有三種，分別是"Windows|Forms|Passport"，其中又以Forms驗證用的最多，也最靈活。Forms驗證方式對基于用戶的驗證授權(quán)提供了很好的支持，可以通過一個登錄頁面驗證用戶的身份，將此用戶的身份發(fā)回到客戶端的

2025-05-27 06:02

外文翻譯---企業(yè)稅收籌劃的有效性-展示頁

【摘要】譯文一：企業(yè)稅收籌劃的有效性：基于對報酬的激勵作用（上）譯文二：企業(yè)稅收籌劃的有效性：基于對報酬的激勵作用（下）學生姓名學號院系經(jīng)濟與管理學院

2025-05-27 10:44

外文翻譯-基于gps的動物跟蹤系統(tǒng)-展示頁

【摘要】基于GPS的動物跟蹤系統(tǒng)作者VishwasRajJain,RaviBagree,AmanKumar,PrabhatRanjan；出處IntelligentSensors,SensorsNetworksandInformationProcessing,2008.摘要：野外感知系統(tǒng)是一種用于監(jiān)測沼澤鹿行為和遷徙模式的無線傳感網(wǎng)。該系統(tǒng)將收集微氣候以及

2025-01-25 13:46

外文翻譯---企業(yè)品牌定位的有效性-展示頁

【摘要】企業(yè)品牌定位的有效性目錄企業(yè)品牌定位的有效性 1摘要 1關(guān)鍵詞 2 2 2 2 3 4 4 4 5 5結(jié)論 6參考文獻 7摘要您的企業(yè)品牌定位有效嗎？本文提出了一種評估企業(yè)組織戰(zhàn)略決策的三重維度模型：利益相關(guān)者對多品牌定位的有效性、經(jīng)濟價

2025-01-26 23:24

外文翻譯--網(wǎng)上零售的有效模式-展示頁

【摘要】Retailing:What’sworkingonlineJournalofMarketing;1999;63,ProQuestPsychologyJournalsSuccessfulpaniesshouldexamineallavailablechannelsandthentailoranapproachaccordingtoth

2025-05-27 06:03

外文翻譯-基于gps的動物跟蹤系統(tǒng)-展示頁

【摘要】基于GPS的動物跟蹤系統(tǒng)1摘要：野外感知系統(tǒng)是一種用于監(jiān)測沼澤鹿行為和遷徙模式的無線傳感網(wǎng)。該系統(tǒng)將收集微氣候以及動物的位置信息，并將這些信息以數(shù)據(jù)流的方式使用點對點網(wǎng)絡(luò)傳達到基站?；臼褂镁W(wǎng)關(guān)，將收集到的所有數(shù)據(jù)上傳到互聯(lián)網(wǎng)上的一個數(shù)據(jù)庫和用基于可視化軟件的瀏覽器將這些信息描繪出來。每一個點將顯示五個信息，即：位置（用GPS），溫度，

2025-06-14 22:38

基于廣度優(yōu)先算法的多線程網(wǎng)絡(luò)爬蟲畢業(yè)設(shè)計-展示頁

【摘要】沈陽理工大學學士學位論文I摘要目前即使通訊軟件在平時的生活中有著十分廣泛的應用，但是對絕大部分的軟件來說，都必須應用在互聯(lián)網(wǎng)上，必須在一個INTERNET環(huán)境下才能使用。有時候單位內(nèi)部的員工，同學，在沒有互聯(lián)網(wǎng)環(huán)境下或因其他原因希望不用INTERNET就可以進行信息交互，這樣開發(fā)局域網(wǎng)通信

2025-07-06 20:18

外文翻譯---企業(yè)品牌定位的有效性-展示頁

2024-09-08 11:31

外文翻譯--設(shè)計有效的職工培訓計劃-展示頁

【摘要】中文3665字本科畢業(yè)論文（設(shè)計）外文翻譯外文題目Designingeffectiveemployeetrainingprogrammed外文出處TrainingforQuality，2020（5）：P52–57

2025-05-27 05:53

基于多線程的網(wǎng)絡(luò)爬蟲的設(shè)計與實現(xiàn)畢業(yè)論文-展示頁

【摘要】成都學院學士學位論文（設(shè)計）本科畢業(yè)論文題目基于多線程的網(wǎng)絡(luò)爬蟲的設(shè)計與實現(xiàn)畢業(yè)設(shè)計（論文）原創(chuàng)性聲明和使用授權(quán)說明原創(chuàng)性聲明本人鄭重承諾：所呈交的畢業(yè)設(shè)計（論文），是我個人在指導教師的指導下進行的研究工作及取得的成果。盡我所知，除文中特別加以標注和致謝的地方外，不包含其他人或組織已經(jīng)發(fā)表或公布過的研究成

2025-07-06 20:16