正文內(nèi)容

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-文庫吧

2025-05-16 15:44 本頁面

【正文】 explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]). However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change”, then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further plicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally. A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and pared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently. The paper is anized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our remendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research. 2. CRAWLING Web crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching. The crawler used by the Inter Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save nonlocal URLs to disk。 at the end of a crawl, a batch job adds these URLs to the perhost seed sets of the next crawl. The original Google crawler, described in [7], implements the different crawler ponents as different processes. A single URL server process maintains the set of URLs to download。 crawling processes fetch p

點擊復(fù)制文檔內(nèi)容

畢業(yè)設(shè)計相關(guān)推薦

外文翻譯--基于ssh的web技術(shù)介紹-資料下載頁

【總結(jié)】中原工學(xué)院信息商務(wù)學(xué)院畢業(yè)設(shè)計（論文）譯文專用紙第1頁基于SSH的web技術(shù)介紹1、引言隨著Java技術(shù)的逐漸成熟與完善，作為建立企業(yè)級應(yīng)用的標(biāo)準(zhǔn)平臺，J2EE平臺得到了長足的發(fā)展。借助于J2EE規(guī)范中包含的多項技術(shù)：EnterpriseJavaBean(EJB)、JavaServlets(Se

2025-05-12 07:27

外文翻譯---網(wǎng)絡(luò)營銷的發(fā)展趨勢-資料下載頁

【總結(jié)】外文文獻(xiàn)翻譯網(wǎng)絡(luò)營銷的發(fā)展趨勢《網(wǎng)絡(luò)營銷》E-Marketing朱迪．斯特勞斯雷德爾．弗羅斯特著　　時啟亮金玲慧譯摘要：互聯(lián)網(wǎng)經(jīng)濟的發(fā)展成為主流，網(wǎng)絡(luò)營銷作為互聯(lián)網(wǎng)的產(chǎn)物影響到經(jīng)濟的發(fā)展。很多企業(yè)在這些變革的推動下，形成新的營銷手段。所以網(wǎng)絡(luò)營銷在新經(jīng)濟形式下成為一種發(fā)展趨勢。關(guān)鍵詞：趨勢網(wǎng)絡(luò)營銷網(wǎng)絡(luò)經(jīng)濟互聯(lián)網(wǎng)

2025-01-17 23:29

基于復(fù)雜網(wǎng)絡(luò)理論的微博營銷研究綜述畢業(yè)論文外文翻譯-資料下載頁

【總結(jié)】基于復(fù)雜網(wǎng)絡(luò)理論的微博營銷研究綜述摘要微博營銷，是可以用復(fù)雜網(wǎng)絡(luò)理論來解釋的基于小世界與無標(biāo)度網(wǎng)絡(luò)的社交網(wǎng)絡(luò)營銷方式。通過系統(tǒng)地回顧復(fù)雜網(wǎng)絡(luò)理論在不同的發(fā)展階段，本章從微博營銷的角度回顧各種文獻(xiàn)，然后提取分析方法和微博營銷操作指南，發(fā)現(xiàn)微博和其他社交網(wǎng)絡(luò)之間的差異，指出了復(fù)雜網(wǎng)絡(luò)理論所無法解釋的問題?？傊?，它能夠為運用復(fù)雜網(wǎng)絡(luò)理論有效地分析微博營銷

2024-11-07 08:33

外文翻譯--論化學(xué)課堂提問的有效設(shè)計-資料下載頁

【總結(jié)】論化學(xué)課堂提問的有效設(shè)計作者：陳婷婷（科學(xué)教育專業(yè)09級）指導(dǎo)老師：李艷靈摘要：教師的提問是啟迪學(xué)生思維引發(fā)學(xué)生主動探究的一種有效途徑。課堂提問的水平是評價教師教學(xué)的重要因素之一，教師需要用教學(xué)提問去點燃學(xué)生的思維之火，激發(fā)學(xué)生的批判性思維和創(chuàng)造性思維，讓生成的答案體現(xiàn)出最顯著的學(xué)習(xí)成果。有效提問可以增強學(xué)生的概念意識和概念理解，從而達(dá)到為理解而教，為理解而學(xué)的目的。教師理應(yīng)

2025-01-18 14:59

freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-文庫吧

外文翻譯--基于ssh的web技術(shù)介紹-資料下載頁

外文翻譯---網(wǎng)絡(luò)營銷的發(fā)展趨勢-資料下載頁

基于復(fù)雜網(wǎng)絡(luò)理論的微博營銷研究綜述畢業(yè)論文外文翻譯-資料下載頁

外文翻譯--論化學(xué)課堂提問的有效設(shè)計-資料下載頁

外文翻譯---基于petri網(wǎng)的plc開發(fā)程序-資料下載頁

通信工程網(wǎng)絡(luò)技術(shù)外文翻譯文獻(xiàn)翻譯外文文獻(xiàn)-資料下載頁

外文翻譯---人工神經(jīng)網(wǎng)絡(luò)-資料下載頁

基于音樂網(wǎng)站的過濾式網(wǎng)絡(luò)爬蟲的研究畢業(yè)論文-資料下載頁

基于音樂網(wǎng)站的過濾式網(wǎng)絡(luò)爬蟲的研究畢業(yè)論文-資料下載頁

畢業(yè)設(shè)計--基于can總線的汽車電器網(wǎng)絡(luò)設(shè)計含外文翻譯-資料下載頁

畢業(yè)設(shè)計--基于can總線的汽車電器網(wǎng)絡(luò)設(shè)計含外文翻譯-資料下載頁

網(wǎng)絡(luò)營銷的發(fā)展外文翻譯-其他專業(yè)-資料下載頁

基于labview虛擬儀器平臺的外文翻譯-資料下載頁

外文翻譯---網(wǎng)絡(luò)廣告：不同的傳媒形式-資料下載頁

基于android的圖片瀏覽器外文翻譯-資料下載頁

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-展示頁

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-在線瀏覽

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-閱讀頁

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存(文件)

外文翻譯---基于網(wǎng)絡(luò)爬蟲的有效url緩存-全文預(yù)覽