【正文】
執(zhí)行性能,當(dāng)這些數(shù)據(jù)的一個(gè)實(shí)例發(fā)生改變時(shí),其他數(shù)據(jù)也都要進(jìn)行相應(yīng)的改變。一方面,能夠被過(guò)濾器識(shí)別的URL符合用戶預(yù)定義的過(guò)濾規(guī)則,這些URL全部都是用戶所期望獲取的數(shù)據(jù)來(lái)源,數(shù)據(jù)抓取的準(zhǔn)確性得到很好的保證;另一方面,由于大量不符合條件的URL都被過(guò)濾掉了,爬行隊(duì)列中僅僅加入符合條件的URL,大大節(jié)省了寶貴的內(nèi)存空間。 數(shù)據(jù)獲取模塊數(shù)據(jù)獲取模塊主要利用Java語(yǔ)言結(jié)合HTTPClient開源工具編寫了一個(gè)針對(duì)新聞的可擴(kuò)展的網(wǎng)絡(luò)爬蟲,該爬蟲程序能夠按照廣度優(yōu)先的爬行策略對(duì)新聞數(shù)據(jù)進(jìn)行全面的定向抓取以及周期性的增量抓取。操作效果如圖52所示:圖52 “建立連接”操作 “開始爬行”操作用戶在取得到數(shù)據(jù)庫(kù)的連接之后,可以通過(guò)選擇新聞?lì)悇e(可細(xì)分為china、world、society、mil)開始抓取數(shù)據(jù),數(shù)據(jù)來(lái)源和爬行限制選項(xiàng)可在界面上部進(jìn)行控制。當(dāng)輸入時(shí)間為“20140608”時(shí)國(guó)內(nèi)新聞69條,國(guó)外29條,社會(huì)23條,軍事12條,可見仍然是軍事新聞數(shù)量最少??梢苑?wù)于測(cè)試目標(biāo)的規(guī)則為: 測(cè)試是一個(gè)為了發(fā)現(xiàn)錯(cuò)誤而執(zhí)行程序的過(guò)程。軟件的白盒測(cè)試又稱為“結(jié)構(gòu)測(cè)試”,這種方法是把測(cè)試對(duì)象看作一個(gè)打開的盒子,它允許測(cè)試人員利用程序內(nèi)部的邏輯結(jié)構(gòu)以及有關(guān)信息來(lái)設(shè)計(jì)或選擇測(cè)試用例,從而對(duì)程序所有邏輯路徑進(jìn)行測(cè)試,檢查程序的功能是否符合其功能需求。 7 總結(jié)在互聯(lián)網(wǎng)產(chǎn)業(yè)高速發(fā)展的今天,新聞爬蟲系統(tǒng)正在扮演著越來(lái)越重要的角色,成為新聞熱點(diǎn)分析系統(tǒng)數(shù)據(jù)來(lái)源的有效工具。當(dāng)對(duì)了這些比較熟悉后才開始著手做新聞爬蟲系統(tǒng);其次,通過(guò)這次的畢業(yè)設(shè)計(jì)深刻的體會(huì)到扎實(shí)的編程功底是每一個(gè)出色的程序員都必須具備的基礎(chǔ)條件。在此向各位學(xué)者表示衷心的感謝附錄1 英文原文ABSTRACT Crawling the web is deceptively simple: the basic algorithm i(a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before repeat (a)–(c).However the size of the web estimated at over 4 billion pages and its rate of change (estimated at 7 per week )move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed these two factors alone imply that for a reasonably fresh and plete crawl of the web step a must be executed about a thousand times per second and thus the membership test c must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture which further plicates the membership test. A crucial way to speed up the test is to cache that is to store in main memory adynamic subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement static cache LRU and CLOCK and theoretical limits: clairvoyant caching and infinite cache. We performed about 1800simulations using these algorithms with various cache sizes using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup a cache of roughly 50000 entries can achieve a hit rate of almost 80. Interestingly this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study 31 states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002 slightly over 50% of all Americans have used web search to find information. Hence the technology that powers web search is of enormous practical interest. In this paper we concentrate on one aspect of the search technology namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways among them direct URL submission paid inclusion and URL extraction from non web sources but the bulk of the corpus is obtained by recursively exploring the web a process known as crawling or SPIDERing. The basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before repeat (a)–(c) Crawling typically starts from a set of seed URLs made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page or a directory such as but in this case a relatively large portion of the web estimated at over 20% is never reached. See[ 9] for a discussion of the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph and hyperlinks as directed edges among these nodes then crawling bees a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search DFS and Breadth First Search BFS – they are easy to implement and taught in many introductory algorithms classes. See for instance [34]. However crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently Google [20] claims to have indexed over 3 billion pages. Various studies 3 27 28 have indicated that historically the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change” then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web a search engine must crawl at least 100 million pages per step a must be executed about 1000 times per second and them ember ship test in step c must be done well over ten thousand times per second against a set of URLs that is too large to store in main memory. In addition crawlers typically use a distributed architecture to crawl more pages in parallel which further plicates the membership test: it is possible that the membership question can only be answered by a peer node not locally. A crucial way to speed up the membership test is to cache a dynamic subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement static cache LRU and CLOCK and pared them against two theoretical limits: clairvoyant caching and infinite cache when run again stat race of a web crawl that issued over one billion HT