【正文】
rawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as , but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling bees a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]). However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change”, then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further plicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally. A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and pared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently. The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our remendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.2. CRAWLINGWeb crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching. The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save nonlocal URLs to disk。從尊敬的導(dǎo)師身上,我不僅學(xué)到了扎實(shí)、寬廣的專業(yè)知識(shí),也學(xué)到了做人的道理。在我的學(xué)業(yè)和論文的研究工作中無(wú)不傾注著老師們辛勤的汗水和心 血。 在這次畢業(yè)設(shè)計(jì)中也使我們的同學(xué)關(guān)系更進(jìn)一步了,同學(xué)之間互相幫助,有什么不懂的大家在一起商量,聽聽不同的看法對(duì)我們更好的理解知識(shí),所以在這里非常感謝幫助我的同學(xué)。 腳踏實(shí)地,認(rèn)真嚴(yán)謹(jǐn),實(shí)事求是的學(xué)習(xí)態(tài)度,不怕困難、堅(jiān)持不懈、吃苦耐勞的精神是我在這次設(shè)計(jì)中最大的收益。在整個(gè)過(guò)程中,我學(xué)到了新知識(shí),增長(zhǎng)了見識(shí)。在設(shè)計(jì)平臺(tái)中,要注意平臺(tái)的可行性和有效性,選擇既重要又適合以學(xué)習(xí)軟件形式出現(xiàn)的知識(shí)點(diǎn)作為材料,參考優(yōu)秀的國(guó)內(nèi)外學(xué)習(xí)輔助平臺(tái),又考慮到數(shù)據(jù)庫(kù)課程的特殊性。深入了解并掌握數(shù)據(jù)庫(kù)基礎(chǔ)知識(shí),挖掘出數(shù)據(jù)庫(kù)課程中的難點(diǎn)和重點(diǎn),對(duì)于其中的難點(diǎn),要充分考慮學(xué)生的學(xué)習(xí)能力,幫助學(xué)生以一種最容易接受的方式掌握知識(shí)。5月開始相關(guān)代碼編寫工作。在大家的幫助下,困難一個(gè)一個(gè)解決掉,論文也慢慢成型。 4月初,資料已經(jīng)查找完畢了,我開始著手論文的寫作。我在學(xué)校圖書館,大工圖書館搜集資料,還在網(wǎng)上查找各類相關(guān)資料,將這些寶貴的資料全部記在筆記本上,盡量使我的資料完整、精確、數(shù)量多,這有利于論文的撰寫。我將這一困難告訴了導(dǎo)師,在導(dǎo)師細(xì)心的指導(dǎo)下,終于使我對(duì)自己現(xiàn)在的工作方向和方法有了掌握。 3月初,在與導(dǎo)師的交流討論中我的題目定了下來(lái),是面向主題的網(wǎng)絡(luò)爬蟲。歷經(jīng)了幾個(gè)月的奮戰(zhàn),緊張而又充實(shí)的畢業(yè)設(shè)計(jì)終于落下了帷幕。第五章 測(cè)試 設(shè)定只爬取前5個(gè)網(wǎng)頁(yè),程序運(yùn)行后的界面如圖51圖51 測(cè)試圖1預(yù)設(shè)目錄為,D:test 按下START后,查看目錄,可見如圖52:圖52 測(cè)試圖2查看數(shù)據(jù)庫(kù)可見,如圖53:圖53 測(cè)試圖3測(cè)試Ping功能,分別對(duì)正確網(wǎng)址ping和不正確網(wǎng)址ping,如圖54圖54 測(cè)試圖4圖55 測(cè)試圖5圖56 測(cè)試圖6第六章 總結(jié)和展望2011年3月,我開始了我的畢業(yè)論文工作,時(shí)至今日,論文基本完成。第四步:,大于給定值則相關(guān),小于給定值則不相關(guān),丟棄該URL。第二步:,同時(shí)去除重復(fù)的部分。 整體流程爬蟲代碼文件構(gòu)成如圖41:圖41 代碼結(jié)構(gòu)構(gòu)成截圖 getParser()方法為public。第i個(gè)線程對(duì)所有URL列表中序列為第0+4i URL的進(jìn)行同步操作,其中對(duì)儲(chǔ)存所有URL的列表執(zhí)行synchronized (all_URL)操作。2. 對(duì)每個(gè)URL進(jìn)行分析,判斷相關(guān)度?!《嗑€程的實(shí)現(xiàn) 設(shè)計(jì)為4個(gè)線程同時(shí)進(jìn)行工作。private String Url。private int ContentLength。private int Port?!”4婢W(wǎng)頁(yè)信息。,設(shè)定相關(guān)度閾值為2,網(wǎng)頁(yè)與主題的相關(guān)度A2,則認(rèn)為該網(wǎng)頁(yè)與主題相關(guān)的。 判斷相關(guān)度 算法實(shí)現(xiàn)步驟和算法描述:1. 對(duì)標(biāo)題及正文的特征項(xiàng)的選取是通過(guò)分詞后與主題集合匹配,并通過(guò)詞頻計(jì)算來(lái)得到與主題向量維數(shù)相等的標(biāo)題向量和正文向量。 } // 獲得該網(wǎng)頁(yè)的所有鏈接 public Vector getLinks() { return links。 } public String getEncode() { return encode。 protected String encode = new String()。 // 得到網(wǎng)頁(yè)上的正文文本 protected String paragraphText = new String()。 // 得到某一網(wǎng)頁(yè)上的所有鏈接 protected VectorString links = new VectorString()。3. 用輸入流,BufferedReader讀取,并且將網(wǎng)頁(yè)內(nèi)容存儲(chǔ)為字符串。(10000)。URLConnection url_C = ()。 網(wǎng)絡(luò)爬蟲具體設(shè)計(jì) 爬取網(wǎng)頁(yè) 主要用到的技術(shù)如下:繼承HTMLEditorKit類,改寫其中的 getParser()屬性protect為public,用下列函數(shù)爬取網(wǎng)頁(yè):public class XXXXX extends HTMLEditorKit {public getParser(){return ()。,從網(wǎng)頁(yè)中某個(gè)鏈接出發(fā),訪問(wèn)該鏈接網(wǎng)頁(yè)上的所有鏈接,訪問(wèn)完成后,再通過(guò)遞歸算法實(shí)現(xiàn)下一層的訪問(wèn),重復(fù)以上步驟。 將待爬取URL列表對(duì)應(yīng)的URL的網(wǎng)頁(yè)代碼提取出來(lái)。