【正文】
后期系統(tǒng)的使用過程中難免會出現(xiàn)新的問題,系統(tǒng)也將在不斷的調(diào)整和維護中日趨完善。 7 總結(jié)在互聯(lián)網(wǎng)產(chǎn)業(yè)高速發(fā)展的今天,新聞爬蟲系統(tǒng)正在扮演著越來越重要的角色,成為新聞熱點分析系統(tǒng)數(shù)據(jù)來源的有效工具。本次設(shè)計開發(fā)的新聞爬蟲系統(tǒng),目前還存在著許多方面的問題,例如:數(shù)據(jù)源的廣度不夠,目前主要的數(shù)據(jù)來源都是新浪新聞;系統(tǒng)功能較為單一,目前主要的功能只有新聞爬取并以不同的可視化形式來呈現(xiàn)新聞內(nèi)容以及新聞分析結(jié)果;對于新聞內(nèi)容的爬取也不是很完善,只是簡單的抽取了一些文本信息和圖片信息,對于音頻和視頻并不能很好地抽取下來。3) 將整個系統(tǒng)打造成一個多數(shù)據(jù)源、功能強大的智能新聞爬取分析平臺。當(dāng)對于這個項目所需要的知識和技術(shù)比較了解的時候才可以著手做。當(dāng)對了這些比較熟悉后才開始著手做新聞爬蟲系統(tǒng);其次,通過這次的畢業(yè)設(shè)計深刻的體會到扎實的編程功底是每一個出色的程序員都必須具備的基礎(chǔ)條件。在這個由面向?qū)ο蟮某绦蛟O(shè)計思想主導(dǎo)軟件行業(yè)的時代,只有學(xué)好一門面向?qū)ο蟮某绦蛟O(shè)計語言(如C、JAVA以及ActionScript)方能保持自己在激烈的社會競爭中立于不敗之地;最后,充分利用一些開源工具能夠在很大程度上提升編程效率,例如,爬蟲采集部分我使用HttpClient來進(jìn)行服務(wù)器訪問和響應(yīng)處理,信息處理部分我借助HTMLParser進(jìn)行文本抽取,數(shù)據(jù)可視化部分該系統(tǒng)使用了ExtJS開源AJAX框架結(jié)合Google Visualization API進(jìn)行數(shù)據(jù)顯示,這些開源工具用起來簡捷方便,幫助我在較短的時間實現(xiàn)了所需要的功能。在今后的學(xué)習(xí)過程中,我會繼續(xù)嚴(yán)格要求自己,以做產(chǎn)品的態(tài)度來對待所有的工作,力求完美。在此向幫助過我的老師學(xué)長學(xué)姐表示衷心的感謝。在此向各位學(xué)者表示衷心的感謝附錄1 英文原文ABSTRACT Crawling the web is deceptively simple: the basic algorithm i(a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before repeat (a)–(c).However the size of the web estimated at over 4 billion pages and its rate of change (estimated at 7 per week )move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed these two factors alone imply that for a reasonably fresh and plete crawl of the web step a must be executed about a thousand times per second and thus the membership test c must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture which further plicates the membership test. A crucial way to speed up the test is to cache that is to store in main memory adynamic subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement static cache LRU and CLOCK and theoretical limits: clairvoyant caching and infinite cache. We performed about 1800simulations using these algorithms with various cache sizes using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup a cache of roughly 50000 entries can achieve a hit rate of almost 80. Interestingly this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study 31 states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002 slightly over 50% of all Americans have used web search to find information. Hence the technology that powers web search is of enormous practical interest. In this paper we concentrate on one aspect of the search technology namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways among them direct URL submission paid inclusion and URL extraction from non web sources but the bulk of the corpus is obtained by recursively exploring the web a process known as crawling or SPIDERing. The basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before repeat (a)–(c) Crawling typically starts from a set of seed URLs made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page or a directory such as but in this case a relatively large portion of the web estimated at over 20% is never reached. See[ 9] for a discussion of the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph and hyperlinks as directed edges among these nodes then crawling bees a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search DFS and Breadth First Search BFS – they are easy to implement and taught in many introductory algorithms classes. See for instance [34]. However crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently Google [20] claims to have indexed over 3 billion pages. Various studies 3 27 28 have indicated that historically the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change” then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web a search engine must crawl at least 100 million pages per step a must be executed about 1000 times per second and them ember ship test in step c must be done well over ten thousand times per second against a set of URLs that is too large to store in main memory. In addition crawlers typically use a distributed architecture to crawl more pages in parallel which further plicates the membership test: it is possible that the membership question can only be answered by a peer node not locally. A crucial way to speed up the membership test is to cache a dynamic subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement static cache LRU and CLOCK and pared them against two theoretical limits: clairvoyant caching and infinite cache when run again stat race of a web crawl that issued over one billion HT