freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

正文內(nèi)容

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-預(yù)覽頁(yè)

 

【正文】 測(cè)試項(xiàng)目數(shù)據(jù)可視化模塊新聞小類型分析柱狀圖預(yù)設(shè)條件TOMCAT服務(wù)器已開(kāi)啟輸入需要進(jìn)行檢索的日期操作步驟(1)輸入日期(2)點(diǎn)擊“搜索”按鈕預(yù)期輸出(1)新聞小類型分析表(2)新聞小類型分析柱狀圖表67 測(cè)試用例七測(cè)試用例編號(hào)Testing_DayTime測(cè)試項(xiàng)目數(shù)據(jù)可視化模塊一天中新聞數(shù)量變化折線圖預(yù)設(shè)條件TOMCAT服務(wù)器已開(kāi)啟輸入需要進(jìn)行檢索的日期操作步驟(1)輸入日期(2)點(diǎn)擊“搜索”按鈕預(yù)期輸出新聞數(shù)量隨時(shí)間變化折線圖表68 測(cè)試用例八測(cè)試用例編號(hào)Testing_yearTime測(cè)試項(xiàng)目數(shù)據(jù)可視化模塊一年中新聞數(shù)量變化折線圖預(yù)設(shè)條件TOMCAT服務(wù)器已開(kāi)啟輸入需要進(jìn)行檢索的日期操作步驟(1)輸入日期(2)點(diǎn)擊“搜索”按鈕預(yù)期輸出(1) 新聞數(shù)量隨月份變化折線圖(2)新聞數(shù)量按月份變化分析表 測(cè)試結(jié)果軟件測(cè)試的步驟執(zhí)行過(guò)后,整個(gè)測(cè)試活動(dòng)并未結(jié)束而對(duì)于測(cè)試結(jié)果分析才是最為重要的環(huán)節(jié),詳細(xì)分析并總結(jié)測(cè)試結(jié)果對(duì)下一輪測(cè)試工作的開(kāi)展具有很大的借鑒意義。 7 總結(jié)在互聯(lián)網(wǎng)產(chǎn)業(yè)高速發(fā)展的今天,新聞爬蟲系統(tǒng)正在扮演著越來(lái)越重要的角色,成為新聞熱點(diǎn)分析系統(tǒng)數(shù)據(jù)來(lái)源的有效工具。3) 將整個(gè)系統(tǒng)打造成一個(gè)多數(shù)據(jù)源、功能強(qiáng)大的智能新聞爬取分析平臺(tái)。當(dāng)對(duì)了這些比較熟悉后才開(kāi)始著手做新聞爬蟲系統(tǒng);其次,通過(guò)這次的畢業(yè)設(shè)計(jì)深刻的體會(huì)到扎實(shí)的編程功底是每一個(gè)出色的程序員都必須具備的基礎(chǔ)條件。在今后的學(xué)習(xí)過(guò)程中,我會(huì)繼續(xù)嚴(yán)格要求自己,以做產(chǎn)品的態(tài)度來(lái)對(duì)待所有的工作,力求完美。在此向各位學(xué)者表示衷心的感謝附錄1 英文原文ABSTRACT Crawling the web is deceptively simple: the basic algorithm i(a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before repeat (a)–(c).However the size of the web estimated at over 4 billion pages and its rate of change (estimated at 7 per week )move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed these two factors alone imply that for a reasonably fresh and plete crawl of the web step a must be executed about a thousand times per second and thus the membership test c must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture which further plicates the membership test. A crucial way to speed up the test is to cache that is to store in main memory adynamic subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement static cache LRU and CLOCK and theoretical limits: clairvoyant caching and infinite cache. We performed about 1800simulations using these algorithms with various cache sizes using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup a cache of roughly 50000 entries can achieve a hit rate of almost 80. Interestingly this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study 31 states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002 slightly over 50% of all Americans have used web search to find information. Hence the technology that powers web search is of enormous practical interest. In this paper we concentrate on one aspect of the search technology namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways among them direct URL submission paid inclusion and URL extraction from non web sources but the bulk of the corpus is obtained by recursively exploring the web a process known as crawling or SPIDERing. The basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before repeat (a)–(c) Crawling typically starts from a set of seed URLs made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page or a directory such as but in this case a relatively large portion of the web estimated at over 20% is never reached. See[ 9] for a discussion of the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph and hyperlinks as directed edges among these nodes then crawling bees a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search DFS and Breadth First Search BFS – they are easy to implement and taught in many introductory algorithms classes. See for instance [34]. However crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently Google [20] claims to have indexed over 3 billion pages. Various studies 3 27 28 have indicated that historically the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change” then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web a search engine must crawl at least 100 million pages per step a must be executed about 1000 times per second and them ember ship test in step c must be done well over ten thousand times per second against a set of URLs that is too large to store in main memory. In addition crawlers typically use a distributed architecture to crawl more pages in parallel which further plicates the membership test: it is possible that the membership question can only be answered by a peer node not locally. A crucial way to speed up the membership test is to cache a dynamic subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement static cache LRU and CLOCK and pared them against two theoretical limits: clairvoyant caching and infinite cache when run again stat race of a web crawl that issued over one billion HT
點(diǎn)擊復(fù)制文檔內(nèi)容
公司管理相關(guān)推薦
文庫(kù)吧 www.dybbs8.com
備案圖鄂ICP備17016276號(hào)-1