正文內(nèi)容

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文(更新版)

2025-09-13 07:56上一頁(yè)面

下一頁(yè)面

　　

【正文】執(zhí)行性能，當(dāng)這些數(shù)據(jù)的一個(gè)實(shí)例發(fā)生改變時(shí)，其他數(shù)據(jù)也都要進(jìn)行相應(yīng)的改變。一方面，能夠被過(guò)濾器識(shí)別的URL符合用戶預(yù)定義的過(guò)濾規(guī)則，這些URL全部都是用戶所期望獲取的數(shù)據(jù)來(lái)源，數(shù)據(jù)抓取的準(zhǔn)確性得到很好的保證；另一方面，由于大量不符合條件的URL都被過(guò)濾掉了，爬行隊(duì)列中僅僅加入符合條件的URL，大大節(jié)省了寶貴的內(nèi)存空間。數(shù)據(jù)獲取模塊數(shù)據(jù)獲取模塊主要利用Java語(yǔ)言結(jié)合HTTPClient開源工具編寫了一個(gè)針對(duì)新聞的可擴(kuò)展的網(wǎng)絡(luò)爬蟲，該爬蟲程序能夠按照廣度優(yōu)先的爬行策略對(duì)新聞數(shù)據(jù)進(jìn)行全面的定向抓取以及周期性的增量抓取。操作效果如圖52所示：圖52 “建立連接”操作 “開始爬行”操作用戶在取得到數(shù)據(jù)庫(kù)的連接之后，可以通過(guò)選擇新聞?lì)悇e（可細(xì)分為china、world、society、mil）開始抓取數(shù)據(jù)，數(shù)據(jù)來(lái)源和爬行限制選項(xiàng)可在界面上部進(jìn)行控制。當(dāng)輸入時(shí)間為“20140608”時(shí)國(guó)內(nèi)新聞69條，國(guó)外29條，社會(huì)23條，軍事12條，可見仍然是軍事新聞數(shù)量最少?？梢苑?wù)于測(cè)試目標(biāo)的規(guī)則為：測(cè)試是一個(gè)為了發(fā)現(xiàn)錯(cuò)誤而執(zhí)行程序的過(guò)程。軟件的白盒測(cè)試又稱為“結(jié)構(gòu)測(cè)試”，這種方法是把測(cè)試對(duì)象看作一個(gè)打開的盒子，它允許測(cè)試人員利用程序內(nèi)部的邏輯結(jié)構(gòu)以及有關(guān)信息來(lái)設(shè)計(jì)或選擇測(cè)試用例，從而對(duì)程序所有邏輯路徑進(jìn)行測(cè)試，檢查程序的功能是否符合其功能需求。 7 總結(jié)在互聯(lián)網(wǎng)產(chǎn)業(yè)高速發(fā)展的今天，新聞爬蟲系統(tǒng)正在扮演著越來(lái)越重要的角色，成為新聞熱點(diǎn)分析系統(tǒng)數(shù)據(jù)來(lái)源的有效工具。當(dāng)對(duì)了這些比較熟悉后才開始著手做新聞爬蟲系統(tǒng)；其次，通過(guò)這次的畢業(yè)設(shè)計(jì)深刻的體會(huì)到扎實(shí)的編程功底是每一個(gè)出色的程序員都必須具備的基礎(chǔ)條件。在此向各位學(xué)者表示衷心的感謝附錄1 英文原文ABSTRACT Crawling the web is deceptively simple: the basic algorithm i（a）Fetch a page (b) Parse it to extract all linked URLs （c） For all the URLs not seen before repeat （a）–（c）.However the size of the web estimated at over 4 billion pages and its rate of change （estimated at 7 per week ）move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed these two factors alone imply that for a reasonably fresh and plete crawl of the web step a must be executed about a thousand times per second and thus the membership test c must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture which further plicates the membership test. A crucial way to speed up the test is to cache that is to store in main memory adynamic subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement static cache LRU and CLOCK and theoretical limits: clairvoyant caching and infinite cache. We performed about 1800simulations using these algorithms with various cache sizes using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup a cache of roughly 50000 entries can achieve a hit rate of almost 80. Interestingly this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study 31 states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002 slightly over 50% of all Americans have used web search to find information. Hence the technology that powers web search is of enormous practical interest. In this paper we concentrate on one aspect of the search technology namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways among them direct URL submission paid inclusion and URL extraction from non web sources but the bulk of the corpus is obtained by recursively exploring the web a process known as crawling or SPIDERing. The basic algorithm is （a） Fetch a page （b） Parse it to extract all linked URLs （c） For all the URLs not seen before repeat （a）–（c） Crawling typically starts from a set of seed URLs made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page or a directory such as but in this case a relatively large portion of the web estimated at over 20% is never reached. See[ 9] for a discussion of the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph and hyperlinks as directed edges among these nodes then crawling bees a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search DFS and Breadth First Search BFS – they are easy to implement and taught in many introductory algorithms classes. See for instance [34]. However crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently Google [20] claims to have indexed over 3 billion pages. Various studies 3 27 28 have indicated that historically the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change” then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web a search engine must crawl at least 100 million pages per step a must be executed about 1000 times per second and them ember ship test in step c must be done well over ten thousand times per second against a set of URLs that is too large to store in main memory. In addition crawlers typically use a distributed architecture to crawl more pages in parallel which further plicates the membership test: it is possible that the membership question can only be answered by a peer node not locally. A crucial way to speed up the membership test is to cache a dynamic subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement static cache LRU and CLOCK and pared them against two theoretical limits: clairvoyant caching and infinite cache when run again stat race of a web crawl that issued over one billion HT

點(diǎn)擊復(fù)制文檔內(nèi)容

公司管理相關(guān)推薦

倉(cāng)庫(kù)管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】本科學(xué)年論文(設(shè)計(jì))第37頁(yè)學(xué)校代碼14199學(xué)號(hào)00902025分類號(hào)密級(jí)

2025-06-24 19:24

網(wǎng)絡(luò)銷售系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】I摘要過(guò)去對(duì)網(wǎng)絡(luò)銷售系統(tǒng)的研究主要集中在Windows平臺(tái)下使用.NET架構(gòu)技術(shù)、HTTP/TCP協(xié)議和MTS(MicrosoftTransactionServer,事務(wù)處理服務(wù))技術(shù)來(lái)實(shí)現(xiàn)，這樣數(shù)據(jù)安全性不是很好。本網(wǎng)絡(luò)營(yíng)銷系統(tǒng)采用SOA（面向服務(wù)）的架構(gòu)，運(yùn)用WCF和MSMQ技術(shù)，為商家提供安全可靠、高效、可交互、

2025-08-17 21:28

電子購(gòu)物系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】高等教育自學(xué)考試畢業(yè)論文（設(shè)計(jì)）電子購(gòu)物系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)電子購(gòu)物系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文目錄摘要 2第1章引言 4 4系統(tǒng)開發(fā)的意義 4第2章相關(guān)技術(shù)介紹 6DreamwearMax簡(jiǎn)介 6ASP簡(jiǎn)介 6SQLServer2005簡(jiǎn)介 8

2025-06-23 17:10

圖片瀏覽系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】圖片瀏覽系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文目錄摘要 IAbstract II目錄 III引言 1第1章圖像技術(shù)概述 2圖像技術(shù)的發(fā)展 2圖像技術(shù)的基本概念 2圖片格式 2圖片分辨率 3RGB與CMYK 4色彩位數(shù) 5Pixel與DPI 5FM網(wǎng)屏技術(shù) 5圖像技術(shù)的應(yīng)用 6第2章開發(fā)工具與系統(tǒng)界面設(shè)計(jì) 8

2025-06-28 16:59

庫(kù)存管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】中國(guó)管理軟件學(xué)院畢業(yè)設(shè)計(jì)(論文)第II頁(yè)庫(kù)存管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文目錄摘要 IIAbstract III1前言 1倉(cāng)庫(kù)管理系統(tǒng)開發(fā)環(huán)境簡(jiǎn)介 1中文VisualC++ 1sqlserver2000簡(jiǎn)介 22系統(tǒng)可行性分析 53系統(tǒng)總體設(shè)計(jì) 13系統(tǒng)目標(biāo)設(shè)計(jì) 13開發(fā)設(shè)計(jì)思

2025-06-28 18:20

在線論壇系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】┊┊┊┊┊┊┊┊┊┊┊┊┊裝┊┊┊┊┊訂┊┊┊┊┊線┊┊┊┊┊┊┊┊┊┊┊┊┊畢業(yè)設(shè)計(jì)（論文）報(bào)告紙?jiān)诰€論壇系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文目錄第１章緒論 1 1 1第２章

2025-06-24 17:37

網(wǎng)上訂餐系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】學(xué)號(hào)：1135209875河南大學(xué)2015屆本科畢業(yè)論文網(wǎng)上訂餐系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)DesignandImplementationofOnlineReservationSystem 論文作者姓名：杜奕霖作者學(xué)號(hào)： 1135209875

2025-06-23 00:51

餐飲管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】圖書分類號(hào)：密級(jí)：畢業(yè)設(shè)計(jì)(論文)蟠桃飯莊餐飲管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)THEDESIGNANDIMPLEMENTATIONOF“PANTAOFANZHUANG”FOODMANAGEMENTSYSTEM班級(jí)11網(wǎng)絡(luò)轉(zhuǎn)本學(xué)號(hào)20110566138學(xué)生姓名朱晨崗學(xué)院名稱信

2025-06-28 11:11

考勤管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

【摘要】考勤管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)——數(shù)據(jù)庫(kù)設(shè)計(jì)摘要：當(dāng)今社會(huì)已進(jìn)入信息社會(huì)時(shí)代。信息已經(jīng)受到社會(huì)的廣泛關(guān)注，被看作社會(huì)和科學(xué)技術(shù)發(fā)展的三大支柱(材料、能源、信息)。信息是管理的基礎(chǔ)，是進(jìn)行決策的基本依據(jù)。在一個(gè)組織里，信息已作為人力、物力、財(cái)力之外的第四種能源，占有重要的地位。然而，信息是一種非物質(zhì)的，有別于基本資源的新形式的資源。信息也是管理的對(duì)象，必須進(jìn)行管理和控制。本系統(tǒng)

2025-06-28 20:20

freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文(更新版)

倉(cāng)庫(kù)管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

網(wǎng)絡(luò)銷售系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

電子購(gòu)物系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

圖片瀏覽系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

庫(kù)存管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

在線論壇系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

網(wǎng)上訂餐系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

餐飲管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

考勤管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文-資料下載頁(yè)

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文(完整版)

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文(更新版)

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文(專業(yè)版)

新聞爬蟲系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)畢業(yè)論文(留存版)