正文內(nèi)容

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

2024-08-24 07:56本頁面

　　

【正文】后期系統(tǒng)的使用過程中難免會出現(xiàn)新的問題，系統(tǒng)也將在不斷的調(diào)整和維護中日趨完善。 7 總結(jié)在互聯(lián)網(wǎng)產(chǎn)業(yè)高速發(fā)展的今天，新聞爬蟲系統(tǒng)正在扮演著越來越重要的角色，成為新聞熱點分析系統(tǒng)數(shù)據(jù)來源的有效工具。本次設(shè)計開發(fā)的新聞爬蟲系統(tǒng)，目前還存在著許多方面的問題，例如：數(shù)據(jù)源的廣度不夠，目前主要的數(shù)據(jù)來源都是新浪新聞；系統(tǒng)功能較為單一，目前主要的功能只有新聞爬取并以不同的可視化形式來呈現(xiàn)新聞內(nèi)容以及新聞分析結(jié)果；對于新聞內(nèi)容的爬取也不是很完善，只是簡單的抽取了一些文本信息和圖片信息，對于音頻和視頻并不能很好地抽取下來。3）將整個系統(tǒng)打造成一個多數(shù)據(jù)源、功能強大的智能新聞爬取分析平臺。當(dāng)對于這個項目所需要的知識和技術(shù)比較了解的時候才可以著手做。當(dāng)對了這些比較熟悉后才開始著手做新聞爬蟲系統(tǒng)；其次，通過這次的畢業(yè)設(shè)計深刻的體會到扎實的編程功底是每一個出色的程序員都必須具備的基礎(chǔ)條件。在這個由面向?qū)ο蟮某绦蛟O(shè)計思想主導(dǎo)軟件行業(yè)的時代，只有學(xué)好一門面向?qū)ο蟮某绦蛟O(shè)計語言（如C、JAVA以及ActionScript）方能保持自己在激烈的社會競爭中立于不敗之地；最后，充分利用一些開源工具能夠在很大程度上提升編程效率，例如，爬蟲采集部分我使用HttpClient來進(jìn)行服務(wù)器訪問和響應(yīng)處理，信息處理部分我借助HTMLParser進(jìn)行文本抽取，數(shù)據(jù)可視化部分該系統(tǒng)使用了ExtJS開源AJAX框架結(jié)合Google Visualization API進(jìn)行數(shù)據(jù)顯示，這些開源工具用起來簡捷方便，幫助我在較短的時間實現(xiàn)了所需要的功能。在今后的學(xué)習(xí)過程中，我會繼續(xù)嚴(yán)格要求自己，以做產(chǎn)品的態(tài)度來對待所有的工作，力求完美。在此向幫助過我的老師學(xué)長學(xué)姐表示衷心的感謝。在此向各位學(xué)者表示衷心的感謝附錄1 英文原文ABSTRACT Crawling the web is deceptively simple: the basic algorithm i（a）Fetch a page (b) Parse it to extract all linked URLs （c） For all the URLs not seen before repeat （a）–（c）.However the size of the web estimated at over 4 billion pages and its rate of change （estimated at 7 per week ）move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed these two factors alone imply that for a reasonably fresh and plete crawl of the web step a must be executed about a thousand times per second and thus the membership test c must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture which further plicates the membership test. A crucial way to speed up the test is to cache that is to store in main memory adynamic subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement static cache LRU and CLOCK and theoretical limits: clairvoyant caching and infinite cache. We performed about 1800simulations using these algorithms with various cache sizes using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup a cache of roughly 50000 entries can achieve a hit rate of almost 80. Interestingly this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study 31 states that “Search engines have bee an indispensable utility for Internet users” and estimates that as of mid2002 slightly over 50% of all Americans have used web search to find information. Hence the technology that powers web search is of enormous practical interest. In this paper we concentrate on one aspect of the search technology namely the process of collecting web pages that eventually constitute the search engine corpus. Search engines collect pages in many ways among them direct URL submission paid inclusion and URL extraction from non web sources but the bulk of the corpus is obtained by recursively exploring the web a process known as crawling or SPIDERing. The basic algorithm is （a） Fetch a page （b） Parse it to extract all linked URLs （c） For all the URLs not seen before repeat （a）–（c） Crawling typically starts from a set of seed URLs made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page or a directory such as but in this case a relatively large portion of the web estimated at over 20% is never reached. See[ 9] for a discussion of the graph structure of the web that leads to this phenomenon. If we view web pages as nodes in a graph and hyperlinks as directed edges among these nodes then crawling bees a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search DFS and Breadth First Search BFS – they are easy to implement and taught in many introductory algorithms classes. See for instance [34]. However crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors. 1. The web is very large. Currently Google [20] claims to have indexed over 3 billion pages. Various studies 3 27 28 have indicated that historically the web has doubled every 912 months. 2. Web pages are changing rapidly. If “change” means “any change” then about 40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more about 7% of all web pages change weekly [17]. These two factors imply that to obtain a reasonably fresh and 679 plete snapshot of the web a search engine must crawl at least 100 million pages per step a must be executed about 1000 times per second and them ember ship test in step c must be done well over ten thousand times per second against a set of URLs that is too large to store in main memory. In addition crawlers typically use a distributed architecture to crawl more pages in parallel which further plicates the membership test: it is possible that the membership question can only be answered by a peer node not locally. A crucial way to speed up the membership test is to cache a dynamic subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement static cache LRU and CLOCK and pared them against two theoretical limits: clairvoyant caching and infinite cache when run again stat race of a web crawl that issued over one billion HT

點擊復(fù)制文檔內(nèi)容

公司管理相關(guān)推薦

基于phpmysql新聞系統(tǒng)的設(shè)計與實現(xiàn)-畢業(yè)論文-閱讀頁

【摘要】中圖分類號：本科生畢業(yè)設(shè)計（申請學(xué)士學(xué)位）論文題目基于PHP的新聞管理系統(tǒng)的設(shè)計與實現(xiàn)作者姓名周偉所學(xué)專業(yè)名稱計算機科學(xué)與技術(shù)指導(dǎo)教師

2025-07-07 00:38

web動態(tài)新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】WEB動態(tài)新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)摘要　　21世紀(jì)是信息的時代，是網(wǎng)絡(luò)的時代，進(jìn)入信息社會高速發(fā)展的時代，數(shù)字化革命給所有領(lǐng)域帶來新的改變。隨著Internet的普及，無論人們相隔多么遙遠(yuǎn)，都有天涯若比鄰的感覺。足不出戶，便可知天下新近之大事，便可與大洋彼岸的朋友暢談無阻。網(wǎng)頁逐漸融入人們的生活，快速及時地瀏覽新聞，獲取五彩繽紛的網(wǎng)上信息，已成為人們?nèi)粘Ｉ畹囊徊糠?/span>

2025-07-07 12:08

基于phpmysql新聞系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】基于PHP+MYSQL新聞系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文目錄摘要：.............................................................................1ABSTRACT:......................................................................

2025-07-07 01:10

基于web新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】目錄第1章前言............................................................1第2章系統(tǒng)概述............................................................1設(shè)計模式...............................

2025-07-07 01:54

基于ssh的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文設(shè)計-閱讀頁

【摘要】4PINGDINGSHANUNIVERSITY畢業(yè)論文(設(shè)計)題目:基于SSH的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)平頂山學(xué)院本科畢業(yè)設(shè)計基于SSH的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)摘要隨著Internet的普及，越來越多的企業(yè)紛紛建立了自己的門戶網(wǎng)

2025-07-12 19:36

基于ssh的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文(設(shè)計)-閱讀頁

【摘要】PINGDINGSHANUNIVERSITY畢業(yè)論文(設(shè)計)題目:基于SSH的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)平頂山學(xué)院本科畢業(yè)設(shè)計基于SSH的新聞發(fā)布系統(tǒng)的設(shè)

2024-09-14 14:17

網(wǎng)絡(luò)爬蟲設(shè)計與實現(xiàn)畢業(yè)設(shè)計論文-閱讀頁

【摘要】畢業(yè)設(shè)計（論文）開題報告課題名稱網(wǎng)絡(luò)爬蟲設(shè)計與實現(xiàn)學(xué)院名稱軟件學(xué)院專業(yè)名稱軟件工程學(xué)生姓名指導(dǎo)教師（內(nèi)容包括：課題的來源及意義，國內(nèi)外發(fā)展?fàn)顩r，本課題的研究目標(biāo)、研究內(nèi)容、研究方法、研究手段和進(jìn)度安排，實驗方案的可行性分析和已具備的實驗條件以及主要參考文獻(xiàn)等。）一．課題的來源及意義互聯(lián)網(wǎng)

2024-12-23 15:20

畢業(yè)論文---基于jsp綜合新聞發(fā)布系統(tǒng)設(shè)計與實現(xiàn)-閱讀頁

【摘要】學(xué)院名稱：學(xué)生姓名：專業(yè)：班級：學(xué)號：指導(dǎo)教師：答辯組負(fù)責(zé)人：

2024-11-28 01:06

基于jsp綜合新聞發(fā)布系統(tǒng)設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】重慶郵電大學(xué)本科畢業(yè)設(shè)計（論文）基于JSP綜合新聞發(fā)布系統(tǒng)設(shè)計與實現(xiàn)畢業(yè)論文目錄第一章緒論 1課題意義 1課題背景 1 1JSP的優(yōu)勢 2JSP的特點 2第二章基礎(chǔ)知識 4JSP技術(shù) 4JSP訪問數(shù)據(jù)庫的原理 4JSP頁面的結(jié)構(gòu)： 4JSP的運行環(huán)境 6JSP的內(nèi)建對象 7JSP的主要內(nèi)置組件： 7

2024-08-15 05:43

基于aspnet的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)-畢業(yè)論文-閱讀頁

【摘要】畢業(yè)論文（設(shè)計）題目：基于的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)所在系：計算機科學(xué)系專業(yè)：計算機科學(xué)與技術(shù)更多論文I基于的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)

2024-11-30 03:56

基于廣度優(yōu)先算法的多線程爬蟲程序的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】摘要網(wǎng)絡(luò)爬蟲是一種自動搜集互聯(lián)網(wǎng)信息的程序。通過網(wǎng)絡(luò)爬蟲不僅能夠為搜索引擎采集網(wǎng)絡(luò)信息，而且可以作為定向信息采集器，定向采集某些網(wǎng)站下的特定信息，如招聘信息，租房信息等。本文通過JAVA實現(xiàn)了一個基于廣度優(yōu)先算法的多線程爬蟲程序。本論文闡述了網(wǎng)絡(luò)爬蟲實現(xiàn)中一些主要問題：為何使用廣度優(yōu)先的爬行策略，以及如何實現(xiàn)廣度優(yōu)先爬行；為何要使用多線程，以及如何實現(xiàn)多線程；系統(tǒng)實現(xiàn)

2025-07-12 20:21

綜合新聞網(wǎng)站的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】畢業(yè)論文綜合新聞網(wǎng)站的設(shè)計與實現(xiàn)目錄摘要……………………………………………………………………………………………………………31前言……………………………………………………………………………………………………………4課題研發(fā)的目的與意義………………………………………………………………5可

2025-07-07 07:48

管理系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】管理系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文........................................................................................................................................................................1背景分析.....

2025-07-13 02:28

his系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

【摘要】 HIS系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文第1章緒論本章從醫(yī)院信息化系統(tǒng)（HospitalInformationSystem簡稱HIS）的研究背景和意義出發(fā)，論述了醫(yī)院信息化系統(tǒng)設(shè)計與實現(xiàn)的迫切性，總結(jié)和概述了現(xiàn)階段國內(nèi)外醫(yī)院信息化系統(tǒng)的發(fā)展現(xiàn)狀，其中重點介紹了醫(yī)院信息化系統(tǒng)的實現(xiàn)為醫(yī)院和患者帶來了多大的便利，在此基礎(chǔ)上對醫(yī)院信息化系統(tǒng)進(jìn)行了設(shè)計與實

2025-07-13 08:31

freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

基于phpmysql新聞系統(tǒng)的設(shè)計與實現(xiàn)-畢業(yè)論文-閱讀頁

web動態(tài)新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

基于phpmysql新聞系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

基于web新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

基于ssh的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文設(shè)計-閱讀頁

基于ssh的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文(設(shè)計)-閱讀頁

網(wǎng)絡(luò)爬蟲設(shè)計與實現(xiàn)畢業(yè)設(shè)計論文-閱讀頁

畢業(yè)論文---基于jsp綜合新聞發(fā)布系統(tǒng)設(shè)計與實現(xiàn)-閱讀頁

基于jsp綜合新聞發(fā)布系統(tǒng)設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

基于aspnet的新聞發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)-畢業(yè)論文-閱讀頁

基于廣度優(yōu)先算法的多線程爬蟲程序的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

綜合新聞網(wǎng)站的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

管理系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

his系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

oa系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-展示頁

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-在線瀏覽

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-閱讀頁

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文(文件)

新聞爬蟲系統(tǒng)的設(shè)計與實現(xiàn)畢業(yè)論文-全文預(yù)覽