【正文】
directory EINet Galaxy (Tradewave Galaxy) on the line. In addition to Web site search, it also supports Gopher and Telnet search..April 1994, Stanford, two doctoral students, ChineseAmerican Jerry Yang (Yang) and cofounder David Filo of Yahoo. With the traffic and growth in the number of recorded link, Yahoo directory begun to support a simple database search. Because Yahoo! The data is entered manually, so can not really be classified as a search engine, in fact, is just a searchable directory. Significantly improve the search efficiency. (Yahoo after being used Altavista, Inktomi, Google provides search engine services).Early in 1994, Washington University CS student Brian Pinkerton began his small projects WebCrawler (Brian Pinkerton Announces the Availability of Webcrawler). April 20, 1994, WebCrawler, when the official debut from 6000 contains only the content server. WebCrawler is the Internet39。s personal search preference settings. (Hotbot was the following years, one of the most popular search engine, Lycos was acquired).Northernlight pany set up in September 1995 in Cambridge, Massachusetts, in August 1997, Northernlight official search engine ing out. It was with the largest database of search engines, it does not Stop Words, it has excellent Current News, 7,100 more than the position of publications Special Collection, a good advanced search syntax, the first support of the search results automatically simple classification. (January 16, 2002, Northernlight public search engine shut down, followed by the acquisition of divine, but Nlresearch, select World Wide Web only, can still be used for search engine Northernlight).In October 1998 before, Google is Stanford University, a small project BackRub. Doctorate in 1995 began studying Larry Page design the search engine in September 15, 1997 registered the domain, by the end of 1997, in the Sergey Brin and Scott Hassan, Alan Steremberg mon participation, BachRub provide Demo. In February 1999, Google pleted the Alpha version from the Beta version of the transformation. Google Inc. took the September 27, 1998 recognized as the birthday of their own. Google in the Pagerank, the dynamic summary page snapshot, DailyRefresh, multidocument format support, maps, dictionary search, such as the stock integrated search, multilanguage support, user interface and other functions on the innovation, such as Altavista, like, once again forever changed the search engine definition.In 2000 before, Google Although the accuracy of the awardwinning search, but not as good as other search engine databases, and the lack of advanced search syntax, so not very useful, does not promote fast. Until mid2000 after the database upgrade, but also by being selected as the search engine Yahoo Dongfeng before soaring.Fast (Alltheweb) Founded in 1997, is one of Norway39。s AltaVista (2001 users in the summer through the beginning part of proxy visit, no proxy radio qbseach available altavista search can only display the first page of search results) is a late, and only in December 1995 debut (AltaVista Public Beta Press Release). However, a large number of innovative features to make it quickly to reach the pinnacle of search engine at the time. Altavista most prominent advantage is its speed (search engine 9238: Comparison of edy, design altavista purpose, it is said only to demonstrate the power of DEC Alpha chip puting power). Altavista and the other new features, it forever changed the definition of search engines. AltaVista was the first to support the natural language search engine, AltaVista was the first realization of advanced search syntax of the search engines (such as AND, OR, NOT, etc.). Users can use AltaVista Search Newsgroups (news group) and the contents of the article from the Internet access, you can search text in the picture name, search Titles, search Java applets, search ActiveX objects. AltaVista also claims to be the first to support the user39。s JumpStation, Colorado University, Oliver McBryan of The World Wide Web Worm (First Mention of McBryan39。了解我們所用的搜索引擎需要很龐大的信息存儲(chǔ)機(jī)制,而且對(duì)提供的檢索信息要進(jìn)行排序,以使用戶(hù)盡快的檢索到比較有質(zhì)量的信息。測(cè)試只能查找程序中的錯(cuò)誤,不能證明程序中沒(méi)有錯(cuò)誤。黑盒測(cè)試法把程序看作一個(gè)黑盒子,完全不考慮程序的內(nèi)部結(jié)構(gòu)和處理過(guò)程。 開(kāi)始建立索引界面 建立索引界面Dedup: starting表示網(wǎng)頁(yè)去重;merging indexes to: specialweb/index表示將索引合并。,調(diào)用Generator生成數(shù)據(jù)段,啟動(dòng)Fetcher線程實(shí)際下載網(wǎng)頁(yè),CrawlDb update把網(wǎng)頁(yè)內(nèi)的鏈接加入待下載的的數(shù)據(jù)庫(kù)中。 ( 鏈接:+urlLink )。 } catch (ParserException e) { ()。 (pageEncoding)。ver=0amp。 String dstfile = filename+ 。import .*。PreparedStatement pstmt = (sql)。DatabaseName=SEARCH。 (2,power)。 String power = (power)。().newInstance()。 (1,ManagerName)。 ().newInstance()。 對(duì)文本進(jìn)行分詞和過(guò)濾程序流程圖網(wǎng)頁(yè)索引程序的基本思想是采用文檔關(guān)鍵字作為索引,生成按照關(guān)鍵字組合的鏈表,每個(gè)鏈表都是包含了特定關(guān)鍵字的文檔集合。 數(shù)據(jù)庫(kù)設(shè)計(jì)數(shù)據(jù)庫(kù)設(shè)計(jì)是指根據(jù)用戶(hù)的需求,在某一具體的數(shù)據(jù)庫(kù)管理系統(tǒng)上,設(shè)計(jì)數(shù)據(jù)庫(kù)的結(jié)構(gòu)和建立數(shù)據(jù)庫(kù)的過(guò)程。 頂層數(shù)據(jù)流圖一層數(shù)據(jù)流圖是對(duì)分析處理網(wǎng)頁(yè)的細(xì)化,首先提取出網(wǎng)頁(yè)信息,然后進(jìn)行去除HTML標(biāo)簽、抽取鏈接、對(duì)文本分詞等一系列的操作。發(fā)現(xiàn)了資源的網(wǎng)頁(yè)地址或者網(wǎng)站的目錄和網(wǎng)址之后,可以利用Nutch進(jìn)行批量的下載,下載的網(wǎng)頁(yè)內(nèi)容用于后續(xù)的索引和檢索。面向主題的搜索引擎是針對(duì)特定領(lǐng)域和問(wèn)題,通過(guò)網(wǎng)絡(luò)蜘蛛自動(dòng)獲取相關(guān)信息并建立索引,為用戶(hù)提供有效信息和相關(guān)服務(wù)。在需求分析中主要采用數(shù)據(jù)流圖和數(shù)據(jù)字典工具來(lái)進(jìn)行描述。根據(jù)結(jié)構(gòu)組成和運(yùn)行環(huán)境的不同,Java程序可以分為兩類(lèi):Java Application和Java Applet 。 JSP頁(yè)面由HTML代碼和嵌入其中的Java代碼所組成。另外還有一個(gè)名為MinGW的庫(kù),可以跟Windows本地的MSVCRT庫(kù)(Windows API)一起工作。實(shí)際上Tomcat 部分是Apache 服務(wù)器的擴(kuò)展,但它是獨(dú)立運(yùn)行的,所以當(dāng)你運(yùn)行Tomcat 時(shí),它實(shí)際上作為一個(gè)與Apache 獨(dú)立的進(jìn)程單獨(dú)運(yùn)行的。Searcher主要利用這些索引檢索用戶(hù)的查找關(guān)鍵詞來(lái)產(chǎn)生查找結(jié)果。Nutch為我們提供了這樣一個(gè)不同的選擇,相對(duì)于那些商用的搜索引擎,Nutch作為開(kāi)放源代碼搜索引擎將會(huì)更加透明,從而更值得大家信賴(lài)。3. 數(shù)據(jù)庫(kù)系統(tǒng):MS SQL Server 2000。在我們所熟知的百度、google、雅虎等搜索引擎,都是通用搜索引擎,其求大求全決定了不能滿(mǎn)足特定領(lǐng)域、特殊人群的精準(zhǔn)化信息需求服務(wù)。目前,搜索引擎技術(shù)正成為計(jì)算機(jī)工業(yè)界和學(xué)術(shù)界爭(zhēng)相研究、開(kāi)發(fā)的對(duì)象。本文首先介紹了面向主題搜索引擎系統(tǒng)開(kāi)發(fā)的背景及意義,分析了系統(tǒng)開(kāi)發(fā)的可行性,并對(duì)系統(tǒng)開(kāi)發(fā)過(guò)程中所涉及到的相關(guān)理論知識(shí)進(jìn)行簡(jiǎn)要的介紹,然后進(jìn)行需求分析、總體設(shè)計(jì)和詳細(xì)設(shè)計(jì),得到系統(tǒng)所要實(shí)現(xiàn)的主要功能,繪制出系統(tǒng)的功能模塊圖并用程序流程圖描述系統(tǒng)的各個(gè)模塊的處理過(guò)程,而后進(jìn)行系統(tǒng)的實(shí)現(xiàn)。關(guān)鍵詞:搜索引擎;Nutch;Tomcat;CygwinSubjectOriented Search EnginesAuthor: ZhaoBei Tutor: XunYalingAbstractAs a result of massive information of web is in change constantly, the search engines has been difficult to provide users with a highquality, prehensive and timely information to update the s