【正文】
西華大學(xué)畢業(yè)設(shè)計說明書 摘要 隨著信息的快速速增長,讓搜索引擎成了人們查找信息的首要工具。如今在中文搜索引擎領(lǐng)域,國內(nèi)搜索引擎已經(jīng)同國外搜索引擎效果上相差不大了。能形成現(xiàn)在這樣的局面,是有一個重要的原因:英文和中文兩種語言自身的書寫方式不相同,其中在計算機涉及的技術(shù)就是中文分詞技術(shù)。 本設(shè)計的主要目的是利用爬蟲獲取的網(wǎng)頁,將網(wǎng)頁的內(nèi)容按照一定的分詞技術(shù),拆分成一項項的詞條,并存儲到本地,供后期檢索使用 。系統(tǒng)中的分詞算法采用基于分詞詞典的機械分詞方法,這種方法是按照正向最大匹配的方法將要分析的中文句子與字典中的詞條進行匹配,從而把中 文句子拆分成一個個詞。 通過使用該分詞軟件可以自動的把中文句子比較準(zhǔn)確的拆分成詞,并且拆分速度快。結(jié)合正向最大匹配法和逆向最大匹配法也能把句子比較正確的分成需要的一個個詞條。 關(guān)鍵詞: 中文分詞;詞典; 西華大學(xué)畢業(yè)設(shè)計說明書 Abstract With the rapid growth of information, search engines bee the preferred tool for finding information. Chinese search engine in the field, domestic and foreign search engine has been the effect on the search engine to be close. Is able to develop this kind of situation, there is one important reason for this is that both Chinese and English language to write their own different ways, including the puter technology is the Chinese word segmentation technology. This design is the realization of a Chineseterm ponents. Through the Chinese word segmentation, and analysis of the Chinese sentence, will be split into its term. And its application in search engines so that the realization of the Chinese search engine search. System of the algorithm using the word dictionary based on the mechanicalterm approach, which is in accordance with the largest positive match will be the strategy of the Chinese phrases and terms in the dictionary to match, then split into the Chinese word sentence.. Through the use of the term of the subponents of the Chinese sentences can be automatically split into precise words, split fast. With the forward maximum matchmethodand the reverse maximum matching methodcan also put the sentences into one and one right entry. Keywords: Chinese word segmentation; dictionary 西華大學(xué)畢業(yè)設(shè)計說明書 目錄 前言 ................................................................................................................................ 1 1 FTP 搜索引擎爬蟲模塊介紹 .................................................. 錯誤 !未定義書簽。 設(shè)計思路 ...................................................................... 錯誤 !未定義書簽。 設(shè)計步驟 ...................................................................... 錯誤 !未定義書簽。 掃描站點 ............................................................ 錯誤 !未定義書簽。 獲取數(shù)據(jù) ............................................................ 錯誤 !未定義書簽。 數(shù)據(jù)分類 ............................................................ 錯誤 !未定義書簽。 生成源文件 ........................................................ 錯誤 !未定義書簽。 生成站點列表 .................................................... 錯誤 !未定義書簽。 建立索引文件 .................................................... 錯誤 !未定義書簽。 2 FTP 搜索引擎概要設(shè)計 .......................................................................................... 5 工作原理 ...................................................................................................... 5 工作流程圖 .................................................................................................. 6 3 FTP 搜索引爬蟲模塊擎詳細設(shè)計 .......................................................................... 7 設(shè)計目的 ...................................................................................................... 7 功能模塊設(shè)計 .............................................................................................. 7 網(wǎng)段掃描 ............................................................................................ 7 獲取數(shù)據(jù) ............................................................................................ 9 關(guān)于編碼問題的解決 ...................................................................... 17 服務(wù)器兼容 ...................................................................................... 18 生成數(shù)據(jù)文件 .................................................................................. 22 II 西華大學(xué)畢業(yè)設(shè)計說明書 生成站點列表 .................................................................................. 25 4 FTP 搜索引擎索引模塊詳細設(shè)計 ........................................................................ 27 格式化數(shù)據(jù) ...................................................................................... 27 匯總屬性文件 .................................................................................. 29 雙字母建立索引 .............................................................................. 30 索引數(shù)據(jù)庫 ...................................................................................... 32 字符編碼 .......................................................................................... 33 5 開發(fā)環(huán)境和結(jié)論 ................................................................................................... 35 硬件環(huán)境 .................................................................................................... 35 軟件環(huán)境 .................................................................................................... 35 運行環(huán)境 .................................................................................................... 35 運行結(jié)果 .................................................................................................... 36 存在的問題和不足 .................................................................................... 36 總結(jié) .............................................................................................................................. 37 致謝 .............................................................................................................................. 38 參考文獻 .........................................................................................................