【正文】
information. The method of extracting semistructured information is simple and effective. The algorithm matches the text to the attribute dictionary and then extracts directly attribute value through simple rule. For extraction of unstructured information, this thesis proposes the algorithm based on rule. The dictionary of trigger words and rules need to be established in the extraction process. The dictionary of trigger words includes basic people attributes and their trigger words. The artificial rules are used to extract attribute values. Key words: Information extractions, structuring, word segmentation, word frequency statistics, content extraction 西南交通大學(xué)碩士研究生學(xué)位論文 第 IV 頁 目 錄 摘 要 ................................................................................................................................. I Abstract .......................................................................................................................... II 第 1 章 緒論 .................................................................................................................... 1 項目背景 ........................................................................................................... 1 目的和意義 ....................................................................................................... 1 研究現(xiàn)狀分析 ................................................................................................... 1 本文主要研究內(nèi)容 ........................................................................................... 3 第 2 章 人物網(wǎng)頁數(shù)據(jù)采集 ............................................................................................ 4 引言 .................................................................................................................... 4 爬蟲概述 ............................................................................................................ 5 HttpClient 介紹 .................................................................................................. 6 網(wǎng)頁數(shù)據(jù)下載 .................................................................................................... 6 網(wǎng)頁數(shù)據(jù)普通方式下載 ........................................................................ 6 網(wǎng)頁數(shù)據(jù)代理方式下載 ......................................................................... 7 動態(tài)網(wǎng)頁數(shù)據(jù)下載 ................................................................................. 9 實(shí)驗(yàn)結(jié)果 ................................................................................................11 本章小結(jié) ..........................................................................................................11 第 3 章 基于 DOM 的網(wǎng)頁正文信息提取 ....................................................................11 引言 ...................................................................................................................11 DOM 簡介 ........................................................................................................ 12 Html 解析器 ..................................................................................................... 13 基于 DOM 的正文抽取方法 ........................................................................... 14 原理分析 ............................................................................................... 14 算法過程描述 ...................................................................................... 15 實(shí)驗(yàn)結(jié)果 ......................................................................................................... 15 本章小結(jié) ......................................................................................................... 16 第 4 章 網(wǎng)頁正文的分詞處理 ...................................................................................... 17 引言 .................................................................................................................. 17 分詞系統(tǒng)介紹 .................................................................................................. 18 組織機(jī)構(gòu)名識別 .............................................................................................. 19 機(jī)構(gòu)名的組成結(jié)構(gòu) .............................................................................. 19 機(jī)構(gòu)名構(gòu)成詞的詞頻統(tǒng)計 .................................................................. 19 詞語頻數(shù)統(tǒng)計排序 .............................................................................. 20 西南交通大學(xué)碩士研究生學(xué)位論文 第 V 頁 機(jī)構(gòu)后綴詞整理 .................................................................................. 20 機(jī)構(gòu)名詞典的建立 .............................................................................. 21 機(jī)構(gòu)詞詞頻的計算 .............................................................................. 21 機(jī)構(gòu)名識別方法 .................................................................................. 22 算法描述 .............................................................................................. 22 實(shí)驗(yàn)結(jié)果 ......................................................................................................... 23 機(jī)構(gòu)名識別實(shí)驗(yàn) .................................................................................. 23 正文分詞處理實(shí)驗(yàn) .............................................................................. 24 本章小結(jié) ......................................................................................................... 25 第 5 章 人物信息結(jié)構(gòu)化 .............................................................................................. 25 引言 .................................................................................................................. 25 人物信息結(jié)構(gòu)類型 ......................................................................................... 26 半結(jié)構(gòu)化人物信息提取 .................................................................................. 28 基于《知網(wǎng)》的語義相似度 .............................................................. 28 屬性