正文內(nèi)容

數(shù)位文字知識探勘--以中文索引典之建構(gòu)及應(yīng)用為例(完整版)

2024-12-11 15:49上一頁面

下一頁面

　　

【正文】如果文件探討某個主題，那麼應(yīng)該會提到某些特定的字串好幾次 ? 具有客觀性、可自動處理 ? 假設(shè)簡單，可適用於不同領(lǐng)域關(guān)聯(lián)詞分析：新的方法： [Tseng 2020] ? 第一步：詞彙選擇： – 每篇文件先用詞庫（長詞優(yōu)先法）斷詞 – 再由關(guān)鍵詞擷取演算法擷取關(guān)鍵詞（至少出現(xiàn) 2次者）（包含新詞） – 以停用詞過濾擷取出的關(guān)鍵詞，並依詞頻（ term frequency）高低排序 – 選詞頻最高的 N 個詞作關(guān)聯(lián)分析 ? 第二步：詞彙關(guān)聯(lián)分析 : – 每篇文件選出來的詞，以下面公式計算兩個詞彙的權(quán)重 wgt： where NSi denotes number of all sentence in document i and NS(Tij) denotes in document i the number of sentences in which term Tj occurs. – 關(guān)聯(lián)詞的權(quán)重超過門檻值（）者，才依下面公式累積其權(quán)重 – 關(guān)聯(lián)詞的最後相似度定義為： ? 原方法：僅單純累加每對關(guān)聯(lián)詞的權(quán)重 ? 新方法：加入 IDF (inverse document frequency ) 及詞彙長度 ) ()()( )(2),( iikijikijikij NSTNSTNSTTNSTTw g t ??????? ?? ni ikijkj TTw g tTTs i m 1 ),(),(? ???? ni ikijkkkj TTw g tn dfnwTTs i m 1 ),()l o g ( )l o g (),(關(guān)鍵詞自動擷取方法比較： ? 詞庫比對法：詞庫需持續(xù)維護更新 ? 統(tǒng)計分析法：容易遺漏統(tǒng)計特徵不足者 ? 文法剖析法：需詞庫、詞性標記等資源與運算 – 適合作為關(guān)鍵詞的名詞片語少於 50% [Arppe 1995] 關(guān)鍵詞自動擷取方法 [Tseng 97, 98, 99, 2020] ? 找出最大重複出現(xiàn)字串（ maximally repeated pattern）的演算法 ? token : 一個中文字（ character）或英文字（ word） ? ntoken: 輸入文字中，任意連續(xù)的 n tokens （與 ngram 類似） ? 演算法三步驟：步驟一 : 轉(zhuǎn)換輸入文字成 2token 串列步驟二 : 依合併規(guī)則重複合併 ntokens 成 (n+1)tokens，直到無法合併步驟三 : 依過濾規(guī)則，過濾不合法的詞彙依過濾規(guī)則，過濾不合法的詞彙詞頻關(guān)鍵詞自動擷取過程範例 ? 輸入文字 : “ BACDBCDABACD”, 假設(shè) 門檻值 = 1 ? 步驟一 : 產(chǎn)生 L = (BA:2 AC:2 CD:3 DB:1 BC:1 CD:3 DA:1 AB:1 BA:2 AC:2 CD:3) ? 步驟二 : token 合併 : 第一次 :合併 L 成 L1= (BAC:2 ACD:2 BAC:2 ACD:2) 丟掉 : (BA:2 AC:2 CD:3 DB:1 BC:1 DA:1 AB:1 BA:2 AC:2 CD:3) 留住 : (CD:3) 第二次 : 合併 L1 成 L2 = (BACD:2 BACD:2) 丟掉 : (BAC:2 ACD:2 BAC:2 ACD:2) 留住 : (CD:3) 第三次 : 合併 L2 成 L3 = ( ) 丟掉 : ( ) 留住 : (CD:3 BACD:2) ? 步驟三 : 無須過濾關(guān)鍵詞自動擷取範例 [Tseng 2020]：英文範例 Web Document Clustering: A Feasibility Demonstration Users of Web search engines are often forced to sift through the long ordered list of document returned by the engines. The IR munity has explored document clustering as an alternative method of anizing retrieval results, but clustering has yet to be deployed on the major search engines. The paper articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short snippets returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents. To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the document collection size) algorithm called Suffix Tree Clustering (STC), which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial.? Terms extracted before filtering 1. clusters based on : 3 2. document clustering : 3 3. of Web : 3 4. on the : 3 5. search engines : 3 6. STC is : 2 7. Web document clustering : 2 8. Web search engines : 2 9. clustering methods in this domain : 2 10. requirements of : 2 11. returned by : 2 Terms extracted after filtering 1. clusters based : 3 2. document clustering : 3 3. Web : 3 4. 5. search engines : 3 6. STC : 2 7. Web document clustering : 2 8. Web search engines : 2 9. clustering methods in this domain : 2 10. requirements : 2 11. returned : 2 關(guān)鍵詞自動擷取範例 [Tseng 2020]：中文範例 Comparison of Three Metadata Related Standards 在本文中，我們介紹了三個跟 metadata 相關(guān)的標準，它們分別是 FGDC 的 Digital Geospatial Metadata、 Dublin Core、和 URC。秋冬流行款式當然要數(shù)各式各樣的靴子 !今秋東京街頭商店的展窗紛紛擺出出前所未有的獨俱特色的新款式 ﹐ 吸引者趕超時尚的男男女女。消費者將可直接將音樂下載至 PC，而無需購買 CD或錄音帶。 Cro

點擊復(fù)制文檔內(nèi)容

教學課件相關(guān)推薦

freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

數(shù)位文字知識探勘--以中文索引典之建構(gòu)及應(yīng)用為例(完整版)

搜索引擎中文的分詞規(guī)律-資料下載頁

catar-文獻內(nèi)容探勘工具-資料下載頁

5淺談以實戰(zhàn)應(yīng)用為核心強力推進公安信息化建設(shè)的幾點想法-資料下載頁

知識管理之企業(yè)應(yīng)用-資料下載頁

飯店經(jīng)營管理之研究以亞都麗致飯店為例-資料下載頁

知識管理及mindmanager應(yīng)用-資料下載頁

知識管理之企業(yè)應(yīng)用(ppt47)(1)-資料下載頁

爆炸應(yīng)用技術(shù)基礎(chǔ)知識之四-資料下載頁

以德國與美國為典例銀行主導(dǎo)型金融機構(gòu)體系與市場主導(dǎo)型金-資料下載頁

以風險管理程序建構(gòu)金融控股公司風險管理模型之研究-資料下載頁

數(shù)位典藏與知識管理整合-資料下載頁

典例解析-質(zhì)量和密度-資料下載頁

數(shù)位文字知識探勘--以中文索引典之建構(gòu)及應(yīng)用為例-文庫吧

數(shù)位文字知識探勘--以中文索引典之建構(gòu)及應(yīng)用為例-wenkub

數(shù)位文字知識探勘--以中文索引典之建構(gòu)及應(yīng)用為例(已修改)

數(shù)位文字知識探勘--以中文索引典之建構(gòu)及應(yīng)用為例(編輯修改稿)