【文章內(nèi)容簡(jiǎn)介】
. j a v aT ok e nsN u t c h A n a l y s i s. j a v aN u t c h A n a l y s i sC o n s t a n t s . j a v aT o k e n . j a v aWordsegmentation ? Create Word segmentation system that – Can handle large scale data(90G, ICTCLAS fail on this) – Can recognize more new words (adaptive to domains) – Can do disambiguation based on context – Favor on Information Retrieval and Feature selection Wordsegmentation: BUAASEISEG Wordsegmentation cont. 編號(hào) 中文字?jǐn)?shù) 詞數(shù) 新詞數(shù)(未包含兩者識(shí) 別一致的新詞) BUAASEISEG 準(zhǔn)確率 ICTCLAS 準(zhǔn)確率 1 467 218 14 % % 2 514 267 8 % % 3 859 383 8 % % 4 598 306 5 % % 5 538 216 19 % % 6 3,926 2,097 200 % % 7 5,239 2,407 313 % % 8 4,003 1,923 246 % % 9 2,309 1,423 51 % % 新聞 2,976 1,390 54 % % 論文 15,477 7,850 810 % % 綜合 18,453 9