【正文】
ubstring is indicated by the deepest fork node that has both `...$...39。 ? int validUTF8ChineseWord (const char*)。 ? Functions ? …. 2022/8/14 COSCUP 2022, NTU Comparison of suffix/KTS ? corpuss (25000 lines, 1mb) ? PATTree CPU time 00:07:37 (457sec) ? Suffix CPU time 00:00:33 (33sec) ? corpusm (250000 lines, ) ? PATTree CPU time ??? ? Suffix CPU time 03:26:57 (12417sec) 2022/8/14 COSCUP 2022, NTU Program Flow 1. Record every valid word (UTF8) address by pointer (char *) 1. moveforward until EOF 2. Quicksort by word prefix ? 我要去上學校來上學 ? 我要去上學校來上學 ? 要去上學校來上學 ? 去上學校來上學 ? 上學校來上學 ? 學校來上學 ? 校來上學 ? 來上學 ? 上學 ? 學 2022/8/14 COSCUP 2022, NTU Program Flow 1. Record every valid word (UTF8) address by pointer (char *) 1. moveforward until EOF 2. Quicksort by word prefix 3. Compare every adjacent word, if has mon prefix 1. Get WORD_FREQ index from lookupWordIndex() 2. Insert to WORD_FREQ 4. Quicksort by frequency 5. Output ? 我要去上學校來上學 ? 上學 ? 上學校來上學 ? 來上學 ? 去上學校來上學 ? 學 ? 學校來上學 ? 我要去上學校來上學 ? 校來上學 ? 要去上學校來上學 2022/8/14 COSCUP 2022, NTU Features ? Prefix/Postfix filtering (may increase process time) ? 李遠哲院 ,4 ? 李遠哲院長 ,4 ? 遠哲院長 ,4 2022/8/14 COSCUP 2022, NTU Suffix (Array) Results corpus of lines Utf8 size(kb) of words Process time(sec) Qsort time (sec) Insert word time Memory used 1MB 15,000 913,034 48,679 4 6,736K (6M) 5MB 124,999 9,021,705 482,847 30 55,588K (55M) 10MB 249,999 17,309,906 912,978 62 103M 20MB 499,999 34,336,644 1,789,545 147 203M 40MB 967,382 66,135,317 3,402,792 342 391M 2022/8/14 COSCUP 2022, NTU Future plan ? Applications ? libtabe? ? Any Apps that needs term extraction ? Algorithm ? Memory usage ? Faster execution time 2022/8/14 COSCUP 2022, NTU Demo Time ? Thank you Questions?