正文內容

textmining文本挖掘課件-wenkub

2022-08-30 17:20:15 本頁面

　

【正文】 ons (grunts, shouts, etc.)? Top Performing Systems ? Currently the best performing systems at TREC can answer approximately 6080% of the questions ? A pretty amazing performance! ? Approaches and successes have varied a fair deal ? Knowledgerich approaches, using a vast array of NLP techniques stole the show in 2020, 2020 ? Notably Harabagiu, Moldovan et al. – SMU/UTD/LCC ? AskMSR system stressed how much could be achieved by very simple methods with enough text (now has various copycats) ? Middle ground is to use a large collection of surface matching patterns (ISI) AskMSR ? Web Question Answering: Is More Always Better? ? Dumais, Banko, Brill, Lin, Ng (Microsoft, MIT, Berkeley) ? Q: “ Where is the Louvre located?” ? Want “ Paris” or “ France” or “ 75058 Paris Cedex 01” or a map ? Don?t just want URLs AskMSR: Shallow approach ? In what year did Abraham Lincoln die? ? Ignore hard documents and find easy ones AskMSR: Details 1 2 3 4 5 Step 1: Rewrite queries ? Intuition: The user?s question is often syntactically quite close to sentences that contain the answer ? Where is the Louvre Museum located? ? The Louvre Museum is located in Paris ? Who created the character of Scrooge? ? Charles Dickens created the character of Scrooge. Query rewriting ? Classify question into seven categories ? Who is/was/are/were…? ? When is/did/will/are/were …? ? Where is/are/were …? a. Categoryspecific transformation rules eg “For Where questions, move ?is? to all possible locations” “Where is the Louvre Museum located” ? “is the Louvre Museum located” ? “the is Louvre Museum located” ? “the Louvre is Museum located” ? “the Louvre Museum is located” ? “the Louvre Museum located is” b. Expected answer “Datatype” (eg, Date, Person, Location, …) When was the French Revolution? ? DATE ? Handcrafted classification/rewrite/datatype rules (Could they be automatically learned?) Nonsense, but who cares? It’s only a few more queries to Google. Query Rewriting weights ? One wrinkle: Some query rewrites are more reliable than others +“the Louvre Museum is located” Where is the Louvre Museum located? Weight 5 if we get a match, it’s probably right +Louvre +Museum +located Weight 1 Lots of nonanswers could e back too Step 2: Query search engine ? Send all rewrites to a Web search engine ? Retrieve top N answers (100?) ? For speed, rely just on search engine?s “snippets”, not the full text of the actual document Step 3: Mining NGrams ? Unigram, bigram, trigram, … Ngram: list of N adjacent terms in a sequence ? Eg, “Web Question Answering: Is More Always Better” ? Unigrams: Web, Question, Answering, Is, More, Always, Better ? Bigrams: Web Question, Question Answering, Answering Is, Is More, More Always, Always Better ? Trigrams: Web Question Answering, Question Answering Is, Answering Is More, Is More Always, More Always Betters Mining NGrams ? Simple: Enumerate all Ngrams (N=1,2,3 say) in all retrieved snippets ? Use hash table and other fancy footwork to make this efficient ? Weight of an Ngram: occurrence count, each weighted by “reliability” (weight) of rewrite that fetched the document ? Example: “Who created the character of Scrooge?” ? Dickens 117 ? Christmas Carol 78 ? Charles Dickens 75 ? Disney 72 ? Carl Banks 54 ? A Christmas 41 ? Christmas Carol 45 ? Uncle 31 Step 4: Filtering NGrams ? Each question type is associated with one or more “datatype filters” = regular expressions ? When… ? Where… ? What … ? Who … ? Boost score of Ngrams that do match regexp ? Lower score of Ngrams that don?t match regexp ? Details omitted from paper…. Date Location Person Step 5: Tiling the Answers Dickens Charles Dickens Mr Charles Scores 20 15 10 merged, discard old ngrams Mr Charles Dickens Score 45 NGrams tile highestscoring ngram NGrams Repeat, until no more overlap Results ? Standard TREC contest testbed: ~1M documents。 policy changes due to the crash (new runway lights were installed at airports). ? Euro Introduced () ? On topic: stories about the preparation for the mon currency (negotiations about exchange rates and financial standards to be shared among the member nations)。 ? High levels of magnesium inhibit SCD。 ? Text mining is the process of piling, anizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inquiry. True Text Data Mining: Don Swanson’ s Medical Work ? Given ? medical titles and abstracts ? a problem (incurable rare disease) ? some medical expertise ? find causal links among titles ? symptoms ? drugs ? results ? .: Magnesium deficiency related to migraine ? This was found by extracting features from medical literature on migraines and nutrition Swanson Example (1991) ? Problem: Migraine headaches (M) ? Stress is associated with migraines。Outline of Today ? Introduction ? Lexicon construction ? Topic Detection and Tracking ? Summarization ? Question Answering Data Mining Market Basket Analysis ? 80% of the people who buy milk also buy bread ? On Friday’s, 70% of the men who bought diapers also bought beer. ? What is the relationship between diapers and beer? ? Walmart could trace the reason after doing a small survey! The business opportunity in text mining? 0102030405060708090100D a ta vo l u m e M a r k e t Ca pU n s tr u c tu r e dS tr u c tu r e dCorporate Knowledge “ Ore” ?

點擊復制文檔內容

公司管理相關推薦

數(shù)據(jù)挖掘基本算法ppt課件-資料下載頁

【總結】數(shù)據(jù)倉庫與數(shù)據(jù)挖掘2數(shù)據(jù)倉庫與數(shù)據(jù)挖掘第一章數(shù)據(jù)倉庫與數(shù)據(jù)挖掘概述第二章數(shù)據(jù)倉庫的分析第三章數(shù)據(jù)倉庫的設計與實施第四章信息分析的基本技術第五章數(shù)據(jù)挖掘過程第六章數(shù)據(jù)挖掘基本算法第七章非結構化數(shù)據(jù)挖掘第八章離群數(shù)據(jù)挖掘第九章數(shù)據(jù)挖掘語言與工具的選擇第十章知識管理與知識管理系統(tǒng)3第六章數(shù)據(jù)挖掘基本算法

2025-04-30 18:14

數(shù)據(jù)挖掘概述ppt課件(2)-資料下載頁

【總結】經(jīng)濟數(shù)據(jù)挖掘與分析第1章數(shù)據(jù)挖掘概述1王耀東上海財經(jīng)大學第1章數(shù)據(jù)挖掘概述2?數(shù)據(jù)挖掘定義?數(shù)據(jù)挖掘的重要性及意義?數(shù)據(jù)挖掘功能?數(shù)據(jù)挖掘步驟和標準?數(shù)據(jù)挖掘常用方法?數(shù)據(jù)挖掘的對象?數(shù)據(jù)挖掘的常用方法數(shù)據(jù)挖掘定義3?數(shù)據(jù)挖掘的技術定義?數(shù)據(jù)挖掘的商業(yè)定義數(shù)據(jù)挖掘的重要性及意義4數(shù)

2025-04-30 18:14

專利挖掘實務ppt課件(2)-資料下載頁

【總結】咨詢電話：010-82689935?2022/5/251專利挖掘-實務2022年1月北京京大聯(lián)合知識產(chǎn)權代理有限公司北京市海淀區(qū)中關村大街27號中關村大廈410室?咨詢電話：010-826899352022/5/252交流內容一、如何破解

2025-05-12 02:27

如何挖掘客戶需求ppt課件-資料下載頁

【總結】如何挖掘客戶的需求如何挖掘客戶的需求?一、????為什么要挖掘客戶的需求?二、????如何挖掘客戶的需求?三、????挖掘客戶需求時應避免的七大誤區(qū)我們一起去發(fā)現(xiàn)！武俠小說中，任何一個大俠都不會在不了解敵人的時候出手！如何挖掘客戶的需

2025-04-28 23:40

2022淺談文本探究話題的挖掘-資料下載頁

【總結】1 淺談文本探究話題的挖掘倡導自主、合作、探究的學習方式。(?語文新課程標準?)新課程理念下的閱讀教學是是學生主動探究文本，與文本互相交流、雙向互動的過程，在這一探究的過程中，學生的...

2025-09-14 05:40

可視化挖掘過程ppt課件-資料下載頁

【總結】可視化數(shù)據(jù)挖掘?目的：可視化數(shù)據(jù)挖掘的目的是使用戶能夠交互地瀏覽數(shù)據(jù)，挖掘過程等，當所要識別的不規(guī)則事物是一系列圖形而不是數(shù)字表格時，人的識別的速度是最快的。分類?源數(shù)據(jù)可視化:數(shù)據(jù)庫和數(shù)據(jù)倉庫的數(shù)據(jù)具有不同的粒度或不同的抽象級別,能用多種可視化方式進行描述,比如三維立方體或曲線等?數(shù)據(jù)挖掘結果可視化:將數(shù)據(jù)挖掘后得到

2025-05-06 18:09

專利申請與挖掘ppt課件-資料下載頁

【總結】專利申請與挖掘演講人：樊欣日期：洛陽振邦知識產(chǎn)權代理服務有限公司一專利申請的準備二專利申請基本信息采集三專利挖掘?發(fā)明發(fā)明是指對產(chǎn)品、方法或者其改進所提出的新的技術方案。實用新型是指對產(chǎn)品的形狀、構造或者其結合所提出的適于使用的新的技術方案。?實用新型?外觀

2025-05-09 22:15

挖掘你的工作機會ppt課件-資料下載頁

【總結】n強強聯(lián)合國際國內頂尖教育機構深度融合！n群星閃爍整合業(yè)界權威名師、就業(yè)專家！n智能互動創(chuàng)造課程追蹤，輕松學習模式！n目標明確全力提高學生就業(yè)力！時代英杰時代英杰中國職前教育專家中國職前教育專家挖掘你的工作機會挖掘你的工作機會萬鈞職業(yè)咨詢首席顧問鄭鈞主講內容：主講內容：1.潛在工作機會的挖掘2.求職知識庫的建立1發(fā)掘

2025-05-12 08:36

文本挖掘算法總結[五篇材料]-資料下載頁

【總結】文本挖掘算法總結[五篇材料]第一篇：文本挖掘算法總結文本數(shù)據(jù)挖掘算法應用小結1、基于概率統(tǒng)計的貝葉斯分類2、ID3決策樹分類3、基于粗糙集理論RoughSet的確定型知識挖掘4、基于k-means聚類5、無限細分的模糊聚類FuzzyClustering6、SOM神經(jīng)元網(wǎng)絡聚類7、基于Meaning的

2025-03-30 17:57

數(shù)據(jù)挖掘基礎知識ppt課件-資料下載頁

【總結】數(shù)據(jù)挖掘原理與SPSSClementine應用寶典元昌安主編　鄧　松　李文敬　劉海濤　編著電子工業(yè)出版社1．1數(shù)據(jù)挖掘的社會需求現(xiàn)實情況：人類積累的數(shù)據(jù)量以每月高于15%的速度增加，如果不借助強有力的挖掘工具，僅依靠人的能力來理解這些數(shù)據(jù)是不可能的?，F(xiàn)在人們已經(jīng)評估出

2025-05-12 08:31

挖掘機服務培訓ppt課件-資料下載頁

【總結】挖掘機服務管理培訓客戶服務部xxxxxxxxx工程機械有限公司謹以此書獻給謹以此課程獻給服務于挖掘機全體將士！2LOGO5個1服務工程1周內回訪1次性修復1刻鐘回復1天內修復1小時出發(fā)內容提要挖掘機結構介紹服務商管理要求

2025-01-10 18:35

挖掘機安全知識ppt課件-資料下載頁

【總結】安全常識?新疆星沃職業(yè)技能培訓學校講師：向小軍聯(lián)系電話：189999573101.前言大多數(shù)安全事故都是由于“不注意”、“疏忽”、“判斷錯誤”、“缺乏危險預知能力”等人為因素引發(fā)。因此，為了預防事故發(fā)生，每個作業(yè)人員都必須提高對危險的覺察能力。每個人都應具有“自我安全、自己保護”的意識，認

2025-01-18 09:08

專利挖掘和編寫vppt課件-資料下載頁

【總結】專利挖掘和編寫廣州杰賽科技股份有限公司技術中心2022年6月目錄專利挖掘技術交底書撰寫申請文件的審核專利挖掘的環(huán)節(jié)有了技術方案后專利申請技術秘密專利深挖專利評估專利挖掘專利專利or秘密確定

2025-01-12 09:42

挖掘機技術講解ppt課件-資料下載頁

【總結】2022年4月挖掘機技術講解2022年4月挖掘機的功能挖掘機的功能挖掘作業(yè)平整作業(yè)破碎作業(yè)吊管作業(yè)裝載作業(yè)開溝作業(yè)抓取作業(yè)拆遷作業(yè)2022年4月挖掘機的使用挖掘機的使用很簡單掌握正確的操作方法掌握正確的保

2025-03-22 01:43

數(shù)據(jù)挖掘計劃書ppt課件-資料下載頁

【總結】數(shù)據(jù)挖掘項目計劃書w項目目的w項目安排w人員分工w項目研究內容w方法和工具目錄一.項目目的w目的：理解和掌握數(shù)據(jù)挖掘原理與技術w選擇合適的挖掘技術和工具，通過動手實踐，實現(xiàn)一個可用的數(shù)據(jù)挖掘系統(tǒng)，挖掘國美電器銷售系統(tǒng)中關于生活家電間的購買關聯(lián)關系。二.項目安排本項目實施

2025-04-30 18:24