【正文】
ment yes/no? ? Example: Headline, scientific abstract Why Automatic Summarization? ? Algorithm for reading in many domains is: ? read summary ? decide whether relevant or not ? if relevant: read whole document ? Summary is gatekeeper for large number of documents. ? Information overload ? Often the summary is all that is read. ? Humangenerated summaries are expensive. Summary Length (Reuters) Goldstein et al. 1999 Summarization Algorithms ? Keyword summaries ? Display most significant keywords ? Easy to do ? Hard to read, poor representation of content ? Sentence extraction ? Extract key sentences ? Medium hard ? Summaries often don’ t read well ? Good representation of content ? Natural language understanding / generation ? Build knowledge representation of text ? Generate sentences summarizing content ? Hard to do well ? Something between the last two methods? Sentence Extraction ? Represent each sentence as a feature vector ? Compute score based on features ? Select n highestranking sentences ? Present in order in which they occur in text. ? Postprocessing to make summary more readable/concise ? Eliminate redundant sentences ? Anaphors/pronouns ? Delete subordinate clauses, parentheticals Sentence Extraction: Example ? Sigir95 paper on summarization by Kupiec, Pedersen, Chen ? Trainable sentence extraction ? Proposed algorithm is applied to its own description (the paper) Sentence Extraction: Example Feature Representation ? Fixedphrase feature ? Certain phrases indicate summary, . “ in summary” , “ in conclusion” etc. ? Paragraph feature ? Paragraph initial/final more likely to be important. ? Thematic word feature ? Repetition is an indicator of importance ? Do any of the most frequent content words occur? ? Uppercase word feature ? Uppercase often indicates named entities. (Taylor) ? Is uppercase thematic word introduced? ? Sentence length cutoff ? Summary sentence should be 5 words. ? Summary sentences have a minimum length. Training ? Handlabel sentences in training set (good/bad summary sentences) ? Train classifier to distinguish good/bad summary sentences ? Model used: Na239。 official introduction of the Euro。 reports on the investigation following the crash。 ? Migraine patients have high platelet aggregability。 ? Spreading cortical depression (SCD) is implicated in some migraines。 ? Stress can lead to a loss of magnesium。 ? Discovery of knowledge previously unknown to the user in text。Outline of Today ? Introduction ? Lexicon construction ? Topic Detection and Tracking ? Summarization ? Question Answering Data Mining Market Basket Analysis ? 80% of the people who buy milk also buy bread ? On Friday’s, 70% of the men who bought diapers also bought beer. ? What is the relationship between diapers and beer? ? Walmart could trace the reason after doing a small survey! The business opportunity in text mining? 0102030405060708090100D a ta vo l u m e M a r k e t Ca pU n s tr u c tu r e dS tr u c tu r e dCorporate Knowledge “ Ore” ? Email ? Insurance claims ? News articles ? Web pages ? Patent portfolios ? IRC ? Scientific articles ? Customer plaint letters ? Contracts ? Transcripts of phone calls with customers ? Technical documents Stuff not very accessible via standard datamining Text Knowledge Extraction Tasks ? Small Stuff. Useful nuggets of information that a user wants: ? Question Answering ? Information Extraction (DB filling) ? Thesaurus Generation ? Big Stuff. Overviews: ? Summary Extraction (documents or collections) ? Categorization (documents) ? Clustering (collections) ? Text Data Mining: Interesting unknown correlations that one can discover Text Mining ? The foundation of most mercial “ text mining” products is all the stuff we have already covered: ? Information Retrieval engine ? Web spider/search ? Text classification ? Text clustering ? Named entity recognition ? Information extraction (only sometimes) ? Is this text mining? What else is needed? One tool: Question Answering ? Goal: Use Encyclopedia/other source to answer “Trivial Pursuitstyle” factoid questions ? Example: ? “What famed English site is found on Salisbury Plain?” From Another tool: Summarizing ? Highlevel summary or survey of all main points? ? How to summarize a collection? ? Example: ? sentence extraction from a single document IBM Text Miner terminology: Example of Vocabulary found ? Certificate of deposit ? CMOs ? Commercial bank ? Commercial paper ? Commercial Union Assurance ? Commodity Futures Trading Commission ? Consul Restaurant ? Convertible bond ? Credit facility ? Credit line ? Debt security ? Debtor country ? Detroit Edison ? Digital Equipment ? Dollars of debt ? EndMarch ? Enserch ? Equity warrant ? Eurodollar ? ? What is Text Data Mining? ? Peoples’ first thought: ? Make it easier to find things on the Web. ? But this is information retrieval! ? The metaphor of extracting ore from rock: ? Does make sense for extracting documents of interest from a huge pile. ? But does not reflect notions of DM in practice. Rather: ? finding patterns across large collections ? discovering heretofore unknown information Definitions of Text Mining ? Text mining mainly is about somehow extracting the information and knowledge from text。 ? 2 definitions: ? Any operation related to gathering and analyzing text from external sources for business intelligence purposes。 ? Text mining is the process of piling, anizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers