【正文】
r large businesses that generate data from multiple divisions, possibly at multiple sites ? Data may also be purchased externally 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Data Warehousing 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition More Warehouse Design Issues ? Data cleansing ? ., correct mistakes in addresses (misspellings, zip code errors) ? Merge address lists from different sources and purge duplicates ? How to propagate updates ? Warehouse schema may be a (materialized) view of schema from data sources ? What data to summarize ? Raw data may be too large to store online ? Aggregate values (totals/subtotals) often suffice ? Queries on raw data can often be transformed by query optimizer to use aggregate values 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Data Warehouse Schema 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Data Mining (Cont.) ? Descriptive Patterns ? Associations ? Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. ? Associations may be used as a first step in detecting causation ? ., association between exposure to chemical X and cancer, ? Clusters ? ., typhoid cases were clustered in an area surrounding a contaminated well ? Detection of clusters remains important in detecting epidemics 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Decision Tree 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Best Splits ? Pick best attributes and conditions on which to partition ? The purity of a set S of training instances can be measured quantitatively in several ways. ? Notation: number of classes = k, number of instances = |S|, fraction of instances in class i = pi. ? The Gini measure of purity is defined as [ Gini (S) = 1 ? ? When all instances are in a single class, the Gini value is 0 ? It reaches its maximum (of 1 –1 /k) if each class the same number of instances. k i 1 p2i 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Best Splits (Cont.) ? Measure of “cost” of a split: Informationcontent (S, {S1, S2, ….., Sr})) = – ? ? Informationgain ratio = Informationgain (S, {S1, S2, ……, Sr}) Informationcontent (S, {S1, S2, ….., Sr}) ? The best split is the one that gives the maximum information gain ratio log2 r i 1 |Si| |S| |Si| |S| 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition DecisionTree Construction Algorithm Procedure GrowTree (S ) Partition (S )。 for each attribute A evaluate splits on attribute A。 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Na239。ve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * ….* ( p (dn | cj ) ? Each of the p (di | cj ) can be estimated from a histogram on di values for each class cj ? the histogram is puted from the training instances ? Histograms on multiple attributes are more expensive to pute and store 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Association Rules ? Retail shops are often interested in associations between different items that people buy. ? Someone who buys bread is quite likely also to buy milk ? A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. ? Associations information can be used in several ways. ? ., when a customer buys a particular book, an online shop may suggest associated books. ? Association rules: br