【正文】
if 80 percent of the purchases that include bread also include milk. 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Figure 169。ve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * ….* ( p (dn | cj ) ? Each of the p (di | cj ) can be estimated from a histogram on di values for each class cj ? the histogram is puted from the training instances ? Histograms on multiple attributes are more expensive to pute and store 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Data Mining (Cont.) ? Descriptive Patterns ? Associations ? Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. ? Associations may be used as a first step in detecting causation ? ., association between exposure to chemical X and cancer, ? Clusters ? ., typhoid cases were clustered in an area surrounding a contaminated well ? Detection of clusters remains important in detecting epidemics 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Data Warehousing ? Data sources often store only current data, not historical data ? Corporate decision making requires a unified view of all anizational data, including historical data ? A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site ? Greatly simplifies querying, permits study of historical trends ? Shifts decision support query load away from transaction processing systems 169。 Procedure Partition (S) if ( purity (S ) ?p or |S| ?s ) then return。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Clustering ? Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster ? Can be formalized using distance metrics in several ways ? Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized ? Centroid: point defined by taking average of coordinates in each dimension. ? Another metric: minimize average distance between every pair of points in a cluster ? Has been studied extensively in statistics, but on small data sets ? Data mining systems aim at clustering techniques that can handle very large data sets ? ., the Birch clustering algorithm (more shortly) 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Clustering Algorithms ? Clustering algorithms have been designed to handle very large datasets ? ., the Birch algorithm ? Main idea: use an inmemory Rtree to store points that are being clustered ? Insert points one at a time into the Rtree, merging a new point with an existing cluster if is less than some ? distance away ? If there are more leaf nodes than fit in memory, merge existing clusters that are close to each other ? At the end of first pass we get a large number of clusters at the leaves of the Rtree ? Merge clusters to reduce the number of clusters 169。 Use best split found (across all attributes) to partition S into S1, S2, …., S r, for i = 1, 2, ….., r Partition (Si )。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Design Issues ? When and how to gather data ? Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (., at night) ? Destination driven architecture: warehouse periodically requests new information from data sources ? Keeping warehouse exactly synchronized with data sources (., using twophase mit) is too expensive ? Usually OK to have slightly outofdate data at warehouse ? Data/updates a