【正文】
ets as: purity(S1, S2, ….., S r) = ? ? The information gain due to particular split of S into Si, i = 1, 2, …., r Informationgain (S, {S1, S2, …., Sr) = purity(S ) – purity (S1, S2, … Sr) r i= 1 |Si| |S| purity (Si) k i 1 pilog2 pi 169。 Procedure Partition (S) if ( purity (S ) ?p or |S| ?s ) then return。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Other Types of Classifiers ? Neural classifiers are studied in artificial intelligence and are not covered here ? Bayesian classifiers use Bayes theorem, which says p (cj | d ) = p (d | cj ) p (cj ) p ( d ) where p (cj | d ) = probability of instance d being in class cj, p (d | cj ) = probability of generating instance d given class cj, p (cj ) = probability of occurrence of class cj, and p (d ) = probability of instance d occuring 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Regression ? Regression deals with the prediction of a value, rather than a class. ? Given values for a set of variables, X1, X2, …, X n, we wish to predict the value of a variable Y. ? One way is to infer coefficients a0, a1, a1, …, a n such that Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn ? Finding such a linear polynomial is called linear regression. ? In general, the process of finding a curve that fits the data is also called curve fitting. ? The fit may only be approximate ? because of noise in the data, or ? because the relationship is not exactly a polynomial ? Regression aims to find coefficients that give the best possible fit. 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Finding Association Rules ? We are generally only interested in association rules with reasonably high support (., support of 2% or greater) ? Na239。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Clustering ? Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster ? Can be formalized using distance metrics in several ways ? Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized ? Centroid: point defined by taking average of coordinates in each dimension. ? Another metric: minimize average distance between every pair of points in a cluster ? Has been studied extensively in statistics, but on small data sets ? Data mining systems aim at clustering techniques that can handle very large data sets ? ., the Birch clustering algorithm (more shortly) 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Other Types of Mining ? Text mining: application of data mining to textual documents ? cluster Web pages to find related pages ? cluster pages a user has visited to anize their visit history ? classify Web pages automatically into a Web directory ? Data visualization systems help users examine large volumes of data and detect patterns visually ? Can visually encode large amounts of information on a single screen ? Humans are very good a detecting visual patterns End of Chapter 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Figure 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Figure 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Clustering Algorithms ? Clustering algorithms have been designed to handle very large datasets ? ., the Birch algorithm ? Main idea: use an inmemory Rtree to store points that are being clustered ? Insert points one at a time into the Rtree, merging a new point with an existing cluster if is less than some ? distance aw