【正文】
Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Figure 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Figure 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Other Types of Mining ? Text mining: application of data mining to textual documents ? cluster Web pages to find related pages ? cluster pages a user has visited to anize their visit history ? classify Web pages automatically into a Web directory ? Data visualization systems help users examine large volumes of data and detect patterns visually ? Can visually encode large amounts of information on a single screen ? Humans are very good a detecting visual patterns End of Chapter 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Clustering Algorithms ? Clustering algorithms have been designed to handle very large datasets ? ., the Birch algorithm ? Main idea: use an inmemory Rtree to store points that are being clustered ? Insert points one at a time into the Rtree, merging a new point with an existing cluster if is less than some ? distance away ? If there are more leaf nodes than fit in memory, merge existing clusters that are close to each other ? At the end of first pass we get a large number of clusters at the leaves of the Rtree ? Merge clusters to reduce the number of clusters 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Clustering ? Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster ? Can be formalized using distance metrics in several ways ? Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized ? Centroid: point defined by taking average of coordinates in each dimension. ? Another metric: minimize average distance between every pair of points in a cluster ? Has been studied extensively in statistics, but on small data sets ? Data mining systems aim at clustering techniques that can handle very large data sets ? ., the Birch clustering algorithm (more shortly) 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Finding Support ? Determine support of itemsets via a single pass on set of transactions ? Large itemsets: sets with a high count at the end of the pass ? If memory not enough to hold all counts for all itemsets use multiple passes, considering only some itemsets in each pass. ? Optimization: Once an itemset is eliminated because its count (support) is too small none of its supersets needs to be considered. ? The a priori technique to find large itemsets: ? Pass 1: count support of all sets with just 1 item. Eliminate those items with low support ? Pass i: candidates: every set of i items such that all its i1 item subsets are large ? Count support of all candidates ? Stop if there are no candidates 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Finding Association Rules ? We are generally only interested in association rules with reasonably high support (., support of 2% or greater) ? Na239。 the population consists of a set of instances ? ., each transaction (sale) at a shop is an instance, and the set of all transactions is the population 169。Silberschatz, Korth and Sudarshan Database System Concepts 6th Edition Regression ? Regression deals with the prediction of a value, rather than a class. ? Given values for a set of variables, X1, X2, …, X n, we wish to predict the value of a variable Y. ? One way is to infer coefficients a0, a1, a1, …, a n such that Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn ? Finding such a linear polynomial is called linear regression. ? In general, the process of finding a curve that fits the data is also called