【文章內容簡介】
specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure. We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers. SIMILARITY AND DISTANCE MEASURES There are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim( ,iltt), defined between any two tuples, ,ilt t D? . This provides a more strict and alternative clustering definition, as found in Definition . Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property. A distance measure, dis( ,ijtt), as opposed to similarity, is often used in clustering. The clustering problem then has the desirable property that given a cluster,jK, ,jl jm jt t K?? and , ( , ) ( , )i j jl jm jl it K sim t t d is t t??. Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then be described by using several characteristic values. Given a cluster, mK of N points { 12, ,...,m m mNt t t }, we make the following definitions [ZRL96]: Here the centroid is the “middle” of the cluster。 it need not be an actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented by one centrally located object in the cluster called a medoid. The radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation mM to indicate the medoid for cluster mK . Many clustering algorithms require that the distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters iK and jK , there are several standard alternatives to calculate the distance between clusters. A representative list is: ? Single link: Smallest distance between an element in one cluster and an element in the other. We thus have dis( ,ijKK)= m i n( ( , ) )il jm il i jdi s t t t K K? ? ?and jm j it K K? ? ? . ? Complete link: Largest distance between an element in one cluster and an element in the other. We thus have dis( ,ijKK)= m a x ( ( , ) )il jm il i jd is t t t K K? ? ?and jm j it K K? ? ? . ? Average: Average distance between an element in one cluster and an element in the other. We thus have dis( ,ijKK)= ( ( , ) )il jm il i jm e a n d is t t t K K? ? ?and jm j it K K? ? ? . ? Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis( ,ijKK)=dis( ,ijCC), where iC is the centroid f