【正文】
m is the number of objects), and thus, are not practical for large data sets. However, instead of applying the algorithm to the entire data set, it can be applied to a reduced data set consisting only of cluster prototypes. Depending on the type of analysis, the number of prototypes, and the accuracy with which the prototypes represent the data, the results can be parable to those that would have been obtained if all the data could have been used.? Compression. Cluster prototypes can also be used for data pression. In particular, a table is created that consists of the prototypes for each cluster。 ., each prototype is assigned an integer value that is its position (index) in the table. Each object is represented by the index of the prototype associated with its cluster. This type of pression is known as vector quantization and is often applied to image, sound, and video data, where (1) many of the data objects are highly similar to one another, (2) some loss of information is acceptable, and (3) a substantial reduction in the data size is desired? Effciently Finding Nearest Neighbors. Finding nearest neighbors can require puting the pairwise distance between all points. Often clusters and their cluster prototypes can be found much more effciently. If objects are relatively close to the prototype of their cluster, then we can use the prototypes to reduce the number of distance putations that are necessary to ?nd the nearest neighbors of an object. Intuitively, if two cluster prototypes are far apart, then the objects in the corresponding clusters cannot be nearest neighbors of each other. Consequently, to ?nd an object’s nearest neighbors it is only necessary to pute the distance to objects in nearby clusters, where the nearness of two clusters is measured by the distance between their prototypes. This chapter provides an introduction to cluster analysis. We begin with a highlevel overview of clustering, including a discussion of the various ap proaches to dividing objects into sets of clusters and the different types of clusters. We then describe three speci?c clustering techniques that represent broad categories of algorithms and illustrate a variety of concepts: Kmeans, agglomerative hierarchical clustering, and DBSCAN. The ?nal section of this chapter is devoted to cluster validity—methods for evaluating the goodness of the clusters produced by a clustering algorithm. More advanced clusteringconcepts and algorithms will be discussed in Chapter 9. Whenever possible,we discuss the strengths and weaknesses of different schemes. In addition,the bibliographic notes provide references to relevant books and papers that explore cluster analysis in greater depth.1Overview Before discussing speci?c clustering techniques, we provide some necessary background. First, we further de?ne cluster analysis, illustrating why it isdiffcult and explaining its relationship to other techniques that group we explore two important topics: (1) different ways to group a set ofobjects into a set of clusters, and (2) types of clusters. Is Cluster Analysis?Cluster analysis groups data objects based only on information found in thedata that describes the objects and their relationships. The goal is that theobjects within a group be similar (or related) to one another and di?erent from(or unrelated to) the objects in other groups. The greater the similarity (orhomogeneity) within a group and the greater the di?erence between groups,the better or more distinct the clustering.Cluster analysis is related to other techniques that are used to divide data objects into groups. For instance, clustering can be regarded as a form of classi?cation in that it creates a labeling of objects with class (cluster) , it derives these labels only from the data. In contrast, classi?cationn the sense of Chapter 4 is supervised classi?cation。 ., new, unlabeled objects are assigned a class label using a model developed from objects with known class labels. For this reason, cluster analysis is sometimes referred to as unsupervised classi?cation. When the term classi?cation is used without any quali?cation within data mining, it typically refers to supervised classi?cation.Also, while the terms segmentation and partitioning are sometimesused as synonyms for clustering, these terms are frequently used for approaches outside the traditional bounds of cluster analysis. For example, the termpartitioning is often used in connection with techniques that divide graphs into subgraphs and that are not strongly connected to clustering. Segmentation often refers to the division of data into groups using simple techniques。 .,an image can be split into segments based only on pixel intensity and color, orpeople can be divided into groups based on their ine. Nonetheless, somework in graph partitioning and in image and market segmentation is relatedto cluster analysis. Different Types of Clusterings An entire collection of clusters is monly referred to as a clustering, and in this section, we distinguish various types of clusterings: hierarchical (nested) versus partitional (unnested), exclusive versus overlapping versus fuzzy, and plete versus partial.Hierarchical versus Partitional The most monly discussed distinc tion among different types of clusterings is whether the set of clusters is nested or unnested, or in more traditional terminology, hierarchical or partitional. Apartitional clustering is simply a division of the set of data objects into nonoverlapping subsets (clusters) such that each data object is in exactly onesubset. If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is the union of its children (subclusters), and the root of the tree is the cluster containing all the , but not always, the leaves of the tree are singleton clusters of individual data o