【正文】
5 0x4 3 1 5 0Manhattan (L1) Euclidean (L2) Supremum 62 有序變量 Ordinal Variables ? 一個(gè)序變量可以離散的或連續(xù)的 ? Order is important, ., rank ? Can be treated like intervalscaled ? 用他們的序代替 xif ? 映射每一個(gè)變量的范圍于 [0,1],用如下支代替第 fth變量的 ith對(duì)象 ? pute the dissimilarity using methods for intervalscaled variables 11???fifif Mrz},...,1{ fif Mr ?63 混合型屬性 ? A database may contain all attribute types ? Nominal, symmetric binary, asymmetric binary, numeric, ordinal ? 可以用加權(quán)法計(jì)算合并的影響 ? f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise ? f is numeric: use the normalized distance ? f is ordinal ? Compute ranks rif and ? Treat zif as intervalscaled )(1)()(1),(fijpffijfijpf djid???????11???fifMrzif64 余弦相似性 Cosine Similarity ? A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. ? Other vector objects: gene features in microarrays, … ? Applications: information retrieval, biologic taxonomy, gene feature mapping, ... ? Cosine measure: If d1 and d2 are two vectors (., termfrequency vectors), then cos(d1, d2) = (d1 ? d2) /||d1|| ||d2|| , where ? indicates vector dot product, ||d||: the length of vector d ????????piipiipiiiyxyxyx12121),c o s (65 Example: Cosine Similarity ? cos(d1, d2) = (d1 ? d2) /||d1|| ||d2|| , where ? indicates vector dot product, ||d|: the length of vector d ? Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1?d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)=(42) = ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)=(17) = cos(d1, d2 ) = 66 Summary ? Data attribute types: nominal, binary, ordinal, intervalscaled, ratioscaled ? Many types of data sets, ., numerical, text, graph, Web, image. ? Gain insight into the data by: ? Basic statistical data description: central tendency, dispersion, graphical displays ? Data visualization: map data onto graphical primitives ? Measure data similarity ? Above steps are the beginning of data preprocessing. ? Many methods have been developed but still an active area of research. 67 References ? W. Cleveland, Visualizing Data, Hobart Press, 1993 ? T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2022 ? U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge Discovery, Man Kaufmann, 2022 ? L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley amp。 Sons, 1990. ? H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on Data Eng., 20(4), Dec. 1997 ? D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer Graphics, 8(1), 2022 ? D. Pyle. Data Preparation for Data Mining. Man Kaufmann, 1999 ? S. Santini and R. Jain,‖ Similarity measures‖, IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(9), 1999 ? E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2022 ? C. Yu , et al, Visual data mining of multimedia data for social and behavioral studies, Information Visualization, 8(1), 2022