【正文】
62 分類數(shù)據(jù)的概念分層 Categorical Data ? 用戶 /專家在模式級顯式地指定屬性的偏序 ? streetcitystatecountry ? 通過顯式數(shù)據(jù)分組說明分層 ? {厄巴納,香檳,芝加哥 }Illinois ? 只說明屬性集 ? 系統(tǒng)自動產(chǎn)生屬性偏序,根據(jù) 每個屬性下不同值的數(shù)據(jù) ? 啟發(fā)式規(guī)則:相比低層,高層概念的屬性通常有較少取值 ? ., street city state country ? 只說明部分屬性值 63 自動產(chǎn)生概念分層 ? Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set ? 含不同值最多的屬性放在層次的最低層 ? Note: Exception—weekday, month, quarter, year country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values 674,339 distinct values 64 Summary ? Data preparation is a big issue for both warehousing and mining ? Data preparation includes ? Data cleaning and data integration ? Data reduction and feature selection ? Discretization ? A lot a methods have been developed but still an active area of research Data Reduction, Transformation, Integration ? Data Quality ? Major Tasks in Data Preprocessing ? Data Cleaning and Data Integration ? Data Cleaning ? i. Missing Data and Misguided Missing Data ? ii. Noisy Data ? iii. Data Cleaning as a Process ? Data Integration Methods ? Data Reduction ? Data Reduction Strategies ? Dimensionality Reduction ? i. Principal Component analysis ? ii. Feature Subset Selection ? iii. Feature Creation ? Numerosity Reduction ? i. Parametric Data Reduction: Regression and LogLinear Models ? ii. Mapping Data to a New Space: Wavelet Transformation ? iii. Data Cube aggregation ? iv. Data Compression ? v. Histogram analysis ? vi. Clustering ? vii. Sampling: Sampling without Replacement, Stratified Sampling ? Data Transformation and Data Discretization ? Data Transformation: Normalization ? Data Discretization Methods ? i. Binning ? ii. Cluster Analysis ? iii. Discretization Using Class Labels: EntropyBased Discretization ? iv. Discretization Without Using Class Labels: Interval Merge by 194。2 Analysis ? Concept Hierarchy and Its Formation ? i. Concept Hierarchy Generation for Numerical Data ? ii. Concept Hierarchy Generation for Categorical Data ? iii. Automatic Concept Hierarchy Generation 66 References ? E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data Engineering. , ? D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:7378, 1999. ? . Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December 1997. ? A. Maydanchik, Challenges of Efficient Data Cleansing (DM Review Data Quality resource portal) ? D. Pyle. Data Preparation for Data Mining. Man Kaufmann, 1999. ? D. Quass. A Framework for research in Data Cleaning. (Draft 1999) ? V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2021. ? T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992. ? Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:8695, 1996. ? R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623640, 1995. ?