【正文】
ation, etc. ?Watch for the PRIVACY pitfall! Other applications ?The selection and processing of data for: ?the identification of novel, accurate, and useful patterns, and ?the modeling of realworld phenomena. ?Data mining is a major ponent of the KDD process automated discovery of patterns and the development of predictive and explanatory models. What is KDD? A process! Konstanz, 20 EDBT2023 tutorial Intro Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)= Warehouse Data Sources Patterns Models Prepared Data Consolidated Data The KDD process Konstanz, 21 EDBT2023 tutorial Intro The KDD Process Core Problems Approaches ?Problems: ?identification of relevant data ?representation of data ?search for valid pattern or model ?Approaches: ?topdown deduction by expert ?interactive visualization of data/models ?* bottomup induction from data * Data Mining OLAP Konstanz, 22 EDBT2023 tutorial Intro ? Learning the application domain: ?relevant prior knowledge and goals of application ? Data consolidation: Creating a target data set ? Selection and Preprocessing ?Data cleaning : (may take 60% of effort!) ?Data reduction and projection: ?find useful features, dimensionality/variable reduction, invariant representation. ? Choosing functions of data mining ?summarization, classification, regression, association, clustering. ? Choosing the mining algorithm(s) ? Data mining: search for patterns of interest ? Interpretation and evaluation: analysis of results. ?visualization, transformation, removing redundant patterns, … ? Use of discovered knowledge The steps of the KDD process C o g N o v aT ech n o lo g ies9T h e K DD P r o c e s s S e l e c t i on a nd P r e pr oc e s s i ngD a t a M i n i n gI nt e r pr e t a t i on a nd E va l ua t i onD a t a C on s ol i da t i onK n o w l e d g ep (x )= 0 . 0 2W a r e h o u s eD a t a S ourc e sP a t t e rns M ode l sP re pa re d D a t a Cons o l i da t e dD a t aIdentify Problem or Opportunity Measure effect of Action Act on Knowledge Knowledge Results Strategy Problem The virtuous cycle Konstanz, 24 EDBT2023 tutorial Intro Applications, operations, techniques Konstanz, 25 EDBT2023 tutorial Intro Roles in the KDD process Konstanz, 26 EDBT2023 tutorial Intro Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP Data mining and business intelligence Konstanz, 27 EDBT2023 tutorial Intro Graphical User Interface Data Consolidation Selection and Preprocessing Data Mining Interpretation and Evaluation Warehouse Knowledge Data Sources Architecture of a KDD system Konstanz, 28 EDBT2023 tutorial Intro A business intelligence environment Konstanz, 29 EDBT2023 tutorial Intro Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)= Warehouse Data Sources Patterns Models Prepared Data Consolidated Data The KDD process Konstanz, 30 EDBT2023 tutorial Intro Garbage in Garbage out ?The quality of results relates directly to quality of the data ?50%70% of KDD process effort is spent on data consolidation and preparation ?Major justification for a corporate data warehouse Data consolidation and preparation Konstanz, 31 EDBT2023 tutorial Intro From data sources to consolidated data repository RDBMS Legacy DBMS Flat Files Data Consolidation and Cleansing Warehouse Object/Relation DBMS Multidimensional DBMS Deductive Database Flat files External Data consolidation Konstanz, 32 EDBT2023 tutorial Intro ?Determine preliminary list of attributes ?Consolidate data into working database ? Internal and External sources ?Eliminate or estimate missing values ?Remove outliers (obvious exceptions) ?Determine prior probabilities of categories and deal with volume bias Data consolidation Konstanz, 33 EDBT2023 tutorial Intro Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)= Warehouse The KDD process Konstanz, 34 EDBT2023 tutorial Intro ?Generate