【正文】
tributes are categorical (if continuousvalued, they are discretized in advance)n Examples are partitioned recursively based on selected attributesn Test attributes are selected on the basis of a heuristic or statistical measure (., information gain)n Conditions for stopping partitioningn All samples for a given node belong to the same classn There are no remaining attributes for further partitioning – majority voting is employed for classifying the leafn There are no samples left2023/2/27 星期六 13Data Mining: Concepts and TechniquesAttribute Selection Measure: Information Gain (ID3/)n Select the attribute with the highest information gainn S contains si tuples of class Ci for i = {1, …, m} n information measures info required to classify any arbitrary tuplen entropy of attribute A with values {a1,a2,…,av}n information gained by branching on attribute A2023/2/27 星期六 14Data Mining: Concepts and TechniquesAttribute Selection by Information Gain Computationg Class P: buys_puter = “yes”g Class N: buys_puter = “no”g I(p, n) = I(9, 5) =g Compute the entropy for age: means “age =30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. HenceSimilarly,2023/2/27 星期六 15Data Mining: Concepts and TechniquesOther Attribute Selection Measuresn Gini index (CART, IBM IntelligentMiner)n All attributes are assumed continuousvaluedn Assume there exist several possible split values for each attributen May need other tools, such as clustering, to get the possible split valuesn Can be modified for categorical attributes2023/2/27 星期六 16Data Mining: Concepts and TechniquesGini Index (IBM IntelligentMiner)n If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T.n If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined asn The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).2023/2/27 星期六 17Data Mining: Concepts and TechniquesExtracting Classification Rules from Treesn Represent the knowledge in the form of IFTHEN rulesn One rule is created for each path from the root to a leafn Each attributevalue pair along a path forms a conjunctionn The leaf node holds the class predictionn Rules are easier for humans to understandn ExampleIF age = “=30” AND student = “no” THEN buys_puter = “no”IF age = “=30” AND student = “yes” THEN buys_puter = “yes”IF age = “31…40” THEN buys_puter = “yes”IF age = “40” AND credit_rating = “excellent” THEN buys_puter = “yes”IF age = “=30” AND credit_rating = “fair” THEN buys_puter = “no”2023/2/27 星期六 18Data Mining: Concepts and TechniquesAvoid Overfitting in Classificationn Overfitting: An induced tree may overfit the training data n Too many branches, some may reflect anomalies due to noise or outliersn Poor accuracy for unseen samplesn Two approaches to avoid overfitting n Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a thresholdn Difficult to choose an appropriate thresholdn Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned treesn Use a set of data different from the training data to decide which is the “best pruned tree”2023/2/27 星期六 19Data Mining: Concepts and TechniquesApproaches to Determine the Final Tree Sizen Separate training (2/3) and testing (1/3) setsn Use cross validation, ., 10fold cross validationn Use all the data for trainingn but apply a statistical test (., chisquare) to estimate whether expanding or pruning a node may improve the entire distributionn Use minimum description length (MDL) principlen halting growth of the tree when the encoding is minimized2023/2/27 星期六 20Data Mining: Concepts and TechniquesEnhancements to basic decision tree inductionn Allow for continuousvalued attributesn Dynamically define new discretevalued attributes that partition the continuous attribute value into a discrete set of intervalsn Handle missing attribute valuesn Assign the most mon value of the attributen Assign probability to each of the possible valuesn Attribute constructionn Create new attributes based on existing ones that are sparsely representedn This reduces fragmentation, repetition, and replication2023/2/27 星期六 21Data Mining: Concepts and TechniquesClassification in Large Databasesn Classification—a classical problem extensively studied by statisticians and machine learning researchersn Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speedn Why decision tree induction in data mining?n relatively faster learning speed (than other classification methods)n convertible to simple and easy to understand classification rulesn can use SQL queries for accessing databasesn parable classification accuracy with other methods2023/2/27 星期六 22Data Mining: Concepts and TechniquesScalable Decision Tree Induction Methods in Data Mining Studiesn SLIQ (EDBT’96 — Mehta et al.)n builds an index for each attribute and only class list and the current attribute list reside in memoryn SPRINT (VLDB’96 — J. Shafer et al.)n constructs an attribute list data structure n PUBLIC (VLDB’98 — Rastogi Shim)n integrates tree splitting and tree pruning: stop growing the tree earliern RainForest (VLDB’98 — Gehrke, Ramakrishnan Ganti)n separates the scalability aspects from the criteria that determine the quality of the treen builds an AVClist (attribute, value, class label)2023/2/27 星期六 23Data Mining: Concepts and TechniquesData CubeBased Decisio