【文章內(nèi)容簡(jiǎn)介】
a t a D i s c r e t i z e / a g g r e g a t e d a t a C o n s t r u c t n e w a t t r i b u t e s R e d u c e n u m b e r o f v a r i a b l e s R e d u c e n u m b e r o f c a s e s B a l a n c e s k e w e d d a t a Source: Turban et al. (2023), Decision Support and Business Intelligence Systems 21 Data Mining Process: SEMMA S a m p l e( G e n e r a t e a r e p r e s e n t a t i v e s a m p l e o f t h e d a t a )M o d i f y( S e l e c t v a r i a b l e s , t r a n s f o r m v a r i a b l e r e p r e s e n t a t i o n s )E x p l o r e( V i s u a l i z a t i o n a n d b a s i c d e s c r i p t i o n o f t h e d a t a )M o d e l( U s e v a r i e t y o f s t a t i s t i c a l a n d m a c h i n e l e a r n i n g m o d e l s )A s s e s s( E v a l u a t e t h e a c c u r a c y a n d u s e f u l n e s s o f t h e m o d e l s )S E M M ASource: Turban et al. (2023), Decision Support and Business Intelligence Systems 22 Data Mining Methods: Classification ? Most frequently used DM method ? Part of the machinelearning family ? Employ supervised learning ? Learn from past data, classify new data ? The output variable is categorical (nominal or ordinal) in nature ? Classification versus regression? ? Classification versus clustering? Source: Turban et al. (2023), Decision Support and Business Intelligence Systems 23 Assessment Methods for Classification ? Predictive accuracy – Hit rate ? Speed – Model building。 predicting ? Robustness ? Scalability ? Interpretability – Transparency, explainability Source: Turban et al. (2023), Decision Support and Business Intelligence Systems 24 Accuracy of Classification Models ? In classification problems, the primary source for accuracy estimation is the confusion matrix T r u e P o s i t i v e C o u n t ( T P )F a l s eP o s i t i v eC o u n t ( F P )T r u eN e g a t i v eC o u n t ( T N )F a l s eN e g a t i v eC o u n t ( F N )T r u e C l a s sP o s i t i v e N e g a t i v ePositiveNegativePredicted ClassFNTPTPRatePosi tiveTru e?? FPTN TNRateNegat iveTru e ?? FNFPTNTPTNTPAcc ura cy?????FPTP TPre cis ion ??P FNTP TPcallRe ??Source: Turban et al. (2023), Decision Support and Business Intelligence Systems 25 Estimation Methodologies for Classification ? Simple split (or holdout or test sample estimation) – Split the data into 2 mutually exclusive sets training (~70%) and testing (30%) – For ANN, the data is split into three subsets (training [~60%], validation [~20%], testing [~20%]) P r e p r o c e s s e dD a t aT r a i n i n g D a t aT e s t i n g D a t aM o d e l D e v e l o p m e n tM o d e l A s s e s s m e n t( s c o r i n g )2 / 31 / 3C l a s s i f i e rP r e d i c t i o n A c c u r a c ySource: Turban et al. (2023), Decision Support and Business Intelligence Systems 26 Estimation Methodologies for Classification ? kFold Cross Validation (rotation estimation) – Split the data into k mutually exclusive subsets – Use each subset as testing while using the rest of the subsets as training – Repeat the experimentation for k times – Aggregate the test results for true estimation of prediction accuracy training ? Other estimation methodologies – Leaveoneout, bootstrapping, jackknifing – Area under the ROC curve Source: Turban et al. (2023), Decision Support and Business Intelligence Systems 27 Estimation Methodologies for Classification – ROC Curve 10 . 90 . 80 . 70 . 60 . 50 . 40 . 30 . 20 . 1000 . 10 . 20 . 30 . 40 . 50 . 60 . 710 . 90 . 8F a l s e P o s i t i v e R a t e ( 1 S p e c i f i c i t y )True Positive Rate (Sensitivity)ABCSource: Turban et al. (2023), Decision Support and Business Intelligence Systems 28 True Positive (TP) True Negative (FN) False Positive (FP) True Negative (TN) True Class (actual value) Predictive Class (prediction oute) Positive Negative Positive Negative total P total N N’ P’ 29 FNTPTPRatePosi ti veTru e?? FPTN TNRateNegat iveTru e ?? FNFPTNTPTNTPAcc ura cy?????FPTP TPre cis ion ??P FNTP TPcallRe ?? FNTP TPRatePositi ve ?? ty)(Sens iti vi 10 . 90 . 80 . 70 . 60 . 50 . 40 . 30 . 20 . 1000 . 10 . 20 . 30 . 40 . 50 . 60 . 710 . 90 . 8F a l s e P o s i t i v e R a t e ( 1 S p e c i f i c i t y )True Positive Rate (Sensitivity)ABCTNFPFPRatePosit iv eF?? alse FPTNTNRateNegat ive ??ty)(Specifi ci TNFPFPRatePositi veF?? y)Spec ific it(1 als eSource: True Positive (TP) True Negative (FN) False Positive (FP) True Negative (TN) True Class (actual value) Predictive Class (prediction oute) Positive Negative Positive Negative total P total N N’ P’ 30 FNTPTPRatePosi ti veTru e?? FNTP TPcallRe ?? FNTP TPRatePositi ve ?? ty)(Sens iti vi 10 . 90 . 80 . 70 . 60 . 50 . 40 . 30 . 20 . 1000 . 10 . 20 . 30 . 40 . 50 . 60 . 710 . 90 . 8F a l s e P o s i t i v e R a t e ( 1 S p e c i f i c i t y )True Positive Rate (Sensitivity)ABCSensitivity = True Positive Rate = Recall = Hit rate Source: True Positive (TP) True Negative (FN) False Positive (FP) True Negative (TN) True Class (actual value) Predictive Class (prediction oute) Positive Negative Positive Negative tot