【正文】
結(jié)論 ? 更多的人在從事 Data Mining, 且更多不同類(lèi)型的人在從事 Data Mining。因此該公司得以重新安排貨架的擺設(shè),使得橘子汁的銷(xiāo)量能夠增加到最大。ve Bayes) ? 使用所有屬性,假設(shè)屬性無(wú)關(guān)、且同等重要 ? Divide and conquer: Constructing decision trees ? 循環(huán)選擇一個(gè)屬性來(lái)分割樣本 (算法: ID ) ? Covering algorithms: Constructing rules( 算法:Prism) ? Take each class in turn and seek a way of covering all instances in it, at the same time excluding instances not in the class. ? Covering approach導(dǎo)出一個(gè)規(guī)則集而不是決策樹(shù) 算法: The basic methods ? Mining association rules: – 參數(shù): coverage(support), accuracy(confidence) ? Linear models( 參考 ) – 主要用于值預(yù)估和分類(lèi)( Linear regression) ? Instancebased learning – 算法: Nearestneighbor, KNearestneighbor 評(píng)估可信度 * ? 三個(gè)數(shù)據(jù)集: ? Training data: 用于導(dǎo)出模型,越大則模型越好 ? Validation data: 用于優(yōu)化模型參數(shù) ? Test data: 用于計(jì)算最終模型的錯(cuò)誤率,越大越準(zhǔn)確 ? 原則:測(cè)試數(shù)據(jù)無(wú)論如何也不能用于模型的訓(xùn)練 ? 問(wèn)題:如果樣本很少,如何劃分 ? ? 方法: ? Nfold Crossvalidation, (n=3,10) ? Leaveoneout Crossvalidation ? Bootstrap (e=): best for very small datasets ? Counting the cost: ? Lift charts (Respondents /Sample Size) 、 ROC curves () ? The MDL principle (Minimum Description Length) ? Occam’s Razor: Other things being equal, simple theories are preferable to plex ones. ? 愛(ài)因斯坦: Everything should be made as simple as possible, but no simpler. 實(shí)現(xiàn) : Real machine learning schemes (略 ) ? 參考閱讀: – Decision tree – Classification rules – Extending linear classification: Support vector machines – Instancebased learning – Numeric prediction – Clustering 改進(jìn): Engineering the input and output ? 數(shù)據(jù)工程 – Attribute selection – Discretizing( 離散化) numeric attributes – Automatic d