【文章內(nèi)容簡介】
in at least one of the partitions of DBn Scan 1: partition database and find local frequent patternsn Scan 2: consolidate global frequent patternsn A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’952023/2/27 星期六 28Data Mining: Concepts and TechniquesSampling for Frequent Patternsn Select a sample of original database, mine frequent patterns within sample using Apriorin Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checkedn Example: check abcd instead of ab, ac, …, etc.n Scan database again to find missed frequent patternsn H. Toivonen. Sampling large databases for association rules. In VLDB’962023/2/27 星期六 29Data Mining: Concepts and TechniquesDHP: Reduce the Number of Candidatesn A kitemset whose corresponding hashing bucket count is below the threshold cannot be frequentn Candidates: a, b, c, d, en Hash entries: {ab, ad, ae} {bd, be, de} …n Frequent 1itemset: a, b, d, en ab is not a candidate 2itemset if the sum of count of {ab, ad, ae} is below support thresholdn J. Park, M. Chen, and P. Yu. An effective hashbased algorithm for mining association rules. In SIGMOD’952023/2/27 星期六 30Data Mining: Concepts and TechniquesEclat/MaxEclat and VIPER: Exploring Vertical Data Formatn Use tidlist, the list of transactionids containing an itemsetn Compression of tidlistsn Itemset A: t1, t2, t3, sup(A)=3n Itemset B: t2, t3, t4, sup(B)=3n Itemset AB: t2, t3, sup(AB)=2n Major operation: intersection of tidlistsn M. Zaki et al. New algorithms for fast discovery of association rules. In KDD’97n P. Shenoy et al. Turbocharging vertical mining of large databases. In SIGMOD’002023/2/27 星期六 31Data Mining: Concepts and TechniquesBottleneck of Frequentpattern Miningn Multiple database scans are costlyn Mining long patterns needs many passes of scanning and generates lots of candidatesn To find frequent itemset i1i2…i100n of scans: 100n of Candidates: (1001) + (1002) + … + (110000) = 21001 = *1030 !n Bottleneck: candidategenerationandtestn Can we avoid candidate generation?2023/2/27 星期六 32Data Mining: Concepts and TechniquesMining Frequent Patterns Without Candidate Generationn Grow long patterns from short ones using local frequent itemsn “abc” is a frequent patternn Get all transactions having “abc”: DB|abcn “d” is a local frequent item in DB|abc ? abcd is a frequent pattern2023/2/27 星期六 33Data Mining: Concepts and TechniquesConstruct FPtree from a Transaction Database{}f:4 c:1b:1p:1b:1c:3a:3b:1m:2p:2 m:1Header TableItem frequency head f 4c 4a 3b 3m 3p 3min_support = 3TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}1. Scan DB once, find frequent 1itemset (single item pattern)2. Sort frequent items in frequency descending order, flist3. Scan DB again, construct FPtreeFlist=fcabmp2023/2/27 星期六 34Data Mining: Concepts and TechniquesBenefits of the FPtree Structuren Completeness n Preserve plete information for frequent pattern miningn Never break a long pattern of any transactionn Compactnessn Reduce irrelevant info—infrequent items are gonen Items in frequency descending order: the more frequently occurring, the more likely to be sharedn Never be larger than the original database (not count nodelinks and the count field)n For Connect4 DB, pression ratio could be over 1002023/2/27 星期六 35Data Mining: Concepts and TechniquesPartition Patterns and Databasesn Frequent patterns can be partitioned into subsets according to flistn Flist=fcabmpn Patterns containing pn Patterns having m but no pn …n Patterns having c but no a nor b, m, pn Pattern fn Completeness and nonredundency2023/2/27 星期六 36Data Mining: Concepts and TechniquesFind Patterns Having P From Pconditional Databasen Starting at the frequent item header table in the FPtreen Traverse the FPtree by following the link of each frequent item pn Accumulate all of transformed prefix paths of item p to form p’s conditional pattern baseConditional pattern basesitem cond. pattern basec f:3a fc:3b fca:1, f:1, c:1m fca:2, fcab:1p fcam:2, cb:1{}f:4 c:1b:1p:1b:1c:3a:3b:1m:2p:2 m:1Header TableItem frequency head f 4c 4a 3b 3m 3p 32023/2/27 星期六 37Data Mining: Concepts and TechniquesFrom Conditional Patternbases to Conditional FPtrees n For each patternbasen Accumulate the count for each item in the basen Construct the FPtree for the frequent items of the pattern basemconditional pattern base:fca:2, fcab:1{}f:3c:3a:3mconditional FPtreeAll frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcam? ?{}f:4 c:1b:1p:1b:1c:3a:3b:1m:2p:2 m:1Header TableItem frequency head f 4c 4a 3b 3m 3p 32023/2/27 星期六 38Data Mining: Concepts and TechniquesRecursion: Mining Each Conditional FPtree{}f:3c:3a:3mconditional FPtreeCond. pattern base of “am”: (fc:3){}f:3c:3amconditional FPtreeCond. pattern base of “cm”: (f:3){}f:3cmconditional FPtreeCond. pattern base of “cam”: (f:3){}f:3camconditional FPtree2023/2/27 星期六 39Data Mining: Concepts and TechniquesA Special Case: Single Prefix Path in FPtreen Suppose a (conditional) FPtree T has a shared single prefixpath Pn Mining can be deposed into two partsn Reduction of the single prefix path into one noden Concatenation of the mining results of the two parts?a2:n2a3:n3a1:n1{}b1:m1 C1:k1C2:k2 C3:k3b1:m1 C1:k1C2:k2 C3:k3r1+a2:n2