【正文】
e 3: determine the global counts for itemsets ? how to find? find frequent kitemsets and replicate in all nodes 并行關(guān)聯(lián)規(guī)則挖掘 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 64 kitemset An itemset having k items Lk Set of frequent kitemsets (those with minimum support) Each member of this set has 2 fields: itemset and support count Ck Set of candidate kitemsets (potentially frequent itemsets) Each member of this set has 2 fields: itemset and support count Pi Processor with idi Di The dataset local to the processor Pi D Ri The dataset local to the processor Pi after repartitioning Cik The candidate set maintained with the processor Pi during the kth pass (there are k items in each candidate) 并行關(guān)聯(lián)規(guī)則挖掘 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 65 Objective: minimizing munication Techniques: Straightforward parallelization of Apriori Carry out redundant duplicate putations in parallel to avoid munication Only requires municating count values (no data tuples are exchanged) Processors can scan the local data asynchronously in parallel 計(jì)數(shù)分布 CD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 66 Algorithm: Pass 1: (1) Each processor Pi generates its local candidate itemset Ci1 depending on the items present in its local data partition Di (2) Develop and Exchange local counts Ci1 (3) Develop global support counts C1 計(jì)數(shù)分布 CD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 67 Algorithm: Pass k1: (1) Pi generates the plete Ck using the plete Lk1 created at the end of pass (k1). Each processor has the identical Lk1 thus generates identical Ck and puts its count values in a mon order into a count array (2) Pi makes a pass over data partition Di and develop local support counts for candidates in Ck (3) Pi exchanges local Ck counts with all other processors to develop global Ck counts. All processors must synchronize. (4) Pi putes Lk from Ck (5) Pi independently decide to terminate or continue to the next pass 計(jì)數(shù)分布 CD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 68 計(jì)數(shù)分布 CD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 69 Disadvantages: ? CD does not exploit the aggregate memory of the system ? Must synchronize and develop global count at the end of each pass M over [Ck] N * |M| N M over [Ck] |M| 1 Usage of memory per processor Total amount of memory Number of processor 計(jì)數(shù)分布 CD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 70 Objective: utilize aggregate main memory of the system effectively Technique: ?Partitions the candidates into disjoint sets, which are assigned to different processors. Each processor works with the entire dataset but only portion of the candidate set. ?Each processor counts mutually exclusive candidates. On a Nprocessor configuration, DD can count in a single pass candidate set that would require N pass in CD 數(shù)據(jù)分布 DD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 71 Basic Idea Ck Ck1 Ck2 Lk1 Lk2 Lk Processor 1 Processor 2 Example: 2 processors ? Data Distribution only processes a subset of Ck to utilize the aggregate memory ? Exchange data to develop global counts for Cki data 數(shù)據(jù)分布 DD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 72 Algorithm: Pass 1: Same as the CD algorithm Pass k1: (1) Pi generates Ck from Lk1. It retains only 1/N of the itemsets forming Cik (2) Pi develops support counts for itemsets in Cik for ALL transactions (using local data pages and data pages received from other processors) (3) At the end of the data pass, Pi calculates Lik using local Cik (4) Processors exchange Lik so that every processor has the plete Lk for generating Ck+1 for the next pass (requires processors to synchronize) (5) Pi can independently decide whether to terminate or continue on to the next pass 數(shù)據(jù)分布 DD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 73 Lik Lik Lik Lk 數(shù)據(jù)分布 DD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 74 Disadvantages: heavy munication Each processor must broadcast their local data and frequent itemsets to all other processors and synchronize in every pass. 數(shù)據(jù)分布 DD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 75 Problem: CD and DD require processors to synchronize at the end of each pass Basic Idea: Remove dependence among processors ? Data dependence Complete transactions are required to pute support count (in CD) ? Frequent itemsets dependency A global itemset Lk is needed during the pruning step of Apiori candidate generation algorithm(in DD) 候選分布 CaD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 76 Remove Data Dependency ? Each processor Pi works on Cki, a disjoint subset of Ck ? Pi derives global support counts for Cki from local data. ? Replicate data amongst processors in order to achieve the above Reduce Frequent itemset dependency ? Does not wait for the plete pruning information to arrive from other processors. ? Prune the candidate set as much as possible ? Late arriving pruning information is used in subsequent passes. 候選分布 CaD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 77 Algorithm: Pass kl: Use either the CD or DD algorithm Pass k=l : (1) Partition Lk1 among N processors (2) Pi generates Cik logically using only the Lik1 partition (use standard pruning) (3) Pi develops global counts for candidates in Cik and the database is repartitioned into D Ri at the same time (requires municating local data) (4) Pi receive Ljk from all other processors needed for pruning Cik+1 (5) Pi putes Lik from Cik and asynchronously send it to the other N1 processors Pass kl: (1) Pi collects all frequent itemsets sent by other processors (2) Pi generates Cik using local Lik, take care of pruning(Ljl1) (3) Pi passes over D Ri and counts Cik (4) Pi putes Lik from Cik and asynchronously send it to the other N1 processors 候選分布 CaD 2020/11/4 史忠植 關(guān)聯(lián)規(guī)則 78 ?Count Distribution attempts to minimize munication by replicating the candidate sets in each processor’s memory ?Data Distribution maximizes the use of aggregate memory by allowing each processor works with the entire