Parallel Mining Association Rules in Distributed Memory System ∗ Xuehai Wang
Yingbo Miao
Faculty of Computer Science
Faculty of Computer Science
Dalhousie University
Dalhousie University
Halifax, Canada B3H 1W5
Halifax, Canada B3H 1W5
[email protected]
[email protected]
http://www.cs.dal.ca/∼xwang
http://www.cs.dal.ca/∼ymiao
Abstract We consider the problem of mining association rules on the system, CGM1 system, which has 32 nodes. Furthor more, since system consists of two processors which share the same resource utilize this feature to employ shared memory apriori algorithm in
1
distributed memory each node of CGM1 in the node, we can one node.
Introduction
With the development of hardware, especially the development of large space storage, lots of organization build large storage database and collect large volume of data. Those organizations have the desire to extract useful information from the ultra large amount of data. So some traditional way will be not enough to handle the information. Association rule mining, fist proposed by Agrawal, Imielinski and Swami, try to ”Finding frequent patterns, associations, correlations, or casual structures sets of items or objects in transaction database, relational database, etc.” On the other word, we want to find out the relation or dependency of occurrence of one item based on occurrence of other items. Lots of algorithms in this area, both sequentially and parallel, had been proposed. Since association rule mining is dedicated to handle the ultra large amount of data, so the time complexity and resource complexity have to be carefully considered. Hence parallel algorithm is desirable. In this paper, we will try to explore some algorithms, especially parallel algorithms, and find out the trade-off among them. We will also implement a specific algorithm in both sequential and parallel, and try to find out the speed up of parallel algorithm and the improvement of parallel algorithm compare to the sequential algorithm. ∗
The course project for CSCI 6702, Parallel Computing.
1
The parallel algorithm will be implemented on a CGI model, which has 32 nodes with 2 processors on each node. The organization of the rest of the paper is as follows. Section 2 gives a brief review of the problem of mining association rules and some sequential algorithms. Section 3 gives the description of some parallel algorithms. Section 4 presents out approach.
2
Overview of association rule
2.1
Association Rule
The association rule problem was introduced in [2]. It focus on discovering relationships among items in a transactional database. Association rules can be defined formally as follows: Let I = {i1 , i2 , . . . , im } be a set of items. Let D be a set of transactions, where each transaction T is a set of items and T ⊂ I. Given a set X that X ⊂ I, we say T contains X if X ⊂ T . A generalized association rule is an implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The support s of X ⇒ Y in the transaction set D is s% of transactions in D contain X ∪ Y . The confidence c of rule X ⇒ Y in the transaction set D is c% of transactions in D that contains X ∪ Y .
2.2
Sequential Algorithms
2.2.1
Apriori
One of the most popular algorithms of finding association rules is Apriori [3]. Apriori uses k-itemsets, which has k items belonging to the set of items I, to generate (k+1) itemsets. The main idea of Apriori is: Any subset of a frequent item set must be a frequent set, so if a subset of a frequent itemset is not a frequent set, the frequent itemset should not be generated or test. Following is the P-Code of Apriori: k1 ← frequent 1-itemsets; //Lk is the set of frequent k-itemsets. k←2 while Lk−1 = φ do generate Ck from Lk−1 //Ck is the set of candidate k-itemsets. for all t ∈ D do Increment the count of all candidates in Ck that are contained in t. end for Lk = All candidates in Ck with minimum support. end while return k Lk {Subroutine of generating Ck from Lk−1 } 2
{Step 1: Self-joining Lk−1 } insert into Ck select p.itme1, p.item2, . . ., p.itemk−1 , q.itemk−1 from Lk−1 p, Lk−1 q where p.itme1 = q.item1 , . . .,p.itemk−2 = q.itmek−2 , p.itmek−1 < q.itmek−1 {Step 2: Pruning} for all itemsets c in Ck do for all (k-1)-subsets s of c do if s ⊂ Lk−1 then delete c from ck end if end for end for 2.2.2
Apriori-like algorithms
There are some algorithms based on Apriori, trying to improve it or make it suitable for some certain conditions. AprioriTID and AprioriHybrid
AprioriTID [3] uses the raw database for counting the
support of candidate itemsets only once in the first pass. In later passes, it uses an encoding of the candidate itemsets to count the support. AprioriTID saves much reading effort since the size of the encoding can become much smaller than the raw database. However, AprioriTID is slower then Apriori in the earlier passes, since it uses more memory. So they can be combined to AprioriHybrid [3]. That is, Apriori is used in the initial passes and switches to AprioriTID when th encoding will fit in memory. But the switching does involve a cost. SETM SETM [10] is implemented to use general query languages such as SQL to mining association rules from large datasets in relational databases. DIC DIC [4], Dynamic Itemset Counting, begins to count the k frequent itemsets at any appropriate point instead of always at the beginning of a pass and finishes to count when the itemsets have been counted over all the transactions. It uses a prefix-tree, of Partition
Unlike Apriori that counts the support of all (k − 1) candidates to determine
the k frequent itemsets, Partition [12] uses the tidlists of the (k − 1) candidates to generate the tidlists of the k frequent itemsets. Since the size of those intermediate results may take too many physical memory, Partition splits the raw dataset into several chunks and do an extra scan to get the globally frequent itemset.
3
2.2.3
Other algorithms
All algorithms introduced above are traversing the search space employing breadth-first search [9]. Ther are also some algorithms using depth- first search(DFS). FP-growth
FP-growth [8] uses frequent pattern tree (FP-tree), which is an extended
prefix-tree structure for storing highly condensed representation of the transaction data, thus saves the costly database scans in the subsequent mining processes. Eclat Eclat [14] combines the depth-first search with tidlist intersections by clustering itemsets using equivalence classes or maximal hypergraph cliques and then generating the true frequent itemsets using bottom-up, top-down or hybrid lattice traversal. One of the important things about all of the algorithms is that although the algorithms employ different strtegies, their runtime behaviors are quite similar. None of them can fundamentally beats out the other ones [9]. We will focus on the parallel mining association rules that based on Apriori.
3
Parallel algorithms
The tow dominate approaches for mining parallel association rules on distributed memory and shared memory systems. In distributed memory system, each processor has a private memory, while in shared memory systems, all processors access common memory.
3.1
Parallel mining association rules on distribute memory system
In a distributed memory system, each processor has its own local memory, which can be accessed directly only by this processor, and message passing is used to communication among all processors. Thus, for parallel mining association rules on distribute memory system, we must consider the trade-offs between computation, communication, memory usage, synchronization and the use of problem-specific information in parallel data mining [1]. 3.1.1
Apriori based
In [1] three algorithms are introduced, namely, Count Distribution, Data Distribution and Candidate Distribution. Count Distribution
In Count Distribution algorithm, the raw dataset are distributed
to the disk of every processors. Each processor generates entire k candidates sets from its own k − 1 Frequent itemsets, and then does sum reduction to obtain the global counts by exchanging local counts with all other processors. Then it can prun k the candidate sets to get the k item frequent sets. 4
In the first pass(k = 1), each processor pi dynamically generates its local candidate itemset C1i depending on the items actually present in its local data partition D i . The candidates counted by different processors may not be identical, so the local counts must be exchaged to determine global C1 . For passes k > 1: 1) Each processor pi generates the complete Ck , using the complete frequent itemset Lk−1 careted at the end of pass k − 1. 2) Processor P i makes a pass over its data partition D j and develops local support counts for candidates in Ck . 3) Processor P i exchanges local Ck counts with all ohter processors to develop global Ck counts. Processors are forced to synchronize in this step. 4) Each processor P i now computes Lk from Ck . 5) Each processor P i independently makes the decision to terminate or continue to the next pass. Data Distribution
One of the disadvantages of the Counting Distribution is that it does
not exploit the aggregate memory of the system efectively, since the number of candidates that can be counted in one pass is determined by the memory size of each processor. In Data Distribution algorithem, each processor counts mutually exclusive candidates, so it exploites better the total system’s memory. However, this algorithm is a communication-happy algorithm since it requires each processor broadcasts its local data to all other processors in every pass. According to the experiments in [1], the Data Distribution algorithem performs poorly when compared to Cound Distribution. Candidate Distribution The Candidate Distribution algorithm uses Count Distribution or Data Distribution until pass l. For iteration l, Candidate Distribution algorithm partitions the candidates to make each processor can generate disjoint candidates independent of other processors. Thus each processor can work on a unique set of candidates without having to repeatedly brodcast the entire dataset. However, this algorithm also performs wores than Count Distribution, because it pays the cost of redistributing the dataset while scanning the local dataset partition repeatedly. 3.1.2
Other algorithms
We can find some other algorithms in [11] [7] [5] and [6]
5
3.2
parallel shared memory
3.2.1
Shared memory apriori like algorithm
Recall in sequential apriori algorithm, there are two main steps, Candidate generation and Support counting. Let’s denote Ck as the candidate item set of the kth path, and Lk as the frequent item sets of K th path. Ck can be generated based on Lk−1 joining with itself. Then it eliminates all infrequent item sets in Ck and gets Lk . Ck and Lk are lexicographically sorted. Lk can be partitioned into equivalence class according to their common k − 2 prefixes and P, the number of processors. So those P classes can be computed on different processors simultaneously. The partition can also benfit pruning step. Instead of checking all k items of their (k − 1) sub items, we now need check n − (k − 2) sub items. The problem now need be considered is computation balancing. We can simply partition by order. But it suffers load imbalance. Interleaved partitioning or Bitonic partitioning can solve this problem. 3.2.2
Multiple Local Parallel Tree (MLPT)
Noticed that Apriori algorithm suffers I/O problem, MLPT will only scan the database twice and alleviate the I/O time. MLPT first scans the database to generate the ordered frequent items, which will be used to generate a tree. By doing this, each processor can be allocated the same number of transactions, and then compute the count of each item locally. Then the global count can be computed by splitting the item list evenly according to the number of processors. Then each processor computes out the global count for the items allocated to this processor. Each processor will be assigned the same number of transactions and do the second database scan. According to the item list generated in step one, a local FP-tree starting with a null node will be created. Then a bottom-up traversal will be used to mine the association rule [13].
4
Our approach
Haveing seen those algorithms, this paper mainly focus on distributed memory parallel. We will implement apriori based count distribution algorithm on our CGM1 system. This cluster is configured with 32 nodes, and each node consists of two processors which share the same resource in the node. We choose count distribution algorithm since it’s simple to implement and it requires the least communication among nodes. Because it splits the transactions based on the
6
processor number, so it may suffer imblance workload. As we have seen before, it will suffer repeatly I/O operations too. Because in our CGM1 system, two processors share the same resource in one node, so we can utilize this feature. In our approach, we also employ shared memory apriori algorithm in one node to compute Ck , Lk . Furthermore, the local count can also be computed in parallel in one node.
7
References [1] R. Agrawal and J. C. Shafer. Parallel mining of association rules. Ieee Trans. On Knowledge And Data Engineering, 8:962–969, 1996. [2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 26–28 1993. [3] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. [4] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. In Joan Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 255–264. ACM Press, 05 1997. [5] Cheung, Han, Ng, Fu, and Fu. A fast distributed algorithm for mining association rules. In PDIS: International Conference on Parallel and Distributed Information Systems. IEEE Computer Society Technical Committee on Data Engineering, and ACM SIGMOD, 1996. [6] David Wai-Lok Cheung and Yongqiao Xiao. Effect of data skewness in parallel mining of association rules. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 48–60, 1998. [7] Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. pages 277–288, 1997. [8] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Weidong Chen, Jeffrey Naughton, and Philip A. Bernstein, editors, 2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05 2000. [9] Jochen Hipp, Ulrich G¨ untzer, and Gholamreza Nakhaeizadeh. Algorithms for association rule mining – a general survey and comparison. SIGKDD Explorations, 2(1):58–64, July 2000. [10] Maurice A. W. Houtsma and Arun N. Swami. Set-oriented mining for association rules in relational databases. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pages 25–33. IEEE Computer Society, 1995. [11] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Efficient parallel data mining for association rules. In Conference on Information and Knowledge Management archive,Proceedings of the fourth international conference on Information and knowledge management, pages 31–36. ACM Press, 1995. [12] Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In The VLDB Journal, pages 432–444, 1995. 8
[13] O. Zaane, M. El-Hajj, and P. Lu. Fast parallel association rule mining without candidacy generation, 2001. [14] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New algorithms for fast discovery of association rules. Technical Report TR651, 1997.
9