Parallel Mining Association Rules in Distributed Memory System

Parallel Mining Association Rules in Distributed Memory System ∗ Xuehai Wang

Yingbo Miao

Faculty of Computer Science

Faculty of Computer Science

Dalhousie University

Dalhousie University

Halifax, Canada B3H 1W5

Halifax, Canada B3H 1W5

[email protected]

[email protected]

http://www.cs.dal.ca/∼xwang

http://www.cs.dal.ca/∼ymiao

Abstract We consider the problem of mining association rules on the system, CGM1 system, which has 32 nodes. Furthor more, since system consists of two processors which share the same resource utilize this feature to employ shared memory apriori algorithm in

1

distributed memory each node of CGM1 in the node, we can one node.

Introduction

With the development of hardware, especially the development of large space storage, lots of organization build large storage database and collect large volume of data. Those organizations have the desire to extract useful information from the ultra large amount of data. So some traditional way will be not enough to handle the information. Association rule mining, fist proposed by Agrawal, Imielinski and Swami, try to ”Finding frequent patterns, associations, correlations, or casual structures sets of items or objects in transaction database, relational database, etc.” On the other word, we want to find out the relation or dependency of occurrence of one item based on occurrence of other items. Lots of algorithms in this area, both sequentially and parallel, had been proposed. Since association rule mining is dedicated to handle the ultra large amount of data, so the time complexity and resource complexity have to be carefully considered. Hence parallel algorithm is desirable. In this paper, we will try to explore some algorithms, especially parallel algorithms, and find out the trade-off among them. We will also implement a specific algorithm in both sequential and parallel, and try to find out the speed up of parallel algorithm and the improvement of parallel algorithm compare to the sequential algorithm. ∗

The course project for CSCI 6702, Parallel Computing.

1

The parallel algorithm will be implemented on a CGI model, which has 32 nodes with 2 processors on each node. The organization of the rest of the paper is as follows. Section 2 gives a brief review of the problem of mining association rules and some sequential algorithms. Section 3 gives the description of some parallel algorithms. Section 4 presents out approach.

2

Overview of association rule

2.1

Association Rule

The association rule problem was introduced in [2]. It focus on discovering relationships among items in a transactional database. Association rules can be defined formally as follows: Let I = {i1 , i2 , . . . , im } be a set of items. Let D be a set of transactions, where each transaction T is a set of items and T ⊂ I. Given a set X that X ⊂ I, we say T contains X if X ⊂ T . A generalized association rule is an implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The support s of X ⇒ Y in the transaction set D is s% of transactions in D contain X ∪ Y . The confidence c of rule X ⇒ Y in the transaction set D is c% of transactions in D that contains X ∪ Y .

2.2

Sequential Algorithms

2.2.1

Apriori

One of the most popular algorithms of finding association rules is Apriori [3]. Apriori uses k-itemsets, which has k items belonging to the set of items I, to generate (k+1) itemsets. The main idea of Apriori is: Any subset of a frequent item set must be a frequent set, so if a subset of a frequent itemset is not a frequent set, the frequent itemset should not be generated or test. Following is the P-Code of Apriori: k1 ← frequent 1-itemsets; //Lk is the set of frequent k-itemsets. k←2 while Lk−1 = φ do generate Ck from Lk−1 //Ck is the set of candidate k-itemsets. for all t ∈ D do Increment the count of all candidates in Ck that are contained in t. end for Lk = All candidates in Ck with minimum support. end while return k Lk {Subroutine of generating Ck from Lk−1 } 2

{Step 1: Self-joining Lk−1 } insert into Ck select p.itme1, p.item2, . . ., p.itemk−1 , q.itemk−1 from Lk−1 p, Lk−1 q where p.itme1 = q.item1 , . . .,p.itemk−2 = q.itmek−2 , p.itmek−1 < q.itmek−1 {Step 2: Pruning} for all itemsets c in Ck do for all (k-1)-subsets s of c do if s ⊂ Lk−1 then delete c from ck end if end for end for 2.2.2

Apriori-like algorithms

There are some algorithms based on Apriori, trying to improve it or make it suitable for some certain conditions. AprioriTID and AprioriHybrid

AprioriTID [3] uses the raw database for counting the

support of candidate itemsets only once in the first pass. In later passes, it uses an encoding of the candidate itemsets to count the support. AprioriTID saves much reading effort since the size of the encoding can become much smaller than the raw database. However, AprioriTID is slower then Apriori in the earlier passes, since it uses more memory. So they can be combined to AprioriHybrid [3]. That is, Apriori is used in the initial passes and switches to AprioriTID when th encoding will fit in memory. But the switching does involve a cost. SETM SETM [10] is implemented to use general query languages such as SQL to mining association rules from large datasets in relational databases. DIC DIC [4], Dynamic Itemset Counting, begins to count the k frequent itemsets at any appropriate point instead of always at the beginning of a pass and finishes to count when the itemsets have been counted over all the transactions. It uses a prefix-tree, of Partition

Unlike Apriori that counts the support of all (k − 1) candidates to determine

the k frequent itemsets, Partition [12] uses the tidlists of the (k − 1) candidates to generate the tidlists of the k frequent itemsets. Since the size of those intermediate results may take too many physical memory, Partition splits the raw dataset into several chunks and do an extra scan to get the globally frequent itemset.

3

2.2.3

Other algorithms

All algorithms introduced above are traversing the search space employing breadth-first search [9]. Ther are also some algorithms using depth- first search(DFS). FP-growth

FP-growth [8] uses frequent pattern tree (FP-tree), which is an extended

prefix-tree structure for storing highly condensed representation of the transaction data, thus saves the costly database scans in the subsequent mining processes. Eclat Eclat [14] combines the depth-first search with tidlist intersections by clustering itemsets using equivalence classes or maximal hypergraph cliques and then generating the true frequent itemsets using bottom-up, top-down or hybrid lattice traversal. One of the important things about all of the algorithms is that although the algorithms employ different strtegies, their runtime behaviors are quite similar. None of them can fundamentally beats out the other ones [9]. We will focus on the parallel mining association rules that based on Apriori.

3

Parallel algorithms

The tow dominate approaches for mining parallel association rules on distributed memory and shared memory systems. In distributed memory system, each processor has a private memory, while in shared memory systems, all processors access common memory.

3.1

Parallel mining association rules on distribute memory system

In a distributed memory system, each processor has its own local memory, which can be accessed directly only by this processor, and message passing is used to communication among all processors. Thus, for parallel mining association rules on distribute memory system, we must consider the trade-offs between computation, communication, memory usage, synchronization and the use of problem-specific information in parallel data mining [1]. 3.1.1

Apriori based

In [1] three algorithms are introduced, namely, Count Distribution, Data Distribution and Candidate Distribution. Count Distribution

In Count Distribution algorithm, the raw dataset are distributed

to the disk of every processors. Each processor generates entire k candidates sets from its own k − 1 Frequent itemsets, and then does sum reduction to obtain the global counts by exchanging local counts with all other processors. Then it can prun k the candidate sets to get the k item frequent sets. 4

In the first pass(k = 1), each processor pi dynamically generates its local candidate itemset C1i depending on the items actually present in its local data partition D i . The candidates counted by different processors may not be identical, so the local counts must be exchaged to determine global C1 . For passes k > 1: 1) Each processor pi generates the complete Ck , using the complete frequent itemset Lk−1 careted at the end of pass k − 1. 2) Processor P i makes a pass over its data partition D j and develops local support counts for candidates in Ck . 3) Processor P i exchanges local Ck counts with all ohter processors to develop global Ck counts. Processors are forced to synchronize in this step. 4) Each processor P i now computes Lk from Ck . 5) Each processor P i independently makes the decision to terminate or continue to the next pass. Data Distribution

One of the disadvantages of the Counting Distribution is that it does

not exploit the aggregate memory of the system efectively, since the number of candidates that can be counted in one pass is determined by the memory size of each processor. In Data Distribution algorithem, each processor counts mutually exclusive candidates, so it exploites better the total system’s memory. However, this algorithm is a communication-happy algorithm since it requires each processor broadcasts its local data to all other processors in every pass. According to the experiments in [1], the Data Distribution algorithem performs poorly when compared to Cound Distribution. Candidate Distribution The Candidate Distribution algorithm uses Count Distribution or Data Distribution until pass l. For iteration l, Candidate Distribution algorithm partitions the candidates to make each processor can generate disjoint candidates independent of other processors. Thus each processor can work on a unique set of candidates without having to repeatedly brodcast the entire dataset. However, this algorithm also performs wores than Count Distribution, because it pays the cost of redistributing the dataset while scanning the local dataset partition repeatedly. 3.1.2

Other algorithms

We can find some other algorithms in [11] [7] [5] and [6]

5

3.2

parallel shared memory

3.2.1

Shared memory apriori like algorithm

Recall in sequential apriori algorithm, there are two main steps, Candidate generation and Support counting. Let’s denote Ck as the candidate item set of the kth path, and Lk as the frequent item sets of K th path. Ck can be generated based on Lk−1 joining with itself. Then it eliminates all infrequent item sets in Ck and gets Lk . Ck and Lk are lexicographically sorted. Lk can be partitioned into equivalence class according to their common k − 2 prefixes and P, the number of processors. So those P classes can be computed on different processors simultaneously. The partition can also benfit pruning step. Instead of checking all k items of their (k − 1) sub items, we now need check n − (k − 2) sub items. The problem now need be considered is computation balancing. We can simply partition by order. But it suffers load imbalance. Interleaved partitioning or Bitonic partitioning can solve this problem. 3.2.2

Multiple Local Parallel Tree (MLPT)

Noticed that Apriori algorithm suffers I/O problem, MLPT will only scan the database twice and alleviate the I/O time. MLPT first scans the database to generate the ordered frequent items, which will be used to generate a tree. By doing this, each processor can be allocated the same number of transactions, and then compute the count of each item locally. Then the global count can be computed by splitting the item list evenly according to the number of processors. Then each processor computes out the global count for the items allocated to this processor. Each processor will be assigned the same number of transactions and do the second database scan. According to the item list generated in step one, a local FP-tree starting with a null node will be created. Then a bottom-up traversal will be used to mine the association rule [13].

4

Our approach

Haveing seen those algorithms, this paper mainly focus on distributed memory parallel. We will implement apriori based count distribution algorithm on our CGM1 system. This cluster is configured with 32 nodes, and each node consists of two processors which share the same resource in the node. We choose count distribution algorithm since it’s simple to implement and it requires the least communication among nodes. Because it splits the transactions based on the

6

processor number, so it may suffer imblance workload. As we have seen before, it will suffer repeatly I/O operations too. Because in our CGM1 system, two processors share the same resource in one node, so we can utilize this feature. In our approach, we also employ shared memory apriori algorithm in one node to compute Ck , Lk . Furthermore, the local count can also be computed in parallel in one node.

7

References [1] R. Agrawal and J. C. Shafer. Parallel mining of association rules. Ieee Trans. On Knowledge And Data Engineering, 8:962–969, 1996. [2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 26–28 1993. [3] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. [4] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. In Joan Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 255–264. ACM Press, 05 1997. [5] Cheung, Han, Ng, Fu, and Fu. A fast distributed algorithm for mining association rules. In PDIS: International Conference on Parallel and Distributed Information Systems. IEEE Computer Society Technical Committee on Data Engineering, and ACM SIGMOD, 1996. [6] David Wai-Lok Cheung and Yongqiao Xiao. Effect of data skewness in parallel mining of association rules. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 48–60, 1998. [7] Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. pages 277–288, 1997. [8] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Weidong Chen, Jeffrey Naughton, and Philip A. Bernstein, editors, 2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05 2000. [9] Jochen Hipp, Ulrich G¨ untzer, and Gholamreza Nakhaeizadeh. Algorithms for association rule mining – a general survey and comparison. SIGKDD Explorations, 2(1):58–64, July 2000. [10] Maurice A. W. Houtsma and Arun N. Swami. Set-oriented mining for association rules in relational databases. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pages 25–33. IEEE Computer Society, 1995. [11] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Efficient parallel data mining for association rules. In Conference on Information and Knowledge Management archive,Proceedings of the fourth international conference on Information and knowledge management, pages 31–36. ACM Press, 1995. [12] Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In The VLDB Journal, pages 432–444, 1995. 8

[13] O. Zaane, M. El-Hajj, and P. Lu. Fast parallel association rule mining without candidacy generation, 2001. [14] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New algorithms for fast discovery of association rules. Technical Report TR651, 1997.

9

Parallel Mining Association Rules in Distributed Memory System

Parallel Mining Association Rules in Distributed Memory System

Suggest Documents

Parallel Data Mining for Association Rules on Shared-Memory Systems

distributed association rules mining using ...

Direct Out-of-Memory Distributed Parallel Frequent Pattern Mining

Direct Out-of-Memory Distributed Parallel Frequent Pattern Mining

Parallel Data Mining for Association Rules on Shared ... - CiteSeerX

Parallel Data Mining Algorithms for Association Rules and Clustering

Parallel Data Mining for Association Rules on Shared ... - CiteSeerX

Parallel Data Mining for Association Rules on Shared ... - CiteSeerX

Parallel Data Mining for Association Rules on ... - Computer Science

Parallel and Distributed Association Mining: A ... - Semantic Scholar

A Distributed Privacy-Preserving Association Rules Mining Scheme ...

Mining Higher-Order Association Rules from Distributed ... - DIMACS

Parallel low-level image processing on a distributed-memory system

Mining Fuzzy Association Rules - CiteSeerX

Mining Association Rules in Long Sequences

Mining Association Rules in Geographic ... - Semantic Scholar

Association Rules and Data Mining in Hospital

Parallel Graph Mining on Shared Memory Architectures

A Text Processing System for Mining Association Rules

Evaluating Distributed Shared Memory for Parallel

Distributed-Memory Parallel Algorithms for ... - Semantic Scholar

DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE ...

DISTRIBUTED-MEMORY PARALLEL ALGORITHMS FOR DISTANCE

Association Rule Mining using Apriori Algorithm for Distributed System ...