Efficiently Mining Frequent Patterns from Dense Datasets using a Cluster of Computers Yudho Giri Sucahyo1
Raj P. Gopalan1 Amit Rudra2
1 Department of Computing, Curtin University of Technology School of Information Systems, Curtin University of Technology Kent St, Bentley, Western Australia 6102
2
{sucahyoy, raj}@computing.edu.au,
[email protected]
Abstract. Efficient mining of frequent patterns from large databases has been an active area of research since it is the most expensive step in association rules mining. In this paper, we present an algorithm for finding complete frequent patterns from very large dense datasets in a cluster environment. The data needs to be distributed to the nodes of the cluster only once and the mining can be performed in parallel many times with different parameter settings for minimum support. The algorithm is based on a master-slave scheme where a coordinator controls the data parallel programs running on a number of nodes of the cluster. The parallel program was executed on a cluster of Alpha SMPs. The performance of the algorithm was studied on small and large dense datasets. We report the results of the experiments that show both speed up and scale up of our algorithm along with our conclusions and pointers for further work.
1. Introduction Discovering Association Rules from large databases has been a topic of considerable research interest for nearly a decade. Most of the research effort has focused on the computationally expensive step of mining frequent patterns that satisfy support constraints specified by users. Though the problem is intractable in the worst case, efficient mining algorithms have been developed for many practical data sets like typical market basket data. However, several datasets of current interest such as DNA sequences, proteins and sales transactions with large time windows contain dense data with long patterns and they significantly slow down the current algorithms. The amount of processing time for dense data also increases almost exponentially as the support level of frequent patterns is reduced. Performing the mining tasks in parallel appears to be a logical and inexpensive way of extending the range of data that can be mined efficiently. Zaki [1] has presented an excellent survey of parallel mining of association rules, though there has been relatively less work in this area so far. Parallel association mining algorithms have been developed for distributed memory, shared memory and hierarchical systems. Most of them are in fact extensions of sequential algorithms. For example, Count Distribution is based on Apriori, ParEclat on Eclat, and APM on DIC. Recent progress in parallel computing using clusters of PCs and workstations
provides an alternative to the more expensive conventional parallel processors [2]. Besides the lower cost, a cluster may be chosen for other advantages such as performance, development tools, communication bandwidth, integration (clusters are easier to integrate into existing networks), price/performance and scaling (clusters can be easily enlarged). Research on parallel data mining using clusters, while not new, is quite limited. The previous work on the use of clusters for association rule mining has studied mainly the scale up and speed up of Apriori based algorithms [3]. Han et al, proposed the pattern-growth approach for mining frequent patterns in [4] as a better alternative to Apriori like algorithms. We improved the pattern-growth approach by using a compact data representation [5]. Our algorithm is significantly more efficient than FP-Growth [4], and scales up for very large datasets. It is an interesting challenge to develop a parallel pattern growth algorithm using our compact data representation on the shared nothing nodes of a cluster of workstations. In this paper, we present such an algorithm for mining frequent patterns. We have developed a data distribution scheme that supports flexible interactive mining while minimizing the data transfer. The transaction database is partitioned by projections suitable for parallel mining described later. Each projection is stored on the local disk of a node in the cluster. The projections are distributed in a round-robin fashion to balance the load on the nodes during mining. However, the size of individual projections will vary depending on the data characteristics of a given transaction database. The total number of projections corresponds to the number of different items present in the database. As the number of nodes is usually much smaller than the number of projections, several projections are stored at each node. The data is distributed only once, but mining can be carried out as many times as required using different minimum support levels and other relevant parameters. The projections at each node of the cluster are mined independently. The mining algorithm uses the compact prefix tree representation originally described in [5] for storing data in memory. The details of data distribution and the mining algorithm are given in later sections. The performance of our algorithm was studied in two stages. First, the algorithm that executes at each node was compared against other significant sequential algorithms including OpportuneProject (OP) [6], FP-Growth [4] and the best available implementation of Apriori [7], using a number of widely used dense test datasets. Then the performance of parallel execution of our algorithm was studied on the same data sets at lower support levels. This approach is justified since mining is a highly computation intensive process at low support levels on dense datasets. The scalability of the algorithm was further tested using very large dense datasets generated with the synthetic data generator [8]. These datasets had characteristics similar to Connect-4, which is a widely used small dense dataset. The performance results indicate that our algorithm is faster than other significant algorithms on dense datasets and is scalable to very large dense datasets. The structure of the rest of this paper is as follows: In Section 2, we define relevant terms used in association rules mining, discuss the density measure and present the compact transaction tree. In Section 3, we present the data distribution scheme and the parallel algorithm. The experimental results on various datasets are presented in Section 4. Section 5 contains conclusions and pointers for further work.
2. Preliminaries In this section, we define the terms used for describing association rule mining, discuss a simple density measure for datasets and describe the compact transaction tree used for grouping transactions with common subsets of items. 2.1 Definition of Terms We give the basic terms needed for describing association rules using the formalism of [9]. Let I={i1,i2,…,in} be a set of items, and D be a set of transactions, where a transaction T is a subset of I (T ⊆ I). Each transaction is identified by a TID. An association rule is an expression of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅. Note that each of X and Y is a set of one or more items and the quantity of each item is not considered. X is referred to as the body of the rule and Y as the head. An example of association rule is the statement that 80% of transactions that purchase A also purchase B and 10% of all transactions contain both of them. Here, 10% is the support of the itemset {A, B} and 80% is the confidence of the rule A ⇒ B. An itemset is called a large itemset or frequent itemset if its support support threshold specified by the user, otherwise the itemset is small or not frequent. An association with the confidence confidence threshold is considered as a valid association rule. 2.2 A Simple Density Measure for Datasets As mentioned in [9], the transactions in the database could be represented by a binary table as shown in Fig. 1. Counting the support for an item is the same as counting the number of 1s for that item in all the transactions. If a dataset contains more 1s than 0s, it can be considered as a dense dataset, otherwise, it is a sparse dataset. In this paper, we use the percentage of 1s in the total of 1s and 0s as the density measure. Given two datasets, we can determine if one of them is relatively denser compared to the other from the ratio of 1s to the total of 1s and 0s in each. Tid 1 2 3 4 5 6 7 8
1 2 1 3 3 1 1 1
Items 2 3 5 3 4 5 2 4 5 4 5 4 5 2 3 4 5 2 3 5 2 4 5
1 1 0 1 0 0 1 1 1
2 1 1 1 0 0 1 1 1
3 1 1 0 1 1 1 1 0
4 0 1 1 1 1 1 0 1
5 1 1 1 1 1 1 1 1
Fig. 1. The transaction database
2.3 Compact Transaction Tree A Compact Transaction Tree was introduced in [5] to group transactions that contain either identical sets of items or some subsets of items in common. The compact tree has only about half the number of nodes of prefix trees used by various well-known algorithms such as FP-Growth. Fig. 2a shows a complete prefix tree with 4 items assuming the transaction count to be one at every node. The corresponding compact tree is in Fig. 2b. A complete prefix tree has many identical subtrees in it. In Fig. 2a, we can see three identical subtrees st1, st2, and st3. In the compact tree, we store information of identical subtrees together. Given a set of n items, a prefix tree would have a maximum of 2n nodes, while the corresponding compact tree will contain a maximum of 2n-1 nodes, which is half the maximum for a full tree. In practice, the number of nodes for a transaction database will need to be far less than the maximum for the problem of frequent item set discovery to be tractable. The nodes along the leftmost path of the compact tree have entries of all itemsets corresponding to the paths ending at each of the nodes. Other nodes only have itemsets that are not present in the leftmost path to avoid duplicates. For example, node 4 in the leftmost path has itemsets 1234, 234, 34, and 4 (we omit the set notation for simplicity). Node 4 in the path 134 does not have itemsets 34 and 4 since it has been registered at node 4 in the leftmost path. Level 0
1 0 1 1
Root
Level 1 1
3
2
3
4
4
2 4
3
st1
4
3 4
4
4
st2 4
(a) Identical Subtrees in a Full Prefix Tree
2 0 1 12 1 1 2 3 0 1 123 1 1 23 2 1 3
st3
4 0 1 2 3
1 1 1 1
1234 234 34 4
3 0 1 13
4 0 1 14
4 0 1 124 4 0 1 134 1 1 24
Level 2
Level 3
(b) Compact Prefix Tree
Fig. 2. Compact Transaction Tree
Attached to each node, an array represents the count of transactions for different item subsets. The array index represents the level of the node containing the starting item of the subset in the tree and the value in each cell is the transaction count. The doted rectangles in Fig. 2b show the itemsets corresponding to the nodes in the compact tree only for illustration (they are not part of the data structure). Similarly, the index entries are implicit and only the transaction count entries need to be explicitly stored.
3. Mining Large Dense Datasets on a Cluster of Computers Our algorithm for mining large dense datasets in a cluster environment consists of two major components: (1) a partitioning scheme for distributing the large dataset to different nodes and (2) an efficient main memory based procedure for mining the partition at each node. In this section, we first describe the partitioning scheme. It is followed by the data structures and algorithm for in-memory mining at each node. Then the cluster algorithm based on the master-slave scheme of a coordinator and a set of participant nodes is described. 3.1 Partitioning the Transaction Database The entire database is first partitioned and distributed to nodes of the cluster. If there are a total of N distinct items in the database, N-1 projections will be created. These projections are allocated to m nodes of the cluster in a round-robin fashion. For each item i in a transaction, we copy all items that occur after i (in lexicographic order) to projection i. Therefore, a given transaction could be present in many projections. For example, if we have items 1 2 3 5 in a transaction, we put a transaction 1 2 3 5 in the projection for item 1, put a transaction 2 3 5 in the projection for item 2, and put a transaction 3 5 in the projection for item 3. Tid 1 2 3 4 5 6 7 8
Tid 1 2 3 4 5
1 2 1 3 3 1 1 1
Items 2 3 5 3 4 5 2 4 5 4 5 4 5 2 3 4 5 2 3 5 2 4 5
Items Tid Items Tid Items Tid Items 1235 1 235 1 35 1 45 1245 2 2345 2 345 2 45 12345 3 245 3 345 3 45 1235 4 2345 4 345 4 45 1245 5 235 5 345 5 45 6 245 6 35 6 45
/* Input: DB Output: ItemTable */ Procedure ParseDB For each transaction in DB For each item in trans If item in ItemTable Increment count of item Else Insert item with count = 1 End If End For End For Sort ItemTable by frequency ascending and assign index number for each item Write ItemTable to frequency file For each transaction in DB Map each item in transaction according to its index number Sort transaction based on index For each index number in trans Write suffix of the transaction starting from current index to projection[index] End For End For
Fig. 3. Parallel Projection Example and Algorithm
Fig. 3 shows the contents of each projection for the sample database following the parallel projection step and the algorithm. The frequency of all items in the database is kept in a frequency file that will be replicated at all the nodes. The frequency file is used to construct the compact tree along with the associated Itemtable for each projection (see Section 3.2). The main advantage of this scheme is that the entire database is partitioned into parallel projections and distributed to the nodes only once. The projections can be
subsequently mined many times with different parameter settings of support and confidence as desired by the user. However, the total size of the projections will be larger than the size of the original database. 3.2 Mining Frequent Itemsets from a Projection The database partition at a node may consist of several projections since the number of nodes in the cluster is usually much smaller than the number of items. Mining a partition involves the mining of all the projections of that partition. Each projection can be mined independently. Therefore, if a node has multiple processors as in the case of Alpha SMPs, each processor can mine a different projection in parallel. Mining a projection consists of three steps: (1) Constructing a compact tree for the projection; (2) Mapping the compact tree to a special data structure named ItemTransLink (ITL) and (3) Mining the frequent itemsets from ITL. These steps are described in detail in the following subsections. 3.2.1 Building a Compact Transaction Tree for a Projection The transactions of each projection are represented in a compact tree as discussed in Section 2.3. As transactions containing common items are grouped together using the compact tree, the cost of traversing the transactions is reduced in the mining phase. ItemTable Index 1 2 3 4 5 Item 1 2 3 4 5 Count 5 6 6 6 8
Level 0
0 0 1
Level 1
0 0 12 2 1 0 2
e
PST
0 0 123 1 0 23 2 0 3
c de f g
c
1
d
3 Level 2
4
0 0 124 1 0 24
f
Level 3 0 1 2 3
0 0 0 0
1234 4 234 34 4
0 2 1235 5 1 0 235 2 0 35
5 0 2 1245 1 0 245
g
Level 4
0 1 2 3 4
1 0 0 0 0
12345 5 2345 345 45 5
Fig. 4. Compact Transaction Tree of a Projection
There are two steps in constructing the compact transaction tree: (1) Read the frequency file (mentioned in Section 3.1) to identify individual frequent (1-freq) items since only 1-freq items will be needed to mine frequent patterns. (2) Using the output of the first step, read only 1-freq items from transactions in the projection, and then insert the transactions into the compact transaction tree.
In the first step, the 1-freq items are entered in the ItemTable. After step 2, each node of the tree will contain a 1-freq item and a set of counts indicating the number of transactions that contain subsets of items in the path from the root as mentioned in Section 2.3. Consider the first projection of the sample database in Fig. 3 and let the minimum support be 5 transactions. The ItemTable and compact tree for the projection are shown in Fig. 4. The ItemTable contains a pointer to the subtree (PST) of each item. Each node of the tree has additional entries to keep the count of transactions represented by the node. For example, the entry (0,0) at the leaf node of the leftmost branch of the tree represents itemset 12345 that occurs once in this projection. The first 0 indicates the starting level in the tree which makes 1 its first item. The second number is the count of the transaction. Similarly entry (0,2) at the leaf node of the rightmost branch of the tree means there are two transactions of 1245. In the implementation of the tree, the counts at each node are stored in an array and the level of an entry is the array index that is not stored explicitly.
3.2.2 Mapping the Compact Tree to Item-TransLink (ITL) Performance of mining can be improved by reducing the number of columns traversed in the conceptual binary representation of transactions, for counting the support of itemsets. We use group intersection in our algorithm to compute the support of itemsets. To map groups in the tree to an array representation for faster group intersections, we had previously developed a data structure called Item-Trans Link (ITL) that has features of both vertical and horizontal data layouts (see Fig. 5a). ItemTable Index 1 2 3 4 5 Item 1 2 3 4 5 Count 5 6 6 6 8 Count
t1 1 2 3 4 5 TransLink
Prefix TempList (count) 1 3 (3), 4 (3), 2 (5), 5 (5) 12 3 (3), 4 (3), 5(5)
Patterns (count) 1 (5), 1 2 (5), 1 5 (5) 1 2 5 (5)
1 1
t2
1 2 3 5
1 2
t3
1 2 4 5
1 2
(a) The Item-TransLink (ITL) Data Structure
(b) Mining Frequent Patterns Recursively (support of the pattern shown in brackets)
Fig. 5. The Item-TransLink (ITL) Data Structure and Mining Frequent Patterns
The main features of ITL are described below. Item identifiers are mapped to a range of integers and transaction identifiers are ignored as the items of each transaction are linked together. 2. ITL consists of ItemTable and the transactions of the current projection linked to it (TransLink).
1.
3.
4.
ItemTable contains all frequent items of the database. Each item is stored with its support count and a link to the first occurrence of that item in TransLink. A row in TransLink represents a group of transactions along with a count of occurrences of each transaction in the current projection. The items of a row are arranged in sorted order and each item is linked to the next row where it occurs.
Fig. 5a shows the result of mapping the compact tree of Fig. 4 to ITL. For example, transaction group t2 in the TransLink represents the transaction 1235 that occurs twice. In large datasets, the count of each transaction group is normally much higher, and a number of subgroups will be present too. In the implementation of ITL, the count entries at each row are stored in an array and so the array index is not stored explicitly. The doted rectangles that show the index at each row, is only for illustration and not part of the data structure. Procedure Par-ITL Read the frequency file If there is any frequent item ConstructTree ConstructITL MineFI End If /* Input: projection */ /* Output: compact trans tree */ Procedure ConstructTree For each transaction in the projection For each frequent item in transaction Insert the item into the tree End For End For /* Input: trans tree Output: ITL */ Procedure ConstructITL For each path in the tree traversed by depth first search Map the path as an entry in TransLink Attach the count entries of the path Establish links in TransLink to previous occurrences of each item End For
/* Input: ITL Output: Freq Patterns */ Procedure MineFI Initialize Freq_Patterns For each frequent item x ∈ ItemTable Freq_Patterns := Freq_Patterns ∪ x Prepare and fill tempList for x Sort tempList by count ascending For each frequent item y ∈ tempList Freq_Patterns := Freq_Patterns ∪ xy For each freq item z ∈ tempList after y RecMine(xy,z) EndFor End For End For Procedure RecMine(prefix, testItem) tlp:= group-list of prefix tli:= group-list of testItem tl_current:= Intersect(tlp,tli) If size(tl_current) ≥ min_sup new_prefix := prefix ∪ testItem Freq_Patterns := Freq_Patterns ∪ new_prefix For each frequent item z ∈ tempList after testItem RecMine(new_prefix, z) End For End If
Fig. 6. Algorithm to Mine Frequent Patterns in a Projection
3.2.3 Mining Frequent Patterns in a Projection Each item in the ItemTable is used as a starting point to mine all longer frequent patterns for which it is a prefix. Fig. 6 shows the algorithm for mining individual projections. For example, starting with item 1, the vertical link is followed to get all other items that occur together with item 1. These items with their support counts and transactioncount lists are entered in a table named TempList as in Fig. 5b (transaction lists are not shown in the figure). For prefix 1, items {2, 5} are frequent (count ≥ 5). Therefore, 1 2, and 1 5 are frequent item sets. After generating the 2-frequent-item sets for prefix 1, the associated group lists of items in the TempList, are used to recursively generate frequent sets of 3 items, 4 items and so on by intersecting the
transaction-count lists. For example, the transaction-count list of 1 2 is intersected with that of 1 5 to generate the frequent itemset 1 2 5. At the end of recursive calls with item 1, all frequent patterns that contain item 1 would be generated: 1 (5), 1 2 (5), 1 5 (5), 1 2 5 (5). 3.3 Parallel Mining in a Cluster Environment We now describe the framework for mining frequent patterns using the multiple processors of a cluster of Alpha 21264C computers. It is based on master-slave architecture using Message Passing Interface (MPI) [10] calls for distributing the tasks to various nodes. The coordinator node (master) assigns mining tasks to the participant nodes (slaves). The parallel program running on each participant node performs the following tasks: 1. Perform disk I/O: It opens and reads the dataset projections allocated to it (in the earlier step described in Section 3.1). These projections are already present on the local disk of the node. 2. Mine dataset projections for frequent itemsets: It mines the allocated projections for frequent itemsets as described in Section 3.2. As the projections can be mined independently, the mining activity at each node of the cluster is not dependent on other nodes. Once a mining task has been allocated to a node by the coordinator, there is no need for further communication during the execution. 3. Create datasets of frequent itemsets: The frequent itemsets mined at each node are written to the local disk. They are copied to another specified directory as a post-step. Fig. 7 outlines the algorithm used for parallel execution of tasks at all nodes. MPI_Initialize Find my rank If coordinator Start timing work Distribute work to participants Else if participant Mine allocated dataset projections End If Time end of work Finalize work
Fig. 7. Scheme for Distributing the Work among Nodes
4. Performance Study In this section, we report the performance of our algorithm, which was studied in two stages. First, a sequential version of the algorithm was compared against other well known efficient algorithms including FP-Growth, OpportuneProject (OP), Eclat, and the fastest available implementation of Apriori on small dense datasets. Then the performance of parallel execution of our algorithm was studied on the same data sets at lower support levels. We adopted this approach since mining is a highly computation intensive process at low support levels on dense data sets and therefore
the performance gain from parallel processing should be apparent on these data sets. The scalability of the algorithm was further tested using very large dense datasets generated with the synthetic data generator [8].
80 60
100 40 10
20
1
0 50
60
70
80
100
Runtime (seconds) logarithmic
Connect-4
1000
100
10
60
1
40
0.1
20
0.01
90
0 50
60
Support Threshold (%)
Apriori 4.08
80
Chess
Density Measure
1000
100 Density Measure
Runtime (seconds) logarithmic
10000
70
80
90
Support Threshold (%)
Par-ITL
Eclat
FP-Growth
OP
Density
Fig. 8. Comparison of Par-ITL with Sequential Algorithms on Connect-4 and Chess 50
Connect-4
800
Runtime (seconds)
Runtime (seconds)
1400 1200 1000
Support 40
600 400 200
Support 50
0
Chess
40
Support 40
30 20 10 0
4
8
12
16
20
24
28
32
4
8
12
16
# of CPUs
20
24
28
32
# of CPUs
Fig. 9. Parallel Execution of Par-ITL on Connect-4 and Chess 1000 Runtime (seconds) logarithmic
Runtime (seconds)
50 40 30 20 10
100
10
1
0 4
8
12
16
20
24
28
# of CPUs
(a) 3 millions transactions
32
1
2
3
4
5
6
7
8
9 10 11
# of Trans (millions)
(b) Scalability: Support 60, Density 67%
Fig. 10. Parallel Execution on Large Datasets
The sequential programs in the performance study were all written in C++ and compared on an 866 MHz Pentium III PC, 512 MB RAM, 30 GB HD running MS Windows 2000. The clusters used for testing the parallel program had 4 to 32 processors of Alpha 21264C, running Unix64 each with hard disk capacity of 36 GB, connected by a Elan3 fast link interconnect. However, due to resource restrictions
imposed by the facility, we performed the tests with a maximum of 32 processors only. For parallel mining, we used C++ as well as MPI routines to perform message passing between the processors running under True Unix64. The small dense datasets used to test the performance were Chess (3196 trans, 75 items, average transactions 37) and Connect-4 (67557 trans, 129 items, average transaction 43) from Irvine Machine Learning Database Repository [11]. To show the scalability of the algorithm, we generated large dense datasets ranging from 1-11 million transactions using the synthetic data generator [8]. The characteristics of those datasets are similar to Connect-4 with 129 distinct items and average length of transactions 43. The density of the dataset was 67% at support 60. As shown in Fig. 8, the sequential version of our program performs better than all other sequential algorithms on Connect-4 (support 70-90) and Chess (support 70-90). At lower support levels, its performance is better than all others except OP. In general, our algorithm performs best at a density of 50% or higher. As the support threshold gets lower, the relative performance of OP improves. Fig. 9 shows the speed up achieved on Connect-4 and Chess datasets by the different runs of our parallel mining program using various configurations of processors. For Connect-4 dataset, we also see that the speed up is consistent for the support levels of 40 and 50 used in the performance study. We observe that a significant speed up was achieved when the configuration was changed from 4 processors to 8 and then to 12 processors. However, beyond 12 processors no further speed up is achieved on Connect-4. The time taken for mining on a processor includes the fixed time of setting up the task and the variable time of mining the given set of projections. By using additional processors, the variable time for mining is shared among more processors, but the fixed time required remains unchanged. So adding more processors beyond a certain number does not reduce the run time. Fig. 10a shows the performance of the parallel algorithm on a medium sized dataset (3 million transactions, 129 items) and Fig. 10b shows the scalability of our algorithm to datasets of 1-11 million transactions. These results indicate that our algorithm is scalable to very large dense datasets.
5. Conclusions In this paper, we have presented an algorithm for mining complete frequent patterns from very large dense datasets using a cluster of computers. The algorithm implements a parallel projection approach for partitioning the transaction data. We compared the sequential version of the algorithm against other well-known algorithms. Our algorithm outperformed other algorithms such as OP, FP-Growth, Eclat and Apriori on small dense datasets for most support levels. We further showed the speed up and scale up of our algorithm on small and large dense data sets on clusters of varying number of computers. Significant speed up was achieved by parallel mining on a cluster of multiprocessors. However, for the test datasets we used, the speed up curve flattened beyond a certain number of processors, though the minimum number of processors that gave the maximum speed up varies for each dataset and support level. We plan to
study this characteristic further to determine the factors that affect the speed up of mining algorithms on clusters. The present implementation assumes that each projection will fit into the available memory when compressed using the compact tree. Very large projections will need to be partitioned further into smaller sub-projections. This is a necessary extension of the program for dealing with huge datasets. The parallel projection scheme we used is flexible but requires a large amount of disk space at each node, particularly for dense datasets. Alternative schemes that significantly reduce disk space requirements are being investigated.
6. Acknowledgments This work was supported by a grant of machine time on the Australian Partnership for Advanced Computing (APAC) national facilities [12]. We especially thank APAC for the pro-active technical support they provide. We also thank Christian Borgelt for providing Apriori and Eclat, Jian Pei for the executable code of FP-Growth and Junqiang Liu for providing us the OpportuneProject program.
References 1.
Zaki, M.J., Parallel and distributed association mining: A survey. IEEE Concurrency (Special Issue on Data Mining). (Oct/Dec 1999) 14-25 2. Baker, M., Buyya, R., Cluster Computing: The Commodity Supercomputing. Software-Practice and Experience. 1(1) (Jan 1999) 1-26 3. Jin, R., Agrawal, G., An Efficient Association Mining Implementation of Cluster of SMPs. Proc. of workshop on Parallel and Distributed Data Mining (PDDM) (2001) 4. Han, J., Pei, J., Yin, Y., Mao, R.: Mining Frequent Patterns without Candidate Generation: A Frequent-pattern Tree Approach. to appear in Data Mining and Knowledge Discovery: An International Journal, Kluwer Academic Publishers (2003) 5. Gopalan, R.P., Sucahyo, Y.G.: Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design. Proc. of 4th Int. Conf. on Intelligent Data Engineering and Automated Learning (IDEAL), Hong Kong (2003) 6. Liu, J., Pan, Y., Wang, K., Han, J.: Mining Frequent Item Sets by Opportunistic Projection. Proceedings of ACM SIGKDD, Edmonton, Alberta, Canada (2002) 7. http://fuzzy.cs.uni-magdeburg.de/~borgelt/ 8. http://www.almaden.ibm.com/cs/quest/syndata.html 9. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. Proc. of ACM SIGMOD, Washington DC (1993) 10. Gropp, W., Lusk, E. Skjellum, A. Using MPI: Portable Parallel Programming with the Message-Passing Interface 2nd ed. MIT Press, Cambridge, MA (1999) 11. http://www.ics.uci.edu/~mlearn/MLRepository.html 12. APAC–Australian Partnership for Advanced Computing, http://nf.apac.edu.au/ (June 2003)