Parallel SQL Based Association Rule Mining on Large Scale PC Cluster: Performance Comparison with Directly Coded C Implementation
Iko Pramudiono, Takahiko Shintani, Takayuki Tamura? , Masaru Kitsuregawa Institute of Industrial Science, The University of Tokyo 7-22-1 Roppongi, Minato-ku, Tokyo 106, Japan
fiko,shintani,tamura,
[email protected]
Data mining is becoming increasingly important since the size of databases grows even larger and the need to explore hidden rules from the databases becomes widely recognized. Currently database systems are dominated by relational database and the ability to perform data mining using standard SQL queries will de nitely ease implementation of data mining. However the performance of SQL based data mining is known to fall behind specialized implementation. In this paper we present an evaluation of parallel SQL based data mining on large scale PC cluster. The performance achieved by parallelizing SQL query for mining association rule using 4 processing nodes is even with C based program. Keywords: data mining, parallel SQL, query optimization, PC cluster Abstract.
1
Introduction
Extracting valuable rules from a set of data has attracted lots of attention from both researcher and business community. This is particularly driven by explosion of the information amount stored in databases during recent years. One method of data mining is nding association rule that is a rule which implies certain association relationship such as "occur together" or "one implies the other" among a set of objects [1]. This kind of mining is known as CPU power demanding application. This fact has driven many initial researches in data mining to develop new ecient mining methods [1][2]. However they imply specialized systems separated from the database. Therefore there are some eorts recently to perform data mining using relational database system which oer advantages such as seamless integration with existing system and high portability. Methods examined are ranging from directly using SQL to some extensions like user de ned function (UDF) [4]. Unfortunately SQL approach is reported to have drawback in performance. ?
Currently at Information & Communication System Development Center, Mitsubishi Electric, Ohfuna 5-1-1 Kamakura-shi Kanagawa-ken 247-8501, Japan
However the increasing computing power trend continues and more can be expected by utilizing parallel processing architecture. More important, most major commercial database systems have included capabilities to support parallelization although no report available about how the parallelization aects the performance of complex query required by data mining. This fact motivated us to examine how eciently SQL based association rule mining can be parallelized using our shared-nothing large scale PC cluster pilot system. We have also compared it with well known Apriori algorithm based C program[2].
2
Association Rule Mining Based on SQL
An example of association rule mining is nding "if a customer buys A and B then 90% of them buy also C" in transaction databases of large retail organizations. This 90% value is called con dence of the rule. Another important parameter is support of an itemset, such as fA,B,Cg, which is de ned as the percentage of the itemset contained in the entire transactions. For above example, con dence can also be measured as support(fA,B,Cg) divided by support(fA,Bg). A common strategy to mine association rule is: 1. Find all itemsets that have transaction support above minimum support, usually called large itemsets. 2. Generate the desired rules using large itemsets. Since the rst step consumes most of processing time, development of mining algorithm has been concentrated on this step. In our experiment we employed ordinary standard SQL query that is similar to SETM algorithm [3].It is shown in gure 1. CREATE TABLE SALES (id int, item int); { PASS 1 CREATE TABLE C 1 (item 1 int, cnt int); CREATE TABLE R 1 (id int, item 1 int); INSERT INTO C 1 SELECT item AS item 1, COUNT(*) FROM SALES GROUP BY item HAVING COUNT(*) = :min support;
>
INSERT INTO R 1 SELECT p.id, p.item AS item 1 FROM SALES p, C 1 c WHERE p.item = c.item 1; { PASS k CREATE TABLE RTMP k (id int, item 1 int, item 2 int, ... , item k int) (item 1 int, CREATE TABLE C k item 2 int, ... , item k int, cnt int) (id int, item 1 int, CREATE TABLE R k item 2 int, ... , item k int) INSERT INTO RTMP k SELECT p.id, p.item 1, p.item 2, ... , p.item k-1, q.item k-1 FROM R k-1 p, R k-1 q
Fig.1.
WHERE AND AND
AND AND
p.id = q.id p.item 1 = q.item 1 p.item 2 = q.item 2 . . . p.item k-2 = q.item k-2 p.item k-1 < q.item k-1;
INSERT INTO C k SELECT item 1, item 2, ..., item k, COUNT(*) FROM RTMP k GROUP BY item 1, item 2, ..., item k HAVING COUNT(*) = :min support;
>
INSERT INTO R k SELECT p.id, p.item 1, p.item 2, ..., p.item k FROM RTMP k p, C k c WHERE p.item 1 = c.item 1 AND p.item 2 = c.item 2 . . . AND p.item k = c.item k; DROP TABLE R k-1; DROP TABLE RTMP k;
SQL query to mine association rule
Transaction data is normalized into the rst normal form (transaction ID, item) . In the rst pass we simply gather the count of each item. Items that satisfy the minimum support inserted into large itemsets table C 1 that takes form(it em, item count). Then transaction data that match large itemsets stored in R 1. In other passes for example pass k, we rst generate all lexicographically ordered candidate itemsets of length k into table RTMP k by joining k-1 length transaction data. Then we generate the count for those itemsets that meet minimum support and included them into large itemset table C k. Finally transaction data R k of length k generated by matching items in candidate itemset table RTMP k with items in large itemsets. Original SETM algorithm assumes execution using sort-merge join. Inside database server on our system, relational joins are executed using hash joins and tables are partitioned over nodes by hashing. As the result, parallelization eciency is much improved. This approach is very eective for large scale data mining. Inside execution plan for this query, join to generate candidate itemsets is executed at each node independently. And then after counting itemsets locally, nodes exchange the local counts by applying hash function on the itemsets in order to determine overall support count of each itemset. Another data exchange occurs when large itemsets are distributed to every nodes. Finally each node independently execute join again to create the transaction data.
3 3.1
Performance Evaluation Parallel Execution Environment
The experiment is conducted on a PC cluster developed at Institute of Industrial Science, The University of Tokyo. This pilot system consists of one hundred commodity PCs connected by ATM network named NEDO-100. We have also developed DBKernel database server for query processing on this system. Each PC has Intel Pentium Pro 200MHz CPU, 4.3GB SCSI hard disk and 64 MB RAM. The performance evaluation using TPC-D benchmark on 100 nodes cluster is reported[7]. The results showed it can achieve signi cantly higher performance especially for join intensive query such as query 9 compared to the current commercially available high end systems. 3.2
Dataset
We use synthetic transaction data generated with program described in Apriori algorithm paper[2] for experiment. The parameters used are : number of transactions 200000, average transaction length 10 and number of items 2000. Transaction data is partitioned uniformly correspond to transaction ID among processing nodes' local hard disk.
3.3
Results
The execution times for several minimum support is shown in gure 2(left). The result is surprisingly well compared even with directly coded Apriori-based C program on single processing node. On average, we can achieve the same level of execution time by parallelizing SQL based mining with around 4 processing nodes. The speedup ratio shown in gure 2(right) is also reasonably good, although the speedup seems to be saturated as the number of processing nodes increased. As the size of the dataset asssigned to each node is getting smaller, processing overhead and also synchronizing cost that depends on the number of nodes cancel the gain. 140
10
SQL 0.2% SQL 0.3% SQL 0.4% Apriori C 0.2% Apriori C 0.3% Apriori C 0.4%
time[s]
100 80
SQL 0.2% SQL 0.3% SQL 0.4% Ideal
9 8
speedup ratio
120
60 40
7 6 5 4 3
20
2
0
1 0
2
4
6
8
10
12
2
0
4
6
#nodes Fig. 2.
8
10
12
#nodes
Execution time(left) Speedup ratio(right)
Figure 3(left) shows the time percentage for each pass when the minimum support is 0.5%. Eight passes are necessary to process entire transaction database. It is well known that in most cases the second pass generates huge amount of candidate itemsets thus it is the most time consuming phase. Figure 3(right) shows the speedup ratio for each pass. The later passes, the smaller candidate itemsets. Thus non-negligible parallelization overhead become dominant especially in passes later than ve. Depending on the size of candidate itemsets, we could change the degree of parallelization. That is, we should reduce the number of nodes on later passes. Such extensions will need further investigations. 10
Ideal
9 8
speedup ratio
pass 1 7
pass 2
6
pass 3 pass 4
5 4
pass 5
3
pass 6 pass 7 pass 8
2 1 0
2
4
6
8
10
12
#nodes
Pass analysis (minimum support 5%). Contribution of each pass in execution time(left) Speedup ratio of each pass(right)
Fig.3.
4
Summary and Conclusion
In this paper, we have compared the parallel implementation of SQL based association rule mining with the directly coded C implementation. Since SQL is popular and it has already sophisticated optimizer, we believed that parallel SQL system could achieve reasonably sucient performance. Through real implementation, we have con rmed that parallization using 4 processing nodes for association rule mining can beat the performance of directly coded C implementation. We do not have to buy or write special data mining application codes, SQL query for association rule mining is extremely easy to implement. It is also very exible, we have extended the SQL query to handle generalized association rule mining with taxonomy. The report will be available soon. At present, parallel SQL is running on expensive massively parallel machines but not in the future. Instead it will run on inexpensive PC cluster system or WS cluster system. Thus we believe that SQL implementation based on sophisticated optimization would be one of reasonable approaches. There remains lots of further investigations. Current SQL query could be more optimized. In addition we plan to do large experiments. Since our system has 100 nodes, we could handle larger transaction database. In such experiments, data skew become a problem. Skew handling is one of the interesting research issue for parallel processing. We have reported some mining algorithms that eectively deal with this problem in [6].
References 1. R. Agrawal, T. Imielinski, A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proc. of the ACM SIGMOD Conference on Management of Data, 1993. 2. R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the VLDB Conference, 1994. 3. M. Houtsma, A. Swami. Set- oriented Mining of Association Rules. In Proc. of International Conference on Data Engineering, 1995. 4. S. Sarawagi, S. Thomas, R. Agrawal. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. In Proc. of the ACM SIGMOD Conference on Management of Data, 1998. 5. R. Srikant, R. Agrawal. Mining Generalized Association Rules. In Proc. of VLDB, 1995. 6. T. Shintani, M. Kitsuregawa. Parallel Mining Algorithms for Generalized Association Rules with Classi cation Hierarchy. In Proc. of the ACM SIGMOD Conference on Management of Data, 1998. 7. T. Tamura, M. Oguchi, M. Kitsuregawa. Parallel Database Processing on a 100 Node PC Cluster: Cases for Decision Support Query Processing and Data Mining. In Proc. of SC97: High Performance Networking and Computing, 1997. This article was processed using the LaTEX macro package with LLNCS style