IDMC'07 20-21 Nov.2007 Amir Kabir University
Parallel Mining of All None-Derivable Frequent Itemsets Mahmood Deypir; Mohammad Hadi Sadreddini Department of computer Science and Engineering, Shiraz University, Shiraz, Iran., e-mails:
[email protected],
[email protected] Abstract Mining non-derivable frequent itemsets (NDIs) is one of the successful approaches to construct a concise representation of frequent patterns which is useful to generate smaller and more understandable rule set. Breadth-first and depth-first algorithms are the two main algorithms that have so far been proposed in the literature for mining non-derivable frequent itemsets. In this study parallel mining of all non-derivable frequent itemsets on the share-nothing parallel systems is investigated. A parallel algorithm called PNDI is proposed and implemented here. This algorithm parallelizes not only I/O costs but also computation cost of deduction rules evaluation. Experimental results on real-life datasets show that the parallel algorithm has fine speed up, scale up and size up. Keywords: Association Rules, None-derivable frequent itemsets, Parallel Data Mining
1. Introduction Association rule mining (ARM) in large transactional databases is a central problem in the field of knowledge discovery and data mining and has wide application areas such as market basket analysis, document clustering, web management, profiling high frequency accident locations and etc. The input of ARM is a database in which objects are grouped together in each transaction. ARM then requires us to find sets of objects which tend to associate with one another. Given two distinct sets of objects, X and Y , we say Y is associated with X if the appearance of X usually implies Y , we then say that rule X ⇒ Y is confident in the database. X and Y are also called itemsets (set of items) or pattern. We would usually not be interested in an association rule unless it appears in more than a certain fraction of the context; if it does, we say that the rule is frequent. The thresholds of frequency (minimum support) and confidence (minimum confidence) are parameters of problem and are usually determined by user according to his needs. The problem of association rule mining was first introduced by R.Agrawal et al[1], where they defined two major steps for this task: 1) finding frequent itemset (frequent itemset mining) and 2) rules generation based on the frequent itemsets found in the first step. Main task of ARM is the first step, where we are interested to find items which tend to appear together. Finding frequent itemset in large transactional database is a very time consuming task. In order to improve the efficiency of the first step various algorithms have been presented in the literature [2, 3, 4, 5, 6, 7, 8, 9]. Frequent itemsets mining algorithms suffer from producing prohibitively large number frequent patterns when data are highly correlated and/or minimum support threshold is low. To overcome this problem, recently several proposals have been made to construct a concise representation of the frequent itemsets, instead of mining all frequent itemsets [10, 11, 12, 13, 14, 15]. It was shown that the collection of non-derivable frequent itemsets [14] is in general much more concise
1
IDMC'07 20-21 Nov.2007 Amir Kabir University
than the complete set of frequent itemsets and even smaller than the other concise representation [16]. Non-derivable frequent itemsets are lossless concise representation. That is, using NDIs it is possible to generate complete set of frequent itemsets based on them. The problem of mining non-derivable frequent itemsets has been introduced by Calders et al and a breadth-first NDI algorithm has been proposed to solve the problem [14]. Mining none derivable frequent itemset is more suitable to apply in dataset with large transactional length such as bioinformatics datasets. It is possible to derive all frequent itemset by first mining Non derivable ones and then produce derivable itemsets without database scan. Although NDI algorithm has better efficiency rather than well known Apriori algorithm, this algorithm must be more efficient when we consider its relative small output. To enhance the performance of non-derivable frequent itemset mining a depth-first algorithm has been proposed [17] which is an Eclat [18] based algorithm. The depth first algorithm outperforms breadth-first approach and needs less real memory. However, applying such a sequential algorithm on large dense real life datasets still takes much time. To overcome the problem it is possible to use parallel algorithm like that in mining all frequent itemsets. Most time consuming of NDI algorithm is twofold. First, is I/O cost and second is evaluation of deduction rules to determine whether an itemset is derivable. In this study a parallel algorithm namely PNDI is proposed which parallelize deduction rule evaluation in addition to I/O cost parallelization. The reminder of the paper is as follow. In Section 2, some related works are reviewed; in Section 3 the PNDI algorithm for parallel mining of non-derivable frequent itemset is proposed. In Section 4 some implementation details are mentioned and the speed up, scale up and size up experiment result are presented and described. Finally Section 5 concludes the paper.
2. Related Work In [14], rules were given to derive bounds on the support of an itemset I if the supports of all its strict subsets of I are known. The main principle behind the support derivation technique used for mining non-derivable sets is the inclusion-exclusion principle [19]. For any subset J ⊆ I , we obtain a lower or an upper bound on the support of I using one of the following formulas [16]. If |I \ J| is odd, then
(2.1) supp( I ) ≤
∑ (−1)|I \ X |+1 supp( X ).
J ⊆ X ⊂I
If |I \ J| is even, then
(2.2) supp( I ) ≥
∑ (−1)|I \ X |+1 supp( X ).
J ⊆ X ⊂I
For more details and the proof of these formulas we refer the interested reader to [14]. Example 1: Fig. 1, all possible rules to derive the tight bounds for given set {abcd} [16]. supp(abcd) ≥ supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd)
≤ ≤ ≤ ≤ ≥ ≥ ≥
supp(abc) + supp(abd) + supp(acd) + supp(bcd) − supp(ab) − supp(ac) − supp(ad) −supp(bc) −supp(bd)− supp(cd) + supp(a)+supp(b) +supp(c)+supp(d)− supp({}) supp(a) − supp(ab) − supp(ac) − supp(ad) + supp(abc) + supp(abd) + supp(acd) supp(b) − supp(ab) − supp(bc) − supp(bd) + supp(abc) + supp(abd) + supp(bcd) supp(c) − supp(ac) − supp(bc) − supp(cd) + supp(abc) + supp(acd) + supp(bcd) supp(d) − supp(ad) − supp(bd) − supp(cd) + supp(abd) + supp(acd) + supp(bcd) supp(abc) + supp(abd) − supp(ab) supp(abc) + supp(acd) − supp(ac) supp(abd) + supp(acd) − supp(ad)
2
IDMC'07 20-21 Nov.2007 Amir Kabir University
supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd) supp(abcd)
≥ supp(abc) + supp(bcd) − supp(bc) ≥ supp(abd) + supp(bcd) − supp(bd) ≥ supp(acd) + supp(bcd) − supp(cd) ≤ supp(abc) ≤ supp(abd) ≤ supp(acd) ≤ supp(bcd) ≥ 0
Figure 1: Deducing bounds on support (abcd). When for an itemset I the smallest upper bound (uI) equals the highest lower bound (lI), then we have actually obtained the exact support of the set solely based on the supports of its subsets. These sets are called derivable and all other sets non-derivable. Based on the deduction rules, it is possible to generate a summary of the set of frequent itemsets. Indeed, if lI = uI , then supp(I,D) = lI = uI, and hence, we do not need to store I in the representation. Such a set I, will be called Derivable Itemsets (DI), all other itemsets are called Non-Derivable Itemsets (NDIs). Based on this principle in [14], in a database D if the minimum support be σ the following condensed representation was introduced:
NID ( D, σ ) := {I | supp(I, D) ≥ σ , l I ≠ u I } ,
In the experiments presented in [16], it is shown that the collection of non-derivable itemsets is much more concise than the complete collection of frequent itemsets, and often even more concise than the other concise representations. A level-wise Apriori-like algorithm called NDI was given in [14]. In fact, the NDI-algorithm corresponds largely to constrained mining algorithms with non-derivability as an anti-monotone constraint. In the candidate generation phase of Apriori, additional to the monotonicity check, the lower an upper bounds on the candidate itemsets are computed. Such a check is possible, since in Apriori a set I can only be a candidate after all its strict subsets have been computed. The candidate itemsets that have an upper bound below the minimal support threshold are pruned, because they cannot be frequent. The itemsets having lower bound equal to the upper bound are pruned since they are derivable. Based on discussion in [14], we know that the superset of a drivable itemsets will be derivable as well, and hence, a drivable itemset can be pruned in the same way as an infrequent itemset. Due to the relatively small number of non-derivable itemsets, the NDI-algorithm almost always outperforms mining all frequent itemsets, independently of the algorithm [16]. When we, however, look at the time and space required by the NDI-algorithm as a function of its outputsize, its performance is far below that of state-of-the-art frequent set mining algorithms. This is due to the fact that to compute non-derivable frequent itemsets in addition to I/O cost for scanning the data sets, for each candidate it is needed to compute all deduction rules for subsets of that candidate which has exponential time complexity Parallel mining of frequent itemsets is one of the approaches that are used to achieve better performance. Number of parallel algorithms so far has been proposed to mine frequent itemsets on the share memory and the distributed memory parallel systems [20, 21, 22, 23, 24]. However, with the best of our knowledge no parallel algorithm has been proposed for parallel mining of non-derivable frequent itemsets. The problem of parallel mining of association rules was introduced in [20]. Two parallel algorithms, Count Distribution (CD) and Data Distribution (DD) were proposed. CD is parallel version of Apriori. In CD the database is partitioned into D1, D2… Dn and distributed across n processors. The program fragment of CD at processor Pi 1 ≤ i ≤ n, for the k-th iteration is outlined in fig. 2. X.sup and X.supi are the global support and locally support at processor i of an itemset X, respectively.
3
IDMC'07 20-21 Nov.2007 Amir Kabir University
1) Ck = apriori_gen(Lk-1); 2) Scan the local partition to find the local support counts X.supi for all X ∈ C k . 3) Exchange { X.supi | X ∈ C k } with all other processors to get the global support counts X.sup for all X ∈ C k 4)
Lk = { X ∈ C k | X . sup ≥ minsup × | D |}
Figure 2: The CD Algorithm In step 1, every processor computes the same candidate set Ck by applying apriori-gen function on Lk-1, which is the set of large itemsets found at the (k-1)-th iteration. In step 2, the local support counts of candidates in Ck are found by scanning the local database partitions. In step 3, the local support counts are exchanged with all other processors to get the global support counts. Finally in step 4 globally large itemsets Lk are computed independently by each processor. CD repeats steps 1 - 4 until no more candidates are found. The CD algorithm scales linearly and has excellent speed up and size up behavior with respect to the number of transactions. In this paper, the similar idea of the CD algorithm is adopted to design parallel algorithm for mining Non-derivable frequent itemsets (PNDI). Experimental evaluation presented in section 4 shows that the PNDI has fine speed up, scale up and size up.
3. Parallel algorithm for mining non-derivable frequent itemsets The PNDI (Parallel mining of Non-derivable frequent itemset) algorithm is proposed in this section. The algorithm PNDI is presented in Fig. 3. As can be seen in the figure, in each iteration candidate itemsets are generated based on Apriori algorithm, then each candidate is tested for whether it is derivable, if so, it is pruned from candidate set. Each site can find out a derivability of an itemsets autonomously because it has all information to do this and there is no need for communication with other sites, because each site has a Trie data structure which stores all non derivable frequent itemsets found in previous iterations. The program fragment of PNDI at processor i, for the k-th iteration, is outlined in Fig.3. The algorithm takes a subset Di of data and minimum support threshold (minSup) as input. PNDI consist of 7 main steps. X.sup and X.supi are the global support and locally support at processor i of an itemset X, respectively. In step 1, every processor computes the same candidate set Ck by applying apriori_gen function on NDILk-1 which is the set of large non-derivable itemsets found at the (k-1)-th iteration. In step 2 through 4, derivable candidate itemsets are pruned away from candidate itemsets and the set NDICk is obtained which is the candidates that are probably large non-derivable frequent itemsets. In step 5, the local support counts of candidate non-derivable itemsets NDICk are found by scanning the local database partitions. In step 6, the local support counts are exchanged with all other processors to get the global support counts and finally globally large non-derivable itemsets NDILk are computed independently by each processor at step7. PNDI repeats steps 1-7 until no more candidates are found. PNDI Algorithm Input: Subset of data Di ( i = 1,2,…,s); Minimum support threshold(minSup) apriori_gen gives candidates in sorted order which is equal in each processor
4
IDMC'07 20-21 Nov.2007 Amir Kabir University
Output: All None-Derivable frequent itemsets which exist in D Methods: Each processor iteratively executes the following steps until no non-derivable frequent itemset is found (K>1).
{ 1) Ck = apriori_gen(NDILk-1); 2) for i in myFraction of Ck
myDerivability[i] = Determente(Ck[i]);
3) allGather( myDerivability , allDerivability); 4) for i in (0…Ck.Size)
if not allDerivability[i] then
NDIC K ← NDIC K ∪ {C K [i ]}
5) if (NDICk) is not empty
Scan the local partition to find the local support counts X.supi for all X ∈ NDIC k
6) Exchange { X.supi | X ∈ NDICk } with all other processors to get the
Global upport counts X.sup for all X ∈ NDIC k 7)
NDILk = { X ∈ NDIC k | X . sup ≥ minsup × | D |}
} Figure 3: The PNDI algorithm
In the above figure each processor is assigned an equal fraction of candidate itemset. The Determine get an itemset and return true if the itemset is derivable. The allGather function is an MPI function which gathers the result of the Determine function of each processor. A hash-Tree structure is used here for storing and retrieving support count of itemsets at the each iteration. Another data structure used the implementation is Trie which store previously found non-derivable frequent itemsets and their support. The above algorithm has been implemented by Visual C++ 6 using MPI library.
4. Experimental results We have implemented our parallel algorithm for mining non-derivable frequent itemsets on the share-nothing parallel system. This workstation consists of eight 1.2 GHz Pentium PC machines with 256 megabyte main memory, which are interconnected via 10M/100M hub. We use parallel message passing interface MPICH 2 to implement the PNDI algorithm. All the programs are written in Microsoft/Visual C++ 6.0. Run time used here represents the total execution time i.e., the period between input and output. We use two well known real-life datasets with different characteristics. The datasets are available at FIM repository. First of them is a sparse dataset obtained from a Belgian retail market. Another data set is the accident dataset [26] with about 340K transactions and about 500 items. The results of the accident dataset are presented here. In each experiment the dataset is horizontally partitioned among machines where the performance of speed up, scale up and size up are analyzed. In all experiments the number of maximal processors is set to 8. In order to see how response time of our parallel algorithm is reduced with respect to the number of processors, we perform speed up experiments where we keep the dataset constant and vary the
5
IDMC'07 20-21 Nov.2007 Amir Kabir University
number of computers. In each speed up experiments the data set is partitioned horizontally w.r.t the number of processors. For example, first speed up experiment is performed using two computers; each of them has ½ of the accident dataset. Minimum support threshold is set to 0.4. Fig. 4 shows the performance results of speed up. Speed Up
Response time (sec.)
6000 5000 4000 3000 2000 1000 0 0
1
2
3
4
5
6
7
8
9
Number of Prcocessors
Figure 4: Speed Up As shown in the above figure, with the increase in the number of processors, better speed up is achieved. This is due to fairly distribution of I/O costs and deduction rules evaluation among all of the machines. To see how well the PNDI algorithm handles large dataset when more computers are available, we perform scale up experiments where the dataset is copied from the accident dataset in direct proportion to the number of the computer in the workstation. In this way every computer has a copy of the accident dataset. The performance results of scale up are shown in Fig. 5. Scale Up
Response Time(sec.)
400 350 300 250 200 150 100 0
2
4
6
8
10
Number of Processors
Figure 5: Scale Up The above figure shows that the PNDI has almost fixed run time when the number of processors and the size of problem are increased simultaneously. That is, the PNDI algorithm is scalable.
6
IDMC'07 20-21 Nov.2007 Amir Kabir University
Size Up
Response Time(sec.)
2000 1500 1000 500 0 0
20
40
60
80
Data Size per process(K Transactions)
Figure 5: Size Up Finally we conduct the size up experiment which shows that how the performance of PNDI is reduced when the size of the problem is increased. We fix the number of the computers to 8 workstations, while growing the number of transactions per computer from 10K to 60K. Fig. 5 shows the performance results of size up experiment. The result show sub liner performance for our parallel algorithm. Our parallel algorithm is actually more efficient as the dataset size is increased, since increasing the size of the dataset simply makes the non-computation portion of the code to take more time due to more I/O and more transaction processing. This has the result of reducing the percentage of the overall time spent on communication.
5. Conclusion In this study we have addressed the problem of parallel mining of non-derivable frequent itemsets and parallel algorithm namely PNDI has been proposed and implemented. Two main costs of non-derivable frequent itemsets mining are I/O costs and cost of deduction rule evaluation. The PNDI algorithm tries two parallelize both of these costs. Empirical evaluations after the implementation, on real-life datasets show that the PNDI has excellent speed up, scale up and size up.
References: [1] R.Agrawal,T.Imiliniski, and A.Swami. Mining association rules between sets of items in large databases.In Proc. Of the ACM SIGMOD conference on Management of Data, Washington, D.C. May 1993. [2] R. Agrawal and R.Srikant, Fast Algorithms for mining Association Rules, Proceedings of the VLDB, Santiago de Chile, September 1994, pp. 487-499. [3] H. Toivonen. Sampling large databases for association rules. In T.M. Vijayaraman, A.P. Buchmann, C. Mohan, and N.L. Sarda, editors, Proceedings 22nd International Conference on Very Large Data Bases, pages 134–145. Morgan Kaufmann, 1996. [4] J.Han, J.Pie,Y.Yin, and R.Mao. Mining frequent pattern without candidate generation: A frequent-pattern tree approach. Data Mining and knowledge discovery, 2003. [5] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, volume 26(2) of SIGMOD Record, pages 255–264. ACM Press, 1997. [6] J.S. Park, M.-S. Chen, and P.S. Yu. An effective hash based algorithm for mining association
7
IDMC'07 20-21 Nov.2007 Amir Kabir University
rules. In proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, volume 24(2) of SIGMOD Record, page 175-186. ACM Press, 1995. [7] F.Bodon. A fast apriori implementation. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, 2003. [8] C. Borgelt. Efficient implementations of apriori and eclat. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), volume 90 of CEUR Workshop proceedings, Melbourne, Florida, USA, 19 November 2003 [9] F.Bodon. Surprising results of trie-based fim algorithms. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’04), volume 126 of CEUR Workshop proceedings, Brighton, UK, 2004. [10] N. Pasquier et al. Discovering frequent closed itemsets for association rules. In Proc. ICDT, pp. 398-416,1999. [11] J.-F. Boulicautet al. Approximation of frequency queries by means of free-sets. In Proc. PKDD, pp. 75-85, 2000. [12] M. Kryszkiewicz. Concise representation of frequent patterns based on disjunction free generators. In Proc. ICDM, pp 305-312, 2001. [13] A. Bykowski and C. Rigotti. A condensed representation to find frequent patterns. In Proc. PODS, 2001. [14] T. Calders and B. Goethals. Mining all non derivable frequent itemsets. In Proc. Principles and Practice of Knowledge Discovery in Databases PKDD’02, volume 2431 of LNAI, pp. 74-85, Helsinki, FIN, Aug. 2002. Springer-Verlag. [15] T. Calders and B. Goethals. Mining k-free representation of frequent sets. In Proc. Principles and Practice of Knowledge Discovery in Database PKDD’03, volume 2828 of LNAI, pp. 71-82, Cavtat-Dubrovnik, HR, Sept. 2003. Springer-Verlag. [16] T. Calders. Deducing Bounds on the Support of Itemsets. In Database Technologies for Data Mining- Discovering Knowledge with Inductive Queries, volume 2682 of LNCS, pages 214-233. Springer-Verlag, 2004 [17] T. Calders, B.Goethals. Depth-First Non-Derivable Itemset Mining. In Proc. SIAM Int. Conf. on Data Mining SDM’05, Newport Beach, USA, Apr. 2005. [18] M. Zaki. Scalable algorithms for association mining, IEEE Transactions on knowledge and Data Engineering, 12(3): 372-390, May/June 2000. [19] J. Galambos and I. Simonelli. Bonferroni-type Inequalities with Applications. Springer. [20] R.Agrawal and J.Shafer. Parallel mining of association rules. IEEE Transaction on Knowledge and Data Engineering, 8(6):962-969, 1996. [21] R. Agrawal and J.C.Shafer. Parallel mining of association rules: Design, implementation and experience. Special Issue in Data Mining, IEEE Trans. On Knowledge and Data Engineering, IEEE computer society, V8, N6, December 1996, pp. 962-969. [22] E. Han, G. Karypis and V. Kumar. Scalable parallel data mining for association rule. In Proc. of 1997 ACM-SIGMOD Int. Conf. On Management of Data, pages 277-288, Tucson, Arizona, 1997. [23] T. Shintani, M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In Proc. of 4th Int. Conf. on Parallel and Distributed information Systems, Miami Beach, Florida, 1996. [24] M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li, Parallel data mining for association rules on share-memory multi-processors. Supercomputing’96, Pittsburg, PA, Nov 17-22, 1996. [25] Message Passing Interface forum. MPI: A Message Passing Interface Standard, May 1994. [26] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof, Profiling high frequency accident locations using association rules. In Proc. of the 82nd Annual Transportation Research Board, page 18, 2003.
8