Mining Frequent Closed Itemsets from Highly Distributed Repositories Claudio Lucchese Ca’ Foscari University of Venice
[email protected]
Salvatore Orlando Ca’ Foscari University of Venice
[email protected]
Raffaele Perego ISTI-CNR of Pisa
[email protected]
Claudio Silvestri Ca’ Foscari University of Venice
[email protected]
Abstract In this paper we address the problem of mining frequent closed itemsets in a highly distributed setting like a Grid. The extraction of frequent (closed) itemsets is an important problem in Data Mining, and is a very expensive phase needed to extract from a transactional database a reduced set of meaningful association rules, typically used for Market Basket Analysis. We figure out an environment where a transactional dataset is horizontally partitioned and stored in different sites. We assume that, due to the huge size of datasets and privacy concerns, dataset partitions cannot be moved to a centralized site where to materialize the whole dataset and perform the mining task. Thus it becomes mandatory to perform separate mining at each site, and then merge local results for deriving global knowledge. This paper shows how frequent closed itemsets, mined independently in each site, can be merged in order to derive globally frequent closed itemsets. Unfortunately, such merging might produce a superset of all the frequent closed itemsets, while the associated supports could be smaller than the exact ones because some globally frequent closed itemsets might be not locally frequent in some partition. In order to avoid an expensive post-processing phase, needed to compute exact global results, we employ a method to approximate the supports of closed itemsets. This approximation is only needed for those globally (closed) frequent itemsets which are locally infrequent on some dataset partitions, and thus are not returned at all from the corresponding sites.
1 Introduction. Association Rule Mining (ARM) is one of the most popular Data Mining topic. In this paper we are interested in the most computationally expensive phase of ARM, i.e the Frequent Itemset Mining (FIM) one, during which the set of all the frequent itemsets are extracted from a transactional database.
Unfortunately, FIM algorithms may fail to extract all the frequent itemsets from dense databases, which contain strongly correlated items and long frequent patterns. The complexity of the mining task becomes rapidly intractable. Moreover, the huge size of the output makes the task of an analyst very hard, since she/he has to extract useful knowledge from a very large amount of frequent patterns and association rules. Closed itemsets [10, 11, 1, 9, 16, 15, 4] are a solution to the problems described above, since they are a lossless and condensed representation of all the frequent itemsets. Therefore, mining the closed frequent itemsets guarantees at the same time better performance, non redundant results, and concise representation of results. Closed frequent itemsets are the unique maximal elements of the equivalence classes that can be defined over the lattice of all the frequent itemsets. Each class includes a distinct group of frequent itemsets, all supported by the same set of transactions. When a dataset is dense, frequent closed itemsets extracted can be orders of magnitude fewer than corresponding frequent itemsets, since they implicitly benefit from data correlations. Nevertheless, they concisely represent exactly the same knowledge. From closed itemsets it is in fact trivial to generate all the frequent itemsets along with their supports. More importantly, association rules extracted from closed itemsets have been proved to be more meaningful for analysts, because all redundancies are discarded. In this paper we address the problem of mining frequent closed itemsets in a highly distributed setting [7, 3], such as a Data Grid. While many papers address the problem of parallel/distributed FIM, at our best knowledge, no proposal for distributed closed itemset mining exists. We figure out a distributed framework in which one (virtually single) transactional dataset is horizontally partitioned among different sites. We assume that due to the huge size of datasets and privacy
concerns dataset partitions cannot be moved to a centralized site where to materialize the whole dataset and perform the mining task. Thus it becomes mandatory to apply a loosely coupled meta-miner approach, according to which we perform separate mining on each site, and then merge the local results do derive a global knowledge. All the issues that may occur when we separately extract all the frequent itemsets from non-overlapping horizontal partitions of a dataset have been raised by a famous algorithm, Partition [12]. This algorithm can be straightforwardly implemented in a loosely-coupled distributed setting, i.e. with a limited number of communications/synchronizations (see for example [6]). We already proposed an approximate algorithm called AP [13] inspired by the Partition for solving the FIM problem. To avoid the second counting phase of Partition, which may be particularly expensive in a distributed setting, AP approximates the unknown supports of frequent itemsets by interpolation. In this paper we discuss APClosed (Approximate Partition for Closed Itemsets) which is based on AP. Unfortunately, focusing on frequent closed itemsets rather than frequent itemsets, makes the final phase, during which the locally computed results must be merged, particularly challenging. In order to address the additional issues, APClosed exploits the theoretical results described in [5]. The rest of the paper is organized as follows. Section 2 discusses the FIM problem and the concept of closed itemsets. Section 3 discusses the issues for realizing a loosely-coupled distributed algorithm, inspired by Partition, for extracting all the frequent closed itemset from an horizontally partitioned transactional dataset. Finally, Section 4 introduces our APClosed algorithm, in particular the specific solution related to the interpolation/approximation of unknown itemset supports, and the particular merging phase of the local results collected from the various distributed sites. 2 Frequent itemsets and closed ones We introduce the concept of closed itemsets by first discussing the FIM problem, i.e. the problem of mining all the frequent itemsets from a transactional dataset D. Let I = {a1 , ..., aM } be a finite set of items, and D a dataset containing N = |D| transactions, where each transaction t ∈ D is simply a list of distinct items t = {i1 , ..., i|t| }, ij ∈ I. Let I be a k-itemset, where I = {i1 , ..., ik } is a set of k distinct items ij ∈ I. Given a k-itemset I, let σ(I) be its support, defined as the number of transactions in D that include I. Mining all the frequent itemsets from D with respect to a threshold minsup, 0 ≤ minsup ≤ 1, requires to discover all the itemsets having a support higher than (or equal to) a
given absolute value σ, where σ = minsup · |D|. To introduce the concept of closed itemset, let T and I, T ⊆ D and I ⊆ I, be subsets of all the transactions and items appearing in D, respectively. SE can finally define the two following functions f and g: f (T ) = {i ∈ I | ∀t ∈ T, i ∈ t} g(I) = {t ∈ D | ∀i ∈ I, i ∈ t}. Function f returns the set of items included in all the transactions belonging to T , while function g returns the set of transactions supporting a given itemset I. Definition 1. An itemset I is said to be closed if and only if c(I) = f (g(I)) = f ◦ g(I) = I where the composite function c = f ◦ g is called Galois operator or closure operator. The closure operator defines a set of equivalence classes over the lattice of frequent itemsets: two itemsets belong to the same equivalence class iff they have the same closure, i.e. they are supported by the same set of transactions. We can also show that an itemset I is closed iff no supersets of I with the same support exist. Therefore mining the maximal elements of all the equivalence classes corresponds to mine all the closed itemsets. Fig. 1(b) shows the lattice of frequent itemsets derived from the simple dataset reported in Fig. 1(a), mined with σ = 1. We can see that all the itemsets with the same closure are grouped in the same equivalence class. Each equivalence class contains elements sharing the same supporting transactions, and closed itemsets are their maximal elements. Note that closed itemsets (six) are remarkably less than frequent itemsets (sixteen). For example, only the closed itemset A,B,C,D, σ(A, B, C, D) = 1, is returned, while there are further 5 frequent itemsets that are supported by the same transaction set (only transaction 1, in this simple case). 3 The Distributed Partition algorithm The basic idea exploited by Partition is the following: given m disjoint horizontal partitions D1 , . . . , Dm of D, each one mined using the same relative support threshold minsup, each globally frequent pattern x (such that σ(x) ≥ minsup·|D|) must be locally frequent in at least one partition, e.g., σi (x) ≥ minsup · |Di |. This guarantees that the union of all local solutions is a superset of the global solution. However, one further pass over the database is necessary to remove false positives, i.e. patterns that result locally frequent but globally infrequent.
∅
4
Support
D
3
ABD Frequent Closed Itemset
Equivalence Class
1
Frequent Itemset
(a)
(b)
1
ABCD
TID 1 2 3 4
B A A C
items D B C C D
D
AB
2
ABC
1
ABD
1
ACD
AC
2
AD
2
BC
1
BD
2
A
2
B
2
C
3
D
3
1
∅ 3
ABD Frequent Closed Itemset
(a)
1
CD
2
4
Support
D
BCD
1
Equivalence Class
Frequent Itemset
(b)
Figure 1: (a) The input transactional dataset, represented in its horizontal form. (b) Lattice of all the frequent this case weclasses. also say that Y subsumes X. If this holds, we can itemsets (σ = 1), with closed itemsets and equivalence safely prune the generator X without computing its closure. Otherwise, we have to compute c(X) in order to obtain a new closed itemset. Obviously, Partition can be straightforwardly im- thethisglobal C is defined asboth follows: result Unfortunately, technique may become expensive, plemented in a distributed setting with a master/slave in time and space. In time because it requires the possibly set of closed itemsets mined so farC to=be C searched for paradigm [6]. Each slave becomes responsible ofhuge a local 1 ⊕ C2 = the inclusion of each generator. In space because, in order to partition, while the master performs the sum-reduction (Cset-inclusion {X1all ∩ the X2closed | (Xitemsets efficiently perform 1 ∪ C2 ) ∪checks, 1 , X2 ) ∈ (C1 × C2 )} ≡ C. of local counters (first phase), and orchestrates have tothe be kept in the main memory. by the following Several like above C HARM, result C LOSET, is andgeneralized C LOSET + slaves for computing the missing local supports for po-algorithms,The [19], [20], [24], base their whose duplicate proof avoidance on in [5]: cantechnique be found tentially globally frequent patterns (second phase) to re- theorem, the above Lemma. However, they might encounter problems move patterns having global support less than inminsup applying correctly the Lemma, Theorem 3.1.because, Givengiven thea whatever sets of closed itemsets (false positive patterns collected during the firstgenerator phase).X, in some cases they are not able to compute its C , . . . , C mined respectively from 1 m givenstraightforward a generator X, if we distributed find an already mined closure c(X). This is because they implicitly perform recursive m disjoint horizontal WeTherefore, call this version of the D1guarantee , . . . , Dm of D,duplicate we have that: closed itemsets Y that set-includes X, where the supports of projections of partitions the dataset. To a correct algorithm Distributed Y and X are identical, we canPartition. conclude that c(X) = c(Y ). In avoidance technique based on Lemma 1, they are thus forced
Frequent closed itemsets are a subset of frequent itemsets, more precisely they are those patterns whose supersets have all smaller supports than their own. Hence if a closed pattern x is globally frequent, then x (or a superset of x with the same support) must be locally frequent in at least one partition. Distributed Partition can be extended to the case of frequent closed itemsets. However making the union of the local solutions is not enough to get a superset of the frequent closet itemsets, as proved in [5]. In order to reach this goal it is necessary to consider also the intersections of every possible pair of locally frequent closed itemsets contained in different partitions. As an example, if C1 and C2 are the closed frequent itemsets making up two local solutions, where C1 = {(ABC : 6), (D : 1)}1 and C2 = {(BCD : 4), (A : 1)}, then the global solution will be C = {(ABC : 6), (BCD : 4), (BC : 10), (A : 7), (D : 5)}. More formally, given two sets of closed itemsets C1 and C2 , mined respectively from the two dataset partitions D1 and D2 , the merging function ⊕ to obtain 1 The notation (XY Z : n) stands for the closed itemset (XY Z) with support n.
C = (. . . ((C1 ⊕ C2 ) ⊕ . . . ⊕ Cm ) . . .). Unfortunately this merging function works properly only if all locally frequent closed itemsets extracted, i.e. C1 , . . . , Cm , are also globally frequent. In order to ensure this, a possible solution is to adopt an expensive method like the one suggested by the Distributed Partition algorithm, which requires a second global scan of the dataset partitions to check whether locally frequent itemsets (or closed ones) are also globally frequent. While the Distributed Partition algorithm is able to gives the exact values for supports, it has pros and cons with respect to other distributed algorithms. The pros are related to the number of communications/synchronizations: other methods as count-distribution [2, 14] require several communications/synchronizations, while the Distributed Partition algorithm only requires two communications from the slaves to the master, and a single one from the master to the slaves. The cons are the size of messages exchanged, and the possible additional computation performed by the slaves when the first phase of the algorithm produces false positives. Consider that, when low absolute minimum supports are used, it is likely to produce a lot
of false positives due to data skew present in the various dataset partitions [8]. This has a large impact on the cost of the second phase of the algorithm too: most of the slaves will participate in counting the local supports of these false positives, thus wasting a lot of time. Furthermore, in the frequent closed itemset case, there will be also a number of unnecessary intersections. One na¨ıve work-around, that we will name Distributed One-pass Partition, consists in stopping Distributed Partition after the first-pass. So in Distributed One-pass Partition each slave independently computes locally frequent patterns and sends them to the master which sum-reduces the support for each pattern and writes in the result set only patterns having the sum of the known supports greater than (or equal to) minsup. In the frequent closed case, the sum reduction is substituted by the merge operator ⊕, which also generates the closed patterns subsumed by other patterns in local results and no longer subsumed in global results. Distributed One-pass Partition has obvious performance advantages vs Distributed Partition. On the other hand it yields a result which is approximate with respect to both the content of the result set and the supports of discovered patterns. A globally frequent pattern x may result infrequent on a given partition Di only. In other words, since σi (x) < minsup · |Di |, x will not be returned as a frequent pattern by the ith slave. As a consequence, the master of Distributed One-pass Partition can not count on the knowledge of σi (x), and thus can not exactly compute the global support of x. Unfortunately, in Distributed One-pass Partition the master might P also deduce that x is not globally frequent, because j,j6=i σj (x) < minsup · |D|. In case we are dealing with closed itemsets the global support is computed in a different way [5], but the issue about inexact count of globally frequent patterns remains the same. 4 The APClosed algorithm APClosed , the distributed algorithm we propose in this paper, tries to overcome some of the problems encountered by Distributed One-pass Partition for frequent closed itemset discovery using the same interpolation method proposed by AP [13] in the case of frequent itemsets. When the algorithm need the support σi (x) of the pattern x in the ith partition where it is currently unknown, APClosed infer an approximate value σi (x)interp by exploiting an interpolation method. The master bases its interpolation reasoning on the knowledge of: • the exact support of each single item on all the partitions, and • the support of its subpattern on all the partitions
where they resulted actually frequent. Besides the issues coming from the need for approximating the unknown local supports of some closed itemsets, our algorithm faces the problem of efficiently implementing operator ⊕, used to merge local results (see Theorem 3.1). In particular, we need a fast method to understand whether a given closed itemset x is a subset of other ones that are already present in the global collection, or whether there exists any closed itemset whose set-intersection with x is not empty. To this end, we exploit a sort of inverted file, a very popular Information Retrieval index. In particular, we assign a unique identifier (CID) to every stored closed itemset, and then store for each frequent single item the list of CIDs associated with closed itemsets that include it. Using this index it is possible to perform not only fast intersections of closed itemsets, but also to derive the various local supports of new itemsets generated by the merging function. In the final paper we shall give all the details regarding the approximate function actually used to interpolate unknown local supports of closed itemsets, and discuss in depth our implementation of the merging function based on operator ⊕, also presenting preliminary results. References [1] Gosta Grahne and Jianfei Zhu. Efficiently using prefixtrees in mining frequent itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, November 2003. [2] E-H. S. Han, G.Karypis, and V.Kumar. Scalable parallel data mining for association rules. In IEEE Transaction on Knowledge and Data Engineering, 2000. [3] H. Kargupta and P. Chan (Eds.). Advances in Distributed and Parallel Knowledge Discovery. AAAI Press/The MIT Press, 2000. [4] C. Lucchese, S. Orlando, and R. Perego. Fast and Memory Efficient Mining of Frequent Closed Itemsets. Technical Report TR-CS-2004, Ca’ Foscari University of Venice, Italy, 2004. [5] Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Distributed mining of frequent closed itemsets: Some preliminary results. In Proceedings of HPDM’05 8th SIAM Workshop on High Performance Distributed Mining, 2005. [6] A. Mueller. Fast sequential and parallel algorithms for association rules mining: A comparison. Technical Report CS-TR-3515, Univ. of Maryland, 1995. [7] B. Park and H. Kargupta. Distributed Data Mining: Algorithms, Systems, and Applications. In Data Mining Handbook, pages 341–358. IEA, 2002. [8] Srinivasan Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the
[9]
[10]
[11]
[12]
[13]
[14]
[15] [16]
2002 IEEE International Conference on Data Mining (ICDM’02), page 354. IEEE Computer Society, 2002. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1):25– 46, 1999. Jian Pei, Jiawei Han, and Runying Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In SIGMOD International Workshop on Data Mining and Knowledge Discovery, May 2000. Jian Pei, Jiawei Han, and Jianyong Wang. Closet+: Searching for the best strategies for mining frequent closed itemsets. In SIGKDD ’03, August 2003. Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, pages 432–444. Morgan Kaufmann, September 1995. C. Silvestri and S. Orlando. Distributed Approximate Mining of Frequent Patterns. In Proc. of the 2005 ACM Symposium on Applied Computing, SAC 2005, special track on Data Mining, pages 529–536, 2005. M. J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12:372–390, May/June 2000. Mohammed J. Zaki. Mining non-redundant association rules. Data Min. Knowl. Discov., 9(3):223–248, 2004. Mohammed J. Zaki and Ching-Jui Hsiao. Charm: An efficient algorithm for closed itemsets mining. In 2nd SIAM International Conference on Data Mining, April 2002.