ExMiner: An Efficient Algorithm for Mining Top-K Frequent Patterns Tran Minh Quang, Shigeru Oyanagi, and Katsuhiro Yamazaki Graduate school of Science and Engineering Ritsuimeikan University, Kusatsu city Japan {
[email protected], oyanagi@cs, yamazaki@cs}.ritsumei.ac.jp
Abstract. Conventional frequent pattern mining algorithms require users to specify some minimum support threshold. If that specified-value is large, users may lose interesting information. In contrast, a small minimum support threshold results in a huge set of frequent patterns that users may not be able to screen for useful knowledge. To solve this problem and make algorithms more user-friendly, an idea of mining the k-most interesting frequent patterns has been proposed. This idea is based upon an algorithm for mining frequent patterns without a minimum support threshold, but with a k number of highest frequency patterns. In this paper, we propose an explorative mining algorithm, called ExMiner, to mine k-most interesting (i.e. top-k) frequent patterns from large scale datasets effectively and efficiently. The ExMiner is then combined with the idea of “build once mine anytime” to mine top-k frequent patterns sequentially. Experiments on both synthetic and real data show that our proposed methods are more efficient compared to the existing ones.
1 Introduction Frequent pattern mining is a fundamental problem in data mining and knowledge discovery. The discovered frequent patterns can be used as the input for analyzing association rules, mining sequential patterns, recognizing clusters, and so on. However, discovering frequent patterns in large scale datasets is an extremely time consuming task. Various efficient algorithms have been proposed and published on this problem in the last decade. These algorithms can be classified into two categories: the “candidate-generation-and-test” approach and the “pattern-growth” approach. Apriori algorithm [1] is the representative of the “candidate-generation-and-test” approach. This algorithm applies the monotonicity property of frequent patterns (every subset of a frequent pattern is frequent) to create candidates for k-itemset frequent patterns from a set of k-1-itemset frequent patterns. The candidates will be verified whether satisfy the minimum support threshold by scanning over the dataset. Follow this approach, an extremely huge number of candidates are generated, and the dataset is scanned many times slowing down the response time of the algorithm. The representative of the “pattern-growth” approach is the FP-growth algorithm [2] which scans dataset only twice to compact data into a special data structure (the FP-tree) making it easier to mine frequent patterns. FP-growth algorithm recognizes the shortest frequent patterns (frequent pattern 1-itemsets) and then “grows” them to X. Li, O.R. Zaiane, and Z. Li (Eds.): ADMA 2006, LNAI 4093, pp. 436 – 447, 2006. © Springer-Verlag Berlin Heidelberg 2006
ExMiner: An Efficient Algorithm for Mining Top-K Frequent Patterns
437
longer ones instead of generating and testing candidates. Owing to this graceful mining approach, FP-growth algorithm reduces the mining time significantly. Conventional frequent patterns mining algorithms require users to provide a support threshold, which is very difficult to identify without knowledge of the dataset in advance. A large minimum support threshold results in a small set of frequent patterns which users may not discover any useful information. On the other hand, if the support threshold is small, users may not be able to screen the actual useful information from a huge resulted frequent pattern set. Recent years, various researches have been dedicated to mine frequent patterns based upon user-friendly concepts such as maximal pattern mining [3][4], closed pattern mining [5][6], and mining the most interesting frequent patterns (top-k mining) [7][8][9][10][11]. With regard to usability and user-friendliness, top-k mining permits users to mine the k-most interesting frequent patterns without providing a support threshold. In real applications, users may need only the k-most interesting frequent patterns to examine for the useful information. If they fail to discover useful knowledge they can continue to mine the next top-k frequent patterns and so on. The difficulty in mining top-k frequent patterns is that the minimum support threshold is not given in advance. The support threshold is initially set to 0. The algorithms have to gradually find out the support threshold to prune the search space. A good algorithm is the one that can raise the support value (i.e. from 0) to the actual value effectively and efficiently. As our best knowledge, the top-k mining was introduced first in [9] which extended the Apriori approach to find out the k-most interesting frequent patterns. Fu, A.W., et al., introduced the Itemset-Loop/Itemset-iLoop algorithms [7] to mine every n-most interesting patterns in each set of k-itemsets. Even these methods apply some optimizations they suffer the same disadvantage as that in Apriori approach. Top-k FP-growth [11] is an FP-growth extension method to mine top-k frequent patterns with the adoption of a reduction array to raise the support value. However, Top-k FPgrowth algorithm is an exhaustive approach and it raises the support value slowly. Beside that, this method has a problem of the effectiveness. It presents exactly k (for top-k) frequent patterns where k is a user specified number. However, with a user specified number, say k, a dataset may contain more than k-most interesting frequent patterns since some patterns may have the same support. This research proposes a new algorithm, called ExMiner, to mine top-k frequent patterns effectively and efficiently. The mining task in this algorithm is divided into two phases – the “explorative mining” and the “actual mining” phases. The explorative mining phase is performed first to recognize the optimal internal support threshold according to a given number of top-k. This value is provided as a support threshold parameter for the actual mining phase. With an optimal internal support threshold, the actual mining phase can mine top-k frequent patterns efficiently. The ExMiner is then combined with the idea of “build once mine anytime” to mine top-k frequent patterns sequentially. This approach is valuable in real applications. The rest of the paper is organized as follows. Section 2 is the preliminary definitions. The explorative mining algorithm, ExMiner, is described in section 3. The idea of mining top-k frequent patterns sequentially is explained in section 4. Section 5 is the experimental evaluation and we conclude the paper in section 6.
438
T.M. Quang, S. Oyanagi, and K. Yamazaki
2 Preliminary Definitions Let I = {i1, i2, … in} be a set of items. An itemset, X, is a non-empty subset of I, called a pattern. A set of k items is called a k-itemset. A transaction T is a duple , where X is an itemset and Tid is the identifier. A transactional database D is a set of transactions. Definition 1. The support of an itemset X, denoted as Sup(X) is the number 1 of transactions in D that contain X. Definition 2. Let θ be the minimum support threshold. An itemset X is called a frequent itemset or a frequent pattern if Sup(X) is greater than or equal to θ. Definition 3. Let α be the support of the kth pattern in a set of frequent patterns which are sorted by the descending order of their supports. The k-most interesting frequent patterns is a set of patterns whose support value is not smaller than α. Observation 1. A set of k-most interesting frequent patterns may contain more or less than k patterns because some patterns may have the same support value or the dataset is too small to generate enough k patterns.
3 ExMiner Algorithm In the traditional frequent pattern mining, the key parameter is the minimum support threshold. With a given minimum support threshold, algorithms can prune the search space efficiently to reduce the computation time. In contrast, top-k mining algorithms are not provided such a useful parameter but have to identify that value automatically according a given number top-k. The better top-k mining algorithms are those that can identify support threshold not only faster but also in a more effective and efficient way than others do. Our purpose is to propose an algorithm that has the ability to recognize the final internal support threshold which is used directly to generate only the actual top-k frequent patterns (without candidate generation). This ability avoids the additional computation time in the mining phase significantly. Owing to that idea an algorithm called ExMiner is proposed. ExMiner algorithm proceeds from the observation of mineral mining activities in the real world in which some explorative mining activities should be performed before the actual mining. The term ExMiner stands for the term “explorative miner”. ExMiner extends the FP-growth to mine top-k frequent patterns effectively and efficiently with following 2 points: a) setting the internal threshold border_sup, b) taking an explorative mining to recognize an effective “final internal support threshold” which is used in the actual mining phase for recognizing frequent patterns. Setting border_sup: The ExMiner scans the dataset once to count the supports of all items and sort them by the descending order of their support. The border_sup is set to the support value of the kth item in the sorted list. The list of top-k frequent items, 1
The support can be defined as a relative value, that is the fraction of the occurrence frequency per the total number of transactions in the considering dataset.
ExMiner: An Efficient Algorithm for Mining Top-K Frequent Patterns
439
denoted as F-list, is a list of items whose support is not smaller than border_sup. ExMiner algorithm also takes the second scan on the dataset to construct an FP-tree according to the F-list. Explorative mining: Instead of mining top-k frequent patterns immediately after the FP-tree has been built, ExMiner performs a virtually explorative mining first. The purpose of this virtual mining routine, called VirtualGrowth, is to identify the final internal support threshold. The pseudo code of ExMiner algorithm is described in Figure 1. Input: Dataset D, number of patterns k Output: top-k frequent patterns Method: 1. Scan D to count support of all 1-itemsets 2. According to k, set border_sup and generate F-list. Insert the support values of the first k items in F-list into a queue, say supQueue 3. Construct an FP-tree according to F-list 4. Call VirtualGrowth(multiset* supQueue, FP-tree, null) to explore the FP-tree and set the final internal support threshold θ to the smallest element, minq, of supQueue 5. Mine the FP-tree with support threshold θ to output top-k frequent patterns
Fig. 1. Pseudo code of the ExMiner algorithm
The pseudo code of the VirtualGrowth routine, in step 4 of the ExMiner algorithm, is depicted in Figure 2 and described as bellow. Initially supQueue contains the first k items in the F-list (1-itemsets) and sorted by the descending order of their values. If the tree is a single-path tree, the routine examines each node in the tree in the top-down direction (line 1). While the support of considering nodes are greater than the last element in supQueue, say minq (line 2), it figures out the number of potential patterns (accompanied with their support values) which can be generated from a considering node (lines 3 and 4). The condition α ≠ null is checked to avoid duplicating the support values of 1-itemsets into supQueue. Note that, for a node, say d, at a position i (i = 1 for the first “real” node under Root) there are 2i-1 combinations of d with other nodes in the path. If α is not null, it serves as a real prefix for each of the above combination making all 2i-1 patterns to be long ones (i.e. contains more than 1 item). Lines 5 and 6 try to update the c recognized supports (i.e. δ) to the last c elements in supQueue (smallest c values). If the tree is not a single-path tree (line 7), the routine traverses the tree from the top of the header table. Let ai be the considering item. While sup(ai) is greater than minq (line 8), the supports of potential patterns generated by “growing” this item will be satisfied to replace minq. Line 9 is to replace minq by the support of a potential long pattern. A new prefix β and its conditional FP-tree, Treeβ, are created (lines 10 and 11). If this conditional FP-tree is not empty, the routine is recalled recursively to deal
440
T.M. Quang, S. Oyanagi, and K. Yamazaki
with the new conditional FP-tree (line 12). When the routine terminates, supQueue is converged and the last element, minq, serves as the final internal support threshold for mining top-k frequent patterns. Procedure VirtualGrowth(supQueue, Tree, α){ (1) If Tree contains a single path P { i = 1; δ = sup(ni); minq minq) { (3) If (α ≠ null) c = 2i-1; (4) Else c = 2i-1 – 1; (5) For each of c value of δ (6) If (δ > minq) replace minq in supQueue by δ; // After replacing minq is updated i++; If (ni ≠ null) δ = sup(ni) Else δ = 0; }} (7) Else (8) While (sup(ai) > minq){ // ai in the header table (9) If (α ≠ null) replace minq by sup(ai); (10) β = ai U α; ai