fast frequent itemset mining using compressed

FAST FREQUENT ITEMSET MINING USING COMPRESSED DATA REPRESENTATION Raj P. Gopalan Yudho Giri Sucahyo School of Computing, Curtin University of Technology Kent St, Bentley Western Australia 6102 {raj, sucahyoy}@computing.edu.au

ABSTRACT Discovering association rules by identifying relationships among sets of items in a transaction database is an important problem in Data Mining. Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore it has attracted significant research attention. In this paper, we describe a more efficient algorithm for mining complete frequent itemsets from typical data sets. We use a compressed prefix tree and our algorithm extracts the frequent itemsets directly from the tree. We present performance comparisons of our algorithm against the fastest Apriori algorithm, Eclat, and FP-Growth. These results show that our algorithm outperforms other algorithms on several widely used test data sets.

KEY WORDS Knowledge Discovery, Data Mining, Association Rules, Frequent Itemsets

1. INTRODUCTION Data mining is used to extract structured knowledge automatically from large data sets. Association Rules, Sequential Patterns, Classification, Similarity Analysis, Summarization and Clustering are major areas of interest in data mining. Among these, mining association rules [1] has been a very active research area. The process of mining association rules consists of two main steps: 1) Finding the frequent itemsets that have a minimum support; 2) Using the frequent itemsets to generate association rules that meet a confidence threshold. Step 1 is the more expensive of the two since the number of item sets grows exponentially with the number of items. A large number of increasingly efficient algorithms to mine frequent itemsets have been developed over the years [2],[3],[4],[6],[7]. There are two main strategies for mining frequent itemsets: the candidate generation-and-test approach and the pattern growth approach. Aprori [2] and its several variations belong to the first approach, while FP-Growth [3] and HMine [4] are examples of the second. Apriori suffers from poor performance when mining dense datasets since it has to traverse the database many times to check the support of candidate itemsets. This problem is partly overcome by algorithms based on pattern growth.

Several data structures have been used for mining frequent itemsets. They can be divided into two main categories as array-based and tree-based. In array-based representation, transactions are stored as arrays of items in the memory. For example, H-Mine uses an array structure called Hstruct [4]. In tree-based representation, variants of prefix trees are used. A prefix tree can be used to organize the transactions by grouping and storing transaction data in memory. In [6], transactions are first grouped using a compressed prefix tree and mapped to an array-based data structure, which is then used for the mining process. In this paper, we propose a new algorithm named CT-Mine for mining complete frequent itemsets directly from the compressed prefix tree introduced in [6]. FP-Growth also uses a tree-based data representation with additional links that connect nodes containing the same item [3]. The performance of CT-Mine has been compared against the best available implementation of Apriori algorithm [8], Eclat [7] and FP-Growth [3]. Our experiments show that compressing the prefix tree can significantly improve the performance of mining frequent itemsets on a number of typical data sets. The structure of the rest of this paper is as follows: In Section 2, we define the relevant terms as well as describe the compressed transaction tree. In Section 3, we present the CT-Mine data structure and algorithm. The experimental results of the algorithm on various datasets are given in Section 4. In Section 5 we compare our algorithm with other algorithms. Section 6 contains conclusion and pointers for further work.

2. PRELIMINARIES In this section, we define the terms used for describing association rule mining. We also describe the compressed transaction tree, which is a modified prefix tree first described in [6].

2.1 Definition Of Terms We give the basic terms for describing association rules using the formalism of [1]. Let I={i1,i2,…,in} be a set of items, and D be a set of transactions, where a transaction T is a subset of I (T ⊆ I). Each transaction is identified by a TID. An association rule is an expression of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩ Y = ∅. X and Y can consist

of one or more items. X is the body of the rule and Y is the head. The proportion of transactions that contain both X and Y to those that contain X is the confidence of the rule. The proportion of all transitions that contain both X and Y is the support of the support of the itemset {X, Y}. An itemset is called a frequent itemset if its support is greater than or equal to a support threshold specified by the user, otherwise the itemset is not frequent. An association rule with the confidence greater than or equal to a confidence threshold is considered as a valid association rule.

2.2 Compressed Transaction Tree Several transactions in a database may contain the same sets of items. Even if items of two transactions are originally different, early pruning of infrequent items from them could make their set of items the same. In addition, two transactions that do not contain an identical set of items may have a subset of their items in common. These observations led us to the design of a modified prefix tree that can group transactions containing common items. Figure 1a shows a full prefix tree for items 1-4. All siblings are lexicographically ordered from left to right. Each node represents a set consisting of the node element and all the elements on nodes in the path (prefix) from the root. It can be seen that the set of paths from the root to the different nodes of the tree represent all possible subsets of items that could be present in any transaction. A complete prefix tree has many identical subtrees in it. In Figure 1a, we can see three identical subtrees st1, st2, and st3. We compress the tree by storing information of identical subtrees together. For example, we can compress the full prefix tree in Figure 1a containing 16 nodes to the tree in Figure 1b that has only 8 nodes, by storing some additional information at the nodes of the smaller tree. Given a set of n items, a prefix tree would have a maximum of 2n nodes. The corresponding compressed tree will contain a maximum of 2n-1 nodes, which is half the maximum for a full tree. In practice, the number of nodes for a transaction database will have to be far less than the maximum for the problem of frequent item set discovery to be tractable. In the compressed tree, we record information from the pruned nodes of the complete tree. The nodes along the leftmost path have entries of all itemsets corresponding to the paths ending at the node. Other nodes only have itemsets that are not present in the leftmost path to avoid duplicates. For example, node 4 in the leftmost path has itemsets 1234, 234, 34, and 4 (we omit the set notation for simplicity). At node 4 in the path 134, itemsets 34 and 4 are not represented since both of them have been registered at node 4 in the leftmost path.

The count of transactions with different item subsets is also recorded at each node. Each count entry has two fields: the level of the node containing the starting item of the subset in the tree and the transaction count. For example, in Figure 1b, node 3 in the leftmost branch of the tree, has three entries: (0,1), (1,1) and (2,1). The entry (0,1) means there is one transaction with items starting at level 0 along the path from the root of the tree to this node; it corresponds to the item set 123. Similarly, the next entry (1,1) represents one transaction with item set 23, and (2,1) means one transaction with item 3. In this example, we have assumed the transaction count to be one at every node of the transaction tree in Figure 1a. The doted rectangles show the item sets corresponding to the nodes in the compressed tree of Figure 1b, for illustration (they are not part of the data structure). Root

1

3

2

2

3

4

3

4

4

st1

4

3

4

4

4

st2 st3

4

(a) Identical Subtrees in a Full Prefix tree Level 0

1 0 1 1

Level 1 2 0 1 12 1 1 2 3 0 1 123 1 1 23 2 1 3 4 0 1 2 3

1 1 1 1

1234 234 34 4

3 0 1 13

4 0 1 124 4 0 1 134 1 1 24

4 0 1 14

Level 2

Level 3

(b) Compressed Prefix Tree

Figure 1. Prefix Tree and Compressed Prefix Tree

3. CT-MINE DATA STRUCTURE AND ALGORITHM The data structure used in CT-Mine consists of two parts as follows: 1. ItemTable: It contains all the frequent items and the support of each item. It also has a pointer to the root of the subtree of each frequent item as shown in Figure 2d.

2.

Compressed Transaction Tree: It contains all transactions of the database containing frequent items. Only frequent items will be stored in the tree by the scheme described in the previous Section.

There are three steps in the CT-Mine algorithm as follows: 1. Identify the 1-frequent item sets and initialize the ItemTable: Due to anti-monotone property, only 1-freq items will be needed to mine frequent item sets In this step, we scan the transaction database to identify all 1freq items and put them in ItemTable. To support the compression scheme that will be used in the second step, all entries in the ItemTable are sorted in ascending order of their frequency and the items are mapped to new identifiers that are an ascending sequence of integers. 2. Construct Compressed Transaction Tree: Using the output of the first step, we read only the 1-freq items from the transaction database, map them to the new item identifiers and then insert the transactions into the compressed transaction tree. Each node of the tree will contain a 1-freq item and a set of counts indicating the number of transactions that contain subsets of items in the path from the root as explained in Section 2. 3. Mining: All the frequent itemsets of two or more items will be mined in this step using a recursive function. The pseudo code of this step is shown in Figure 3. The CT-Mine algorithm is illustrated by the following example.

Example 1. Let the table in Figure 2a be the transaction database and suppose the user wants to get the Frequent Itemsets with a minimum support of 2 transactions.

Tid 1 2 3 4 5

3 1 1 1 1

4 3 2 3 3

Items 5 6 7 9 4 5 13 4 5 7 11 4 8 4 10

Tid 1 2 3 4 5

(a) Sample database

3 1 1 1 1

Items 4 5 7 3 4 5 4 5 7 3 4 3 4

Tid 1 2 3 4 5

(b) Frequent Items

1 2 1 3 3

Items 2 3 5 3 4 5 2 4 5 4 5 4 5

(c) Mapped c

ItemTable Index 1 2 3 4 5 Item 7 5 3 1 4 Count 2 3 4 4 5 PST

Level 1

0 2 12 1 1 2 2

3

f

0 1 2 3

0 1 2 3 4

0 1 2 0 0

0 1 2 0

1

e

0 1 123 1 1 23 2 2 3

Level 3

Level 4

0 2 1

d

cdefg Level 2

Level 0

1234 4 234 34 4

0 1 1235 5 1 0 235 2 0 35

4

0 1 124 1 0 24

5 0 1 1245 1 0 245

g

12345 5 2345 345 45 (d) 5

Compressed Transaction Tree

Figure 2. Sample Database and CT-Mine Data Structure

In Step 1, all transactions in the database are read to identify the frequent items. For each item in a transaction, the existence of the item in the ItemTable is checked. If the item is not present in the ItemTable, it is entered with an initial count of one, otherwise the count of the item is incremented. On completing the database scan, the 1frequent item sets can be identified in the ItemTable as {1, 3, 4, 5, 7}. The entries in the ItemTable are then sorted in ascending order of item frequency. The item-ids are mapped to an ascending sequence of integers shown as the index row in Figure 2d. The index entries now represent the set of new item-ids.

As described in Section 2.2, each node in the tree has additional entries to keep the count of transactions represented by the node. When a transaction is inserted into the tree, all count entries along the path starting with the leftmost item are incremented. For example, when a transaction with item set 1245 is inserted, the count entries at index 0 (since it starts with item 1) of itemsets 1, 12 and 124 will also be incremented. In the implementation of the tree, the counts at each node are stored in an array so that the level for an entry is the array index that is not stored explicitly. As mentioned before, the doted rectangles that show different item sets at a node are not part of the data structure.

In Step 2, we read only 1-freq items from the database (see Figure 2b) and each transaction will be inserted into the tree using the mapping in ItemTable (see Figure 2c). For example, 7 will be mapped to 1 in the tree, 5 mapped to 2, and so on. In ItemTable, there is also a pointer to the subtree (PST) of each item. This pointer will be used as a starting point to mine all frequent itemsets corresponding to the paths ending at nodes of the corresponding subtree. The result of this step is shown in Figure 2d.

In the last step of the algorithm described by the pseudo code in Figure 3, each item of the ItemTable is used as the starting point to mine all frequent itemsets in the associated subtree. Starting with the subtree that has item 1 in the root (line 2), itemset counts at the node are checked for minimum support (line 3). The result is the frequent itemset 1(2), where the support of the itemset is shown in parenthesis. Then at line 4, the child subtrees of the current node are checked for occurrences of remaining items in

ItemTable from the rightmost to the second item on the right of the current item. An array named sumArray (initialized in line 5), that has two fields to record the level and count is used for storing the temporary results during traversal. Starting from the rightmost item which is 5, we check the support of itemset 15 by visiting all the occurrences of item 5 under the subtrees of item 1 and keep adding the count at level 0 of the itemset counts in the sumArray. There are three occurences of nodes with item 5: one node represents itemset 1245 with count 1, another node represents itemset 1235 with count 1 and the leaf of the leftmost branch has count 0. SumArray will have one entry with level 0 and count 2 at this point. Each entry of itemset count in the sumArray will be checked for minimum support in line 7. Since 5 is the rightmost item in the ItemTable, itemset 15 cannot be extended further. So the process to mine longer frequent itemsets with 15 as the prefix will not continue (line 8). After finishing with 15, it continues to check the support of itemset 14. Since the support of 1 for 14 is below minimum support, it will not continue to mine longer frequent itemsets beginning with 14 (line 8). Checking itemset 13 is similar to checking itemset 14 and the support count is 1. Item 3 is the second item to the right of item 1 in ItemTable; so the process will not continue to item 2 (line 4). Itemset with 12 as the prefix will be derived later while mining itemsets under subtree of item 2. Since paths 15, 14 and 13 do not exists in the tree, line 12 is not executed. After finishing with subtree of item 1, the program will continue to mine frequent itemsets under subtree of item 2. We get 12(2) and 2(3) from the root. Notice that the support of 2 is the sum of index 0 and index 1 of the itemset counts at node 2 and also that itemset 12 can be derived from there. This is why we need to stop at the second item on the right of the current item (remember that we stopped at item 2 while mining the child of subtree for item 1). SumArray will contain two entries: entry 0 to keep the count of itemset beginning with 1 and entry 1 to keep the count of itemset beginning with 2. The program continues to mine 25(3) and 125(2) as well as 24(2) and 124(1). Every time we have finished visiting all occurrences of item 5 (or item 4), each itemset count in sumArray will be checked to determine the count of 25 and 125 (or 24 and 124). Since 24 is frequent, the mining process will continue to mine 245(2) as well as 1245(1) (line 9). Path 25 does not exists in the tree so line 11 will not be executed, but paths 24 and 245 exist, so we need to add up the itemset counts to the leftmost branch of the tree. The total of itemset counts at node 4 in path 24 (which is 1) will be added to the itemset count of node 4 representing itemset 4 (index 3) at the leftmost branch of the tree. The total count of itemset counts at node 5 in path 245 (which is 1) will be added to the itemset counts of node 5

representing itemset 45 (index 3) at the leftmost branch of the tree. /* Input : compressed transaction tree */ /* Output: Frequent Itemsets */ (1) Procedure Mining (2) for each subtree root item i in ItemTable (3) check each count entry at the root of subtree item i and add to the set of FI if it is frequent (4) for each remaining item j in ItemTable after i from the rightmost to the second item on the right of i (5) initialize sumArray (6) calculate support of all itemsets corresponding to the paths ending at node j and store the result into sumArray (7) check each count entry in sumArray and add to the set of FI if frequent (8) if there is some frequent itemset in sumArray and j is not the rightmost item in ItemTable (9) MineFI(ij) (10) endif (11) if there exists path ij in the tree (12) add each itemset count entry to the respective itemset count at the leftmost branch of the tree (13) endif (14) endfor (15) endfor (16) Procedure MineFI(k) (17) for each remaining item l after the last item in k from the rightmost to the next item on the right of k (18) initialize sumArray (19) calculate support of all itemsets corresponding to the paths ending at node l and store the result into sumArray (20) check each count entry in sumArray and add to the set of FI if frequent (21) if there is some frequent itemset in sumArray and l is not the rightmost item in ItemTable (22) MineFI(kl) (23) endif (24) endfor

Figure 3. Mining Step in CT-Mine Algorithm From the root of subtree for item 3, the count of 123(1), 23(2) and 3(4) can be determined. When its child node 5 is visited, the count of itemsets 35(4), 235(2) and 1235(1) are extracted. Path 35 exists in the tree so the total of itemset counts at node 5 in path 35 (which is 1) will be added to the itemset counts of node 5 representing itemset 5 (index 4) at the leftmost branch of the tree. Itemset 34 and other itemsets with 34 as their prefix need not be mined in this step since they will be mined when we mine subtree of item 4. Root of the subtree of item 4 gives the count of 1234(0), 234(1), 34(3) and 4(4). From root of the subtree of item 5 we get itemsets 12345(0), 2345(1), 345(3), 45(4), 5(5) and the mining process will stop.

As the final result, 17 frequent itemsets are extracted as follows (after mapping to their real item id): 4(5), 7(2), 74(2), 5(3), 75(2), 754(2), 54(2), 51(2), 514(2), 3(4), 53(2), 534(2), 34(4), 1(4), 31(3), 314(3), 14(4).

scale along the y-axis in Figure 4. It can be seen that the ranking of algorithms are varied for the different datasets and support levels we used (see Table 1).

4. PERFORMANCE STUDY

Connect-4 67557 trans 129 items 43 avg trans

Chess 3196 trans 75 items 37 avg trans Pumsb* 49046 trans 2087 items 50 avg trans

Sup (%) 95 90 85 80 75 70 65 60 55 50 90 80 70 60 50 70 60 50 40

Freq Itemsets 2201 27127 142127 533975 1585551 4129839 9724531 21252795 44064031 88316367 622 8227 48731 254944 1272932 29 167 679 27354

Ranking CT-Mine > Apriori > Eclat > FP-Growth CT-Mine > Apriori > Eclat > FP-Growth CT-Mine > Apriori > FP-Growth > Eclat CT-Mine > FP-Growth > Apriori > Eclat CT-Mine > FP-Growth > Apriori > Eclat CT-Mine > FP-Growth > Apriori > Eclat CT-Mine > FP-Growth > Eclat > Apriori CT-Mine > FP-Growth > Eclat > Apriori CT-Mine > FP-Growth > Eclat CT-Mine > FP-Growth CT-Mine > Eclat > Apriori > FP-Growth CT-Mine > Eclat > Apriori > FP-Growth CT-Mine > Eclat > FP-Growth > Apriori Eclat > CT-Mine > FP-Growth > Apriori Eclat > FP-Growth > CT-Mine > Apriori CT-Mine > Apriori > FP-Growth > Eclat CT-Mine > Apriori > FP-Growth > Eclat CT-Mine > Apriori > FP-Growth > Eclat FP-Growth > Apriori > Eclat > CT-Mine

Table 1. Ranking of Algorithms

100

10

1 50

55

60

65

70

75

80

85

90

95

Support Threshold (%) 1000

CHESS

100 10 1 0.1 0.01 50

60

70

80

90

Support Threshold (%) 100 Runtime (seconds)

Dataset

.

CONNECT-4 1000

Runtime (seconds)

In this section, the performance of CT-Mine is compared with the fastest available implementation of Apriori (version 4.04), Eclat [7] and FP-Growth. Apriori 4.04 also uses prefix tree to store the transactions in the memory. FP-Growth is a tree-based algorithm for mining frequent itemsets as mentioned earlier. Implementations of Eclat and Apriori were downloaded from [8]. All programs are written in MS Visual C++ 6.0. All the testing was performed on an 866MHz Pentium III PC, 512 MB RAM, 30 GB HD running MS Windows 2000. In this paper, the runtime includes both CPU time and I/O time.

Runtime (seconds)

10000

PUMSB* 10

1 40

45

50

55

60

65

70

Support Threshold (%)

Several datasets were used to test the performance including Mushroom, Chess, Connect-4, Pumsb* and BMS-Web-View1. Due to space limitations, we show results for only Connect-4, Chess and Pumsb*. Both Connect-4 and Chess are dense datasets since they contain many long patterns of frequent item sets for very high values of support [9]. Pumsb* containing census data from PUMS (Public Use Microdata Samples) [10] is a relatively less dense data set. Table 1 shows the characteristics of each dataset, the number of frequent itemsets generated, and the ranking of algorithms. Performance comparisons of CT-Mine, Apriori, Eclat and FP-Growth on those datasets are shown in logarithmic

Apriori

FP-Growth

Eclat

CT-Mine

Figure 4. Performance Comparisons on Connect-4, Chess and Pumsb* On Connect-4, CT-Mine outperforms other algorithms at all support levels we used. The runtime of CT-Mine is only half of FP-Growth. We did not continue testing below support 60% for Apriori and support 55% for Eclat since they took a very long time. From these results, it is clearly seen that mining frequent itemsets from the compressed tree significantly improves performance.

On Chess, CT-Mine performs better than other algorithms at support 70%-90% and it is better than FP-Growth and Apriori at support 60%. On Pumsb*, CT-Mine is faster than other algorithms at support 50%-70%.

5. COMPARISON WITH OTHER ALGORITHMS In this section, we compare CT-Mine with Apriori, Eclat and FP-Growth to highlight the significant differences between our algorithm and others. Apriori. Apriori 4.04 uses a prefix tree to store the transactions and the frequent itemsets. Apriori algorithm follows the candidate generation and test approach [1]. A set of candidate itemsets of length n + 1 is generated from itemsets of length n and then each candidate itemset is checked to see if it is frequent. Apriori has to traverse the database many times to test the support of candidate item sets. CT-Mine follows the pattern growth approach where support is counted while extending the frequent itemsets by traversing the compressed transaction tree. As mentioned earlier, CT-Mine performs better than Apriori. Eclat. Eclat is similar to Apriori, but represents the database in memory vertically and uses tid-intersection in its mining process. It creates a tid-list of all transactions in which an item occurs. There is no compression scheme in Eclat, so if the transactions contains, let’s say, 1000 transactions with identical set of items it will create a tidlist with length of 1000 instead of a single path of the prefix tree in CT-Mine. FP-Growth. The FP-Growth algorithm uses the FP-Tree which is a variant of the prefix tree [3]. FP-Tree has a header table that contains each frequent item in frequency descending order and a pointer that links the occurrences of each item in the FP-Tree. FP-Tree is used to group transactions with identical set of items but there is no compression scheme to combine identical subtrees as in CT-Mine. Therefore, depending on the characteristics of the datasets, the number of nodes in the compressed prefix tree of CT-Mine will be less by up to half of the nodes in the corresponding FP-Tree. Both FP-Growth and CT-Mine follow the pattern growth approach. However, in the mining process, FP-Growth projects a number of conditional FP-Trees to mine frequent itemsets that start with different items. In CTMine, once the compressed transaction tree is constructed, it alone is used throughout the mining process with no need to build conditional trees. As mentioned earlier, CT-Mine shows significant performance gains over FP-tree.

6. CONCLUSION In this paper, we have described a new algorithm called CT-Mine for efficient discovery of frequent itemsets. A prefix-tree is used to group transactions and a method to compress the prefix tree for efficient use of memory is used. We have compared the performance of CT-Mine against Apriori, Eclat and FP-Growth on various datasets and the results show that at most of the common support levels used in mining, CT-Mine outperforms others. As the compressed transaction tree for huge databases will not fit into main memory, work to extend this algorithm for very large databases is currently in progress. The use of constraints can help to reduce the size of frequent item sets generated [5]. It is planned to integrate the processing of constraints into CT-Mine.

7. ACKNOWLEDGEMENT We are very grateful to Jian Pei for providing the FPGrowth program on his web site and to Christian Borgelt for the Apriori and Eclat programs. We thank the anonymous referees for their constructive comments that helped us to improve this paper.

REFERENCES [1]

R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases", Proc. of ACM SIGMOD, Washington DC, 1993. [2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int. Conf. on VLDB, Santiago, Chile, 1994. [3] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns without Candidate Generation", Proc. of the ACM SIGMOD, Dallas, TX, 2000. [4] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, "HMine: Hyper-Structure Mining of Frequent Patterns in Large Databases", Proc. of IEEE ICDM, San Jose, California, 2001. [5] J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible Constraints", Proc. of 17th ICDE, Heidelberg, Germany, 2001. [6] Y. G. Sucahyo and R. P. Gopalan, "CT-ITL: Efficient Frequent Item Set Mining Using a Compressed Prefix Tree with Pattern Growth", to appear in Proc. of 14th Australasian Database Conference, Adelaide, Australia, 2003. [7] M. J. Zaki, "Scalable Algorithms for Association Mining", IEEE Transactions on Knowledge and Data Engineering, 12, May/June 2000, 372-390. [8] http://fuzzy.cs.uni-magdeburg.de/~borgelt/ [9] Irvine Machine Learning Database Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html. [10] http://augustus.csscr.washington.edu/census/