A Fast Algorithm for Maintenance of Association Rules in Incremental Databases Xin Li, Zhi-Hong Deng∗, and Shiwei Tang National Laboratory on Machine Perception, School of Electronics Engineering and Computer Science Peking University, Beijing 100871, China
[email protected] [email protected] [email protected]
Abstract. In this paper, we propose an algorithm for maintaining the frequent itemsets discovered in a database with minimal re-computation when new transactions are added to or old transactions are removed from the transaction database. An efficient algorithm called EFPIM (Extending FP-tree for Incremental Mining), is designed based on EFP-tree (extended FP-tree) structures. An important feature of our algorithm is that it requires no scan of the original database, and the new EFP-tree structure of the updated database can be obtained directly from the EFP-tree of the original database. We give two versions of EFPIM algorithm, called EFPIM1 (an easy vision to implement) and EFPIM2 (a fast algorithm), they both mining frequent itemsets of the updated database based on EFP-tree. Experimental results show that EFPIM outperforms the existing algorithms in terms of the execution time.
1 Introduction Data mining or knowledge discovery in databases has attracted much attention in database research since it was first introduced in [1]. This is due to its wide applicability in many areas, including decision support, market strategy and financial forecast. Many algorithms have been proposed on this problem such as Apriori [2] (and its modifications), hash based algorithm [3], FP-growth [4, 5], and vertical method [6]. The rules discovered from a database only reflect the current state of the database. However, the database is not a static database, because updates are constantly being applied to it. Because of these update activities, new association rules may appear and some existing association rules would become invalid at the same time. To re-mine the frequent itemsets of the whole updated database is clearly inefficient, because all the computations done in the previous mining are wasted. Thus, maintenance of discovered association rules is an important problem. The FUP algorithm proposed in [7] and its developed form FUP2 algorithm [8] are both similar to Apriori-like algorithms, which has to generate large number of candidates and repeatedly scan the database. In [9], the negative border is maintained along ∗
Corresponding author.
X. Li, O.R. Zaiane, and Z. Li (Eds.): ADMA 2006, LNAI 4093, pp. 56 – 63, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Fast Algorithm for Maintenance of Association Rules in Incremental Databases
57
with the frequent itemsets to perform incremental updates. This algorithm still requires a full scan of the whole database if an itemset outside the negative border gets added to the frequent itemsets or its negative border. A recent work called AFPIM [10] is designed to efficiently find new frequent itemsets based on adjusting FP-tree structure. However, it will cost much a lot to adjust FP-tree of the original database according to the changed transactions. In this paper, an algorithm called EFPIM (Extending FP-tree for Incremental Mining), is designed to efficiently find new frequent itemsets with minimum re-computation when new transactions are added to or old transactions are removed from the database. In our approach, we use the structure EFP-tree, which is equal to the FP-tree when the system runs the mining algorithm for the first time and an expanded form of FP-tree structure during the next incremental mining. The EFP-tree of the original database is maintained in addition to the frequent itemsets. Without needing to re-scan the original database, the EFP-tree of the updated database is obtained from the preserved EFP-tree. We will give two visions of EFPIM algorithm, called EFPIM1 (an easy vision to implement) and EFPIM2 (a fast algorithm), they both mining frequent itemsets of the updated database based on EFP-tree.
2 Problem Description Let DB be a database of original transactions, and I = {i1, i2, … , im} be the set of items in DB. A set of items is called an itemset. The support count of an itemset X in DB, denoted as SupDB(X), is the number of transactions in DB containing X. Given a minimum support threshold s%, an itemset X is called a frequent itemset in DB if SupDB(X) ≥ |DB| × s%. Let LDB refer to the set of frequent itemsets in DB. Moreover, let db+ (db-) denote the set of added (deleted) transactions, and |db+| (|db-|) be the number of added (deleted) transactions. The updated database, denoted as UD, is obtained from DB ∪ db+ - db-. Define LUD to denote the new set of frequent itemsets in UD. The number of transactions in UD, |UD|, is |DB|+|db+|-|db-|. Also, the support count of an itemset X in UD, denoted as SupUD(X), is equal to SupDB(X)+Supdb+(X)-Supdb-(X). In other words, the update problem of association rules is to find LUD efficiently.
3 Extending FP-Tree for Incremental Mining (EFPIM) 3.1 Basic Concept In this paper, a lesser threshold, called pre-minimum support, is specified. For each item X in a database, if its support count is no less than pre-minimum support, X is named a pre-frequent item. Otherwise, X is an infrequent item. If the support of a pre-frequent item X is no less than minimum support also, X is called a frequent item. Our strategy is designed based on extending FP-tree structure to maintain the updated frequent itemsets efficiently. The following information after mining the original database DB needs to be maintained: 1) all the items in DB along with their support count in DB, and 2) the FP-tree of DB for pre-frequent items in DB.
58
X. Li, Z.-H. Deng, and S. Tang
When the system runs the mining algorithm for the first time, we use the algorithm introduced in [4], and get a FP-tree. In the FP-tree of DB, each path follows the frequency descending order of pre-frequent items in DB. After insertion or deletion occurs in DB, some items may become infrequent ones in UD. These items have to be removed from the FP-tree. In addition, the paths of nodes may become unordered. If we adopt the method [10] of adjusting the paths of nodes to make it follow the order of pre-frequent items in UD, it will find that this step costs much a lot to adjust FP-tree according to the changed transactions. We propose a new structure called EFP-tree which is equal to the FP-tree when running the mining algorithm for the first time. After database is updated, we first remove those items becoming infrequent in UD from the EFP-tree. Then expand (decrease) the EFP-tree with the transactions of db+ (db-), which are arranged as the order of pre-frequent items in DB. The EFP-tree is still a prefix-tree structure for storing compressed and crucial information in transactions, though it is not compressed as constrictive as FP-tree. However, it still conserves all the information in UD without a loss. When the database’s update is small in proportion to DB, using EFP-tree is really an effective and efficient method, which will be discussed later. 3.2 An Easy Update Algorithm EFPIM1 Here, we give an easy update algorithm EFPIM1 based on the EFP-tree structure defined above and FP-growth algorithm [4]. Algorithm EFPIM1 Input: the pre-maintained information of DB, db+, and db-. Output: the frequent itemsets in UD. 1. Read in the items in DB and their support counts in DB. 2. Scan db+ and db- once. Compute the support count of each item X in UD. 3. Judge whether all the frequent items of UD are covered in EFP-tree of DB. 3.1 If there exists a frequent item of UD not in the EFP-tree, scan the whole UD to reconstruct an EFP-tree according to the pre-frequent items in UD. 3.2 Otherwise, read in the stored EFP-tree of DB. 3.2.1 Remove items becoming infrequent in EFP-tree. 3.2.2 Arrange items of each transaction in db+ according to the support descending orders in DB, and insert them into the EFP-tree. Similarly, transactions in db- are removed from EFP-tree by decreasing count in nodes. 4. Apply FP-Growth algorithm [4] on the updated EFP-tree. 5. Store the support counts of items and EFP-tree of UB.
A Fast Algorithm for Maintenance of Association Rules in Incremental Databases
59
Step3 and step3.1 explain the purpose of using pre-frequent items. Thinking of the situation that there is an item x appearing frequent in db+ but not existing in the original EFP-tree (which means not frequent in DB), however, it appears frequent in UD (SupUD(X) > min_sup). If we omit step3, the output will not contain all the situations. If we do not construct EFP-tree according to the pre-frequent items, the test in step3 will often lead to step3.1 (which means doing the re-mining over all). [Example 1] Let the original DB be illustrated in Table1 (a). The minimum support is 0.2 and the pre-minimum support is 0.15. Scan DB once. The items with support counts no less than 2(13*0.15=1.95) are pre-frequent items. Thus, A, B, C, D, E, and F are pre-frequent items. After sorting them in support descending order, the result is F:7, B:6, C:5, E:4, D:3, and A:2. The FP-tree of DB is constructed as Fig. 1(a). Table 1. Sample database T_ID 1 2 3 4 5 6 7 8 9 10 11 12
Items BDEF F ABEF CH BF B ABEF CG BF CDE F CD
13
C
After ordering FBED F FBEA CH FB B FBEA CG FB CED F CD
T_ID 14 15 16 17 18
Items BCDEF BDEF BCD BD D
After ordering FBCED FBED BCD BD D
C
(b) db+
(a)DB
Then five transactions are inserted into DB, shown as Table 1(b). In new database UD, a pre-frequent item must have support counts no less than 3 (i.e. (13+5)*0.15=2.7). Therefore, the pre-frequent items in UD, shown in the order as before, are F:9, B:10, C:7, E:6, D:8, A:2. Accordingly, the constructed FP-tree of UD is shown as Fig. 1(b).Then apply FP-growth algorithm to find out all the itemsets.
(a)
(b) Fig. 1. The EFP-tree of example 1
60
X. Li, Z.-H. Deng, and S. Tang
3.3 A Fast Update Algorithm EFPIM2 The EFPIM1 algorithm could work well, and it can be implemented much easier than methods like AFPIM. Further more, we have the following conclusions by analyzing the algorithm described above: First, due to the nature of EFP-tree, the new EFP-tree constructed for an updated database is just an extension to the old one, which means that all existing nodes and their arrangement did not change at all. So we can reuse old trees to the utmost extent. Note that extending an existing tree is relatively much faster than constructing a new tree, especially when the tree is big resulting from prolific or long patterns. Second, based on its divide-and-conquer idea, FP-growth decomposes the mining task into a set of smaller tasks for mining confined patterns in conditional databases. Although EFP-tree is rather compact, its construction of the first time still needs two scans of a conditional database. So it would be beneficial to extend an existing one. Materialization plays a very important role and provides a foundation in our work. These features and considerations combined together form the core of our new algorithm EFPIM2. This algorithm is much similar to EFPIM1, but calling EFP-growth instead of FP-growth algorithm. Limited by space, we only list the different step compared with EFPIM1. Algorithm EFPIM2 3.2.1 Call EFP-tree_extension,return the EFP-tree of UD. 4. Apply EFP-growth algorithm on the EFP-tree of UD. The method EFP-tree_extension and EFP-growth are outlined as follows. Algorithm EFP-tree_extension Input: the EFP-tree of DB, db+, dbOutput: extended EFP-tree 1. For each item that is pre-frequent in DB and becoming infrequent in UD, remove the corresponding nodes from EFP-tree and the header table. 2. Arrange items of each transaction in db+ according to the support descending orders in DB, and insert them into the EFP-tree. If there are some new items appearing in db+ but not in DB, arrange them according to their descending orders, and put them behind those items which have appeared in DB. So the new items will be in the bottom of the EFP-tree of UB. Similarly, each transaction in db- is removed from EFP-tree by decreasing the count in the nodes. Algorithm EFP-growth Input: EFP-tree, pattern α Output: the frequent itemsets in UD. 1. For each item ai in EFP-tree, which appears in db+ but not in DB, do
A Fast Algorithm for Maintenance of Association Rules in Incremental Databases
61
Generate pattern β = α ∪ ai with support = ai. support; Construct β’s conditional pattern base and then construct β’s conditional EFP-tree Treeβ; If Treeβ != NULL, then call FP-growth(Treeβ , β) and store the support counts of items and EFP-tree of Treeβ. 2. For each item ai in EFP-tree, which appeared in DB, do β = α ∪ ai; Read in the stored β’s conditional EFP-tree as STreeβ; Construct β’ s conditional pattern base; Call EFP-tree_extension (STreeβ, db+, db-) and get β’s new conditional FP-tree Treeβ; Call EFP-growth (Treeβ, β) and store the support counts of items and EFP-tree of Treeβ. Obviously, any item appearing in db+ but not in DB can not have a stored conditional EFP-tree when doing mining on DB. So we have to construct a new one and call FP-growth method, since the recursions starting from this item have never been done before. But for the items have already appeared in DB, it needs not to reconstruct all. We just extend the stored EFP-tree and call EFP-growth recursively. The correction of this process is guaranteed by that each path of the EFP-tree of UD is arranged as the order of pre-frequent items in DB. [Example 2] Consider when we construct item E’s conditional database and conditional EFP-tree in example 1. In DB, E’s conditional database is {{C:1}, {F:3, B:3}}, and its conditional EFP-tree is shown in Fig. 2(a). When database becomes to UD, the conditional database of item E becomes {{C:1}, {F:3, B:3}, {F:1, B:1}, {F:1, B:1, C:1}}, and its conditional EFP-tree is show in Fig. 2(b).
(a)
(b) Fig. 2. The EFP-tree of example 2
We can find that the EFP-tree can be directly expanded from the old structure.
4 Experiments To evaluate the performance of EFPIM algorithm, the algorithms EFPIM1, EFPIM2 are implemented on a personal computer. In this section, we present a performance comparison of re-mining by re-executing FP-growth.
62
X. Li, Z.-H. Deng, and S. Tang
The experiments are performed on synthetic data generated using the same technique as in [1]. The parameter Tx.Iy.Dm.dn is used to denote that average size of the transactions |T|= x, average size of the potentially frequent itemsets |I|= y, number of transactions |D|=m*1000, and number of inserted/deleted transactions |d+|/|d-|=n*1000. The number of various items is 1000. We first compare the execution time of re-executing FP-growth, EFPIM1 and EFPIM2 on an updated database T10.I4.K10.d1. As shown in Fig. 3(a), in general, the smaller the minimum support is, the larger the speed-up ratio of EFPIM2 over EFPIM1 and re- executing method. The reason is that a small minimum support will induce a large number of frequent itemsets, which greatly increase the computation cost. In Fig. 3(b), the chart shows the ratio of execution time by two EFPIM algorithms when comparing with the re-mining method.
(b)
(a)
(d)
(c) Fig. 3. Experiment Results
We then evaluate the effect of the size of updates on these algorithms. The minimum support is 3% in Fig. 3(c). Fig. 3(d) shows that two EFPIM algorithms are much faster than re-mining method.
A Fast Algorithm for Maintenance of Association Rules in Incremental Databases
63
5 Conclusion In this paper, an efficient and general incremental updating technique is proposed for updating frequent itemsets when old transactions are removed from or new transactions are added into a transaction database. This approach uses EFP-tree structure constructed from a previous mining to reduce the possibility of re-scanning the updated database. The EFP-tree structure of the updated database is obtained, by extending the previous EFP-tree, to discover the corresponding frequent itemsets. Performance studies show that the proposed EFPIM1 and EFPIM2 algorithms are significantly faster than the re-mining by re-executing FP-growth algorithms. In particular, it works well for small minimum support setting. Recently, there have been some interesting studies at mining maximal frequent itemsets [11] and closed frequent itemsets [12, 13]. The extension of our technique for maintaining these special frequent itemsets is an interesting topic for future research.
References 1. R. Agrawal, T. Imielinski, and A. Swami. “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of ACM SIGMOD, May 1993, 207-216 2. R. Agrawal and R.Srikant. “Fast algorithm for mining Association rules,” In VLDB’94, 487-499. 3. Park J S et al. “An effective hash based algorithm for mining of association rules,” In Proceedings of ACM SIGMOD Conference on Management of Data, May 1995, 175-186 4. J. Han, J. Pei, and Y. Yin. “Mining Frequent Patterns without Candidate Generation,” in Proceedings of the ACM SIGMOD Int. Conf. on Management of Data, 2000, 1-12 5. J. Han and J. Pei. “Mining frequent patterns by pattern-growth: methodology and implications,” In SIGKDD’00, 14-20. 6. Zaki and K. Gouda. “Fast vertical mining using diffsets,” In SIGKDD'03, 326-335 7. D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong. “Maintenance of Discovered Association Rules in Large Databases: An Incremental Update Technique,” In: Proceedings of International Conference on Data Engineering, 1996, 106-114 8. D.W. Cheung, S.D. Lee, and Benjamin Kao. “A General Incremental Technique for Maintaining Discovered Association Rules,” in Proc. of the 5th International Conference on Database Systems for Advanced Applications, 1997, 185-194 9. S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka. “An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases,” in Proc. of 3rd International conference on Knowledge Discovery and Data Mining, 1997, 263-266 10. Jia-Ling Koh and Shui-Feng Shieh. “An Efficient Approach for Maintaining Association Rules Based on Adjusting FP-Tree Structures,” DASFAA 2004, 417-424 11. D. Burdick, M. Calimlim, and J. Gehrke. “MAFIA: A maximal frequent itemset algorithm for transactional databases,” In ICDE'01, 443-452. 12. M. Zaki and C. Hsiao. “CHARM: An efficient algorithm for closed itemset mining,” In SDM'02, 12-28 13. J. Y. Wang, J. Han, and J. Pei. “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” In SIGKDD'03, 236-245