Multi-Level Database Mining Using AFOPT Data Structure ... - CiteSeerX

3 downloads 22662 Views 80KB Size Report
product x, then they also buy product y". Market basket analysis is a ... Scanners. 4. Monitors. 5. Keyboards. 1.1. Desktop. 1.2. Laptop. 2.1. Compaq. 2.2. HP. 3.1.
Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. III (2008), Suppl. issue: Proceedings of ICCCC 2008, pp. 437-441

Multi-Level Database Mining Using AFOPT Data Structure and Adaptive Support Constrains Mirela Pater, Daniela E. Popescu Abstract: Finding frequent itemsets is one of the most investigated fields of database mining. The classic association mining based on a uniform support misses interesting patterns of low support or suffers from the bottleneck of itemset generation. A better solution is to exploit support constrains, witch specifies what minimum support is required for what itemsets and so that only necessary itemsets are generated. In this paper, an algorithm for multilevel database mining is proposed. The algorithm is obtained by extending the AFOPT algorithm for multi-level databases. Keywords: data mining, knowledge discovery in databases, support constrains, multi-level databases, multi-level association rules mining

1

Introduction

The aim of data mining is the discovery of patterns within data stored in large databases. Mining for association rules is a data mining method that lends itself to formulating conditional statements such as "if customers buy product x, then they also buy product y". Market basket analysis is a typical example among various applications of association mining. The association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold [9]. Most studies on database mining have been focused at mining rules at single concept levels [2] and [9]. This approach may sometimes encounter difficulties of finding desired knowledge in databases. Frequent pattern mining plays an essential role in mining associations, as shows Agrawal and Srikant [2] in multi-level and multidimensional patterns, and many other important database mining tasks. Mining frequent patterns in transactions databases, multi-level databases, and many other kinds of databases has been studied popularly in data mining research. For mining multiple-level association rules, concept taxonomy should be provided for generalizing primitive level concepts to high level ones.

2

Multiple concept levels

The use of conceptual hierarchies facilitates the mining of multiple level rules. Moreover, conceptual hierarchies can be adjusted dynamically to meet the need of the current data mining task. Example: numerical attributes can be generated automatically based on the current data distribution. Compared to single level frequent pattern mining, multilevel frequent pattern mining is fine-tuned to generate interesting frequent patterns spanning at multiple concept levels in multi-level databases [3]. It services the business needs much better. Patterns generated with concept hierarchies included are called multi-level frequent patterns. The classic AFOPT algorithm (Ascending Frequency Ordered Prefix-Tree) [10] uses a compact data structure to represent the conditional databases, and the tree is traversed top-down. The combination of the top-down traversal strategy and ascending frequency order minimizes both the total number of conditional databases and the traversal cost of individual conditional databases. Based on the concept hierarchy, the AFOPT data structure [10] and some existing algorithms for mining singlelevel association rules [11], an efficient algorithm for mining multi-level frequent pattern is propose in this paper. We assume that the database contains: 1) An item data set which contains the description of each item in I in the form of Ai; description, where Ai is the product’s code; 2) A transaction data set, T={T1,..., Tn}, which consists of a set of transactions Ti={Ap,...,Aq} in the format of (TID, Ai); Where TID is a transaction identifier and Ai is an item from the item data set [11]. Multi-level databases use hierarchy-information encoded transaction table instead of the original transaction table [2]. This is useful when we are interested in only a portion of the transaction database such as food, instead of all the items. This way we can first collect the relevant set of data and then work repeatedly on the task-relevant set. Thus in the transaction table each item is encoded as a sequence of digits. Table 1, contains category codes and a description for each code (category or item) which is only needed for the final display. The codes and the descriptions for table 1 are extracted from figure 1. TABLE 1.Multi-level items Copyright © 2006-2008 by CCC Publications - Agora University Ed. House. All rights reserved.

438

Mirela Pater, Daniela E. Popescu

Figure 1: Hierarchy-base multi-level database Code 0 1 2 3 4 5 1.1 1.2 2.1 2.2 3.1 4.1 5.1 1.1.1 1.1.2

Description All Electronics Computer Printers Scanners Monitors Keyboards Desktop Laptop Compaq HP Mustek Philips Tech Athlon Pentium

The transactions from the database usually contain items from the lowest concept level. In order to get results for a certain concept level, we have to transform the transactions (or view them in a different way). This can be done by using the information from each item’s ID. TABLE 2 Transactions from database TID T1 T2 T3 T4 T5

Items 3.1.1 1.2.2 2.2.3 2.1.1 1.1.4 1.1.2 2.1.2 4.1.1 1.2.2 1.1.2 3.1.1 2.1.2 4.1.2 5.1.3 1.1.2 1.2.3 1.1.2 3.1.1 4.1.1 1.1.2 3.1.1 1.2.3 1.2.2

Multi-level frequent pattern mining is a very promising research topic and plays an invaluable role in real life applications. The classic frequent pattern mining algorithms (APRIORI [2], FP-Growth [8]) have been focusing on mining knowledge at single concept levels. It is often desirable to discover knowledge at multiple concept levels that are interesting and useful. A method for mining multiple-level association rules uses a hierarchy information encoded transaction table, instead of the original transaction table, in iterative data mining.

3

ADAPTIVE AFOPT algorithm - ADA AFOPT

In this paper, we propose an algorithm, named ADA AFOPT, which can be used to resolve the multilevel frequent pattern mining problem. The algorithm is obtained by extending the AFOPT algorithm for multi-level databases [8]. The features of the AFOPT algorithm, i.e. FPTree[6], FPTreebased pattern fragment, 1item partitionbased divide and conquer method are well preserved. This algorithm uses flexible support constraints. To avoid the problem caused by uniform support threshold, mining with various support constraints is used. Uniform support threshold might cause problems of either generating uninteresting patterns at higher abstraction level or missing potential interesting patterns at lower abstraction are level. At each level, we classify individual items into two categories: normal items and exceptional items.

Multi-Level Database Mining Using AFOPT Data Structure and Adaptive Support Constrains

439

At each level, 2 types of thresholds, for both normal items and exceptional items, are needed. The two types of thresholds are "item passage threshold" and "item printing threshold". Usually item printing threshold is higher than item passage threshold. An item is treated as a frequent 1item if its occurrence passes the item printing threshold. Otherwise, as long as its support passes the item passage threshold, the item will still have les chance to be held in the base to generate frequent 2itemsets. This is reasoned on the observation that longer patterns are likely associated with smaller thresholds ion than their subpatterns. ADA AFOPT algorithm favours users by pushing these various n transactions, support constraints deep into the mining process. The interestingness of the patterns mound generated, hence, is improved dramatically. Being based on the AFOPT algorithm, this algorithm first traverses the original database to find frequent items or abstract levels and sorts them in ascending frequency order. Then the original database is scanned the second time to construct an AFOPT structure to represent the conditional databases of the frequent items. The conditional database of an item i include all the transactions containing item i, and infrequent items and those items before i are removed from each transaction. Arrays are used to store single branches in the AFOPT structure to save space and construction cost. Each node in the AFOPT structure contains three pieces of information: an item id, the count of the itemset corresponding to the path from the root to the node, and the pointers pointing to the children of the node. In step 1 and step 2, the counts for each individual item is gathered; and the complete information about the transaction database is compressed in the AFOPT structure. From now on, the pattern generation process is started by recursively visiting the AFOPT structure. Note no more costly database its operations will be involved.

4

Performance study

The bottleneck of the Apriorilike algorithms [2], [4], [6], is the repeated database scans. The pattern growth approach avoids the candidate generation and test cost by growing a frequent itemset from its prefix [5]. Thus frequent pattern based algorithms are shown to be superior to the Apriori based ones significantly, especially on dense datasets. No matter how long the frequent pattern will be, it takes FPGrowth [8] a constant number of database scans (2 scans) to generate the problem complete set of frequent patterns. It is proved that FPGrowth is about an order of each magnitude faster than the Apriorilike algorithms. Using directly the AFOPT algorithm on the lowest concept level (or the level we need) will require only 2 scans of the database as opposed to k+1 scans with an Apriori based multi level algorithm. We traverse the AFOPT structure in top-down depth-first order. Each node is visited exactly once. The total number of node visits of the FP-Growth algorithm is equal to the total length of its branches. The total number of nodes visit of the AFOPT algorithm is equal to the size of the AFOPT structure, which is smaller than the total length of the nfp branches. Therefore the AFOPT algorithm needs less traversal cost than the FP-Growth algorithm For the AFOPT algorithm, additional traversal cost is caused by the push-right step. If we did not have this step we would get a conditional database which consists of multiple subtrees. The number of subtrees constituting the conditional database is exponential to the number of items before that item in worst case. While the number of merging operations needed is equal to the number of items before that item in worst case. To save the traversal cost, it is better to perform the merging operation. A set of experiments were conducted to compare the performance of the ADA AFOPT algorithm with the ADA FP-Growth algorithm. The standard versions of the FP-Growth and AFOPT algorithms were also part of the experiments. In the first test (fig.2), some items were considered rare special items, and special passage and printing thresholds were used. These thresholds are lower than the normal thresholds, because patterns containing these are interesting even though are less frequent. As a result the algorithm had to deal with more items and the "ADA AFOPT" and is slower than standard version which doesn’t use special thresholds. In the second test (fig.3), some items were considered "common" special items, and the "common" passage and printing thresholds were used on them. These thresholds are higher than the normal thresholds. Consequently the algorithm had to deal with less items and the "Adaptive AFOPT" performance is better than standard version which doesn’t use such thresholds. No rare special items were declared The same conditions were applied to the ADA-FP-Growth algorithm, and standard FP-Growth. As stated earlier, the AFOPT algorithm is faster than FP-Growth; as a result, ADA AFOPT is faster than the ADA-FPGrowth algorithm (fig.4). Note that, even though ADA AFOPT has more items to deal with, it is still faster than FP-Growth which uses standard thresholds Experiments were conducted on a 1.6 Ghz Athlon XP with 512 MB memory running Microsoft Windows XP Professional, using a real database of 50000 entries in 6000 transactions. All codes were done in java and compiled using the Eclipse Platform.

440

Mirela Pater, Daniela E. Popescu

Figure 2: AFOPT vs. ADA AFOPT - Performance comparison with rare special items

Figure 3: AFOPT vs. ADA AFOPT - Performance comparison without rare special items

Figure 4: ADA AFOPT vs. FP-Growth vs. ADA FP-Growth - Performance comparison

Multi-Level Database Mining Using AFOPT Data Structure and Adaptive Support Constrains

5

441

Conclusions

Multilevel frequent pattern mining is fine-tuned to generate interesting frequent patterns spanning at multiple concept levels. It services the business needs much better and can be used to facilitate decision making and boost business sales. General frequent pattern mining algorithms focus on mining at single level. These way only strong associations between items will be discovered. For multilevel frequent pattern mining, mining algorithms have to be extended. The AFOPT algorithm traverses the trees in top-down depth-first order, and the items in the prefixtrees are sorted in ascending frequency order. The combination of these two methods is more efficient than the combination of the bottom-up traversal strategy and descending frequency order, which is adopted by the FPGrowth algorithm. The ADAAFOPT algorithm value lies on how to exploit these potential support requirements. The algorithm favours users by pushing these various support constraints deep into the mining process. The interestingness of the patterns generated, hence, is improved dramatically.

References [1] Agarwal, R.C., Aggarwal, C.C., and Prasad, V.V.V., A tree projection algorithm for finding frequent itemsets, Journal on Parallel and Distributed Computing, 2001 [2] Agarwal R., Aggarwal C., and Prasad V.V.V., A tree projection algorithm for generation of frequent itemsets, In J. Parallel and Distributed Computing, 2000. [3] Agrawal R., Mannila H., Srikant R., Toivonen H., and Verkamo A.I., Fast discovery of association rules, In Advances in Knowledge Discovery and Data Mining, pages 307-328, 1996 [4] Agrawal R. and Strikant R., Mining sequential patterns, In Proc. 1995 Int. Conf. Data Engineering, 3-14, Taipei, Taiwan, 1995 [5] Agrawal R. and Strikant R., Fast algorithms for mining association rules, In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’95), Santiago, Chile, 487-499, 1994 [6] Bayardo R. J.., Efficiently mining long patterns from databases, In SIGMOD’98, pp. 85-93. [7] Dong G. and Li J., Efficient mining of emerging patterns: Discovering trends and differences, In KDD’99, pp. 43-52. [8] Gyorodi R., Gyorodi C., Pater M., Boc O., David Z., AFOPT Algorithm for multi-level databases, SYNASC 05, Timisoara, 2005 [9] Han, J., Pei, J., and Yin, Y. 2000, Mining frequent patterns without candidate generation, In Proceedings of the 2000 ACM SIGMOD Conference, ACM Press, pp. 1-12. [10] Liu, G., Lu, H., Lou, W., Xu, Y. and Xu Yu, J., Efficient Mining of Frequent Patterns Using Ascending Frequency Ordered Prefix-Tree, Data Mining and Knowledge Discovery, 9, 249-274, 2004 [11] Liu, G., Lu, H., Lou, W., Xu, Y. and Xu Yu, J., Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns In Proc. of KDD Conf., 2003. [12] Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li,W. 1997, New algorithms for fast discovery of association rules, In Proceedings of the Third KDD Conference, AAAI Press, pp. 283-286. [13] Zheng, Z., Kohavi, R., and Mason, L. 2001, Real world performance of association rule algorithms, In Proceedings of the 7th KDD Conference, ACM Press, pp. 401-406.

Mirela Pater, Daniela E. Popescu University of Oradea Department of Computer Science University Street, no.1 E-mail: {mirelap,depopescu}@uoradea.ro

Suggest Documents