Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST)
AN APPROACH TO EXTRACT EFFICIENT FREQUENT PATTERNS FROM TRANSACTIONAL DATABASE MAMTA DHANDA Computer Science Department, RIMT-IET, Mandi GobindGarh, Punjab Technical University Jalandhar, Punjab, India E-Mail ID:
[email protected] Abstract Frequent pattern mining is a heavily researched area in the field of data mining with wide range of applications. Mining frequent patterns from large scale databases has emerged as an important problem in data mining .A number of algorithms has been proposed to determine frequent patterns. Apriori algorithm is the first algorithm proposed in this field. With the time a number of changes proposed in Apriori to enhance the performance in term of time and number of database passes. This paper proposes an innovative utility sentient approach for the mining of interesting association patterns from transaction database and also illustrate the limitations of apriori algorithm . First, frequent patterns are discovered from the transactional database using the apriori algorithm. From the frequent patterns mined, this approach extracts novel interesting association patterns with emphasis on significance, quantity , profit and confidence. A comparative analysis is also presented to illustrate our approach’s effectiveness.
Keywords: Data Mining, KDD, frequent itemset , Apriori , Q-factor , PW-factor.
1. Introduction Due to the wide availability of huge amounts of data and imminent need for turning such data into useful information and knowledge, data mining has attracted a great deal of attention in the information industry in recent years. Data mining[2] is an important part of the process of knowledge discovery in databases (KDD)[16] . The knowledge discovery in databases (KDD) is defined as, "The non-trivial extraction of implicit, previously unknown, and potentially useful information from data". The objective of data mining [11] is both prediction and description. That is, to predict unknown or future values of the attributes of interest using other attributes in the databases, while describing the data in a manner understandable and interpretable to humans. Predicting the sale amounts of a new product based on advertising expenditure, or predicting wind velocities as a function of temperature, humidity, air pressure, etc. , are examples of tasks with a predictive[6] goal in data mining. Describing the different terrain groupings that emerge in a sampling of satellite imagery is an example of a descriptive goal for a data mining task. The relative importance of description and prediction can vary between different applications. These two goals can be fulfilled by any of a number data mining tasks including: [10] [15] classification, regression, clustering, summarization, dependency modeling, and deviation detection .Association Rule mining is an important technique of data mining . Apriori algorithm is used to extract frequent patterns and association rules from transactional database. Apriori Algorithm has many Limitations which are : • Needs several iterations of the data. • Generates large amount of frequent itemsets which are not efficient. • Difficulties to find rarely occuring events. • Some competing alternative approaches focus on partition and sampling[16]. From these limitations it is clear that Apriori is not an efficient algorithm. Efficiency of apriori can be improved through utilization of attributes. In this paper we will see how the frequent patterns can be produced through use of attributes . The problem with mining association rules can be distilled into two steps. The first step involves
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
5652
Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST)
finding all frequent itemsets (or say large itemsets) in databases. The next step is the generation of association rules. Once the frequent itemsets are found, generating association rules is straightforward and can be accomplished in linear time [5]. The most important task of traditional association rule mining (ARM) is to identify frequent itemsets. Traditional ARM [14] algorithms treat all the items equally by assuming that the weightage of each item is always 1 (item is present) or 0 (item is absent). Obviously, it’s unrealistic and will lead to loss of some useful patterns [7]. In order to overcome the weakness of the traditional association rules mining, utility mining model [9] [10] and weighted [18] association rule mining [16] have been proposed. Recently, researchers are interested at incorporating both the attributes (weightage and utility [20][19]) for mining of valuable association rules. The incorporation, Weighted Utility association rule mining (WUARM) can be considered as the extension of weighted association rule mining in the sense that it considers items weights as their significance in the dataset and also deals with the frequency of occurrences of items in transactions. Thus, weighted utility association rule mining is concerned with both the frequency and significance of itemsets and is also helpful in identifying the most valuable and high selling items which contribute more to the company’s profit . Here, we propose an efficient approach based on weight factor and utility for mining of high utility patterns. Initially, the proposed approach makes use of the traditional Apriori algorithm to generate a set of association rules from a database.
2. Association rule discovery
It is an important data mining [9][10] model studied extensively by the database and data mining community. Assume all data are categorical. Initially used for Market Basket Analysis [14] to find how items purchased by customers are related.
2.1 The Apriori Algorithm: Basics Key Concepts : • • • • • • • • •
Let I = {I1, I2, … Im} a set of items. Let D be a set of DB transactions [16]. Let T be a particular transaction. An association rule [3][4]is of the form A => B where A, B included in I and (A ∩ B = ). Support: The support of a rule, A => B, is the percentage of transactions in D, the DB, containing both A and B. Confidence: The percentage of transactions in D containing A that also contain B. For Example, Confidence (AB) = Supp(AB)/supp(A). Strong Rules: Rules that satisfy both a minimum support and a minimum confidence [13] are said to be strong. Itemset : Simply a set of items. k- Itemset: A set of items with k items in it.
2.2 Apriori Algorithm : Pseudocode • • •
Join Step: Ck is generated by joining Lk-1with itself. Prune Step: Any (k-1) - itemset that is not frequent cannot be a subset of a frequent k- itemset . Pseudo-code: Ck: Candidate itemset of size k; Lk: frequent itemset of size k; L1 = {frequent items}; for (k= 1; Lk!= ;k++) do begin; Ck+1 = candidates generated from Lk ; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t . Lk+1= candidates in Ck+1with min_ support; end return k L k;
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
5653
Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST)
Apriori algorithm [13] [16] is an influential algorithm for mining frequent itemsets for Boolean association rules. 2.2.1 2.2.2 2.2.3 2.2.4
Frequent Itemsets : The sets of items which has minimum support (denoted by Li for ith – Itemset). Apriori Property: Any subset of frequent itemset[1] must be frequent. Join Operation: To find Lk, a set of candidate k- itemsets is generated by joining Lk-1 with itself. Find the frequent itemsets: the sets of items that have minimum support and apriori use the frequent itemsets to generate association rules.
3. Efficient Frequent itemset generation using attributes: The first step of the proposed approach is the computation of the profit ratio for all items based on Profit which can be calculated by using q-factor as :
Q Factor
P
(1)
Pi Where i = 1 to n and n = number of items and P = profit of an item. 3.1 Calculation of profit ratio: Table 1 shows the five Items A,B,C,D,E. Each Item has a unique Transactional ID and Profit Gained When the particular Item is Purchased by the customer. Profit Ratio is calculated by using the Q-Factor for each item and the profit associated with each item. Table 1:Calculation of Q-Factor
TID 1 2
ITEMS A B
PROFIT 60 10
Q-FACTOR 0.28571428571428 0.04761904761904
3 4 5
C D E
30 90 20
0.1428571428571 0.4285714285714 0.0952380952380
3.2 Applying Apriori algorithm on transactional database: One of the contemporary approaches for frequent item set mining is the Apriori Algorithm. Major Steps of Apriori Algorithm are: 3.2.1
The join step: Find Lk , the set of candidate of k-itemsets , join Lk-1 with itself. Rules for joining are listed below;
Order the items first so you can compare item by item. The join of Lk-1 is possible only if its first (k-2) items are in common.
3.2.2 The Prune step: The “join” step will produce all k-itemsets , but not all of them are frequent. Scan DB to see which itemsets are indeed frequent [8][2] and discard the others. Stop when “join” step produces empty set. Table 2 Shows the Transactional database with a minsupp= “3”. minsupp is termed as the minimum support assumed in order to determine the frequency of each itemset occurrence within transactional database. As shown in the Table 1 ,we have 10 number of transactions .It shows us in how many transactions a particular item has been purchased. From this transactional database we will calculate the frequency of each item’s occurrence with in the transactional database[11][12] using Support measure of apriori algorithm, as shown in Table 3.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
5654
Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST) Table 2: Given Transactional Database and min supp = 3
TID 1 2 3 4 5 6 7 8 9 10
A 10 0 2 0 1 1 0 0 7 0
B 1 1 0 0 2 1 2 0 0 1
C 4 0 0 1 0 1 3 0 1 1
D 1 3 1 0 1 1 0 1 1 1
E 0 0 0 0 3 1 1 2 0 1
Table 3: Determining Frequency of Patterns using Apriori algorithm
Patterns AD D ACD BD ABD CD ED CBD EBD A AC AB C ECB CB EC EB E B
Frequency 5 8 3 5 3 4 4 3 3 5 3 3 6 3 4 3 4 5 6
3.3 Frequent Pattern Selection using Confidence measure : Table 4 shows the selected frequent patterns using confidence measure of apriori algorithm.confidence[20] is assumed to be 60% .Sorting of frequent patterns is done and those patterns are selected having confidence ≥ 60%.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
5655
Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST)
Table 4: frequent Pattern selection based on Confidence
Frequent pattern AD D ACD BD ABD CD ED CBD EBD A AC AB C ECB CB EC EB E B
Confidence 100% 100% 100% 83.3% 100% 83.3% 80% 75% 75% 100% 60% 60% 100% 100% 66.6% 60% 80% 100% 100%
3.4 Calculation of profit and weighing factor(PW-factor): The next step in Frequent Pattern mining is the calculation of PW-Factor for each frequent patterns selected based on confidence.it is calculated by using equation (2).Table 5 shows the values calculated by applying the PW-factor on each Frequent itemset.Here frequency is the support calculated for each itemset in Table 3.Q-factor is the profit ratio calculated in Table 1. n
PW frequency Q factori (2) i 1
Table 5: Calculating PW-Factor
Frequent pattern AD D ACD BD ABD CD ED CBD EBD A AC AB C ECB CB EC EB E B
ISSN : 0975-5462
P-W Factor 3.5714 3.4285 2.5714 2.3809 2.2857 2.2857 2.0952 1.8571 1.7142 1.2857 1.2855 1.0 0.8571 0.8571 0.7619 0.7142 0.5714 0.4761 0.2857
Vol. 3 No. 7 July 2011
5656
Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST)
3.5 Efficient frequent pattern selection: Table 6 shows the Sorting of Frequent patterns whose PQ-Factor ≥ 2.0. It is an efficient frequent pattern selection step consisting of frequent patterns giving maximum profit to the business. A discussion of the results obtained from our approach is presented here. In general, standard association rule mining algorithms result in enormous patterns, and users are expected to shortlist or select the patterns that are interesting to their own businesses. However, from the results, it is clear that our approach, in contradiction to traditional association rule mining algorithms, generates only a meager number of interesting association patterns that are both statistically and semantically important for business development. This in turn affects business utility because, in most cases, frequency plays a vital part in business development including sales backup and more. The high utility [17] patterns discovered in historical buying patterns certainly signify the importance of the items in the growth of the enterprise. Table 6 : Efficient Frequent Pattern Selection
Frequent pattern AD D ACD BD ABD CD ED
P-W Factor 3.5714 3.4285 2.5714 2.3809 2.2857 2.2857 2.0952
4. Conclusion: The conclusion to this work is that we can figure out the association rule by extracting out the frequent patterns from the large transactional database .Apriori algorithm is applied on the transactional database. By using measures of apriori algorithm , frequent itemsets can be generated from the database. apriori algorithm is associated with certain limitations of large database scans. Advantage of apriori is its easy implementation. Association rule mining has a wide range of applicability in many areas. This paper present the list of existing association rule mining techniques including Apriori algorithm. Association rules have been utilized extensively to determine customer buying patterns from market basket data. In recent times, researchers have been greatly interested in incorporating utility considerations into association rule mining. Recently, the data mining community has turned to the mining of interesting association rules to facilitate business development by increasing the utility of an enterprise. The above scenario emphasizes the need to discover interesting and utility sentient association patterns that are both statistically and semantically important to business development [20]. In this paper, we have presented a novel approach for mining high utility and interesting association patterns from transaction databases. This approach is aimed at mining association patterns that facilitate improvement of business utility. It focuses on the utility, significance, and quantity of individual items for the mining of novel association patterns. The mined interesting association patterns are used to offer valuable suggestions to an enterprise for intensifying its business utility. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8]
[9]
J.5 Han, J.Pei ,and Y.Yin . Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach”. In Proc. ACM-SIG MOD Int. Conf. Management of Data(SIG MOD'04), pages 53-87, 2004. M. H.Marghny and A.A.Mitwaly. Fast Algorithm for Mining Association Rules. In proc. Of the First ICGST International Conference on Artificial Intelligence and Machine Learning AIML05, pages 36-40, Dec. 2005. Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 207-216. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 20th Int. Conf. Very Large Data Bases, 487499. Hegland, M., Algorithms for Association Rules, Lecture Notes in Computer Science,Volume 2600, Jan 2003, Pages 226 – 234 J. Han, M. Kamber (2001), Data Mining, Morgan Kaufmann Publishers, San Francisco, CA Hilderman R. J., Hamilton H. J., Knowledge Discovery and Interest Measures, Kluwer Academic, Boston, 2002. Rajesh Natarajan, B. Shekar : Data mining (DM): poster papers: A relatedness-based data-driven approach to determination of interestingness of association rules. Laks V. S. Lakshmanan, Carson Kai-Sang Leung, Raymond T. Ng: The segment support map: scalable mining of frequent itemsets. ACM SIGKDD Explorations Newsletter, Volume 2 Issue 2. December 2000
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
5657
Mamta Dhanda et al. / International Journal of Engineering Science and Technology (IJEST)
[10] Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation. SIGMOD Conference 2000: 1-12 [11] Some relevant research literatue R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C. [12] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pages 207-216, May 1993 [13] Textbook Jiawei Han, Data Mining: concepts and techniques, Morgan Kaufman, 2000, ISBN 1-55860-489-8 [14] Rakesh Agrawal , Thomasz Imielinski, and Arun Swami: Mining association rules between sets of items in large database. In Proc. Of the ACM SIGMOD Conference on Management of Data, P. 207-216, May 1993. [15] Jianying Hu and Aleksandra Mojsilovic, "High-utility pattern mining: A method for discovery of high-utility item sets", Pattern Recognition, vol. 40, no. 11, pp.3317-3324, November 2007. [16] Yu-Chiang Li , Jieh-Shan Yeh and Chin-Chen Chang, "Isolated items discarding strategy for discovering high utility itemsets", Data & Knowledge Engineering, vol.64, no.1, pp.198-217, January 2008. [17] Guangzhu Yu, Shihuang Shao and Xianhui Zeng, "Mining Long High Utility Itemsets in Transaction Databases", WSEAS Transactions On Information Science & Applications, vol.5, no. 2, pp.202-210, February 2008 [18] Lu, S., Hu, H., Li, F, “Mining Weighted Association Rules”, Intelligent Data Analysis, vol.5 , no. 3, pp.211 - 225, August 2001. [19] Chun-Jung Chu, Vincent S. Tseng and Tyne Liang, "Mining temporal rare utility Itemsets in large databases using relative utility thresholds", International Journal of Innovative Computing, Information and Control, Vol. 4, no. 11, November 2008. [20] M. Sulaiman Khan, Maybin Muyeba and Frans Coenen, "Fuzzy Weighted Association Rule Mining with Weighted Support and Confidence Framework", International Workshops on New Frontiers in Applied Data Mining, Osaka, Japan, pp. 49 - 61, May 20-23, 2009.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
5658