Abstractâ Association rule mining is very popular data mining method. ... Journal of Advanced Research in Computer Science and Software Engineering 3(10),.
Volume 3, Issue 10, October 2013
ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com
An Empirical Evaluation of Association Rule Mining Algorithms Anil D. Pandya* PG Student ,Computer Department Marwadi Education Foundation Group of Institutions Rajkot ,Gujarat, India
Prof. Glory H. Shah Assistant Professor, Computer Department Marwadi Education Foundation Group of Institutions Rajkot, Gujarat, India
Abstract— Association rule mining is very popular data mining method. However, mining association rules often provides very large number of rules, leaving the analyst with the task to go through all the rules and find out interesting ones. Most of the historical studies use an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still not cheaper, especially when we have patterns which are large and long. Moreover, dynamic itemset counting algorithms, which are an extension of Apriori algorithms, will help to decrease the number of scans on the chosen dataset. This can be termed as alternative approach to Apriori Itemset Generation. In this, itemsets are dynamically added and deleted as transactions are read .It depends on the fact that for an itemset to be frequent, all of its subsets must also be frequent, so we only monitor those itemsets whose subsets are all frequent. This paper presents the comparision of the performances of Apriori, Frequent Pattern Growth and DIC algorithm. The performance is analyzed, which is based on the execution time for different number of instances and confidence in different datasets. The performance study shows that the DIC method is efficient and scalable and is about an order of magnitude faster than the Apriori algorithm. Keywords— Association rules, Apriori, Dynamic itemset counting, Frequent pattern growth,Support and Confidence. I. INTRODUCTION Data mining [13] can be seen as an outcome of natural evolution of information technology. This is all about utilizing the information by acquiring the underlying knowledge which is useful but hidden. There is also an another interpretation of this, which is practice of searching over huge data for discovering patterns and trends and that too automatically.Thus , this is out of bounds of simple analysis. Among different approaches, association rule mining is very famous method for mining data. In this technique, an interrelation among different items in data is dig up by determining frequent large itemsets which are repeated more than a threshold number of times in the database. Data mining process can be defined as centralized or distributed based on the location of data. While considering centralized data mining process, data is located into a single site but in distributed process data is resided into multiple sites. Data Mining Task are mainly categorized into two subcategories as shown in fig1.
DATA MINING TASKS
DESCRIPTIVE
PREDICTIVE
CLASSIFICATION
CLUSTERING
REGRESSION
ASSOCIATION
DEVIATION DETECTION
SEQUENTIAL
Fig 1Diffrent Task of .Data Mining. © 2013, IJARCSSE All Rights Reserved
Page | 1220
Pandya et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 1220-1225 II. OVERVIEW OF ASSOCIATION RULE MINING In this section we will introduce association rule mining in detail. Different Association Rule Mining (ARM) Algorithms[22] will be expounded together. Association Rule problem was first of all stated by Agrawal [22], that the conventional statement of association rule mining problem was discovering the interesting association or correlation relationships among a large set of data items. Association rule mining (ARM) [22] has become one of the basic data mining tasks and has attracted extent interest among data mining researchers. Association rule mining is one of the major tasks of the Data Mining. Association rule mining can help extracting knowledge in various applications like advertisements, bioinformatics, database marketing, fraud detection, Ecommerce, health care, security, sports, telecommunication, web, weather forecasting, financial forecasting, etc. ARM is an unsupervised technique which works on variable length data, and which leads to clear and understandable results. Association rule mining finds the interesting or correlation relationship among a large set of data items. The typical example of association rule mining is the market basket analysis. Frequent itemset mining leads to the discovery of associations and correlation among items in large transactional or relational data sets. In brief, an association rule is an expression A ⟹B, where A and B are sets of large items. Generally association rule mining is done in two steps. • Finding all frequent itemsets: The items which are frequently or habitually purchased together in one transaction are called the frequent itemsets. • Generate strong association rules from the frequent itemsets: the rules which are generated must satisfy both minimum support and minimum confidence. Typically, association rules can be considered interesting if they satisfy minimum support threshold and a minimum confidence threshold. Let us take I = i1, i2, ….im as a set of items. Take D , the data relevant to task, as a set of transactions where in each transaction T is a set of items such that T⊆I. Here, each and every transaction has a unique identifier, TID. Let A be a set of items. A transaction T is said to contain A if and only if A⊆T. An association rule is an implication of the form A⟹B, where A⊂T, B⊆T and A⋂B=∅. The rule A⟹B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A∪B. The rule A⟹B is said to have confidence c in the transaction set D, if there are c percentage of transactions in D that contains A and also contains B,
Support:Support of a particular itemset is the percentage of the total transactions having that itemset. Support(A⟹B) = prob{A ∪ B} Confidence:Confidence is the probability of occuring (such as buying) items together. Confidence(A⟹B)=prob{B/A}
The Method to Improve efficiency of association rule mining algorithms: Reduce database Scan. by sampling the database by adding Some extra constraints on the structure of patterns Using parallelization Many algorithms for generating association rules were presented in past time. A large number of association rule mining algorithms have been developed with different mining efficiencies. Any algorithm should find the same set of rules though their computational efficiencies and memory requirements may be different. In this Section we discuss brief some well-known algorithms are Apriori, FP-Growth and DIC. A. Apriori: The Apriori algorithms by Rakesh Agrawal[21] and colleagues has developed one of the best ARM algorithms. It also serves as the base algorithm for most parallel algorithms [18]. Apriori works in bottom-up fashion with a horizontal layout for searching and enumerates all the frequent itemsets. This is an iterative algorithm , which counts itemsets of specific length in a particular database pass.. The process is start scanning all transactions in the database and computing the frequent items. After One scan, a set of potentially frequent Candidate 2- item sets is formed from the frequent items. Another database scan obtains their supports. The frequent 2-itemsets are retained for the next pass, and the process is repeated until all frequent item sets have been enumerated. The algorithm has three main steps: By doing self-join on Fk-1, generate candidates of length k from the frequent (k-1) length itemsets Prune the candidates which are having atleast one infrequent subset. Scan all the transaction to obtain support of candidates. Apriori uses the hash tree for storing candidates which supports fast support counting. Here , in the hash tree , itemsets forms leaves and internal nodes are formed by hash tables(hashed by items) , which directs the candidate searching. Limitations The algorithm is low efficiency, such as firstly it needs to repeatedly scan the database, which spends much in I/O. Secondly, it create a large number of 2- candidate itemsets during outputting frequent 2- itemsets. Thirdly, it doesn’t cancel the useless itemsets during outputting frequent k- itemsets. It takes more time, space and memory for candidate generation process. © 2013, IJARCSSE All Rights Reserved
Page | 1221
Pandya et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 1220-1225 Methods to Improve Apriori’s Efficiency Hash-based itemset counting: Any of the k-itemset which is having hashing bucket count less than the threshold is n ot frequent. Transaction reduction: Reduce transaction that does not contain any frequent k-itemset is useless in next scans. Partitioning: Any potential frequent itemset of database must be frequent in atleast one of the partitions of database. . Sampling: It is a process of mining particularly on a subset of given data. It also lowers support threshold and is a method to determine the completeness. Dynamic itemset counting:Adding a new candidate itemset , only after the estimation of their subsets to be frequent. B. Dynamic itemset counting(DIC): Apriori has generalization called DIC [18] algorithm proposed by Sergey Brin [20]. It is actually an extension of Apriori algorithm which decreases the number of scans on the dataset. DIC Algorith is strictly separate counting and generating candidates [16]. Alternative to Apriori Itemset Generation Itemsets are dynamically added and deleted as transactions are read. Relies on the fact that for an itemset to be frequent, all of subsets also be frequent, so we only examine those itemsets whose subsets are all frequent. A dynamic itemset counting technique was proposed in which the database is divided into blocks marked by start points. In this variation, new candidate itemsets can be added at any place, unlike in Apriori, which determines new candidate itemsets only immediately prior to each complete database scan. This is a dynamic technique which estimates support of all of the itemsets, counted so far and adds new candidates , is all of its subsets are estimated to be frequent. Itemsets are marked in four different ways as they are counted: Solid box: confirmed frequent itemset - an itemset we have finished counting and exceeds the support threshold minsupp Solid circle: confirmed infrequent itemset - we have finished counting and it is below minsupp Dashed box: suspected frequent itemset - an itemset we are still counting that exceeds minsupp Dashed circle: suspected infrequent itemset - an itemset we are still counting that is below minsupp C. FP-Growth: FP-growth from Jiawei Han’s[9] research group at Simon Fraser University. is the algorithm , which generates frequent itemsets for generating association rules. The final version of this algorithm was released on February 5, 2001. This was an improvement to earlier versions. FP-growth (frequent pattern growth) uses a special structure called extended prefix-tree (FP-tree) for storing the database in compressed manner. FP-growth technique follows divide-and-conquer approach for decomposing the mining tasks and databse. It makes the use of pattern fragment growth technique to have relief from costly candidate generation and testing, which is used by Apriori approach. FP-Growth performs in following ways: Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test done. III. LITERATURE SURVEY Literature survey focuses on the some of the previous work carried out by the researchers in measuring the performance and comparing different most recent existing association rule mining algorithm and which provides the valid information and working behavior of the algorithm on the various condition. (Vanitha et al 2011) worked with two different type of association rule mining algorithm which are Apriori and FP Growth. They performed their experiment on a single Super Market data set with 4627 instances and 217 attributes. They measured the performance analysis by varying number of different instances and their confidence level, and they evaluated the efficiency of those two algorithm based on time required to generate association rules. They concluded that FP Growth algorithm performed better than the Apriori algorithm. [4] (Tannu et al 2011) Provided the detailed study on two different type of association rule mining algorithm which are DIC and FP Growth, and they proposed new Dynamic FP which takes the best features of both algorithms. They performed their experiment on a single dataset and measured the performance analysis by varying confidence level. and they evaluated computational time. and they concluded that Dynamic FP algorithm gives the better performance for large dataset than the DIC and FP-Growth. [3] (Arora et al 2013) provided an overview of four different types of Apriori algorithm which are Apriori, AprioriTid, Apriori hybrid and tertius algorithm. They presented the drawbacks and comparisons between those algorithms based on Execution time, memory used, candidate generated and methodology used [1]. (Zheng et al 2001) worked with five different type of association rule mining algorithm which are Apriori, FP-growth, Closet, Charm, and Magnum-Opus. Here the experiment is performed on four datasets ,out of which three real-world © 2013, IJARCSSE All Rights Reserved
Page | 1222
Pandya et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 1220-1225 datasets , two are e-commerce datasets and one is retail dataset and the fourth one was artificial , one of the IBM Almaden datasets. They evaluated that the algorithms perform differently on real-world datasets than they do on the artificial dataset.and some of these gains do not carry over to the real datasets, indicating that these algorithms over fit the IBM artificial dataset. They also concluded that choosing an algorithm only have significance at support levels which generate more rules than the useful ones in the practice[15]. (Hipp et al 2000) worked with four different type of association rule mining algorithm which are Apriori,Eclat,DIC,Partition and FP Growth. They performed their experiment on Synthesized And Real data sets with T10I4D100K,T20I4D100K, T20I4D100K,T20I6100K,Basket Data,Car Equipment. They examine all algorithm measured the performance analysis by varying their confidence level, and they evaluated that the efficiency of those algorithm based on time required to generate association rules. They concluded that algorithms show quite similar runtime behavior. At least there is no algorithm that is more better then other ones. They showed that the advantages and disadvantages that identified to determine the support values of the frequent item sets nearly balance out on market basket like data.[16] (Kotsiantis et al 2006) provided an overview of different type Association rule mining algorithm, they also discussed some techniques to generate rule and they explained different techniques used to reduce computational time. They concluded that many users of association rule mining tools encounter several well-known problems in practice. The algorithms do not provide the results in a reasonable time. From the available research it is been proved that the set of association rules can rapidly grow to be heavily at lower the frequency requirements. The larger the set of frequent itemsets the more the number of rules presented to the user, and that may also lead to redundancy of the itemsets [6]. (Duneja et al 2012) provided Comprehensive survey of different association rule mining algorithm which are Apriori Hybrid[21] ,DSM[8]Parallel Apriori [7], WAR[17], Apriori Brave[11], PRICE[10] and Combined[12], they examined all algorithms based on Data size, itemset, Computational time, Confidence, memory expense.[2] There has not been any of the survey done which carries out the comparison of the Association rule mining algorithms apriori, DIC, FP-Growth, this paper focuses only on this techniques, so in next section we are going to perform the comparison of all three above mentioned algorithms in terms of computation time and confidence with different dataset which are really important in deciding the performance of association rule mining techniques. IV. EVALUATION AND RESULTS The machine on which the experiment carried out had dual Core 1GHz processors and 2 GB of RAM. The operating system used was Windows7. To make the time measurements more reliable, no other process or an application was running at the same time on the machine. All the above mentioned algorithms were evaluated for the single system only. The Apriori and FP-Growth association rule mining algorithms were tested in WEKA Data Mining tool of version 3.6.1. WEKA Data Mining Tool is a collection of open source of many data mining and machine learning algorithms, including pre-processing on data, Classification, clustering and association rule extraction. And DIC Algorithm was tested in Netbeans IDE 6.8.The performance of Apriori, FP-Growth and DIC were evaluated based on execution time. The execution time is measured for different number of instances and Confidence level on Different datasets. We have analyzed algorithms for different datasets. To evaluate the performance of the association rule mining algorithm we performed an experiment on two different data sets super market[25] and other one SPECT heart[24] data taken from UCI Repository. Data set Information is shown below. TABLE 1. INFORMATION OF DATASET Dataset Name Data type Instances Attributes Super market Numerical 4627 217 Heart Spect Binary 267 22 First dataset Super Market uses for evaluate Apriori and FP-Growth Algorithms and performed experiment on weka and Tanagra tools.We has imported the data set in ARFF format. Second dataset Heart Diagnosis uses to evaluate Apriori and DIC Algorithm and performed experiment on Tanagra and netbeans tools.We has imported the data set in binary data format in text file. Association rule mining algorithm is design to improve efficiency by some parameter given below. Number of Scan: Number of database scan is affect the memory I/O operation.if number of scan is less then algorithm is provide better efficiency. DataStructure: Data Structure is used to store frequent item and their count during execution of algorithm. There are many available data structure like trie tree,FP-Tree prefix tree and Hash tree. Data Layout: Data Layout are two type:vertical and horizontal. Algorithm is used that layout for scanning database. Optimization Techniques: Many algorithm is improve efficiency by optimize any technique like FP-Growth is used FP-Tree to improve efficiency.DIC algorithm is reduce number of database scan. © 2013, IJARCSSE All Rights Reserved
Page | 1223
Pandya et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 1220-1225 According to Survey and evaluation of an algorithm. We have provided some features of various algorithms as shown in table 2. TABLE 2. FEATURES OF VARIOUS ALGORITHMS Algorithm Scan Data Structure Database Optimizations Layout Apriori K Hash Tree Horizontal C2 array FP-Growth 2 FP-Tree Horizontal Data structure, Does not generate Catedate Set DIC