2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology
Bees Swarm Optimization for Web Association Rule Mining Y. Djenouri, H. Drias LRIA, USTHB: University of Algiers BP 32 El Alia Bab Ezzouar , Algers, Algeria ,
[email protected],
[email protected]
Z. Habbas, H. Mosteghanemi LITA, University of Lorraine Ile du Saulcy, 57045, Metz Cedex LRIA, USTHB: University of Algiers BP 32 El Alia Bab Ezzouar , Algers, Algeria ,
[email protected],
[email protected] rule X → Y is the support of X ∪ Y and the confidence of ) a rule is support(X∪Y support(X) . Confidence is a measure of strength of the association rules. An association rule X → Y with a confidence of 80% means that 80% of the transactions that contain X also contain Y together. So, Association rules Mining consists in extracting from a given database, all interesting rules, that is rules with support ≥ MinSup and confidence ≥ MinConf [7] where MinSup and MinConf are two thresholds predefined by users. The main goal is to do so efficiently. Many algorithms for generating association rules have been proposed in literature. Some well known algorithms are AIS [18], Apriori [8], Eclat [9] and FP-Growth [6]. AIS is very space consuming and requires too many passes over the whole database. Apriori is the best known algorithm for association rules mining. It is based on breadth first search strategy to count the supports of itemsets and uses a candidate generation function to exploit the downward closure property of support. FP-growth uses an FP-tree structure to compress the database and a divide-and-conquer approach, to decompose the mining tasks and the database as well. In [10] the authors present an interesting survey about different exact and polynomial algorithms. However, because of the fast web development and growth of databases, they have become very quickly inefficient. Indeed, even if these polynomial algorithms can still calculate the association rule in a very short time, they remain limited face to our goal which is to extract all the rules from large databases in real-time. In order to compute association rules mining algorithms in real-time, different ways have been explored: reducing the number of passes over the database,the sampling of data base, the use of parallelism, adding constraints on the structure of rules.
Abstract—This paper deals with Association Rules Mining algorithms for very large databases and especially for those existing on the web. The numerous polynomial exact algorithms already proposed in literature treated somehow in an efficient way data sets with reasonable size. However they are not capable to cope with a huge amount of data in the web context where the respond time must be very short. This paper, mainly proposes two new Association Rules Mining algorithms based on Genetic metaheuristic and Bees Swarm Optimization respectively. Experimental results show that concerning both the fitness criterion and the CPU time, IARMGA algorithm improved AGA and ARMGA two other versions based on genetic algorithm already proposed in the literature. Moreover, the same experience shows that concerning the fitness criterion, BSO-ARM achieved slightly better than all the genetic approaches. On the other hand, BSO-ARM is more time consuming. In all cases, we observed that the developed approaches yield useful association rules in a short time when comparing them with previous works. Keywords-Association rule mining, Genetic metaheuristic , BSO metaheuristic, Solution Quality, Optimization Problem,Web Mining.
I. I NTRODUCTION AND RELATED WORKS Association Rules Mining (ARM) is one of the most important and well studied techniques of Data Mining tasks [1]. It aims to extract frequent patterns, associations or causal structures among sets of items from a given transactional database. Formally, the association rule problem is as follows: let T be a set of transactions {t1 , t2 , . . . , tm } representing a transactional database, and I be a set of m different items or attributes {i1 , i2 , . . . , im }, an association rule is an implication of the form X → Y where X ⊂ I, Y ⊂ I, and X ∩ Y = ∅. The itemset X is called antecedent while the itemset Y is called consequent and the rule means X implies Y . Association rule mining is concerning by discovering a set of rules covering a large percentage of data and tends to produce an important number of rules. However, since the databases are increasingly large, the user no longer looks for all the rules but only a subset of useful rules. Two basic parameters are commonly used for measuring usefulness of association rules, namely the support of a rule and a confidence of a rule. The support of an itemset I ⊆ I is the number of transactions containing I . The support of a
978-0-7695-4880-7/12 $26.00 © 2012 IEEE DOI 10.1109/WI-IAT.2012.148
For our purposes, we use metaheuristics instead of exact methods. Different meta-heuristics have already proposed in literature. Yan et al. [11] developed an algorithm based on genetic algorithm called ARMGA, for finding association rules. The main drawback with this algorithm is the generation of not admissible solutions and hence erroneous rules that may not respect Minimum Support and Minimum Confidence constraints. In [12], another GA is proposed for Min-
142
that have a high fitness quality without respecting minimum support and minimum confidence constraints. Secondly their process creates new solutions that may not be admissible and even more there is no special treatment to manage this issue. To overcome these main drawbacks, we proposed an improved Genetic Algorithm (IARMGA) which eliminates the risk of generating false rules and which solves the admissibility problem by defining a new strategy for crossover and mutation operators. First a rule is represented by a chromosome represented by a vector that contains the items of the rule. We also use a separator index between antecedent and consequent parts. Then a classical GA is applied using a combination of crossover and mutation operators. We define the crossover/mutation operator using the ”delete and decomposition” strategy as follows. First, two chromosomes are selected from the given population and the classical crossover is applied. After then, we check the new chromosomes. If an item appears in both the antecedent and consequent parts then the ”delete and decomposition” strategy is performed. In the first phase, this item is removed from the antecedent part and in the second phase it is removed from the consequent part. This operation allows us to decompose the non admissible solution in two accepted solutions according to the syntaxical form. Furthermore, the item is in the antecedent part for the first solution and in the consequent part for the second one. To cope with the generation of false rules, for each yielded chromosome we compute its support sup(c) and its confidence conf(c). If sup(c) or conf (c) are less than Minsup and Minconf respectively, then the chromosome is rejected without calculating its fitness.
ing Association rules. By using an adaptive mutation rate, this algorithm provides an important population variation. Nevertheless, the mutation probability for all chromosomes is computed in each iteration, which increases the execution time. In [13], an Adaptive Genetic Algorithm called AGA is developed for computing ARM. The two major differences between classical ARMGA and AGA are the mutation and crossover operators. The issue of creating unviable solutions remains. All of the above mentioned algorithms have mainly two limits. Firstly they could generate false rules, that is rules that have a high fitness quality without respecting minimum support and minimum confidence constraints. Secondly their process creates new solutions that may not be admissible and even more there is no special treatment to manage this issue. To overcome these main drawbacks, we proposed in [2], an Improved Genetic Algorithm (IARMGA) which avoids the risk of generating false rules and which solves the admissibility problem by defining a new strategy for crossover and mutation operators. We experimented IARMGA on common Synthetic standard database, generated by IBM QUEST [18], by using three fitness functions. The first implementation results confirm that IARMGA does’nt generate non admissible solutions and any false rules. Moreover, it outperforms AGA and ARMGA algorithms w.r.t the computational time and the quality of generated rules. Motivated by the success and the power of Bees Swarm Optimization (BSO) metaheuristic, we propose in this paper a new algorithm called BSO-ARM (BSO for Association Rules Mining). BSO has been widely used for solving difficult problem. It was successfully applied to Web Information Retrieval [5]. For association Rules Mining, BSO-ARM will avoid not admissible solutions by using a new strategy called ”delete and decomposition strategy”. Moreover, the determination of search area and the neighborhood search operations including in BSO result in a good balancing between intensification and diversification searches. The rest of this paper is organized as follows. In section II, we recall the main idea behind our improved genetic algorithm. In section III, we introduce the Bees Swarm Optimization metaheuristic and in section IV, we present the BSO-ARM algorithm. Section V summarizes our experimental results by illustrating the performance of the proposal algorithm and compared it to three other algorithms (AGA, ARMGA, IARMGA) with respect to solution quality and execution time. Section VI concludes this paper by some remarks and some perspectives for a future work.
III. I NTRODUCTION TO B EES S WARM O PTIMIZATION The metaheuristic BSO proposed in [3] is inspired by the collective bees behavior. It is based on a swarm of artificial bees cooperating together to solve a problem. First, a bee named InitBee settles to find a solution presenting good features. From this first solution called Sref we determine a set of other solutions of the search space by using a certain strategy. This set of solutions is called SearchArea. Then, every bee will consider a solution from SearchArea as its starting point in the search. After accomplishing its search, every bee communicates the best visited solution to all its neighbours through a table named Dance. One of the solutions stored in this table will become the new reference solution during the next iteration. In order to avoid cycles, the reference solution is stored every time in a taboo list. The reference solution is chosen according to the quality criterion. However, if after a period of time the swarm observes that the solution is not improved, it introduces a criterion of diversification preventing it from being trapped in a local optimum. The diversification criterion consists to select amon the solutions stored in taboo list, the most distant one. The algorithm stops when the optimal solution is founf
II. I MPROVED G ENETIC ALGORITHM FOR ARM All of the above mentioned algorithms have mainly two limits. Firstly they could generate false rules, that is rules
143
generated using flip by adding or subtracting one bit to Sref. This process is repeated for all bits in Sref. This method may generate non admissible solutions for the following reasons: • Some solutions may contain the same item more than once • Some solutions may contain a bit value out of the range [1 − n] • Some solutions may contain the same item more than once and a bit value out of the range [1 − n] too The first problem can be solved using delete and decomposition strategy and the second one can be solved by replacing the target bit by 0. Example 2: Let consider: F lip = 2 and Sref= {3, 2, 4, 5, 3, 0, 0, 0, 6, 7, 9}. 1) add Flip value to the first bit in Sref : S1= {5, 2, 4, 5, 3, 0, 0, 0, 6, 7, 9} 2) subtract Flip value to the first bit in Sref : S2 = {1, 2, 4, 5, 3, 0, 0, 0, 6, 7, 9} 3) add Flip value to the second bit in Sref: S3 = {3, 4, 4, 5, 3, 0, 0, 0, 6, 7, 9} 4) subtract Flip value to the second bit in Sref: S4 = {3, 0, 4, 5, 3, 0, 0, 0, 6, 7, 9} S3 is a non admissible solution because it contains the item t4 twice. hence it is decomposed in two following admissible solutions S5= {3, 0, 4, 5, 3, 0, 0, 0, 6, 7, 9} and S6={3, 4, 0, 5, 3, 0, 0, 0, 6, 7, 9}.
or the maximum number of iterations is reached. IV. BSO-ARM ALGORITHM In this section we present BSO-ARM the Bees Swarm Optimization algorithm designs for Association Rules Mining. To adapt BSO to Association Rules Mining we have to define the following components: the encoding’s solution, the determination of SearchArea strategy , the fitness function and the neighborhood search. A. The encoding’s solution In the Association Rule Mining [11] [12] two famous representation can be cited, namely Binary encoding and Integer encoding. In Binary encoding , each solution (rule) is represented by a vector S of n elements where n is the items number. Furthermore, S[i] = 1 if the item i is in the rule and 0 otherwise. However, in the Integer encoding the solution (rule) is represented by a vector S of k + 1 elements where k is the rule size. The first element is the separator index between antecedent and consequent parts of the solution. For all others elements i in S, if S[i] = j then the item j appears in the ith position of the rule. In BSOARM, we combine these both representations to facilitate the BSO operations performing the determination of SearchArea and the neighborhood search. Consequently, each solution s is a vector of n (n is the number of all items) elements where: 1) S[i] = Index separator between the consequent and the antecedent parts if i = 1. 2) S[i] = j where j > 0 if the item j appears in the ith position of s. 3) S[i] = 0 if there is an item in the ith position of the solution s. Example 1: Let T={t1 ,t2 ,...,t10 } be a set of items • S1 = {3, 2, 4, 5, 3, 0, 0, 0, 6, 7, 8} represents the rule R1: t2 , t4 , t5 ⇒ t3 , t6 , t7 , t8 . • S2= {2, 0, 0, 5, 3, 0, 0, 0, 1, 2, 9} represents the rule R2: ξ ⇒ t5 , t3 , t1 , t2 , t9 . • S3= {5, 1, 6, 8, 7, 0, 0, 0, 3, 2, 4} represents the rule R3: t1 , t6 , t8 , t7 ⇒ t3 , t2 , t4 .
D. The neighborhood search The neighborhood search is calculated from each solution s by adding 1 or subtracting 1 to s . Again, This operation can generate non admissible solutions. To overcome these non admissible solutions, we use the previous process. Example 3: Consider the set of items : S = {3, 2, 4, 5, 3, 0, 0, 0, 6, 7, 9} 1) add 1 value to the first bit in S : S1 = {4, 2, 4, 5, 3, 0, 0, 0, 6, 7, 9} 2) subtract 1 value to the first bit in S: S2 = {2, 2, 4, 5, 3, 0, 0, 0, 6, 7, 9} 3) add 1 value to the second bit in S:S3 = {3, 3, 4, 5, 3, 0, 0, 0, 6, 7, 9} 4) subtract 1 value to the second bit in S: S4 = {3, 1, 4, 5, 3, 0, 0, 0, 6, 7, 9}
B. Fitness function As mentioned above, the ARM problem consists to find all rules satisfying MinSup and MinConf respectively. Let α and β are two empirical parameters, the fitness function of the solution s is computed as follows: Fmax = α× confidence(s)+β× support(s) if Confidence(s) > MinConf and Support(s) > Minsup Fmax = -1 otherwise.
E. BSO-ARM algorithm The Reference Solution and The Degree of Diversity are chosen as mentioned in [3]. F. Complexity of BSO-ARM For each iteration,k solutions are exploited and each one explores L neighborhoods. The number of generated solutions is equal to M axiter × k × L As one solution is evaluated in O(n) where n is the number of transactions in database, then the complexity of BSO-ARM algorithm is in O(n × M axiter × k × L).
C. The Determination of Search Area Let be Sref the solution found by Initbee. The search Area is explored as follows: First, an integer variable named flip is chosen in the range [1 − n]. After that, the solutions are
144
Bees Number(K) 4
Algorithm 1 BSO-ARM 1: Empirical parameters: k (Bees number), flip, MaxIter, a, b. 2: Input: 3: Dataset transactions, MinSup, MinConf 4: Output : Set of association rules. 5: i = 0; 6: Sref = be the initial solution generated randomly or via an heuristic. 7: Fmax (Sref, a,b, Minsup,Minconf). 8: S*=Sref; 9: while i < M axIter do 10: TabouList = TabouList + Sref 11: Set solutions = SearchArea(Sref,Flip) 12: Set K solutions = select k solution(Set solutions) 13: for each solution s in set k solutions do 14: Fmax (s,a,b, Minsup,Minconf) 15: assign one solution of set k solutions in each bee 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
5
6
7
Flip 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Average Fitness 6,93 10,60 9,4 9,6 4,16 4,47 11,81 4,71 13,91 5,75 2,56 4,56 6,5 3,90 8,33 3,34
Table I PARAMETER S ETTING
MaxIter 25 50 75 100 125 150 175 200
for each bee k do neighborhood-search(k) save the best solution found in dance table end for Sref=the best solution in dance table end for if Fmax(s)> Fmax(Sref) then S*=Sref i=i+1 end while for each solution s in dance table do generate the rule from s end for
Average Fitness 14 14 12 11 10 9 8 8
Execution Time(sec) 12 35 55 67 83 107 138 176
Table II BSO-ARM RESULTS
B. Experiments and Comparison In the following experiments we set the number of bees to 6 and the flip value to 1. Table II presents the average fitness and the execution time of BSO-ARM algorithm when the iteration number increases from 25 to 200. Fig 1 shows that the fitness decreases with the increase of iterations for all the algorithms (AGA, ARMGA, IARMGA BSO-ARM). This is due to the fact that at each iteration, the size of rules increases while their support decreases. Moreover, we remark that our new proposed algorithm (BSO-ARM) outperforms the three other algorithms w.r.t the quality of solutions, thanks to the two fundamental operations of BSO we developed (determination of SearchArea and neighborhood search). In fact, the first one reinforces the diversification while the seconde one improves the intensification strategy. The fitness average of BSO-ARM doesn’t go below 10 when the number of iterations is 200 and reaches 25 when the number of iterations is 1. Fig 2 shows that concerning the CPU time, even if all the algorithms are comparable, AGA can be considered as the best one. This is explained by the fact that AGA uses very naive mutation and crossover operations. BSO-ARM is the most expensive in term of CPU-time because the two operations (determination area and neighborhood search) have a high complexity compared with AGA operations. Of course, the CPU time increases with the increase of the number of iterations. In two following figures, IARMGA and BSO-
V. E XPERIMENTATION AND RESULTS We first experimented BSO-ARM on common Synthetic standard database, generated by IBMQUEST [18] which contains 1000 transactions and 20 attributes. The implementation results confirm that BSO-ARM doesn’t generate non admissible solutions and any false rules. A. Parameters Setting We vary the number k of bees and the value of flip, the two parameters associated to SearchArea and neighborhood search. Table I is obtained by increasing the bees number and the flip value and by fixing the number of iterations to 200. The reported values correspond to the average of 100 successive test results. By observing Tab I, it is clear that the fitness average of the final ARM-BSO solutions depends on the number of bees and the value of flip. The best value is observed when k = 6 and flip =1.
145
[4] Marwick,A.: Text Mining for associations using UIMA, technical lead,IBM, feb 2006.
200
25
180
AGA ARMGA BSO-ARM
20
160
Fitness Average
Average fitness
140 15
10
120
[5] Drias, H. and Mosteghanemi, H.: Bees Swarm Optimization based Approach for Web Information Retrievial, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010.
100 80 60 40
5
20
0
-20
0 0
50
100
150
0
200
50
100
200
[6] Han, J., Pei, J., Yin, J., Mai, R.: Mining frequent patterns without candidate generation, in Data Knowledge and Knowledge discovery, No 8, PP 53-87, 2004.
Figure 2. Comparing BSO-ARM and 3 other approaches w.r.t CPU time
Figure 1. Comparing BSO-ARM and 3 other approaches w.r.t fitness
35
[7] Agrawal, R. and Shafer, J.: Parallel mining of associations rules, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL 8, NO 6 (1996).
85 BSO-ARM IARMGA
80
BSO-ARM IARMGA
75 Execution Time(Sec)
30 Fitness Average
150
Transaction Number
Iteration Number
25
20
[8] Agrawal, R. and Ramakrishan,S.: Fast algorithms for association rules in large databases (http://rakesh.agrawalfamily.com/papers/vldb94apriori.pdf), in Bocca, Jorge B.; Jarke, Matthias; and Zaniolo, Cqarlo; editors, Proc of the 20th International Conference on very large Data bases -VLDB), Santiago, Chile, PP 487-499, Sept 1004.
70 65 60 55 50
15
45
10 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
40 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Transaction Number
Transaction Number
Figure 3.
Figure 4.
Fitness
[9] Zaki, M J.: Scalable algorithms for association mining, IEEE Transactions on Knowledge and Data Engineering, 12(3), PP 372-390, 2000.
CPU time
[10] Zaki,M J. : Parallel and distributed Association Mining, A survey IEEE Concurrency (1999).
ARM performances are observed when the dataset grows in size. The next results are obtained using IBM Artificial Benchmark [19] which contains 100000 transactions and 870 attributes. We executed IARMGA and BSO-ARM with 200 iterations. Fig 3 confirms that BSO is better concerning the fitness average. Moreover the quality of fitness decreases with the increase of the number of transactions. Again, Fig 4 confirms that IARMGA is better in terms of CPU time. Of course, the CPU time increases with the increase of the number of transactions.
[11] Yan, X. and Zhang, C. : Genetic algorithm based strategy for identifying association rule without specifying minimum support, Expert system with applications (2009). [12] Hong, G. and Zhou, Y.: An algorithm for mining association rules based on improved genetic algorithm and its application, Third international conference on genetic and evolutionary computing, IEEE computer science (2009). [13] wang,M., zou,Q. and lin, C. : Multi dimensions association rules mining on adaptive genetic algorithm, international conference on uncertainly reasoning on knowledge engineering,IEEE (2011).
VI. C ONCLUSION In this present article, we proposed a new algorithm (BSO-ARM) for association rule mining. it’s inspired by bees behavior and it’s based on BSO algorithm. The two important operations (determination search area and neighborhood search) provided by BSO, permit to improve the solution quality but it requires a considerable computation time. For this reason and as future work a parallel BSOARM will be developed.
[14] liu, D. : Improved genetic algorithm based on simulated annealing and quantum computing strategy for association rule mining, journal of software (2010). [15] Luke, S. Essentials of meta-heuristics (2011) [16] Gendreau, M. and Potvin, J Y. Handbook of metaheuristics,Springer, second edition (2010). [17] Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs, 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996)
R EFERENCES [1] Han,J., Kamber, J. and Pei, M.: Data Mining: Concepts and Techniques, Elsevier 3rd edition, 1011.
[18] Agrawal R, I mielinski T and Swami A, Mining association rules between sets of items in large databases, Proceedings of the ACM SIG2 MOD,Washington DC, pp 207- 216, 1993.
[2] Djenouri,Y., Drias, H. and Habbas, Z. IARMGA: An Improved Genetic Algorithm for Association Rules Mining, Proceeding of META, 2012
[19] Zheng Z, Kohavi R and Mason L, Real World Performance of Association Rule Algorithms, Knowledge Discovery in Database journal (2001).
[3] Drias, H., Sadeg, S. and Safa Yahi, S. : Cooperative Bees Swarm for Solving the Maximum Weighted Satisfiability Problem, In procedings of IWANN 2005, pp 318-325, 2005.
146