Nov 10, 2015 - Recently, several studies defined the process of extracting association rules as a multi-objective problem allowing researchers to optimize ...
A New Evolutionary Algorithm for Extracting a Reduced Set of Interesting Association Rules Mir Md. Jahangir Kabir(&), Shuxiang Xu, Byeong Ho Kang, and Zongyuan Zhao School of Engineering and ICT, University of Tasmania, Hobart, Australia {mmjkabir,Shuxiang.Xu,Byeong.Kang, Zongyuan.Zhao}@utas.edu.au
Abstract. Data mining techniques involve extracting useful, novel and interesting patterns from large data sets. Traditional association rule mining algorithms generate a huge number of unnecessary rules because of using support and confidence values as a constraint for measuring the quality of generated rules. Recently, several studies defined the process of extracting association rules as a multi-objective problem allowing researchers to optimize different measures that can present in different degrees depending on the data sets used. Applying evolutionary algorithms to noisy data of a large data set, is especially useful for automatic data processing and discovering meaningful and significant association rules. From the beginning of the last decade, multi-objective evolutionary algorithms are gradually becoming more and more useful in data mining research areas. In this paper, we propose a new multi-objective evolutionary algorithm, MBAREA, for mining useful Boolean association rules with low computational cost. To accomplish this our proposed method extends a recent multi-objective evolutionary algorithm based on a decomposition technique to perform evolutionary learning of a fitness value of each rule, while introducing a best population and a class based mutation method to store all the best rules obtained at some point of intermediate generation of a population and improving the diversity of the obtained rules. Moreover, this approach maximizes two objectives such as performance and interestingness for getting rules which are useful, easy to understand and interesting. This proposed algorithm is applied to different real world data sets to demonstrate the effectiveness of the proposed approach and the result is compared with existing evolutionary algorithm based approaches. Keywords: Data mining Association rules Multi-objective evolutionary algorithms Conditional probability Interestingness
1 Introduction Data mining techniques are the most important tools for discovering valid, novel, useful and interesting patterns from large data sets. Nowadays, a huge amount of data is collected and stored inexpensively and is easy to access due to the digital revolution. This limitless growth of data, makes the knowledge extraction process more difficult and in most cases the problems become complex [1]. Therefore, designing an efficient deterministic method is not suitable. Because of inherent parallel structures, © Springer International Publishing Switzerland 2015 S. Arik et al. (Eds.): ICONIP 2015, Part II, LNCS 9490, pp. 133–142, 2015. DOI: 10.1007/978-3-319-26535-3_16
134
M.M.J. Kabir et al.
evolutionary algorithms have been found to be effective for automatic processing of large data sets through optimal operator settings and to extract meaningful and important information [2]. Mining association rules is one of the most common data mining tasks for extracting interesting and hidden knowledge from big data sets [3]. Association rules are used to define and represent relationships between item sets in a data set. If X and Y are item sets where X\Y = Ø, then an association rule between the item sets is represented by X → Y. A large number of research studies for mining association rules are based on a support-confidence framework [4, 5]. This framework consists of two sub processes and most of the existing association rule mining algorithms follow these factors to measure the interestingness of a rule. These factors are as follows: (1) finding all frequent item sets from a large database which satisfies a user defined support value and (2) generating rules from those frequent item sets which satisfies a user defined confidence value. These particulars raise two major issues. The first one is that, users need to specify a suitable threshold value although they have no knowledge regarding the real world data sets. Another one is an exponential search space of size 2n, where n is the number of item sets [6]. Many evolutionary algorithms [7], particularly genetic algorithms, are proposed in the literature for extracting a reduced set of Boolean association rules (BAR) [6, 8–10]. Genetic algorithm based methods are considered to be one of the most effective search techniques for solving complex problems and have proved to be a successful approach, especially when the size of the search spaces are large enough for using deterministic methods. These algorithms removes some of the limitations of the above mentioned challenges and generate high quality rules as well as few unnecessary rules due to using only one evaluation criteria [11]. In this paper, we propose MBAREA, a new genetic algorithm based approach which is using multiple evaluation criteria in order to mine a reduced set of high quality rules without generating unnecessary rules. In order to assess the performance of the proposed method, we are presenting an experimental analysis using six real world data sets, with the number of attributes ranging from 23 to 118 and the number of records ranging from 267 to 12,960. We have performed the following studies: First, we have compared the performance of our results with two other evolutionary approaches, ARMGA [6, 10] and ARMMGA [8]. Second, we have explained the scalability of the proposed approach and finally, we have analyzed some of the rules obtained by our proposed method.
2 Preliminaries In real world applications the data sets consist of nominal attributes. For mining Boolean association rules, nominal attributes are mapped into Boolean attributes on which association rule mining algorithms are applied. For instance, in a solar flare data set, the attributes named evolution and activity are categorized into {decay, no growth, growth} and {reduced, unchanged} respectively. These two attributes are mapped into a set of items I = {I1 = decay, I2 = no growth, I3 = growth, I4 = reduced, I5 = unchanged}. The aim of this study is to find interesting association rules from this mapped
A New Evolutionary Algorithm for Extracting a Reduced Set
135
data set to find Boolean association rules (BARs). For example, a Boolean association rule can be defined as, buys (X,“Data Mining Book”) ˄ buys (X,“EA Book”) → buys (X,“EARM Book”). A common practice is to use association rules for solving a wide range of different real world problems, such as health, medical science, etc [12]. The classical algorithms only focus on generating association rules if these satisfy user defined support and confidence values. A commonly used method is to generate frequent item sets from a data set based on a user defined support value where supp(A[B) = | (A[B)|/|D|, here A and B are two item sets of a data set D and A I, B I and A \ B = φ [3, 4]. After generating frequent item sets, the next step is to generate rules from those frequent patterns. A rule is valid if it satisfies a user defined confidence value. A confidence value of a rule A → B is defined as supp(A[B)/supp(A). Most of these algorithms were only focused on those rules which frequently appeared in a data set that is high support based rules [11, 13]. However several authors have noticed some drawbacks of this structure which leads to the generation of a huge number of misleading and trivial rules [8, 11]. A rule is misleading if supp(Y) > confidence (X → Y) i.e. the item set in the antecedent is negatively correlated with the item sets in the consequent, since the buying of one of these items actually decreases the probability of purchasing the other [14]. Many researchers used different measures for evaluating the quality of a rule and those approaches significantly reduced the generation of misleading rules [6, 8, 15]. However, still those approaches suffered from generating misleading as well as trivial rules. A rule X → Y is trivial, if supp(X) = 0 or supp(Y) = 0, then supp(X[Y) = 0 for any item sets of X or Y, respectively. In recent years, several studies have proposed other measures according to the interest of users [11]. We briefly describe a few of those which will be used in the current study. The conditional probability [16] of a rule analyzes the dependence between X and Y and it is defined as: CPðXjYÞ ¼ fsuppðX[YÞ suppðXÞsuppðYÞg=fsuppðXÞð1 suppðYÞÞg
ð1Þ
Its domain range is [−∞, ∞], where 0 > value > −∞ represents misleading rules, 0 < value < ∞ represents positive association rules, and value = 0/−∞/∞ represents trivial rules. The ratio between the confidence and the expected confidence of the rule is measured by lift [11] and it is defined as, liftðX ! YÞ ¼ suppðX[YÞ=f suppðXÞsuppðYÞg
ð2Þ
For finding interesting rules, new rules are generated based on each item present in the consequent part of a rule. Since a number of items are present in the consequent part of a rule and it is not predefined, this approach may not be suitable for an association rule mining task. Recall the definition of interesting [17], a new expression for measuring the interestingness of a rule A → B is defined as follows: I ¼ ½suppðA[BÞ=suppðAÞ½ suppðA[BÞ=suppðBÞ½ suppðA[BÞ=jDj
ð3Þ
136
M.M.J. Kabir et al.
Here I is the interestingness constraint of a rule A → B and the total number of records in a database is defined by the term |D|. Its domain range is [0, ∞], where 0, ∞ and 0 < value < ∞ represents independence, trivial rules and positive dependence, respectively.
3 Methods This section describes our method for obtaining a reduced set of interesting association rules with a good trade-off between the coverage and the number of generated rules, considering three objectives conditional probability, lift and interestingness. This proposal extends the existing ARMGA and ARMMGA algorithms for performing an evolutionary learning and introduces two new components: class based variable adaptation operator and best population. The characteristic and flowchart of the algorithm are presented through the following sections. 3.1
Class Based Mutation and Best Population
In order to store all the non-dominated rules which are generated in the intermediate generation of a population, provoking the diversity of the population, and increasing the coverage of data sets we have to introduce class based mutation approach along with best population method. The mutation operator is used to keep the diversity from one generation of a population to the next one. Mutation changes one or more genes of a chromosome with respect to a mutation probability, mp. Existing GA based approaches such as ARMGA and ARMMGA, followed fixed mutation probability and randomly mutated the chromosomes. Although these methods used low mutation probability, it mutated few high quality chromosomes due to random function. For this reason, some top quality chromosomes get less chance for future generation of a population. To prevent this problem and to give more chance to the best chromosomes for future generation of a population, we classified the whole population into δ, based on a fitness value of each chromosome. Top class chromosomes have a higher fitness value but assign with a low mutation ratio whereas low class chromosomes are mutated with high mutation probability. Through this approach high class chromosomes take part for future generation of a population. Best population (BP) keeps all the non-dominated rules which are generated in intermediate generation of a population. Moreover, BP will be updated with the generation of a new population following the non-dominance criteria. This process helps us to increase the coverage of a data set and for performing enhanced exploration of the search space.
3.2
Objectives and Genetic Operators
Three objectives are maximized for this problem: conditional probability (CP), lift and interestingness. We are only interested in mining very strong rules which have positive dependence between items and avoid the problem of support-confidence framework based methods. Notice that positive association rules allow us to represent positive dependence, thus we are interested in those rules which have CP > 0. Thus, a rule
A New Evolutionary Algorithm for Extracting a Reduced Set
137
X → Y must satisfy the following conditions: (i) CP > 0; (ii) supp (X[Y) > 0; (iii)supp(X) ≠ 0 and supp (Y) ≠ 0. In this study, CP will act as a fitness function of a valid rule for filtering out from misleading and trivial rules. A rule with a CP value near to one means high degree of positive dependence between item sets and may be more important to the users. Interestingness is a measure of a rule through which we can say how interesting a rule to the users. Here we have used the well-known interestingness measure (see Sect. 2). Since its range is not bounded, allows us to better value denotes the difference between the rules and reduces the number of generation. A chromosome is a gene vector which represents the attributes and an indicator for separation between item sets. Given an association rule of k length means that, a rule contains k items which is shown by Fig. 1. For example a rule is A → B, where antecedent A contains item1 to itemn and the consequent B contains itemn þ 1 to itemk , where 0\n\k. The first place of a chromosome is an indicator for separation from antecedent to consequent. By using selection operator, an individual chromosome is chosen from a given population. This operator acts as a filter to choose an individual chromosome based on the fitness function and selection probability (sp). Crossover operator is applied on two chromosomes of a given population called parent chromosomes to reproduce two new offspring chromosomes by exchanging parts of the parent chromosomes. An example of how this operator works is shown in Fig. 2. Mutation operator is explained in 3.1. 3.3
MBAREA Algorithm
According to the above description, the MBAREA algorithm is summarized through the following structure.
138
M.M.J. Kabir et al.
Fig. 1. A chromosome of an association rule of k length
Fig. 2. Two-point crossover example
This process is continued until it satisfies any of the following conditions occur: (i) if the maximum number of evaluation is reached, or (ii) if the average value of fitness function of current population is less than the value α of previous population.
4 Experimental Analysis Several experiments have been carried out on different data sets for analyzing the performance of the proposed algorithm. For testing the proposed algorithm and comparing the result with ARMGA and ARMMGA approaches, we have considered six real world data sets, which are available in the UCI machine learning repository (http:// archive.ics.uci.edu/ml/datasets.html). Table 1 summarizes the specifications of those data sets, where Attributes (B) represents the number of Boolean attributes and Records is the number of records. The parameters, which are used for running the algorithms are shown in Table 2. For ARMMGA and ARMGA, the parameters are selected according to the recommendations of each proposal. As we described in Sect. 3.1, a class based mutation method is applied for our approach and the probability of mutation ratio is decreased with respect to the class of chromosomes in a population. Table 1. Data sets considered for the experimental analysis Mushroom Attributes (B) Records
118 8124
Balance scale 23 625
Nursery 32 12960
Monk’s problems 19 431
Solar flare 50 1066
SPECT heart 46 267
Table 2. Parameters considered for running the algorithms Algorithms ARMMGA ARMGA MBAREA
Parameters Popsize = 100, Psel = 0.95, Pcro = 0.85, Pmut = 0.01, db = 0.01, k = 3 Popsize = 100, Psel = 0.95, Pcro = 0.85, Pmut = 0.01, α = 0.01, k = 3 Popsize = 100, Psel = 0.95, Pcro = 0.85, Pmut = [100 − {(100/δ)*n}] %, δ = 5, n = 1 * δ, k = 3, α = 0.01
A New Evolutionary Algorithm for Extracting a Reduced Set
139
Because of using a weak constraint function [8], ARMMGA generates positive association rules including misleading and trivial rules, which are shown in Fig. 3. In this experiment we considered single evaluation result for different popsizes. The performance of our approach against other algorithms is shown in Table 3, where #Rules is the number of generated rules and Avsupp, AVconf, AVlift, Avint and CP are average support, confidence, lift, interest and conditional probability, respectively. In order to develop the different experimental analysis, we have considered the average results of three runs for each data set.
Fig. 3. Different types of rules for different data sets are generated by ARMMGA because of using weak constraint
The rules obtained by our approach presents better or similar values for different measures than the rules obtained by other algorithms. As with ARMMGA, it generates a smaller number of rules but some of those are misleading or trivial rules. For some data sets, ARMGA obtain good average support but low values for the rest of the measures.
140
M.M.J. Kabir et al. Table 3. Results obtained by evolutionary algorithms for different data sets Algorithms #Rules SPECT heart ARMMGA 9 ARMGA 37 MBAREA 11 Monk’s problems ARMMGA 7 ARMGA 23 MBAREA 15 Balance scale ARMMGA 5 ARMGA 15 MBAREA 8 Solar flare ARMMGA 7 ARMGA 12 MBAREA 10 Mushroom ARMMGA 10 ARMGA 29 MBAREA 18 Nursery ARMMGA 4 ARMGA 14 MBAREA 6
Avsupp Avconf Avlift
Avint
CP 0.2 0.3 0.68
0.29 0.25 0.1
0.87 0.6 0.83
1.25 1.5 1.88
0.0005 0.0004 0.0005
0.04 0.06 0.06
0.64 0.4 0.8
1.88 3.06 6.46
8.84E-06 0.19 4.27E-05 0.24 0.0001 0.76
0.02 0.03 0.03
0.72 0.51 0.77
1.56 2.24E-06 0.35 2.82 6.02E-06 0.38 1.678 2.60E-06 0.58
0.005 0.003 0.005
1 236.88 0.59 113.19 0.947 237.31
5.86E-07 0.5 2.16E-06 0.6 4.10E-06 0.94
0.028 0.018 0.002
0.38 0.47 0.97
12.09 6.26 12.74
3.75E-07 0.40 2.29E-07 0.37 1.47E-06 0.94
0.028 0.02 0.01
0.52 0.5 1
1.19 3.54 8
1.15E-07 0.2 3.28E07 0.37 1.47E-11 1
For all data sets, the values of average confidence, lift, interest and conditional probability of the rules generated by MBAREA are more or similar than the rules generated by other algorithms. Moreover, the rules generated by the proposed approach are not misleading or trivial. To analyze the scalability of the proposed algorithm, several experiments have been carried out on the Nursery data set. The experiments were performed on an Intel(R) core i5-3210M CPU @2.50 GHz, 4 GB RAM running on Windows 7 Enterprise. The average runtime required by the algorithms, when the number of attributes and examples are increased is shown in Tables 4 and 5, respectively. We can see that all the algorithms scale quite linearly, however in most of the cases MBAREA takes less time than other algorithms. Some useful and interesting BARs which are generated by MBAREA are shown in Table 6. These are the rules with positive dependence among the item sets and that have maximum value of other objectives such as lift, CP and so on. For example, the rule 28 → 19, 12 of the Nursery data set in Table 6 could be interpreted as follows: the decision will be to “recommend” the application only if the financial condition of a parent is “convenient” and the number of children is “one”.
A New Evolutionary Algorithm for Extracting a Reduced Set
141
Table 4. Runtime (in secs) needed for different attributes of the nursery data set Algorithms Number of attributes 8 12 20 ARMMGA 10.68 8.34 11.4 ARMGA 9.15 10.88 12.25 MBAREA 9.1 9.43 10.09
25 10.87 13.37 14.35
32 13.13 13.85 10.88
Table 5. Runtime (in secs) needed for increasing number of examples of the nursery data set Algorithms Number of examples 20 % 40 % 60 % 80 % ARMMGA 8 9.45 10 10.4 ARMGA 3.77 9.17 10.46 13.24 MBAREA 3.09 8.32 8.58 12.84
100 % 13.13 13.85 10.88
Table 6. Rules obtained by our proposal for different data sets Data SPECT Heart Mushroom Nursery
Rules Conf 37, 26 → 16 1 98, 86 → 34 1 28 → 19, 12 1
Lift 1.74 1.19 8
CP 1 1 1
5 Conclusion We have proposed MBAREA, a new EA for mining a reduced set of positive BARs. The generated rules are interesting, easy to understand and maximize two objectives performance and interestingness. To accomplish this, this approach extends the existing ARMGA and ARMMGA for evolutionary learning and selection of a condition of each rule. This algorithm introduces class based mutation method to the evolutionary model and a best population technique to improve the diversity of the generated rules and to store all the non-dominated rules which are generated in the intermediate generation of a population. Analyzing the results obtained over six real world data sets, it can be concluded that the generated rules maintain the good trade-off among the number of rules, confidence, conditional probability, interest and lift values in all the data sets. Moreover, the generated rules are very strong which indicates a strong relationship between the item sets and solves the drawback of support dependent methods. Finally, the experimental results show that our proposed algorithm has a good computational cost and scales well when the problem size is increased.
References 1. Van Renesse, R., Birman, K.P., Vogels, W.: Astrolabe: a robust and scalable technology for distributed system monitoring, management, and data mining. ACM Trans. Comput. Syst. 21(2), 164–206 (2003)
142
M.M.J. Kabir et al.
2. Maulik, U., Bandyopadhyay, S., Mukhopadhyay, A.: Multiobjective Genetic Algorithms for Clustering: Applications in Data Mining and Bioinformatics. Springer, Berlin (2011) 3. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, Burlington (2006) 4. Aggarwal, C.C., Yu, P.S.: A new framework for itemset generation. In: PODS Conference, pp. 18–24 (1998) 5. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD 29(2), 1–12 (2000) 6. Yan, X., Zhang, C., Zhang, S.: Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Syst. Appl. 36(2), 3066–3076 (2009) 7. Eiben, A.E.: Introduction to Evolutionary Computing. Springer, Berlin (2003) 8. Qodmanan, H.R., Nasiri, M., Minaei-Bidgoli, B.: Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Syst. Appl. 38(1), 288–298 (2011) 9. Kannimuthu, S., Premalatha, K.: Discovery of high utility itemsets using genetic algorithm with ranked mutation. Appl. Artif. Intell. 28(4), 337–359 (2014) 10. Yan, X., Zhang, C., Zhang, S.: ARMGA: identifying interesting association rules with genetic algorithms. Appl. Artif. Intell. Int. J. 19(7), 677–689 (2005) 11. Martin, D., Rosete, A., Alcala-Fdez, J., Herrera, F.: A new multiobjective evolutionary algorithm for mining a reduced set of interesting positive and negative quantitative association rules. IEEE Trans. Evol. Comput. 18(1), 54–69 (2014) 12. Ampan, A.C.: A programming interface for medical diagnosis prediction. Artif. Intell. LI(1), 21–30 (2006) 13. del Jesus, M.J., Gámez, J.A., González, P., Puerta, J.M.: On the discovery of association rules by means of evolutionary algorithms. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 1(5), 397–415 (2011) 14. Zhou, L., Yau, S.: Efficient association rule mining among both frequent and infrequent items. Comput. Math Appl. 54(6), 737–749 (2007) 15. Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S., Coello, C.A.C.: A survey of multiobjective evolutionary algorithms for data mining: part i. IEEE Trans. Evol. Comput. 18(1), 4–19 (2014) 16. Piatetsky-Shapiro, G.: Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pp. 229–248. AAAI/MIT Press, Menlo Park (1991) 17. Wakabi-Waiswa, P.P., Baryamureeba, V.: Extraction of interesting association rules using genetic algorithms. Int. J. Comput. ICT Res. 2(1), 101–110 (2008)