May 23, 2014 - Jesús A. Carrasco-Ochoa2. José Fco. Martınez-Trinidad2. J. Hernández-Palancar1. 1. Centro de Aplicaciones de Tecnologıas de Avanzada.
Combining Hybrid Rule Ordering Strategies based on Netconf and a novel satisfaction mechanism for CAR-based classifiers R. Hern´andez-Le´on1,∗ Jes´ us A. Carrasco-Ochoa2 Jos´e Fco. Mart´ınez-Trinidad2 J. Hern´andez-Palancar1 1
Centro de Aplicaciones de Tecnolog´ıas de Avanzada 7a ♯21406, Siboney, Playa, CP: 12200, Havana, Cuba. {rhernandez,jpalancar}@cenatav.co.cu 2 ´ Instituto Nacional de Astrof´ısica Optica y Electr´ onica Luis Enrique Erro ♯1, Sta. Mar´ıa Tonantzintla, Puebla, CP: 72840, M´exico. {ariel,fmartine}@inaoep.mx
May 23, 2014
Abstract In Associative Classification, building a classifier based on Class Association Rules (CARs) consists in finding an ordered CAR list by applying a rule ordering strategy, and selecting a satisfaction mechanism to determine the class of unseen transactions. In this paper, we introduce four novel hybrid rule ordering strategies; the first three combine the Netconf measure with different Support-Confidence based rule ordering strategies. The fourth strategy combines the Netconf measure with a rule ordering strategy based on the CAR’s size. Additionally, we combine the proposed strategies with a novel “Dynamic K” satisfaction mechanism. Experiments over several datasets show that the proposed rule ordering strategies jointly with the “Dynamic K” satisfaction mechanism allow improving the performance of CAR-based classifiers. ∗ Corresponding
author
1
Key words: Data mining, Supervised Classification, Class Association Rules, Rule Ordering Strategies
1
Introduction
Associative Classification, also known as Classification Association Rule Mining (CARM) and introduced in [22], integrates Association Rule Mining (ARM) and Classification Rule Mining (CRM). This integration involves mining a special subset of association rules, called Class Association Rules (CARs), using some quality measure (QM) to evaluate them. In general, a CAR describes an implicative co-occurring relationship between a set of items (itemset) and a pre-defined class, expressed as “⟨item1 , . . . , itemn ⟩ ⇒ class”. Regardless of the particular methodology is used to compute the set of CARs, a classifier based on this approach is usually presented as an ordered CAR list l, and a mechanism for classifying unseen transactions using l [21, 22, 32]. Associative classification has been applied to many tasks including anomaly detection [15], text classification [7], text segmentation [8], automatic image annotation [27], automatic error detection [23], determination of DNA splice junction types [6], determination of cotton yarn quality [2], discrimination of diseases [20], mammalian mesenchymal stem cell differentiation [28] and prediction of protein-protein interaction types [24], among others. In associative classification, similar to ARM, a set of items I = {i1 , . . . , in }, a set of classes C, and a set of labeled transactions D, are given. Each transaction in D is represented by a set of items X ⊆ I and a class c ∈ C. A lexicographic order among the items of I is assumed. The Support of an itemset X ⊆ I, denoted as Sup(X), is the fraction of transactions in D containing X (see Eq. 1). A CAR is an implication of the form X ⇒ c where X ⊆ I and c ∈ C. The rule X ⇒ c is held in D with certain Support s (see Eq. 2) and 2
Confidence α (see Eq. 3), where s is the fraction of transactions in D that contains X ∪ {c}, and α is the probability of finding c in transactions that also contain X, which represents how “strongly” the rule antecedent X implies the rule consequent c. A CAR X ⇒ c covers a transaction t if X ⊆ t, which means that t must contain all the items of X.
Sup(X) =
|DX | |D|
(1)
where DX is the set of transactions in D containing X and | · | represents the cardinality.
Sup(X ⇒ c) = Sup(X ∪ {c})
(2)
Sup(X ⇒ c) Sup(X)
(3)
Conf (X ⇒ c) =
Several classifiers based on CARs have been developed [12, 18, 19, 21, 22, 31]. Regardless the CARM approach, a CAR based classifier is usually presented as an ordered list of CARs and a mechanism for classifying new transactions. Therefore, in this paper, we research about new rule ordering strategies and a better satisfaction mechanism such that both together allow obtaining a more accurate classifier. Preliminary results of this paper were published in [17]. The main differences of this paper w.r.t. the conference paper [17] are the following: (1) we introduce four novel hybrid rule ordering strategies; three of them by combining the Netconf measure with different Support-Confidence based rule ordering strategies, and the fourth strategy by combining the Netconf measure with a rule ordering strategy based on the CAR’s size; (2) we combine the proposed rule ordering strategies with a novel “Dynamic K” satisfaction mechanism, in
3
order to improve the accuracy of CAR based classifiers, and (3) we include new experiments to show the impact of the new hybrid rule ordering strategies in the classification stage, by themselves and in combination with the “Dynamic K” satisfaction mechanism. This paper is organized as follows: The next section describes the related work. In section three, the novel “Dynamic K” satisfaction mechanism is described and the new hybrid rule ordering strategies are introduced. In section four, experimental results show the performance of our proposal by comparing it against the best ones reported in the literature. Finally, our conclusions are given in section five.
2
Related work
Associative Classification was first proposed in [1] where it was used for two specific tasks: detecting redundant medical tests and reducing telecommunication order failures. Later, several classifiers based on CARs have been developed [12, 21, 22, 29, 30, 31]. Broadly, the CAR-based classifiers can be categorized into two groups according to the way the CARs are generated: 1. Two stage classifiers: In this type of classifiers, all CARs satisfying the Support and Confidence thresholds are mined in a first stage; and later, in a second stage, a classifier is built by selecting an ordered subset of CARs (CBA [22], CMAR [21]). The second stage involves a dataset coverage analysis, which preserves CARs that can classify correctly at least one training transaction, and prunes the remaining CARs. 2. Integrated classifiers: In this type of classifiers, a reduced set of CARs is built in a single step (TFPC [12]). These algorithms avoid the coverage process by directly generating a subset of CARs.
4
2.1
Rule ordering strategies
Once a subset of CARs has been obtained, regardless the way they were generated, the CARs are ordered. In [4, 29, 30, 31] the authors established six main rule ordering strategies divided in two groups:
Group 1: Based on Support-Confidence a) CSA (Confidence - Support - Antecedent size): The CSA rule ordering strategy combines Confidence, Support and the size of the rule antecedent (or CAR’s size). CSA sorts the rules in descending order according to their Confidence. Those CARs with the same Confidence value are sorted in descending order according to their Support, and if the tie persists, CSA sorts the rules in ascending order according to their antecedent size [21, 22]. In case of tie, this strategy as well as ACS and L3 strategies, sorts the rules according to the order in which they appeared during the mining process. b) ACS (Antecedent size - Confidence - Support): The ACS rule ordering strategy is a variation of CSA, but it takes into account the antecedent size as first ordering criterion followed by the Confidence and Support criteria [12]. c) L3 : This rule ordering strategy was proposed in [4] and it is also a variation of CSA, but it sorts the rules in descending order according to their antecedent size, as third ordering criterion. Group 2: Based on weighting d) WRA (Weighted Relative Accuracy): The WRA rule ordering strategy, proposed in [12], assigns to each CAR a weight (based on their Support and
5
Confidence) and then sorts the set of CARs in descending order according to these weights. The weight for a CAR X ⇒ c is computed as
W RA(X ⇒ c) = Sup(X)(Conf (X ⇒ c) − Sup(X)) e) LAP (Laplace Expected Error Estimate): The LAP rule ordering strategy was introduced in [10] and it has been used to sort the CARs in some classifiers [29]. LAP rule ordering strategy is similar to WRA but LAP defines the weight of the CARs in a different way, also based on their Support and Confidence. Given a rule X ⇒ c, LAP is defined as
LAP (X ⇒ c) =
Sup(X ⇒ c) + 1 Sup(X)+ | C |
where C is the set of predefined classes. f) χ2 (Chi-Square): The χ2 rule ordering strategy is a well known technique in statistics, which can be used to determine whether two variables are independent or related. After computing an additive χ2 value for each CAR (also based on their Support and Confidence), these values are used to sort the rules in descending order [21]. Given a rule X ⇒ c, χ2 is computed as follows [21]:
χ (X ⇒ c) = 2
Sup(X)(|T |−Sup(c)) Sup(c)(|T |−Sup(X)) Sup(c)(|T |−Sup(X)) Sup(X)(|T |−Sup(c))
if Sup(X) < Sup(c) otherwise
where T is the training set. Additionally, some hybrid rule ordering strategies have been proposed in [29, 30, 31] by combining one rule ordering strategy taken from the SupportConfidence based group (group 1) and another one taken from the Weighting 6
based group (group 2). In general, let A, B be rule ordering strategies from groups 1 and 2 respectively, the rule ordering strategy “Hybrid A/B” is defined as follows: • For each predefined class, this ordering strategy selects, from the original list, the Best K Rules in B manner; the remainder of the original list is sorted in A manner; after, this strategy reorders the selected K Rules in A manner. Finally, these K Rules are put at the front of the remaining original rule list which has been already sorted in A manner. The experiments in [29, 30, 31] showed that the obtained classification accuracies using these hybrid rule ordering strategies were better than the ones obtained using their “parents”, e.g. the obtained accuracy using the Hybrid CSA/WRA rule ordering strategy was better than those accuracies obtained using CSA and WRA separately. In [9], the authors proposed a hybrid rule ordering strategy, called MLRP (Multi-level Rule Priority), which sorts the CARs using rule priority to reduce the influence of rule dependence. The rule dependence problem occurs when (during the dataset coverage analysis) a training transaction O is covered by several CARs concurrently and one of them is used first for classifying O [9]. In this case, because the transaction O has been classified, the confidence of all other CARs covering O are recalculated excluding the transaction O, implying possible changes in the CARs order. In a first step, MLPR adopts the L3 rule ordering strategy as initial order. The L3 strategy gives higher priority to higher confidence and longer rules. In a second step, rule dependencies are computed and stored in a matrix EWM (Effective Weight Matrix), and later, these dependencies are used to obtain the final order of the CARs. In [18], a novel measure called Netconf (see Eq. 4) was used to compute and sort the set of CARs. In this paper, we will call Netconf rule ordering strategy 7
(NF) to this approach. Some useful properties of the Netconf measure are the following: • Netconf
holds
the
statistical
independence
property,
therefore
N etconf (X ⇒ c) = 0 ⇔ Sup(X ⇒ c) = Sup(X)Sup(c). • N etconf (X ⇒ c) ̸= N etconf (c ⇒ X) if Sup(X) ̸= Sup(c), which means that Netconf is not symmetric, therefore, it can indicate the strength of implication in both directions. • Netconf takes values in [-1,1], positive values represent positive dependencies, negative values represent negative dependencies and a zero value represents independence. • Netconf satisfies some properties which, according to Shapiro [25], should be satisfied by every good quality measure used for separating strong rules from weak rules.
N etconf (X ⇒ c) =
Sup(X ⇒ c) − Sup(X)Sup(c) Sup(X)(1 − Sup(X))
(4)
In addition, in [18] the authors showed that Netconf solves the drawbacks of the Support and Confidence measures, which have been mentioned in some works [5, 26]. According to the above defined groups, the Netconf measure can be considered as a rule ordering strategy belonging to the Weighting based group. Therefore, in this paper, we investigate the use of the Netconf rule ordering strategy combined with different rule ordering strategies from the SupportConfidence group in order to determine whether it is possible to obtain better hybrid rule ordering strategies than those proposed in the literature.
8
2.2
Case satisfaction mechanisms
Once a CAR based classifier has been built, we need to select a satisfaction mechanism for classifying unseen transactions. In [21, 22, 31], the authors summarize three case satisfaction mechanisms that have been employed for classifying “unseen” transactions in CAR based classifiers. 1. Best Rule: This mechanism selects, according to an ordering imposed on the list of CARs [22], the first (the best) rule that satisfies the transaction to be classified, and it assigns to this transaction the class associated to the best rule. 2. Best K Rules: For each pre-defined class, this mechanism selects the first (top) K rules satisfying the transaction to be classified and it assigns the class for this transaction applying an averaging process as those used in [31]. 3. All Rules: This mechanism selects all rules satisfying the given transaction and it assigns the class applying an averaging process [21]. Classifiers following the “Best Rule” mechanism could suffer biased classification or overfitting since the classification is based on only one rule. On the other hand, the “All Rules” mechanism includes rules with low ranking for classification and this could affect the accuracy of the classifier. The “Best K Rules” mechanism could be affected when most of the best K rules were obtained extending the same item, or when there is an imbalance among the number of CARs with high Confidence, or another quality measure values, per each class, that cover the new transaction.
9
3
Our proposal
In this section, we introduce four novel hybrid rule ordering strategies based on NetConf. Additionally, we describe the “Dynamic K” satisfaction mechanism, whose preliminary results were introduced in [17]. And finally, we propose to combine the four novel hybrid rule ordering strategies with “Dynamic K” in order to obtain a more accurate classifier.
3.1
Novel hybrid rule ordering strategies
In this section, we introduce four novel hybrid rule ordering strategies. Following the results reported in [18], we propose to combine the Netconf measure (NF) with the well known Support-Confidence based rule ordering strategies (CSA, ACS and L3 ), which gives as result the next three hybrid rule ordering strategies: • Hybrid CSA/NF • Hybrid ACS/NF • Hybrid L3 /NF Our hypothesis is that Netconf combined with CSA, ACS and L3 can overcome the best hybrid strategies reported in the literature [9, 29, 30, 31]. As a fourth hybrid rule ordering strategy we propose combining Netconf and the CAR’s size (Specific Rules strategy). As it was shown in [18, 19], the Specific Rule strategy (SR) considers the CAR’s size in descending order, favoring specific (large) rules since specific rules would involve more items from the unseen transactions than general (short) rules [18]. Notice that all rule ordering strategies based on Support-Confidence are also based on the CAR’s size but this criterion is considered in ascending order (CSA and ACS), favoring general rules.
10
In the case of L3 , the CAR’s size is considered in descending order but it is applied after the Confidence and Support criteria (both applied in descending order), therefore, the general rules are favored because of the Support download closure [9]. Unlike the L3 strategy, the SR strategy considers the CAR’s size as first (and unique) ordering criterion, favoring in this way to specific rules. Thus, this new hybrid rule ordering strategy is called Hybrid Specific Rules/Netconf (Hybrid SR/NF). Our hypothesis is that the new proposed hybrid rule ordering strategies can overcome the best hybrid strategies reported in the literature [9, 29, 30, 31]. The overall procedure of the above four hybrid rule ordering strategies is shown in Algorithm 1. Algorithm 1: Hybrid X/NF Input: A list of CARs L Output: A re-ordered list of CARs LHybrid ordered in a hybrid manner 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LN F ← ∅ LtopK ← ∅ LX ← ∅ LHybrid ← ∅ forall r ∈ L do calculate the Netconf Γ of r add r jointly with its Γ value to LN F end sort LN F in a descending order according to Γ LtopK ← select the top K CARs ∈ LN F LX ← LN F − LtopK sort LtopK in X manner sort LX in X manner LHybrid ← put LtopK at front of LX return LHybrid
In lines 6 − 9, for all CARs belonging to list L, their Netconf are calculated and stored in the list LN F . In line 10, the list LN F is sorted in descending order according to the Netconf value. In line 11, the top K CARs of LN F are selected and stored in LtopK . Later, the remainder of the LN F list is stored in the LX
11
list (line 12). In lines 13 − 14 both lists LtopK and LX , are sorted in X manner. Finally, in line 15, the list LtopK is placed at front of the list LX .
3.2
“Dynamic K” satisfaction mechanism
In order to overcome the drawbacks of the main satisfaction mechanisms (see section 2.2), we propose a novel satisfaction mechanism named “Dynamic K”, which works as follows: First, “Dynamic K” sorts the CARs in a descending order according to the size of the CARs and in case of tie, the tied CARs are sorted in a descending order according to their QM’s value. Later, “Dynamic K” selects, for each class c ∈ C, a set of rules X ⇒ c covering the new transaction t and satisfying the following conditions: • X ⇒ c is a maximal rule. • for all i ∈ I, with i lexicographically greater than all items of X, QM (X ∪ {i} ⇒ c) < QM (X ⇒ c). Thereby more large rules with high QM values are selected for classification, avoiding redundancy in the set of CARs and including more different items in the antecedent of the selected CARs. Finally, let Ni be the set of CARs that were selected for each class ci (i = 1 to |C|), for classifying a new transaction “Dynamic K” assigns the class cj if the average of the quality measure values of rules (satisfying the transaction to classify) in Nj is greater than the average of the quality measure values of the first (top) |Nj | rules for each Ni , with i ̸= j and |Ni | ≥ |Nj |. In case of tie among classes with different number of CARs, the class with less number of CARs is assigned; in case of tie among classes with equal number of CARs, the class with higher Support value is selected, if the tie persist the class is selected randomly. 12
The “Dynamic K” mechanism does not have the drawbacks of the other satisfaction mechanisms since: • It selects the maximal rules with high QM values, avoiding redundancies and allowing the inclusion of more different items in the antecedent of the selected CARs, thereby CARs of low quality are not included for classifying. • The result is not biased when there is imbalance in the number of CARs (covering the transaction to classify) with high quality measure values in the classes, since “Dynamic K” considers the average of the same amount of CARs at classifying a new transaction. • It considers all good quality CARs that cover the new transaction and not only the best one. Thereby, “Dynamic K” does not assume that the best rule is going to correctly classify all transactions that it covers. Finally, in this paper, we investigate the use of the proposed rule ordering strategies combined with the “Dynamic K” satisfaction mechanism in order to improve the performance of CAR based classifiers. In the next section, we show some experiments that prove this hypothesis.
4
Experimental results
In this section, we aim to evaluate the novel hybrid rule ordering strategies introduced in this paper; comparing their accuracy of classification against the best hybrid rule ordering strategies reported in the literature [9, 29, 30, 31]. Also we evaluate the proposed hybrid rule ordering strategies combined with “Dynamic K” satisfaction mechanism. For the comparison of the hybrid rule ordering strategies, we implemented the TFPC classifier [13] coupled with the “Dynamic K” satisfaction mechanism 13
although any classifier coupled with the “Dynamic K” mechanism could be used. In the cases of CSA/NF, ACS/NF, L3 /NF and SR/NF strategies, we also need to compute the Netconf of the CARs. Our tests were run on a 1.86 GHz Intel(R) Core(TM)2 CPU with 1.00 GB DDR2 of RAM, running Windows 7. The experiments were conducted using the datasets reported in [29, 30, 31], all of them taken from the UCI Machine Learning Repository [3], see characteristics in Table 1. Similar to [29, 30, 31], the numerical attributes of these datasets were discretized using the LUCS-KDD discretized/normalized ARM and CARM Data Library as suggested in [11]. It should be noticed that our experiments were done using ten-fold cross-validation, reporting the average over the ten folds. Additionally, we set the Netconf threshold to 0.5 as in other works [18, 19]. Finally, for Algorithm 1 (see Section 3.1), we set K equal to 5 (the same value used in other works [29, 30, 31]). In columns 2 − 4 of Table 2, we show the average accuracy that we get at applying the non-hybrids rule ordering strategies, based on Support-Confidence [31]; additionally, in the last two columns we show the average accuracy that we get at applying NF and SR rule ordering strategies, respectively. From this set of results, it can be seen that NF is better than all other evaluated ordering strategies since as we can see from Table 2, NF gets the maximum accuracy in 13 out of the 19 datasets. In terms of average accuracy, NF reaches 80.05 for the 19 datasets, whereas SR, L3 , CSA and ACS get 79.81, 79.30, 79.19 and 65.96 respectively. As we enunciated at section 3 of this paper, our hypothesis is that Netconf combined with CSA, ACS, L3 and SR can overcome the best hybrid strategies reported in the literature. Thus, in order to test our hypothesis, we performed an experiment comparing the results of our novel hybrid rule ordering strategies against the best rule ordering strategies reported in the literature. In the
14
last four columns of Table 3, we show the accuracies obtained by our proposed hybrid strategies CSA/NF, ACS/NF, L3 /NF and SR/NF (see columns 5 − 8). In columns 2 − 4 of the same table we show the results of the best rule ordering strategies reported in the literature (CSA/χ2 [29], ACS/LA [30] and MLRP [9]). The first interesting thing that we can observe from our experiments is that, as it was established in [29, 30, 31], an Hybrid A/B strategy obtains better classification accuracy results than its “parents” A or B separately. For example, we can see in Table 3 that the classification accuracy obtained using Hybrid SR/NF (82.55) is greater than the accuracies obtained by its “parents” (see Table 2) SR (79.81) and NF (80.05). Analogously, the classification accuracies obtained by using Hybrid CSA/NF, Hybrid ACS/NF and Hybrid L3 /NF are greater than the accuracies obtained by their “parents”. From our results reported in Table 3, we can see that all the rule ordering strategies proposed in this paper outperform to the best rule ordering strategies reported in the literature. It means that our hypothesis is true for all the proposed strategies except for ACS/NF. This occurs because the ACS strategy, different from CSA, L3 and SR strategies, favors short rules. As we also mentioned in section 3, another hypothesis is that combining the proposed hybrid rule ordering strategies with the “Dynamic K” satisfaction mechanism allows to improve the performance of CAR based classifiers. In order to test this hypothesis, we evaluate all hybrid rule ordering strategies applying the “Dynamic K” satisfaction mechanism. However, in order to show that these strategies give good results independently of the satisfaction mechanism used, we performed the same experiments of Table 3 but applying “Best Rule”, “All Rules” and “Best K Rules” satisfaction mechanism, these results appear in Table 4. Something interesting that we can see from these results is that the proposed hybrid rule ordering strategies reach the first places in terms
15
of accuracy independently of the satisfaction mechanism used (see the last row in Table 4). Additionally, notice that “Dynamic K” obtains the best average accuracy independently of the rule ordering strategy used, being the combination SR/NF + “Dynamic K” the best one. As we can see in Table 1, the used datasets have between 2 and 19 classes. In order to show the accuracy on the different datasets, we grouped them according to the number of classes, in three groups: (1) datasets with 2 classes; (2) datasets with 3, 4, 5 or 6 classes and (3) datasets with more than 6 classes. In Table 5 we report the average accuracy of all strategies over these three groups of datasets. In this experiment we can see that the Hybrid SR/NF strategy (and the other three proposed strategies) works well independently of the number of classes in the dataset. From our experiments we can conclude that, in general, our hybrid rule ordering strategies, in general, overcome the best hybrid strategies reported in the literature, but by combining our strategies with “Dynamic K” satisfaction mechanism we obtain the best performance. We can see in Table 3 that the Hybrid SR/NF strategy gets an average accuracy, throughout the 19 datasets, of 82.55; it represents an improvement of 1.12% in accuracy with respect to the second place (Hybrid L3 /NF). Finally, in order to determine if the results shown in Table 3 are statistically significant, we applied the Friedman test [14] and the Bergmann-Hommel dynamic post-hoc procedure [16]. Critical Difference diagrams (CD diagrams) [14] are used to show the post-hoc results because CD diagrams compactly present the rank of each algorithm according to the accuracy. CD diagrams also show the magnitude of rank differences between algorithms and the significance of those differences [14]. In a CD diagram, the rightmost algorithm is the best algorithm, while the algorithms sharing a thick line have statistically similar
16
behavior. Figure 1 shows that Hybrid SR/NF achieves the best results, having a statistical significant difference with almost all other evaluated classifiers with exception of Hybrid L3 /NF, which only achieves a statistical significance advantage over Hybrid ACS/LA.
5
Conclusions
In this paper, we have proposed four novel hybrid rule ordering strategies based on the Netconf measure. The first three strategies arise from combining the Netconf with Support-Confidence based rule ordering strategies. The last, called Hybrid SR/NF, combines the Netconf ordering strategy (NF) with a rule ordering strategy based on the CAR’s size (SR), but unlike other existing strategies based on CAR’s size, Hybrid SR/NF sorts the CARs in descending order according to their size, favoring specific (large) rules. Additionally, we propose a novel satisfaction mechanism named “Dynamic K”. Our experimental results show that the proposed hybrid strategies always reached better classification accuracies than those obtained by their parents rule ordering strategies, as occurs with other hybrid strategies. From the experiments, we can conclude that our novel hybrid rule ordering strategies (with exception of ACS/NF) have better performance than the best hybrid rule ordering approaches reported in the literature. In particular, our Hybrid SR/NF strategy reached the best classification accuracy. Additionally, from the experiments we can conclude that the proposed hybrid rule ordering strategies reach the first places in terms of accuracy independently of the satisfaction mechanism used. Analogously, the “Dynamic K” satisfaction mechanism obtains the best results independently of the rule ordering strategy, being the combination SR/NF + “Dynamic K” the best one.
17
As future work, we will look for another way to combine ordering strategies, for example ordered list fusion, which would allow combining multiple ordering strategies to obtain a single one that would be better than any subset combination.
Acknowledgment This work was partly supported by the National Council of Science and Technology of Mexico (CONACyT) through the project grants CB2008-106443 and CB2008-106366.
References [1] K. Ali., S. Manganaris, and R. Srikant. Partial classification using association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases and Data Mining, pages 115–118, 1997. [2] A. E. Amin. A novel classification model for cotton yarn quality based on trained neural network using genetic algorithm. Knowledge-Based Systems, 39(0):124–132, 2013. [3] A. Asuncion and D.J. Newman. UCI machine learning repository. In http://www.ics.uci.edu/∼mlearn/MLRepository.html, 2007. [4] E. Baralis and P. Garza. A lazy approach to pruning classification rules. In Proceedings of the ICDM, pages 35–, 2002. [5] F. Berzal, I. Blanco, D. S´anchez, and M. A. Vila. Measuring the accuracy and interest of association rules: A new framework. Intelligence Data Analysis, 6(3):221–235, 2002.
18
[6] F. Berzal, J. C. Cubero, D. S´anchez, and J. M. Serrano. ART: A Hybrid Classification Model. Machine Learning, 54(1):67–92, 2004. [7] S. Buddeewong and W. Kreesuradej. A New Association Rule-Based Text Classifier Algorithm. pages 684–685, Washington, DC, USA, 2005. IEEE Computer Society. [8] E. Cesario, F. Folino, A. Locane, G. Manco, and R. Ortale. Boosting text segmentation via progressive classification. Knowl. Inf. Syst., 15(3):285– 320, 2008. [9] C. Chun-Hao, C. Rui-Dong, L. Cho-Ming, and C. Chih-Yang. Improving the performance of association classifiers by rule prioritization. KnowledgeBased Systems, 36:59–67, 2012. [10] P. Clark and R. Boswell. Rule Induction with CN2: Some Recent Improvments.
In Proceedings of European Working Session on Learning
(ESWL’91), pages 151–163, 1991. [11] F. Coenen. The LUCS-KDD discretised/normalised ARM and CARM Data Library. Department of Computer Science, The University of Liverpool, UK(2003).
In http://www.csc.liv.ac.uk/∼frans/KDD/Software/LUCS-
KDD-DN. [12] F. Coenen and P. Leng. An Evaluation of Approaches to Classification Rule Selection. In Proceedings of the Fourth IEEE International Conference on Data Mining, pages 359–362, 2004. [13] F. Coenen, P. Leng, and L. Zhang. Threshold Tuning for Improved Classification Association Rule Mining.
Lecture Notes in Artificial Intelli-
gence: Advances in Knowledge Discovery and Data Mining - PAKDD 2005, 3518:216–225, 2005.
19
[14] F. Demˇsar. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res., 7:1–30, 2006. [15] G. Deshmeh and M. Rahmati. Distributed anomaly detection, using cooperative learners and association rule analysis. Intelligent Data Analysis, 12(4):339–357, 2008. [16] S. Garc´ıa and F. Herrera. An extension on Statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. Journal of Machine Learning Research, 9:2677–2694, 2008. [17] R. Hern´andez. Dynamic k: A novel satisfaction mechanism for car-based classifiers. In Proceedings of XVIII CIARP, Volume 8258 of the LNCS series, pages 141–148, 2013. [18] R. Hern´andez, J. A. Carrasco, J. Fco. Mart´ınez, and J. Hern´andez. Carnf: A classifier based on specific rules with high netconf. Intelligent Data Analysis, 16(1):49–68, 2012. [19] R. Hern´andez, J. Hern´andez, J. A. Carrasco, and J. Fco. Mart´ınez. Algorithms for mining frequent itemsets in static and dynamic datasets. Intelligent Data Analysis, 14(3):419–435, 2010. [20] M. Khashei, A. Z. Hamadani, and M. Bijari. A fuzzy intelligent approach to the classification problem in gene expression data analysis. KnowledgeBased Systems, 27(0):465–474, 2012. [21] W. Li, J. Han, and J. Pei. CMAR: accurate and efficient classification based on multiple class-association rules. In Proceedings of the IEEE International Conference on Data Mining, ICDM 2001, pages 369–376, 2001.
20
[22] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pages 80–86, 1998. [23] W. A. Malik and A. Unwin. Automated error detection using association rules. Intelligent Data Analysis, 15(5):749–761, 2011. [24] S. H. Park, J. A. Reyes, D. R. Gilbert, J. W. Kim, and S. Kim. Prediction of protein-protein interaction types using association rule based classification. BMC Bioinformatics, 10(1), 2009. [25] G. Piatetsky-Shapiro. Discovery, Analysis, and Presentation of Strong Rules, pages 229–238. AAAI/MIT Press, Cambridge, MA, 1991. [26] M. Steinbach and V. Kumar. Generalizing the notion of confidence. Knowl. Inf. Syst., 12(3):279–299, 2007. [27] A. M. Teredesai, M. A. Ahmad, J. Kanodia, and R. S. Gaborski. CoMMA: a framework for integrated multimedia mining using multi-relational associations. Knowl. Inf. Syst., 10(2):135–162, 2006. [28] W. Wang, Y. J. Wang, nares-Alc´antara R. Ba Z. Cui, and F. Coenen. Application of Classification Association Rule Mining for Mammalian Mesenchymal Stem Cell Differentiation. pages 51–61, Berlin, Heidelberg, 2009. Springer-Verlag. [29] Y. J. Wang, Q. Xin, and F. Coenen. A Novel Rule Ordering Approach in Classification Association Rule Mining. In Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2007, pages 339–348, 2007.
21
[30] Y. J. Wang, Q. Xin, and F. Coenen. A Novel Rule Weighting Approach in Classification Association Rule Mining. Data Mining Workshops, International Conference on, 0:271–276, 2007. [31] Y. J. Wang, Q. Xin, and F. Coenen. Hybrid Rule Ordering in Classification Association Rule Mining. Trans. MLDM, 1(1):1–15, 2008. [32] X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. In Proceddings of the SIAM International Conference on Data Mining, 2003.
22
7
6
5
4
3
2
1
ACS/LA CSA/χ2
SR/NF
ACS/NF
L3/NF
MLPR
CSA/NF
Fig. 1: CD diagram with an statistical comparison of the algorithms according to Accuracy.
Table 1: Dataset characteristics. Dataset # instances # items # classes adult 48842 97 2 anneal 898 73 6 breast 699 20 2 connect4 67557 129 3 flare 1389 39 9 glass 214 48 7 heart 303 52 5 hepatitis 155 56 2 horseColic 368 85 2 ionosphere 351 157 2 iris 150 19 3 led7 3200 24 10 mushroom 8124 90 2 pageBlocks 5473 46 5 pima 768 38 2 nursery 12960 32 5 soybean-large 683 118 19 ticTacToe 958 29 2 wine 178 68 3
23
Table 2: Classification accuracy using the main non-hybrid rule ordering strategies based on Support-Confidence as well as using the NF and SR rule ordering strategies. Dataset CSA ACS L3 NF SR adult 80.80 74.70 80.80 81.68 81.43 anneal 88.29 75.58 88.51 92.21 92.21 breast 89.99 89.99 89.99 84.12 83.76 connect4 65.83 65.18 65.83 62.05 62.05 flare 84.30 84.30 84.52 85.87 86.21 glass 64.97 50.74 64.97 67.49 67.12 heart 51.42 39.76 51.42 54.45 54.45 hepatitis 81.83 48.50 82.03 84.16 84.16 horseColic 79.07 41.11 79.39 82.56 81.73 ionosphere 86.34 64.67 86.34 83.92 84.09 iris 95.33 95.33 95.33 95.76 95.21 led7 68.72 64.22 68.72 73.32 73.32 mushroom 99.04 64.92 99.04 99.40 98.36 pageBlocks 89.99 89.99 89.99 91.83 91.83 pima 74.37 73.85 75.06 76.79 76.01 nursery 77.75 55.08 77.75 78.12 77.96 soybean-large 88.01 86.10 88.01 89.35 88.73 ticTacToe 67.10 39.03 67.45 66.69 66.12 wine 71.51 50.28 71.51 71.23 71.72 Average 79.19 65.96 79.30 80.05 79.81
24
Table 3: Classification accuracy using hybrid rule ordering strategies. Dataset CSA/χ2 ACS/LA MLRP CSA/NF ACS/NF L3 /NF SR/NF adult 81.12 84.89 83.49 83.78 85.30 83.81 84.35 anneal 90.88 81.70 94.84 94.48 89.46 94.61 94.74 breast 92.00 90.59 84.66 85.12 90.60 85.12 86.26 connect4 66.92 66.32 63.10 63.17 66.45 63.20 63.42 flare 85.56 85.44 87.24 87.21 86.81 87.31 87.48 glass 66.65 61.88 68.86 68.77 66.46 68.79 69.08 heart 52.08 51.60 55.03 55.46 54.25 55.48 57.45 hepatitis 81.51 77.82 85.48 85.39 84.95 85.66 87.44 horseColic 82.23 82.00 83.89 83.55 82.84 83.87 84.24 ionosphere 85.10 85.95 84.80 85.15 86.28 85.15 85.36 iris 96.28 96.28 96.77 96.82 96.28 96.86 97.88 led7 69.93 65.86 74.42 74.51 74.07 74.48 75.70 mushroom 98.51 98.81 98.95 99.39 98.51 99.39 99.51 pageBlocks 92.72 92.16 94.22 93.91 93.47 93.88 94.61 pima 75.65 75.52 78.20 77.86 76.89 78.25 78.67 nursery 79.50 67.81 80.48 80.23 79.39 80.17 81.10 soybean-large 89.26 78.69 90.96 91.01 90.06 90.97 91.59 ticTacToe 68.94 64.16 67.74 67.69 67.31 67.86 73.43 wine 75.49 73.28 72.18 72.41 72.92 72.41 76.11 Average 80.54 77.94 81.33 81.36 81.17 81.43 82.55
Table 4: Average classification accuracy throughout the 19 datasets using other Satisfaction Mechanisms (SM). SM CSA/χ2 ACS/LA MLRP CSA/NF ACS/NF L3 /NF SR/NF Best Rule 79.46 76.86 80.26 80.29 80.12 80.37 81.48 All Rules 78.08 74.63 78.82 78.90 78.66 79.11 80.16 Best K Rules 79.63 76.98 80.38 80.41 80.25 80.49 81.58 Dynamic K 80.54 77.94 81.33 81.36 81.17 81.43 82.55 Ranking 6 7 4 3 5 2 1
Table 5: Average accuracy of all evaluated strategies over datasets grouped by number of classes. # classes CSA/χ2 ACS/LA MLRP CSA/NF ACS/NF L3 /NF SR/NF 2 82.25 81.58 82.53 82.61 83.20 82.75 84.03 3, 4, 5 or 6 78.00 74.47 78.38 78.37 77.77 78.39 79.63 more than 6 76.81 71.96 79.36 79.36 78.34 79.37 79.95
25