Comparing Rule Measures for Predictive Association Rules ? Paulo J. Azevedo1 and Al´ıpio M. Jorge2 1
Departamento de Inform´ atica, Universidade do Minho, Portugal
[email protected] 2 LIAAD, Faculdade de Economia, Universidade do Porto, Portugal
[email protected]
Abstract. In this paper we study the predictive ability of some association rule measures typically used to assess descriptive interest. Such measures, namely conviction, lift and χ2 are compared with confidence, Laplace, mutual information, cosine, Jaccard and φ-coefficient. As prediction models, we use sets of association rules generated as such. Classification is done by selecting the best rule, or by weighted voting (according to each measure). We performed an evaluation on 17 datasets with different characteristics and conclude that conviction is on average the best predictive measure to use in this setting.
1
Introduction
Association rule mining is a technique primarily used for exploratory data mining. In such a setting it is useful to discover relations between sets of variables, which may represent products in an on-line store, disease symptoms, keywords, demographic characteristics, to name a few. To guide the data analyst identifying interesting rules, many objective interestingness rule measures have been proposed in the literature [17]. Although these measures have descriptive aims, we will evaluate their use in predictive tasks. One of these measures, conviction, will be shown as particularly successful in classification. Classification based on association rules has been proved as very competitive [12]. The general idea is to generate a set of association rules with a fixed consequent (involving the class attribute) and then use subsets of these rules to classify new examples. This approach has the advantage of searching a larger portion of the rule version space, since no search heuristics are employed, in contrast to decision tree and traditional classification rule induction. The extra search is done in a controlled manner enabled by the good computational behavior of association rule discovery algorithms. Another advantage is that the produced rich rule set can be used in a variety of ways without relearning, which can be used to improve the classification accuracy [8]. In this work, we study the predictive power of many of the known interestingness measures. Although this is done in terms of association rule based classification, the results can be potentially useful to other classification settings. ?
Supported by Funda¸c˜ ao Ciˆencia e Tecnologia, Project Site-o-matic, FEDER e Programa de Financiamento Plurianual de Unidades de I & D.
We start by describing in detail the classification approach used, and define each of the used measures, previously appraised for descriptive data mining tasks in [10][17]. We perform a thorough experimental validation and study the results using Best Rule and Weighted Voting prediction implemented in the CAREN system [1]. 1.1
Classifying with Association Rules
The classification approach we describe in this paper consists in obtaining a classifier, or a discriminant model M , from a set of association rules. The rules are generated from a particular propositional data set, involve categorical and numerical < attribute = value > pairs in the antecedent and a class value in the consequent. We want the model M to be successful in the prediction of the classes of unseen cases taken from the same distribution as D. A Bayesian view of the success of a classifier defines that the optimal classifier MBayes maximizes the probability of predicting the correct class value for a given case x [7]. Previous work on classification from association rules has confirmed the predictive power of confidence [12]. In this paper we provide empirical indication that another measure, conviction, tends to obtain better results.
2
The Measures
In this section we describe the measures used in this work. Let us first introduce some notation. Let r be a rule of the form A → C where A and C are sets of items. In a classification setting, each item in A is a pair < attribute = value >, and C has one single pair < class attribute = class value >. We are assuming that the rule was obtained from a a dataset D. The size of D is N . In an association rule framework, Confidence is a standard measure, and is defined as: conf (A → C) =
sup(A ∪ C) sup(A)
(1)
Confidence ranges from 0 to 1. Confidence is an estimate of Pr(C | A), the probability of observing C given A. After obtaining a rule set, one can immediatly use confidence as a basis for classifying one new case x. Of all the rules that apply to x (i.e., the rules whose antecedent is true in x), we choose the one with highest confidence. This loosely follows the optimal Bayes classifier. In the extreme case where there is exactly one rule that covers each new case, the Best Rule-confidence classifier concides with the optimal Bayes. In general, the BRconf classifier approximates the MBayes but ignores the combined effect of evidence. For rules with the same confidence, the rule with the highest support is preferred. The rationale is that the estimate for confidence is more reliable. Another measure sometimes used in classification is Laplace. It is a confidence estimator that takes support into account, becoming more pessimistic as the support of A decreases. It ranges within [0, 1[ and is defined as:
sup(A ∪ C) + 1 sup(A) + 2
lapl(A → C) =
(2)
Confidence alone (or Laplace) may not be enough to assess the descriptive interest of a rule. Rules with high confidence may occur by chance. Such spurious rules can be detected by determining whether the antecedent and the consequent are statistically independent. This inspired a number of measures for association rule interest. One of them is Lift, defined as: lif t(A → C) =
conf (A → C) sup(C)
(3)
Lift measures how far from independence are A and C. It ranges within [0, +∞[. Values close to 1 imply that A and C are independent and the rule is not interesting. Values far from 1 indicate that the evidence of A provides information about C. Lift measures co-occurrence only (not implication) and is symmetric with respect to antecedent and consequent. Lift is also used to characterize the classification rules derived by C4.5 [15]. Conviction is another measure proposed in [4] to tackle some of the weaknesses of confidence and lift. Unlike lift, conviction is sensitive to rule direction (conv(A → C) 6= conv(C → A)). Conviction is somewhat inspired in the logical definition of implication and attempts to measure the degree of implication of a rule. Conviction is infinite for logical implications (confidence 1), and is 1 if A and C are independent. It ranges along the values 0.5, ..., 1, ...∞. Like lift, conviction values far from 1 indicate interesting rules. It is defined as: conv(A → C) =
1 − sup(C) 1 − conf (A → C)
(4)
According to [4], conviction intuitively captures the notion of implication rules. Logically, A → C can be rewritten as ¬(A ∧ ¬C). Then, one can measure how (A ∧ ¬C) deviates from independence and take care of the outside negation. To cope with this negation, the ratio between sup(A∪¬C) and sup(A)×sup(¬C) is inverted. Unlike confidence, the support of both antecedent and consequent are considered in conviction. Within the association rules framework, the Leverage measure was recovered by Webb for the Magnus Opus system [18]. It had been previously proposed by Piatetsky-Schapiro [14]. In [10] is called novelty. The idea is to measure how much more counting is obtained from the co-occurrence of the antecedent and consequent from the expected, i.e., from independence. It ranges within [−0.25, 0.25] and is defined as: leve(A → C) = sup(A ∪ C) − sup(A) × sup(C)
(5)
The definite way for measuring the statistical independence between antecedent and consequent is the χ2 test. The test’s statistic can be used as a rule measure.
X
χ2 (A → C) = N ×
X∈{A,¬A},Y ∈{C,¬C}
(sup(X ∪ Y ) − sup(X).sup(Y ))2 sup(X) × sup(Y )
(6)
where N is the database size. As stated in [3], χ2 does not assess the strength of correlation between antecedent and consequent. It only assists in deciding about the independence of these items which suggests that the measure is not feasible for ranking purposes. Our results will corroborate these claims. The following measures, rather then indicating the absence of statistical independence between A and C, measure the degree of overlap between the cases covered by each of them. The Jaccard coefficient takes values in [0, 1] and assesses the distance between antecedent and consequent as the fraction of cases covered by both with respect to the fraction of cases covered by one of them. High values indicate that A and C tend to cover the same cases. sup(A ∪ C) (7) sup(A) + sup(C) − sup(A ∪ C) Cosine is another way of measuring the distance between antecedent and consequent when these are viewed as two binary vectors. The value of one indicates that the vectors coincide. The value of zero only happens when the antecedent and the consequent have no overlap. It ranges along [0, 1] and is defined as: jacc(A → C) =
cos(A → C) = p
sup(A ∪ C) sup(A) × sup(C)
(8)
Also the φ-coefficient can be used to measure the association between A and C. It is analogous to the discrete case of the Pearson correlation coeficient [17]. leve(A → C)
φ(A → C) = p
(sup(A) × sup(C)) × (1 − sup(A)) × (1 − sup(C))
(9)
φ ranges within [−1, 1]. It is one when the antecedent and the consequent cover the same cases and -1 when they cover opposite cases. In [16] it is shown the 2 relation between φ and the χ2 statistics. i.e. that φ2 = χN being N the database size. The last measure considered is information-based. Many classification algorithms use similar measures to assess rule predictiveness. This is the case of CN2 [5] which uses Entropy. In this paper we make use of the Mutual Information. Mutual information measures the amount of reduction in uncertainty of the consequent when the antecedent is known [16]. P P
M I(A → C) =
sup(A ∪C )
i j i j sup(Ai ∪ Cj ) × log( sup(Ai )×sup(Cj ) ) P P min( i −sup(Ai ) × log(sup(Ai )), j −sup(Cj ) × log(sup(Cj ))) (10)
where Ai ∈ {A, ¬A} and Cj ∈ {C, ¬C}. M I ranges over [0, 1]. Notice that measures lift, leverage, χ2 , Jaccard, cosine, φ and MI are symmetric, whereas confidence, Laplace and conviction are asymmetric. We will see that this makes all the difference in terms of predictive performance. Other measures could have been considered, but this study focuses mainly on the ones mostly used in association rule mining. 2.1
Ordering Rules
The prediction given by the best rule is the best guess we can have with one single rule. When the best rule is not unique we can break ties maximizing support [12]. A kind of best rule strategy, combined with a coverage rule generation method, provided encouraging empirical results when compared with state of the art classifiers on some datasets from UCI [13]. Our implementation of Best Rule prediction follows closely the rules ordering described in CMAR [11]. Thus, having R1 earlier than R2 is defined as: R1 ≺R2
if
metric(R1 ) > metric(R2 ) or metric(R1 )==metric(R2 ) ∧ sup(R1)>sup(R2) or metric(R1 )==metric(R2 ) ∧sup(R1)==sup(R2) ∧ ant(R1)