spaces [12] with PETs. Here we juxtapose the quality of es- timates, as measured by the appropriate loss function with the resulting rank-ordering and error, as a ...
Calibration and Power of Probability Estimation Trees on Unbalanced Data David A. Cieslak and Nitesh V. Chawla Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556 {dcieslak,nchawla}@nd.edu
Abstract
non-cancerous than otherwise; thus, the quality of the probabilistic can be decisive for the practitioner.
Probabilistic decision trees have found a wide array of applications, including but not limited to finance, medicine, and risk management. These applications, particularly medicine, can impose exacting requirements on the calibration of the classifier and the quality of the resulting estimates. It can be important for the classifier not only to produce efficient rank-ordering, but also robust probability estimates. Moreover, cost-sensitive learning can also require that the estimates produced be sound. In this paper, we comprehensively assess and evaluate the quality of probability estimates produced by decision trees learned on unbalanced datasets, particularly within an ensemble setting. We use negative cross entropy, brier score, and 0/1 loss to evaluate the accuracy of point predictions and quality of estimates. We also juxtapose the losses with the rank-ordering produced by the classifiers. We further provide the decomposition of brier score into calibration and refinement losses to comment on the calibration and power of the ensemble of PETs. Lastly, we study the effect of sampling methods for countering class imbalance on PETs.
The specific needs of the probabilistic estimates can even dictate the choice of the learning algorithm. For instance, if the treatment costs are based on a strict cost matrix, then the calibration of the estimates is indispensable, since the use of estimates in calculating expected cost ensures optimal results. In other cases, a rank-order (as measured by AUC) maybe be of importance in lieu of estimate precision. A well-calibrated classifier does not always produce good rank-ordering, and vice versa. We hasten to add that it can certainly be the case that a classifier only produces a good rank ordering if the estimates are well-calibrated, which can be particularly true for PETs, as we will also demonstrate in this work. Hence, a consensus must be established on the choice of evaluation measure for assessing the calibration and quality of estimates. We also posit that the issues of calibration and cost-sensitive classification are more prevalent in datasets with class imbalance, datasets where the class of interest is usually rare.
1 Introduction Many methods are available for assessing classifier effectiveness. However, selecting an appropriate metric is highly dependent on the classifier’s task. In many applications, traditional classifiers are transformed into probability estimators. The quality of the produced estimates is critical, particularly in cases with a large class imbalance. Consider the classification of pixels in mammogram images as possibly cancerous [1]. A typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels. A simple default strategy of guessing the majority class would give a predictive accuracy of 98%. Ideally, a fairly high rate of correct cancerous predictions is required, while allowing for a small to moderate error rate in the majority class. It is also more costly to predict a cancerous case as
We assess and evaluate the calibration of the probability estimation decision trees (PETs) [3], with a particular focus on unbalanced datasets, allowing us to evaluate in scenarios where the tree-building can be intrinsically biased towards the majority class. A classifier is considered to be wellcalibrated if the predicted probability approaches the empirical probability as the number of predictions goes to infinity [2]. Probability estimates derived from leaf frequencies may not be appropriate for ranking test examples [4]. Often they are biased due to small leaf-sizes and carry high variance (as in an ensemble). The small leaf-sizes can be particularly prevalent for highly unbalanced datasets. Hence, the classes assigned at the leaves of the decision trees have to be appropriately converted to reliable probabilistic estimates. Typically, the estimates produced at the leaves are improved by applying smoothing methods, thus resulting in improved “quality” of the estimates and the rank-ordering [3, 5, 6, 7]. Recent work has evaluated the relationship between probabilistic predictions by different classifiers and true posterior probabilities [8, 9, 10].
Contributions The main goal of our paper is to empirically characterize the calibration and power of PETs on unbalanced datasets. In addition to using the leaf frequencies as probability estimates for decision trees, we use popular smoothing methods such as Laplace and m-estimate [3] to improve the quality of estimates. Additionally, we use ensemble methods such as bagging [11] and random subspaces [12] with PETs. Here we juxtapose the quality of estimates, as measured by the appropriate loss function with the resulting rank-ordering and error, as a function of ensemble size. Utilizing a decomposition of the Brier score (squared error) [13], into calibration and refinement losses, we shed light into the behavior of the ensemble methods on the quality of estimates. We show this decomposition as a function of ensemble size. Finally, we perform an analysis of the interaction between losses, rank-ordering, and several popular sampling methods. Another key contribution of our work is looking at the effect of sampling methods for countering class imbalance on the quality of estimates. Sampling methods — oversampling and undersampling — are typically exploited to improve the performance of classifiers on unbalanced datasets. Our goal is to provide an insight into the relationship between posterior probability estimation by PET’s, sampling method, and the rank-ordering (as measured by AUC).
2 Classifiers We used probability estimation decision trees (PETs) as the base classifiers in our study. The ensembles of these classifiers were generated using bagging and random subspaces. In addition, we considered three popular sampling measures for countering class imbalance in the data distribution before learning PETs.
2.1 Probability Estimation Trees A decision tree is essentially in a disjunctive-conjunctive form, wherein each path is a conjunction of the attributesvalues and the tree by itself is a disjunction of all these conjunctions. An instance arriving at the root node, takes the branch it matches based on the attribute-value test and moves down the tree following that branch. This continues until a path is established to a leaf node, providing the classification of the instance. Decision tree learning aims to make “pure” leaves, that is leaves in which all the examples belong to one particular class. This growing procedure becomes its potential weakness for constructing probability estimates. The leaf estimates, which are a natural calculation from the frequencies at the leaves, can be systematically skewed towards 0 and 1, as the leaves are essentially dominated by one class. Typically, the probabilistic (frequency-based) estimate at a decision tree leaf j is,
, where y is the true class, nj,y are the Pj (y|x)F req = nj,y j examples belonging to class y at the leaf j and nj are the total number of examples at the leaf j, However, simply using the frequency derived from the correct counts of classes at a leaf might not give sound probabilistic estimates [3, 4]. A small leaf can potentially give optimistic estimates for classification purposes. For instance, consider two leaves with the following distributions in the form of (n, ny ): (5, 5) and (50, 49). The smaller leaf gets P (y|x) = 1, compared to P (y|x) = 0.98 for the larger leaf. However, it is not very robust given the limited coverage of the leaf. So, smoothing methods such as Laplace and m-estimate are typically employed for mitigating the aforementioned problem. n
2.1.1 Smoothing Leaf Frequencies One way of improving the probability estimates given by an unpruned decision tree is to smooth them to make them less extreme. One can smooth these estimated probabilities by using the Laplace estimate [3], which can be written n +1 , where C is the total as follows, Pj (y|x)Laplace = nj,y j +C number of classes in the dataset, Laplace estimate introduces a prior probability of 1/C for each class. Again considering the two pathological cases of nj,y = 5 and nj,y = 49, the Laplace estimates are 0.86 and 0.96, respectively, which are more reliable given the evidence. However, Laplace estimates might not be very appropriate for highly unbalanced datasets. Thus, the m-estimate [14] incorporates the prior of positive class in smoothing, shifting the estimates towards the minority n +bm class base rate (b), as follows [4] P (y|x)m−est = nj,yj +m , where m is the parameter for controlling the shift towards b. Zadrozny and Elkan (2001) suggest using m, given b, such that bm = 10.
2.2 Ensemble Methods: Bagging and Random Subspaces Smoothing estimates may not completely mitigate the effects of overfit and overgrown trees. We use ensemble methods to “smooth” out the probability estimates at the leaves. It is evident the estimates at the leaves can be biased towards 0 or 1 and carry high variance. The ensemble methods can effectively smooth out the variations in these values. Ensemble methods aggregate predictions (by voting or averaging) from classifiers learned on samples of data. We consider bagging [11] and random subspaces [12] for generating ensembles. These methods typically exploit the instability in the classifiers [11, 15], since perturbing the training set can produce different classifiers using the same learning algorithm, where the difference is in the resulting
a) LossN CE
b) LossBS
4.5
1.8
4
1.6
3.5
1.4
Brier Score
NCE
3 2.5 2 1.5
c) LossC
d) LossR
1
0.25
0.8
1.2 1 0.8 0.6
0.2
Refinement Loss
2
Calibration Loss
5
0.6
0.4
0.2 1
0
0.1
0.05
0.4
0.5
0.15
0.2
0
0.1
0.2
0.3
0.4
0.5 pi
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5 pi
0.6
0.7
0.8
0.9
1
0 0
0.2
0.4
0.6
0.8
1
rj − pj
0 0
0.2
0.4
0.6
0.8
1
rj
Figure 1: Loss Measure Trends. predictions on the testing set and the structure of the classifier. The ensembles generated by both bagging and random subspaces comprised of 100 classifiers. For random subspaces, we randomly selected 25% of the features to train the classifier. PETs, given the unstable nature of decision trees, should benefit from bagging and random subspaces. Each tree has a potentially different representation of the original data set, thus resulting in a different function for P (y|x) at each leaf. The classification assigned by the individual decision trees is effectively invariant for test examples. The classification can either be done by taking the most popular class attached to the test example or by averaging the probability estimate computed from each of the subspaces. The averaging of the estimates leads to a reduction in the variance component of the error, thus reducing the overall error. We examine this reduction in error and improved quality of estimates by examining the calibration and refinement losses. Zadrozny & Elkan ([4]) claim that bagging does not always improve the probability estimates for large unbalanced datasets, particularly as a function of bias and estimate quality. However, we show that even for large and unbalanced datasets, there is an improvement in the quality of probabilistic estimates, especially in terms of rank-order.
2.3 Sampling Methods: Oversampling, Undersampling and SMOTE When classifying in the environment of unbalanced data, several sampling methods are effective in improving minority class accuracy and rank-order [16, 17, 18]. These methods emphasize learning a model which is generalized to favor a given class through emphasizing different class boundaries than would be learned on the original sample of data. However, we are unaware of work in the literature that directly investigates the effects of sampling on the probability estimates (aside from AUC) produced by decision trees. Because of the emphasis on class borders it is our hypothesis that sampling methods may increase loss, thereby worsening the estimates produced, but may still improve rankorder, a commonly observed sampling by-product. Vari-
ous re-sampling strategies have been used such as random over-sampling with replacement, random under-sampling, and oversampling with synthetic generation of new samples based on the known information (SMOTE) [1], and combinations of the above techniques. Over-sampling by replication can lead to similar but more specific regions in the feature space as the decision region for the minority class. This can potentially lead to overfitting on the multiple copies of minority class examples. On the other hand, the synthetic examples cause the classifier to create larger and less specific decision regions as shown by the dashed lines, rather than smaller and more specific regions, as typically caused by over-sampling with replication. More general regions are now learned for the minority class samples rather than those being subsumed by the majority class samples around them. The effect is that decision trees generalize better. Thus, we want to develop an effective relationship between the nature of balancing the dataset and the posterior probability estimation to develop additional insight into the reasons for improvement in rank-ordering.
3 Loss Measures As mentioned in the Introduction, we used the following three loss measures to evaluate the quality of the probability estimates. We will assume a two-class case (y ∈ 0, 1), where 1 is the positive or minority class and 0 is the negative or majority class. Furthermore, in our discussion we will assume that pi is the probability of predicting class 1 on example i. • LossN CE : The Negative Cross Entropy (LossN CE ) measure is the average Negative Cross Entropy of predicting the true labels. Thus, it can be considered as the measure that must be minimized to obtain the maximum likelihood probability estimates. One word of caution with NCE is that it will be undefined for log(0). Thus, the minimum loss is 0, but the maximum can be infinity if pi = 0 or 1 − pi = 0. To avoid the infinity value, we applied a floor of 10−6 to
Table 1: Datasets, in an increasing order of imbalance. Dataset Phoneme Adult Segment Pendigits Page Forest Cover Oil Mammography
Number of Features 5 14 19 16 10 54 49 6
Number of Examples 5,400 48,840 2,310 10,992 5,473 38,500 937 11,183
the probability estimate: all pi = 0 were replaced by pi = 10−6 , and the corresponding 1 − pi was adjusted as well. LossN CE essentially measures the proximity of the predicted values to the actual class values. That is, the class 1 predictions should have probabilities closer to 1. As seen in Figure 1, we see that a much higher loss is incurred when misclassification occurs with high confidence. Thus, LossN CE should be an effective method for predicting AUC. 1 X log(pi ) + LossN CE = − { n i|y=1 X log(1 − pi )} i|y=0
• LossBS : The Brier score measure (LossBS ) is the average quadratic loss occurred on each instance in the test set. It indicates predictions that make the best estimates at the true probabilities. It not only accounts for the probability assigned to the actual class, but also the probabilities assigned for the other possible class. Thus, the more confidence we have in predicting the actual class, the lower the loss. The quadratic loss is averaged over all the n test instances. In the subsequent equation, the squared term sums over all the possible probability values assigned to the test instance, which is two for our case. For instance, the worst case will be when pi = 0, when true label y = 1. This will lead to (1 − 0)2 + (0 − 1)2 = 2. The best case will be when pi = 1, y = 1, leading to (1−1)2 +(0+0)2 = 0. LossBS is much more moderate than LossN CE in that loss is incurred much more similarly between predictions of high and low confidence; therefore, LossBS should be fairly highly correlated with error or Loss01L . 1X (yi − pi )2 n i=1 n
LossBS =
Proportion of Minority (positive) class examples 0.29 0.24 0.143 0.104 0.102 0.071 0.044 0.023
The brier score can be effectively decomposed into calibration (LossC ) and refinement (LossR ) losses as follows [13].
LossC =
1X nj (rj − pj )2 n j=1
LossR =
1X nj rj (1 − rj ) n j=1
k
k
where k is the number of unique probabilities generated, pj for j = 1, ..., k, by the classifier on the testing set; rj is the fraction of examples with class = 1 and y =1 assigned the probability of pj (rj = jnj ). Note that a testing set will have i = 1, ..., n instances, but the classifier will not necessarily generate the same number of probability estimates, thus k