... hubg}@nlpr.ia.ac.cn. AbstractâIn the context of the class imbalance problem, ... situation because the accuracy does not distinguish error types. And the 0-1 ...
2012 IEEE 12th International Conference on Data Mining Workshops
Learning in the Class Imbalance Problem When Costs are Unknown for Errors and Rejects Xiaowan Zhang and Bao-Gang Hu NLPR/LIAMA Institute of Automation, Chinese Academy of Sciences Beijing, P.R.China Email:{xwzhang, hubg}@nlpr.ia.ac.cn
an indispensable role and is usually assumed to be known. The values of costs are commonly given by domain experts. Sometimes, it is simply assigned with the inverse of the class ratio, or learned through other ways before being applied into classification. Generally, when the costs are not given, the methods above can not work properly. Another category in the study of machine learning or data mining is called “abstaining classification” [10]. In this category, a reject option is introduced to reduce the chance of a potential misclassification. Significant benefits have been obtained from abstaining classifications, particularly in very critical applications [11], [12]. In the context of abstaining classifications, cost-sensitive learning approaches require cost terms associated to the rejects. However, one often fails to provide such information. And, up to now, there seems no proper guidelines for giving such information. Although a reject option adds another degree of complexity in classifications over the standard approaches, we consider the abstaining strategy will become a common option for most learning machines in future. In class imbalance problems, “cost-sensitive learning” is an important research direction. Based on the definition in [13], we extend it below by including the situations of abstaining classifications. Definition 1: Cost-sensitive learning (CSL) is a type of learning that takes the misclassification costs and/or reject costs into consideration. The goal of this type of learning is to minimize the total cost. Hence, CSL generally requires modelers or users to specify cost terms for reaching the goal. However, this work addresses one open issue below which is mostly overlooked: “How to conduct a learning in the class imbalance problem when costs are unknown for errors and rejects”? In fact, the issue is not unusual in real-world applications. Therefore, we propose another category of learning below for distinguishing the differences between the present work and the existing studies in CSL. Definition 2: Cost-term-free(or Cost-free for short) learning (CFL) is a type of learning that does not require the cost terms associated with the misclassifications and/or rejects as the inputs. The goal of this type of learning is to get optimal
Abstract—In the context of the class imbalance problem, most existing approaches require the knowledge of costs for reaching the reasonable classification results. If the costs are unknown, some approaches can not work properly. Moreover, to our best knowledge, none of the cost-sensitive approaches is able to process the abstaining classifications when costs are unknown for errors and rejects. Therefore, the challenge above forms the motivation of this work. Based on information theory, we propose a novel cost-free learning approach which targets the maximization of normalized mutual information between the target outputs and the decision outputs of classifiers. Using the approach, we can deal with classifications with/without rejections when no cost terms are given. While the degree of class imbalance is changing, the proposed approach is able to balance the errors and rejects accordingly and automatically. Another advantage of the approach is its ability of deriving optimal reject thresholds for abstaining classification and the “equivalent” costs for binary probabilistic classification. Numerical investigation is made on several benchmark data sets and the classification results confirm the unique feature of the approach for overcoming the challenge. Keywords-imbalanced learning; classification; reject option; cost-free learning; mutual information
I. I NTRODUCTION The imbalanced data sets occur in many real-world applications where the numbers of instances in some classes are greater than those of others [1], [2]. In such cases, the standard classification algorithms may tend toward the large classes and neglect the significance of the small ones. Unreasonable results are thus obtained. The failure is due to their assumptions that the class distributions are balanced or the misclassification costs are equal. Therefore, the goal of maximizing the overall accuracy is not rational in this situation because the accuracy does not distinguish error types. And the 0-1 loss function should be modified with different costs so that the goal is to minimize the risk [3]. From the background of class imbalance problems, various methods are developed within a category in the study of machine learning, which is called “cost-sensitive learning”. We list a few of them, such as costs to test [4], to relabel training instances [5], to sample [6], to weight instances [7], and to find a threshold for binary classification [8], [9]. In the category of cost-sensitive learning, the term “cost” plays 978-0-7695-4925-5/12 $26.00 © 2012 IEEE DOI 10.1109/ICDMW.2012.167
194
classification results without using any cost information. It is understandable that CFL may face a bigger challenge over CSL. The challenge is shown by the fact that most existing approaches may fail to present a reasonable solution to the open issue. This work attempts to provide an applicable learning approach in CFL. We extend Hu’s [14] study on mutual information classifiers in binary classifications. While Hu presented the theoretical formulas, no learning approaches and results were shown for the real-world data sets. Hence, this work focuses on learning and presents main contributions as follows. ∙ We propose a CFL approach in the class imbalance problem. Using normalized mutual information as the learning target, the approach conducts the learning from cost-insensitive classifiers. Therefore, we are able to adopt standard classifiers for simple and direct implementations. The most advantage of this approach is attributed to its unique feature in classification scenarios where one has no knowledge of costs. ∙ We conduct the numerical studies on binary class and multi-class problems for comparing the CFL approach with other existing approaches when they are applicable. Specific investigation is made on abstaining classifications, and we obtain several results from the benchmark data sets which have not been reported before in literature. The results confirm the advantages of the approach and show the promising perspective of cost-free learning in imbalanced data sets. We also discuss the disadvantages of the approach. The remainder of this paper is organized as follows. In Section II, some related works on imbalanced learning and abstaining classification are listed. Section III introduces related formulas of mutual information. We present our work and the optimization algorithm in Section IV and V respectively. Section VI analyses the relations between the optimal parameter and the costs. Section VII presents the experimental results. The last section gives the conclusion.
THE
Table I 𝑚- CLASS AUGMENTED CONFUSION MATRIX 𝐶
T 𝜔1 𝜔2 .. . 𝜔𝑚
𝜔1 𝑐11 𝑐21 .. . 𝑐𝑚1
𝜔2 𝑐12 𝑐22 .. . 𝑐𝑚2
Y ... ... ... .. . ...
𝜔𝑚 𝑐1𝑚 𝑐2𝑚 .. . 𝑐𝑚𝑚
𝜔𝑚+1 𝑐1(𝑚+1) 𝑐2(𝑚+1) .. . 𝑐𝑚(𝑚+1)
In abstaining classifications, numbers of strategies have been proposed for defining optimal reject rules. Chow [10] not only provided the error-reject trade-off but also gave the relation between the reject threshold and the risk. The optimal reject threshold could be found through minimizing the loss function [22], [23]. The possibility of designing loss functions for classifiers with a reject option was also explored in a cost-sensitive setting [24]. However, all methods above are not applicable when cost information is unknown. [25] aimed at maximizing accuracy while keeping the reject rate below a given value, and [26] chose the thresholds for different classes to ensure the error rate of each class below its targeted value. Both methods can not work when the targeted reject rate and the error rates are not given. III. R ELATED F ORMULAS The results of a classification algorithm are typically displayed by a confusion matrix. Table I illustrates an augmented confusion matrix 𝐶 for 𝑚-class abstaining classification by adding one column as a rejected class 𝜔𝑚+1 to a conventional confusion matrix [27]. The rows correspond to the true classificatory states, i.e. the states of the targets 𝑇 , 𝑇 ={𝜔1 , . . . , 𝜔𝑚 }. The columns correspond to the states of the decision outputs 𝑌 , 𝑌 ={𝜔1 , . . . , 𝜔𝑚 , 𝜔𝑚+1 }. 𝑐𝑖𝑗 represents the number of the instances that actually belong to class 𝜔𝑖 but are classified as class 𝜔𝑗 . Normalized mutual information (NMI) has been used as an evaluation criterion to measure the degree of dependence between the targets and the decision outputs. Based on Table I, NMI of the abstaining classification is calculated as [27]
II. R ELATED W ORKS ROC curve was used to show the performance of binary classification under different cost settings [15]. It can be viewed as comparing different classifiers rather than finding an optimal operating point. Moreover, the ROC-based methods can not be applied directly to multi-class problems and abstaining classifications. Cross validation was used to learn from different cost settings [16], but the final decisions were made by users. It is desirable to allow interactions with users, but totally relying on users’ intentions seems more or less subjective. In [17], the values of costs for different examples could be derived through estimation method. There exists some CFL approaches in imbalanced learning, such as sampling [18], [19], active learning [20], recognition-based method [21] and so on.
𝑁 𝐼(𝑇, 𝑌 ) =
𝐼(𝑇, 𝑌 ) 𝐻(𝑇 )
∑𝑚 ∑𝑚+1 𝑖=1
=−
⎛
⎜ ⎜ log2 ⎜ ⎝ ∑𝑚+1
⎞ ⎟
𝑐 ⎟ ⎛𝑖𝑗 ⎞⎟ 𝑐𝑖𝑗 ⎠ ∑𝑚 ⎝∑ ⎠ 𝑐𝑖𝑗 ∑ 𝑚+1 𝑗=1 𝑖=1 𝑚 𝑐𝑖𝑗 𝑖=1 𝑗=1 ∑𝑚+1 𝑐𝑖𝑗 ∑𝑚 ∑𝑚+1 𝑗=1 ) 𝑖=1 𝑗=1 𝑐𝑖𝑗 log2 ( ∑𝑚 ∑𝑚+1 𝑐 𝑖𝑗 𝑖=1 𝑗=1
𝑗=1
𝑐𝑖𝑗
(1)
For conventional classification, 𝑌 ={𝜔1 , . . . , 𝜔𝑚 } and its formula of calculating NMI is the same as Equation 1 except that 𝑗 is counted from 1 to 𝑚. MacKay [29] recommended mutual information because its result is a single rankable number which makes more sense than error rate. Hu et al. [27] found out a phenomenon
195
where 𝑦 is the label of the decision output, ℎ(xj ) is the true class of the 𝑗th nearest neighbor xj , 𝜔∈{𝜔1 , . . . , 𝜔𝑚 } is a class label, and 𝐼(⋅) is an indicator function that returns the value 1 if its argument is true and 0 otherwise. Therefore, we can manipulate the decision boundary of 𝑘NN with the weight parameter 𝜶 through Equation 3. 2) Abstaining classification: We follow Arlandis et al. [30] to use the confidence measure which is built on the distance measures to the 𝑘 nearest neighbors alone. The confidence of a new instance to be class 𝜔𝑖 is shown as ∑ 1
through numerical examples that misclassification from a smaller class led to more loss of NMI than from a larger class and assigning incorrect instances into a smaller class led to less loss of NMI than into a larger class. He considered information-theoretic measures most promising in providing “objectivity” to classification evaluations and analytically derived the specific properties of NMI regarding error types and reject types in evaluating the classification results. Principe et al. [28] presented a schematic diagram of information theory learning (ITL) and they mentioned that maximizing mutual information as the target function made the decision outputs correlate with the targets as much as possible. Recently, a study [14] confirmed that ITL opened a new perspective for classifier design.
𝑗∈𝑠𝑖 𝑑(x,xj )
𝐶𝑜𝑛𝑓𝑖 (x) = ∑𝑘
1 𝑗=1 𝑑(x,xj )
where 𝑠𝑖 is the sub-indices of the prototypes from class 𝜔𝑖 among the 𝑘 nearest neighbors x1 , . . . , xk . Note that a suitable small value 𝜖 should be used for 𝑑(x, xj ) when the distance is zero. An 𝑚-dimensional vector 𝜶=Tr =(𝑇𝑟1 , . . . , 𝑇𝑟𝑚 )𝑇 is set as the reject threshold parameter, 𝑇𝑟𝑖 is in the range [0, 1], 𝑖=1, . . . , 𝑚. The decision output 𝑦=𝑓 (𝐶𝑜𝑛𝑓 (x), 𝜶) lies within 𝑚+1 classes, 𝑦∈{𝜔1 , . . . , 𝜔𝑚 , 𝜔𝑚+1 }. 𝑓 is the decision rule for abstaining classification:
IV. NMI BASED C LASSIFICATION Consider an 𝑚-class problem with each class denoted by 𝜔𝑖 (𝑖=1, . . . , 𝑚), the rejected class denoted by 𝜔𝑚+1 and the new instances characterised by vector x with dimension 𝑑. We define the maximization of NMI between the targets 𝑇 and the decision outputs 𝑌 to be the target function, while the decision outputs are derived from 𝑓 (g(x), 𝜶). g(x) presents the internal step in a base classifier, and 𝜶 is the vector parameter added to the base classifier. Then the optimal parameter 𝜶∗ is found from 𝜶∗ = arg max 𝑁 𝐼(𝑇 = 𝑡, 𝑌 = 𝑓 (g(x), 𝜶)). 𝜶
𝐷𝑒𝑐𝑖𝑑𝑒 𝜔𝑖 𝑖𝑓 𝐶𝑜𝑛𝑓𝑖 (x) ≥ 1 − 𝑇𝑟𝑖 , 𝑓 𝑜𝑟 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, 𝐷𝑒𝑐𝑖𝑑𝑒 𝜔𝑚+1 ∑𝑚 𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 0 ≤ 𝑖=1 𝑇𝑟𝑖 < 𝑚 − 1.
(2)
The probabilistic classification algorithm outputs the posterior probabilities of the class members. In this section, we assume that the posterior probabilities p(𝜔∣x) = (𝑝(𝜔1 ∣x), . . . , 𝑝(𝜔𝑚 ∣x))𝑇 have already been derived from a probabilistic classification algorithm, and we make use of their values in the present study. 1) Conventional classification: Each posterior output 𝑝(𝜔𝑖 ∣x) can be weighted by the scalar 𝛼𝑖 , 𝛼𝑖 ≥ 0, in order to control the trade-offs between different error types. We set 𝜶 = (𝛼1 , . . . , 𝛼𝑚−1 , 𝛼𝑚 )𝑇 as the weight parameter, and 𝛼𝑚 = 1. Class assignment rule 𝑦 = 𝑓 (p(𝜔∣x), 𝜶) is based on the highest weighted posterior probability as
A. Dealing with 𝑘-Nearest-Neighbor classification The 𝑘-nearest-neighbor (𝑘NN) classification is the most basic instance-based machine learning algorithm. For a new instance, its label is decided by the majority vote among the labels of its 𝑘 nearest neighbors in the training set according to a suitable distance measure. 1) Conventional classification: We set an 𝑚-dimensional vector 𝜶=(𝛼1 , . . . , 𝛼𝑚−1 , 𝛼𝑚 )𝑇 as the weight parameter, where 𝛼𝑖 ≥0, 𝑖=1, . . . , 𝑚 − 1, 𝑚, and 𝛼𝑚 =1. Then we alter the distance measure for all instances of class 𝜔𝑖 with 𝛼𝑖 : ′
𝑚
arg max 𝛼𝑖 ∗ 𝑝(𝜔𝑖 ∣x).
(3)
𝑖=1
Therefore we can manipulate the decision boundary with 𝜶. 2) Abstaining classification: In this part, the reject threshold parameter we set is the same as that from 𝑘NN abstaining classification in the above section. Meanwhile, the decision output 𝑦=𝑓 (p(𝜔∣x), 𝜶) has the same form if we substitute p(𝜔∣x) for 𝐶𝑜𝑛𝑓 (x).
where 𝑑(x, xij ) represents the distance measure between the new instance x and the training data from class 𝜔𝑖 , and xij is the 𝑗th instance of class 𝜔𝑖 . We take 𝑑(x, xij ) as 𝑔(x) here, and g(x) is the vector that contains all the values of distance measures between the new instance and the training set. Then we identify the 𝑘 nearest neighbors of instance x. And a majority vote mechanism 𝑦 = 𝑓 (g(x), 𝜶) is performed over the classes of these neighbors: 𝑦 = arg max 𝜔
𝑘 ∑
(5)
B. Dealing with probabilistic classification
It can be regarded as a general approach to make the standard learning algorithms information-theoretic based.
𝑑 (x, xij ) = 𝛼𝑖 ∗ 𝑑(x, xij )
(4)
V. O PTIMIZATION A LGORITHM The present framework is proposed based on the confusion matrix from which we compute the value of NMI, but NMI is not differentiable. We apply a general optimization algorithm called “Powell Algorithm” which is a direct
𝐼(ℎ(xj ) = 𝜔)
𝑗=1
196
method for nonlinear optimization without calculating the derivatives [31]. It is also widely used in image registration to find the optimal registration parameter. The algorithm is given in Algorithm 1. We use the bracketing method to find three starting points and use Brent’s Method to realize the one-dimensional optimization for Step 3 and Step 8. 𝐾 iterations of the basic procedure lead to 𝐾(𝑛 + 1) one-dimensional optimizations. The disadvantage of this algorithm is that it may find a local extrema. Hence, we randomly choose the starting points several times and then pick the best one. We set 𝑛=𝑚−1 in conventional classification, because we need to change the weights of 𝑚−1 classes. 𝑛=𝑚 in abstaining classification so that each class could have its own reject threshold. Note that the binary conventional classification is the case of 𝑛 = 1, we just need to work (2) from Step 1 to Step 5 and return the value of 𝜶𝑲 to 𝜶∗ .
A. Normalized Cost Matrix We have already used 𝐶 to denote the confusion matrix, and for that we use 𝜆 to denote the common cost matrix. Suppose 𝜆𝑖𝑗 is the element in 𝜆 which represents the cost of classifying an instance of class 𝜔𝑖 to be class 𝜔𝑗 . It is proved that the relative values of the costs are the essences rather than the absolute ones [23]. To get the normalized cost matrix based on [23], we subtract 𝜆𝑖𝑖 from the elements in the 𝑖th row respectively and divide all elements by (𝜆21 − ′ 𝜆22 ). Thus the normalized conventional cost matrix 𝜆𝑐 is ] ] [ [ ′ ′ ′ ′ 𝜆11 𝜆12 0 𝜆12 , (6) 𝜆𝑐 = = ′ ′ 1 0 𝜆21 𝜆22 ′
−𝜆11 where 𝜆12 = 𝜆𝜆12 . 21 −𝜆22 We can also derive the normalized abstaining cost matrix ′ 𝜆𝑎 in the same way: ] [ ′ ′ ′ ′ 𝜆11 𝜆12 𝜆13 𝜆𝑎 = ′ ′ ′ 𝜆21 𝜆22 𝜆23 ] [ −𝜆11 𝜆13 −𝜆11 0 𝜆𝜆12 𝜆21 −𝜆22 21 −𝜆22 . (7) = 𝜆23 −𝜆22 1 0 𝜆21 −𝜆22
Algorithm 1 Learning algorithm Input: g(x): the vector of internal results in a base classifier; 𝜶1 : an 𝑛-dimensional random vector in the range of 𝜶; d1 , . . . , dn : linear independent vectors; 𝜀 ≥ 0; 𝐾 := 1. Output: 𝜶∗ 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
It is reasonable to assume that the absolute values of the correct classification costs and the misclassification costs in the common cost matrix 𝜆 are not affected by introducing a reject option. In this case, what is noteworthy is that the ′ relative misclassification cost 𝜆12 stays the same.
(1)
𝜶𝑲 = 𝜶 𝑲 . for 𝑖 = 1 𝑡𝑜 𝑛 do (𝒊) 𝜂¯(i) = arg max𝜂∈R [𝑓 (g(x), 𝜶𝑲 + 𝜂di )]; (𝒊+1) (𝒊) (𝑖) 𝜶𝑲 = 𝜶𝑲 + 𝜂¯ di . end for di := di+1 , 𝑖 = 1, 2, . . . , 𝑛 − 1; (𝒏+1) dn := 𝜶𝑲 − 𝜶𝑲 ; ∗ 𝜂𝐾 = arg max𝜂∈R [𝑓 (g(x), 𝜶𝑲 + 𝜂dn )]; ∗ 𝜶𝑲+1 = 𝜶𝑲 + 𝜂𝐾 dn ; if ∣∣𝜶𝑲+1 − 𝜶𝑲 ∣∣2 ≤ 𝜀 then stop; else 𝐾 := 𝐾 + 1, turn to step 1; end if 𝜶∗ = 𝜶𝑲+1
B. Optimal Weight and Misclassification Cost Elkan [8] derived the relation between the decision threshold and the misclassification costs through minimizing the conditional risk which was named as expected cost by him. We put the normalized form of Equation 6 into his formula and get the decision threshold 𝑝∗1 of class 𝜔1 as 𝑝∗1 =
1 ′ . 1 + 𝜆12
(8)
We predict an instance to be class 𝜔1 if 𝑝(𝜔1 ∣x) ≥ 𝑝∗1 . Note ′ that 𝜆12 is a free parameter, and its value should be given and be reasonable. Otherwise, the value of 𝑝∗1 can not be derived or be proper under this rule. In the present work of binary conventional classification, the optimal weight vector is 𝜶∗ =(𝛼1∗ , 1)𝑇 . According to the maximum weighted posterior probability rule, the optimal prediction is class 𝜔1 if and only if 𝛼1∗ ∗ 𝑝(𝜔1 ∣x) ≥ 𝑝(𝜔2 ∣x), which is equivalent to 𝛼1∗ ∗ 𝑝1 ≥ 1 − 𝑝1 given 𝑝1 = 𝑝(𝜔1 ∣x). Hence, the threshold 𝑝∗∗ 1 of class 𝜔1 for making optimal decision can be calculated as 1 (9) 𝑝∗∗ 1 = 1 + 𝛼1∗
VI. R ELATIONS OF PARAMETERS IN B INARY P ROBABILISTIC C LASSIFICATION The previous section completes the essence of the present framework, from which we can deal with class imbalance problem without changing the distribution of the data set while no cost terms are given. Here we focus our study on binary probabilistic classification. We analyse the relations between the optimal vector parameter 𝜶∗ and the costs.
and we predict an instance to be class 𝜔1 if 𝑝(𝜔1 ∣x) ≥ 𝑝∗∗ 1 . Suppose that the minimum conditional risk rule shares the same decision boundary with the maximum weighted
197
posterior probability rule, then their decision thresholds derived from Equation 8 and 9 should be equal, i.e.
Table II D ESCRIPTION OF THE EXPERIMENTAL DATA SETS
1 1 = ′ 1 + 𝛼1∗ 1 + 𝜆12 ′
𝜆12 = 𝛼1∗ .
Data set Diabetes German Credit Breast Cancer Ionosphere Letter(A) Heart Disease
(10) ′
In this case, the value of 𝛼1∗ is assigned to 𝜆12 . In other ′ words, 𝛼1∗ has the same effect on classification as 𝜆12 does. Hence, we treat the value of the optimal weight 𝛼1∗ as the “equivalent” misclassification cost in the given context.
According to [24], [14], the general relations between the reject thresholds and the costs for binary abstaining classification can be presented in a form of explicit formulas. ∗ ∗ 𝑇 , 𝑇𝑟2 ) With the optimal reject threshold vector T∗r = (𝑇𝑟1 and the normalized form of Equation 7, we get the following formulas based on [14]: ′
𝜆13 ′ ′ 1 + 𝜆13 − 𝜆23
∗ 𝑇𝑟2 =
𝜆23 ′ ′ . 𝜆12 − 𝜆13 + 𝜆23
′
(11)
′
′
Moreover, the value of 𝜆12 derived from Equation 10 can be introduced into Equation 11 as the prior knowledge. Then the “equivalent” reject costs can be calculated as ′
𝜆13 ′
In order to show the change of the minority class clearly, err𝑖 and rej𝑖 are defined as the error rate and the reject rate within its 𝑖th class respectively. The total error rate and the total reject rate are denoted as “err” and “rej” respectively. “acc” is short for the overall accuracy. And “G-m” is short mean with the formula ∏𝑚 for the geometry 1 𝐺−𝑚𝑒𝑎𝑛=( 𝑖=1 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑖 ) 𝑚 , where 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑖 represents the accuracy within its 𝑖th class.
′
′
∗ ∗ ∗ ∗ −𝑇𝑟1 𝑇𝑟2 + (1 − 𝑇𝑟1 )𝑇𝑟2 𝜆12 . ∗ ∗ 1 − 𝑇𝑟1 − 𝑇𝑟2
Class ratio 500/268 700/300 357/212 225/126 19211/789 188/37/26/28/15
B. Evaluation Criterion
∗ ∗ ∗ 𝑇 ∗ (1 − 𝑇𝑟2 ) − 𝑇𝑟1 𝑇𝑟2 𝜆12 = 𝑟1 ∗ ∗ 1 − 𝑇𝑟1 − 𝑇𝑟2
𝜆23 =
# of classes 2 2 2 2 2 5
parameter is chosen as the average value of the distance from one instance to its 𝑟th nearest neighborhood for all data (𝑟=10 empirically), and the empirical probability of the occurrence of class 𝜔𝑖 is chosen as the estimated prior probability. For SMOTE, the average results are presented with the amount from 1 to 5, and it is performed for four minority classes in Heart Disease simultaneously. For cost-sensitive learning with the minimum conditional risk rule, we simply assign the inverse of the class ratio to the misclassification cost 𝜆𝑖𝑗 for 𝑖∕=𝑗, and 𝜆𝑖𝑖 =0. We do not introduce the reject costs because they could not be derived easily, and we do not apply it to 𝑘NN because 𝑘NN could not be altered with costs directly. For Chow’s reject method, we simply assign 0.3 to the reject threshold for all classes, and we apply it to 𝑘NN with the confidence values.
C. Optimal Reject Thresholds and Costs
∗ = 𝑇𝑟1
# of examples 768 1000 569 351 20000 294
(12)
VII. E XPERIMENTS A. Configuration Table II lists six imbalanced data sets [32] with different class ratios. All these data sets have continuous attributes and are rescaled to be in the range [0, 1]. For Letter(A) and Heart Disease, we perform 3-fold cross validation. For other data sets, we perform 10-fold cross validation. All experiments are repeated ten times to get the average results. We call our NMI based conventional classification and NMI based abstaining classification “NMI con” and “NMI abs” respectively. We adopt 𝑘NN and Bayes classifier as the base classifiers, and we compare our methods with SMOTE, cost-sensitive learning and Chow’s reject [10] methods besides the base classification algorithms. For 𝑘NN, we apply Euclidean distance and just list the results of 11-NN for binary class data sets and 5-NN for Heart Disease for brevity. For Bayes classifier, we derive the estimated class-conditional density from the Parzen-window estimation with Gaussian kernel [33] and use Bayes rule to classify an instance to the class that has the highest posterior probability. The smooth
C. Results We find that two base classifiers have high accuracies and the error rates of the majority class are low, but the error rates of the minority class are high. Due to the improper value of the misclassification cost we assign, cost-sensitive learning performs not so well in Ionosphere for the minority class. The costs are so important and indispensable to costsensitive learning that we should know their values before applying such method. SMOTE is a good method to bias the performance towards the minority class, and its G-mean and NMI are improved except for Breast Cancer. Note that SMOTE on Bayes classifier reduces the error rate of the minority class slightly for Letter(A), as a result of the high imbalance and small amount settings. On the whole, SMOTE is not the best. Moreover, it can not solve the problems with rejections. When adding a reject option into classification, the error rate could be reduced and the accuracy could be increased. But it is difficult for us to judge which reject thresholds are reasonable. It is so wasteful for Chow’s reject
198
Table III E XPERIMENTAL RESULTS FOR BINARY CLASS DATA SETS . “—”: NOT AVAILABLE . T HE BEST PERFORMANCE IN EACH CELL IS BOLDED . Data set
Diabetes
German Credit
Breast Cancer
Ionosphere
Letter(A)
Method 𝑘NN (k=11) SMOTE NMI con Chow’s reject NMI abs Bayes classifier SMOTE Cost-sensitive NMI con Chow’s reject NMI abs 𝑘NN (k=11) SMOTE NMI con Chow’s reject NMI abs Bayes classifier SMOTE Cost-sensitive NMI con Chow’s reject NMI abs 𝑘NN (k=11) SMOTE NMI con Chow’s reject NMI abs Bayes classifier SMOTE Cost-sensitive NMI con Chow’s reject NMI abs 𝑘NN (k=11) SMOTE NMI con Chow’s reject NMI abs Bayes classifier SMOTE Cost-sensitive NMI con Chow’s reject NMI abs 𝑘NN (k=11) SMOTE NMI con Chow’s reject NMI abs Bayes classifier SMOTE Cost-sensitive NMI con Chow’s reject NMI abs
err1 (%) 15.44 40.79 41.16 4.20 14.48 2.34 53.23 18.90 27.96 0.16 21.52 12.49 48.32 42.31 1.87 17.80 0.00 61.89 48.61 26.17 0.00 17.53 0.90 6.37 1.98 0.28 0.90 0.17 11.96 0.76 4.08 0.00 0.80 2.17 4.78 6.78 1.47 3.26 0.00 45.11 0.14 5.17 0.00 4.06 0.03 0.19 0.16 0.01 0.05 0.00 0.02 16.81 0.77 0.00 0.24
err2 (%) 46.33 17.46 14.29 23.21 10.38 81.17 23.86 42.09 25.22 23.78 15.09 65.33 22.58 21.77 29.20 14.93 100.00 32.50 15.27 35.67 30.27 11.80 6.85 3.26 4.34 4.13 3.62 23.83 7.19 14.14 8.15 7.34 3.41 39.67 18.16 6.50 32.28 19.78 97.00 26.54 72.19 28.86 41.72 18.97 5.20 1.37 2.32 0.94 0.28 100.00 84.58 9.67 14.89 100.00 13.75
err(%) 26.23 32.67 31.81 10.82 13.10 29.81 42.99 27.04 27.04 8.43 19.30 28.34 40.60 36.15 10.07 16.94 30.00 53.08 38.61 29.02 9.08 15.81 3.12 5.21 2.86 1.72 1.91 8.88 10.18 5.57 5.59 2.74 1.78 15.59 9.55 6.66 12.48 9.22 34.69 38.46 25.91 13.65 14.93 9.37 0.24 0.23 0.25 0.04 0.06 3.94 3.36 16.53 1.33 3.94 0.77
acc(%) 73.77 67.33 68.19 83.18 79.67 70.19 57.01 72.96 72.96 84.75 76.37 71.66 59.40 63.85 82.93 73.00 70.00 46.92 61.39 70.98 84.74 73.81 96.88 94.79 97.14 98.18 98.04 91.12 89.82 94.25 94.41 96.70 98.03 84.41 90.45 93.34 86.34 90.49 65.31 61.54 74.09 86.35 81.03 87.54 99.76 99.77 99.75 99.96 99.94 96.06 96.64 83.47 98.67 96.06 99.21
rej1 (%) — — — 27.16 37.34 — — — — 29.08 18.54 — — — 31.13 39.01 — — — — 28.49 40.64 — — — 3.54 5.55 — — — — 3.90 9.37 — — — 0.83 2.69 — — — — 2.60 30.08 — — — 0.14 0.48 — — — — 0.00 1.87
rej2 (%) — — — 51.33 40.07 — — — — 75.61 22.62 — — — 64.37 40.03 — — — — 69.73 35.30 — — — 7.64 3.09 — — — — 40.78 10.98 — — — 24.03 9.17 — — — — 58.28 17.39 — — — 6.87 2.99 — — — — 0.00 12.44
rej(%) — — — 35.60 38.35 — — — — 45.25 20.03 — — — 41.10 39.32 — — — — 40.86 39.04 — — — 5.07 4.63 — — — — 17.64 9.98 — — — 9.11 4.98 — — — — 22.51 25.55 — — — 0.41 0.58 — — — — 0.00 2.29
G-m(%) 67.16 68.52 70.75 69.55 78.75 42.16 45.46 68.23 73.06 6.32 75.65 54.57 60.56 66.74 36.91 70.70 0.00 12.09 65.84 68.12 0.00 72.74 96.04 95.11 96.80 97.56 97.64 87.01 89.95 92.26 93.77 93.55 97.61 76.45 87.91 93.25 74.67 86.64 10.21 47.40 50.80 81.77 0.00 83.16 97.35 99.22 98.75 99.49 99.83 0.00 24.83 86.68 91.89 0.00 91.65
NMI 0.1332 0.1566 0.1645 0.1611 0.1834 0.0743 0.0885 0.1320 0.1754 0.0936 0.2051 0.0583 0.0810 0.1028 0.0757 0.1080 0.0000 0.0184 0.1085 0.1169 0.0606 0.1561 0.8174 0.7452 0.8322 0.8355 0.8444 0.5985 0.6017 0.6999 0.7106 0.6007 0.8046 0.3832 0.5614 0.6966 0.3429 0.5308 0.0174 0.1439 0.1747 0.4298 0.0880 0.4560 0.9018 0.9329 0.9189 0.9256 0.9549 0.0000 0.1242 0.2945 0.6686 0.0000 0.7015
Table IV E XPERIMENTAL RESULTS : THE OPTIMAL WEIGHT, THE OPTIMAL REJECT THRESHOLDS AND THE ” EQUIVALENT ” COSTS FOR BAYES BASED CLASSIFIER IN THE GIVEN CONTEXT. O PTIMAL VALUES ARE IN THE FORM OF MEAN ( STANDARD DEVIATION ).
Data set Diabetes German Breast Iono Letter(A)
𝛼∗1 1.2234(0.0291) 1.1490(0.0300) 1.0517(0.0200) 1.5081(0.0453) 1.1078(0.0336)
𝑘NN based ∗ 𝑇𝑟1 0.1624(0.0582) 0.1832(0.0912) 0.2838(0.1583) 0.1944(0.1748) 0.1338(0.0607)
∗ 𝑇𝑟2 0.4726(0.1216) 0.5469(0.1118) 0.4936(0.0754) 0.6801(0.2229) 0.5640(0.0823)
′
𝜆12 (= 𝛼∗1 ) 0.4342(0.0452) 0.4335(0.0120) 0.4524(0.0869) 0.3518(0.0215) 0.1161(0.0066)
199
Bayes classifier based ∗ ∗ 𝑇𝑟1 𝑇𝑟2 0.2618(0.0331) 0.6571(0.0612) 0.2780(0.0120) 0.6894(0.0074) 0.2413(0.0266) 0.6009(0.0306) 0.1962(0.0478) 0.7178(0.0288) 0.0956(0.0156) 0.8567(0.0108)
′
𝜆13 0.1859 0.1002 0.1946 0.0677 0.0879
′
𝜆23 0.4758 0.7399 0.3882 0.7226 0.1688
Table V E XPERIMENTAL RESULTS FOR H EART D ISEASE DATA SET. “—”: NOT AVAILABLE . T HE BEST PERFORMANCE IN EACH CELL IS BOLDED . Data set
Heart Disease
Method 𝑘NN (k=5) SMOTE NMI con Chow’s reject NMI abs Bayes classifier SMOTE Cost-sensitive NMI con Chow’s reject NMI abs
err1 (%) 5.47 23.13 25.64 0.58 24.70 2.71 8.25 100.00 27.52 0.00 29.77
err3 (%) 91.50 82.20 89.72 24.00 85.58 99.17 100.00 100.00 71.50 15.33 79.50
err5 (%) 78.00 78.80 49.63 17.33 38.67 99.33 100.00 0.00 100.00 1.33 58.00
err(%) 33.93 42.83 44.18 8.91 40.00 32.77 34.82 94.89 45.25 5.08 44.01
acc(%) 66.07 57.17 55.82 84.69 56.42 67.23 65.18 5.11 54.75 90.71 51.83
rej1 (%) — — — 23.56 8.42 — — — — 25.93 8.88
rej3 (%) — — — 76.00 6.25 — — — — 84.67 9.00
rej5 (%) — — — 80.00 6.67 — — — — 98.67 3.33
rej(%) — — — 42.40 8.40 — — — — 47.55 8.71
G-m(%) 10.54 13.82 12.78 0.00 8.96 0.00 0.00 0.00 0.00 0.00 8.18
NMI 0.2138 0.2195 0.2389 0.0910 0.2720 0.1983 0.1614 0.0000 0.2352 0.1063 0.2972
Table VI E XPERIMENTAL RESULTS : THE OPTIMAL VALUES OF WEIGHTS AND REJECT THRESHOLDS FOR H EART D ISEASE . VALUES ARE IN THE FORM OF MEAN ( STANDARD DEVIATION ) EXCEPT 𝛼∗ 5 WHICH IS SET TO 1. 𝑘NN based Bayes classifier based
𝛼∗1 2.1562(0.5671) ∗ 𝑇𝑟1 0.2112(0.1168) 𝛼∗1 0.3540(0.0802) ∗ 𝑇𝑟1 0.2313(0.0854)
𝛼∗2 1.5238(0.3047) ∗ 𝑇𝑟2 0.7666(0.1199) 𝛼∗2 1.9983(0.5210) ∗ 𝑇𝑟2 0.8672(0.0416)
𝛼∗3 1.3160(0.1971) ∗ 𝑇𝑟3 0.7678(0.1185) 𝛼∗3 2.7306(0.7415) ∗ 𝑇𝑟3 0.8783(0.0388)
method to reject lots of instances from the minority class under the improper reject threshold settings. We can see that the performances of our NMI con and NMI abs are good and robust from the results of G-mean and NMI. They are able to reduce the error rate of the minority class, although as the expense of increasing the error rate of the majority class. Moreover, compared with our rough settings for Chow’s reject method, NMI abs can get reasonable reject thresholds and its reject rate of the minority class seems more proper. Table IV lists the optimal weight and the optimal reject thresholds derived from our approach. Moreover, for binary probabilistic classification, the optimal weight 𝛼1∗ is the ′ value of the relative misclassification cost 𝜆12 . The last two columns represent the relative reject costs computed with the mean values of the relative misclassification cost and the optimal reject thresholds. We prefer to assume that the misclassification cost of the minority class is greater than that of the majority class, the misclassification cost is greater than the reject cost within the same class, and the reject cost of the minority class is greater than that of the majority class. We find that all our results support the above assumptions and the analysis in [27]. Moreover, we can also compare the misclassification cost of the majority class with the reject cost of the minority class. Note that the “equivalent” costs are dependent on the data sets and the algorithms. They may provide users an “objective” reference among the errors and rejects, while the exact values of the costs are unknown. The detailed results on Heart Disease are shown in Table V. Some results of the 2th class and the 4th class are not
𝛼∗4 1.2615(0.1679) ∗ 𝑇𝑟4 0.6937(0.1162) 𝛼∗4 2.3253(0.6827) ∗ 𝑇𝑟4 0.7806(0.0773)
𝛼∗5 1 ∗ 𝑇𝑟5 0.7915(0.0588) 𝛼∗5 1 ∗ 𝑇𝑟5 0.8607(0.0309)
given for brevity. We find that two base classifiers have low error rates of the majority class but high error rates of the minority classes. Cost-sensitive learning performs worst. It classifies all instances into the 5th class whose misclassification cost is the greatest, and both of the G-mean and NMI are 0. Chow’s reject method reduces the error rates a lot, but its reject rates are high. Although our methods may not be the best, they balance the errors and rejects well and get high NMI. The results in Table VI can also provide users an “objective” reference in classification. Thus it can be seen that our approach is feasible in the situation when no cost information is available. VIII. C ONCLUSION In this paper, we propose a CFL approach to deal with the class imbalance problem. Based on the specific property of mutual information that can distinguish different error types and reject types, we set it as the learning target function between the targets and the decision outputs. Based on the approach, we can derive the reject thresholds for abstaining classification and the “equivalent” costs for binary probabilistic classification. The effectiveness of the approach is shown by experimental results. Generally, the “equivalent” costs can be useful as an objective basis for CSL. The work, being preliminary at this stage, indicates a new direction of CFL. In future work, we face a challenge of improving the learning approach for fast and global optimizations. ACKNOWLEDGMENT
This work is supported in part by by NSFC (No. 61075051) and the SPRP-CAS (No. XDA06030300).
200
R EFERENCES
[17] B. Zadrozny and C. Elkan, “Learning and making decisions when costs and probabilities are both unknown,” in KDD, pp. 204-213, 2001.
[1] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, pp. 429449, 2002.
[18] N.V. Chawla, K.W. Bowyer, and W.P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal Of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[2] T. Raeder, G. Forman, and N. V. Chawla , “Learning from imbalanced data: Evaluation matters,” Data Mining: Foundations and Intelligent Paradigms, D. E. Holmes and L. C. Jain, Eds., New York: Springer, pp. 315-331, 2012.
[19] H. Guo and H.L. Viktor, “Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach,” SIGKDD Explor. Newsl., vol. 6, pp. 30-39, 2004.
[3] R. O. Duda, P. E. Hart, and D. Stork, Pattern Classification. 2nd eds., John Wiley: New York, 2001.
[20] S. Ertekin, J. Huang, L. Bottou, and L. Giles, “Learning on the border: active learning in imbalanced data classification,” in CIKM, pp. 127-136, 2007.
[4] X. Chai, L. Deng, Q. Yang, and C. X. Ling, “Test-cost sensitive naive Bayes classification,” in ICDM, pp. 51-58, 2004.
[21] L.M. Manevitz and M. Yousef, “One-class svms for document classification,” Journal of Machine Learning Research, vol. 2, pp. 139-154, 2002.
[5] P. Domingos, “MetaCost: A General Method for Making Classifiers Cost-Sensitive,” in KDD, pp. 155-164, 1999. [6] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by cost-proportionate example weighting,” in ICDM, pp. 435442, 2003.
[22] T.C. Landgrebe, D.M. Tax, P. Pacl´ık, and R.P. Duin, “The interaction between classification and reject performance for distance-based reject-option classifiers,” Pattern Recognition Letters, vol. 27, pp. 908-917, 2006.
[7] K. M. Ting, “Inducing cost-sensitive trees via instanceweighting,” Lecture Notes in Computer Science, vol. 1510, pp. 139-147, 1998.
[23] C.C. Friedel, U. R¨uckert, S. Kramer, “Cost Curves for Abstaining Classifiers,” in ICML workshop on ROC Analysis in Machine Learning, pp. 33-40, 2006.
[8] C. Elkan, “The Foundations of Cost-Sensitive Learning,” in IJCAI, pp. 973-978, 2001.
[24] A. Guerrero-Curieses, J. Cid-Sueiro, R. Alaiz-Rodr´ıguez, and A.R. Figueiras-Vidal, “Local Estimation of Posterior Class Probabilities to Minimize Classification Errors,” IEEE Transactions on Neural Networks, vol. 15, pp. 309-317, 2004.
[9] V. S. Sheng and C. X. Ling, “Thresholding for Making Classifiers Cost-sensitive,” in AAAI, pp. 476-481, 2006.
[25] G. Fumera, F. Roli, and G. Giacinto, “Reject option with multiple thresholds,” Pattern Recognition, vol. 33, pp. 20992101, 2000.
[10] C. Chow, “On optimum recognition error and reject tradeoff,” IEEE Transactions on Information Theory, vol. 16, pp. 41-46, 1970.
[26] M. Li, and I.K. Sethi, “Confidence-based classifier design,” Pattern Recognition, vol. 39, pp. 1230-1240, 2006.
[11] T. Pietraszek, “Classification of intrusion detection alerts using abstaining classifiers,” Intelligent Data Analysis, vol. 11, pp. 293-316, 2007.
[27] B.-G. Hu, R. He, and X.-T. Yuan, “Information-Theoretic Measures for Objective Evaluation of Classifications,” Acta Automatica Sinica, vol. 38, pp. 1170-1182, 2012.
[12] M-R. Temanni, S. A. Nadeem, D. P. Berrar, and JD. Zucker, “Aggregating abstaining and delegating classiers for improving classication performance: an application to lung cancer survival prediction,” [Online]. Available: http://camda.bioinfo.cipf.es/camda07/agenda/detailed.html
[28] J. Principe, D. Xu, Q. Zhao, and J. Fisher, “Learning from examples with information-theoretic criteria,” J. VLSI Signal Proc, vol. 26, pp. 61-77, 2000. [29] D. MacKay, “Information Theory, Inference and Learning Algorithms,” Cambridge University Press, 2003.
[13] C. X. Ling and V. S. Sheng, “Cost-sensitive learning and the class imbalance problem,” Encyclopedia of Machine Learning, 2008.
[30] J. Arlandis, J.C. Perez-Cortes, and J. Cano, “Rejection strategies and confidence measures for a k-nn classifier in an ocr task,” in ICPR, pp. 576-579, 2002.
[14] B.-G. Hu, “What are the Differences between Bayesian Classifiers and Mutual-Information Classifiers?,” arXiv:1105.0051v2 [cs.IT], 2012.
[31] M.J.D. Powell, “An efficient method for finding the minimum of a function of several variables without calculating derivatives,” Computer Journal, vol. 7, pp. 155-162, 1964.
[15] M.A. Maloof, “Learning when data sets are imbalanced and when costs are unequal and unknown,” in ICML Workshop on Learning from Imbalanced Data Sets II, 2003.
[32] A. Frank, and A. Asuncion, “UCI Machine Learning Repository”, 2010. Available: http://archive.ics.uci.edu/ml
[16] Y. Zhang and Z.-H. Zhou, “Cost-Sensitive Face Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 1758-1769, 2010.
[33] S. Tong, “Restricted Bayes Optimal Classifiers,” in AAAI, pp. 658-664, 2000.
201