cDepartment of Management Information Systems, The University of Arizona, USA. A R T I C L E. I N F O .... Compared with
Pattern Recognition 42 (2009) 1572 -- 1581
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
A closed-form reduction of multi-class cost-sensitive learning to weighted multi-class learning Fen Xia a,∗ , Yan-wu Yang a , Liang Zhou a , Fuxin Li a , Min Cai b , Daniel D. Zeng a,c a b c
The Key Lab of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, China Department of Computer Science, Beijing Jiaotong University, China Department of Management Information Systems, The University of Arizona, USA
A R T I C L E
I N F O
Article history: Received 26 October 2007 Received in revised form 16 December 2008 Accepted 19 December 2008 Keywords: Cost-sensitive learning Supervised learning Statistical learning theory Classification
A B S T R A C T
In cost-sensitive learning, misclassification costs can vary for different classes. This paper investigates an approach reducing a multi-class cost-sensitive learning to a standard classification task based on the data space expansion technique developed by Abe et al., which coincides with Elkan's reduction with respect to binary classification tasks. Using this proposed reduction approach, a cost-sensitive learning problem can be solved by considering a standard 0/1 loss classification problem on a new distribution determined by the cost matrix. We also propose a new weighting mechanism to solve the reduced standard classification problem, based on a theorem stating that the empirical loss on independently identically distributed samples from the new distribution is essentially the same as the loss on the expanded weighted training set. Experimental results on several synthetic and benchmark datasets show that our weighting approach is more effective than existing representative approaches for cost-sensitive learning. © 2008 Elsevier Ltd. All rights reserved.
1. Introduction In standard classifications, classifiers aim to minimize expected errors, while weighing all misclassifications under the same costing structure. In many real-world applications such as fraud detection, medical diagnosis, however, costs of different types of misclassifications can be vastly different [1]. For example, in the case of predicting whether a patient might have a heart attack, false negatives can lead to much bigger problems than false alarms. Such learning problems involving class-specific cost functions are often called the cost-sensitive learning. During the past few years, many researchers have investigated issues related to cost-sensitive learning. If one considers the level of abstraction concerning cost modeling, the existing work can be categorized into two classes, example-dependent costs [2–4,17,19,20], and class-dependent costs [5–11]. The former assumes that each example has its own cost if misclassified to a certain target class. The latter assumes that all examples belonging to a certain class share the same misclassification cost. In most real-world applications, it is much easier to specify class-dependent costs than for each individual example. If one considers how the cost information is used in
∗ Corresponding author. Tel.: +86 10 82614560; fax: +86 10 62545229. E-mail address:
[email protected] (F. Xia). 0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.12.011
the learning process, the existing cost-sensitive learning approaches mainly fall into two categories. The first category uses the Bayes risk theory to assign each example to the class with the lowest expected cost (e.g., the threshold-moving approach [10]). This requires classifiers output class membership probabilities. The second category modifies the distribution of examples through sampling examples [9] and weighting examples [5,7,8,11], before applying a cost-insensitive learner. This second approach is more general since it does not require accurate probability estimates and can be combined with any existing learners. Empirically, it has attained similar or better performance when compared with the methods based on the Bayes risk theory [4]. Although many computational approaches have been proposed recently for cost-sensitive learning, few has focused on its theoretical aspects. To the best of our knowledge, Elkan [7] might be the first to address this problem from a theoretical point of view. With an assumption that class-conditional distributions are unchanged, Elkan presented a theorem that shows how class membership probabilities change according to class priors. His work clearly indicates the relationship between a cost-sensitive binary classification problem and a standard classification. Such relationships provide a wellgrounded explanation for binary cost-sensitive learning algorithms which modify the distribution of examples [5,7–9,11]. Unfortunately, Elkan's theory is limited to binary classifications. In multi-class cases, misclassifications occur in more than one way, making the analysis of cost-sensitive learning problems more
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
complicated. A possible solution is to decompose a multi-class problem into a series of binary problems by pairwise coupling [18]. Then multi-class cost-sensitive learning can be conceptualized as the combination of results from applying Elkan's theory to each binary classification problem. Following this idea, Zhou and Liu [11] investigated a weighting mechanism for multi-class cost-sensitive learning, where class weights match Elkan's theory in each decomposed binary classification. Although this decomposition method provides a way to understand multi-class cost-sensitive problems, it heavily relies on the relationship between the multi-class problem and its derivative binary problems, which lays an unnecessary burden to explore the nature of cost-sensitive learning. Taking a completely different approach, Abe et al. [4] proposed a data space expansion technique to take into account different costs associated with multiple ways of misclassifying examples. Each training example with K possible classes is mapped into K training examples. With a weight assigned for each of the new examples, there are enough degrees of freedom to capture all misclassification costs for the single original example. This clear and elegant technique is developed in the context of example-dependent cost-sensitive learning. However, it can be readily applied to the class-dependent case to tackle the multiple misclassification problem. Abe et al. [4] also showed a reduction from the cost-sensitive learning problem to the importance weighted classification problem. The reduction is derived from the fact that, for each original example, minimizing the importance weighted loss on its K expanded new examples also minimizes the cost on it. Although the reduction works well for each original example, it does not take into account the different importance among original examples, as evidenced from the fact that their reduction is unchanged when adding arbitrary positive constants to the importance of different examples. In cost-sensitive multi-class learning, the different importance among original samples can be a crucial factor for the performance of learning algorithms, but how to deal with it remains unexplored. To the best of our knowledge, the nature of cost-sensitive multiclass classification learning remains under-studied. One of those crucial and fundamental questions is to explore the essential relationship between cost-sensitive learning and standard classification in the multi-class context. In this paper, we aim to investigate this fundamental question, and propose a new reduction approach which reduces cost-sensitive learning to the standard classification based on the data space expansion technique proposed in [4]. A costsensitive learning problem can be viewed as equivalent to a standard classification problem on a new distribution determined by the cost matrix. In fact, many distributions produced from the cost matrix can be possible candidates of the standard classification. But only very few of them can really simplify the relationship between the cost-sensitive learning and the standard classification. As in the case of binary problems, our proposed reduction approach coincides with Elkan's [7] on the basis of a carefully chosen distribution. When dealing with multi-class problems, a new distribution can be defined in such a way that class membership probabilities are the linear combination of the original ones while keeping the feature distribution of examples unchanged. Interestingly, this implies that by perturbing labels a cost-sensitive problem can be turned into a cost-insensitive one. In this sense, this paper proposes a novel method to generate an expanded weighted training set to solve the reduced classification problem, which aims to deal with the subtle difficulty in obtaining the i.i.d. samples from the standard classification. This weighting mechanism is based on a theorem stating that the empirical risk on i.i.d. samples from the new distribution is essentially the same as that on the expanded weighted training set. To demonstrate the effectiveness of our reduction approach and the weighting mechanism, we have conducted experiments on several synthetic and benchmark datasets, with Classification
1573
And Regression Tree (CART) as the learner. When compared our approach with several existing cost-sensitive approaches, experimental results show that our weighting approach is more efficient and versatile than these benchmark methods. The contribution of this paper is as follows. • First, this paper presents a reduction approach from a multi-class cost-sensitive learning task to a standard classification task, and theoretically analyzes the error transformations between them. Compared with the reduction in [4] which also utilizes the data space expansion technique, our reduction is different from theirs in two aspects. First, in addition to considering the influence of cost on labels, our reduction further takes into account the influence of cost on features. Secondly, the cost-sensitive learning problem is reduced to a standard classification problem in our reduction approach, whereas the reduced form is an importance weighted classification problem in their approach. As such, our reduction approach provides direct linkage between multi-class cost-sensitive learning and standard classification. • Secondly, we present an approach to derive well-behaved reductions by carefully operating on the cost matrix. These well-behaved reductions can, to a large degree, simplify the relationship between cost-sensitive learning and standard classification, helping us gain an understanding of the nature of cost-sensitive multi-class classification. • Thirdly, we propose a theory-driven weighting mechanism to generate an expanded weighted training set to solve the reduced classification problem, which overcomes the subtle difficulty in obtaining the i.i.d. samples from the standard classification. • Fourthly, we experimentally validate our reduction framework and the weighting mechanism through a comparison with several state of the art approaches with positive findings. The remainder of this paper is organized as follows. In Section 2, the reduction framework is introduced. Then we present how to learn the reduced standard classification problem in Section 3. Section 4 presents experimental results and Section 5 concludes the paper with a summary and discussions of future research. 2. A reduction framework In this section, a formal definition of cost-sensitive learning is given first. We then present our reduction approach and discuss the relationship between cost-sensitive learning and its reduced form as a standard classification problem. 2.1. Definition of cost-sensitive learning Let X be the input vector space, Y = {1, 2, . . . , K} the set of K class labels, li,j (i, j ∈ {1 . . . K}) the cost of misclassifying an example of class i to class j, li,j = 0 for i = j, and li,j > 0 for i j. Encoding these cost information in a cost matrix L with entries li,j , the cost-sensitive learning can be defined as follows. Definition 1. Let PXY be the unknown joint distribution of X and Y. Given an i.i.d. sample set S = {(xi , yi )}li=1 ∼ PXY , and a set H of mappings h ∈ H from X to Y, a cost-sensitive learning procedure is to select a mapping h to minimize the risk functional R(h), defined as R(h) = EPXY ly,h(x) =
⎡ ⎤ K ⎣ ly,h(x) P(y|x)⎦ p(x) dx.
(1)
y=1
Note that if the cost matrix is with entries li,j = 1 for every i j, Definition 1 reduces to the standard classification.
1574
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
2.2. Reducing cost-sensitive learning to standard classification Cost-sensitive learning is different from standard classification in that there is more than one way in which classifiers make mistakes. Most existing machine learning algorithms can only handle the single misclassification case, i.e., all misclassifications are assigned the same cost. As such, the key technical challenge in costsensitive learning is how to incorporate various misclassification information. In example-dependent cost-sensitive learning, Abe et al. [4] proposed a data space expansion technique to deal with multiple ways of misclassifying problems. This paper specializes this technique to the class-dependent case. Based on this technique, each example is expanded to K examples of different classes with weights assigned according to the loss of the corresponding misclassifications. When an example is assigned to a certain class, the sum loss of its expanded K examples is in proportion to the loss of classifying it to that class. The details of the expansion technique are given as follows. Assume that h(x) is a classifier. To keep weights of expanded examples positive, let ly be a positive quantity which is not less than the largest misclassification loss on an example (x, y). Then the loss of h(x) on the example (x, y) is given by
Proof. Substituting (4) into (1) yields ⎞ ⎤ ⎛ ⎡ K K ⎠ ⎝ ⎣ wy,k I(h(x) k) − f (y) P(y|x)⎦ p(x) dx R(h) = y=1
=
⎡ ⎣
k=1
⎞ ⎤ K K K ⎠ ⎝ wy,k P(y|x) I(h(x) k)⎦ p(x) dx − f (y)P(y) ⎛
y=1
k=1
y=1
⎤ ⎡ K K y=1 wy,k P(y|x) ⎣ I(h(x) k)⎦ g(x)p(x) dx − C2 = g(x) k=1
= C1
⎡ ⎣
K
⎤ I(h(x) k) P(k|x)⎦ p(x) dx − C2 ,
(5)
k=1
P(k|x) = Ky=1 wy,k P(y|x)/g(x), and p(x) = where C2 = Ky=1 f (y)P(y), g(x)p(x)/C1 . Meanwhile, the standard classification problem on the distribution P(x, k) is to minimize ⎡ ⎤ K ⎣ R(h) = E PXK lk,h(x) = I(h(x) k)P(k|x)⎦ p(x) dx (6) k=1
K
ly,h(x) =
=
ly,i −
K
i=1
i=1
K
K
ly,i −
i=1
R(h) − C2 . R(h) = C1
i=1
(ly − ly,i ),
(2)
i∈{1, ...,K}\y
where I(x) is a step function with value 1 if the inner condition is true, and 0 otherwise. Note that the fact ly,y = 0 is used in the right hand of the third equal sign. The expanded examples (x(k) , y(k) ) with weights wy,k are defined as x(k) = x,
y(k) = k,
wy,k = (ly − ly,k ).
(3)
Substituting (3) into (2) yields
ly,h(x) =
K
wy,k I(h(x) k) − f (y),
(4)
k=1
where f (y) =
(7)
It is easy to see that C1 > 0. Thus minimizing R(h) minimizes R(h) as well.
ly,i I(h(x) i) + (K − 1)ly − (K − 1)ly
i=1
K (ly − ly,i )I(h(x) i) −
=
Combining (5) and (6) results in
ly,i I(h(x) i)
wy,k . k∈{1, ...,K}\y Eq. (4) shows that the loss of h(x) on the example (x, y) equals a weighted 0/1 loss of h(x) on expanded examples minus a variable irrelevant to h(x). It also provides a way to transform the loss to weights of examples. The weights can be used to modify PXY to produce a new distribution on (X, Y). In this way, cost-sensitive learning can be reduced to the standard classification, which is given as the following theorem. Theorem 2. For the cost-sensitive learning problem in Definition 1 and wi,j (i, j = 1, . . . , K) as defined in (3), let g(x) = Kk=1 Ky=1 wy,k P(y|x) K K and C1 = k=1 y=1 wy,k P(y). There exists a distribution P(x, k) = K w P(y|x)p(x)/C on (X, Y), such that the minimizer of cost1 y=1 y,k sensitive learning on distribution P is equivalent to the minimizer of standard classification on distribution P.
Theorem 2 shows that, a cost-sensitive learning problem can be solved by the solution for a standard classification problem with a new distribution determined by the cost matrix. Note that Abe et al. [4] proposed a reduction from the cost-sensitive learning problem to the importance weighted classification problem based on the data space expansion technique. Our reduction is different from their approach in that, besides considering the influence of cost on labels, our reduction further takes into account the influence of cost on the features. Zadrozny [19] also proposed a similar reduction for the exampledependent multi-class cost-sensitive learning problem. In contrast to our proposed reduction, their method transforms the cost-sensitive learning problem to an importance weighted multi-class problem under the same distribution. Our reduction further employs the importance weight to modify the distribution to produce a standard multi-class classification. For a given cost-sensitive problem, {ly }Ky=1 are the only controllable variables in the reduction, thus playing an unique role in the construction of the new distribution. It is well known that the distribution is a crucial factor that determines how difficult a learning problem can be. In other words, the distribution is a dominating factor in the choice of classifier h(x). By carefully choosing {ly }Ky=1 , the new distribution not only benefits the choice of h(x), but also facilitates the understanding of the relationship between cost-sensitive learning and its reduced form as a standard classification problem, as addressed in Section 2.3. Eq. (7) indicates how the error transforms from the standard classification problem to the cost-sensitive learning problem. It is desirable to distinguish between errors due to environmental noises and errors due to base predictor mistakes. In this case, the regret transform analysis works, where regret is defined as the error rate minus the minimum error rate [20]. For the cost-sensitive learning problem, the regret is given by r(h) = R(h) − min R(h ). h
(8)
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
For the reduced standard classification problem, the regret is given by r(h) = R(h) − min R(h ).
(9)
h
Note that the minimum is obtained over all classifiers h and thus the minimum loss classifier is also known as the “Bayes classifier.” According to (7), it is easy to see the regret transformation, as given by r(h) = C1 r(h).
(10)
When the cost matrix is the one for the standard classification (li,j =1 for i j) and li = 1, the results are C1 = 1, C2 = 0, P(k|x) = P(k|x), p(x) = p(x) and R(h) = R( h). Thus the standard classification problem is kept unchanged. 2.3. Relationship between cost-sensitive learning and standard classification This section shows some desired distributions of the reduced standard classification by choosing {ly }Ky=1 . All conclusions are easily derived from the proof of Theorem 2. In the two-class case, when l1 = l1,2 and l2 = l2,1 , class priors and class-conditional distributions have simple forms as follows: P(y = 1) =
l1,2 P(y = 1) , l1,2 P(y = 1) + l2,1 P(y = 2)
P(y = 2) =
l2,1 P(y = 2) , l1,2 P(y = 1) + l2,1 P(y = 2)
p(x|y = 1) = p(x|y = 1) p(x|y = 2) = p(x|y = 2). From equations above, we can see that class-conditional distributions are unchanged while class priors are changed after the reduction. The ratio ( P(y=2)/ P(y=1)) of two class priors on the new distribution is l2,1 /l1,2 times of that on the original distribution. In other words, the solution for a standard classification problem with a given ratio of two priors can solve a cost-sensitive learning problem with costs l1,2 , l2,1 and another ratio which is l1,2 /l2,1 times of the former ratio, as Elkan's theorem states [7]. (Following this theorem, many existing binary cost-sensitive learning methods such as weighting examples and sampling examples were proposed [5,7–9,11].) On the other hand, without loss of generality, let l2,1 > l1,2 , we can set l1 = (l1,2 + l2,1 )/2 and l2 = l2,1 . Class membership probabilities and the distribution of x become l2,1 + l1,2 P(y = 1|x) = P(y = 1|x), 2l2,1
In this case, class membership probabilities are changed while the distribution of x remains unchanged. In the multi-class (K 3) case, we can choose {ly }Ky=1 so that the sum of each row of the weight matrix is the same value denoted by C. In this case, class membership probabilities and the distribution of x take simple forms: K
p(x) = p(x).
C
2.4. Parameter C So far we have discussed how to choose {ly }Ky=1 such that the relationship between cost-sensitive learning and its reduced standard classification version is favorable in the multi-class case. According to the definition of C, it determines all values of {ly }Ky=1 , thus playing a crucial role in the reduction. Let the sum of y-th row of a cost matrix be Sy and the maximum of y-th row be ly,max in a K ×K cost matrix. From Eq. (3), we know the role of {ly }Ky=1 is to transform the loss to weights. Since weights must be non-negative, for each y ∈ {1, . . . , K} we have ly ly,max . According to the definition of C, ly = (C + Sy )/K. Therefore, C Kly,max − Sy > 0. Consider transition probabilities defined in Eq. (11). Since Py,k = wy,k /C = (Sy − Kly,k )/KC + 1/K, when C approaches to infinite, Py,k converges to 1/K for every k. For any given item xi , although probabilities P(y|x) (y ∈ {1, . . . , K}) vary, with sufficiently large C, the probabilities that xi belongs to every class become the same. In other words, there is no supervised information to learn in the reduced problem. Therefore, C can be thought as a parameter representing our belief on the training set. When there is less useful information in the training set, enlarging C would probably make the error on the new distribution actually decrease, which, however, is not expected to occur often. Using the definition of C, the error transformation formula (7) becomes clear. From the proof of Theorem 2, we can obtain C1 = C,
C2 =
(K − 1)C − K
K
y=1 Sy P(y)
K
.
(12)
Substituting (12) into (7) yields K K−1 y=1 Sy P(y) R(h) = C R(h) − + . K K
r(h) = C r(h)
p(x) = p(x).
y=1 wy,k P(y|x)
An interesting implementation of Eq. (11) is to regard Py,k = wy,k /C as the transition probability from y to k. Thus the reduction procedure from cost-sensitive classification to a standard one is just a perturbation on labels. More specifically, each item in the reduced standard classification can be considered as being generated by drawing the tuple (x, y, k) from P(x, y) and Py,k . The example (x, y) in the cost-sensitive learning is only the intermediary result of the whole sampling process, which serves as the input to generate the example (x, k) of the reduced standard classification problem, according to Py,k . After the entire sampling is completed, the intermediary variable y is omitted. Note that Py,k is in proportion to wy,k , which is negatively related to the misclassification loss ly,k . That means that the sampling favors more low-cost labels than high-cost labels.
(13)
The corresponding regret transform formula (10) becomes
l2,1 + l1,2 P(y = 2|x) = 1 − P(y = 1|x), 2l2,1
P(k|x) =
1575
∀k ∈ {1, . . . , K}, (11)
(14)
The parameter C is crucial for the learner to output a good classifier h. It affects the distribution of the reduced standard classification problem and the regret transformation. When C increases, the reduced problem not only becomes more difficult to learn but also has a larger transformed regret as in (14). Therefore we suggest to use the smallest C. 3. Learning from the reduced standard classification problem As a solution for a cost-sensitive learning problem can be obtained by solving the corresponding standard classification problem, we now analyze the reduced standard classification problem. Usually we can draw a bootstrap [15] training set and use it (11) to generate labels. We then apply any cost-insensitive classification
1576
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
algorithm to produce a classifier. However, there are several problems with this method. Firstly, it is hard to determine how many examples should be drawn. If there are too many examples, the following learning process is time-consuming; if the number of examples is too small, it may not be an accurate reflection of the new distribution. Secondly, it is undesirable that sampling fluctuations introduce random factors into an otherwise deterministic classifier. We therefore introduce a new weighting mechanism which bypasses the sampling process. For a given training set S = {(xi , yi )}ni=1 , each example (xi , yi ) is expanded to K examples {(xi , k)}Kk=1 with weights wyi ,k /C, respectively. Thus the training set S is transformed to an expanded weighted training set S = {{(xi , k)}Kk=1 }ni=1 . The expanded training set contains at most K × N examples, with the exact number of examples depending on how many weights (wi,j ) are positive. According to Eq. (11), each element in S is a possible outcome from P(x, k). However, all these elements are not independent. Thus the entire expanded set is not an i.i.d. sample from P(x, k). Fortunately, we draw an independent ki according to P(ki |xi ) for each (xi , yi ) and produce a subset T = {(xi , ki )}ni=1 of S which is i.i.d. from P(x, k). The following theorem shows that the empirical risk on T = {(xi , ki )}ni=1 is essentially the same as that on the expanded weighted training set S, which bears similarity with Theorem 4 in [16]. (k)
Theorem 3. Let bi
be 0 if a classifier correctly predicts an example
(k) (k) (xi , yi ),
and 1 otherwise. Let bi be a boolean random variable introduced by applying the classifier to a random example (xi , ki ) extracted from (xi , yi ) according to Pyi ,ki . When each bi is chosen independently, with probability at least 1 − over the choice of bi , N N K 1 1 (k) bi wyi ,k bi + O N NC i=1
i=1 k=1
where C =
K
k=1 wyi ,k
1 1 , √ , log N
(15)
(∀yi ∈ {1, . . . , K}).
Proof. For each example (xi , yi ) in the original training set, k is drawn from Pyi ,k = wyi ,k /C. Thus the random variable bi has mean
K
(k) k=1 wyi ,k bi /C,
which equals
to the probability of bi = 1. Now b1 , . . . , bN are independent {0, 1} (k) valued random variables with P(bi = 1) = Kk=1 wyi ,k bi /C. Using the Chernoff bounds [12], for 0, we have ⎞ N 1 bi (1 + )⎠ exp(−2 N/3), P⎝ N ⎛
(16)
i=1
K (k) where = (1/NC) N i=1 k=1 wyi ,k bi . 2 Let = exp(− N/3). We have
=
3 log(1/ ) . N
Combining (16) and (17) results in (15).
(17)
Theorem 3 shows that a classifier minimizing the weighted loss on the expanded training set S approximately minimizes the 0/1 loss on an i.i.d. sample of P(x, k) as well. Thus it is valid to train a classifier only on the expanded training set to solve the reduced standard classification problem. Though the size of S is expanded by at most a factor K, it does not raise computing time of classification substantially where the actual computation depends on the size of examples with different features (e.g., in the case of CART).
4. Experimental studies 4.1. Existing cost-sensitive approaches We have compared our proposed weighting mechanism against four representative approaches for cost-sensitive learning: a traditional weighting approach, Zhou's approach [11], MetaCost [6] and GBSE [4]. The traditional weighting approach and Zhou's approach belong to the weighting approaches. MetaCost and GBSE are ensemble approaches. In the traditional weighting approach [5–8], a weight wi is assigned to examples of class i, which is computed by wi = K
nSi
k=1 nk Sk
,
(18)
where n is the size of the training set and ni is the number of examples in class i. Roughly speaking, wi is in proportion to the sum of loss of misclassification of class i. However, the traditional weighting approach is often not effective in dealing with cost-sensitive multi-class problems, as explained by Zhou and Liu [11]. They proposed another weighting approach generalized from Elkan's theorem [7]. Suppose each class can be assigned with a weight wi . According Elkan's theorem, it is desirable that pairwise weights satisfy wi /wj = li,j /lj,i , which implies the following K(K − 1)/2 number of constraints: li,j wi = , wj lj,i
i, j ∈ {1, . . . , K}.
(19)
If (19) has a solution w = [w1 , w2 , . . . , wK ]T , examples from each class can be weighted according to the solution w, such a loss matrix called as a consistent matrix; otherwise, called as an inconsistent matrix. A multi-class problem is decomposed into K(K −1)/2 number of binary class problems by pairwise coupling. Each binary class problem is rescaled according to Elkan's theorem, while the final prediction is made through voting the class labels predicted by the binary class classifiers. As reported by Abe et al. [4], MetaCost and GBSE deliver better performance than other representative cost-sensitive methods on their experimental evaluation, and a variant of MetaCost could perform at least as well as MetaCost. Thus, the variant of MetaCost and GBSE were chosen in our experiments as the representatives of state-of-the-art cost-sensitive learning methods. The variant uses multiple sub-samples from the original training set with rejection sampling [17] to obtain an ensemble of hypotheses. Then these hypotheses are combined to estimate class membership probabilities for each example. The label of a test example is assigned in term of minimization of the expected cost. GBSE casts stochastic multiclass cost-sensitive learning in the framework of gradient boosting [21], with the objective function defined as the expected cost of the stochastic ensemble, obtained as a mixture of individual hypotheses, on the expanded dataset. The readers can refer to [4] and [6] for the implementation details. 4.2. Experiment setup and results 4.2.1. Experiment setup We have conducted experiments on two kinds of datasets to demonstrate the effectiveness of our proposed weighting mechanism, including 12 synthetic datasets and 12 benchmark datasets. Besides the aforementioned cost-sensitive approaches, we also considered the standard multi-class classification method as the base line. For notion convenience, we denote our proposed weighting approach as Reduction, the traditional weighting approach as
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
1577
Table 1 Synthetic datasets. Dataset
Size
A
C
Class distribution
syn-a syn-b syn-c syn-d syn-e syn-f syn-g syn-h syn-i syn-j syn-k syn-l
1500 3000 6000 2500 5000 10 000 5000 10 000 20 000 3500 10 500 36 500
2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 5 5 5 10 10 10 3 3 3
[500*3] [1000*3] [2000*3] [500*5] [1000*5] [2000*5] [500*10] [1000*10] [2000*10] [500, 1000, 2000] [500, 2000, 8000] [500, 4000, 32 000]
A: attributes, C: classes.
Table 2 Misclassification costs with consistent cost matrices. Dataset
Base
Tradition
Zhou
MetaCost
GBSE
Reduction
syn-a syn-b syn-c syn-d syn-e syn-f syn-g syn-h syn-i syn-j syn-k syn-l
1.127 ± 0.250 1.941 ± 0.300 1.400 ± 0.397 2.029 ± 0.330 0.898 ± 0.155 1.794 ± 0.272 2.870 ± 0.288 2.305 ± 0.174 2.514 ± 0.215 2.947 ± 0.608 0.678 ± 0.108 0.690 ± 0.206
0.725 ± 0.182 1.588 ± 0.370 0.760 ± 0.192 1.414 ± 0.336 0.781 ± 0.171 1.435 ± 0.261 2.546 ± 0.333 1.955 ± 0.235 2.151 ± 0.279 1.536 ± 0.488 0.533 ± 0.094 0.570 ± 0.146
0.697 ± 0.153 1.480 ± 0.397 0.751 ± 0.191 1.323 ± 0.354 0.741 ± 0.167 1.367 ± 0.254 2.450 ± 0.339 1.825 ± 0.207 2.013 ± 0.251 1.346 ± 0.338 0.526 ± 0.100 0.562 ± 0.145
0.836 ± 0.194 1.670 ± 0.346 0.981 ± 0.294 1.753 ± 0.310 0.736 ± 0.162 1.529 ± 0.231 2.501 ± 0.291 2.140 ± 0.150 2.332 ± 0.217 2.344 ± 0.389 0.564 ± 0.104 0.622 ± 0.171
1.983 ± 0.877 1.468 ± 0.486 2.280 ± 0.826 1.787 ± 0.596 1.568 ± 0.449 1.917 ± 0.707 2.556 ± 0.413 2.434 ± 0.450 2.501 ± 0.465 1.392 ± 0.486 1.441 ± 0.328 0.912 ± 0.265
0.717 ± 0.161 1.422 ± 0.433 0.727 ± 0.177 1.264 ± 0.342 0.737 ± 0.172 1.305 ± 0.269 2.138 ± 0.265 1.624 ± 0.227 1.823 ± 0.245 1.293 ± 0.380 0.531 ± 0.106 0.596 ± 0.158
Tradition, Zhou's weighting approach as Zhou, and the base line method as Base. In the weighted rejection sampling process of MetaCost and GBSE, the original data were scanned once, and each example was accepted with probability proportional to its weight. The number of iterations for MetaCost and GBSE was set 30 as in [4] for comparison purpose. The CART was used as the standard multi-class classification learner since it can deal with weighted examples. The implementation of CART was based on the rpart package in R (http://www.r-project.org) [23]. The maximum depth of learning trees was empirically set as 5. 4.2.2. Synthetic datasets In our first experiment, we chose 12 multi-class synthetic datasets. These synthetic datasets were generated in a similar way as in [11]. Each synthetic dataset has two attributes, three, five or 10 classes. Examples were generated randomly from the normal distribution under the following constraints: the mean value and standard deviation of each attribute are random real values in [0, 10], and correlation coefficients are random real values in [−1, +1]. The experimental datasets are summarized as in Table 1. The first nine datasets are of class-balance while the remaining three are increasingly class-imbalanced. Concerning the weighted rejection sampling, for MetaCost, the accepted probability of all examples was set at 23 , and for GBSE, the accepted probability of each example was set as the ratio of its weight against the maximum of weight. Since Zhou's weighting approach has two options according to the property of the cost matrix as described in Section 4.1, we conducted two different experiments on the synthetic datasets to validate the performance of these cost-sensitive approaches. Option 1 (in the case of consistent cost matrix): To obtain consistent cost matrices, the approach in [11] was used to generate the cost matrix, so that Eq. (19) has a solution. A K-dimensional real vector was randomly generated and regarded as the w in (19), then a real
value was randomly generated for li,j (i, j ∈ [1, K] and j > i) such that lj,i can be solved from (19). All these real values are in [1,10], and at least one li,j is 1.0. The cost of misclassifying the smallest class to the largest class is the largest, while the cost of misclassifying the largest class to the smallest class is the smallest. Option 2 (in the case of inconsistent cost matrix): We generated another kind of cost matrices which does not satisfy constraints (19). We used the same procedure as in [4] to generate these cost matrices. Let P(i) and P(j) be the empirical probabilities of occurrence of classes i and j in the training data. The non-diagonal entries of the cost matrix li,j , i j were chosen with a uniform probability from the interval [0, 2000 P(j)/ P(i)]. The diagonal entries were set to zero. The procedure to generate the cost matrices is based on a model that gives higher costs for misclassifying a rare class as a frequent one, and lower costs for the reverse. As explained in [4], this mimics a situation that is found in many practical data mining applications, including direct marketing and fraud detection, where the rare classes are the most valuable to identify correctly [22]. Note that, as the cost matrices do not satisfy the constraints of (19) in almost all cases, the method of Zhou for inconsistent matrices was used to learn classification rules. For each dataset, following [4], we randomly selected two-third of examples in the dataset for training and the remaining one-third for testing. For each training/test split a different cost matrix was generated according to the aforementioned rules. Thus, both the variation in the data and in the cost matrix could affect the standard deviations. All results were obtained by averaging over 20 runs and given in the form of “mean ± standard deviation.” Table 2 records average costs for the six approaches in the case of the consistent cost matrices, and Table 3, in the case of inconsistent loss matrices. The best performance in terms of the mean of misclassification costs on each dataset is boldfaced. To illustrate the difference among these approaches, we also plotted the cost curves normalized by the cost
1578
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
Table 3 Misclassification costs with inconsistent cost matrices. Dataset
Base
Tradition
Zhou
MetaCost
GBSE
Reduction
syn-a syn-b syn-c syn-d syn-e syn-f syn-g syn-h syn-i syn-j syn-k syn-l
322.461 ± 129.204 116.541 ± 44.964 454.521 ± 155.039 211.896 ± 69.426 344.106 ± 86.679 305.886 ± 45.681 373.912 ± 35.575 538.424 ± 51.434 521.039 ± 54.873 388.057 ± 155.591 388.580 ± 162.365 790.706 ± 350.943
311.563 ± 139.422 116.582 ± 48.689 410.537 ± 158.795 196.351 ± 66.950 337.875 ± 85.363 295.076 ± 62.504 364.451 ± 38.182 536.378 ± 49.979 515.795 ± 68.919 310.272 ± 114.442 144.475 ± 83.971 34.523 ± 19.751
272.550 ± 119.833 115.719 ± 49.377 374.534 ± 148.272 156.908 ± 59.989 307.390 ± 85.953 233.063 ± 50.698 301.104 ± 35.071 471.525 ± 48.705 471.186 ± 51.338 224.154 ± 86.653 75.030 ± 27.706 34.679 ± 18.381
275.904 ± 125.647 109.874 ± 48.012 404.830 ± 158.228 158.553 ± 68.236 308.383 ± 91.187 276.353 ± 43.777 313.139 ± 40.281 484.290 ± 53.840 480.743 ± 61.629 313.307 ± 123.431 302.064 ± 119.965 636.425 ± 316.048
505.167 ± 268.474 115.374 ± 52.804 493.794 ± 202.676 408.859 ± 149.014 529.456 ± 167.438 451.758 ± 147.603 536.106 ± 89.215 618.064 ± 114.017 610.568 ± 106.254 227.565 ± 98.218 291.206 ± 172.238 30.695 ± 26.707
264.973 ± 132.834 110.204 ± 48.724 339.223 ± 145.041 154.386 ± 71.647 265.255 ± 85.685 243.417 ± 59.852 283.960 ± 58.851 422.403 ± 56.992 411.336 ± 51.425 191.091 ± 71.973 68.484 ± 22.437 17.767 ± 8.027
Fig. 1. Average costs normalized by that of Base on the synthetic datasets: (a) in the case of consistent cost matrix; (b) in the case of inconsistent cost matrix.
of Base on the synthetic datasets. Fig. 1(a) shows the cost curves in the case of the consistent cost matrices, and Fig. 1(b), in the case of inconsistent cost matrices. We can get some initial observations from Tables 2, 3, and Fig. 1. 1. Reduction, Zhou, MetaCost and Tradition approaches are capable to deal with cost-sensitive learning problems since they can achieve comparable or better performance than the Base approach on all synthetic datasets with both kinds of cost matrices. 2. Reduction approach obviously outperforms other cost-sensitive approaches as it achieves the best performance on eight datasets with consistent cost matrix and nine datasets with inconsistent cost matrix. On other datasets, Reduction approach achieves the second best performance, except for the dataset syn-l with consistent cost matrix, where it still achieves comparable results. 3. The performance of GBSE is unstable as it performs worse than Base approach on seven datasets with consistent cost matrix and on eight datasets with inconsistent cost matrix, while performing better than Base approach on the remaining cases. GBSE is also sensitive to cost matrices, especially on balanced datasets, since it achieves the largest standard deviation of costs on all balanced datasets and at three cases on imbalanced datasets. This phenomenon will be further discussed in Section 4.2.4. 4. Compared to MetaCost and Tradition approaches, Zhou achieves better performance on all datasets with both kinds of cost matrices, except for three datasets, i.e., the dataset syn-e with consistent matrix and the dataset syn-b, syn-l with inconsistent matrix, where it still achieves the second best performance. Obviously, the
performance of Zhou approach on datasets with consistent cost matrix is slightly better than that with inconsistent cost matrix. 5. MetaCost performs better than Tradition with inconsistent cost matrix, while with worse performance with consistent cost matrix. 4.2.3. Benchmark datasets We also conducted experiments similar to those reported in [4] using real-world benchmark datasets to evaluate the performance of these six approaches. Our experiment used 11 benchmark datasets available from the UCI machine learning repository [13] and one dataset from the UCI KDD archive [14], as summarized in Table 4. The class ratio is given as the class frequency of the least frequent class divided by that of the most frequent. These datasets were selected with different sizes and different class ratios in order to examine the robustness of the approaches under study. The KDD-99 is actually a fairly large dataset. For experimentation purposes, we used half of the dataset used in [4]. For MetaCost, the accepted probability of all samples in the weighted rejection sampling was set as 23 for small size data (i.e., Vowel, Segment, Waveform, Splice, Satellite, Balance, Solar flare, 1 Ann, Page) while 10 for large size data (i.e., Letter, Connnect4, KDD99). For GBSE the probability of each sample was set as the ratio of its weight against the maximum of weight in the data. Except for the KDD-99 dataset, the benchmark datasets do not have misclassification cost matrices. We used the same procedure as in [4] to generate these cost matrices for comparison purpose, i.e., the second option (inconsistent cost matrix) in the synthetic
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
1579
Table 4 Benchmark datasets. Dataset
Size
A
C
Class ratio
Vowel Segment Waveform Letter Splice Satellite Balance Connect4 Solar flare Ann Page KDD-99
990 2310 5000 20 000 3190 6435 625 67 557 1389 898 5473 98 855
13 19 40 16 60 36 4 42 12 38 10 42
11 7 3 26 3 6 3 3 6 5 5 5
1.000 1.000 0.9769 0.9028 0.4634 0.4083 0.1701 0.1450 0.1287 0.01169 0.005699 0.0001277
A: attributes, C: classes.
Table 5 Average costs and standard deviations on the benchmark datasets. Dataset
Base
Tradition
Zhou
MetaCost
GBSE
Reduction
Vowel Segment Waveform Letter Splice Satellite Balance Connect4 Solar flare Ann Page KDD-99
458.627 ± 74.243 173.749 ± 84.373 224.846 ± 75.016 656.638 ± 49.459 58.316 ± 16.088 312.861 ± 89.228 629.973 ± 235.351 942.757 ± 420.031 570.028 ± 129.115 367.770 ± 582.919 1078.024 ± 804.890 0.00476 ± 0.000415
436.427 ± 76.167 131.127 ± 54.614 212.695 ± 75.721 646.987 ± 75.080 54.230 ± 15.471 232.790 ± 53.889 271.521 ± 122.780 263.227 ± 104.980 246.056 ± 88.497 164.195 ± 187.839 87.399 ± 73.719 0.00455 ± 0.000420
219.434 ± 40.243 40.650 ± 12.407 188.0446 ± 71.588 145.668 ± 9.048 48.660 ± 18.314 142.549 ± 42.060 378.690 ± 157.169 209.321 ± 84.095 254.761 ± 62.765 256.916 ± 316.878 167.589 ± 144.462 0.00186 ± 0.00217
293.178 ± 45.741 112.923 ± 68.610 149.703 ± 61.191 409.012 ± 26.491 48.099 ± 13.496 174.192 ± 53.978 107.810 ± 99.635 694.137 ± 274.141 355.636 ± 107.455 55.905 ± 121.502 400.366 ± 201.002 0.00459 ± 0.000390
479.366 ± 117.550 162.090 ± 65.083 153.414 ± 63.727 643.713 ± 69.928 60.980 ± 19.843 175.928 ± 55.351 127.846 ± 53.813 170.209 ± 62.202 128.529 ± 45.672 39.990 ± 27.391 43.299 ± 27.433 0.0131 ± 0.000671
351.672 ± 57.077 70.571 ± 25.822 195.121 ± 75.466 461.806 ± 44.053 53.963 ± 19.168 171.991 ± 63.185 207.587 ± 120.271 169.585 ± 63.168 115.472 ± 41.824 123.956 ± 185.963 14.253 ± 5.176 0.00476 ± 0.000415
experiments. Note that, as the cost matrices do not satisfy the constraints of (19) in almost all cases, the method of Zhou for inconsistent matrices was used to learn classification rules. Similar to the synthetic experiment, we randomly split the dataset into training and testing sets. For each training/test split a different cost matrix was generated according to the aforementioned rules. All results were obtained by averaging over 20 runs and given in the form of “mean ± standard deviation.” Table 5 records average costs for the six approaches, which are also plotted in Fig. 2. The best performance in terms of the mean of misclassification costs on each dataset is boldfaced. From Table 5 and Fig. 2, we have similar observations to the synthetic experiments. 1. Reduction and Zhou achieve better performance than the other four approaches. Reduction tends to perform better on imbalanced datasets, while Zhou performs better on balanced datasets. 2. MetaCost achieves comparable performance among these methods as it performs better on three datasets. 3. GBSE approach is unstable as it outperforms Base approach on nine datasets while under-performs on three datasets. 4. All cost-sensitive approaches except for GBSE can achieve better performance than Base, on small or medium datasets. While, on large datasets, such as KDD-99, the performance of three costsensitive approaches, i.e., Tradition, MetaCost and Reduction, is undistinguishable from that of Base approach. 4.2.4. Experimental results discussion With consideration of both the synthetic and benchmark experimental results and relevant observations, we can come to some conclusions as follows. First, Reduction achieves the overall better performance on both synthetic datasets and benchmarks. Moreover, it performs better
Fig. 2. Average costs normalized by that of Base on the benchmark datasets.
on imbalanced datasets than on balanced datasets. Cost-sensitive classification problems are usually imbalanced per se, in this sense our Reduction approach can be considered to be more effective and suitable to such problems. Secondly, Reduction has low computational complexity, when compared with these state-of-the-art cost-sensitive algorithms. For a K classification problem, the ensemble methods such as MetaCost and GBSE use 30 tree learners to get the classification rules, while for Zhou's approach, K(K − 1)/2 in the case of inconsistent cost
1580
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
550 500
Training Curve Test Curve Average Cost
Average Cost
450 400 350 300 250 200 150 100 0
5
10
15
20
25
30
Number of Weak Learner
900 850 800 750 700 650 600 550 500 450 400
Training Curve Test Curve 0
5
10
15
20
25
30
Number of Weak Learner
Fig. 3. The training and test cost curves of GSBE on syn-b dataset in the case of two inconsistent cost matrix: (a) cost matrix L1 ; (b) cost matrix L2 .
matrix. However, Reduction, Base and Tradition employ only one tree learner. In this sense, the single methods like our Reduction approach have an advantage of achieving higher computational efficiency, making it suitable to large datasets. Thirdly, Zhou approach is also comparable to deal with costsensitive learning. However, the generated cost matrices almost never satisfy constraints (19), which implies that it is hard to obtain a consistent cost matrix for Zhou's approach. Together with the higher computational complexity, it limits its application in practice. Fourthly, GBSE has an unstable behavior on both synthetic datasets and benchmark datasets, because that it performs better than Base on some datasets but it loses on other datasets, and usually has a larger standard deviations of costs than other approaches, especially when the class ratio becomes regular. In the implementation, we also observed that, although a same dataset was used, GBSE had a dramatic variety of performance compared to Base approach, depending on the generated cost matrices. This implies that GBSE is sensitive to the cost matrices, especially on the balanced datasets. Note that in [4] of GBSE, the authors showed that a variant of GBSE has a theoretical performance guarantee that it can converge to the global optimum when a weak learning condition for the weak learner is satisfied. However, no theoretical performance guarantee is provided for the performance of GBSE. In order to illustrate the performance of GBSE, we constructed learning curves on syn-b datasets in the case of two inconsistent cost matrices to show its behavior in the training process. The two matrices are given as the follows: ⎛
0
⎜ L1 = ⎜ ⎝ 1388 1676
1000 195 0 216
⎞
⎟ 209 ⎟ ⎠, 0
⎛
0
631
1608
1575
⎜ L2 = ⎜ ⎝ 1345
0
1827
⎞
⎟ 819 ⎟ ⎠. 0
As shown in Fig. 3(a), in the case of cost matrix L1 , the cost of training datasets decreases as more weak learners are added. The test cost rapidly slips at its optimum at the third update and then becomes stable until the 30th update with a slight increasing. In Fig. 3(b), in the case of cost matrix L2 , the training cost continues to fall before the fifth update then climbs quickly until the 30th update. Accordingly, the test cost declines before the fifth update, then climbs sharply from the fourth update to sixth update, and becomes stable after the seventh update. The final results of the six approaches for these two cost matrices are: • For cost matrix L1 , the average cost for Base, Tradition, Zhou, MetaCost, GBSE, and Reduction are 259.519, 244.204, 222.898, 199.679, 128.922, and 124.075,
• For cost matrix L2 , the average cost for Base, Tradition, Zhou, MetaCost, GBSE, and Reduction are 224.939, 255.998, 243.137, 216.497, 830.092, and 219.162. Obviously, GBSE is sensitive to the cost matrix by comparing Figs. 3(a) to (b). With a good cost matrix such as L1 , GBSE can achieve better performance. The measure of goodness of a cost matrix for GBSE is another issue which is far beyond the focus of this paper. The sensitivity to the cost matrix limits GBSE's application in practice. Fifthly, the property of datasets definitely has some effects on the performance of classification algorithms (including costsensitive classification). On one hand, usually, the larger the dataset is, the more training data is, thus the better performance of the classification algorithms. When the dataset becomes large such as KDD-99, cost-sensitive algorithms cannot achieve significantly better performance than general classification algorithms as the performance of the latter is also good enough. On the other hand, the smaller the class ratio is, the scarcer samples of rare classes are, cost-sensitive approaches are more easier to get better performance than general classification approaches. Specifically costsensitive classification approaches are mainly intended to get better classification rules on smaller datasets with imbalanced class distribution, our Reduction is more applicable to such cost-sensitive tasks.
5. Conclusion This paper proposes a reduction approach from cost-sensitive learning to standard classification. We prove that the solution of a cost-sensitive learning problem can be obtained by solving a cost-insensitive problem on a new distribution determined by the cost matrix. This new distribution can be defined by reassigning class membership probabilities of the original distribution. We also propose a new weighting mechanism to solve the reduced costinsensitive problem, which is based on the theoretical guarantee that the empirical risk from the new distribution is essentially the same as that on the expanded weighted training set. Experimental results on several synthetic and benchmark datasets demonstrate the effectiveness of our proposed approach. We are currently conducting additional experimental studies to further evaluate our approach. Our current work focuses on class-dependent cost-sensitive learning. The future work will aim to develop an unified reduction framework dealing with both classdependent and example-dependent cost-sensitive learning.
F. Xia et al. / Pattern Recognition 42 (2009) 1572 -- 1581
Acknowledgments This work was supported by the Hi-tech Research and Development Program of China (863) (2008AA01Z121), the National Science Foundation of China (60835002), CAS #2F07C01, CAS #2F05N01, and MOST 2006AA010106. References [1] S. Yanmin, K.S. Mohamed, W.K.C. Andrew, W. Yang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition 40 (12) (2007) 3358–3378. [2] B. Zadrozny, C. Elkan, Learning and making decisions when costs and probabilities are both unknown, in: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp. 204–213. [3] B. Zadrozny, J. Langford, N. Abe, A simple method for cost-sensitive learning, Technical Report, IBM, 2002. [4] N. Abe, B. Zadrozny, J. Langford, An iterative method for multi-class costsensitive learning, in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 3–11. [5] L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA, 1984. [6] P. Domingos, MetaCost: a general method for making classifiers cost-sensitive, in: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 155–164. [7] C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2001, pp. 973–978. [8] K.M. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE Transactions on Knowledge and Data Engineering 14 (3) (2002) 659–665. [9] C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: why undersampling beats over-sampling, in: Working Notes of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003. [10] Z.-H. Zhou, X.-Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering 18 (1) (2006) 63–77.
1581
[11] Z.-H. Zhou, X.-Y. Liu, On multi-class cost-sensitive learning, in: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06), 2006, pp. 567–572. [12] H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Annals of Mathematical Statistics 23 (1952) 493–509. [13] D.J. Newman, S. Hettich, C.L. Blake, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, CA, 1998 http://www.ics.uci.edu/mlearn/MLRepository.html . [14] S.D. Bay, UCI KDD archive, Department of Information and Computer Sciences, University of California, Irvine, 2000 http://kdd.ics.uci.edu/ . [15] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [16] L. Li, H.-T. Lin, Ordinal regression by extended binary classification, Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA, 2007. [17] B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by cost—proportionate example weighting, in: Proceedings of the 2003 IEEE International Conference on Data Mining (ICDM'03), 2003. [18] B. Zadrozny, C. Elkan, Transforming classifier scores into accurate multiclass probability estimates, in: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23–26, 2002. [19] B. Zadrozny, One-benefit learning: cost-sensitive learning with restricted cost information, in: Proceedings of the 1st International Workshop on Utility-based Data Mining, Chicago, IL, August 21–21, 2005, pp. 53–58. [20] J. Langford, A. Beygelzimer, Sensitive error correcting output codes, in: The 18th Annual Conference on Learning Theory, June 27–30, 2005. [21] L. Mason, J. Baxter, P. Barlett, M. Frean, Boosting algorithms as gradient descent, Advances in Neural Information Processing Systems 12 (2000) 512–518. [22] Q. Tao, G.-W. Wu, F.-Y. Wang, J. Wang, Posterior probability support vector machines for unbalanced data, IEEE Transactions on Neural Networks 16 (6) (2005) 1561–1573. [23] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2007. ISBN3-900051-07-0 http://www.R-project.org .
About the Author—FEN XIA is an assistant professor in the Key Lab of Complex System and Intelligent Science at Institute of Automation China Academy of Sciences. He received his B.S. degree in automation at the University of Science and Technology of China (USTC) in 2003 and Ph.D. degree from Institute of Automation China Academy of Sciences in 2008. His research interests include statistical machine learning, ranking, regularization methods, efficient algorithms, information retrieval.
About the Author—YANWU YANG received the Ph.D. degree in computer science from the Doctoral School of the Ecole Nationale Superieure d'Arts et Metiers (ENSAM), France, in 2006. He joined the Lab of Complex Systems and Intelligence Sciences in 2007. His current interests include user model, human–computer interaction and text mining. About the Author—LIANG ZHOU was born in HangZhou, China, on October 4, 1979. He received the B.S. degree from Xi'an Jiaotong University, Xi'an, China, in 2002, and he is currently working towards the Ph.D. degree in machine learning from The CAS Laboratory of Complex Systems and Intelligence Science (LCSIS), Institute of Automation, Chinese Academy of Sciences. He current research interests include pattern recognition, statistical machine learning, and data mining. About the Author—FUXIN LI is a Ph.D. candidate in the Key Lab of Complex System and Intelligent Science at Institute of Automation, Chinese Academy of Sciences. He received his Bachelor degree in computer science and technology in Zhejiang University, China, in 2001. His research interests include metric learning, regularization methods, statistical machine learning, learning with existing knowledge, semi-supervised learning and bioinformatics. About the Author—MIN CAI received her Bachelor degree in automation in Beijing Institute of Fashion Technology, China, in July 2008. She is currently a M.S. candidate in the Institute of Computer Science and Engineer, Beijing Jiaotong University, China. Her research interests include pattern recognition and machine learning. About the Author—DANIEL DAJUN ZENG received the M.S. and Ph.D. degrees in industrial administration from Carnegie Mellon University, Pittsburgh, PA, and the B.S. degree in economics and operations research from the University of Science and Technology of China, Hefei, China. He is an associate professor and Honeywell Fellow in the Department of Management Information Systems at the University of Arizona, Tucson, AZ, USA. He is also affiliated with the Institute of Automation, the Chinese Academy of Sciences as a research professor. His research interests include software agents and their applications, social computing, computational support for auctions and negotiations, recommender systems, spatio-temporal data analysis, and security informatics. He has co-edited 10 books and published more than 100 peer-reviewed articles in Information Systems and Computer Science journals, edited books, and conference proceedings. He has received multiple best conference paper awards and teaching awards. His research has been funded by the U.S. NSF, the Chinese Academy of Sciences, the NSFC, and the MOST. He serves on editorial boards of 10 Information Technology-related journals and has co-edited five special topic issues with major technical journals on the topics of security informatics, e-commerce, and social computing. He is Vice President for Technical Activities for the IEEE Intelligent Transportation Systems Society and Chair of INFORMS College on Artificial Intelligence.