Double Ramp Loss Based Reject Option Classifier

3 downloads 50 Views 386KB Size Report
Nov 26, 2013 - LG] 26 Nov 2013. JMLR: Workshop and Conference Proceedings 1–21. Double Ramp Loss Based Reject Option Classifier. Naresh Manwani.
JMLR: Workshop and Conference Proceedings 1–21

Double Ramp Loss Based Reject Option Classifier

arXiv:1311.6556v1 [cs.LG] 26 Nov 2013

Naresh Manwani Kalpit Desai Sanand Sasidharan Ramasubramanian Sundararajan

[email protected] [email protected] [email protected] [email protected] Data Mining Laboratory, Software Sciences & Analytics, GE Global Research John F. Welch Technology Centre, 122 EPIP, Whitefield Road, Bangalore 560066, India

Abstract We consider the problem of building classifiers with the option to reject i.e., not return a prediction on a given test example. Adding a reject option to a classifier is well-known in practice; traditionally, this has been accomplished in two different ways. One is the decoupled method where an optimal base classifier (without the reject option) is build first and then the rejection boundary is optimized, typically in terms of a band around the separating surface. The coupled method is based on finding both the classifier as well as the rejection band at the same time. Existing coupled approaches are based on minimizing risk under an extension of the classical 0 − 1 loss function wherein a loss d ∈ (0, .5) is assigned to a rejected example. In this paper, we propose a double ramp loss function which gives a continuous upper bound for (0 − d − 1) loss described above. Our coupled approach is based on minimizing regularized risk under the double ramp loss which is done using difference of convex (DC) programming. We show the effectiveness of our approach through experiments on synthetic and benchmark datasets.

1. Introduction The primary focus of classification problems has been on algorithms that return a prediction on every example in the example space. However, in many real life situations, it may be prudent to reject an example, i.e., not return a prediction, rather than run the risk of a costly potential misclassification. Consider, for instance, a physician who has to return a diagnosis for a patient based on the observed symptoms and a preliminary examination. If the symptoms are either ambiguous, or rare enough to be unexplainable without further investigation, then the physician might choose not to risk misdiagnosing the patient (which might lead to further complications). He might instead ask for further medical tests to be performed, or refer the case to an appropriate specialist. Similarly, a banker, when faced with a loan application from a customer, may choose not to decide on the basis of the available information, and ask for a credit bureau score. These actions can be viewed as akin to a classifier refusing to return a prediction (in this case, a diagnosis) in order to avoid a potential misclassification. While the follow-up actions might vary (asking for more features to describe the example, or using a different classifier), the principal response in these cases is to “reject” the example. This paper focuses on the manner in which this principal response is decided, i.e., which examples should a classifier reject, and why? From a geometric standpoint, we can view the classifier as being possessed of a decision surface c N. Manwani, K. Desai, S. Sasidharan & R. Sundararajan.

Manwani Desai Sasidharan Sundararajan

(which separates points of different classes) as well as a rejection surface (which determines which regions of the example space to return a prediction in, and which regions to reject). The size of the rejection region impacts the proportion of cases that are likely to be rejected by the classifier, as well as the proportion of predicted cases that are likely to be correctly classified. A well-optimized classifier with a reject option is the one which minimizes the rejection rate as well as the misclassification rate on the predicted examples. The analysis of classifiers without a reject option has often been performed from the standpoint of minimizing the expectation of an appropriately defined loss function (risk), the simplest of which is a 0 − 1 loss function defined as below: ( 0 L0−1 (f (x), y) = 1

if yf (x) ≥ 0 if yf (x) < 0

(1)

where x ∈ Rp is the feature vector and y ∈ {−1, +1} is the class label. The expectation is taken with respect to the joint distribution D(x, y) from which these examples are generated. Since D(x, y) is generally assumed to be fixed but unknown, the empirical risk minimization principle (with its attendant caveats on minimizing complexity or structural risk) is used (Vapnik, 1998). If we assume that a rejection region classifier g(x) is to be built, which returns a value of 1 when a given example is to be rejected, and 0 if it is to be classified, then the problem of learning with a reject option can be posed in one of two ways: 1. Minimize the cost of misclassification (as described in equation (1)) for the predicted examples, while keeping the rejection rate below a required rate. 2. Model the cost of rejection as a quantity d that is less than the cost of misclassification, thereby explicitly modeling the fact that one might choose to reject in order to avoid the risk of a costly potential misclassification. The loss function in this case would be as below:   0 if g(x) = 0, yf (x) ≥ 0 g L0−d−1 (f (x), g(x), y) = d if g(x) = 1 (2)   1 if g(x) = 0, yf (x) < 0 If d = 0, then we will always reject. When d > .5, then we will never reject (because expected loss of random labeling is 0.5). Thus, we always take d ∈ (0, .5).

A typical case of Lg0−d−1 is when g(x) (rejection region classifier) is defined as a bandwidth ρ around f (x) = 0, (i.e., g(x) = 1 if |f (x)| ≤ ρ and 0 otherwise). Then a reject option classifier C(f (x), ρ) can be formed as follows.   if f (x) > ρ 1 h(f (x), ρ) = 0 (3) if |f (x)| ≤ ρ   −1 if f (x) < −ρ 2

Double Ramp Loss Based Reject Option Classifier

L0−d−1 in this case is described as follows (Wegkamp and Yuan, 2011; Herbei and Wegkamp, 2006):   1, if yf (x) < −ρ L0−d−1 (f (x), ρ, y) = d, if |f (x)| ≤ ρ (4)   0, otherwise

where ρ is the parameter which determines the rejection region. We shall focus on this loss function for the remainder of this paper. For 2-class problems, the risk under L0−d−1 is minimized by generalized Bayes discriminant (Herbei and Wegkamp, 2006; Chow, 1970) which is as below:   −1, if P (y = 1|x) < d ∗ fd (x) = 0, (5) if d ≤ P (y = 1|x) ≤ 1 − d   1, if P (y = 1|x) > 1 − d

If one wishes to build a classifier that minimizes the average of the loss function described in equation (4) for a given training sample, two approaches can be followed, namely decoupled and coupled approaches. Details of these two approaches are as follows. Decoupled Approach The decoupled approach involves separating the problem into two tasks: finding the classifier and the rejection boundary. That is, the decoupled scheme first finds a classifier f (x) under the assumption that nothing is to be rejected. Then it builds a rejection boundary around the base classifier. In general, band ρ around f (x) = 0 is found so that the risk under L0−d−1 (described in (4)) is minimized. f (x) can be chosen as the minimizer of risk under any convex surrogate of L0−1 . Classifier h(f (x), ρ) (equation (3))is shown to be infinite sample consistent with respect to the generalized Bayes classifier fd∗ (x) described in equation (5) (Yuan and Wegkamp, 2010). Coupled Approach The coupled approach directly minimizes the loss function in equation (4). The coupled rejection scheme involves viewing the solution surface as two of parallel surfaces (with the rejection area in between), wherein f (x) as well as ρ are to be determined simultaneously. The most common approach used for coupled rejection scheme in the literature is the risk minimization based framework. Table 1 lists out some of the loss functions specifically designed for learning reject option classifier. Fumera and Roli (2002); Sundararajan and Pal (2004) have also proposed learning algorithms for rejection option classifier in the coupled way. Analogous to the convex surrogates of L0−1 , convex surrogates for L0−d−1 also have been proposed. Wegkamp and Yuan (2011); Wegkamp (2007); Bartlett and Wegkamp (2008) discuss risk minimization based on generalized hinge loss LGH (see Table 1) for learning a reject option classifier. It is shown that a minimizer of risk under LGH is consistent to the generalized Bayes classifier fd∗ (Bartlett and Wegkamp, 2008). Grandvalet et al. (2008) propose a risk minimization scheme under double hinge loss LDH (see Table 1) and show that resulting classifier is strongly universally consistent to the generalized Bayes classifier fd∗ . We observe that these convex loss functions have some limitations. For example, LGH is a convex upper bound to L0−d−1 provided ρ < 1 − d and LDH forms an upper bound 3

Manwani Desai Sasidharan Sundararajan

Loss Function

Definition  1−d  1 − d yf (x), if yf (x) < 0 LGH (f (x), y) = 1 − yf (x), if 0 ≤ yf (x) < 1   0, otherwise LDH (f (x), y) = max[−y(1 − d)f (x) + H(d), −ydf (x) + H(d), 0] where H(d) = −d log(d) − (1 − d) log(1 − d)

Generalized Hinge Double Hinge

Table 1: Existing loss functions for learning classifiers with reject option.

2.5

d=0.2

2.5

d=0.2

0.5

1.0

Loss

1.5

2.0

GH DH 0-d-1 (r=2)

0.0

0.0

0.5

1.0

Loss

1.5

2.0

GH DH 0-d-1 (r=0.7)

-3

-2

-1

0 yf(x)

1

2

3

(a)

-3

-2

-1

0 yf(x)

1

2

(b)

Figure 1: Generalized hinge (GH) and double hinge (DH) losses for rejection option classification. In both the figures the value of d is kept at 0.2. (a) For ρ = 0.7, both the losses upper bound the L0−d−1 (0 − d − 1) loss. For ρ = 2, both the losses fail to upper bound the L0−d−1 loss. Both the loss functions increase linearly even in the rejection region than being flat.

H(d)−d ) (see Figure 1). Also, both LGH and LDH increase to L0−d−1 provided ρ ∈ ( 1−H(d) 1−d , d linearly in the rejection region instead of remaining constant. These convex losses can also become unbounded for misclassified examples with the scaling of parameters of f . In this paper, we consider the coupled approach in the context of a support vector machine (SVM). SVM is based on risk minimization under hinge loss function which is a convex upper bound on the L0−1 . It is well known for its generalization ability for nonlinear problems. However, SVM and other convex loss based classification algorithms are not robust to label noise in the data (Manwani and Sastry, 2013). Recent results show that

4

3

Double Ramp Loss Based Reject Option Classifier

ramp loss based risk minimization for classifier learning is more robust to noise (Wu and Liu, 2007) and gives sparse solutions compared to hinge loss based SVM. This makes it more suitable for scalability (Collobert et al., 2006; Ong and An, 2012). While learning a reject option classifier as well, one has to deal with the ambiguity in the classification due to overlapping class regions as well as the presence of outliers. This motivates us to generalize the ramp loss function such that it incorporates a different loss value for the rejection region. To accomplish this, we propose a new loss function which we call double ramp loss (LDR ). LDR forms a continuous non-convex upper bound for L0−d−1 and takes care of many of the drawbacks of convex loss functions (LGH and LDH ). To learn a reject option classifier, we minimize the regularized risk under double ramp loss which becomes an instance of difference of convex (DC) functions. To minimize such a DC function, we use difference of convex programming approach (An and Tao, 1997), which essentially solves a sequence of convex programs. Our approach can be easily kernelized for dealing with nonlinear problems. The main contribution of our paper is a novel formulation for the problem of learning a classifier with a reject option. The proposed coupled formulation is better compared to the existing coupled approaches in following ways: (1) the proposed loss function LDR gives a tighter upper bound to the 0-d-1 loss function, (2) LDR requires no constraint on ρ (width of the rejection region) unlike LGH and LDH , (3) the proposed algorithm based on minimizing risk under LDR results in smaller number of support vectors. The rest of the paper is organized as follows. In Section 2 we define the double ramp loss function (LDR ) and discuss its properties. Then we discussed the proposed formulation based on risk minimization under LDR . In Section 3 we derive the algorithm for learning reject option classifier based on regularized risk minimization under (LDR ) using DC programming. We present experimental results in Section 4. We conclude the paper with the discussion and insights for future work in Section 5.

2. Proposed Approach Our approach for learning classifier with reject option is based on minimizing regularized risk under the double ramp loss function. 2.1. Double Ramp Loss We define double ramp loss function as a continuous upper bound for L0−d−1 . This loss function is defined as a sum of two ramp loss functions as follows: LDR (f (x), ρ, y) =

   i d h µ − yf (x) + ρ + − − µ2 − yf (x) + ρ + µ    i (1 − d) h µ − yf (x) − ρ + − − µ2 − yf (x) − ρ + + µ

(6)

where [a]+ = max(0, a). Here µ > 0 defines the slope of ramps in the loss function. In this paper, we take µ ∈ (0, 1]. d ∈ (0, .5) is the cost of rejection and ρ ≥ 0 is the parameter which defines the size of the rejection region around the classification boundary f (x) = 0.1 1. While LDR is parametrized by µ and d as well, we omit them for the sake of notational consistency.

5

Manwani Desai Sasidharan Sundararajan

2.5

d=0.2, r=2

0.0

0.5

1.0

Loss

1.5

2.0

DR (m=1) DR (m=0.5) DR (m=0.1) 0-d-1

-4

-2

0 yf(x)

2

4

Figure 2: Double Ramp (DR) Loss vs. 0 − d − 1 Loss. For every µ ≥ 0 and ρ ≥ 0, DR loss forms an upper bound of 0 − d − 1 loss.

As in L0−d−1 , LDR also considers the region [−ρ, ρ]as rejection region. Figure 2 shows LDR (double ramp) loss for d = 0.2 and ρ = 2 for different values of µ. Theorem 1 (i) LDR ≥ L0−d−1 , ∀µ > 0, ρ ≥ 0. (ii) limµ→0 LDR (f (x), ρ, y) = L0−d−1 (f (x), ρ, y). (iii) In the rejection region yf (x) ∈ (ρ − µ2 , −ρ + µ), the loss remains constant, that is LDR (f (x), ρ, y) = d(1 + µ). (iv) For µ > 0, LDR ≤ (1 + µ), ∀ρ ≥ 0, ∀d ≥ 0. (v) When ρ = 0, LDR is same as µ-ramp loss used for classification problems without rejection option. (vi) LDR is a non-convex function of (yf (x), ρ). The proof of Theorem 1 is provided in the supplementary material in Appendix A. We see that LDR does not put any restriction on ρ for it to be an upper bound of L0−d−1 . Moreover, when ρ = 0, LDR behaves same as the usual ramp loss used for classification without rejection. Thus, LDR is a general ramp loss function which also allows rejection option. 2.2. Formulation Using LDR Let S = {(xn , yn ), n = 1 . . . N } be the training dataset, where xn ∈ Rp , yn ∈ {−1, +1}, ∀n. As discussed, we minimize regularized risk under LDR to find a reject option classifier. In this paper, we use l2 regularization. Thus, for f (x) = (wT φ(x) + b), regularized risk under

6

Double Ramp Loss Based Reject Option Classifier

ramp loss is N

R(w, b, ρ) = =

X 1 LDR (yn , wT φ(xn ) + b) ||w||2 + C 2 n=1 N n X

    C 1 ||w||2 + d µ − yn (wT φ(xn ) + b) + ρ + − d − µ2 − yn (wT φ(xn ) + b) + ρ + 2 µ n=1     o +(1 − d) µ − yn (wT φ(xn ) + b) − ρ + − (1 − d) − µ2 − yn (wT φ(xn ) + b) − ρ + (7)

where C is regularization parameter. In our approach we learn the parameter ρ along with (w, b). C, d and µ are kept as user defined parameters. The method for solving this formulation is described in Section 3.

3. Solution methodology R(w, b, ρ) (equation (7)) is a nonconvex function of (w, b, ρ). However, R(w, b, ρ) can be decomposed as a difference of convex functions as follows: R(w, b, ρ) =

N    CX  1 d µ − yn (wT φ(xn ) + b) + ρ + + (1 − d) µ − yn (wT φ(xn ) + b) − ρ + ||w||2 + 2 µ n=1     i −d − µ2 − yn (wT φ(xn ) + b) + ρ + + (1 − d) − µ2 − yn (wT φ(xn ) + b) − ρ +

= R1 (w, b, ρ) − R2 (w, b, ρ) where R1 (w, b, ρ) = R2 (w, b, ρ) =

N    i 1 C Xh  2 ||w|| + d µ − yn (wT φ(xn ) + b) + ρ + + (1 − d) µ − yn (wT φ(xn ) + b) − ρ + 2 µ n=1 N    i C Xh  d − µ2 − yn (wT φ(xn ) + b) + ρ + + (1 − d) − µ2 − yn (wT φ(xn ) + b) − ρ + µ n=1

We can easily verify that both R1 and R2 are convex functions of (w, b, ρ). Thus R is an instance of difference of two convex (DC) function. We develop our algorithm which exploits this structure of R(w, b, ρ) using DC programs. We will describe DC programming in Section 3.1. 3.1. Difference of Convex Programming When a nonconvex function is represented as a difference of two convex functions, finding the local optima of the nonconvex function becomes computationally simpler (An and Tao, 1997). A DC optimization problem is (An and Tao, 1997), min R(Θ) = min R1 (Θ) − R2 (Θ) Θ

Θ

7

Manwani Desai Sasidharan Sundararajan

where R1 (Θ) and R2 (Θ) are convex functions of Θ. In the simplified DC algorithm (An and Tao, 1997), an upper bound of R(Θ) is found using the convexity property of R2 (Θ) as follows. R(Θ)

=

R1 (Θ) − R2 (Θ)



R1 (Θ) − R2 (Θ(l) ) − (Θ − Θ(l) )T ∇R2 (Θ(l) )

=: ub(Θ, Θ(l) ) where Θ(l) is the parameter vector after (l) th iteration ∇R2 (Θ(l) ) is a subgradient of R2 at Θ(l) and ub(Θ, Θ(l) ) is upper bound to R(Θ) after (l)th iteration . Now Θ(l+1) is found by minimizing ub(Θ, Θ(l) ). Thus, R(Θ(l+1) ) ≤ ub(Θ(l+1) , Θ(l) ) ≤ ub(Θ(l) , Θ(l) ) = R(Θ(l) )

Algorithm 1: DC Algorithm for Minimizing Rreg (Θ) Initialize Θ(0) ; repeat Θ(l+1) = arg min R1 (Θ) − ΘT ∇R2 (Θ(l) ) Θ

until convergence of

Θ(l)

;

3.2. Learning Reject Option Classifier Using DC Programming In this section, we will derive a DC algorithm for learning rejection option classifier. The DC algorithm will minimize the regularized risk R(w, b, ρ) described in the previous section. Let Θ = [wT b ρ]T . We initialize with Θ = Θ(0) . For any l ≥ 0, we find ub(Θ, Θ(l) ) as an upper bound for R(Θ) (see equation (8)) as follows: ub(Θ, Θ(l) ) = R1 (Θ) − R2 (Θ(l) ) − (Θ − Θ(l) )T ∇R2 (Θ(l) ) Given Θ(l) , we find Θ(l+1) by minimizing the upper bound ub(Θ, Θ(l) ). Θ(l+1) ∈ arg min R1 (Θ) − ΘT ∇R2 (Θ(l) )

(8)

Θ

where ∇R2 (Θ(l) ) is the subgradient of R2 (Θ) at Θ(l) . Here, we choose ∇R2 (Θ(l) ) as follows: ∇R2 (Θ(l) ) =

C X d [−yn φ(xn )T µ (l)

− yn 1]T + (1 − d)

X

[−yn φ(xn )T

− yn

− 1]T

(l)

xn ∈V1

xn ∈V2

where (

(l)

V1 = {xn | yn (φ(xn )T w(l) + b(l) ) − ρ(l) < −µ2 } (l)

V2 = {xn | yn (φ(xn )T w(l) + b(l) ) + ρ(l) < −µ2 }

8

(9)



Double Ramp Loss Based Reject Option Classifier

We rewrite the upper bound minimization problem described in equation (8) as follows, P (l+1) = min R1 (Θ) − ΘT ∇R2 (Θ(l) ) Θ

=

N    i 1 C Xh  d µ − yn (wT φ(xn ) + b) + ρ + + (1 − d) µ − yn (wT φ(xn ) + b) − ρ + ||w||2 + w,b,ρ≥0 2 µ n=1 i X Ch X [yn (wT φ(xn ) + b) − ρ] + (1 − d) [yn (wT φ(xn ) + b) + ρ] + d µ (l) (l)

min

xn ∈V1

xn ∈V2

Note that P (l+1) is a convex optimization problem where the optimization variables are (w, b, ρ). We can rewrite P (l+1) as min ′ ′′ w,b,ξ ,ξ ,ρ≥0

N  Cd X C X ′ 1 [yn (wT φ(xn ) + b) − ρ] ||w||2 + dξn + (1 − d)ξn′′ + 2 µ µ (l) n=1

+

xn ∈V1

C(1 − d) X [yn (wT φ(xn ) + b) + ρ] µ (l) xn ∈V2

yn (wT φ(xn ) + b) ≥ ρ + µ − ξn′ , ∀n

s.t.

yn (wT φ(xn ) + b) ≥ −ρ + µ − ξn′′ , ∀n ξn′ ≥ 0, ξn′ ≥ 0, ∀n ′ ]T and ξ ′′ = [ξ ′′ ξ ′′ . . . ξ ′′ ]T . The dual optimization problem D (l+1) where ξ ′ = [ξ1′ ξ2′ . . . ξN 2 1 N of P (l+1) is as follows.2 N

D

(l+1)

=

N

N

X 1XX ′ ′′ ′ ′′ min (βn′ + βn′′ ) y y (β + β )(β + β )k(x , x ) − µ n m n m n n m m β ′ ,β ′′ 2 n=1

n=1 m=1

Cd (l) (l) s.t. 0 ≤ βn′ ≤ ∀xn ∈ S \ V1 ; βn′ = 0 ∀xn ∈ V1 µ C(1 − d) (l) (l) 0 ≤ βn′′ ≤ ∀xn ∈ S \ V2 ; βn′′ = 0 ∀xn ∈ V2 µ N N N X X X ′ ′ ′′ βn′′ βn ≥ yn (βn + βn ) = 0;

(10)

n=1

n=1

n=1

where β ′ = [β1′ β2′ . . . . . . βn′ ]T and β ′′ = [β1′′ β2′′ . . . . . . βn′′ ]T are dual variables. Using the KKT conditions of optimality for P (l+1) , normal vector w is represented as: w=

N X

yn (βn′ + βn′′ )φ(xn )

n=1

Since P (l+1) is a convex optimization problem with quadratic objective function and linear constraints, it holds strong duality with its dual optimization problem D (l+1) . Solving the dual becomes more useful as it can be easily kernelized for non-linear problems. 2. The KKT conditions for P (l+1) and the derivation of D(l+1) is provided in the supplementary material (Appendix B).

9

Manwani Desai Sasidharan Sundararajan

Condition yn (wT φ(xn ) + b) ∈ (ρ + µ, ∞) yn (wT φ(xn ) + b) = ρ + µ yn (wT φ(xn ) + b) ∈ [ρ − µ2 , ρ + µ) yn (wT φ(xn ) + b) ∈ (−ρ + µ, ρ − µ2 ) yn (wT φ(xn ) + b) = −ρ + µ

βn′ ∈ 0 (0, Cd µ )

yn (wT φ(xn ) + b) ∈ [−ρ − µ2 , −ρ + µ) yn (wT φ(xn ) + b) ∈ (−∞, −ρ − µ2 )

0 0

Cd µ

0 0

βn′′ ∈ 0 0 0 0 (0, C(1−d) ) µ C(1−d) µ

0

Table 2: Behavior of β ′ and β ′′

3.3. Behavior of Dual Variables β ′ and β ′′ The KKT conditions for optimality give the following insights about the dual variables. For any xn , only one of βn′ and βn′′ can be nonzero. We observe that the orientation (w) and distance between the two parallel hyperplanes (wT x + b = −ρ and wT x + b = ρ) are determined by the points close to these hyperplanes. In other words, points whose margin (yf (x)) is in the range [ρ − µ2 , ρ + µ] ∪ [−ρ − µ2 , −ρ + µ]. We call these points as support vectors. This is in line with our insight that the coupled scheme treats the solution as a pair of parallel hyperplanes rather than one hyperplane with a band around it. We also see that for all points whose margin (yf (x)) falls in the region (ρ + µ, ∞) ∪ (−ρ + µ, ρ − µ2 ) ∪ (−∞, −ρ − µ2 ), both βn′ and βn′′ are zero. Thus, points which are correctly classified with margin at least (ρ + µ), points falling close to the decision boundary with margin in the interval (−ρ+µ, ρ−µ2 ) and points which are misclassified with a high negative margin (less than −ρ−µ2 ), are not considered in the final classifier. Thus, our approach not only rejects points which fall in the overlapping region of two classes, it is also unaffected by potential outliers. We illustrate this insight through experiments on a synthetic dataset. The dataset is shown in Figure 3. 400 points are uniformly sampled from the square region [0 1] × [0 1]. We consider the diagonal passing through the origin as the separating surface and assign labels {−1, +1} to all the points using it. We changed the labels of 80 points inside the band (width=0.225) around the separating surface. Figure 4(a) shows the reject option classifier learnt using the proposed method. We see that the proposed approach learns the rejection region accurately. We also observe that the number of support vectors is small (87) and all of them are near the two parallel hyperplanes. This is in accordance with our discussion on the properties of the proposed method. On the other hand, the decoupled approach finds a rejection region that is less accurate as shown in Figure 4(b). Also, the number of support vectors with the decoupled approach is much larger (192) than that with the proposed method. 3.4. Finding b(l+1) and ρ(l+1) The dual optimization problem above gives dual variables β (l+1) using which the normal P ′(l+1) ′′(l+1) vector is found as w(l+1) = N + βn )yn φ(xn ). To find b(l+1) and ρ(l+1) , we n=1 (βn 10

Double Ramp Loss Based Reject Option Classifier

0.0

0.2

0.4

X2

0.6

0.8

1.0

Data with Label Noise

0.0

0.2

0.4

0.6

0.8

1.0

X1

Figure 3: A classification problem where examples are corrupted by the label noise around a band near the classification boundary. Two classes are represented using empty circles and triangles.

(l+1)

consider xn ∈ SV1

(l+1)

∪ SV2

(l+1)

SV1

(l+1)

SV2

, where

= {xn | yn (φ(xn )T w(l+1) + b(l+1) ) = ρ(l+1) + µ}

= {xn | yn (φ(xn )T w(l+1) + b(l+1) ) = −ρ(l+1) + µ}

We also observe that (l+1)

, then βn

(l+1)

, then βn

1. If xn ∈ SV1

2. If xn ∈ SV2

′′(l+1)

′(l+1)

∈ (0, Cd µ ) and βn

′(l+1)

= 0 and βn

′′(l+1)

=0

∈ (0, C(1−d) ) µ (l+1)

We solve the system of linear equations corresponding to sets SV1 tifying b(l+1) and ρ(l+1) .

(l+1)

and SV2

for iden-

3.5. Summary of the Algorithm Thus, our algorithm for learning classifier with reject option is as follows. We fix d ∈ (0, .5), (0) µ ∈ (0, 1] and C and initialize the parameter vector Θ as Θ(0) . We now find sets V1 and (0) (0) (0) V2 (see equation (9)) using Θ(0) . We use V1 , V2 to solve the optimization problem D (1) . P ′(1) ′′(1) Using dual variable we find the normal vector w(1) as w(1) = N n=1 yn (βn + βn )φ(xn ). 11

Manwani Desai Sasidharan Sundararajan

Reject Option Classifier Using Decoupled Approach

X2 0.0

0.0

0.2

0.2

0.4

0.4

X2

0.6

0.6

0.8

0.8

1.0

1.0

Dual Ramp Based Reject Option Classifier

0.0

0.2

0.4

0.6

0.8

1.0

0.0

X1

0.2

0.4

0.6

0.8

1.0

X1

(a)

(b)

Figure 4: Results on synthetic dataset. (a) shows the reject option classifier learnt using the proposed approach (with C = 64, µ = 1 and d = .2). (b) shows the classifier learnt using SVM based decoupled approach (with C = 64 and d = .2). Filled circles and triangles represent the support vectors. The proposed algorithm finds out the rejection region accurately with smaller number of support vectors.

We find b(1) and ρ(1) as described in Section 3.4. This will give us new parameter vector (1) (1) Θ(1) . Using Θ(1) , we now find V1 and V2 . Like this keep repeating these two steps until there is no significant decrement in R(w, b, ρ). More formal description of our algorithm is given in Algorithm 2.

4. Experimental Results In this section we will compare our approach with a decoupled approach in which the base classifier is learnt using SVM. We could not find sufficient experimental results in the literature for coupled reject option classifiers (see Wegkamp and Yuan (2011); Wegkamp (2007); Bartlett and Wegkamp (2008); Grandvalet et al. (2008)). Therefore, we have only been able to provide some illustrative results on comparison with the decoupled approach. 4.1. Dataset Description We report experimental results on two datasets taken from UCI ML repository (Bache and Lichman, 2013), which are described in Table 3.

12

Double Ramp Loss Based Reject Option Classifier

Algorithm 2: Learning Classifier with Rejection Option Input: d ∈ (0, .5), µ ∈ (0, 1], C > 0, S Output: w∗ , b∗ , ρ∗ begin 1. Initialize w(0) , b(0) , ρ(0) , l = 0. (l)

2. Find V1

(l)

and V2

as

(l)

= {xn | yn (φ(xn )T w(l) + b(l) ) − ρ(l) < −µ2 }

(l)

= {xn | yn (φ(xn )T w(l) + b(l) ) + ρ(l) < −µ2 }

V1

V2

3. Find β ′(l+1) , β ′′(l+1) ∈ arg minβ ′ ,β ′′ D (l+1) , where D (l+1) is described in Eq. (10). 4. Find w(l+1) =

′(l+1) n=1 yn (βn

PN

′′(l+1)

+ βn

)φ(xn ).

5. Find b(l+1) and ρ(l+1) by solving system of linear equation corresponding to sets (l+1) (l+1) SV1 and SV2 , where (l+1)

= {xn | yn (φ(xn )T w(l+1) + b(l+1) ) = ρ(l+1) + µ}

(l+1)

= {xn | yn (φ(xn )T w(l+1) + b(l+1) ) = −ρ(l+1) + µ}

SV1 SV2

(l)

(l+1)

6. Repeat steps 2-5 until SV1 = SV1

(l)

(l+1)

and SV2 = SV2 (l)

(l+1)

7. Return (w∗ , b∗ , ρ∗ ) = (w(l+1) , b(l+1) , ρ(l+1) ), when SV1 = SV1 (l+1) (l) SV2 = SV2

and

end Dataset Pima Indians Diabetes Heart

# Points 768 269

Dimension Class Dist. 8 268/500 13 150/119

Table 3: Description of Real Datasets Used from UCI ML Repository

4.2. Experimental Setup We implement our approach in R. For solving the dual D (l) at every iteration, we have used the kernlab package (Karatzoglou et al., 2004) in R. In our experiments, we have used linear kernel for all the datasets. Our approach has 3 user defined parameters, µ ∈ (0, 1] for the slope of the ramps, d ∈ (0, 0.5) as the loss for rejection and C as the regularization parameter. In all our experiments, we fix the value of µ to 1. We find the values of C by 10-fold cross validation. because the base SVM classifier for the decoupled approach is learnt without influence of d, the value of C obtained by 10-fold CV is same across all values

13

Manwani Desai Sasidharan Sundararajan

of d. However, for the coupled approach, we obtain the optimal value of C separately for each value of d. For every dataset, we report results for values of d in the interval [0.05 .5] with the step size of 0.05. For every value of d, we find the 10-fold cross validation % error (under L0−d−1 loss), % accuracy over the non-rejected examples (Acc), % rejection rate (RR) and number of support vectors (nSV). 4.3. Simulation Results We now discuss the experimental results. Table 4-5 show the experimental results on all the datasets. We observe the following: 1. The reject option classifier learnt using the proposed method typically requires a smaller number of support vectors compared to the decoupled approach. When there is label noise around the separating hyperplane (which can also happen due to the overlapping class conditional densities), our approach tries to approximate this noisy region as the rejection region in between two parallel hyperplanes. Our approach ignores (1) points in the interval (−ρ + µ, ρ − µ2 ) (in the rejection band) and (2) points in the interval (−∞, −ρ−µ2 ) (points misclassified with a high negative margin), while forming the classifier. Thus, these points do not become support vectors. And hence the proposed approach is able to learn reject option classifiers having sparse representation. On the other hand, these noisy points would become hard to classify using standard SVM classifier and hence would become support vectors. Thus SVM classifier based decoupled approach would results in more number of support vectors. 2. Most of the time, the average rejection rate of the proposed method is smaller than the decoupled approach, while the average accuracy on the non-rejected examples is comparable. This is also expected because the constant penalty that the double ramp loss puts on the points misclassified with high negative margin. 3. We see that the average loss achieved by our approach is comparable to the SVM based decoupled approach. In principle, it seems intuitive that optimizing the orientation and rejection bandwidth of a separating hyperplane together (I.e., coupled approach) is likely to arrive at a lower empirical risk value as compared to optimizing the orientation of the separating hyperplane (assuming no rejection) and then optimizing the rejection bandwidth given this orientation (I.e., decoupled approach). However, it is also intuitive that this is a tougher optimization problem to solve, and results may be comparable in practice.

5. Conclusion and Future Work In this paper, we have proposed a new loss function LDR (double ramp loss) for learning the reject option classifier. Our approach learns the classifier by minimizing the regularized risk under the double ramp loss function which becomes an instance of DC optimization problem. Our approach can also learn nonlinear classifiers by using appropriate kernel 14

Double Ramp Loss Based Reject Option Classifier

d 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

C 20 23 23 28 20 21 25 210 20 22

Coupled Algorithm Acc RR Loss 96.9 64.1 4.3 92.0 48.9 9.3 91.4 38.5 10.9 91.3 26.7 12.0 87.9 17.0 14.3 85.8 9.3 15.7 84.4 0.0 15.6 84.7 0.0 15.9 83.3 0.0 16.7 83.3 0.0 16.7

nSV 103.2 54.0 56.1 29.3 86.6 62.8 20.1 14.0 34.4 23.3

De-coupled Acc RR 100.0 63.3 97.7 45.2 97.5 42.6 95.1 32.2 92.3 24.8 91.7 22.2 89.7 15.2 87.6 7.8 87.6 7.8 87.0 6.3

Algorithm Loss nSV 3.2 86.9 6.0 86.9 8.2 86.9 10.1 86.9 11.8 86.9 13.0 86.9 13.8 86.9 14.6 86.9 15.0 86.9 15.4 86.9

Table 4: Comparison Results on Heart Dataset (Optimal C = 64 for decoupled approach) d 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Coupled Algorithm Acc RR Loss nSV 95.4 78.3 5.1 60.8 93.0 68.7 9.1 72 87.5 52.0 14.1 205.5 87.1 42.7 16.1 128.5 84.9 30.4 18.0 104.8 84.8 31.9 20.1 102.6 83.8 23.0 20.7 131.8 79.9 13.8 22.8 172.1 78.2 6.1 23.3 150.1 75.8 0.0 24.2 114.0

De-coupled Acc RR 88.3 87.5 90.2 80.1 91.5 61.2 89.9 45.4 86.9 34.6 85.8 28.7 84.4 22.8 84.0 21.4 81.3 12.2 79.6 7.5

Algorithm Loss nSV 4.8 359.5 8.9 359.5 12.6 359.5 15.3 359.5 17.2 359.5 18.9 359.5 20.2 359.5 21.3 359.5 22.2 359.5 22.8 359.5

Table 5: Comparison Results on Pima Indian Diabetes Dataset (Optimal C = 1 for both approaches)

function. Experimentally we have shown that our approach works comparable to SVM based decoupled approach for learning reject option classifier. We have seen that LDR is attractive because it gives a better upper bound for L0−d−1 compared to convex losses LDH and LGH . It would be useful to show that classifier learnt using LDR is Bayes consistent. Deriving the generalization bounds for the true risk based on (LDR ) is also another direction of research.

15

Manwani Desai Sasidharan Sundararajan

Appendix A. Proof of Theorem 1 Proof LDR (f (x), ρ, y) =

   i d h µ − yf (x) + ρ + − − µ2 − yf (x) + ρ + µ    i (1 − d) h + µ − yf (x) − ρ + − − µ2 − yf (x) − ρ + µ

• (i) We need to show that LDR ≥ L0−d−1 for all values of µ > 0 and ρ ≥ 0. We can see this case by case. Table 6: Proof for Theorem 1.(ii). Interval LDR yf (x) ∈ [ρ + µ, ∞) 0 yf (x) ∈ (ρ, ρ + µ) ∈ (0, d) yf (x) ∈ (ρ − µ2 , ρ] ∈ [d, (1 + µ)d) yf (x) ∈ [−ρ + µ, ρ − µ2 ] (1 + µ)d yf (x) ∈ [−ρ, −ρ + µ) ∈ ((1 + µ)d, (1 + µ)d + (1 − d)] yf (x) ∈ (−ρ − µ2 , −ρ) ∈ ((1 + µ)d + (1 − d), (1 + µ)) 2 yf (x) ∈ (−∞, −ρ − µ ] 1+µ

L0−d−1 0 0 d d d 1 1

From Table 6, we see that LDR ≥ L0−d−1 , ∀µ > 0, ∀ρ ≥ 0. • (ii) We need to show that limµ→0 LDR (f (x), ρ, y) = L0−d−1 (f (x), ρ, y). We first see the functional form of LDR in different intervals. Interval yf (x) ∈ (ρ + µ, ∞) yf (x) ∈ [ρ − µ2 , ρ + µ] yf (x) ∈ (−ρ + µ, ρ − µ2 ) yf (x) ∈ [−ρ − µ2 , −ρ + µ] yf (x) ∈ (−∞, −ρ − µ2 )

LDR 0 d µ (µ − yf (x) + ρ) (1 + µ)d (1 + µ)d + (1 − d) 1+µ

Table 7: LDR in different intervals (Proof for Theorem 1.(iii))

Now we take the limit µ → 0, which is shown in Table 8. We see that limµ→0 LDR = L0−d−1 . • (iii) In the rejection region yf (x) ∈ (ρ − µ2 , −ρ + µ), the loss remains constant, that is LDR (f (x), ρ, y) = d(1 + µ). This can be seen in Table 7. • (iv) For µ > 0, LDR ≤ (1 + µ), ∀ρ ≥ 0, ∀d ≥ 0. This can be seen in Table 7.

16

Double Ramp Loss Based Reject Option Classifier

Interval yf (x) ∈ (ρ, ∞) yf (x) = ρ yf (x) ∈ (−ρ, ρ) yf (x) = −ρ yf (x) ∈ (−∞, −ρ)

limµ→0 LDR 0 d d 1 1

Table 8: limµ→0 LDR in different intervals (Proof for Theorem 1.(iii))

• (v) When ρ = 0, LDR becomes    i (1 − d) h  d h µ − yf (x) + − − µ2 − yf (x) + + µ − yf (x) − + µ µ   i 2 − − µ − yf (x) +    i 1 h = µ − yf (x) + − − µ2 − yf (x) + µ

LDR (f (x), 0, y) =

which is same as the µ-ramp loss function used for classification problems without rejection option. • (vi) We have to show that LDR is non-convex function of (yf (x), ρ). From (iv), we know that LDR ≤ (1 + µ). That is, LDR is bounded above. We show non-convexity of LDR by contradiction.

Let LDR be convex function of (yf (x), ρ). Let z = (yf (x), ρ). We also rewrite LDR (f (x), ρ, y) as LDR (z). We choose two points z1 , z2 such that LDR (z1 ) > LDR (z2 ). Thus, from the definition of convexity, we have LDR (z1 ) ≤ λLDR (

z1 − (1 − λ)z2 ) + (1 − λ)LDR (z2 ) ∀λ ∈ (0, 1) λ

Hence, z1 − (1 − λ)z2 LDR (z1 ) − (1 − λ)LDR (z2 ) ≤ LDR ( ) λ λ Now, since LDR (z1 ) > LDR (z2 ), LDR (z1 ) − LDR (z2 ) LDR (z1 ) − (1 − λ)LDR (z2 ) = + LDR (z2 ) → ∞ as λ → 0+ λ λ 2 Thus limλ→0+ LDR ( z1 −(1−λ)z ) = ∞. But LDR is upper bounded by (1 + µ)d. This λ contradicts that LDR is convex.

17

Manwani Desai Sasidharan Sundararajan

Appendix B. Derivation of Dual Optimization Problem D (l+1) N  C X ′ 1 ||w||2 + dξn + (1 − d)ξn′′ 2 µ n=1 i X Ch X + d [yn (wT φ(xn ) + b) − ρ] + (1 − d) [yn (wT φ(xn ) + b) + ρ] µ (l) (l)

P (l+1) : minw,b

xn ∈V1

T

xn ∈V2

yn (w φ(xn ) + b) ≥ ρ + µ −

s.t.

ξn′ ,

T

yn (w φ(xn ) + b) ≥ −ρ + µ −

ξn′

ξn′′ ,

≥ 0, n = 1 . . . N

ξn′′ ≥ 0, n = 1 . . . N

ρ≥0 The Lagrangian for above problem will be: L =

N  Ch X C X ′ 1 d [yn (wT φ(xn ) + b) − ρ] + ||w||2 + dξn + (1 − d)ξn′′ + 2 µ µ (l) n=1

(1 − d)

X

(l)

xn ∈V2

+

N X

xn ∈V1

N i X α′n [ρ + µ − ξn′ − yn (wT φ(xn ) + b)] [yn (wT φ(xn ) + b) + ρ] + n=1

α′′n [−ρ + µ − ξn′′ − yn (wT φ(xn ) + b)] −

N X

n=1

n=1

ηn′ ξn′ −

N X

ηn′′ ξn′′ − βρ

n=1

where α′n is dual variable corresponding to constraint yn (wT φ(xn ) + b) ≥ ρ + µ − ξn′ , α′′n is dual variable corresponding to yn (wT φ(xn ) + b) ≥ −ρ + µ − ξn′ , ηn′ is dual variable corresponding to ξn′ ≥ 0, ηn′′ is dual variable corresponding to ξn′′ ≥ 0 and β is dual variable corresponding to ρ ≥ 0. We take the gradient of Lagrangian with respect to the primal variables. By equating the gradient to zero, we get the KKT conditions of optimality for this optimization problem. i h P  P P  w= N yn (α′n + α′′n )φ(xn ) − Cµ d x ∈V (l) yn φ(xn ) + (1 − d) x ∈V (l) yn φ(xn )  n=1  n n 2 1 i h  P  PN y (α′ + α′′ ) − C d P  (l) yn = 0 (l) yn + (1 − d) n  n n n=1 µ xn ∈V2 xn ∈V1    ′ + α′ = Cd   ∀ n = 1...N η n n  µ   C(1−d) η ′′ + α′′ =  ∀ n = 1...N n n  µ i P  C h (l) (l) N ′ ′′ n=1 (αn − αn ) − β = 0 µ − d|V1 | + (1 − d)|V2 | +    ηn′ ξn′ = 0, ηn′ ≥ 0 ∀ n = 1...N     ′′ ′′ ′′ ηn ξn = 0, ηn ≥ 0 ∀ n = 1...N     ′ ′ T ′ αn [µ − ξn − yn (w φ(xn ) + b) + ρ] = 0, αn ≥ 0 ∀ n = 1...N    T ′′ ′′ ′′   ∀ n = 1...N αn [µ − ξn − yn (w φ(xn ) + b) − ρ] = 0, αn ≥ 0    βρ = 0, β ≥ 0 18

Double Ramp Loss Based Reject Option Classifier

(l)

(l)

(l)

(l)

where |V1 | and |V2 | denote cardinality of sets V1 and V2 . KKT conditions of optimality described above give following insights about the dual variables. yn (wT φ(xn ) + b) > ρ + µ ⇒ α′n = 0, α′′n = 0 Cd ), α′′n = 0 yn (wT φ(xn ) + b) = ρ + µ ⇒ α′n ∈ (0, µ Cd ′′ , αn = 0 −ρ + µ < yn (wT φ(xn ) + b) < ρ + µ ⇒ α′n = µ Cd ′′ C(1 − d) yn (wT φ(xn ) + b) = −ρ − µ2 ⇒ α′n = , αn ∈ (0, ) µ µ C(1 − d) Cd ′′ , αn = yn (wT φ(xn ) + b) < −ρ − µ2 ⇒ α′n = µ µ We make the dual optimization problem simpler by changing the variables in following way:  (l)  ∀xn ∈ S \ V1 βn′ = α′n ,    (l) β ′ = α′ − Cd = 0, ∀xn ∈ V1 n n µ (l)  βn′′ = α′′n , ∀xn ∈ S \ V2    β ′′ = α′′ − C(1−d) = 0, ∀x ∈ V (l) n

n

n

µ

2

By changing these variables, the new KKT conditions in terms of β ′ and β ′′ are  P ′ ′′  w= N  n=1 yn (βn + βn )φ(xn )  P  N ′′ ′    n=1 yn (βn + βn ) = 0   (l)   ηn′ + βn′ = Cd ∀ xn ∈ S \ V1  µ   (l)   βn′ = 0 ∀ xn ∈ V1    (l)   ηn′′ + βn′′ = C(1−d) xn ∈ S \ V2  µ   (l)   βn′′ = 0 xn ∈ V2    PN (β ′ − β ′′ ) − β = 0 n n=1 n ′ ξ ′ = 0, η ′ ≥ 0  η ∀ n = 1...N  n n n    ′′ ′′ ′′ ηn ξn = 0, ηn ≥ 0 ∀ n = 1...N     ′ ′ T ′ βn [µ − ξn − yn (w φ(xn ) + b) + ρ] = 0, βn ≥ 0 ∀ xn ∈ S \ V1(l)    (l)   ∀ xn ∈ V1 µ − ξn′ − yn (wT φ(xn ) + b) + ρ = 0    (l)  βn′′ [µ − ξn′′ − yn (wT φ(xn ) + b) − ρ] = 0, βn′′ ≥ 0 ∀ xn ∈ S \ V2     (l)  µ − ξn′′ − yn (wT φ(xn ) + b) − ρ = 0 ∀ xn ∈ V2     βρ = 0, β ≥ 0

19

Manwani Desai Sasidharan Sundararajan

The dual optimization problem D (l+1) will become: D (l+1) :

min β ′ ,β ′′ s.t.

N N N X 1XX ′ ′′ (βn′ + βn′′ ) yn ym (βn′ + βn′′ )(βm + βm )k(xn , xm ) − µ 2 n=1 m=1 n=1

Cd (l) (l) ∀xn ∈ S \ V1 ; βn′ = 0 ∀xn ∈ V1 µ C(1 − d) (l) (l) 0 ≤ βn′′ ≤ ∀xn ∈ S \ V2 ; βn′′ = 0 ∀xn ∈ V2 µ N X yn (βn′ + βn′′ ) = 0 0 ≤ βn′ ≤

n=1 N X

βn′



N X

βn′′

n=1

n=1

where β ′ = [β1′ β2′ . . . . . . βn′ ]T and β ′′ = [β1′′ β2′′ . . . . . . βn′′ ]T .

Appendix C. D (l+1) in Matrix Format A simple representation of D (l+1) in terms of matrices and vectors is as follows: D (l+1) : minβ s.t.

1 T β Hβ + β T a 2 l≤β≤u 0 ≤ Aβ ≤ 0

 K ⊙YYT K ⊙YYT , K is kernel matrix (Knm = where β = [β H = K ⊙YYT K ⊙YYT k(xn , xm ), Y is the label vector, ⊙ represents elementwise multiplication. a = [−2µ1Tm 0Tm ]T where 1m is m dimensional column vector of ones and 0m is m dimensional column vector uT2 ]T , where u1 is of zeros. l = 02m (2m dimensional column vector of zeros). u = [uT1 (l) Cd m × 1 vector whose entries for V1 are 0 and all other entries are µ . Similarly, u2 is m × 1 ′T

β ′′T ]T ,

(l)

vector whose entries for V2



are 0 and all other entries are

C(1−d) . µ

A = [Y T Y T ].

References Le Thi Hoai An and Pham Dinh Tao. Solving a class of linearly constrained indefinite quadratic problems by d.c. algorithms. Journal of Global Optimization, 11:253–285, 1997. K. Bache and M. Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/ml.

URL

Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823–1840, June 2008. C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41–46, January 1970. 20

Double Ramp Loss Based Reject Option Classifier

Ronan Collobert, Fabian Sinz, Jason Weston, and L´eon Bottou. Trading convexity for scalability. In Proceedings of the 23rd International Conference on Machine Learning, pages 201–208, 2006. Giorgio Fumera and Fabio Roli. Support vector machines with embedded reject option. In Proceedings of the First International Workshop on Pattern Recognition with Support Vector Machines, SVM ’02, pages 68–82, 2002. Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and St´ephane Canu. Support vector machines with a reject option. In NIPS, pages 537–544, 2008. Radu Herbei and Marten H. Wegkamp. Classification with reject option. The Canadian Journal of Statistics, 34(4):709–721, December 2006. Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab – an S4 package for kernel methods in R. Journal of Statistical Software, 11(9):1–20, November 2004. URL http://www.jstatsoft.org/vll/i09/. Naresh Manwani and P. S. Sastry. Noise tolerance under risk minimization. IEEE Transactions on Systems, Man and Cybernetics: Part–B, 43:1146–1151, March 2013. Cheng Soon Ong and Le Thi Hoai An. Learning sparse classifiers with difference of convex functions algorithms. Optimization Methods and Software, (ahead-of-print):1–25, 2012. Ramasubramanian Sundararajan and Asim K. Pal. A conservative approach to perceptron learning. WSEAS Transactions on Systems, 3, 2004. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Marten Wegkamp and Ming Yuan. Support vector machines with a reject option. Bernaulli, 17(4):1368–1385, 2011. Marten H. Wegkamp. Lasso type classifiers with a reject option. Electronic Journal of Statistics, 1:155–168, 2007. Yichao Wu and Yufeng Liu. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, pages 974–983, 2007. Ming Yuan and Marten Wegkamp. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11:111–130, March 2010.

21

Suggest Documents