Cross Validation Through Two-dimensional Solution Surface for Cost ...

5 downloads 4579 Views 8MB Size Report
Finally, we obtain the CV error surface by superposing K validation error surfaces, which can find the .... apply cross validation (CV) to select the best choices.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

1

Cross Validation Through Two-dimensional Solution Surface for Cost-Sensitive SVM Bin Gu, Victor S. Sheng, Keng Yeow Tay, Walter Romano, and Shuo Li Abstract— Model selection plays an important role in cost-sensitive SVM (CS-SVM). It has been proven that the global minimum cross validation (CV) error can be efficiently computed based on the solution path for one parameter learning problems. However, it is a challenge to obtain the global minimum CV error for CS-SVM based on one-dimensional solution path and traditional grid search, because CS-SVM is with two regularization parameters. In this paper, we propose a solution and error surfaces based CV approach (CV-SES). More specifically, we first compute a two-dimensional solution surface for CS-SVM based on a bi-parameter space partition algorithm, which can fit solutions of CS-SVM for all values of both regularization parameters. Then, we compute a two-dimensional validation error surface for each CV fold, which can fit validation errors of CS-SVM for all values of both regularization parameters. Finally, we obtain the CV error surface by superposing K validation error surfaces, which can find the global minimum CV error of CS-SVM. Experiments are conducted on seven datasets for cost sensitive learning and on four datasets for imbalanced learning. Experimental results not only show that our proposed CV-SES has a better generalization ability than CS-SVM with various hybrids between grid search and solution path methods, and than recent proposed cost-sensitive hinge loss SVM with three-dimensional grid search, but also show that CV-SES uses less running time. Index Terms—Solution surface, space partition, cost-sensitive support vector machine, cross validation, solution path.

F

NOMENCLATURE

sign(·)

To make notations easier to follow, we provide a summary of the notations in the following list.

ϕ

R, R+ R 0, 1

m×k

αi , gi ∆ M−1 , MT |E| QMj QMM Rt∗ , R∗t R\tt



• • • •

The sets of real numbers, and non-negative real numbers. The set of m × k real matrices. The vectors having all the elements equal to 0 and 1 respectively. The i-th element of the vector α and g . The amount of the change of each variable. The inverse and the transpose of matrix M. The cardinality of the set E . The subvector of the j -th column of a matrix Q with the rows indexed by M. The submatrix of a matrix Q with the rows and columns indexed by M. The row and the column of a matrix R corresponding to the sample (xt , yt ) respectively. The submatrix of R after deleting the row and the column corresponding to the sample (xt , yt ).

B. Gu (corresponding author) is with Jiangsu Engineering Center of Network Monitoring and School of Computer & Software, Nanjing University of Information Science & Technology, P.R.China, and with the Department of Medical Biophysics, University of Western Ontario, Canada (e-mail: [email protected]). V. S. Sheng is with Department of Computer Science, University of Central Arkansas, Conway, Arkansas, USA (e-mail: [email protected]). K. Y. Tay is with London Health Science Center, London, Ontario. W. Romano is with St. Joseph’s Health Care, London, Ontario. S. Li is with GE Health Care, and Department of Medical Biophysics, University of Western Ontario, Canada (e-mail: [email protected]).

1

A sign function determines whether the predicted classification comes out positive (+1) or negative (-1). A transformation function from an input space to a higher dimensional kernel feature space.

I NTRODUCTION

E

VER since Vapnik’s influential work in statistical learning theory [1], Support Vector Machines (SVMs) have been successfully applied to a lot of classification problems due to its good generalization performance. However, in many realworld classification problems such as medical diagnosis [2], object recognition [3], business decision making [4], and so on, the costs of different types of mistakes are naturally unequal. Cost sensitive learning [5] takes the unequal misclassification costs into consideration, which has also been deemed as a good solution to class-imbalance learning where the class distribution is highly imbalanced [32]. There have been several cost-sensitive SVMs, such as the boundary movement [9], biased penalty (2C -SVM [7] and 2ν -SVM [8]), cost-sensitive hinge loss (CSHL-SVM) [11], [31], and so on. As pointed in [11], the boundary movement would be the optimal strategy under Bayesian decision theory, if the class posterior probabilities were available. However, it is well known that SVMs do not predict these probabilities accurately. The boundary movement method is also flawed when data is nonseparable, in which case cost sensitive optimality is expected to require a modification of both the direction and the threshold of a separating plane. Thus, either the biased penalty method or the cost-sensitive hinge loss method is a better choice for cost sensitive learning. In this paper, we focus on the most popular one (2C SVM1 ).

1. Actually, 2ν -SVM is also equivalent to 2C -SVM as proved in [8]. For the sake of convenience, we do not distinguish the names of 2C -SVM and CS-SVM hereafter unless explicitly mentioned.

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

Given a training set S = {(x1 , y1 ), · · · , (xl , yl )} where xi ∈ Rd and yi ∈ {+1, −1}, 2C -SVM introduces two cost parameters C+ and C− to denote the costs of false negative and false positive respectively, and considers the following primal formulation: ∑ ∑ 1 ⟨w, w⟩ + C+ min ξi + C− ξi w,b,ξ 2 + − i∈S

2

2)

i∈S

yi (⟨w, ϕ(xi )⟩ + b) ≥ 1 − ξi , (1) ξi ≥ 0, i = 1, · · · , l where w and b are the normal vector and the threshold to the hyperplane of 2C -SVM respectively, S + = {(xi , yi ) ∈ S : yi = +1}, S − = {(xi , yi ) ∈ S : yi = −1}, ⟨·, ·⟩ denotes an inner product in a kernel feature space, and ξi is a non-negative slack variable measuring the degree of misclassification of the sample (xi , yi ). There are several articles about CS-SVM. For example, Bach et al. [18] proposed an efficient algorithm to build receiver operating characteristic curves by varying the training cost asymmetry C+ (i.e., C+ +C ). Davenport et al. [8] presented an equivalent for− mulation 2ν -SVM to 2C -SVM, and developed two grid search methods for determining values of both parameters in 2ν -SVM. Lee and Scott [10] proposed a new formulation for building a family of nested CS-SVM, which yields a family of nested classifiers indexed by cost asymmetry. Specially, they implemented the 1 solution path of CS-SVM w.r.t. regularization parameter C+ +C − and cost asymmetry respectively, to compare with nested CSSVM. In general, how one tunes the cost parameter pair (C+ , C− ) to achieve optimal generalization performance (it is also called the problem of model selection) is a central problem of CS-SVM. As mentioned in [8], a general approach to tackle model selection is to specify some candidate parameter values, and then apply cross validation (CV) to select the best choices. A typical implementation of this approach is grid search. However, extensive exploration for the optimal parameter values is seldom pursued, because there exist two difficulties. It requires: 1) to train the classifier many times under different parameter settings, 2) and to test the classifier on the validation dataset for each parameter setting. To overcome the first difficulty, solution path algorithms were proposed for many learning models [12], [13], [14], [15], [16], [17], [18], [20], to fit the entire solutions for every value of the parameter, which avoids training the classifier many times under different parameter settings. Specifically, Hastie et al. [12] proposed an solution path approach for C -SVM. Gunter and Zhu [13], and Wang et al. [15] proposed solution path algorithms for ε-Support Vector Regression (ε-SVR) to trace the solution with ε and the regularization parameter respectively. Rosset and Zhu [16] proposed a solution path for Lasso. Takeuchi et al. [17] proposed a solution path for kernel quantile regression. Gu et al. [20] proposed a solution path algorithm for ν -Support Vector Classification (ν SVC). It should be noted that there are several articles involving solution paths for two-parameter problems ε-SVR [15], 2C -SVM [14], and quantile regression [14]. Although they are computed in a twoparameter space, all are essentially one-parametric solution paths and none of them can fit solutions for all values of a parameter pair. The detailed explanation for this point is as follows. s.t.

1)

Wang et al. [15] discussed the solution path of ε-SVR for two parameters (λ and ε) respectively. Their solution path works with respect to only one parameter while the other parameter is fixed. In their conclusion, they pointed out

3)

that it is difficult to explore all possible solutions using their path-following approach when the dimensionality of the solution space is larger than one. Bach et al. [18] searched the space (C+ , C− ) of 2C SVM by using a large number of parallel one-parametric solution paths. It is essentially a semi-grid search method, and hard to explore the whole space. This method is implemented (denoted as SP+GS in our paper) to compare with our algorithm. Rosset [14] handled the problem (generating a set of models w.r.t. τ , by selecting the best regularization parameter λ(τ ) for every value of τ ) using a bi-level program with optimizing one parameter λ. Our cross validation on 2C -SVM handles a bi-level program with optimizing two parameters. As a result, our algorithm explores the entire bi-parameter space. However, Rosset’s model follows a large number of one-parametric solution paths simultaneously.

To sum up, the three articles essentially follow one-parametric solution paths in a two-parameter space, and cannot explore solutions for all values of the parameter pair. Therefore, it is highly desirable to design an approach to determine a complete solution surface as both parameters vary. To address the second difficulty, a global search strategy [19] was proposed to compute the minimum CV error based on the solution path. Specifically, Yang and Ong [19] first proposed the global search strategy to determine the global minima of some common validation functions in C -SVM. Gu et al. [20] used this strategy to find the global minimum CV error for ν -SVC based on their proposed regularization path of ν -SVC. The power of the global search method is proved by theoretical and empirical analysis for model selection. Therefore, it is desirable to design an error surface algorithm to compute the minimum CV error for the bi-parametric problem (e.g. CS-SVM), based on the twodimensional solution surface.

The original samples

1-th fold

Solution Surface

Validation Error Surface

2-th fold

Solution Surface

Validation Error Surface

K-th fold

Solution Surface

Validation Error Surface

Global Minimum K-fold CV Error

K equal parts

First step

Second step

Fig. 1: Structure flow chart of the proposed CV-SES method. In this paper, we propose a solution and error surfaces based CV approach (CV-SES). Fig. 1 shows the structure flow chart of our proposed CV-SES method, which includes two main parts. The first part is the two-dimensional solution surface of CSSVM, which is implemented by bi-parameter space partition based on critical convex polygon region (CCPR). CCPR is defined as follows. Definition 1. CCPR is a convex polygon region, where the solutions of CS-SVM share one same linear function w.r.t. both regularization parameters, in the regularization parameter space of CS-SVM. It is worth pointing out that the whole region R+ × R+ for (C+ , C− ) is explored by CCPRs in 1.5 square units as shown in

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

Fig. 2 based on two equivalent CS-SVM formulations. The second part is the two-dimensional validation error surface for each CV fold, which is computed based on the solution surface. The final K -fold CV error surface can be obtained by superposing K validation error surfaces, and the global minimum CV error can be found correspondingly. We validate our method on seven datasets for cost sensitive learning and on four datasets for imbalanced learning. Experimental results show that CV-SES has a better generalization ability than CS-SVM with various hybrids between grid search and solution path methods, and than recent proposed CSHL-SVM with three-dimensional grid search. Meanwhile, its running time is less. The main contributions of this paper can be summarized as follows. 1)

2)

We propose a bi-parameter space partition (BPSP) algorithm based on CCPR. BPSP can handle the overlap of CCPRs, and determine the complete two-dimensional solution surface of CS-SVM. To the best of our knowledge, there is no such contribution. Once the solution surfaces are available, we compute the two-dimensional K -fold CV error surface to find the values of the parameter pair with the global minimum CV error. Experimental results demonstrate that the method has a better generalization ability than various hybrids between grid search and solution path methods. Meanwhile, it uses less running time.

The rest of this paper is organized as follows. In Section 2, we present two equivalent formulations of CS-SVM, and their Karush-Kuhn-Tucker (KKT) conditions. Section 3 presents the bi-parameter space partition (BPSP) algorithm to compute the two-dimensional solution surface for CS-SVM. The method of computing two-dimensional CV error surface is presented in Section 4. Section 5 discusses why CV-SES adopts the combined compact parameter space with 1.5 square units, instead of others. Experimental setup and results are presented in Section 6 and 7 respectively. The last section provides some concluding remarks. Part of this paper has been presented at [21].

2 T WO F ORMULATIONS OF CS-SVM AND T HEIR KKT CONDITIONS In this section, we present two equivalent formulations of CSSVM (i.e., 2C -SVM and (λ, η )-SVM), and their Karush-KuhnTucker (KKT) conditions, where (λ, η )-SVM is formulated based C+ 1 and η = C+ +C . Fig. 2 on 2C -SVM with λ = C+ +C − − shows the corresponding relation between (C+ ,C− ) and (λ, η ) coordinate systems. Specifically, the open region of C+ ≥ 0, C− ≥ 0, and C+ + C− ≥ 1 corresponds the closed region of 0 ≤ λ ≤ 1 and 0 ≤ η ≤ 1. Thus, the whole region of (C+ , C− ) can be explored in 1.5 square units by searching the region [0, 1] × [0, 1] of the (λ, η ) coordinate system, and the lower triangle region of [0, 1]×[0, 1] in the (C+ , C− ) coordinate system, as shown in Fig. 2. The reason for adopting the compact parameter space is discussed in Section 5. In the following, we first present (λ, η )-SVM and its KKT conditions, and then present 2C -SVM and its KKT conditions.

3

Fig. 2: The corresponding relation between (C+ , C− ) and (λ, η ) coordinate systems. 2.1

(λ, η )-SVM and Its KKT Conditions C

1 + Letting λ = C+ +C , η = C+ +C , the primal formulation (1) of − − 2C -SVM can be reformulated as∑ follows. ∑ λ min ⟨w, w⟩ + η ξi + (1 − η) ξi w,b,ξ 2 + − i∈S

yi (⟨w, ϕ(xi )⟩ + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, · · · , l The dual problem of (2) can be presented as: ∑ 1 T min α Qα − αi α 2λ i∈S ∑ s.t. yi αi = 0, s.t.

i∈S

(2)

(3)

i∈S

1 − yi + 2yi η 0 ≤ αi ≤ , i = 1, · · · , l 2 where Q is a positive semidefinite matrix with Qij = yi yj K(xi , xj ), and K(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩. From KKT theorem [22], the KKT conditions of the dual problem (3) are obtained as follows.   1 ∑ ∀i ∈ S : gi = αj Qij + yi b′′  − 1 λ j∈S   gi > 0 f or αi = 0 iη gi = 0 f or 0 ≤ αi ≤ 1−yi +2y (4) 2  1−yi +2yi η gi < 0 f or αi = 2 ∑ yi αi = 0 (5) i∈S

where b′ = λb′′ , and b′′ is the Lagrangian multiplier corresponding to the equality constraint in (3). According to the value of gi , a training sample set S is partitioned as π(λ, η) = (M(λ, η), E(λ, η), R(λ, η)), (see Fig. 3) where 1) 2) 3)

iη } M(λ, η) = {i : gi = 0, 0 ≤ αi ≤ 1−yi +2y 2 denotes the margin support vector set; iη E(λ, η) = {i : gi < 0, αi = 1−yi +2y } denotes the 2 error support vector set; R(λ, η) = {i : gi > 0, αi = 0} denotes the remaining vector set.

It is clear that M(λ, η), E(λ, η), and R(λ, η) are disjoint, and their union is S . 2.2 2C -SVM and Its KKT Conditions The primal formulation of 2C -SVM is (1), and its dual problem is presented as: ∑ 1 T min α Qα − αi α 2 i∈S ∑ s.t. yi αi = 0, (6) i∈S

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 (w,b − 1)

(w,b − 1)

(w,b + 1) (b) E

Theorem 1. The set CR(λ0 , η0 ) is a convex set and its closure is a convex polygon region.

(w,b − 1)

(w,b + 1)

(a) M

(w,b + 1) (c) R

Fig. 3: The partition of the training samples S into three independent sets by KKT-conditions.

C+ + C− + yi (C+ − C− ) 0 ≤ αi ≤ , i = 1, · · · , l 2 Similar to (4)-(5), we have the KKT conditions for (6) as follows. ∑ ∀i ∈ S : gi = αj Qij + yi b′′ − 1 j∈S

   g i > 0 f or αi = 0 g i = 0 f or 0 ≤ αi ≤ C+ +C− +y2i (C+ −C− ) (7)   g < 0 f or α = C+ +C− +yi (C+ −C− ) i i 2 ∑ yi αi = 0 (8) i∈S

According to the value of g i , a training sample set S also has the partition π(C+ , C− ) = (M(C+ , C− ), E(C+ , C− ), R(C+ , C− )) (see Fig. 3), where 1)

2)

3)

3

M(C+ , C− ) = {i : g i = 0, 0 ≤ αi ≤ C+ +C− +yi (C+ −C− ) } denotes the margin support vector 2 set; g i < 0, αi = E(C+ , C− ) = {i : C+ +C− +yi (C+ −C− ) } denotes the error support vector 2 set; R(C+ , C− ) = {i : g i > 0, αi = 0} denotes the remaining vector set.

T WO - DIMENSIONAL S OLUTION S URFACE

To determine the complete two-dimensional solution surface of CS-SVM, we first propose an approach to detect CCPRs for both (λ, η )-SVM and 2C -SVM. Based on the CCPRs, we propose the bi-parameter space partition algorithm in the region of [0, 1] × [0, 1] for the (λ, η ) parameter space and in the lower triangle region of [0, 1] × [0, 1] for the (C+ , C− ) parameter space respectively. This means that the complete two-dimensional solution surface of CS-SVM is determined accordingly. In other words, solutions of CS-SVM for all values of the parameter pair are fitted. 3.1

Detecting CCPRs

In the following, we present the approach of detecting the CCPR for both (λ, η )-SVM and 2C -SVM respectively. 3.1.1

4

CCPR of (λ, η )-SVM

As mentioned in Section 2.1, one solution of the dual problem (3) corresponds a partition π . Conversely, a partition π corresponds a set of parameter pairs (λ, η). We define this set as CR(λ0 , η0 ) = {(λ, η) ∈ R+ × [0, 1] : π(λ, η) = π(λ0 , η0 )} when a partition π(λ0 , η0 ) is given. Obviously, CR(λ0 , η0 ) is the set of all parameter pairs (λ, η) in the region R+ × [0, 1] sharing the same partition π(λ0 , η0 ). Theorem 1 shows that CR(λ0 , η0 ) is a convex set and its closure is a convex polygon region.

In the following, we provide a proof to Theorem 1. The proof shows that the solution of (3) is jointly piecewise linear w.r.t. both λ and η . Thus, CR(λ0 , η0 ) is a CCPR of (λ, η )-SVM. According to the KKT conditions (4)-(5), if we finely adjust λ and η around the parameter pair (λ0 , η0 ), the weights αi of the samples in M and the variable b′ should also be adjusted accordingly to keep all the samples satisfying the KKT conditions. Thus, letting gei = λ(gi + 1), we have the following linear system from (4)-(5): ∑ ∑ def ∆gei = Qij ∆αj + yi ∆b′ + yj Qij ∆η j∈M

j∈E

∆λ, ∀i ∈ M ∑ ∑ yj ∆αj + ∆η = 0

=

j∈M

(9) (10)

j∈E

If 1M is the column vector 1 with |M|-dimensional, and let yM = [y1 , · · · , y|M| ]T , the linear system (9)-(10) can be rewritten as: [ ][ ] ∆b′ 0 yTM (11) ∆αM yM QMM | {z } [

e Q

][ ] −|E| ∆λ ∑ = 1M − j∈E yj QMj ∆η T −1 e Let R = Q . The linear relationship between [∆b′ ∆αM ] T and[ [∆λ ∆η] as follows. ] can be obtained [ ][ ] ′ 0 ∆b ∆λ ∑ −|E| = R 1M − j∈E yj QMj ∆αM ∆η [ λ ] η ][ βb′ βb′ ∆λ def = (12) η λ ∆η βM βM Substituting (12) into (9), we can obtain the linear relationship T between ∆gei (∀i ∈∑ S ) and [∆λ ∆η] as follows. ( ) λ ∆gei = Qij βj ∆λ + βjη ∆η

0

j∈M

( ) ∑ +yi βbλ′ ∆λ + βbη′ ∆η + yj Qij ∆η j∈E

def

γiλ ∆λ

γiη ∆η

= + (13) When adjusting both λ and η , meanwhile keeping all the samples satisfying the KKT conditions (4)-(5), the following constraints should be kept: 0 ≤ α(λ0 , η0 )i + βiλ (λ − λ0 ) + βiη (η − η0 ) (14) 1 − yi + 2yi η , ∀i ∈ M ≤ 2 η λ (15) ge(λ0 , η0 )i + γi (λ − λ0 ) + γi (η − η0 ) < λ, ∀i ∈ E ge(λ0 , η0 )i + γiλ (λ − λ0 ) + γiη (η − η0 ) > λ, ∀i ∈ R (16) Thus, CR(λ0 , η0 ) is the set of feasible solutions to the system of inequalities (14)-(16). According to (14)-(16), CR(λ0 , η0 ) is a convex set and its closure is a convex polygon region. According to (14)-(16), it is also easy to see that the solution of (3) is jointly piecewise linear w.r.t. both λ and η . It should be noted that the compact representation of (14)-(16) can be obtained after removing redundant inequalities, which can be efficiently solved by the vertex enumeration algorithm [23]2 . 2. Removing redundant inequalities from a nondegenerate system of n inequalities with d dimensions (here d = 2) can be solved by the vertex enumeration algorithm [23] in time O(ndv), where v is the number of inequalities (or vertices) of the equivalent minimum subsystem.

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

3.1.2

5

CCPR of 2C -SVM

Similar to (λ, η )-SVM, a partition π of 2C -SVM corresponds a set of parameter pairs (C+ , C− ). We define this set as 0 0 CR(C+ , C− ) = {(C+ , C− ) ∈ R+ × R+ : π(C+ , C− ) = 0 0 0 0 π(C+ , C− )} when a partition π(C+ , C− ) is given. Obviously, 0 0 CR(C+ , C− ) is the set of all parameter pairs (C+ , C− ) in the lower triangle region of [0, 1] × [0, 1] sharing the same partition 0 0 0 0 π(C+ , C− ). Theorem 2 shows that CR(C+ , C− ) is a convex set and its closure is a convex polygon region. 0 0 Theorem 2. The set CR(C+ , C− ) is a convex set and its closure is a convex polygon region.

In the following, we prove Theorem 2. The proof also shows that the solution of (6) is jointly piecewise linear w.r.t. both C+ 0 0 and C− . Thus, CR(C+ , C− ) is a CCPR of 2C -SVM. According to the KKT conditions (7)-(8), if we finely adjust 0 0 C+ and C− around the parameter pair (C+ , C− ), the weights αi ′′ of the samples in M and the variable b should also be adjusted accordingly to keep all the samples satisfying the KKT conditions. Thus, from (7)-(8), linear system: ∑ we have the following∑ def ∆g i = Qij ∆αj + yi ∆b′′ + Qij ∆C+ j∈M



+

j∈E +

Qij ∆C− = 0, ∀i ∈ M

j∈E −





yj ∆αj +

∆C+ −

j∈E +

j∈M +



∆C− = 0

(17) (18)

j∈E − −

where E = {i ∈ E : yi = +1}, and E = {i ∈ E : yi = −1}. The linear system (17)-(18) can [ ] [ be rewritten ] as: ∆b′′ 0 yTM (19) ∆αM yM QMM | {z } e Q

] ][ |E + | −|E − | ∆C+ ∑ ∑ = − ∆C− j∈E + QMj j∈E − QMj T The linear relationship between [∆b′′ ∆αM ] and T [∆C+ ∆C[− ] can be obtained as follows. ] ∆b′′ ∆αM [ ][ ] − |E + | ∆C+ ∑ −|E | = −R ∑ ∆C− j∈E + QMj j∈E − QMj [ ][ ] C+ C− βb′′ βb′′ ∆C+ def = (20) C C ∆C− βM+ βM− Substituting (20) into (17), we can get the linear relationship T ) and [∆C between ∆g i (∀i ∈ S∑ ( + ∆C− ] as follows. ) C+ C ∆g i = Qij βj ∆C+ + βj − ∆C− [

j∈M

( ) C C +yi βb′′+ ∆C+ + βb′′− ∆C− ∑ ∑ + Qij ∆C+ + Qij ∆C− j∈E + def

C γi + ∆C+

j∈E − C γi − ∆C−

= + (21) When adjusting C+ and C− , meanwhile keeping all the samples satisfying the KKT conditions (7)-(8), the following constraints should be kept: C C 0 0 0 0 0 ≤ α(C+ , C− )i + βi + (C+ − C+ ) + βi − (C− − C− ) C+ + C− + yi (C+ − C− ) ≤ , ∀i ∈ M (22) 2

C γi + (C+

3.2

+



C γi − (C−

− < 0, ∀i ∈ E (23) C+ C− 0 0 0 0 g(C+ , C− )i + γi (C+ − C+ ) + γi (C− − C− ) > 0, ∀i ∈ R (24) 0 0 Thus, CR(C+ , C− ) is the set of feasible solutions to (22)-(24). 0 0 According to (22)-(24), CR(C+ , C− ) is a convex set and its closure is a convex polygon region. And the solution of (6) is jointly piecewise linear w.r.t. both C+ and C− . 0 0 g(C+ , C− )i

0 C+ )

+

0 C− )

Bi-parameter Space Partition Algorithm

As mentioned in Section 2, the entire region of (C+ , C− ) can be explored in 1.5 square units as shown in Fig. 2. In this section, we use the CCPRs of both (λ, η )-SVM and 2C -SVM to explore the region of [0, 1] × [0, 1] for the (λ, η ) parameter space and the lower triangle region of [0, 1] × [0, 1] for the (C+ , C− ) parameter space respectively. Once the 1.5 square units have been completely covered by the CCPRs, we can fit solutions of CS-SVM for all values of (C+ , C− ) according to (14)-(16) and (20)-(21). This means that the complete two-dimensional solution surface of CSSVM is determined. An intuitive idea to cover the entire parameter space of CSSVM based on the CCPRs is using a progressive construction method. Before designing this progressive construction algorithm, there are three problems which should be answered. (i) How do we give an initial solution of the first CCPR for both (λ, η )-SVM and 2C -SVM? (ii) How do we handle the issue of overlapped CCPRs? (iii) How do we find the next CCPRs based on the current one? Our answers to the three problems are as follows, which derive a recursive bi-parameter space partition algorithm (i.e., CCPRBPSP, see Algorithm 1). 3.2.1 Initialization A simple strategy for initialization is directly using the SMO technology [25] or other quadratic programming solvers to find the solution of (λ, η )-SVM for a parameter pair in the region of [0, 1] × [0, 1] and the solution of 2C -SVM for a parameter pair in the lower triangle region of [0, 1] × [0, 1] respectively. We present a method in Lemma 1 and 2, which does not require any numerical solver and can directly give the solutions for the parameter pairs of (λ, η )-SVM and 2C -SVM respectively, under some conditions. Specifically, Lemma 1 directly gives the solu|S − | and λ ≥ tion of (λ, η )-SVM when η = |S| ( ) ∑ ∑ 1 + j∈S αj Qij + maxi∈S − j∈S αj Qij . 2 maxi∈S Lemma 2 directly gives the solution of 2C -SVM when C+ |S − | 2 C− = |S + | and C+ + C− ≤ maxi∈S + hi +maxi∈S − hi , where − + ∑ ∑ hi = j∈S + |S|S| | Qij + j∈S − |S|S| | Qij . It should be pointed out that two or more samples will start at M, if the inequalities in Lemma 1 and 2 are all with equality. Lemma 1. When αi =

1−yi +2yi η , 2

the optimal solution |S − |

of the minimization problem (3) with η = and |S| ( ) ∑ ∑ 1 λ ≥ 2 maxi∈S + j∈S αj Qij + maxi∈S − j∈S αj Qij achieved. Further, we have that b′ ] ∈ [is ∑ ∑ maxi∈S − j∈S αj Qij − λ, λ − maxi∈S + j∈S αj Qij . C+ +C− +yi (C+ −C− ) , the optimal so2 C+ |S − | of the minimization problem (6) with C− = |S + | and 2 C− ≤ max + hi +max is achieved, where hi = − hi

Lemma 2. When αi = lution C+ +

i∈S

i∈S

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

+ ∑ ∑ |S − | Qij + j∈S − |S|S| | Qij . Further, we have that b′′ ∈ + j∈S |S| [ ] ∑ ∑ maxi∈S − j∈S αj Qij − 1, 1 − maxi∈S + j∈S αj Qij .

Although a similar conclusion to Lemma 1 and 2 was provided in Appendix C of [10], which considered CS-SVM without the offset b in the discrimination hyperplane, the proof of Lemma 1 is still provided in Appendix to make the contribution self-contained. Lemma 2 can be proved similarly to Lemma 1. 3.2.2

Handle the Overlapped Phenomenon

In many real-world problems, the minimization problem (3) or (6) can not be guaranteed to be strictly convex, which means that the solution of (3) or (6) is not unique. Given a parameter pair of (3) or (6), the set of the solutions would be a convex set, according to the convex optimization theory [22]. Thus, combined with Theorem 1 and 2, we can infer that two adjacent CCPRs may overlap (called the overlapped phenomenon). Fig. 4a presents an illustration of the overlapped phenomenon. More verifications on three benchmark datasets and a real world spine image dataset are provided in Section 7, to illustrate the widespread existence of the overlapped phenomenon. The overlapped phenomenon makes it difficult to exactly cover the entire parameter space of CS-SVM by exploring all CCPRs based on a progressive construction method. To tackle this issue, a parameter space partition method is introduced by Theorem 3, which was originally provided in [24]. Theorem 3 defines a partition procedure which considT ers the inequalities A [ρ ϱ] ≤ b, which is used to define T R0 = {(ρ, ϱ) ∈ X : A [ρ ϱ] ≤ b}, one by one. See Fig. 4b, the four inequalities of R0 induce four disjoint ∪ subregions of X (i.e., R1 , R2 , R3 , and R4 ) respectively, and 4i=0 Ri = X . Obviously, this partition method can be used to handle the overlapped phenomenon effectively. In addition to avoiding the overlap of CCPRs, the parameter space partition speeds up computing the two-dimensional CV error surface in Section 4.2, because it introduces a nested set structure for the parameter space naturally. Theorem 3. Let X ⊆ R2 be a convex polygon region, and R0 = T {(ρ, ϱ) ∈ X : A [ρ ϱ] ≤ b} be a convex polygon subregion of m×2 X , where A ∈{R , b ∈ R m×1 , R0 ̸= ∅. And let } T Ai [ρ ϱ] > bi Ri = (ρ, ϱ) ∈ X , Aj [ρ ϱ]T ≤ bj , ∀j < i

6

In this part, we mainly discuss how to find a CCPR for each subregion Ri , i = 1, · · · , m. A simple strategy is using the SMO technology [25] or other quadratic programming solvers to find the solution for a parameter pair (ρi , ϱi ) in Ri similar to the initialization (Section 3.2.1), and then compute the corresponding CR(ρi , ϱi ). Thus, a CCPR in Ri can be found as CR(ρi , ϱi )∩Ri . Obviously, an approach to compute the solution for a parameter pair in Ri without requiring any numerical solver will speed up the running of CCPR-BPSP. Theorem 4 allows us to directly compute α and ge (g ) for a parameter pair in the subregion of Ri adjacent to R0 according to (14)-(16) and (20)-(21). The detailed proof is provided in Appendix. Theorem 4. Supposing X ⊆ R2 is a convex polygon region, def def 0 0 CR(λ0 , η0 ) ∩ X = R0 or CR(C+ , C− ) ∩ X = R0 , R0 has the partition π , and {R0 , R1 , · · · , Rm } is a partition of X as Theorem 3. ∀i ∈ {1, · · · , m}, if Ri ̸= ∅, the i-th inequality T Ai [ρ ϱ] ≤ bi of CR only corresponds to the t-th sample of S , 1) from the left part of (14) or (22), there exists a subregion of Ri adjacent to R0 with the partition π = (M \ {t}, E, R ∪ {t}); 2) from the right part of (14) or (22), there exists a subregion of Ri adjacent to R0 with the partition π = (M \ {t}, E ∪ {t}, R); 3) from (15) or (23), there exists a subregion of Ri adjacent to R0 with the partition π = (M ∪ {t}, E \ {t}, R); 4) and from (16) or (24), there exists a subregion of Ri adjacent to R0 with the partition π = (M ∪ {t}, E, R \ {t}). According to Theorem 4, we can directly obtain the partition π for the subregion of Ri adjacent to R0 . Thus, the inverse matrix e can be updated R corresponding to the extended kernel matrix Q 2 in time O(|M| ) as the method described in [26], and the linear T , ∆ge (∆g ) and [∆ρ ∆ϱ] relationships between ∆b′ (∆b′′ ), ∆αM can be computed as (12)-(13), or (20)-(21). Further, both α(ρi , ϱi ) and ge(λi , ηi ) (g(C+ , C− )), where (ρi , ϱi ) is a parameter pair in the subregion of Ri adjacent to R0 with the partition π , can be computed directly according to (14)-(16) and (20)-(21). Algorithm 1: CCPR-BPSP(α, ge (g ), π , X ) (a CCPRs-based bi-parameter space partition algorithm)

∀i = 1, · · · , m. ∪m Then {R0 , R1 , · · · , Rm } is a partition of X . This is i=0 Ri = X , and Ri ∩ Rj = ∅, ∀i ̸= j , i, j ∈ {0, 1, · · · , m}.

0 , C 0 ), π(ρ , ϱ ), a convex polygon Input: α(ρ0 , ϱ0 ), ge(λ0 , η0 ) or g(C+ 0 0 − region X with (ρ0 , ϱ0 ) ∈ X . Output: P (a partition of X in a nested set structure). 1: Detect CR(ρ0 , ϱ0 ) according to (14)-(16); let R0 := CR(ρ0 , ϱ0 ) ∩ X , and P := {R0 }. 2: Partition the parameter space X with {R0 , R1 , · · · , Rm } (c.f. Theorem 3). 3: while i ≤ m & Ri ̸= ∅ {m is the number of partition defined in Theorem 3.} do 4: Update π for the subregion of Ri adjacent to R0 . 5: Compute α and ge for a parameter pair (ρi , ϱi ) in the subregion of Ri adjacent to R0 . 0 , C 0 )), 6: Pi :=CCPR-BPSP(α(ρi , ϱi ), g e(λ0 , η0 ) (g(C+ − π(ρi , ϱi ), Ri ). {Pi is the partition of Ri .} 7: Update P := P ∪ {Pi }, i := i + 1. 8: end while

In Theorem 3, both A and b are issued from the compact representation of inequalities (14)-(16) or (22)-(24), which can be computed by the vertex enumeration algorithm [23] as described in footnote 2. m is the number of inequalities in the compact representation, and has no relationship with the training set size. (ρ, ϱ) is the shorthand implying (λ, η ) and (C+ , C− ) hereafter. 3.2.3

Find the Next CCPRs

Given a CCPR R0 and a convex polygon region X with R0 ⊆ X , a partition {R0 , R1 , · · · , Rm } is produced by the above partition procedure, where only R0 is the CCPR. The next task is to find a CCPR for each subregion Ri , i = 1, · · · , m. Repeating the steps until the full parameter space is covered by CCPRs (see Fig. 4c and 4d). Thus, two-dimensional solution surface of CS-SVM is obtained.

4

T WO - DIMENSIONAL CV E RROR S URFACE

In this section, we first present the approach of computing twodimensional validation error surface for each CV fold, and then

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

7

0.56 0.52 0.54 0.51

CR(λ1 , η1 )

R3

0.52

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

R1

0.5

X

R0

0.49

R4

CR(λ2 , η2 )

0.48

R2

0.48

0.49

0.5

0.51

0.52 λ

0.53

0.54

(a)

0.55

0.56

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.46 0.47

η



C

η

η

0.5

0.48

0.49

0.5

0.51

λ

0.52

0.53

0.54

0.55

0.56

(b)

0

0.1 0

0.2

0.4

0.6 C+

(c)

0.8

1

0

0

0.2

0.4

λ

0.6

0.8

1

(d)

Fig. 4: (a): The overlapped phenomenon of CCPRs (λ1 = λ2 = 0.5, η1 = 0.51, and η2 = 0.49). (b): Partitioning the parameter space X based on Theorem 3. (c): Partitioning the lower triangle region of [0, 1] × [0, 1] for (C+ , C− ) through CCPR-BPSP. (d): Partitioning the parameter space [0, 1] × [0, 1] of (λ, η ) through CCPR-BPSP. present the method of computing the superposition of K validation error surfaces. The method produces the two-dimensional K -fold CV error surface, which can be used to find the parameter pairs with the global minimum K -fold CV error.

e (ρ, ϱ). Theorem 5 shows that IR(ρ0 , ϱ0 ) is a same partition π convex set and its closure is also a convex polygon region.

4.1 Two-dimensional Validation Error Surface

The detailed proof of Theorem 5 is provided in Appendix. e (ρ, ϱ) in (25), it is According to the definition of the partition π easy to find that the cost sensitive error E(ρ, ϱ) remains unchanged in an IR(ρ0 , ϱ0 ). Thus, IR(ρ0 , ϱ0 ) is an ICPR of CS-SVM. If all CCPRs are covered by ICPRs, we will have a validation error surface, which fits all the validation cost sensitive errors in the entire parameter space.

Based on the solution surfaces of both (λ, η )-SVM and 2C SVM, the decision function of CS-SVM can be obtained (∑ ) ′ as f (λ, η)(x) = λ1 η)y K(x , x) + b (λ, η) α (λ, j j j j∈S for all (λ, η) in [0, 1] × [0, 1] and f (C+ , C− )(x) = ∑ ′′ α (C , C )y K(x , + − j j x) + b (C+ , C− ) for all (C+ , C− ) j∈S j in the lower triangle region of [0, 1] × [0, 1]. Given a validation e1 , ye1 ), · · · , (x en , yen )}, and assuming both C(−, +) set V = {(x and C(+, −) are the misclassification costs of false negative and false positive respectively (no costs for the true positive and the true negative, i.e., C(−, −) = C(+, +) = 0), the cost sensitive∑error on the validation set can be computed as ei )) , yei ). E(ρ, ϱ) = n1 ni=1 C (sign (f (ρ, ϱ)(x In this part, we show that E(ρ, ϱ) is piecewise constant w.r.t. both the parameters. Specifically, E(ρ, ϱ) remains unchanged in invariant convex polygon region3 (ICPR). ICPR is defined as follows. Definition 2. ICPR is a convex polygon region in the regularization parameter space of CS-SVM, in which the cost sensitive error E(ρ, ϱ) remains unchanged. To select the parameter pairs with the lowest cost sensitive error, we cover the entire parameter space by ICPRs, which will produce a two-dimensional validation error surface. In the following, we first present an approach to detect ICPRs, and then present the method of computing the two-dimensional validation error surface. 4.1.1

Detecting ICPRs

ei ), the validation set V can be According to the sign of f (x partitioned as: e (ρ, ϱ) = {{i ∈ V : f (ρ, ϱ)(x ei )) ≥ 0}, π ei )) < 0}} {i ∈ V : f (ρ, ϱ)(x def

= {I+ (ρ, ϱ), I− (ρ, ϱ)} (25) e corresponds a Similar to π(λ, η) and π(C+ , C− ), a partition π set of parameter pairs (ρ, ϱ). We define this set as IR(ρ0 , ϱ0 ) = e (ρ, ϱ) = π e (ρ0 , ϱ0 )}. Obviously, {(ρ, ϱ) ∈ CR(ρ0 , ϱ0 ) : π IR(ρ0 , ϱ0 ) is the set of all parameter pairs (ρ, ϱ) sharing the 3. “Invariant” means that the cost sensitive error E(ρ, ϱ) remains unchanged.

Theorem 5. The set IR(ρ0 , ϱ0 ) is a convex set and its closure is a convex polygon region.

4.1.2 Partitioning Each CCPR with ICPRs To cover the entire parameter space by ICPRs, we use the strategy of divide and conquer, i.e., searching all ICPRs for each CCPR. Similar to CCPR-BPSP, the parameter space partition procedure introduced by Theorem 3 is also used to partition each CCPR, where an ICPR corresponds the subregion R0 . Thus, a recursive algorithm (i.e., ICPR-BPSP, see Algorithm 2) is proposed to find all ICPRs and compute the corresponding cost sensitive errors for each CCPR. It should be noted that the nested set structure for the output of ICPR-BPSP is still retained based on Theorem 3. The nested set structure will speed up computing the superposition of K validation error surfaces in Section 4.2. Combining all results of the CCPRs based on the framework of Algorithm 1, we can obtain the validation error surface for the region of [0, 1] × [0, 1] in the (λ, η ) coordinate system, as shown in Fig. 5a and 5b, and the validation error surface for the lower triangle region of [0, 1]×[0, 1] in the (C+ , C− ) coordinate system, as shown in Fig. 5d and 5e. 4.2 Computing the Superposition of K Validation Error Surfaces This section focuses on computing the two-dimensional K -fold CV error surface. Given the validation set V , we randomly partition it into K equal size subsets (i.e., V1 , · · · , VK ). For each k = 1, · · · , K , we fit the CS-SVM model with a parameter pair (ρ, ϱ) to the other K − 1 parts, which produces the decision function f (ρ, ϱ)(x) and computes ( ( error in predicting ) ) ∑its cost sensitive ei ) , yei . the k part E k (ρ, ϱ) = |V1k | i∈Vk C sign f k (ρ, ϱ)(x ∑ K 1 k This gives the K -fold CV error CVE(ρ, ϱ) = K k=1 E (ρ, ϱ). As mentioned in Section 4.1, the two-dimensional validation error surface for each fold can be obtained by the ICPR-BPSP algorithm. To find the parameter pairs with the global minimum K -fold CV error, we need to investigate the two-dimensional K fold CV error surface in the regions with 1.5 square units, which

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

8

ei ), π e , X ) (an ICPRs-based Algorithm 2: ICPR-BPSP(f (x bi-parameter space partition algorithm) Input: f (ρ0 , ϱ0 )(x ei ), π e(ρ0 , ϱ0 ), a convex polygon region X with (ρ0 , ϱ0 ) ∈ X . Output: P (a partition of X with ICPRs in a nested set structure, each ICPR has its cost sensitive error). 1: Compute IR(ρ0 , ϱ0 ) and E(ρ0 , ϱ0 ); then let R0 := IR(ρ0 , ϱ0 ) ∩ X , P := {R0 }. 2: Partition the parameter space X with {R0 , R1 , · · · , Rm } (c.f. Theorem 3). 3: while i ≤ m & Ri ̸= ∅ {m is the number of partition defined in Theorem 3.} do 4: Compute f (ρi , ϱi )(x ei ) and π e(ρi , ϱi ) for a parameter pair (ρi , ϱi ) in Ri . 5: Pi :=ICPR-BPSP(f (ρi , ϱi )(x ei ), π e(ρi , ϱi ), Ri ). 6: Update P = P ∪ {Pi }, i := i + 1. 7: end while

3)

can be achieved by superposing K validation error surfaces in one and the same two-dimensional space (see Fig. 5c and 5f). Algorithm 3 gives the intersection of an ICPR and a validation error surface, based on a recursive procedure. Obviously, we can obtain the two-dimensional K -fold CV error surface by calling Algorithm 3 multiple times. Algorithm 3: INTERSECTION(IR, P ) (the algorithm for computing the intersection of IR and P ) Input: IR(ρ0 , ϱ0 ), P = {P0 , P1 , · · · , Pm } (a parameter space partition in a nested set structure). Output: L (a set of IRs which are intersections of IR and P ). 1: while i ≤ m & IR(ρ0 , ϱ0 ) ∩ Pi ̸= ∅ {Initially, i = 0, L = ∅.} do 2: if Pi is a leaf node then 3: Compute the intersection region IR := IR(ρ0 , ϱ0 ) ∩ Pi ; then let L := L ∪ {IR}. 4: else 5: Li := INTERSECTION(IR(ρ0 , ϱ0 ), Pi ). 6: Update L := L ∪ Li , i := i + 1. 7: end if 8: end while

5

D ISCUSSION ON THE C OMPACT PARAMETER S-

PACE

As mentioned in Section 3 and 4, both the solution and error surfaces are explored in the region [0, 1] × [0, 1] of the (λ, η ) coordinate system and in the lower triangle region of [0, 1] × [0, 1] in the (C+ , C− ) coordinate system, as shown in Fig. 2. In this section, we will discuss why CV-SES adopts the combined compact parameter space with 1.5 square units, instead of others. 1)

2)

If the minimization problems (3) and (6) are strictly convex, all the CCPRs will be non-overlapping. Thus, the solution surface for 2C -SVM or (λ, η )-SVM can be directly obtained by merging all the CCPRs. Its complexity mainly depends on the number of CCPRs. Because the number of CCPRs of (λ, η )-SVM is the same as that of 2C -SVM, there is no obvious difference between exploring the solution surfaces for 2C -SVM and (λ, η )-SVM. This means that it is no need to explore the parameter spaces of 2C -SVM and (λ, η )-SVM simultaneously. However, in many real-world problems, the minimization problem (3) or (6) can not be guaranteed to be strictly convex. Thus, there exists the phenomenon of overlapped CCPRs as shown in Fig. 4a, which is also verified in Section 7.1. In this paper, the parameter space partition

4)

technology defined by Theorem 3 is used to handle the overlap of CCPRs. However, this way of partitioning defines new polyhedral regions Ri to be explored that are not related to the CCPRs which still need to be determined. This may split some of the CCPRs, due to the artificial cuts induced by Theorem 3 [24]. Fig. 6a and 6b present the results of space partition and error surface respectively in the region of [0, 10] × [0, 10] for (C+ , C− ). Fig. 6c and 6d present the results of space partition and error surface respectively in the region of [0, 10] × [0, 1] for (λ, η ). It is easy to find that so many regions are produced by the space partition algorithm, which burdens CV-SES, even though only exploring in an inner region beyond which the solution does not change (cf. Lemma 1 and Corollary 2, or Appendix C of [10]). To tackle this issue, we explore the region [0, 1]×[0, 1] of the (λ, η ) coordinate system and the lower triangle region of [0, 1] × [0, 1] in the (C+ , C− ) coordinate system. Searching in this compact parameter space reduces many artificial cuts, and speed up the running of CV-SES. Experimental results will show that this search strategy can achieve a fast speed, faster than various hybrids between grid search and solution path methods. In addition, it can be proved that it is the optimal combination to explore the entire space of (C+ , C− ) with 1.5 square units. Another possible compact parameter space would be 2ν -SVM (or equivalently, the (ν ,γ ) parameterization). However, as described in [8], the parameter space of the (ν ,γ ) formulation is [0, 1] × R+ . If directly exploring the parameter space [0, 1] × R+ by the parameter space partition technology, it will be inefficient as mentioned above. In addition, the parameter space of 2ν -SVM is [0, 1] × [0, 1], which is compact enough for exploring. However, 2ν -SVM is nonlinear w.r.t. ν+ and ν− . Thus, an approximate solution surface algorithm should be designed for tackling this parametric nonlinear optimization problem [30], which is beyond the scope of this paper.

To sum up, due to the overlapped phenomenon of CCPRs, exploring the compact parameter space with 1.5 square units is an efficient way to speed up the running of CV-SES.

6 6.1

E XPERIMENTAL S ETUP Design of Experiments

In order to demonstrate the effects and the advantages of our proposed CV-SES, we conduct a detailed experimental study as follows. Both CCPR-BPSP and ICPR-BPSP (the two main procedures of CV-SES) are essentially tree construction algorithms. To demonstrate the effects of CV-SES, we count the “depth”, “branches”, and “leaves” for both CCPR-BPSP and ICPR-BPSP, where “depth” stands for the maximum depth of a tree, “branches” represents the average child nodes for each inner node, and “leaves” stands for the number of leaves for a tree, where each leaf represents a CCPR or an ICPR. In addition, we also count the ICPRs of the final superposition of K two-dimensional validation error surface in a 5-fold CV (denoted as “regions”). To illustrate the widespread existence of the overlapped phenomenon of CCPRs, we randomly generate 200 pairs of CCPRs which are adjacent, and count the number of the overlapped cases. By observing the numbers of “depth”, “branches”, “leaves”, “overlaps”,

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

2

1.5

1

2

1.5 1.5 CV Error

Cost Sensitive Error

Cost Sensitive Error

2

9

1

0.5 1

0 1

0.5 1

1

1

0.8

0.5

0.5

0.6 0

0

η

λ

(a)

0.6 0.4

0.2 0

1

λ

3 2.5

3 2 CV Error

Cost Sensitive Error

2

0

(c)

4

1.5

0.2

0

η

λ

(b)

2.5 Cost Sensitive Error

0.8

0.5

0.6 0.4

0.2 0

1

0.8

0.4 η

1

0.5

2

1.5 1

1 0.5

0.5 1

0 1

0 1

1

1

0.8

0.5

0.5

0.6 0

0.6 0.4

0.4

0.2 0

0.8

0.5

0.6

0.4 C−

1

0.8

0

C−

C+

(d)

0.2 0

0.2

0

C−

C+

(e)

0

C+

(f)

10

1

9

0.9

8

2.5

7

2

0.7

6

1.5

0.6

1

0.5

4

2 1 0

0.3

0 10 5 0

2

4

6 C+

(a)

8

10

C−

0

0

2

6

4

8

10

0.4

0.2

0.4

0.5

3

0.6 CV Error

5

0.8

0.8

η

CV Error

C−

Fig. 5: Error surfaces in 2-fold CV. (a)-(c): 2-fold CV for all parameter pairs of (λ, η ) in [0, 1] × [0, 1]. (d)-(f): 2-fold CV for all parameter pairs of (C+ , C− ) in the lower triangle region of [0, 1] × [0, 1]. (a), (d): The results of the first fold. (b), (e): The results of the second fold. (c), (f): The results of 2-fold CV.

0 1

0.2

10 8

0.5

0.1

6 4

0

C

0

2

+

(b)

4

λ

(c)

6

8

10

η

0

2 0

λ

(d)

Fig. 6: (a)-(b): [0, 10]×[0, 10] of (C+ , C− ). (c)-(d): [0, 10]×[0, 1] of (λ, η ). (a), (c): The results of space partition through CCPR-BPSP. (b), (d): 2-fold CV error surface. and “regions” with the validation set size, we hope to show that CV-SES can effectively cover the entire region in a finite number of steps. In order to show the advantages of CV-SES, we compare the generalization ability and the runtime of CV-SES with other two typical model selection methods of CS-SVM and the most recent cost-sensitive SVM CSHL-SVM [11], [31]. To sum up, the four model selection methods of CS-SVM are: 1)

2)

grid search (GS): a two-step grid search strategy [8] is used for 2C -SVM. The initial search is done on a 20 × 20 coarse grid linearly spaced in the region {(log2 C+ , log2 C− )| − 9 ≤ log2 C+ ≤ 10, −9 ≤ log2 C− ≤ 10}, followed by a fine search on a 20 × 20 uniform grid linearly spaced by 0.1 in the (log2 C+ , log2 C− ) space; a hybrid method of one-parametric solution path searching on η and grid searching on λ (SPη +GSλ ): λ (i.e., 1 C+ +C− ) is selected by a two-step grid search in the region {log2 λ| − 9 ≤ log2 λ ≤ 10} with the granularity 1 and followed by 0.1. For each value of λ, the solution C+ path search [18], [19] is applied on η (i.e., C+ +C ) to − find the best parameter pair (C+ , C− );

3)

4)

6.2

a hybrid method of one-parametric solution path searching on λ and grid searching on η (SPλ +GSη ): η is selected by a two-step grid search in the region {η|0 ≤ η ≤ 1} with the granularity 0.1 and followed by 0.01. For each value of η , the one-parametric solution path searching is applied on λ to find the best parameter pair (C+ , C− ); our proposed solution and error surfaces based CV approach (CV-SES). Datasets

Table 1 summarizes eleven datasets used in our experiments. For each dataset, we choose the class with the fewer samples as positive. “ratio” denotes the ratio of its major class size against its minor class size. We randomly partition each dataset into 75% for training and 25% for testing, and select 30% as a validation set which is disjoint from the test set. The validation set is used in a 5fold CV procedure to determine the optimal values of parameters of CS-SVM and CSHL-SVM. The cost sensitive error on the test set is computed based on the optimal parameters. The datasets are grouped into two parts. The first part is for cost sensitive learning whose ratios are below 2, and the second part is for imbalanced learning whose ratios are above 3.

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

10

TABLE 1: Datasets used in the experiments. data set Sonar Ionosphere Diabetes Breast Cancer Heart Hill-Valley Spine Image Ecoli1 Ecoli3 Vowel0 Vehicle0

#attributes 60 34 8 10 13 100 5 7 7 13 18

#training set 156 264 576 510 202 454 262 252 252 741 635

#validation set 62 106 230 205 81 182 105 101 101 296 254

#test set 52 90 192 173 68 152 88 84 84 247 211

ratio 1.13 1.79 1.87 1.86 1.25 1.01 1.23 3.36 8.6 9.98 3.25

6.2.1 The Datasets for Cost Sensitive Learning Benchmark Datasets: The first six datasets are from the UCI benchmark repository [27]. The following are the brief descriptions for three medical datasets. It is easy to see that the costs of different types of mistakes are unequal. 1) 2) 3)

Diabetes: The dataset is to diagnose whether the patient of Pima Indian heritage shows signs of diabetes. Breast Cancer: The samples of this dataset arrive periodically as Dr. Wolberg reports his clinical cases. Heart: The goal field of this dataset refers to the presence of heart disease of patients.

A Real-World Dataset: The spine image dataset is collected from the London area of Canada by us. This dataset is related to diagnose a degenerative disc disease depending on five image texture features (including contrast, correlation, energy, homogeneity, and mean signal intensity) quantified from magnetic resonance imaging. It contains 350 records, where 157 records were marked normal and 193 records marked abnormal by an experienced radiologist. It is easy to see that the costs of different types of mistakes are unequal. 6.2.2 The Datasets for Imbalanced Learning Ecoli1, Ecoli3, Vowel0, and Vehicle0 are from the KEEL-dataset repository4 . Their class imbalance ratios are varying from 3.25 to 9.98. Cost sensitive learning methods (including CS-SVM and CSHL-SVM) are used for imbalanced learning on these four datasets, as mentioned in Section 1.

for training several types of SVMs, which was implemented in C++ and named as LIBSVM (it does not include 2C SVM). To compare the run-time in the same platform, we do not directly modify the LIBSVM software package as stated in [8], but implement the SMO-type algorithm of 2C SVM in MATLAB [29]. Lee and Scott [10] implemented the one-parametric solution path searching on λ and η respectively in MATLAB5 . We used their MATLAB implementation for achieving SPη +GSλ and SPλ +GSη . Specially, an λ solution |S − | path searching with η = |S| first prepares initial solutions of SP an η solution path searching)with (η +GSλ . Similarly, ∑ ∑ λ = 21 maxi∈S + j∈S αj Qij + maxi∈S − j∈S αj Qij first prepares initial solutions of SPλ +GSη . For CV-SES (SPη +GSλ and SPλ +GSη ), our implementation returns a center point from the region (or a line segment) with the minimum CV error. In addition, CSHL-SVM is implemented using the quadprog function of MATLAB with a three-dimensional grid search on the region {(log2 C, log2 C−1 , log2 C1 )| − 9 ≤ log2 C ≤ 10, 0 ≤ log2 C−1 ≤ 10, 0 ≤ log2 C1 ≤ 10, C1 ≥ 2C−1 − 1} as mentioned in [31], to determine the values of parameters C , C−1 , and C1 of CSHL-SVM. All experiments are performed on a 2.5-GHz Intel Core i5 machine with 8GB RAM, running MATLAB 7.10. To demonstrate the effects of CV-SES and compare the runtime of different methods, C(−, +) is set to 5 for the datasets in cost sensitive learning. For kernel, since our focus is on nonlinear kernel, we use the Gaussian kernel K(x1 , x2 ) = exp(−κ∥x1 − x2 ∥2 ) with κ = 10−3 and 103 . To compare the generalization ability of different methods, and to investigate how the performance of an approach changes with different settings in misclassification costs, C(−, +) is set to 2, 5, and 10 respectively for the datasets in cost sensitive learning. The Gaussian kernel is also used with κ ∈ {10−3 , 10−2 , 10−1 , 1, 10, 102 , 103 }, where the value of κ having the lowest CV error is adopted.

7 7.1

E XPERIMENTAL R ESULTS Effects of CV-SES

We implement our proposed CV-SES in MATLAB. Chang and Lin [28] proposed a recognized SMO-type algorithm designed

Both CCPR-BPSP and ICPR-BPSP are essentially tree construction algorithms. The numbers of “depth”, “branches”, and “leaves” are essential numerical values to depict the trees of both CCPRBPSP and ICPR-BPSP. And the number of “regions” is a numerical value showing how many ICPRs are produced in the final superposition of 5 validation error surfaces. The average results of (λ, η )-SVM in the region of [0, 1] × [0, 1] over 10 trials are presented in Table 2, and the ones of 2C -SVM in the lower triangle region of [0, 1] × [0, 1] over 10 trials are presented in Table 3. These experiments are carried out with the validation set size of 50, 100, 150, 200, and 250, when κ = 10−3 . From the tables, we find that the number of “branches” is from 3 to 4, which is consistent with the observation that both CCPR and ICPR normally are a triangle or a quadrilateral (see Fig. 4 and 5). The “depth”, “branches”, and “leaves” of ICPR-BPSP are not fewer than the corresponding ones of CCPR-BPSP. This is because the tree of ICPR-BPSP is constructed based on the tree of CCPRBPSP. The most important observation is that the numbers of “leaves” and “regions” are all finite. This means that CV-SES

4. The KEEL-dataset repository is available at http://sci2s.ugr.es/keel/ imbalanced.php.

5. The MATLAB software package is available at http://web.eecs.umich. edu/∼cscott/code.html#svmpath.

6.3

Performance Estimation

Given a testing set T = {(x1 , y 1 ), · · · , (xm , y m )}, the final performance on the testing set is measured by the cost sensitive 1 ∑m error E = m i=1 C (sign (f (xi )) , y i ) for both cost sensitive and imbalanced learning. The cost sensitive error is also used in the CV procedure [31]. In binary classification, a positive class is of the primary interest and with a higher misclassification cost C(−, +), while a negative class has a lower cost C(+, −). Since costs can be normalized with the optimal decision unchanged [32], C(+, −) is always set to 1, and C(−, +) is bigger than 1. Specially, C(−, +) is always set to the class imbalance ratio for each dataset in imbalanced learning. The setting of C(−, +) for datasets in cost sensitive learning is discussed in Section 6.4. 6.4

Implementation

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

(including CCPR-BPSP and ICPR-BPSP) can cover the entire region in a finite number of steps. CCPR-BPSP introduces a parameter space partition method to handle the overlapped cases of CCPRs. To show the necessity of introducing the parameter space partition method, we verify the widespread existence of the overlapped phenomenon on four datasets, i.e., Diabetes, Breast Cancer, Hill-Valley, and Spine Image. The average results over 10 trials for (λ, η )-SVM in the region of [0, 1] × [0, 1] and for 2C -SVM in the lower triangle region of [0, 1] × [0, 1] are presented in the column of O (abbreviation of “overlaps”) in Table 2 and 3 respectively. It is easy to observe that the overlapped phenomenon happens with a high probability in the 200 test cases generated randomly. This verifies the necessity of the parameter space partition method in CCPR-BPSP. 7.2 7.2.1

Comparison with Other Methods Accuracy

The optimal choices of parameters C+ , C− and κ and the corresponding CV error are presented in Table 4 for 5-fold CV of CSSVM with GS, SPη +GSλ , SPλ +GSη , and CV-SES, respectively, where C(−, +) is set to be 2, 5, and 10 respectively on the first seven datasets for cost sensitive learning, and C(−, +) = ratio on the last four imbalanced datasets. Note that C(+, −) is always set to 1. It is easily observed that CV-SES obtains the lowest CV errors for all datasets under different C(−, +) settings. This is because GS is a points-based grid search method, and both SPη +GSλ and SPλ +GSη are line-based grid search methods. However, CV-SES is a region-based method which finds the best choices among the infinite many candidates of the two-parameter space. The values of parameters C , C−1 , and C1 of CSHL-SVM were obtained from the 5-fold CV with three-dimensional grid search. And combining with the optimal parameters of CS-SVM with GS, SPη +GSλ , SPλ +GSη , and CV-SES, respectively, as presented in Table 4, we obtained the cost sensitive errors for both CSHL-SVM and CS-SVM on the test sets over 50 trials, as presented in Fig. 7. In each subfigure, the grouped boxes represent the results of CSHL-SVM with three-dimensional grid search, CS-SVM with GS, SPη +GSλ , SPλ +GSη , and CV-SES, from left to right at different datasets. Similar to Table 4, C(−, +) is set to be 2, 5, and 10 (corresponding to Fig. 7a, 7b, and 7c) respectively, on the first seven datasets for cost sensitive learning, and C(−, +) = ratio (corresponding to Fig. 7d) on the last four datasets for imbalanced learning. The results on CS-SVM show that CV-SES has a better generalization ability than GS, SPη +GSλ , and SPλ +GSη . Especially, CV-SES has the best stability, because it returns a center point from the optimal region with the minimum CV error. The results between CSHL-SVM and CS-SVM show that CS-SVM with CV-SES has a better generalization ability than CSHL-SVM with three-dimensional grid search, although CSHLSVM also performs well. Since there are more parameters in the formulation of CSHL-SVM, it is not easy to tune the values of the parameters for CSHL-SVM. On the flip side, it also encourages us to implement an approximate three-dimensional solution surface algorithm for CSHL-SVM, because CSHL-SVM is nonlinear w.r.t. C , C−1 , and C1 .

11

SPη +GSλ , SPλ +GSη , and CV-SES in terms of the size of the validation set is presented in Fig. 8 on each dataset, when κ = 10−3 and 103 . All results are averaged over 10 trials. It is easy to find that either CSHL-SVM with three-dimensional GS or CS-SVM with two-dimensional GS has the longest running time among them. This is because they need to train the model of CSHL-SVM or of CS-SVM for many times. It is also easy to find that the running time of CV-SES has no strong relationship with the size of the validation set. This is because the computational complexity of CV-SES mainly depends on the number of ICPRs. The experimental results in Table 2 and 3 show that the number of ICPRs has a weak relationship with the sizes of the training and validation sets when Gaussian kernel is used. However, both SPη +GSλ and SPλ +GSη need to search a large number of one-parametric solution paths. The time complexity of computing each solution path mainly depends on the number of iterations taken by the solution path algorithm. The experience shows that the total number of iterations is on average some small multiple of the size of the training set [12], [13], [15]. Thus, CV-SES has the less running time than both SPη +GSλ and SPλ +GSη . To sum up, the results demonstrate that our CV-SES is generally much faster than various hybrids between grid search and solution path methods.

8

C ONCLUSION

In this paper, we proposed a solution and error surfaces based CV approach (CV-SES) for CS-SVM. CV-SES mainly includes two steps: first computing the two-dimensional solution surface, and then computing the two-dimensional error surface based on the solution surface. The final K -fold CV error surface can be obtained by superposing K validation error surfaces, and the global minimum CV error can be found correspondingly. Experimental results on seven datasets for cost sensitive learning and on four datasets for imbalanced learning show that CV-SES has a better generalization ability than CS-SVM with various hybrids between grid search and solution path methods, and than recent proposed CSHL-SVM with three-dimensional grid search. Meanwhile, it uses less running time. In the future, we plan to extend the solution surface algorithm to a more general formulation which can cover other bi-parametric learning models (i.e., ε-SVR [15], ν -SVR [33], and quantile regression [14]), and even multi-parametric learning models. As mentioned in Section 5, 2ν -SVM [8] is nonlinear w.r.t. ν+ and ν− . Thus, an approximate two-dimensional solution surface algorithm should be designed for tackling this parametric nonlinear optimization problem [30]. Similarly, an approximate multidimensional solution surface algorithm should be designed for CSHL-SVM [11] which is nonlinear w.r.t. C , C−1 , and C1 . Normally, the assessment metric of imbalanced learning uses the receiver operating characteristic (ROC) curve (or AUC, the area under the curve) [34]. We will extend the error surface algorithm to be compatible with the AUC metric in the future.

A PPENDIX

7.2.2 Runtime

.1

Proof of Lemma 1

The empirical running time (in minutes) of CSHL-SVM with three-dimensional grid search, and of CS-SVM with GS,

Proof. If η = |S| , it will satisfy the equality constraint (5) of iη iη . As αi = 1−yi +2y , the KKT conditions when αi = 1−yi +2y 2 2

|S − |

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

12

TABLE 2: The average numbers of “depth” (D), “branches” (B), “leaves” (L), “overlaps” (O), and “regions” (R) of CV-SES in the region of [0, 1] × [0, 1] for (λ, η )-SVM, over 10 trials. Dataset

Size 50 100 150 200 250 50 100 150 200 250

Diabetes

HillValley

D 4.6 5.4 6 6.4 6.4 8 8.8 9.2 9.8 10.4

CR-BPSP B L 3.39 18.6 3.49 23 3.46 28.4 3.54 30.6 3.88 35.4 3.92 185.4 4 252 3.94 331.5 3.93 376.2 4.08 420.4

O 200 200 200 200 200 198 200 200 194 189

D 4.6 5.4 6 6.4 6.4 8 8.8 9.2 10.4 9.8

IR-BPSP B L 3.41 22.5 3.50 26.6 3.46 31.3 3.59 34.4 3.9 37.5 3.94 190.2 4.08 255.3 3.94 339.2 3.93 382.1 4.08 425.8

Final R 244 306 355 486 578 1844 2043 2162 2303 2795

Dataset

Breast Cancer

Spine Image

Size 50 100 150 200 250 50 100 150 200 250

D 5.6 6.4 7.8 8 8.4 6.8 7 7 7.3 7.7

CR-BPSP B L 3.33 24.4 3.53 42.4 3.54 110.4 3.67 142.6 3.79 164.2 3.69 71 3.87 120.3 3.93 139.3 3.98 153 4 160.7

O 200 199 200 187 173 32 43 45 58 42

D 5.6 6.4 7.8 8 8.4 6.8 7 7 7.3 7.7

IR-BPSP B L 3.36 26.6 3.53 45.3 3.59 118.5 3.67 151.2 3.8 168.2 3.71 74.3 3.89 125.7 3.95 144.5 4 157.3 4.01 167.4

Final R 233 693 1455 2003 2351 260 307 351 394 450

TABLE 3: The average numbers of “depth” (D), “branches” (B), “leaves” (L), “overlaps” (O), and “regions” (R) of CV-SES in the lower triangle region of [0, 1] × [0, 1] for 2C -SVM, over 10 trials. Dataset

Size 50 100 150 200 250 50 100 150 200 250

Diabetes

HillValley

D 4 4 4 4 4 6 6.4 6.6 6.8 7.1

CR-BPSP B L 3.25 9 3.25 9 3.25 9 3.25 9 3.25 9 3.37 25.2 3.61 36 3.72 44.2 3.78 45.1 3.76 49

O 200 200 200 200 200 200 196 200 200 197

D 4 4 4 4 4 6 6.4 6.6 6.8 7.1

IR-BPSP B L 3.25 9 3.25 9 3.25 9 3.25 9.1 3.25 9.1 3.39 27.1 3.64 38.3 3.78 47.6 3.81 48.4 3.8 53.2

Final R 51 56 72 73 104 196 214 235 263 276

Dataset

Breast Cancer

Spine Image

Size 50 100 150 200 250 50 100 150 200 250

D 3.6 3.8 4 4.8 6.2 6.7 7 7 7 7.1

CR-BPSP B L 3.1 9.6 3.2 13.8 3.64 15.8 3.55 19.4 3.77 24.6 3.65 39.3 3.65 60.7 3.84 61.3 3.85 67.7 3.85 73.4

O 195 200 200 189 186 23 37 34 29 30

D 3.6 3.8 4 4.8 6.2 6.7 7 7 7 7.1

IR-BPSP B L 3.2 11.2 3.3 15.3 3.67 18.4 3.55 22.6 3.77 28.6 3.67 42.2 3.65 62.1 3.84 65.6 3.85 70.3 3.85 76.7

Final R 42 59 88 123 168 115 136 167 199 231

TABLE 4: The results (i.e., optimal values of C+ , C− , κ and CV error) of 5-fold CV with GS, SPη +GSλ , SPλ +GSη , and CV-SES, respectively, for CS-SVM. Smallest CV error for each dataset is in bold. C(+, −)

2

5

10

ratio

GS

Dataset

C+ Sonar 2.297 Ionosphere 1.072 Diabetes 1.414 Breast Cancer 0.8706 Heart 0.6156 Hill-Valley 11.31 Spine Image 207.94 Sonar 445.7 Ionosphere 17.15 Diabetes 2.144 Breast Cancer 2.828 Heart 1.3195 Hill-Valley 9.189 Spine Image 776.2 Sonar 103.9 Ionosphere 1.072 Diabetes 4.925 Breast Cancer 1.414 Heart 0.5359 Hill-Valley 8.574 Spine Image 68.59 Ecoli1 5.657 Ecoli3 2.144 Vowel0 157.59 Vehicle0 2.8284

C− 1.414 0.8123 0.6598 0.4665 0.3789 7.464 128 207.9 12.99 0.25 1.741 0.3789 3.482 238.9 59.71 0.8123 13.93 1.741 0.4061 3.031 9.189 1.414 0.2872 29.86 2.639

SPη +GSλ κ CV error C+ C− κ CV error −1 10 0.4667 60.17 30.33 10−3 0.282 −1 −3 10 0.3623 2047 0.924 10 0.0725 −3 −3 10 0.6275 62.8 1.2 10 0.5948 10−2 0.6593 2047 0.473 10−3 0.6 10−3 0.52 0.464 1.015 0.401 10−3 10−3 0.463 0.45 245.8 116.3 10−3 10−3 0.5017 519.5 504.5 10−2 0.2754 −3 −3 52.02 51.95 10 10 0.4872 0.3167 10−1 0.3768 1.319 10−3 0.1159 2.68 10−3 0.6536 2047 0.316 10−3 0.632 10−2 0.6741 2047 0.448 10−3 0.6074 −3 −3 10 0.537 0.509 10 0.463 2047 10−3 0.6 1.456 0.75 1 0.55 10−2 0.524 0.383 41.52 18.19 10−2 254.1 10−3 0.564 83.6 10−3 0.4615 10−1 0.3823 2.6 10−1 0.2319 2045 10−3 0.6863 1.055 0.265 10−3 0.6601 10−3 0.6815 2047 0.45 10−3 0.6741 10−3 0.556 0.4367 0.223 10−3 0.556 10−3 0.5 2047 0.818 10−3 0.5 10−2 0.536 769.9 254.1 10−2 0.4783 1 0.1722 872.7 82.71 10−1 0.117 1 0.1905 924.9 251.3 10−1 0.0909 10−3 0.1586 14.59 1.407 10−3 0.101 10−3 0.472 2047 0.7802 10−3 0.1834

according to (4) of the KKT that   conditions, we require 1 ∑ ∀i ∈ S : gi = αj Qij + yi b′  − 1 ≤ 0 λ j∈S which means that ∀i ∈ S + ,



αj Qij + b′ ≤ λ

(26)

(27)

j∈S

∀i ∈ S − ,





αj Qij + b′ ≥ −λ

(28)

j∈S

C+ 47.37 0.842 53.8 60 79.2 0.934 1.212 2.285 0.5 1.937 1.93 63.06 1.578 4.786 42.49 2.623 1.948 43.04 1.953 1.392 5.519 210.3 28.75 2.309 3.62

SPλ +GSη C− κ CV error 0.48 10−1 0.271 −1 0.585 10 0.0857 −3 0.54 10 0.606 0.606 10−3 0.611 0.8 10−3 0.478 6.2 10−3 0.462 −1 1.076 10 0.278 −3 1.026 10 0.322 0.521 10−1 0.1324 0.0196 10−3 0.638 0.0195 10−3 0.6222 −3 0.637 10 0.485 0.485 10 0.493 2.357 10−2 0.3795 0.429 10−3 0.473 1.071 10−2 0.2425 −3 0.0197 10 0.672 0.435 10−2 0.6741 0.0197 10−2 0.562 0.622 102 0.5 0.415 1 0.464 113.2 10−1 0.124 4.29 10−3 0.1102 0.4075 10−3 0.095 0.273 10−3 0.2092

C+ 1.636 0.6083 4.794 2.563 1.596 4.303 1.899 1.636 2.683 13.43 1.266 5.883 0.742 9.293 3.511 1.858 0.043 30.01 2.115 2.148 0.849 1.6538 0.8764 19.44 1.9845

CV-SES C− κ CV error 0.544 10−1 0.2564 −3 0.5775 10 0.0435 0.665 10−3 0.5752 0.479 10−3 0.5642 0.266 10−3 0.444 0.3884 10−3 0.4417 −3 1.075 10 0.2650 0.544 10−1 0.3167 17.69 10−1 0.1159 0.192 10−3 0.632 0.483 10−3 0.6074 −2 0.656 10 0.463 0.446 10−3 0.4417 5.588 10−2 0.3562 1.403 10−1 0.4359 0.979 10−3 0.2319 0.9875 10−3 0.6601 0.294 10−3 0.6626 0.649 10−3 0.556 0.655 10−3 0.458 0.536 10−1 0.4493 0.5165 10−1 0.0833 0.1944 1 0.0595 8.143 10−2 0.0449 0.1545 10−3 0.1024

Obviously, the solutions of the linear inequality system (29) can be formulated asfollows.  ∑ ∑ 1 λ≥ max αj Qij + max αj Qij  (30) 2 i∈S + j∈S i∈S − j∈S   ∑ ∑ b′ ∈  max αj Qij − λ, λ − max αj Qij  (31) i∈S −

j∈S

i∈S +

j∈S

This completes the proof.

Thus, (27)-(28) system. ∑ can be transformed into the following ∑ max αj Qij − λ ≤ b′ ≤ λ − max αj Qij (29) i∈S −

j∈S

i∈S +

j∈S

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

13

3 1.2

Cost Sensitive Errors

Cost Sensitive Errors

2.5 1 0.8 0.6 0.4

1.5 1 0.5

0.2 0

2

Sonar

Ionosphere DiabetesBreast Cancer Heart

0

Hill−ValleySpine Image

Sonar

Ionosphere DiabetesBreast Cancer Heart

(a) 4

Cost Sensitive Errors

3.5

Cost Sensitive Errors

Hill−ValleySpine Image

(b) 1

3 2.5 2 1.5 1

0.8

0.6

0.4

0.2

0.5 0

0 Sonar

Ionosphere DiabetesBreast Cancer Heart

Hill−ValleySpine Image

Ecoli1

(c)

Ecoli3

Vowel0

Vehicle0

(d)

Fig. 7: The results of cost sensitive errors on the test sets, over 50 trials. The grouped boxes represent the results of CSHL-SVM with three-dimensional grid search, CS-SVM with GS, SPη +GSλ , SPλ +GSη , and CV-SES, from left to right on different datasets. The notched-boxes have lines at the lower, median, and upper quartile values. The whiskers are lines extended from each end of the box to the most extreme data value within 1.5×IQR (Interquartile Range) of the box. Outliers are data with values beyond the ends of the whiskers, which are displayed by plus signs. (a): C(−, +) = 2. (b): C(−, +) = 5. (c): C(−, +) = 10. (d): C(−, +) = ratio, for imbalanced learning.

.2

Proof of Theorem 4

Proof. Assuming (λ1 , η1 ) and (λ2 , η2 ) are two end points of the i-th line segment of the convex polygon region R0 , and (λ0 , η0 ) is a point which belongs to R0 , but not on the i-th line segment, we will prove that, for all points (θλ1 + (1 − θ)λ2 , θη1 + (1 − θ)η2 ) (abbreviated as (λ(θ), η(θ)), where θ ∈ (0, 1)), there always exists a non-zero-length line segment, which has the slope coefficient η(θ)−η0 , the end point (λ(θ), η(θ)), and the partition as ξ = λ(θ)−λ 0 described in Theorem 4 (denoted as Sub-Conclusion 1). Based on this sub-conclusion, it is easy to conclude that Theorem 4 holds. With the relationship ∆η = ξ∆λ (assuming ξ is finite here), the solution path of the parameter λ for CS-SVM can be obtained by calling the following three steps multiple times. First, the linear T relationship between [∆b′ ∆αM ] and ∆λ can be obtained as the following if ]∆η = ξ∆λ[. ] [ 0∑ − ξ|E| ∆b′ = R ∆λ 1M − ξ j∈E yj QMj ∆αM [ ] βbb′ def = ∆λ (32) b βM Second, substituting (32) into (9), we can get the linear relationship between ∆ge∑ i (∀i ∈ S ) and ∆λ as follows. ∑ ∆gei = βbj Qij ∆λ + yi βbb′ ∆λ + ξyj Qij ∆λ j∈M def

1 − yi + 2yi (ξ(λ − λ0 ) + η0 ) , ∀i ∈ M(λ0 , η0 ) 2 ge(λ0 , η0 )i + γbi (λ − λ0 ) ≤ λ, ∀i ∈ E(λ0 , η0 ) (35) ge(λ0 , η0 )i + γbi (λ − λ0 ) ≥ λ, ∀i ∈ R(λ0 , η0 ) (36) In order to draw Sub-Conclusion 1, we will prove that there always exists a nontrivial interval [λ(θ), λmax ] (or [λmin , λ(θ)]) of λ satisfying the inequalities (34)-(36), where λ0 = λ(θ), η0 = η(θ), and the partition π is described in Theorem 4 (denoted as Sub-Conclusion 2). T If the i-th inequality Ai [λ η] ≤ bi of CR(λ0 , η0 ) only corresponds to the t-th sample of S (it is the premise of Theorem 4), the inequality system (34)-(36) will imply the following strict inequalities. 1 − yi + 2yi η(θ) , (37) 0 < α(λ(θ), η(θ))i < 2 ∀i ∈ M(λ0 , η0 ) \ {t} ge(λ(θ), η(θ))i < λ(θ), ∀i ∈ E(λ0 , η0 )) \ {t} (38) ge(λ(θ), η(θ))i < λ(θ), ∀i ∈ R(λ0 , η0 )) \ {t} (39) Further, we can prove that ∀i ∈ M, βeiλ ̸= ±∞, and ∀i ∈ S , γeiλ ̸= ±∞, which similar to the proof of Lemma 4 in [20]. It means that Sub-Conclusion 2 holds if the inequalities corresponding to the t-th sample of S (i.e., (xt , yt )) are not considered in (34)-(36). In order to complete the proof, four cases must be considered to account for the inequalities corresponding to (xt , yt ) in (34)-(36). (1) If the i-th inequality of CR(λ0 , η0 ) is from the left part of (14), it means that (xt , yt ) migrates from M to R on the η(θ)−η solution path with the slope coefficient ξ = λ(θ)−λ00 and the starting point (λ0 , η0 ). Supposing that the solution adjustment on the interval [λ(θ), λ0 ] (or [λ0 , λmax ]) for λ is indexed by k , the ≤

j∈E

= γbi ∆λ (33) Finally, similar to (14)-(16), when adjusting λ, the following constraints should also be kept. This means that λ is limited to a certain interval. 0 ≤ α(λ0 , η0 )i + βbi (λ − λ0 ) (34)

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 900

700

λ

−3

λ

−3

CV−SES(κ=10 ) CSHL−SVM(κ=103) GS(κ=103)

30

SP +GS (κ=103)

20

SPλ+GSη(κ=10 )

η

λ

3

10

SPη+GSλ(κ=10 ) λ

500

η

η

3

CSHL−SVM(κ=10 ) 3

400

GS(κ=10 )

300

SPη+GSλ(κ=10 )

3

SP +GS (κ=103) λ

η

60

80

100

120

140

160

100

150

200

η

λ

3

200

SPλ+GSη(κ=10 )

λ

600

SPλ+GSη(κ=10−3)

0 50

300

CV−SES(κ=10 ) 3

CSHL−SVM(κ=10 ) 3

100

GS(κ=10 ) SP +GS (κ=103) λ

3

SPλ+GSη(κ=10 )

η

3

200

SPλ+GSη(κ=10 )

150

200

250

0 50

300

100

GS(κ=10 )

SPλ+GSη(κ=10 )

λ

3

300

250

300

−3

CSHL−SVM(κ=10 ) −3

GS(κ=10 )

1000

−3

SP +GS (κ=10 )

λ

η

SPλ+GSη(κ=10−3)

800

−3

CV−SES(κ=10 ) 3

CSHL−SVM(κ=10 )

600

3

GS(κ=10 ) SP +GS (κ=103) η

400

λ

3

SPλ+GSη(κ=10 )

3

−3

CV−SES(κ=10 ) 3

CSHL−SVM(κ=10 )

600

3

GS(κ=10 ) SP +GS (κ=103) η

400

λ

3

SPλ+GSη(κ=10 ) 3

CV−SES(κ=10 )

200

λ

SPλ+GSη(κ=10−3)

800

3

CV−SES(κ=10 )

250

1200

−3

η

CSHL−SVM(κ=10 )

200

(d)

SP +GS (κ=10 )

CV−SES(κ=10 )

200

150

Size of validation set

−3

−3

SP +GS (κ=103)

λ

3

GS(κ=10 )

λ

300

3

SP +GS (κ=103)

−3

3

CV−SES(κ=10 )

300

CSHL−SVM(κ=10 ) 1000

SP +GS (κ=10 )

η

3

GS(κ=10 )

CV−SES(κ=10 )

1200 −3

3

400

3

CSHL−SVM(κ=10 )

(c)

GS(κ=10 ) η

400

Size of validation set

SPλ+GSη(κ=10−3) 500

−3

CV−SES(κ=10 )

100

−3

Runtime (in minutes)

Runtime (in minutes)

250

−3

−3

200

SP +GS (κ=103)

CSHL−SVM(κ=10 ) 700

−3

SP +GS (κ=10 )

400

300

λ

SPλ+GSη(κ=10−3)

3

800 −3

−3

GS(κ=10 )

η

3

GS(κ=10 )

(b)

CSHL−SVM(κ=10 )

600

3

CSHL−SVM(κ=10 )

Size of validation set

1200

η

400

500

CV−SES(κ=10 )

(a)

800

η

−3

CV−SES(κ=10 )

100

Size of validation set

1000

SP +GS (κ=10 )

λ

SPλ+GSη(κ=10−3)

3

0 50

180

500

CV−SES(κ=10 )

100

40

−3

SP +GS (κ=10 )

SP +GS (κ=10−3) CV−SES(κ=10−3)

200

CV−SES(κ=103)

0 20

600

−3

GS(κ=10 )

600 −3

Runtime (in minutes)

40

η

−3

CSHL−SVM(κ=10 )

−3

GS(κ=10 )

600 −3

Runtime (in minutes)

η

Runtime (in minutes)

Runtime (in minutes)

SP +GS (κ=10 ) SP +GS (κ=10 )

50

700 −3

CSHL−SVM(κ=10 )

−3

GS(κ=10 )

Runtime (in minutes)

800

−3

GS(κ=10 ) −3

60

700

CSHL−SVM(κ=10−3)

−3

CSHL−SVM(κ=10 ) 70

Runtime (in minutes)

80

14

CV−SES(κ=10 )

200

100

100

150

200

0 50

250

100

150

Size of validation set

200

(e)

Runtime (in minutes)

Runtime (in minutes)

λ

−3

CV−SES(κ=10 ) 3

CSHL−SVM(κ=10 ) 3

GS(κ=10 ) 3

SP +GS (κ=10 ) λ

3

SPλ+GSη(κ=10 ) 500

0 50

250

400

100

SPλ+GSη(κ=10−3) 3

CSHL−SVM(κ=10 ) 3

SP +GS (κ=103) η

CSHL−SVM(κ=10−3)

λ

SPλ+GSη(κ=103)

−3

SPη+GSλ(κ=10 ) SP +GS (κ=10−3) λ

800

CV−SES(κ=10 ) CSHL−SVM(κ=103) 600

3

GS(κ=10 ) 3

SPη+GSλ(κ=10 )

400

SP +GS (κ=103) λ

150

200

250

0 50

300

100

150

η

3

CV−SES(κ=10 )

Size of validation set

η

−3

3

100

200

GS(κ=10−3)

GS(κ=10 )

3

CV−SES(κ=10 )

150

(h)

1000

−3

SPη+GSλ(κ=10 )

CV−SES(κ=10 )

200

100

Size of validation set

1200

−3

−3

300

0 50

300

(g) −3

SP +GS (κ=10−3)

η

200

GS(κ=10 ) 500

SPλ+GSη(κ=10−3)

1000

150

CSHL−SVM(κ=10 )

−3

GS(κ=10 )

1500

100

Size of validation set

600 −3

CSHL−SVM(κ=10 )

η

0 50

300

(f)

2500

2000

250

Size of validation set

Runtime (in minutes)

0 50

CV−SES(κ=10 )

200

200

250

0 50

300

100

150

Size of validation set

(i)

200

250

300

Size of validation set

(j)

(k)

Fig. 8: Runtime of CSHL-SVM with three-dimensional grid search, and of CS-SVM with GS, SPη +GSλ , SPλ +GSη , and CV-SES on different datasets, when κ = 10−3 and 103 , respectively. (a) Sonar datset. (b) Ionosphere datset. (c) Diabetes datset. (d) Breast Cancer datset. (e) Heart datset. (f) Hill-Valley datset. (g) Spine Image datset. (h) Ecoli1 datset. (i) Ecoli3 datset. (j) Vowel0 datset. (k) Vehicle0 datset.  immediate next∑round of adjustment for λ is indexed by k + 1, and c = 1 −[ ξ i∈E [k] y]i Qti( , we have ) [k+1] b 1 ( [k] [k] ) βb′ [k] = R\tt − [k] R∗t Rt∗ [k+1] \tt βbM[k+1] Rtt [ ] [k+1] 0∑ − ξ|E | · 1M[k+1] − ξ j∈E [k+1] yj QM[k+1] j [ ] [k] [k] βbb′ − cRb′ t = [k] [k] βbM[k+1] − cRM[k+1] t  ( )  [k] [k] [k] 1  Rb′ t βbt (− cRtt )  − [k] [k] [k] [k] R [k+1] βb − cR R M

tt [k+1] γbt



=

i∈M[k+1]



=

tt

[k+1] [k+1] βbi Qti + yt βbb′ +

(

i∈M[k+1]

+

t

t



i∈E [k+1]

) [k] [k] βbi − cRit Qti +



i∈E [k+1] ( [k] yt βbb′ −

[k] cRb′ t

)

) yt ( [k] b[k] [k] [k] R − cR ′ t βt ′ t Rtt b b [k] Rtt  ( ) [k] [k] [k] [k] Qti Rit βbt − cRit Rtt 

ξyi Qti −

 1  ∑ − [k] Rtt i∈M[k+1] ∑ [k] ∑ [k] = βbi Qti + yt βbb′ + ξyi Qti i∈M[k]

ξyi Qti

i∈E [k]

−c 





[k] Qti Rit

+

[k] yt Rb′ t 

i∈M[k]

  [k] βbt  ∑ [k] [k]  Qti Rit + yt Rb′ t − [k] Rtt i∈M[k]   [k] cRtt  ∑ [k] [k] + [k] Qti Rit + yt Rb′ t  Rtt [k] i∈M [k] = γbt − c −

[k] βbt [k]

[k]

+

cRtt

[k]

=−

[k] βbt [k]

Rtt Rtt Rtt [k] Further, we can prove that Rtt > 0, which is similar to the proof of Corollary 3 in [20]. It means that the inequalities corresponding to (xt , yt ) in (34) lead to a nontrivial interval. (2) If the i-th inequality of CR(λ0 , η0 ) is from the right part of (14), it means that (xt , yt ) migrates from M to E on the solution η(θ)−η path with the slope coefficient ξ = λ(θ)−λ00 and the starting point (λ0 , η0 ). [Then we have ] [k+1] βbb′ [k+1] βb [k+1] ) ( M 1 ( [k] [k] ) [k] = R\tt − [k] R∗t Rt∗ \tt Rtt ([ ] [k] 0∑ − ξ|E | · 1M[k+1] − ξ j∈E [k] yj QM[k+1] j

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

[

− [

ξyt yt ξyt QM[k+1] t

] [ ] [k] [k] [k] βbb′ − cRb′ t Rb′ t = + ξyt Qtt [k] [k] [k] βbM[k+1] − cRM[k+1] t RM[k+1] t  ( )  [k] [k] [k] 1  Rb′ t βbt (− cRtt )  − [k] [k] [k] [k] RM[k+1] t βbt − cRtt Rtt [ ] [k] ξyt (1 − Rtt Qtt ) Rb′ t + [k] [k] (1 − Rtt Qtt ) RM[k+1] t Rtt [k+1] γbt ∑ ∑ [k+1] [k+1] = βbi Qti + yt βbb′ + ξyi Qti i∈M[k+1]



=

(

[k+1]

i∈E ) ( ) [k] [k] [k] [k] b βi − cRit Qti + yt βbb′ − cRb′ t

i∈M[k+1]

+



ξyi Qti + ξyt Qtt

i∈E [k]

  ( ) 1  ∑ [k] b[k] [k] [k]  Qti Rit βt − cRit Rtt − [k] Rtt i∈M[k+1] ) yt ( [k] [k] [k] [k] − [k] Rb′ t βbt − cRb′ t Rtt Rtt   ∑ [k] [k] +ξyt Qtt  Qti Rit + yt R ′  bt

i∈M[k+1]

  ξyt  ∑ [k] [k]  Qti Rit + yt Rb′ t + [k] Rtt i∈M[k+1]   ∑ [k] [k] −ξyt Qtt  Qti Rit + yt Rb′ t 

=

i∈M[k+1]



[k] [k] βbi Qti + yt βbb′ +

i∈M[k]



−c 



ξyi Qti

=



Qti Rit + yt Rb′ t  [k]

[k]

=

i∈M[k]





ξyt  ∑ [k] [k] Qti Rit + yt Rb′ t  [k] Rtt i∈M[k]   ∑ [k] [k] −ξyt Qtt  Qti Rit + yt Rb′ t  [k] = γbt − c −

= −

[k] βbt − ξyt [k]

Rtt

[k]

cRtt

[k] Rtt

+ ξyt Qtt +

γt

[k+1] [k+1] βbi Qti + yt βbb′ +

i∈M[k+1] ∑ [k] βbi Qti [k] i∈M

[k] + yt βbb′ +





ξyi Qti

i∈E [k+1]

ξyi Qti − ξyt Qtt

i∈E [k]

i∈M[k]

+

i∈M[k] [k] βbt + [k] Rtt

[k+1] γbt ∑

  m ∑ t t + t β i Qti + yt β b′ + Qtt  + ξyt Qtt γt i∈M[k] m t [k] [k] = γbt + t · γ t = γbt + m = 0 γt [k] [k+1] b γ we have βbt = − γtt + ξyt . In addition, we have t ∑ t t γ tt = β i Qti + yt β b′ + Qtt

  [k] b ∑ β [k] [k] Qti Rit + yt Rb′ t  − t[k]  Rtt i∈M[k]   [k] cRtt  ∑ [k] [k]  Qti Rit + yt Rb′ t + [k] Rtt i∈M[k]   ∑ [k] [k] +ξyt Qtt  Qti Rit + yt Rb′ t  i∈M[k]

It means that the inequalities corresponding to (xt , yt ) in (34) lead to a nontrivial interval. (3) If the i-th inequality of CR(λ0 , η0 ) is from (15), it means that (xt , yt ) migrates from E to M on the solution η(θ)−η path with the slope coefficient ξ = λ(θ)−λ00 and the starting [ ]T t , and point (λ0 , η0 ). Supposing β = −R[k] yt QM[k] t ∑ t t t γ t = i∈M + yt β b′ + Qtt , we have that [ [k] β i Qti ] [k+1] βbb′ [k+1] βbM[k+1] ([ ] 0∑ − ξ|E [k] | = R[k+1] 1M[k+1] − ξ j∈E [k] yj QM[k+1] j [ ]) yt +ξyt QM[k+1] t  ][ ]T  [ [ [k] ] t t 1 R 0 β β  + t =  0T 0 γt 1 1 ] [ [ ] 0 0∑ − ξ|E [k] | + · ξyt 1M[k+1] − ξ j∈E [k] yj QM[k+1] j   [k] ] [ [ ] βb ′ t 1 0  b  · =  βb[k]  + + t β M ξyt γt 1 0 [ ]T [ ] t 0∑ − ξ|E [k] | β 1M[k+1] − ξ j∈E [k] yj QM[k+1] j 1 | {z } m   [k] ] [ [ ] βbb′ m βt 0  b[k]  =  β + + t M ξyt γt 1 0 [k+1] Therefore, we have βb = mt + ξyt . Because t

i∈E [k]



15

])

ξyt [k]

Rtt

− ξyt Qtt

2

∑ t

= y ϕ(x ) + β y ϕ(x ) t i i i

t

i∈M[k] ∑ t Supposing that yt ϕ(xt ) + i∈M[k] β i yi ϕ(xi ) = 0, i.e. ∑ t yt ϕ(xt ) = − i∈M[k] β i yi ϕ(xi ), and plugging it into the expression (33), we have [k] γbt ∑ [k] ∑ [k] = βb Qti + yt βb ′ + ξyi Qti i

b

[k]

i∈M ⟨

=

yt ϕ(xt ),

i∈E [k]

∑ i∈M[k]

[k] βbi yi ϕ(xi )

+





ξyi ϕ(xi )

i∈E [k]

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014



[k] −βbb′

=

pletes the proof.

i∈E [k]





yt ϕ(xt ),

[k] βbi yi ϕ(xi ) +

i∈M[k]

− ⟨

=





t

βi 

i∈E [k]

∑ ∑





[k] βbi Qti +

ξyi ϕ(xi ) 

ξyi Qti 

[k+1] γbt ∑

= =

t

j∈E



def



ξyi ϕ(xi )

=0

i∈E [k] [k] γbt ̸= 0,

γt

[k+1] [k+1] βbi Qti + yt βbb′ + [k] + yt βbb′ +





ξyi Qti

i∈E [k+1]

ξyi Qti

i∈E [k]

  m ∑ t t β i Qti + yt β b′ + Qtt  + t γt [k] i∈M m t [k] [k] = γbt + t · γ t = γbt + m = 0 γt [k] [k+1] b γ we have βb = − tt . It means that the inequality correspondt

γt

j∈M

β i yi ϕ(xi ),

[k] βbi yi ϕ(xi ) +

i∈M[k+1] ∑ [k] βbi Qti [k] i∈M

Proof. ∀(λ, η) ∈ CR(λ0 , η0 ), according to (12), we can get the ei ) and linear relationship between ∆f (x ( [∆λ ∆η] as follows. ) ∑ ei ) = ei ) βjλ ∆λ + βjη ∆η ∆f (x yj K(xj , x ( ) ∑ ei )∆η + βbλ′ ∆λ + βbη′ ∆η + K(xj , x

which contradicts the premise when (xt , yt ) migrates ∑ t from E to M. Thus, we have yt ϕ(xt )+ i∈M[k] β i yi ϕ(xi ) ̸= 0. t This implies γ t > 0. It means that the inequality corresponding to (xt , yt ) in (35) lead to a nontrivial interval. (4) If the i-th inequality of CR(λ0 , η0 ) is from (16), it means that (xt , yt ) migrates from R to M on the solution path with the η(θ)−η slope coefficient ξ = λ(θ)−λ00 and the starting point (λ0 , η0 ). We have that [ ] [k+1] βbb′ [k+1] βbM[k+1] [ ] 0∑ − ξ|E [k+1] | = R[k+1] 1M[k+1] − ξ j∈E [k+1] yj QM[k+1] j  [ ][ ]T  [ [k] ] t t 1 R 0 β · =  + t β 0T 0 γt 1 1 [ ] 0∑ − ξ|E [k] | 1M[k+1] − ξ j∈E [k] yj QM[k+1] j   [k] ][ ]T [ βbb′ t t 1  b[k]  β β · =  β + t M γt 1 1 0 [ ] 0∑ − ξ|E [k] | 1M[k+1] − ξ j∈E [k] yj QM[k+1] j   [k] [ ] βbb′ t  b[k]  m β =  β + t M γt 1 0 [k] [k+1] b γ we have βb = − tt . Because t

.3 Proof of Theorem 5

i∈E [k]

i∈M[k]

i∈M[k]



i∈E [k]

i∈M[k]

yt ϕ(xt ) + ∑

16

t β i yi

ing to (xt , yt ) in (36) leads to a nontrivial interval. If ξ is infinite, we can also prove Sub-Conclusion 1 with the λ(θ)−λ relationship ∆λ = η(θ)−η00 ∆η similarly. The same analysis can be extended to 2C -SVM. This com-

γeiλ ∆λ

γeiη ∆η

= + (40) e (λ0 , η0 ), we can get Combining (40) with the constraint of π the following constraints. ∀i ∈ I+ (λ0 , η0 ) : ei ) + γ eiλ (λ − λ0 ) + γ eiη (η − η0 ) ≥ 0 f (λ0 , η0 )(x (41) ∀i ∈ I− (λ0 , η0 ) : ei ) + γ eiλ (λ − λ0 ) + γ eiη (η − η0 ) < 0 f (λ0 , η0 )(x (42) Obviously, IR(λ0 , η0 ) is the set of feasible solutions to the system of inequalities (41)-(42). According to (41)-(42), IR(λ0 , η0 ) is a convex set and its closure is a convex polygon region. The 0 0 same analysis can be extended to IR(C+ , C− ). This completes the proof.

ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their constructive comments and suggestions. This work was supported by the Project Funded by the Priority Academic Program Development (PAPD) of Jiangsu Higher Education Institutions, the U.S. National Science Foundation (IIS-1115417), and the National Natural Science Foundation of China (Nos: 61232016, 61573191 and 61573191).

R EFERENCES [1] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., New York, NY, 1998. [2] Park, Y. J., Chun, S. H., & Kim, B. C.. Cost-sensitive case-based reasoning using a genetic algorithm: Application to medical diagnosis. Artificial Intelligence in Medicine, 51(2):133 – 145, 2011. [3] Zhang, Y., & Zhou, Z. H.. Cost-sensitive face recognition. IEEE Trans. Pattern Anal. Mach. Intell., 32(10):1758–1769, 2010. [4] Cui, G., Wong, M. L., & Wan, X.. Cost-sensitive learning via priority sampling to improve the return on marketing and crm investment. J. of Management Information Systems, 29(1):341–374, 2012. [5] Sheng, V. S., & Ling, C. X. Thresholding for making classifiers costsensitive. In AAAI, pages 476–481, 2006. [6] Liu, X. Y., & Zhou, Z. H.. The influence of class imbalance on costsensitive learning: An empirical study. In ICDM, 970–974, 2006. [7] Sch¨olkopf, B., & Smola, A. J.. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. [8] Davenport, M., Baraniuk, R. G., & Scott, C. D.. Tuning support vector machines for minimax and neyman-pearson classification. IEEE Trans. Pattern Anal. Mach. Intell., 32(10):1888–1898, 2010. [9] Karakoulas, G., & Shawe-Taylor, J.. Optimizing classifiers for imbalanced training sets. In NIPS, 253–259, 1999. [10] Lee, G., & Scott, C.. Nested support vector machines. IEEE Trans. Signal Process., 58(3):1648–1660, 2010. [11] Masnadi-Shirazi, H., & Vasconcelos, N.. Risk minimization, probability elicitation, and cost-sensitive svms. In ICML, 759–766, 2010. [12] Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J.. The entire regularization path for the support vector machine. J. Mach. Learn. Res., 5:1391–1415, 2004. [13] L. Gunter, J. Zhu. Efficient computation and model selection for the support vector regression. Neural Comput., 19(6):1633–1655, 2007. [14] Rosset, S.. Bi-level path following for cross validated solution of kernel quantile regression. J. Mach. Learn. Res., 10:2473–2505, 2009.

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2578326, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014

[15] Wang, G., Yeung, D. Y., & Lochovsky, F. H.. A new solution path algorithm in support vector regression. IEEE Trans. Neural Netw., 19(10): 1753–1767, 2008. [16] Rosset, S., Zhu, J.: Piecewise linear regularized solution paths. The Annals of Statistics (2007) 1012–1030. [17] Takeuchi, I., Nomura, K., & Kanamori, T.. Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Comput., 21(2):533–559, 2009. [18] F. R. Bach, D. Heckerman, E. Horvitz. Considering cost asymmetry in learning classifiers. J. Mach. Learn. Res., 7:1713–1741, 2006. [19] Yang, J. B., & Ong, C. J.. Determination of global minima of some common validation functions in support vector machine. IEEE Trans. Neural Netw., 22:654–659, 2011. [20] Gu, B., Wang, J. D., Zheng, G. S., & Yu, Y. C.. Regularization path for ν -support vector classification. IEEE Trans. Neural Netw. Learn. Syst., 23 (5):800–811, 2012. [21] Gu, B., Sheng, V. S., & Li, S.. Bi-parameter space partition for costsensitive SVM. In IJCAI, pages 3532–3539, 2015. [22] Boyd, S., & Vandenberghe, L.. Convex Optimization. Cambridge University Press, 2004. [23] Avis, D., & Fukuda, K.. A pivoting algorithm for convex hulls and vertex enumeration of arrangements and polyhedra. Discrete & Computational Geometry, 8(1):295–313, 1992. [24] Borrelli, F.. Constrained Optimal Control of Linear and Hybrid Systems. Springer, 2003. [25] Cai, F., & Cherkassky, V.. Generalized smo algorithm for svm-based multitask learning. IEEE Trans. Neural Netw. Learn. Syst., 23(6):997– 1003, 2012. [26] Laskov, P., Gehl, C., Krger, S., & M¨uller, K. R.. Incremental support vector learning: Analysis, implementation and applications. J. Mach. Learn. Res., 7:1909–1936, 2006. [27] Frank, A., & Asuncion, A.. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml. [28] Chang, C. C., & Lin, C. J.. LIBSVM: A library for support vector machines. ACM T. INTEL. SYST. TEC., 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [29] Chen, P. H., Fan, R. E., & Lin, C. J.. A study on smo-type decomposition methods for support vector machines. IEEE Trans. Neural Netw., 17(4): 893–908, 2006. [30] J. Giesen, J.K. Mueller, S. Laue and S. Swiercy. Approximating Concavely Parameterized Optimization Problems. In NIPS, pages 2114– 2122, 2012. [31] Masnadi-Shirazi, H., Vasconcelos, N., & Iranmehr, A.. Cost-sensitive support vector machines. arXiv preprint arXiv:1212.0975, 2012. [32] Charles Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973–978, 2001. [33] B. Sch¨olkopf, A. J. Smola, R. C. Williamson, & P. L. Bartlett. New support vector algorithms. Neural Comput., 12(5):1207–1245, 2000. [34] He, H., & Garcia, E.. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng., 21(9):1263–1284, 2009.

Bin Gu received the B.S. and Ph.D. degrees in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2005 and 2011, respectively. He joined the School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, in 2010, as a Lecturer. He was promoted to Associate Professor in 2014. He was a Post-Doctoral Fellow with the University of Western Ontario, London, ON, Canada, from 2013 to 2015. His research interests include machine learning, data mining, and medical image analysis.

17

Victor S. Sheng received the masters degree in computer science from the University of New Brunswick, Fredericton, NB, Canada, in 2003, and the Ph.D. degree in computer science from the Western University, London, ON, Canada, in 2007. He was an Associate Research Scientist and NSERC Post-Doctoral Fellow in information systems at Stern Business School, New York University, New York, NY, USA. He is an Associate Professor of Computer Science at the University of Central Arkansas, Conway, AR, USA, and the Founding Director of Data Analytics Laboratory. His current research interests include data mining, machine learning, and related applications. Prof. Sheng is a senior member of the IEEE and the IEEE Computer Society, and a Lifetime Member of ACM. He has published more than 80 research papers in conferences and journals of machine learning and data mining. He was the recipient of the Best Paper Award Runner-Up from KDD’08, and the Best Paper Award from ICDM’11. He is a PC Member for a number of international conferences and a reviewer for several international journals.

Keng Yeow Tay graduated with a medical degree from University of New South Wales, Sydney, Australia, and completed a clinical fellow in diagnostic neuroradiology at University of Toronto, Toronto, Ontario, Canada. He is presently Assistant Professor of Radiology at the Schulich School of Medicine, the University of Western Ontario. His current research interests include diagnostic neuroradiology and head & neck radiology.

Walter Romano graduated with a medical degree from the University of Western Ontario (UWO). He is an Associate Professor of Radiology at UWO. He is the Medical Director of Ultrasound at St. Josephs Health Care London and is Interim Chief of the Department of Diagnostic Imaging and has held numerous committee responsibilities at St. Josephs, as well as at the University of Western Ontario.

Shuo Li is a research scientist and project manager in general electric (GE) healthcare, Canada. He is also an adjunct research professor in the University of Western Ontario and adjunct scientist in Lawson health research institute. He is currently leading the digital imaging group of London (http://digitalimaginggroup.ca/) as the scientific director. He received his Ph.D. degree in computer science from Concordia University, where his PhD thesis won the doctoral prize, which gives to the most deserving graduating student in the faculty of engineering and computer science. He obtained his master and bachelor degrees all from China. He is the recipient of several GE internal awards. He serves as guest editor and associate editor in prestigious journals in the field. His current interest is in intelligent medical imaging systems with a main focus on automated medial image analysis and visualization.

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Suggest Documents