Neural Networks New support vector algorithms with ...

3 downloads 1031 Views 3MB Size Report
described, and the use of a parametric insensitive/margin model with an arbitrary shape is demonstrated. ...... http://www.liacc.up.pt/ML/old/statlog/datasets.html.
Neural Networks 23 (2010) 60–73

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

New support vector algorithms with parametric insensitive/margin model Pei-Yi Hao ∗ Department of Information Management, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, ROC

article

info

Article history: Received 8 September 2008 Received in revised form 28 July 2009 Accepted 7 August 2009 Keywords: Support vector machines (SVMs) Regression estimation Classification Heteroscedastic noise model Parametric-insensitive model Parametric-margin model

abstract In this paper, a modification of v -support vector machines (v -SVM) for regression and classification is described, and the use of a parametric insensitive/margin model with an arbitrary shape is demonstrated. This can be useful in many cases, especially when the noise is heteroscedastic, that is, the noise strongly depends on the input value x. Like the previous v -SVM, the proposed support vector algorithms have the advantage of using the parameter 0 ≤ v ≤ 1 for controlling the number of support vectors. To be more precise, v is an upper bound on the fraction of training errors and a lower bound on the fraction of support vectors. The algorithms are analyzed theoretically and experimentally. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction Support vector machines (SVMs) are one of the leading techniques for pattern classification and function approximation (Cortes & Vapnik, 1995; Vapnik, 1995). SVM was originally developed for pattern classification (for brevity, we called C -SVC (Boser, Guyon, & Vapnik, 1992; Burges, 1998; Cortes & Vapnik, 1995; Vapnik, 1995)). Consider a set of N data vectors {xi , yi }, i = 1, . . . , N , yi ∈ {−1, 1}, xi ∈ Rn , where xi is the ith data vector that belongs to a binary class yi . The C -SVC seeks to estimate the hyperplane hw · xi + b = 0 that best separates the two classes of data with the widest margin, where h·i denotes the inner product. The task of C -SVC is therefore to minimize the regularized risk functional 12 kwk2 + C · Remp , where the constant C > 0 determines the trade-off between the maximization of margin and the minimization of empirical risk Remp . The C -SVC represents the decision boundary in terms of a typically small subset of all training examples, called the Support Vectors (Schölkopf, Bartlett et al., 1995). In order for this sparseness property to carry over to the case of SV regression, Vapnik devised the so-called ε -insensitive loss function (Drucker, Burges, Kaufman, Smola, & Vapnik, 1996; Vapnik, 1995): Lε (x, y, f ) = |y − f (x)|ε

 :=

0

|y − f (x)| − ε

if |y − f (x)| ≤ ε otherwise

(1)

which does not penalize errors below some ε > 0, chosen a priori. Only the points outside the ε -insensitive tube contribute to the

∗ Corresponding address: Department of Information Management, National Kaohsiung University of Applied Sciences, Kaohsiung, 415 Chien Kung Road, Kaohsiung 807, Taiwan, ROC. Tel.: +886 7 3814526 6117; fax: +886 7 3831332. E-mail address: [email protected]. 0893-6080/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2009.08.001

cost. His algorithm, which we call ε -SVR, seeks to estimate functions f (x) = hw · xi + b, based on independent and identically distributed data (x1 , y1 ), . . . , (xN , yN ) ∈ Rn × R, by minimizing the regularized risk functional 21 kwk2 + C · Rεemp . Here, kwk2 is a term that characterizes the model complexity (Smola, Schölkopf, & Müller, 1998), Rεemp [f ] :=

1 XN N

i=1

|yi − f (xi )|ε

measures the ε -insensitive training errors, and C > 0 is a constant determining the trade-off between minimizing training errors and minimizing the model complexity. In short, minimizing the regularized risk functional captures the main insight of statistical learning theory, that is, in order to obtain a small risk, we need to control both training error and model complexity, by explaining the data with a simple model (Vapnik & Chervonenkis, 1974; Vapnik, 1982, 1995). The v -support vector machine (v -SVM) is an extension of support vector machine for classification and regression (Schölkopf, Bartlett et al., 1999; Schölkopf, Smola, Williamson, & Bartlett, 2000). Schölkopf et al. introduced a new parameter v which can control the number of support vectors and training errors. The parameter v also can eliminate one of the other free parameters of the original support vector algorithms. For regression, the parameter v replaces ε ; whereas for classification, v replaces C . The v -support vector regression (v -SVR) algorithm automatically trades off the insensitive-tube width ε against the model complexity kwk2 and ε -insensitive training errors using a parameter v . Schölkopf et al. also presented a definition of a margin that both SV classification and regression algorithms maximize (Schölkopf et al., 2000), and formulated a v -support vector classification (v -SVC) algorithm, as a modification of the C -SVC algorithm.

P.-Y. Hao / Neural Networks 23 (2010) 60–73

To motivate the new algorithms that shall be proposed, note that what the v -SVR has so far retained is the assumption that the noise level is uniform throughout the domain, or at least, its functional dependency is known beforehand (Schölkopf et al., 2000). The assumption of a uniform noise model, however, is not always satisfied. In other words, the amount of noise might depend on location (i.e. heteroscedastic noise structure). Consider the following heteroscedastic regression model (Yunn & Wahba, 2004) yi = µ(xi ) + σ (xi )ei where µ and σ are unknown mean and variance functions to be estimated, xi , i = 1, . . . , N are univariate or multivariate covariates and ei are i.i.d noise with mean 0 and variance 1. The focus of most regression methodologies centers around the conditional mean. However, the understanding of the local variability of the data, measured by the conditional variance is very important in many scientific studies (Cawley, Janaceka, Haylockb, & Dorlingc, 2007; Cawley, Talbot, Foxall, Dorling, & Mandic, 2004; Williams, 1996; Yunn & Wahba, 2004). Those applications necessitate the development of new regression techniques that allow the modeling of varying variance. In this paper, a modification of v support vector regression (v -SVR) algorithm is described. It shows how to use the parametric-insensitive model with an arbitrary shape to estimate both the mean and variance function simultaneously without prior assumption of either. The parametric insensitive model is characterized by a learnable function g (x), which is estimated by a new constrained optimization problem. In view of the close connection between SV regression and classification algorithms, we also proposed a modification of the v -support vector classification (v -SVC) algorithm by using a parametric-margin model of arbitrary shape. It is shown how to best separate the two classes with a flexible parametric-margin model, including some theoretical analysis. This can be useful in many cases, especially when the data has heteroscedastic error structure, that is, the noise strongly depends on the input value x. Like the previous v -SVM, the proposed new support vector algorithms with parametric insensitive/margin model have the advantage of using the parameter 0 ≤ v ≤ 1 to control the number of support vectors. To be more precise, v is an upper bound on the fraction of training errors and a lower bound on the fraction of support vectors. Hence, the selection of v is more intuitive. The rest of this paper is organized as follows. A brief overview of the v -support vector machine is given in Section 2. Section 3 describes a modification of the v -support vector regression (v -SVR) algorithm, called par-v -SVR, which uses a parametric-insensitive model. In Section 4, a modification of the v -support vector classification (v -SVC) algorithm, called par-v -SVC, is proposed. Experiments are presented in Section 5, and some concluding remarks are given in Section 6. 2. Overview of the v-support vector machine The v -support vector machine (v -SVM) is a new class of support vector machines that can handle both classification and regression problems, originally proposed by Schölkopf et al. This section first gives a brief overview of the v -support vector machine. Interested readers may consult Schölkopf, Bartlett et al. (1999); Schölkopf et al. (2000) for details. 2.1. v -support vector regression algorithm As it is difficult to select an appropriate value of the insensitive tube width ε in ε -SVR, the v -support vector regression (v SVR) algorithm alleviates this problem by making ε part of the optimization problem (Schölkopf et al., 2000). Suppose we are

61

given a training data set {(x1 , y1 ), . . . , (xN , yN )} ⊂ Rn × R, the prime optimization problem of v -SVR is as follows: minimize w,b,ε,ξi ,ξi∗

1

kwk + C · vε + 2

2

N 1 X

N i=1

! ξi + ξi





(2a)

subject to

(hw · xi i + b) − yi ≤ ε + ξi

(2b)

yi − (hw · xi i + b) ≤ ε + ξi∗

(2c)

ε ≥ 0,

(2d)

ξi , ξi ≥ 0 for i = 1, . . . , N . ∗

and

Consequently, the goal is not only to achieve a small training error (with respect to ε ) but also to obtain a solution with small ε itself. By using the Lagrange multiplier techniques, one can show that this leads to the following dual optimization problem (Schölkopf et al., 2000): maximize αi ,αi∗

N X N

−1 X (αi − αi∗ )(αj − αj∗ ) xi · xj

2

i=1 j=1

N

+

X (αi − αi∗ )yi

(3a)

i=1

subject to N X (αi − αi∗ ) = 0,

(3b)

i =1 N X (αi + αi∗ ) ≤ C · v,

(3c)

i =1



αi , αi ∈ 0, ∗

C N



.

(3d)

Nonlinearity of the algorithm is achieved by mapping x into a high dimensional feature space F via the feature map Φ : x → F and computing a linear estimate in the feature space to obtain f (x) = hw · Φ (x)i + b. However, since F may have a very high dimension, this direct approach is often not computationally feasible. Hence kernels are used instead, i.e., the algorithm is rewritten in terms of dot-products which do not require explicit knowledge of Φ (x) and kernels are introduced by letting k(x, y) ≡ hΦ (x) · Φ (y)i (Schölkopf, Burges et al., 1999; Schölkopf, Mika et al., 1999; Vapnik, 1995). Since w can be written as a linear combination of the (mapped) training pattern xi , this yields the following well known kernel expansion of f as f (x) =

N X

αi k(x, xi ) + b.

(4)

i=1

The training algorithm itself can also be formulated in terms of k(x, y) such that the basic optimization problem remains unchanged (Schölkopf et al., 2000; Vapnik, 1995). 2.2. v -support vector classification algorithm Schölkopf et al. also proposed the v -support vector classification (v -SVC) algorithm. Consider a set of N data vectors {xi , yi }, i = 1, . . . , N , yi ∈ {−1, 1}, xi ∈ Rn , where xi is the ith data vector that belongs to a binary class yi . In v -SVC, the hyperplane f (x) = hw · Φ (x)i + b separates the data if and only if

hw · Φ (xi )i + b ≥ ρ for yi = +1 hw · Φ (xi )i + b ≤ −ρ for yi = −1.

(5)

This changes the width of the margin in the v -SVC to 2ρ/ kwk, which is to be maximized while minimizing the margin errors

62

P.-Y. Hao / Neural Networks 23 (2010) 60–73

(Schölkopf et al., 2000). By making ρ part of the optimization problem, the prime optimization problem of v -SVC is as follows: 1

minimize w,b,ρ,ξi

2

kwk2 − vρ +

N 1 X

N i=1

ξi

(6a)

subject to yi (hw · Φ (xi )i + b) ≥ ρ − ξi

(6b)

ρ ≥ 0 and ξi ≥ 0 for i = 1, . . . , N .

(6c)

By using the Lagrange multiplier techniques, it can be shown that this leads to the following dual optimization problem: maximize

N X N −1 X

αi

2

αi αj yi yj k(xi , xj )

(7a)

i =1 j =1

subject to N X

αi yi = 0,

(7b)

αi ≥ v,

(7c)

i =1 N X i =1

0 ≤ αi ≤

1 N

,

(7d)

where k(x, y) ≡ hΦ (x) · Φ (y)i is the P kernel function.The resulting decision function is f ∗ (x) = sgn i αi yi k(xi , x) + b . To understand why the new algorithms below are proposed, note that what the v -SVM has so far retained is the assumption that the insensitive/margin model has a tube (or slab) shape. In the next section, we propose a modification of v -SVM by using a new parametric insensitive/margin model of arbitrary shape. This can be useful in many cases, especially when the noise is heteroscedastic, that is, where it depends on location x. 3. Support vector regression with parametric insensitive model The v -support vector regression algorithm relies on the assumption that the noise level is uniform throughout the domain, or at least, its functional dependency is known beforehand (Schölkopf et al., 2000). The assumption of a uniform noise model, however, is not always satisfied. In many regression tasks, the amount of noise might depend on location (Cawley et al., 2004; Kersting, Plagemann, Pfaff, & Burgard, 2007; Le, Smola, & Canu, 2005). This section presents an algorithm, called par-v-SVR, to estimate simultaneously both conditional mean and predictive variance of a regression problem. Suppose we are given a set of training data {(x1 , y1 ), . . . , (xN , yN )} where xi ∈ Rn is an input and yi ∈ R is a target output. By incorporating a change from ε , the width of the insensitive tube, with a parametric-insensitive function g (x), a new parametric-insensitive loss function Lg (x, y, f ) is discussed. Definition. The parametric-insensitive loss function Lg (x, y, f ) is defined by Lg (x, y, f ) = |y − f (x)|g

 :=

Fig. 1. The par-v -SVR algorithm with parametric-insensitive model. The shaded region represents the parametric-insensitive zone f ± g.

0 |y − f (x)| − g (x)

if |y − f (x)| ≤ g (x) otherwise,

(8)

where f and g are real-valued functions on the a domain Rn , x ∈ Rn , and y ∈ R. In other words, we do not care about errors as long as they are inside the parametric-insensitive zone f ± g. Only the points outside the parametric-insensitive zone contribute to the cost,

insofar as the deviations are penalized in a linear fashion. The goal of par-v -SVR is to evaluate the regression model by automatically adjusting the parametric-insensitive zone f ± g of an arbitrary shape and minimal size to include the data. Following the concept of kernel-based learning (Müller, Mika, Ratsch, Tsuda, & Schölkopf, 2001; Schölkopf, Burges et al., 1999), a non-linear function is learned by a linear learning machine in a kernel-introduced feature space, via the feature map Φ : x → F . At the same time, the capacity of the system is controlled by a parameter that does not depend on the dimensionality of the space. Hence, the proposed par-v -SVR seeks to estimate the following two linear functions: f (x) = hw · Φ (x)i + b,

where w ∈ F , x ∈ Rn , b ∈ R,

g (x) = hc · Φ (x)i + d,

where c ∈ F , x ∈ Rn , d ∈ R.

Fig. 1 depicts the situation graphically. The problem of finding the values for w, c, b, and d that minimize the parametricinsensitive training errors Rgemp [f ] =

N 1 X

N i=1

Lg (xi , yi , f )

(9)

is equivalent to the following constrained optimization problem: minimize

w,c,b,d,ξi ,ξi∗

+C v ·

1 2

kwk2



1 2



2

kck + d +

N 1 X

N i=1

! ξi + ξi





(10a)

subject to

(hw · Φ (xi )i + b) + (hc · Φ (xi )i + d) ≥ yi − ξi

(10b)

(hw · Φ (xi )i + b) − (hc · Φ (xi )i + d) ≤ yi + ξi

(10c)

ξi , ξi ≥ 0 for i = 1, . . . , N .

(10d)





At each point xi , we allow an error of g (xi ) = hc · Φ (xi )i + d. Everything above g (xi ) is captured in slack variables ξi and ξi∗ , which are penalized in the objective function via a regularization constant C > 0, chosen a priori. The size of the parametricinsensitive zone, which is characterized by 21 kck2 + d, is traded

off against model complexity, which is characterized by 12 kwk2 , and slack variables via a constant v > 0, chosen a priori. We can find the solution of this optimization problem given by Eq. (10) in dual variables by finding the saddle point of the Lagrangian: L =

1 2

kwk + C v · 2



1 2

2



kck + d +

N 1 X

N i =1

! ξi + ξi





P.-Y. Hao / Neural Networks 23 (2010) 60–73



N X

N X

αi , αi∗ ∈ 0,

h αi∗ − (hw · Φ (xi )i + b) + (hc · Φ (xi )i + d)

where k(x, y) ≡ hΦ (x) · Φ (y)i is the kernel function. Solving the above dual quadratic programming problem, we obtain the Lagrange multipliers αi and αi∗ , which give the weight vector w and c as a linear combination of Φ (xi ):

i=1 N N i X X + yi + ξi∗ − βi ξi − βi∗ ξi∗ i=1

(11)

i =1

where αi , αi∗ , βi , and βi∗ are the nonnegative Lagrange multipliers. Differentiating L with respect to w, c, b, d, ξi , ξi∗ and setting the result to zero, we obtain: N X

∂ L/∂ w = w −

αi Φ (xi ) +

i=1

N X

αi∗ Φ (xi ) = 0

i =1

N X ⇒w= (αi − αi∗ )Φ (xi ),

(12)

∂ L/∂ c = (C · v) c −

N X (αi + αi∗ )Φ (xi ) = 0

N X (αi + αi∗ )Φ (xi ),

1

C · v i =1 N X

αi +

i=1

N X

C

N X

αi −

i =1

(13)

(14) N X

αi∗ = 0

i=1

(15)

i=1

∂ L/∂ξi∗ =

− αi − βi = 0

N C

− βi and αi ≤

maximize αi ,αi∗

C N

,

(16)

− αi∗ − βi∗ = 0

N X N −1 X (αi − αi∗ )(αj − αj∗ )k(xi , xj )

2

N



+

1 XX

(αi + αi∗ )(αj + αj∗ )k(xi , xj )

(18a)

i=1

subject to N X (αi − αi∗ ) = 0,

(18b)

i=1 N X (αi + αi∗ ) = C · v, i=1

(22)

−1

hw · Φ (xi )i + w · Φ (xj ) + hc · Φ (xi )i 2

 − c · Φ (xj ) − yi − yj ,

(23)

−1

hw · Φ (xi )i − w · Φ (xj ) + hc · Φ (xi )i 2

 + c · Φ (xj ) − yi + yj , (24)  C ∗ for some i, j such that αi , αj ∈ 0, N . Therefore, the estimated d =

regression function f and the parametric-insensitive function g of the resulting par-v -SVR are given as follows: N X f (x) = (αi − αi∗ )k(xi , x) + b

! (25)

g (x) =

1

! N X ∗ (αi + αi )k(xi , x) + d .

(26)

C · v i=1

The Karush–Kuhn–Tucker conditions provide several useful (∗) conclusions. The training point xi for which αi > 0 ((∗) being a shorthand implying both the variables with and without asterisks) are termed support vectors (SVs) since only those points determine the final regression result among all training points. Here we have to distinguish the difference between the examples for which 0 < αi(∗) < C /N, and those for which αi(∗) = C /N. In the first case, (∗)

i=1 j=1

N X (αi − αi∗ )yi

(21)

from Eq. (21) (or Eq. (22)), it follows that ξi = 0 and moreover the second factor in Eq. (19) (or Eq. (20)) has to vanish. In other (∗) words, those examples (xi , yi ) with corresponding αi ∈ (0, C /N ) lie on the upper or lower bounds of the parametric-insensitive zone f ± g. In the second case, from Eq. (21) (or Eq. (22)), it follows that ξi(∗) > 0. In other words, only examples (xi , yi ) with corresponding

N

2C v i=1 j=1

(20)

i=1

C − βi∗ and αi∗ ≤ . (17) N N Substituting Eqs. (12)–(17) into (11), and incorporating kernels for dot products yields the dual problem:

⇒ αi∗ =

(19)

For some αi , αj∗ ∈ 0, NC , we have ξi = ξj∗ = 0 and moreover the second factor in Eqs. (19) and (20) has to vanish. Hence, b and d can be computed as follows: b =

N X (αi + αi∗ ) = C · v, ⇒

N C

N X (αi + αi∗ )Φ (xi ).

 − αi∗ ξi∗ = 0.

i=1

∂ L/∂ d = C · v −

⇒ αi =

1



i=1

N C

(αi − αi∗ )Φ (xi ) and c =

αi (hw · Φ (xi )i + b + hc · Φ (xi )i + d − yi + ξi ) = 0,  αi∗ − hw · Φ (xi )i − b + hc · Φ (xi )i + d + yi + ξi∗ = 0,   C − αi ξi = 0,

N

αi∗ = 0

X ⇒ (αi − αi∗ ) = 0,

C

(18d)

C · v i =1 Knowing w and c, we can subsequently determine the bias terms b and d by exploiting the Karush–Kuhn–Tucker (KKT) conditions:

N

∂ L/∂ξi =

N X

,

i=1



i=1

∂ L/∂ b = −

w=

N

N

i=1

⇒c=

C

63



αi [(hw · Φ (xi )i + b) + (hc · Φ (xi )i + d) − yi + ξi ]

i=1





(18c)

αi(∗) = C /N lie outside the parametric-insensitive zone f ± g.

Here, we will use the term errors to refer to the training points lying outside the parametric-insensitive zone and the term fraction of errors or SVs to denote the relative number of errors or SVs (i.e., divided by N). Now, we analyze the theoretical aspects of the new optimization problem given in Eq. (18). The core aspect can be captured in the proposition stated below. Proposition 1. Suppose the par-v-SVR is applied to a data set, and the minimum width of the resulting parametric-insensitive zone is

64

P.-Y. Hao / Neural Networks 23 (2010) 60–73

nonzero. Then, the following statements hold: i. v is an upper bound on the fraction of errors. ii. v is a lower bound on the fraction of SVs. Proof. Ad(i). The constraints, Eqs. (18c) and (18d), imply that at (∗) most a fraction v of all training points can have αi = C /N. (∗)

> 0 certainly satisfy αi(∗) = (∗) could grow further to reduce ξi ).

All training points with ξi (∗)

C /N (if not, αi Ad(ii). Suppose the minimum width of the resulting parametricinsensitive zone is nonzero, i.e., g (xi ) > 0, ∀i. According to the KKT conditions, we have αi · αi∗ = 0 for all i, i.e., there can never be a set of dual variables αi , αi∗ which are simultaneously nonzero since this would require nonzero slacks in both directions. Since SVs are those training points (∗) for which 0 < αi ≤ C /N, the constraints of Eqs. (18c) and (18d) imply that at least a fraction v of all training points (∗) can have αi > 0 (using αi · αi∗ = 0 for all i).  Hence, parameter 0 < v ≤ 1 can be used to control the number of support vectors and errors. Moreover, since the constraint, Eq. PN (∗) (18b), implies that Eq. (18c) is equivalent to i=1 αi = C v/2, it can be concluded that Proposition 1 actually holds for the upper and the lower sides of the insensitive zone separately, with v/2 each. 4. Support vector classification with parametric margin model In the previous section, we saw that the proposed par-v SVR differs from v -SVR in that it uses the parametric-insensitive model of arbitrary shape, instead of a tube (or slab) shape. In many cases this can be useful, especially when the noise has a heteroscedastic structure. Thus, it is worthwhile to ask whether a similar change could be incorporated in the original v -SVC. In this section, we describe a modification of the v -SVC algorithm, called par-v -SVC, which uses a parametric-margin model of arbitrary shape. Consider a set of Ndata vectors {xi , yi }, i = 1, . . . , N , yi ∈ {−1, 1}, xi ∈ Rn , where xi is the ith data vector that belongs to a binary class yi . By incorporating a change from ρ in the original v -SVC algorithm with a parametric-margin model g (x) = hc · Φ (x)i + d, the hyperplane f (x) = hw · Φ (x)i + b in par-v-SVC separates the data if and only if

hw · Φ (xi )i + b ≥ hc · Φ (xi )i + d for yi = +1; hw · Φ (xi )i + b ≤ − hc · Φ (xi )i − d for yi = −1.

minimize w,c,b,d,ξi

2

kwk2 + C −v ·



1 2

N 1 X



kck2 + d +

N i=1

! ξi

(28b)

ξi ≥ 0 for i = 1, . . . , N, (28c) where {ξi }Ni=1 is a set of variables that measure the amount of

violation of the constraints (27), and C is a positive constant that sets the cost that one is willing to pay for a misclassified data point. To derive the dual problem, we consider the Lagrangian: L =

2



kwk2 + C −v · N X

2



kck2 + d +

N 1 X

N i=1

! ξi

∂ L/∂ c = − (C · v) c + 1

⇒c=

N X

αi Φ (xi ) = 0

N X

αi Φ (xi ),

yi αi = 0 ⇒

i=1

(31) N X

yi αi = 0,

(32)

i=1

∂ L/∂ d = −C · v +

N X

αi − γ = 0

i =1 N X

αi = C · v + γ ,

∂ L/∂ξi =

(33)

C N C

(29)

− αi − βi = 0

C − βi and αi ≤ . (34) N N Substituting Eqs. (30)–(34) into (29), using αi , βi , γ ≥ 0, and incorporating kernels for dot products yields the following quadratic optimization problem:

⇒ αi =

+ βi ξi

N X

C · v i=1

∂ L/∂ b = −

αi [yi (hw · Φ (xi )i + b) − (hc · Φ (xi )i + d) + ξi ]

i=1

(30)

i=1

maximize

i=1

−γd −

yi αi Φ (xi ),

i=1

αi

N X

N X

⇒w=



d ≥ 0,

1

αi yi Φ (xi ) = 0

i=1

i =1

yi (hw · Φ (xi )i + b) ≥ (hc · Φ (xi )i + d) − ξi



N X

∂ L/∂ w = w −

(28a)

subject to

1

where αi , βi , and γ are the nonnegative Lagrange multipliers. This function has to be minimized with respect to the primal variables w, c, b, d, ξi and maximized with respect to the dual variables αi , βi , and γ . To eliminate the former, we compute the corresponding partial derivatives and set them to zero, obtaining the following conditions:

(27)

Fig. 2 depicts this situation graphically. As a prime problem for par-v -SVC, we consider the following constrained optimization problem: 1

Fig. 2. The par-v -SVC algorithm with parametric-margin model.

N X N −1 X

2

N N 1 XX

2C v i=1 j=1

subject to

yi yj αi αj k(xi , xj )

i=1 j=1

αi αj k(xi , xj )

(35a)

P.-Y. Hao / Neural Networks 23 (2010) 60–73 N X

yi αi = 0,

(35b)

αi ≥ C · v,

(35c)

i=1 N X i=1

  C αi ∈ 0, .

(35d)

N

Solving the above dual quadratic programming problem, we obtain the Lagrange multipliers αi , which give the weight vector w and c as a linear combination of Φ (xi ): w=

N X

yi αi Φ (xi )

and c =

i =1

N X

1

C · v i=1

αi Φ (xi ).

Knowing w and c, we can subsequently determine the bias terms b and d by exploiting the Karush–Kuhn–Tucker (KKT) conditions:

αi [yi (hw · Φ (xi )i + b) − (hc · Φ (xi )i + d) + ξi ] = 0   C − αi ξi = 0

(36)

γ d = 0.

(38)

N

(37)

For some αi ∈ 0, , yi = +1 and αj ∈ 0, , yj = −1, we have ξi = ξj = 0 and moreover the second factor in Eq. (36) must vanish. Hence, b and d can be computed as follows: C N

b =

C N



−1 



2

(39)

1

hw · Φ (xi )i − w · Φ (xj ) − hc · Φ (xi )i  − c · Φ (xj ) , (40)   for some i, j such that αi ∈ 0, NC , yi = +1 and αj ∈ 0, NC , yj = −1. Therefore, the decision function f ∗ and the parametric margin function g of the resulting par-v -SVC can be shown to take the d =

2

form: f ∗ (x) = sgn(f (x))

where f (x) =

N X

αi yi k(x, xi ) + b,

(41)

i=1

and g (x) =

1

N X

C · v i=1

(42)

j=1 αi αj k(xi , xj ) in the objective function. Second, the original v -SVC eliminated the parameters C by setting C = 1. In v -SVC, due to the homogeneity, the solution of the dual problem in v -SVC

PN

would be scaled by C ; however, it is straightforward to see that the corresponding decision function will not change (Schölkopf et al., 2000). On the other hand, eliminating parameter C by setting C = 1 might be improper. Because the upper bound on αi is 1/N (see Eq. (7d)), where N is the number of training data, αi will vanish when the number of training data is huge. As in the regression case, the parameter v in par-v -SVC has a more natural interpretation. To formulate it, we first define the term margin error. By this, we denote points with ξi > 0, that is, they are either errors or lie within the parametric-margin model. Formally, the fraction of margin errors is 1

|{xi | (yi f (xi )) < g (xi )}| .

i. v is an upper bound on the fraction of margin errors. ii. v is a lower bound on the fraction of SVs. Proof. Ad(i). By the KKT conditions, Eq. (38), d > 0 implies γ = 0. Hence, inequality (35c) becomes an equality (cf. Eq. (33)). Thus, at most a fraction v of all training points can have αi = C /N. All training points with ξi > 0 certainly satisfy αi = C /N (if not, αi could grow further to reduce ξi ). Ad(ii). SVs can contribute at most C /N to the left-hand side of Eq. (35c); hence there must be at least vN of them.  Hence, parameter 0 < v ≤ 1 can be used to control the number of support vectors and margin errors. Moreover, since Eq. (35b) means that the sums over the coefficients of positive and negative SVs respectively are equal, we conclude that Proposition 2 actually holds for both classes separately, with v/2. 5. Experiments In this section, several examples are used to verify the effectiveness of the proposed new parametric support vector algorithms for the regression estimation and pattern classification task. These simulations were conducted in the Matlab environment. Throughout this experimental part, we used the radial basis function (RBF) kernel k(x, y) = exp(−q kx − yk2 ) for all algorithms. The optimal choice of parameters C , v , and q was tuned using a grid search mechanism. 5.1. Regression estimation 5.1.1. Toy examples The first task was to estimate a noisy sinc function, given N examples (xi , yi ), with xi drawn uniformly from [−3π , 3π ], and yi = sin c (xi ) + ei , where sin c (x) =

sin(x)

1

Comparing the dual problem in the proposed par-v -SVC (given in Eq. (35)) to the dual problem in v -SVC (given in Eq. (7)), there PN are two differences. First, there is an additional term 2C1v i =1

N

Proposition 2. Suppose the par-v-SVC is applied to a data set, and the minimum width of the resulting parametric-margin is nonzero. The following statements hold:

(

! αi k(xi , x) + d .

Here, f is used to denote the argument of the sgn in the decision function, Eq. (41), that is, f ∗ = sgn (f ). Now, let us analyze the theoretical aspects of the new optimization problem given in Eq. (35). The core aspect can be captured in the proposition stated below.



hw · Φ (xi )i + w · Φ (xj ) − hc · Φ (xi )i  + c · Φ (xj ) ,

65

(43)

x

if x 6= 0, if x = 0,

and the noise ei were drawn from a Gaussian distribution with zero mean and variance σ 2 . Unless stated otherwise, we set sample size N = 50, error constant v = 0.2, regularization parameter C = 100, RBF kernel parameter q = 0.125, and σ = 0.2 on the noisy sinc data. The risk (or test error) of a regression estimate f was q computed with respect to the sinc function without noise, as 1 N

PN

i=1

(f (xi ) − sinc(xi ))2 .

Fig. 3 illustrates the results of the ε -SV regression (Vapnik, 1995) algorithm on the noisy sinc data with noise σ = 0 and σ = 1, respectively. In ε -SVR, we set the insensitive tube width ε = 0.2 for both cases. This choice of ε , which has to be specified a priori, is ideal for neither case. In the case of σ = 0, the regression estimate of ε -SVR is biased; while in the case of σ = 1, the tube width ε does not match the external noise. In ε -SVR, the risks (test errors) are 0.1322 and 0.3109 for σ = 0 and σ = 1, respectively. Fig. 4 illustrates the results of the proposed par-v-SV regression algorithm on the noisy sinc data with noise σ = 0 and σ = 1, respectively. In the proposed par-v-SVR, we set the error constant v = 0.2 for both cases. As seen from Fig. 4, the proposed

66

P.-Y. Hao / Neural Networks 23 (2010) 60–73

Fig. 3. ε -SV regression (Vapnik, 1995) on data with noise (a) σ = 0 and (b) σ = 1, respectively. In both cases, we set ε = 0.2. This choice of ε , which has to be specified a priori, is ideal for neither case. The sinc function is depicted by a dotted line; the regression f and the insensitive tube f ± ε are depicted by dashed lines. The support vectors (SVs) found by ε -SVR are marked by extra circles.

Fig. 4. par-v-SV regression on data with noise (a) σ = 0 and (b) σ = 1, respectively. In both cases, we set v = 0.2. The par-v-SVR algorithm automatically adjusts the size of the parametric-insensitive zone to include the data. The sinc function is depicted by a dotted line; the regression f and the parametric-insensitive zone f ± g are depicted by dashed lines. The support vectors (SVs) are marked by extra circles.

par-v-SVR algorithm automatically adjusts the size of the parametric-insensitive zone to include the data. The width of the parametric-insensitive zone increases automatically as σ is increased. In the proposed par-v-SVR, the risks are 0.0045 and 0.1343 for σ = 0 and σ = 1, respectively. Fig. 5 gives an illustration of the proposed par-v-SVR algorithm on the noisy sinc data with v = 0.2 and v = 0.8, respectively. The larger v allows more points to lie outside the parametricinsensitive zone (cf. Proposition 1). In contrast to the original SV regression algorithm, the proposed par-v-SVR algorithm uses a flexible parametric-insensitive zone rather than a tube shape. The width of the parametric-insensitive zone is characterized by a learnable function g (x). The proposed par-v-SVR algorithm automatically adjusts the parametric-insensitive zone with an arbitrary shape to include the data. Figs. 6 through 9 illustrate the results of the proposed par-vSVR on the noisy sinc data with different settings of the following parameters: v (error constant), C (regularization parameter), σ (variance of noise ei ), and N (simple size). Fig. 6 illustrates the par-v-SVR for different values of the error constant v . The arithmetic means and standard deviation error bars are estimated over 100 randomly generated datasets. Notice how the width of the parametric-insensitive zone decreases when more errors are PN allowed (large v ), where ZoneWidth := N1 i=1 |g (xi )|. Moreover, over a large range of v , the test error (risk) is insensitive toward changes in v . The chosen v controls the upper bound on the fraction of errors and the lower bound on the fraction of support vectors (cf. Proposition 1). Hence, the selection of v is more intuitive.

Fig. 7 gives an illustration of par-v-SVR for different values of the regularization parameter C (the horizontal-axis represents the logarithm of C to the base 2). The width of the parametric insensitive zone decreases when the regularization is decreased (large C ). The upper bound on the fraction of errors and a lower bound on the fraction of support vectors get looser as C increases, which corresponds to a smaller number of examples relative to regularization parameter C (Schölkopf et al., 2000). Fig. 8 illustrates the par-v-SVR for different values of the noise σ . The width of the parametric-insensitive zone increases linearly with σ , largely due (∗) to the fact that both 12 kck2 + d and the ξi enter the cost function linearly. Because the proposed par-v-SVR automatically adjusts the width of the parametric-insensitive zone, the number of SVs and points outside the parametric-insensitive zone (errors) are largely independent of σ , except for the noise-free case σ = 0. Fig. 9 illustrates the par-v-SVR for different values of the noise sample size N. Notice that the width of the parametric-insensitive zone is largely independent of the simple size N. The fraction of support vectors and the fraction of errors approach v = 0.2 from above and below, respectively, as the number of training simples N increases (cf. Proposition 1). 5.1.2. Heteroscedastic examples Although previous experiments led to satisfactory results, they did not really address the actual task the new par-v-SVR algorithm was designed for. Therefore, we next focus on data with a heteroscedastic error structure, i.e., the noise strongly depends on

P.-Y. Hao / Neural Networks 23 (2010) 60–73

67

Fig. 5. The proposed par-v-SV regression with (a) v = 0.2 and (b) v = 0.8, respectively. The proposed par-v-SVR algorithm automatically adjusts the parametric-insensitive zone of an arbitrary shape to include the data. The larger v allows more points to lie outside the parametric-insensitive zone. The sinc function is depicted by a dotted line; the regression f and the parametric-insensitive zone f ± g are depicted by dashed lines. The support vectors (SVs) are marked by extra circles.

Fig. 6. par-v -SVR for different values of the error constant v .

Fig. 7. par-v-SVR for different values of the regularization parameter C .

the input value x. Two heteroscedastic examples are used to verify the effectiveness of the proposed par-v-SVR algorithm. For the first example, the training data sets are generated by yk = 0.2 sin(2π xk ) + 0.2x2k + 0.3 + (0.1x2k + 0.05)ek , xk = 0.02(k − 1),

k = 1, 2, . . . , 51,

(44)

where the noise ei were drawn from a Uniform distribution in the interval [−1; 1]. This example was also used in previous studies (Hwang, Hong, & Seok, 2006; Jeng, Chuang, & Su, 2003).

Fig. 10(a) shows the true mean function (dotted), the estimated mean function f (dashed), and the parametric-insensitive zone f ± g (dashed) obtained by the proposed par-v-SVR where the parameters (C , v, q) were chosen as (100, 0.1, 15.5). The support vectors (SVs) found by the par-v-SVR algorithm are marked by extra circles. Fig. 10(b) shows the true variance function (dotted) and the estimated variance function g (dashed). Without requiring prior knowledge about the heteroscedastic structure of noise, the proposed par-v-SVR algorithm automatically adjusts the

68

P.-Y. Hao / Neural Networks 23 (2010) 60–73

Fig. 8. par-v-SVR for different values of the noise σ .

Fig. 9. par-v-SVR for different values of the data size N.

Fig. 10. (a) The true mean function (dotted), the estimated mean function f (dashed), and the parametric-insensitive zone f ± g (dashed) obtained by the proposed par-v-SVR. (b) The true variance function (dotted) and the estimated variance function g (dashed).

parametric-insensitive zone f ± g of arbitrary shape and minimal size to include the data. The use of the parametric-insensitive loss function penalizes errors more harshly in low noise regions of the data, leading to qualitatively improved estimates of the conditional mean. In par-v-SVR, the risk (test error) is 0.0088. As seen from Fig. 10, the estimates of the upper and lower bounds of the parametric-insensitive zone effectively capture the characteristics of the heteroscedastic error structure, and the proposed par-v-SVR performs fairly well. Fig. 11(a) and (b) give an illustration of the v -SV regression (Schölkopf et al., 2000) on this data with v = 0.1 and v = 0.8,

respectively. In both cases, we set C = 100 and q = 15.5. The v -SVR algorithm automatically adjusts the insensitive-tube width ε to 0.104 and 0.0135 for v = 0.1 and v = 0.8, respectively. Due to the assumption that the ε -insensitive zone has a tube (or slab) shape, the test error (risk) in v -SVR is sensitive toward the changes in v on this heteroscedastic data. In the case of v = 0.1, the regression estimate is biased. In the case of v = 0.8, the resulting tube width ε does not match the external noise. Almost all training points become the support vectors to derive the satisfying regression estimation. This will increase the testing time on regression, and increase the storage requirement for saving

P.-Y. Hao / Neural Networks 23 (2010) 60–73

69

Fig. 11. v -SV regression (Schölkopf et al., 2000) on the heteroscedastic data with (a) v = 0.1 and (b) v = 0.8, respectively. The true function is depicted by dotted line; the regression f and the insensitive tube f ± ε are depicted by dashed lines.

Fig. 12. (a) The data set with heteroscedastic error structure. (b) The estimation regression model of the proposed par-v-SVR. The regression estimation function f and the parametric-insensitive zone f ± g are depicted by dashed lines. The support vectors (SVs) found by par-v -SVR are marked by extra circles.

the support vectors. In v -SVR, the risks are 0.0386 and 0.0301 for v = 0.1 and v = 0.8, respectively. For the second example, the training data set is shown in Fig. 12(a). This example was also used in previous studies (Hwang et al., 2006; Jeng et al., 2003). As seen from Fig. 12(a), the data have a heteroscedastic error structure, i.e., the noise strongly depends on the input value x. Fig. 12(b) shows the regression estimation function f (dashed) and the parametric-insensitive zone f ± g (dashed) obtained by the proposed par-v-SVR where the parameters (C , v, q) were chosen as (500, 0.1, 50). In par-v-SVR, the root mean square error (RMSE) is 0.0088. The experimental results show that par-v-SVR provides a satisfying solution to estimating insensitive bounds and captures well the characteristics of the data set. Fig. 13(a) and (b) give an illustration of the v -SV regression on this data with v = 0.2 and v = 0.6, respectively. In both cases, we set C = 100 and q = 15.5. The v -SVR algorithm automatically adjusts the insensitive-tube width ε to 0.092698 and 0.007016 for v = 0.2 and v = 0.6, respectively. Due to the assumption that the ε -insensitive zone has a tube (or slab) shape, the v -SVR does not provide an indication of the spread of the target distribution (i.e. predictive variance). In v -SVR, the RMSEs are 0.0891 and 0.0774 for v = 0.2 and v = 0.6, respectively. As a whole, the proposed par-v-SVR works better on the heteroscedastic data than the original SV regression algorithms with a tube-shaped insensitive zone. 5.1.3. The benchmark datasets For the first example, we use the Motorcycle benchmark dataset (Silverman, 1985), which is a real-life example for input dependent

noise (Cawley et al., 2004; Kersting et al., 2007). Silverman’s motorcycle dataset consists of a sequence of accelerometer readings through time following a simulated motorcycle crash performed during experiments to determine the efficacy of crash helmets (Silverman, 1985). Fig. 14(a) and (b) shows the estimated interval models constructed by v -SVR (Schölkopf et al., 2000) and the proposed par-v-SVR, respectively. Note that the regression performance depends on the values of the model-parameters. However, it is unfair to use only one model-parameter set for comparing these approaches. The optimal model-parameters for each algorithm were tuned using the tenfold cross-validation method and a grid search mechanism. Then the best model-parameter set is used for constructing the interval model. Note that the width of the interval model for the proposed par-v-SVR is appropriately small where the variance of the dataset is least. The proposed par-v-SVR provides a quantitatively better description of the data set than the conventional v -SVR. Then, the Boston Housing, Auto-Price, and Machine-CPU datasets from the UCI repository (Blake & Merz, 1998) were used for evaluation and comparison of the proposed par-v-SVR approach. In this experiment, both the target output and numerical attributes were scaled to be in [−1, 1]. We compare the proposed parv-SVR approach with original v -SVR (Schölkopf et al., 2000). The optimal model-parameters for each algorithm were selected to minimize a ten-fold cross-validation estimate of root-meansquared-error by a grid search mechanism. Table 1 presents the result of comparing these methods. We present the optimal model parameters, the corresponding root-mean-squared-error (RMSE), and the fraction of support vectors. As seen from Table 1, the

70

P.-Y. Hao / Neural Networks 23 (2010) 60–73

Fig. 13. v -SV regression (Schölkopf et al., 2000) on the heteroscedastic data with (a) v = 0.2 and (b) v = 0.6, respectively. The regression f and the insensitive tube f ± ε are depicted by dashed lines. The support vectors (SVs) found by the v -SVR are marked by extra circles.

Fig. 14. The regression model obtained by (a) v -SVR and (b) the proposed par-v-SVR for the Motorcycle benchmark dataset.

Table 1 Comparison of regression performance. Dataset

v -SVR

par-v-SVR

Model parameters (C , v, q) Motorcycle Boston Housing Auto-Price Machine-CPU

3

−5

(10 , 0.2, 2 ) (105 , 0.8, 2−3 ) (105 , 0.7, 2−7 ) (105 , 0.1, 2−4 )

RMSE 0.232 0.172 0.160 0.074

Fraction of SVs (%) 17.29 85.37 75.47 9.29

proposed par-v-SVR yields a comparable result to the original v SVR approaches, demonstrating that it is suitable for real-world regression problems. 5.2. Pattern classification We present two types of experiments to demonstrate the performance of the proposed par-v-SV classification algorithm. Fig. 15 illustrates the results of par-v-SVC algorithm on a 2D artificial dataset. The middle line is the decision boundary; the outer lines precisely meet the constraint (27). Note that the support vectors (SVs) found by this algorithm (marked by extra circles) are examples which are critical for the given classification task. With larger values for v , more points are allowed to lie inside the margin (depicted by dotted lines). Fig. 16 illustrates the results of parv-SVC algorithm on a 2D heteroscedastic artificial dataset. This dataset have heteroscedastic error structure, i.e., the noise in x-axis strongly depends on the input value in y-axis. As shown in Fig. 16,

Model parameters (C , v, q) 3

−5

(10 , 0.2, 2 ) (104 , 0.8, 2−6 ) (103 , 0.7, 2−7 ) (104 , 0.1, 2−6 )

RMSE

Fraction of SVs (%)

0.221 0.158 0.153 0.070

26.32 79.24 76.72 13.39

the proposed par-v-SVC classified these two classes accurately and captured the distribution characteristics of this dataset well. We then apply the proposed par-v-SVC algorithm on a set of benchmark problems. From PROBEN1 (Prechelt, 1994) we choose the following datasets: cancer, card, diabetes, heart, and heartc. From UCI Repository (Blake & Merz, 1998) we choose the following datasets: wdbc, ionosphere, sonar, iris, wine, new-thyroid, glass, and vowel. Furthermore, we choose german from the Statolog Collection (Michie, Spiegelhalter, & Taylor, 1994). The information on collected datasets is given in Table 2. We scale all data to be in [−1, 1]. Some of those problems have more than two classes. For simplicity, we treat all data not in the first class as in the second class. We compare the proposed par-v -SVC approach with original C SVC (Vapnik, 1995) and v -SVC (Schölkopf et al., 2000). The most important criterion for evaluating the performance of those methods is their accuracy rate. However, it is unfair to use only one parameter set for comparing these methods. Note that the generalization errors of the classifier depend on the values of the kernel

P.-Y. Hao / Neural Networks 23 (2010) 60–73

71

Fig. 15. Toy example solved by par-v -SVC, using parameter v = (a) 0.2 (b) 0.4 (c) 0.6 (d) 0.8, respectively. With larger values for v , more points are allowed to lie inside the margin (depicted by dotted lines). Table 2 Quick overview of the classification dataset. Dataset

#examples

#attributes

#class

Cancer Card Diabetes Heart Heartc

699 690 768 920 303

9 51 8 35 35

2 2 2 2 2

wdbc Ionosphere Sonar Iris Wine New-thyroid Glass Vowel

569 351 208 150 178 215 214 990

30 34 60 4 13 5 10 10

2 2 2 3 3 3 7 11

1000

24

2

German

Fig. 16. 2D heteroscedastic artificial example solved by par-v -SVC.

Source

PROBEN 1 (Prechelt, 1994) Anonymous FTP on ftp.ira.uka.de.

UCI Repository (Blake & Merz, 1998) http://kdd.ics.uci.edu/

Statlog-Collecton (Michie et al., 1994) http://www.liacc.up.pt/ML/old/statlog/datasets.html

parameter q, the regularization parameter C , and parameter v . For any method, the best parameters are obtained by performing the model-parameters selection. However, there is no explicit way to solve the problem of choosing multiple parameters for SVMs (Hsu & Lin, 2002). The use of a gradient descent algorithm over the set of model-parameters by minimizing some estimates of the generalization error of SVMs has been discussed in Chapelle, Vapnik, and Bousquet (2002). On the other hand, an exhaustive search is the most popular method for choosing the model-parameters (Hsu & Lin, 2002). In the experiments, we estimate the generalized accuracy using different kernel parameters q = [24 , 23 , . . . , 2−10 ], regularization parameters C = [10−1 , 100 , 21 , . . . , 106 ], and v = [0.1, 0.2, 0.3, . . . , 0.8] for each problem. We apply the ten-fold cross-validation method on the whole training data to select the model parameters and estimate the generalized accuracy. Namely, for each problem we partition the available examples into ten disjoint subsets (called ‘folds’) of approximately equal size. The classifier is trained on all the subsets except for one, and the validation

72

P.-Y. Hao / Neural Networks 23 (2010) 60–73

Table 3 Comparison of classification performance (Best rates bold-faced). Dataset

C -SVC (C , q)

Acc. rate #SVs −3

Cancer

(10, 2

)

Card

(10, 2−5 )

Diabetes

(10, 2−3 )

Heart

(102 , 2−8 )

Heartc

(102 , 2−7 )

wdbc

(10, 2−2 )

Ionosphere

(10, 2−5 )

Sonar

(104 , 2−8 )

Iris

(103 , 2−7 )

Wine

(103 , 2−1 )

New-thyroid

(100 , 24 )

Glass

(10, 2)

Vowel

(102 , 2−6 )

German

(10, 2−6 )

96.828 70.5 86.934 244.6 77.607 360.8 83.239 334.0 85.230 107.3 98.418 67.4 95.510 79.4 72.000 71.5 97.333 47.6 98.235 46.9 96.465 88.8 99.524 59.7 98.485 52.1 77.000 501.7

v -SVC (v, q)

par-v-SVC Acc. rate #SVs −3

(0.1, 2

)

(0.8, 2−5 ) (0.5, 2−4 ) (0.5, 2−4 ) (0.3, 2−9 ) (0.1, 2−4 ) (0.1, 2−5 ) (0.1, 2−8 ) (0.3, 2−5 ) (0.2, 2−2 ) (0.1, 2−1 ) (0.1, 2) (0.1, 2−2 ) (0.5, 2−6 )

error is measured by testing it on the subset left out. This procedure is repeated for a total of ten trials, each time using a different subset for validation. The performance of the model is assessed by averaging the squared error under validation over all the trials of this problem. According to their cross-validation rate, we try to infer the proper values of model-parameters and report the best crossvalidation rate. Table 3 presents the result of comparing these methods. We present the optimal model parameters, the corresponding crossvalidation rates, and the numbers of support vectors. It can be seen that optimal model-parameters C , v, q are in various ranges for different problems, so it is essential to test this many parameter sets. As seen from Table 3, the proposed par-v-SVC yields comparable results to the original SVM classifiers, demonstrating that it is suitable for real-world classification problems. Note that the numbers of support vectors are not integers, which is because they are the average of the ten-fold cross-validation. The number of support vectors is the main factor which affects the testing time. We also observe that the numbers of support vectors obtained by the three SV algorithms are very similar. That is, none are statistically better than the others. The proposed par-v-SVC algorithm achieves better (or the same) accuracy rate with fewer support vectors on cancer, wdbc, iris, and wine. Experimental results show that the proposed method performs fairly well on the benchmark datasets. 6. Conclusion In this paper, an extension of v -support vector algorithms is presented. We first described par-v -SVR, a new regression algorithm that uses a parametric-insensitive model. By devising a new parametric-insensitive loss function, the proposed par-v -SVR automatically adjusts a flexible parametric-insensitive zone of arbitrary shape and minimal radius to include the given data, where the parametric-insensitive zone f ± g is estimated by a new constrained optimization problem. The par-v-SVR has been shown to be useful in practice, especially when the noise is heteroscedastic, that is, the noise strongly depends on the input value x. In addition,

96.828 73.2 87.394 506.9 77.619 360.6 82.426 446.3 85.468 104.0 98.264 62.6 94.922 68.0 72.000 71.6 97.333 44.3 98.824 43.7 96.941 25.2 99.524 63.6 98.686 106.1 76.800 500.7

(C , v, q) 5

(10 , 0.1, 2

Acc. rate #SVs −4

)

(103 , 0.8, 2−5 ) (104 , 0.2, 2−3 ) (106 , 0.1, 2−10 ) (105 , 0.1, 2−8 ) (106 , 0.1, 2−5 ) (106 , 0.2, 2−3 ) (106 , 0.1, 2−8 ) (102 , 0.1, 2) (103 , 0.2, 2−2 ) (101 , 0.5, 24 ) (103 , 0.1, 2) (104 , 0.5, 2−3 ) (104 , 0.2, 2−6 )

96.973 69.2 87.983 506.9 77.619 363.9 83.239 337.6 86.289 105.7 98.418 61.0 95.216 109.4 72.000 71.6 97.333 30.9 98.824 43.6 97.326 107.2 99.524 63.3 98.787 124.1 77.000 504.3

we have applied the idea underlying par-v -SVR to develop a parv -SVC algorithm, which is a new classification algorithm that uses a parametric-margin model of arbitrary shape. As in v -SVM, the proposed par-v-SVMs are parameterized by a quantity v that allows control of the number of errors and support vectors, and the latter can be used to give a leave-one-out generalization bound (Vapnik, 1995). Experimental results have demonstrated the simplicity and effectiveness of the proposed method. References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Univ. Irvine, CA: California, Dept. Inform. Comput. Sci., [Online]. Available: http://kdd. ics.uci.edu/. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.)., Proceedings of the 5th annual ACM workshop on computational learning theory (pp. 144–152). Pittsburgh, PA: ACM Press. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 955–974. Cawley, G. C., Janaceka, G. J., Haylockb, M. R., & Dorlingc, S. R. (2007). Predictive uncertainty in environmental modeling. Neural Networks, 20, 537–549. Cawley, G. C., Talbot, N. L. C., Foxall, R. J., Dorling, S. R., & Mandic, D. P. (2004). Heteroscedastic kernel ridge regression. Neurocomputing, 57, 105–124. Chapelle, O., Vapnik, V. N., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. Cortes, C., & Vapnik, V. N. (1995). Support vector network. Machine Learning, 20, 1–25. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J., & Vapnik, V. N. (1996). Support vector regression machines. In Advances in neural information processing systems: vol. 9 (pp. 155–161). Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13, 415–425. Hwang, C., Hong, D. H., & Seok, K. H. (2006). Support vector interval regression machine for crisp input and output data. Fuzzy Sets and Systems, 157, 1114–1125. Jeng, J.-T., Chuang, C.-C., & Su, S.-F. (2003). Support vector interval regression networks for interval regression analysis. Fuzzy Sets and Systems, 138, 283–300. Kersting, K., Plagemann, C., Pfaff, P., & Burgard, W. (2007). Most likely heteroscedastic Gaussian process regression. In Proceedings of the 24th international conference on machine learning (pp. 393–400). Le, Q. V., Smola, A. J., & Canu, S. (2005). Heteroscedastic Gaussian process regression. In Proceedings of the 22nd international conference on machine learning. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classification. Ellis Horwood, [Online]. Available: http://www.maths. leeds.ac.uk/∼charles/statlog/. Müller, K. R., Mika, S., Ratsch, G., Tsuda, K., & Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201.

P.-Y. Hao / Neural Networks 23 (2010) 60–73 Prechelt, L. (1994). PPROBEN 1 — A set of neural network benchmark problems and benchmarking rules. Technical report 21/94. D-76128, Karlsruhe, Germany: Fakultat fur Informatik, Universitat Karlsruhe. Anonymous FTP: ftp://pub/ papers/techreports/1994/1994-21.ps.Zon ftp.ira.uka.de. Schölkopf, B., Burges, C. J. C., & Vapnik, V. N. (1995). Extracting support data for a given task. In U. M. Fayyad, & R. Uthurusamy (Eds.), Proceedings, first international conference on knowledge discovery and data mining. Menlo Park, CA: AAAI Press. Schölkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K. R., Rätsch, G., et al. (1999). Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5), 1000–1017. Schölkopf, B., Bartlett, P. L., Smola, A., & Williamson, R. (1999). Shrinking the tube: a new support vector regression algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems: Vol. 11 (pp. 330–336). Cambridge, MA: MIT Press.

73

Schölkopf, B., Smola, A. J., Williamson, R., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12(5), 1207–1245. Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparametric regression curve fitting. Journal of the Royal Statistical Society, 47, 1–52. Smola, A. J., Schölkopf, B., & Müller, K. R. (1998). The connection between regularization operations and support vector kernels. Neural Networks, 11, 637–649. Vapnik, V. N., & Chervonenkis, A. (1974). Theory of pattern recognition. Moscow: Nauka [in Russian]. Vapnik, V. N. (1982). Estimation of dependencies based on empirical data. New York: Springer-Verlag. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Williams, P. M. (1996). Using neural networks to model conditional multivariate densities. Neural Computation, 8, 843–854. Yunn, M., & Wahba, D. (2004). Doubly penalized likelihood estimator in heteroscedastic regression. Statistics & Probability Letters, 69, 11–20.

Suggest Documents