SVM $HI|85LdBj DEED 9(pJsIt ") 305-8568 0q>k8)$D$/$P;TG_1‘ 1-1-4 E-Mail:
[email protected]
I|85LdBj$H$O!$?.9f$,;(2;Ey$K$h$jNt2=$5$l$?$H$-!$85$N?.9f$r5a$a$kLdBj$r;X$9!% 65;U$,!$71N}%5%s%W%k$r 0 l(f (x); y) = 1 yf (x) 0 : The loss function becomes 1, when f makes misclassi cation on x. So, R(f ) is the expected number of misclassi cation over the underlying distribution P (x; y). However, since we have only nite training samples, we can only evaluate the average on training samples as R
X (f ) = n1 l(f (x ); y ): n
emp
i
i
i
=1
The basic strategy of SVM is to nd f that minimize jR(f ) 0 R (f )j subject to the constraint that R (f ) = 0. Although we cannot evaluate the risk deviation jR(f ) 0 R (f )j directly, we can control the risk deviation indirectly by the capacity measure of f . We do not get into the theoretical details about the relationship between risk deviation and capacity. For those who are interested in theory, please consult to [14]. Although there are many kinds of capacity measures, the one used in standard OHC is the VC bound: emp
OHC deals with binary class pattern recognition problem. Let the feature space be d-dimensional real vector space < . There are n training samples x 2 < (i = 1; 1 1 1 ; n), whose class label is described as y 2 f01; 1g. If y = 1 and 01, x belongs to class A and B, respectively. It is assumed that these samples are derived i.i.d. from underlying distribution P (x; y). The discriminant function of OHC is linear, which implys that it assumes a linear teacher model as shown in (1). Here, we consider the noiseless case, where training samples are linearly separable. The discriminant function is described as i
uses a bias term, but it is omitted here because the bias term does not deeply aect classi cation performance in a high dimensional space. To normalize the magnitude of w, we assume that
i
(3)
Here, the weight vector expanded by training samples is used to enable nonlinear extension described in next section. There is another formulation that
emp
emp
h(f ) = w w: T
Then, the learning problem of OHC is formulated as follows: Find w that minimizes w w subject to the constraint that T
y
X n
wx x T
j
i j
=1
i
j
1:
(5)
Note that the constraint is derived from R (f ) = 0 and (4). The solution of learning problem is given by the saddle point of Lagrangian: emp
X X 1 L(w; ) = w w 0 fy w x x 0 1g (6) 2 =1 =1 n
n
T
T
i
i
i
j
j
i
j
where the ( 0) are Lagrange multipliers. The Lagrangian has to be minimized with respect to w and maximized with respect to . At the saddle point, we have i
@L(w; ) = 0: @w
(7)
i
By substituting (7) to (6), the optimization problem with respect to Lagrangian is reduced as follows: Find that maximize i
1 XX y y x x 0 2 =1 =1 =1
X
n
n
n
T
i
i
i
i
j
i
j
i
j
(8)
j
subject to the constraints 0. i
This problem is a convex quadratic programming problem and the global optimal solution can be obtained by various iterative methods. When we consider the extension of OHC for noisy cases, the problem is that the training samples may not be linearly separable. In nonseparable cases, the set of weight vectors which satisfy the constraint (5) is void. To make it work, it is necessary to expand the weight vector set by allowing slight violation of constraint. In soft margin approach[14], the optimization problem for learning is modi ed as follows: Find w minimizes 1w w + C X 2 =1 n
T
i
i
that subject to the constraint that y
X n
wx x T
j
i j
=1
i
j
n
n
n
T
i
i
i
j
i
j
j
i
j
i
w =y: i
Find that maximizes X 1 X X y y x x (9) 0 2 =1 =1 =1 subject to the constraints 0 C . i
This leads to the following equation: i
Here, C is a user-de ned constant that controls the violation of constraints. When C is large, the violation becomes dicult and the weight vector set does not expand largely. On the other hand, when C is small, the violation becomes easy and the weight vector set expands largely. In Lagrangian representation, the optimization problem is written as follows:
10
i
Soft margin approach is ad-hoc modi cation and has scarse theoretical justi cation. But, it is a practical choice, because the optimal solution can be obtained surely by quadratic programming without local minima problem. 2.2 Feature extraction with Kernel Function
OHC is usually used with feature extraction with a kernel function K (x; x0). When the kernel function is positive de nite, the value of kernel function can be considered as the similarity between samples. OHC is a classi er which adopts inner product as a similarity measure, because all calculations are based on inner products. So, when the input space is mapped by ' so that the kernel function is embedded as inner product, K (x; x ) = '(x) '(x ); we can constract an OHC that uses kernel function as a similarity measure. Here, the embedding ' is written as '(x) = K 01 2k(x); (10) where K is a matrix whose (i; j )-element is K (x ; x ) and k(x) = (K (x; x1 ); 1 1 1 ; K (x; x )): T
i
i
=
i
j
n
By the linear separation of embedded samples, we can perform nonlinear separation in the input space. This nonlinear version of OHC is called SVM, and it can be trained in the same way as OHC except that inner product is replaced by kernel function. So, the discriminant function of SVM is described as f (x) =
X n
w K (x; x ): i
i
=1
i
and the learning problem is described as
becomes a random vector which has Gaussian distribution centered on = X w: 1 1 01 p(gj) = (2) 2 j6j1 2 exp(0 2 (g0) 6 (g0)): T
Find that maximizes X 1 XX 0 2 =1 =1 y y K (x ; x ) (11) =1 i
n
n
i
i
i
i
T
n
j
i
j
i
n=
j
Since it is straightforward to obtain w from (i.e. w = (X )01 ), it is sucient to obtain the MAP
j
subject to the constraints 0 C .
T
i
3 Learning by MAP estimate
In this section, we present the learning algorithm of SVM by MAP estimate. Here, the prior distributions of weight vectors and random noise are assumed as Gaussian. The learning problem is designed so that its MAP solution corresponds to hard-margin SVM in the noiseless case. Let us assume that the training samples are embedded as (10) using a positive de nite kernel function K (x; x0). Let X denote the matrix whose columns are the embedded samples '(x ). Then, K = X X. The teacher model is de ned as follows: i
T
estimate of . When we assume the prior distribution as p(w) N (0; I ), the prior distribution of is described as p() N (0; K ). On the other hand, the distribution of y given is described as p(yj) =
Z
T
i
For consistency with SVM, we prepare three labels for y 2 f01; 1; g. The uncertainty label means that the teacher cannot judge the class label. The thresholding function is de ned as follows: 8 01 x 01 < (x) = : 01 < x < 1 : 1 x1 The teacher model is rewritten in a vector form as
where V describes the set of g's which is mapped to y via 8: Y Vy (g) = v (g ); n
i
i
where
y = 8(X w + n); T
where y is the vector of y 's, 8(x ) is the vectorvalued function whose i-th element is (x ) and n is the vector of random noise. The noise is assumed to be Gaussian with zero mean and have a diagonal covariance matrix 6: i
i
E (n) = 0 E (nn ) = 6 = diag(12 ; 1 1 1 ; 2 ): T
n
(12) (13)
The aim of learning is to obtain the MAP estimate of w based on the prior distribution p(w). Due to the random noise, the vector g := X w + n T
i
=1
v (g ) = 1 (g ) = y : i
i
0 otherwise When y contains no uncertainty labels (y 2 f01; 1g ), p(yj) can be factorized as i
i
p(yj) =
(14)
Y n
q( ; y ); i
i
i
i
p(gj)Vy (g)dg;
n
y = (w '(x ) + n): i
=
=1
(15)
i
where the individual component q( ; y ) is described as follows: 1 Z v (g) exp(0 (g 0 )2 )dg q( ; y ) = 22 (22 )1 2 = 12 0 y2 erf( p0 ); 2 where erf is the error function de ned as Z 2 p exp(0t2 )dt: (16) erf(x) = i
i
i
i
i
i
=
i
i
i
i
i
x
0
The MAP solution is obtained by solving the following optimization problem with respect to : Find that maximizes log p(yj)+log p(). Since given class labels do not contain uncertainty labels, this problem can be rewritten based on (15): Find that maximizes z () =
X n
log q ( ; y ) 0 21 K 01 T
i
i
=1
i
This is a nonlinear optimization problem and can be solved by gradient ascent methods, where the gradient is described as follows: 2 @z y 1 X K 01 exp( 0 = ) =q( ; y )0 2 2 1 2 @ (2 ) =1 2
Find g that minimizes g M g subject to the constraint that Vy (g) = 1. T
n
i
i
i
i
=
i
i
i
ij
j
But, for large scale problems, it is very dicult to give a reasonable solution in a limited time. When the noise magnitude converges to zero (i.e. ! 0), this learning problem can be rewritten in terms of w as follows: Find w that minimizes w w subject to the constraint that Vy (X w) = 1. This is equivalent to the learning problem of hard margin SVM. i
T
T
j
We can solve this problem using Lagrange coecients ( 0). The Lagrange functional is written as X 1 L(g; ) = g M g 0 (y g 0 1): 2 =1 From saddle point equations, the optimization problem is reformulated as follows: i
n
T
i
i
i
i
Find that maximize X 1 X X y y M 01 0 2 =1 =1 =1 subject to the constraints 0, where M 01 denotes the (i; j )-element of M 01 . i
n
n
n
i
i
i
i
j
i
j
ij
j
i
4 Approximated Learning
In this section, we consider an approximated approach to make the learning problem practically solvable. Here, the estimation of w from y is divided into two stages. First, the MAP estimate of g is obtained from y. Originally, g is a random vector, but it is approximated by a vector g^ with maximum a posteriori probability. Then, the estimation of w reduces to linear regression. In regression, the estimated vector g^ is mapped back to w by a linear restoration lter. 4.1 Classi cation to Regression
The marginal distribution of g is derived as p(g) =
Z
p(gj)p()d;
where p(gj) and p() is assumed as the previous section. Then, we have 1 p(g) exp(0 g M g) 2 where T
M = (601 ) (6 0 (601 + (K )01 )01 )601 : T
Since the relation between g and y is deterministic, p(yjg) = Vy (g):
The MAP estimate of g is obtained by solving the following problem: Find g that maximizes log p(yjg) + log p(g). This problem can be rewritten as
ij
This problem is a quadratic problem and the optimal solution can be obtained by numerical methods[18]. The solution g^ is obtained from the optimal solution as follows: X g^ = M 01 y : n
i
j
ij
=1
j
j
(17)
4.2 Regression by Restoration Filter
After obtaining the MAP estimate g^ , we have to solve the following regression problem with random noise: g^ = X w + n: Any regression method can be used, but one effective solution is to assume the restoration lter B, w = B g; and determine B so that it recovers the true vector w before corrupted by noise[16]. For this purpose, the expected loss of B with respect to w and n is described as R(B ) = Ew En k w 0 BX w 0 B n k : Since w and n are independent, the expected loss can be written as R(B ) = Ew k w 0 BX w k2 +En k B n k : (18) The Wiener lter is de ned as the operator B which minimizes the expected loss (18)[2]. When we assume the prior distributions as in the previous section, an optimal solution to minimize R(B ) is obtained as B = X (K + 6)01 : T
T
T
5 Experiments
2.5
In this section, we will apply SVM with approximated MAP learning to an arti cially made pattern recognition problem and observe its generalization ability. We used BANANA benchmark for binary classi cation, which was produced by Ratsch et. al[19]1 . This benchmark contains 100 datasets (No.1,1 1 1,No.100) derived from the same distribution. An example of this dataset is shown in Fig. 1. Each sample is a two dimensional vector, and each dataset contains 400 samples for training and 4900 samples for testing. We used only the rst 20 data sets and 200 training samples (100 for each class) from each dataset. The Gaussian kernel function was used for feature extraction: K (x; x0 ) = exp(0
k x 0 x0 k2 2
)
Here, the parameter was set as = 1. The error rates by SVM with approximated MAP learning are shown in Fig. 2. Here, the parameter was xed to 1 and all the noise variances were set to the same value and it was changed from 0 to 10. The solid line shows the average error rate and the error bar shows the standard deviation. The result at = 0 corresponds to hard margin SVM, and the error rate was 16:9%. The best error rate over all range of was 11:54%, and so signi cant improvement is observed. On the other hand, the error rates by soft margin SVM are shown in Fig. 3. The parameter C was changed from 0 to 10. Note that, when C = 0, all 's have the same value and SVM becomes identical with simple Parzen windows. The best error rate over all range of was 11:24%. So, we could not observe no signi cant dierence between two methods in performance. The main dierence between the two methods lies in resultant optimization problems. The soft margin is implemented by adding new constraints, but our method only scales the feature space and add no constraints. So, it is an advantage of our method that the optimization problem has less constraints. In online learning methods, it can lead to a simpler updating procedure which contributes to fast learning[20].
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2 -2
1
This dataset is http://www. rst.gmd.de/~raetsch
available
from
-1
-0.5
0
0.5
1
1.5
2
2.5
Figure 1 Banana Dataset. o stands for the samples of class A and + stands for the samples of class B 0.19 0.18 0.17
i
i
-1.5
Error Rate
0.16 0.15 0.14 0.13 0.12 0.11 0.1 0
2
4
6
8
10
σ
Figure 2 Error rate of SVM with approximated MAP learning against parameter . The solid line shows the average error rate and the error bar shows the standard deviation. 6
Conclusion
In this paper, we formulated a SVM for noisy cases based on MAP estimate. Since this leads to a complicated optimization problem, we proposed an approximated learning algorithm where the estimation is divided into two stages. In the rst stage, the intermediate vector is estimated via MAP approach. In the second stage, linear regression is performed by Wiener lter. In experiments, our method showed good generalization ability comparable to soft margin.
0.21 0.2 0.19 0.18
Error Rate
0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.1 0
2
4
6
8
10
C
Figure 3 Error rates of soft margin SVM against parameter C References
[1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press, 1995. [2] H. Ogawa and E. Oja, \Projection lter, wiener lter and karhunen-loeve subspaces in digital image restoration," J. Math. Anal. Appl., vol. 114, no. 1, pp. 37{51, 1986. [3] A. Albert, Regression and the Moore-Penrose Pseudoinverse. Academic Press, 1972.
Numerical Methods for Least Squares Problems. Society of Industrial and
[4] A. Bjorck,
Applied Mathematics, 1996.
[5] Y. Yamashita and H. Ogawa, \Image restoration by averaged projection lter," Trans. IEICE, vol. J74-D-II, no. 2, pp. 150{157, 1991. (in Japanese). [6] M. Sugiyama and H. Ogawa, \Function analytic approach to model selection - subspace information criterion," Proc. Workshop on
Information-based Induction Sciences (IBIS 99), pp. 93{98, 1999.
[7] A. Hirabayashi and H. Ogawa, \A class of learning of optimal generalization," Proc. IJCNN'99, 1999. to appear. [8] P. Sollich, \Learning in large linear perceptrons and why the thermodynamic limit is relevant to the real world," NIPS 7, pp. 207{214, 1995.
[9] D. Saad, On-Line Learning in Neural Networks. Cambridge University Press, 1998. [10] C. P. Robert, The Bayesian Choice: A Decision Theoretic Motivation. Springer-Verlag, 1994. [11] M. Schmidt and H. Gish, \Speaker identi cation via support vector classi ers," ICASSP'96, vol. 1, pp. 105{108, 1996. [12] B. Schoelkopf, K.-K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik, \Comparing support vector machines with gaussian kernels to radial basis function classi ers," IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2758{2765, 1997. [13] M. Pontil and A. Verri, \Support vector machines for 3d object recognition," IEEE Trans. Patt. Anal. Mach. Intell., vol. 20, no. 6, pp. 637{646, 1998. [14] V. Vapnik, Statistical Learning Theory. Wiley, 1998. [15] P. Sollich, \Probabilistic methods for support vector machines," NIPS 12, 1999. to appear. [16] E. Oja and H. Ogawa, \Parametric projection lter for image and signal restoration," IEEE
Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-34, no. 6, pp. 1643{1653, 1986.
[17] B. Scholkopf, A. Smola, and K.-R. Muller, \Nonlinear component analysis as a kernel eigenvalue problem," Neural Computation, vol. 10, pp. 1299{1319, 1998. [18] A. Smola, Learning with Kernels. PhD thesis, GMD First, 1998. [19] G. Raetsch, T. Onoda, and K.-R. Mueller, \Soft margins for adaboost," Tech. Rep. NC-
TR-1998-021, Royal Holloway College, University of London, 1998.
[20] T.-T. Friess, \Support vector neural networks: The kernel adatron with bias and soft margin," Technical Report 752, The Univ of Sheeld, Dept of ACSE, 1998.