JMLR: Workshop and Conference Proceedings 20 (2011) 165–180
Asian Conference on Machine Learning
Approximate Model Selection for Large Scale LSSVM Lizhong Ding Shizhong Liao
[email protected] [email protected]
School of Computer Science and Technology Tianjin University, Tianjin 300072, P. R. China
Editor: Chun-Nan Hsu and Wee Sun Lee
Abstract Model selection is critical to least squares support vector machine (LSSVM). A major problem of existing model selection approaches of LSSVM is that the inverse of the kernel matrix need to be calculated with O(n3 ) complexity for each iteration, where n is the number of training examples. It is prohibitive for the large scale application. In this paper, we propose an approximate approach to model selection of LSSVM. We use multilevel circulant matrices to approximate the kernel matrix so that the fast Fourier transform (FFT) can be applied to reduce the computational cost of matrix inverse. With such approximation, we first design an efficient LSSVM algorithm with O(n log(n)) complexity and theoretically analyze the effect of kernel matrix approximation on the decision function of LSSVM. We further show that the approximate optimal model produced with the multilevel circulant matrix is consistent with the accurate one produced with the original kernel matrix. Under the guarantee of consistency, we present an approximate model selection scheme, whose complexity is significantly lower than the previous approaches. Experimental results on benchmark datasets demonstrate the effectiveness of approximate model selection. Keywords: Model Selection, Matrix Approximation, Multilevel Circulant Matrices, Least Squares Support Vector Machine
1. Introduction Support vector machine (SVM) (Vapnik, 1998) is a learning system for training linear learning machines in the kernel-induced feature spaces, while controlling the capacity to prevent overfitting by generalization theory. It can be formulated as a quadratic programming problem with linear inequality constraints. The least squares support vector machine (LSSVM) (Suykens and Vandewalle, 1999) is a least squares version of SVM, which considers equality constraints instead of inequalities for classical SVM. As a result, the solution of LSSVM follows directly from solving a system of linear equations, instead of quadratic programming. Model selection is an important issue in LSSVM research. It involves the selection of kernel function and associated kernel parameters and the selection of regularization parameter. In spite of regularization parameter, Micchelli and Pontil (2006) proposed a kernel selection method to obtain an optimal kernel by minimizing a cost functional over a set of kernels; Crammer et al. (2003) used the boosting paradigm to construct kernel function from data. Typically, the form of kernel function will be determined as several types, such as polynomial kernel and radial basis function (RBF) kernel. In this situation, the selection of kernel function amounts to tuning the kernel parameters. Model selection can be
c 2011 L. Ding & S. Liao.
Ding Liao
reduced to the selection of kernel parameters and regularization parameter which minimize the expectation of test error (Chapelle and Vapnik, 2000). We usually refer to these parameters collectively as hyperparameters. Common model selection approaches mainly adopt a nested two-layer inference (Guyon et al., 2010), where the inner layer trains the classifier for fixed hyperparameters and the outer layer tunes the hyperparameters to minimize the generalization error. The generalization error can be estimated either via testing on some unused data (hold-out testing or cross validation) or via a theoretical bound (Vapnik and Chapelle, 2000; Chapelle et al., 2002). The k-fold cross validation gives an excellent estimate of the generalization error (Duan et al., 2003) and the extreme form of cross validation, leave-one-out (LOO), provides an almost unbiased estimate of the generalization error (Luntz and Brailovsky, 1969). However, the naive model selection strategy based on cross validation, which adopts a grid search in the hyperparameters space, unavoidably brings high computational complexity, since it would train LSSVM for every possible value of the hyperparameters vector. Minimizing the estimate bounds of the generalization error is an alternative to model selection, which is usually realized by the gradient descent techniques. The commonly used estimate bounds include span bound (Vapnik and Chapelle, 2000) and radius margin bound (Chapelle et al., 2002). Generally, these methods using the estimate bounds reduce the whole hyperparameters space to a search trajectory in the direction of gradient descent, to accelerate the outer layer of model selection, but multiple times of LSSVM training have to be implemented in the inner layer to iteratively attain the minimal value of the estimates. Training LSSVM is equivalent to computing the inverse of a full n × n matrix, so its complexity is O(n3 ), where n is the number of training examples. Therefore, it is prohibitive for the large scale problems to directly train LSSVM for every hyperparameters vector on the search trajectory. Consequently, efficient model selection approaches via the acceleration of the inner computation are imperative. As pointed out in (Chapelle et al., 2002; Cawley and Talbot, 2010), a model selection criterion is not required to be an unbiased estimate of the generalization error, instead the primary requirement is merely for the minimum of the model selection criterion to provide a reliable indication of the minimum of the generalization error in hyperparameters space. Therefore, we argue that it is sufficient to calculate an approximate criterion that can discriminate the optimal model from the candidates. Such considerations drive the proposal of approximate model selection approach for LSSVM. Since the high computational cost for calculating the inverse of a kernel matrix is a major problem, we consider to approximate a kernel matrix by a “nice” matrix with a significantly lower computational cost when calculating its inverse. The multilevel circulant matrix is a good choice since its built-in periodicity allows the multi-dimensional fast Fourier transform (FFT) to be utilized in calculating its inverse with complexity of O(n log(n)) (Song and Xu, 2010b; Song, 2010). Taking advantage of computational virtue of such approximation, we propose an efficient algorithm for solving LSSVM and derive an upper error bound to measure the effect of such approximation on the decision function of LSSVM. We further take a model selection criterion as an example to demonstrate the consistency of approximate model selection. With the guarantee of consistency, we present an efficient approximate model selection scheme. It conforms to the two-layer iterative procedure, but the inner computation can be realized in O(n log(n)) complexity instead of O(n3 ). 166
Approximate Model Selection for Large Scale LSSVM
By experiments on 10 benchmark datasets, we show that approximate model selection can significantly improve the efficiency of model selection, and meanwhile guarantee low generalization error. The rest of the paper is organized as follows. In Section 2, we give a brief introduction of LSSVM and a reformulation of it. In Section 3, we present an efficient algorithm for solving LSSVM. In Section 4, we analyze the effect of kernel matrix approximation on the decision function of LSSVM. In Section 5, we demonstrate the consistency of approximate model selection. In Section 6, we propose an approximate model selection scheme for LSSVM. In Section 7, we report experimental results. The last section gives the conclusion.
2. Least Squares Support Vector Machine We use X to denote the input space and Y the output domain. Usually we will have X ⊆ Rd , Y = {−1, 1} for binary classification. The training set is denoted by S = ((x1 , y1 ) , . . . , (xl , yl )) ∈ (X × Y)l . We seek to construct a linear classifier, f (x) = w · φ(x) + b, in a feature space F, defined by a feature mapping of the input space, φ : X → F. The parameters of the linear classifier, (w, b), are given by the minimizer of a regularized least-squares training function l
1 1 X L = kwk2 + [yi − w · φ(xi ) − b]2 , 2 2µ
(1)
i=1
where µ > 0 is called regularization parameter. The basic training algorithm for LSSVM (Suykens and Vandewalle, 1999; Van Gestel et al., 2004) views the regularized loss function (1) as a constrained minimization problem l
min
1 X 2 1 kwk2 + εi , 2 2µ i=1
s.t.
(2)
εi = yi − w · φ(xi ) − b.
Further, we can obtain the dual form of Equation (2) as follows l X
αj φ(xj ) · φ(xi ) + b + µαi = yi ,
i = 1, 2, . . . , l,
(3)
j=1
P where li=1 αi = 0. Noting that φ(xj ) · φ(xi ) corresponds to the kernel function K(xi , xj ), we can write Equation (3) in a matrix form K l + µI l 1 α y = , (4) 1T 0 b 0 where K l = [K(xi , xj )]li,j=1 , I l is the l × l identity matrix, 1 is a column vector of l ones, α = (α1 , α2 , . . . , αl )T ∈ Rl is a vector of Lagrange multipliers, and y ∈ Y l is the label vector. 167
Ding Liao
If we let K µ,l = K l + µI l , Equation (4) is equivalent to K µ,l 1 α y = . T 1 0 b 0
(5)
The matrix on the left-hand side is not positive definite, so Equation (5) cannot be directly solved. However, we can write the first row of Equation (5) as K µ,l (α + K −1 µ,l 1b) = y.
(6)
Therefore, α = K −1 µ,l (y − 1b) and replacing α in the second row of Equation (5) we can obtain T −1 1T K −1 (7) µ,l 1b = 1 K µ,l y. The system of linear equations (5) can then be re-written as K µ,l 0 y α + K −1 µ,l 1b = . 0T 1T K −1 1T K −1 b µ,l 1 µ,l y
(8)
Equation (8) can be solved as follows: First solve K µ,l ρ = 1
and
K µ,l ν = y.
(9)
Since K µ,l = K l + µI l is positive definite, the inverse of the matrix K µ,l exists. The solution (α, b) of Equation (5) are then given by b=
1T ν 1T ρ
and
α = ν − ρb.
(10)
P The decision function of LSSVM can be written as f (x) = `i=1 αi K(xi , x) + b. If Equation (9) is solved, we can easily obtain the solution of LSSVM. However, the complexity of calculating the inverse of the matrix K µ,l is O(l3 ). In the following, we will demonstrate that multilevel circulant matrices can be used to speed up this process.
3. Approximating LSSVM using Multilevel Circulant Matrices We first review the notion of multilevel matrices (Tyrtyshnikov, 1996). For m, n ∈ N, let Nn := {0, 1, . . . , n − 1} and Rm×n be the set of m × n real-valued matrices. For a fixed positive integer p and n := [n0 , n1 , . . . , np−1 ] ∈ Np , we set Πn := n0 n1 . . . np−1 ,
Nn := Nn0 × Nn1 × · · · × Nnp−1 .
A multilevel matrix is defined recursively. According to (Tyrtyshnikov, 1996), a matrix An is called a p-level matrix of level order n if it consists of n0 ×n0 blocks and each block is a (p − 1)-level matrix of level order [n1 , n2 , . . . , np−1 ]. To point to the entries of the multilevel matrix An , we use multi-indices. For any j := [js : s ∈ Np ], l := [ls : s ∈ Np ] ∈ Nn , we write An := [aj,l : j, l ∈ Nn ], 168
Approximate Model Selection for Large Scale LSSVM
where (js , ls ), s ∈ Np is the location at level s. In the following, we restrict the kernel to be radial basis function (RBF) kernel. We assume that there exists a real-valued function k ∈ L1 (R) on X such that K(t, s) = k(kt − sk2 ) for all t, s ∈ X , where k · k2 denotes the Euclidean norm. We point out that k is always an even function, since we have K(t, s) = K(s, t), s, t ∈ X , from the definition of kernels. We present the kernel matrix in multilevel notation. For a set D ⊆ X , we relabel it as D := {xj : j ∈ Nn } for some n ∈ Np . The number of elements of D is Πn . The kernel matrix is rewritten as K n = [k(kxj − xl k2 ) : j, l ∈ Nn ]. We now describe the definition of multilevel circulant matrices (Davis, 1979). We begin with the definition of circulant matrices. A circulant matrix is an n × n matrix C n := [cj,l : j, l ∈ Nn ], where cj,l = cl−j for any j, l ∈ Nn and cj = cj−n for 0 ≤ j ≤ n − 1. More specifically, it takes the form c0 c1 . . . cn−1 cn−1 c0 . . . cn−2 .. .. . . .. . . . . . c1 c2 . . . c0 Clear, a circulant matrix is completely determined by its first row, so we write C n := circ[cj : j ∈ Nn ]. A block circulant matrix of type (m, n) is an A0 A1 Am−1 A0 .. .. . . A1 A2
(11)
mn × mn matrix of the form . . . Am−1 . . . Am−2 .. , .. . . ...
A0
where each block Aj , j ∈ Nm is an n × n matrix. A multilevel circulant matrix is defined recursively (Davis, 1979). A multilevel circulant matrix of level 1 is an ordinary circulant matrix. For any s ∈ N, an (s + 1)-level circulant matrix is a block circulant matrix whose blocks are s-level circulant matrices. More specifically, for n ∈ Np , An := [aj,l ∈ Nn ] is called a p-level circulant matrix if, for any j, l ∈ Nn , aj,l = al0 −j0 (mod
n0 ),...,lp−1 −jp−1 (mod np−1 ) .
Note that a p-level circulant matrix is completely determined by its first row a0,l , where 0 := (0, . . . , 0)T ∈ Rp . We will write An := circn [al : l ∈ Nn ], where al := a0,l , for l ∈ Nn .
169
Ding Liao
We now construct a multilevel circulant matrix U n which approximates the given kernel matrix K n (Song and Xu, 2010b; Song, 2010). For an n ∈ Np , we choose a sequence of positive numbers hn := [hn,j : j ∈ Np ] ∈ Rp , and define tj := k(k[js hn,s : s ∈ Np ]k2 ),
j ∈ Nn .
(12)
For any j ∈ Nn and s ∈ Np , we introduce the sets Dj,s := {0} if js = 0, and Dj,s := {js , ns − js } if 1 ≤ js ≤ ns − 1, and let Dj := Dj,0 × Dj,1 × · · · × Dj,p−1 . We then define X tl , j ∈ Nn , uj := (13) l∈Dj
and U n := circn [uj : j ∈ Nn ].
(14)
For LSSVM, we need to solve the inverse of K n + µI n . To reduce the computational cost, we intend to use the inverse of U n + µI n as an approximation of the inverse of K n + µI n . We know that K n + µI n is invertible, but the invertibility of U n + µI n is not so obvious. Actually, Song and Xu (2010b) have proved that when n is large enough, which means all of its components are large enough, U n + µI n is positive definite and invertible. We further introduce two lemmas about the eigenvalues and eigenvectors of a multilevel circulant matrix (Tyrtyshnikov, 1996; Davis, 1979). Lemma 1 The eigenvalues of a p-level circulant matrix An := circn [al : l ∈ Nn ] are given by λj =
X
2πi
al e
P
s∈Np
js ls ns
,
j ∈ Nn ,
l∈Nn
where i :=
√
−1.
Lemma 2 Suppose that An is a multilevel matrix of order n and a := [aj : j ∈ Nn ] is the first column of An . Then An is a p-level circulant matrix of level order n if and only if An =
1 ∗ Φ diag(Φa)Φ, Πn
where Φ Fn0 ⊗ Fn1 ⊗ h := i · · · ⊗ Fnp−1 , ⊗ denotes the Kronecker product of matrices and 2πst Fm := ei m : s, t ∈ Nm , for m ∈ N. From above two lemmas, we can find that the eigenvalues and eigenvectors of multilevel circulant matrices can be expressed in a multi-dimensional discrete Fourier transform (DFT) form. Therefore, their calculation can be realized efficiently by using the multi-dimensional fast Fourier transform (FFT). In the following, we will show how this computational advantage can be applied to obtain an efficient algorithm for solving LSSVM. From Lemma 2, the multilevel circulant matrix U n can be represented as Un =
1 ∗ Φ diag(v)Φ, Πn 170
(15)
Approximate Model Selection for Large Scale LSSVM
where v = Φ[uj : j ∈ Nn ]. It follows that (U n + µI n )
−1
1 ∗ = Φ diag Πn
1 : j ∈ Nn Φ. vj + µ
(16) (17)
We next present an algorithm of solving LSSVM. Algorithm 1: Approximating LSSVM using Multilevel Circulant Matrices Input: y := {yj : j ∈ Nn }, [uj : j ∈ Nn ], k, µ; Output: (α, b); 1: Calculate v according to (16) by using multi-dimensional FFT; 2: Calculate η := Φ[1, y] using multi-dimensional FFT, where 1 is a vector of all ones; 1 : j ∈ Nn η; 3: Calculate τ := diag vj + µ 1 ∗ Φ τ according to (17) using multi-dimensional FFT; 4: Calculate [ρ, ν] = Πn T 1 ν 5: Calculate b = T , α = ν − ρb; 1 ρ return (α, b);
We estimate the computational complexity of Algorithm 1 in next theorem. Theorem 3 The computational complexity of Algorithm 1 is O(Πn log(Πn )). Proof The computational complexity of the multidimensional FFT is O(l log(l)), where l is the total number of data points. It follows that the computational complexity of step 1, 2 and 4 in Algorithm 1 is O(Πn log(Πn )), since each of them is the multidimensional FFT of Πn data points. The complexity of step 3 is O(Πn ), since it is the product of a vector with a diagonal matrix. In step 5, the multiplication and subtraction between two vectors need to be done, so its complexity is O(Πn ). Therefore, the total complexity is O(Πn log(Πn )).
4. Error Analysis In this section, we analyze the effect of kernel matrix approximation on the decision function generated by LSSVM. We assume that kernel approximation is only used in training. At testing time the true kernel function is used (Cortes et al., 2010). The decision function f derived with the exact kernel matrix is defined by f (x) =
X j∈Nn
T α kx αj K(x, xj ) + b = , b 1
where kx = [K(x, xj ) : j ∈ Nn ]. For simplicity, we assume the offset b to be a constant ϕ. We define κ > 0 such that K(x, x) ≤ κ.
171
Ding Liao
To analyze the effect of approximation, we need to introduce a class of matrices whose entries have an exponential decay property (Song and Xu, 2010b). We first define “distances” of an entry in a multilevel matrix to its diagonal, to its upper right corner and to its lower left corner at each level. For t ∈ N2 , m ∈ N, j, l ∈ Nm , we set dm (t, j, l) := t|j − l| + (1 − t)(m − |j − l| − 1), and for t ∈ Np2 , n ∈ Np , j, l ∈ Nn , let dn (t, j, l) := [dns (ts , js , ls ) : s ∈ Np ]. We remark that, for any s ∈ Np , dns (1, js , ls ) is the distance of the entry at the position (js , ls ) to the diagonal at level s and dns (0, js , ls ) is the distance to the upper right and lower left corners at level s. We now give the definition of the class of matrices whose entries have an exponential decay as their distances defined above increase. For any n ∈ Np , j, l ∈ Nn and r > 0, let X Er,p,n (j, l) := e−rkdn (t,j,l)k2 . (18) p t∈N2
In what follows, we write A := {An : n ∈ Np }. Definition 4 A sequence of positive definite matrices A belongs to Er for a positive constant r if it satisfies the following conditions: p (i) there exists a positive constant κ such that kA−1 n k2 ≤ κ for any n ∈ N ;
(ii) there exists a positive constant c such that for any n ∈ Np and j, l ∈ Nn , |aj,l | ≤ cEr,p,n (j, l). Since solving LSSVM is equivalent to solving the linear systems, we further introduce a weight function of the location of an entry in a multilevel vector, which is used to analyze the approximation behavior in solving the linear systems. For an n ∈ Np , j ∈ Nn , let WEr ,n (j) := erνn (j) ,
(19) r > 0,
n
where νn (j) := 2 − j 2 . Furthermore, we define the associated weighted norm of the multilevel matrix. For a multilevel vector y n , n ∈ Np , we let ky n kWEr := sup WEr ,n (j) |(y n )j | ,
r > 0.
j∈Nn
(20)
The “induced norm” of multilevel matrix An , n ∈ Np is defined by kAn kWEr := sup{kAn y n k∞ : ky n kWEr = 1},
r > 0.
Our analysis of the effect of kernel matrix approximation on the decision function of LSSVM is based on the convergence analysis of the approximation of kernel matrices by multilevel circulant matrices. The following theorem demonstrates the convergency of such approximation (Song and Xu, 2010b). We let K µ,n := K n + µI n , U µ,n := U n + µI n . 172
Approximate Model Selection for Large Scale LSSVM
Theorem 5 If the following assumptions (H1) there exist positive constants c1 and c2 such that |k(s)| ≤ c1 e−c2 |s| , s ∈ R; (H2) there exist constants h0 > 0 and c0 ∈ R such that kxj − xl k2 ≥ h0 kj − lk2 + c0 ,
n ∈ Np , j, l ∈ Nn ;
(H3) there exists a positive constant h such that hn,j ≥ h, for all n ∈ Np and j ∈ Np ; (H4) there exist positive constants c3 and β such that |k(s) − k(t)| ≤ c3 |s − t|β , s, t ∈ R hold, then for each r > 0 there exists a positive constant r0 such that, for any 0 < r0 < and all n ∈ Np , we have 0 β −1 kK −1 − U k ≤ c kX − M k + 1 e−r nmin , n n WE µ,n µ,n WEr
r0 4
r
0
for some positive constant c, where nmin := min{ns : s ∈ Np }, Xn := [kxj −xl k2 : j, l ∈ Nn ], Mn := [k[(js − ls )hn,s : s ∈ Np ]k2 : j, l ∈ Nn ], for any n ∈ Np . 2
RBF kernels which have the form k(x) = e−γx , x ∈ R, for γ > 0 satisfy assumptions (H1) and (H4). We know that the key of solving LSSVM is to solve K µ,n ρ = 1
and
K µ,n ν = y.
If we use the multilevel circulant matrix U n to replace K n , the solution could be different. Let ρ0 denote the solution obtained using the approximate matrix U n . We can write −1 −1 −1 ρ − ρ0 = K −1 µ,n 1 − U µ,n 1 = (K µ,n − U µ,n )1.
(21)
Thus, using the Theorem 5, kρ0 − ρk can be bounded as follows: −1 kρ0 − ρkWEr ≤ kK −1 µ,n − U µ,n kWEr0 k1kWEr0 0 0 ≤ c kXn − Mn kβWE + 1 e−r nmin k1kWEr . r
(22)
0
From Equation (20) and (19), 1
k1kWEr = sup WEr0 ,n (j) = sup er0 νn (j) = e 2 r0 Ω , 0
j∈Nn
where Ω = n20 + n21 + · · · + n2p−1 kρ0 − ρkWEr
0
j∈Nn
1
. Therefore, 1 0 ≤ c kXn − Mn kβWE + 1 e−r nmin e 2 r0 Ω r 1 0 β = c kXn − Mn kWE + 1 e 2 r0 Ω−r nmin . 2
(23)
r
Replacing 1 with y, we can obtain the similar bound for kν 0 − νk, 1 0 kν 0 − νkWEr ≤ c kXn − Mn kβWE + 1 e 2 r0 Ω−r nmin . r
0
173
(24)
Ding Liao
As the assumptions, no approximation affects kx and the offset b is a constants ϕ, so the approximate decision function f 0 is given by f 0 (x) = [α0 ; ϕ]T [kx ; 1]. Therefore, 0 T T ! α α kx 0 f (x) − f (x) = − ϕ ϕ 1 0 T α −α kx = 0 1 = (α0 − α)T kx . It is easy to see that the norm k · kWEr satisfies the triangle inequality. We can obtain kf 0 (x) − f (x)kWEr ≤ kα0 − αkWEr kkx kWE 0
0
For RBF kernel, K(x, x) ≤ κ = 1, so kkx kWE
r0
.
(25)
1
r0
≤ e 2 r0 Ω .
From Equation (10), we know that α = ν − ρb = ν − ρϕ, so kα0 − αkWEr ≤ kν 0 − νkWEr + ϕkρ − ρ0 kWEr 0 0 1 0 0 β ≤ c(1 + ϕ) kXn − Mn kWE + 1 e 2 r0 Ω−r nmin . r
Therefore, we have 1 1 0 kf 0 (x) − f (x)kWEr ≤ c(1 + ϕ) kXn − Mn kβWE + 1 e 2 r0 Ω−r nmin e 2 r0 Ω r 0 0 = c(1 + ϕ) kXn − Mn kβWE + 1 er0 Ω−r nmin .
(26)
r
Equation (26) measures the effect of kernel matrix approximation on the decision function generated by LSSVM. It enables us to bound the relative performance of LSSVM when the multilevel circulant matrix is used to approximate the kernel matrix.
5. Consistency of Approximate Model Selection In this section, we take a model selection approach proposed by Song and Xu (2010a) as an example to demonstrate the consistency of approximate model selection when multilevel circulant matrices are used. LSSVM is a regularized kernel-based learning algorithm, so we could define a cost functional as 1 X Q(f, K) := (f (xj ) − yj )2 + µ(f, f )HK , Πn j∈Nn
where HK is the reproducing kernel Hilbert space (RKHS) with inner product (·, ·)HK . For a given kernel K, the target function fK is chosen to be the minimizer of the above functional in HK . That is, fK := arg min Q(f, K). f ∈HK
174
Approximate Model Selection for Large Scale LSSVM
˜ := [˜ The observed output y yj : j ∈ Nn ] is corrupted by some random noises, that is, y˜j = yj + ξj ,
j ∈ Nn ,
where y := [yj : j ∈ Nn ] is the unknown true output and ξj is a random variable with mean 0 and variance σ 2 . Let X 1 ˜ ) := min R(K, y (f (xj ) − y˜j )2 + µ(f, f )HK . (27) f ∈HK Πn j∈Nn
˜ ). A cost functional R(K) is defined Since ξj is a random variable, so are y˜j and R(K, y ˜ ), namely, as the expectation of R(K, y ˜ )]. R(K) := E[R(K, y
(28)
R(K) can be divided into two parts as follows (Song and Xu, 2010a): R(K) = M (K n ) + V (K n ) =
µ µδ 2 X 1 y(K n + µI n )−1 y T + , Πn Πn λj + µ
(29)
j∈Nn
where λj , j ∈ Nn are the eigenvalues of K n . Therefore, for a prescribed set of kernel functions K, we can take R(K) as a model selection criterion to select the optimal kernel function K ∗ by minimizing it, i.e. K ∗ = arg min R(K). K∈K
For equation (29), computing M (K n ) requires finding the inverse of the full matrix K n + µI n and computing V (K n ) requires calculating all eigenvalues of K n . Therefore, computing R(K) is computationally expensive. We could also use a circulant matrix U n to replace the kernel matrix K n to reduce the computational cost. However, we need a theoretical guarantee to show the rationality of such approximation. To this end, we introduce the following theorem (Song, 2010). Theorem 6 If the assumptions (H1), (H2), (H3) in Theorem 5 and (H5) there exist positive constants c1 and c2 such that for any n ∈ Np , j, l ∈ Nn , X |kxj − xl k2 − k[(js − ls )hn,s : s ∈ Np ]k2 | ≤ c1 e−c2 δns (js ) + e−c2 δns (ls ) , s∈Np
where δn (j) =
n 2
−
| n2
− j|,
hold, then there exists a positive constant c such that for any n ∈ Np , |V (U n ) − V (K n )| ≤ c(nmin )−1 . If in addition, there exist positive constants c3 and r1 such that for any n ∈ Np and j ∈ Nn , |yj | ≤ c3 e−r1 νn (j) , then there exist positive constants c and r such that for any n ∈ Np −1/2 −rnmin
|M (U n ) − M (K n )| ≤ cΠn 175
e
.
Ding Liao
e We denote R(K) = M (U n ) + V (U n ). By Theorem 6 and the triangle inequality, we could directly derive the following theorem. Theorem 7 If the assumptions (H1), (H2), (H3), (H5) hold, and there exist positive constants c and r such that for any n ∈ Np and j ∈ Nn , |yj | ≤ ce−rνn (j) , then e lim |R(K) − R(K)| = 0,
n→∞
where n → ∞ means all of its components goes to infinity. Theorem 7 shows that, for the regularized kernel-based learning algorithm (such as e LSSVM), if we use the approximate model selection criterion R(K) produced with multilevel circulant matrices, we could also obtain the optimal kernel as the accurate model selection criterion R(K) does, which shows the consistency of approximate model selection.
6. Approximate Model Selection for LSSVM In previous section, we introduce a model selection approach, which focuses on the kernel selection. The selection of regularization parameter µ has not been discussed. Actually, there are many model selection approaches, which could simultaneously select the kernel and regularization parameter, such as the cross validation, radius margin bound (Chapelle et al., 2002), PRESS criterion (Cawley and Talbot, 2004) and so on. However, when optimizing model selection criteria, all these approaches need to train LSSVM in the inner layer for every iteration. In this section, we discuss the problem of approximate model selection. We argue that for model selection purpose, it is sufficient to calculate an approximate criterion that can discriminate the optimal models from candidates. This argument has been supported by Theorem 7. In the following, we present an approximate model selection scheme, as shown in Algorithm 2. We use the RBF kernel K (xj , xl ) = exp −γkxj − xl k2 , j, l ∈ Nn to describe the scheme. Algorithm 2: Approximate Model Selection Scheme for LSSVM Input: D := {xj : j ∈ Nn }, y := {yj : j ∈ Nn }; Output: (γ, µ)opt ; Initialize: (γ, µ) = (γ 0 , µ0 ); repeat 1: Calculate [uj : j ∈ Nn ] according to (12) and (13); 2: Calculate α and b for LSSVM with fixed (γ, µ) using Algorithm 1; 3: Calculate model selection criterion T using α and b; 4: Update (γ, µ) to minimize T ; until the criterion T is minimized ; return (γ, µ)opt ;
Let S denote the iteration steps of optimizing model selection criteria. The complexity of solving LSSVM by calculating the inverse of the kernel matrix is O(Π3n ). For radius margin bound or span bound (Chapelle et al., 2002), a standard LSSVM needs to be solved in the inner layer for each iteration, so the total complexity of these two methods is O(SΠ3n ). 176
Approximate Model Selection for Large Scale LSSVM
For PRESS criterion (Cawley and Talbot, 2004), the inverse of kernel matrix still need to be calculated for each iteration, so its complexity is O(SΠ3n ). From Theorem 3, we know that using Algorithm 1, we could solve LSSVM in O(Πn log(Πn )) complexity. The complexity of calculating [uj : j ∈ Nn ] is O(p2p Πn ), where p is the number of levels of U n . p is a small constant and usually set to be 2 or 3. Therefore, if we use the above model selection criteria in the outer layer, the complexity of approximate model selection is O(SΠn (p2p + log(Πn ))). For t-fold cross validation, let Sγ and Sµ denote the grid steps of γ and µ. If LSSVM is directly solved, the complexity of t-fold cross validation is O(tSγ Sµ Π3n ). However, the complexity of approximate model selection using t-fold cross validation as outer layer criterion will be O(tSγ Πn (p2p + Sµ log(Πn ))), since the calculation of [uj : j ∈ Nn ] only need the kernel parameter γ.
7. Experiments In this section, we present experiments on several benchmark datasets to demonstrate the effectiveness of approximate model selection. 7.1. Experimental Scheme The benchmark datasets used in our experiments are introduced in R¨atsch et al. (2001) and widely used for the model selection purpose (Chapelle et al., 2002; Chen et al., 2009), as shown in Table 1. For each dataset, there are 100 random training and test pre-defined partitions1 (except 20 for the Image dataset). The use of multiple benchmarks means that the evaluation is more robust as the selection of data sets that provide a good match to the inductive bias of a particular classifier becomes less likely. Likewise, the use of multiple partitions provides robustness against sensitivity to the sampling of data to form training and test sets. Table 1: Datasets used in experiments. Dataset Thyroid Titanic Heart Breast Banana Twonorm Diabetes Flare solar German Image
Features
Training
Test
Replications
5 3 13 9 2 20 8 9 20 18
140 150 170 200 400 400 468 666 700 1300
75 2051 100 77 4900 7000 300 400 300 1010
100 100 100 100 100 100 100 100 100 20
In R¨atsch et al. (2001), model selection is performed on the first five training sets of each dataset. The median values of the hyperparameters over these five sets are then determined and subsequently used to evaluate the error rates throughout all 100 partitions. However, for this experimental scheme, some of the test data is no longer statistically “pure” since it has 1. http://www.fml.tuebingen.mpg.de/Members/raetsch/benchmark
177
Ding Liao
been used during model selection. Furthermore, the use of median of the hyperparameters would introduce an optimistic bias (Cawley and Talbot, 2010). In our experiments, we perform model selection on the training set of each partition, then train the classifier with the obtained optimal hyperparameters on the same training set, and finally evaluate the classifier on the corresponding test set. Therefore, we can obtain 100 test error rates for each dataset (except 20 for the Image dataset). The statistical analysis of these test error rates is conducted to evaluate the performance of the model selection approach. This experimental scheme is rigorous and can avoid the major flaws of the previous one (Cawley and Talbot, 2010). All experiments are performed on a Core2 Quad PC, with 2.33GHz CPU and 4GB memory. 7.2. Effectiveness Following the experimental setup in Section 7.1, we perform model selection respectively using 5-fold cross validation (5-fold CV) and approximate 5-fold CV, that is, approximate model selection by minimizing 5-fold CV error (as shown in Algorithm 2). The CV is performed on a 13 × 11 grid of (γ, µ) respectively varying in [2−15 , 29 ] and [2−15 , 25 ] both with step 22 . The number of levels of multilevel circulant matrices approximation is 2. We compare effectiveness of two model selection approaches. Effectiveness includes efficiency and generalization. Efficiency is measured by average computation time for model selection. Generalization is measured by the mean test error rate (TER) of the classifiers trained with the optimal hyperparameters produced by different model selection approaches. Results are shown in Table 2. We use the z statistic of TER (Cawley and Talbot, 2007) to estimate the statistical significance of differences in performance. Let x ¯ and y¯ represent the means of TER of two approaches, and ex and q ey the corresponding standard errors,
then the z statistic is computed as z = (¯ x − y¯)/
e2x + e2y and z = 1.64 corresponds to a
95% significance level. From Table 2, approximate 5-fold CV is significantly outperformed by 5-fold CV for none of 10 datasets. Besides, according to the Wilcoxon signed rank test (Demˇsar, 2006), neither of 5-fold CV and approximate 5-fold CV is statistically superior at the 95% level of significance. However, Table 2 also shows that approximate 5-fold CV is more efficient than 5-fold CV on all datasets. It is worth noting that the larger the training set size is, the efficiency gain is more obvious, which is in accord with the results of complexity analysis.
8. Conclusion In this paper, multilevel circulant matrices were first introduced into the model selection problem. A brand new approximate model selection approach of LSSVM was proposed, which fully exploits the theoretical and computational virtue of multilevel circulant matrices. We designed an efficient algorithm for solving LSSVM and bounded the effect of kernel matrix approximation on the decision function of LSSVM. We demonstrated the consistency of approximate model selection. With consistency as a theoretical support, we presented an approximate model selection scheme and analyzed its complexity as compared with other classic model selection approaches. This complexity shows the promise of the application of
178
Approximate Model Selection for Large Scale LSSVM
Table 2: Comparison of computation time and test error rate (TER) of 5-fold cross validation (5-fold CV) and approximate 5-fold CV 5-fold CV
Dataset Thyroid Titanic Heart Breast Banana Twonorm Diabetes Flare solar German Image
Approximate 5-fold CV
Time(s)
TER(%)
Time(s)
TER(%)
0.938 0.854 0.986 1.475 6.287 6.362 10.042 18.172 23.058 134.680
5.000±2.580 22.534±0.688 16.200±3.259 26.831±4.578 10.781±0.721 2.560±0.310 23.493±1.663 34.172±1.863 23.793±2.283 3.014±0.877
0.774 0.850 0.889 1.010 1.858 1.937 2.104 4.200 4.239 11.875
4.773±2.291 22.897±1.427 18.920±4.576 27.831±5.569 11.283±0.992 2.791±0.566 26.386±4.501 36.440±2.752 25.080±2.375 4.391±0.631
approximate model selection for large scale problems. We finally verified the effectiveness of our approach on several benchmark datasets. The application of our theoretical results and approach to real large problems will be one of major concerns. Besides, a new efficient model selection criterion directly dependent on kernel matrix approximation will be proposed in near future.
Acknowledgments The authors would like to thank the anonymous reviewers for their detailed comments and suggestions that help to improve the quality of the paper. The work was supported in part by the National Natural Science Foundation of China under grant No. 61170019 and the Natural Science Foundation of Tianjin under grant No. 11JCYBJC00700.
References G.C. Cawley and N.L.C. Talbot. Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17:1467–1475, 2004. G.C. Cawley and N.L.C. Talbot. Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. Journal of Machine Learning Research, 8:841–861, 2007. G.C. Cawley and N.L.C. Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11:2079–2107, 2010. O. Chapelle and V. Vapnik. Model selection for support vector machines. In Advances in Neural Information Processing Systems 12, pages 230–236. MIT Press, Cambridge, MA, 2000. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131–159, 2002. H. Chen, P. Tino, and X. Yao. Probabilistic classification vector machines. IEEE Transactions on Neural Networks, 20(6):901–914, 2009.
179
Ding Liao
C. Cortes, M. Mohri, and A. Talwalkar. On the impact of kernel approximation on learning accuracy. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pages 113–120, Sardinia, Italy, 2010. K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting. In Advances in Neural Information Processing Systems 15, pages 553–560. MIT Press, Cambridge, MA, 2003. P.J. Davis. Circulant Matrices. John Wiley and Sons, New York, NY, 1979. J. Demˇsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. ISSN 1532-4435. K. Duan, S.S. Keerthi, and A.N. Poo. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51:41–59, 2003. I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the Bayesian/frequentist divide. Journal of Machine Learning Research, 11:61–87, 2010. A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of recognition (in Russian). Technicheskaya Kibernetica, 3, 1969. C.A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2006. G. R¨ atsch, T. Onoda, and K.R. M¨ uller. Soft margins for AdaBoost. Machine Learning, 42(3): 287–320, 2001. G. Song. Approximation of kernel matrices in machine learning. PhD thesis, Syracuse University, 2010. G. Song and Y. Xu. Approximation of kernel matrices by circulant matrices and its application in kernel selection methods. Frontiers of mathematics in China, 5(1):123–160, 2010a. G. Song and Y. Xu. Approximation of high-dimensional kernel matrices by multilevel circulant matrices. Journal of Complexity, 26(4):375–405, 2010b. J.A.K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999. E.E. Tyrtyshnikov. A unifying approach to some old and new theorems on distribution and clustering. Linear Algebra and its Applications, 232:1–43, 1996. T. Van Gestel, J.A.K. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, and J. Vandewalle. Benchmarking least squares support vector machine classifiers. Machine Learning, 54(1):5–32, 2004. V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Computation, 12(9):2013–2036, 2000. V.N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, NY, 1998.
180