Parallel support vector machine training with nonlinear kernels Kristian Woodsend
∗
Jacek Gondzio
†
School of Mathematics, University of Edinburgh The King’s Buildings, Edinburgh, EH9 3JZ, UK November 30, 2007
Abstract A parallel implementation of Support Vector Machine training for problems involving nonlinear kernels has been developed. The kernel matrix is approximated by a partial Cholesky decomposition. Theoretical issues associated with constructing the best possible (and yet computationally efficient) approximation are discussed in detail and implemented in OOPS. The structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems which involve very large data sets. Excellent parallel efficiency was observed on such problems.
1
Introduction
Support Vector Machines (SVMs) are powerful machine learning techniques for classification and regression, and they offer state-of-the-art performance. The training of an SVM is computationally expensive and relies on optimization. The core of the approach is a dense convex quadratic optimization problem (QP), which for a general-purpose QP solver scales cubically with the number of data points (O(n3 )). This complexity result makes applying SVMs to large scale data sets challenging, and in practise the optimization problem is intractable by general purpose optimization solvers. ∗ †
Email:
[email protected] Email:
[email protected]
1
The standard approach to handle this problem is to build a solution by solving a sequence of small scale problems, e.g. Decomposition (Osuna et al., 1997) or Sequential Minimal Optimization (Platt, 1999). State-of-theart software such as SVMlight (Joachims, 1999) and SVMTorch (Collobert and Bengio, 2001) use these techniques. These are basically active-set techniques, which work well when the separation into active and non-active variables is clear, in other words when the data is separable by a hyperplane. Empirical computation time measurements on SVMTorch show that training time grows much closer to O(n2 ) than O(n3 ) of a general purpose QP solver, but with noisy data, the set of support vectors is not so clear, and the performance of these algorithms deteriorates (Woodsend and Gondzio, 2007). The standard active set technique is essentially sequential, choosing a small subset of variables to form the active set at each iteration, based upon the results of the previous selection. Improving the computation time through parallelization of the algorithm is difficult due to dependencies between subsequent iterations, and it is not clear how to implement this efficiently. Parallelization schemes so far proposed have involved splitting the training data to give smaller, separable optimization sub-problems which can be distributed amongst the processors. Dong et al. (2003) used a block-diagonal approximation of the kernel matrix to derive independent optimization problems. The resulting SVMs were used to filter out samples that were likely not to be support vectors. A SVM was then trained on the remaining samples, using the standard serial algorithm. Collobert et al. (2002) proposed a mixture of multiple SVMs where single SVMs are trained on subsets of the training set and a neural network is used to assign samples to different subsets. Another approach is to use a variation of the standard SVM algorithm that is better suited to a parallel architecture. Tveit and Engum (2003) developed an exact parallel implementation of the Proximal SVM, which modifies the standard SVM formulation to remove the single constraint in the dual and give an unconstrained QP. It is not clear how applicable this formulation is to real-world data sets. There have only been a few parallel methods in the literature which train a standard SVM on the whole of the data set. We briefly survey the methods of Zanghirati and Zanni (2003), Graf et al. (2005) and Chang et al. (2007). In an approach similar to SVMlight, the algorithm of Zanghirati and Zanni (2003) decomposes the SVM training problem into a sequence of smaller, though still dense, QP sub-problems. Zanghirati and Zanni imple2
ment the inner solver using a technique called variable projection method, which is able to work efficiently on relatively large dense inner problems, and is suitable for implementing in parallel. The performance of the inner QP solver was improved in Zanni et al. (2006). Scalar performance was competitive with SVMlight, and parallel efficiency was reasonable. In the cascade algorithm introduced by Graf et al. (2005), the SVMs are layered. The support vectors given by the SVMs of one layer are combined to form the training sets of the next layer. The support vectors of the final layer are re-inserted into the training sets of the first layer at the next iteration, until the global KKT conditions are met. The authors show that this feedback loop corresponds to standard SVM training. Another family of approaches to QP optimization are based on Interior Point Method (IPM) technology, which works by delaying the split between active and inactive variables for as long as possible. IPMs generally work well on large-scale problems. A straight-forward implementation of the standard SVM dual formulation would have complexity O(n3 ), and be unusable for anything but the smallest problems. Chang et al. (2007) use IPM technology for the optimizer, and avoid the problem of inverting the dense Hessian matrix by generating a low-rank approximation of the kernel matrix using partial Cholesky decomposition with pivoting. The dense Hessian matrix can then be efficiently inverted implicitly using the low-rank approximation and the Sherman-Morrison-Woodbury (SMW) formula. Moreover, a large part of the calculations at each iteration can be distributed amongst the processors effectively. The SMW formula has been widely used in interior point methods; however, sometimes it runs into numerical difficulties. Goldfarb and Scheinberg (2005) constructed data sets where an SMW-based algorithm required many more iterations to terminate, and in some cases stalled before achieving an accurate solution. They also showed that this situation arises in real-world data sets. This paper has two essential contributions: • For non-linear SVMs, the data is preprocessed to give a partial Cholesky decomposition. We propose to use the approximation LLT + D rather than the standard LLT , allowing outliers to be more efficiently approximated. • We propose a parallel SVM algorithm that trains the SVM using the full data set, using an interior point method to give efficient optimization, and Cholesky decomposition to give good numerical stability. For the partial Cholesky decomposition of non-linear kernel matrices, 3
error bounds resulting from the approximation are analyzed. From our observation that choosing outliers to form columns of L does not reduce the error bounds substantially, we construct a greedy heuristic for choosing pivots. We provide evidence that approximations using explicit diagonal terms (LLT + D) require significantly fewer vectors to be included in matrix L than approximations which do not add diagonal terms (LLT ). Since the complexity of IPM SVMs depends on the square of the number of columns included in L (but does not depend on the number of nonzero elements included in the D part), we observe spectacular improvements resulting from the use of LLT + D approximations. In addition, we describe an algorithm for performing this decomposition, without requiring explicit storage of the kernel matrix. Our parallel SVM algorithm distributes the data evenly amongst the processors. We use an interior point method to give efficient optimization of the QP that is the core of SVM training, but unlike previous approaches we use Cholesky decomposition to give good numerical stability and efficient memory caching. Our approach directly tackles the most computationally expensive part of the optimization, namely the inversion of the dense Hessian matrix, through providing an efficient implicit inverse representation of an approximation to the Hessian. By exploiting the structure of the problem, we show how this can be parallelized with excellent parallel efficiency. The resulting implementation is significantly faster at SVM training than can be achieved using state-of-the-art active set methods implemented serially. The structure of the rest of this paper is as follows. Section 2 provides a short description of support vector machines. Section 3 gives an outline of interior point method for optimizing quadratic programs. Then in Section 4 we describe our approach to training linear and non-linear SVMs efficiently. Our algorithm for choosing pivots for the partial Cholesky decomposition is described in Section 5. The parallel algorithm that exploits the structure of the QP is given in Section 6, along with numerical performance results. We now briefly describe the notation used in this paper. xi is the attribute vector for the ith data point, and it consists of the observation values directly. There are n observations in the training set, and k attributes in each vector xi . X is the n × k matrix whose rows are the attribute row vectors xTi associated with each point. The classification label for each data point is denoted by yi ∈ {−1, 1}. The variables w ∈ Rk and z ∈ Rn are used for the primal variables (“weights”) and dual variables (α in SVM literature) respectively, and w0 ∈ R for the bias of the hyperplane. Scalars and column vectors are denoted using lower case letters, while upper case letters denote matrices. D, S, U, V, Y and Z are the diagonal matrices of the corresponding 4
lower case vectors.
2
Support vector machines
In this section we briefly outline the formulations for Support Vector Machines used for classification problems. In particular, we draw attention to the relationship between the primal weight variables w ∈ Rk and the dual variables z ∈ Rn , as we will use it later to develop new formulations for each of the problems described here. A Support Vector Machine (SVM) is a classification learning machine that learns a mapping between the features and the target label of a set of data points known as the training set, and then uses a hyperplane wT x + w0 = 0 to separate the data set and predict the class of further data points. The labels are the binary values “yes” or “no”, which we represent using the values +1 and −1. The objective is based on the Structural Risk Minimization (SRM) principle, which aims to minimize the risk functional with respect to both the empirical risk (the quality of the approximation to the given data, by minimising the misclassification error) and maximize the confidence interval (the complexity of the approximating function, by maximising the separation margin) (Vapnik, 1998, 1999). A fuller description is also given in Cristianini and Shawe-Taylor (2000). For a linear kernel, the attributes in the vector xi for the ith data point are the observation values directly, while for a non-linear kernel the observation values are transformed by means of a (possibly infinite dimensional) non-linear mapping Φ. Training an SVM has at its core a convex quadratic optimization problem. For a linear SVM classifier using a 2-norm for the hyperplane weights w and a 1-norm for the misclassification errors ξ ∈ Rn this takes the following form:
min
w,w0 ,ξ
s.t.
1 T w w + τ eT ξ 2 Y (Xw + w0 e) ≥ e − ξ
(1)
ξ≥0 where e is the vector of all ones, and τ is a positive constant that parameterizes the problem. Due to the convex nature of the problem, a Lagrangian function associ5
ated with (1) can be formulated, n
L(w, w0 , ξ, z, µ) =
X 1 T w w + τ eT ξ − zi [yi (wT xi + w0 ) − 1 + ξi ] − µT ξ 2 i=1
where z ∈ Rn is the vector of Lagrange multiplers associated with the inequality constraint, and µ ∈ Rn is the vector of Lagrange multipliers associated with the non-negativity constraint on ξ. The solution to (1) will be at the saddle point of the Lagrangian. Partially differentiating the Lagrangian function gives relationships between the primal variables w, w0 and ξ, and the dual variables z at optimality: w = (Y X)T z yT z = 0
(2)
0 ≤ z ≤ τ e. Substituting these relationships back into the Lagrangian function gives the dual problem formulation min z
s.t.
1 T z Y XX T Y z − eT z 2 yT z = 0
(3)
0 ≤ z ≤ τ e. Non-linear kernels are a powerful extension to the Support Vector Machine technique. Attribute vectors x are transformed into some feature space through a non-linear mapping x → Φ(x). This allows SVMs to handle data sets that are not linearly separable, but which can be separated by a polynomial curve or by clustering. One of the main advantages of the dual formulation is that the mapping can be represented by a kernel matrix K, where each element is given by Kij = Φ(xi )T Φ(xj ), resulting in the QP min z
s.t.
1 T z Y KY z − eT z 2 yT z = 0
(4)
0 ≤ z ≤ τ e. Kernel functions allow the matrix K to be calculated without knowing Φ(x) explicitly. The original attribute vectors appear only in terms of inner products. As an example, a commonly used kernel function is the radial basis kernel (RBF) 2 Kij = e−γkxi −xj k . 6
3
Interior point methods
Interior point methods represent state-of-the-art techniques for solving linear, quadratic and non-linear optimization programmes. In this section the key issues of implementation for QPs are discussed very briefly; for more details, see Wright (1997). We are interested in solving the general convex quadratic problem min z
s.t.
1 T z Qz − eT z 2 Az = b
(5)
0 ≤ z ≤ u, where u is a vector of upper bounds, and the constraint matrix A is assumed to have full row rank. Dual feasibility requires that AT λ + s − v − Qz = −e, where λ is the Lagrange multiplier associated with the linear constraint Az = b and s, v > 0 are the Lagrange multipliers associated with the lower and upper bounds of z respectively. An interior point method progresses towards satisfying the KKT conditions over a series of iterations, by reducing primal and dual infeasibilities and controlling the complementarity products, ZSe = µe (U − Z)V e = µe, where µ is a strictly positive parameter. At each iteration, the method makes a damped Newton step towards satisfying the primal feasibility, dual feasibility and complementarity product conditions for a given µ. Then the algorithm decreases µ before making another iteration. The algorithm continues until both infeasibilities and the duality gap (which is proportional to µ) fall below required tolerances. The Newton system to be solved at each iteration can be transformed into the augmented system equations: −(Q + Θ−1 ) AT ∆z rc = , (6) A 0 ∆λ rb where ∆z, ∆λ are components of the Newton direction in the primal and dual spaces respectively, Θ−1 ≡ Z −1 S + (U − Z)−1 V , and rc and rb are appropriately defined residuals. If the block (Q + Θ−1 ) is diagonal, an efficient method to solve such a system is to form the Schur complement C = A (Q + Θ−1 )−1 AT , solve the smaller system C∆λ = rˆb for ∆λ, and back-substitute into (6) to calculate ∆z. Unfortunately in the case of SVM training, the kernel matrix, and therefore Q, is a completely dense matrix. 7
4
Efficient training of SVMs using IPM
In this section we describe a formulation for training a SVM that allows the efficient use of IPM technology to solve the QP. We use Cholesky decomposition to give good numerical stability, applied to all features at once resulting in a more efficient implementation in terms of memory caching. For linear SVMs, the feature matrix is used directly. Our approach can be applied to the case of non-linear SVMs once the data has been preprocessed using partial Cholesky decomposition with pivoting.
4.1
Linear SVMs
Using the form Q = (Y X)(Y X)T enabled by the linear kernel, we can rewrite the quadratic objective in terms of w, and ensure the relationship (2) between w and z to hold at optimality by introducing it into the constraints. Consequently, we can state the classification problem (3) as the following separable QP: min w,z
s.t.
1 T w w − eT z 2 w − (Y X)T z = 0
(7)
T
y z=0 0 ≤ z ≤ τ e. The quadratic matrix in the objective is no longer dense, but simplified to the diagonal matrix Ik 0 Q= ∈ R(n+k)×(n+k) 0 0n while the constraint matrix is in the form: Ik −(Y X)T A= ∈ R(k+1)×(k+n) . 0 yT Determining the Newton step requires calculating the matrix product: C ≡ A(Q + Θ−1 )−1 AT −1 + X T Y Θ Y X −X T Y Θ y (Ik + Θ−1 z z w ) = ∈ R(k+1)×(k+1) . −y t Θz Y X y T Θz y
(8)
We need to solve A(Q + Θ−1 )−1 AT ∆λ = r for ∆λ. Building the matrix (8) is the most expensive operation, of order O(n(k + 1)2 ), while inverting the resulting matrix is of order O((k + 1)3 ). 8
4.2
Non-linear SVMs
The formulation in the above section can be extended to the non-linear SVM training problem (4). The matrix resulting from a non-linear kernel is normally dense, which results in a dense QP. It has been noted by several researchers (see Fine and Scheinberg, 2002) that it is possible to make a good low-rank approximation of the kernel matrix. Fine and Scheinberg (2002) use the approximation K ≈ LLT based on partial Cholesky decomposition with pivoting. In Woodsend and Gondzio (2007) we showed that the approximation K ≈ LLT + diag(d) can be determined at no extra computational expense. By applying a similar reformulation we can derive the QP: min w,z
s.t.
1 T (w w + z T Dz) − eT z 2 w − (Y L)T z = 0
(9)
yT z = 0 0 ≤ z ≤ τ e. Note that we can still exploit the separability of the objective as the Hessian is again diagonal, so the computational complexity using the LLT + D approximationg with L of rank r is O(n(r + 1)2 + nkr + (r + 1)3 ).
5
Choosing pivots for partial Cholesky decomposition of the kernel matrix
To form the approximation K ≈ LLT using partial Cholesky decomposition, Fine and Scheinberg (2002) chose pivots by applying the greedy heuristic algorithm of selecting the largest diagonal element each time, and this approach has been used subsequently (Chang et al., 2007; Woodsend and Gondzio, 2007). In this section we show that this algorithm is not the best for approximating clustered data using a low-rank matrix, particularly if the residual diagonal forms part of the approximation.
5.1
Approximating clusters and outliers
Consider a data-set that consists of two clusters A and B, both with a large number (nA and nB respectively, and nA ≈ nB ) of very closely situated points, and nC outlying points C which are far from either cluster (see Figure 9
Cluster A Cluster B Outliers C
Figure 1: Data set consisting of two clusters plus outliers. 1). If the Radial Basis Kernel is used for this data set, it will generate a kernel matrix with a large block of elements equal to or very close to 1 corresponding to cluster A, a similar block corresponding to cluster B, and a diagonal block very close to the identity matrix corresponding to points C. Using the notation enA = (1, 1, . . . , 1)T ∈ RnA and enB = (1, 1, . . . , 1)T ∈ RnB , the kernel matrix can be closely approximated by (A) enA eTnA αenA eTnB (0) T T (B) K = αenB enA enB enB InC (C) if A and B are reasonably close, while the points of C are far from both the A and B clusters. In other words, the large blocks corresponding to clusters A and B can be very well approximated by rank-one matrices. Note that all the pivots have the same value at this stage. Next we show by example that such a kernel matrix, resulting from natural clusters in the data set, cannot be well approximated by a partial Cholesky decomposition LLT with pivoting based on the largest diagonal element unless the rank of L is at least nc + 2. Suppose a small rank, say r = 2 is used. Proposition 5.1 gives the error in the approximation of K (0) using such an approximation, when rank r = 2. Alternatively, Proposition 5.2 shows that an approximation using LLT + D with L build of merely two columns, where pivots are chosen on some measure based on clustering, 10
has minimal error. Let the notation O() indicate a number very close to zero, and blocks in the form O()nA ×nA indicate sub-matrices where diagonal elements are zero and non-diagonal elements are very close to zero. Proposition 5.1. If a low-rank approximation LLT of matrix K (0) is constructed, with the rank r = 2 and at each iteration the largest pivot is chosen, the lowest Frobenius norm of the residual matrix kK − LLT k2F ≈ (1 − α2 )n2B + nC − 1. Proof. If we assume that the first pivot chosen at random is taken from cluster A (if nA ≈ nB then clusters A and B are interchangable), the first approximation of the kernel matrix is enA K (0) = αenB eTnA αeTnB 0TnC + K (1) , 0nC O()nA ×nA O()nA ×nB . where K (1) = O()nB ×nA (1 − α2 )enB eTnB InC At the next iteration, the largest pivot will be found among diagonal elements corresponding to points in C, giving enA K (0) = αenB eTnA αeTnB 0TnC 0nC 0nA +nB 0Tn +n 1 0TnC −1 + K (2) 1 + A B 0nC −1 O()nA ×nA O()nA ×nB O()n ×n (1 − α2 )enB eTnB B A . where K (2) = 0 InC −1 The Frobenius norm of this residual matrix kK − LLT k2F ≈ (1 − α2 )n2B + nC − 1. By using a similar argument we can prove that unless all pivots corresponding to points in C are eliminated, no pivot from the cluster B and be chosen. Hence we need a rank of at least nC + 2 to reduce the error of the approximation to an O() level. 11
Proposition 5.2. If an approximation LLT + D of matrix K (0) is constructed, where a point in cluster A and a point in cluster B are chosen to form the two columns of L, and D is the diagonal of the residual matrix, then the Frobenius norm of the residual matrix kK − (LLT + D)kF = kO()(nA +nB )×(nA +nB ) kF . Proof. If, instead of choosing the pivots based on the largest elements in the diagonal, we choose a pivot from cluster A and a pivot from cluster B, then we get enA K (0) = αenB eTnA αeTnB 0TnC 0nC √ √ 0nA + 1 − α2 enB 0TnA 1 − α2 eTnB 0TnC + K (2) 0nC O()(nA +nB )×(nA +nB ) (2) where K = , InC with a Frobenius norm kK − LLT k2F ≈ nC . If cluster B is large and C small, then this is obviously preferable. Additionally, if we include a residual diagonal into our approximation (LLT +D), the norm of the residual error matrix can be reduced further to kK − (LLT + D)kF = kO()(nA +nB )×(nA +nB ) kF . We can see from this example that the algorithm that chooses the largest pivot will have to choose all the outliers C before it will select any point from cluster B. The approximation will need to be of higher rank than the number of outliers. We can remedy this by using the form K ≈ LLT +D and packing all elements corresponding to outliers C into matrix D, without increasing the rank of L.
5.2
Bounding the error in the approximation
If we approximate the original kernel matrix using the matrix LLT + D, and use it to form the Hessian of the SVM training QP (9), how close is the perturbed problem to the original one? One method of judging is to measure the error in the objective value. To do this, we consider the original nonlinear SVM dual formulation (4), with the kernel matrix forming part of the Hessian, and the approximate QP in which the new (approximate) Hessian ˜ is used. Let f (z) = 1 z T Y KY z T − eT z and f˜(z) = 1 z T Y KY ˜ z T − eT z K 2 2 12
denote the objective functions in the original problem (4) and in its approximation respectively. Bounds for the error in the approximated objective function are given in Theorem 5.3. Theorem 5.3. Let z ∗ be the optimal solution of (4) and z˜∗ be the optimal ˜ + ∆K. Then the error in the solution of the approximate QP. Let K = K ∗ ∗ ˜ objective value |f (z ) − f (˜ z )| is bounded by 1 ∗T 1 z Y ∆KY z ∗ ≤ f (z ∗ ) − f˜(˜ z ∗ ) ≤ z˜∗T Y ∆KY z˜∗ . 2 2 Proof. First we observe that the two QP problems differ only in the objective functions; the constraints are identical. Hence any feasible solution of (4) is also a feasible solution of the approximate problem. From the definitions of f (z) and f˜(z), for any point z, 1 1 ˜ z f (z) − f˜(z) = z T Y KY z − z T Y KY 2 2 1 = z T Y ∆KY z. 2
(10)
For the upper bound, consider the point z ∗ . It is the minimizer of f (·), therefore f (z ∗ ) ≤ f (˜ z ∗ ). Applying (10) to the point z˜∗ and substituting in this inequality gives 1 f (z ∗ ) − f˜(˜ z ∗ ) ≤ z˜∗T Y ∆KY z˜∗ . 2 Similarly, the point z˜∗ minimizes f˜(·), so f˜(˜ z ∗ ) ≤ f˜(z ∗ ). Substituting ∗ ∗ this relationship for f (z ) into (10) at z gives 1 f (z ∗ ) − f˜(˜ z ∗ ) ≥ z ∗T Y ∆KY z ∗ . 2 A good approximation will keep these bounds as tight as possible. Below we attempt to quantify this. Let us define the matrix norm k∆Kke := P P i j |∆Kij |, the sum of all absolute values in ∆K. Corollary 5.4. If the matrix K in QP (4) is approximated by the matrix ˜ where K = K ˜ + ∆K, the error in the objective due to the approximation K, is bounded by 1 |f (z ∗ ) − f˜(˜ z ∗ )| ≤ τ 2 k∆Kke . 2 13
Proof. In (4) z is bounded by 0 ≤ z ≤ τ e, and only the support vectors themselves will have non-zero values, but obviously we can only know the values of z ∗ after the approximation is made. Following Theorem 5.3, a worst case for the bounds of the error, |f (z ∗ ) − f˜(˜ z ∗ )|, can be constructed by taking all elements of z to their bounds: 1 XX |f (z ∗ ) − f˜(˜ z ∗ )| ≤ max(τ 2 yi yj ∆Kij , 0) 2 i j X X 1 ≤ τ2 |yi yj ∆Kij | 2 i j 1 2XX |∆Kij | = τ 2 i
j
as yi , yj ∈ {−1, +1}. Using the definition of k∆Kke , the maximum error in the objective is then 1 |f (z ∗ ) − f˜(˜ z ∗ )| ≤ τ 2 k∆Kke , 2 and clearly minimizing k∆Kke will reduce the error.
5.3
A cost measure for selecting pivots
In section 5.1 we observed that the diagonal elements of K corresponding to outlying data points will remain large during partial Cholesky decomposition, yet other elements in the columns will be small. An algorithm that selects columns by choosing the largest diagonal element will construct columns of L based on such points, yet they can be adequately represented by a diagonal matrix D. In that example, the decision whether a pivot should contribute to a diagonal term of D or a column of L was straightforward. In real-life examples this decision is less obvious. Moreover, we wish to keep the rank of L as small as possible, hence we would like to encourage an early choice of essential columns which indeed should contribute to L. We start our discussion from an analysis of a simplified example in which two attractive pivot candidates are available. Let the kernel submatrix at an iteration be u1 0 u ¯T K = 0 v1 v¯T . (11) ¯ u ¯ v¯ K 14
The columns containing u and v are the candidates for pivoting. For simplicity, we consider them after being permuted to the top of the matrix. The elements u1 and v1 are pivot elements on the diagonal, while u ¯ and v¯ are ¯ vectors containing the rest of the column. K contains the rest of the matrix. To describe the effect on the residual matrix after a candidate is chosen, we make the following assumption. Assumption 5.5. The kernel matrix K ≥ 0, and for all pivot column candidates K − u11 uuT ≥ 0. Assumption 5.5 corresponds to assuming that all elements of the Schur complement will be reduced in absolute value as the result of the pivoting. First let us observe that, by construction, kernel matrices are positive semidefinite. For some types of kernel, it is possible to say more: for instance K ≥ 0 for a polynomial kernel with an even power, and K > 0 for the commonly used RBF kernel. Unfortunately for later iterations this is only an approximation: K ≥ 0 cannot be guaranteed even for an initial kernel matrix using the radial basis kernel. However, we have noted empirically that the vast majority of elements of the Schur complement remain positive during at least the early iterations of Cholesky decomposition. We exploit these observations for the purposes of developing a heuristic, by taking that Assumption 5.5 will hold at least for the early iterations where this heuristic will be used. Remark 5.6. If Assumption 5.5 holds, the residual matrix after choosing column u will have the norm ¯ e− kK − llT ke = v1 + 2k¯ v k1 + kKk
1 k¯ uk21 . u1
Proof. If u is chosen for the pivot, the corresponding column of the Cholesky l1 √ ¯. The residual decomposition l will be 0 , where l1 = u1 and ¯l = l11 u ¯l matrix K − llT resulting when column u is chosen will be 0 0 0 . v¯T K − llT = 0 v1 T ¯ ¯ ¯ 0 v¯ K − ll The entry-wise norm defined before Corollary 5.4 will be kK − llT ke = ¯ − ¯l¯lT ke . |v1 | + 2k¯ v k1 + kK 15
Under the (restrictive) Assumption 5.5, ¯ − ¯l¯lT ke = kKk ¯ e − k¯l¯lT ke . kK √ Furthermore, using the relationships ¯l = l11 u ¯ and l1 = u1 , the norm of the sub-matrix ¯ − ¯l¯lT ke = kKk ¯ e − 1 k¯ kK uu ¯T ke . u1 P P P P Using the following: k¯ uu ¯T ke = i j (|¯ ui ||¯ uj |) = i |¯ ui | j |¯ uj | = k¯ uk21 , we conclude ¯ − ¯l¯lT ke = kKk ¯ e − 1 k¯ kK uk21 , u1
and
¯ e − 1 k¯ kK − llT ke = v1 + 2k¯ v k1 + kKk uk21 . u1 There will be an analogous result if column v was used. It is now possible to say which column should be chosen as the pivot: we will prefer column u to v if it gives a smaller error in the residual matrix when compared to the error resulting from choosing column v; in other words, if kK − (llT )(u) ke < k K − (llT )(v) ke . Remark 5.7. If Assumption 5.5 holds, an approximation LLT ≈ K with minimum residual is constructed by choosing column u as the pivot in preference to column v if 1 1 kvk21 < kuk21 . v1 u1 Proof. Using Remark 5.6 for kK − llT ke , we get kK − (llT )(u) ke ¯ e − 1 k¯ v1 + 2k¯ v k1 + kKk uk21 u1 1 v k21 v1 + 2k¯ v k1 + k¯ v1 1 kvk21 v1
< k K − (llT )(v) k ¯ e − 1 k¯ < u1 + 2k¯ uk1 + kKk v k21 v1 1 < u1 + 2k¯ uk1 + k¯ uk21 u1 1 < kuk21 . u1
We can see from this inequality that whether column u is the best choice of pivot depends in a non-linear way on both the pivot value u1 and the norm of the column kuk1 . It is not always best to choose the largest pivot. The measure changes if we use the residual diagonal to improve our approximation, as shown in Remark 5.8. Let the matrix K be approximated 16
¯ ≈ K, where D ¯ is the diagonal of K. ¯ The residual error matrix using LLT + D T ¯ is therefore redefined as K − (LL + D). We consider again matrix (11) and ¯ the pivot selection that minimizes the error in ∆K = K − (LLT + D). Proposition 5.8. The pivot corresponding to column u is preferable to that of column v if 1 1 v k21 < 2k¯ uk1 + k¯ uk21 . 2k¯ v k1 + k¯ v1 u1 Proof. Using the redefined residual error matrix, then when column u is chosen as the pivot ¯ e − 1 k¯ ¯ e. k∆u ke = v1 + 2k¯ v k1 + kKk uk21 − v1 − kDk u1 Column u is the better pivot if 2k¯ v k1 +
1 1 k¯ v k21 < 2k¯ uk1 + k¯ uk21 . v1 u1
We can use the measure mu = 2k¯ u k1 +
1 k¯ uk21 u1
(12)
as the basis for selecting a pivot column. mu can be thought of as a cost measure for approximating the column with a diagonal rather than providing an exact representation through a column in L.
5.4
An algorithm for Cholesky decomposition using column norms
We use (12) as the basis of greedy heuristic for selecting columns to form a Cholesky decomposition of the kernel matrix. At each iteration, we choose the column with the highest measure, while rejecting columns with very small pivots that would increase numerical errors during the decomposition. A simplified version is given as Algorithm 1. In this algorithm, n is the number of samples, r the maximum allowed number of columns of the Cholesky factor L, and d is the diagonal of the residual matrix K −LLT . The full set of columns J of the matrix K are partitioned into four disjoint sets: L (columns allocated to L), D (allocated to D), P (columns postponed from consideration in the current iteration) and U (columns available for consideration, and not yet allocated to either L or D), where J = L ∪ D ∪ P ∪ U. 17
Algorithm 1 A simplified serial algorithm to construct a Cholesky decom˜ = LLT + diag(d), using the cost measure position with partial pivoting, K ¯ − diag(d). mu to minimize the residual K 1: J := {1 . . . n}, 2: L := ∅, D := ∅, P := ∅, U := J 3: dj := Kjj ∀j ∈ J // Initialise the diagonal // Calculate a maximum of r columns 4: for i = 1 : r do 5: Choose j ∗ : dj ∗ = maxj∈U dj ∗ 6: Compute column Pij of the Schur complement: Skj ∗ := Kkj ∗ − l=1 (Lkl · Lj ∗ l ) ∀k = i + 1, . . . , n 7: Compute the “cost” of approximating this column using just the diagonal,Pand not as a full column of L: s¯j ∗ := nl=i+1 |Slj ∗ | mj ∗ := d1∗ s¯2j ∗ + 2¯ sj ∗ j 8: if dj ∗ < small or mj ∗ < D then 9: D := D ∪ {j ∗ }, U := U\{j ∗ } 10: else if mj ∗ > L then 11: L := L ∪ {j ∗ }, U := U\{j ∗ }, form new column i of L using L·i := S·j ∗ , update the diagonal: dj := dj − (Lji )2 ∀j ∈ U ∪ P 12: else 13: For a column where the allocation is not clear, temporarily remove it from consideration: P := P ∪ {j ∗ }, U := U\{j ∗ } 14: end if 15: Return columns j to U that have been in P for the required number of iterations: U := U ∪ {j}, P := P\{j} 16: if U = ∅ then 17: r := i 18: return 19: end if 20: end for
18
In addition, we use a number of algorithm parameters as the basis of decisions: L is a threshold value for including a column in L, while D is a similar threshold value for inclusion in D. We set D L . If D ≤ mu ≤ L then we postpone the decision regarding column u. To aid numerical stability, columns are not chosen for pivoting if their diagonal element is below a minimum acceptable value small . To aid readability, we present the algorithm as a serial procedure. Initially all columns are in U. At each iteration, the column with the largest diagonal element is selected, and the corresponding column of the Schur complement is calculated. From this we calculate the cost measure (12). Using this information, a number of options are available: we allocate the column to D if either the cost measure mu is low or pivoting using this column would be numerically unstable; if the cost measure is high we allocate the column to L; or we can postpone making a decision on this column by allocating it to P. Columns remain in this set for a number of iterations, before being returned to U. The algorithm terminates when r columns of L have been so constructed, or there are no more columns in U. In this presentation we assume here that the columns are already permuted into the correct order; in the actual implementation, column permutations are handled implicitly. Kernel functions are expensive to evaluate, so individual elements of the matrix K are calculated as required in steps 3 and 6. Computing column j ∗ of the Schur complement can be performed in parallel, with each processor calculating elements corresponding to samples held locally; this requires the broadcast of the attribute vector xj ∗ and the first i rows of L. In practice it is hard to define a suitable value for L , as it depends on the matrix K. We used a cache of best columns, rather than the single column j ∗ shown in Algorithm 1, and chose the column with the highest cost measure to enter L. When a column was added to L, the Schur complement for candidate columns held in the cache were updated with the newly added column L·i , rather than fully recalculated as shown in Step 6 of Algorithm 1.
5.5
Numerical examples
To investigate how this greedy algorithm performed, we created a data set containing 4 clusters of 25 points each, arranged in an XOR pattern. 10 outliers were added. This is shown in Figure 2. The radial basis P function was used to create the kernel matrix, with parameter γ = (2 ki=1 σi2 )−1 where σi is the standard deviation of attribute i of the original data X. 19
2.5
Positive class Negative class
2
1.5
1
0.5
0
-0.5
-1
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Figure 2: Data set containing 4 clusters plus outliers. No points are misclassified within the clusters. 30
Largest diagonal Largest cost measure (all cols)
Number of columns of L in approximation
25
20
15
10
5
0
1
0.1
0.01 0.001 0.0001 Relative error in approximation
1e-05
1e-06
Figure 3: Rank of L in the approximation LLT + D against the norm of the residual error matrix kK − (LLT + D)ke achieved, using the data set shown in Figure 2. Choosing each pivot column based on the cost measure mu (12) is compared against the standard heuristic of choosing the largest diagonal element. 20
Number of columns of L in approximation
25
Largest diagonal Largest m (all cols) Largest m (20 cols cache)
20
15
10
5
0
1
0.1
0.01 0.001 Relative error in approximation
0.0001
1e-05
Figure 4: Performance of using a cache of 20 columns gives an approximation that is comparable to knowing the whole matrix.
Largest diagonal Largest m (20 cols cache)
1.2
Accuracy predicting against test set
1
0.8
0.6
0.4
0.2
0
0
2
4
6 8 10 Number of columns of L in approximation
12
14
Figure 5: Accuracy of SVM training based on the two approximation techniques in classifying an unseen test data set.
21
We compared the greedy heuristic for constructing the partial Cholesky decomposition, by choosing pivots based on the largest diagonal element, and on the largest cost measure mu (12). The performance of the pivot selection methods are shown in Figure 3. The experiments show that the algorithm does not make fast progress in reducing the error in the approximation until most of the outliers have been chosen. In contrast, using the measure mu gives the best approximation. For example, for an error of 1% of the original kernel matrix, 5 columns are required choosing based on mu , compared to 16 columns using the largest diagonal element. As the optimization problem associated with SVM training scales to the square of the number of columns, this represents an order of magnitude difference in the time required for the optimization. For L with more than 20 columns, there is little difference between the methods, and for many columns the algorithm using mu suffers a little from numerical errors. As described above, it is computationally too expensive to assess all columns. To avoid this, we maintain a cache Schur complements of attractive columns. Figure 4 shows that, in this particular example, the algorithm using a cache of 20 columns performs almost as well in terms of the approximation matrices it generates as an algorithm that knows the full Schur complement matrix. We also assessed the accuracy of the resulting SVMs in classifying a previously unseen test set with a similar distribution of data points. The results (Figure 5) show that an error of less than approximately 1% of the original kernel matrix is enough to obtain a very high classification accuracy with this particular data set, and the algorithm choosing pivots based on the cost measure achieves this with far fewer columns in L than the one using largest diagonal elements. We also applied the algorithm to the United States Postal Service (USPS) 1 data set of hand-written digits, and compared the classification accuracy of the cost measure mu to using the largest diagonal elements as pivots. This data set comprises 7291 training and 2007 test patterns, represented as dense vectors of dimension 257 with entries between 0 and 255. The digits have been classified by hand and labelling is highly accurate. The classification task set was to discriminate the digit 4 from the others. Results are shown in Figure 5.5. For up to 200 columns of L, the cost measure considerably reduces the number of columns required to achieve a certain level of accuracy. After this point, there is little difference in the approximations. It is also apparent from the figure that a very small number of columns (10-20) 1
http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/
22
Largest diagonal Cost measure SVMlight
Accuracy of predictions using test set
1
0.98
0.96
0.94
0.92
0.9
0
50
100 150 Number of columns of L in approximation
200
250
Figure 6: Accuracy of predicting the USPS test set, for approximations K ≈ LLT + D. Columns were chosen to enter L either based on largest diagonal element or our cost measure.
gives good results, but subsequently adding columns increases the accuracy only very slowly.
6
Implementing the QP for parallel computation
To apply formulations (7) and (9) to truly large-scale data sets, it is necessary to employ linear algebra operations that exploit the block structure of the formulations (Gondzio and Grothey, 2004).
6.1
Linear algebra operations
−Q − Θ−1 AT We use the augmented system matrix Φ = A 0 where Q, Θ and A were described in Section 3.
from (6),
For the formulations (7) and (9), this results in Φ having a symmetric 23
bordered block diagonal structure. We can break Φ into blocks:
AT1 AT2 .. .
Φ1 Φ2
Φ=
.. A1 A2
.
Φp ATp . . . Ap 0
,
where Φi and Ai result from partitioning the data set evenly across the p processors. A block-based Cholesky decomposition of the matrix Φ = LDLT results in the structure:
L1 ..
Φ= LA1
.
Lp . . . LAp LC
D1
..
. Dp DC
LT1 ..
. LTp
LTA1 .. . T LAp LTC
We can employ the Schur-complement mechanism for the blocks LC and DC . The Cholesky decomposition can be computed by performing the series of computations outlined below: Φi = Li Di LTi LAi = C= =
⇒ Di = Φi , Li −T −1 Ai Li Di = Ai Φ−1 i p X T − Ai Φ−1 i Ai i=1 LC DC LTC
=I
(13) (14) (15) (16)
Matrix C is a dense matrix of relatively small size (r + 1) × (r + 1), and the Cholesky decomposition C = LC DC LTC is performed in the normal way on a single processor. Once the representation Φ = LDLT above is known, we can use it to ∆z rc compute the solution of the system Φ = (6) through back∆λ rb substitution. ∆z 0 , ∆λ0 and ∆λ00 are vectors used for intermediate calcula24
tions, with the same dimensions as ∆z and ∆λ. 00
∆λ =
L−1 C (rb
−
p X
LAi rci )
(17)
i −1 ∆λ0 = DC ∆λ00
∆λ = ∆zi0 = ∆zi =
0 L−T C ∆λ Di−1 rci ∆zi0 − LTAi ∆λ
(18) (19) (20) (21)
For the formation of LDLT , equations (13) and (14) can be calculated on each processor individually. Outer products (15) can be calculated, and the results gathered onto a single master processor to form C; this requires each processor to transfer approximately 12 (r+1)2 elements. The master processor performs the Cholesky decomposition of C (16). Each processor needs to calculate LAi rci , which again can be performed without any inter-processor communication, and the results gathered onto the master processor. The master processor then performs the calculations in equations (17), (18) and (19) of the back-substitution. Vector ∆λ is broadcast to all processors for them to calculate equations (20) and (21) locally.
6.2
Performance
We used the SensIT data set2 , collected on types of moving vehicles by using a wireless distributed sensor networks. It consists of 100 dense attributes, combining acoustic and seismic data. There were 78,823 samples in the training set and 19,705 in the test set. The classification task set was to discriminate class 3 from the other two. This is a relatively noisy data set — benchmark accuracy is around 85%. By partitioning the data evenly across the processors, and exploiting the structure as outlined above, we get very good parallel efficiency results, as shown in Figure 6.2 . Training times are competitive: our implementation on 8 processors was 7.5 times faster than LIBSVM running serially (τ = 100).
7
Conclusions
In this paper, we have shown how to develop a parallel implementation of Support Vector Machine training for problems involving nonlinear kernels, 2
http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/
25
100
tau=1 tau=100 f(x)
90 80 70
Speedup
60 50 40 30 20 10 0
0
4
8
12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 Processors
Figure 7: Parallel efficiency, training on the SensIT data set. and that allowed the entire data set to be used. It consisted of following the steps: 1. Forming a partial Cholesky decomposition of the kernel matrix, taking into account that some columns are adequately approximated using diagonal elements alone. 2. Reformulating the problem to give an implicit inverse of the kernel matrix. 3. Using interior point method to solve the optimization problem in a predictable time, and Cholesky decomposition to give good numerical stability of implicit inverses. 4. Exploiting the block structure of the augmented system matrix, to partition the data and linear algebra computations amongst parallel processors efficiently. The above steps were implemented in OOPS, and the implementation was applied to solve problems involving very large data sets. Excellent parallel efficiency was observed on such problems. 26
References Edward Chang, Kaihua Zhu, Hao Wang, Jongjie Bai, Jian Li, Zhihuan Qiu, and Hang Cui. Parallelizing support vector machines on distributed computers. In NIPS, 2007. To be published. Ronan Collobert and Samy Bengio. SVMTorch: support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of svms for very large scale problems. Neural Computation, 14(5):1105–1114, 2002. Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. Jian-xiong Dong, Adam Krzyzak, and Ching Suen. A fast parallel optimization for training support vector machine. In P. Perner and A. Rosenfeld, editors, Proceedings of 3rd International Conference on Machine Learning and Data Mining, pages 96–105, Leipzig, Germany, 2003. Springer Lecture Notes in Artificial Intelligence. Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243–264, 2002. Donald Goldfarb and Katya Scheinberg. Solving structured convex quadratic programs by interior point methods with application to support vector machines and portfolio optimization. Submitted for publication, 2005. J. Gondzio and A. Grothey. Exploiting structure in parallel implementation of interior point method for optimization. Technical report, MS-04-004, School of Mathematics, University of Edinburgh, 2004. Hans Peter Graf, Eric Cosatto, L´eon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support vector machines: the Cascade SVM. In Lawrence Saul, Yair Weiss, and L´eon Bottou, editors, Advances in Neural Information Processing Systems. MIT Press, 2005. volume 17. Thorsten Joachims. Making large-scale support vector machine learning practical. In Bernhard Scholkopf, Christopher. J. C. Burges, and Alexander. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184. MIT Press, 1999. 27
Edgar Osuna, Robert Freund, and Federico Girosi. An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII — Proceedings of the 1997 IEEE Workshop, pages 276–285. IEEE, 1997. John Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Scholkopf, Christopher. J. C. Burges, and Alexander. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 185–208. MIT Press, 1999. Amund Tveit and Havard Engum. Parallelization of the incremental proximal support vector machine classifier using a heap-based tree topology. Technical report, IDI, NTNU, Trondheim, 2003. Vladimir Vapnik. Statistical Learning Theory. Wiley, 1998. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, 2nd edition, 1999. Kristian Woodsend and Jacek Gondzio. Exploiting separability in large-scale support vector machine training. Technical Report MS-07-002, School of Mathematics, University of Edinburgh, August 2007. Submitted for publication. Available at http://www.maths.ed.ac.uk/˜gondzio/reports/wgSVM.html. Stephen J. Wright. Primal-dual interior-point methods. S.I.A.M., 1997. Gaetano Zanghirati and Luca Zanni. A parallel solver for large quadratic programs in training support vector machines. Parallel Computing, 29(4): 535–551, 2003. Luca Zanni, Thomas Serafini, and Gaetano Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7:1467–1492, 2006. ISSN 15337928.
28