Linear SVM training using separability and interior ... - Semantic Scholar

1 downloads 0 Views 110KB Size Report
Jun 25, 2008 - The Kings Buildings, Edinburgh, EH9 3JZ, UK .... damped Newton step towards satisfying the primal feasibility, dual feasibility and comple-.
Linear SVM training using separability and interior point methods Kristian Woodsend [email protected]

Jacek Gondzio [email protected]

School of Mathematics, University of Edinburgh, The Kings Buildings, Edinburgh, EH9 3JZ, UK June 25, 2008 Abstract Support vector machine training can be represented as a large quadratic program. We present an efficient and numerically stable algorithm for this problem using primaldual interior point methods. Reformulating the problem to exploit separability of the Hessian eliminates the main source of computational complexity, resulting in an algorithm which requires only O(n) operations per iteration. Extensive use of L3 BLAS functions enables good parallel efficiency on shared-memory processors. As the algorithm works in primal and dual spaces simultaneously, our approach has the advantage of obtaining the hyperplane weights and bias directly from the solver.

1

Introduction

This paper is a summary of Woodsend and Gondzio (2007), for the purpose of participating in the Pascal Large Scale Learning Challenge. Support Vector Machines (SVMs) are a powerful machine learning technique, and they offer state-of-the-art performance. The training stage for Support Vector Machines involves at its core a dense convex quadratic optimization problem (QP). Solving this optimization problem is computationally expensive, primarily due to the dense Hessian matrix. With n data points and m features, solving the QP with a general-purpose interior point method (IPM) QP solver would result in the time taken to scale cubically with the number of data points (O(n3 )). Such a complexity result means that, in practise, the SVM training problem cannot be solved by such general-purpose solvers. Yet in other applications, IPM technology generally works well on large-scale problems as the number of outer iterations required grows very slowly with problem size (see Wright, 1997). Several approaches applying IPM technology to SVM training have been researched, based on different formulations (Ferris and Munson, 2003; Gertz and Griffin, 2005; Goldfarb and Scheinberg, 2005). They have in common an aim to exploit the low-rank structure of the kernel matrix. Such an approach means that the only matrix to be inverted has dimension m × m, and the overall effort per iteration associated with computing its implicit inverse representation scales linearly with n and quadratically with m. This gives a significant improvement if n  m. These approaches, however, inherently suffer from either numerical instability of the implicit inverse or memory caching inefficiencies. In our paper Woodsend and Gondzio (2007), we presented a set of efficient and numerically stable IPM-based formulations which unify, from an optimization perspective, 1-norm classification, 2-norm classification, Universum classification, ordinal regression and -insensitive regression. All these problems can be equivalently reformulated as very large, yet structured, separable QPs. Exploiting separability has been investigated for general 1

sparse convex QPs (M´esz´ aros, 1998), but not for the SVM problem. The formulation for 1-norm classification is described here. Further, we show how IPM can be specialized to exploit separability in all these problems efficiently. We now describe the notation used in this paper. xi is the attribute vector for the ith data point, and it consists of the observation values directly. There are n observations in the training set, and m attributes in each vector xi . We assume that n  m. X is the m × n matrix whose columns are the attribute vectors xi associated with each point. The classification label for each data point is denoted by yi ∈ {−1, 1}. The variables w ∈ Rm and α ∈ Rn are used for the primal variables (“weights”) and dual variables respectively, and w0 for the bias of the hyperplane. τ is used as the misclassification penalty. Scalars are denoted using lower case letters, column vectors are underlined, while upper case letters denote matrices. D, S, U, V and Y are the diagonal matrices of the corresponding lower case vectors.

2

Interior Point Methods

Interior point methods represent state-of-the-art techniques for solving linear, quadratic and non-linear optimization programmes. In this section the key issues of implementation for QPs are discussed very briefly to highlight areas of computational cost; for more details, see Wright (1997). We are interested in solving the general convex quadratic problem min z

s.t.

1 T z Qz + cT z 2 Az = b 0 ≤ z ≤ u,

where u is a vector of upper bounds, and the constraint matrix A is assumed to have full rank. Dual feasibility requires that AT λ + s − v − Qz = c, where s, v ≥ 0 are the dual variables associated with the lower and upper bounds of z respectively. An interior point method (outlined in Algorithm 1) moves towards satisfying the KKT conditions over a series of iterations, by monitoring primal and dual feasibility and controlling the complementarity products ZSe = µe (U − Z)V e = µe, where µ is a strictly positive parameter. At each iteration (steps 2–7), the method makes a damped Newton step towards satisfying the primal feasibility, dual feasibility and complementarity product conditions for a given µ. Then the algorithm decreases µ before making another iteration. The algorithm continues until both infeasibilities and the duality gap (which is proportional to µ) fall below required tolerances. The Newton system to be solved at each iteration (step 4) can be transformed into the augmented system equations:      −(Q + Θ−1 ) AT ∆z rc = , ∆λ rb A 0 where ∆z, ∆λ are components of the Newton direction in the primal and dual spaces respectively, Θ−1 ≡ Z −1 S + (U − Z)−1 V , and rc and rb are appropriately defined residuals. Furthermore, the normal equations are a set of equations found by eliminating ∆z from the augmented system. Solving them requires calculating M ≡ A (Q + Θ−1 )−1 AT (step 3) and then solving M ∆λ = −ˆ rb for ∆λ (step 4). Calculating and factorizing M (step 3) are the 2

Algorithm 1 Outline of Interior Point Method Require: Initial point (z 0 , s0 , v 0 , λ0 ) 1: (z, s, v, λ) := (z 0 , s0 , v 0 , λ0 ) 2: while stopping criteria is not fulfilled do 3: Calculate and factorize matrix M 4: Calculate search direction (∆z, ∆s, ∆v, ∆λ) by solving M ∆λ = −ˆrb and backsolving for other variables 5: Determine step size to ensure positivity of bound variables, and calculate new iterates (z, s, v, λ) 6: Correct (z, s, v, λ) to obtain a more central point 7: end while 8: return (z, s, v, λ)

most computationally expensive operations of the algorithm. Interior point methods are efficient for solving quadratic programmes when the matrix Q is easily invertible; however, if the matrix is dense the time taken to invert M can become prohibitive.

3

Application to SVMs

In this section we briefly outline how the standard SVM binary classification problem can be reformulated as a separable QP (for more details, see Woodsend and Gondzio, 2007). Using the form Q = (XY )T (XY ) enabled by the linear kernel, we can rewrite the quadratic objective in terms of w, and ensure the relationship between w and α to hold at optimality by introducing it into the constraints. Consequently, we can state the 1-norm classification problem as the following separable QP: min w,α

s.t.

1 T w w − eT α 2 w − XY α = 0 yT α = 0 0 ≤ α ≤ τ e.

Compared to the standard SVM dual formulation, the quadratic matrix Q in the objective is no longer dense, but simplified to the diagonal matrix, while the constraint matrix A is increased in both rows and columns:     Im −XY Im 0 (n+m)×(n+m) Q= ∈R , A= ∈ R(m+1)×(m+n) . 0 yT 0 0n Determining the Newton step requires calculating the matrix product: M ≡ A(Q + Θ−1 )−1 AT  −1 + XY Θ Y X T (Im + Θ−1 α w ) = t −y Θα Y X T

−XY Θα y y T Θα y



∈ R(m+1)×(m+1) .

We need to solve M ∆λ = r for ∆λ. Building the matrix M is the most expensive operation, of order O(n(m + 1)2 ), while inverting the resulting matrix is of order O((m + 1)3 ).

4

Implementation issues

There remain some implementation issues to describe of our submission to the Challenge. 3

The algorithm spends the majority of time (typically over 95%) forming matrix M . Level 3 functions from the BLAS library are used extensively for this part, minimizing Layer 2 cache read misses and enabling good parallel efficiency on shared-memory parallel processors. To determine the hyperplane, we also require the value of the bias w0 . Note that as we are using a primal-dual interior point method, and because the dual of the dual (on which our formulation is built) is the primal, we can obtain primal variables such as w0 directly from the solver (w0 is the element of λ corresponding to the constraint y T α = 0). This is in contrast to other approaches where w0 has to be estimated from a subset of α values. A property of this algorithm is the early identification of the optimal hyperplane, possibly due to incorporating primal variables w directly in the formulation (Chapelle, 2007). The algorithm spends further iterations reducing infeasibilities associated with α. In this version of our algorithm, we exploited this behaviour to provide an earlier stopping criterion. The w values are monitored, and the algorithm measures the change in the angle φ of the normal to the hyperplane between iteration i − 1 and i: cos φ =

(w(i−1) )T w(i) kw(i−1) kkw(i) k

The algorithm terminates if the hyperplane is stable (sin φ < ) and residual errors are also small. In the data sets of the Challenge, this stopping criterion was always met before the standard one based on the relative size of the duality gap, and with no loss in prediction accuracy.

References O. Chapelle. Training a support vector machine in the primal. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large-scale kernel machines, chapter 2, pages 29–50. MIT Press, 2007. M. Ferris and T. Munson. Interior point methods for massive support vector machines. SIAM Journal on Optimization, 13(3):783–804, 2003. E. M. Gertz and J. D. Griffin. Support vector machine classifiers for large data sets. Technical memo, Argonne National Lab ANL/MCS-TM-289, Oct 2005. D. Goldfarb and K. Scheinberg. Solving structured convex quadratic programs by interior point methods with application to support vector machines and portfolio optimization. Submitted for publication, 2005. C. M´esz´aros. The separable and non-separable formulations of convex quadratic problems in interior point methods. Technical Report WP 98-3, Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, 1998. K. Woodsend and J. Gondzio. Exploiting separability in large-scale support vector machine training. Technical Report MS-07-002, School of Mathematics, University of Edinburgh, August 2007. Submitted for publication. Available at http://www.maths.ed.ac.uk/˜gondzio/reports/wgSVM.html. S. J. Wright. Primal-dual interior-point methods. S.I.A.M., 1997.

4