A Comparison of Pruning Algorithms for Sparse

15 downloads 0 Views 120KB Size Report
Email: {luc.hoegaerts,johan.suykens}@esat.kuleuven.ac.be. Abstract. Least Squares Support Vector Machines (LS-SVM) is a proven method for classification ...
A Comparison of Pruning Algorithms for Sparse Least Squares Support Vector Machines L. Hoegaerts, J.A.K. Suykens, J. Vandewalle, B. De Moor Katholieke Universiteit Leuven, ESAT-SCD-SISTA Kasteelpark Arenberg 10, B-3001 Leuven (Heverlee), Belgium Email: {luc.hoegaerts,johan.suykens}@esat.kuleuven.ac.be Abstract. Least Squares Support Vector Machines (LS-SVM) is a proven method for classification and function approximation. In comparison to the standard Support Vector Machines (SVM) it only requires solving a linear system, but it lacks sparseness in the number of solution terms. Pruning can therefore be applied. Standard ways of pruning the LSSVM consist of recursively solving the approximation problem and subsequently omitting data that have a small error in the previous pass and are based on support values. We suggest a slightly adapted variant that improves the performance significantly. We assess the relative regression performance of these pruning schemes in a comparison with two (for pruning adapted) subset selection schemes, -one based on the QR decomposition (supervised), one that searches the most representative feature vector span (unsupervised)-, random omission and backward selection on independent test sets in some benchmark experiments1 .

1

Introduction

In kernel based classification and function approximation the sparseness (i.e. limited number of kernel terms) of the approximator is an important issue, since it allows faster evaluation of new data points. The remaining points are often called support vectors. In Vapnik’s SVM [1] the sparseness is built-in due to the ǫ-insensitive loss function that outrules errors of points inside a ’tube’ around the approximated function. This results in a quadratic programming problem. In LS-SVM [2] a quadratic loss function is used instead, and the optimization problem reduces to solving a linear set of equations. But at the same time the sparseness is lost and must be imposed. A simple approach to introduce sparseness is based on the sorted support value spectrum (the solution of the set of equations) [3]. From the LS-SVM solution equations follows a reasonable choice for pruning away points with a low error contribution in the dual optimization objective. Another recent paper [4] sophisticates the pruning mechanism by weighting the support values. The 1

This research work was carried out at the ESAT laboratory of the KUL, supported by grants from several funding agencies and sources: Research Council KUL: GOA-Mefisto 666, GOA-Ambiorics, several PhD/postdoc & fellow grants; FWO: PhD/postdoc grants, projects, G.0240.99, G.0407.02, G.0197.02, G.0141.03, G.0491.03, G.0120.03, G.0452.04, G.0499.04, research communities (ICCoS, ANMMM, MLDM); AWI: Bil. Int. Collaboration Hungary/ Poland; IWT: PhD Grants, GBOU (McKnow); Belgian Federal Government: Belgian Federal Science Policy Office: IUAP V-22 (2002-2006), PODO-II (CP/01/40: TMS and Sustainibility); EU: FP5-Quprodis; ERNSI; Eureka 2063IMPACT; Eureka 2419-FliTE; Contract Research/agreements: ISMC/IPCOS, Data4s, TML, Elia, LMS, IPCOS, Mastercard; BOF OT/03/12, Tournesol 2004 - Project T2004.13. LH is a PhD student with IWT. JS is an associate professor at KUL. BDM and JVDW are full professors at KUL.

II

datapoint with the smallest error introduced after its omission is selected then. This pruning method is claimed to outperform the standard scheme of [3], but the extent of the comparison was limited to one example where noise was filtered out. We additionally suggest here an improved selection of the pruning point based on their derived criterion. Other methods for achieving sparse LS-SVMs is via Fixed-Size approaches [5] which employ entropy based subset selection in relation to kernel PCA density estimation. This has been successfully applied to a wide class of problems for subspace regression in feature space [6]. For a general overview of pruning we refer to [7]. Pruning is closely related to subset selection, choosing relevant datapoints or variables in order to build a sparse model. Pruning assumes that the model on the full set is iteratively downsized, in a backward manner, while subset selection usually proceeds in a forward manner. Many subset selection schemes can be distinguished that organise their search in a greedy fashion [8]. In particular we focus on two such schemes. A supervised method is based on the QR decomposition [9, 10], that we can employ as omitting points of which the orthogonalized components have least correlation with the output. Furthermore, an unsupervised approach is based on a best fitting span [11], that we can employ as omitting points that have least similarity to that span. In this paper we aim at a comparison of regression performance between the two pruning procedures and the two subset selection procedures by performing a set of experiments with (i) evaluation on the independent test set and (ii) including random and backward pruning to have an objective measure.

2

LS-SVM for function estimation

Assume training data {{(xi , yi )}ni=1 ∈ Rp × R} have been given where xi are the input data and yi the target or output values for sample i. The goal of function approximation is to find the underlying relation between input and target values. LS-SVM [5] assumes an underlying linear model in the weight parameters w with a bias term b: y = wT ϕ(x) + b, where the feature map ϕ : Rp → Hk is a function into the r-dimensional Reproducing Kernel Hilbert space (RKHS) [12] Hk with an associated kernel k : Rp × Rp → R : (xi , xj ) 7→ k(xi , xj ). A common choice is k(xi , xj ) = exp(−kxi − xj k22 /h2 ), where h is a kernel width parameter. The mapping k provides a similarity measure between pairs of data points, should fulfill the Mercer condition of positive definiteness and is supposed to capture the nonlinearity, while the model remains linear in the parameters [13]. The weights in w and bias b are to be estimated byP minimizing n a primal space error cost function minw,b,e J(w, b, e) = wT w + γ i=1 e2i s.t. T yi = w ϕ(xi ) + b + ei , where i = 1, . . . , n. The objective consists of a smallest sum of squares error term (to fit to the training data) and a regularization term to smoothen the approximation (to compensate overfitting). Working with the explicit expression for ϕ is avoided by considering the dual formulation of this cost function in feature space Hk . The Pnoptimization objective becomes the Lagrangian L(w, b, e; α) = J(w, b, e) − i=1 αi (wT ϕ(xi ) + b + ei − yi ), where the αi ’s are Lagrange multipliers. One solves by deriving the optimality conditions

III

∂L/∂w = ∂L/∂b = ∂L/∂ei = ∂L/∂αi = 0. Elimination of the variables e, w through substitution naturally leads to a solution expressed solely in terms of inner products ϕ(xi )T ϕ(xj ) = k(xi , xj ), which results in a linear system [2]:      0 b 0 1T = , (1) 1 K + γ −1 I α y where 1 is a vector column vector of ones, and y a vector with target values. The entries of the symmetric positive definite kernel matrix K equal Kij = k(xi , xj ). The role of the (potentially infinite-dimensional) r × 1 weight vector w in primal space is conveniently taken over by a directly related n×1 weight vector α in the dual space. Typically there is a model selection (e.g. cross-validation) procedure required for the determination of two hyperparameters (γ, h2 ). Once these are fixed, the LS-SVM P approximator can bePevaluated at any point x by yˆ(x) = wT ϕ(xi ) + b = ni=1 αi ϕ(x)T ϕ(x) + b = ni=1 αi k(xi , x) + b. Related models of regularization networks and Gaussian processes have been considered without the bias term b. LS-SVMs have been proposed as a class of kernel machines with a primal-dual formulation for KFDA, KRR, KPCA, KPLS, KCCA, recurrent networks and optimal control [5].

3

Pruning methods

Pruning methods that are compared in this paper are: 1. Support values. A simple way of imposing sparseness on the LS-SVM is by pruning those terms of the kernel expansion that have the smallest absolute value [3]. The motivation comes from the fact that the LS-SVM support values are proportional to the errors at the datapoints, namely αi = γei . To omit the points that contribute least to the training error is a direct and cheap way to impose sparseness (denoted with ’sv’ in our experiments). 2. Weighed support values (γ = ∞). Recently a sophistication of the above pruning scheme has been reported [4] that omits the sample that itself bears least error after it is omitted. The derivation makes a distinct criterion corresponding to the value of γ. When no regularization is applied, thus γ = ∞, one proposes to omit the sample that has the smallest absolute value of αi , divided by the diagonal element (i, i) of the kernel matrix K. Compared to [3], the extension of [4] comes with a more expensive computation since the kernel matrix needs to be inverted. It also claims to outperform the standard method and an example is given where the training error is indeed systematically lower. 3. Weighed support values (γ 6= ∞). In case γ 6= ∞, [4] proposes to omit the sample that has the smallest absolute value of the ith component of T AAγ−1 ei eTi Aγ−1 c / eTi A−1 γ ei , were A = [0, 1 ; 1, K], c = [0; y], Aγ = A + −1 γ I(n+1) , and ei a column vector with value 1 on element i + 1. Both cases are included in the experiments. So the alpha’s need to be weighed, which resembles the formula of optimal brain surgeon in which the inverse of the Hessian of the error surface of the model parameters appears in the denominator [14]. In [4], no examples, nor any comparison was however made for the case γ 6= ∞. In this paper we complement this result with experiments.

IV

4. Weighed support values sum (γ 6= ∞). As an extension to the work of [4] we propose in the case γ 6= ∞ to omit the sample such that the sum of all errors introduced in every point is smallest. Because omitting a point introduces error at all points, it makes sense to look at the global increase in error over all points and exclude the point that minimizes this measure, at no extra cost. 5. Orthogonal error minimizing regressors. A subset selection procedure [9, 10] is motivated from the expression of the sum of squares errors on the training data that is obtained at the optimal LS solution. If the regressors are made orthogonal through e.g. a Gram-Schmidt or QR like decomposition, their impact on reducing the error can be termwise expressed. It then turns out that choosing orthogonalized regressors that are most coincident with the independent variable, contribute most to the error reduction. This ranks the points and for pruning the least error-influencing point will be omitted here. 6. Approximating representers. A second subset selection procedure [11] is unsupervised in nature and aims at finding a span of feature vectors that are most similar to the other remaining ones. The similarity is measured by the distance between the remaining vector and an arbitrary linear combination of previously found vectors (gathered in the set S): minβ kϕ(xi )−ΦTS βk2 /kϕ(xi )k2 . This criterion selects points by which all the remaining features can be best approximated. The datapoints are again ranked in importance and for pruning the last one will be omitted here (denoted ’span’ in the experiments).

4

Experiments

In this section we describe the experimental comparison of the above 6 pruning schemes. The prediction performance on an independent test set will serve as a measure. For reference we include pruning of a randomly selected point (as an upper bound) and pruning of a backward selected point, i.e. the one that yields a model with least error after omitting and retraining the model on the remaining points (as a lower bound). Backward selection can be expected to perform well, but is an overly expensive method. We applied the algorithms to an artificial sinc function estimation problem and several benchmarks from the UCI machine learning repository [15]. In all experiments we standardized the data to zero mean and unit variance, we used the common Gaussian kernel and determined the tuning parameters h2 and γ with standard 10-fold cross-validation. In every experiment we measured the mean square error on an independent test set, while every time one training sample has been left out according to the criterion of the algorithm. For the mexican hat function we performed 25 times a 100sized uniformly spaced sample of a sinc with added Gaussian noise with standard deviation σ = 0.2. We show in the (prototypical) figure the averaged mean square error on an independent test set versus training set reduction (in %). From UCI we used Boston (housing prices), Machine (rel. cpu performance), Servo (rise times) and Auto (fuel consumption). All data set characteristics, selected tuning parameters and pruning results have been reported in the overview table for several pruning reduced training sets. The best results are typeset in bold face (excluding the performance of backward selection).

V

5

Conclusions

We compared different pruning methods for sparse LS-SVMs: four support value based pruning schemes and two (for pruning adapted) subset selection algorithms against random pruning and backward selection on independent test sets in a set of benchmark experiments. From these results we conclude that omitting a point based upon its weighed support value in the case γ 6= ∞ is rarely yielding satisfactory pruning results. We suggest to use instead as criterion the sum of these values at all training datapoints, yielding significant improvements. Pruning were the point is omitted that has the smallest support value, weighed by the corresponding diagonal element of the inverted kernel matrix, achieved overall excellent pruning results, although in theory the formula is not intended for cases were γ 6= ∞. The subset selection based pruning algorithms perform overall as second best. If we also take into account the computational cost, standard pruning based on the support values remains most cheap and cheerful.

References 1. V. N. Vapnik, Statistical Learning Theory. John Wiley & Sons, 1998. 2. J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, june 1999. 3. J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines : robustness and sparse approximation,” Neurocomputing, Special issue on fundamental and information processing aspects of neurocomputing, vol. 48, no. 1-4, pp. 85–105, Oct 2002. 4. B. J. de Kruif and T. J. A. de Vries, “Pruning error minimization in least squares support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 3, pp. 696–702, May 2003. 5. J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines. World Scientific, Singapore, 2002. 6. L. Hoegaerts, J. A. K. Suykens, J. Vandewalle, and B. De Moor, “Subset based least squares subspace regression in RKHS,” Accepted for publication in Neurocomputing, Special issue, 2004. 7. R. Reed, “Pruning algorithms – a survey,” IEEE Transactions on Neural Networks, vol. 4, no. 5, pp. 740–747, 1993. 8. A. J. Smola and B. Sch¨ olkopf, “Sparse greedy matrix approximation for machine learning,” in Proc. 17th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2000, pp. 911–918. 9. A. J. Miller, Subset Selection in Regression. Chapman & Hall, 1990. 10. S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Trans. Neural Networks, vol. 2, pp. 302–309, March 1991. 11. G. Baudat and F. Anouar, “Kernel based methods and function approximation,” in IJCN, Washington DC, July 2001, pp. 1244–1249. 12. N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American mathematical society, vol. 686, pp. 337–404, 1950. 13. B. Sch¨ olkopf and A. Smola, Learning with kernels. MIT Press, 2002. 14. B. Hassibi, D. G. Stork, and G. Wolf, “Optimal brain surgeon and general network pruning,” in Proceedings of the 1993 IEEE International Conference on Neural Networks, San Francisco, CA, Apr 1993, pp. 293–300. 15. C. Blake and C. Merz, “UCI repository of machine learning databases,” 1998. [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html

VI

data set dim train test h2 γ Boston 13 400 106 15.96 32.35 Machine 6 200 109 15.03 1.74 Servo 4 100 67 16.20 91.07 Auto 7 300 92 56.57 58.44

1.5

random sv weighed sv γ=∞ weighed sv γ

Suggest Documents