SPARSE FEATURE EXTRACTION USING ...

SPARSE FEATURE EXTRACTION USING GENERALISED PARTIAL LEAST SQUARES Charanpal Dhanjal, Steve R. Gunn and John Shawe-Taylor {cd04r, srg, jst}@ecs.soton.ac.uk ISIS, School of Electronics and Computer Science University of Southampton Southampton, UK, SO17 1BJ ABSTRACT We describe a general framework for feature extraction based on the deflation scheme used in Partial Least Squares (PLS). The framework provides many desirable properties, such as conjugacy and efficient computation of the resulting features. When the projection vectors are constrained in a certain way, the resulting features have dual representations. Using the framework, we derive two new sparse feature extraction algorithms, Sparse Maximal Covariance (SMC) and Sparse Maximal Alignment (SMA). These algorithms produce features which are competitive with those extracted by Kernel Boosting, Boosted Latent Features (BLF) and sparse kernel PLS on several UCI datasets. Furthermore, the sparse algorithms are shown to improve the performance of an SVM on a sample of the Reuters Corpus Volume 1 dataset. 1. INTRODUCTION Modern applications of machine learning increasingly deal with large amounts of complex data. By identifying relevant features one can reduce computation time, improve understanding of the data and often increase prediction performance. There are many approaches to feature selection however we focus on those which project the data onto a subspace, known as subspace methods. Popular subspace methods include Principal Components Analysis (PCA) [1] which has applications in image compression and face recognition, and Partial Least Squares (PLS) [2] which is often used in chemometrics. In this paper we propose a general framework for feature extraction in which projection directions can be generated according to an arbitrary criterion. The framework is a generalisation of primal and dual versions of PCA, PLS, and a subspace algorithm based on boosting called Boosted Latent Features (BLF) [3]. It is based on PLS and retains many of its desirable properties such as conjugacy of the projection directions with respect to the data and efficient computation of the resulting features. By using such a framework,

features can be targeted towards a given problem whilst providing a low dimensional approximation of the data. Although projection vectors can be chosen in any manner, under certain constraints the extracted features have dual (kernel) representations. This allows for non linear feature mappings. The advantages of using kernels are threefold: one can use feature spaces which may not have explicit forms, one can often reduce computation time compared to explicit representations, and the implied modularity of the approach. Today many popular and state of the art algorithms have kernel variants. One of the drawbacks of many subspace algorithms, including PCA and PLS, is that they do not scale well with large amounts of data. However, scalability is a useful property to have for many real datasets. To this end, we derive two scalable feature extraction algorithms which enforce a sparse representation on the set of projection vectors, i.e. the projection directions are represented using only a few training examples. The first algorithm, Sparse Maximal Covariance (SMC), maximises the covariance between the examples and corresponding labels. The second, Sparse Maximal Alignment (SMA), maximises the kernel alignment [4] between the examples and labels. We empirically compare PLS and the sparse algorithms to several related subspace methods such as Kernel Boosting [5], BLF and a variant of PLS called sparse kernel PLS (sparse KPLS) [6]. Experimental results on several UCI datasets show that features extracted by our sparse algorithms are competitive with those extracted by Kernel Boosting, BLF and sparse KPLS when using Support Vector Machine (SVM) and k-Nearest Neighbour (KNN) classifiers. Of the related algorithms sparse KPLS scales well with the numbers of examples, so we compare its scalability with our sparse algorithms on a sample of the Reuters Corpus Volume 1 dataset. This document is organised as follows: we start with an overview of several related subspace algorithms. Section 3 introduces the primal version of the feature extraction framework, followed by Section 4 on kernel feature extraction. Section 5 introduces two new sparse algorithms un-

der the dual form of the framework. Computational results comparing the algorithms to related ones are presented in Section 6 and Section 7 concludes the paper. 2. RELATED ALGORITHMS In this section we review some recently proposed subspace methods. Kernel Boosting and BLF use a boosting paradigm to generate projection directions and we show how they are related to PLS. Sparse KPLS and sparse KBLF generate dual projection directions which are sparse in the number of training examples. Kernel Boosting [5] is an iterative process used to learn a set of projection vectors using a boosting model. At each iteration a distribution matrix is computed on the kernel matrix entries, which indicates how difficult it is to predict whether two examples have the same or different labels. The distribution matrix weights the kernel matrix entries so that additional projection directions focus more on those that are difficult to predict. Projection directions are computed using a “base kernel learner” which allows for modularity in the approach. The base kernel learner given in [5] is one which solves the following generalised eigenvalue problem: XX0 YDYXX0 v = λXX0 v,

(1)

where λ, v are an eigenvalue-eigenvector pair, X is a matrix whose rows are the data examples and Y is a diagonal matrix with its diagonal entries as the labels. The initial distribution matrix D1 is made up of the same value, so we can write YD1 Y = cyy0 for some constant c, where y is the vector of labels. Using a Rayleigh quotient formulation and w = X0 v, equation 1 can be phrased as an optimisation problem. The weight vector w is invariant to scaling, hence one can introduce the constraint kwk = 1 and equation 1 becomes maxkwk=1 w0 X0 yy0 Xw = (w0 X0 y)2 which is the same as the PLS optimisation problem. Hence, the first kernel boosting direction is the same as the first PLS direction although additional directions are not computed in the same way as PLS and will in general be different. BLF [3] is an algorithm for constructing orthogonal features targeted toward a loss function. It finds a set of linear hypotheses t = Xu, where u is a projection vector, which minimise a given loss and then computes the corresponding regression coefficients using the same loss function. Additional directions are generated using the PLS deflation hence BLF is equivalent to PLS when using a least squares loss. Whereas BLF can be seen as a generalisation of PLS with an arbitrary loss, the general feature extraction framework can be regarded as a generalisation of PLS feature extraction with arbitrary projection vectors. Sections 3 and

4 show how the framework is a generalisation of BLF and kernel BLF (KBLF) feature extraction respectively. Sparse KPLS [6] is a variant of kernel PLS (KPLS) which preserves sparsity on the training set. Data centring, computation of projection vectors and deflation are all non-sparse operations in PLS. Sparse KPLS replaces them with sparse equivalents, in particular the deflation step on the data is performed on the rows instead of the columns. This implies that conjugacy of the projection direction with respect to the data is lost, however the projection directions are orthogonal. Dual sparsity on the projection vectors is achieved using an -insensitive loss function similar to that used in ν-support vector regression. A refinement of sparse KPLS is given in [7] whereby projections vectors can be computed without solving a nonlinear optimisation problem. Instead, a heuristic is used to approximate the sparse solutions, one which can be applied to the dual form without requiring the full kernel matrix in memory. This alternative approach to sparse KPLS is generalised in sparse kernel BLF (sparse KBLF) [7] to allow for the use of arbitrary loss functions. 3. PRIMAL FEATURE EXTRACTION Here we begin with a description of the primal general feature extraction framework and go on to illustrate how it is a generalisation of PCA, PLS and BLF. First of all, consider Algorithm 1 which shows the general feature extraction method. Essentially, the method operates iteratively, selecting a new feature direction uj at iteration j and then deflating the data matrix Xj , by projecting its columns into the space orthogonal to Xj uj . The projection matrix for a normalised vector w is given by (I − ww0 ). Hence, the deflation of Xj is defined as follows

Xj+1

=

Xj uj u0j X0j I− 0 0 uj Xj Xj uj

! Xj .

(2)

There is one requirement that we impose on the choice of uj , that it should be in the row space of X, i.e. for some β j , uj = X0 β j . This ensures a dual representation of the features Xj uj . These features are orthogonal since they are a linear combination of the columns of Xj that have been repeatedly projected into the orthogonal complement of previous Xi ui , for i < j. The apparent disadvantage with this simple description is that the feature directions uj are defined relative to the deflated matrix Xj . We would, however, like to be able to compute the extracted features directly from the original feature vectors, so that one can extract features from a test set for example. The features on a test point with feature ˆ 0 = φ(x)0 U(P0 U)−1 , where vector φ(x) are given by φ(x) the matrices U and P have their columns composed of the

Algorithm 1 Pseudo code for primal general feature extraction. Inputs: Data matrix X ∈ R`×n , dimension k, target vectors Y ∈ R`×m Process:

framework, the resulting features are also the same. BLF additionally computes the regression coefficients according to the loss function as α = U(P0 U)−1 c, where c is a vector of regression coefficients of the deflated matrices. 4. KERNEL FEATURE EXTRACTION

1. X1 = X 2. for j = 1, . . . , k (a) select uj from the span of the rows of X Xj uj u0 X0 (b) Xj+1 = I − u0 X0 Xjj ujj Xj j

j

3. end Output: Directions uj and features Xj uj , j = 1, . . . , k vectors uj and pj = X0j Xj uj /u0j X0j Xj uj , j = 1, . . . , k respectively. The derivation of this result is identical to that of PLS, see [8] for details. The vectors of feature values across the training set Xj uj are orthogonal and can be written as XU(P0 U)−1 , hence (U0 P)−1 U0 X0 XU(P0 U)−1 is a diagonal matrix. This conjugacy of the projection directions with respect to the deflated matrices ensures that the resulting features are as dissimilar as possible and that regression coefficients can be efficiently computed. 3.1. Specialisations The application of the general framework to PCA [1] is straightforward. If solving iteratively, one simply chooses uj to be the first eigenvector of X0j Xj . The vectors u1 , . . . , uk extracted from this process are exactly the first k eigenvectors of X0 X, and hence those needed for projecting the data as in the PCA method. In this case, the effect of the deflation of Xj at each iteration can be seen as shrinking the largest eigenvalue of X0j Xj to zero. Clearly the resulting features will be the same as those found in PCA. If we consider the PLS algorithm introduced in [2], then this is also easily placed within the framework. In this case the vector uj at each iteration is chosen to be the first singular vector of X0 Y, where Y is the matrix whose rows are the output labels. For the PLS case where one wishes to not just select features, but compute the overall regression coefficients for the primal PLS problem, they can be computed as W = U(P0 U)−1 C0 , where C is the matrix with columns cj = Y0 Xj uj /u0j X0j Xj uj . To position BLF within theqfeature extraction framework we require uj = X0j wj / w0j X0j Xj X0j Xj wj , where wj is the negative loss gradient. The scaling on uj is required so that the resulting features have unit norm. Since the BLF deflation strategy is identical to that of the general

In this section we give a dual variable formulation of the framework, which allows feature extraction methods to be used in conjunction with kernels. Furthermore we demonstrate how the dual framework specialises to KPCA, KPLS and KBLF. As we are using kernels, we do not want to compute the feature vectors explicitly. Therefore we must work with the kernel matrix Kj at each stage. Given a choice of dual variables β j , let τ j = Xj uj = Kj β j . The deflation of Xj is then given by Xj+1 = I − τ j τ 0j /τ 0j τ j Xj , hence the kernel matrix can be deflated in the following way ! τ j τ 0j Kj , (3) Kj+1 = I − 0 τ jτ j which is computable without explicit feature vectors. In the primal case, we were able to give a closed form expression for the projection of a new test point, and we would like to obtain a similar result for the dual variable case. The derivation of the projection of a new point starts with the primal representation and closely resembles that of kernel PLS given in [8]. In our case U = X0 B and P = X0 T(T0 T)−1 where B is the matrix with columns β j and T is the matrix with columns τ j , j = 1, . . . , k. Hence the final features are given by φ(x)0 U(P0 U)−1

−1 = k0 B (T0 T)−1 T0 KB ,

(4)

where k is a vector of inner products between the test point and the training examples. The pseudo code for the general kernel feature extraction framework is given in Algorithm 2. 4.1. Specialisations Kernel PCA was introduced in [9] and is a well known method for feature extraction. In this case, one usually performs a full eigen-decomposition of the kernel matrix K = VΛV0 and then uses the first k eigenvectors to project the data onto the first k principle √ components. If at each iteration we choose β j to be v1 / λ1 , where v1 , λ1 are the first eigenvector-eigenvalue pair of Kj and then deflate, it can be shown that the matrix B corresponds to the first k 1 columns of VΛ− 2 , and equation 4 gives the required result φ(x)0 U(P0 U)−1 = k0 B. As we are extracting the features iteratively, one does not need to perform a full eigendecomposition of the kernel matrix, other techniques such

Algorithm 2 Pseudo code for general kernel feature extraction. Input: Kernel matrix K ∈ R`×` , dimension k, target vectors Y ∈ R`×m Process: 1. K1 = K 2. for j = 1, . . . , k (a) choose weighting coefficients β j ∈ R` (b) let τ j = Kj β j τ jτ 0 (c) Kj+1 = I − τ 0 τ jj Kj j 3. end Output: Weightings β j , and τ j , j = 1, . . . k.

as the Power method can be used to efficiently extract the first eigenvector at each iteration. The kernel variant of PLS was presented in [10]. In this case, we need to deflate both the X and Y matrices. The deflation of Y in the primal version is redundant (see [8]), however in the dual case it is required in order to obtain the dual representations. The kernel matrix Kj is deflated as per the general framework, and the same deflation is also applied to the columns of Yj , i.e. Yj+1 = I − τ j τ 0j /τ 0j τ j Yj , Y1 = Y. At each iteration β j is chosen to be the first eigenvector of Yj Y0j Kj scaled so that β 0j Kj β j = 1. We can compute the regression coefficients for KPLS using the orthogonality of τ j . For i < j, we obtain cj = Y0j Xj uj /u0j X0j Xj uj = Y0 τ j /τ 0j τ j making the matrix of coefficients C = Y0 T(T0 T)−1 . Putting this together with equation 4, the dual regression variables are given by α = B(T0 KB)−1 T0 Y. This expression is identical to that given in [10]. To place KPLS within the general framework, we therefore need to modify Algorithm 2 as follows. At the start we let Y1 = Y and at each iteration we choose β j to be the scaled first eigenvector of Yj Y0j Kj . After deflating Kj , we must also deflate Yj by the process outlined above. As one might expect, kernel BLF is closely related to the kernel general feature extraction framework. Since the kernel matrix in KBLF is deflated on both sides, one must use uj = X0j wj = X0 I − Tj (T0j Tj )−1 T0j wj , where wj is the negative loss gradient and Tj is the matrix composed of the first j columns of T. Clearly we have that β j = I − Tj (T0j Tj )−1 T0j wj at each iteration, and features are scaled so that kτ j k2 = β 0j K0j Kj β j = 1. Hence, the resulting features are computed as per equation 4, with (T0 T)−1 = ˆ 0 = k0 B T0 KB −1 . I, so that φ(x)

5. SPARSE FEATURE EXTRACTION One drawback of the kernel-based feature extraction methods considered so far is that they are not sparse: to obtain projections for new test points, all training examples are often needed. Sparseness however is a desirable property, providing computation and efficiency benefits, and as such it is often the case that one is willing to see a small reduction in performance if a high degree of sparseness is achieved. Here, we outline how sparsity can be achieved using a further constraint on the projection directions of the dual general framework and derive two new algorithms using this approach. To achieve a sparse representation on the projection vectors the dual vector β j is chosen so that it has only one nonzero entry, hence after k iterations only k training examples will contribute to the projections. If β j has a non-zero entry at its ith position, then τ j = Xj uj = Kj β j is a scalar multiple of the ith column of Kj and the selection of i can be computed without requiring the full kernel matrix in memory. Although all training examples are used to compute the final projection matrix, from equation 4 it is evident that the evaluation of a new test point will require only k kernel eval−1 uations and the precomputed matrix (T0 T)−1 T0 KB . We now derive two new sparse feature extraction algorithms, Sparse Maximal Alignment (SMA) and Sparse Maximal Covariance (SMC) using the given approach to sparsity. These algorithms target features towards the labels hence are useful for dimensionality reduction for prediction. Our first sparse algorithm, SMA, is based on the notion of kernel target alignment [4], given by A(K, yy0 ) = y0 Ky/(y0 ykKkF ) where k · kF is the Frobenius norm. Intuitively, this alignment score is a measure of similarity between the kernel matrix K and the “ideal” kernel matrix yy0 . SMA is derived by maximising the kernel target alignment of the kernel matrix given by projecting Xj onto uj , subject to the sparsity constraint described earlier. This kernel matrix is given by Xj uj u0j X0j = Kj β j β 0j K0j , and has an alignment of (y0 Kj β j )2 /(y0 yβ 0j K0j Kj β j ). At each iteration, this quantity is maximised whilst constraining β j to one non-zero element, scaled so that u0j uj = β 0j Kβ j = 1. The second sparse algorithm, SMC, maximises the empirical expectation of the covariance between the examples and their labels, subject to the sparsity constraint. This 1 0 0 1 0 0 ˆ quantity is given by E[yφ(x) j uj ] = ` uj Xj y = ` β j Kj y ˆ is the empirical expectation of a random variwhere E[·] able. The non-zero element of β j corresponds to the maxi1 mum element of diag(K)− 2 K0j y, where diag(K) is a diagonal matrix with its diagonal entries as diag(K)ii = Kii , i = 1 . . . `. Note that the above definition of covariance assumes that the data is centred, however sparsity is often lost through centring. Chapter 6 of [8] provides an explanation for why centring is not required and the same argument

applies in our case. 6. EXPERIMENTAL RESULTS As subspace methods are often used as a precursor to prediction, we compare the prediction performance of the features extracted by our sparse algorithms to those generated by PLS, Kernel Boosting, BLF and sparse KPLS. We start by generating features from several UCI datasets and use those features to train KNN and SVM classifiers. This is followed by a performance evaluation of our sparse algorithms and sparse KPLS on a sample of the Reuters Corpus Volume 1 dataset [11]. Features are extracted by each subspace method on the Ionosphere, Wisconsin Diagnostic Breast Cancer (WDBC), SPECTF heart, and Sonar datasets and used to train KNN and linear SVM classifiers. The datasets are first preprocessed by centring and normalising their attributes. With PLS, SMC and SMA we also use the Radial Basis Function (RBF) ker nel, given by κ(xi , xj ) = exp −kxi − xj k2 /2σ with kernel width σ selected from 0.5 to 128. Each subspace method is run with a classifier using two repetitions of 5-fold cross validation, with a 5-fold cross validation loop used to select both classifier and subspace method parameters. The number of iterations of each method is varied from 1 to the rank of the data, except for Kernel Boosting which can iterate further and was allowed up to 1000 iterations. Sparse KPLS has an additional sparsity parameter ν, which is varied between 0 and 1. Kernel Boosting and Least Absolute Deviation (LAD) loss BLF were considerably slower than the other algorithms and are run using 3-fold cross validation, with an inner 3-fold cross validation used for parameter selection. Experimental code is implemented in Matlab and the LIBSVM package is used for computing SVM models. KNN PLS SMC SMA LS BLF ExpLoss BLF LogLoss BLF LAD BLF ExpLoss KB LogLoss KB Sparse KPLS RBF PLS RBF SMC RBF SMA

Ionosphere .139 ±.021 .104 ±.026 .099 ±.020 .111 ±.020 .090 ±.030 .090 ±.018 .096 ±.029 .114 ±.018 .131 ±.041 .114 ±.031 .081 ±.041 .097 ±.047 .107 ±.025 .107 ±.025

Sonar .156 ±.059 .156 ±.046 .124 ±.063 .149 ±.067 .227 ±.050 .232 ±.066 .220 ±.066 .246 ±.063 .237 ±.087 .224 ±.046 .173 ±.048 .144 ±.039 .166 ±.044 .173 ±.047

SPECTF .245 ±.048 .211 ±.044 .221 ±.056 .238 ±.050 .236 ±.040 .242 ±.046 .215 ±.053 .247 ±.059 .202 ±.045 .217 ±.057 .232 ±.053 .211 ±.033 .198 ±.034 .213 ±.022

WDBC .038 ±.009 .029 ±.009 .033 ±.008 .032 ±.010 .038 ±.014 .036 ±.016 .035 ±.014 .028 ±.008 .045 ±.019 .025 ±.012 .035 ±.018 .030 ±.012 .035 ±.013 .033 ±.016

Table 1. Errors and standard deviations using a KNN classifier on the extracted features, with best results for the primal and dual algorithms in bold. KB is the Kernel Boosting algorithm and LS is Least Squares loss.

Tables 1 and 2 summarise the results. SMC compares well with PLS in both the primal and dual cases despite its sparsity. In the primal space, SMA and SMC attain lower error rates than sparse KPLS on average. Kernel boosting gives good results particularly with the SVM, although this is often at the expense of dimensionality which is higher than the original space. Amongst the algorithms using the RBF kernel, PLS is superior to the other methods in many cases, however RBF SMC also gives good results. SVM PLS SMC SMA LS BLF ExpLoss BLF LogLoss BLF LAD BLF ExpLoss KB LogLoss KB Sparse KPLS RBF PLS RBF SMC RBF SMA

Ionosphere .136 ±.032 .124 ±.021 .117 ±.027 .123 ±.024 .131 ±.039 .134 ±.026 .119 ±.026 .128 ±.009 .126 ±.028 .111 ±.029 .139 ±.029 .083 ±.039 .090 ±.036 .084 ±.030

Sonar .244 ±.053 .249 ±.050 .232 ±.071 .227 ±.080 .261 ±.065 .285 ±.046 .268 ±.033 .319 ±.014 .207 ±.084 .232 ±.089 .251 ±.055 .180 ±.062 .190 ±.060 .217 ±.070

SPECTF .270 ±.044 .274 ±.052 .283 ±.051 .309 ±.048 .294 ±.079 .289 ±.065 .289 ±.060 .206 ±.006 .217 ±.017 .199 ±.039 .342 ±.029 .242 ±.061 .260 ±.033 .272 ±.052

WDBC .021 ±.007 .027 ±.010 .028 ±.010 .027 ±.013 .029 ±.014 .035 ±.019 .032 ±.017 .026 ±.005 .030 ±.003 .028 ±.011 .027 ±.014 .020 ±.011 .027 ±.010 .024 ±.013

Table 2. Errors and standard deviations using a linear SVM on the extracted features.

6.1. Scalability On large datasets, PLS, Kernel Boosting and BLF are inefficient since they operate on the entire kernel matrix at each iteration. SMC, SMA and sparse KPLS only require part of the kernel matrix in memory and here we demonstrate their scalability by running them on the Reuters Corpus Volume 1 news database. The full database consists of about 800,000 news articles (the period of a whole year), but only 5000 examples are considered from first three months of the Economics branch. As a pre-processing step, the Porter stemmer has been used to reduce all words to their stems which yields 136469 distinct terms. Labels are assigned according to whether the articles are about the topic “Government Finance”, with approximately 37% of the examples positively labelled. Outliers are removed and all features are normalised. Features extracted on this dataset by sparse KPLS, SMC and SMA are compared using a linear SVM. Sparse KBLF is not included in the comparison since [7] shows that it performs worse than an SVM. Each subspace algorithm is run using 100 to 300 iterations in steps of 100 and the SVM penalty parameter is chosen from 9 values. Sparse KPLS is run using the Maximal Residual (MR) heuristic [7], a kernel cache size of 1000 and three values of ν. Each method is run using 3-fold cross validation with an inner 2-fold cross validation loop for model selection, except for sparse KPLS

which was slower and run with only 2 outer loops. Being standard in information retrieval, the average precision was recorded for each method. It is defined as the cumulative precision after each relevant document is retrieved divided by the number of relevant documents, where precision is the proportion of relevant documents to all the documents retrieved. With large datasets Kj often cannot be stored in memory and our sparse algorithms are implemented with this in mind. In the case of SMA, 1500 columns of Kj with the highest alignment are stored. If at any iteration the maximum alignment drops below that of the original 1500th column, the columns are reselected from the entire kernel matrix. This heuristic is intuitive since the alignment of the columns of the kernel matrix drops in general, and good performance is demonstrated in practice. A similar strategy and argument applies for SMC. SVM SMA SMC Sparse KPLS

Average Precision .822 ±.015 .847 ±.009 .823 ±.009 .787 ±.020

Projections 233 300 100

Sparsity 233 300 2500

Table 3. Results on the Reuters dataset. Sparsity is the number of kernel evaluations required for the projection of a new test example. Table 3 shows that our sparse methods outperform sparse KPLS, both in terms of average precision and the number of kernel evaluations required for the projection of a new example. Both SMC and SMA achieve a higher average precision than the SVM using a much smaller dimensionality, with SMA demonstrating a more significant gain of 0.025. 7. CONCLUSIONS This paper has proposed a general method of feature extraction and derived two sparse feature extraction algorithms based on maximising covariance and alignment. The prediction performance of the features generated by these sparse algorithms compare well to those extracted by related methods on several UCI datasets. On the Reuters Corpus Volume 1 dataset SMA and SMC exceed the performance of sparse KPLS, and the features extracted by SMA improve the average precision of an SVM whilst requiring only a few kernel evaluations for the projection of a new test point. 8. REFERENCES [1] Harold Hotelling, “Analysis of a complex of statistical variables into principle components,” Journal of Educational Psychology, vol. 24, pp. 417–441 and 498– 520, 1933.

[2] Herman Wold, “Estimation of principal components and related models by iterative least squares,” Multivariate analysis, pp. 391–420, 1966. [3] Michinari Momma and Kristin P. Bennett, “Constructing orthogonal latent features for arbitrary loss,” in Feature Extraction, Foundations and Applications, Masoud Nikravesh Isabelle Guyon, Steve Gunn and Lofti Zadeh, Eds. Springer, 2005. [4] Nello Cristianini, John Shawe-Taylor, Andr´e Elisseeff, and Jaz S. Kandola, “On kernel-target alignment,” in Advances in Neural Information Processing Systems, 2001, pp. 367–373. [5] Koby Crammer, Joseph Keshet, and Yoram Singer, “Kernel design using boosting,” in Advances in Neural Information Processing Systems, 2002, pp. 537–544. [6] Michinari Momma and Kristin P. Bennett, “Sparse kernel partial least squares regression,” in Proceedings of the Sixteenth Annual Conference on Learning Theory, 2003, pp. 216–230. [7] Michinari Momma, “Efficient computations via scalable sparse kernel partial least squares and boosted latent features,” in Proceedings of the ACM International Conference on Knowledge Discovery in Data mining, New York, NY, USA, 2005, pp. 654–659, ACM Press. [8] John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, New York, NY, USA, 2004. [9] Bernhard Schoelkopf, Alexander Smola, and KlausRobert Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998. [10] Roman Rosipal and Leonard J. Trejo, “Kernel partial least squares regression in reproducing kernel hilbert space,” Journal of Machine Learning Research, vol. 2, pp. 97–123, 2001. [11] Mark Stevenson Tony Rose and Miles Whitehead, “The reuters corpus volume 1 - from yesterday’s news to tomorrow’s language resources,” in Proceedings of LREC-02, 3rd International Conference on Language Resources and Evaluation, 2002, pp. 827–832.