Learning the kernel matrix via predictive low-rank

0 downloads 0 Views 484KB Size Report
rid icu lo u s d isa p p o in te d u n fo rtu n a te ly p o o rly d isa p p o in tin g w a ... rite. _ a. 0.0. 0.1. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9. W e ig h ts (b o o ks) align.
arXiv:1601.04366v1 [cs.LG] 17 Jan 2016

Learning the kernel matrix via predictive low-rank approximations Martin Stražar

Tomaž Curk

Faculty of Computer and Information Science University of Ljubljana, Veˇcna pot 113, 1000 Ljubljana, Slovenia Email: [email protected]

Faculty of Computer and Information Science University of Ljubljana, Veˇcna pot 113, 1000 Ljubljana, Slovenia Email: [email protected]

Abstract—Efficient and accurate low-rank approximations to multiple data sources are essential in the era of big data. The scaling of kernel-based learning algorithms to large datasets is limited by the O(n2 ) complexity associated with computation and storage of the kernel matrix, which is assumed to be available in most recent multiple kernel learning algorithms. We propose a method to learn simultaneous low-rank approximations of a set of base kernels in regression tasks. We present the mklaren algorithm, which approximates multiple kernel matrices with least angle regression in the lowdimensional feature space. The idea is based on geometrical concepts and does not assume access to full kernel matrices. The algorithm achieves linear complexity in the number of data points as well as kernels, while it accounts for the correlations between kernel matrices. When explicit feature space representation is available for kernels, we use the relation between primal and dual regression weights to gain model interpretation. Our approach outperforms contemporary kernel matrix approximation approaches when learning with multiple kernels on standard regression datasets, as well as improves selection of relevant kernels in comparison to multiple kernel learning methods.

I NTRODUCTION Kernel methods are popular in machine learning as they model relations between objects in feature spaces of arbitrary, even infinite dimension [1]. To avoid working with explicit representations, algorithms are phrased to depend only on inner products in such feature spaces. Any symmetric positive semidefinite matrix can be associated to inner products between training data in some feature space. Such kernel matrix is used in learning algorithms via the kernel trick. Along with gain in accuracy of the rich representation, it is useful for learning with objects not representble as vectors. Examples include structured objects, strings or trees. The availability of the inner product values for all pairs of data points presents a substantial drawback as the computation and storage of the kernel matrix typically scales as O(n2 ) in the number of data instances. Kernel approximations are thus indispensable when learning on large datasets and can be roughly classified in two families: approximations of the kernel function or approximations of the kernel matrix. Direct approximation of the kernel function can achieve significant performance gains. A large body of work relies on approximating the frequently used Gaussian kernel, which has a rapidly decaying eigenspectrum, as proved by the Bochner’s

theorem and the concept of random features [2]. Recently, the property of matrices generated by the Gaussian kernel were further exploited to achieve sublinear complexity in the approximation rank [3]–[5]. In another line of work, lowdimensional features can be derived for translation-invariant kernels based on their Fourier transform [6]. The mentioned family of methods currently presents the most space- and timeefficient approximations, but is nevertheless limited to kernels of particular functional forms. Approximations to the kernel matrix are applicable to any symmetric positive-definite matrix even if the underlying kernel function is unknown. They are termed data dependent and can be obtained via the eigenvalue decomposition, selection a subset of datapoints (the Nyström method) or Cholesky decomposition in order to minimize the difference between the original matrix and its low-rank approximation [7]–[12]. However, side information such as target variables is commonly disregarded. Predictive decompositions use the target variables to obtain a supervised low-rank approximation via the Incomplete Cholesky decomposition [13] or minimization of Bregman divergence measures [14]. The choice of optimal kernel for a given learning task is often non-trivial. Multiple kernel learning (MKL) methods learn the optimal weighted sum of kernel matrices with respect to the target variables, such as class labels [15]. Different kernels can thus be used to model the data and their relative importance is assessed via the predictive accuracy, offering insights into the domain of interest. Depending on the userdefined constraints, the resulting optimization problems are quadratic (QP) or semidefinite programs (SDP), assuming the complete knowledge of the kernel matrices. Low-rank approximations have been used for MKL, e.g., by performing Gram-Schmidt orthogonalization and subsequently MKL [16]. In a recent attempt, the combined kernel matrix is learned via efficient generalized Forward-backward algorithm, however assuming that low-rank approximations are available beforehand [17]. Gönen et al. develop a Bayesian treatment for joint dimensionality reduction and classification, solved via gradient descent [18] or variational approximation [19], while assuming access to full kernel matrices and not exploiting their symmetric structure. In this work, we propose a joint treatment of low-rank kernel

approximations and MKL. We design mklaren, a greedy algorithm that couples Incomplete Cholesky Decomposition and Least-angle regression to simultaneously learn low-rank approximations and regression coefficients. Multiple kernel learning is performed implicitly by iteratively selecting the appropriate portions of the approximations to the feature space induced by the set of kernels. Finally, when explicit feature space representation is available for kernels, the relation between primal and dual regression weights to achieve model interpretation. In contrast to contemporary approaches, which rely on convex optimization or Bayesian methods to learn the combination of kernels, our approach relies on geometrical principles solely, enhancing simplicity and computational complexity. A common assumption when applying matrix approximation or MKL is that the resulting decomposition can only be applied in transductive learning [20], [21], i.e., prediction is possible only if test examples are available at the training phase. Here, we show that by means of relating Incomplete Cholesky decomposition and the Nyström method, it is possible to circumvent this apparent limitation and predict the value of any data point, even if it was not part of the training set. The resulting method assumes linear time computational complexity in the number of data points as well as kernels, and compared favorable against related low-rank kernel matrix approximation and state-of-the-art MKL approaches. The predictive performance and model interpretation is evaluated empirically on multiple benchmark regression datasets. Additionally, we isolate the effect of low-rank approximation and compare the method with full kernel matrix MKL methods on a very large set of rank-one kernels. M ULTIPLE KERNEL LEARNING WITH LEAST- ANGLE REGRESSION

Let {x1 , x2 , ..., xn } be a set of points in a Hilbert space X , associated with targets y ∈ Rn . Let the Hilbert spaces X1 , X2 , ..., Xp be isomorphic to X and endowed with respective inner product (kernel) functions k1 , k2 , ...kp . The kernels kq are positive definite and map from Xq × Xq to R. Hence, a data point x ∈ Xq can be represented in multiple inner product spaces, which could be related to different data views or representations. The goal of a supervised approximation algorithm is to learn the corresponding lowrank approximations G1 , G2 , ..., Gp and the regression line µ ∈ span{G1 , G2 , ..., Gp } such that ky − µk is minimal. The mklaren algorithm simultaneously learns low-rank approximations to kernel matrices Kq associated to each kernel function kq using a variant of the well-known Incomplete Cholesky Decomposition (ICD). This family of methods produces a finite sequence of low-rank approximations G(1) , G(2) , ..., G(i) , such that G(i) G(i)

T

→ K as i → n.

(1)

We propose an innovative view on the pivot column selection criterion, which is closely associated to the selection of

variables in least-angle regression (LAR) [22]. At each step, the method keeps a current estimate of the regression line µ, which is obtained by moving along the direction of the vectors within the current approximation(s). In comparison to existing methods, it has the following advantages. First, the method is aware of multiple kernel functions. In each step, the next pivot to be added will be chosen greedily from all remaining pivots from all kernels. Accordingly, kernels that give more information about the current residual are more likely to be selected. In contrast to methods that assume access to the complete kernel matrices, the importance of a kernel is determined already at the time of approximation. Second, the explicit notion of the approximation error is completely abolished in pivot selection criterion. Since in the limit the approximations are exact (Eq. 1), we consider only the gain with respect to the current residual y − µ. Although accurate approximation is proportional to the similarity of the model using the full kernel function [23], it was recently shown that i) the expected generalization error is related to maximal marginal degrees of freedom rather than the approximation error and ii) empirically, low-rank approximation can result in a regularization-like effect [24]. The high-level pseudo code of the mklaren algorithm is shown in Algorithm 1. Each step is described in detail in the following. Simultaneous Incomplete Cholesky decompositions Each kernel function kq is associated to its corresponding kernel matrix Kq , which is in turn associated to its lowrank approximation Gq . Let G denote any of the matrices G1 , G2 , ..., Gp . Initially, G is initialized to 0, the diagonal vector d = diag(G) and the active set A = ∅. At step j, a pivot i is selected from the remaining set J = {1, 2, ..., n} \ A and its corresponding pivot column gj = G(:, j) is computed as follows: p G(i, j) = d(i)   j−1 X 1 (2) G(J , j) = K(J , i) − G(J , l)G(i, l) G(i, j) l=1

The selected pivot is added to the active set, the counter j and the diagonal vector are updated: d = d − g2j A = A ∪ {i}

(3)

j =j+1 Importantly, only the information on one column of K is required at each step and GGT is never computed explicitly. The quality of approximation is influenced by the subset of the pivot columns used for approximation. The pivots are selected based on a heuristic in a greedy manner. Various options exist, such as the lower-bound on the gain in reduction

of the approximation error [12], or a weighted cost function considering approximation error and targets [13]. In the method proposed by Fine & Scheinberg [12], i is selected according to maximum value in d, which represents the lower bound on the gain in approximation error. One could perform the decomposition for each kernel kq independently and subsequently weight kernels in spirit of MKL (e.g., see [16]). Instead, our pivot selection criterion takes into account the current residual as well as all other kernels. This is achieved as follows; consider the subspace spanned by the union of columns within all p approximations  H = G1 G2 · · · Gp =  (4) = g1,1 · · · g1,j1 g2,1 · · · g2,j2 · · · gp,1 · · · gp,jp P

span(H) ⊂ Rn and H ∈ Rn× q jq . Treating H as a feature space in context of regression enables greedy selection of further columns, to be added using the least-angle regression criterion. Least-angle regression based pivot selection Least-angle regression (LAR) is an active set method, originally designed for feature subset selection in linear regression [22]. The original method assumes the access to all candidate columns to be added to the feature space, making the combination of ICD and LAR without significant increase in complexity non-trivial. Thorough description of LAR is hereby skipped for brevity and can be found in the Appendix. The following is an adaptation of the LAR method to Cholesky decomposition, where the exact values of pivot columns to be added are unknown. Consider the matrix H in Eq. 4 and the set Jq = {1, 2, ..., n} \ Aq containing candidate pivot columns yet to be included in decomposition of kq . The regression line µ is initialized to 0. For all columns gi in H, define vectors hi = si Pgi

(5) T

where P is the centering projection matrix P = (I − 11n ) and si is the sign of the correlation (Pgi )T (y−µ). The active matrix HA representing the modified feature space equals  HA = · · · hi · · · (6) By elementary linear algebra, there exist an equiangular vector uA , having kuA k = 1 and spanning equal angles, less than 90 degrees, between the residual y − µ and vectors in HA . Moving the regression line µ along uA µnew = µ + γuA

(7)

causes the correlations ci = hTi (y − µ)/khi k to change equally for all hi . Let q be an index of any kernel having nonempty remaining set Jq . Define the following quantities C = maxi {ci } and A = (1T HA 1)−1/2 . The gradient γq is then selected as follows.



 C − cl C + cl γq = , , where AA − al AA + al cl = (Pˆ gq,l )T (y − µ)/kPˆ gq,l k min+ l∈J q

(8)

al = (Pˆ gq,l )T uA /k Pˆ gq,l k ˆ q,l is the approximation to the pivot column gq,l Here, g if the pivot l would be chosen to generate a new column in decomposition of kq using Eq. 2; min+ is the minimum over positive arguments for each choice of l. The selected pivot is the minimizer l of Eq. 8. Finally, the gradient γ and next kernel q are selected by γ = minq {γq } and the pivot index l is the solution of Eq. 8 for selected q. ˆ q,l to gq,l is necessary to compute Eq. 8 The approximation g efficiently and is further described below. The exact value of gq,l , selected as described above, remains unknown until the actual step of Cholesky decomposition is computed using Eq. 2. This renders the value of γ invalid, i.e., the change in the correlations of the active set would not be the same after applying Eq. 7. To obtain the correct γ at this point, we recompute Eq. 8 with fixed kernel and pivot indices q and l respectively and using the true valueP of gq,l . In the last step of Algorithm 1, as q jq reaches maximum allowed rank K, the same selection procedure is followed. After the true value is computed for last gq,l to be added to Gq and consequently H, the gradient simplifies to γ = C/A, yielding the least-squares solution for HA and y. Look-ahead decompositions The Eq. 8 assumes the availability of correlations cj for each of the candidate pivots in the sets Jq . To avoid recomputing the Cholesky steps, we adapt look-ahead decompositions, proposed by Bach & Jordan [13]. By definition of ICD in Eq. 2, the values of a newly added pivot column in step j and pivot i are: Pj−1 (K − l=1 G(:, l)G(:, l)T )(:, i) p (9) gj = d(i) The computation of cj is made more efficient by replacing K in Eq. 9 by an look-ahead approximation G0 ∈ Rn×(j+δ) , using δ additional columns.

ˆi = g

(G0 G0T −

Pj−1 T l=1 G(:, l)G(:, l) )(:, i) p d(i)

G0 (:, j+1:j+δ) G0T (j+1:j+δ, i) p = d(i)

(10)

This simplification significantly reduces the computation complexity in computing the quantities cl and al in Eq. 8. To see this, consider the computation of correlation with the equiangular vector, al for any (normalized) candidate pivot ˆl. column g By assumption of the LAR algorithm, the predictor variables have mean zero and norm one, which is not the case with

pivot columns. This is solved by using the centering projection T matrix P = (I − 11n ), and letting al =

(Pˆ gl )T uA kPˆ gk

(11)

p ˆ l as in in Eq. 10, the denominator 1/ d(i, i) Inserting g cancels out. The norm kPˆ gl k can be computed as:  T 0 0T 2 kPˆ gl k = P G (:, j+1:j+δ) G (j+1:j+δ, l)   0 0T P G (:, j+1:j+δ) G (j+1:j+δ, l)  = G(l, :) GT G(j+1:j+δ, j+1:j+δ)  − GT 11T G(j+1:j+δ, j+1:j+δ) G(l, :)T (12) Similarly, dot product with a vector is computed as:  T (Pˆ gl )T uA = uTA P G0 (:, j+1:j+δ)G0T (j+1:j+δ, l) =   = uTA G(:, j+1:j+δ) − uTA 11T G(:, j+1:j+δ) G(j+1:j+δ, l) (13) Correctly ordering the order of computation yields the computational complexity O(δ 2 ) per column. Note that matrices GT G, GT 11T G, uTA G, uTA 11T G are the same for all columns (independent of l) and need to be computed only once. Computation of correlations with the residual r = y − µ are computed in an analogous manner. Computing the regression weights The regression weights β ∈ RK are computed from the regression line µ and the active feature space H as defined in Eq. 4. using the relation HA β = µ =⇒ µ = (RT R)−1 QT µ,

(14)

where HA = QR is the thin QR decomposition [25]. Computing the dual coefficients Kernel ridge regression is often stated in terms of dual coefficients α ∈ Rn , satisfying the relation: HTA α = β

(15)

This is an overdetermined system of equations. The vector α with minimal norm can be obtained by solving the following least-norm problem: minimize kαk2 subject to HTA α = β

(16)

The problem has an analytical solution equal to α = HA (HT H)−1 β

(17)

Algorithm 1: The mklaren algorithm. Input: {x1 , x2 , ..., xn } training set of objects in X , k1 , k2 , ...kp kernel functions on X × X , y ∈ Rn regression targets, λ regularization parameter, δ number of look-ahead columns, K maximum total rank. Result: G1 ∈ Rn×k1 , G2 ∈ Rn×k2 , ...Gp ∈ Rn×kp , low-rank approximations, A1 , A2 , ..., Ap active sets of object indices, µ ∈ Rn regression vector on the training set, β ∈ RK regression weights. Initialize: ¯, residual r = y − y bisector b = 0, regression line µ = 0, active sets Aq = ∅ and counters jq = 0 for q ∈ {1, ..., p}, Compute standard Cholesky Decompositions with δ look-ahead columns for G01 , G02 , ..., G0p . Pp while q=1 jq < K do Select (kernel, pivot index) pair (kq , j) that maximizes the LAR criterion (Eq. 8). Compute gqj with one Cholesky step for pivot j w.r.t Gq and kq (., .). Add gqj to Gq . Recompute Cholesky steps for G0q at columns kq ...kq + δ. Compute bisector uA of all active columns in G1 , G2 , ..., Gp (Eq. 11). Compute gradient step γ using the LAR method (Eq. 8). The step is computed so that the resulting correlations remain all equal. Update regression line µ = µ + γb. Increase column counter jq = jq + 1. Add index j to active set Aq . Compute β using the QR decomposition with Eq. 14. Compute linear transform T using the Nyström approximation using Eq. 20.

Obtaining dual coefficients α can be useful if the explicit feature space Φq is available for a kernel q, which is the case for linear, polynomial, string and other kernels. An interpretation of regression coefficients can be done in that feature space by considering β Φq = Φq α.

(18)

Moreover, if the vector α is sparse, only the relevant portions of Φq need to be computed. This condition can be additionally enforced by using techniques like matching pursuit when solving for α [26]. Prediction of unseen samples We show a way to infer the Cholesky factors for examples not used during training, without explicitly repeating the Cholesky steps, so that learned weights β can predict responses for test samples. In order to simplify notation, we show the principle for one kernel matrix and its corresponding Cholesky factors, but note that the idea is trivially extended to the case of multiple kernels. Nyström approximation. Let A = {i1 , i2 , ..., ik } be the active set of pivot indices. The Nyström aproximation [11] of the kernel matrix K is defined as follows: K = K(:, A)K(A, A)−1 K(:, A)T

(19)

The selection of the indices constituting A crucially influences the performance of learning. Note that mklaren defines a method to construct this set. Let GGT be an (incomplete) Cholesky decomposition using the same pivot indices in the set A. Consider the following proposition: Proposition. The incomplete Cholesky decomposition with pivots A = {i1 , i2 , ..., ik } yields the same approximation as the Nyström approximation using the active set A. The proof follows directly from [13]. There is an unique matrix L that is: (i) symmetric, (ii) has the column space spanned by K(:, A) and (iii) L(:, A) = K(:, A). It follows that both incomplete Cholesky decomposition and the Nyström approximation result in the same approximation matrix L. Corollary. Let G be the Cholesky factors obtained on the training set. Cholesky factors G∗ for unseen examples x∗1 , x∗2 , ...x∗nt can be inferred using the linear transform: G∗ = (GT G)−1 GT K(:, A)K(A, A)−1 K(A, ∗) = TK(A, ∗)

(20)

The matrix T ∈ RK ×K is inexpensive to compute and can be stored permanently after training. Hence, the Cholesky factors G∗ can be computed from the inner product between the test and the active sets K(A, ∗) and the existing regression weights β can be applied on G∗ to obtain predictions.

`2 norm regularization Regularization is achieved by constraining the norm of weights kβk. Zou and Hastie [27] prove the following lemma, which shows that `2 regularized regression problem can be stated as ordinary least squares via appropriate augmentation of the dataset X, y. Lemma. Define the augmented data set X∗ , y∗ to equal   p X X∗ = (1 + λ) √ λI   y y∗ = . 0 The least-squares solution of X∗ , y∗ is then equivalent to Ridge regression of the original data X, y with parameter λ. For proof, see Zou & Hastie [27]. The augmented dataset can be included in LAR to achieve the `2 -regularized solution. Computational complexity: It is straightforward to analyze the exact upper bound on computational complexity of the mklaren algorithm, described in Algorithm 1. The lookahead Cholesky decompositions are standard Cholesky decompositions with δ pivots and complexity O(nδ 2 ). The main loop is executed K times. The selection of kernel and pivot pairs is based on the LAR criterion, which includes inverting HTA HA of size K × K, thus having a complexity of K 3 . However, as each step is a rank-one modification to HTA HA , the Morrison-Sherman-Woodbury lemma [28] can be used to achieve complexity O(K 2 ) per update. The computation of correlations with the bisector in Eq. 11 and residuals are computed for p kernels in O(npδ 2 ). Recomputation of δ Cholesky factors requires standard Cholesky steps of complexity nδ 2 . The computation of the gradient step is of the same complexity as the gradient step. Updating the regression line is O(n). The QR decomposition in Eq 14 takes O(nK 2 ) and the computation of linear transform T in Eq. 20 is of O(K 3 + nK 2 ) complexity. Together, the final complexity is O(nδ 2 +K(K 2 +npδ 2 +nδ 2 )+nK 2 +K 3 ) = O(K 3 +npKδ 2 ), (21) thus being linear in the number of data points and kernels. E XPERIMENTS In this section, we provide an empirical evaluation of the proposed methodology on known regression datasets. We first compare mklaren with several benchmark low-rank matrix approximation methods: Incomplete Cholesky Decomposition (icd, [12]), Cholesky with side Information (csi, [13]) and the Nyström method (Nyström, [11]). In the interest of generality, we omit the comparison with kernel function approximation methods limited to kernels of a particular functional form [2], [6] and the approximation of kernel matrices limited only to the Gaussian kernel [3]–[5]. Finally, a comparison is provided with a family of state-of-the-art multiple kernel learning methods using the full-kernel matrix on a sentiment analysis data set using a large number of rankone kernels [29].

Dataset boston kin pumadyn abalone comp ionosphere bank diabetes

mklaren 4.393 ± 0.432 0.018 ± 0.000 1.252 ± 0.032 2.638 ± 0.116 5.288 ± 0.461 0.283 ± 0.017 0.036 ± 0.001 54.680 ± 3.61

csi 4.762 ± 0.598 0.025 ± 0.003 1.650 ± 0.169 2.768 ± 0.187 7.520 ± 1.852 0.310 ± 0.022 0.046 ± 0.005 54.953 ± 3.018

icd 6.703 ± 0.354 0.067 ± 0.006 4.024 ± 0.655 2.906 ± 0.222 14.111 ± 1.123 0.380 ± 0.010 0.101 ± 0.011 63.715 ± 5.970

Nyström 6.611 ± 1.272 0.065 ± 0.006 3.882 ± 0.803 2.939 ± 0.197 13.763 ± 0.580 0.377 ± 0.015 0.128 ± 0.010 68.117 ± 3.947

uniform 3.109 ± 0.274 0.013 ± 0.000 1.210 ± 0.070 2.499 ± 0.118 0.750 ± 0.203 0.292 ± 0.025 0.034 ± 0.001 62.142 ± 3.991

Dataset boston kin pumadyn abalone comp ionosphere bank diabetes

mklaren 3.792 ± 0.454 0.016 ± 0.001 1.257 ± 0.032 2.526 ± 0.097 3.100 ± 0.942 0.234 ± 0.028 0.035 ± 0.001 55.580 ± 3.634

csi 4.481 ± 0.689 0.018 ± 0.000 1.268 ± 0.035 2.591 ± 0.111 5.318 ± 1.298 0.254 ± 0.030 0.036 ± 0.001 55.220 ± 3.56

icd 5.499 ± 0.680 0.059 ± 0.008 3.552 ± 0.767 2.777 ± 0.194 12.646 ± 0.548 0.341 ± 0.012 0.067 ± 0.005 58.793 ± 5.606

Nyström 5.677 ± 0.609 0.054 ± 0.009 3.581 ± 0.660 2.820 ± 0.220 11.288 ± 2.365 0.331 ± 0.016 0.110 ± 0.012 60.747 ± 2.377

uniform 3.109 ± 0.274 0.013 ± 0.000 1.210 ± 0.070 2.499 ± 0.118 0.750 ± 0.203 0.292 ± 0.025 0.034 ± 0.001 62.142 ± 3.991

Dataset boston kin pumadyn abalone comp ionosphere bank diabetes

mklaren 3.493 ± 0.489 0.014 ± 0.000 1.255 ± 0.038 2.500 ± 0.110 1.330 ± 0.409 0.221 ± 0.012 0.034 ± 0.002 55.628 ± 3.597

csi 4.191 ± 0.878 0.018 ± 0.000 1.251 ± 0.027 2.545 ± 0.095 4.791 ± 2.805 0.228 ± 0.018 0.035 ± 0.001 55.214 ± 4.03

icd 4.657 ± 0.664 0.043 ± 0.018 3.015 ± 0.702 2.597 ± 0.128 9.845 ± 2.085 0.304 ± 0.024 0.042 ± 0.009 56.608 ± 4.488

Nyström 5.220 ± 0.751 0.040 ± 0.014 2.448 ± 0.742 2.702 ± 0.159 9.744 ± 2.005 0.266 ± 0.033 0.101 ± 0.020 57.560 ± 2.425

uniform 3.109 ± 0.274 0.013 ± 0.000 1.210 ± 0.070 2.499 ± 0.118 0.750 ± 0.203 0.292 ± 0.025 0.034 ± 0.001 62.142 ± 3.991

TABLE I C OMPARISON OF REGRESSION PERFORMANCE (RMSE) ON TEST SETS VIA 5- FOLD CROSS - VALIDATION FOR DIFFERENT VALUES OF RANK (K). S HOWN IN BOLD IS THE LOW- RANK APPROXIMATION METHOD WITH LOWEST RMSE. T OP K=14. M IDDLE K=28. B OTTOM K=42.

Comparison with low-rank approximations The main proposed advantage of mklaren over established kernel matrix approximation methods is simultaneous approximation of multiple kernels, which considers the current approximation to the regression line and greedily selects the next kernel and pivot to be included in the decomposition. To elucidate this, we performed a comparison on eight known regression datasets: abalone, bank, boston, comp-active, diabetes, ionosphere, kinematics, pumadyn12 . Similar to Cortes et al. [29], seven Gaussian kernels with different lengthscale parameters are used. The Gaussian kernel function is defined as k(x, y) = exp{−γkx − yk2 }, where the lengthscale parameter γ is in range 2−3 , 2−2 , ..., 20 , ..., 23 . For approximation methods icd, csi and Nyström, each kernel matrix was approximated independently using a fixed maximum rank K. The combined feature space of seven kernel matrices was used with ridge regression. For mklaren, the approximation is defined simultaneously for all kernels and the maximum rank was set to 7K, i.e., seven times the maximum rank of individual kernels used in icd, csi and Nyström. Thus, the low-rank feature space of all four methods had exactly the same dimension. The uniform kernel combination (uniform) using the full-kernel matrix was included as an empirical lower bound. 1 http://archive.ics.uci.edu/ml/ 2 http://www.cs.toronto.edu/~delve/data/datasets.html

The performance was assessed using 5-fold cross-validation as follows. Initially up to 1000 data points were selected randomly from the dataset. For each random split of the data set, a training set containing 60% of the data was used for kernel matrix approximation and fitting the regression model. A validation set containing 20% of the data was used to select the regularization parameter λ from range 10−3 , 10−2 , ...100 , ... 103 . The final reported performance using root mean square error (RMSE) was obtained on the test set with remaining 20% of the data. All variables were standardized and the targets y were centered to the mean. The look-ahead parameter δ was set to 10 for mklaren and csi. The results using different settings of K are shown in Table I. Not surprisingly, the performance of supervised mklaren and csi is consistently superior to unsupervised icd and Nyström. Moreover, mklaren outperforms csi on the majority of regression tasks, especially at lower settings of K. At higher settings of K, the difference begins to vanish as all approximation methods obtain sufficient information of the feature space induced by the kernels. It is interesting to compare the utilization of the vectors in the low-dimensional feature space. Table II shows the minimal setting of K where the performance is at most one standard deviation away from the performance obtained by uniform. On seven out of eight datasets, mklaren reaches equivalent performance to uniform at the smallest setting of K. On only the diabetes dataset, the unsupervised icd and Nyström

Dataset boston kin pumadyn abalone comp ionosphere bank diabetes

n 506 1000 1000 1000 1000 351 1000 442

mklaren 42 63 49 21 49 14 21 14

csi 63 > 140 > 140 28 63 14 42 14

icd > 140 > 140 56 35 > 140 42 42 14

Nyström 119 > 140 98 49 > 140 35 112 21

TABLE II C OMPARISON OF MINIMAL RANK FOR WHICH THE RMSE DIFFERS BY AT MOST ONE STANDARD DEVIATION TO RMSE OBTAINED WITH THE FULL KERNEL MATRICES USING UNIFORM ALIGNMENT. n, NUMBER OF FEATURES . S HOWN IN BOLD IS THE METHOD WITH LOWEST MAXIMAL RANK K TO ACHIEVE EQUIVALENT PERFORMANCE TO U N I F O R M .

appear to outperform the supervised methods at low ranks. However, at higher setting of K the performance of csi and mklaren overtakes csi and Nyström as can be seen in Table I. Overall the results confirm the utility of the greedy approach to select not only pivots, but also the kernels to be approximated and suggest mklaren to be the method of choice when competitive performance at very low-rank feature spaces is desired. Moreover, as irrelevant kernels that are never selected to be added to the decompositions need never be evaluated at test stage. This point is discussed further in the next subsection. Comparison with MKL methods on rank-one kernels The comparison of mklaren to multiple kernel learning methods using the full kernel matrix is challenging as it is unrealistic to expect improved performance with low-rank approximation methods. Although the restriction to low-rank feature spaces may result in implicit regularization and improved performance as a consequence [24], the difference in implicit dimension of the feature space makes the comparison difficult. We focus on the ability of mklaren to select from a set of kernels in a way that takes into account the implicit correlations between the kernels. To this end, we built on the empirical analysis of Cortes et. al [29]. The mentioned reference used four well-known sentiment analysis datasets compiled by Blitzer et. al [30]. In each dataset, the examples are user reviews of products and the target is the product rating in a discrete range 1..5. The features are counts of 4000 most frequent bigrams in each dataset. Each bigram was represented by a rank-one kernel, thus enabling the use of multiple kernel learning for feature selection and explicit control over feature space dimension. The datasets contain moderate number of examples: books (n = 5501), electronics (n = 5901), kitchen (n = 5149) and dvd (n = 5118). The splits into training and test part were readily included as a part of the data set. Three state-of-the-art multiple kernel methods were used for comparison. All methods are based on maximizing centered kernel alignent [29]. The align method infers the kernel weights independently, while alignf and alignfc consider

the between-kernel correlations when maximizing the alignment. The combined kernel learned by all three methods was used with kernel ridge regression model. The align method is linear in the number of kernels (p), while alignf and alignfc are cubic as the include solving a full p × p system of equations (alignf) or a QP (alignfc). When testing for different ranks K, the bigrams were first filtered according to the descending centered alignment metric for align, alignf, alignfc prior to optimization. When using mklaren the K pivot columns were selected from the complete set of 4000 bigrams and δ was set to 1. Note that in this scenario, mklaren is equivalent to the original LAR algorithm, thus excluding the effect of low-rank approximation and comparing only the kernel selection part. This way, the same dimension of the feature space was ensured. The performance was measured via 5-fold crossvalidation. At each step, 80% of the training was used for kernel matrix approximation (mklaren) or determining kernel weights (align, alignf, alignfc). The remaining 20% of the training set was used for selecting regularization parameter λ from range 10−3 , 10−2 , ...103 and the final performance was reported on the test set using RMSE. The results for different settings of K are shown in Figure 1. For low setting of K, mklaren outperforms all three MKL methods that assume full kernel-matrices. At higher settings of K and the number of kernels considered, mklaren turns equivalent to alignf and alignfc in terms on performance, showing that the greedy kernel and pivot selection criterion considers implicit correlations between kernels. However, there is an important difference in computational complexity. Note that mklaren is linear in the number of kernels p, the fact that can make it preferable over alignf and alignfc. Thus, the same effect for accounting of between kernel correlations is achived at a significanlty lower cost. We finally examine the utility of mklaren for model interpretation. We compare the kernels (features) selected by the greedy criterion and the corresponding regression weights for the books dataset on Figure 2. In comparison with the features selected by align the bigrams appear to represent explicit expression of sentiment related to reviews. For example, the words generally expressing positive sentiment great, wonderful or excellent were detected by mklaren as well as negatively sounding disappointing, ridiculous, or sorry. All the words were missed by align, explaining the possible gap in performance seen in Figure 1. C ONCLUSION Subquadratic complexity in the number of training examples is essential in large-scale application of kernel methods. Learning the kernel matrix efficiently from the data and the selection of relevant portions on the data early can reduce time and storage requirements further up the machine learning pipeline. The complexity with respect to the number of kernels should not be disregarded when the number of kernels is large. Using a greedy low-rank approximation to multiple kernels, we achieve linear complexity in the number of kernels

align Mklaren uniform alignf alignfc

RMSE

1.50 1.45

20

40

60

80

Rank

100

120

140

1.40 1.30

160

electronics

align Mklaren uniform alignf alignfc

1.40 1.35 1.30

0

20

40

60

80

Rank

100

120

140

160

kitchen

1.45

align Mklaren uniform alignf alignfc

1.40

RMSE

0

1.45

RMSE

1.50 1.45 1.35

1.50

1.25

align Mklaren uniform alignf alignfc

1.55

1.40 1.35

dvd

1.60

RMSE

books

1.55

1.35 1.30 1.25

0

20

40

60

80

Rank

100

120

140

160

1.20

0

20

40

60

80

Rank

100

120

140

160

Fig. 1. RMSE on the test set for MKL methods. The rank K is equal to the number of kernels included.

and data points without sacrificing the consideration of inbetween kernel correlations. Moreover, the approach learns a regression model, but is nevertheless applicable in any kernelbased model. The extension to classification or ranking tasks is an interesting subject for future work. Contrary to the recent kernel matrix approximations, we present an idea based entirely on geometric principles, which is also not limited to transductive learning. With the abundance of different data representations, we expect kernels to remain essential in machine learning applications. A PPENDIX Least-angle regression: We briefly describe Least-angle regression (LAR, see ref. [31], [32]) for completeness. We use LAR as an alternative to the pivot selection criterion in the ICD and to simultaneously learn the regression coefficients. In the LAR method, a new column is chosen from the set of candidates such that the correlations with the residual are equal for all active variables. This is possible because all variables (columns) are known a priori, which clearly does not hold for candidate pivot columns. The monotonically decreasing maximal correlation in the active set is therefore not guaranteed. Moreover, the addition of a column to the active set potentially affects the values in all further columns. Naively recomputing these values at each iteration would yield a computational complexity of order O(n2 ). Let the predictor variables x1 , x2 , ..., xp be vectors in Rn , arranged in a matrix X ∈ Rn×p . The associated response vector is y ∈ Rn . The LAR method iteratively selects the predictor variables xj and the corresponding coefficients βj are updated at the same time as they are moved towards their least-squares coefficients. At last step, the method reaches the least-squares solution Xβ = y.

The high-level pseudo code is as follows: ¯ , and regression 1) Start with the residual r = y − y coefficients β1 , β2 , ...βp = 0. 2) Find the variable xj most correlated with r. 3) Move βj towards its least-squares coefficient until another xk has as much correlation with r. 4) Move βj and βk in the direction towards their joint leastsq. coeff., until some new xl has as much correlation with r. 5) Repeat until all variables have been entered, reaching the least-sq. solution. Note that the method is easily modified to include early stopping, after a maximum number of selected predictor variables are included. Importantly, the method can be viewed as a version of supervised Incomplete Cholesky Decomposition of the linear kernel K = XXT which corresponds to the usual inner product in Rp . Assume that the predictor variables are standardized and response has had its mean subtracted off: kxj k1 = 0 and kxj k2 = 1 for j = 1, 2, ..., p. kyk1 = 0

(22)

Initialize the regression line µ, the residual r and the active set A: µ = 0, r = y and A = ∅ . (23) The LAR algorithm estimates µ = Xβ in successive steps. Say the predictor xi has the largest correlation with r. Then, the index i is added to the active set A and the regression line and residual are updated: µnew = µ + γxi rnew = r − γxi

(24)

Mklaren

A = (1TA TA 1A )−1/2 ω = AT−1 A 1A

(26)

uA = XA ω A

align

how_can of_both by_his references_to i_finished economics firm saying_that occurred one_who more_and and_information gripping a_year first_two we_know had_this statement need_for write_a

Weights (books)

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

equal angles with the columns of XA . The bisector is obtained as follows. TA = XTA XA

great classic story_of excellent wonderful highly a_must easy_to covers each say_about waste_of sorry money ridiculous disappointed unfortunately poorly disappointing waste

Weights (books)

5 4 3 2 1 0 −1 −2 −3 −4

The calculation of gradient proceeds as follows. Get the maximum vector of correlations. Active set contains variables with highest absolute correlations. cj = xTj r C = maxj {cj } a = XTA uA γ = min+ j∈Ac {

(27) C − cj C + cj , } AA − aj AA + aj

where min+ is the minimum over positive components. By Eq. 24, we the change in correlations within the active set can be expressed. cnew = xTj (y − rnew ) = cj − γaj j

(28)

For the predictors in active set, we have |cnew j | = C − γA, for j ∈ A.

(29)

A variable is selected from the remaining variables in Ac , such that cnew is maximal. Equaling Eq. 28 and Eq. 29, and j C−c maximizing yields γ = A−ajj . Similarly, −cnew for the reverse j C+c

covariate is maximal at γ = A+ajj . Hence, γ is chosen in Eq. 27 as a minimal value for which an variable joins the active set. R EFERENCES

Fig. 2. Weights of top 10 positive and top 10 negative features for mklaren (top) and align (below) on the sentiment analysis task for the books dataset.

The gradient γ is set such that a new predictor xj will enter the model after µ is updated and all predictors in the active set as well as xj will be equally correlated to r. The key parts are the selection of predictors added to the model and the calculation of gradient. The active matrix for a subset of indices j with sign sj is defined as  XA = · · · sj xj · · · for j ∈ A (25) sj = sign{xTj r} By elementary linear algebra, there exist a bisector uA an equiangular vector, having kuA k2 = 1 and spanning equal angles, less than 90 degrees, with vectors in XA . Define the following quantities respectively: XA the active matrix, A the normalization scalar, uA the bisector, and ω the vector making

[1] B. Schölkopf and A. J. Smola, Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002. [2] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, no. 1, 2007, pp. 1177—-1184. [3] Z. Yang, A. J. Smola, L. Song, and A. G. Wilson, “A la Carte - Learning Fast Kernels,” Dec. 2014. [4] S. Si, C. Hsieh, and I. Dhillon, “Memory Efficient Kernel Approximation,” Proceedings of The 31st International Conference on Machine Learning, vol. 32, 2014. [5] Q. Le, T. Sarlós, and A. Smola, “Fastfood—approximating kernel expansions in loglinear time,” Proceedings of the international conference on Machine Learning, vol. 28, 2013. [6] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit feature maps.” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 34, no. 3, pp. 480–92, Mar. 2012. [7] A. Rudi, R. Camoriano, and L. Rosasco, “Less is More: Nystrom Computational Regularization,” arXiv preprint arXiv:1507.04717, 2015. [8] Z. Xu, R. Jin, B. Shen, and S. Zhu, “Nystrom Approximation for Sparse Kernel Methods: Theoretical Analysis and Empirical Evaluation,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 3115–3121. [9] M. Li, W. Bi, J. Kwok, and B. Lu, “Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 152–164, 2015.

[10] A. Gittens and M. W. Mahoney, “Revisiting the Nystrom Method for Improved Large-Scale Machine Learning,” arXiv preprint arXiv:1303.1849, p. 60, Mar. 2013. [11] C. Williams and M. Seeger, “Using the Nystr{ö}m method to speed up kernel machines,” in Proceedings of the 14th Annual Conference on Neural Information Processing Systems, no. EPFL-CONF-161322, 2001, pp. 682–688. [12] S. Fine and K. Scheinberg, “Efficient SVM Training Using Low-Rank Kernel Representations,” Journal of Machine Learning Research, vol. 2, pp. 243–264, 2001. [13] F. R. Bach and M. I. Jordan, “Predictive low-rank decomposition for kernel methods,” New York, New York, USA, pp. 33–40, 2005. [14] B. Kulis, M. Sustik, and I. Dhillon, “Low-rank kernel learning with Bregman matrix divergences,” The Journal of Machine Learning Research, vol. 10, pp. 341–376, 2009. [15] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” The Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011. [16] J. Kandola, J. Shawe-Taylor, and N. Cristianini, “Optimizing Kernel Alignment over Combination of Kernels,” Advances in neural information processing systems, 2002. [17] A. Rakotomamonjy and S. Chanda, “`p-Norm Multiple Kernel Learning With Low-Rank Kernels,” Neurocomputing, vol. 143, pp. 68–79, Nov. 2014. [18] M. Gönen and E. Alpaydın, “Supervised learning of local projection kernels,” Neurocomputing, vol. 73, no. 10-12, pp. 1694–1703, Jun. 2010. [19] S. Kaski, M. Gonen, and S. Kaski, “Kernelized Bayesian matrix factorization,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 36, no. 10, pp. 2047–2060, 2014. [20] K. Zhang, Z. Wang, and F. Moerchen, “Scaling up Kernel SVM on Limited Resources : A Low-rank Linearization Approach,” vol. XX, pp. 1425–1434, 2012. [21] G. Lanckriet and N. Cristianini, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004. [22] B. Efron and T. Hastie, “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004. [23] C. Cortes, M. Mohri, and A. Rostamizadeh, “Two-stage learning kernel algorithms,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 239–246. [24] F. Bach, “Sharp analysis of low-rank kernel matrix approximations,” arXiv:1208.2015, vol. 30, pp. 1–25, Aug. 2012. [25] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press, 2012, vol. 3. [26] F. Bach, J. Mairal, J. Ponce, and G. Sapiro, “Sparse coding and dictionary learning for image analysis,” in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2010. [27] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, Apr. 2005. [28] C. D. Meyer, Matrix analysis and applied linear algebra. Siam, 2000. [29] C. Cortes, M. Mohri, and A. Rostamizadeh, “Algorithms for Learning Kernels Based on Centered Alignment,” Journal of Machine Learning Research, vol. 13, pp. 795–828, Mar. 2012. [30] J. Blitzer, M. Dredze, F. Pereira, and Others, “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification,” in ACL, vol. 7, 2007, pp. 440–447. [31] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series in statistics Springer, Berlin, 2001, vol. 1. [32] T. Hesterberg, N. H. Choi, L. Meier, and C. Fraley, “Least angle and ` 1 penalized regression: A review,” Statistics Surveys, vol. 2, pp. 61–93, 2008.

Suggest Documents