Document not found! Please try again

The efficient cross-validation of principal components ... - Springer Link

2 downloads 2607 Views 752KB Size Report
Statistics and Computing 5, (1995) 227-235. The efficient cross-validation of principal components applied to principal component regression. BART MERTENS ...
Statistics and Computing 5, (1995) 227-235

The efficient cross-validation of principal components applied to principal component regression BART M E R T E N S , 1 TOM F E A R N 2 and M I C H A E L T H O M P S O N 1

1Department of Chemistry, Birkbeck College, Gordon House, 29 Gordon Square, London WC1H OPP, UK 2Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK Received March 1993 and accepted November 1994

The cross-validation of principal components is a problem that occurs in many applications of statistics. The naive approach of omitting each observation in turn and repeating the principal component calculations is computationally costly. In this paper we present an efficient approach to leave-one-out cross-validation of principal components. This approach exploits the regular nature of leave-one-out principal component eigenvalue downdating. We derive influence statistics and consider the application to principal component regression.

Keywords: Cross-validation, downdating, eigenproblem, eigenvalue, influence statistics, principal components, regression.

I. Introduction Applications of principal component analysis in statistics often involve the choice of the number of principal components to be used in the analysis. Stone and Brooks (1990) as well as Martens and Naes (1989) consider applications in regression. Wold (1976; 1978) examines the choice of the number of components for an adequate description of the variability within data. Cross-validation (Stone, 1974) is often used to address such problems. The computational cost of a full leave-one-out crossvalidatory analysis may be extremely high. An algorithm proposed by Bunch et aL (1978), based on work by Golub (1973) and further investigated by DeGroat (1990), provides an elegant solution to the problem of leaveone-out calculations. By applying this algorithm to the cross-validation of principal components, one can calculate a full leave-one-out cross-validation from a single principal component analysis on the whole data. Rather than recomputing the principal components from scratch after leaving out an observation, we consider the change in the principal components that results from a leave-one-out perturbation. This computation is based 0960-3174 9 1995Chapman& Hall

on the downdating of the eigenvalues that is associated with the deletion of an observation. As this approach avoids recomputing the cross-product matrix, we achieve an order of magnitude reduction in the computational cost of cross-validating the principal component decomposition. The method is conceptually elegant, as we only calculate the decomposition that we are actually interested in. We can store this decomposition for use in further analysis. In what follows, we first set up a framework for principal component analysis and cross-validatory algebra. We concentrate from the outset on principal component regression as an application. We then consider the algebra of leave-one-out perturbations. After an explanation of the nature of the calculations, we describe in more detail the application of this approach to cross-validatory calculations for principal component regression. Finally, we consider the performance of our method for principal component regression and some examples. The crossvalidatory algebra described does not apply to the decomposition of the correlation matrix. We are therefore restricted to the analysis of the covariance matrix. This has also been observed by Sundberg (1993) and Stone

M e r t e n s et al.

228 and Brooks (1990; 1992) in the cross-validatory algebra for continuum regression.

2. Principal component regression and cross-validation Consider a sample of n observations (f;i, Xil,...,fCip), i = 1,...,n. We want to construct a prediction equation for the measurements Yi, i-- 1,...,n, using the explanatory m e a s u r e m e n t s xi()~il,..-,Xip), i : 1,... ,n. We write .~=

2.2. Principal component regression and cross-validation

9 2n

n

for the raw data, and Yl Y=

1 -=

Xl X=

Yn

Xl =

n

However, it is of particular interest from a computational point of view as it emerges as one limit of continuum regression (Stone and Brooks, 1990), a procedure which poses even greater computational problems. It is a straightforward exercise to change an implementation of the subsequent cross-validatory algebra to allow for a different ordering of the principal component scores in regression, using a permutation of the principal component indices. For these reasons, and to simplify notation, we restrict ourselves to the previously described implementation.

Xn

for the mean-centred data, where Y and X are sample means. We will consider the fitting of a linear predictor Y = Y1 + X a for Y from the predictor data X using principal component regression, as described by Jolliffe (1986, p. 130), for example. 1 is a column vector of 1s.

The number of components 7 remains to be determined in the above procedure. We use full leave-one-out crossvalidation to assess the number of components needed to optimize the prediction of future samples, as described by Stone (1974) and Stone and Brooks (1990). The computation of a full leave-one-out cross-validatory analysis for principal component regression involves, for each datum (yi,xi), i = 1,... ,n, the computation of the leave-one-out cross-validation prediction equation Y(i,7) -- ff(i)l + bl(i)Ul(i) + . . . + bT(i)uT(i) where the notation (i) indicates quantities that have been recalculated with the omission of the ith datum. Applying this prediction equation to the left-out datum, we obtain the cross-validated predicted value ^

2.1. Principal component regression Principal component regression derives the eigendecomposition

Yi(i,7) : Y(i) + bl(i)Uil(i) + ' ' " -'}-bT(i)uiT(i) for )~i. We have Uij(i)

s = QAQ T

of the sample covariance matrix S = C / ( n - 1), where C = X T X is the cross-product matrix, Q = (ql,..-, qr) is the matrix of principal component coefficients and A--diag(Al,...,Ar) is a diagonal matrix with the principal component variances A1 > "" > Ar > 0 on the diagonal, where r = m i n ( n - 1,p). The linear predictor is obtained from a least squares regression of Y on a subset of the principal component scores U = ( u l , . . . , Ur), where U = XQ. We consider the restrictive implementation of principal component regression in which only the equations

-~- ( X i - -

X(i))qj(i)

with qj(i) thejth vector of principal component coefficients when the ith datum is removed. The PRESS statistic may now be obtained for each 7 as the sum of squared cross-validated residuals n

PRESSr = E ( ~ i -- Yi(i,?)) ~ 2 i=l The only difficult step in this cross-validatory procedure is the calculation of the principal component coefficients qj(i), J = 1,..., r, for leave-one-out perturbations of the data.

A

Y(r)= Y l + b l U l + ' " - t - b r u r where 7 is a positive integer with 1 ~ er(i) , and the downdated principal component coefficients Q(i) may be obtained from Q(i) = Q v

This provides a framework for the downdating of the principal component decomposition. We will show that the downdating matrix V ( v l , . . . , V r ) can be conveniently calculated in two stages. First we calculate the downdated eigenvalues. Once these measures have been derived, we can obtain V in a closed-form expression. Thus, the efficient cross-validation of the principal component decomposition is based on the efficient downdating of the principal component eigenvalues. We describe these two steps in reverse order. 3.1. Deflation

There are a few situations in which the downdating problem becomes trivial. The first is that in which z = 0, or equivalently p = 0, which corresponds to the trivial downdating problem for which none of the eigenvalues and principal component coefficients change when the ith datum is omitted. The next situation is that in which zt = 0 for some 1 ~