The aim of this paper is twofold. In the first part, we recapitulate the main results regarding the shrinkage properties of Partial Least Squares (PLS) regression.
An Overview on the Shrinkage Properties of Partial Least Squares Regression Nicole Krämer1 1
TU Berlin, Department of Computer Science and Electrical Engineering, Franklinstr. 28/29, D-10587 Berlin Summary The aim of this paper is twofold. In the rst part, we recapitulate the main results regarding the shrinkage properties of Partial Least Squares (PLS) regression. In particular, we give an alternative proof of the shape of the PLS shrinkage factors. It is well known that some of the factors are > 1. We discuss in detail the eect of shrinkage factors for the Mean Squared Error of linear estimators and argue that we cannot extend the results to PLS directly, as it is nonlinear. In the second part, we investigate the eect of shrinkage factors empirically. In particular, we point out that experiments on simulated and real world data show that bounding the absolute value of the PLS shrinkage factors by 1 seems to leads to a lower Mean Squared Error.
Keywords: linear regression, biased estimators, mean squared error
1
1 Introduction In this paper, we want to give a detailed overview on the shrinkage properties of Partial Least Squares (PLS) regression. It is well known (Frank & Friedman 1993) that we can express the PLS estimator obtained after m steps in the following way:
b (m) = β P LS
X
f (m) (λi )zi ,
i
where zi is the component of the Ordinary Least Squares (OLS) estimator along the ith principal component of the covariance matrix X t X and λi is the corresponding eigenvalue. The quantities f (m) (λi ) are called shrinkage factors. We show that these factors are determined by a tridiagonal matrix (which depends on the inputoutput matrix (X, y)) and can be calculated in a recursive way. Combining the results of Butler & Denham (2000) and Phatak & de Hoog (2002), we give a simpler and clearer proof of the shape of the shrinkage factors of PLS and derive some of their properties. In particular, we reproduce the fact that some of the values f (m) (λi ) are greater than 1. This was rst proved by Butler & Denham (2000). We argue that these peculiar shrinkage properties (Butler & Denham 2000) do not necessarily imply that the Mean Squared Error (MSE) of the PLS estimator is worse compared to the MSE of the OLS estimator. In the case of deterministic shrinkage factors, i.e. factors that do not depend on the ¯ ¯ output y , any value ¯f (m) (λi )¯ > 1 is of course undesirable. But in the case of PLS, the shrinkage factors are stochastic they also depend on y . In particular, bounding the absolute value of the shrinkage factor by 1 might not automatically yield a lower MSE, in disagreement to what was conjectured in e.g. Frank & Friedman (1993). Having issued this warning, we explore whether bounding the shrinkage factors leads to a lower MSE or not. It is very dicult to derive theoretical b (m) and f (m) (λi ) respectively - deresults, as the quantities of interest - β P LS pend on y in a complicated, nonlinear way. As a substitute, we study the problem on several articial data sets and one real world example. It turns out that in most cases, the MSE of the bounded version of PLS is indeed smaller than the one of PLS.
2
2 Preliminaries We consider the multivariate linear regression model
y
=
(1)
Xβ + ²
with
= σ 2 · In .
cov (y)
(2)
Here, In is the identity matrix of dimension n. The numbers of variables is p, the number of examples is n . For simplicity, we assume that X and y are scaled to have zero mean, so we do not have to worry about intercepts. We have
X ∈ Rn×p , A := X t X ∼ cov(X) ∈ Rp×p , y ∈ Rn , b := X t y ∼ cov(X, y) ∈ Rp . We set p∗ = rk (A) = rk (X). The singular value decomposition of X is of the form
X
=
V ΣU t
with
V ∈ Rn×p
,
Σ = diag (σ1 , . . . , σp ) ∈ Rp×p
,
U ∈ Rp×p .
The columns of U and V are mutually orthogonal, that is we have U t U = Ip and V t V = Ip . We set λi = σi2 and Λ = Σ2 . The eigendecomposition of A is
A
=
U ΛU t =
p X
λi ui uti .
i=1
The eigenvalues λi of A (and any other matrix) are ordered in the following way:
λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0 . The Moore-Penrose inverse of a matrix M is denoted by M − .
3
b The Ordinary Least Squares (OLS) estimator β OLS is the solution of the optimization problem arg min β
ky − Xβk .
If there is no unique solution (which is in general the case for n > p), the OLS estimator is the solution with minimal norm. Set
s =
ΣV t y .
(3)
The OLS estimator is given by the formula
¡
b β OLS = X X t
¢−
∗
p X vt y √i ui . X y = U Λ ΣV y = U Λ s = λi i=1 −
t
−
t
We dene
zi
vt y √i ui . λi
=
This implies ∗
b β OLS
=
p X
zi .
i=1
Set
¡ ¢ K (m) := A0 b, Ab, . . . , Am−1 b
∈ Rp×m .
The columns of K (m) are called the Krylov sequence of A and b. The space spanned by the columns of K (m) is called the Krylov space of A and b and denoted by K(m) . Krylov spaces are closely related to the Lanczos algorithm (Lanczos 1950), a method for approximating eigenvalues of the matrix A. We exploit the relationship between PLS and Krylov spaces in the subsequent sections. An excellent overview of the connections of PLS to the Lanczos method (and the conjugate gradient algorithm) can be found in Phatak & de Hoog (2002). Set © ª M := {λi |si 6= 0} = λi 6= 0|vit y 6= 0 (the vector s is dened in (3)) and
m∗
:= |M| .
It follows easily that m∗ ≤ p∗ = rk(X). The inequality is strict if A has non-zero eigenvalues of multiplicity > 1 or if there is a principal component vi that is not correlated to y , i.e. vit y = 0 . The quantity m∗ is also called the grade of b with respect to A. We state a standard result on the dimension of the Krylov spaces associated to A and b.
4
Proposition 1. We have dim K
(m)
=
( m m∗
m ≤ m∗ . m > m∗
In particular dim K(m
∗
)
= dim K(m∗+1) = . . . = dim K(p) = m∗ .
(4)
Finally, let us introduce the following notation. For any set S of vectors, we denote by PS the projection onto the space that is spanned by S . It follows that ¡ ¢− PS = S S t S S t .
3 Partial Least Squares We only give a sketchy introduction to the PLS method. More details can be found e.g. in Höskuldsson (1988) or Rosipal & Krämer (2006). The main idea is to extract m orthogonal components from the predictor space X and t the response y to these components. In this sense, PLS is similar to Principal Components Regression (PCR). The dierence is that PCR extracts components that explain the variance in the predictor space whereas PLS extracts components that have a high covariance with y . The quantity m is called the number of PLS steps or the number of PLS components. We now formalize this idea. The rst latent component t1 is a linear combination t1 = Xw1 of the predictor variables. The vector w1 is usually called the weight vector. We want to nd a component with maximal covariance to y , that is we want to compute
w1 = arg max cov (Xw, y) = kwk=1
arg max wt b . kwk=1
(5)
Using Lagrangian multipliers, we conclude that the solution w1 is up to a factor equal to X t y = b. Subsequent components t2 , t3 , . . . are chosen such that they maximize (5) and that all components are mutually orthogonal. We ensure orthogonality by deating the original predictor variables X . That is, we only consider the part of X that is orthogonal on all components tj for j < i:
Xi
=
X − Pt1 ,...,ti−1 X .
We then replace X by Xi in (5). This version of PLS is called the NIPALS algorithm (Wold 1975).
5
Algorithm 2 (NIPALS algorithm). After setting X1 = X , the weight vectors wi and the components ti of PLS are determined by iteratively computing weight vector component deation
= XiT Y = Xi w i = Xi − Pti Xi
wi ti Xi+1
The nal estimator yb for Xβ is
yb = Pt1 ,...,tm y =
m X
Ptj y .
j=1
The last equality follows as the PLS components t1 , . . . , tm are mutually orthogonal. We denote by W (m) the matrix that consists of the weight vectors w1 , . . . , wm dened in algorithm 2:
W (m)
(6)
= (w1 , . . . , wm ) .
The PLS components tj and the weight vectors are linked in the following way (see e.g. Höskuldsson (1988)):
XW (m) R(m)
(t1 , . . . , tm ) =
with an invertible bidiagonal matrix R(m) . Plugging this into (3), we yield
yb =
XW
(m)
µ³ W
(m)
´t
t
X XW
(m)
¶− ³
W (m)
´t
X ty .
It can be shown (Helland 1988) that the space spanned by the vectors wi (i = 1, . . . , m) equals the Krylov space K(m) that is dened in section 2. More precisely, W (m) is an orthogonal basis of K(m) that is obtained by a Gram-Schmidt procedure. This implies:
Proposition 3 ((Helland 1988)). The PLS estimator obtained after m steps can be expressed in the following way: ·³ ¸− ³ ´t ´t b (m) = K (m) K (m) AK (m) β K (m) b . P LS (m)
(7)
b An equivalent expression is that the P LS estimator β P LS for β is the solution of the constrained minimization problem arg min β
ky − Xβk
subject to
β ∈ K(m) .
6 Of course, the orthogonal basis W (m) of the Krylov space only exists if dim K(m) = m, which might not be true for all m ≤ p. The maximal number for which this holds is m∗ (see proposition 1). Note however that it follows from (4) that
K(m
∗
−1)
∗
⊂ K(m
)
= K(m
∗
+1)
= . . . = K(p) ,
and the solution of the optimization problem does not change anymore. Hence there is no loss in generality if we make the assumption that
dim K(m) = m .
(8)
Remark 4. We have ∗
b (m ) β P LS
=
b β OLS .
Proof. This result is well-known and it is usually proven using the fact that after the maximal number of steps the vectors t1 , . . . , tm span the same space as the columns of X . Here we present an algebraic proof that exploits the relationship between PLS and Krylov spaces. We show that the OLS ∗ b estimator is an element of K(m ) , that is β OLS = πOLS (A)b for a polynomial πOLS of degree ≤ m∗ − 1. We dene this polynomial via the m∗ equations πOLS (λi ) =
1 , λi ∈ M . λi
In matrix notation, this equals
πOLS (Λ)s =
Λ− s .
(9)
Using (2), we conclude that (9) b πOLS (A)b = U πOLS (Λ)ΣV t y = U πOLS (Λ)s = U Λ− s = β OLS .
Set
D (m)
=
³ ´t W (m) AW (m) ∈ Rm×m .
Proposition 5. The matrix D(m) is symmetric and positive semidenite.
Furthermore D (m) is tridiagonal: dij = 0 for |i − j| ≥ 2.
Proof. The rst two statements are obvious. Let i ≤ j − 2. As wi ∈ K(i) , the vector Awi lies in the subspace K(i+1) . As j > i + 1, the vector wj is orthogonal on K(i+1) , in other words dji
=
hwj , Awi i = 0 .
As D (m) is symmetric, we also have dij = 0 which proves the assertion.
7
4 Tridiagonal matrices We see in section 6 that the matrices D (m) and their eigenvalues determine the shrinkage factors of the PLS estimator. To prove this, we now list some properties of D (m) .
Denition 6. A symmetric tridiagonal matrix D is called unreduced if all subdiagonal entries are non-zero, i.e. di,i+1 6= 0 for all i. Theorem 7 ((Parlett 1998)). All eigenvalues of an unreduced matrix are distinct. Set
D
(m)
a1 b1 b1 a2 = . . . . . . 0 0 0 0
0 b2
... ...
... ... ...
... am−1 bm−1
0 0 .. .
. bm−1 am
Proposition 8. If dim K(m) = m, the matrix D(m) is unreduced. More
precisely bi > 0 for all i ∈ {1, . . . , m − 1} .
Proof. Set pi = Ai−1 b and denote by w1 , . . . , wm the basis (6) obtained by Gram-Schmidt. Its existence is guaranteed as we assume that dim K(m) = m. We have to show that bi
=
hwi , Awi−1 i > 0 .
As the length of wi does not change the sign of bi , we can assume that the vectors wi are not normalized to have length 1. By denition
wi
=
pi −
i−1 X hpi , wk i · wk . hwk , wk i
k=1
As the vectors wi are pairwise orthogonal, it follows that
hwi , pi i =
hpi , pi i > 0 .
(10)
8 We conclude that
bi
=
hwi , Awi−1 i * Ã
(10)
=
wi , A ·
pi−1 −
i−2 X hpi−1 , wk i k−1
Api−1 =pi
=
i−2 X hpi−1 , wk i
hwi , pi i −
k=1 (5)
=
hwk , wk i
hwk , wk i
!+ · wk
hwi , Awk i
hwi , pi i
(11)
=
hpi , pi i > 0
Note that the matrix D (m−1) is obtained from D (m) by deleting the last column and row of D (m) . It follows that we can give a recursive formula for the characteristical polynomials
χ(m) := χD(m) of D (m) . We have
χ(m) (λ) = (am − λ) · χ(m−1) (λ) − b2m−1 χ(m−2) (λ) .
(11)
We want to deduce properties of the eigenvalues of D (m) and A and explore their relationship. Denote the eigenvalues of D (m) by (m)
µ1
> . . . > µ(m) m ≥ 0.
(12)
Remark 9. All eigenvalues of D(m ) are eigenvalues of A. ∗
Proof. First note that A|K(m∗ )
:
K(m
∗
)
As the columns of the matrix W (m
D (m
∗
)
∗
³ =
−→ K(m )
∗
+1)
∗
= K(m ) . ∗
form an orthonormal basis of K(m ) ,
W (m
∗
)
´t
AW (m
∗
)
is the matrix that represents A|K(m∗ ) with respect to this basis. As any eigenvalue of A|K(m∗ ) is obviously an eigenvalue of A, the proof is complete.
9 The following theorem is a special form of the Cauchy Interlace Theorem. In this version, we use a general result from Parlett (1998) and exploit the tridiagonal structure of D (m) .
Theorem 10. Each interval
i h (m) (m) µm−j , µm−(j+1)
(j = 0, . . . , m − 2) contains a dierent eigenvalue of D (m+k) ) (k ≥ 1). In addition, there is a dierent eigenvalue of D (m+k) outside the open interval (m) (m) (µm , µ1 ) . This theorems hensures in particular that there is a dierent eigenvalue of A i (m) (m) in the interval µk , µk−1 . Theorem 10 holds independently of assumption (8).
Proof. By denition, for k ≥ 1
D (m+k)
D (m−1) = • 0
•t am ∗
µ (m−1) D •
•t am
0 ∗ . ∗
Here • = (bm−1 , . . . , 0, 0), so
D
(m)
=
¶ .
An application of theorem 10.4.1 in Parlett (1998) gives the desired result.
Lemma 11. If D(m) is unreduced, the eigenvalues of D(m) and the eigenvalues of D (m−1) are distinct.
Proof. Suppose the two matrices have a common eigenvalue λ. It follows from (11) and the fact that D (m) is unreduced that λ is an eigenvalue of D (m−2) . Repeating this, we deduce that a1 is an eigenvalue of D (2) , a contradiction, as 0 =
χ(2) (a1 ) = −b21 < 0 .
Remark 12. In general it is not true that D(m) and a submatrix D(k) have distinct eigenvalues. Consider the case where ai = c for all i. Using equation (11) we conclude that c is an eigenvalue for all submatrices with m odd. ¡ ¢ Proposition 13. If dim K(m) = m, we have det D(m−1) 6= 0.
10
Proof. The matrix D (m) is positive¡ semidenite , hence all eigenvalues of ¢ D (m) are ≥ 0. In other words, det D (m−1) 6= 0 if and only if its smallest (m−1) eigenvalue µm−1 is > 0. Using Theorem 10 we have (m−1)
µ(m) m ≥ µm−1 ≥ 0 . As dim K(m) = m, the matrix D (m) is unreduced, which implies that D (m) and D (m−1) have no common eigenvalues (see 11). We can therefore replace the rst ≥ by >, i.e. the smallest eigenvalue of D (m−1) is > 0. It is well known that the matrices D (m) are closely related to the so-called Rayleigh-Ritz procedure, a method that is used to approximate eigenvalues. For details consult e.g. Parlett (1998).
5 What is shrinkage? We presented two estimators for the regression parameter β OLS and PLS which also dene estimators for Xβ via
yb•
b . = Xβ •
One possibility to evaluate the quality of an estimator is to determine its b for a Mean Squared Error (MSE). In general, the MSE of an estimator θ vector-valued parameter θ is dened as · ³ ´ ³ ´³ ´t ¸ b b b MSE θ = E trace θ − θ θ − θ ·³ ´t ³ ´¸ b b = E θ−θ θ−θ ·³ ³ h i ´t ³ h i ´ h i´t ³ t h i´¸ t b b b b b b = E θ −θ E θ −θ +E θ −E θ θ −E θ . This is the well-known bias-variance decomposition of the MSE. The rst part is the squared bias and the second part is the variance term. We start by investigating the class of linear estimators, i.e. estimators that b = Sy for a matrix S that does not depend on y . It follows are of the form θ immediately from the regression model (1) and (2) that for a linear estimator, h i h i b = SXβ , var θ b = σ 2 trace (SS t ) . E θ The OLS estimators are linear:
b β OLS
=
ybOLS
=
¡ t ¢− t X X X y, ¡ ¢− X X tX X ty .
11 Note that the estimator of ybOLS is simply the projection PX onto the space that is spanned by the columns of X . The estimator ybOLS is unbiased as
E [b yOLS ] = PX Xβ = Xβ . −
t b The estimator β OLS is only unbiased if β ∈ range (X X) : i ¡ i h¡ h ¢− ¢− ¡ ¢− b X t X X t y = X t X X t E [y] = X t X X t Xβ = β . E β OLS = E
Let us now have a closer look at the variance term. It follows directly from t trace(PX PX ) = rk(X) = p∗ that var (b yOLS ) =
σ 2 p∗ .
b For β OLS we have ¡
X tX
¢−
Xt
³¡
X tX
¢−
Xt
´t
¡ ¢− = X t X = U Λ− U t ,
hence
³
b var β OLS
∗
´ =
p X 1 σ . λ i=1 i 2
(13)
b We conclude that the MSE of the estimator β OLS depends on the eigenvalues t λ1 , . . . , λp∗ of A = X X . Small eigenvalues of A correspond to directions in X that have a low variance. Equation (13) shows that if some eigenvalues b are small, the variance of β OLS is very high, which leads to a high MSE. One possibility to (hopefully) decrease the MSE is to modify the OLS estimator by shrinking the directions of the OLS estimator that are responsible for a high variance. This of course introduces bias. We shrink the OLS estimator in the hope that the increase in bias is small compared to the decrease in variance. In general, a shrinkage estimator for β is of the form ∗
b β shr
=
p X
f (λi )zi ,
i=1
where f is some real-valued function. The values f (λi ) are called shrinkage factors. Examples are
12
• Principal Component Regression ( 1 ith principal component included f (λi ) = 0 otherwise and
• Ridge Regression f (λi ) =
λi λi + λ
with λ > 0 the Ridge parameter. We illustrate in section 6 that PLS is a shrinkage estimator as well. It turns out that the shrinkage behavior of PLS regression is rather complicated. Let us investigate in which way the MSE of the estimator is inuenced by the shrinkage factors. If the shrinkage estimators are linear, i.e. the shrinkage factors do not depend on y , this is an easy task. Let us rst write the shrinkage estimator in matrix notation. We have − t b β shr = Sshr y = U Σ Dshr V y .
The diagonal matrix Dshr has entries f (λi ). The shrinkage estimator for y is ybshr = XSshr y = V ΣΣ− Dshr V t . We calculate the variance of these estimators. ¡ ¢ ¡ ¢ t trace Sshr Sshr = trace U Σ− Dshr Σ− Dshr U t ¡ ¢ = trace Σ− Dshr Σ− Dshr ∗
=
p 2 X (f (λi )) i=1
λi
and
¡ ¢ t trace XSshr Sshr Xt = =
¡ ¢ trace V ΣΣ− Dshr ΣΣ− Dshr V t ¡ ¢ trace ΣΣ− Dshr ΣΣ− Dshr ∗
=
p X
2
(f (λi )) .
i=1
Next, we calculate the bias of the two shrinkage estimators. We have
E [Sshr y] = Sshr Xβ = U ΣDshr Σ− U t β .
13 It follows that ³ ´ b bias2 β
=
shr
=
t
(E [Sshr y] − β) (E [Sshr y] − β) ¡ t ¢t ¡ ¢t ¡ ¢¡ ¢ U β ΣDshr Σ− − Ip ΣDshr Σ− − Ip U t β ∗
=
p X
2
(f (λi ) − 1)
¡
uti β
¢2
.
i=1
Replacing Sshr by XSshr it is easy to show that p X
2
bias (ybshr ) =
2
λi (f (λi ) − 1)
¡ t ¢2 ui β .
i=1
b bshr dened above Proposition 14. For the shrinkage estimator β shr and y
we have
³
b M SE β shr
∗
´ =
p 2 X ¡ t ¢2 (f (λi )) 2 (f (λi ) − 1) ui β + σ λi i=1 i=1 ∗
M SE (ybshr )
=
∗
p X
p X
2
∗
p X ¡ t ¢2 2 2 λi (f (λi ) − 1) ui β + σ (f (λi )) . 2
i=1
i=1
If the shrinkage factors are deterministic, i.e. they do not depend on y , any value f (λi ) 6= 1 increases the bias. Values |f (λi )| < 1 decrease the variance, whereas values |f (λi )| > 1 increase the variance. Hence an absolute value > 1 is always undesirable. The situation might be dierent for stochastic shrinkage factors. We discuss this in the following section. Note that there is a dierent notion of shrinkage, namely that the L2 - norm of an estimator is smaller than the L2 -norm of the OLS estimator. Why is this a desirable property? Let us again consider the case of linear estimators. bi = Si y for i = 1, 2. We have Set θ ° °2 °b ° °θ i ° = y t Sit Si y . 2
n
The property that for all y ∈ R ° ° °b ° °θ 1 °
2
≤
° ° °b ° °θ 2 °
2
is equivalent to the condition that S1t S1 − S2t S2 is negative semidenite. The trace of negative semidenite matrices is ≤ 0. Furthermore trace (Sit Si ) = trace (Si Sit ), so we conclude that ³ ´ ³ ´ b1 b2 . var θ ≤ var θ
14 It is known (de Jong (1995)) that ∗
b (1) k2 ≤ kβ b (2) k2 ≤ . . . ≤ kβ b (m ) k2 = kβ b kβ OLS k2 . P LS P LS P LS
6 The shrinkage factors of PLS In this section, we give a simpler and clearer proof of the shape of the shrinkage factors of PLS. Basically, we combine the results of Butler & Denham (2000) and Phatak & de Hoog (2002). In the rest of the section, we assume b b (m) = β that m < m∗ as the shrinkage factors for β OLS are trivial, i.e. P LS ∗ (m ) f (λi ) = 1 . b (m) ∈ K(m) . Hence there is a polynoBy denition of the PLS estimator, β P LS
b (m) = π(A)b . Recall that the eigenvalues mial π of degree ≤ m − 1 with β P LS (m) of D (m) are denoted by µi . Set à ! m Y λ (m) f (λ) := 1 − 1 − (m) µi i=1 1 = 1 − (m) χ(m) (λ) . χ (0)
As f (m) (0) = 0, there is a polynomial π (m) of degree m − 1 such that
f (m) (λ) =
λ · π (m) (λ) .
(14)
Proposition 15 ((Phatak & de Hoog 2002)). Suppose that m < m∗ . We have
b (m) β P LS
=
π (m) (A) · b .
Proof (Phatak & de Hoog 2002). Using either equation (14) or the CaleyHamilton theorem (recall proposition 13), it is easy to prove that ³ ´−1 ³ ´ D (m) = π (m) D (m) . We plug this into equation (7) and obtain µ³ ¶³ ´t ´t (m) (m) (m) (m) (m) b β P LS = W π W AW W (m) b . (m) Recall that the columns form an orthonormal basis of K(m) . It ¡ (m)of ¢t W (m) follows that W W is the operator that projects on the space K(m) . In particular ´t ³ W (m) W (m) Aj b = Aj b
15 for j = 1, . . . , m − 1. This implies that
b (m) β P LS
π (m) (A)b .
=
Using (14), we can immediately conclude the following corollary.
Corollary 16 ((Phatak & de Hoog 2002)). Suppose that dim K(m) = m. If b we denote by zi the component of β OLS along the ith eigenvector of A then ∗
b (m) β P LS
=
p X
f (m) (λi ) · zi ,
i=1
with f (m) (λ) dened in (14). We now show that some of the shrinkage factors of PLS are 6= 1 .
Theorem 17 ((Butler & Denham 2000)). For each m ≤ m∗ − 1, we can decompose the interval [λp , λi ] into m + 1 disjoint intervals1 I1 ≤ I2 ≤ . . . ≤ Im+1
such that
( f (m) (λi )
≤1 ≥1
λi ∈ Ij and j odd . λi ∈ Ij and j even
Proof. Set g (m) (λ) = 1 − f (m) (λ). It follows from equation (14) that the (m) (m) zero's of g (m) (λ) are µm , . . . , µ1 . As D (m) is unreduced, all eigenvalues (m) (m) (m) (m) are distinct. Set µ0 = λ1 and µm+1 = λp . Dene Ij =]µi , µi+1 [ for j = 0, . . . , m . By denition, g (m) (0) = 1. Hence g (m) (λ) is non-negative on the intervals Ij if j is odd and g (m) is non-positive on the intervals Ij if j is even. It follows from theorem 10 that all intervals Ij contain at least one eigenvalue λi of A . In general it is not true that f (m) (λi ) 6= 1 for all λi and m = 1, . . . , m∗ . Using the example in remark 12 and the fact that
f (m) (λi )
=
1
is equivalent to the condition that λi is an eigenvalue of D (m) , it is easy to construct a counterexample. Using some of the results of section 4, we 1 We say that I ≤ I if sup I ≤ inf I . j j k k
16 can however deduce that some factors are indeed 6= 1. As all eigenvalues of ∗ ∗ ∗ D (m −1) and D (m ) are distinct (proposition 11), we see that f (m −1) (λi ) 6= 1 for all i. In particular ( < 1 m∗ even (m∗ −1) f (λ1 ) . > 1 m∗ odd More generally, using proposition 11, we conclude that f (m−1) (λi ) = 1 and f (m) (λi ) = 1 is not possible. In practice i.e. calculated on a data set the shrinkage factors seem to be 6= 1 all of the time. Furthermore
0 ≤ f (m) (λp ) < 1 . To prove this, we set g (m) (λ) = 1 − f (m) (λ). We have g (m) (0) = 1. Fur(m) thermore, the smallest positive zero of g (m) (λ) is µm and it follows from (m) theorem 10 and proposition 11 that λp < µm . Hence g (m) (λp ) ∈]0, 1]. Using theorem 10, more precisely (m)
λp ≤ µi
≤ λi ,
it is possible to bound the terms
1−
λi (m)
µi
.
From this we can derive bounds on the shrinkage factors. We do not pursue this further. Readers who are interested in the bounds should consult Lingjaerde & Christopherson (2000). Instead, we have a closer look at the MSE of the PLS estimator. In section 5, we showed that a value |f (m) (λi )| > 1 is not desirable, as both the bias and the variance of the estimator increases. Note however that in the case of PLS, the factors f (m) (λi ) are stochastic; they depend on y in a nonlinear way. The variance of the PLS estimator for the ith principal component is à ! t (v ) y i var f (m) (λi ) · √ λi with both f (m) (λi ) and
vt y √i λi
depending on y .
17 Among others, Frank & Friedman (1993) propose to truncate the shrinkage factors of the PLS estimator in the following way. Set f (m) (λi ) > +1 +1 (m) f˜ (λi ) = −1 f (m) (λi ) < −1 (m) f (λi ) otherwise and dene a new estimator:
b (m) β T RN
:=
p X
f˜(m) (λi )zi .
(15)
i=1
If the shrinkage factors are numbers, this will improve the MSE (cf. section 5). But in the case of stochastic shrinkage factors, the √situation might be dierent. Let us suppose for a moment that f (m) (λi ) = vtλyi . It follows that i
µ
vt y 0 = var f (m) (λi ) · √i λi
¶
µ ¶ vit y (m) ˜ ≤ var f (λi ) · √ , λi
so it is not clear whether the truncated estimator TRN leads to a lower MSE, which is conjectured e.g. in Frank & Friedman (1993). √
The assumption that f (m) (λi ) = vtλyi is of course purely hypothetical. It is i not clear whether the shrinkage factors behave this way. It is hard if not infeasible to derive statistical properties of the PLS estimator or its shrinkage factors, as they depend on y in a complicated, nonlinear way. As an alternative, we compare the two dierent estimators on dierent data sets.
7 Experiments In this section, we explore the dierence between the methods PLS and TRN. We investigate several articial datasets and one real world example.
Simulation We compare the MSE of the two methods - PLS and truncated PLS - on 27 dierent articial data sets. We use a setting similar to the one in Frank & Friedman (1993). For each data set, the number of examples is n = 50. We consider three dierent number of predictor variables:
p =
5, 40, 100 .
18 The input data X is chosen according to a multivariate normal distribution with zero mean and covariance matrix C . We consider three dierent covariance matrices:
C1 (C2 )ij (C3 )ij
= Ip , 1 , |i − j| + 1 ½ 1 , i=j = 0.7 , i 6= j =
.
The matrices C1 , C2 and C3 correspond to no, moderate and high collinearity respectively. The regression vector β is a randomly chosen vector β ∈ {0, 1}p . In addition, we consider three dierent signal-to-noise ratios: r var (Xβ) stnr = σ2 = 1, 3, 7 . We yield 3 · 3 · 3 = 27 dierent parameter settings. For each setting, we estimate the MSE of the two methods: For k = 1, . . . , K = 200, we generate y according to (1) and (2). We determine for each method and each m the b and dene respective estimator β k
b \ M SE(β)
=
K ´t ³ ´ 1 X ³b b −β . βk − β β k K k=1
If there are more predictor variables than examples, this approach is not sensible, as the true regression vector β is not identiable. This implies that dierent regressions vectors β 1 6= β 2 can lead to Xβ 1 = Xβ 2 . Hence for p = 100, we estimate the MSE of yb for the two methods. We display the estimated MSE of the method TRN as a fraction of the estimated MSE of the method PLS, i.e. for each m we display µ ¶ b (m) \ M SE β T RN µ ¶. M SE − RAT IO = (m) b \ M SE β P LS (As already mentioned, we display the MSE-RATIO for yb in the case p = 100.) The results are displayed in gures 1, 2 and 3. In order to have a compact representation, we consider the averaged MSE-RATIOS for dierent parameter settings. E.g. we x a degree of collinearity (say high collinearity) and display the averaged MSE-RATIO over the three dierent signal-to-noise ratios. The results for all 27 data sets are shown in the tables in the appendix.
1.00 0.95 0.90 0.85 0.80 0.75 0.70
0.70
0.75
0.80
0.85
0.90
0.95
1.00
19
1
2
3
4
5
1
2
3
4
5
1.00 0.95 0.90 0.85 0.80 0.75 0.70
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Figure 1: MSE-RATIO for p = 5 . The gures show the averaged MSERATIO for dierent parameter settings. Left: Comparison for high (straight line), moderate (dotted line) and no (dashed line) collinearity. Right: Comparison for stnr 1 (straight line), 3 (dotted line) and 7 (dashed line).
0
10
20
30
40
0
10
20
30
40
Figure 2: MSE-RATIO for p = 40. There are several observations: The MSE of TRN is lower almost all of the times. The decrease of the MSE is particularly large if the number of components m is small, but > 1 . For larger m, the dierence decreases. This is not surprising, as for large m, the dierence between the PLS estimator and the OLS estimator decreases. Hence we expect the dierence between TRN and PLS to become smaller. The reduction of the MSE is particularly prominent in complex situations, i.e. in situations with collinearity in X or with a low signal-to-noise-ratio. Another feature which cannot be deduced from gures 1, 2 and 3 but from the
1.00 0.95 0.90 0.85 0.80 0.75 0.70
0.70
0.75
0.80
0.85
0.90
0.95
1.00
20
5
10
15
20
5
10
15
20
Figure 3: MSE-RATIO for p = 100. In this case, we display the MSE-RATIO b . Only the rst 20 components are displayed. for yb instead of β tables in the appendix is the fact that the optimal numbers of components µ ¶ (m) opt b \ mP LS = argmin M SE β P LS µ ¶ (m) opt b \ mT RN = argmin M SE β T RN are equal almost all of the times. This is also true if we consider the MSE of yb. We can benet from this if we want to select an optimal model for truncated PLS. We return to this subject in section 8.
Real world data In this example, we consider the near infrared spectra (NIR) of n = 171 meat samples that are measured at p = 100 dierent wavelengths from 850 1050 nm. This data set is taken from the StatLib datasets archive and can be downloaded from http://lib.stat.cmu.edu/datasets/tecator. The task is to predict the fat content of a meat sample on the basis of its NIR spectrum. We choose this dataset as PLS is widely used in the chemometrics eld. In this type of applications, we usually observe a lot of predictor variables which are highly correlated. We estimate the MSE of the two methods PLS and truncated PLS by computing the 10fold cross-validated error of the two estimators. The results are displayed in gure 4. Again, TRN is better allmost all of the times, although the dierence is small. Note furthermore that the optimal number of components are almost opt identical for the two methods: We have mopt P LS = 15 and mT RN = 16.
10
15
20
21
10
20
30
40
50
Figure 4: 10fold cross-validated test error for the Tecator data set. The straight line corresponds to PLS, the dashed line corresponds to truncated PLS.
8
Discussion
We saw in section 7 that bounding the absolute value of the PLS shrinkage factors by one seems to improve the MSE of the estimator. So should we now discard PLS and always use TRN instead? There might be (at least) two objections. Firstly, it would be somewhat lightheaded if we relied on results of a small-scale simulation study. Secondly, TRN is computationally more extensive than PLS. We need the full singular value decomposition of X . In each step, we have to compute the PLS estimator and adjust its shrinkage factors by hand. However, the experiments suggest that it can be worthwhile to compare PLS and truncated PLS. We pointed out in section 7 that the two methods do not seem to dier much in terms of optimal number of components. In order to reduce the computational costs of truncated PLS, we therefore suggest the following strategy. We rst compute the optimal PLS model on a training set and choose the optimal model with the help of a model selection criterion. In a second step, we truncate the shrinkage factors of the optimal model. We then use a validation set in order to quantify the dierence between PLS and TRN and choose the method with the lower validation error.
References Butler, N. & Denham, M. (2000), `The Peculiar Shrinkage Properties of Partial Least Squares Regression', Journal of the Royal Statistical Society Se-
22
ries B 62 (3), 585593. de Jong, S. (1995), `PLS shrinks', Journal of Chemometrics 9, 323326. Frank, I. & Friedman, J. (1993), `A Statistical View of some Chemometrics Regression Tools', Technometrics 35, 109135. Helland, I. (1988), `On the Structure of Partial Least Squares Regression', Communications in Statistics, Simulation and Computation 17(2), 581 607. Höskuldsson, A. (1988), `PLS Regression Methods', Journal of Chemometrics 2, 211228. Lanczos, C. (1950), `An Iteration Method for the Solution of the Eigenvalue Problem of Linear Dierential and Integral Operators', Journal of Research of the National Bureau of Standards 45, 225280. Lingjaerde, O. & Christopherson, N. (2000), `Shrinkage Structures of Partial Least Squares', Scandinavian Journal of Statistics 27, 459473. Parlett, B. (1998), The Symmetric Eigenvalue Problem, Society for Industrial and Applied Mathematics. Phatak, A. & de Hoog, F. (2002), `Exploiting the Connection between PLS, Lanczos, and Conjugate Gradients: Alternative Proofs of some Properties of PLS', Journal of Chemometrics 16, 361367. Rosipal, R. & Krämer, N. (2006), Overview and Recent Advances in Partial Least Squares, in `Subspace, Latent Structure and Feature Selection Techniques', Lecture Notes in Computer Science, Springer, 3451. Wold, H. (1975), Path models with Latent Variables: The NIPALS Approach, in H. B. et al., ed., `Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building', Academic Press, 307357.
23
A Appendix: Results of the simulation study We display the results of the simulation study that is described in section b as well as for yb. In 7. The following tables show the MSE-RATIO for β addition to the MSE ratio, we display the optimal number of components for each method. It is interesting to see that the two quantities are the same almost all of the times. collinearity stnr 1 2 3 4
no 1 0.833 0.980 1.000 1.000
no 3 0.861 0.976 0.993 1.001
no 7 0.676 0.975 1.001 0.999
med. 1 0.958 0.995 0.969 0.988
med. 3 1.000 0.938 0.960 1.000
med. 7 0.993 0.864 0.993 1.002
high 1 1.000 0.847 0.954 0.997
high 2 0.999 0.965 0.980 0.993
high 7 1.000 0.866 0.967 0.992
mopt P LS mopt T RN
2 2
5 5
2 2
2 2
4 4
3 3
1 1
2 2
5 5
b for p = 5. The rst two rows display the setting Table 1: MSE-RATIO of β of the parameters. The rows entitled 1-4 display the MSE ratio for the respective number of components.
collinearity stnr 1 2 3 4
no 1 0.775 0.978 1.001 1.000
no 3 0.780 0.972 0.990 1.001
no 7 0.570 0.9697 1.001 0.999
med. 1 0.919 0.994 0.969 0.990
med. 3 1.000 0.882 0.967 1.000
med. 7 0.970 0.786 0.992 1.001
high 1 1.004 0.828 0.960 0.997
high 2 0.995 0.951 0.977 0.996
high 7 0.999 0.823 0.973 0.993
mopt P LS mopt T RN
3 2
5 5
3 2
2 2
4 4
4 4
1 1
2 2
5 3
Table 2: MSE-RATIO of yb for p = 5.
24
collinearity stnr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
no 1 0.929 0.938 0.907 0.905 0.901 0.898 0.892 0.899 0.908 0.913 0.913 0.917 0.924 0.933 0.94 0.949 0.956 0.961 0.968 0.973 0.977 0.98 0.984 0.987 0.989 0.992 0.994 0.995 0.996 0.997 0.998 0.998 0.999 0.999 0.999 1.000 1.000 1.000 1.000
no 3 0.963 0.959 0.952 0.933 0.942 0.942 0.926 0.926 0.933 0.938 0.937 0.931 0.932 0.939 0.945 0.945 0.945 0.944 0.946 0.951 0.958 0.965 0.97 0.976 0.98 0.985 0.989 0.991 0.993 0.994 0.995 0.996 0.997 0.998 0.999 1.000 1.000 1.000 1.000
no 7 0.972 0.977 0.981 0.971 0.954 0.945 0.949 0.956 0.955 0.951 0.947 0.944 0.946 0.946 0.95 0.951 0.954 0.959 0.964 0.973 0.977 0.981 0.984 0.988 0.99 0.993 0.996 0.997 0.998 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000
med. 1 0.98 0.922 0.875 0.879 0.879 0.878 0.887 0.892 0.897 0.900 0.902 0.909 0.919 0.927 0.933 0.935 0.939 0.943 0.946 0.949 0.954 0.961 0.968 0.975 0.978 0.982 0.986 0.988 0.99 0.992 0.994 0.996 0.996 0.997 0.999 0.999 0.999 1.000 1.000
med. 3 0.998 0.91 0.91 0.913 0.924 0.915 0.906 0.904 0.910 0.916 0.919 0.919 0.92 0.917 0.916 0.918 0.922 0.931 0.934 0.935 0.936 0.94 0.945 0.948 0.953 0.959 0.966 0.973 0.978 0.982 0.985 0.99 0.991 0.993 0.994 0.996 0.998 0.999 0.999
med. 7 0.989 0.978 0.945 0.912 0.898 0.891 0.891 0.895 0.895 0.898 0.907 0.917 0.925 0.936 0.936 0.941 0.945 0.95 0.958 0.962 0.966 0.973 0.98 0.983 0.987 0.991 0.992 0.994 0.995 0.996 0.997 0.998 0.999 0.999 0.999 0.999 1.000 1.000 1.000
high 1 1.000 0.789 0.849 0.857 0.870 0.882 0.891 0.897 0.903 0.902 0.906 0.908 0.913 0.921 0.928 0.938 0.944 0.946 0.953 0.961 0.968 0.972 0.976 0.98 0.981 0.984 0.987 0.99 0.993 0.995 0.996 0.996 0.997 0.998 0.998 0.999 0.999 0.999 1.000
high 2 1.000 0.793 0.843 0.864 0.883 0.890 0.895 0.897 0.902 0.899 0.901 0.904 0.907 0.911 0.916 0.922 0.926 0.930 0.939 0.947 0.955 0.962 0.967 0.970 0.973 0.977 0.981 0.985 0.988 0.991 0.992 0.994 0.995 0.996 0.997 0.998 0.998 0.999 0.999
high 7 1.000 0.792 0.849 0.868 0.879 0.893 0.898 0.903 0.904 0.901 0.902 0.906 0.914 0.922 0.931 0.936 0.936 0.935 0.936 0.939 0.943 0.948 0.950 0.953 0.959 0.966 0.975 0.984 0.988 0.99 0.993 0.995 0.996 0.997 0.998 0.998 0.999 0.999 1.000
mopt P LS mopt T RN
1 1
3 3
5 5
1 1
2 2
2 2
1 1
1 1
1 1
b for p = 40. Table 3: MSE-RATIO of β
25
collinearity stnr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
no 1 0.781 0.870 0.853 0.874 0.889 0.898 0.897 0.923 0.924 0.935 0.943 0.954 0.959 0.961 0.97 0.975 0.979 0.982 0.986 0.989 0.991 0.993 0.995 0.996 0.996 0.997 0.998 0.999 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000
no 3 0.797 0.857 0.899 0.891 0.92 0.942 0.938 0.941 0.944 0.958 0.967 0.967 0.967 0.961 0.969 0.971 0.976 0.981 0.985 0.987 0.99 0.99 0.992 0.993 0.995 0.996 0.997 0.997 0.998 0.999 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000
no 7 0.791 0.853 0.914 0.896 0.893 0.921 0.929 0.943 0.960 0.961 0.959 0.958 0.965 0.966 0.977 0.976 0.983 0.985 0.988 0.991 0.992 0.994 0.996 0.997 0.997 0.998 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
med. 1 0.877 0.785 0.776 0.818 0.839 0.844 0.876 0.886 0.904 0.913 0.922 0.929 0.941 0.948 0.954 0.964 0.968 0.972 0.976 0.977 0.980 0.984 0.987 0.989 0.99 0.992 0.994 0.995 0.996 0.997 0.998 0.998 0.998 0.999 0.999 0.999 1.000 1.000 1.000
med. 3 0.983 0.702 0.818 0.836 0.891 0.884 0.902 0.898 0.916 0.93 0.937 0.942 0.95 0.949 0.953 0.962 0.962 0.966 0.969 0.970 0.973 0.979 0.98 0.982 0.983 0.986 0.990 0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.998 0.998 0.999 0.999
med. 7 0.924 0.868 0.853 0.838 0.846 0.859 0.88 0.896 0.901 0.915 0.916 0.938 0.942 0.954 0.96 0.967 0.974 0.979 0.980 0.983 0.985 0.988 0.991 0.993 0.994 0.996 0.997 0.998 0.998 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000
high 1 1.013 0.673 0.778 0.81 0.835 0.862 0.886 0.9 0.915 0.915 0.924 0.932 0.939 0.947 0.953 0.961 0.967 0.968 0.974 0.979 0.984 0.988 0.990 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.998 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000
high 2 1.004 0.684 0.772 0.818 0.855 0.881 0.898 0.906 0.92 0.914 0.92 0.926 0.933 0.942 0.948 0.954 0.957 0.960 0.965 0.972 0.977 0.982 0.986 0.987 0.989 0.991 0.993 0.994 0.996 0.997 0.997 0.998 0.998 0.999 0.999 0.999 0.999 0.999 1.000
high 7 1.001 0.680 0.778 0.822 0.856 0.881 0.898 0.914 0.917 0.921 0.927 0.931 0.937 0.942 0.949 0.957 0.957 0.966 0.970 0.974 0.978 0.981 0.983 0.984 0.985 0.987 0.989 0.991 0.994 0.994 0.996 0.997 0.998 0.998 0.999 0.999 0.999 1.000 1.000
mopt P LS mopt T RN
1 1
3 3
9 4
1 1
1 2
2 2
1 1
1 1
1 1
Table 4: MSE-RATIO of yb for p = 40.
26
collinearity stnr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
no 1 0.845 0.813 0.863 0.908 0.933 0.954 0.964 0.976 0.984 0.99 0.994 0.996 0.997 0.998 0.999 0.999 1.000 1.000 1.000 1.000
no 3 0.763 0.892 0.884 0.898 0.940 0.953 0.967 0.972 0.982 0.990 0.991 0.994 0.995 0.997 0.998 0.999 0.999 0.999 1.000 1.000
no 7 0.825 0.864 0.881 0.918 0.954 0.960 0.979 0.988 0.993 0.995 0.998 0.998 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000
med. 1 0.862 0.753 0.788 0.852 0.889 0.900 0.927 0.942 0.959 0.969 0.976 0.982 0.988 0.991 0.994 0.996 0.997 0.998 0.999 0.999
med. 3 0.839 0.832 0.837 0.852 0.888 0.905 0.935 0.942 0.963 0.970 0.979 0.987 0.991 0.994 0.995 0.997 0.998 0.999 0.999 0.999
med. 7 0.906 0.847 0.859 0.864 0.900 0.915 0.930 0.950 0.961 0.970 0.978 0.984 0.990 0.992 0.994 0.996 0.997 0.998 0.999 0.999
high 1 1.017 0.695 0.806 0.866 0.900 0.926 0.951 0.968 0.979 0.980 0.987 0.991 0.994 0.996 0.997 0.998 0.999 0.999 0.999 1.000
high 2 1.002 0.695 0.808 0.861 0.903 0.931 0.953 0.968 0.968 0.979 0.987 0.992 0.994 0.997 0.998 0.998 0.999 0.999 1.000 1.000
high 7 1.000 0.694 0.812 0.865 0.903 0.931 0.953 0.970 0.979 0.981 0.988 0.993 0.995 0.997 0.998 0.999 0.999 0.999 1.000 1.000
mopt P LS mopt T RN
1 1
2 1
5 3
1 1
1 1
2 2
1 1
1 1
1 1
Table 5: MSE-RATIO of yb for p = 100. We only display the results for the rst 20 components, as the MSE-RATIO equals 1 (up to 4 digits after the decimal point) for the remaining components.