Penalized Maximum Likelihood Principle for

1 downloads 0 Views 161KB Size Report
May 13, 2009 - This is often called the ridge estimator, was originally introduced by Hoerl ... In this paper, we propose two penalized maximum likelihood (PML) ...
Penalized Maximum Likelihood Principle for Choosing Ridge Parameter

Minh Ngoc Tran National University of Singapore & Vietnam National University 6 Science Drive 2, Singapore 117546 [email protected]

May 13, 2009 Abstract We consider the problem of choosing the ridge parameter. Two penalized maximum likelihood (PML) criteria based on a distribution-free and a datadependent penalty function are proposed. These PML criteria can be considered as “continuous” versions of AIC. A systematic simulation is conducted to compare the suggested criteria to several existing methods. The simulation results strongly support the use of our method. The method is also applied to two real data sets. Keywords Data-dependent penalty, loss rank principle, model selection, penalized ML, ridge regression, ridge parameter

1

1

Introduction

Ridge regression. Consider the standard linear regression model y = Xβ + 

(1)

where y and  are n-vectors, X is an (n×p) matrix standardized such that X>X is in the form of a correlation matrix, E() = 0, cov() = σ 211n , and β = (β1,...,βp)> is the vector of regression coefficients. When X is full rank, it’s well-known in the literature (see, e.g. Hastie et al., 2001) that the unbiased least squares estimator ˆ ≡ β(0) ˆ of β is β = (X>X)−1 X>y. When X>X is nearly singular, however, the 2 ˆ is not stable ˆ = σ 2tr(X>X)−1 will be very large, and β expected distance E||β−β|| ˆ (a small change of y may lead to a large change of β even in signs and some of its components may be extremely large in absolute value). ˆ may explode when X>X is ill-conditioned naturally The fact that the estimate β leads us to the idea of restricting coefficients β to a sphere by minimizing n X

(yj −

j=1

subject to

p X

βixji )2

(2)

i=1

p X

βi2 ≤ s,

(3)

i=1

where s ≥ 0 is a complexity parameter of the model. This optimization problem is equivalent to the penalized least square estimation: minimizing n X j=1

(yj −

p X

2

βi xji ) + λ

i=1

p X

βi2,

(4)

i=1

where λ > 0 is called the ridge parameter that controls the amount of shrinkage of regression coefficients. There is a one-by-one correspondence between s and λ (see, e.g. Hastie et al. (2001), Chapter 3), an increase in s leads to a decrease in λ and otherwise. The solution of (4) with a given λ is ˆ β(λ) = (X>X + λ11p )−1 X>y. This is often called the ridge estimator, was originally introduced by Hoerl and Kennard (1970) in an attempt to deal with the ill-condition of X>X. ˆ Although β(λ) is biased when λ>0, there is a trade-off between the bias and the variance. Let d21 ≥...≥d2p be the eigenvalues of X>X, the expected distance between ˆ β(λ) and β (see Hoerl and Kennard, 1970b) is ˆ E||β(λ) − β||2 = λ2 β>(X>X + λ11p )−2 β + σ 2

p X 1

2

d2i . (d2i +λ)2

(5)

The first term is known as the squared bias, it equals 0 when λ = 0, the second is ˆ where tr(A) is the trace of matrix A. Hoerl the sum of variances tr[var(β(λ))], ˆ − β||2 < and Kennard (1970) showed that there exists a λ > 0 such that E||β(λ) 2 ˆ . E||β(0)−β|| Selection of the ridge parameter. The remaining problem is how to choose a good ridge parameter. Several methods have been proposed: the ridge trace (Hoerl and Kennard (1970a, 1970b)), Hoerl-Kennard-Baldwin estimator (HKB) (Hoerl et al., 1975), PRESS, cross-validation and its variants (Allen, 1974; Stone, 1974; Geisser, 1975; Craven and Wahba, 1979; Golub et al., 1979), bootstrap (Delaney and Chatterjee, 1986) and some others (Khalaf and Shukur, 2005; Alkhamisi et al., 2006; Alkhamisi and Shukur, 2007). Vinod (1978) gave a detailed survey of ridge regression and related techniques for improvements over OLS. A good book on ridge regression is Draper and Smith (1981). In this paper, we propose two penalized maximum likelihood (PML) criteria for choosing the optimal ridge parameter. The criteria are of the form − sup(log-likelihood) + penalty of the complexity of model. Two penalty functions will be proposed: one is distribution-free and the other is data-dependent. These PML criteria can be considered as “continuous” versions of AIC (Akaike, 1973) whose penalty of the model complexity is the number of coefficients which is a discrete number. In Section 2 we review (some results there are somewhat new) a novel method called the loss rank principle (LoRP) for model selection recently introduced by Hutter (2007) (see also Hutter and Tran, 2008). Two PML criteria are derived in Section 3 by the virtue of LoRP. A simulation study is carried out in Section 4 to compare the suggested method to several competitors. In Section 5, the method is applied to two real data sets. Section 6 contains the conclusions. Some notes on the simulation are relegated to the Appendix.

2

The loss rank principle

Hutter (2007) recently introduced a novel method for model selection, called the Loss Rank Principle (LoRP). This method based on the so-called loss rank of a model which is the number of other (fictitious) data that fit the model better than the training data, then the LoRP selects a model that has smallest loss rank as the best one. More precisely, consider a training data set D =(x,y) = {(x1,y1),...,(xn,yn )} from a regression model yi = f (xi ) + i. Suppose that we use a model M to fit the data D, e.g. M is the classical regression model with p predictors, or M is the k-nearest neighbors regression, or M is the ridge 3

regression with ridge parameter λ, etc. Imagine that in experiment situations, we can conduct the experiment many times with fixed design points x, we then would get many other (fictitious) output y 0 . If the model M is complex/flexible (large p, small k, small λ), then M fits the training data (x,y) well and it also fits (x,y0 ) well, i.e. both loss functions LossM (y|x) and Loss M (y 0 |x) are small. Let t denote a regressor in model M, and γ(t)(x,y) be a contrast/loss function, e.g. γ(t)(x,y) = |y −t(x)|2, Pn then the general form of loss function is LossM (y|x) = inf t∈M 1 γ(t)(xi ,yi). Here, 2 for the simplicity, we only Pnconsider2 the squared loss γ(t)(x,y) = |y − t(x)| , then 2 ˆ || = 1 (yi − yˆi ) where y ˆ is the fitted vector under model M. LossM (y|x) = ||y − y Then the loss rank of M defined by RankM (D) = #{y 0 : LossM (y 0|x) ≤ Loss M (y|x)} will be large. In the contradictory case where M is small/inflexible, that both LossM (y|x) and LossM (y 0 |x) are large also leads to a large loss rank. Thus, it’s very natural to choose a model with smallest loss rank as a good model. By doing this, we trade off between the fit and the model complexity. In the case of continuous spaces like IRn , it’s natural to use the concept of volume instead of “count number” in the definition of loss rank, i.e. RankM (D) = Vol{y 0 ∈ IRn : LossM (y 0 |x) ≤ LossM (y|x)}. ˆ is linear in y, i.e. Consider the case of linear models where the fitted vector y ˆ = M (x)y where the regression matrix M = M (x) depends only on x. Then y LossM (y|x) = ||y−M y||2 = y>Ay with A =(11n −M )>(11n −M ) and the loss rank is RankM (D) = Vol{y 0 ∈ IRn : y 0>Ay 0 ≤ L} where L := y>Ay. Suppose at the moment that det(A) 6= 0 which is equivalent to the assumption that all eigenvalues of the regression matrix M are different from 1, the set {y 0 ∈ IRn : y 0>Ay 0 ≤ L} is an ellipsoid in IRn , so its volume is vn Ln/2 RankM (D) = √ det A where vn = π n/2/Γ( n2 +1) is the volume of the unit sphere in IRn . Because the logarithm is strictly monotone increasing, instead of comparing among the loss ranks of candidate models to find a model with smallest loss rank, we can consider LRM (D) := logRankM (D) which will be more convenient. Taking the logarithm of RankM (D) and neglecting the constant logvn independent of model M, we finally get (6) LRM (D) = n2 log(y>Ay) − 12 log det A Principle 1. Given a class of linear models M={M}. Assume that all eigenvalues of the regression matrix M w.r.t. model M are different from 1. Then the best model among M is of the smallest loss rank M best = argminM ∈M{ n2 log(y>Ay) − 12 log det A} where A = (11n −M )>(11n −M ). 4

(7)

Now we consider the case where detA = 0 (e.g. projective regression) or A is nearly singular (e.g. ridge regression when λ is very close to 0). In such a case, RankM (D) is infinity or extremely large. Like the idea of ridge regression, to prevent the loss rank from being infinity, we add a small penalty α||y||2 to the loss y − y||2 + α||y||2 = y>S α y, α > 0 small Loss M (y|x) := ||ˆ where S α = A+α11n . S α is not singular. Similarly to the above derivation RankαM (D)

vn Ln/2 = Vol{y ∈ IR : y S αy ≤ L} = √ where L := y>S αy. det S α 0

n

0>

0

Taking logarithm and neglecting a constant, we define the loss rank of model M (which is dependent on α) as LRαM (D) =

n 2

log(y>S α y) − 12 log det S α .

(8)

How we deal with the extra parameter α? We are seeking a model of smallest loss rank, so it’s natural to minimize LRαM (D) in α (see Hutter (2007) for a more detailed interpretation) and finally the loss rank of model M is defined as LRM (D) = inf LRαM (D) = inf { n2 log(y>S αy) − 12 log det S α }. α>0

α>0

(9)

Principle 2. (Hutter, 2007) Given a class of linear models M = {M}. Then the best model among M is of the smallest loss rank M best = argminM ∈MLRM (D) = argminM ∈M inf { n2 log(y>S α y) − 12 log det S α } (10) α>0

where S α = A+α11n = (11n −M )>(11n −M )+α11n , M is the regression matrix w.r.t. the linear model M. LoRP is a very original and attractive idea. At the time the present paper was being written, some important properties of LoRP had been pointed out (see Hutter and Tran, 2008), e.g. LoRP consistently identifies the true model as n → ∞ if the given collection of models contains the true model, the loss rank is an asymptotically unbiased estimator of the Kullback-Leibler discrepency between the candidate and the true model, in Bayesian framework with Gaussian priors the loss rank is asymptotically propotional to the minus logarithm of posterior model propability, LoRP has an interpretation in terms of minimum description length (MDL), etc. The interested reader is refered to Hutter and Tran (2008) and Tran (2009) for the details. In the next section, two PML criteria for choosing the ridge parameter will be derived by the virtue of LoRP. 5

3

Penalized ML for choosing λ

We now back to the ridge regression. Denote by Mλ the ridge regression model w.r.t. parameter λ, M={Mλ ,λ>0} is the class of candidate models. The regression matrix ˆ λ =M λy is linear w.r.t. model Mλ is M λ =X(X>X+λ11p )−1 X>. The fitted vector y λ in y so Mλ is a linear model. The matrix A and matrix S α in the previous section now are Aλ = (11n − M λ )>(11n − M λ), S λα = Aλ + α11n . Distribution-free penalized ML criterion. Consider the singular value decomposition (SVD) X = U DV , where U is an (n×n) orthogonal matrix, V is a (p×p) orthogonal matrix, D is an (n×p) diagonal matrix with principal diagonal elements d1 ≥ ... ≥ dp ≥ 0. By using the SVD of X, we can see that the eigenvalues of Aλ are ( d2λ+λ )2 ,...,( d2λ+λ )2 ,1,...,1 p 1 Qp (with n−p eigenvalues 1), thus det(Aλ ) = 1 ( d2λ+λ )2 . From (6), the best λ chosen i by LoRP is the one that minimizes LRλ (D) ≡ LRMλ (D) =

n 2

ˆ λ ||2) + log(||y − y

p X

log(1 +

d2i ). λ

(11)

1

Assume that the noise  is Gaussian N (0,σ 2 11n ), the log-likelihood of the observations from model (1) (neglecting constant − n2 log(2π)) then is ln (β, σ 2) = − n2 log σ 2 −

1 ||y 2σ 2

− Xβ||2.

Because of the equivalence of (2)-(3) and (4), the set Θλ = {θ =(β1,...,βp,σ 2) : ||β||2 ≤ s,σ 2 > 0} (note that s = s(λ) as there is a correspondence between s and λ) can be seen as the parameter space of regression model Mλ , thus, since  is Gaussian ˆ λ||2 ) − sup ln (β, σ 2) = − n2 log( n1 ||y − y

θ∈Θλ

So 2

LRλ(D) = − sup ln (β, σ ) + neglecting as

ˆ λ ||2) + = − n2 log(||y − y

p X

θ∈Θλ n logn− n2 2

n 2

log(1 +

d2i ) λ

+

n 2

n 2

log n − n2 .

log n − n2 ,

1

which is independent of Mλ , we can write the loss rank of Mλ 2

LRλ(D) = − sup ln (β, σ ) + θ∈Θλ

p X

log(1 +

d2i ). λ

1

This criterion is in form of penalized maximum likelihood criteria − sup(log-likelihood) + “penalty of the model complexity” 6

(12)

where the penalty is pen1 (n, λ) =

p X

log(1 +

d2i ). λ

(13)

1

We can see that c= c(λ) = 1/λ is a measure of the complexity of model Mλ : a large c/small λ leads to a big Θλ, so Mλ is complex/flexible; a small c/large λ leads to a small Θλ , so Mλ is incomplex/inflexible. The function pen1(n,λ) has the usual properties of a penalty function: • pen1 (n,λ) is an increasing monotone function of the complexity c, • pen1 (n,λ) → 0 as c → 0 and pen1 (n,λ) → ∞ as c → ∞. Akaike’s information criterion (Akaike, 1973) AIC = − sup(log-likelihood) + “number of parameters of model” uses the number of parameters, a discrete number, as the penalty of the model complexity. When the parameter space is restricted to Θλ, LoRP uses pen1 (n,λ), a continuous number, as the penalty. Therefore, (12) can be seen as a “continuous” version of AIC. Note that the penalty function (13) is distribution-free, because it does not depend on y. Distribution-free PML sometimes has poor performance for specific distributions. In the next paragraph we will obtain a data-dependent penalty function. Data-dependent penalized ML criterion. We see that A √λ is nearly singular > n/2 when λ is close to 0. As a result, LRMλ (D) = log((y Aλy) / detAλ ) then would be extremely large and unstable (a small change in y would lead to a large change in the loss rank). Like the original idea of the ridge regression, adding a small penalty α||y||2 to the loss function and considering S λα instead of Aλ help to stablize the problem in some sense. In this section we will consider the loss rank defined in (9) and try to estimate the minimizer α = αm . We have shown earlier that ( d2λ+λ )2 , ..., ( d2λ+λ )2 , 1, ..., 1 1

p

are the eigenvalues of Aλ , thus the eigenvalues of S λα = Aλ +α11n are α + ( d2λ+λ )2 , ..., α + ( d2λ+λ )2, 1 + α, ..., 1 + α. 1

p

Suppose at the moment that α = OP (1/n) (will be judged later on, here α may be a random variable), where an = OP(bn ) means random variables |an /bn | ≤ C with some bounded constant C as n → ∞ with probability 1. Then we get with probability 1 7

(w.p.1) det S λα

= (1 + α)

n−p

p Y (α + ( d2λ+λ )2 ) i

i=1 #" p " # p Y Y 2 d +λ (1 + α( i λ )2) ( d2λ+λ )2 = (1 + α)n−p i

i=1

#i=1 " p # p X Y d2i +λ 2 ≈ [1 + (n − p)α] 1 + α ( λ ) ( d2λ+λ )2 "

i



"

1

1 + α(n − p +

p X

(

d2i +λ λ

)2 )

#"

i=1 p Y ( d2λ+λ )2

#

i

1

i=1

p p Y X d2 +λ 2 λ ( i λ )2 = n + = [1 + αν] ( d2 +λ ) where ν := n − p +

2p λ

i

1 λ2

+

1

i=1

p X

d4i .

1

ˆ λ||2 /||y||2, (8) becomes Let ρλ = ||y− y LRαMλ (D) =

n 2



n 2

log(y>S λα y) − 12 log det S λα 2

p Y log(ρλ + α) − log(1 + αν) − log[ ( d2λ+λ )2 ].

n 2

log ||y|| +

1 2

1 2

i

i=1

Solving ∂LRαλ (D)/∂α = 0 with respect to α, we get a minimum at α = αm =

νρλ −n (n−1)ν

provided νρλ > n w.p.1.

(14)

ρλ can be considered as a measure of fit. Clearly, in the case of overfitting, ρλ will be very close to 0. The main point of LoRP is to avoid overfitting. Thus, it is reasonable to consider only λ such that ρλ is not so small in the sense of the following condition P + λ12 p1d4i )ρλ > n w.p.1. Condition (C): νρλ = (n+ 2p λ Our experience to date shows that this condition is mostly satisfied in practice. Under the condition (C), αm = OP (1/n) that judges the assumption above about α. We then also get αm /ρλ = OP (1/n) which leads to n 2

log(ρλ + αm ) = =

n [log ρλ + 2 n log ρλ + 2

αm − 12 ( αρm )2 ρλ λ 1 + oP (1) 2

+ 13 ( αρm )3 + ...] λ

where an = oP (1) means |an | → 0 as n → ∞ w.p.1. Combine the last equalities and neglect constants independent of model Mλ , we can finally write the loss rank of model Mλ as LRλ (D) ≡

LRαMmλ (D)

=

n 2

p  νρλ −1 Y  ˆ λ|| ) − log n−1 log(||y − y ( d2λ+λ )2 . 2

1 2

i

i=1

8

(15)

Under the assumption  ∼ N (0,σ 211n ), we again obtain a penalized maximum likelihood criterion − sup(log-likelihood) + “penalty of the model complexity” where the penalty is pen2 (n, λ) =

− 12

p p  νρλ −1 Y  X 2 λ log n−1 ( d2 +λ ) = log(1 + i

d2i ) λ

λ −1 − 12 log νρn−1 .

(16)

1

i=1

Noting that as n is sufficiently large pen2 (n, λ) ≈

p X

log(1 +

d2i ) λ

− 12 log ρλ

1

and that ρλ increases as λ increases, ρλ ↑ 1 as λ ↑ ∞, we have w.p.1 that • pen2 (n,λ) is an increasing monotone function of the complexity c, • pen2 (n,λ) → 0 as c → 0 and pen2 (n,λ) → ∞ as c → ∞. Therefore pen2 (n,λ) also has the usual properties of a penalty function. This penalty function depends on ρλ , so it is data-dependent. It has been widely criticised that PML criteria based on distribution-free penalties sometimes work poorly for specific distributions, thus leading us to the idea of using data-dependent penalties. PML based on data-dependent penalties may be expected better performance over based on distribution-free penalties. In the next section, we will see by means of simulation that PML based on pen2(n,λ) works slightly better than based on pen1(n,λ).

4

Simulation

In this section a systematic simulation study is conducted to evaluate the performance of our suggested criteria for choosing the optimal λ and compare them to other competitors including GCV (Golub et al., 1979) GCV(λ) = n1 ||(11n − M λ )y||2/[ n1 tr(11n − M λ)]2 where M λ = X(X>X + λ11p )−1 X>, HKB estimator (Hoerl et al., 1975) 2 2 ˆ ˆ λHKB = ps2 /||β(0)|| , s2 = ||y − X β(0)|| /(n − p)

and the ordinary least square (OLS). So many methods for choosing λ have been proposed so far. A complete comparison is beyond the scope of the present paper. HKB was introduced by the authors of the original papers on ridge regression, while GCV is the most widely used method, that’s why we decide to consider them in 9

comparison with our criteria. It’s worth investigating a complete comparison of all proposed methods which we intend to do in the future. The criteria (11) and (15) will be refered to as LR1 and LR2, respectively. Setup simulation. Two factors that affect the ridge regression the most are degree of correlation between explanatory variables and signal-to-noise ratio (SNR). The degree of correlation is often measured by the condition number (Belsley et al., 1980) defined as d1 /dp ≥ 1 where d1 ≥ ... ≥ dp > 0 are singular values of design matrix X. The larger the condition number, the stronger the dependencies between explanatory variables. SNR is defined as ||β||2 /σ 2. In our study, four levels of correlation (very weak, weak, strong and very strong) w.r.t. condition numbers 5, 10, 50 and 100 (according to Belsley et al. (1980)) were studied. We considered three levels of SNR: 1, 10 and 100 which can be considered to be large, medium and small errors, respectively. Therefore 12 rigde regression models which represent various situations we would face in the real world were studied. For each model, a design matrix of size (50×4) and a response vector were generated. The way of generating design matrices, regression parameters, noises and response vectors w.r.t. given condition numbers and SNRs is described in detail in the Appendix. To search for the optimal ridge parameters, 1000 values of λ ranging from 0.001 to 1 in increments of .001 were used. The performance of the methods was measured in terms of the average MSE in regression coefficients. For each of 12 regression models, 100 replications were generated, the MSEs and the chosen λ’s were taken average over the 100 replications. For a method δ, its average MSE was computed by MSE(δ) =

1 100

100 X

ˆ (j) (δ)|| ||β(j) − β

j=1

ˆ (j) (δ) is the ridge estimator where β (j) is the true coefficients of j-th replication and β of β(j) with λ is chosen by method δ. Along with the average MSE(δ), the standard deviations sd(δ) were also computed. The results. Table 1 presents the average and standard deviation of MSE’s over 100 replications for each of 12 ridge regression models. The numbers in brackets are the means and standard deviations of chosen λ’s corresponding to each of the criteria. LR2 outperformed the others, especially when there were at least weak dependencies (the condition number ≥ 10) between the explanatory variables. We used the method of comparing averages of two paired samples (see, e.g. Rice (1995), Chapter 11) to test the hypothesis H0 : LR2 = δ, i.e. the overall average MSE of method δ is the same as that of LR2, against the alternative H1 : LR2 > δ (LR2 is better than δ), i.e. the overall average MSE of δ is larger than that of LR2, where δ is each of the methods LR1, GCV, HKB and OLS. Table 2 shows the 10

P-values of the tests, in which the P-values smaller than 0.01 were rounded down to 0. As shown, when there are dependencies between the explanatory variables, most of P-values are smaller than significance level 0.05. Thus, we can conclude that the improvement of LR2 over the others is statistically significant. In general, we can rank the performance of the criteria as: LR2 > LR1 > GCV> HKB > OLS. The chosen ridge parameters can be ranked as λLR1 > λLR2 > λGCV > λHKB > λOLS ≡ 0. So we can conclude that GCV, HKB and OLS are overfitting, while LR1 is slightly too parsimonious, and LR2 does the best job. Recently, the overfitting of GGV for choosing lasso parameter was theoretically shown by Wang et al. (2007). Tran (2009) showed that LoRP can consistently identify the true covariates by selecting the optimal lasso parameter. Looking again at Table 1, we can see that as the condition number increases, the performance of both LR1 and LR2 increases, while that of GCV and HKB decreases. In general, the chosen λ’s increases as the condition number increases (except for HKB), and λ’s decreases as SNR increases. Interestingly, standard deviations of LR2 are often larger than that of LR1. This is consistent with what we expect, because LR2 is based on data-dependent penalty. In summary, the simulation results strongly support the use of LR2 and LR1 when there are at least weak dependencies (the condition number is at least 10) between the explanatory variables.

5

Real data applications

Example 1: Hald cement data. We consider the well-known Hald’s data which was analyzed by Draper and Smith (1981) and Delaney and Chatterjee (1986). The dataset contains a design matrix X of size (13 × 4) and the corresponding observations of response variable Y . By using HKB, Draper and Smith proposed ˆ HKB = .0131, and by using bootstrap Delaney and Chatterjee proposed λ ˆ boot = .028. λ Figure 1(a) presents the curves LR1, LR2 and GCV versus λ, the selected λ’s were 0.135, 0.095 and 0.025, respectively. Table 3 lists the selected values in the increasing order. The condition number of d1 /d4 =20.58 shows slightly strong dependencies among the columns of X. By the virtue of the previous section, we strongly believe that ˆ LR2 = 0.095 is the optimal choice. λ Example 2: Prostate cancer data. Stamey et al. (1989) studied the correlation between the level of prostate antigen (lpsa) and 8 clinical measures in men: lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45. The dataset is freely available online. Figure 1(b) shows the curves LR1, LR2 and GCV versus the effective number of parameters (Hastie et al., 2001) defined as df(λ) = tr(M λ ). As λ varies from 0 to ∞, the effective number of parameters df(λ) varies from 8 to 0. The selected df’s are listed in Table 4. Obviously, the stronger the dependencies between covariates, the smaller the df. The condition number of 243.3 shows very 11

SNR

CN 5 10

1 50 100 5 10 10 50 100 5 10 100 50 100

Table 1: Average MSE and SDs over 100 replications LR1 LR2 GCV HKB 1.96±0.53 1.95±0.54 2.36±1.51 2.26±1.31 (0.84±0.19) (0.80±0.21) (0.39±0.31) (0.21±0.17) 1.95±0.67 1.94±0.68 2.77±2.66 2.87±2.23 (0.83±0.19) (0.79±0.21) (0.38±.32) (0.13±0.15) 2.07±0.87 2.06±0.88 6.52±11.52 10.12±13.91 (0.84±0.19) (0.81±0.21) (0.36±0.32) (0.05±0.14) 2.10±0.70 2.09±0.72 4.95±9.41 17.86±22.99 (0.87±0.16) (0.83±0.18) (0.38±0.31) (0.02±0.08) 1.33±0.64 1.24±0.61 0.99±0.58 0.95±0.57 (0.24±0.14) (0.20±0.13) (0.05±0.06) (0.03±0.01) 1.62±0.87 1.57±0.88 1.61±0.97 1.71±0.87 (0.24±0.12) (0.20±0.12) (0.05±0.07) (0.04±0.01) 1.46±0.93 1.44±0.95 3.47±4.03 4.26±5.61 (0.25±0.14) (0.21±0.14) (0.04±0.08) (0.01±0.01) 1.44±0.82 1.42±0.83 2.95±2.96 6.27±8.06 (0.24±0.13) (0.20±0.13) (0.03±0.07) (0.01±0.01) 0.54±0.33 0.49±0.31 0.32±0.20 0.32±0.20 (0.05±0.01) (0.04±0.01) (0.001±0.003) (0.003±0.001) 1.36±0.91 1.328±0.88 1.327±0.95 1.40±0.95 (0.06±0.01) (0.05±0.01) (0.006±0.005) (0.002±0.001) 1.377±0.92 1.371±0.92 1.47±0.96 1.66±1.12 (0.07±0.03) (0.06±0.03) (0.007±0.006) (0.002±0.002) 1.46±0.90 1.45±0.91 1.59±1.19 2.69±3.13 (0.07±0.03) (0.05±0.03) (0.005±0.004) (0.001±0.001)

OLS 3.18±2.15 (0) 6.05±4.06 (0) 29.31±23.37 (0) 58.41±39.57 (0) 1.01±0.60 (0) 1.94±1.25 (0) 9.91±8.38 (0) 18.85±13.39 (0) 0.31±0.19 (0) 2.02±1.37 (0) 2.78±2.28 (0) 6.02±4.78 (0)

strong dependencies among 8 explanatory variables. Thus, dfGCV =7.20 and dfHKB = 7.46 seem to be too large, thus producing overfitted models. The value dfLR2 =5.22 appears to be a reasonable choice. By using tenfold cross-validation, Hastie et al. (2001, p.61) reported the selected dfCV =4.16. This implies that λCV >λLR1 >λLR2 . The ranking in Section 4 allows us to conclude that the chosen dfCV =4.16 is too much parsimonious, thus producing an underfitted model. In summary, we strongly believe that dfLR2 =5.22 is the optimal effective number of parameters.

6

Conclusions

We proposed two penalized maximum likelihood criteria (LR1 and LR2), based on a distribution-free penalty and a data-dependent penalty respectively, for choosing the optimal ridge parameter. By means of a systematic simulation, the suggested 12

Table 2: P-values for testing LR2 = δ/LR2 > δ SNR CN LR2 > LR1 LR2 > GCV LR2 > HKB LR2 > OLS 1 5 0.30 0 0.01 0 10 0.37 0 0 0 50 0 0 0 0 100 0 0 0 0 10 5 0 1 1 0.99 10 0 0.35 0.02 0.01 50 0 0 0 0 100 0 0 0 0 100 5 0 1 1 1 10 0.07 0.50 0.27 0 50 0 0.14 0 0 100 0 0.03 0 0 Table 3: Selected λ’s by different criteria HKB GCV Bootstrap LR2 LR1 λ 0.0131 0.025 0.028 0.095 0.135 criteria were compared to several competitors including GCV, HKB estimator and OLS. The simulation strongly supports the use of LR2 and LR1 which outperform the others, especially when there are dependencies among explanatory variables. Two real data applications were also studied, in which our selected parameters are surprisingly different from the existing results.

Appendix As described in Section 4, each ridge regression model includes design matrix X, regression coefficients β, noise  and response vector y. Similar to the simulation in Delaney and Chatterjee (1986), the way we generated the set (X,β,,y) w.r.t. a given condition number and signal-to-noise ratio SNR is as follows (where n = 50 and p = 4). Design matrix X. First, a matrix X 0 of size (n×p) with random entries sampled from a uniform distribution on [−1,1] was generated. Let X 0 = UD 0V be the SVD of X 0 where d01 ≥...≥d0p are principal diagonal elements of diagonal matrix D0 of size (n×p). With a given condition number CN, let D be an (n×p) diagonal matrix with principal diagonal elements d1 ≥ ... ≥ dp where dp = d01 /CN and di = d0i , i = 1,...p−1. Then we computed X =UDV . X was the design matrix with the desired condition number. p Coefficient vector β. For a coefficient vector β, ||β|| = β>β is called the length 13

(b)

(a) 4

5 LR1 LR2 GCV

LR1 LR2 GCV 4

3

3 2

2 1 1

0 0

−1 −1

−2

0.02

0.04

0.06

0.08 λ

0.1

0.12

0.14

−2 4.5

0.16

5

5.5 6 6.5 effective number of parameters

7

7.5

Figure 1: Hald’s cement data: LR2 curve versus λ for choosing the optimal ridge parameter Table 4: Selected df’s by different criteria LR1 LR2 GCV HKB df 5.10 5.22 7.20 7.46 of signal. In our study, we fixed ||β|| = 10. To generate β, we first created a vector u = (u1 ,...,up)> whose components were sampled from a uniform distribution on ui . [−1,1]. Then, components of β were computed by βi = 10 ||u|| Noise . For a given signal-to-noise ratio SNR, n samples 1 ,...,n were sampled from a normal distribution with mean 0 and variance σ 2 =102 /SNR. Then,  =(1,...,n)> was the vector of noise with the desired signal-to-noise ratio ||β||2/σ 2 = SNR. Response vector y. Finally, the response vector y was computed by y = Xβ+. Software. The software to implement the simulation was written in Matlab. The source code is freely available upon contacting the author.

Acknowledgement The author would like to thank the associate editor and the reviewer for their careful reading and helpful comments. 14

References H. Akaike. Information theory and an extension of the maximum likelihood principle. In Proc. 2nd International Symposium on Information Theory, pages 267–281, Budapest, Hungary, 1973. Akademiai Kaid´ o. M. Alkhamisi, G. Khalaf, and G. Shukur. Some modifications for choosing ridge parameters. Commun. Statist. Theory and Methods, 35:2005–2020, 2006. M. Alkhamisi and G. Shukur. A monte carlo study of recent ridge parameters. Commun. Statist. Simulation and Computation, 36:535–547, 2007. D. Allen. The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16:125–127, 1974. D. A. Belsley, E. Kuh, and R. E. Welsch. Regression diagnostics, identifying influential data and sources of collinearity. New York, John Wiley, 1980. P. Craven and G. Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the methods of generalized cross-validation. Numerische Mathematik, 31:377–403, 1979. N. J. Delaney and S. Chatterjee. Use of the bootstrap and cross-validation in ridge regression. Journal of Business and Economics Statsitics, 4(2):225–262, 1986. N. R. Draper and H. Smith. Applied Regression Analysis. New York, John Wiley, 1981. S. Geisser. The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70:329–328, 1975. G.H. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, (21):215–223, 1979. T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2001. A. E. Hoerl and R. W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. A.E. Hoerl, R.W. Kennard, and K.F. Baldwin. Ridge regression: Some simulations. Communications in statistics, (4):105–123, 1975. M. Hutter. The loss rank principle for model selection. In Proc. 20th Annual Conf. on Learning Theory (COLT’07), volume 4539 of LNAI, pages 589–603, San Diego, 2007. Springer, Berlin. URL http://arxiv.org/abs/math.ST/0702804. M. Hutter and M. N. Tran. The loss rank principle for model selection. 2008. submitted. G. Khalaf and G. Shukur. Choosing ridge parameter for regression problems. Commun. Statist. Theory and Methods, 34:1177–1182, 2005.

15

J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press, California, 1995. T. Stamey, J. Kabalin, J. McNeal, I. Johnstone, F. Freiha, E. Redwine, , and N. Yang. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate ii. radical prostatectomy treated patients. Journal of Urology, (16):1076–1083, 1989. M. Stone. Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Stat. Soc. Ser. B, 36:111–147, 1974. M. N. Tran. The loss rank criteria for variable selection. 2009. submitted. H. D. Vinod. A survey of ridge regression and related techniques for improvements over ordinary least squares. The Review of Economics and Statistics, 60(1):121–131, 1978. H. Wang, R. Li, and C. L. Tsai. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 3(94):553–568, 2007.

16

Suggest Documents