276 Statistica Neerlandica (2014) Vol. 68, nr. 4, pp. 276–292 doi:10.1111/stan.12033
Tuning parameter selection in penalized generalized linear models for discrete data E. Androulakis, C. Koukouvinos*, F. Vonta Department of Mathematics, National Technical University of Athens, Zografou 15773, Athens, Greece
In recent years, we have seen an increased interest in the penalized likelihood methodology, which can be efficiently used for shrinkage and selection purposes. This strategy can also result in unbiased, sparse, and continuous estimators. However, the performance of the penalized likelihood approach depends on the proper choice of the regularization parameter. Therefore, it is important to select it appropriately. To this end, the generalized cross-validation method is commonly used. In this article, we firstly propose new estimates of the norm of the error in the generalized linear models framework, through the use of Kantorovich inequalities. Then these estimates are used in order to derive a tuning parameter selector in penalized generalized linear models. The proposed method does not depend on resampling as the standard methods and therefore results in a considerable gain in computational time while producing improved results. A thorough simulation study is conducted to support theoretical findings; and a comparison of the penalized methods with the L1 , the hard thresholding, and the smoothly clipped absolute deviation penalty functions is performed, for the cases of penalized Logistic regression and penalized Poisson regression. A real data example is being analyzed, and a discussion follows. Keywords and Phrases: penalized likelihood, logistic regression, Poisson regression, tuning parameter, error estimation, generalized cross-validation.
1 Introduction The penalized approach for variable selection has been studied intensively for a number of years. Frank and Friedman (1993) and Fu (1998) have studied different penalized methods, although not necessarily in the context of generalized linear models (GLMs). The methodology developed in Fan and Li (2001) has the advantage that it selects the significant variables and estimates regression coefficients simultaneously. In addition, insignificant variables are deleted by estimating their coefficients as 0. Recent related studies include Fan and Li (2006), Leng et al. (2006), and Zou and Li (2008). The efficiency of this methodology depends strongly on appropriately choosing the tuning parameter that is involved in the penalty functions. Several methods are useful *
[email protected] © 2014 The Authors. Statistica Neerlandica © 2014 VVS. Published by Wiley Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.
Tuning parameter selection for discrete data
277
for the tuning parameter selection. They are established through the use of an appropriate criterion, which by minimization can lead to the desirable selector. The most well-known existing methods are cross-validation (CV) and generalized CV (GCV) (Craven and Wahba, 1979). As parameter tuning in penalized likelihood modeling with standard resampling methods (e.g., CV) can be very time consuming (see, e.g., Park et al. (2014)), the development of error estimate formulas that could be used for parameter tuning without the need for resampling is a very important issue. In this work, we propose new error estimates for GLMs, through the use of Kantorovich inequalities. We then use these estimates as a regularization parameter selector for penalized likelihood-based models. Although we focus in this paper on generalized linear models for discrete data, our method applies also to continuous data cases where a dispersion parameter might be present. This will be the subject of another paper. In addition, we have applied our approach to a class of very general models, namely, frailty models for clustered data, and the results illustrate that our method is a competitor of the GCV (Androulakis et al. 2012, 2013). We should stress here that our method is easily implemented and less time consuming than GCV. This paper is organized as follows: in Section 2, we discuss the GLMs framework. We emphasize in logistic and Poisson regression. In Section 3, we derive our new error estimates via Kantorovich inequalities. These estimates were explored in the context of numerical analysis for non-linear systems by Galantai (2001). The penalized likelihood approach is briefly discussed in Section 4, where we also propose our new approach for the tuning parameter selection. In Section 5, we provide simulations for different sample sizes, so as to illustrate our method and to compare it with the GCV. A real set of data is being analyzed in Section 6, and in Section 7, some general conclusions are presented. 2
Formulation
The concept of GLMs unifies different approaches so as to explain variation in data, in terms of a linear combination of covariates. Examples of approaches to non-normal data that fall into this category of models, which in general belong to the exponential family, are the logistic regression and the Poisson regression (McCullagh and Nelder, 1989, and Myers et al., 2002). Our data consist of the random variables (yi , xi ), i = 1, … , n, where xi is a vector of dimension r of predictor variables. Assume that at the ith data point, i = 1, 2, … , n, the response yi follows a distribution that belongs to the one-parameter exponential family with parameter pi and with support that does not depend on any unknown parameters. The likelihood function in its natural form (Bickel and Doksum, 1977) is written as ∑n ∑n ∑n ) ( (1) Lik y1 , … , yn , 𝜂1 , … , 𝜂n = e i=1 yi 𝜂i − i=1 d0 (𝜂i )+ i=1 S(yi ) ) ( ) ( ) ( where 𝜂1 , … , 𝜂n ∈ H, where H is the collection of all 𝜂1 , … , 𝜂n such that d0 𝜂i is finite, for real valued functions d0 and S. © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
278
E. Androulakis, C. Koukouvinos and F. Vonta
Assuming that the natural parameter 𝜂i is written as 𝜂i = xTi 𝜷, we have that the derivative of the log-likelihood with respect to 𝜷 is given as )) ( 𝜗 ln Lik ∑ ( = yi − d0′ xTi 𝜷 xij , j = 1, … r 𝜗𝛽j i=1 n
(2)
Therefore, the maximum likelihood estimator (MLE) of 𝜷 is the solution to the score equations written in matrix notation as ( ) X T y − d0′ (X , 𝜷) = X T (y − 𝝁) = 0 (3) ) ( T as it is known that d0′ (𝜼) = E𝜼 (y) = 𝝁, where y = y1 , … , yn , d0′ (X , 𝜷) = )) ( ( ( ) ) ( ′( T ) T T T , 𝝁 = 𝜇1 , … , 𝜇n , and 𝜼 = 𝜂1 , … , 𝜂n . Equation 3 is d0 x1 𝜷 , … , d0′ xTn 𝜷 known as the ‘score equation’ in GLMs and the MLE 𝜷̂ is its solution. The minus second derivative with respect to 𝜷 of the aforementioned log-likelihood is given as n ∑
) ( xij d0′′ xTi 𝜷 xik , j = 1, … , r, k = 1, … , r
(4)
i=1
or in matrix notation X T WX
(5) d(0′′
where W is the variance–covariance matrix of y because it is known that (𝜼) ) = Var 𝜼 y. Therefore, W is an n × n diagonal matrix with diagonal elements Var yi , i = 1, … , n. 2.1 Logistic regression model Assume now that at the ith data point,( i )= 1, 2, … , n,(the) response (is a) Bernoulli ( ran) dom variable yi , so yi 𝜖{0, 1}, with E yi = pi = p xi and Var yi = pi 1 − pi , which obviously is a function of the mean. Here, pi denotes the probability of success in a Bernoulli process. It is well known (McCullagh and Nelder, 1989, and Myers et al., 2002) that the MLE of 𝜷, with the use of a canonical link function, is the solution to the score equations ( ) n T ∑ ex i 𝜷 xij yi − = 0, j = 1, … , r (6) T 1 + ex i 𝜷 i=1 or in matrix notation X T (y − 𝝁) = 0 ( ) ( ) e𝜂i where 𝜇i = E𝜼 yi = d0′ 𝜂i = 1+e = pi . 𝜂i T Moreover, the matrix X WX takes the form ( xT 𝜷 ( )) n T ∑ ei ex i 𝜷 xij 1− xik , j = 1, … , r, k = 1, … , r T T 1 + ex i 𝜷 1 + ex i 𝜷 i=1 © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
(7)
(8)
Tuning parameter selection for discrete data
279
( ) ( ) where W an n × n diagonal matrix with diagonal elements Var yi = pi 1 − pi , i = 1, … , n. 2.2 Poisson regression model Assume now that the responses y1 , … , yn are counts that follow independent Poisson ( ) ( ) distributions with E yi = pi = 𝜇i and Var yi = 𝜇i , as the variance is equal to the mean. For the Poisson regression model and with the use of a canonical link function, the MLE of 𝜷 is the solution to the score equations (McCullagh and Nelder, 1989, and Myers et al., 2002) n ∑
( ) T xij yi − exi 𝜷 = 0, j = 1, … , r
(9)
i=1
or in matrix notation X T (y − 𝝁) = 0
(10)
Moreover, the matrix X T WX takes the form n ∑
( T ) xij exi 𝜷 xik , j = 1, … , r, k = 1, … , r
(11)
i=1
where W is an n × n diagonal matrix with diagonal elements 𝜇i = exi 𝜷 , i = 1, … , n. T
3
Error estimates for generalized linear models
We now propose new estimates of the norm of the error in the GLMs framework through the use of the Kantorovich inequality given in the following form (Greub and Rheinboldt (1959), Horn and Johnson (1985), Marcus and Minc 1992): ‖z‖42
( )2 ( T ) ( T −1 ) 1 w1 + wr ≤ z Bz z B z ≤ ‖z‖42 4 w1 wr
(12)
where B ∈ Rr×r is a symmetric positive definite matrix with eigenvalues w1 ≥ w2 ≥ · · · ≥ wr > 0, z ∈ Rr is an arbitrary vector, and the norm ‖.‖2 is the Euclidean norm for vectors and the Frobenius norm for matrices. Galantai (2001) generalized Auchmuty’s error estimate (Auchmuty, 1992) in numerical analysis to non-linear systems. This result can actually be used in conjunction with the Newton–Raphson method. Based on his steps, we adjust the error estimate that he derived to our situation, that is, the GLMs framework. Consider the non-linear score equations X T (y − 𝝁) = 0 multiplied by −1. We denote them as F (𝜷) = 0 with F ∶ Rr → Rr , and let 𝜷 any approximate solution and 𝜷 ∗ the exact solution of F (𝜷) = 0. The matrix X T WX given in Equation 5 is also equal to the © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
280
E. Androulakis, C. Koukouvinos and F. Vonta
( ( )) ( ) Jacobian matrix F ′ (𝜷) . Assume that F ′ 𝜷 ∗ is invertible, F ′ ∈ C 1 S 𝜷 ∗ , 𝛿 , and ( ) ( ) ( ) ‖F ′ 𝜷1 − F ′ 𝜷2 ‖2 ≤ M‖𝜷1 − 𝜷2 ‖2 , ∀𝜷1 , 𝜷2 ∈ S 𝜷 ∗ , 𝛿 , where ( ) { } S 𝜷 ∗ , 𝛿 = 𝜷 ∶ ‖𝜷 ∗ − 𝜷‖2 ≤ 𝛿
(13)
for 𝛿 > 0. Now, assume that 𝜷 is close to 𝜷 ∗ , and let ( )( )T B = F ′ (𝜷) F ′ (𝜷)T = X T WX X T WX
(14)
Let also UΣV T be the singular value decomposition of F ′ , where Σ = ) ( diag 𝜎1 , … , 𝜎r such that 𝜎1 ≥ 𝜎2 ≥ · · · ≥ 𝜎r > 0. If z ∈ Rr is an arbitrary vector, we have that zT Bz = ‖F ′ (𝜷)T z‖22
(15)
zT B−1 z = ‖F ′ (𝜷)−1 z‖22
(16)
and wi = wi (B) = 𝜎i2 (F ′ ) = 𝜎i2 , where wi and 𝜎i are the ith eigenvalues and singular values of B and F ′ , respectively. We can now apply the Kantorovich inequality 12, which gives ‖z‖22 ≤ ‖F ′ (𝜷)T z‖2 ‖F ′ (𝜷)−1 z‖2 ≤
2 2 1 𝜎1 + 𝜎r ‖z‖22 , z ∈ Rr 2 𝜎1 𝜎r
Let z = F (𝜷) . From the Lipschitz continuity, it follows that ) ( ) ( F (𝜷) = F ′ (𝜷) 𝜷 − 𝜷 ∗ + O ‖e‖22
(17)
(18)
and ( ) F ′ (𝜷)−1 F (𝜷) = 𝜷 − 𝜷 ∗ + O ‖e‖22
(19)
Hence, we have that the approximate absolute error estimate is given as (Galantai (2001)) ‖𝜷 − 𝜷 ∗ ‖2 = c
‖F (𝜷) ‖22 ‖F ′ (𝜷)T F (𝜷) ‖2
(20)
or ‖X T (y − 𝝁) ‖22 ‖𝜷 − 𝜷 ∗ ‖2 = c ( )T ‖ X T WX X T (y − 𝝁) ‖2
(21)
( ) 𝜎 2 +𝜎 2 where 1 ⪅ c ⪅ C2 F ′ (𝜷) = 12 𝜎1 𝜎 r . 1 r We must note that in order to apply the Kantorovich inequality to obtain Equation 21 in the GLMs framework, the following assumptions must hold: © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
Tuning parameter selection for discrete data
281
A. The regressors x take finite values, are non-degenerate random variables, and are linearly independent. B. The function d0 ∈ 3 defined in Equation 1 has a uniformly bounded third ( ) derivative for 𝜷 ∈ S 𝜷 ∗ , 𝛿 . The continuity of X T WX with respect to 𝜷 and the Lipschitz continuity follow from ( ) assumptions A and B for 𝜷 ∈ S 𝜷 ∗ , 𝛿 . Also, the matrix X T WX is symmetric, positive definite, and invertible at the true point 𝜷 ∗ because of its form and our assumptions. As a consequence, the matrix B is symmetric and positive definite. Owing to the aforementioned assumptions, the conditions needed for the application of the Kantorovich inequality are satisfied in our setup. 4
Penalized likelihood methodology and tuning parameter selection
Fan and Li (2001) proposed a variable selection methodology for likelihood-based ) ( GLMs. Assume that the data xi , yi are collected independently. Conditioning on xi , ( ( T ) ) yi has a density fi g xi 𝜷 , yi , where g is a known link function. In this paper, a form of the penalized likelihood is defined as Q (𝛽) ≡
n r ∑ ∑ ) ) ( ( ( ) li g xTi 𝜷 , yi − n p𝜆 |𝛽j | i=1
(22)
j=1
where li = log fi denotes the conditional log-likelihood of yi , p𝜆 (.) is a penalty function, and 𝜆 is a tuning parameter, which can be chosen by data-driven approaches, such as CV and GCV (Craven and Wahba, 1979). Maximizing the penalized likelihood function is equivalent to minimizing the −Q (𝜷) with respect to 𝜷. To obtain a penalized MLE of 𝜷, we minimize −Q (𝜷), for some thresholding parameter 𝜆. The most common penalties are the L1 penalty p𝜆 (|𝛽|) = 𝜆|𝛽|, which results in the least absolute shrinkage and selection operator (LASSO) method (Tibshirani, 1996) and the hard thresholding penalty p𝜆 (|𝛽|) = 𝜆2 − (|𝛽| − 𝜆)2 I (|𝛽| < 𝜆) (Antoniadis, 1997), where I (⋅) is an indicator function. However, these penalties do not simultaneously satisfy the necessary mathematical conditions for unbiasedness, sparsity, and continuity. Therefore, Fan and Li (2001) proposed the continuous differentiable penalty function, known as smoothly clipped absolute deviation (SCAD) penalty, the first derivative of which is defined by } { (𝛼𝜆−𝛽)+ ′ p𝜆 (𝛽) = 𝜆 I (𝛽 ≤ 𝜆)+ (𝛼−1)𝜆 I (𝛽 > 𝜆) , for some 𝛽 > 0 and 𝛼 > 2, with p𝜆 (0) = 0. For the choice of 𝛼, according to the relevant literature (Fan and Li, 2001), the value 𝛼 ≈ 3.7 appears to perform quite satisfactorily in numerous variable selection problems. Fan and Li (2001) also proposed a new unified algorithm for the minimization of −Q (𝜷) by selecting an initial value of the unknown coefficients. The first term in −Q (𝜷) may be regarded as a loss function of 𝜷 and denote it by l (𝜷). Then, −Q (𝜷) can be written in a unified form as l (𝜷) + n
r ∑
( ) p𝜆 |𝛽j |
j=1 © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
(23)
282
E. Androulakis, C. Koukouvinos and F. Vonta
Moreover, they suggested to approximate locally the penalty function by a quadratic function and referred to such an approximation as local quadratic approximation. Based on the local quadratic approximation, the optimization of penalized likelihood function can be achieved using a modified Newton–Raphson algorithm. For more details on the minimization of Equation 23 with respect to 𝜷 and the derivation of standard errors for the estimated parameters, the interested reader is referred to Fan and Li (2001) and to Fan and Li (2002) for the cases of the Cox proportional hazards model and the Gamma frailty model and to Androulakis et al. (2012) for a general class of frailty models with penalized likelihood. We now propose an alternative method for selecting the tuning parameter 𝜆, in penalized likelihood-based generalized linear models, through the proper use of the estimates of the norm of the error 21. The procedure we suggest is described as follows: 1. Start with a grid containing the initial values of the tuning parameter 𝜆. ̂ 2. For each 𝜆 in the grid, compute a penalized estimator 𝜷. 3. Then evaluate the norm of the error by formula 21, with the constant c taking the ̂ and F ′ (𝜷), ̂ value of its upper bound, using the appropriate expressions for F(𝜷) for example, based on Equations 6 and 8 for the penalized logistic regression model or on Equations 9 and 11 for the penalized Poisson regression model. 4. The value of the parameter 𝜆 in the grid, which minimizes Equation 21 for the aforementioned constant c, is the desired tuning parameter, and the resulting penalized estimator 𝜷̂ is our final estimator of 𝜷. ( ) Remark 1. It is easily observed that C2 F ′ (𝜷) is essentially a simple function of the ( ′ ) condition number k2 F (𝜷) = ‖F ′ (𝜷) ‖2 ‖F ′ (𝜷)−1 ‖2 of the matrix F ′ (𝜷) . Therefore, we should stress here that the non-singularity of the matrix F ′ (𝜷) = X T WX , along with ( ( ) ) the fact that C2 F ′ (𝜷) is less than or equal to k2 F ′ (𝜷) (see also Galantai, 2001), ensures that the upper bound of the constant c is finite. Remark 2. For the case of a linear system 𝐗𝜷 = 𝐲, Galantai (2001) investigated through simulations the behavior of the constant c, which actually depends on the dimension of the matrix 𝐗 or equivalently on the dimension r of the regression parameter 𝜷. He concluded that the average of the error constant c in Auchmuty’s estimate for that case increases slowly with r and that with high probability
‖𝜷 − 𝜷 ∗ ‖2 ⪅
0.5 dim(X )‖res (𝜷) ‖22 ‖X T res (𝜷) ‖2
(24)
where res (𝜷) = 𝐗𝜷 − 𝐲 is the residual error for any approximate solution 𝜷 of the linear system. However, further work is needed for the formal derivation of the average of c in our case, especially for high-dimensional settings, and this will be considered in the next paper. See, however, some comments in the sequel in Remark 3. © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
Tuning parameter selection for discrete data 5
283
Simulation results
In this section, we perform simulation experiments so as to numerically compare the proposed method with the GCV, for the derivation of the tuning parameter. As a penalized likelihood estimate of 𝜷 in step 2 in the procedure described earlier, we considered the maximum penalized likelihood estimator (Fan and Li (2001)), and we followed their simulation scheme. Therefore, we simulated 100 datasets consisting of n observations, from the following models: ( ( )) 1. Y ∼ Bernoulli p xT 𝜷 , where p(u) = 1+e1 −u . ( ( T )) 2. Y ∼ Poisson 𝜇 x 𝜷 , where 𝜇(u) = eu . The values of n were taken to be 100, 200, and 500. The first six components of x were taken as standard normal with the correlation between xi and xj being equal to 𝜌|i−j| with 𝜌 = 0.5. The last two components of x were taken independent identically distributed random variables following a Bernoulli distribution with probability of success 0.5. All covariates were standardized. The true value of 𝜷 = (3, 1.5, 0, 0, 2, 0, 0, 0)T for the logistic regression model and 𝜷 = (1.2, 0.6, 0, 0, 0.8, 0, 0, 0)T for the Poisson regression model. In our simulations, the performance of the penalized procedures, with the SCAD penalty (SCAD), the L1 penalty (LASSO), and the hard thresholding penalty (Hard), is compared in terms of their model errors (Fan and Li, 2001), model complexity, and accuracy, by using the proposed method for the derivation of the tuning parameter and GCV as an alternative. Model errors of the penalized procedures are compared to those of the maximum likelihood estimates, and they are computed via 1000 Monte Carlo simulations. Specifically, we present the median of relative model errors (MRME) over the 100 simulated datasets for the aforementioned values of n. Moreover, the average number of zero coefficients is also reported in the following tables, in which the column labeled ‘correct’ presents the average restricted only to the true zero coefficients, while the column labeled ‘incorrect’ depicts the average of coefficients erroneously set to 0. We also test the accuracy of the standard error formula proposed by Fan and Li (2001) (see eq. (3.10) in their paper), which can be used as an estimator for the covariance of ̂ The median absolute deviation divided by 0.6745, the non-vanishing component of 𝜷. denoted by SD in the tables, of the 100 estimated coefficients in the 100 simulations can be regarded as the true standard error. Note that in general, the median absolute deviation (MAD) of a set ti , i = 1, … , n is calculated as the median of the ( absolute value ( ) )of each value ti minus the median of the data, that is, MAD = median |ti − median t̃ | , where t̃ is the data. Moreover, the median of the 100 estimated SDs, resulting from 100 simulations, denoted by SDm , and the median absolute deviation error of the 100 estimated standard errors, divided by 0.6745, denoted by SDmad , gauge the overall performance of the standard error formula. In our simulations, we present the results concerning the standard errors only for the non-zero coefficients. We have also calculated for each method the mean values of the 100 estimated coefficients of x1 , x2 , and x5 . Furthermore, the mean values of 𝜆 selected by the compared methods are also © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
284
E. Androulakis, C. Koukouvinos and F. Vonta
demonstrated. We also present the results for the Oracle estimator, obtained by fitting the ideal model consisting of the variables x1 , x2 , and x5 . We note that, because GCV also uses some initial values of 𝜆 in its routine, the same values were given in both methods, in order to make a strict and fair comparison of them. In addition, as we noted in the previous section, it is preferable and more efficient to minimize the maximum of the error; therefore, we used the upper bound of c when evaluating Equation 21. All computations were conducted using MATLAB codes. The results are given in Tables 1–5. From the aforementioned tables, we observe that with the use of our method for the derivation of the tuning parameter, there is a significant improvement in the general performance of the penalized methods. More specifically, the MRME is ameliorated for all methods, regardless of the sample size or the GLM used (Table 1). Moreover, observe from the same table that the penalized methods and especially the SCAD and Hard in the penalized Logistic regression case have a higher chance of correctly setting the non-significant coefficients to zero, when using our method. At the same time, in general, there is a lower chance through our method of erroneously setting significant coefficients to 0. In the penalized Poisson regression case, our method obviously improves the performance of all the penalized methods. The SCAD and Hard methods outperform the LASSO in general. However, the performance of the LASSO method is improved with the use of our tuning parameter selector. The standard error formula proposed by Fan and Li (2001) performs well when we apply our approach, or even better in many cases, as compared to GCV (Tables 2 and 4). Note that a good and accurate performance of this formula should result in close values of the true and estimated standard deviations of estimators, while the penalized methods must perform as well as the oracle estimator. Concerning the mean values of the non-zero coefficients, it can be seen that in all cases (Tables 3 and 5), the estimated and real values are much closer when implementing the new method, resulting in a better performance of the penalized methods. As far as the mean values of the selected 𝜆 are concerned, a general remark is that in many cases, our method yields higher or slightly higher values of 𝜆. Remark 3. Galantai (2001) pointed out that in practice, in the case of non-random matrices, the constant c seems to take values usually less than 10. For random matrices though, it could take much higher values, and it certainly increases with the value of r. Therefore, further simulation experiments were performed in order to investigate the possible values of c and especially its upper bound constant C2 (F ′ (𝜷)). We note that as our findings, concerning the MRME and model identifications for these extra setups, are similar to those analyzed before, we do not report the results here for brevity. In addition, we found that for the logistic regression model, the average value of C2 (F ′ (𝜷)) is actually almost always less than 10 for the case of regression parameters with dimension up to 20 and for n=200 or 500 for all three penalized methods. For the Poisson regression model, we found much higher average values for C2 (F ′ (𝜷)) in some cases. For example, for r = 20 and n = 500, we obtained an average C2 (F ′ (𝜷)) around © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
Method n = 100 SCAD LASSO Hard Oracle n = 200 SCAD LASSO Hard Oracle n = 500 SCAD LASSO Hard Oracle
Table 1.
Correct 4.0400 (3.6100) 4.0000 (3.4500) 3.5600 (3.5300) 5 (5) 4.5400 (3.5200) 3.1300 (3.4700) 4.6900 (3.5000) 5 (5) 4.6600 (3.5700) 3.0200 (3.5200) 4.3800 (3.5700) 5 (5)
GCV: MRME(%)
0.7279 (0.6975) 0.7445 (0.8303) 0.8341 (0.8614) 0.3314 (0.3536)
0.3394 (0.8252) 0.7894 (0.8383) 0.3507 (0.9431) 0.2420 (0.3595)
© 2014 The Authors. Statistica Neerlandica © 2014 VVS.
0.3952 (0.8838) 0.8119 (0.8913) 0.4306 (0.9277) 0.2668 (0.3985)
0 (0) 0.02 (0) 0 (0) 0 (0)
0.2000 (0) 0 (0) 0.0200 (0) 0 (0)
0.1300 (0.0800) 0.0900 (0.1100) 0.0400 (0.0500) 0 (0)
Incorrect
GCV: Aver. no. of 0 coeff.
0.2503 (0.5934) 0.5898 (0.6335) 0.3010 (0.7421) 0.2668 (0.3985)
0.2838 (0.6376) 0.6540 (0.6374) 0.2890 (0.7567) 0.2420 (0.3595)
0.5923 (0.5306) 0.6704 (0.6370) 0.7187 (0.6511) 0.3314 (0.3536)
New M.: MRME(%)
5.0000 (3.8800) 3.1700 (3.8400) 4.9000 (3.8100) 5 (5)
5.0000 (3.8700) 3.2800 (3.8600) 4.9200 (3.8200) 5 (5)
4.4500 (3.8700) 4.3200 (3.7200) 4.1700 (3.7800) 5 (5)
Correct
0 (0) 0 (0) 0 (0) 0 (0)
0.0500 (0) 0 (0) 0.0200 (0) 0 (0)
0.0900 (0.0200) 0.1100 (0.0500) 0.0200 (0.0500) 0 (0)
Incorrect
New M.: Aver. no.of 0 coeff.
Simulation results for the penalized logistic regression and Poisson regression (in parentheses)
Tuning parameter selection for discrete data 285
286
E. Androulakis, C. Koukouvinos and F. Vonta
Table 2. Standard deviations for the penalized logistic regression using the new method and GCV (in parentheses) 𝜷̂1 𝜷̂2 𝜷̂5 n = 100 SCAD LASSO Hard Oracle n = 200 SCAD LASSO Hard Oracle n = 500 SCAD LASSO Hard Oracle
SD
SDm
SDmad
SD
SDm
SDmad
SD
SDm
SDmad
0.8570 (0.9962) 0.3973 (0.2658) 0.8607 (0.9200) 0.8447
0.8021 (0.8802) 0.4090 (0.4589) 0.8371 (0.8961) 0.7465
0.2136 (0.3335) 0.1274 (0.1347) 0.2926 (0.2612) 0.2315
0.6795 (0.7220) 0.3910 (0.2543) 0.6509 (0.6322) 0.4601
0.5688 (0.6030) 0.3329 (0.3474) 0.6185 (0.6290) 0.5310
0.1813 (0.1895) 0.1112 (0.1342) 0.2017 (0.1901) 0.1047
0.7445 (0.8253) 0.4606 (0.3108) 0.7884 (0.9251) 0.6334
0.6018 (0.6427) 0.3300 (0.3517) 0.6735 (0.6900) 0.5686
0.2337 (0.2603) 0.1032 (0.1356) 0.2357 (0.2519) 0.1725
0.5989 (0.4474) 0.6262 (0.4817) 0.6023 (0.4387) 0.5751
0.5126 (0.5420) 0.5122 (0.5376) 0.5300 (0.5468) 0.5271
0.0990 (0.0736) 0.0970 (0.0894) 0.1031 (0.0783) 0.0946
0.3783 (0.4885) 0.3511 (0.4497) 0.3479 (0.4457) 0.3263
0.3665 (0.3587) 0.3680 (0.3670) 0.3795 (0.3699) 0.3794
0.0649 (0.0751) 0.0480 (0.0562) 0.0567 (0.0546) 0.0483
0.3477 (0.3132) 0.4102 (0.4297) 0.4235 (0.3534) 0.3341
0.3708 (0.3867) 0.3856 (0.4034) 0.3880 (0.3974) 0.3883
0.0574 (0.0560) 0.0728 (0.0706) 0.0651 (0.0577) 0.0624
0.2958 (0.2778) 0.3033 (0.2698) 0.2954 (0.2589) 0.3415
0.3246 (0.3297) 0.3206 (0.3254) 0.3264 (0.3312) 0.3328
0.0347 (0.0326) 0.0325 (0.0317) 0.0334 (0.0322) 0.0405
0.2173 (0.2443) 0.2365 (0.2577) 0.2099 (0.2552) 0.2350
0.2319 (0.2281) 0.2309 (0.2287) 0.2323 (0.2283) 0.2307
0.0199 (0.0191) 0.0225 (0.0177) 0.0198 (0.0199) 0.0251
0.2361 (0.2628) 0.2568 (0.2635) 0.2633 (0.2632) 0.2565
0.2452 (0.2447) 0.2457 (0.2205) 0.2464 (0.2450) 0.2437
0.0253 (0.0261) 0.0294 (0.0256) 0.0266 (0.0283) 0.0266
Table 3. Mean values of the non-zero coefficients and 𝜆, using the new method and GCV (in parentheses) for the penalized logistic regression 𝜆̂ 𝜷̂2 𝜷̂5 Method 𝜷̂1 n = 100 SCAD LASSO Hard Oracle n = 200 SCAD LASSO Hard Oracle n = 500 SCAD LASSO Hard Oracle
3.2655 (3.5394) 2.7264 (2.2912) 3.3662 (3.8886) 3.3144
1.6271 (1.8601) 1.3041 (1.0556) 1.6013 (1.9754) 1.6140
2.2506 (2.5467) 1.8238 (1.4431) 2.3054 (2.6431) 2.2040
0.2212 (0.2094) 0.0194 (0.0138) 0.2401 (0.2295) —
3.0753 (3.2004) 3.1151 (3.2227) 3.0962 (3.1923) 3.0749
1.4882 (1.3401) 1.5650 (1.5658) 1.5446 (1.5452) 1.5353
1.9761 (2.0918) 2.0359 (2.1330) 2.0474 (2.1276) 2.0873
0.3351 (0.3319) 0.0015 (0.0009) 0.8735 (0.8278) —
3.0532 (3.0963) 3.0821 (3.1142) 3.0615 (3.0888) 3.0641
1.5415 (1.3995) 1.5610 (1.4165) 1.5379 (1.5513) 1.5236
2.0397 (2.0969) 2.0834 (2.0931) 2.0364 (2.0451) 2.0501
0.1959 (0.1964) 0.0008 (0.0005) 0.4652 (0.4731) —
Note: True values: 3, 1.5, and 2.
275 for all three penalized methods. At the same time, however, the minimum upper bound obtained by (21) is very low, around 0.24, for the three penalized methods, while the mean bias of 𝜷̂ is approximately equal to 0.02. Moreover, Galantai (2001) pointed out that C2 (F ′ (𝜷)) becomes approximately equal to k2 (F ′ (𝜷))∕2 if k2 (F ′ (𝜷)) is © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
Tuning parameter selection for discrete data
287
Table 4. Standard deviations for the penalized Poisson regression using the new method and GCV (in parentheses) 𝜷̂1 𝜷̂2 𝜷̂5 n = 100 SCAD LASSO Hard Oracle n = 200 SCAD LASSO Hard Oracle n = 500 SCAD LASSO Hard Oracle
SD
SDm
SDmad
SD
SDm
SDmad
SD
SDm
SDmad
0.0566 (0.0609) 0.0563 (0.0670) 0.0566 (0.0665) 0.0566
0.0576 (0.0548) 0.0560 (0.0551) 0.0596 (0.0556) 0.0552
0.0109 (0.0113) 0.0090 (0.0104) 0.0105 (0.0107) 0.0094
0.0658 (0.0703) 0.0649 (0.0769) 0.0659 (0.0740) 0.0626
0.0611 (0.0604) 0.0600 (0.0614) 0.0629 (0.0626) 0.0613
0.0096 (0.0105) 0.0090 (0.0110) 0.0108 (0.0110) 0.0099
0.0627 (0.0552) 0.0668 (0.0652) 0.0625 (0.0623) 0.0542
0.0527 (0.0532) 0.0541 (0.0562) 0.0547 (0.0557) 0.0500
0.0103 (0.0082) 0.0112 (0.0098) 0.0117 (0.0096) 0.0087
0.0357 (0.0282) 0.0368 (0.0268) 0.0345 (0.0272) 0.0315
0.0363 (0.0353) 0.0354 (0.0357) 0.0364 (0.0357) 0.0333
0.0048 (0.0052) 0.0052 (0.0053) 0.0050 (0.0053) 0.0061
0.0460 (0.0333) 0.0451 (0.0362) 0.0440 (0.0374) 0.0394
0.0406 (0.0399) 0.0407 (0.0403) 0.0410 (0.0408) 0.0385
0.0060 (0.0058) 0.0052 (0.0057) 0.0068 (0.0063) 0.0052
0.0291 (0.0429) 0.0325 (0.0429) 0.0347 (0.0442) 0.0359
0.0348 (0.0349) 0.0348 (0.0367) 0.0359 (0.0365) 0.0337
0.0063 (0.0062) 0.0062 (0.0067) 0.0066 (0.0066) 0.0041
0.0238 (0.0194) 0.0233 (0.0188) 0.0245 (0.0188) 0.0216
0.0215 (0.0215) 0.0212 (0.0215) 0.0216 (0.0216) 0.0210
0.0022 (0.0018) 0.0021 (0.0018) 0.0023 (0.0017) 0.0027
0.0252 (0.0249) 0.0249 (0.0254) 0.0271 (0.0252) 0.0227
0.0242 (0.0243) 0.0239 (0.0246) 0.0246 (0.0246) 0.0235
0.0028 (0.0021) 0.0024 (0.0024) 0.0027 (0.0023) 0.0026
0.0237 (0.0216) 0.0235 (0.0221) 0.0234 (0.0225) 0.0214
0.0202 (0.0204) 0.0202 (0.0210) 0.0204 (0.0210) 0.0196
0.0027 (0.0035) 0.0028 (0.0038) 0.0030 (0.0039) 0.0022
Table 5. Mean values of the non-zero coefficients and 𝜆, using the new method and GCV (in parentheses) for the penalized Poisson regression 𝜆̂ Method 𝜷̂1 𝜷̂2 𝜷̂5 n = 100 SCAD LASSO Hard Oracle n = 200 SCAD LASSO Hard Oracle n = 500 SCAD LASSO Hard Oracle
1.2027 (1.1846) 1.1968 (1.1931) 1.1960 (1.1933) 1.2032
0.5958 (0.6116) 0.5970 (0.6109) 0.6008 (0.6129) 0.5914
0.7956 (0.7803) 0.7832 (0.7859) 0.7949 (0.7904) 0.7898
0.1083 (0.0697) 0.0547 (0.0089) 0.0826 (0.0739) —
1.2023 (1.1937) 1.2026 (1.1878) 1.2022 (1.1954) 1.2009
0.5989 (0.6170) 0.5934 (0.6257) 0.5986 (0.6166) 0.5991
0.7952 (0.7831) 0.7859 (0.7910) 0.7939 (0.7927) 0.8003
0.0507 (0.0393) 0.0456 (0.0052) 0.0550 (0.0452) —
1.2003 (1.1961) 1.2011 (1.1956) 1.2009 (1.1987) 1.1968
0.6007 (0.5977) 0.5982 (0.5971) 0.6012 (0.5981) 0.6019
0.8046 (0.7980) 0.7997 (0.7974) 0.8047 (0.7981) 0.8020
0.0340 (0.0240) 0.0294 (0.0031) 0.0348 (0.0264) —
Note: True values: 1.2, 0.6, and 0.8.
large enough. This applies to our results, for example, in the penalized logistic and Poisson regression case for r = 20 and n = 200 or 500. Remark 4. We must stress that, in general, the solution of the system F(𝜷) = 0 is the maximum likelihood estimate, with non-zero components. However, this may be misleading for the purpose of this paper. Specifically, as it can be seen in this section concerning © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
288
E. Androulakis, C. Koukouvinos and F. Vonta
the simulation experiments, they were conducted using a true value of 𝜷, which contained some zero components, that corresponds to the non-significant effects. This 𝜷 is obviously the optimal solution 𝜷 ∗ when an efficient variable selection method must be used. Therefore, the optimal solution 𝜷 ∗ should contain non-zero but probably some zero components too. This solution could be considered as a ‘restricted MLE’. As a result, the goal is to minimize the error between the penalized estimator (by using the proposed method for the selection of 𝜆) and the ‘restricted MLE’, and not the usual MLE to which a value of 𝜆 = 0 would correspond. The aforementioned discussion indicates that our method works very well for the selection of the tuning parameter 𝜆 and, as a result, improves the estimation of the penalized regression parameter 𝜷.
6 Real data example In this example, we apply the penalized likelihood methodology, using our proposed method for the selection of the tuning parameter, to the burns data, collected by the General Hospital Burn Center at the University of Southern California. The same data were analyzed also in Fan and Li (2001). The dataset consists of 981 observations. The binary response variable Y is 1 for the victims who survived their burns and 0 otherwise. The considered covariates are X1 = age, X2 = sex, X3 = log(burn area + 1), and a binary variable X4 = oxygen (0 normal, 1 abnormal). Quadratic terms of X1 and X3 and all interaction terms were also included. The intercept term was added, and the logistic regression model was fitted. We also applied the best subset variable selection with the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) to this dataset. The unknown parameter 𝜆 chosen by our method was 0.6932, 0.0014, and 0.8062 for the penalized likelihood estimates with the SCAD, L1 , and hard thresholding penalties, respectively, while the parameter 𝜆 chosen by the GCV method was 0.6932, 0.0015, and 0.8062, respectively. The constant 𝛼 in the SCAD was taken to be 3.7. We should point out that our method led to almost similar values of 𝜆. However, it was less time consuming than GCV (in relation to the convergence of the penalized methods), which we also used as an alternative, yielding almost the same estimates. The execution time of our method was 53.7108 s (CPU time), while the execution time of GCV was 64.1944 s. The estimated coefficients and standard errors concerning our method are depicted in Table 6. We refer the reader to the paper of Fan and Li (2001), Table 7, for a direct comparison concerning the estimated coefficients and standard errors. Moreover, we added fivefold CV as an alternative to the proposed approach and GCV. The same grid was also used for CV. However, the use of CV led to serious estimability problems, leading to overestimated values of coefficients, owing to the very low values of 𝜆, which were selected; therefore, we do not present the results here. Note that the execution time of CV was 452.2157 s. Thus, our method and GCV are both much faster and yield better quality solutions. © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
MLE SCAD LASSO Hard Best subset (AIC) Best subset (BIC) 5.5050 (0.7519) 6.0846 (0.2901) 3.7241 (0.2580) 5.8709 (0.4093) 4.8139 (0.4466) 6.1211 (0.5683) −8.8339 (2.9680) −12.2334 (0.0761) 0 (—) −11.3059 (1.0968) −6.4928 (1.7543) −12.1506 (1.8100) 2.2970 (2.0049) 0 (—) 0 (—) 2.2077 (1.4089) 0 (—) 0 (—) −2.7705 (3.4286) −6.9966 (0.2112) 0 (—) −4.2186 (0.6367) 0 (—) −6.9254 (0.7894) −1.7401 (1.4086) 0 (—) −0.2858 (0.1001) −1.1573 (1.0423) −0.2999 (0.1085) −0.2868 (0.1064) −0.7544 (0.6086) 0 (—) −1.7601 (0.2559) 0 (—) −1.0446 (0.5447) 0 (—) −2.6997 (2.4459) 0 (—) −2.7019 (0.2240) −1.9319 (0.9532) −4.5536 (0.5460) 0 (—) 0.0324 (0.3425) 0 (—) 0 (—) 0 (—) 0 (—) 0 (—) 7.4644 (2.3372) 9.8330 (0.1437) 0.4075 (0.2434) 9.0503 (0.9740) 5.6784 (1.2892) 9.8281 (1.6284) 0.2429 (0.3166) 0 (—) 0 (—) 0 (—) 0 (—) 0 (—) −2.1537 (1.6099) 0 (—) −0.1006 (0.1011) −2.1290 (1.2669) 0 (—) 0 (—) −0.1204 (0.1576) 0 (—) 0 (—) 0 (—) 0 (—) 0 (—) 1.2343 (1.2092) 0 (—) 0 (—) 0.8177 (1.0131) 0 (—) 0 (—)
Estimated coefficients and standard errors (in parentheses) for the burns data
Factors Intercept X1 X2 X3 X4 X12 X32 X1 X2 X1 X3 X1 X4 X2 X3 X2 X4 X3 X4
Table 6.
Tuning parameter selection for discrete data
© 2014 The Authors. Statistica Neerlandica © 2014 VVS.
289
290
E. Androulakis, C. Koukouvinos and F. Vonta Table 7.
Absolute standardized coefficients and the corresponding factors
Factors Intercept X1 X2 X3 X4 X12 X32 X1 X2 X1 X3 X1 X4 X2 X3 X2 X4 X3 X4
SCAD 20.9741 160.7542 — 33.1278 — — — — 68.4272 — — — —
LASSO 14.4345 — — — 2.8551 6.8780 12.0620 — 1.6742 — 0.9950 — —
Hard 14.3437 10.3080 1.5669 6.6257 1.1103 — 2.0267 — 9.2918 — 1.6804 — 0.8071
Best subset (AIC) 10.7789 3.7010 — — 2.7640 1.9177 8.3399 — 4.4046 — — — —
Best subset (BIC) 10.7709 6.7130 — 8.7729 2.6954 — — — 6.0354 — — — —
From Table 6, we observe that the SCAD method chooses three out of 12 covariates, namely, the age (X1 ), the burn area (X3 ), and their interaction, whereas the best subset (BIC) chooses the same covariates along with X4 , namely, the oxygen. The quadratic terms of the age and burn area are kept also as significant factors in LASSO and best subset (AIC). However, the corresponding linear terms are not selected by LASSO, and AIC keeps only the age. The Hard penalty case is more conservative as it keeps more factors as statistically significant as compared to the other methods. Concerning the SCAD, LASSO, Hard, AIC, and BIC methods, we have also calculated the absolute values of standardized coefficients for all the selected factors, in order to prioritize them, and we present the results in Table 7. From the medical point of view, all of the selected factors should be taken into consideration so as to acquire integrated knowledge about this set of data. However, excluding the intercept and based on the values depicted in Table 7, the covariates age and burn area, as well as their interaction, are the most significant for the majority of the applied methods, except from the LASSO and AIC methods, which display minor differences from the other methods. In case one of the variables age or burn area are not selected by a method, then their quadratic terms are selected in turn as significant. Therefore, these covariates greatly affect the survival of burn victims.
7 Concluding remarks In this paper, we propose new estimates of the norm of the error in the GLMs framework based on previously existing results for non-linear systems in numerical analysis. We also propose a new procedure for the selection of the tuning parameter in penalized GLMs for discrete data, based on these estimates. The choice of this parameter is crucial because it controls the extent of penalization. The GCV method is commonly used for this purpose. Despite its popularity and effectiveness, our simulations highlight that our method is a quite reliable procedure as it produces ameliorated results. Moreover, it is easier to implement and computationally less intensive because it does not depend on resampling methods. In addition, using our method, we not only derive © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
Tuning parameter selection for discrete data
291
the tuning parameter but also simultaneously estimate the regression coefficients by minimizing the maximum of the error. In conclusion, our method could be considered an effective alternative approach for the tuning parameter selection. Acknowledgements We would like to thank Professor Claude Brezinski for his insightful comments and suggestions, and Professors Dennis K. J. Lin and Runze Li for sending the MATLAB codes for the procedures proposed in their papers. We also thank Professor Runze Li for sending us the burns data, Professor A. Galantai for fruitful discussions, and two anonymous referees for comments that led to a significant improvement of the paper. References Androulakis, E., C. Koukouvinos and F. Vonta (2012), Estimation and variable selection via frailty models with penalized likelihood, Statistics in Medicine 31, 2223–2239. Androulakis, E., Koukouvinos C. and F Vonta (2013), Tuning parameter selection in penalized frailty models. (submitted). Antoniadis, A. (1997), Wavelets in statistics: a review (with discussion), Journal of the American Statistical Association 6, 97–144. Auchmuty, G. (1992), A posteriori error estimates for linear equations, Numerische Mathematik 61, 1–6. Bickel, P. J. and K. A. Doksum (1977), Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day Inc., San Francisco. Craven, P. and G. Wahba (1979), Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation, Numerische Mathematik 31, 377–403. Fan, J. and R. Li (2001), Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96, 1348–1360. Fan, J. and R. Li (2002), Variable selection for Cox’s proportional hazards model and frailty model, The Annals of Statistics 30, 74–99. Fan, J and R Li (2006), Statistical challenges with high dimensionality: feature selection in knowledge discovery, in: M. Sanz-Sole, J. Soria, J.L. Varona and J. Verdera (eds), Proceedings of the International Congress of Mathematicians, vol. III, European Mathematical Society, Zurich, 595–622. Frank, I. E. and J. H. Friedman (1993), A statistical view of some chemometrics regression tools, Technometrics 35, 109–135. Fu, W. (1998), Penalized regression: the bridge versus the lasso, Journal of Computational and Graphical Statistics 7, 397–416. Galantai, A. (2001), A study of Auchmuty’s error estimate, Computers and Mathematics with Applications 42, 1093–1102. Greub, W. and W. Rheinboldt (1959), On a generalization of an inequality of L. V. Kantorovich, Proceedings of the American Mathematical Society 10, 407–415. Horn, R. A. and C. R. Johnson (1985), Matrix Analysis, Cambridge University Press, Cambridge. Leng, C., Y. Lin and G. Wahba (2006), A note on the LASSO and related procedures in model selection, Statistica Sinica 16, 1273–1284. Marcus, M. and M. Minc (1992), A Survey of Matrix Theory and Matrix Inequalities, Dover, New York. McCullagh, P. and J. A. Nelder (1989), Generalized Linear Models, Chapman and Hall, London. © 2014 The Authors. Statistica Neerlandica © 2014 VVS.
292
E. Androulakis, C. Koukouvinos and F. Vonta
Myers, R. H, D. C. Montgomery and G. G. Vining (2002), Generalized Linear Models. With applications in Engineering and the Sciences, John Wiley and Sons, New York. Park, H., F. Sakaori and S. Konishi (2014), Robust sparse regression and tuning parameter selection via the efficient bootstrap information criteria, Journal of Statistical Computation and Simulation 84, 1596–1607. Tibshirani, R. J. (1996), Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical Society – Series B 58, 267–288. Zou, H. and R. Li (2008), One-step sparse estimates in nonconcave penalized likelihood models, The Annals of Statistics 36, 1509–1533. Received: April 2013. Revised: July 2014.
© 2014 The Authors. Statistica Neerlandica © 2014 VVS.