Chinese Journal of Electronics Vol.23, No.4, Oct. 2014
Variable Selection in Logistic Regression Model∗ ZHANG Shangli1,3 , ZHANG Lili2 , QIU Kuanmin3 , LU Ying1 , CAI Baigen3 (1. School of Science, Beijing Jiaotong University ,Beijing 100044, China ) (2. Department of Statistic, Chonnam National University ,Gwangju, 500757, South Korea ) (3. State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University ,Beijing 100044, China) Abstract — Variable selection is one of the most important problems in pattern recognition. In linear regression model, there are many methods can solve this problem, such as lasso and many improved lasso methods, but there are few variable selection methods in generalized linear models. In this paper, we study the variable selection problem in logistic regression model. We propose a new variable selection method - the logistic elastic net, prove that it has grouping effect which means that the strongly correlated predictors tend to be in or out of the model together. The logistic elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the case when p is more larger than n. The advantage and effectiveness of this method are demonstrated by real leukemia data and a simulation study. Key words — Variable selection, Lasso, Linear regression model, Logistic elastic net, Logistic Regression Model.
I. Introduction Consider linear regression model Y = Xβ + ε, ε ∼ N (0, σ 2 In )
(1)
where Y is n × 1 response variables, X is n × p design matrix with rank(X)=p. Normally, we select the least squares estimate (LSE) to estimate the parameter β as it has many good properties, however, it also has two drawbacks. One is that LSE has very large variance when the predictor variables have some collinearity, this makes prediction accuracy be relatively poor and the model be instability. Another is that LSE retains all variables such that its interpretation is not good. There are two standard techniques, ridge regression and subset selection, which can improve LSE, but they also have defects. The ridge regression proposed by Hoerl and Kennard [1] is a continuous process that shrinks coefficients and hence makes the model stable, however, it does not set any coefficients to 0 and hence does not give an easily interpretable model. Subset selection makes the model interpretable, but as
a discrete process, the variables in a model are either retained or dropped, so small changes in the data may lead to choose different models and this can reduce its prediction accuracy. In 1996, a promising technique named the least absolute shrinkage and selection operator (LASSO) was proposed by Tibshirani [2]. It shrinks some coefficients and sets others to 0, and hence makes it retain the good features of both subset selection and ridge regression. The lasso has injected great vitality for the area of variable selection, especially when least angle regression (LARS) algorithm was proposed by Efron [3], which has well solved the computational problems of lasso. For the variable selection problem, many alternative regularization estimates have been proposed, such as SCAD [4], fused lasso [5], group lasso [6] and relaxed lasso [7]. The lasso method is not only applied to the simple linear regression model, but also other models, such as Cox model [8], graph model [6,7], semi-parametric model [9] and GLM (generalized linear model) [10]. For the variables selection problems in GLM, Cai and Gao [11] gave two methods of variable selection according to the ward criterion and likelihood ratio criterion respectively, and under certain conditions they have proved that there exist weak consistency between the two methods. Park and Hastie [10] studied the coefficient estimate method with L1 planning in GLM, they adopted predictcorrect algorithm to estimate coefficients meanwhile proceeding variable selection. Wang and Zhang [12] studied the mean variable selection problems in joint generalized linear models, using the generalized quasi-likelihood function to propose a new criterion for model selection. As analyzed in Zou and Hastie [13], although lasso shows good properties in many cases, it also has some limitations in the following cases: (a) When using LARS algorithm, for n × p design matrix, this method can select at most min(n, p) variables (see [3]). So , when p > n, lasso can only select n variables, this is a limiting feature for a variable selection method. (b) If there is a group of variables among which the pairwise correlations are very high, then the lasso usually tends
∗ Manuscript Received June 2013; Accepted July 2013. Accepted July 2013. This work was partially supported by The National Natural Science Foundation of China (61070236,U1334211), the Project of State Key Laboratory of Rail Traffic Control and Safety (No.RCS2012ZT004), Beijing Jiaotong University.
to select only one variable from the group and does not care which one is selected. (c) For usual case n > p, if there are high correlation between predictors, it has been empirically observed that the prediction performance of the lasso is dominated by ridge regression [2]. Cases (a) and (b) make the lasso an inappropriate variable selection method in some situations, especially for gene selection in micro-array data analysis. For the prediction performance, case (c) is also common in regression problems, so it is necessary to strengthen the predictive ability of lasso. The elastic net proposed by Zou and Hastie [13] is a new regularization technique which can overcome the drawbacks of the lasso. For the linear regression model (1), the elastic net is defined as ∧
β (elastic net) = arg min{k Y −Xβ k2 +λ1
p X
|βj |+λ2
p X
Implementing the log transformation to formula (5), logistic regression model can change to linear form: p ) = β1 X1 + β2 X2 + · · · + βp Xp (6) logit(p) = ln( 1−p Then we can use linear regression model to estimate the parameters, the logistic regression model is a generalized linear model. For logistic regression model, the response variable Y has the binomial distribution: Y ∼ B(1, p), that is P (Y = y) = py (1 − p)1−y p = (1 − p)( 1−p )y p } = (1 − p)exp{ylog 1−p
p Let θ = log 1−p , then the right side of formula (7) can be expressed as exp{yθ − log(1 + eθ )}. Comparing with the form of exponential distribution fameθ ily (4), we have µ = b0 (θ) = 1+e θ and
βj 2 }
j=1
j=1
(2) Where non-negative λ1 and λ2 are tuning parameters. Let α = λ1 /(λ1 + λ2 ), λ = λ1 + λ2 , formula (2) can be written as
g(µ) = log
j=1
(3)
n Q
j=1
ln L(θ(β); y) = ln Pp
Pp
= ln
f (yi ; θ(β)i ) exp{yi θ(β)i − ln(1 + eθ(β)i )}
i=1
= =
n P
i=1 n P
(8)
[y i θ(β)i − ln(1 + eθ(β)i )] T
[yi (xTi β) − ln(1 + exi β )]
i=1
The estimation of regression parameter β is to maximize formula (8), then by minimizing its inverse and adding suitable penalty term we can define the logistic elastic net estimate. Definition 1 For given data (X, Y ) and penalty parameter (λ1 , λ2 ), the logistic elastic net in the logistic regression model is defined as n _ P β (log istic elastic net) = arg min{ [−yi (xTi β) + log(1 + exp{xTi β})]
II. The Logistic Elastic Net
+λ1
Let Y obey exponential dispersion family with density function yθ − b(θ) f (y, θ, ϕ) = exp{ + c(y, ϕ)} (4) a(ϕ)
p P
i=1
|βj | + λ2
j=1
p P
βj2 ]}
j=1
(9) where β = (β1 , β2 , · · · , βp ) are regression coefficients, nonnegative λ1 and λ2 are tuning parameters.
Where θ is natural parameter, ϕ is scale parameter. Under some certain regular conditions, θ connects with E(Y |X ) by b(·), that is µ = E(Y |X ) = b0 (θ). Given X, the variance of Y is the function of mean and ϕ, that is V ar(Y |X ) = a(ϕ)b00 (θ). When the response variable is 0 − 1 variable, the linear regression model is inappropriate, in this case, we can use logistic regression model. For response variable Y and p explanatory variables X1 , X2 , · · · , Xp , let p = P {Y = 1|X1 , X2 , · · · , Xp }, the logistic regression model is exp{β1 X1 + β2 X2 + · · · + βp Xp } 1 + exp{β1 X1 + β2 X2 + · · · + βp Xp }
i=1 n Q
2
We call the function α j=1 |βj | + (1 − α) j=1 βj as the elastic net penalty. When α = 0, the elastic net reduces to ridge regression; when α = 1, the elastic net changes to lasso regression. The elastic net has the group effect, it can overcome the limitations of the lasso in scenarios (a) and (b). In this paper we study variable selection problem in the logistic regression model. The rest of the paper is organized as follows. In Section 2, we propose the logistic elastic net. In Section 3, we prove that it has the grouping effect property. Real leukemia data and a simulation study are provided in Section 4. In Section 5, we give a discussion.
p=
µ = θ = Xβ 1−µ
where g(µ) = g(E(Y |X)) = Xβ is the connection function. Let xi = (xi1 , xi2 , · · · , xip ) represent the ith observation data of independent variable, the Log-likelihood function of θ(β) is
_
β (elastic net) = arg min{kY − Xβk2 p p P P βj2 ]} |βj | + (1 − α) +λ[α
(7)
III. The Property of Logistic Elastic Net The logistic elastic net has the grouping effect which means that the strongly correlated predictors tend to be in or out of the model together, it is shown by the following theorem. Theorem 1 For logistic regression model, given data (X, Y ) and parameter (λ1 , λ2 ), response variable Y has been centralized and design matrix X has been standardized. Let ∧
∧
β (λ1 , λ2 ) present logistic elastic net estimate. Assume β k ∧
(5)
(λ1 , λ2 ) β l (λ1 , λ2 ) > 0. Define ∧
∧
Dλ1 ,λ2 (k, l) = | β k (λ1 , λ2 )− β l (λ1 , λ2 )|
Where β1 , β2 , · · · , βp are regression coefficients.
2
Then
1 2λ2
Where
∧ ri
i=1
is prediction residual. ∧
Proof: ∧
∧
∧
From β k (λ1 , λ2 ) β l (λ1 , λ2 ) > 0, then β k (λ1 , λ2 ) 6= ∧
∧
0, β l (λ1 , λ2 ) 6= 0 and sgn{β k (λ1 , λ2 )} = sgn{β l (λ1 , λ2 )}. If ∧
∧
As an automatic variable selection method, logistic elastic net naturally overcomes the difficulties of p n and has the capacity of selecting the group variables. Here we apply real leukemia data and a simulation data to illustrate its good performance.
β m (λ1 , λ2 ) 6= 0 , β (λ1 , λ2 ) meets ∂L(λ1 , λ2 , β) | ∧ =0 β=β(λ1 ,λ2 ) ∂βm then we have n P
1. Real leukemia data example
_
[−yi xik +
i=1 _
exp{xT i β} _
1+exp{xT i β}
xik ]
The leukemia data set [16] contains 72 observation data, each data composes by 3571 gene variables, which is a typical p n problem.
(10)
_
+λ1 sgn{ β k (λ1 , λ2 )} + 2λ2 β k (λ1 , λ2 ) = 0 n P
We take α = 0.2, Table 1 lists the number of explanatory variables which is selected by the models corresponding to the two tuning parameters. From this table we can get that the lasso method is limited by the number of observation data n, which only selects 23 variables, the ridge method almost retains all of the variables, while the logistic elastic net method well overcomes the drawback of this two methods.
_
[−yi xil +
i=1 _
exp{xT i β} _
1+exp{xT i β}
xil ]
(11)
_
+λ1 sgn{ β l (λ1 , λ2 )} + 2λ2 β l (λ1 , λ2 ) = 0 Using (10)-(11), we get n P
_
[−yi (xil − xik ) −
i=1 _
exp{xT i β} _
1+exp{xT i β}
(xil − xik )]
_
Figure 1 shows the 10-fold cross-validation error curve of logistic elastic net estimate, lasso estimate and ridge estimate. In figure 1, the error curve of each method has two vertical dotted lines, the left line corresponds to the tuning parameter value λmin when cross-validation average error achieve minimum, the right line is the maximum value λ1se of λ within the range of standard error.
+2λ2 { β k (λ1 , λ2 ) − β l (λ1 , λ2 )} = 0 then ∧
∧
Dλ1 ,λ2 (k, l) =| β k (λ1 , λ2 )− β l (λ1 , λ2 )| ∧ n P exp {xT i β} = 2λ12 |xil −xik ||yi − ∧ | = 2λ12
i=1 n P
1+ exp {xT i β}
∧ |xil −xik || r i |
i=1
Table 1: The variables selected by the three method the estimate method λmin λ1se Elastic Net(α = 0.2) 185 147 Lasso(α = 1) 23 17 Ridge(α = 0) 2779 2647
Dλ1 ,λ2 (k, l) describes the difference of coefficient path between the variables k and l. if xk and xl are highly correlated, Theorem 1 shows that the difference of coefficient path between the variables k and l is almost equal to 0, which demonstrates when there exists strong relationship between variables, the logistic elastic net can select all of them. The correlation algorithm and the tuning parameters selection method are similar to those in Zou and Hastie [13].
Elastic net (α = 0.2)
Ridge
Lasso
1.0 Binomial Deviance
1.0 0.8 0.2
0.4
0.4
0.6
Binomial Deviance
1.0 0.8 0.6
Binomial Deviance
0.2
As mentioned in Zou and Hastie [13], typical micro-array data contains thousands of genes, but sample data are usually less than 100. Because of unique data structure of micro-array data, we believe that a good classification method should include the following properties: (1) This process should carry out gene selection; (2) It should not be limited by the fact that p n; (3) For those genes sharing the same biological path, this method should be able to include whole groups into the model once one gene among them is selected.
0.4
III. Data Example
1.2
1.2
1.4
1.2
=
From the published results in this area, many classification methods can achieve lower misclassification rate. But many of them are not satisfying gene selection method. Most of the popular classification method don’t have one of the above three properties at least. The lasso doesn’t meet conditions (2) and (3). The support vector machine [14] and penalty logistic regression [15] are very successful classification method, but they are not automatic gene selection method, they do not use either univariate ranking [16] or recursive feature elimination [14] to reduce the number of variables into the final model.
0.8
1 2λ2
_ exp{xT i β} _ 1+exp{xT i β}
0.6
Dλ1 ,λ2 (k, l) =
|xil − xik | yi − i=1 n P _ |xil − xik | r i n P
−4
−2
0
Log(Lambda)
−5
−3
Log(Lambda)
−1
2
3
4
5
Log(Lambda)
Fig. 1. Cross-validation error curve of logistic elastic net estimate, lasso estimate and ridge estimate
3
6
Coefficients
0
2
L1 Norm
4
6
are the same, lasso can not select them at the same time, the result of theorem 1 shows that logistic elastic net can make it. We test these results by using a simulation data set. Assume X1 , X2 , e, ε, all have standard normal distribution N (0, 1), X3 = X2 2 1 1 1 X4 = X1 + X2 + X3 + e 3 3 3 3 The true model is
Elastic net (α = 0.2)
85
0
0.3
72
122
162
185
0
2
4
6
0.05 −0.15
0.1
Coefficients
y=
0
2
L1 Norm
4
6
We simulate this set of data for 1000 times, and select variables by logistic elastic net, lasso and ridge method respectively. The solution paths are given in figure 4, and the coefficient estimates obtained by the three methods are listed in table 2.
8
L1 Norm
Fig. 2. Solution path by Logistic Elastic net method under 4 different values of α
Elastic net(α = 0.5)
0.4 0.3 0.2
0.4 0.3
3
3
Ridge
Lasso
3
0
1.0
0
3
3
3
4 4 4 4 4 4
0.5 6 Step
8
10
4
6
8
10
Step
−1
−1.0
Fig. 3. The first 10-step solution path of the three methods
0.0
1 2
Coefficients
4
−0.5
2
0
10
Coefficients
8
−0.5
6 Step
Coefficients
4
0.0
0.0
0.0 2
value Ridge -0.05284973 -1.13725235 0.65207945 0.64833306 -0.19196671
0.1
0.1
0.2
Coefficients
0.4
Coefficients
Table 2: Coefficient estimate Variables Lasso Elastic Net Intercept -0.05594212 -0.05549803 X1 -1.46855106 -1.45022001 X2 1.29688521 0.66503690 X3 6.370543e-05 0.64045268 X4 -
Ridge
Lasso
0.3 0.2 0.1
Coefficients
Elastic net (α = 0.2)
exp{−2X1 + X2 + X3 + ε} 1 + exp{−2X1 + X2 + X3 + ε}
0.5
69
−0.2
Coefficients
38
60
L1 Norm
Elastic net (α = 0.4) 0
46
0.0
6
22
2
4
Elastic net (α = 0.6) 0 0.2
2
42
−0.2
0
25
0.6
14
0.0 −0.6
Coefficients
Elastic net (α = 0.8) 0
4
−1.0
−1.5
Figure 2 shows the solution path by elastic net method under 4 different values of α. As we can see from figure 2, with the decreasing of α, the solution path of logistic elastic net 0 1 2 3 0 1 2 3 0.0 1.0 2.0 becomes more and more dense, which means that the more L1 Norm L1 Norm L1 Norm variables are selected. In fact, when α is bigger (α = 0.8), the logistic elastic net estimate is sparse, when α is smaller Fig. 4. The solution path of logistic elastic net, lasso and ridge method (α = 0.2), the logistic elastic net estimate is dense. We can get the following conclusion from the results of Figure 3 gives the first 10-step solution path of the three solution path and parameter estimate by the three methods: methods. From the Figure 3, we can find the fact that the (1) Only logistic elastic net can select the true model; solution path given by the lasso method is too sparse, and the (2) Comparing with logistic elastic net, as the coefficient solution path by the ridge method is too dense. As the loof X3 (6.370543e-05) in lasso is very small, which means that gistic elastic net method combines the advantages of the lasso lasso only selects variable X1 and X2 , while logistic elastic method and the ridge method, its solution path is moderate, net selects X1 , X2 and X3 ,the group variables X2 and X3 are neither too sparse (because of its grouping effect) nor dense selected at the same time; (as it has abandoned some nonsignificant variables). (3) Comparing the ridge estimate with the logistic elas2. A simulation study tic net estimate, both of them can select the variable X1 ,the As previously mentioned, when there exists some grouped group variables X2 and X3 , but the ridge estimate also selects variables in the data set, the lasso method can not select the the unnecessary variable X4 , which does not include in the group variables. In particular, when two prediction variables true model.
The conclusions above show that the logistic elastic net estimate performs the best for the data having grouped variables situation.
[10] M.Y.Parkand T.Hastie, “An L1 regularization-path algorithm for generalized linear models”, Journal of the Royal Statistical Society,Series B,Vol.69, No.4, pp.659-677, 2007. [11] P.Cai, Q.Gao, “Variable selection in generalized linear model”, Journal of University of Science and Technology of China,Vol.36, No.9, pp.927-931, 2006.(In Chinese) [12] D.R.Wang and Z.Z. Zhang, “Variable Selection in Joint Generalized Linear Models”, Chinese Journal of Applied Probability and Statistics,Vol.25, No.3, pp.245-256, 2009. [13] H.Zou and T. Hastie , “Regularization and variable selection via the elastic net”, Journal of the Royal Statistical Society,Series B,Vol.67, No.2, pp.301-320, 2005. [14] I.Guyon,J. Weston, S. Barnhill and V. Vapnil, “Gene selection for cancer classification using support vector machines”, Machine Learning,Vol.46, No.1-3, pp.389-422, 2002. [15] J.Zhu and T.Hastie, “Classification of gene microarrays by penalized logistic regression”, Biostatistics,Vol.5, No.3, pp.427443, 2004. [16] T.Golub, D.Slonim,P.Tamayo, C.Huard,M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh,J.Downing and M.Calingiuri, , “Molecular classification of cancer:class discovery and class prediction by gene expression monitoring”, Science,Vol.286, No.5439, pp.531-537, 1999. [17] M.Segal,K.Dahlquistand B.Conklin, “Regression approach for microarray data analysis”, Computational Biology,Vol.10, No.6, pp.961-980, 2003. [18] S.Van De Geer,H. Van Houwelingen, “ High dimensional data:p > n in mathematical statistics and bio-medical applications”, Bernoulli,Vol.10, No.6, pp.939-943, 2004. [19] E.Candes, T.Tao, “The Dantzig selector:statistical estimation when p is much larger than n”, The Annals of Statistics,Vol.35, No.6, pp.2313-2351, 2007. [20] S.L.Zhang,Z.L. Ke,G.D. Wei and L.L. Zhang , “The Random Simulation Algorithm for Variable Selection”, Journal of Information and Computational Science,Vol.9, No.17, pp.5119-5125, 2012. [21] S.L.Zhang, L.C. Wang and H. Lian , “Estimation by polynomial splines with variable selection in additive Cox models”, Statistics,Vol.48, No.1, pp.67-80, 2014. [22] H.Zou, T.Hastie and R.Tibshirani, “Sparse principal component analysis”, Journal of computational and graphical Statistics,Vol.15, No.2, pp.265-286, 2006.
IV. Discussion For the variable selection in the linear regression model, lasso minimizes the residual sum of squares and adds L1 penalty to the coefficients, ridge estimate minimizes residual sum of squares and adds L2 penalty to coefficients, while elastic net combines the merits of this two methods, in which the penalty terms contain L1 and L2 at the same time. In this paper, we apply the thought of elastic net to the logistic regression model and propose a new variable selection methodlogistic elastic net. Logistic elastic net has the good effect in dealing with micro-array data, in which the explanatory variables have strong multi-collinearity (these variables usually describe different characteristics of some genes, and there exist grouped variables among them), we try to explore the influence of these variables on response variable, the theory and practice results show that logistic elastic net can be well applied in this type of data set. For the case that the number of variables is far larger than the sample size, the lasso estimate selects too sparse variables limited by the sample size and the ridge estimate retains all variables, while the logistic elastic net estimate can overcome these disadvantages and select the right variables, which are verified by applying them to the real Leukemia data. In addition, when the data set exists some grouped variables, we find that the logistic elastic net estimate can select the true model, but the lasso and ridge estimate can not, these results are also illustrated by using the simulation data. References [1] A.E. Hoerl,R.W. Kennard, “Ridge regression:biased estimation for nonorthogonal problem”, Technometrics , Vol.12, No.1, pp.55-67,1970. [2] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society,Series B, Vol.58, No.1, pp.267-288,1996 [3] B. Efron, T. Hastie, I.Johnstoneand R. Tibshirani,“Least angle regression” The Annals of Statistics, Vol.32, No.2, pp.407-499, 2004. [4] J.Fan and R.Z.Li , “Variable selection via penalized likelihood”, Journal of American Statistical Association, Vol.96, No.456, pp.1348-1360, 2001. [5] R. Tibshirani and M. Saunders , “Sparsity and smoothness via the fused lasso”, Journal of the Royal Statistical Society,Series B,Vol.67, No.1, pp.91-108, 2005. [6] M. Yuan and Y. Lin, “Model selection and estimation in regression with Grouped variables”, Journal of the Royal Statistical Society,Series B,Vol.68, No.1, pp.49-67, 2006. [7] N. Meinshausen, “Relaxed Lasso”, Computational Statistics and Data Analysis,Vol.52, No.1, pp.374-393, 2007. [8] R. Tibshirani, “The Lasso Method for Variable Selection in Cox Models”, Statistics in Medicine,Vol.16, No.4, pp.385-395, 1997. [9] R.Liand H.Liang, “Variable selection in semiparamtric regression modeling”, The Annals of Statistics,Vol.36, No.1, pp.261286, 2008.
ZHANG Shangli was born in Hunan, he is master instructor and associate professor in Beijing Jiaotong University. His research interests include statistics and cross-cutting areas of statistics and information (Email:
[email protected] )
ZHANG Lili (corresponding author) was born in Inner Mongolia, she received the B.E. degree in Mathematics from Inner Mongolia University. She is now a doctor candidate of Chonnam National University. Her research interests include pattern recognition and biostatistic (Email:
[email protected])
5
6