Variable Selection for the Multinomial Logit Model Gerhard Tutz1 , Wolfgang P¨oßnecker1 1
Department of Statistics, LMU Munich, Germany
E-mail for correspondence:
[email protected] Abstract: Common variable selection for the multinomial logit model is based on forward/backward strategies, which are known to be rather unstable. We propose selection by regularization using an L1 -type penalization term. The difference to existing methods is that all the parameters that are linked to one variable are penalized simultaneously. The method does not select single contributions of variables but whole variables. An application to data about party choice in Germany demonstrates the advantages of the proposed method. Keywords: Logistic regression, Multinomial logit model, Variable selection, L1 penalty.
1
Introduction
The use of the multinomial logit model is typically restricted to applications with few predictors, because in high-dimensional settings maximum likelihood estimates tend to deteriorate. Therefore variable selection, which reduces the number of parameters to the relevant ones, is important in a parameter intensive model like the multinomial logit model. The main feature of variable selection in the multinomial logit model is that the effect of one predictor variable is represented by several parameters. Therefore, one has to distinguish between variable selection and parameter selection. The available methods (Krishnapuram et al., 2005; Friedman et al., 2010) that are based on L1 -type penalties shrink all the parameters simultaneously without using that the parameters are structured in groups with one group collecting all the parameters that refer to one variable. In the present paper a penalty is proposed which explicitly uses the grouping of parameters that are linked to one predictor. The effect is selection of predictors rather than selection of parameters. For linear and generalized linear models (GLMs) a variety of penalty approaches for regularized variable selection has been proposed. The most prominent example is the lasso (Tibshirani, 1996) and its extensions to fused lasso (Tibshirani et al., 2005) and grouped lasso (Yuan and Lin, 2006). Alternative regularized estimators that enforce variable selection are the
2
Variable Selection in Multinomial Logit Models
elastic net (Zou and Hastie, 2005), SCAD (Fan and Li, 2001), the Dantzig selector (Candes and Tao, 2007) and boosting approaches (B¨ uhlmann and Yu, 2003; B¨ uhlmann and Hothorn, 2007; Tutz and Binder, 2006).
2 2.1
Model and Selection Procedure The Multinomial Logit Model
For data (Yi , xi ), i = 1, . . . , n, with Yi ∈ {1, . . . , k} denoting the categorical response variable and xi the predictor, the multinomial logit model has the form exp(βr0 + xTi β r ) exp(ηir ) P (Yi = r|xi ) = Pk = Pk , T s=1 exp(βs0 + xi β s ) s=1 exp(ηis )
(1)
where β Tr = (βr1 , . . . , βrp ). Since parameters β10 , . . . , βk0 , β T1 , . . . , β Tk are not identifiable, additional constraints are needed. Typically one of the response categories is chosen as reference category, for example, by setting βk0 = 0, β k = 0, category k is chosen as the reference category. An extensive discussion of the multinomial logit model as multivariate GLM is given, for example, in Agresti (2002) or Tutz (2012). 2.2
Penalized Estimation
As an alternative to forward/backward procedures we propose a penalized likelihood approach that enforces variable selection. The basic concept is to maximize a penalized log-likelihood lp (β) = l(β) − λJ(β), where l(β) is the usual log-likelihood, λ is a tuning parameter, and J(β) is a functional that penalizes the size of the parameters. While the tuning parameter determines the strength of the regularization, the functional determines the properties of the penalized estimation. The most widely used penalty that enforces variable selection is the lasso (Tibshirani, 1996). It has been used in models with unidimensional response models like the classical linear model and univariate generalized linear models (GLMs). For the multinomial logit model direct application of the lasso corresponds to the penalty J(β) =
k−1 X r=1
||β r ||1 =
p k−1 XX
|βrj |,
r=1 j=1
where β T = (β 1 , . . . , β k−1 ) collects all the parameters to be estimated. Friedman et al. (2010) used the slightly more general elastic net penalty,
Tutz and P¨ oßnecker
3
which also has the drawback that selection focuses on parameters but not on variables. By contrast, the penalty proposed by us penalizes the group of parameters that are linked to one variable. For simplicity let the j-th predictor be metric and the parameters in ηir , r = 1, . . . , k−1, that are linked to variable j be collected in β T.j = (β1j , . . . , βk−1,j ). If no category-specific predictors are included we will use the penalty J(β) =
p X
||β .j ||2 =
j=1
p X
2 2 (β1j + · · · + βk−1,j )1/2 .
j=1
The penalty enforces variable selection, that is, all the parameters in β .j are simultaneously shrunk toward zero. It is strongly related to the group lasso (Yuan and Lin, 2006; Meier et al., 2008). However, in the group lasso the grouping refers to the parameters that are linked to a categorical predictor within a univariate regression model whereas in the present model grouping arises from the multivariate response structure. 2.3
Computation of Estimates
Let the data be given by (Yi , xi ), i = 1, . . . , n with Yi ∈ {1, . . . , k} denoting the response variable, and xi the predictors. The coded response vector y Ti = (yi1 , . . . , yi,k−1 ), with yir = 1 if Yi = r and yir = 0 otherwise, follows a multinomial distribution M (1, π i ), where π Ti = (πi1 , . . . , πi,k−1 ) is the vector of response probabilities πir = P (Yi = r|xi ). For the multinomial logit model it is assumed that the linear predictor η Ti = (ηi1 , . . . , ηi,k−1 ) has components ηir = βr0 + xTi β r = xTir β, where the constant is separated, and xir is the corresponding design vector of length m composed of the intercept and xi . In closed form the model can be written as g(π i ) = X i β, where X Ti = (xi1 , . . . , xi,k−1 ) contains the matrix of predictors for the ith variable and g(.) is a (k-1)-dimensional link function. The log-likelihood of the multinomial logit model has the form l(β) =
n k−1 X X i=1 r=1
yir ηir − log(1 +
k−1 X
exp(ηis )).
s=1
There are several ways to give the log-likelihood in matrix form, see, for example, Tutz (2012). We will consider a form that is especially useful for the grouping of variables and consider the form with global (metric or binary) variables of length p. Let the observations be collected in the matrix X = [1|x.1 | . . . |x.p ],
4
Variable Selection in Multinomial Logit Models
where xT.j = (x1j , . . . , xnj ) and 1T = (1, 1, . . . , 1), and the parameters be collected in the matrix B = [β .0 |β .1 | . . . |β .p ]. Then one obtains BX T = [η 1 | . . . |η n ]. With Y T = [y 1 | . . . |y n ], the log-likelihood can be given in the form k−1 n X X exp(ηis )). log(1 + l(β) = tr(Y BX T ) − i=1
s=1
For maximization we use a blockwise coordinate ascent algorithm that is an adaptation of the algorithm that was proposed by Meier et al. (2008) for the binary model. It cycles through the groups of parameters keeping all but the current group fixed. 2.4
Application
We consider the modelling of party preference in Germany. We use data from the German Longitudinal Election Study with the five response categories Christian Democratic Union (CDU: 1), Social Democratic Party (SPD: 2), Green Party (3), Liberal Party (FDP: 4), and Left Party (Die Linke: 5). There are nine potential predictors: age, political interest (1: less interested 0: very interested), religion (1: evangelical, 2: catholic, 3: otherwise), regional provenance (west; 1: former West Germany, 0: otherwise), gender (1: male, 0: female), union (1: member of a union, 0: otherwise), satisfaction with the functioning of democracy (democracy; 1: not satisfied 0: satisfied), unemployment (1: currently unemployed, 0: otherwise), and high school degree (1: yes, 0: no). Figure 1 shows the buildups of global variables resulting from lasso-type regularization. Only the variables that turned out to be influential are shown. The variables political interest, gender, unemployment, and high school degree were set to zero by the method. The vertical line shows the selected smoothing parameter based on cross-validation. The grouped selection behavior can clearly be seen from figure 1: the k − 1 coefficients that belong to the same predictor always enter or leave the model simultaneously. Thus, in contrast to previous approaches (Krishnapuram, 2005; Friedman, 2010), the method proposed in this paper performs actual variable selection in multinomial logit models.
Tutz and P¨ oßnecker
0.00 2
β 1.0
1.5
2.0
3
0.0
0.5
1.0 λ
Democracy
Religion_2
1.5
2.0
0.0
0.5 β 0.2 0.1
β 1.0 λ
1.5
2.0
1.5
2.0
4 3
0.3
−0.2 4
0.0
2
0.0
−0.6
0.2 0.0
0.5
2.0
5
2
5 0.0
1.5
Religion_3
3 −0.4
3 4 2
1.0
0.4
5
0.5
λ
0.0
λ
0.8
1.0
0.5
0.6 β
3
4 0.0
0.4
5
−0.03
5
0.0
0.0
3 4
β
0.3
−0.01
0.4
2
0.2 0.1
β
0.4
2
0.2
Age
4
−0.02
0.5
West
0.6
0.8
Union 5
5
0.5
1.0
1.5
2.0
0.0
0.5
λ
1.0 λ
FIGURE 1. Coefficient buildups for selected global variables of party choice data.
References Agresti, A. (2002). Categorical Data Analysis. New York: Wiley. B¨ uhlmann, P. and Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association, 98, 324 – 339. B¨ uhlmann, P. and Hothorn, T. (2007). Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22, 477 – 505. Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35, 2313 – 2351. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348 – 1360. Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1 – 22. Krishnapuram, B., Carin, L., Figueiredo, M. and Hartemink, A. (2005). Sparse multinomial logistic regression: Fast algorithms and generalization
6
Variable Selection in Multinomial Logit Models
bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 957 – 968. Meier, L., van de Geer, S. and B¨ uhlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society B, 70, 53 – 71. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267 – 288. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Kneight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society B, 67, 91 – 108. Tutz, G. and Binder, H. (2006). Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics, 62, 961 – 971. Tutz, G. (2012). Regression for Categorical Data. Cambridge: Cambridge University Press. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society B, 68, 49 – 67. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67, 301 – 320.