Regularization Parameter Selection in the Group Lasso (?) Teppei SHIMAMURA1 , Hiroyuki MINAMI2 , Masahiro MIZUTA2
1
Graduate School of Information Science and Technology, Hokkaido University e-mail:
[email protected] 2
Abstract:
Information Initiative Center, Hokkaido University
This article discusses the problem of choosing a regularization parameter
in the group Lasso proposed by Yuan and Lin (2006), an l1 -regularization approach for producing a block-wise sparse model that has been attracted a lot of interests in statistics, machine learning, and data mining. It is important to choose an appropriate regularization parameter from a set of candidate values, because it affects the predictive performance of the fitted model. However, traditional model selection criteria, such as AIC and BIC, cover only models estimated by maximum likelihood estimation and can not be directly applied for regularization parameter selection. We propose an information criterion for regularization parameter selection in the group Lasso in the framework of maximum penalized likelihood estimation. Keywords: L1 -regularization approach, model selection problem, information-theoretic approach
1. Introduction The group Lasso proposed by Yuan and Lin (2006) produces a block-wise sparse model while keeping high prediction accuracy in linear regression problem with p explanatory variables partitioned into J blocks, which implies that some blocks of the regression coefficients are exactly zero. It achieves variable selection at the group level and is useful for the special case in linear regression when not only continuous but also categorical variables are present or when there are meaningful blocks of variables such as polynomial regression. This article deals with the problem of choosing a better one for a set of regularization parameter values in the Group Lasso problem. The least regularized group Lasso when the regularization parameter λ is equal to zero corresponds to ordinal least squares estimates, while the most regularized group Lasso using λ = ∞ yields a constant fit. The choice of the regularization parameter λ affects the predictive performance and sparseness of the (?)
Work partially supported by JSPS research fellowships for young scientists
fitted model simultaneously. Thus regularization parameter selection is one of the most important practical issues for the group Lasso. The purpose of this article is to derive an information criterion (Akaike, 1974; Konishi and Kitagawa, 1996) for regularization parameter selection in the group Lasso. A challenging task of the above derivation is to calculate influence functions for the group Lasso estimators. Like the Lasso, the penalized likelihood function corresponding to the group Lasso is not differentiable at the origin which leads to make it difficult to obtain the influence functions. To avoid this problem, we only consider the nonzero components of the group Lasso estimator satisfying Karush-Kuhn-Tucker conditions and accomplish the above derivation with some assumptions. Section 2 describes the definition of the group Lasso in the context of linear regression models. Section 3 presents information criteria for model selection and regularization parameter selection in the group Lasso.
2. Group Lasso Consider the regression model with J factors: y=
J X
Xj βj + ε = Xβ + ε
(1)
j=1
where y is an n × 1 response vector, ε ∼ Nn (0, σ 2 I), Xj is an n × pj design matrix corresponding to the j-th factor, βj is a coefficient vector of size pj , j = 1, . . . , J, X = (X1 , . . . , XJ ), and β = (β10 , . . . , βJ0 )0 . For simplicity of explanation, assume that the response vector and each input variable are centerized, and each Xj is orthonormalized through Gram-Schmidt orthonormalization, that is, XjT Xj = Ipj . The group Lasso estimator is defined as a solution to minimize the functional J J X X 1 ky − Xj β j k 2 + λ (βjT Kj βj )1/2 2 j=1 j=1
(2)
where λ ≥ 0 is a regularization parameter and K1 , . . . , KJ are symmetric pj ×pj positive definite matrices. In this article, we choose to set Kj = pj Ipj which is equivalent to the setting of Yuan and Lin (2006). Let a subset of factor indices {1, . . . , J} where the coefficient vectors are nonzero be G = {j ∈ {1, . . . , J} : βj 6= 0}. From Karush-Kuhn-Tucker conditions, a necessary and sufficient condition for β to be a solution to (2) when Kj = pj Ipj is √ −Xj0 (y − Xβ) + λ pj kβj0 βj k−1/2 βj = 0 ∀βj 6= 0, j ∈ G, √ k − Xj0 (y − Xβ)k ≤ λ pj ∀βj = 0, j ∈ / G.
(3) (4)
Yuan and Lin (2006) extended the shooting algorithm for the Lasso proposed by Fu (1999) to calculate a solution to (2). We use Yuan and Lin’s algorithm directly.
3. Regularization Parameter Selection in the Group Lasso We start by giving an overview of model selection via information-theoretic approach. Then we move on to discuss the regularization parameter selection problem in the group Lasso and present the derivation of information criteria. 3.1. Information-Theoretic Approach for Model Selection The problem of model selection is to choose one from a series of r candiate models. Assume that observations y are independently distributed from an unknown “true” distribution function G(y) which has the density function g(y). One usually models g(y) by a member of a parametric family, F = {f (y|θ), θ ∈ Θ}, where θ is a p-dimensional unknown parameter vector, and Θ is an open subset of Rp and estimates θˆ based on observations y. We also denote a future observation from the density g by z. Information criteria have been proposed as estimators of the expected log-likelihood given by IC = −2
n X
ˆ + 2ˆb(G) log f (yi |θ)
(5)
i=1
where ˆb(G) is a properly defined approximation to the bias term given by ¸ ·Z Z ˆ ˆ ˆ b(G) = EG(y) log f (z|θ)dG(z) − log f (z|θ)dG(z) with the expectation taken over the joint distribution of y, that is,
Qn i=1
(6)
dG(yi ).
In general, it is difficult to obtain the bias b(G) in a closed form. Thus the bias estimate ˆb(G) is usually given as a consistent estimator of the asymptotic bias, b1 (G), in the expansion b(G) =
1 b1 (G) + O(n−2 ). n
(7)
ˆ where T (·) is a p-dimensional For a general statistical functional estimator θˆ = T (G) functional vector on the space of distribution functions, Konishi and Kitagawa (1996) derived the asymptotic bias as a function of the empirical influence function of the ˆ Then estimator θˆ and the score function of the hypothesized statistical model f (·|θ). the asymptotic bias in (6) is given by (Z ) ¯ ¯ ∂ log f (z|θ) ¯ b1 (G) = tr T (1) (z; G) dG(z) ¯ ∂θ 0 θ=T (G)
(8)
(1)
(1)
where T (1) (z; G) = (T1 (z; G), . . . , Tp (z; G))0 , ∂/∂θ 0 = (∂/∂θ1 , . . . , ∂/∂θp ) and (1)
Ti (z; G) is the influence function defined by Ti [(1 − ²)G + ²δz ] − Ti (G) ²→∞ ²
(1)
Ti (z; G) = lim
(9)
ˆ with δz being a point mass at z. By replacing the asymptotic bias b1 (G) by b1 (G), we can obtain information criterion. Model selection proceeds by choosing the model which gives the smallest value of information criterion from a set of candidate models {f (y|θˆ1 ), f (y|θˆ2 ), . . . , f (y|θˆr )}. 3.2. Proposal for Regularization Parameter Selection We consider a setting of regularization parameter selection where one chooses a better value among a sequence of regularization parameter values {λ1 , λ2 , . . . , λr } in the framework of the group Lasso. Suppose that each model M depends on the regularization parameter λ and is specified by the normal linear model ) ( ˆ0 x)2 1 (y − β ˆ =√ f (y|θ) exp − 2ˆ σ2 2πˆ σ2
(10)
where βˆ = (βˆ10 , . . . , βˆJ0 )0 and θˆ = (βˆ0 , σ ˆ 2 )0 are given by the maximizer of the penalized log-likelihood lλ (y|θ) =
n X i=1
J λ X 0 (β Kj βj )1/2 log f (yi |θ) − 2 σ j=1 j
(11)
where λ > 0 is a regularization parameter. Multiplying (11) by σ 2 shows that the problem of maximizing the penalized log-likelihood is precisely equivalent to that maximizing σ
2
n X
J X log f (yi |θ) − λ (βj0 Kj βj )1/2 .
i=1
(12)
j=1
Since, in normal linear case, σ
2
n X i=1
n
1X nσ 2 0 2 log(2πσ 2 ) log f (yi |θ) = − (yi − β xi ) − 2 i=1 2
(13)
and σ 2 is independent of the estimation of β, thus the maximization of (12) in terms of β is equivalent to the minimization of n J X 1X 0 2 (yi − β xi ) + λ (βj0 Kj βj )1/2 2 i=1 j=1
(14)
eliminating dependence on σ 2 , which is exactly the group Lasso criterion used in (2). After βˆ is obtained, σ ˆ 2 is given by the solution of the following equation ∂ ∂σ 2
(
n J 1 X λ X ˆ0 n 0 2 2 ˆ − 2 (y − β xi ) − log(2πσ ) − 2 (β Kj βˆj )1/2 2σ i=1 2 σ j=1 j
) = 0. (15)
ˆ Note that Our task is to calculate the influence function of the group Lasso estimator θ. the penalized log-likelihood (11) is not differentiable in terms of θ when some block-wise components of θ are exactly zero in the solution. Let the subset of factor indices {1, . . . , J} where the nonzero block-wise components of β are nonzero for the regularization parameter λk , k = 1, . . . , r, be Gk = {j ∈ {1, . . . , J} : βˆj 6= 0}. Also define the group Lasso estimator for λk by θˆk . Suppose that Gk is locally constant with respect to y, that is, when a small perturbation ε is imposed on the observations y, the zero block-wise components of βˆk , that is, βˆj , j ∈ / Gk , stay at 0. Therefore the penalized log-likelihood function is twice differentiable in terms of θGk where θGk = (βG0 k , σk2 )0 and βGk is the corresponding coefficient vector for Gk . Denote the sub design matrix for Gk by XGk = [· · · , xj , · · · ]j∈Gk where xj is the j-th column of X, and the number of θˆG by pG . k
k
Let TGk (G) be the pGk -dimensional functional defined by ¯ Z ¯ ¯ ψGk (y, θGk )¯ dG = 0 ¯
(16)
„G =TG (G)
where ∂ ψGk (y, θGk ) = ∂θGk
(
J 1 1 λk X 0 0 2 2 − 2 (y − β x) − log(2πσ ) − 2 (βj Kj βj )1/2 2σ 2 σ j=1
)
ˆ with G being the true distribution of y. By replacing G by the empirical distribution G based on observations y, we have ¯ n ¯ X 1 ¯ ψGk (yi , θGk )¯ ¯ˆ n
= 0.
(17)
ˆ „Gk =TGk (G)
i=1
This implies that the nonzero components of the group Lasso estimator θˆGk can be written ˆ for the functional TG (G) by (16). implicitly as θˆG = TG (G) k
k
k
By replacing G in (16) by G² = (1 − ²)G + ²δy with δy being a point of mass at y, we obtain Z
¯ ¯ ¯ ψ(y, TGk ((1 − ²)G + ²δy ))¯ ¯
d {(1 − ²)G(y) + ²δy (y)} = 0. „=T (G)
(18)
By differentiating with respect to ² and setting ² = 0, we have Z ψGk (y, TGk (G))d {δy (y) − G(y)} ¯ ¯ Z ¯ ¯ ∂ ∂ ¯ ¯ + ψGk (y, θGk )¯ {TGk ((1 − ²)G + ²δy )} ¯ dG(y) · ¯ ¯ ∂θGk ∂² „Gk =TGk (G)
= 0.
²=0
(19) From the above equation, the influence function of the group Lasso estimator θˆ is given by ¯ ¯ ∂ ¯ (1) TGk (y, G) ≡ {TGk ((1 − ²)G + ²δy )} ¯ ¯ ∂² ²=0 ¯ Z ∂ ¯ ¯ =− ψGk (y, θGk )¯ ∂θ ¯
„Gk =TGk (G)
−1 dG(y) ψGk (y, TG (G)) (20)
ˆ and using the result in (8), we have the following information By replacing G in (20) by G ˆ criterion for the group Lasso model f (y|θ): GLIC = −2
n X
ˆ + b1 (G) ˆ log f (yi |θ)
i=1
¯ ) ( n ¯ X ˆ 2 ∂ log f (y | θ) ¯ i ˆ + ˆ = −2 log f (yi |θ) tr TGk (yi , G) ¯ 0 ¯ ˆ n ∂θ i=1 i=1 n o „=„ ˆ 2 /ˆ ˆ (G) ˆ −1 = n log(2πˆ σ 2 ) + ky − X βk σ 2 + 2tr I(G)J n X
(21)
where ˆ = 1 J (G) nˆ σ2 ˆ = I(G)
1 nˆ σ4
λk 1 0 X Λ1n − 2 d + λk D σ ˆ 2 Gk σ ˆ , λk 0 n 1 0 1 d − ΛX − G k σ ˆ2 n σ ˆ2 2ˆ σ2 µ ¶ XG0 k Λ − λk d 1 1 2 Λ 1n − 1n λk ˆ0 ˆ 1/2 0 ΛXGk , 1 0 2 1 0 2 2ˆ σ 2 Λ − + 1 1 ( β K β) 1 n 2ˆ σ2 n 2 n σ ˆ2 XG0 k XGk
with 1n = (1, 1, . . . , 1)0
(n × 1 vector),
ˆ 0 (y − X β) ˆ + 2λ(βˆ0 K β) ˆ 1/2 }/n σ ˆ 2 = {(y − X β) D = bdiag[Dj ]j∈Gk , √ √ Dj = kβˆj0 βˆj k−1/2 pj Ij − kβˆj0 βˆj k−3/2 pj βˆj βˆj0 , Λ = diag[y1 − βˆ0 x1 , y2 − βˆ0 x2 , . . . , yn − βˆ0 xn ] d = (dj )j∈Gk ,
√ dj = kβˆj0 βˆj k−1/2 pj βˆj . We choose the regularization parameter value which gives the minimizer of GLIC among a sequence of regularization parameter values {λ1 , λ2 , . . . , λr }.
4. Conclusion In this article, we discussed the problem of choosing a better regularization parameter from a series of candidate values in the group Lasso. We derived the influence function of the group Lasso estimator and proposed an information criterion for regularization parameter selection in the normal linear regression problem with the group Lasso. Further works remains to be done towards extending the derived criterion in the framework of the penalized likelihood-based models from exponential families.
References [1] Akaike, H. (1974), A new look at the statistical model identification. IEEE Trans. Automatic Control, AC-19, 716-723. [2] Konishi, S. and Kitagawa, G. (1996), Generalised information criteria in model selection, Biometrika, 83, 875-890. [3] Yuan, M. and Lin Y. (2006), Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B, 68, 49-67.