A Group Adaptive Elastic-Net Approach for Variable Selection in High ...

3 downloads 0 Views 524KB Size Report
Jan 22, 2017 - paper, we consider the problem of group selection and estimation in the ... High-dimensional regression, Group variable selection, Group ...
A Group Adaptive Elastic-Net Approach for Variable Selection in High-Dimensional Linear Regression Hu Jianhua, Huang Jian and Qiu Feng Citation: SCIENCE CHINA Mathematics ; doi: 10.1007/s11425-016-0071-x View online: http://engine.scichina.com/doi/10.1007/s11425-016-0071-x Published by the Science China Press

Articles you may be interested in

Variable selection in censored quantile regression with high dimensional data SCIENCE CHINA Mathematics , ;

Concave group methods for variable selection and estimation in high-dimensional varying coefficient models SCIENCE CHINA Mathematics 57, 2073 (2014);

Asymptotic properties of Lasso in high-dimensional partially linear models SCIENCE CHINA Mathematics 59, 769 (2016);

Penalized least squares estimation with weakly dependent data SCIENCE CHINA Mathematics 59, 2335 (2016);

Shrinkage estimation analysis of correlated binary data with a diverging number of parameters SCIENCE CHINA Mathematics 56, 359 (2013);

SCIENCE CHINA Mathematics

. ARTICLES .

January 2017 Vol. 60 No. 1: 1–XX doi: 10.1007/s11425-000-0000-0

d

A Group Adaptive Elastic-Net Approach for Variable Selection in High-Dimensional Linear Regression

pte

HU Jianhua1,4,∗ , HUANG Jian2 & QIU Feng1,3 1School

of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China; 2Department of Biostatistics, University of Iowa, Iowa City, Iowa 52242, USA; 3Science College, Zhejiang Agriculture and Forestry University, Linan, Zhejiang 3100300, China; 4Key Laboratory of Mathematical EconomicsSUFE, Ministry of Education, Shanghai 200433, China. Email: [email protected], [email protected], [email protected] Received September 07, 2016; accepted November 14, 2016; published online January 22, 2017

ce

Abstract In practice, predictors possess grouping structures spontaneously. Incorporation of such useful information can improve statistical modeling and inference. In addition, the high-dimensionality often leads to the collinearity problem. The elastic net is an ideal method which is inclined to reflect a grouping effect. In this paper, we consider the problem of group selection and estimation in the sparse linear regression model in which predictors can be grouped. We investigate a group adaptive elastic-net and derive oracle inequalities and model consistency for the cases where group number is larger than the sample size. Oracle property is addressed for the case of the fixed group number. We revise the locally approximated coordinate descent algorithm to make our computation. Simulation and real data studies indicate that the group adaptive elastic-net is an alternative and competitive method for model selection of high-dimensional problems for the cases of group number being larger than the sample size. Keywords High-dimensional regression, Group variable selection, Group adaptive elastic-net, Oracle inequalities, Oracle property MSC(2010)

62J05, 62J99

Ac

Citation: J HU, J HUANG, F QIU. A group adaptive elastic-net approach for variable selection in high-dimensional linear regression. Sci China Math, 2017, 60, doi: 10.1007/s11425-000-0000-0

1

Introduction

In many applications of regression, some predictors can be grouped naturally. The most common example is the multi-factor analysis of variance problem, in which several levels from each factor can be expressed as a group of dummy variables. Predictor variables within group usually have more similar characteristics and higher correlations. Grouping structures can be determined by analysis approach or field knowledge in practice. Incorporation of such grouping structures information will improve statistical modeling and inference. Consider the linear regression model: y = Xβ + ε, (1.1) where y = (y1 , . . . , yn )⊤ is the response vector, X = (x1 , . . . , xp ) is the predictor matrix, and β is the unknown regression coefficient vector. Assume that the predictors can be grouped but no overlapped and the errors are identically and independently distributed with zero mean and finite variance σ 2 . Our ∗ Corresponding

author

c Science China Press and Springer-Verlag Berlin Heidelberg 2016

math.scichina.com

link.springer.com

Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

2

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

interest is to select the important groups in which the coefficients are all nonzero when the number of parameters p or the number of groups J is much larger than the sample size n.

pte

d

Many authors have considered the problem of group selection in high-dimensional statistical modeling. Yuan and Lin (2006) proposed the group lasso based on l2 -norm penalty as an extension of the lasso (Tibshirani 1996). Huang et al. (2009) considered the problem of simultaneous group and individual variable selection and proposed the group bridge or sparse group bridge. The further work of bi-level selection refers to Breheny and Huang (2009) and Jiang and Huang (2015). Friedman et al. (2010) and Simon et al. (2013) proposed the sparse group lasso, a bi-level selection method whose penalty blends the lasso with the group lasso. Huang et al. (2012) also gave a selective review of group selection including several l2 -norm concave group selection methods, such as the group SCAD and the group MCP (Zhang 2010). The elastic net, proposed by Zou and Hastie (2005), was investigated to be an ideal method which is inclined to a grouping effect, where strongly correlated covariates tend to be in or out of the model together. Based on the advantage, Zou and Zhang (2009) proposed the adaptive elastic-net which can improve the lasso in two different directions: the adaptive lasso has the oracle property of the SCAD (Fan and Li 2001) and the elastic-net alleviates the collinearity.

ce

For variable selection and estimation in high-dimensional situations where p is much larger than n, there has also been many developments in theoretical nature of diverse methods. Candes and Tao (2005) considered the l1 -minimization problem under the uniform uncertainty principle. Later, Candes and Tao (2007) introduced an estimator called Dantzig selector and gave the uniform uncertainty principle when p is much larger than n. Bickel et al. (2009) gave the oracle inequalities under a quite weak condition called restricted eigenvalue assumption under a sparsity scenario for simultaneous analysis of Lasso and Dantzig selector. Lounici et al. (2011) gave oracle inequalities under the restricted eigenvalue condition on the covariate matrix under the models of group sparsity. But, no literature considers oracle inequalities to the adaptive elastic net with or without the group effect.

Ac

Influenced by Lounici et al. (2011) and the group effect, our motivation is to consider oracle inequalities to the adaptive elastic net with or without the group effect and propose an alternative and competitive method for model selection of high-dimensional situations where p is much larger than n and the group effect is considered. Our contributions include model selection consistency and oracle inequalities of the group adaptive elastic-net estimator when the group number of predictors is larger than the sample size, and oracle property of the corresponding the group adaptive elastic-net estimator for the fixed-dimensional parameter space or the fixed group number. Our construction of a framework of the group adaptive elastic-net approach and methods of proofs are based on non-trivial extensions and generalizations of condition and results for group variable selection procedure already in existence in the literature.

The rest of this paper is organized as follows. In Section 2, we give the methodology of group adaptive elastic-net and its algorithm. In Section 3, we will address model selection consistency and oracle inequalities of the proposed estimator under the mild regularity conditions including the restricted eigenvalue assumption when p or J is much larger than n. In Section 4, we study the oracle property of the corresponding the group adaptive elastic-net estimator for the fixed-dimensional parameter space or the fixed group number. In Section 5, simulation studies for the case of group number being larger than the sample size are illustrated, and the results show that our method is competitive. In Section 6, a real dataset of glioblastoma microarray gene expression study is applied to illustrate the performance of various methods. A brief concluding remarks is given in Section 7. Proofs of main results are degraded to Appendix. Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

2

Sci China Math

January 2017

Vol. 60

3

No. 1

Group adaptive elastic-net estimation

2.1

d

Assume that the classical linear regression model (1.1) is satisfied with group sparse, namely groups corresponding to the predictors of nonzero coefficients are sparse, and the coefficients from one group are all zero or not. Without loss of generality, the response vector y is centered, and the predictor matrix X is standardized such that x⊤ i xi /n = 1, for any i. Once all models have been fitted, we can transform the estimates back to the original scale. Definition and estimation

pte

We shall solve the following optimization problem (2.1) to obtain a group adaptive elastic-net estimator, written as   J J   X X 1 βˆ = (1 + λ2 ) arg min kβj k22 (2.1) ω ˆ j kβj k2 + λ2 ky − Xβk22 + λ1  β  2n j=1 j=1 where βj represents the coefficient vector whose elements belong to the jth group, kβj k2 is the l2 norm of the vector βj , λ1 and λ2 are the tuning parameters and ω ˆ j is the adaptive weight of an initial estimate init ˆ β defined as ( ˆinit k−γ if kβ ˆinit k2 > 0; kjδ kβ j 2 j ω ˆj = (2.2) ˆinit k2 = 0, n or ∞ if kβ j

2.2

ce

for δ > 0 and γ > 0, where kj is the cardinality of the jth group. If set all kj = 1 and δ = 0, the estimator is the adaptive elastic-net obtained by Zou and Zhang (2009). As we will see in numerical studies later, parameter δ has a great impact on the behavior of BIC function in selection of tuning parameters and avoids the situation where large groups overwhelm small groups. In this article, we always set γ = 0.5 and specify δ by simulation studies later. Computation

For the optimization problem (2.1) to the group adaptive elastic-net estimate, following from the KarushKuhn-Tucker conditions, we obtain 1 − Xj⊤ (y − Xβ) + 2λ2 βj + λ1 ω ˆ j sj = 0 n

Ac

where sj is the sub-gradient, expressed as    sj =  

(2.3)

βj , if kβj k2 6= 0; kβj k2

ksk2 6 1, if kβj k2 = 0.

We use an iterated algorithm usually called locally approximated coordinate descent algorithm (LCD) to solve (2.3). The details of LCD algorithm and its convergence refer to the literature of Breheny and Huang (2009). An R package called grpreg for group selection has also been contributed by Breheny. In the sequel, we use βjk standing for the kth element of the jth group of β. We also use xjk for the corresponding column of X to βjk . For integrity, we give a revision algorithm of LCD below. 1. Provide an initial estimate β (0) . 2. Update β (m) ← β (m−1) as the following until convergence. For the jth group, compute sj = Xj⊤ Rj /(nλ1 ω ˆ j ), where Rj is the partial residuals. (m)

(a) If ksj k2 6 1, set βj

= 0, otherwise (m)

(b) update the elements of βj

individually by (m) βjk



1 ⊤ n xjk r

(m−1)

+ βjk

1 + 2λ2 + λjk

,

Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

4

Hu J et al.

Sci China Math

January 2017 (m−1)

where r is the current residuals, λjk = λ1 ω ˆ j /kβj (m−1) β k2 < ε for a given calculation error ε > 0.

Vol. 60

No. 1 (m−1) −γ k2 ,

k2 and ω ˆ j = kjδ kβj

until kβ (m) −

d

In this paper, we shall use the BIC-type criterion to select the tuning parameters, which is defined as   1 1 ky − X βˆn (λ1 , λ2 )k22 + log(n)df(λ1 , λ2 ), BIC(λ1 , λ2 ) = log n n

3

pte

where df(λ1 , λ2 ) represent the cardinality of nonzero coefficients of the estimator given λ1 and λ2 . To avoid intensive computation in selecting λ1 and λ2 by the method of grid search, we use the same technique to the calculation as Breheny and Huang (2009) and set λ2 = 0.001λ1 , which does not violate the equation (3.6) in Theorem 3.2. The numerical studies also prove this method works. In practical computations, for group selection in ultrahigh-dimensional regression models, we can adopt the strategy of screening firstly. Specifically, we can use DC-SIS, a sure independence screening procedure based on the distance correlation (Li et al. 2012), to reduce the model dimension via screening predictors. Then, the group adaptive elastic-net performs better and more efficient.

Oracle inequalities and model consistency

ce

To explore the statistical nature of the group adaptive elastic-net estimator, we consider the objective function J J X X 1 kβj k22 , (3.1) ω ˆ j kβj k2 + λ2 ky − Xβk22 + λ1 Ln (β) = 2n j=1 j=1

Ac

where maxj kβj k2 6 m, a positive constant. Throughout this paper, we need the following notations. Let J0 = {j : kβj k2 6= 0} be the set of nonzero groups of the true parameter β of the model, and J0c is a complementary set of J0 . Set NJ = {1, . . . , J}, m = maxj kβj k2 , ∆ = βˆ − β, ∆J0 = βˆJ0 − βJ0 and ∆j = βˆj − βj . Let Xj be the n × kj submatrix of X, which is constructed by columns of X whose numbers are in jth group. Define G = X ⊤ X/n and Gj = Xj⊤ Xj /n, denoting the Gram matrices of X and Xj , respectively. We use (·)j standing for the submatrix of (·) which is constructed by rows whose numbers are in the jth group. In our theoretical analysis, we also need the following regularity conditions throughout. Assumption A. For some integer 1 6 s 6 p, let     X X X X B = ∆ ∈ Rp : λ1 ω ˆ j k∆j k2 + 2λ2 k∆j k22 6 3λ1 ω ˆ j k∆j k2 + 6λ2 k∆j k22 ,   c c j∈J0

j∈J0

j∈J0

j∈J0

then the following conditions hold

kX∆k2 > 0, and κ = min √ nk∆k2 k∆k2 6=0

(3.2)

∆∈B

kX∆k2 min √ > 0. |J0 |6s k∆k2 6=0 nk∆J0 k2

κ(s) = min

(3.3)

∆∈B

It’s obvious that κ < κ(s). Assumption A is also called restricted eigenvalue assumption which resembles Assumption 3.1 in Lounici et al. (2011). This is an extension to the setting of the restricted eigenvalue assumption for group sparsity from Lounici et al. (2011). In particular, setting λ2 = 0 and all ω ˆ j = 1, κ(s) is nothing but κ(s, 3) in Bickel et al. (2009). Assumption B. The errors ε follow a normal N (0, σ 2 I) distribution. Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

Sci China Math

January 2017

Vol. 60

5

No. 1

d

Assumption C. The predictor group number J tends to infinity as the sample size n tends to infinity at such a speed rate log(J) lim = 0. n→∞ n Assumption C implies that p or J can diverge with the sample size n. Assumption C also allows p or J α to be much larger than n and determines the order of J, namely, J = O(en ), α < 1. Lemma 3.1. On the event Aj = {2k(X ⊤ε)j k2 /n 6 λ1 ω ˆ j − 4λ2 m}, we have

J J X X X 1 λ1 X k∆j k22 6 2λ1 ω ˆ j k∆j k2 + λ2 ω ˆ j k∆j k2 + 4λ2 kX∆k22 + k∆j k22 . 2n 2 j=1 j=1 j∈J0

pte

In particular, λ1

X

j∈J0c

ω ˆ j k∆j k2 + 2λ2

Proof. See Appendix.

X

j∈J0c

k∆j k22 6 3λ1

X

j∈J0

(3.4)

j∈J0

ω ˆ j k∆j k2 + 6λ2

X

j∈J0

k∆j k22 .

(3.5)

The inequality (3.4) is said to be basic inequality, where ω ˆ j is the adaptive weight. To the group ˆ we have the following theorem. adaptive elastic-net estimator β,

ce

Theorem 3.2. For every j ∈ NJ , assume that r q 2σ λ1 ω ˆ j − 4λ2 m > √ tr(Gj ) + 2ρ log(J)k|Gj k| + 2 ρ log(J)kGj k2F + k|Gj k|2 ρ2 log2 (J) n

(3.6)

with ρ > 1, where tr(Gj ), kGj kF and k|Gj k| denote the trace, Frobenius and spectral norms of Gj , respectively. Then, under Assumption A and B, with probability at least 1 − 2J 1−ρ , we have 16nλ21 κ2 (s) X 2 ω ˆj , (κ2 (s) − 8λ2 )2

(3.7)

k∆J0 k2 6

4λ1 (κ2 (s) − 8λ2 )

(3.8)

Ac

kX∆k22 6

4λ1 κ(s) k∆k2 6 κ(κ2 (s) − 8λ2 )

ˆ 6 N (β)

64κ2 (s) 2 ω ˆ min (κ2 (s) − 8λ2 )2

j∈J0

sX

ω ˆ j2 ,

j∈J0

sX

ω ˆ j2 , and

(3.9)

j∈J0

  p 2λ2 X 2 ω ˆj , φmax + κ

(3.10)

j∈J0

ˆ stands for the cardinality of set J(β) ˆ = {j : kβˆj k2 6= 0, j ∈ Np }, ω where N (β) ˆ min = min{ˆ ω1 , . . . ω ˆ p } and φmax is the largest eigenvalue of Gram matrices G and Gj = Xj⊤ Xj /n.

Proof. See Appendix.

The inequalities (3.7) - (3.10) in Theorem 3.2 are said to be oracle inequalities. From the condition (3.6), the orders of λ1 and λ2 are determined as h i1/2 h i1/2 λ1 = O n−1/2 {log(J)} and λ2 = O n−1/2 {log(J)} .

Here, p and J can variate along with n even p ≫ n or J > n. Introducing the ridge penalization into objective function (3.1) will increasing the error bound of estimator, which can be seen from Theorem 3.2. For the purpose of variable selection, λ2 is a small tuning parameter complying with λ2 < κ2 (s)/8 and λ2 tend to zero faster than λ1 , which can be seen from (3.6). Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

6

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

When λ2 = 0, the condition (3.6) reduces to r q 2σ λ1 ω ˆj > √ tr(Gj ) + 2ρ log(J)k|Gj k| + 2 ρ log(J)kGj k2F + k|Gj k|2 ρ2 log2 (J) n

d

which can derive the inequality (3.1) of Lounici et al. (2011) through suitable adjustments. Theorem 3.2 is an extension of Theorem 3.1 in Lounici et al. (2011). If setting λ2 = 0 and all groups be of one element, the group adaptive elastic-net can be degenerated to be an adaptive lasso problem. To the adaptive lasso estimator βˆalasso , we get the following corollary immediately.

pte

Corollary 3.3. For every i ∈ Np , assume that s   q 2σ 1 ⊤ Xj Xj 1 + 2ρ log(p) + ρ log(p) + ρ2 log2 (p) λ1 ω ˆj > √ n n

with ρ > 1. Then, under Assumption A and B, with probability at least 1 − 2p1−ρ , we have kX∆k22 6

16nλ21 X 2 ω ˆj , κ2 (s)

k∆J0 k2 6

4λ1 κ2 (s)

4λ1 κκ(s)

sX

ω ˆ j2 ,

j∈J0

ω ˆ j2 , and

j∈J0

ce

k∆k2 6

j∈J0

sX

X ˆ 6 64φmax N (β) ω ˆ j2 , 2 2 ω ˆ min κ (s) j∈J0

ˆ stands for the cardinality of set J(β) ˆ = {j : βj 6= 0, j ∈ Np }, ω where N (β) ˆ min = min{ˆ ω1 , . . . ω ˆ p } and φmax is the largest eigenvalue of Gram matrix G.

Ac

Furthermore, setting all ωj = 1, the group adaptive elastic-net can be degenerated to be a lasso problem. The result of Corollary 3.3 is nothing but Theorem 7.2 in Bickel et al. (2009). ˆ we have the following main results. To the group adaptive elastic-net estimator β, Theorem 3.4. Suppose that {J0 , βJ0 } are fixed unknown. Choose the tuning parameters λ1 and λ2 based on Theorem 3.2. Then, under Assumption A-C, we have (i) consistency in selection:  Pr βˆJ c = 0 → 1 as n, J → ∞. 0

(ii) rate of convergence: Ekβˆ − βk22 = O(n−1 log(J)).

Proof. See Appendix.

It follows from the rate of convergence that sample size.

4

P

j∈J0

ω ˆ j2 is bounded by a constant for the enough large

Oracle property for a fixed group number

In this section we will investigate model selection consistency and the asymptotic property of the group adaptive elastic-net estimator βˆ when the sample size n increases, while the parameter group number J remains fixed. In order to distinguish from (2.1), a group adaptive elastic-net estimator is written as   J J   X X 1 kβj k22 , (4.1) ω ˆ j kβj k2 + λ4 ky − Xβk22 + λ3 βˆ = (1 + λ4 ) arg min   β 2n j=1 j=1 Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

7

where the notations are defined as the same as (2.1) except a fixed group number J. Assumption D. Assume that limn→∞ n−1 XJ⊤0 XJ0 = Ω, where Ω is a positive definite constant matrix, limn→∞ n−1 XJ⊤0 XJ0c and limn→∞ n−1 XJ⊤0c XJ0c are bounded.

d

Assumption D claims that the covariate matrix has a reasonably good behavior. If the row vectors of the design submatrix XJ0 are iid sample from the distribution with covariance Ω, then n−1 XJ⊤0 XJ0 converges almost surely to Ω by Kolmogorov’s strong law of large numbers. Assumption E. Suppose J is fixed. The tune parameters λ3 and λ4 are associated with the sample size n with the following rates √ √ 1+γ nλ3 = 0, lim n 2 λ3 = ∞ for γ > 0 and lim nλ4 = 0. n→∞

n→∞

pte

lim

n→∞

Assumption E determines the rates that the tune parameters λ3 and λ4 tend to zero, namely λ3 = O(n−(1+γ1 )/2 ) with 0 < γ1 < γ and λ4 = o(n−1/2 ). Based on the above assumptions, we have the following property of the group adaptive elastic-net estimator. Theorem 4.1. Suppose that J is fixed and {J0 , βJ0 } are fixed unknown. Then, under Assumption D and E, we have (i) consistency in selection:  Pr βˆJ0c = 0 → 1 as n → ∞,

ce

and (ii) asymptotic normality:

  d √ n βˆJ0 − βJ0 → N 0, σ 2 Ω−1 as n → ∞.

Proof. See Appendix.

Usually, Theorem 4.1 is said to be oracle properties. The techniques of Knight and Fu (2000) are used in the proof of Theorem 4.1. To the adaptive lasso estimator βˆalasso , the following Corollary 4.2 is straightforward.

Ac

Corollary 4.2. Suppose that p is fixed and {J0 , βJ0 } are fixed unknown. Under Assumption D and E, we have (i) consistency in selection:  Pr βˆalasso,J0c = 0 → 1 as n → ∞ and (ii) asymptotic normality:

 √ n(βˆalasso,J0 − βJ0 ) → N 0, σ 2 Ω−1 as n → ∞.

Corollary 4.2 is nothing but the exact result stated in Theorem 2 of Zou (2006).

5

Numerical studies

In this section, we shall use simulation studies to illustrate the finite sample performance of group adaptive ˆ elastic-net estimator β. Data are generated from the following linear regression model iid

2 ⊤ yi = x⊤ i1 β1 + · · · + xi257 β257 + εi , εi ∼ N (0, σ ),

where the parameter vector βj represents the coefficients of the jth group, j = 1, · · · , 257. These coefficient vectors β1 , · · · , β257 totally contain 2000 individual unknown coefficients, say β1 , ..., β2000 . Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

8

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

ce

pte

d

Their relations are as follows: β1 = (β1 , · · · , β6 )⊤ , β2 = (β7 , · · · , β16 )⊤ , β3 = (β17 , · · · , β22 )⊤ , β4 = (β23 , · · · , β32 )⊤ , β5 = β33 , · · · , β12 = β40 , while β13 = β14 = · · · = β257 = 0, one of which contains eight zero components. To be general, we set unknown coefficients by the following way: β1 , · · · , β6 are generated independently by a normal N (2, 0.5) random variable, β7 , · · · , β16 are generated independently by a normal N (1, 1) random variable, β17 , · · · , β22 are generated independently by a normal N (3, 1) random variable, β23 , · · · , β32 are generated independently by a normal N (2, 2) random variable, and β33 , · · · , β40 are generated independently by a uniform U(−4, 4) random variable. Since coefficient parameters are generated from continuous distributions, we have Pr(βi 6= 0, i ∈ {1, · · · , 40}) = 1. So in our setting, there are 12 nonzero groups associated with 40 nonzero individual parameters and 245 zero groups associated with 1960 zero individual parameters. Setting p = 2000, the predictor matrix X is partitioned into (X1 · · · X5 )n×2000 , where X1 , · · · , X5 are n×6, n×10, n×6, n×10 and n×1968 dimensional sub-matrices, respectively. The sub-matrices X1 , · · · , X4 are generated independently by multivariate normal distributions with zero means and covariances, Σ1 of order 6, Σ2 of order 10, Σ3 of order 6 and Σ4 of order 10, where their diagonal entries are 1, non-diagonal entries are 0.8, 0.8|i−j| , 0.6 and 0.6|i−j| , respectively. The entries of X5 are generated by iid N (0, 1) random variables. Firstly, we illustrate the instability of model selection by using BIC in group adaptive elastic-net for the case of high-dimension when the parameter τ varies in (2.2). Set n = 100 and σ = 8. We use the above-mentioned method to generate data. Typically, plots of BIC functions of λ1 from 0.08 to 1 for group adaptive elastic-net estimate can be got, see Figure 1. We use RTOv to denote the ratio of correct selected to all selected nonzero components. The corresponding results can be seen in Table 1. We also give a discussion on the importance of parameter τ for group adaptive lasso by making further checks on BIC curves of λ1 from 0.1 to 0.5. Table 1. Results of model selection by BIC in GAEnet and GALasso with respect to τ (the oracle of RTOv is 40/40). GAEnet

0

0.5

0.8

1

1.2

1.6

RTOv

32/120

33/121

35/35

35/35

36/36

29/29

MSE

66.19

60.71

13.04

13.79

13.76

49.20

RTOv

32/128

34/98

34/96

35/35

35/35

34/34

MSE

59.58

41.06

40.16

19.33

23.64

30.14

Ac

GALasso

τ

From Table 1, Figure 1 and 2, we can get some information:

(1) When τ is zero or a small positive number, the curve of BIC function is almost monotonically increasing with λ1 . When τ is close to 1, the curve is almost a convex. But when τ is larger than 1 and far from it, the curve seems to be monotone. (2) When τ is zero or a small positive number, the performance of MSE and ratio of correct selection are poor. When τ is about 1, the performance can be improved significantly. But when τ is larger than 1 and far from it, the performance becomes poor again. (3) It seems that τ has great impacts on both MSE and the behavior of the BIC function.

Next, we shall compare the developed group adaptive elastic-net (GAEnet) with other popular selection methods: adaptive lasso (aLasso), adaptive elastic-net (AEnet), group lasso (grLasso), group adaptive lasso (GALasso), group SCAD (grSCAD), group MCP (grMCP) and group bridge (gBridge). Two points are mentioned that (i) Adaptive lasso and adaptive elastic-net are variable selection methods but not group selection ones, and (ii) group bridge is a penalized method for bi-level selection and differs from other group select methods. We use BIC criterion in selecting the tuning parameters. Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Sci China Math

January 2017

Vol. 60

9

No. 1

0.4

0.6 λ1

1.0

0.2

0.4

0.6

0.8

1.0

λ1

(b)

6.2

BIC

ce

5.8

5.8

6.0

6.0

6.2

BIC

6.4

6.4

6.6

6.6

(a)

0.8

pte

0.2

d

6.0

BIC 3.5

5.4

5.6

4.0

5.8

4.5

BIC

5.0

6.2

5.5

6.4

6.6

6.0

Hu J et al.

0.2

0.4

0.6

0.8

1.0

0.2

0.4

λ1

(c)

0.8

1.0

0.6

0.8

1.0

(d)

6.2

BIC

6.2 5.8

6.0

6.0

Ac

BIC

6.4

6.4

6.6

6.6

0.6 λ1

0.2

0.4

0.6

λ1

(e)

0.8

1.0

0.2

0.4 λ1

(f)

Figure 1. Plots of BIC function of λ1 from 0.08 to 1 for group adaptive elastic-net estimate with (a) τ = 0, (b) τ = 0.5, (c) τ = 0.8, (d) τ = 1, (e) τ = 1.2 and (f) τ = 1.6, where τ is a tuning parameter in the adaptive weight.

The mean of real signal-to-noise ratio (SNR) takes two values, approximately 4 and 9, by setting the standard deviation σ to 8 and 12, respectively. We set τ close to 1 to alleviate the instability of model selection via BIC. Here, we define MSE and SNR as follows: MSE =

kXβk22 1 . kX(βˆ − β)k22 and SNR = n kεk22

The performance of model selection is measured by (C, IC, PRTOv , PRTOg ), where C is the number of zero coefficients or groups that are correctly estimated by zero, IC is the number of nonzero coefficients Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

10

Sci China Math

January 2017

Vol. 60

No. 1

0.2

0.3 λ1

0.5

0.1

0.2

0.3

0.4

0.5

λ1

(b)

BIC

ce

5.4

6.0

5.6

6.1

5.8

6.2

BIC

6.3

6.0

6.4

6.2

6.5

6.6

(a)

0.4

pte

0.1

d

BIC 5.0

4.5

5.2

5.0

5.4

BIC

5.6

5.5

5.8

6.0

6.0

Hu J et al.

0.1

0.2

0.3

0.4

0.5

0.1

0.2

λ1

0.3

0.4

0.5

0.4

0.5

λ1

(d)

BIC

6.0

5.8 5.5

5.8

5.6

5.9

5.7

Ac

BIC

6.1

5.9

6.2

6.0

6.3

(c)

0.1

0.2

0.3 λ1

(e)

0.4

0.5

0.1

0.2

0.3 λ1

(f)

Figure 2. Plots of BIC function of λ1 from 0.1 to 0.5 for group adaptive lasso estimate with (a) τ = 0, (b) τ = 0.5, (c) τ = 0.8, (d) τ = 1, (e) τ = 1.2 and (f) τ = 1.6, where τ is a tuning parameter in the adaptive weight.

or groups that are incorrectly estimated by zero, PRTOg denotes the percentage ratio of correct selected to all selected nonzero groups and PRTOv denotes the percentage ratio of correct selected to all selected nonzero components. R packages ”gcdnet” and ”grpreg” are used for various methods except group adaptive lasso and group adaptive elastic-net in the simulation. The results are reported in Table 2. We summarize results from simulation studies as follows: (a) Since the information of group structures is not incorporated, variable selection methods are inferior Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

Sci China Math

January 2017

Vol. 60

11

No. 1

Table 2 Results of model selection based on 500 replications Variables MSE

Oracle

C

IC

PRTOv

C

IC

PRTOg

1960

0

100

245

0

100

n = 100, p = 2000, σ = 8

d

Method

Groups

54.88(12.70)

1613.56

9.85

10.22

-

-

-

AEnet

61.45(14.88)

1816.25

12.38

17.91

-

-

-

grLasso

96.61(34.21)

1899.01

7.69

34.63

237.38

4.69

48.94

grSCAD

34.37(11.70)

1916.19

4.83

44.53

239.52

3.55

60.68

grMCP

38.45(12.29)

1934.85

5.46

57.86

241.86

3.81

72.25

gBridge

36.79(8.43)

1949.50

14.82

70.57

242.17

6.31

66.78

GALasso

30.47(14.56)

1956.32

6.91

91.10

244.79

4.89

94.63

GAEnet

28.18(10.53)

1956.82

6.22

91.39

244.60

4.56

94.92

104.88(16.84)

1544.46

9.66

8.77

-

-

-

pte

aLasso

n = 100, p = 2000, σ = 12 aLasso

93.39(15.59)

1751.80

12.52

13.35

-

-

-

165.61(48.53)

1911.73

11.06

37.48

238.97

5.70

51.06

grSCAD

93.44(37.63)

1905.06

9.52

35.68

238.13

5.14

49.98

grMCP

116.88(27.71)

1923.28

10.12

44.86

240.41

5.38

59.05

gBridge

87.47(20.05)

1933.57

17.64

45.83

238.76

6.92

44.86

GALasso

63.77(27.76)

1947.36

7.68

77.62

243.42

4.98

84.44

GAEnet

58.34(26.37)

1951.60

8.26

79.07

243.95

4.93

87.07

ce

AEnet grLasso

n = 200, p = 2000, σ = 8 53.04(10.14)

1680.85

6.26

12.99

-

-

-

AEnet

64.10(14.82)

1844.48

7.91

23.50

-

-

-

grLasso

39.15(8.27)

1883.33

2.44

32.88

235.42

2.32

50.25

grSCAD

16.26(4.28)

1936.40

2.21

61.56

242.05

2.16

76.94

grMCP

15.22(3.57)

1950.82

2.42

80.36

243.85

2.33

89.39

gBridge

22.02(6.42)

1955.09

10.17

85.88

243.44

4.72

82.31

GALasso

14.67(5.70)

1959.76

3.66

99.45

244.97

2.94

99.66

GAEnet

14.64(6.12)

1959.92

3.78

99.78

244.99

3.18

99.89

Ac

aLasso

n = 200, p = 2000, σ = 12 aLasso

83.39(13.82)

1649.86

6.95

12.29

-

-

-

AEnet

78.43(11.75)

1757.89

7.84

14.93

-

-

-

grLasso

81.05(18.10)

1900.70

4.06

37.74

237.59

3.36

53.82

grSCAD

43.78(13.77)

1926.72

3.84

52.08

240.84

3.21

67.88

grMCP

38.98(11.53)

1950.77

4.39

79.41

243.85

3.45

88.11

gBridge

52.34(12.65)

1938.59

13.53

55.29

238.96

5.82

50.59

GALasso

27.84(12.51)

1959.84

5.25

99.74

244.96

3.52

99.84

GAEnet

27.67(11.93)

1959.92

5.26

99.77

244.99

3.54

99.88

Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

12

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

to group selection methods. Adaptive elastic-net is competitive, comparing with adaptive lasso, in sense of the percentage ratio of correct selection.

d

(b) Group adaptive elastic-net performs better than other methods in terms of MSE. When the sample size is larger (n = 200), the MSEs of group SCAD and MCP are smaller than the MSE of group Bridge. While the sample size is smaller (n = 100), group bridge can perform better than group SCAD and MCP.

pte

(c) The developed group adaptive elastic-net estimator outperforms group SCAD and MCP in identifying the zero variables and groups while it is a bit inferior to them in selecting the nonzero variables and groups. Group bridge and group lasso perform worse than the other methods of group selection. (d) Group adaptive elastic-net is superior to the other group selection methods in terms of (MSE, PRTOv , PRTOg ). (e) Group adaptive elastic-net can improve MSE and PRTO due to introducing the ridge penalty to group adaptive lasso. This improvement is not surprising. The ridge penalty seems to not be crucial.

6

A glioblastoma microarray gene expression study

Ac

ce

In this section, we will analyze the real data of glioblastoma microarray gene expression study from Horvath et al. (2006) by the proposed method of group adaptive elastic-net, comparing with the usual existing group selection approaches, group lasso, group SCAD, group MCP and group bridge. Glioblastoma is a common malignant brain tumor of adults and one of the most lethal of all cancers. Patients with this disease have a median survival of 15 months from the time of diagnosis despite therapies. Expression data of 3600 genes from two independent sets of 120 clinical tumor samples (55 in dataset I, 65 in dataset II) can be available for analysis. In our analysis, we exclude nine censored samples (5 in dataset I, 4 in dataset II), and use the logarithm of survival time as the response vector. Two datasets are used as the training set (n = 50) and the test set (n = 61), respectively. We first choose 800 genes with the smallest p-values by screening on the training set. Then, we use the following data driven strategies: (a) Fix the number of groups, e.g. 50, 100, 150 and 200, the membership of groups can be determined by the hierachical cluster method; (b) For a given grouping structure, we model the training set by GAEnet; (c) We assess models to determine the number of groups by comparing mean prediction error of the test set respectively. So 800 genes are partitioned into 100 groups by the strategies above. Based on these 800 genes divided into 100 groups, we fit a linear regression model by the group adaptive elastic-net method, and select 15 groups consisting of 54 genes. Then we use the fitted model to predict the logarithm of survival time for samples in the test set. We also model the training set using other group selection approaches and evaluate the performance of models on the test set by mean prediction error. In analysis of the real data, we still use BIC criterion to select the tuning parameters. The result is reported in Table 3, where mean ˆ 2 /n. prediction error (MPE) is defined as MPE = ky − X βk 2 Table 3 presents the results of analysis of the glioblastoma dataset, showing quite a difference between various methods. The results of group adaptive elastic-net is not worse than that of group adaptive lasso. Compared with group adaptive elastic-net, group lasso selects fewer genes with only a slight ascent in mean prediction error for the test set, group SCAD and group MCP selects 18 and 14 genes, respectively, with a noticeable rising in mean prediction error, and group bridge selects 61 genes with poor performance in both the training set and the test set. In brevity, the developed group adaptive elastic-net is competitive in the group selection methods. To our surprise, group lasso is stable and performs great in the real data despite its slightly worse performance Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

Sci China Math

January 2017

Vol. 60

13

No. 1

in the numerical studies. The performance of group SCAD and group MCP is very close. Group bridge, an approach of bi-level selection, shows no advantages in analysis of the real data. Analysis of the glioblastoma dataset No. of selection groups

genes

training set

test set

GAEnet

15

54

0.0063

1.0605

GALasso

15

54

0.0063

1.0616

grLasso

9

31

0.3824

1.0776

6

18

0.3981

1.3088

4

14

0.4103

1.3731

27

61

1.8398

1.6587

grMCP gBridge

Concluding remarks

pte

Method

grSCAD

7

MPE

d

Table 3

Ac

ce

In this paper, we develop the methodology of group adaptive elastic net when the number of predictors or the number of groups is much larger than the sample size. We address oracle inequalities and model selection consistency under the proper assumptions and derive the oracle property only for the case of the fixed group number. We also provide a slightly revised LCD algorithm by applying a modified adaptive weight which is displayed to be effective. Simulation and real dataset studies indicate that the developed group adaptive elastic-net is a competitive method for group selection in high-dimensional linear regression models. However, we are not able to prove the asymptotic normality of the group adaptive elastic-net estimator βˆ when the group number J is (much) larger than the sample size n. The problem whether the group adaptive elastic-net estimator, including group adaptive Lasso, for the case of the group number being larger than the sample size has the asymptotic normality still remains unsolved. Acknowledgements Authors sincerely thank the Editor-in-Chief, an Associate Editor and anonymous referees for their constructive comments that lead to the current improved version. Hu’s research is supported by NSFC (Grant No. 11571219) and the Open Research Fund Program (No. 201309KF02) of Key Laboratory of Mathematical Economics (SUFE), Ministry of Education. The work is also partially supported by Program (No. IRT13077) for Changjiang Scholars and Innovative Research Team in University.

Appendix

Lemma A.1. For any j in NJ , we have the following inequality kβj k22 − kβˆj k22 6 3kβj − βˆj k22 + 2mkβj − βˆj k2 .

(A.1)

Proof. Using the triangle inequality of norm, we obtain kβj k22 − kβˆj k22 = (kβj k2 − kβˆj k2 )(kβj k2 + kβˆj k2 ) 6 kβj − βˆj k2 (kβj − βˆj + βˆj k2 + kβˆj k2 )

6 kβj − βˆj k2 (kβj − βˆj k2 + 2kβˆj k2 ) = kβj − βˆj k2 (kβj − βˆj k2 + 2kβˆj k2 − 2kβj k) + 2kβj k2 kβj − βˆj k2

Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

14

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

6 3kβj − βˆj k22 + 2mkβj − βˆj k2 , completing the proof of (A.1). Proof of Lemma 3.1. By the definition (2.1), we obtain

Since ε⊤ X∆ 6

PJ

j=1

d

J J X X   1 1 ⊤ 2 2 2 ˆ ω ˆ j kβj k2 − kβˆj k2 . kβj k2 − kβj k2 + λ1 kX∆k2 6 ε X∆ + λ2 2n n j=1 j=1

 k X ⊤ ε j k2 k∆j k2 , to the event Aj , we have

6

pte

J J X 1 λ1 X k∆j k22 ω ˆ j k∆j k2 + λ2 kX∆k22 + 2n 2 j=1 j=1

J J J J X X  1X λ1 X kβj k22 − kβˆj k22 k∆j k22 + λ2 (λ1 ω ˆ j − 4λ2 m)k∆j k2 + ω ˆ j k∆j k2 + λ2 2 j=1 2 j=1 j=1 j=1

+ λ1

J X j=1

6 λ2

J X j=1

6 λ2



(k∆j k22 + kβj k22 − kβˆj k22 ) + λ1

X

(4k∆j k22 + 2mk∆j k2 ) + 2λ1

J X j=1

X

j∈J0

ω ˆ j (k∆j k2 + kβj k2 − kβˆj k2 ) − 2λ2 m

ω ˆ j k∆j k2 − 2λ2 m

ce

j∈J0

6 4λ2

ω ˆ j kβj k2 − kβˆj k2

X

j∈J0

k∆j k22 + 2λ1

X

j∈J0

J X j=1

J X j=1

k∆j k2

k∆j k2

ω ˆ j k∆j k2 ,

where the equation (A.1) in Lemma A.1 is used. So Lemma 3.1 follows. Proof of Theorem 3.2. For every j ∈ NJ , consider the following random event A=

Ac

where

Aj =

and

n1

n

Pr(Aj ) = Pr

We rewrite Pr(Aj ) as

Pr(Aj ) = Pr



J \

j=1

k(X ⊤ ε)j k2 6

n1

n

Aj , o 1 (λ1 ω ˆ j − 4λ2 m) 2

ε⊤ Xj Xj⊤ ε 6

o n (λ1 ω ˆ j − 4λ2 m)2 . 4

1 ⊤ n ε Xj Xj⊤ ε 6 (λ1 ω ˆ j − 4λ2 m)2 n 4



= Pr

(P

n k=1

νjk (ξk2 − 1) √ 6 dj 2kνj k

)

where ξ1 , · · · , ξn are i.i.d. standard normal, νjk denote the eigenvalues of the matrix Xj Xj⊤ /n, among which the positive ones are the same as those of Gj , and dj is defined as dj =

n(λ1 ω ˆ j − 4λ2 m)2 /(4σ 2 ) − tr(Gj ) √ . 2kGj kF

Applying Lemma B.1 in Lounici et al. (2011) to the event Aj , we have ( ) d2j c √ Pr(Aj ) 6 2 exp − . 2(1 + 2dj k|Gj k|/kGj kF ) Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

Sci China Math

January 2017

Vol. 60

15

No. 1

We now choose dj so that the right-hand side of the above inequality is smaller than 2J 1−ρ for ρ > 1. A direct computation yields that q √ 2 dj > 2ρ log(J)k|Gj k|/kGj kF + 2ρ log(J) + 2 [ρ log(J)k|Gj k|/kGj kF ] ,

d

implying that the inequality (3.6) holds. We conclude, by a union bound, under the above condition on the parameters λj , that Pr(Acj ) 6 2J 1−ρ . To prove the inequality (3.7), by (3.3) and Lemma 3.1, we have

(A.2)

4λ1 (κ2 (s) − 8λ2 )

(A.3)

pte

X X 1 1 2 ω ˆ j k∆j k2 k∆i k22 + 2λ1 κ (s)k∆J0 k22 6 kX∆k22 6 4λ2 2 2n j∈J0 j∈J0 sX 2 6 4λ2 k∆J0 k2 + 2λ1 ω ˆ j2 k∆J0 k2 . j∈J0

Then we obtain

k∆J0 k2 6

sX

ω ˆ j2 ,

j∈J0

which is (3.8). And the inequality (3.7) follows from (A.2) and (A.3) immediately. To prove the inequality (3.9), using (3.2) yields X 1 8κ2 (s)λ21 1 2 κ k∆k22 6 kX∆k22 6 2 ω ˆ j2 . 2 2n (κ (s) − 8λ2 )2

ce

j∈J0

ˆ we have To prove the inequality (3.10), from the Karush-Kuhn-Tucker conditions, for any j ∈ J(β),  1 ˆ − 2λ2 βˆj k2 X ⊤ (y − X β) j n   1 1 6 k X ⊤ ε j k2 + k X ⊤ X∆ j k2 + 2λ2 k∆j k2 + 2λ2 kβj k2 n n  1 1 ⊤ ˆ j + k X X∆ j k2 + 2λ2 k∆j k2 . 6 λ1 ω 2 n  So we obtain λ1 ω ˆ j /2 6 k X ⊤ X∆ j k2 /n + 2λ2 k∆j k2 . Then

Ac

λ1 ωj = k

 X 1 2 ⊤ k(X X∆)j k2 + 2λ2 k∆j k2 λ1 ω ˆ min n ˆ j∈J(β) q ! ˆ rφ 2 N (β) max 6 kX∆k2 + 2λ2 k∆k2 . λ1 ω ˆ min n

ˆ 6 N (β)

Applying (3.7) and (3.9) results in the inequality (3.10) immediately.

Proof of Theorem 3.4. We will prove the consistency in selection at the case of J → ∞. Applying oracle inequality (3.9) and Assumption C yields sX 4κ(s) (λ1 ω ˆ j )2 → 0 as n, J → ∞, k∆J0 k2 6 κ(κ2 (s) − 8λ2 ) j∈J0

with probability tending to 1. Since k∆J0c k2 6 k∆J0 k2 , so we have k∆J0c k2 → 0 as n, J → ∞ with probability tending to 1. Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

16

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

(ii) From the inequality (3.9), we get 

 2 X 16κ (s) Ekβˆ − βk22 6 E  2 2 (λ1 ω ˆ j )2  κ (κ (s) − 8λ2 )2

(A.4)

j∈J0

d

With the inequality (A.4) and the fact that the order of λ1 ω ˆ j is O(n−1/2 log1/2 (J)) from (3.6), for such a small λ2 that κ2 (s) − 8λ2 > 0, we obtain Ekβˆ − βk22 = O(n−1 log(J)). We complete the proof of the desired results.

(A.5)

pte

ˆ = β + µ/√n for µ = Proof of Theorem 4.1. Let us first prove asymptotic normality part. Set β ⊤ ⊤ p (µ⊤ 1 , · · · , µJ ) ∈ R , where µj have the same dimension as βj for j = 1, 2, · · · , J, and



2   2 J J X X



1

βj + √1 µj + nλ4

βj + √1 µj .

+ nλ3 √ ω ˆ Qn (µ) = y − X β + µ j



n n n 2 2 2 j=1 j=1

Consider Vn (µ) = Qn (µ) − Qn (0) as follows. A routine decomposition yields the following decomposition Vn (µ) =µ⊤



   J X

1 1 ⊤ 2

√ β + ω ˆj µ − kβ k X X µ − √ ε⊤ Xµ + nλ3 j j 2

j n n n 2 j=1

 J  X √ 1 ⊤ √ 2µ⊤ β + nλ4 µ µ j . j j n j j=1

ce +

For βj = 0, if Pr(µj 6= 0) does not tend to zero, then by Slutsky’s theorem, we have 1+γ

 

n 2 λ3 kjδ kµj k2 √ 1

ˆ j kµj k2 = →∞ nλ3 ω ˆj √

βj + √n µj − kβj k2 = nλ3 ω ( nkβˆj k2 )γ 2

Ac

√ with a positive probability, where nkβˆj k2 = OP (1) is used. So Pr({µj = 0, j ∈ J0c }) tends to one so that Vn (µ) is a well-defined function. ⊤ ⊤ ⊤ ⊤ ⊤ Without loss generality to consider µ = (µ⊤ J0 , 0 ) with µJ0 = (µ1 , · · · , µJ0 ) . Then we have     X 1 ⊤ 2 ⊤ 1 ⊤ ⊤ Vn (µJ0 , 0 ) =µJ0 X XJ µJ0 − √ ε XJ0 µJ0 + nλ3 ω ˆ j kβj + √ µj k2 − kβj k2 n J0 0 n n j∈J0   X √ 1 ⊤ + nλ4 2µ⊤ j βj + √ µj µj . n j∈J 0

For βj 6= 0, then ω ˆ j →P kjδ /kβj kγ . By Slutsky’s theorem and Assumption E, we have

  √

nλ3 ω ˆ j βj⊤ µj P 1

nλ3 ω ˆ j βj + √ µj − kβj k2 = → 0. n kβj k2 2

Also by Assumption E, the third item tend to zero. Following the epi-convergence results of Knight and Fu (2000), we have 2 ⊤ ⊤ ⊤ d ⊤ Vn (µ⊤ J0 , 0 ) → µJ0 ΩµJ0 − √ ε XJ0 µJ0 . n Assumption D tells us that XJ⊤0 XJ0 /n → Ω as n → ∞. It follows from the Lindeberg central limit theorem, see Van der Vaart (1998), that XJ⊤0 ε/n converges in distribution to a normal N (0, σ 2 Ω) distribution. Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Hu J et al.

Sci China Math

January 2017

Vol. 60

17

No. 1

We obtain that √ 2 ⊤ ⊤ d −1 W, n(βˆJ0 − βJ0 ) → arg min{µ⊤ J0 ΩµJ0 − √ ε XJ0 µJ0 } = Ω µ1 n

d

where W ∼ N (0, σ 2 Ω). So we prove the asymptotic normality part (ii). Next, we will prove the consistency part (i). Suppose that, for enough large n, there exists some j ∈ J0c such that βˆj 6= 0 with a positive probability. Then from the sub-gradient equation (2.3), we have √ √ 1 βˆj ˆ = 2 nλ4 βˆj + nλ3 ω √ Xj⊤ (y − X β) ˆj . n kβˆj k2

(A.6)

pte

On one hand, the second item of the right-side of (A.6) converges to infinite with a positive probability, namely,

γ

√ n1+ 2 λ3 kjδ βˆj

ˆj → ∞.

= √

nλ3 ω

kβˆj k2 2 k nβˆj k1+γ 2 On the other hand, the left-side of (A.6) is bounded in probability since

√ 1 ˆ = OP (1). ˆ = 1 X ⊤ X( n(β − β)) √ Xj⊤ (y − X β) n n j

We get a contradiction that two sides of the equation (A.6) have different order as n tends to infinite. So we show that  Pr βˆJ0c = 0 → 1 as n → ∞, References

ce

which complete the proof of the desired result.

Ac

1 Bickel P J, Ritov Y, Tsybakov A B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009, 37(4):1705-1732. 2 Breheny P, Huang J. Penalized methods for bi-level variable selection. Statistics and Its Interface, 2009, 2(3):369-380. 3 Candes E, Tao T. Decoding by linear programming. IEEE Transactions on Information Theory, 2005, 51(12):4203-4215. 4 Candes E, Tao T. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007, 35(6):2313-2351. 5 Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 2001, 96(456):1348-1360. 6 Friedman J, Hastie T, Tibshirani R. A note on the group lasso and a sparse group lasso. Techinical report. Stanford University, 2010. 7 Horvath S, Zhang B, Carlson M, et al. Analysis of oncogenic signaling networks in glioblastoma identifies aspm as a molecular target. Proceedings of the National Academy of Sciences of the United States of America, 2006, 103(46):17402-17407. 8 Huang J, Breheny P, Ma S. A selective review of group selection in high- dimensional models. Statistical Science, 2012, 27(4):481-499. 9 Huang J, Ma S, Xie H, et al. A group bridge approach for variable selection. Biometrika, 2009, 96(2):339-355. 10 Jiang D, Huang J. Concave 1-norm group selection. Biostatistics, 2015, 16(2):252- 267. 11 Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics, 2000, 28(5):1356-1378. 12 Li R, Zhong W, Zhu, L. Feature screening via distance correlation learning. Journal of the American Statistical Association, 2012, 107(499):1129-1139. 13 Lounici K, Pontil M, Van De Geer S, et al. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 2011, 39(4):2164-2204. 14 Simon N, Friedman J, Hastie T, et al. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013, 22(2):231-245. 15 Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1996, 58(1):267-288. 16 Van Der Vaart A W. Asymptotic statistics. Cambridge university press, 1998. 17 Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2006, 68(1):49-67.

Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

18

Hu J et al.

Sci China Math

January 2017

Vol. 60

No. 1

Ac

ce

pte

d

18 Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 2010, 38(2):894-942. 19 Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 2006, 101(476):1418-1429. 20 Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2005, 67(2):301- 320. 21 Zou H, Zhang H H. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 2009, 37(4):1733-1751.

Downloaded to IP: 61.165.232.226 On: 2017-02-16 01:20:46 http://engine.scichina.com/doi/10.1007/s11425-016-0071-x

Suggest Documents