Communications in Statistics - Simulation and Computation
ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: http://www.tandfonline.com/loi/lssp20
A lower bound based smoothed quasi-Newton algorithm for group bridge penalized regression Yongxiu Cao, Jian Huang, Yuling Jiao & Yanyan Liu To cite this article: Yongxiu Cao, Jian Huang, Yuling Jiao & Yanyan Liu (2017) A lower bound based smoothed quasi-Newton algorithm for group bridge penalized regression, Communications in Statistics - Simulation and Computation, 46:6, 4694-4707, DOI: 10.1080/03610918.2015.1129409 To link to this article: http://dx.doi.org/10.1080/03610918.2015.1129409
Accepted author version posted online: 13 Jan 2016. Published online: 13 Jan 2016. Submit your article to this journal
Article views: 83
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lssp20 Download by: [Zhongnan University of Economics & Law]
Date: 22 July 2017, At: 07:45
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION , VOL. , NO. , – http://dx.doi.org/./..
A lower bound based smoothed quasi-Newton algorithm for group bridge penalized regression Yongxiu Caoa,b , Jian Huangc,d , Yuling Jiaoa , and Yanyan Liub a School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China; b School of Mathematics and Statistics, Wuhan University, Hubei, China; c School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China; d Department of Statistics and Actuarial Science, University of Iowa, IA, USA
ABSTRACT
ARTICLE HISTORY
In this paper, we propose a lower bound based smoothed quasiNewton algorithm for computing the solution paths of the group bridge estimator in linear regression models. Our method is based on the quasi-Newton algorithm with a smoothed group bridge penalty in combination with a novel data-driven thresholding rule for the regression coefficients. This rule is derived based on a necessary KKT condition of the group bridge optimization problem. It is easy to implement and can be used to eliminate groups with zero coefficients. Thus, it reduces the dimension of the optimization problem. The proposed algorithm removes the restriction of groupwise orthogonal condition needed in coordinate descent and LARS algorithms for group variable selection. Numerical results show that the proposed algorithm outperforms the coordinate descent based algorithms in both efficiency and accuracy.
Received October Accepted December KEYWORDS
Lower bound rule; Group bridge; Penalized least squares; Quasi-Newton algorithm MATHEMATICS SUBJECT CLASSIFICATION
C; J
1. Introduction In many applications, high-dimensional data have become commonplace and there is now a large body of work on statistical and computational methods for variable selection in highdimensional models. Several important penalized methods have been proposed for the analysis of high-dimensional data. Examples include the bridge (Frank and Frediman, 1993; Huang et al. 2008; Huang and Ma, 2010), least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996), smoothed clipped absolute deviation (SCAD, Fan and Li, 2001), adaptive LASSO (Zou, 2006), and minimax concave penalty (MCP, Zhang, 2010). These methods automatically remove unimportant variables by shrinking some of the regression coefficients to zero and are designed for selecting individual variables. However, in many problems, predictors are often naturally grouped. For example, in genetic studies, genes or genetic markers are usually grouped by biological pathways. Other examples include the multi-factor analysis of variance (ANOVA), where each factor is represented by a group of dummy variables, and the additive model with nonparametric components where each component of the original variable could be expressed by a linear combination of polynomial basis functions (e.g., Huang et al. 2010, 2012). These examples motivate the need for developing variable selection methods
CONTACT Yanyan Liu
[email protected] School of Mathematics and Statistics, Wuhan University, Hubei, China. Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/lssp. © Taylor & Francis Group, LLC
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4695
at group level. In the past decade, there is an increasing literature on groupwise variable selection. Several authors have proposed methods for group selection under groupwise orthogonal assumption. For example, Yuan and Lin (2006) proposed a group LASSO method for group selection. Kim et al. (2006) studied the group LASSO for logistic regression models. Meier, van de Geer and Bühlmann (2008) studied the group LASSO for logistic regression models for high dimensional data. Zhao et al. (2006) proposed a composite absolute penalty for group selection. However, these methods only focused on group selection without considering variable selection within each identified group, that is, if one variable in a group is selected, all the other variables in the same group are also selected. However, in many applications it is of interest to select individual variables as well as groups. Huang et al. (2009) proposed a group bridge method with Lγ (0 < γ < 1) penalty which not only effectively removes unimportant groups, but also maintains the flexibility of selecting variables within identified groups. Even though the group bridge method enjoys the powerful oracle group selection property (Huang et al., 2009), it is computationally difficult because the group bridge penalty is neither convex nor differentiable at (group-wise) origin. To solve the computational problem, Huang et al. (2009) first transformed the penalized group bridge problem into a penalized adaptive LASSO problem and then the LARS algorithm (Efron et al., 2004) or the coordinate descent algorithm (Fu, 1998) can be used. To implement this algorithm, an updated weight for adaptive LASSO solver is needed at each iteration, which may be time consuming. For small or moderate sized group regression, this algorithm works well. However, when the number of groups is large, it is computationally expensive. To ease the computational burden, it is desirable to develop alternative algorithms that can update the vector of regression coefficients as a whole, such as the classical Newton-Raphson (NR) algorithm. However, the NR algorithm requires the calculation of the inverse of the Hessian matrix, which is computationally prohibitive in high-dimensional settings. The quasi-Newton (QN) algorithm avoids the calculation of the inverse of the Hessian matrix by replacing it with an easily calculated approximation. Therefore, how to approximate the inverse of the Hessian matrix properly is critical to speed up the QN algorithm. Another challenge in using QN is that it can only provide an approximation of a local optimal solution so that the zero entries can rarely be identified numerically. To eliminate zero terms in a QN solver, it is naturally to adopt a fixed constant as a threshold. The terms with the corresponding group norm of the estimates smaller than the threshold can then be set to zero. Obviously, different thresholding value leads to different conclusion, which makes the choice of the thresholding value critical. In this article, we develop a lower-bound based smoothed quasi-Newton (LSQN) algorithm to compute group bridge solutions efficiently. First, we smooth the group bridge penalty. Then, we use the DFP formulae (Nocedal and Wright, 1999) to approximate the inverse of the Hessian matrix. Finally, we extend the results by Chen et al. (2012) to derive the lower bound for the estimators corresponding to the nonzero groups. The lower-bound results can be used to detect the zero terms of the estimator. The LSQN algorithm allows us to work directly with general design matrices, without requiring any within-group orthogonalization. An important advantage of our proposed algorithm is that it can efficiently handle the problems, where both the dimension of the groups and the sample size are large. The reminder of this article is organized as follows. In Section 2, we derive two lower bounds for the nonzero entries of the group bridge estimator. In Section 3, we describe the proposed LSQN algorithm. In Section 4, simulation studies are presented to evaluate the finite sample performance of the proposed algorithm. We end the article with a few concluding remarks in Section 5.
4696
Y. CAO ET AL.
2. The lower bounds for the nonzero groups of the group bridge estimator Consider the linear regression model y = Xβ + ε, where y is an n-dimensional vector of the response variables, X is a n × p-dimensional design matrix, β is a p-dimensional vector of unknown regression coefficients, and ε is a vector of the error terms. To delete the intercept from the regression model, we center the response variable so that the sample mean is zero. Without loss of generality, we further assume that X is a design matrix with unit-normalized columns. This simply can be done by normalizing the columns of X and rescaling β. Suppose that the predictors are divided into I nonoverlapping groups A1 , . . . , AI such that {1, . . . , p} = Ii=1 Ai , where Ai Ai = ∅, for i = i , 1 ≤ i, i ≤ I. Let |Ai |0 be the cardinality of Ai . If all of the |Ai |0 ’s are ones, the group selection is actually an individual variable selection. Let βAi = (β j , j ∈ Ai )T . The L1 and L2 norms for a column vec√ tor a = (a1 , . . . , am )T are denote by a1 = |a1 | + · · · + |am | and a2 = aT a, respectively. The group bridge penalized least-square criterion (Yuan and Lin, 2006; Huang and Ma, 2010) is defined to be 1 λβAi γ1 , Q(β) = y − Xβ22 + 2 i=1 I
(1)
where γ ∈ (0, 1) is a fixed constant and λ ≥ 0 is the tuning parameter. The group bridge estimator is the minimizer of Q(β), i.e. β = arg minβ∈R p Q(β). I Since Q(β) ≥ λ i=1 βAi γ1 , the objective function is bounded below and Q(β) → ∞ if there exists some i ∈ {1, . . . , I} such that βAi 1 → ∞. Thus, by the continuity of the objective function there exists at least one global minimizer of Q(β). Due to the nonconvexity of the group bridge penalty with γ ∈ (0, 1), Q(β) may have more than one global minimizers. Let β∗ be one of the global minimizers of (1). By the first-order necessary optimal condition, we obtain a lower bound for the nonzero groups of β∗ , which is referred as the first-order lower bound. The result is presented in the following theorem. Theorem 2.1. (The first-order lower bound) Let ρ be the maximum eigenvalue of X T X and β ∈ R p be an arbitrary vector. If there exist some i ∈ {1, . . . , I}, such that ⎛ ⎞1/(1−γ ) λγ ⎠ β∗Ai 1 < ⎝
, 2ρQ(β) then β∗Ai = 0. The first-order lower bound is related to an arbitrary value β. In practice, we can take β to be zero vector, in this way the lower bound for the active groups reduced to be √ (λγ /( ρy2 ))1/(1−γ ) . This lower bound depends on n and p since ρ depends on n and p. The following theorem gives another lower bound for the nonzero groups of the group bridge estimator based on the second-order necessary optimal condition. Theorem 2.2. (The second-order lower bound) For i = 1, . . . , I, if β∗Ai 1 < [λγ (1 − γ )]1/(2−γ ) , then β∗Ai = 0.
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4697
These two theoretical results can be used to eliminate the zero groups for the group bridge estimator.
3. The LSQN algorithm 3.1. Smoothed group bridge objective function Many existing algorithms for group selection are based on the idea of coordinate descent in the sense that the regression coefficient is updated groupwise. Examples include the blockwise descent algorithm by Yuan and Lin (2006), the block coordinate gradient descent algorithm (BCGD) by Meier et al. (2008), the blockwise-majorization-decent (BMD) algorithm by Yang and Zou (2014). All the aforementioned algorithms are designed for solving the group LASSO regression problems. For the group bridge penalized objective function, minimizing Q(β) is challenging since this function is nonconvex and not differentiable. Huang et al. (2009) proposed to transform the group bridge regression problem to an adaptive LASSO problem. For the high-dimensional regression case, this method can be time-consuming, since it involves solving a sequence of LASSO regression problems. To overcome the difficulty due to the nondifferentiability of the penalty function, approximation techniques are proposed in the literature to smooth the objective function. Examples include the local quadratic approximation (LQA) by Fan and Li (2001) and the local linear approximation (LLA) by Zou and Li (2008). By applying the LQA to the group bridge penalty, the original minimization problem changes to a ridge regression one, which cannot shrink small coefficients to zero. By applying the LLA, the group bridge regression is transformed into an adaptive LASSO penalized problem, which is designed for selecting significant variables individually rather than identifying important groups. Foucart and Lai (2009) proposed to smooth the bridge penalty by |β j | ≈ (|β j | + 0 )γ , with 0 being a fixed smoothing parameter. Inspired by their method, we propose to approximate βAi γ1 by (βAi 1 + 0 )γ . Obviously, (βAi 1 + 0 )γ → βAi γ1 as 0 → 0, but the smaller 0 is, the more unstable the computation will be. With this smooth approximation to the penalty function, we have the smoothed objective function: 1 (βAi 1 + 0 )γ , Q(β) = y − Xβ22 + λ 2 i=1 I
(2)
The corresponding optimal solution is defined as β0 = arg minβ Q(β). 3.2. Description of the LSQN algorithm Since (2) is a smooth function, the NR algorithm can be used directly. When βAi = 0, this smoothing technique is similar to that used in the modified NR algorithm by Fu (1998). The iteration formula for the smoothed objective function is (k) ), (k) )]−1 ∇ Q(β β (k+1) = β (k) − α[∇ 2 Q(β
4698
Y. CAO ET AL.
(k) ) and where α is a fixed step factor (we fix α = 0.1 in our simulation studies), ∇ Q(β (k) ) are the gradient and Hessian matrix of Q(β) ∇ 2 Q(β at β (k) , respectively. The calculation of the inverse of the Hessian matrix is computational costly, especially when the dimension of covariates is high. Furthermore, the Hessian matrix may be nearly singular for high-dimensional regression problems. Therefore, we employ the QN algorithm to imple (k) )]−1 is ment the group bridge method, wherein the inverse of the Hessian matrix [∇ 2 Q(β (k) replaced by a suitable approximation matrix H . Generally, to keep the nice properties of the NR algorithm, H (k) should satisfy the following requirements: (i) H (k) is easy to be calculated; (ii) {H (k) }k≥1 must be positive definite; (k) ) should guaran(iii) The corresponding pseudo-Newton direction h(k) = −H (k) ∇ Q(β tee that the objective function decreasing with increasing k. A possible choice of H (k) satisfying the above requirements can be obtained by the DFP formulae (Nocedal and Wright, 1999): H
(k)
=H
(k−1)
+
p(k−1) p(k−1) T
T
p(k−1) q(k−1)
T
−
H (k−1) q(k−1) q(k−1) H (k−1) T
q(k−1) H (k−1) q(k−1)
,
(k) ) − ∇ Q(β (k−1) ). Usually, the initial matrix where p(k−1) = β (k) − β (k−1) and q(k−1) = ∇ Q(β (0) H is set as the p × p identity matrix. To guarantee the convergence of this QN algorithm, the step factor α should be small. However, small α slows the convergence speed. As mentioned above, this QN algorithm cannot identify the zero components for the group bridge estimator, thus some thresholding value is needed to eliminate the zero components. We use the derived lower bound results in Theorems 1 and 2 as the thresholding value to carry out the group reduction. After this reduction, we only need to consider the groups remaining in the active set. We referred to this group reduction rule as the lower bound rule and the corresponding algorithm as LSQN algorithm. For fixed constants K, 0 , α, η, and λ, LSQN algorithm is summarized as follows: LSQN Algorithm Step 1. Compute
1/(1−γ )
√ C1 = λγ / ρy2 and C2 = {λγ (1 − γ )}1/(2−γ ) .
Step 2. Initialize β by β (0) and set H (0) = Ip×p . Step 3. For k = 0, 1, . . . , K, carry out the following (a)–(c) updates until convergence (k) ); (a) Compute β (k+1) = β (k) − αH (k) ∇ Q(β (k+1) (k+1) ), if ∇ Q(β )2 ≤ η, output β = β (k+1) and go to step 4; oth(b) Compute ∇ Q(β erwise, go to substep (c); n (β (k+1) ) − ∇ Q(β (k) ). If p(k) T q(k) > 0, (c) Compute p(k) = β (k+1) − β (k) , q(k) = ∇ Q update H (k+1) with H
(k+1)
=H
(k)
+
p(k) p(k) T
T
p(k) q(k)
T
−
H (k) q(k) q(k) H (k) T
q(k) H (k) q(k)
,
otherwise set H (k+1) = H (k) . Then let k = k + 1 and go to sub-step (a) with the updated H (k+1) . βAi 1 ≤ C2 , set βAi = 0. Output the updated β. Step 4. For i = 1, . . . , I, if βAi 1 ≤ C1 or An important feature of this algorithm is the use of the lower-bound results in Theorems 1 and 2 to carry out the group reduction in a data-driven manner given in Step 4 above. The
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4699
thresholding constants C1 and C2 calculated in Step 1 are based on Theorems 1 and 2. After this reduction, we only need to consider the groups remaining in the active set. We referred to this group reduction rule as the lower bound rule. Tibshirani et al. (2012) proposed to use the KKT condition or the strong rule as thresholding value for the LASSO group regression problem, where the estimating process is carried out restricting on the discovered active set and other groups of coefficients will be set as zeros. However, there are no corresponding KKT condition for our smoothed bridge penalty so that their method cannot be used for our problem.
3.3. Computation of the solution paths In practice, we are interested in solving this group bridge regression problem over a grid of tuning parameter values λ1 ≥ λ2 ≥ · · · ≥ λM . To speed up the implementation, we use β(λk ), the idea of warm start, i.e., the initial value of λk+1 optimization problem is taken as k = 1, . . . , M − 1. Associate with the fixed λk+1 are the two lower bounds C1 (λk+1 ), C2 (λk+1 ), which can be used to reduce the active set: the ith group is detected as inactive one if the ith group initial values satisfies βA(k)i 1 < max{C1 (λk+1 ), C2 (λk+1 )}, otherwise, it will be included k+1 ) is determined, the LSQN algorithm is only imple k+1 ). After A(λ in the active set A(λ mented on the active set. We have found that using lower bound rule and warm start dramatically decreases the computing time. The optimal value of λ is chosen by minimizing BIC over λ1 , . . . , λM , see Yuan and Lin (2006) for details.
4. Simulation studies In this section, we compare the proposed LSQN with some existing algorithms for group bridge regression in terms of timing performance and solution accuracy. The algorithms for comparison include the method proposed by Huang et al. (2009) (referred to as CDH below) and LLA technique by Zou and Li (2008) (referred to as LLA below). Define a0 = 0, ∀a ∈ R. By LLA, the group bridge penalty is approximated by βAi γ1 + γ βAi γ1 −1 (βAi 1 − βAi ) βAi γ1 ≈ γ βAi γ1 −1 (|β j | − |β˜ j |) = βAi γ1 +
(3)
j∈Ai
with βAi being close enough to βAi . With this approximation, the penalized group bridge problem is transferred into an adaptive LASSO regression. Then, the coordinate descent method can be applied to obtain the minimizer. Datasets are generated independently from the linear regression model y = Xβ + . The predictors are generated from a multivariate normal distribution with mean 0 and a compound symmetry correlation matrix = (σi j ) with σi j = 0.5|i− j| . The random error ∼ N(0, σ 2 ). For all the algorithms considered here, 0 is set as the initial value of β. To calculate the lower bound in Theorem 2.1, set β = 0.
4700
Y. CAO ET AL.
Table . Measurement comparison results for p = 20 based on independent runs. n = 50
n = 100
γ
Method
Time
ME
| A|
Corr
Time
ME
| A|
Corr
.
CDH LLA LSQN
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
.
CDH LLA LSQN
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
.
CDH LLA LSQN CDH LLA LSQN CDH LLA LSQN
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
| A|
Corr
.
.
Time: the average value of the running time (in seconds) for getting the solution at λ values; ME: mean of model error; | A|: the average value of estimated group size; Corr: the proportion of times the correct model was selected over realizations; CDH: the algorithm proposed by Huang et al. () as mentioned in section .; LLA: the coordinate descent algorithm with the penalty approximated by the LLA method; LSQN: the proposed LSQN algorithm combined with recommended lower bound rule.
Table . Measurement comparison results for n = 200 based on independent runs. γ
Method
Time
| A|
ME
Corr
Time
ME
p = 50 . . . .
. . . .
p = 100
CDH LLA LSQN CDH LLA LSQN CDH LLA LSQN CDH LLA LSQN
. . . . . . . . . . . .
. . . . . . . . . . . . p = 150
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . p = 200
. . . . . . . . . . . .
. . . . . . . . . . . .
CDH LLA LSQN CDH LLA LSQN CDH LLA LSQN CDH LLA LSQN
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
Time: mean of the running time (in seconds) for getting the solution at λ values; ME: mean of model error; Corr: the proportion of times the true regression model was exactly selected; | A|: the average value of estimated group size; CDH: the iteration based algorithm proposed by Huang et al. (); LLA: the coordinate descent algorithm with approximate group bridge penalty by LLA method; LSQN: the proposed LSQN algorithm combined with recommended lower bound rule.
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4701
The groups of the coefficients are determined in the following way. (S1) The first five group sizes are set as (3, 4, 3, 5, 5), and the corresponding true coefficients of β for the first five groups are (1, −0.8, 0, 1.3, 0, 0, 0, 0, . . . , 0, 0, . . . , 0, 0, . . . , 0 )T . (S2) Each of the group size for the remaining group is 5, this segment results to the totally p = 20 + 5t dimension of covariates, where t represents the number of the remaining groups. When solving this group bridge regression over a grid of tuning parameter values λ1 ≥ λ2 ≥ · · · ≥ λM , we apply the lower bound rule to speed up the computing. For CDH and LLA algorithms, the original minimizing problem is transformed to be an adaptive LASSO one, which can be solved using the coordinate descent algorithm. 4.1. Accuracy of the LSQN algorithm To evaluate the accuracy of the proposed LSQN method, we compare the proposed LSQN algorithm with the CDH and LLA algorithms in terms of selection accuracy in group level, selected group size, and model error, which is defined as β − β0 ). ME = ( β − β0 )T E[X T X]( Additionally, the CPU running time of computing the solution paths at 20 λs are also recorded. Since the main purpose of this subsection is to check the accuracy of the proposed LSQN algorithm, we consider the data setting with t = 0 just for simple. This results in a linear regression model with five groups, wherein the first two groups being the significant ones. We consider different values for γ in the group bridge estimator: γ = 0.1, 0.3, 0.5, 0.7, 0.9.
Figure . I = 5 + t represents the number of groups considered in the model with the dimension p = 20 + ˆ is the average estimated nonzero group size via the LSQN method with γ = 0.5 based on 5 ∗ (t − 6). |A| independent realizations. The square and the diamond line are the results with sample size n = 200 and n = 100, respectively. The star line represents the true value of this ratio.
4702
Y. CAO ET AL.
For all of the consideration in this subsection, the value of stopping rule η and the maximum iteration steps K involved in the LSQN algorithm are fixed as 0.001 and K = 100. We summarize the results in Table 1 with averages of the considered measures across the 200 simulations. It shows that the LSQN algorithm is faster than the other two methods. When considering the accuracy, LSQN performs the best for γ = 0.1, 0.9 in terms of percentage of selecting the correct groups and size of the estimating model. When n increases to 100, all the three methods are comparable both in efficiency and accuracy. To further investigate our proposed methods, we conduct additional simulation studies for n = 200 and p = 50, 100, 150 and 200. The results is summarized in Table 2. For all the cases, LSQN has the best timing and selecting the true model performance. When p = 50 or 100, the model errors of CDH are a little smaller than that of the other two. While this phenomenon disappears as p increases. When γ = 0.1, CDH overestimates the group size, this results in the low portion of correctly selecting the true significant groups. LSQN and LLA are comparable in the accuracy of model size estimation and the proportion of correctly selecting true groups. over the actual group size Figure 1 presents the ratio of average estimated group size |A| . We fix I for different p. The relation of p with the true number of groups I is: I = 5 + p−20 5
Figure . The percentage of the zero group A6 being selected as zero group, where p = 20 + |A6 |0 , I = 6. The star and diamond lines are the results with n = 100, n = 200.
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4703
γ = 0.5 and sample size n = 100 or 200. As shown in this figure, when p is moderate large (p ≤ 2 × n), the LSQN method can exactly select the true nonzero groups. When p is large, the proposed LSQN method can eliminate most of the zero groups in the regression model. Next, we investigate the influence of the cardinality of group to the performance of LSQN. We consider a model with six groups. The first five groups are set as (S1) and the sixth nonsignificant group A6 with number of within this group (denoted by |A6 |0 ) increasing from 10 to 80 by 10. Two values of γ are considered: γ = 0.1 and 0.5. Figure 2 gives the percentage of group A6 being correctly detected as nonsignificant group for the 100 runs. We see that the proposed LSQN works well when γ = 0.1, n = 200, or |A6 |0 ≤ 50. The possibility of group A6 being wrongly detected as nonzero groups increases with the decreasing of sample size n or the increasing of group cardinality when |A6 |0 > 50. This phenomenon may indicate that the cardinality of group bridge estimator may have some relations to the two bounds, while Chen et al. (2010) showed that this conjecture is true.
4.2. Timing comparison We compare the CPU running time for the CDH, LLA, and LSQN methods. Data are generated based on the models presented before. We fix γ = 0.5. The timing (in seconds) of computing the solution paths at 20λ values for each algorithm is recorded. The results are averaged over 10 independent runs. We consider the following combinations of (n, p), (A) (n, p) = (100, 20), (n, p) = (100, 50), (n, p) = (100, 100), (n, p) = (100, 200), (n, p) = (100, 400), (n, p) = (100, 800); (B) (n, p) = (200, 200), (n, p) = (200, 400), (n, p) = (200, 800), (n, p) = (200, 1000), (n, p) = (200, 1600), (n, p) = (200, 2000), (n, p) = (200, 4000). For the first scenario, timing results are summarized in Table 3. We can see that CDH always cost the longest time to get the solution path. The LSQN algorithm shows great power in time efficiency. Table . Timing comparison results for (n, p). (n, p) =
(100, 20)
(100, 50)
(100, 100)
(100, 200)
(100, 400)
(100, 800)
γ
Method
.
CDH LLA LSQN
. . .
. . .
. . .
. . .
. . .
. . .
.
CDH LLA LSQN
. . .
. . .
. . .
. . .
. . .
. . .
.
CDH LLA LSQN
. . .
. . .
. . .
. . .
. . .
. . .
.
CDH LLA LSQN CDH LLA LSQN
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
Timing Comparison (in seconds)
CDH: the iteration based algorithm proposed by Huang et al. () as mentioned in Section .; LLA: the coordinate descent algorithm with the penalty approximated by the LLA method; LSQN: the proposed LSQN algorithm combined with recommended lower bound rule.
4704
Y. CAO ET AL.
Figure . The average running time of independent runs (in the natural logarithm scale) for 20λ values. The diamond line represents the running time based on the CDH algorithm. The plus and star lines are the logarithm of the running time obtained by the LLA and LSQN algorithm, respectively. For all of the cases, sample size is , and γ = 0.5.
In the second scenario, we want to examine the impact of dimension on the timing of three considered methods. We fixed n = 200. Figure 3 shows the running time (in logarithm scaled) against p for the three considered methods. We see that the LSQN algorithm saves much time when p is large.
5. Concluding remarks In this article, we propose a fast LSQN algorithm, which is based on the QN algorithm and a novel thresholding rule for computing the solution paths of the group bridge in linear regression models. The DFP formulae is employed to generate an approximation of the inverse Hessian matrix to speed up computation. The proposed algorithm allows us to work directly with general design matrices, without requiring any group-wise orthogonality assumptions. The proposed LSQN algorithm is the fastest among methods considered in our simulation studies and performs well in group selection accuracy.
Appendix Here, we prove the theoretical results. Proof of Theorem 2.1. Let K be the estimated group size of β∗ . Without loss of generality, assume that the first K groups A1 , . . . , AK are the estimated active groups of β∗ . For i = 1, . . . , K, let A˜ i = {k ∈ Ai : βk∗ = 0} be the index set of the nonzero terms of β∗Ai , let A˜ = K A˜ i and q = K |A˜ i |0 . Denote the nonzero elements of β∗ as i=1
i=1
T β∗A˜ = β∗A˜ T , . . . , β∗A˜ T . 1
K
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4705
Let W be the submatrix of X corresponding to the nonzero elements of β∗ . Let L(z) be a function defined in Rq with the form 1 z A˜ i γ1 . L(z) = y − W z22 + λ 2 i=1 K
By simply calculation, we have 1 β∗Ai γ1 Q(β ) = y − Xβ∗ 22 + λ 2 i=1 I
∗
1 = y − W β∗A˜ 22 + λ β∗A˜ γ1 i 2 i=1 K
= L(β∗A˜ ). Thereby L(β∗A˜ ) = Q(β∗ ) ≤ min{Q(β) : β j = 0 for j ∈ {1, . . . , p}\(∪Ki=1 A˜ i )} = min{L(z), z ∈ Rq } ≤ L(β∗A˜ ). It implies that β∗A˜ is a global minimizer of L(z). Let I|A˜ i |0 be the identity matrix of rank |A˜ i |0 and = diag{1 , . . . , K } be a diagonal matrix with i = β∗A˜ 1(γ −1) × I|A˜ i |0 . By the firsti order necessary optimal condition for L(z), we have W T (W β∗A˜ − y) + λγ sign(β∗A˜ ) = 0, where sign(z) is the sign function of z. Then, for every i ∈ {1, . . . , K}, λγ β∗A˜ γ1 −1 ≤ W T (y − W β∗A˜ )2 i
= W T (y − Xβ∗ )2 ≤ X T (y − Xβ∗ )2
≤ ρy − Xβ∗ 22 ≤ 2ρQ(β∗ )
≤ 2ρQ( β). Thus, for the nonzero groups of β∗ , we have
⎛ λγ
⎞1/(1−γ )
⎠ min β∗Ai 1 = min β∗A˜ 1 ≥ ⎝
i 1≤i≤K 2ρQ( β)
1≤i≤K
.
It follows that the regression coefficients of the ith group β∗Ai must be zeros if β∗Ai 1 < λγ 1/(1−γ ) √ . This completes the proof of Theorem 2.1. 2ρQ( β) Proof of Theorem 2.2. Let = diag{β∗A˜ γ1 −2 I|A˜ 1 | · · · β∗A˜ γ1 −2 I|A˜ K | } be the first-order 1 K derivative of with respect to β∗A˜ . By the second-order necessary condition for L(z), we conclude that the matrix W TW + λγ (γ − 1)
4706
Y. CAO ET AL.
is positive semi-definite. For any j ∈ A˜ i , let W j be the corresponding column in W , then W j 22 + λγ (γ − 1)β∗A˜ γ1 −2 ≥ 0. i
It follows that β∗A˜ 2−γ ≥ λγ (1 − γ ). 1 i
Therefore, for every nonzero groups of β∗ β∗Ai 1 = β∗A˜ 1 ≥ [λγ (1 − γ )]1/(2−γ ) , i
which means that Theorem 2.2.
β∗Ai
= 0 if
β∗Ai 1
< [λγ (1 − γ )]1/(2−γ ) . This completes the proof of
Acknowledgment The authors thank the editor and referees for their helpful comments which greatly improved the article.
Funding Yongxiu Cao’s research is partially supported by the National Natural Science Foundation of China No. 11401443 and the National Natural Science Foundation of China No. 11501578. Yuling Jiao is partially supported by the National Natural Science Foundation of China No. 11501579 and partially supported by the Fundamental Research Funds for the Central Universities of China No. 31541411224. Yanyan Liu’s work is partially supported by the National Natural Science Foundation of China No. 11571263 and 11371299.
References Chen, X., Xu, F., Ye, Y. (2010). Lower bound theory of nonzero entries in solutions of l2 –l p minimization. SIAM Journal on Scientific Computing 32:2832–2852. Dicker, L., Huang, B., Lin, X. (2012). Variable selection and estimation with the seamless-l0 penalty. Statistica Sinica 23:929–962. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). Least angle regression. The Annals of Statistics 32:407–499. Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its Oracle properties. Journal of the American Statistical Association 96:1348–1360. Fan, J., Lv, J. (2010). A selection overview of variable selection in high-dimensional feature space. Statistica Sinica 20:101–148. Frank, L. E., Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35:109–135. Foucart, S., Lai, M. J. (2009). Sparsest solutions of under-determined linear systems via lq -minimization for 0 < q < 1. Applied and Computational Harmonic Analysis 26:395–407. Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of computational and Graphical Statistics 7:397–416. George, E. I. (2000). The variable selection problem. Journal of the American Statistical Association 95:1304–1308. Hocking, R. R., Leslie, R. N. (1967). Selection of the best subset in regression analysis. Technometrics 9:531–540. Huang, J., Horowitz, J. L., Ma, S. (2008). Asymptotic properties of bridge estimators in sparse highdimensional regression models. The Annals of Statistics 36:587–613. Huang, J., Horowitz, J. L., Wei, F. (2010). Variable selection in nonparametric additive models. The Annals of Statistics 38:2282–2313.
®
COMMUNICATIONS IN STATISTICS—SIMULATION AND COMPUTATION
4707
Huang, J., Ma, S. (2010). Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Analysis 16:176–195. Huang, J., Liu, L., Liu, Y., Zhao, X. (2014). Group selection in Cox model with a diverging number of covariates. Statistica Sinca 24:1787–1810. Huang, J., Ma, S., Xie, H., Zhang, C. (2009). A group bridge approach for variable selection. Biometrika 96:339–355. Huang, J., Wei, F., Ma, S. (2012). Semiparametric regression pursuit. Statistica Sinica 22:1403–1426. Kim, Y., Kim, J., Kim, Y. (2006). Blockwise sparse regression. Statistica Sinica 16:375–390. Mallows, C. (1973). Some comments on Cp . Technometrics 15:661–675. Meier, L., van de Geer, S., Bülmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B 70:53–71. Nocedal, J., Wright, S. J. (1999), Numerical Optimization. New York: Springer-Verlag. Li, B., Yu, Q. (2009). Robust and sparse bridge regression. Statistics and Its Interface 2:481–491. Park, C., Yoon, Y. J. (2011). Bridge regression: Adaptivity and group selection. Journal of Statistical Planning and Inference 141:3506–3519. Sun, W., Yuan, Y. (2006), Optimization Theory and Methods: Nonlinear Programming. New York: Springer-Verlag. Tibshirani, R. (1996). Regression shrinkage and variable selection via the lasso. Journal of the Royal Statistical Society, Series B 58:267–288. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R. (2012). Strong rule for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society, Series B 74:245–266. Wang, H., Li, B., Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94:553–568. Wang, H., Li, B., Leng, C. (2009). Shrinkage tuning paramter selection with a diverging number of paramters. Journal of the Royal Statistical Society, Series B 71:671–683. Xie, H., Huang, J. (2009). SCAD-penalized regression in high-dimensional partially linear models. The Annals of Statistics 37:673–696. Yang, Y., Zou, H. (2014). A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and ComputingDOI: 10.1007/s11222-014-9498-5. Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B 68:49–67. Zhao, P., Rocha, G., Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics 37:3468–3497. Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American Statistical Association 101:1418–1429. Zou, H., Hastie, T., Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. The Annals of Statistics 35:2173–2192. Zou, H., Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics 36:1509–1533.