SparseNet: Coordinate Descent with Non-Convex Penalties Rahul Mazumder Department of Statistics, Stanford University
[email protected]
Jerome Friedman Department of Statistics, Stanford University
[email protected]
Trevor Hastie Department of Statistics, Department of Health, Research and Policy Stanford University
[email protected] December 2, 2009 Abstract We address the problem of sparse selection in linear models. A number of non-convex penalties have been proposed for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this paper we pursue the coordinate-descent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a df -standardizing reparametrization that assists our pathwise algorithm. The MC+ penalty (Zhang 2010) is ideally suited to this task, and we use it to demonstrate the performance of our algorithm.
1
Introduction
Consider the usual linear regression set-up y = Xβ +
(1)
with n observations and p features X = [x1 , . . . , xp ], response y, coefficient vector β and (stochastic) error . In many modern statistical applications with p n, the true β vector is sparse, with many redundant predictors having coefficient 0. We would like to identify the useful predictors and also obtain good estimates of their coefficients. Identifying a set of relevant features from a list of many thousand is in general combinatorially hard and statistically troublesome. In this context, convex relaxation techniques such as the LASSO (Tibshirani 1996, Chen et al. 1998) have 1
been effectively used for simultaneously producing accurate and parsimonious models. The LASSO solves (2) min 12 y − Xβ2 + λβ1 . β
The 1 penalty shrinks coefficients towards zero, and can also set many coefficients to be exactly zero. In the context of variable selection, the LASSO is often thought of as a convex surrogate for best-subset selection: min β
1 y 2
− Xβ2 + λβ0 .
(3)
The 0 penalty β0 = pi=1 I(|βi| > 0) penalizes the number of non-zero coefficients in the model. The LASSO enjoys attractive statistical properties (Zhao & Yu 2006, Donoho 2006, Knight & Fu 2000, Meinshausen & B¨ uhlmann 2006). Under certain regularity conditions on the model matrix and parameters, it produces models with good prediction accuracy when the underlying model is reasonably sparse. In addition, Zhao & Yu (2006) established that the LASSO is model selection consistent: Pr(Aˆ = A0 ) → 1, where A0 corresponds to the set of nonzero coefficients (active) in the true model and Aˆ those recovered by the LASSO. Their assumptions include the strong irrepresentable condition on the sample covariance matrix of the features (essentially limiting the pairwise correlations), and some additional regularity conditions on (n, p, β, ) However, when these regularity conditions are violated (Zhang 2010, Zhang & Huang 2008, Friedman 2008, Zou & Li 2008, Zou 2006), the LASSO can be suboptimal in model selection. In addition, since the LASSO produces shrunken coefficient estimates, in order to minimize prediction error it often selects a model which is overly dense in its effort to relax the penalty on the relevant coefficients. Typically in such situations greedier methods like subset regression and the non-convex methods we discuss here achieve sparser models than the LASSO for the same or better prediction accuracy and superior variable selection properties. There are computationally attractive algorithms for the LASSO. The piecewiselinear LASSO coefficient paths can be computed efficiently via the LARS (homotopy) algorithm (Efron et al. 2004, Osborne et al. 2000). The recent coordinate-wise optimization algorithms of Friedman et al. (2007) appear to be the fastest for computing the regularization paths (for a variety of loss functions) and are readily scalable to very large problems. One-at-a time coordinate-wise methods for the LASSO make repeated use of the univariate soft-thresholding operator ˜ λ) = arg min 1 (β − β) ˜ 2 + λ|β| S(β, 2 β
˜ β| ˜ − λ)+ . = sgn(β)(|
(4)
In solving (2), the one-at-a-time coordinate-wise updates are given by β˜j = S(
n
(yi − y˜ij )xij , λ), j = 1, . . . , p, 1, . . . , p, . . .
i=1
2
(5)
where y˜ij = k=j xik β˜k (assuming each xj is standardized to have mean zero and ˜ we cyclically update the parameters using (5) until unit 2 norm). Starting with β, convergence. The LASSO can fall short as a variable selector : • to induce sparsity, the LASSO ends up shrinking the coefficients for the “good” variables; • if these “good” variables are strongly correlated with redundant features, this effect is exacerbated, and extra variables are included in the model. As an illustration, Figure 1 shows the regularization path of the LASSO coefficients for a situation where it is sub-optimal for model selection. The LASSO is somewhat LASSO
Hard Thresholding
MC+(γ = 6.6)
0
10
20
λ
30
40
50
20 −40
−20
0
20 0 −20 −40
−40
−20
0
20
Coefficient Profiles
40
Estimates True
40
Estimates True
40
Estimates True
0
10
20
λ
30
40
50
0
10
20
λ
30
40
50
Figure 1: Regularization path for LASSO (left), MC+ (non-convex) penalized least-squares for γ = 6.6 (centre) and γ = 1+ (right), corresponding to the hard-thresholding operator. The true coefficients are shown as horizontal dotted lines. The LASSO in this example is sub-optimal for model selection, as it can never recover the true model. The other two penalized criteria are able to select the correct model (vertical lines), with the middle one having smoother coefficient profiles than best subset on the right.
indifferent to correlated variables, and applies uniform shrinkage to their coefficients as in (4). This is in contrast to best-subset regression; once a strong variable is included, it drains the effect of its correlated cousins. This motivates going beyond the 1 regime to more aggressive non-convex penalties (see the left-hand plots for each of the four penalty families in Figure 2), bridging the gap between 1 and 0 (Fan & Li 2001, Zou 2006, Zou & Li 2008, Friedman 2008, Zhang 2010). Similar to (2), we minimize Q(β) =
1 y 2
− Xβ2 + λ
p i=1
3
P (|βi|; λ; γ),
(6)
where P (|β|; λ; γ) defines a family of penalty functions concave in |β|, and λ and γ control the degrees of regularization and concavity of the penalty respectively. The main challenge here is in the minimization of the possibly non-convex objective Q(β) (6) (the optimization problems become non-convex when the non-convexity of the penalty is no longer dominated by the convexity of the squared error loss). As one moves along the continuum of penalties beyond the 1 towards the 0 , the optimization potentially becomes combinatorially hard. Our contributions in this paper are as follows. 1. We propose coordinate-wise optimization methods for finding minima of Q(β), motivated by their success with the LASSO. Our algorithms cycle through both λ and γ, giving solution paths (surfaces) for all families simultaneously. For each value of λ, we start at the LASSO solution, and then update the solutions as γ changes, moving us toward best-subset regression. 2. We study the generalized thresholding functions that arise from different nonconvex penalties, and map out a set of properties that make them more suitable for coordinate descent. In particular, we seek continuity in these threshold functions for every λ and γ. 3. We prove convergence of coordinate descent for a useful subclass of non-convex penalties, generalizing the results of Tseng & Yun (2009) to nonconvex problems. Our results stand as a contrast to Zou & Li (2008); they study stationarity properties of limit points (and not convergence) of the sequence produced by the algorithms. 4. We propose a re-parametrization of the penalty families that makes them even more suitable for coordinate-descent. Our re-parametrization constrains the effective degrees of freedom at any value of λ to be constant as γ varies. This in turn allows for a smooth transition of the solutions as we move through values of γ from the convex LASSO towards best-subset selection, with the size of the active set decreasing along the way. 5. We compare our algorithm to the state of the art for this class of problems, and show how our approach leads to improvements. The paper is organized as follows. In Section 2 we study four families of nonconvex penalties, and their induced thresholding operators. We study their properties, particularly from the point of view of coordinate descent. We propose a df calibration, and lay out a list of desirable properties of penalties for our purposes. In Section 3 we describe our SparseNet algorithm for finding a surface of solutions for all values of the tuning parameters. In Section 4 we illustrate and implement our approach using calibrated versions of MC+ penalty (Zhang 2010) or the firm shrinkage threshold operator (Gao & Bruce 1997). In Section 5 we study the convergence properties of our SparseNet algorithm. Section 6 presents simulations under a variety of conditions to demonstrate the performance of SparseNet. Sections 7 and 8 investigate other approaches, and makes comparisons with SparseNet and multi-stage Local Linear Approximation (LLA) (Zou & Li 2008, Zhang 2009, Candes et al. 2008). The proofs of lemmas and theorems are gathered in the appendix. 4
2
Generalized Thresholding Operators
In this section we examine the one-dimensional optimization problems that arise in coordinate descent minimization of Q(β). With squared-error loss, this reduces to minimizing ˜ 2 + λP (|β|, λ, γ). Q(1) (β) = 12 (β − β) (7) We study (7) for different non-convex penalties, as well as the associated generalized threshold operator ˜ λ) = arg min Q(1) (β). (8) Sγ (β, β
As γ varies, this generates a family of threshold operators Sγ (∗, λ) : → . The soft-threshold operator (4) of the LASSO is a member of this family. The hardthresholding operator (10) can also be represented in this form 2 1 ˜ ˜ H(β, λ) = arg min 2 (β − β) + λ I(|β| > 0) (9) β
˜ ≥ λ), = β˜ I (|β|
(10)
Our interest in thresholding operators arose from the work of She (2009), who also uses them in the context of sparse variable selection, and studies their properties for this purpose. Our approaches differ, however, in that our implementation uses coordinate descent, and exploits the structure of the problem in this context.
2.1
Examples of non-convex penalties and threshold operators
For a better understanding of non-convex penalties and the associated threshold operators, it is helpful to look at some examples. (a) The γ penalty given by λP (t; λ; γ) = λ|t|γ for γ ∈ [0, 1], also referred to as the bridge or power family (Frank & Friedman 1993, Friedman 2008). See Figure 2(a) for the penalty and associated threshold functions. (b) The log-penalty is shown in Figure 2(b), and is a generalization of the elastic net family (Friedman 2008) to cover the non-convex penalties from LASSO down to best subset. λP (t; λ; γ) =
λ log(γ|t| + 1), γ > 0, log(γ + 1)
(11)
where for each value of λ we get the entire continuum of penalties from 1 (γ → 0+) to 0 (γ → ∞). (c) The SCAD penalty (Fan & Li 2001) is shown in Figure 2(c), and is defined via d (γλ − t)+ P (t; λ; γ) = I(t ≤ λ) + I(t > λ) for t > 0, γ > 2 dt (γ − 1)λ P (t; λ; γ) = P (−t; λ; γ) P (0; λ; γ) = 0 5
(12)
(a) γ
(b) Log Threshold Functions
Threshold Functions
1.0
2.0
1.5 0.2
0.4
0.6
1.0
0.0
0.5
1.0
β
1.5
2.0
2.5
0.0
0.2
0.4
β˜
= 0.01 =1 = 10 = 30000
0.6
β
0.8
1.5
2.0
2.5
0.5
1.0
Threshold Functions γ γ γ γ
3.0
= 1.01 = 1.7 =3 = 100
= 1.01 = 1.7 =3 = 100
2.5
2
γ γ γ γ
1.5
1.5
2.0
1
3.0
= 2.01 = 2.3 = 2.7 = 200
2.0
γ γ γ γ
2.5
5
1.0
0
β
3
4
5
0
3
4
5
1.0 0.5 0.0
0.0
0
0
0.5
1
1
1.0
2
2
0.5
Penalty
3 2
4
5 4
1
0.0
(d) MC+ Threshold Functions
3
= 2.01 = 2.3 = 2.7 = 200
1.0
β˜
(c) SCAD Penalty γ γ γ γ
0.0
0.0
γ γ γ γ
0.5
1.0
0.4
0.5 0.0
= 0.001 = 0.3 = 0.7 = 1.0 0.8
0.2
1.0 0.0
γ γ γ γ
= 0.01 =1 = 10 = 30000
1.5
0.6
0.6 0.4 0.2 0.0
γ , γ , γ , γ ,
γ γ γ γ
2.5
= 0.001 = 0.3 = 0.7 = 1.0
2.0
0.8
γ γ γ γ
Penalty
0.8
γ , γ , γ , γ ,
2.5
1.0
3.0
Penalty
0.0
β˜
1.5
β
2.0
2.5
0.0
0.5
1.0
1.5
2.0
2.5
β˜
Figure 2: Non-Convex penalty families and their corresponding threshold functions. All are shown with λ = 1 and different values for γ.
(d) The MC+ family of penalties (Zhang 2010) is shown in Figure 2(d), and is given by λP (t; λ; γ) = λ
|t|
(1 −
0
= λ(|t| −
x )+ dx γλ
λ2 γ t2 ) I(|t| < λγ) + I(|t| ≥ λγ). 2λγ 2
(13)
For each value of λ > 0 there is a continuum of penalties and threshold operators, varying from γ → ∞ (soft threshold operator) to γ → 1+ (hard threshold operator). The MC+ is a reparametrization of the firm shrinkage operator introduced by Gao & Bruce (1997) in the context of wavelet shrinkage. Though P (t; λ; γ) → λt1 as γ → ∞, in order to get the penalty λt0 it does not suffice to consider the limit γ → 0 (with λ fixed), as pointed out by Zhang (2010). In fact the parameter λ has to be taken to ∞ at some “optimal” rate. We will settle this issue in Section 4; by concentrating on the induced threshold operators, we will justify the restriction of the family to γ ∈ (1, ∞). From now on, unless mentioned otherwise, the MC+ penalty is to be understood as the family restricted to γ > 1.
6
Other examples of non-convex penalties include the transformed 1 penalty (Nikolova 2000), clipped 1 penalty (Zhang 2009) to name a few. Although each of these families (as in Figure 2) bridge 1 and 0 , they have different properties. The two in the top row, for example, have discontinuous univariate threshold functions, which would cause instability in coordinate descent. The threshold operators for the γ , log-penalty and the MC+ form a continuum between the soft and hard-thresholding functions. The family of SCAD threshold operators, although continuous, do not include H(·, λ). We now study some of these properties in more detail. Univariate log-penalised least squares
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
β˜ = −0.1 β˜ = 0.7 β˜ = 0.99 β˜ = 1.2 β˜ = 1.5
−0.5
* * *
0.0
* * 0.5
β
1.0
* 1.5
2.0
Figure 3: Figure shows the penalized least-squares criterion (7) with the log-penalty (11) ˜ The “*” denotes the global minima of the for (γ, λ) = (500, 0.5) for different values of β. functions. The “transition” of the minimizers, creates discontinuity in the induced threshold operators.
2.2
Effect of multiple minima in univariate penalized least squares criteria
Figure 3 shows the univariate penalized criterion (7) for the log-penalty (11) for certain choices of β˜ and a (λ, γ) combination. Here we see the non-convexity of Q1 (β), ˜ of the global minimizers of the univariate functions. This and the transition (with β) ˜ λ), as shown in causes the discontinuity of the induced threshold operators β˜ → Sγ (β, Figure 2(b). Multiple minima and consequent discontinuity appears in the γ , γ < 1 penalty as well, as seen in Figure 2(a). Discontinuity of the threshold operators will lead to added variance (and hence poor risk properties) of the estimates (Breiman 1996, Fan & Li 2001). We observe that for the log-penalty (for example), coordinate-descent can produce multiple limit points — creating statistical instability in the optimization procedure.
7
We believe that this discontinuity will naturally affect other optimization algorithms; for example those based on sequential convex relaxation. An example is the LLA of Zou & Li (2008); we see in Section 8.2 that LLA gets stuck in a suboptimal local minimum, even for the univariate log-penalty problem. This phenomenon is aggravated for the γ penalty. The multiple minima or discontinuity problem is not an issue in the case of the MC+ penalty or the SCAD (Figures 2(c,d)). Our study of these phenomena leads us to conclude that if the functions Q(1) (β) (7) are (strictly) convex then the coordinate-wise updates converge (unique limit point) to a stationary point. This turns out to be the case for the MC+ and the SCAD penalties. Furthermore we see in Section 4 in (19) that this restriction on the MC+ penalty for γ still gives us the entire continuum of threshold operators from soft to hard thresholding. Strict convexity is also true for the log-penalty for some choices of λ, γ, but not enough to cover the whole family. The γ , γ < 1, does not qualify here at all. This is not surprising given that the γ , γ < 1 penalty with an unbounded derivative at zero is well known to be unstable in optimization.
2.3
Effective degrees of freedom and nesting of shrinkage thresholds
For a LASSO fit μ ˆλ (x), λ controls the extent to which we (over)fit the data. This can be expressed in terms of the effective degrees of freedom of the estimator. For our non-convex penalty families, the effective degrees of freedom (or df for short) are influenced by γ as well. We propose to recalibrate our penalty families so that for fixed λ, the df do not change with γ. This has important consequences for our pathwise optimization strategy. For a linear model, df is simply the number of parameters fit. More generally, under an additive error model, df is defined by (Stein 1981, Efron et al. 2004) df (ˆ μλ,γ ) =
n
Cov(ˆ μλ,γ (xi ), yi)/σ 2 ,
(14)
i=1
where {(xi , yi)}n1 is the training sample and σ 2 is the noise variance. For a LASSO fit, the df is estimated by the number of nonzero coefficients (Zou et al. 2007). Suppose we compare the LASSO fit with k nonzero coefficients to an unrestricted least-squares fit in those k variables. The LASSO coefficients are shrunken towards zero, yet have the same df ? The reason is the LASSO is “charged” for identifying the nonzero variables. Likewise a model with k coefficients chosen by best-subset regression has df greater than k (here we do not have exact formulas). Unlike the LASSO, these are not shrunk, but are being charged the extra df for the search. Hence for both the LASSO and best subset regression we can think of the effective df as a function of λ. More generally, penalties corresponding to a smaller degree of concavity shrink more than those with a higher degree of concavity, and are hence charged less in df per nonzero coefficient. For the family of penalized regressions from the 1 to the 0 the effective df are controlled by both λ and γ. 8
In this paper we provide a re-calibration of the family of penalties {P (·; λ; γ)}λ,γ such that for every value of λ the effective df of the fit across the entire continuum of γ values are approximately the same. Since we rely on coordinate-wise updates in the optimization procedures, we will ensure this by calibrating the df in the univariate thresholding operators. Details are given in Section 4.1.1. We define the shrinkage-threshold λS of a thresholding operator as the largest (absolute) value that is set to zero. For the soft-thresholding operator of the LASSO, this is λ itself. However, the LASSO also shrinks the non-zero values toward zero. The hard-thresholding operator, on the other hand, leaves all values that survive its shrinkage threshold λH alone; in other words it fits the data more aggressively. So for the df of the soft and hard thresholding operators to be the same, the shrinkagethreshold λH for hard thresholding should be larger than the λ for soft thresholding. More generally, to maintain a constant df there should be a monotonicity (increase) in the shrinkage-thresholds λS as one moves across the family of threshold operators from the soft to the hard threshold operator. Figure 4 illustrates these phenomena for the MC+ penalty, before and after calibration.
1
2
3
4
5
3.0
= 1.01 = 1.7 =3 = 100
1.5 0.0
0.5
1.0
1.5 1.0 0.5 0.0
0.00
0
γ γ γ γ
2.5
= 1.01 = 1.7 =3 = 100
2.0
2.5
γ γ γ γ
2.0
0.20 0.15 0.10 0.05
effective degrees of freedom
0.25
Uncalibrated Calibrated
MC+ threshold functions before calibration after calibration 3.0
0.30
df for MC+ before/after calibration
0.0
0.5
1.0
1.5
β˜
log(γ)
2.0
2.5
3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
β˜
Figure 4: [Left] The df (solid line) for the (uncalibrated) MC+ threshold operators as a function of γ, for a fixed λ = 1. The dotted line shows the df after calibration. [Middle] Family of MC+ threshold functions, for different values of γ, before calibration. All have shrinkage thresholds at 1. [Right] Calibrated versions of the same. The shrinkage threshold of the soft-thresholding operator is λ = 1, but as γ decreases, λS increases, forming a continuum between soft and hard thresholding.
Why the need for calibration? Because of the possibility of multiple stationary points in a non-convex optimization problem, good warm-starts are essential for avoiding sub-optimal solutions. The calibration assists in efficiently computing the doubly-regularized (γ, λ) paths for the coefficient profiles. For fixed λ we start with 9
the exact LASSO solution (large γ). This provides an excellent warm start for the problem with a slightly decreased value for γ. This is continued in a gradual fashion till γ approaches 1+, the best-subset problem. The df calibration ensures that the solutions to this series of problems vary smoothly with γ: • λS increases slowly, decreasing the size of the active set; • at the same time, the active coefficients are shrunk successively less. (Strictly speaking, the preceding few statements are to be interepreted marginally for each coordinate.) We find that the calibration keeps the algorithm away from sub-optimal stationary points and accelerates the speed of convergence.
2.4
Desirable properties for a family of threshold operators
Based on our observations on the properties and irregularities of the different penalties and their associated threshold operators, we propose some recommendations. A family of threshold operators given by Sγ (·, λ) : → γ ∈ (γ0 , γ1 ) should have the following properties: 1. γ ∈ (γ0 , γ1) bridges the gap from the soft thresholding operator on one end to the hard thresholding operator on the other, with the following continuity at the end points ˜ λ) = sgn(β)( ˜ β˜ − λ)+ and Sγ (β; ˜ λ) = βI(| ˜ β| ˜ ≥ λH ) Sγ1 (β, 0
(15)
where λ and λH correspond to the shrinkage thresholds of the soft and hard threshold operators respectively. 2. λ controls the effective df in the family Sγ (·, λ); for fixed λ, the effective df of Sγ (·, λ) for all values of γ should be the same. 3. For every fixed λ, there is a strict nesting (increase) of the shrinkage thresholds λS , as γ decreases from γ1 to γ0 . ˜ λ) is continuous. 4. The map β˜ → Sγ (β, 5. The univariate penalized least squares function Q(1) (β) (7) is (strictly) convex, ˜ This ensures that coordinate-wise procedures converge to a stationfor every β. ˜ λ) in the previous ary point. In addition this implies continuity of β˜ → Sγ (β, item. 6. The function γ → Sγ (·, λ) is continuous on γ ∈ (γ0 , γ1 ). This assures a smooth transition as one moves across the family of penalized regressions, in constructing the family of regularization paths.
10
The properties listed above are desirable, and in fact necessary for meaningful analysis for any generic non-convex penalty. Enforcing these properties will require re-parametrization and some restrictions in the family of penalties (and threshold operators) considered. • The threshold operator induced by the SCAD penalty does not encompass the entire continuum from the soft to hard thresholding operators (all the others do). • None of the penalties satisfy the nesting property of the shrinkage thresholds (item 3) or the degree of freedom calibration (item 2). • The threshold operators induced by γ and Log-penalties are not continuous (item 4); they are for MC+ and SCAD. • Q(1) (β) is strictly convex for both the SCAD and the MC+ penalty. Q(1) (β) is non-convex for every choice of (λ > 0, γ) for the γ penalty and some choices of (λ > 0, γ) for the log-penalty. In this paper we explore the details of our approach through the study of the MC+ penalty; indeed, the calibrated MC+ penalty (Section 4) achieves all the properties recommended above.
3
SparseNet: coordinate-wise algorithm for constructing the entire regularization surface
ˆ to We now present our SparseNet algorithm for obtaining a family of solutions β γ,λ (6). The X matrix is assumed to be standardized with each column having zero mean and unit 2 norm. The basic idea of the coordinate-wise algorithm is as follows. For every fixed λ, we optimize the (convex) criterion Q(β) with γ = ∞. This minimizer is used as a warm-start start for the minimization of Q(β) at a smaller (correspoding to a more concave) value of γ. We continue in this fashion, decreasing γ, till we have the path of solutions across the grid of values for γ (for a fixed λ). The details are given in Algorithm 1. Ideally the penalty has been calibrated for df, and the threshold functions satisfy the properties outlined in Section 2.4. In this paper we implement the algorithm using the calibrated MC+ penalty, which enjoys all the properties ˜ λ) outlined in Section 2.4. This means using the calibrated threshold operator S∗γ (β, (28) in (16). The algorithm converges (Section 5) to a stationary point of Q(β) for every (λ, γ). For a fixed λ, for large values of γ, the function Q(β) is convex, hence the coordinate-wise descent converges to the global minimizer. As γ is slowly relaxed to γ < γ, and the minimizer at (λ, γ) is passed to obtain the minimizer at (λ, γ ), the df calibration ensures that the “pass” is smooth. As a reasonable alternative to Algorithm 1 one might be tempted to obtain a ˆ family of solutions β γ,λ by moving across the λ space for every fixed γ. However, with increasing non-convexity we observe that it leads to inferior solutions. The superior performance of the SparseNet is not surprising because of the novel way in which it “borrows strength” across the neighboring γ space. 11
Algorithm 1 Coordinate-wise algorithm for computing the regularization solution surface for a family of non-convex penalties
1. Input a grid of increasing λ values Λ = {λ1 , . . . , λL }, and a grid of increasing γ values Γ = {γ1 , . . . , γK }, where γK indexes the LASSO penalty. Define λL+1 ˆ K , λL+1 ) = 0. such that β(γ 2. For each value of ∈ {L, L − 1, . . . , 1} repeat the following steps: ˜ = β(γ ˆ K , λ+1 ) (previous LASSO solution). (a) Initialize β (b) For each value of k ∈ {K, K − 1, . . . , 1} repeat the following i. Cycle through the 1, . . . , p, 1, . . . , p, . . .
following
one-at-a-time
n
β˜j = Sγk
updates
(yi − y˜ij )xij , λ ,
j
=
(16)
i=1
where y˜ij = k=j xik β˜k , ii. Repeat till updates converge to β ∗ . ˆ k , λ ) ← β ∗ iii. Assign β(γ iv. Decrement k. (c) Decrement . ˆ λ), (λ, γ) ∈ Λ × Γ 3. Return the two-dimensional solution surface β(γ,
4
Illustration via the MC+ penalty
Here we elaborate on our calibration scheme, and the resulting algorithm using the MC+ penalty. We will see eventually that the calibrated MC+ penalty satisfies all the desirable properties. The MC+ penalized univariate least-squares objective criterion is |β| x 1 2 1 ˜ +λ Q (β) = 2 (β − β) )+ dx. (1 − (17) γλ 0 This can be shown to be convex for γ ≥ 1, and non-convex for γ < 1. The minimizer of (17) for γ > 1 is given by ⎧ ˜ 0 ⎪ ⎪
if |β| ≤ λ ⎨ ˜ ˜ (|β|−λ ˜ λ) = ˜ ≤ λγ , sgn(β) if λ < |β| (18) Sγ (β, 1 1− ⎪ γ ⎪ ⎩ ˜ ˜ > λγ β if |β| Observe that in (18), for fixed λ > 0, as γ → 1+, as γ → ∞,
˜ λ) → H(β, ˜ λ), Sγ (β, ˜ λ) → S(β, ˜ λ). Sγ (β, 12
(19)
Hence γ0 = 1+ and γ1 = ∞. It is interesting to note here that the hard-threshold operator, which is conventionally understood to arise from a highly non-convex 0 penalized criterion (10), can be equivalently obtained as the limit of a sequence ˜ λ)}γ>1 where each threshold operator is the solution of a convex criterion (17) {Sγ (β, ˜ λ)}γ with the soft and hard for γ > 1. For a fixed λ, this gives a family {Sγ (β, threshold operators as its two extremes. Since we will be concerned with the one-dimensional threshold operators, it suffices to restrict to the continuum of penalties for γ ∈ (1, ∞).
4.1
Calibration of the MC+ penalty for df
˜ λ) have the same For a fixed λ, we would like all the thresholding functions Sγ (β, df. Achieving this will require a re-parametrization of the MC+ family. An attractive property of the MC+ penalty is that it is possible to obtain “nice” analytical expressions for the effective df. 4.1.1
Expressions for effective df
Consider the following univariate regression model iid
yi = xi β + i , i = 1, . . . , n, i ∼ N(0, σ 2 ).
(20)
Without loss of generality, assume the xi have sample mean zero and variance one. The df for the threshold operator Sγ (·, λ)(λ,γ) is given by (14) with μ ˆ λ,γ (xi ) = 2 ˜ λ) and β˜ = xi . Let xi Sγ (β, i xi yi / ˜ λ). ˆ γ,λ = x · Sγ (β, μ
(21)
Theorem 1. ˆ γ,λ ) = df (μ
λγ ˜ < λγ) + Pr(|β| ˜ > λγ) Pr(λ ≤ |β| λγ − λ
(22)
where these probabilities are to be calculated under the law β˜ ∼ N(β, σ 2 /n) Proof. We make use of Stein’s unbiased risk estimation (Stein 1981, SURE). Suppose ˆ : n → n is an almost differentiable function of y, and y ∼ N(μ, σ 2 In ). Then that μ ˆ formula (14) there is a simplification to the df (μ) ˆ = df (μ)
n
Cov(ˆ μi , yi)/σ 2
i=1
ˆ = E(∇ · μ),
(23)
n ˆ = where ∇ · μ ˆi /∂yi . Proceeding along the lines of proof in Zou et al. i=1 ∂ μ ˆ γ,λ : → defined in (21) is uniformly Lipschitz (for every (2007), the function μ λ > 0, γ > 1), and hence almost everywhere differentiable, so (23), is applicable. The ˜ which leads to (22). result follows easily from the definition (18) and that of β,
13
ˆ γ,λ → μ As γ → ∞, μ ˆλ , and from (22), we see that ˜ > λ) ˆ γ,λ ) −→ Pr(|β| df (μ ˜ > 0)) = E (I(|β|
(24)
which is precisely the expression obtained by Zou et al. (2007) for the df of the LASSO. ˜ λ) is given by Corollary 1. The df for the hard-thresholding function H(β, ˜ > λ) ˆ 1+,λ ) = λφ∗ (λ) + Pr(|β| df (μ
(25)
where φ∗ is taken to be the p.d.f. of the absolute value of a normal random variable with mean β and variance σ 2 /n. For the hard threshold operator, Stein’s simplified formula does not work, since ˜ λ) is not almost differentiable. But observing the corresponding function y → H(β, from (22) that ˜ > λ), and Sγ (β, ˜ λ) → H(β, ˜ λ) as γ → 1+ ˆ γ,λ ) → λφ∗ (λ) + Pr(|β| df (μ
(26)
we get an expression for df as stated. These expressions are consistent with simulation studies based on Monte Carlo estimates of df. Figure 4[Left] shows the df as a function of γ for a fixed value λ = 1 for the uncalibrated MC+ threshold operators Sγ (·, λ). For the figure we used β = 0 and σ 2 = 1. 4.1.2
Re-parametrization of the MC+ penalty
We see from Section 2.3 that for the df to remain constant for a fixed λ and varying γ, the shrinkage threshold λS (λ, γ) should increase as γ moves from γ1 to γ0 . For a formal proof of this argument see Theorem 3. Hence for the purpose of df calibration, we re-parametrize the family of penalties as follows: |t| x ∗ )+ dx. λP (|t|; λ; γ) = λS (λ, γ) (1 − (27) γλS (λ, γ) 0 With the re-parametrized penalty, the thresholding function is the obvious modification of (18): ˜ λ) = arg min { 1 (β − β) ˜ 2 + λP ∗ (|β|; λ; γ)} S∗γ (β, 2 β
˜ λS (λ, γ)) = Sγ (β,
(28)
Similarly for the df we simply plug into the formula (22) γλS (λ, γ) ˜ < γλS (λ, γ)) + Pr(|β| ˜ > γλS (λ, γ)) Pr(λS (λ, γ) ≤ |β| γλS (λ, γ) − λS (λ, γ) (29) ∗ ˆ γ,λ = μ ˆ γ,λS (λ,γ) . where μ To achieve a constant df, we require the following to hold for λ > 0:
ˆ ∗γ,λ ) = df (μ
14
• The shrinkage-threshold for the soft threshold operator is λS (λ, γ = ∞) = λ. This will be a boundary condition in the calibration. ˆ ∗γ=∞,λ ) = df (μ ˆ ∗γ,λ ) ∀ γ > 1. • df (μ The definitions for df depend on β and σ 2 /n. Since the notion of df centers around variance, we use the null model with β = 0. We also assume without loss of generality that σ 2 /n = 1, since this gets absorbed into λ. Theorem 2. For the calibrated MC+ penalty to achieve constant df, the shrinkagethreshold λS = λS (λ, γ) must satisfy the following functional relationship Φ(γλS ) − γΦ(λS ) = −(γ − 1)Φ(λ),
(30)
where Φ is the standard normal cdf and φ the pdf of the same. Theorem 2 is proved in Appendix A.1, using (29) and the boundary constraint. The next theorem establishes some properties of the map (λ, γ) → (λS (λ, γ), γλS (λ, γ)), including the important nesting of shrinkage thresholds. Theorem 3. For a fixed λ, (a) γλS (λ, γ) is increasing as a function of γ. (b) λS (λ, γ) is decreasing as a function of γ [Note: as γ increases, we approach the LASSO]. Both these relationships can be derived from the functional equation (30). Theorem 3 is proved in Appendix A.1. 4.1.3
Efficient computation of shrinkage threshold
In order to implement the calibration for MC+, we need an efficient method for evaluating λS (λ, γ)—i.e. solving equation (30). For this purpose we propose a simple parametric form for λS based on some of its required properties: the monotonicity properties just described, and the df calibration. We simply give the expressions here, and leave the details for Appendix A.2: λS (λ, γ) = λH ( where
1 − α∗ + α∗ ) γ
−1 α∗ = λ−1 H Φ (Φ(λH ) − λH φ(λH )),
λ = λH α ∗
(31)
(32)
The above approximation turns out to be a reasonably good estimator for all practical purposes, achieving a calibration of df within an accuracy of five percent, uniformly over all (γ, λ). The approximation can be improved further, if we take this estimator as the starting point, and obtain recursive updates for λS (λ, γ). Details of this approximation along with the algorithm are explained in Appendix A.2. Numerical studies show that a few iterations of this recursive computation can improve the degree of accuracy of calibration up to an error of 0.3 percent; Figure 4 [Left] was produced using this approximation. 15
5
Convergence Analysis
In this section we study the convergence of Algorithm 1. We will see that under certain conditions, it always converges to a minimum of the objective function; conditions that are met, for example, by the SCAD and MC+ penalties. We introduce the following notation. Let {β k }k ; k = 1, 2, . . . denote the sequence obtained via the sequence of p coordinate-wise updates k+1 k , u, βj+1 , . . . , βpk ), βjk+1 = arg min Q(β1k+1, . . . , βj−1 u
j = 1, . . . , p
(33)
The induced map Scw is given by β k+1 = Scw (β k )
(34)
Tseng & Yun (2009), in a very recent paper, proves the convergence of fairly generalized versions of the coordinate-wise algorithm for functions that can be written as the sum of a smooth function (loss) and a separable non-smooth convex function (penalty). This is not directly applicable to our case as the penalty P (t; λ; γ) is non-convex. Theorem 4 proves convergence of Algorithm 1, for the squared error loss and under certain regularity conditions on P (t; λ; γ). Lemma 3 in Appendix A.3, under milder regularity conditions, establishes that every limit point of the sequence {β k } is a stationary point of Q(β). Theorem 4. Suppose the penalty function λP (t; λ; γ) ≡ P (t) satisfies P (t) = P (−t), and the map t → P (t) is twice differentiable on t ≥ 0. In addition, suppose P (|t|) is non-negative and uniformly bounded and inf t P (|t|) > −1; where P (|t|) and P (|t|) are the first and second derivatives of P (|t|) wrt |t|. Then the univariate maps β → Q(1) (β) are strictly convex and the sequence of coordinate-updates {β k }k converge to a minimum of the function Q(β). Theorem 4 is proved with the aid of Lemmas 3 and 4 in Appendix A.3. Remark 1. Note that Theorem 4 includes the case where the penalty function P (t) is convex in t.
6
Simulation Studies
This section explores the empirical performance of non-convex penalized regression using the calibrated MC+ penalty, and compares it with the LASSO, forward stagewise regression (as in LARS), and best-subset selection, on different types of examples. For each example we examine prediction error, the number of variables in the model, and the zero-one loss based on variable selection. In all the examples we assume the usual linear model set-up with standardized predictors, and Gaussian errors in the response. The signal-to-noise ratio is defined as β T Σβ SNR = (35) σ 16
where rows of the matrix Xn×p are distributed as MVN(0, Σ). In our examples below, we consider both SNR=1 and 3. Our examples differ according to the true coefficient patterns, and the correlations among the predictors. These correlations are controlled through the structured covariance Σ = Σ(ρ; m), an m × m matrix with 1’s on the diagonal, and ρ’s on the off-diagonal. The first two examples have p = 30 (small). We compare best-subset selection, ˆ λ) for each problem forward stepwise and the entire family of MC+ estimates β(γ, instance. For each model, the optimal tuning parameters—λ for the family penalized regressions, subset size for best-subset selection and stepwise regression—are chosen based on minimization of the prediction error on a separate large validation set of size 10K. This is repeated 50 times for each problem instance. In each case, a separate large test set of 10K observations is used to measure the prediction performance. In the last 4 examples with p = 200 (large), we consider all strategies except best-subset (since it is computationally infeasible). Figures 5–7 show the results. In the first column, we report prediction error over the test set, in terms of the standardized MSE Prediction Error =
ˆ2 E(xβ − xβ) σ2
(36)
(strictly speaking prediction error minus 1). The second column shows the percentage of non-zero coefficients in the model. The third column shows the mis-identification of the true non-zero coefficients, reported as a misclassification error. In each figure we show the average performance over the 50 simulations, as well as an upper and lower standard error. Below is a description of how the data was generated for the different problem instances, and some observations on the results. Examples with small p — Figure 5 p1: n = 35, p = 30, Σ(0.4; p) and β = (−0.033, 0.067, −0.1, 01×23, −0.9, 0.93, −0.97, 0). For small SNR(= 1), the LASSO has the similar prediction error as some other (weaker) non-convex penalties, but it achieves this with a larger number of nonzero coefficients and poor 0−1 error. Members in the interior of the non-convex family have best prediction error along with good variable selection properties. The most aggressive non-convex selection procedures and best-subset have high prediction error because of added variance due to the high noise. Step-wise has good prediction error but lags behind in terms of variable selection properties. For larger SNR, there is a role-reversal — more aggressive non-convex penalized regression procedures perform much better in terms of prediction error and variable selection properties. It is interesting to note that MC+ penalized estimates for γ ≈ 1 perform almost identically to the best-subset procedure. Step-wise performs worse here. p2: n = 35, p = 30, Σ(0.4; p) and β = (0.033, 0.067, 0.1, 01×23, 0.9, 0.93, 0.97, 0), as above but with same signs. 17
Small p — Example p1 0−1 Error
0.35
% of Non−zero coefficients
0.55
40
Prediction Error
−
SNR=1
0.50
30
0.30
*
bs
1 2
4
las
0.25 st
Prediction Error
bs
1 2
4
las
% of Non−zero coefficients −
−
−
st
bs
1 2
4
las
4
las
4
las
4
las
0−1 Error
0.25
− * −
+− bs
1 2
4
las
st
+− bs
1 2
4
las
0.10
st
10
0.2
0.15
20
0.3
0.20
30
SNR=3
0.4
40
0.30
*
*
−
+
0.35
* st
0.15
10
0.40
+−
−
0.5
* −
0.20
20
0.45
+−
−
st
+− bs
1 2
Small p — Example p2 % of Non−zero coefficients
0−1 Error
0.18 0.20 0.22 0.24 0.26 0.28
0.55
Prediction Error −
25
0.45
30
*
+−
−
*
bs
1 2
4
las
st
Prediction Error
1 2
4
las
% of Non−zero coefficients −
−
*
4
las
1 2 0−1 Error
−
* −
0.15 st
+− bs
1 2
4
las
0.10
15 10 1 2
−
bs
0.20
30 20
25
0.25 +− bs
−
+
0.25
35
0.30
SNR=3
0.20 st
* −
st
40
*
0.15
bs
0.30
st
5
0.25
10
15
0.35
20
SNR=1
+−
−
st
+−− bs
1 2
Figure 5: Small p: examples p1 and p2. The three columns of plots represent prediction error (standardized mean-squared error), recovered sparsity and zero-one loss (with standard errors). The first row in each pair has SNR=1, the second SNR=3. Each plot represents the MC+ family for different values of γ (on the log-scale), the LASSO (las), Step-wise (st) and Best-subset (bs)
18
Here the shape of the performance curves are pretty similar to those in the previous example, the difference lying in the larger number of coefficients selected by all the selectors. This is expected given the same sign of the coefficients, as compared to the previous example. In the high SNR example, the most aggressive penalties get the best prediction error though they select far fewer number of variables, even when compared to the ground-truth.
Examples with large p — Figures 6 and 7 P1: n = 100, p = 200, Σ(0.004; p) and β = (−1, 01×199 ). Here the underlying β is extremely sparse, hence the most aggressive penalties perform the best. The results are quite similar for the two SNR cases, the only notable difference being the kinks in the sparsity of the estimates and the 0-1 error — possibly because of the instability created by high noise. The performance of the step-wise procedure is very similar to that of the LASSO — both being suboptimal. P2: n = 100, p = 200; Σρp×p = ((0.7|i−j|))1≤i,j≤p and β has 10 non-zeros such that β20i+1 = 1, i = 0, 1, . . . , 9;and βi = 0 otherwise. Due to the neighborhood “banded” structure of the correlation among the features, the LASSO and the less aggressive penalties estimate a considerable proportion of coefficients (which are actually zero in the population) as nonzero. This accounts for their poor variable selection properties. Stepwise also has poor variable selection properties. In the small SNR situation, LASSO and the less aggressive non-convex penalties and step-wise have good prediction error. The situation changes with increase in SNR, where the more aggressive penalties have better predictive power — because they nail the right variables. P3: n = 100, p = 200, Σ(0.5; p) and β = (β1 , β2 , . . . , β10 , 01×190 ) is such that the first ten coefficients namely β1 , β2 , . . . , β10 form an equi-spaced grid on [0, 0.5]. The patterns of the performance measures seem to remain stable for the different SNR values. The most aggressive penalized regressions have very sparse solutions but end up performing worse (in prediction accuracy) than the less aggressive counterparts. For the higher SNR examples, prediction accuracy appears to be better towards the interior of the γ space. The LASSO gives them tough competition in predictive power at the cost of worse selection properties. P4: n = 100, p = 200, Σ(0; p) and β = (11×20 , 01×180 ). As is the case with some of the previous examples, the LASSO and the less aggressive selectors perform favorably in the high-noise situation. For the less noisy version, the more aggressive penalties dominate in predictive accuracy and variable selection properties. In summary, in all the small-p situations the SparseNet algorithm for the calibrated MC+ family is able to mimic the prediction performance of the best procedure, which in the high SNR situation is best-subset regression. In the high-p 19
Large p — Example P1 0−1 Error
6
% of Non−zero coefficients
−
*
SNR=1
4
0.05
0.04
5
0.07
Prediction Error
*
0.02
3
− *
3
4
las
st
1
3
4
4
las
st
las
1
3
4
las
4
las
4
las
4
las
0−1 Error
−
*
0.020
− *
0.010
0.07 0.01
0.03
SNR=3
0.05
*
st
3
% of Non−zero coefficients
−
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Prediction Error
1
0.030
1
st
1
3
4
las
0.000
st
0.00
1
2
0.03 0.01
−
st
1
3
Large p — Example P2 0.14
0−1 Error
*
− − *
0.12
0.8
0.08 0.06
5 3
4
las
st
3
4
las
st
% of Non−zero coefficients
30
Prediction Error
1
−
1
3
0−1 Error
−
*
− − *
st
1
3
4
las
0.10 st
1
3
4
las
0.00
5
10
0.05
15
SNR=3
20
0.15
25
*
0.20
1
0.04
0.7
10
0.10
SNR=1
0.6 − * st
0.2 0.3 0.4 0.5 0.6 0.7 0.8
% of Non−zero coefficients −
15
0.9
20
Prediction Error
st
1
3
Figure 6: Large p: examples P1 and P2. The three columns of plots represent prediction error (fraction unexplained variance), recovered sparsity and zero-one loss (with standard errors). The first row in each pair has SNR=1, the second SNR=3. Each plot represents the MC+ family for different values of γ (on the log-scale), the LASSO (las), and Step-wise (st)
20
large p — Example P3 % of Non−zero coefficients
0.12
Prediction Error −
0−1 Error
0.10
− * −
0.06
2
0.20
4
0.25
0.08
6
0.30
SNR=1
8
0.35
10
0.40
*
− * st
1
3
4
las
st
3
4
las
st
% of Non−zero coefficients
1
3
4
las
4
las
4
las
4
las
0−1 Error
−
− * −
*
0.08
0.45
10
SNR=3
0.55
15
0.12
20
0.65
Prediction Error
1
−
st
5 1
3
4
las
0.04
0.35
*
st
1
3
4
las
st
1
3
large p — Example P4 1.00
Prediction Error
% of Non−zero coefficients
0−1 Error
−
−
1
3
4
las
0.13
Prediction Error
1
3
4
las
st
% of Non−zero coefficients
3
− − *
*
40
0.20
50
*
1
0−1 Error
−
−
st
1
3
4
las
st
1
3
4
las
0.00
10
0.5
20
0.10
1.0
30
SNR=3
1.5
2.0
* −
0.11 st
0.09
− * st
0
0.85
5
0.90
10
SNR=1
0.95
15
0.15
*
st
1
3
Figure 7: Large p: examples P3 and P4. The three columns of plots represent prediction error (fraction unexplained variance), recovered sparsity and zero-one loss (with standard errors). The first row in each pair has SNR=1, the second SNR=3. Each plot represents the MC+ family for different values of γ (on the log-scale), the LASSO (las), and Step-wise (st)
21
settings, it repeats this behavior, mimicking the best, and is the out-right winner in several situations since best-subset regression is not available. In situations where the LASSO does well, MC+ will often be able to reduce the number of variables for the same prediction performance, by picking a γ in the interior of its range.
7
Other methods of optimization
We shall now look into some state-of-the art methods proposed for optimization with general non-convex penalties.
7.1
Local Quadratic Approximation
Fan & Li (2001) used a local quadratic approximation (LQA) of the non-convex SCAD penalty. For a generic non-convex penalty, this amounts to the approximation P (|t|; λ; γ) ≈ P (|t0|; λ; γ) +
1 2
P (|t0 |; λ; γ) 2 (t − t20 ), for t ≈ t0 |t0 |
(37)
and consequently minimization of the following convex surrogate of Q(β): C+
1 y 2
− Xβ + 2
1 λ 2
p P (|β 0 |; λ; γ) i
i=1
|βi0 |
βi2 .
(38)
Minimization of (38) is equivalent to performing an adaptive/weighted ridge regression. The LQA gets rid of the nonsingularity of the penalty at zero, and depends heavily on the initial choice β 0 ; it is hence suboptimal (Zou & Li 2008) in hunting for sparsity in the coefficient estimates.
7.2
MC+ algorithm and path-seeking
The MC+ algorithm of Zhang (2010) cleverly seeks multiple local minima of the objective function, keeping track of the local convexity of the problem, and then seeks the sparsest solution among them. The algorithm is complex, and since it uses a LARS-type update, is potentially slower than our coordinate-wise algorithms. Our algorithm is very different from theirs. Friedman (2008) proposes a path-seeking algorithm for general non-convex penalties, drawing ideas from boosting. However, apart from a few specific cases (for example orthogonal predictors) it is unknown as to what criterion of the form “Loss+penalty”, it optimizes.
7.3
Multi-stage Local Linear Approximation
Zou & Li (2008), Zhang (2009) and Candes et al. (2008) propose a majorize-minimize (MM) algorithm for minimization of Q(β). They propose a multi-stage LLA of the nonconvex penalties and optimize the re-weighted 1 criterion Q(β; β k ) = Constant + 12 y − Xβ2 + λ
p i=1
22
P (|βˆik |; λ; γ)|βi|
(39)
for some starting point β 1 . There is a currently a lot of attention payed to LLA in the literature, so we analyze and compare it to our coordinate-wise procedure in more detail. Our empirical evidence (see Section 8.2.1) suggests that the starting point is critical; see also Candes et al. (2008). Zou & Li (2008) suggest stopping at k = 2 after starting with an “optimal” estimator, for example, the least-squares estimator when p < n. In fact, choosing an optimal initial estimator β 1 is essentially equivalent to knowing a-priori the zero and non-zero coefficients of β. The adaptive LASSO of Zou (2006) is very similar in spirit to this formulation, though the weights (depending upon a good initial estimator) are allowed to be more general than derivatives of the penalty (39). The convergence properties of the multi-stage LLA (or simply LLA) are outlined below. Since P (|β|; λ; γ) is concave in |β|, P (|β|; λ; γ) ≤ P (|β k |; λ; γ) + P (|βik |; λ; γ)(|β| − |β k |).
(40)
k
Q(β; β ) is a majorization (upper bound) of Q(β). The algorithm produces a sequence β k+1 ≡ β k+1 (λ, γ) such that Q(β k+1 ) ≤ Q(β k ) where β k+1 = arg min Q(β; β k ) β
(41)
Although the sequence {Q(β k )}k≥1 converges, this does not address the convergence properties of {β k }k≥1; understanding the properties of every limit point of the sequence, for example, requires further investigation. Define the map Mlla by β k+1 := Mlla (β k ).
(42)
Zou & Li (2008) point out that if Q(β) = Q(Mlla (β)) is satisfied only by the limit points of the sequence {β k }k≥1, then the limit points are stationary points of the function Q(β). However, the limitations of this analysis are as follows • It appears that solutions of the fixed-point equation Q(β) = Q(Mlla (β)) are not straightforward to characterize, for example by explicit regularity conditions on P (·). • The analysis does not address the fate of all the limit points of the sequence {β k }k≥1 , in case the sufficient conditions of Proposition 1 in Zou & Li (2008) fail to hold true. • The convergence of the sequence {β k }k≥1 is not addressed. Our analysis in Section 5 for SparseNet addresses all the concerns raised above. Empirical evidence suggests that our procedure does a better job in minimization of Q(β) over its counterpart — multi-stage LLA. Motivated by this, we devote the next section to a detailed comparison of LLA with coordinate-wise optimization.
8
Multi-Stage LLA revisited
Here we compare the performance of multi-stage LLA with SparseNet optimization algorithm for the calibrated MC+ penalty. Since our algorithm relies on suitably calibrating the tuning parameters based on df, we use the same calibration for LLA in our comparisons. We refer to it as modified (multi-stage) LLA. 23
Empirical comparison of LLA and SparseNet
8.1
In all our experimental studies, the coordinate-wise algorithm consistently reaches a smaller objective value of Q(β) than does LLA. The sub-optimality is more severe for larger values of λ and smaller γ, when the non-convexity of the problem becomes pronounced. Figure 8[Left] shows a problem instance where the objective value attained by the coordinate-wise method is much smaller than that attained by the LLA method Our “Improvement” criterion is defined as Q(β lla ) − Q(β cw ) Improvement = (43) Q(β cw ) The problem data X, y, β correspond to the usual linear model set-up. Here, (n, p) = (40, 8), SNR = 9.67, β = (3, 3, 0, −5, −5, 0, 4, 1) and X ∼ MVN(0, Σ), where Σ = diag{Σ(0.85; 4), Σ(0.75; 4)} using the notation in Section 6.
1.5
Function Convergence Point Phase Transition
8
λ = 0.273 λ = 0.535 λ = 0.798 λ = 1.06 λ = 1.85 λ = 2.37
4
1.0
6
2 4 3
1 5
2
0.5
Fraction Improvement of C−W
2.0
1 2 3 4 5 6
Phase Transition for Log−penalty 10
Optimization performance: C−W vs LLA
0
0.0
6
0.0
0.5
1.0
1.5
2.0
2.5
3.0
−4
log(γ)
−2
0
log(β)
2
4
Figure 8: [Left]“Improvement” of SparseNet over LLA (for the calibrated MC+ penalty), as defined in (43) for a problem instance, (as a function of γ, for different values of λ) where the x-axis represents the log(γ) axis, and the different lines correspond to different λ values. [Right] Mlla (β) for the univariate log-penalized least squares problem. The starting point β 1 determines which stationary point the LLA sequence will converge to. If the starting point is on the right side of the phase transition line (vertical dotted line), then the LLA will converge to a strictly sub-optimal value of the objective function. In this example, the global minimum of the one-dimensional criterion is at zero.
8.2
Understanding the sub-optimality of LLA through examples
In this section we explore the reasons behind the sub-optimality of LLA through two univariate problem instances. 24
8.2.1
Example 1: log-penalty and the phase-transition phenomenon
Consider the log-penalized least-squares criterion (11) ˜ 2+ Q(1) (β) = 12 (β − β)
λ log(γ|β| + 1) log(γ + 1)
(44)
for some large value of γ such that Q(1) (β) has multiple local minima, but a unique global minimizer. The corresponding (multi-stage) LLA for this problem leads to a sequence {β k }k≥1 defined by β k+1 = arg min
1 (β 2
β
˜ w k ), = S(β,
˜ 2 + w k |β| − β)
with w k =
(γ|β k |
λγ + 1) log(γ + 1)
(45)
Convergence of the sequence of estimates β k corresponds to fixed points of the function λ∗ λγ ˜ β| ˜ − Mlla : β → sgn(β)(| )+ with λ∗ = . (46) (γ|β| + 1) log(γ + 1) ˜ Mlla (β) on β ≥ 0, has multiple fixed points for certain choices of λ, β. ˜ For example, let γ = 500, λ = 74.6 and β = 10; here Q(β), β ≥ 0 has two minima, with the global minimum at zero. Figure 8[Right] shows the iteration map Mlla (β). It has two fixed points, corresponding to the two local minima 0 = β1 < β2 of Q(β), β ≥ 0. There exists a critical value r > 0 such that β 1 < r =⇒ β k → 0 = β1 ,
β 1 > r =⇒ β k → β2 as k → ∞
(47)
This phase transition phenomenon is responsible for the sub-optimality of the LLA in the one-dimensional case. Note that the coordinate-wise procedure applied in this context will give the global minimum. The importance of the initial choice in obtaining “optimal” estimates via the re-weighted 1 minimization has also been pointed out in an asymptotic setting by Huang et al. (2008); our observations are non-asymptotic. 8.2.2
Example 2: MC+ penalty and rate of convergence
Here we study the rate of convergence to the global minimum for the MC+ penalized least squares problem (17). The multi-stage LLA for this problem gives the following updates β k+1 = arg min β
˜ w k ), = S(β,
1 (β 2
˜ 2 + w k |β| − β)
with w k = λ(1 −
k
|β k | )+ γλ
(48)
˜ λ(1 − |β | )+ ) need not be a contraction mapping, hence The function β˜ → S(β, γλ one cannot readily appeal to Banach’s Fixed Point Theorem to obtain a proof of the convergence and a bound on its rate. However, in this example we can justify convergence (see Lemma 1) 25
˜ γ > 1, λ > 0 the sequence {β k+1 }k defined in (48) converges to Lemma 1. Given β, the minimum of Q(1) (β) for the MC+ penalized criterion at a (worst-case) geometric rate |β k − arg min Q(1) (β)| = O(1/γ k ) Sketch of proof. Assume, for the sake of simplicity β˜ ≥ 0, β k ≥ 0. The update β k+1 is given explicitly by ⎧ ˜ 1 k ˜ ≤ β k ≤ λγ ⎨ β − λ + γ β if γ(λ − β) k+1 ˜ > β k ≤ λγ = β 0 if γ(λ − β) ⎩ ˜ β if β k > λγ On iteratively expanding a particular case of the above expression, we obtain β k+1 = (β˜ − λ)(
k
˜ ≤ β i ≤ λγ ∀, 1 ≤ i ≤ k γ −i ) + γ −k−1β 1 , if γ(λ − β)
(49)
i=0
˜ The proof essentially considers different cases based on where β k lies wrt λγ, γ(λ − β) and derives expressions for the sequence β k . The convergence of β k follows in a fairly straightforward way analyzing these expressions. We omit the details for the sake of brevity. The following corollary is a consequence of Lemma 1. Corollary 2. The number of iterations required for the multi-stage LLA algorithm of the univariate MC+ penalized least squares to converge within an tolerance of log() the minimizer of the objective Q(1) (β) is of the order of − log(γ) . The coordinate-wise algorithm applied in this setting converges in exactly one iteration.
8.3
Coordinate-wise descent versus multi-stage LLA: a summary
We give a summary of the main observations from these examples: 1. In the case of the log-penalty, one-dimensional multi-stage LLA optimization can potentially lead to a sub-optimal stationary point if the starting point is not chosen appropriately. The coordinate-wise procedure computes the minimum very efficiently.1 2. For the MC+ penalized problem the multi-stage LLA takes a large number of iterations to converge to the minimum. The coordinate-wise method evaluates the same via a “closed form” update. 1
The stationarity condition of the function shows that the minimum is given by the solution of a simple quadratic equation.
26
Updates of the coordinate-wise algorithm and multi-stage LLA can be analyzed via properties of fixed points of the maps Scw (·) and Mlla (·). If the sequences produced by Mlla (·) or Scw (·) converge, they correspond to the stationary points of the function Q. Fixed points of the maps Mlla (·) and Scw (·) correspond to convergence of the sequence of updates. Some main comparisons of the two methods are listed below 1. Fixed points of the map Mcw (·) are explicitly characterized through some regularity conditions of the penalty functions as outlined in Lemma 3 and Lemma 4. Explicit convergence can be shown under additional conditions as in Theorem 4. For the map Mlla (·), the only formal result appears to be: stationarity of the limit points of the sequence should correspond to the fixed points of Q(β) = Q(Mcw (β)) (see Section 7.3). 2. Suppose the multi-stage LLA procedure returns β ∗ such that Q(β ∗ ) = limk Q(β k ). A one step coordinate-wise cycle on β ∗ , will return Scw (β ∗ ) such that Q(Scw (β ∗ )) ≤ Q(β ∗ ). The inequality is seen to be strict in the example of Section 8.2.1. Hence a fixed point of the map Mlla (·) need not be a fixed point of Scw (·). 3. On the other hand, every stationary point of the coordinate-wise procedure which is necessarily a stationary point of the function Q(·), is also a fixed point of Mlla (·).
A A.1
Appendix Calibration of MC+ and Monotonicity Properties of λS , γλS
Proof of Theorem 2 on page 15 Proof. For convenience we rewrite λS (λ, γ) in terms of a function fλ (γ) as λS (λ, γ) = λfλ (γ).
(50)
Using (29) and the equality of effective df across all γ for every λ, fλ (γ) can be defined as a solution to the following functional equation γ ˜ < γλfλ (γ)) + Pr(|β| ˜ > γλfλ (γ)) = Pr(|β| ˜ > λ) Pr(λfλ (γ) ≤ |β| γ−1 ∀γ > 1, fλ (γ) > 1.
(51) (52)
Since we have assumed σ 2 /n = 1 and a null generative model, we have γ (Φ(γλfλ (γ)) − Φ(λfλ (γ))) + (1 − Φ(γλfλ (γ))) = (1 − Φ(λ)) γ−1
(53)
or equivalently Φ(γλfλ (γ)) − γΦ(λfλ (γ)) = −(γ − 1)Φ(λ)
(54)
where Φ is the standard normal cdf and φ the pdf of the same. Equation (30) is obtained from (54) and (50).
27
Proof of Theorem 3 on page 15
Before we begin the proof, we need a lemma.
Lemma 2. For δ ≤ λH , where λH = λS (λ, γ = 1+) the function h(δ) h(δ) = Φ(δ) − Φ(λ) − δφ(δ)
(55)
satisfies h(δ) < 0, ∀λH > δ > λ Proof. Observe that for every fixed λ > 0, h(λ) = −δφ(δ) < 0. In addition, the derivative of h(·) satisfies (56) h (δ) = −δφ (δ) > 0 since φ(δ) is the standard normal density and δ > 0. Also, from equation (53) it follows that (taking γ → 1+), λH φ(λH ) + 1 − Φ(λH ) = 1 − Φ(λ) which implies Φ(λH ) − Φ(λ) − λH φ(λH ) = 0 that is h(λH ) = 0. Since h(·) is strictly monotone increasing on [λ, λH ], with h(λ) < 0 and h(λH ) = 0 it is necessarily negative on [λ, λH ). This completes the proof. Proof. Proof of part [a]. Let λmax denote the partial derivative of λmax ≡ γλS (λ, γ) w.r.t γ. Differentiating both sides of equation (30) w.r.t γ, we get λmax (φ(γλS ) − φ(λS )) = Φ(λS ) − Φ(λ) − λS φ(λS )
(57)
Observe that φ(γλS ) < φ(λS ), since γλS > λS . In order to complete the proof it suffices to show that Φ(λS ) − Φ(λ) − λS φ(λS ) < 0. This follows from Lemma 2, which gives h(λS ) < 0 since λ < λS < λH . Proof of part [b]. Let λmin denote the partial derivative of λmin ≡ λS (λ, γ) w.r.t γ. Differentiating both sides of equation (30) w.r.t γ, we get λmin γ(φ(γλS ) − φ(λS )) = Φ(λS ) − Φ(λ) − λS φ(γλS )
(58)
Since φ(γλS ) < φ(λS ), it suffices to show that r.h.s of (58) is greater than zero. Define for a fixed λ > 0 the function (59) g(λS , γ) = Φ(λS ) − φ(γλS ) From (30) it follows that, Φ(λS ) =
γ−1 Φ(λ) + Φ(λS γ)/γ γ
(60)
Using (60), g(λS , γ) becomes g(λS , γ) = Φ(λ) −
Φ(λ) Φ(λS γ) + − φ(λS γ)λS γ γ
(61)
Consider the function g(λS , γ) − Φ(λ), for fixed λ > 0. By Theorem 3 (part [a]), we know that for fixed λ, the map γ → λmax = γλS is increasing in γ. Hence, the function g(λS , γ) − Φ(λ) is greater than zero iff the function h(u) (with u = γλS ) h(u) = Φ(u) − uφ(u) − Φ(λ) satisfies h(u) > 0 on u > λH Observe that h (u) = −uφ (u) > 0 on u > λH . In addition, from (53), taking limits over γ it follows that h(λH ) = 0. This proves that h(u) > 0 on u > λH . Consequently, g(λS , γ) > Φ(λ) ∀γ > 1. This proves that the r.h.s of (58) is greater than zero.
28
A.2
Parametric functional form for λS (λ, γ)
We detail out the steps leading to the expressions (31,32) given in Section 4.1.3. For every fixed λ > 0 the following are the list of properties that the function fλ (γ) (see Appendix A.1) should have 1. monotone increase of shrinkage thresholds λS from soft to hard thresholding, implies λfλ (γ) should be increasing as γ decreases. 2. monotone decrease of bias thresholds γλS from soft to hard thresholding requires γλfλ (γ) to be increasing as γ increases. 3. The sandwiching property: λS (γ) > λ ∀γ ∈ (1, ∞) =⇒ fλ (γ) > 1
∀γ ∈ (1, ∞)
(62)
4. The df calibration at the two extremes of the family: df (ˆ μγ=1+,λ ) = df (ˆ μγ=∞,λ ). The above considerations suggest a nice parametric form for fλ (γ) given by fλ (γ) = 1 + τ /γ
for some
τ >0
(63)
Let us denote λS (λ, 1+) = λH Equating the df ’s for the soft and hard thresholding τ is to be chosen as: (1 − Φ(λ)) = λH φ(λH ) + (1 − Φ(λH )) which implies −1 (1 + τ )−1 = fλ (γ = 1+) = λ−1 H Φ (Φ(λH ) − λH φ(λH ))
Hence we get
λS = λH
τ 1 + 1+τ γ(1 + τ )
In addition, observe that as γ → ∞, λS (λ, γ) →
A.2.1
λH 1+τ
= λ. This gives expression (32).
Tightening up the initial approximation of λS
In order to get a better approximation of the mapping (λ, γ) → (λS , γλS ), we can define a sequence of recursive updates based on equation (30). Theorem 5 gives a proof of the convergence of these updates and the algorithm for obtaining the better approximations are given in Section (A.2.2). Theorem 5. With reference to (30), define a sequence of estimates {zk }k≥0 recursively as γ 1 Φ(γzk ) − Φ(zk+1 ) = −Φ(λ) γ−1 γ−1
(64)
where the variables {zk }k≥0 are actually the iterative updates of λS (γ) for a fixed (λ, γ). Under the condition |φ(ξk∗ )| } 1, and the density φ(·) decreases on the positive reals. This completes the proof.
A.2.2
Algorithm for tightening up the approximation of shrinkage thresholds
For a given (λ, γ) the re-parametrized MC+ penalty is of the form given in (27) where λS , γλS defined through (30) are obtained by the following steps: 1. Obtain fλ (γ) based on the starting proposal function as in (64). This gives an ˆ max = γλS using (50). ˆ min and λ estimate for λ ˆ min obtain recursively the sequence of estimates {zk }k≥1 using Theorem 2. With z0 = λ μγ=1+ (y, λ))| < tol. (5) till convergence criterion is met, that is, |df (ˆ μγ,λ ) − df (ˆ 3. If the sequence zk → z∗ then assign λS = z∗ and γλS = γz∗ Theorem 5 gives a formal proof of the convergence of the above algorithm.
A.3
Convergence proofs for Algorithm 1
Proof of Theorem 4 on page 16 Observe that the sequence β k is bounded — we will require this on several occasions during the proofs. Lemma 3. Suppose the following Assumptions hold 1. The penalty function P (t) (symmetric around 0) is differentiable on t ≥ 0; P (|t|) is non-negative, continuous and uniformly bounded 2. For every convergent subsequence {β nk }k ⊂ {β k }k the successive differences converge to zero: β nk − β nk −1 → 0
30
If β ∞ is any limit point of the sequence {β k }k , then β ∞ is a minimum for the function Q(β); i.e. for any δ = (δ1 , . . . , δp ) ∈ p Q(β ∞ + αδ) − Q(β ∞ ) ≥0 (69) lim inf α↓0+ α Proof. For any β = (β1 , . . . , βp ) and δ i = (0, . . . , δi , . . . , 0) ∈ p , Q(β + αδ i ) − Q(β) P (|βi + αδi |) − P (|βi |) = ∇i f (β)δi + lim inf lim inf α↓0+ α↓0+ α α where f (β) = 12 y − Xβ22 , ∇i is the partial derivative w.r.t. the ith coordinate. Observe that the second term in (70) simplifies to P (|βi + αδi |) − P (|βi |) ∂P (βi ; δi ) := lim inf α↓0+ α P (|βi |) sgn(βi )δi if |βi | > 0; = if |βi | = 0. P (0)|δi |
(70)
(71)
⎧ if x > 0; ⎨ {1} {−1} if x < 0; sgn(x) ∈ ⎩ [-1,1] if x = 0.
where
Assume β nk → β ∞ = (β1∞ , . . . , βp∞ ). Using (71) and Assumption 2, as k → ∞ nk nk −1 ∞ ∞ , βink , βi+1 , βpnk −1 ) → (β1∞ , . . . , βi−1 , βi∞ , βi+1 , βp∞ ) β ink −1 := (β1nk , . . . , βi−1
If
βi∞
=
0, ∂P (βink ; δi )
→
∂P (βi∞ ; δi );
If
βi∞
=
0, ∂P (βi∞ ; δi )
≥ lim inf k
(72) ∂P (βink ; δi )
By definition of a coordinate-wise minimum, using (70,71) we have ∇i f (βink −1 )δi + ∂P (βink ; δi ) ≥ 0 ∀k
(73)
The above imply ∀i ∈ {1, . . . , p}
∇i f (β∞ )δi + ∂P (βi∞ ; δi ) ≥ lim inf ∇i f (β ink −1 )δi + ∂P (βink ; δi ) k
≥ 0
(74)
by (73)
Using (70,71) the l.h.s. of (74) gives ∞ Q(β ∞ + αδ ∞ i ) − Q(β ) ≥0 lim inf α↓0+ α
(75)
For δ = (δ1 , . . . , δp ) ∈ p , due to differentiability of f (β) = 12 y − Xβ22 lim inf
α↓0+
Q(β ∞ + αδ) − Q(β ∞ ) α
p
p
P (|βj∞ + αδj |) − P (|βj∞ |) α↓0+ α j=1 j=1 p P (|βj∞ + αδj |) − P (|βj∞ |) ∞ f (β )δj + lim = α↓0+ α j=1 p ∞ Q(β ∞ + αδ ∞ j ) − Q(β ) ≥ 0(76) lim inf = α↓0+ α =
∇j f (β ∞ )δj +
j=1
lim
By (75)
This completes the proof.
31
Lemma 4 gives a sufficient condition under which Assumption 2 of Lemma 3 is true. Lemma 4. Suppose in addition to Assumption 1 of Lemma 3 the following assumptions hold: 1. For every i = 1, . . . , p and (u1 , . . . , ui−1 , ui+1 , . . . , up ) ∈ p−1 the univariate function χi(u1 ,...,ui−1 ,ui+1,...,up ) (·) : u →
Q(u1 , . . . , ui−1 , u, ui+1 , . . . , up )
(77)
has a unique global minimum. 2. If uG ≡ uG (u1 , . . . , ui−1 , ui+1 , . . . , up ) is the global minimum and uL ≡ uL (u1 , . . . , ui−1 , ui+1 , . . . , up ) any other local minimum of the function (77); there exists > 0, independent of the choice of (u1 , . . . , ui−1 , ui+1 , . . . , up ) such that |uG (u1 , . . . , ui−1 , ui+1 , . . . , up ) − uL (u1 , . . . , ui−1 , ui+1 , . . . , up )| ≥ — that is uG and uL are uniformly separated. Then for every convergent subsequence {β nk }k ⊂ {β k }k , the successive differences β nk − β nk −1 → 0; i.e. Assumption 2 of Lemma 3 is true. nk −1 } Proof. Borrowing notation from (72), consider the sequence {β p−1 nk −1 nk nk := (β1nk , . . . , βp−2 , βp−1 , βpnk −1 ) β p−1
(78)
nk −1 nk −1 differs from β nk at the pth coordinate. We will show β p−1 − β nk → 0 by the where β p−1 method of contradiction. nk −1 = Δnp k := (0, . . . , 0, Δnp k ) we have Writing β nk − β p−1
Q(β nk − Δnp k ) − Q(β nk ) ↓ 0 as nk → ∞ As k → ∞, let β nk → β + . Passing on to a subsequence (if necessary) we have Δnp k → Δ+ p = 0 (as Δnp k 0) and β nk − Δnp k → β + − Δ+ p Furthermore, the continuity of the function Q(·) implies + Q(β nk − Δnp k ) − Q(β nk ) ↓ Q(β + − Δ+ p ) − Q(β ) = 0
(79)
By the definition of β nk := (β1nk , . . . , βpnk ) we have ∇p f (β nk )δp + ∂P (βpnk ; δp ) ≥ 0, ∀k Passing on to the limit k → ∞ we get (using arguments as in proof of Lemma 3) ∇p f (β + )δp + ∂P (βp+ ; δp ) ≥ 0 where β + = (β1+ , . . . , βp+ ). This implies that χp(β + ,...,β + 1
p−1 )
(u) has a minimum at u = βp+ . By
Assumption 2 of this Lemma the global minimum uG is uniformly separated from other local minima uL — hence βp+ is the global minimum. But (79) implies that χp(β + ,...,β + ) (u) has 1
p−1
two distinct global minima at u = βp+ and u = βp+ − Δ+ p — a contradiction to Assumption + n −1 n k k − βp → 0. 1. Hence Δp = 0 and βp
32
nk −1 In the above spirit consider sequence (80) (differing from β p−1 at (p − 1)th coordinate) nk −1 nk nk nk −1 := (β1nk , . . . , βp−3 , βp−2 , βp−1 , βpnk −1 ) β p−2
(80)
nk −1 nk −1 nk −1 − β p−1 → 0, which along with β p−1 − β nk → 0 (shown above) will We will show β p−2 nk −1 − β nk → 0. The proof of this part is similar to the above. imply β p−2 nk −1 nk −1 k −1 k k − β p−2 = Δnp−1 := (0, . . . , 0, Δnp−1 , 0) and suppose Δnp−1 0. We will Write β p−1 arrive at a contradiction. k By hypothesis, there is a subsequence of Δnp−1 converging to Δ+ p−1 = 0. Passing onto a nk −1 + further subsequence, if necessary and using Δp−1 → Δp−1 we have nk −1 nk −1 nk −1 + − Δp−1 ) − Q(β p−1 ) ↓ Q(β + − Δ+ ) − Q(β ) =0 (81) Q(β p−1 p−1
This essentially implies that the map χp−1 (β + ,...,β + 1
+ p−2 ,βp )
(u) has two distinct global minima at
k and u = − which is a contradiction — hence Δnp−1 → 0. u= We can follow the above arguments inductively for finitely many steps namely i = p − 3, p − 4, . . . , 1, p and show that Δni k → 0 for every i. This shows β nk −1 − β nk → 0, thereby completing the proof.
+ βp−1
+ βp−1
Δ+ p−1 ;
Proof of Theorem 4. Lemmas 3 and 4 characterize the fate of limit points of the sequence produced by the coordinate-wise updates under milder conditions than those required by Theorem 4. The uniform separation condition — Assumption 2 of Lemma 4 is a trivial consequence of strict convexity of the univariate functions (Theorem 4) and are more general. Proof. For fixed i and (β1 , . . . , βi−1 , βi+1 , . . . , βp ), we will write χi(β1 ,...,βi−1 ,βi+1 ,...,βp ) ≡ χ(u) for the sake notational convenience. Recall ∂χ(u) is a sub-gradient (Borwein & Lewis 2006) of χ(u) at u if χ(u + δ) − χ(u) ≥ ∂χ(u)δ
(82)
The sub-gradient(s) are given by ∂χ(u) = ∇i f ((β1 , . . . , βi−1 , u, βi+1 , . . . , βp )) + P (|u|) sgn(u)
(83)
Also observe that χ(u + δ) − χ(u) = {f ((β1 , . . . , βi−1 , u + δ, βi+1 , . . . , βp )) − f ((β1 , . . . , βi−1 , u, βi+1 , . . . , βp ))} + {P (|u + δ|) − P (|u|)} (84) = {∇i f ((β1 , . . . , βi−1 , u, βi+1 , . . . , βp ))δ + 12 ∇2i f δ2 }
+ {P (|u|)(|u + δ| − |u|) + 12 P (|u∗ |)(|u + δ| − |u|)2 }
(85)
Line (85) is obtained from (84) by a Taylor’s series expansion on f wrt to the ith coordinate and the twice differentiable function |t| → P (|t|). ∇2i f is the second derivative of the function f wrt the ith coordinate and equals 1 as the features are all standardized; |u∗ | is some number between |u + δ| and |u|. Further observe that the rhs of (85) can be simplified further after rearranging terms as follows: ∇i f ((β1 , . . . , βi−1 , u, βi+1 , . . . , βp ))δ + 12 ∇2i f δ2 +P (|u|)(|u + δ| − |u|) + 12 P (|u∗ |)(|u + δ| − |u|)2 = 12 ∇2i f δ2 + {∇i f ((β1 , . . . , βi−1 , u, βi+1 , . . . , βp ))δ + P (|u|) sgn(u)δ}+ {P (|u|)(|u + δ| − |u|) − P (|u|) sgn(u)δ} + 12 P (|u∗ |)(|u + δ| − |u|)2
33
(86)
Since χ(u) is minimized at u0 ; 0 ∈ ∂χ(u0 ). Thus using (83) we get ∇i f ((β1 , . . . , βi−1 , u0 , βi+1 , . . . , βp ))δ + P (|u0 |) sgn(u0 )δ = 0
(87)
if u0 = 0 then the above holds true for some value of sgn(u0 ) ∈ [−1, 1]. By definition of sub-gradient of x → |x| and P (|x|) ≥ 0 we have P (|u|)(|u + δ| − |u|) − P (|u|) sgn(u)δ = P (|u|) ((|u + δ| − |u|) − sgn(u)δ) ≥ 0
(88)
Using (87,88) in (86,85) at u = u0 we have χ(u0 + δ) − χ(u0 ) ≥ 12 ∇2i f δ2 + 12 P (|u∗ |)(|u0 + δ| − |u0 |)2
(89)
Observe that (|u0 + δ| − |u0 |)2 ≤ δ2 . If P (|u∗ |) ≤ 0 then 12 P (|u∗ |)(|u0 + δ| − |u0 |)2 ≥ 1 1 ∗ 2 ∗ ∗ 2 2 P (|u |)δ . If P (|u |) ≥ 0 then 2 P (|u |)(|u0 + δ| − |u0 |) ≥ 0. Combining these two we get (90) χ(u0 + δ) − χ(u0 ) ≥ 12 δ2 ∇2i f + min{P (|u∗ |), 0} By the conditions of this theorem, (∇2i f + inf x P (|x|)) > 0. Hence there exists a θ > 0, θ = 12 {∇2i f + min{inf P (|x|), 0}} = 12 {1 + min{inf P (|x|), 0}} > 0 x
x
which is independent of (β1 , . . . , βi−1 , βi+1 , . . . , βp ), such that χ(u0 + δ) − χ(u0 ) ≥ θδ2
(91)
(For the MC+ penalty, inf x P (|x|) = − γ1 ; γ > 1. For the SCAD, inf x P (|x|) = 1 ; γ > 2.) − γ−1 In fact the dependence on u0 above in (91) can be further relaxed. Following the above arguments carefully, we see that the function χ(u) is strictly convex (Borwein & Lewis 2006) as it satisfies: (92) χ(u + δ) − χ(u) − ∂χ(u)δ ≥ θδ2 We will use this result to show the convergence of {β k }. Consider the sequence of coordinate-wise updates β ki — using (91) we have m−1 m−1 m 2 ) ≥ θ(βi+1 − βi+1 ) Q(β im−1 ) − Q(β i+1
m−1 2 = θβ im−1 − β i+1 2
(93)
m−1 , . . . , βpm−1 ). where β im−1 := (β1m , . . . , βim , βi+1 Using (93) repeatedly across every coordinate, we have for every m
Q(β m+1 ) − Q(β m ) ≥ θβ m+1 − β m 22
(94)
Using the fact that the (decreasing) sequence Q(β m ) converges, it is easy to see from (94), that the sequence β k cannot cycle without convergence ie it must have a unique limit point. This completes the proof of convergence of β k .
34
References Borwein, J. & Lewis, A. (2006), Convex Analysis and Nonlinear Optimization, Springer. Breiman, L. (1996), ‘Heuristics of instability and stabilization in model selection’, Annals of Statistics 24, 2350–2383. Candes, E. J., Wakin, M. B. & Boyd, S. (2008), ‘Enhancing sparsity by reweighted l1 minimization’, Journal of Fourier Analysis and Applications 14(5), 877–905. Chen, S. S., Donoho, D. & Saunders, M. (1998), ‘Atomic decomposition by basis pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61. Donoho, D. (2006), ‘For most large underdetermined systems of equations, the minimal 1 -norm solution is the sparsest solution’, Communications on Pure and Applied Mathematics 59, 797–829. Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression (with discussion)’, Annals of Statistics 32(2), 407–499. Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American Statistical Association 96(456), 1348–1360(13). Frank, I. & Friedman, J. (1993), ‘A statistical view of some chemometrics regression tools (with discussion)’, Technometrics 35(2), 109–148. Friedman, J. (2008), Fast sparse regression and classification, Technical report, Department of Statistics, Stanford University. Friedman, J., Hastie, T., Hoefling, H. & Tibshirani, R. (2007), ‘Pathwise coordinate optimization’, Annals of Applied Statistics 2(1), 302–332. Gao, H.-Y. & Bruce, A. G. (1997), ‘Waveshrink with firm shrinkage’, Statistica Sinica 7, 855–874. Huang, J., Ma, S. & Zhan, C.-H. (2008), ‘Adaptive lasso for sparse high-dimensional regression models’, Statistica Sinica 18, 1603–1618. Knight, K. & Fu, W. (2000), ‘Asymptotics for lasso-type estimators’, Annals of Statistics 28(5), 1356–1378. Meinshausen, N. & B¨ uhlmann, P. (2006), ‘High-dimensional graphs and variable selection with the lasso’, Annals of Statistics 34, 1436–1462. Nikolova, M. (2000), ‘Local strong homogeneity of a regularized estimator’, SIAM J. Appl. Math. 61, 633–658. Osborne, M., Presnell, B. & Turlach, B. (2000), ‘A new approach to variable selection in least squares problems’, IMA Journal of Numerical Analysis 20, 389–404. She, Y. (2009), ‘Thresholding-based iterative selection procedures for model selection and shrinkage’, Electronic Journal of Statistics . Stein, C. (1981), ‘Estimation of the mean of a multivariate normal distribution’, Annals of Statistics 9, 1131–1151.
35
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society, Series B 58, 267–288. Tseng, P. & Yun, S. (2009), ‘A coordinate gradient descent method for nonsmooth separable minimization’, Mathematical Programming B 117, 387–423. Zhang, C. H. (2010), ‘Nearly unbiased variable selection under minimax concave penalty’, Annals of Statistics (to appear) . Zhang, C.-H. & Huang, J. (2008), ‘The sparsity and bias of the lasso selection in highdimensional linear regression’, Annals of Statistics 36(4), 1567–1594. Zhang, T. (2009), Multi-stage convex relaxation for non-convex optimization, Technical report, Rutgers Tech Report. Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Machine Learning Research 7, 2541–2563. Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American Statistical Association 101, 1418–1429. Zou, H., Hastie, T. & Tibshirani, R. (2007), ‘On the degrees of freedom of the lasso’, Annals of Statistics 35(5), 2173–2192. Zou, H. & Li, R. (2008), ‘One-step sparse estimates in nonconcave penalized likelihood models’, The Annals of Statistics 36(4), 1509–1533.
36