Gradient LASSO algorithm Jinseog Kim Seoul National University, Seoul, Korea.
Yuwon Kim University of Minnesota, Minneapolis, USA.
Yongdai Kim Seoul National University, Seoul, Korea. Summary. LASSO is a useful method to achieve the shrinkage and variable selection simultaneously. The main idea of LASSO is to use the L1 constraint in the regularization step. Starting from linear models, the idea of LASSO - using the L1 constraint, has been applied to various models such as wavelets, kernel machines, smoothing splines, multiclass logistic models etc. We call such models the generalized LASSO models. In this paper, we propose a new computational algorithm called the gradient LASSO algorithm for generalized LASSO, which is computationally much simpler and more stable than the QP based algorithms, and so easily applicable to large dimensional data. We also derive the convergence rate of the algorithm, which does not depend on the dimension of inputs. Simulations show that the proposed algorithm is fast enough for practical purposes and provides reliable results. To illustrate its computing power with high dimensional data, we analyze two real data sets using the structural sparse kernel machine and multiclass logistic regression model. Keywords: Gradient descent, LASSO, Multiclass logistic model, Sparse kernel machines, Variable selection
1. Introduction Tibshirani (1996) introduced an interesting method for shrinkage and variable selection in linear models, called “Least Absolute Shrinkage and Selection Operator (LASSO)”. It achieves better prediction accuracy by shrinkage as the ridge regression does, but at the same time, it gives a sparse solution, which means that some regression coefficients are exactly zero. Hence, LASSO can be thought to achieve the shrinkage and variable selection simultaneously. The main idea of LASSO is to use the L1 constraint in the regularization step. That is, the estimator is obtained by minimizing the empirical risk subject to that the L1 norm of the regression coefficients is bounded by a given positive numbers. Starting from linear models, the idea of LASSO - using the L1 constraint, has been applied to various models such as wavelets (Chen et al., 1999; Bakin, 1999), kernel machines(Gunn and Kandola, 2002; Roth, 2004), smoothing splines (Zhang et al., 2003), multiclass logistic regressions (Krishnapuram et al., 2004) etc. These models extend the standard LASSO in two ways (i) the loss function can be a general convex function other than the squared error loss and Address for correspondence : Yuwon Kim, School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA. E-mail :
[email protected]
2
Kim et al.
(ii) there is more than one L1 constraint in the regularization step (Gunn and Kandola, 2002; Zhang et al., 2003). We call such models the generalized LASSO models and give several examples in section 2. One important issue in generalized LASSO is that the objective function is not differentiable due to the L1 constraint, and hence special optimization techniques are needed. Tibshirani (1996) used the quadratic programming (QP) method for least squares regressions and the iteratively reweighted least squares procedure (IRLS) with QP for logistic regressions. Osborne et al. (2000a) proposed a faster QP algorithm for linear regressions, which was extended for logistic regressions by Lokhorst et al. (1999) and Roth (2004). Recently, Efron et al. (2004) developed an algorithm called LARS for linear regressions, which is shown to be closely related to Osborne et al. (2000a)’s algorithm. The aforementioned algorithms, however, may not be easily applicable to large dimensional data sets. Examples are the structural sparse kernel machine (see Example 3 in Section 2 for details) and microarray data. This computational drawback is mainly due to that these algorithms need repeated matrix inversions whose computing cost increases very fast as the dimension of inputs increases. Moreover, when the design matrix is singular, the matrix constructed inside the algorithms could be singular, in which case the algorithm fails to converge. See Section 3.1 for details. Also, these algorithms are not directly applicable to generalized LASSO where there are more than one L1 constraint. There are some attempts to resolve these computational issues in generalized LASSO. Perkins et al. (2003) developed a stagewise gradient descent algorithm called grafting. However, the grafting algorithm requires high dimensional nonlinear constraint optimization techniques inside the iteration, which would be computationally demanding. Grandvalet and Canu (1999) implemented a fixed point algorithm using the equivalence between the adaptive ridge regression and LASSO. This algorithms, however, may not lead global convergence. In this paper, we propose a gradient descent algorithm for generalized LASSO called the gradient LASSO algorithm. The proposed algorithm does not require matrix inversions, so that it can be applied easily to data sets with large dimensional inputs. Moreover, the algorithm always converges to the global optimum. The paper is organized as follows. In Section 2, we define the set-up of generalized LASSO and give four examples. In Section 3, we review the various algorithms for generalized LASSO. In section 4, we propose the gradient LASSO algorithm, and study its theoretical properties. Section 5 compares the proposed algorithm with Osborne et al. (2000a)’s QP algorithm in the standard LASSO framework by simulation. The proposed algorithm is applied to a structural sparse kernel machine and a multiclass logistic regression in Section 6 and Section 7, respectively. Concluding remarks follow in Section 8. 2.
The set-up of generalized LASSO
We first present the set-up of generalized LASSO and give four examples. Let (y1 , x1 ), . . . , (yn , xn ) be n output/input pairs where yi ∈ Y and xi = (xi1 , . . . , xip )0 ∈ X where X is a subspace of Rp , the p-dimensional Euclidean space. That is, xi = (xi1 , . . . , xip ). Let β = (β1 , . . . , βp ) be the corresponding regression coefficients. Suppose that β is composed 0 0 of d many subvectors β 1 , . . . , β d . That is, β 0 = (β 1 , . . . , β d ) where β l = (β1l , . . . , βpl l )0 are Pd pl dimensional vectors for l = 1, . . . , d with l=1 pl = p. For a given loss function L : Y × X × Rp → R, the objective of generalized LASSO is to
Gradient LASSO algorithm
3
find β which minimizes the empirical risk R(β) =
n X
L (yi , xi , β)
(1)
i=1
subject to ||β l ||1 ≤ λl for l = 1, . . . , d where kβ l k1 =
Ppl k=1
|βkl |.
Example 1. Multiple linear regression Let (y1 , z 1 ), . . . , (yn , z n ) be n observations where yi s are outputs and z i s are p-dimensional inputs. A multiple linear regression model is given by yi = β1 zi1 + · · · + βp zip + ²i
(2)
where ²i s are assumed to be zero-mean random quantities. The LASSO estimate βˆ1 , . . . , βˆp for the linear regression model (2) is the minimizer of n X i=1
à yi −
p X
!2 βk zik
k=1
Pp be embedded into the generalized LASSO subject to k=1 |βk | ≤ λ. This problem can P p set-up with d = 1, x = z and L(y, x, β) = (y − k=1 βk xk )2 . Remark. The model (2) is not of standard since there is no intercept term. The standard multiple linear regression model is yi = β0 + β1 zi1 + · · · + βp zip + ²i .
(3)
The LASSO estimates (βˆ0 and βˆ1 , . . . , βˆp ) then can be obtained by minimizing n X i=1
à yi − β0 −
p X
!2 βk zki
k=1
Pp subject to k=1 |βk | ≤ λ. Note that β0 is unconstrained. This LASSO problem can be embedded into the generalized LASSO set-up by letting d = 2, xi = (1, z i ), β 1 = (β0 ), β 2 = (β1 , . . . , βp ) and λ1 = ∞, λ2 = λ. Example 2. Multiple logistic regression A multiple logistic regression model for two class classification problems is given by log
Pr(yi = 1|z i ) = f (z i ) 1 − Pr(yi = 1|z i )
(4)
where f (z i ) = β1 zi1 + · · · + βp zip . Here, yi ∈ {0, 1}. The LASSO estimate βˆ1 , . . . , βˆp is the minimizer of the negative binomial log likelihood n h ´i ³ X −yi f (z i ) + log 1 + ef (z i ) i=1
4
Kim et al.
Pp subject to k=1 |βk | ≤ λ. The generalized LASSO setting of a multiple logistic regression is the same as that of the multiple linear regression except for that the loss function is given as !) Ã p ! ( Ã p X X βk xk . (5) βk xk + log 1 + exp L(y, x, β) = −y k=1
k=1
Example 3. Structural sparse kernel machine The structural sparse kernel machine with second order interaction terms for classification is basically the same as (4) except that f (z) is modelled via the functional ANOVA (analysis of variance) decomposition f (z) =
p X
fj (zj ) +
j=1
X
fjk (zj , zk ).
(6)
j λ. Advantages of the stagewise forward selection algorithm is that the computation is simple and it gives a regularized solution path as a by-product. However, it may not converge to the global optimum. 4.
The gradient LASSO algorithm
In this section, we will present a gradient descent algorithm to solve the optimization problem in generalized LASSO. Recall that the optimization problem of generalized LASSO is to minimize the empirical risk R(β) subject to kβ l k1 ≤ λl for l = 1, . . . , d where β = (β 1 , . . . , β d ). We first present the coordinatewise gradient descent (CGD) algorithm, prove the convergence, and then present the gradient LASSO algorithm which speeds up the CGD algorithm by employing the deletion step.
Gradient LASSO algorithm
9
4.1. Coordinatewise gradient descent algorithm Let wl = β l /λl . The optimization problem of generalized LASSO is equivalent to minimize C(w) = R(λ1 w1 , . . . , λd wd ) subject to ||wl ||1 ≤ 1, l = 1, . . . , d. To describe the CGD algorithm, we introduce some notations. Let S l = {wl : kwl k1 ≤ 1} for l = 1, . . . , d and 0 0 S = {w = (wl , . . . , wd )0 : wl ∈ Sl }. Then, the optimization problem of generalized LASSO ˆ in S such that is to find w ˆ = arg min C(w). w w∈S ˆ sequentially as follows. For a given The main idea of the CGD algorithm is to find w v l ∈ S l and α ∈ [0, 1], let w[α, v l ] = (w1 , . . . , wl−1 , wl + α(v l − wl ), wl+1 , . . . , wd ). Suppose that w is the current solution. Then, for each l, the proposed algorithm searches a direction vector v l ∈ S l such that C(w[α, v l ]) decreases most rapidly and updates w to w[α, v l ]. Note that w[α, v l ] is still in S. The Taylor expansion implies C(w[α, v l ]) ≈ C(w) + αh∇(w)l , v l − wl i where ∇(w)l = (∇(w)l1 , . . . , ∇(w)lpl ) and ∇(w)lk = ∂C(w)/∂wkl . Here, h·, ·i is the inner product. Moreover, it can be easily shown that min h∇(w)l , v l i = min min{∇(w)lk , −∇(w)lk }. k=1,...,pl v l ∈S l ˆ Hence, the desired direction is a vector in Rpl such that the k-th element is −sign(∇(w)lkˆ ) and the other elements are 0 where kˆ = arg
min
k=1,...,pl
min{∇(w)lk , −∇(w)lk }.
By summing up the arguments, we propose the CGD algorithm for generalized LASSO in Figure 1. 4.2. Convergence Analysis We assume that C is convex and its gradient ∇ satisfies the Lipschitz condition with the Lipschitz constant L on S. That is, for any two vectors w and v in Rp , k∇(w) − ∇(v)k ≤ Lkw − vk
(12)
where k · k is the Euclidean norm. Let C ∗ = inf w∈S C(w) and let ∆C(w) = C(w) − C ∗ . Let M1 = L supw,v ∈S kw − vk2 and M2 = supw∈S C(w) − C ∗ and let M = max{M1 , M2 }. The following Theorem gives the convergence rate of the CGD algorithm. Theorem 4.1. Let wm be the solution of generalized LASSO obtained after m iterations of the CGD algorithm. Then, 2d2 M . (13) ∆C (wm ) ≤ m For proving Theorem 4.1, we begin with the following two lemmas.
10
Kim et al. (a) Initialize : w = 0 and m = 0. (b) While converge do ; (i) (ii) (iii) (iv)
m = m + 1. Compute the gradient ∇(w) = (∇(w)1 , . . . , ∇(w)d ). ˆ γˆ ) which minimizes γ∇(w)lk for l = 0, . . . , d, k = 1, . . . , pl , γ = ±1. Find the (ˆ l, k, ˆ l ˆ Let v be the pˆl dimensional vector such that the k-th element is γˆ and the other column elemets are 0. ³ ´ ˆ (v) Find α ˆ = arg minα∈[0,1] C w[α, v l ] .
(vi) Update w:
wkl
=
ˆ kl + γˆ α ˆ (1 − α)w l (1 − α)w ˆ k l wk
ˆ ,l = ˆ l, k = k ˆ ˆ , l = l, k 6= k , o.w.
(c) Return w.
Fig. 1. Coordinatewise descent algorithm for generalized LASSO
Lemma 4.2. For any w ∈ S, v l ∈ S l and α ∈ [0, 1], ¡ ¢ M α2 C w[α, v l ] ≤ C(w) + αh∇(w)l , v l − wl i + . 2 Proof. Define Φ : R → R by Φ(α) = C(w[α, v l ]). Let Φ0 (α) = dΦ(α)/dα. Then, we have Φ0 (α) = h∇(w[α, v l ])l , v l − wl i. Hence, |Φ0 (α) − Φ0 (0)| =
¯ ¯ ¯h∇(w[α, v l ])l − ∇(w)l , v l − wl i¯
≤ ≤
k∇(w[α, v l ])l − ∇(w)l kkv l − wl k Lαkv l − wl k2
for α ∈ [0, 1]. The first inequality is by Cauchy-Schwarz inequality. Thus, Φ0 (α) ≤ =
Φ0 (0) + Lαkv l − wl k2 h∇(w)l , v l − wl i + Lαkv l − wl k2 .
Hence, Z
α
Φ(α) − Φ(0) = Z
0
≤
α
Φ0 (s)ds ¡ ¢ h∇(w)l , v l − wl i + Lskv l − wl k2 ds
0
= αh∇(w)l , v l − wl i +
L l kv − wl k2 α2 . 2
Gradient LASSO algorithm
Since v l and wl are in S l , Lkv l − wl k2 ≤ M1 ≤ M, which completes the proof.
11
2
Lemma 4.3. For any given w ∈ S, let v l be a vector in S l which minimizes h∇(w)l , v l − w i and let ˆl = arg minl=1,...,d {h∇(w)l , v l − wl i}. Let l
³ ´ ˆ α ˆ = arg min C w[α, v l ] . α∈[0,1]
Then,
³ ´ ∆C(w)2 ˆ ∆C w[ˆ α, v l ] ≤ ∆C(w) − . 2d2 M
Proof. For a given w ∈ S, define the Bregman divergence B(u, w) on u ∈ S by B(u, w) = C(u) − C(w) − h∇(w), u − wi.
(14)
Note that inf u∈S B(u, w) ≥ 0 for all w ∈ S since C is convex. Now, let u be a vector in S. Since h∇(w), u − wi =
d X
h∇(w)l , ul − wl i,
l=1
we have h∇(w), u − wi ≥
d X
h∇(w)l , v l − wl i.
l=1
So, ˆ
ˆ
inf h∇(w), u − wi ≥ dh∇(w), v l − wl i. u∈S
(15)
Combining (14) and (15), we have ˆ
ˆ
h∇(w), v l − wl i ≤ −
∆C(w) . d
(16)
Lemma 4.2 implies that ∆C(w[α, v]) ≤ ∆C(w) − α
∆C(w) M 2 + α . d 2
(17)
By taking the minimum on the right hand side of (17) with respect to α on [0, 1], we will get the desired result since ∆C(w)/dM ≤ 1 for d ≥ 1. 2
Proof of Theorem 4.1. We will use the mathematical induction. First, consider the case of m = 1. Since 2d2 M , ∆C(w1 ) ≤ M ≤ d2 M ≤ m Theorem 4.1 holds.
12
Kim et al.
For m = 2, Lemma 4.3 implies that ∆C(w2 ) ≤ ∆C(w1 ) −
∆C(w1 )2 . 2d2 M
Note that x − x2 /(2d2 M ) is increasing on [0, d2 M ]. Since ∆C(w1 ) ≤ d2 M, we have ∆C(w2 ) ≤ =
d2 M −
d4 M 2 2d2 M
d2 M ≤ d2 M 2
and hence Theorem 4.1 holds. Next, suppose ∆C(wm ) ≤ 2d2 M/m for m ≥ 2. Since x − x2 /(2d2 M ) is increasing on [0, d2 M ] and 2d2 M/m ≤ d2 M for m ≥ 2, Lemma 4.3 implies ∆C (wm+1 ) ≤ =
2d2 M (2d2 M )2 − 2 m 2d M m2 2 2d M 2d2 M 2d2 M − ≤ , 2 m m m+1
which completes the proof.
2
Remark. When d = 1, the convergence rates of sequentially updated algorithms for generalized LASSO have been studied by many authors including Jones (1992), Barron (1993) and Zhang (2003). They derived the same rate of convergence as the CGD algorithm. However, the previous algorithms find the optimal v l and α simultaneously at each iteration which is not easy since C(w[α, v l ]) is not jointly convex in α and v l . In contrast, the CGD algorithm finds v l and α sequentially. Remark. The bound in Theorem 4.1 implies that the convergence rate of the proposed algorithm does not depend on the dimension of inputs. Instead it depends on the magnitudes of λl s and xi s as well as yi s. To see this, consider the standard problem, that PLASSO n l(y − = (y − x0 β)2 and d = 1. In this case, C(w) = is, L(y, x, β) P λx0 w)2 and i i=1 n 0 2 (y − λx w). So ∇(w) satisfies (12) with L = 4λ η where η is the ∇(w) = −2λ i=1 xP i i i n largest eigenvalue of i=1 xi x0i . Since supw,v ∈S kw − vk2 ≤ 2, we let M1 = 2L. For M2 , let w∗ = argminw∈S C(w). Since C(w) − C(w∗ ) ≤ −h∇(w), w∗ − wi ≤ k∇(w)kkw∗ − wk ≤ Lkwkkw∗ − wk ≤
2L,
we can let M2 = 2L, and so we have ∆C(wm ) ≤ 4L/m. Suppose that the column vectors of the design matrix are standardized. Then, η is bounded by the sample size n. Hence, the convergence rate depends on n and λ, but not on p. So, we expect that the CGD algorithm works well when the dimension of inputs is large.
Gradient LASSO algorithm
13
4.3. Gradient LASSO algorithm A problem of the CGD algorithm is that it converges slowly at near optimum. This is mainly because the CGD algorithm has only the addition step compared to the Osbonre’s algorithm which has the deletion step as well as the addition step. In this section, we explain reasons of the slow convergence of the CGD algorithm, and propose a method to improve the convergence speed of the CGD algorithm by introducing the deletion step. The final algorithm, which is the CGD algorithm with the deletion step, is called the gradient LASSO algorithm. Pn Consider the problem of minimizing C(w) = i=1 L(yi , x, w) with respect to kwk1 ≤ 1. ˆ = argminw∈S C(w) where For a given w, let σ(w) = {j : wj 6= 0, j = 1, . . . , p}. Let w S = {w : kwk1 ≤ 1}, and let w be the current solution of the CGD algorithm. Suppose ˆ 1 = 1. There are two cases where the convergence of the CGD algorithm is slow. that kwk ˆ 6= ∅, and the second case is kwk1 < 1 when σ(w) − σ(w) ˆ = ∅. The first case is σ(w) − σ(w) ˆ 6= ∅, the current solution has some coefficients which In the first case where σ(w) − σ(w) ˆ Hence, we need to delete some coefficients are not zero currently but should be zero in w. from the current solution. The CGD algorithm can delete such coefficients by adding other coefficients sequentially A problem is that its convergence speed is significantly slow. For ˆ = {1, 2} and σ(w) = {1, 2, 3}. Let v 1 = (1, 0, 0)0 , v 2 = example, suppose p = 3, σ(w) 0 0 (0, 1, 0) and v 3 = (0, 0, 1) . Then, the final solution is a convex combination of v 1 and v 2 while the current solution is a convex combination of v 1 , v 2 and v 3 . Figure 2-(a) depicts this situation where the filled dot is the current solution and the star shape indicates the final solution. Then, the CGD algorithm should move the filled dot to the star shape by adding v 1 and v 2 sequentially, which needs a large number of iterations. A typical path of the solutions of the CGD algorithm is illustrated in Figure 2-(b). A way of improving the convergence speed in this case is to move the all coordinates simultaneously, which is illustrated in Figure 2-(c).
(a)
(b)
(c)
Fig. 2. The illustration of the first case of slow convergence of the CGD algorithm
Now, the question is which direction and how much the solution moves. For a given vector v ∈ Rp , let v σ be the subvector of v defined by v σ = (vk , k ∈ σ(w)). Conversely, for a given v σ , let µ ¶ vσ T v=P , 0
14
Kim et al.
where P is the permutation matrix that collects the non-zero elements of w in the first |σ(w)| elements. For the direction, a natural choice is a negative gradient. That is, we consider wσ − δ∇(w)σ for some δ > 0. A problem of using the gradient as a desirable direction is that wσ − δ∇(w)σ may not be a feasible solution for all δ > 0. That is, kwσ − δ∇(w)σ k1 > 1 for all δ > 0. Also, it is easy to show that such a case occurs when hθ(w)σ , ∇(w)σ i < 0 where θ(w) = (sign(w1 ), . . . , sign(wp ))0 . In this case, instead of using the gradient itself, we project the negative gradient −∇(w)σ into the hyperplane P|σ(w)| Wσ = {v ∈ R|σ(w)| : k=1 vk = 0}. Let hσ be the projection vector of −∇(w)σ onto Wσ which is defined by hσ = argminv ∈Wσ k − ∇(w)σ − vk2 . It turns out that hσ = −∇(w)σ +
hθ(w)σ , ∇(w)σ i θ(w)σ . |σ(w)|
Also, kw + δhk1 = 1 for δ ∈ [0, L] where µ h=P
T
hσ 0
¶ ,
(18)
ˆ where δˆ = and L = minj {−wj /hj : wj hj < 0, j ∈ σ(w)}. So, we update w by w + δh argminδ∈[0,L] C(w + δh). When hθ(w)σ , ∇(w)σ i ≥ 0, there exists a constant L > 0 such that kw + δhk1 ≤ 1 for δ ∈ [0, L], where hσ = −∇(w)σ and h is defined by (18). However, the maximum value of L is difficult to find. Hence, instead of finding the maximum value of L, we let L = minj {−wj /hj : wj hj < 0, j ∈ σ(w)}, which is the smallest value where one ˆ where of the coordinates of w + Lh in σ(w) becomes zero. And we update w by w + δh ˆ ˆ ˆ δ = argminδ∈[0,L] C(w + δh). Note that, when δ = L, then one of the coordinates of w + δh in σ(w) becomes zero. Hence, we can consider this process of updating the speed as the deletion step. But, if δˆ < L, the deletion does not occur. For the second case where kwk1 < 1, the convergence is also slow as the first case. Figures 3-(a), (b) and (c) illustrate this case for p = 2. Here v 0 = (0, 0), v 1 = (1, 0) and v 2 = (0, 1). Hence, we can improve the convergence speed similarly to the first case. However, the situation is simpler since there always exists a positive constant L such that kw+δhk1 ≤ 1 for δ ∈ [0, L] where h is defined by (18) with hσ = −∇(w)σ . If hθ(w)σ , hσ i ≤ 0, we can choose L = minj {−wj /hj : wj hj < 0, j ∈ σ(w)} similarly to the first case. If hθ(w)σ , hσ i > 0, we need an additional condition L ≤ (1 − kwk1 )/hθ(w)σ , hσ i. Finally, we ˆ where δˆ = argmin update w by w + δh δ∈[0,L] C(w + δh). By combining the CGD algorithm and the proposed deletion step, we propose the gradient LASSO algorithm, which is given in Figure 4. Since the deletion step always decreases the cost function, the convergence rate of the gradient LASSO algorithm is not slower than that of the CGD algorithm given in Theorem 4.1. Note that the gradient LASSO algorithm does not require matrix inversion. In some cases, there are coefficients which are not constrained. That is, for some l, λl = ∞. For examples, the intercept terms in (generalized) linear models are usually unconstrained. There are two methods for dealing with this situation. The first method is to let ˆ l k1 ≤ λl and apply the gradient LASSO algorithm. A difficulty λl large but finite so that kβ of this method is to choose a reasonable value of λl . If λl is too large, the convergence speed ˆ l k1 , the algorithm does not converge to the could be slow while if λl is smaller than kβ global optimum. The second method is to apply the gradient LASSO algorithm only to the
Gradient LASSO algorithm
(a)
(b)
15
(c)
Fig. 3. The illustration of the second case of slow convergence of the CGD algorithm
constrained coefficients with the unconstrained coefficients being fixed. Then, at the end of each iteration, the unconstrained coefficients are optimized by nonlinear optimization techniques such as the Newton-Raphson algorithm with the constrained coefficients being fixed. When only the intercept terms are not constrained, which is the most popular case, the second method works well since the unconstrained optimization of the intercept terms is easy. We will use the second method in what follows. Figure 5 shows the trajectories of the training errors (the empirical risk divided by the sample size) and the number of zero coefficients over the number of iterations of the gradient LASSO and CGD algorithms, respectively. The model used for this result P is the multiple 50 regression model described in Section 5.1 with p = 50 and the constraint i=1 |βi | ≤ 3. We apply the algorithms with the two different initial models; the null model where all of the coefficients are set to be 0 and the full model where all of the coefficient are set to be 3/50. In Figure 5, we can see that the gradient LASSO algorithm converges faster than the CGD algorithm regardless of the initial model. However, the number of zero coefficients strongly depends on the initial model. With the null initial model, the gradient LASSO algorithm is slightly more sparse than the CGD, but the difference seems to be negligible. In contrast, with the full initial model, the gradient LASSO successfully deletes insignificant coefficients while the CGD algorithm fails to do so. That is, the deletion step in the gradient LASSO algorithm is necessary for obtaining a sparse solution. Moreover, note that the numbers of zero coefficients can be quite different even if the training errors are similar. This observation suggests that for sparse learning, approximation of the training error is not enough for variable selection. Along with finding the solution of generalized LASSO for fixed λs, we are interested in ˆ finding the set of solutions for various values of λs. Consider the case of d = 1. Let β(λ) be the solution of the generalized LASSO problem with kβk1 ≤ λ. Many researchers including Efron et al. (2004), Rosset and Zhu (2003), Rosset et al. (2004) and Zhao and Yu (2004) ˆ have studied methods of finding β(λ) for all λ, which is called the regularized solution path. Unfortunately, the gradient LASSO algorithm does not give the regularized solution path automatically. A way of finding an approximated regularized solution path is to evaluate ˆ β(λ) at λ = ², 2², 3², . . . for sufficiently small ² > 0. For improving the convergence speed,
16
Kim et al.
(a) Initialize : w = 0 and m = 0.. (b) While converge do; (i) Addition step (the CGD algorithm) : A. Compute the gradient ∇(w) = (∇(w)1 , . . . , ∇(w)d ). ˆ γˆ ) which minimizes γ∇(w)lk for l = 0, . . . , d, k = 1, . . . , pl , γ = ±1. B. Find the (ˆ l, k, ˆ ˆ C. Let v l be the pˆl dimensional vector such that the k-th element is γˆ and the other column elemets are ³ 0. ´ ˆ D. Find α ˆ = arg minα∈[0,1] C w[α, v l ] . E. Update w: ˆ ˆ kl + γˆ α ˆ ,l = ˆ l, k = k (1 − α)w l l ˆ ˆ wk = (1 − α)w ˆ k , l = l, k 6= k l wk , o.w. (ii) Deletion Step : A. Let ∇(w)lσ = (∇(w)lk , k ∈ σ(wl )) and let o = ˆ l. B. If hθ(wo )σ , ∇(w)oσ i < 0 and hθ(wo ), wo i = 1, set hoσ = −∇(w)oσ + C. Else, set
hθ(wo )σ , ∇(w)oσ i θ(wo )σ . |σ(wo )|
hoσ = −∇(w)oσ .
D. Let h = (h1 , . . . , hd ) where hl = 0 for l 6= o and ho is defined by (18) with hoσ and the corresponding permutation matrix P. E. Compute δˆ = arg min0