S2 Algorithm Details - PLOS

28 downloads 0 Views 157KB Size Report
We describe the algorithm used to obtain the solutions to (7). Cyclic group-wise coordinate descent works well when we have a small number of groups (18 ...
S2

Algorithm Details

We describe the algorithm used to obtain the solutions to (7). Cyclic group-wise coordinate descent works well when we have a small number of groups (18 ROIs in our case). We compute a family of solutions varying λ from λmax down toward zero over a grid of values (on the exponential scale), with λmax being the smallest ˜ are zero [see equation (19) ]. As) we move down the λ sequence, value of λ for which all the estimated β i

we use “warm starts” from the previous value of λ to start the iterations. The idea of coordinate descent is to update the coefficients for a single group while holding the coefficients for all other groups fixed. If we cycle through all the groups repeatedly, we will converge to the solution of the strictly convex optimization problem at each λ [1]. Let Y be the matrix of observations, and let X = [X1 , . . . , Xp ] be the matrix of features that constitute the p groups.

S2.1

Cyclic group-wise coordinate descent

We consider the generic problem

2 p p

X X 1

t Xi β i + λ γi kβ i kF + αkβk2F . argminµ,β Y − 1µ −

2 i=1

F

(S2-1)

i=1

We have introduced penalty modifiers γi , so in fact the regularization multiplier for group i is λγi . These are needed to allow for potentially different group sizes and scales, and are discussed in S2.2. We first elliminate µ by partially optimizing (S2-1) with respect to µ. It is easy to show that centering each of the columns of Xi is a simple re-parametrization of the problem that leaves β i alone, but changes µ, and importantly ˆ is given by the column means of leads to exactly the same optimized fit. But with these transformations, µ ˆ we can replace Y by its centered version, and remove µ from (S2-1), as long as we use the Y. With this µ centered Xi . ˆ ,...,β ˆ be the Let λ and α be fixed, and suppose we want to perform the update for group k. Let β 1 p Pp ˆ (i.e. in (S2-1), we current estimates, and define the partial residual for group k to be rk = Y − i6=k Xi β i have created a uni-block problem.) Using standard results in convex optimization, we write the subgradient equation for β k in (S2-1): −Xtk rk + Xtk Xk β k + λγk sk + 2αβ k = 0,

(S2-2)

ˆ = 0 if kXt rk kF < λγk . If where sk ∈ {z : kzkF ≤ 1}, and sk = β k /kβ k kF if β k 6= 0. It follows that β k k ˆ β k 6= 0, from (S2-2) we have ! ! λγ k ˆ = Xt r k Xtk Xk + + 2α I β (S2-3) k k ˆ kF kβ k

ˆ by first solving for the scalar kβ ˆ kF and then plugging the result into (S2-3) to get a We can solve for β k k closed-form solution. We compute the singular value decomposition (one-time computation) Xk = Uk Dk Vkt and rewrite (S2-3) as h

i−1 ˆ kF + (λγk + 2αkβ ˆ kF )I D2k kβ Dk Utk rk = Vkt k k

ˆ β k . ˆ kβ kF k

1

(S2-4)

Take the Frobenius norm on both sides to obtain

h

2 i−1

2

t ˆ kF + (λγk + 2αkβ ˆ kF )I

Dk kβ D k Uk r k k k

= 1.

(S2-5)

F

2 ˆ kF , we need only find θ0 such that f (θ0 ) = 0. Let f (θ) = [D2k θ + (λγk + 2αθ)I]−1 Dk Utk rk F −1. To find kβ k We do this with Newton-Rhapson by reiterating θ ←− θ − η

f (θ) , f 0 (θ)

(S2-6)

where η is the step size and

2

 − 3 1

f 0 (θ) = −2 D2k θ + (λγk + 2αθ)I 2 (D2k + 2αI) 2 Dk Utk rk .

(S2-7)

F

In our experience, f (θ) tends to be quite linear around θ0 , so that very few Newton iterations are required ˆ with for convergence. Having obtained θˆ0 , we update β k

 −1 λγk t ˆ β k ←− Xk Xk + ( + 2α)I Xtk rk . θˆ0

(S2-8)

We now cycle through all the groups until convergence. The full algorithm is presented in Algorithm 1. In Algorithm 1: Cyclic group-wise coordinate descent input : Y, X1 , . . . , Xp , λ, γ1 , . . . , γp , α. All Xj and Y column centered. ˆ ,...,β ˆ output: β 1

p

ˆ = 0, . . . , β ˆ = 0. Let Xk = Uk Dk Vt be the singular decomposition of Xk . Initialize r = Y, β 1 p k Iterate until convergence: for k ← 1 to p do ˆ ; r k = r + Xk β k

if kXtk rk kF < λγk then ˆ ←− 0; β k

end else h i−1 ˆ ←− Xt Xk + ( λγk + 2α)I β Xtk rk ; k k θ  −1 where θ is the root of k D2k θ + (λγk + 2α)I Dk Utk rk kF = 1 end ˆ ; r ←− rk − Xk β k end ˆ ,...,β ˆ return β 1 p ˆ ,...,β ˆ with the practice, because we are fitting the group-lasso along a sequence of λ, we will initialize β 1 p estimates from the previous λ in the sequence. These “warm starts” give a significant speed advantage in our experience. 2

S2.2

Determining the group penalty modifiers γi

The γi in (7) allow us to have different penalties for different groups. This is useful because a larger group can be more likely to have a stronger correlation with the response than a small group, just by random chance. Having different penalties thus allows us to put different-sized groups on the same scale. ˆ = 0 if the following gradient condition is met: Recall that β k

kXtk rk kF < λγk .

(S2-9)

It follows that we can determine an appropriate group penalty modifier by computing the expected value of the LHS if the signal were pure noise. Let  ∼ (0, IN ), an N vector of white noise (we need only deal with the T = 1 case, since the T > 1 would be the same). Then we have γk2

= EkXtk k2F

(S2-10)

= E tr(t Xk Xtk )

(S2-11)

=

E tr(Xk Xtk t )

(cyclic invariance of trace)

(S2-12)

=

tr(Xk Xtk Et )

(linearity of trace and expectation)

(S2-13)

=

tr(Xtk Xk )

(S2-14)

=

kXk k2F .

(S2-15)

Therefore we take γk = kXk kF , the Frobenius norm of Xk . Note that if Xk is orthonormal, then γk =



pk ,

which is the penalty modifier proposed in [2].

References 1. Tseng P (2001) Convergence of block coordinate descent method for nondifferentiable maximization. Journal of Optimization Theory and Applications 109: 474-494. 2. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society - Series B: Statistical Methodology 68: 49-67.

3