Appendix to “A Sparse Structured Shrinkage Estimator ... - Google Sites

1 downloads 16 Views 81KB Size Report
Computational and Graphical Statistics. Z. John Daye, Jichun Xie, and Hongzhe Li. ∗. University of Pennsylvania, Schoo
Appendix to “A Sparse Structured Shrinkage Estimator for Nonparametric Varying-Coefficient Model with an Application in Genomics” published in the Journal of Computational and Graphical Statistics Z. John Daye, Jichun Xie, and Hongzhe Li ∗ University of Pennsylvania, School of Medicine January 11, 2011

A

Appendix - Proofs

In this section, we provide proofs for model selection consistency of the SSS estimator and estimation bounds presented in Section 3, extending results of Bach (2008) for the group lasso. We further use the notations, ΣY Y = E(Y Y T ) − E(Y )E(Y )T and ΣU ∗ Y = E(U ∗ Y T ) − E(U ∗ )E(Y )T .

A.1

Proof of Theorem 3.1

Consider the optimization problem (5) in Section 2.2, ( ) p ∑ 1 λ 2 γˆ ∗ = argminγ ∗ ∥y − U∗ γ ∗ ∥2 + λ1 ∥γg∗ ∥ + (γ ∗ )T Ω∗ γ ∗ . 2n 2 g=1 By Karush-Kuhn-Tucker conditions, γ ∗ is an optimal solution for (5) if and only if λ1 1 ∗ T ∗ ∗ (Ug ) (U γ − y) + λ2 Ω∗g γ ∗ = − ∗ γg∗ , n ∥γg ∥

1 ∗ T ∗ ∗

(Ug ) (U γ − y) + λ2 Ω∗g γ ∗ ≤ λ1 ,

n

∀γg∗ ̸= 0

(17)

∀γg∗ = 0.

(18)

This can be easily verified as in Proposition 1 of Bach (2008) and Yuan and Lin (2006). ∗

Z. John Daye is a Postdoctoral Researcher, Jichun Xie is a Ph.D. student, and Hongzhe Li is Professor, Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021. (E-mail: [email protected])

1

Define T γ˜(1) = argminγ ∗

(1)

∑ 1 λ2 ∗ T ∗ ∗ ∗ ∥γg∗ ∥ + (γ(1) ∥y − U∗ (1) γ(1) ∥ 2 + λ1 ) Ω11 γ(1) , 2n 2 0

(19)

{j:βg ̸=0}

T and let γ˜ = (˜ γ(1) , 0T )T . As λ1 → 0 and λ2 → 0, the objective function in (19) converges to ∗ ∗T ∗ 0∗ ∗ γ ∗ ∗ ΣY Y − 2ΣY U(1) (1) + γ(1) ΣU(1) U(1) γ(1) , whose unique minimum is γ(1) by regularity condition A2. 0∗ Thus, γ˜(1) converges to γ(1) in probability.

Using regularity condition A1 and letting ϵ = y − (γ 0∗ )T U ∗ , we have 1 ∗ T 1 ∗ T ∗ 0∗ 1 ∗ T (U ) y = (U ) U γ + (U ) ϵ n n n 0∗ ∗ ∗ = ΣU U(1) γ(1) + Op (n−1/2 ). This gives 1 ∗ T ∗ 0∗ ∗ (˜ ) + Op (n−1/2 ), γ(1) − γ(1) (U ) (U(1) γ˜(1) − y) = ΣU ∗ U(1) n

(20)

which by (17) gives 0∗ γ˜(1) − γ(1) = −λ1 Σ−1 U∗

∗ − λ2 Σ−1 ˜(1) + Op (n−1/2 ) ∗ U ∗ Ω11 γ U(1) (1) [ ] λ2 ∗ −1 = −λ1 ΣU ∗ U ∗ Diag(1/∥γ˜i ∥) + Ω11 γ˜(1) + Op (n−1/2 ). (1) (1) λ1 (1)

˜i ∥)˜ γ(1) ∗ Diag(1/∥γ U(1)

(21)

From equations (20) and (21), we have 1 ∗ T ∗ (U ) (U γ˜(1) − y) + λ2 Ω∗21 γ˜(1) n (2) 0∗ ∗ U ∗ (˜ = ΣU(2) γ(1) − γ(1) ) + λ2 Ω∗21 γ˜(1) + Op (n−1/2 ) (1) [ ] λ2 ∗ −1 ∗ U∗ Σ ∗ = −λ1 ΣU(2) Diag(1/∥γ˜i ∥) + Ω11 γ˜(1) + λ2 Ω∗21 γ˜(1) + Op (n−1/2 ). ∗ U(1) U(1) (1) λ1 Divide the above by λ1 . We see that (U∗g )T (U∗ γ˜(1) − y)/(nλ1 ) + (λ2 /λ1 )Ω∗g1 γ˜(1) converges in probability to

[ ] −1 0∗ ∗ Σ ∗ −ΣUj∗ U(1) Diag(1/∥γi0∗ ∥) + αΩ∗11 γ 0∗ (1) + αΩ∗j1 γ(1) (22) ∗ U(1) U(1) √ for g ∈ {g : βg0 = 0}, as λ1 n → ∞ and λ2 /λ1 → α. We note that, by definitions U∗g = Ug Vg−1 and Ω∗ = diag(V1−1 , . . . , Vp−1 )T Ω diag(V1−1 , . . . , Vp−1 ), (22) equals to 0 0 0 Vg−1 {−ΣUg U(1) Σ−1 U(1) U(1) [Diag(1/∥γg ∥Rgg ) + αΩ11 ]γ(1) + αΩj1 γ(1) }.

Thus, by condition (12), ∥(U∗g )T (U∗ γ˜(1) − y)/(nλ1 ) + (λ2 /λ1 )Ω∗g1 γ˜(1) ∥ ≤ 1 holds with probability 1 for all g ∈ {g : βg0 = 0}. This verifies (18), and in addition (17) is satisfied for γ˜(1) by definition. Equations (17) and (18) in turn imply γ˜ is an optimal solution for (5), and this completes the proof for Theorem 3.1. 2

A.2

Proof of Theorem 3.2

Assume that condition (13) does not hold. That is, ∥vg ∥ > 1

(23)

0 0 for some g ∈ {g : βg0 = 0}, where vg = Vg−1 (ΣUg U(1) ΣU−1(1) U(1) [D(1) + αΩ11 ]γ(1) − αΩ21 γ(1) ).

From (17), we have ( γˆ(1) =

1 ∗ T ∗ (U ) U(1) n (1)

)−1 (

) λ1 1 ∗ T ∗ (U ) y − λ2 Ω1 γˆ − γˆ(1) , n (1) ∥γg ∥

which gives 1 ∗ T (U ) (y − U∗(1) γˆ(1) ) − λ2 Ω∗11 γˆ(1) n g ( ) [1 1 ∗ T ∗ 1 ∗ T ∗ −1 1 ∗ T ] ∗ T = (U ) y − (Ug ) U(1) (U ) U(1) (U ) y n g n n (1) n (1) ( ) ( ) [ 1 1 ∗ T ∗ −1 λ2 ∗ ∗ T ∗ −1 + λ1 (Ug ) U(1) (U ) U(1) Diag(∥γˆi ∥ )ˆ γ(1) + Ω1 γˆ n n (1) λ1 ] −λ2 Ω∗11 γˆ(1)

(24)

= An + Bn

Now, since P ({g : βˆg ̸= 0} = {g : βg0 ̸= 0}) → 1 by assumption of model selection consistency 0∗ and γ˜(1) converges to γ(1) in probability for λ1 → 0 and λ2 → 0 by argument following equation 0∗ (19), we see that γˆ(1) converges to γ(1) and Bn /λ1 converges to vg . Furthermore, by assumption

(23), P ((vg /∥vg ∥)T (Bn /λ1 ) ≥ (∥vg ∥ + 1)/2) → 1. By arguments in proof of Theorem 3 in Bach (2008) and regularity condition A3, we have √ that P ( nvgT An > 0) converges to a constant a ∈ (0, 1). Thus, P ((vg /∥vg ∥)T (An + Bn )/λ1 ≥ (∥vg ∥ + 1)/2) ≥ a asymptotically, which in turn implies that (18) does not hold with probability 1 for γˆ as ∥An + Bn )∥/λ1 ≥ (vg /∥vg ∥)T (An + Bn ) and (∥vg ∥ + 1)/2 > 1. Hence, γˆ is not an optimal solution for (5), and Theorem 3.2 follows by contradiction.

A.3

Proof of Theorem 3.3

Since solution of equation (4) is simply a reparameterization of the solution of equation (5), we first establish the estimation bounds for γˆ ∗ , p ∑ λ2 1 ∗ ∗ 2 γˆ = argminγ ∗ ∥y − U γ ∥ + λ1 ∥γg∗ ∥ + (γ ∗ )T Ω∗ γ ∗ , 2n 2 g=1 ∗

3

which can then be converted to the bounds for γˆ . Our proof extends that of Hebiri and van de Geer (2010) for the L1 + L2 penalized estimation methods. Define m = nK, Ω∗ = JT J. We can reparameterize U∗ , y, and ε to √  √    √ K K K ∗ U y ε ˜ = √ 2 , y . ˜ =  2  , and ε ˜= √ 2 U m m ∗0 λ J 0 − λ Jγ 2 2 2 2 Then the original optimization problem can be reformulated to p ∑ 1 ∗ ∗ 2 ˜ γˆ = arg min ∥˜ y − Uγ ∥ + λ1 ∥γg∗ ∥. γ∗ m g=1

(25)

(26)

Let Ig be the set of indices of γ that correspond to the gth covariate. We first prove the following useful lemma. ∗ 2 Lemma 1. Define σj2 = U∗T j Uj and assume σj is bounded away from 0 and ∞, i.e. there exist

σmin and σmax , such that 0 < σmin ≤ σj ≤ σmax < ∞,

∀j = 1, . . . , p.

√ Define Zj = m1 (U∗T ε)j and Z Ig = (Zi : i ∈ Ig ). Choose 0 < τ < 1/2, κ > 2Kισmax /(τ σmin ) and √ λ1 = κσmin m−1 log(q). Then ( ) 2 2 2 2 2 2 P max K∥Z Ig ∥ ≤ τ λ1 ≥ 1 − p1−k τ σmin /(2K ι σmax ) . g=1,...,p

Proof. ∀ i ∈ 1, . . . , p, Zi ∼ N(0, m−1 σi2 ), then ( ) P max K∥Z Ig ∥ ≥ τ λ1 g=1,...,p ( ) ≤ P max |Zj | ≥ τ λ1 /(Kι) j=1,...,q ( )2 ) ( τ λ1 m ≤ p exp − 2 2σmax Kι ≤ p1−k

2 τ 2 σ 2 /(2K 2 ι2 σ 2 max ) min

. 

Proof of Theorem 3.3. Starting from the minimization problem (26), we have the following equivalent statements, ∑ ∑ 1 1 ˜ γ ∗ ∥2 + λ1 ˜ ∗0 ∥2 + λ1 ∥˜ y − Uˆ ∥ˆ γg∗ ∥ ≤ ∥˜ y − Uγ ∥γg∗0 ∥ m m g g ∑ ∑ 1 ˜ ∗0 ˜ ∗ 1 ˜∥2 − ∥˜ ⇐⇒ ∥Uγ − Uˆ γ +ε ε∥2 ≤ λ1 ∥γg∗0 ∥ − λ1 ∥ˆ γg∗ ∥ m m g g ∑[ ] 1 ˜ ∗0 ˜ ∗ 2 2 ˜ ∗0 − γˆ ∗ ). ˜T U(γ ⇐⇒ ∥Uγ − Uˆ γ ∥ ≤ λ1 ∥γg∗0 ∥ − ∥ˆ γg∗ ∥ − ε m m g 4

Note that 2 T ˜ ∗0 K ˜ U(γ − γˆ ∗ ) = εT U(γ ∗0 − γˆ ∗ ) − λ2 (γ ∗0 )T Ω(γ ∗0 − γˆ ∗ ) = (A) + (B), ε m m where |(A)| =

∑ ∑ K T ∗ ∗0 ∗0 ∗ |ε U (γ − γˆ ∗ )| = K| ZT (γ − γ ˆ )| ≤ K ∥Z Ig ∥∥γg∗0 − γˆg∗ ∥, Ig g g m g g

and similarly, |(B)| ≤ λ2



∥Ωg γg∗0 ∥∥γg∗0 − γˆg∗ ∥ ≤ r∗ λ2

g ∗



∥γg∗0 − γˆg∗ ∥,

g

∗0

where r = ∥Ωγ ∥∞ . Define λ2 =

On the event Λn,p = {maxg=1,...,G K∥Z Ig ∥ ≤ τ λ1 }, we have

∑[ ∑ ] 1

˜ ∗0 ˜ ∗ 2 γ ≤ λ1 ∥γg∗0 ∥ − ∥ˆ γg∗ ∥ + 2τ λ1 ∥γg∗0 − γˆg∗ ∥.

Uγ − Uˆ m g g τ λ1 . r∗

Adding (1 − 2τ )λ1

∑ g

(27)

∥γg∗0 − γˆg∗ ∥ to both sides of (27) and noting the fact that ∥γg∗ − γˆg∗ ∥ + ∥γg∗ ∥ −

∥ˆ γg∗ ∥ = 0 for any g ∈ / A0 , we can further simplify (27) to

∑ ∑ 1

˜ ∗0 ˜ ∗ 2 γ + (1 − 2τ )λ1 ∥γg∗0 − γˆg∗ ∥ ≤ 2λ1 ∥γg∗0 − γˆg∗ ∥,

Uγ − Uˆ m 0 g

(28)

g∈A

using the triangle inequality. From (28), we obtain ∑

∥γg∗0 − γˆg∗ ∥ ≤

g

∑ 2 ∥γg∗0 − γˆg∗ ∥. 1 − 2τ 0 g∈A

By Condition B(A0 , τ ) and the norm inequality  2 ∑ ∑  ∥γ ∗0 − γˆ ∗ ∥ ≤ ∥γ ∗0 − γˆ ∗ ∥2 , g∈A0

we have



g∈A0



∥γg∗0



γˆg∗ ∥

g∈A0

|A0 | ˜ ∗0 ˜ ∗ ≤ √ ∥Uγ − Uˆ γ ∥. mϕ

Combining (29) with (28) leads to √

∑ |A0 | 1

˜ ∗0 ˜ ∗ 2

˜ ∗0 ˜ ∗ ∗ ∗0 ∥γg − γˆg ∥ ≤ 2λ1 · √ γ ≤ 2λ1 γ .

Uγ − Uˆ

Uγ − Uˆ m mϕ 0 g∈A

Thanks to the inequality 2ab ≤ a2 /2 + 2b2 , we have

1

˜ ∗ 0 ˜ ∗ 2 4λ21 |A0 | γ ≤ ,

Uγ − Uˆ m ϕ 5

(29)

which leads to

2 8λ2 |A0 | 1

Uγ 0 − Uˆ γ ≤ 1 , n ϕ 8r∗ λ1 |A0 | (γ 0 − γˆ )T Ω(γ 0 − γˆ ) ≤ . τϕ On the other hand, Condition B(A0 , τ ) and (29) imply √ ∑ |A0 | ˜ ∗0 ˜ ∗ 2 4λ1 |A0 | ∥γg∗0 − γˆg∗ ∥ ≤ · √ ∥Uγ − Uˆ γ ∥≤ . 1 − 2τ (1 − 2τ )ϕ mϕ g This implies

∑ g

∑ ∥γg0

− γˆg ∥ ≤

g

∥γg∗0 − γˆg∗ ∥ 4λ1 |A0 | ≤ . µmin (1 − 2τ )ϕµmin 

References Bach, F. R. (2008), “Consistency of the Group Lasso and Multiple Kernel Learning,” Journal of Machine Learning Research, 9, 1179–1225. Hebiri, M. and van de Geer, S. (2010), “The Smooth-Lasso and Other ℓ1 + ℓ2 -Penalized Methods,” arXiv, 1003.4885v1. Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,” Journal of the Royal Statistical Society, Ser. B, 68, 49–67.

6