Mar 1, 2012 - ST] 1 Mar 2012 ... Tr(AA ) for all matrix A. Recall .... the following constant Cinf = infmâM Tr ((Î m â Î m) Φ), and assume that the collection.
Adaptive Covariance Estimation with model selection
arXiv:1203.0107v1 [math.ST] 1 Mar 2012
Rolando Biscay, Hélène Lescornel and Jean-Michel Loubes
Abstract We provide in this paper a fully adaptive penalized procedure to select a covariance among a collection of models observing i.i.d replications of the process at fixed observation points. For this we generalize the results of [3] and propose to use a data driven penalty to obtain an oracle inequality for the estimator. We prove that this method is an extension to the matricial regression model of the work by Baraud in [1].
Keywords: covariance estimation, model selection, adaptive procedure.
1
Introduction
Estimating the covariance function of stochastic processes is a fundamental issue in statistics with many applications, ranging from geostatistics, financial series or epidemiology for instance (we refer to [10], [8] or [5] for general references). While parametric methods have been extensively studied in the statistical literature (see [5] for a review), nonparametric procedures have only recently received attention, see for instance [6, 3, 4, 2] and references therein. In [3], a model selection procedure is proposed to construct a non parametric estimator of the covariance function of a stochastic process under mild assumptions. However their method heavily relies on a prior knowledge of the variance. In this paper, we extend this procedure and propose a fully data driven penalty which leads to select the best covariance among a collection of models. This result constitutes a generalization to the matricial regression model of the selection methodology provided in [1]. Consider a stochastic process (X (t))t∈T taking its values in R and indexed by T ⊂ Rd , d ∈ N. We assume that E [X (t)] = 0 ∀t ∈ T and we aim at estimating its covariance function σ (s, t) = E [X (s) X (t)] < ∞ for all t, s ∈ T . We assume we observe Xi (tj ) where i ∈ {1 . . . n} and j ∈ {1 . . . p}. Note that the observation points tj are fixed and that the Xi ’s are independent copies of the process X. Set xi = (Xi (t1 ) , . . . , Xi (tp )) ∀i ∈ {1 . . . n} and denote by Σ the covariance matrix of X at the observations points Σ =E xi x> = i (σ (tj , tk ))1≤j≤p,1≤k≤p . Following the methodology presented in [3], we approximate the process X by its projection onto some finite dimensional model. For this, consider a countable set of functions (gλ )λ∈Λ which may be for instance a basis of L2 (T ) and choose a collection of models M ⊂ P (Λ). For m ⊂ M, a finite number of indices, the process can be 1
approximated by X (t) ≈
X
aλ gλ (t) .
λ∈m
Such an approximation leads to an estimator which depends on the collection of functions ˆ m . Our objective is to select in a data driven way, the best model, i.e. m, denoted by Σ the one close to an oracle m0 defined as the minimizer of the quadratic risk, namely
2
ˆ m0 ∈ arg minR (m) = arg minE Σ − Σm . m∈M
m∈M
This result is achieved using a model selection procedure. The paper falls into the following parts. The description of the statistical framework of the matrix regression is given in Section 2. Section 3 is devoted to the main statistical results. Namely we recall the results of the estimate given in [3] and prove an oracle inequality with a fully data driven penalty. Section 4 states technical results which are used in all the paper, while the proofs are postponed to the Appendix.
2
Statistical model and notations
We consider an R-valued process X (t) indexed by T a subset of Rd with expectation equal to 0. We are interested in its covariance function denoted by σ (s, t) = E [X (s) X (t)]. We have at hand the observations xi = (Xi (t1 ) , . . . , Xi (tp )) for 1 6 i 6 n where Xi are independent copies of the process and tj are deterministic points. We note Σ ∈ Rp×p the covariance matrix of the vector xi . Hence we observe xi x> 16i6n (1) i = Σ + Ui , where Ui are i.i.d. error matrices Pwith expectation 0. We denote by S the empirical covariance of the sample : S = n1 ni=1 xi x> i . We use the Frobenius norm k k defined by kAk2 = Tr AA> for all matrix A. Recall that for a given matrix A ∈ Rp×q , vec(A) is the vector in ∈ Rpq obtained by stacking the columns of A on top of one another. We denote by A− the reflexive generalized inverse of the matrix A, see for instance in [9] or [7]. The idea is to consider that we have a quite good approximation of the process in the following form X X (t) ≈ aλ gλ (t) , (2) λ∈m
where m is a finite subset of a countable set Λ , (aλ )λ∈Λ are random coefficients in R and (gλ )λ∈Λ are real valued functions. We will consider models m among a finite collection denoted by M . We note Gm ∈ Rp×|m| where (Gm )jλ = gλ (tj ) and am the random vector of R|m| with coefficients (aλ )λ∈m . Hence, we obtain the following approximations : x = (X (t1 ) , .., X (tp ))> ≈ Gm am 2
> xx> ≈ Gm am a> m Gm > Σ ≈ Gm E am a> m Gm
Thus, this point of view leads us to approximate Σ by a matrix in the subset |m|×|m| ⊂ Rp×p . S (Gm ) = Gm ΨG> m /Ψ symmetric in R
(3)
Hence, for a model m, a natural estimator for Σ is given by the projection of S onto S (Gm ). We can prove using standard algebra (see in [3] for a general proof) that it has the following form : b m = Πm SΠm m ∈ M ∈ Rp×p , Σ (4) where Πm = Gm G> m Gm
−
G> m
∈ Rp×p
(5)
are orthogonal projection matrices. Set Dm = T r (Πm ⊗ Πm ) which is the dimension of S (Gm ) assumed to be positive, and Σm = Πm ΣΠm the projection of Σ onto this subspace. Hence we obtain the model selection procedure defined in [3]. The estimation error for a model m ∈ M is given by
2 2 Dm δm
2 b m , (6) E Σ − Σ = kΣ − Π ΣΠ k +
m m n where Tr ((Πm ⊗ Πm ) Φ) , Dm Φ=V vec x1 x> . 1
2 δm =
by
b =Σ bm Given θ > 0, it is thus natural to define the penalized covariance estimator Σ b ( m b = arg min
m∈M
) n
1 X
> b 2
xi xi − Σm + pen (m) , n i=1
where pen (m) = (1 + θ)
2 δm Dm . n
(7)
b The following result proved in [3] states an oracle inequality for the estimator Σ.
κ
< Theorem 2.1. Let q > 0 be given such that there exists κ > 2 (1 + q) satisfying E x1 x> 1 0 ∞. Then, for some constants K (θ) > 1 and C (θ, κ, q) > 0 we have that
2q 1/q 2 δ D ∆ q −1 −1)
m κ 2 2 ( m b + E Σ − Σ ≤2 K (θ) inf kΣ − Πm ΣΠm k + + δ , m∈M n n sup 3
where !
κ
∆qκ = C 0 (θ, κ, q) E x1 x> 1
X
−κ −(κ/2−1−q) δm Dm
m∈M
and 2 2 δsup = max δm :m∈M . However the penalty defined here depends on the quantity δm which is unknown in > . Our objective is to study a practice since it relies on the matrix Φ = V vec xx covariance estimator built with a new penalty involving an estimator of Φ. More precisely, we will replace pen(m) by an empirical version pd en(m), where pd en (m) = (1 + θ)
2 δbm Dm , n
(8)
and
2 δbm =
b Tr (Πm ⊗ Πm ) Φ Dm
,
b an estimator of Φ. with Φ The objective is to generalize Theorem 2.1 and to construct a fully adaptive penalized procedure to estimate the covariance function.
3
Main result : adaptive penalized covariance estimation
Here we state the oracle inequality obtained for the new covariance estimator introduced previously. Set yi = vec xi x> i , 1 6 i 6 n, P 2 which are vectors in Rp and denote by Svec = n1 ni=1 yi their empirical mean. Consider the following constant Cinf = inf m∈M Tr ((Πm ⊗ Πm ) Φ), and assume that the collection of models is chosen such that Cinf > 0. Set n
X > b= 1 Φ yi yi> − Svec Svec , n i=1 b Tr (Πm ⊗ Πm ) Φ 2 δbm = . Dm
4
e=Σ bm Given θ > 0, we consider the covariance estimator Σ e with ) ( n
2 1 X
> b en (m) , m e = arg min
xi xi − Σm + pd m∈M n i=1 where 2 Dm δbm . (9) n Theorem 3.1. Let 1 > q > 0 be given such that there exists β > max (2 (1 + 2q) , 3 + 2q)
β satisfying E xx> < ∞. Then, for a constant C depending on θ, β and q, we have for n > n(β, θ, Cinf , Σ), and ∀κ ∈ ]2 (1 + 2q) ; min (β, 2β − 4)[ :
pd en (m) = (1 + θ)
2q 1/q 2 δ D
m 2 m e 6 C inf kΣ − Σm k + E Σ − Σ m∈M n h 2 i
C e 2 2 > β β
∆β E xx + + kΣk + δsup ∆κ n where h
i X −β −β/2 q > β e
∆β = c (θ, β, q) E xx δm Dm
(10) (11)
!1− 2qκ
m∈M
! ∆qκ
κ = C (θ, κ, q) E xx>
X
−κ −(κ/2−1−q) δm Dm
m∈M
and 2 2 δsup = max δm :m∈M . e has the We have obtained in Theorem 3.1 an oracle inequality since the estimator Σ same quadratic risk as the “oracle” estimator except for an additive term of order O n1 and a constant factor. Hence, the selection procedure is optimal in the sense that it behaves as if the true model were at hand. The proof of this theorem is divided into two parts. First, as in the of Theorem 2.1 proved in [3], we will consider a vectorized version of the model (1). In this technical part we will obtain an oracle inequality under some particular assumptions for a general penalty. In a second part, we will prove that our particular penalty verifies these b assumptions by using properties of the estimator Φ.
4 4.1
Technical results Vectorized model
Here we consider the vectorized version of model (1). In this case, we observe the following 2 vectors in Rp : yi = fi + εi 1 6 i 6 n. (12) 5
in the model (1), fi to vect (Σ) and εi to vec (Ui ). We Here yi corresponds to vec xi x> i > > > > > > > set f = f1 , . . . , fn , y = y1 , . . . , yn> and ε = ε> , which are vectors in 1 , . . . , εn np2 R . We estimate f by an estimator of the form fbm = Pm y
m ∈ M,
where Pm is the orthogonal projection onto a subspace SP m of dimension Dm . We note fm = 2 1 Pm f and we consider the empirical norm kf kn = n ni=1 fi> fi with the corresponding scalar product h·, ·in . First we state the vectorized form of Theorem 2.1. Write Tr (Pm (In ⊗ Φ)) , Dm 2 = max δm :m∈M .
2 δm = 2 δsup
Given θ > 0, define the penalized estimator fb = fbm b , where
b 2 m b = arg min y−fm + pen (m) , m∈M
n
with
2 δm Dm . n Then, the proof of Theorem 2.1 relies on the following proposition proved in [3]:
pen (m) = (1 + θ)
Proposition 4.1. : Let q > 0 be given such that there exists κ > 2 (1 + q) satisfying E kε1 kκ < ∞. Then, for some constants K (θ) > 1 and C (θ, κ, q) > 0 we have that
2q 1/q 2 −1 −1 D δ ∆ q
κ m 2 ( ) m 2 + E f − fb δ + , (13) ≤2 K (θ) inf kf − P m f kn + m∈M n n sup n where ! X
∆qκ = C (θ, κ, q) E kε1 kκ
−κ −(κ/2−1−q) δm Dm
.
m∈M
e defined previously corresponds here to the estimator fe = fbm The new estimator Σ e , where
b 2 m e = arg min y−fm + pd en (m) , m∈M
n
with pd en (m) = (1 + θ)
2 δbm Dm , n
2 2 and δbm is some estimator of δm . Next Proposition gives an oracle inequality for this estimator under new assumptions on the model. As Proposition 4.1, it is inspired by the paper [1].
6
Proposition 4.2. Let 1 > q > 0 be given such that there exists κ > 2 (1 + 2q) satisfying E kε1 kκ < ∞. n o 2 2 For α ∈ ]0; 1[, set Ω = ∩m∈M δbm > (1 − α) δm . Assume that h i A1. E δb2 6 δ 2 . m
m
A2. P (Ωc ) 6 C˜ (α) n1γ for some γ >
q . 1−2q/κ
Then, for a constant C depending on κ, θ and q, and we have
2q 1/q 2 δ D
m 2 m 6 C inf kf − Pm f kn + E f − fe m∈M n n h h i i 2 C e κ κ 2 2 + ∆κ E [kε1 k ] + kf kn + δsup ∆κ n where
(14) (15)
(1− 2qκ ) q e ˜ with α = α (θ) is fixed in ]0; 1[ ∆κ = C (α)
and
! ∆qκ
κ
= C (θ, κ, q) E kε1 k
X
−κ −(κ/2−1−q) δm Dm
m∈M
Theorem 3.1 is thus a direct application of Proposition 4.2. Hence only remain to be checked the two assumptions A1 and A2.
4.2
Auxiliary concentration type lemmas
Here we state some propositions required in the proofs of the previous results. To our knowledge, the first is due to von Bahr and Esseen in [11]. Lemma 4.3. Let U1 , . . . , Un independent centred variables with values in R. For any 1 6 κ 6 2 we have : # " n n X κ X E Ui 6 8 E [|Ui |κ ] i=1
i=1
The next proposition is proved in [3]. e ∈ RN k×N k {0} be a non-negative definite Proposition 4.4. Given N, k ∈ N, let A and symmetric matrix and ε1 , ..., εN i.i.d random vectors in Rk with E (ε1 ) = 0 and p e > > > e and δ 2 = Tr(A(IN ⊗Φ)) . For all β ≥ 2 V (ε1 ) = Φ. Write ε = ε1 , ..., εN , ζ (ε) = ε> Aε, e) Tr(A such that E kε1 kβ < ∞ it holds that, for all x > 0, ! r e E kε1 kβ Tr A 2 2 2 2 e e e e P ζ (ε) ≥ δ Tr A + 2δ Tr A ρ A x + δ ρ A x ≤ C2 (β) , e xβ/2 δβ ρ A (16) where the constant C2 (β) depends only on β. 7
5
Appendix
5.1
Proof of Proposition 4.2
This proof follows the guidelines of the proof of Theorem 6.1 in [1]. The following lemma will be helpful for the proof of this proposition Lemma 5.1. Choose η= η (θ) > 0 and α = α (θ) ∈ ]0; 1[ such that (1 + θ) (1 − α) >
2 i h
2 2 where κ ˜ (θ) = 2 + η4 (1 + θ). (1 + 2η). Set Hm (f ) = f − fe − κ ˜ (θ) kf − fm kn + Dnm δbm n
Then, for m0 minimizing m 7→ kf − fm k2n +
+
Dm 2 δ n m
in m ∈ M
2q E [Hm0 (f )q 1Ω ] 6 ∆qκ δsup
1 . nq
(17)
where ∆κ was defined in Proposition 4.2. Proof. Lemma 5.1 First, remark that on the set Ω, for all m ∈ M pd en (m) > (1 − α) (1 + θ)
2 δ 2 Dm δm Dm > (1 + 2η) m . n n
2
Set pen(m) = (1 + 2η) δmnDm , which corresponds to the penalty of Proposition 4.1. The proof of this lemma is based on the proof of Proposition 4.1 in [3]. In fact, it is sufficient to prove that for each x > 0 and κ ≥ 2 X 1 Dm ∨ 1 2 x 2 δm ≤ c (κ, η) E kε1 kκ , (18) P H (f ) 1Ω ≥ 1 + κ/2 κ η n δ m∈M m (ηDm + x) where we have set
2 4
2 e H (f ) = f − f − 2 + kf − fm0 kn + pd en (m0 ) . η n + Indeed, for each m ∈ M, 2 δbm 0 Dm0 n ! b2 δ ≤ (1 + θ) kf − fm0 k2n + m0 Dm0 n
kf − fm0 k2n + pd en (m0 ) = kf − fm0 k2n + (1 + θ)
then we get that for all q > 0, q Hq (f ) 1Ω ≥ Hm (f ) 1Ω 0
Using the equality q
Z
E [H (f ) 1Ω ] =
∞
quq−1 P (Hq (f ) 1Ω > u) du
0
8
(19)
and following the proof of Propositon 4.1 in [3] we obtain the upper bound (17) of Lemma 5.1. 2 Now we turn to the proof of (18). For any g ∈ Rnp we define the empirical quadratic loss function by γn (g) = ky − gk2n . 2
Using the definition of γn we have that for all g ∈ Rnp , kf − gk2n = γn (g) + 2 hg − y, εin + k εk2n and therefore
2 D E
2 e e e
f − f − kf − Pm0 f kn = γn f − γn (Pm0 f ) + 2 f − Pm0 f, ε . n
(20)
n
Using the definition of fe, we know that en (m) e ≤ γn (g) + pd en (m0 ) γn fe + pd for all g ∈ Sm0 . Then γn fe − γn (Pm0 f ) ≤ pd en (m0 ) − pd en (m) e .
(21)
So we get from (20) and (21) that
2
2 e f − f en (m0 ) − pd en (m) e
≤ kf − Pm0 f kn + pd n
D E e + 2 hf − Pm0 f, εin + 2 hPm . e f − f, εin + 2 f − Pm e f, ε
(22)
n
In the following we set for each m0 ∈ M, Bm0 = {g ∈ Sm0 : kgkn ≤ 1} , Gm0 = sup hg, εin = kPm0 εkn , t∈Bm0
Pm0 f −f kPm0 f −f kn
if kPm0 f − f kn 6= 0
0
otherwise.
( um0 = Since fe = Pm e f + Pm e ε, (22) gives
2
2 e en (m0 ) − pd en (m) e
f − f ≤ kf − Pm0 f kn + pd n
2 + 2 kf − Pm0 f kn |hum0 , εin | + 2 kf − Pm e f kn |hum e , εin | + 2Gm e.
(23)
Using repeatedly the following elementary inequality that holds for all positive numbers ν, x, z 1 2xz ≤ νx2 + z 2 (24) ν 9
we get for any m0 ∈ M 2 kf − Pm0 f kn |hum0 , εin | ≤ ν kf − Pm0 f k2n +
1 |hum0 , εin |2 . ν
(25)
By Pythagora’s Theorem we have
2
2
2 e e
f − f = kf − Pm e f kn + Pm e f − f n
n
= kf −
2 Pm e f kn
+
G2m e.
(26)
We derive from (23) and (25) that for any ν > 0
2 1
2 2 2
f − fe ≤ kf − Pm0 f kn + ν kf − Pm0 f kn + hum0 , εin ν n 1 2 2 2 +ν kf − Pm d en (m0 ) − pd en (m) e . hum e f kn + e , εin + 2Gm e +p ν
2
2 2 e Now taking into account that by equation (26) kf − Pm e f kn = f − f − Gm e the above n inequality is equivalent to
2 1
(1 − ν) f − fe ≤ (1 + ν) kf − Pm0 f k2n + hum0 , εi2n ν n 1 2 2 d en (m0 ) − pd en (m) e . (27) + hum e , εin + (2 − ν) Gm e +p ν 2 We choose ν = 2+η ∈ ]0, 1[, but for sake of simplicity we keep using the notation ν. Let pe1 and pe2 be two functions depending on ν mapping M into R+ . They will be specified as in [3] to satisfy 1 pen (m0 ) ≥ (2 − ν) pe1 (m0 ) + pe2 (m0 ) ∀(m0 ) ∈ M. ν
(28)
Remember that on Ω, pd en(m) > pen(m) ∀m ∈ M. Since ν1 pe2 (m0 ) ≤ pen (m0 ) and 1 + ν ≤ 2, we get from (27) and (28) that on the set Ω
2 1
e − p e ( m) e (1 − ν) f − f ≤ (1 + ν) kf − Pm0 f k2n + pd en (m0 ) + pe2 (m0 ) + (2 − ν) G2m 1 e ν n 1 1 2 + hum e2 (m) e + hum0 , εi2n − pe2 (m0 ) e , εin − p ν ν 2 ≤ 2 kf − Pm0 f kn + pd en (m0 ) + (2 − ν) G2m e1 (m) e e −p 1 1 2 + hum e2 (m) e + hum0 , εi2n − pe2 (m0 ) . (29) e , εin − p ν ν As
2 1−ν
=2+
4 η
we obtain that
2 4
2 kf − Pm0 f kn + pd en (m0 ) 1Ω (1 − ν) H (f ) 1Ω = (1 − ν) f − fe − (1 − ν) 2 + η n +
2
2 e = (1 − ν) f − f − 2 kf − Pm0 f kn + pd en (m0 ) 1Ω n + 1 1 2 2 2 ≤ (2 − ν) Gm e1 (m) e + hum e2 (m) e + hum , εin − pe2 (m0 ) e , εin − p e −p ν ν + 10
For any x > 0, 2 2 xδm xδm 0 0 2 0 ≤ P ∃m ∈ M : (2 − ν) Gm0 − pe1 (m ) ≥ P (1 − ν) H (f ) 1Ω ≥ n 3n 2 xδm 1 0 2 0 0 hum0 , εin − pe2 (m ) ≥ + P ∃m ∈ M : ν 3n 2 X xδm 0 2 0 ≤ P (2 − ν) kPm0 εkn − pe1 (m ) ≥ 3n m0 ∈M X xδ 2 0 1 + P hum0 , εi2n − pe2 (m0 ) ≥ m ν 3n m0 ∈M X X := P1,m0 (x) + P2,m0 (x) . (30) m0 ∈M
m0 ∈M
From now on, the proof of Lemma 5.1 is exactly the same as the end of the proof of Proposition 4.1 in [3] with Lm = ν. Proof. Proposition 4.2
2q
e We first provide an upper bound for E f − f 1Ω , where the set Ω depends on α n
chose as in Lemma 5.1. As q 6 1, we have (a + b)q 6 aq + bq . Together with Lemma 5.1 we deduce that q
2q 1 D
m q 2 0 2 q 2q ˜ (θ) kf − fm0 kn + δb E f − fe 1Ω 6 ∆κ δsup q + E κ . n n m0 n 1
Using the convexity of x 7→ x q together with the Jensen inequality, we obtain
2q 1q Dm0 b2
2 1/q−1 1/q−1 2 1 e E κ ˜ (θ) kf − fm0 kn + δ 62 ∆κ δsup + 2 , E f − f 1Ω n n m0 n and by using the assumption A1 we have that
2q 1q Dm0 2
2 1/q−1 2 1 1/q−1 e . E f − f 1Ω 62 ∆κ δsup + 2 κ ˜ (θ) kf − fm0 kn + δ n n m0 n
2q
Now we need to find an upper bound for the quantity E f − fe 1Ωc . n
First, remark that
2
2 2 2 e
f − f = kf − Pm e ykn = kf − Pm e f kn + kPm e (f − y)kn n
2 2 2 2 2 6 kf − Pm e f kn + kεkn = kf kn − kPm e f kn + kεkn
And thus
2
2 2 e
f − f 6 kf kn + kεkn . n
11
(31)
So we have
2q 2q
c e E f − f 1Ωc 6 kf k2q n P (Ω ) + E kεkn 1Ωc . n
Using Hölder’s inequality with
κ 2q
> 1 we obtain
2q κ 2q κ P (Ωc )(1− κ ) . c E kεk2q 1 6 E [kεk ] Ω n n But E [kεkκn ] =
1 κ E n2
n X
! κ2 kεi k2
,
i=1
and as κ > 2, we can use Minkowsky’s inequality to obtain n X
1 E [kεkκn ] 6 κ n2
(E [kεi kκ ])
2 κ
! κ2 =
i=1
κ2 2 1 κ κ n (E [kε k ]) , κ 1 n2
that is E [kεkκn ] 6 E [kε1 kκ ] . So we have h
2q i 2q 2q
c (1− κ ) e , E f − f 1Ωc 6 E [kε1 kκ ] κ + kf k2q n P (Ω ) n
and with assumption A2 h (1− 2qκ )
2q i 2q 1
κ 2q ˜ E f − fe 1Ωc 6 E [kε1 k ] κ + kf kn C(α) . nγ n As γ >
q , 1−2q/κ
we deduce that
1q 2q
2q h i 1− κ 1 2 1
˜ q 6 2 q −1 E [kε1 kκ ] κ + kf k2n C(α) E f − fe 1Ωc n n 1
To conclude, we use again the convexity of x 7→ x q and the inequality (31) to get 2q
2q 1q h i 1− κ 1 2 1
˜ q 6 4 q −1 E [kε1 kκ ] κ + kf k2n C(α) E f − fe n n 1 2 +41/q−1 ∆κ δsup n Dm0 2 +41/q−1 κ ˜ (θ) kf − fm0 k2n + δ n m0
12
(32)
5.2
Proof of Theorem 3.1
Recall that β > max (2 (1 + 2q) , 3 + 2q) and κ ∈ ]2 (1 + 2q) ; min (β, 2β − 4)[. In order to use Proposition 4.2 , we need to prove the following inequalities : h i 2 2 6 δm A1. E δbm A2. P (Ωc ) 6 C˜ (α) n1γ for γ >
q . 1−2q/κ
First we prove A1. b) Tr((Πm ⊗Πm )Φ 2 Remember that δbm = . By using the linearity of the trace and the equalD m h i h i 2 b = n−1 Φ, we obtain that E δbm = n−1 δ 2 which proves the result. ity E Φ n n m o n 2 2 . We bound up the quantity 6 (1 − α) δm For the second, write Ωc = ∪m∈M δbm 2 2 in the following Proposition. 6 (1 − α) δm P δbm Proposition 5.2. For all m ∈ M , α ∈]0; 1[ and n > n(κ, β, α, Cinf , Σ) we have for some constants C1 (β), C2 (β) : h
> β i −β − β 1 1 2 2 β+1 b P δm 6 (1 − α) δm 6 γ C2 (β)2 + C1 (β) β E xx δm Dm 2 , n α2 for γ >
q . 1−2q/κ
This Proposition concludes the proof of A2 with h
β i X −β − β
1 β+1 ˜ δm Dm 2 . C (α) = C2 (β)2 + C1 (β) β E xx> 2 α m∈M Proof. Proposition 5.2 2 2 We start by dividing Pm = P δbm 6 (1 − α) δm into two parts with one of them involving a sum of independent variables with expectation equal to 0. b Pm = P Tr (Πm ⊗ Πm ) Φ 6 (1 − α) Tr ((Πm ⊗ Πm ) Φ)
b − Φ + µµ> + µµ> Pm = P Tr (Πm ⊗ Πm ) Φ 6 −αTr ((Πm ⊗ Πm ) Φ)
Pm 6 P Tr (Πm ⊗ Πm )
n
1X > yi yiT − Φ − µµ> + µµ> − Svec Svec n i=1
Pm 6 P Tr (Πm ⊗ Πm )
!! ! > αTr ((Πm ⊗ Πm ) Φ)
!! ! n α 1X yi yiT − Φ − µµ> > Tr ((Πm ⊗ Πm ) Φ) n i=1 2 13
α > > Tr ((Πm ⊗ Πm ) Φ) +P Tr (Πm ⊗ Πm ) µµ> − Svec Svec 2 Set Q1 = P Tr (Πm ⊗ Πm )
!! ! n α 1X > T yi yi − Φ − µµ > Tr ((Πm ⊗ Πm ) Φ) n i=1 2
and
α > > Tr ((Πm ⊗ Πm ) Φ) Q2 = P Tr (Πm ⊗ Πm ) µµ> − Svec Svec 2 Study of Q1 First we use Markov’s inequality to obtain
A1 6
h β 2 2 E Tr (Πm ⊗ Πm )
1 n
Pn
i=1
yi yiT − Φ − µµ>
β2 i
β
.
(αTr ((Πm ⊗ Πm ) Φ)) 2
We must consider the two following cases : • If
β 2
> 2, Rosenthal’s inequality gives β n 1 X 2 E Tr (Πm ⊗ Πm ) yi yiT − Φ − µµ> n i=1 β2 β 1 > > E Tr (Πm ⊗ Πm ) y1 y1 − Φ − µµ 6C 2 n β2 −1 β i 4 β 1 h > > 2 +C E Tr (Πm ⊗ Πm ) y1 y1 − Φ − µµ . 2 n
As β2 > 2, obtain
1 β
n 2 −1
6
1 β
and we can use Jensen’s inequality on the second term to
n4
n 1 X E Tr (Πm ⊗ Πm ) n i=1
β 2 T > yi yi − Φ − µµ
β2 2 β > > 6C E Tr (Πm ⊗ Πm ) y1 y1 − Φ − µµ β . 2 n4 • If 1 6
β 2
6 2, we use Lemma 4.3 of subsection 4.2 to get β n 1 X 2 T > E Tr (Πm ⊗ Πm ) yi yi − Φ − µµ n i=1 β2 8 > > 6 β E Tr (Πm ⊗ Πm ) y1 y1 − Φ − µµ n 2 −1 14
β
In both cases, we can use the fact that x → 7 x 2 is a convex and increasing function to obtain β > > 2 E Tr (Πm ⊗ Πm ) y1 y1 − Φ − µµ 62
β −1 2
β β > 2 > 2 E Tr (Πm ⊗ Πm ) y1 y1 + Tr (Πm ⊗ Πm ) Φ + µµ .
And by using the Jensen’s inequality on the second term we have that β > > 2 E Tr (Πm ⊗ Πm ) y1 y1 − Φ − µµ β2 β > 6 2 2 E Tr (Πm ⊗ Πm ) y1 y1 . Now consider the following lemma. Lemma 5.3. If Ψ is symmetric non-negative definite, then Tr ((Πm ⊗ Πm ) Ψ) ∈ [0; Tr (Ψ)]
(33)
From this fact we get that
β β Tr (Πm ⊗ Πm ) y1 y1> 2 6 Tr y1 y1> 2 = ky1 kβ = xx> β . In conclusion, we have h
i > β
E xx 1 Q1 6 C1 (β) , β γ β β α2δ D2 n m
(34)
m
with γ = min β4 , β2 − 1 and C1 (β) = 2C β2 if β > 4 where C Rosenthal’s inequality and C1 (β) = 8 if 2 6 β 6 4. Remark that so γ > κ4 .
β is 2 β > κ4 4
the constant in and β2 − 1 > κ4 ,
Study of Q2 Recall that α > > Tr ((Πm ⊗ Πm ) Φ) . Q2 = P Tr (Πm ⊗ Πm ) µµ> − Svec Svec 2 Set > B2 = Tr (Πm ⊗ Πm ) µµ> − Svec Svec
.
Using the properties of the trace, we can write > B2 = Tr (Πm ⊗ Πm ) µµ> − Tr (Πm ⊗ Πm ) Svec Svec > = Tr µ> (Πm ⊗ Πm ) µ − Tr Svec (Πm ⊗ Πm ) Svec . But Πm ⊗ Πm is an orthogonal projection matrix, then 15
>
>
B2 = Tr µ (Πm ⊗ Πm ) (Πm ⊗ Πm ) µ − Tr
> Svec
>
(Πm ⊗ Πm ) (Πm ⊗ Πm ) Svec
B2 = k(Πm ⊗ Πm ) µk2 − k(Πm ⊗ Πm ) Svec k2 B2 = (k(Πm ⊗ Πm ) µk − k(Πm ⊗ Πm ) Svec k) (k(Πm ⊗ Πm ) µk + k(Πm ⊗ Πm ) Svec k) Hence |B2| 6 k(Πm ⊗ Πm ) (µ − Svec )k (k(Πm ⊗ Πm ) µk + k(Πm ⊗ Πm ) Svec k) |B2| 6 k(Πm ⊗ Πm ) (µ − Svec )k2 + 2 k(Πm ⊗ Πm ) (µ − Svec )k k(Πm ⊗ Πm ) µk |B2| 6 k(Πm ⊗ Πm ) (µ − Svec )k2 + 2 k(Πm ⊗ Πm ) (µ − Svec )k kµk Finally α Q2 6P k(Πm ⊗ Πm ) (µ − Svec )k2 > Tr ((Πm ⊗ Πm ) Φ) 4 α +P k(Πm ⊗ Πm ) (µ − Svec )k > Tr ((Πm ⊗ Πm ) Φ) 8 kµk
(35)
Now we need to provide an upper bound for the quantities P k(Πm ⊗ Πm ) (µ − Svec )k2 > t .
For this we will use the deviation bound section 4.2 . Set Idp2 . . . Id 1 p2 . . . Gn = .. .. n . . Idp2 . . .
provided by Proposition 4.4 stated in sub-
Idp2 Idp2 .. .
2 2 ∈ Rp n×p n
Idp2
Then Gn (y − f ) = 1n ⊗ (Svec − µ) Now, if Hm = Idn ⊗ (Πm ⊗ Πm ) =
Πm ⊗ Πm 0 ... 0 0 Πm ⊗ Πm . . . 0 .. .. .. .. . . . . 0 0 . . . Πm ⊗ Πm
we have Hm (1n ⊗ (Svec − µ)) = 1n ⊗ ((Πm ⊗ Πm ) (Svec − µ)) . 16
2 2 ∈ Rp n×p n ,
In conclusion, with 1 Am = Hm Gn = n
Πm ⊗ Πm . . . Πm ⊗ Πm Πm ⊗ Πm . . . Πm ⊗ Πm .. .. .. . . . Πm ⊗ Πm . . . Πm ⊗ Πm
2 2 ∈ Rp n×p n ,
we have that Am (y − f ) = 1n ⊗ ((Πm ⊗ Πm ) (Svec − µ)) . Moreover, Am is an orthogonal projection matrix and we have the following equalities kAm (y − f )k2 = n k(Πm ⊗ Πm ) (Svec − µ)k2 = (y − f )> Am (y − f ) , Tr (Am ) =
n Tr (Πm ⊗ Πm ) = Dm , n
n Tr ((Πm ⊗ Πm ) Φ) = Tr ((Πm ⊗ Πm ) Φ) . n e = Am , εi = yi − µ, Tr (Am ) = Dm , ρ (Am ) = 1, Now we can use Proposition 4.4 with A 2 and β > 2. δ 2 = δm This gives for all x > 0 β h 2 +1 p x i2 E[ky1 −µkβ ]Dm > P (y − f ) Am (y − f ) > Tr ((Πm ⊗ Πm ) Φ) 1 + Dm 6 C2 (β) β β , Tr (Am (Idn ⊗ Φ)) =
Tr((Πm ⊗Πm )Φ) 2 x 2
that is
2
P k(Πm ⊗ Πm ) (Svec − µ)k >
1 Tr ((Πm n
h
⊗ Πm ) Φ) 1 +
p
x Dm
i2
β
6 C2 (β)
2 E[ky1 −µkβ ]Dm
β
α 1 P k(Πm ⊗ Πm ) (µ − Svec )k2 > Tr ((Πm ⊗ Πm ) Φ) 6 C˜ γ 4 n
with γ >
P k(Πm ⊗ Πm ) (µ − Svec )k > q , 1−2q/κ
we need to find x > 0 satisfying the three following facts ∀m ∈ M
∀m ∈ M
α 1 Tr ((Πm ⊗ Πm ) Φ) 6 C˜ γ 8 kµk n
α 8 kµk
α 1 > 4 n
2
2 r x 1+ Dm
Tr ((Πm ⊗ Πm ) Φ) > β
α 8 kµk
+1
Dm2
β 2
Tr ((Πm ⊗ Πm ) Φ) x 17
β 2
=
2
Dm β β δm x2
Cinf
6C
(36) 1 > n 1 . nγ
β
Tr((Πm ⊗Πm )Φ) 2 x 2
In order to use this deviation bound to obtain the inequalities
and
+1
2 r x 1+ Dm
(37)
(38)
.
(36) and (37) hold for the choice x = Dm nr with r < 1 and if n is large enough to have √ 2 α 1 1 + nr 6 (39) n 4 and 2 √ 2 1 α 1 + nr 6 Cinf (40) n 8 kµk
In order to obtain (38) with x = Dm nr , we use the inequality Dm 6 n which gives Dm β δm x
Moreover
h
6
β 2
E ky1 − µk
β
i
1
.
β 2
β δm Dm nrβ/2−1
h
6 E (ky1 k + kµk)
β
i
,
and by using properties of convexity we obtain h i h i E ky1 − µkβ 6 2β−1 E ky1 kβ + kµkβ . With the Jensen’s inequality we get: h i h i E ky1 − µkβ 6 2β E ky1 kβ . In conclusion, with r =
κ +2 2
β
< 1 we obtain for n > n(κ, β, α, Cinf , Σ) h
i > β
E xx 1 Q2 6 2β+1 C2 (β) β κ/4 n δβ D 2 m
(41)
m
where C2 (β) is the constant which appears in Proposition 4.4. In conclusion, combining (34) and (41) h
> β i −β − β 1 1 2 2 β+1 b P δm 6 (1 − α) δm 6 κ/4 C2 (β)2 + C1 (β) β E xx δm Dm 2 n α2 for n > n(κ, β, α, Cinf , Σ) . To conclude, remark that
κ 4
(1 − 2q/κ) =
κ−2q 4
>
2+2q 4
> q as q 6 1.
Proof. Lemma 5.3 Recall that Πm ⊗Πm is an orthogonal projection matrix. Hence there exists an orthogonal matrix Pm such that Pm> (Πm ⊗ Πm ) Pm = D, with D a diagonal matrix with Dii = 1 if i 6 Dm , and Dii = 0 otherwise. Then if Ψ is symmetric non-negative definite we have : Tr ((Πm ⊗ Πm ) Ψ) = Tr DPm> ΨPm 18
2
=
2
p p X X
2
Dkl Pm> ΨPm
kl
l=1 k=1
=
=
p X
Dll Pm> ΨPm
ll
l=1 Dm X
Pm> ΨPm
ll
∈ [0; Tr (Ψ)] .
l=1
Indeed, Pm> ΨPm is non-negative definite so all its diagonal entries are non-negative.
References [1] Y. Baraud. Model selection for regression on a fixed design. Probability theory related fields, 117(4):467–493, 2000. [2] J. Bigot, R. Biscay, J.-M. Loubes, and L. M. Alvarez. Group lasso estimation of high-dimensional covariance matrices. Journal of Machine Learning Resarch, 2011. [3] J. Bigot, R. Biscay, J.-M. Loubes, and L. Muñiz-Alvarez. Nonparametric estimation of covariance functions by model selection. Electron. J. Stat., 4:822–855, 2010. [4] J. Bigot, R. Biscay Lirio, J.-M. Loubes, and L. Muniz Alvarez. Adaptive estimation of spectral densities via wavelet thresholding and information projection. preprint hal-00440424, May 2010. [5] N. A. C. Cressie. Statistics for spatial data. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons Inc., New York, 1993. Revised reprint of the 1991 edition, A Wiley-Interscience Publication. [6] S. N. Elogne, O. Perrin, and C. Thomas-Agnan. Non parametric estimation of smooth stationary covariance functions by interpolation methods. Stat. Inference Stoch. Process., 11(2):177–205, 2008. [7] H. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375 of Mathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996. [8] A. G. Journel. Kriging in terms of projections. J. Internat. Assoc. Mathematical Geol., 9(6):563–586, 1977. [9] G. A. F. Seber. A matrix handbook for statisticians. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2008. [10] M. L. Stein. Interpolation of spatial data. Springer Series in Statistics. SpringerVerlag, New York, 1999. Some theory for Kriging. [11] B. von Bahr and C.-G. Esseen. Inequalities for the rth absolute moment of a sum of random variables, 1 ≤ r ≤ 2. Ann. Math. Statist, 36:299–303, 1965.
19