AbstractâWe formulate and analyze a graphical model selec- tion method for inferring the ... Thus, the probability distribution of the observed sam- ples is fully ...
On the Sample Complexity of Graphical Model Selection for Non-Stationary Processes Nguyen Tran Quang and Alexander Jung Abstract—We formulate and analyze a graphical model selection method for inferring the conditional independence graph of a high-dimensional non-stationary Gaussian random process (time series) from a finite-length observation. The observed process samples are assumed uncorrelated over time but having different covariance matrices. We characterize the sample complexity of graphical model selection for such processes by analyzing a particular selection method, which is based on sparse neighborhood regression. Our results indicate, similar to the case of i.i.d. samples, accurate GMS is possible even in the highdimensional regime if the underlying conditional independence graph is sufficiently sparse. Index Terms—sparsity, graphical model selection, neighborhood regression, high-dimensional statistics.
I. I NTRODUCTION Consider a complex system which is represented by a large number of random variables x1 , . . . , xp . For the ease of exposition we model those random variables as zero mean jointly Gaussian. We are interested in inferring the conditional independence graph (CIG) of these random variables based on observing samples x[n] ∈ Rp , for n = 0, . . . , N − 1, which are uncorrelated but not identically distributed. The learning method shall cope with the high-dimensional regime, where the system size p is (much) larger than the sample size N , i.e., N ≪ p [1]–[7]. This problem is relevant, e.g., in the analysis of medical diagnostic data (EEG) [4], climatology [8] and genetics [9]. Most existing approaches to graphical model selection (GMS) for collections of random variables model the observed data either as i.i.d. or as samples of a stationary random process [3], [6], [10], [11]. By contrast, we model the observed data as an uncorrelated non-stationary process x[n] with a varying covariance matrix C[n]. Contributions: Our main conceptual contribution is the formulation of a sparse neighborhood regression GMS method for high-dimensional non-stationary processes. By analyzing this method, we derive upper bounds on the required sample size such that accurate GMS is possible. In particular, our analysis reveals that the crucial parameter determining the required sample size is the minimum average partial correlation between the process components. If this quantity is not too small, accurate GMS is feasible even in the high-dimensional regime where N ≪ p. Outline: The remainder of this paper is organized as follows. In Section II we formalize the considered process model and the notion of a CIG. Section III presents a GMS method based on neighbourhood regression along with an upper bound on the sample size such that GMS is accurate.
The detailed derivation of this upper bound is to be found in Section IV. Notation: The set of non-negative real numbers is denoted R+ . Given a p-dimensional process x[0], . . . , x[N−1] ∈ Rp of length N , we denote its ith T scalar Ncomponent process as xi := xi [0], . . . , xi [N − 1] ∈ R . Given a vector x = (x1 , .p . .P , xd )T , we denote its euclidean and ∞-norm by 2 kxk2 := i xi and kxk∞ := maxi |xi |. The minimum and maximum eigenvalues of a positive semidefinite (psd) matrix C are denoted λmin (C) and λmax (C), respectively. Given a matrix Q, we denote its transpose, spectral norm and Frobenius norm by QT , kQk2 and kQkF , respectively. It will be handy to define, for a given finite sequence of matrices Ql , the block diagonal matrix blkdiag{Ql } with lth diagonal block given by Ql . The identity matrix of size d × d is Id . II. P ROBLEM F ORMULATION We model the observed data samples x[n], for n ∈ {0, . . . , N−1}, as zero-mean Gaussian random vectors. Moreover, we assume the vector samples x[n] to be uncorrelated. Thus, the probability distribution of the observed samples is fully specified by the covariance matrices C[n] := E{x[n]xT [n]}. By contrast to the widely used i.i.d. assumption (where C[n] = C), we allow the covariance matrix C[n] to vary with sample index n. However, we impose a smoothness constraint on the dynamics of the covariance. In particular, we assume the covariance matrix C[n] being constant over evenly spaced length-L blocks of consecutive vector samples. Thus, our sample process model can be summarized as x[0], . . . , x[L − 1], x[L], . . . , x[2L − 1], . . . | {z } | {z } i.i.d.∼N (0,C{b=0})
(1)
i.i.d.∼N (0,C{b=1})
The model (1) accommodates the case where the observed samples form a stationary process (cf. [10]– [13]) in the following way: Consider a Gaussian zeromean stationary process z[n] with auto-covariance function Rz [n] P := E{z[n]zT [n]} and spectral density matrix Sz (θ) := m Rz [m] exp(−j2πθm) [14]. Let x[k] := PN −1 z[n] exp(−j2πnk/N ) denote the discrete Fourier n=0 transform (DFT) of the stationary process z[n]. Then, by wellknown properties of the DFT [15], x[k] are approximately uncorrelated multivariate normal vectors with zero mean and covariance matrix C[k] ≈ Sz (k/N ). Moreover, if the effective correlation width W of the process z[n] is small, i.e., W ≪ N , the SDM is nearly constant over a frequency interval of length 1/W . Thus, the DFT vectors x[k] approximated conform to the model (1) with block length L ≈ N/W .
sufficiently large partial correlations ρi,j for {i, j} ∈ E.
The model (1) is also useful for an important class of non-stationary processes, which are referred to as underspread processes [16]–[19]. A continuous-time random process z(t) is underspread R if its expected ambiguity function (EAF) ¯ A(τ, ν) := t E{z(t + τ /2)zT (t − τ /2)} exp(−j2πtν)dt is well concentrated around the origin in the (τ, ν) plane. In particular, if the EAF of z(t) is exactly supported on the rectangle [−τ0 /2, τ0 /2] × [−ν0/2, ν0 /2], then the process z(t) is underspread if τ0 ν0 ≪ 1. One of the most striking properties of an underspread process is that its Wigner-Ville spectrum (which can be loosely interpreted R as a time-varying power ¯ spectral density) W (t, f ) := τ,ν A(τ, ν) exp(−2π(f τ − νt))dτ dν is approximately constant over a rectangle of area 1/(τ0 ν0 ). Moreover, it can be shown that for a suitably chosen prototype function g(t) (e.g., a Gaussian pulse) and grid constants T ,F , the Weyl-Heisenberg set {g (n,k) (t) := g(t−nT )e−2πkF t }n,k∈Z [20], yields zero-mean expansion coR efficients x[n, k] = t z(t)g (n,k) (t)dt which are approximately uncorrelated. The covariance matrix of the vector x[(n, k)] is approximately given by W (nT, kF ). Thus the vectors samples x[n, k] approximately conform to the process model (1) with block length L ≈ T F 1τ0 ν0 . We now define the CIG of a p-dimensional Gaussian process x[n] ∈ Rp conforming to the model (1) as an undirected simple graph G = (V, E) with node set V = {1, 2, . . . , p}. Node i ∈ V represents the process component xi = (xi [0], . . . , xi [N − 1])T . An edge is absent between nodes i and j, i.e., {i, j} ∈ / E if the corresponding process components xi and xj are conditionally independent, given the remaining components {xr }r∈V\{i,j} . Since we model the process x[n] as Gaussian (cf. (1)), the conditional independence among the individual process components can be read off conveniently from the inverse covariance (precision) matrices K[n] := C[n]−1 . In particular, xi are xj are conditionally independent, given {xr }r∈V\{i,j} , if and only if Ki,j [n] = 0 for all n = 0, . . . , N − 1 [15, Prop. 1.6.6.]. Thus, we have the following characterization of the CIG G associated with the sample process x[n] in (1): {i, j} ∈ / E if and only if Ki,j [n] = 0 for all n.
Assumption 1. There is a constant ρmin > 0 such that ρi,j ≥ ρmin for any {i, j} ∈ E. (4) Moreover, we also assume the CIG underlying x[n] to be sparse in the sense of having small maximum degree. Assumption 2. There is a constant s < p/2 such that |N (i)| ≤ s , for any node i ∈ V.
III. S PARSE N EIGHBOURHOOD R EGRESSION The CIG G of the process x[n] in (1) is fully specified by the neighbourhoods N (i) := {j ∈ V \ {i} : {i, j} ∈ E}, i.e., once we have found all neighbourhoods, we can reconstruct the full CIG. In what follows, we focus on the sub-problem of learning the neighborhood N (i) of a particular node i ∈ V. Let us denote, for each block b ∈ {0, . . . , B −1 = N/L−1} in (1), the ith process component as T xi {b} := xi [bL], . . . , xi [(b + 1)L − 1] ∈ RL . According to Lemma IV.3, X aj xj {b} + εi {b}, xi {b} =
(1)
= (1/B)
2 Ki,j {b}/ Ki,i {b}Kj,j {b}
b=0
(6)
j∈N (i)
with the “error term” εi {b} ∼ N (0, (1/Ki,i {b})I). Moreover, for an index set N (i) \ T 6= ∅, X ˜ i {b} + εi {b}, xi {b} = aj xj {b} + x
(7)
(8)
j∈T
˜ i {b} ∼ N (0, σ with component x ˜ 2 I) being uncorrelated with ˜ i {b} is σ {xj {b}}j∈T and εi {b}. The variance σ ˜ 2 of x ˜b2 = −1 T ˜ −1 ˜ = C{b} a K a with K (cf. Lemma IV.3). N (i)∪T T
In view of the decompositions (6) and (8), it is reasonable to estimate N (i) via a least-squares search: B−1 X 2 b (i) := arg min(1/N ) N kxi {b} − PT {b} xi {b}k2 (9) |T |=s
(2)
= arg min(1/N )
We highlight the coupling in the CIG characterization (2): An edge is absent, i.e., {i, j} ∈ / E only if the precision matrix entry Ki,j [n] is zero for all n ∈ {0, . . . , N − 1}. We will also need a measure for the strength of a connection between process components xi and xj for {i, j} ∈ E. To this end, we define the average partial correlation between xi and xj as N −1 X 2 ρi,j := (1/N ) Ki,j [n]/ Ki,i [n]Kj,j [n] n=0 B−1 X
(5)
|T |=s
b=0 B−1 X
2
kP⊥ T {b} xi {b}k2 .
b=0
For a subset T = {i1 , . . . , is }, the matrix PT {b} ∈ RL×L represents the orthogonal projection on span{xj {b}}j∈T , i.e., −1 T PT {b} := XT XTT XT XT , (10) with data matrix XT := xi1 {b}, . . . , xis {b} ∈ RL×s . The complementary orthogonal projection is P⊥ T {b} := I − PT {b} .
(11)
We can interpret (9) as performing sparse neighborhood regression, since we aim at approximating the ith component xi in a sparse manner (by allowing only s active components) using the remaining process components. For the analysis of the estimator (9) we require an assumption on the eigenvalues of the covariance matrices C[n].
(3)
By (2) and (3), {i, j} ∈ E if and only if ρi,j 6= 0. Accurate estimation of the CIG for finite sample size N (incurring unavoidable sampling noise) is only possible for 2
Assumption 3. The eigenvalues of the psd covariance matrices C[n] are bounded for all n as
The covariance matrix of εi is Cεi = blkdiag{(1/Ki,i {0})IL , . . . , (1/Ki,i {B − 1})IL }.
0 < α ≤ λmin (C[n]) ≤ λmax (C[n]) ≤ β (12) with fixed constants β ≥ α > 0. We refer to the ratio κ = β/α as the condition number of the process.
We also define the stacked projection matrix P⊥ := T B−1 blkdiag{P⊥ } (cf. (11)). Rather trivially, we have T {b} b=0 2 ET = Z(N (i)) − (1/N )kP⊥ T εi k2 (20) 2 > Z(T ) − (1/N )kP⊥ T εi k2 .
Our main analytical result is an upper bound on the probab (i) which does bility that the estimator (9) delivers a subset N b (i) 6= ∅. We not contain the true neighborhood, i.e., N (i) \ N denote this event by
Let us now define, for some number δ > 0 whose precise value to be chosen later, the two events 2 E1 (δ) := Z(N (i)) − (1/N )kP⊥ (21a) T εi k2 ≥ δ , 2 E2 (δ) := Z(T ) − (1/N )kP⊥ (21b) T εi k2 ≤ 2δ .
b (i)| > 0}. Ei := {|N (i) \ N
(13)
ρmin ≥ 8κ/L
(14)
In view of (20), the event ET can only occur if at least one of the events E1 (δ) or E2 (δ) occurs. Therefore, by union bound, we have P{ET } ≤ P{E1 (δ)} + P{E2 (δ)}. (22)
(15)
In order to analyze the event E1 (δ), we observe
Theorem III.1. Consider observed samples x[n] conforming to the process model (1) such that Asspt. 1, 2 and 3 are valid. Then, if the minimum average partial correlation satisfies
and moreover N≥
L 1800κ2 log(18(p − s)/η), L−s ρmin
(17)
b (i) not containing N (i) is bounded as the probability of N P{Ei } ≤ η.
Z(N (i)) = (1/N )
(16) (8)
= (1/N )
We highlight the fact that Asspt. 1 and 3 are only required for the analysis of the estimator (9) but they are not required for its practical implementation. In particular, we can use the estimator (9) also for processes x[n] violating Asspt. 1 and 3. However, Theorem III.1 is then useless. IV. P ROOF
OF THE
B−1 X
2
2
⊥
P
N (i){b} 2
X
i∈N (i)
2 aj xj {b} + εi {b}
2
(23)
Hence, (21a) 2 P{E1 (δ)} = E P Z(N (i))−(1/N )kP⊥ T εi k2 ≥ δ | xT 2 2 (23) ⊥ = E P (1/N ) kP⊥ N (i) εi k2 −kPT εi k2 ≥ δ | xT (7) = E P (1/N ) wiT Pwi ≥ δ | xT (24) ⊥ N ×N with P := blkdiag{(1/Ki,i {b})(P⊥ −P )} ∈ R N (i){b} T {b} and standard normal vector wi ∼ N (0, I). By similar arguments as used in [21], the matrix P can be shown to satisfy
(17)
b=0
The error event ET when (9) delivers an index set T with |N (i) \ T | = ℓ > 0, i.e., T does not contain N (i), is given as ET = {Z(N (i)) > Z(T )}.
2
kP⊥ N (i){b} xi {b}k
= (1/N )kP⊥ N (i) εi k2 .
M AIN R ESULT
kP⊥ S{b} xi {b}k2 .
b=0 B−1 X b=0
Our derivation of Theorem III.1 draws heavily on the techniques used in [21]. b (i) = it will be convenient to rewrite (9) as N arg minS Z(S) with Z(S) := (1/N )
B−1 X
kPk2 ≤ β and rank{P} ≤ 2Bℓ.
(25)
Thus, we have
(18)
(25)
kPk2F ≤ rank{P}kPk22 ≤ 2Bℓβ 2 . Inserting (25), (26) into (57) of Lemma IV.2 yields N 2δ2 . P{E1 (δ)} ≤ 2 exp − 8(2ℓBβ 2 + N δβ)
In what follows, we will derive an upper bound M (ℓ) on P{ET } which depends on the index set T Psonly via ℓ = |N (i) \ T |. By a union bound, P{Ei } ≤ ℓ=1 N (ℓ)M (ℓ), where N (ℓ) denotes the number of size-s subsets of V which differ from N (i) in exactly ℓ indices. Since N (ℓ) ≤ sℓ p−s ℓ [22] and s ≤ p/2 (cf. Asspt. 2), p−s log{P{Ei }} ≤ max 3 log + log M (ℓ). (19) ℓ=1,...,s ℓ
(26)
(27)
Thus, whenever N ≥ 2Bℓβ/δ, (28) we have Nδ P{E1 (δ)} ≤ 2 exp − . (29) 16β Let us now upper bound the probability of E2 (δ) (cf. (21b)). To this end, in view of the decomposition (8), note
Let us now detail the derivation of the above mentioned upper bound M (ℓ) on the probability P{ET }. Consider the decomposition (6) and let us stack the noise vectors εi {b} into a big vector εi = (εi {0}T , . . . , εi {B − 1}T )T ∈ RN .
E2 (δ) ⊆ E3 (δ) ∪ E4 (δ) 3
(30)
Combining (42), (41), (39) with (33), (22) and (36) yields
with the events ˜ i k22 ≤ 3δ} E3 (δ) := {(1/N )kP⊥ Tx
(31)
E4 (δ) := {(2/N )|˜ xTi P⊥ T εi | ≥ δ}.
(32)
P{ET } ≤ 6 exp(−ρmin ℓB(L−ℓ)/(576κ2)).
For the final step in the proof of Theorem III.1, we plug in the RHS of (43) as the upper bound M (ℓ) used in (19). Thus, P{Ei } ≤ η if p−s max 3 log 6 − ρmin ℓB(L − ℓ)/(576κ2) ≤ log η. ℓ=1,...,s ℓ (44) The validity of (44), in turn, is guaranteed if 1800κ2 log 6 p−s ℓ − log η for ℓ = 1, . . . , s. (45) B(L−ℓ) ≥ ρmin ℓ
and By union bound, (30) implies P{E2 (δ)} ≤ P{E3 (δ)} + P{E4 (δ)}
(33)
such that P{E2 (δ)} can be upper bounded by separately bounding P{E3 (δ)} and P{E4 (δ)}. Let us define ˜ i k22 }, m3 := E{(1/N )kP⊥ Tx
(34)
˜ i = (˜ ˜ Ti {B − 1})T . with the random vector x xTi {0}, . . . , x According to Corollary IV.4, ˜ i {b} ∼ N (0, σ x ˜b2 IL ) with variance σ ˜b2 ≥ ℓρmin (α/κ). (35) Therefore, (34)
1/2
We obtain the final sufficient (15) of Theorem III.1 condition q by applying the bound pq ≤ pe [21] to (45). q
A PPENDIX In order to make this paper somewhat self-contained, we list here some well-known facts about Gaussian random vectors, which are instrumental for the derivation in Section IV.
1/2
m3 = (1/N ) Tr{Cx˜ i P⊥ T Cx ˜i } (35)
= (1/N )
B−1 X
σ ˜b2 Tr{P⊥ T}
Lemma IV.1. Consider N i.i.d. Gaussian random variables zi ∼ N (0, 1) and the weighted sum of squares
b=0
(35)
≥ (B(L − ℓ)/N )ℓρminα.
(36) y=
Let us now choose δ := m3 /4.
(37)
(37)
Applying Lemma IV.2 to (38) gets us to N m3 (N/β)δ 2 (37) . = 2 exp − P{E3 (δ)} ≤ 2 exp − 8(m3 +δ) β160
with w ∼ N (0, I) and the symmetric matrix ! 1/2 0 P⊥ 2 C1/2 0 Cx˜i T x ˜ i Q= T 1/2 N P⊥ 0 0 Cεi 0 T
ai zi2
with weight vector a = (a1 , . . . , aN )T ∈ RN . We have δ 2 /8 P{|y − E{y}| ≥ δ} ≤ 2 exp − kak22 + kak∞ δ
(31) ˜ i k22 ≤ 3δ | xT P{E3 (δ)} = E P{(1/N )kP⊥ Tx ˜ i k22 − m3 ≤ −δ | xT ≤ E P{(1/N )kP⊥ Tx ˜ i k22 − m3 | ≥ δ | xT . ≤ E P{|(1/N )kP⊥ Tx
N X
(46)
i=1
Observe
In order to control P{E4 (δ)} (cf. (32)), note P{E4 } = E P wT Qw ≥ δ | xT ,
(43)
(47)
Proof. Consider some fixed λ ∈ [0, 1/(4kak∞)], such that p E{exp(λai zi2 )} = 1/ 1 − 2λai , (48)
(38)
and hence,
log E{exp(λai (zi2 − 1))} = (−1/2) log(1 − 2λai ) − λai ≤ (−1/2) log(1 − 2λ|ai |) − λ|ai |. (49)
(39)
Applying the elementary inequality u2 − log(1 − u) ≤ u + ,0 ≤ u ≤ 1 2(1 − u) to the RHS of (49) yields λ2 a2i ≤ 2λ2 a2i . log E{exp(λai (zi2 −1))} ≤ 1 − 2λ|ai | Summing (51) over i yields
(40)
0 1/2 Cεi
!
.
log E{exp(λ(y − E{y}))} ≤ 2λ2 kak22 .
Applying Lemma IV.2 to (40) gets us to N (δ/2)2 P{E4 (δ)} ≤ 2 exp − 8(βm3 + β(δ/2)) (N/β)m3 (37) . (41) = 2 exp − 16 · 36 For the choice δ = m3 /4 (cf. (37)), the condition (14) implies validity of (28). Therefore, (29) is in force and can be reformulated using (37) as N m3 P{E1 (δ)} ≤ 2 exp − . (42) 64β
(50)
(51)
(52)
Now, consider the tail bound P{y − E{y} ≥ δ} ≤ exp(−λδ)E{exp(λ(y − E{y}))} (52)
≤ exp(−λδ + 2λ2 kak22 ).
(53)
Minimizing the RHS of (53) for λ ∈ [0, 1/(4kak∞)], P{y −E{y} ≥ δ}≤ exp(−(δ 2 /8) min{1/kak22, 1/(δkak∞ )}) δ 2 /8 ≤ exp − , (54) kak22 + kak∞ δ 4
where the last inequality is due to the fact min{1/x, 1/y} ≥ 1/(x + y) for x, y ∈ R+ . Similar to (53), one can also verify δ 2 /8 P{y − E{y} ≤ −δ} ≤ exp − . (55) kak22 + kak∞ δ
R EFERENCES [1] N. E. Karoui, “Operator norm consistent estimation of large-dimensional sparse covariance matrices,” Ann. Statist., vol. 36, no. 6, pp. 2717–2756, 2008. [2] N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions,” IEEE Trans. Inf. Theory, vol. 58, no. 7, pp. 4117–4134, Jul. 2012. [3] P. Ravikumar, M. J. Wainwright, and J. Lafferty, “High-dimensional Ising model selection using ℓ1 -regularized logistic regression,” Ann. Stat., vol. 38, no. 3, pp. 1287–1319, 2010. [4] A. Bolstad, B. D. van Veen, and R. Nowak, “Causal network inference via group sparse regularization,” IEEE Trans. Signal Processing, vol. 59, no. 6, pp. 2628–2641, Jun. 2011. [5] J. Bento, M. Ibrahimi, and A. Montanari, “Learning networks of stochastic differential equations,” in Advances in Neural Information Processing Systems 23, Vancouver, CN, 2010, pp. 172–180. [6] N. Meinshausen and P. B¨uhlmann, “High-dimensional graphs and variable selection with the Lasso,” Ann. Stat., vol. 34, no. 3, pp. 1436–1462, 2006. [7] J. H. Friedmann, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432– 441, Jul. 2008. [8] I. Ebert-Uphoff and Y. Deng, “A new type of climate network based on probabilistic graphical models: Results of boreal winter versus summer,” Geophysical Research Letters, vol. 39, no. 19. [Online]. Available: http://dx.doi.org/10.1029/2012GL053269 [9] E. Davidson and M. Levin, “Gene regulatory networks,” Proc. Natl. Acad. Sci., vol. 102, no. 14, Apr. 2005. [10] A. Jung, G. Hannak, and N. G¨ortz, “Graphical LASSO Based Model Selection for Time Series,” to appear in IEEE Sig. Proc. Letters, 2015. [11] A. Jung, “Learning the conditional independence structure of stationary time series: A multitask learning approach,” submitted to IEEE Trans. Sig. Proc, also available at http://arxiv.org/pdf/1404.1361v3.pdf, 2014. [12] A. Jung, R. Heckel, H. B¨olcskei, and F. Hlawatsch, “Compressive nonparametric graphical model selection for time series,” in Proc. IEEE ICASSP-2014, Florence, Italy, May 2014. [13] G. Hannak, A. Jung, and N. G¨ortz, “On the information-theoretic limits of graphical model selection for Gaussian time series,” in Proc. EUSIPCO 2014, Lisbon, Portugal, 2014. [14] R. Dahlhaus, “Graphical interaction models for multivariate time series,” Metrika, vol. 51, pp. 151–172, 2000. [15] P. J. Brockwell and R. A. Davis, Time Series: Theory and Methods. New York: Springer, 1991. [16] A. Jung, G. Taub¨ock, and F. Hlawatsch, “Compressive nonstationary spectral estimation using parsimonious random sampling of the ambiguity function,” in Proc. IEEE-SSP 2009, Cardiff, Wales, UK, Aug.–Sep. 2009, pp. 642–645. [17] ——, “Compressive spectral estimation for nonstationary random processes,” in Proc. IEEE ICASSP-2009, Taipei, Taiwan, R.O.C., April 2009, pp. 3029–3032. [18] A. Jung, G. Taub¨ock, and F. Hlawatsch, “Compressive spectral estimation for nonstationary random processes,” IEEE Trans. Inf. Theory, vol. 59, no. 5, pp. 3117–3138, May 2013. [19] G. Matz and F. Hlawatsch, “Nonstationary spectral analysis based on time-frequency operator symbols and underspread approximations,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp. 1067–1086, March 2006. [20] G. Durisi, U. G. Schuster, H. Bolcskei, and S. Shamai, “Noncoherent capacity of underspread fading channels,” IEEE Trans. Inf. Theory, vol. 56, no. 1, pp. 367–395, Jan 2010. [21] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Trans. Inf. Theory, vol. 55, no. 12, pp. 5728–5741, Dec. 2009. [22] ——, “Information-theoretic bounds on sparsity recovery in the highdimensional and noisy setting,” in Proc. IEEE ISIT-2007, June 2007, pp. 961–965. [23] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge, UK: Cambridge Univ. Press, 1985.
Adding (54) and (55) yields the result (47). Lemma IV.2. Consider a standard normal random vector z ∼ N (0, I) and a symmetric matrix Q ∈ RN ×N , i.e., QT = Q. Then, the quadratic form y = zT Qz (56) satisfies δ2 (57) P{|y − E{y}| ≥ δ} ≤ 2 exp − 8(kQk2F + kQk2 δ) with E{y} = Tr{Q}. Proof. The spectral decomposition of Q yields [23] N X λi ui uTi Q=
(58)
i=1
with eigenvalues λi ∈ R and orthonormal eigenvectors {ui }N i=1 . Inserting (58) into (56) yields y=
N X
λi (uTi z)2 .
(59)
i=1
PN Note that kQk2 = maxi=1,...,N |λi | and kQk2F = i=1 λ2 . Since the random variables {uTi z}N i=1 are i.i.d. standard normal, we can apply Lemma IV.1 to (59) yielding (57). Lemma IV.3. Consider a multivariate normal random vector x = (x1 , . . . , xp ) ∼ N (0, C) with precision matrix K = C−1 . Let us denote the neighbourhood N (i) := {j ∈ [p]\{i} : Ki,j 6= 0}. For an arbitrary index set T ⊆ [p], we have X ˜ + εi . xi = cj xj + aT x (60) j∈T
with weight vectors a = {Ki,j /Ki,i }j∈N (i)\T and the random ˜ which is independent of {xj }j∈T . The ˜ ∼ N (0, C), vector x −1 ˜ error term εi is Gaussian ∼ N (0, Ki,i ) and independent of x ˜ −1 = ˜ is given via C and {xj }j∈T . The covariance matrix of x −1 CT ∪N (i) )N (i)\T .
Corollary IV.4. Consider a vector-valued process x[n] of the form (1) such that Asspt. 3 and 1 are valid. The ith component xi {b} can be decomposed as X ˜ i {b} + εi {b}. xi {b} = cj xj {b} + x (61) j∈T
˜ i {b} ∼ N (0, σ with x ˜ 2 IL ), which is independent of {xj {b}}j∈T and has variance σ ˜ 2 ≥ |N (i) \ T |ρmin (α/κ).
(62)
The error term εi {b} ∼ N (0, (1/Ki,i [b])IL ) is independent ˜ i {b} and {xj {b}}j∈T . of x 5