arXiv:1803.01188v3 [math.ST] 18 Apr 2018
ESTIMATION AND INFERENCE FOR PRECISION MATRICES OF NON-STATIONARY TIME SERIES By Xiucai Ding∗ , and Zhou Zhou† University of Toronto
∗ †
We consider the estimation and inference of precision matrices of a rich class of locally stationary linear and nonlinear time series assuming that only one realization of the time series is observed. Using a Cholesky decomposition technique, we show that the precision matrices can be directly estimated via a series of least squares linear regressions with smoothly time-varying coefficients. The method of sieves is utilized for the estimation and is shown to be optimally adaptive in terms of estimation accuracy and efficient in terms of computational complexity. We establish an asymptotic theory for a class of L2 tests based on the nonparametric sieve estimators. The latter are used for testing whether the precision matrices are diagonal or banded. A Gaussian approximation result is established for a wide class of quadratic forms of non-stationary and possibly nonlinear processes of diverging dimensions, which is of interest by itself.
1. Introduction. Consider a centered non-stationary time series x1,n , · · · , xn,n ∈ R. Denote by Ωn := [Cov(x1,n , · · · , xn,n )]−1 the precision matrix of the series. Modelling, estimation and inference of Ωn are of fundamental importance in a wide range of problems in time series analysis. For example, the L2 optimal linear forecast of xn+1,n based on x1,n , · · · , xn,n is determined by Ωn and the covariance between xn,n+1 and (x1,n , · · · , xn,n ) [3]. In time series regression with fixed regressors, the best linear unbiased estimator of the regression coefficient is a weighted least squares estimator with weights proportional to the square root of the precision matrix of the errors [20]. Furthermore, the precision matrix is a key part in Gaussian likelihood and quasi likelihood estimation and inference of time series [3, 25]. We shall omit the subscript n in the sequel if no confusions arise. Observe that Ω is an n × n matrix. When the time series length n is at least moderately large, it is generally not a good idea to first estimate the covariance matrix of (x1 , · · · , xn ) and then invert it to obtain an estimate of Ω. One main reason is that small errors in the covariance matrix estimation may be amplified through inversion when n is large, especially when the Keywords and phrases: Non-stationary time series, Precision matrices, Cholesky decomposition, Sieve estimation, High dimensional Gaussian approximation, Random matrices, White noise and bandedness tests
1
2
X. DING AND Z. ZHOU
condition number of the covariance matrix is large. Also matrix inversion is not computationally efficient for large n. As a result it is desirable to directly estimate Ω. In this paper, we utilize a Cholesky decomposition technique to directly estimate Ω through a series of least squares linear regressions. Specifically, write (1.1)
xi =
i−1 X j=1
φij xi−j + ǫi , i = 2, · · · , n
P where i−1 ˆi is the best linear forecast of xi based on x1 , · · · , xi−1 j=1 φij xi−j := x and ǫi is the forecast error. Let ǫ1 := x1 and denote by σi2 the variance of ǫi , i = 1, 2, · · · , n. Observe that ǫi are uncorrelated random variables. As a result it is straightforward to show that [28] (1.2)
˜ Ω = Φ∗ DΦ,
˜ = diag{σ −2 , · · · , σ −2 }, Φ is a lower triangular where the diagonal matrix D n 1 matrix having ones on its diagonal and −φij at its (i, i − j)−th element for j < i and ∗ denotes matrix or vector transpose. The most significant advantage of the above Cholesky decomposition is structural simplification that transfers the difficult problem of precision matrix estimation to that of estimating a series of least squares regression coefficients and error variances. However, the Cholesky decomposition idea is not directly applicable to precision matrix estimation of non-stationary time series. The reason is that there are in total n(n + 1)/2 regression coefficients and error variances to be estimated in the Cholesky decomposition of Ω. Meanwhile, observe that there are also n(n + 1)/2 parameters to be estimated for the precision matrix of a general non-stationary time series. Hence Cholesky decomposition, though performs structural simplification, does not reduce the dimensionality of the parameter space. On the other hand, we only observe one realization of the time series with n observations. As a result dimension reduction techniques with natural assumptions in non-stationary time series analysis are needed for the estimation of Ω. We adopt two natural and widely used assumptions in non-stationary time series for the dimension reduction. First such assumption is local stationarity which refers to slowly or smoothly time-varying underlying data generating mechanisms of the series. Utilizing the locally stationary framework in Zhou and Wu [47], we show that, for a wide class of locally stationary nonlinear processes, each off-diagonal of the Φ matrix as well as the error variance series σi2 can be well approximated by smooth functions on [0,1]. Specifically, we show that there exist smooth functions φj (·)
PRECISION MATRIX ESTIMATION
3
and g(·) such that supi>b |φij − φj (i/n)| = o(n−1/2 ), j = 1, 2, · · · , n and supi>b |σi2 − g(i/n)| = o(n−1/2 ), where b = bn diverges to infinity with b/n → 0 whose specific value will be determined later in the article. To our knowledge, the latter is the first result on smooth approximation to general non-stationary precision matrices. From classic approximation theory [31], a d times continuously differentiable function can be well approximated by a basis expansion with O((n/ log n)1/(2d+1) ) parameters. Thanks to the local stationarity assumption, the number of parameters needed for estimating σi2 is reduced from n to O((n/ log n)1/(2d+1) ). Similar conclusion holds for each off-diagonal of Φ. The second assumption we adopt is short range dependence which refers to fast decay of the dependence between xi and xi+j as j diverges. Using the physical dependence measures introduced in Zhou and Wu [47], modern operator spectral theory and approximation theory [12, 33], we show, as a theoretical contribution of the paper, that the off-diagonals of Φ decays fast to zero for a general class of locally stationary short range dependent processes. Specifically, we show that φij can be effectively treated as 0 whenever j > b. Hence the total number of parameters one needs to estimate is reduced to the order b[b + (n/ log n)1/(2d+1) ] which is typically much smaller than the sample size n. Now we utilize the method of sieves to estimate the smooth functions φj (·) and g(·) mentioned above. The method of sieves refers to approximating an infinite dimensional space with a sequence of finer and finer finite dimensional subspaces. Typical examples include Fourier, wavelet and orthogonal polynomial approximations to smooth functions on compact intervals. We refer to [8] by Chen for a thorough review of the subject. There are two major advantages of the sieve method when used for precision matrix estimation. First, many sieve estimators, such as the three mentioned above, do not have inferior performances at the boundary of the estimating interval. This is important as inaccurate estimates at the boundary may drastically lower the accuracy of the whole precision matrix estimation even though entries are well estimated in the interior. Second, the computation complexities of many sieve methods are both adaptive (to the smoothness of the functions of interest) and efficient. When estimating one smooth function of time, local methods such as the kernel estimation perform one regression at each time point. This could be computational inefficient when n is large. On the contrary, the above mentioned three sieve methods only need to perform a single regression at the whole time interval with the number of covariates determined by the smoothness of the function of interest. In many cases this yields a much faster estimation. For instance, in the extreme case where the
4
X. DING AND Z. ZHOU
time series dependence is exponentially decaying and the functions are infinitely differentiable, the sieve method only needs O(n log5 n) computation complexity to estimate Ω. Under the same scenario, the kernel method is of O(n2 kn log2 n) computation complexity where kn is the bandwidth used for the regression and is typically of the order n−1/5 . In this paper, we show that the sieve estimates of the functions φj (·) achieve, uniformly over time and j, minimax rate for nonparametric function estimation [31]. This extends previous convergence rate results on nonparametric sieve regression to the case of diverging number of covariates and non-stationary predictors and errors. Combining the latter result with modern random matrix theory [34], we show that the operator norm of the estimated precision matrix converges at a fast rate which is determined by the strength of time series dependence and smoothness of the underlying data generating mechanism. In the best scenario where the dependence is exponentially decaying and φj (·) and g(·) are infinitely differentiable, the √ convergence rate is shown to be of the order log3 n/ n, which is almost as fast as parametrically estimating a single parameter from i.i.d. samples. The sieve estimators have already been used to estimate the smooth conditional mean function in various settings. For instance, in [1], the authors proved that the sieve least square estimators could achieve minimax rate in the sense of sup-norm loss for a fixed number of i.i.d regressors and errors with a general class of sieve basis functions; later Chen and Christensen [9] showed that the spline and wavelet sieve regression estimators attain the above global uniform convergence rate for a fixed number of weakly dependent and stationary regressors. In this article, we study nonparametric sieve estimates for locally stationary time series with diverging number of covariates under physical dependence and obtain the same minimax rate for the functions φj (·). After estimating Ω, one may want to perform various tests on its structure. In this paper, we focus on two such tests, one on whether {xi }ni=1 is a non-stationary white noise and the other on whether Ω is banded. Two test statistics based on the L2 distances between the estimated and hypothesized Φ are proposed. These tests boil down to quadratic forms of the estimated sieve regression coefficients which are quadratic forms of non-stationary, dependent vectors of diverging dimensionality. To our knowledge, there have been no previous works on L2 inference of nonparametric sieve estimators as well as the inference of high dimensional quadratic forms of non-stationary nonlinear time series. Here we utilize Stein’s method together with an mdependence approximation technique and prove that the laws of a large class of quadratic forms of non-stationary nonlinear processes with diverg-
PRECISION MATRIX ESTIMATION
5
ing dimensionality can be well approximated by those of quadratic forms of diverging dimensional Gaussian processes. Consequently asymptotic normality can be established for those high dimensional quadratic forms. The latter Gaussian approximation result is of separate interest and may of wider applicability in non-stationary time series analysis. In [41], Xu, Zhang and ∗ Wu derived the L2 asymptotics for the quadratic form X X, where X is the sample mean of n i.i.d. random vectors. In the present paper, we prove ∗ new and much more general L2 asymptotics for quadratic forms Z EZ for any bounded positive semi-definite matrix E using Stein’s method [30, 43], where Z is the sample mean of a high dimensional, non-stationary and dependent process. It is very interesting that similar ideas have been used in proving the universality of random matrix theory [14, 16, 22, 35]. We point out that the idea of Cholesky decomposition has been used in time series analysis under some different settings when multiple replicates of the vector of interest are available. Assuming a longitudinal setup where multiple realizations can be observed, Wu and Pourahmadi [39] studied the estimation of covariance matrices using nonparametric smoothing techniques. Bickel and Levina [2] considered estimating large covariance and precision matrices by either banding or tapering the sample covariance matrix and its inverse assuming that multiple independent samples can be observed. On the other hand, we assume that only one realization of the time series is observed which is the case in many real applications. Hence none of the aforementioned results can be applied under this scenario. Finally, we mention that estimating large dimensional covariance and precision matrices has attracted much attention in the last two decades. One main research line is to assume that we can observe n i.i.d copies of a pdimensional random vector. When p is comparable or larger than n, it is well-known that sample covariance and precision matrices are inconsistent estimators [13, 27, 36]. To overcome the difficulty from high dimensionality, researchers usually impose two main structural assumptions in order to consistently estimate the covariance and precision matrices: sparsity structure and factor model structure. Various families of covariance matrices have been introduced assuming some types of sparsity, this includes the bandable covariance matrices [2, 5, 39] and sparse covariance matrices [7, 23, 42]. Many regularization methods have been developed to exploit the sparsity structure for the estimation of covariance and precision matrices. For a comprehensive review, we refer to [6]. On the other hand, factor models in the high dimensional setting have been used in a range of applications in finance and economics. Under this structural assumption, many thresholding techniques can be used to estimate the covariance and precision matrices
6
X. DING AND Z. ZHOU
[17, 18]. For a comprehensive review on factor model based methods, we refer to [19]. Most recently, researchers have focused on providing an oracle estimator without the structural assumptions, for instance the optimal rotation invariant estimator [4, 15, 26]. Even though the estimators are inconsistent, they are better choices than simply using the sample covariance and precision matrices. Although high dimensional covariance and precision matrix estimation has witnessed unprecedented development, statistical inference of high dimensional and non-stationary time series remains largely untouched so far. Under stationarity, [40] considers thresholding and banding techniques for estimating the covariance matrix with only one realization of the series. Under sparsity assumptions, [10] estimates marginal covariance and precision matrices of high-dimensional stationary and locally stationary time series using thresholding and Lasso techniques. Note that when estimating marginal covariance or precision matrices of a p dimensional time series of length n, the series can be viewed as n dependent replicates of the vector of interest which is completely different than the situation considered in this article. The rest of the paper is organized as follows. In Section 2, we introduce a rich class of non-stationary (locally stationary) and nonlinear time series and study the theoretical properties of its covariance and precision matrices. In Section 3, we consistently estimate the precision matrices and provide convergent rates for these estimators. In Section 4, we propose two efficient testings using some simple statistics from our estimation procedure. In Section 5, we give Monte Carlo simulations to illustrate our results. In Section 6, we give the essential proofs of the theorems in our paper. In Appendix A, additional technical proofs are provided. 2. Locally stationary time series. Consider a locally stationary time series [45, 47, 48] (2.1)
i xi = G( , Fi ), n
where Fi = (· · · , ηi−1 , ηi ) and ηi , i ∈ Z are i.i.d centered random variables, and G : [0, 1] × R∞ → R is a measurable function such that ξi (t) := G(t, Fi ) is a properly defined random variable for all t ∈ [0, 1]. The above represents a wide class of locally stationary linear and nonlinear processes. We refer to Zhou and Wu [38, 47, 48] for detailed discussions and examples. And following [38, 47, 48], we introduce the following dependence measure to quantify the temporal dependence of (2.1). Definition 2.1.
Let {ηi′ } be an i.i.d. copy of {ηi }. Assuming that for
PRECISION MATRIX ESTIMATION
7
some q > 0, ||xi ||q < ∞, where || · ||q = [E| · |q ]1/q is the Lq norm of a random variable. For j ≥ 0, we define the physical dependence measure by (2.2)
δ(j, q) := sup max ||G(t, Fi ) − G(t, Fi,j )||q , i
t∈[0,1]
′ ,η where Fi,j := (Fi−j−1 , ηi−j i−j+1 , · · · , ηi ).
The measure δ(j, q) quantifies the changes in the filter’s output when the input of the system j steps ahead is changed to an i.i.d. copy. If the change is small, then we have short-range dependence. It is notable that δ(j, q) is related to the data generating mechanism and can be easily computed. We refer the readers to [47, Section 4] for examples of such computation. In the present paper, we impose the following assumptions on (2.1) and the physical dependence measure to control the temporal dependence of the non-stationary time series. Assumption 2.2. There exists a constant τ > 10 and q > 4, for some constant C > 0, we have that δ(j, q) ≤ Cj −τ , j ≥ 1.
(2.3)
Furthermore, G defined in (2.1) satisfies the property of stochastic Lipschitz continuity, for any t1 , t2 ∈ [0, 1], we have (2.4)
||G(t1 , Fi ) − G(t2 , Fi )||q ≤ C|t1 − t2 |.
We also assume that (2.5)
sup max ||G(t, Fi )||q < ∞. t
i
(2.3) indicates that the time series has short-range dependence. (2.4) implies that G(·, ·) changes smoothly over time and ensures local stationarity. Furthermore, for each fixed t ∈ [0, 1], denote (2.6)
γ(t, j) = E(G(t, F0 ), G(t, Fj )),
(2.4) and (2.5) imply that γ(t, j) is Lipschitz continuous in t. Furthermore, we need the following mild assumption on the smoothness of γ(t, j). Assumption 2.3. For any j ≥ 0, we assume that γ(t, j) ∈ C d ([0, 1]), d > 0 is some integer, where C d ([0, 1]) is the function space on [0, 1] of continuous functions that have continuous first d derivatives.
8
X. DING AND Z. ZHOU
Many important consequences can be derived due to Assumption 2.2 and 2.3. We list the most useful ones and put their proofs into Appendix A. The first one is the following control on γ(t, j). Lemma 2.4. Under Assumption 2.2 and 2.3, there exists some constant C > 0, such that sup |γ(t, j)| ≤ Cj −τ , j ≥ 1. t
Another important conclusion is that the coefficients defined in (1.1) is of polynomial decay. Hence, when i > b is large, where b = O(n2/τ ), we only need to focus on autoregressive fit of order b instead of i − 1. Recall (1.1), denote φi = (φi1 , · · · , φi,i−1 )∗ . Then we have (2.7)
φi = Ω i γ i ,
where Ωi and γi are defined as Ωi = [Cov(xii−1 , xii−1 )]−1 , γi = Cov(xii−1 , xi ), with xii−1 = (xi−1 , · · · , x1 )∗ . Proposition 2.5. Under Assumption 2.2 and letting b = O(n2/τ ), there exists some constant C > 0, such that ( max{Cn−4+5/τ , Cj −τ }, i ≥ b2 ; (2.8) sup |φij | ≤ max{Cn−2+3/τ , Cj −τ }, b < i < b2 . i>b ˜ b = Ωb γ b Furthermore, when i > b, denote φbi = (φi1 , · · · , φib ), and φ i i i with entries (φ˜i1 , · · · , φ˜ib ), where Ωbi = [Cov(xi , xi )]−1 , γib = E(xi xi ), xi = (xi−1 , · · · , xi−b ), we have ˜b ≤ Cn−2+1/τ . sup φbi − φ i i
The above proposition provides the first dimension reduction for our parameter space. It states that we can treat φij = 0 for j > b. Hence, the number of coefficients needed for the Cholesky decomposition reduces from O(n2 ) to O(nb). Finally, denote φb ( ni ) := (φ1 ( ni ), · · · , φb ( ni )) by (2.9)
i ˜ bi γ ˜ib , φb ( ) = Ω n
˜ b and γ ˜ib are defined as where Ω i ˜ bi = [Cov(˜ ˜ i )]−1 , γ ˜i = Cov(˜ Ω xi , x xi , xi ), ˜ i,k = G( ni , Fi−k ), k = 1, 2, · · · , b. The following lemma shows that φbi with x can be well approximated by φb ( ni ) when i > b.
PRECISION MATRIX ESTIMATION
Lemma 2.6. such that
9
Under Assumption 2.2, there exists some constant C > 0, i sup φij − φj ( ) ≤ Cn−1+2/τ , for all j ≤ b. n i>b
We will see from Section 3 that the above lemma provides the second dimension reduction. Due to the smoothness of φj (·), we only need a linear combination of smooth basis functions to estimate. This will reduce the number of estimation from O(nb) to O(bc), where c is the number of basis functions. Till the end of this paper, we will always use b = O(n2/τ ). Recall that ǫi is the prediction error with variance σi2 , ǫ i = xi − x ˆi .
(2.10)
We summarize the basic properties of ǫi . Lemma 2.7. First of all, we have supi σi2 < ∞. Furthermore, denote the physical dependence measure of {ǫi } as δǫ (j, q), then there exists some constant C > 0, such that δǫ (j, q) ≤ Cj −τ , j ≥ 1. Remark 2.8. In this paper, we focus on the discussion when the physical dependence measure is of polynomial decay, i.e. (2.3) holds true. However, all our results can be extended to the case when the short-range dependence is of exponential decay, i.e. δ(j, q) ≤ Caj , 0 < a < 1.
(2.11)
In detail, Lemma 2.4 can be changed to sup |γ(t, j)| < Caj , j ≥ 1. t
Therefore, we only need to choose b = O(log n). As a consequence, Proposition 2.5 can be updated to ˜b ≤ Cn−C , sup |φij | ≤ max{Cn−C , Caj }, sup φbi − φ i i
i>b
where C > 1 is some constant. Similarly, Lemma 2.6 can be modified to i C log n sup φij − φj ( ) ≤ , for all j ≤ b. n n i>b Finally, the analog of Lemma 2.7 is
δǫ (j, q) ≤ Caj .
10
X. DING AND Z. ZHOU
3. Estimation of precision matrices. As shown in (1.2), Proposition 2.5 and Lemma 2.6, it suffices to estimate φij , i ≤ b, φj ( ni ), i > b ≥ j and the variances of the residuals. When i > b, by (2.5) and Proposition 2.5, it is easy to see that X i−1 sup φij xi−j = o(n−1 ) in probability. i j=b+1
Therefore, we now simply write (3.1)
xi =
b X j=1
φij xi−j + ǫi + ou (n−1 ), i = b + 1, · · · , n,
where Xi = ou (n−1 ) means nXi converges to zero in probability uniformly for i > b. 3.1. Time-varying coefficients for i > b. We first estimate the timevarying coefficients φj ( ni ) using the method of sieves [1, 8, 9] when i > b. We need to treat i ≤ b and i > b separately as when i > b, we only need to do one least square regression to estimate φj (·). We firstly observe the following fact, whose proof will be put into Appendix A. Lemma 3.1. Under Assumption 2.2 and 2.3, for any j ≤ b, we have that φj (t) ∈ C d ([0, 1]). Based on the above lemma, we use c
(3.2)
X i i ajk αk ( ), j ≤ b, θj ( ) := n n k=1
to approximate φj ( ni ), where {αk ( ni )} is a set of pre-chosen orthogonal basis functions on [0, 1] and c ≡ c(n) stands for the number of basis functions which will be specified later. We impose the following regularity condition on the regressors and the basis functions. Assumption 3.2. For any k = 1, 2, · · · , b, denote Σk (t) ∈ Rk×k via = γ(t, |i − j|), we assume that the eigenvalues of Z 1 Σk (t) ⊗ (b(t)b∗ (t)) dt,
Σkij (t)
0
are bounded above and also away from zero by a constant κ > 0, where b(t) = (α1 (t), · · · , αc (t))∗ ∈ Rc .
PRECISION MATRIX ESTIMATION
11
The results of the convergent rate on the approximation (3.2) can be found in [8, Section 2.3] for the commonly used basis functions, where we summarize it as the following lemma. Lemma 3.3.
Denote the sup-norm with respect to Lebesgue measure as L∞ := sup |φj (t) − θj (t)| . t∈[0,1]
We then have that L∞ = O(c−d ) for the orthogonal polynomials, trigonometric polynomials, spline series when r ≥ d + 1, and orthogonal wavelets when m > d. Using the Cholesky decomposition (3.1) and Assumption 3.5, we can now write (3.3)
xi =
c b X X j=1 k=1
i ajk zkj ( ) + ǫi + ou (n−1 ), i = b + 1, · · · , n, n
with zkj ( ni ) defined as
i i zkj ( ) := αk ( )xi−j . n n We can use the ordinary least square (OLS) method to estimate the coefficients ajk . Denote the vector β ∈ Rbc with βs = ajs ,ks , where js = ⌊ sc ⌋ + 1, ks = s − ⌊ sc ⌋ × c. Similarly, we define yi ∈ Rbc by letting yis = zks ,js ( ni ). Furthermore, we denote Y ∗ as the bc × (n − b) rectangular matrix whose columns are yi , i = b + 1, · · · , n. We also define by x ∈ Rn−b containing xb+1 , · · · , xn . Hence, the OLS estimator for β can be written as βˆ = (Y ∗ Y )−1 Y ∗ x. Moreover, recall xi = (xi−1 , · · · , xi−b )∗ ∈ Rb , denote X = (xb+1 , · · · , xn ) ∈ Rb×(n−b) and the matrices Ei ∈ R(n−b)×(n−b) satisfies that (Ei )st = 1, when s = t = i − b and zeros otherwise. As a consequence, we can write n X i ∗ X ⊗ b( ) Ei , (3.4) Y = n i=b+1
where ⊗ stands for the Kronecker product. It is well-known that the OLS estimator satisfies ∗ −1 ∗ Y Y Y ǫ ˆ (3.5) β=β+ + oP (n−1 ), n n
12
X. DING AND Z. ZHOU
where ǫ ∈ Rn−b contains ǫb+1 , · · · , ǫn and the error is entrywise. We decompose β into b blocks by denoting β = (β1∗ , · · · , βb∗ )∗ , where each βi ∈ Rc . ˆ Therefore, our sieve estimator can be written Similarly, we can decompose β. i i ∗ ˆ ˆ as φj ( n ) = βj b( n ) and it satisfies that i i i φj ( ) − φˆj ( ) = (βj − βˆj )∗ b( ). n n n
(3.6)
By carefully choosing c = O(nα1 ), we show that φˆj ( ni ) are consistent estimators for φj ( ni ) uniformly in i for all j ≤ b. Denote ζc := supi ||b( ni )||, as discussed in [1, Section 3], we can write ∗
ζc = nα1 , where α∗1 = 12 α1 for trigonometric polynomials, spline series, orthogonal wavelets and weighted orthogonal Chebyshev polynomials. And α∗1 = α1 for Legendre orthogonal polynomials. Finally, to control the sup-norm, we need the following assumption on the derivative of the basis functions, which is also used in [9, Assumption 4]. Assumption 3.4. have
There exist ω1 , ω2 ≥ 0, for some constant C > 0, we sup ||∇b(t)|| ≤ Cnω1 cω2 . t
The above assumption is a mild regularity condition on the sieve basis functions and is satisfied by many of the widely used basis functions. For instance, we can choose ω1 = 0, ω2 = 21 for trigonometric polynomials, spline series, orthogonal wavelets and weighted Chebyshev polynomials. For more examples of basis functions satisfying this assumption, we refer to [9, Section 2.1]. Finally, we impose the following mild assumption on the parameters. Assumption 3.5. We assume that for τ defined in (2.3), d defined in Assumption 2.3 and α1 , there exists a constant C > 4, such that C + dα1 < 1. τ Theorem 3.6.
Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, we have ! r i i log n + n−dα1 . sup φj ( ) − φˆj ( ) = OP ζc n n n i>b,j≤b
PRECISION MATRIX ESTIMATION
13
Furthermore, by using basis functions with α∗1 = 12 α1 , we can show that our estimators attain the optimal minimax convergent rate (n/(log n))−d/(2d+1) for the nonparametric regression proposed by Stone in [31]. Corollary 3.7. Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, using the trigonometric polynomials, spline series, orthogonal wavelets and weighted orthogonal Chebyshev polynomials, when c = O((n/(log n))1/2d+1 ), we have i i ˆ sup φj ( ) − φj ( ) = OP (n/(log n))−d/(2d+1) . n n i>b,j≤b
3.2. Time-varying coefficients for i ≤ b. It is notable that by Lemma 2.6, when i, j are less or equal to b, we cannot use the estimators derived from Section 3.1. Instead, a different series of least squares linear regressions should be used. For instance, in order to estimate φ21 , we use the following regression equations xk = φk1 xk−1 + ξk,2 , k = 2, 3, · · · , n. Note that ξ2,2 = ǫ2 . Due to the local stationarity assumption, there exists a smooth function f21 , such that φk1 ≈ f21 ( nk ), k = 2, 3, · · · , n. Here f21 can be efficiently estimated using the sieve method as described by the previous discussions and φ21 can be estimated by fˆ21 (2/n). Generally, for each fixed i ≤ b, to estimate φi , we make use of the following predictions: (3.7)
xk =
i−1 X j=1
λkj xk−j + ξk,i , k = i, i + 1, · · · , n,
where λki = (λk1 , · · · , λk,i−1 ) are the coefficients of the best linear prediction using the i − 1 predecessors and ξi,i = ǫi . Note that λii = φi . Using YuleWalker’s equation, we find λki = Ωki γik , where Ωki = [Cov(xki , xki )]−1 , γik = Cov(xki , xk ) and xki = (xk−1 , · · · , xk−i+1 ). i ( k )) by Due to Assumption 2.3, we can denote fik = (f1i ( nk ), · · · , fi−1 n ˜ k γ˜ k , fik = Ω i i
(3.8) k ˜ k, γ with Ω i ˜i defined by
˜ ki = [Cov(˜ ˜ ki )]−1 , γ ˜ik = Cov(˜ Ω xki , x xki , x ˜k ), ˜ ki,j = G( nk , Fi−j ). In order to estimate φij via fji ( ni ), we will make where x use of the following lemma whose proof can be found in Appendix A.
14
X. DING AND Z. ZHOU
Lemma 3.8. Under Assumption 2.2 and 2.3, for each fixed i ≤ b and for any j ≤ i − 1, fji (t) are C d functions on [0, 1]. Furthermore, for some constant C > 0, we have i i φij − fj ( ) ≤ C n−1+2/τ + n−dα1 , j < i ≤ b. (3.9) n
Remark 3.9. Under the exponential decay assumption (2.11), Lemma 3.8 can be changed to i i φij − fj ( ) ≤ C log n + n−dα1 , j < i ≤ b. n n Therefore, the rest of the work leaves to estimate the functions fji ( ni ), j < i ≤ b using sieve approximation by denoting c
X i i djk αk ( ). fji ( ) = n n k=1
The estimation is similar to Section 3.1 except that the P length b is replaced by i − 1. We denote the OLS estimation as fˆji ( ni ) = ck=1 dˆjk αk ( ni ). Theorem 3.10.
Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, we have ! r i i i log n + n−dα1 . sup fj ( ) − fˆji ( ) = OP ζc n n n i≤b,j b b and i ≤ b separately. For i > b, denote ǫi = xi − bj=1 φij xi−j and (σib )2 = E(ǫbi )2 . σi can be well approximated using σib by the following lemma, whose proof will be put into Appendix A. Lemma 3.11. Under Assumption 2.2 and 2.3, for i > b and some constant C > 0, we have sup σi2 − (σib )2 ≤ Cn−2+2/τ . i>b
2 P Furthermore, denote g( ni ) = E xi − bj=1 φij G( ni , Fi−j ) , we then have b 2 i sup (σi ) − g( ) ≤ Cn−1+4/τ . n i>b
Finally, g( ni ) ∈ C d ([0, 1]).
PRECISION MATRIX ESTIMATION
15
Denote rib = (ǫbi )2 , it is notable that rib can not be observed directly. Instead, we use rˆib = ǫˆ2i , where (3.10)
ǫˆi = xi −
c b X X j=1 k=1
i a ˆjk zkj ( ), i = b + 1, · · · , n. n
By Theorem 3.6 and Assumption 3.5, we conclude that r log n 2/τ b b ζc (3.11) sup |ri − rˆi | = OP n + n−dα1 . n i>b Denote the centered random variables ωib = rib − (σib )2 , by Lemma 3.11 and (3.11), we have r log n i (3.12) rˆib = g( ) + ωib + OP n2/τ ζc + n−dα1 . n n
Invoking Lemma 3.3 and Assumption 3.5, for i > b, we can therefore write our regression equations as c r log n X i + n−dα1 . (3.13) rˆib = dk αk ( ) + ωib + OP n2/τ ζc n n k=1
Similar to Lemma 2.7, we can show that the physical dependence measure of ωib is also of polynomial decay. Therefore, the OLS estimator for α = (d1 , · · · , dc )∗ can be written as ˆ = (W ∗ W )−1 W ∗ ˆr, α i+b ∗ where W ∗ is an c×(n−b) matrix whose i-th column is (α1 ( i+b n ), · · · , αc ( n )) , n−b b b i = 1, 2, · · · , n − b, and ˆ r is an R containing rˆb+1 , · · · , rˆn . Furthermore, by the property of OLS, we have ∗ −1 ∗ r log n W W W ω 2/τ ˆ =α+ (3.14) α ζc + OP n + n−dα1 , n n n
where ω = (ωb+1 , · · · , ωn )∗ and the error is entrywise. Correspondingly, we have the following consistency result. Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, we have r log n i i 2/τ sup gˆ( ) − g( ) = OP n ζc + n−dα1 . n n n i>b
Theorem 3.12. (3.15)
16
X. DING AND Z. ZHOU
Finally, we study the estimation of σi2 , i = 1, 2, · · · , b, which enjoys the same discussion as in Section 3.2. Recall ξk,i defined in (3.7), denote (σk,i (ξ))2 = E(ξk,i )2 , using a similar discussion to Lemma 3.11, we can find a smooth function gi , such that supi≤b |(σi,i (ξ))2 − g i ( ni )| ≤ O(n−1+4τ ), especially we can use gi ( ni ) to estimate σi2 . When i = 1, we need to estimate the variance function of xi . The rest of the work leaves to estimate gi (t) using sieve method similar to (3.13) for i ≤ b, where we replace the residual with rˆki , k = i, · · · , n. Here rˆki is defined as (3.16)
rˆki := xi −
i−1 X j=1
2
k fˆji ( )xi−j , k = i, i + 1, · · · , n. n
Then for i ≤ b, we can estimate gˆi ( ni ) similarly, except that the dimension of W ∗ is c × (n + 1 − i). The results can be summarized as the following theorem. Theorem 3.13. Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, we have r log n i i i i 2/τ ζc + n−dα1 . sup gˆ ( ) − g ( ) = OP n n n n i≤b 3.4. Precision matrix estimation. From (1.2), it is natural to choose ˆ ˆ ˆ := Φ ˆ ∗D ˜Φ Ω as our estimator for the precision matrix. As we discussed in the previous ˆ is a lower triangular matrix whose diagonal entries are all sections, here Φ ones. For the off-diagonal entries, when i > b and j ≤ b, its (i, i − j)-th entry is −φˆj ( ni ) defined in Section 3.1. And when i ≤ b, Φi,i−j is estimated using ˆ ˆ are set to be zeros. Finally, D ˜ −fˆi ( i ) from Section 3.2. All other entries of Φ j n
is a diagonal matrix with entries {ˆ σi−2 } estimated from Section 3.3. Observe ˆ is always positive semi-definite. It is easy to see that when i > b, the that Ω number of regressors is bc and length of observation is n − b. Hence the computational complexity of the least squares regression is O(n(bc)2 ). Similar discussion can be applied for i ≤ b and we hence conclude that the computaˆ is of the order O(nb3 c2 ). As a result the tional complexity for estimating Ω computation complexity of our estimation is adaptive to the smoothness of the underlying data generating mechanism and the decay rate of temporal dependence. In the best scenario when assumption (2.11) holds and
PRECISION MATRIX ESTIMATION
17
γ(t, j) ∈ C ∞ ([0, 1]), our procedure only requires O(n log5 n) computation complexity. In this subsection, we shall control the estimation error between Ω and ˆ ˆΦ ˆ ∗ ) = 1, combining with Ω. We first observe that, as det(ΦΦ∗ ) = det(Φ Assumption 2.2, there exist some constants C1 , C2 > 0, such that C1 ≤ λmin (ΦΦ∗ ) ≤ λmax (ΦΦ∗ ) ≤ C2 . ˆΦ ˆ ∗. Similar results hold for Φ Theorem 3.14. (3.17)
Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, we have r log n −dα1 4/τ ˆ . n + ζc Ω − Ω = OP n n
Recall that ||·|| denotes the operator norm of a matrix. It can be seen from the above theorem that the estimation accuracy of precision matrices depends on the decay rate of dependence and the smoothness of the covariance functions. The estimation accuracy gets higher for time series with more smooth covariance functions and faster decay speed of dependence. Remark 3.15. Under assumption (2.11), when we apply the Gershgorin’s circle theorem for our proof, we only need O(log n) matrix entries to bound the error terms. Hence, we can change (3.17) to ! r log n 2 −dα 1 ˆ = OP log n n + ζc . Ω − Ω n
In the best scenario where the dependence is exponentially decaying and φj (·) and g(·) are infinitely differentiable, following the same arguments as those in the proof of Theorem 3.14, it is easy to show the convergence rate ˆ is of the order log3 n/√n, which is almost as fast as parametrically of Ω estimating a single parameter from i.i.d. samples. 4. Testing the structure of the precision matrices. An important advantage of our methodology is that we can test many structural assumptions of the precision matrices using some simple statistics in terms of the ˆ entries of Φ. 4.1. Test statistics. In this subsection, we focus on discussing two fundamental tests in non-stationary time series analysis. One of those is to test
18
X. DING AND Z. ZHOU
whether the observed samples are from a non-stationary white noise process {xi } in the sense that ( 0, i 6= j, Cov(xi , xj ) = 2 σi , i = j. Note that we allow heteroscedasticity by assuming that the variance of xi changes over time. Formally, we would like to test H10 : {xi } is a non-stationary white noise process. Under H10 , recall (2.9), we shall have that φj ( ni ) are all zeros. Therefore, our estimation φˆj ( ni ) should be small enough for all pairs i, j, i 6= j. We hence use the following statistic: T1∗ =
(4.1)
b Z X j=1
1 0
φˆ2j (t)dt.
The second hypothesis of interest is whether the precision matrices are banded. In our setup, the Cholesky decomposition provides a convenient way to test the bandedness. Formally, for any k0 ≡ k0 (n) < b, we are interested in testing the following hypothesis: H20 : The precision matrix of {xi } is k0 -banded. Due to (1.2), as Ω is strictly positive, the Cholesky decomposition is unique. Therefore, we conclude that Φ is also k0 -banded using the discussion in [29, Section 2]. Furthermore, under H20 , we have that φj ( ni ) = 0, for j > k0 . Therefore, it is natural for us to use the following statistic T2∗
=
Z b X
j=k0 +1 0
1
φˆ2j (t)dt.
It is notable that both of the test statistics T1∗ and T2∗ can be written into summations of quadratic forms under the null hypothesis. For T1∗ under H10 and T2∗ under H20 , we have that 2 φˆ2j (t) = φˆj (t) − φj (t) .
For any fixed j ≤ b, we have Z 1 c 2 X ˆ (ˆ ajk − ajk )2 + O(n−dα1 ). φj (t) − φj (t) dt = 0
k=1
PRECISION MATRIX ESTIMATION
19
It can be seen from the above equation that the order of smoothness and number of basis functions are important to our distribution. Under Assumption 3.5, we can see that the error O(n−dα1 ) is negligible. We refer the readers to [47, Section 4] for examples satisfying this property. Then recall (3.5), it is easy to see that for Σ defined in (6.3), we have for all j ≤ b, (4.2)
Z
0
1
2 ǫ∗ Y −1 ∗ Y ∗ǫ φj (t) − φˆj (t) dt = Σ Aj Aj Σ−1 + oP (1), n n
where Aj ∈ Rbc is a diagonal block matrix whose j-th diagonal block being the identity matrix and zeros otherwise. Therefore, all the above quantities can be computed using (4.2), which are quadratic forms of a high dimensional locally stationary time series. 4.2. Diverging dimensional Gaussian approximation. As we have seen from the previous subsection, both test statistics are involved with high dimensional quadratic forms. Observe that the distribution of quadratic forms of Gaussian vectors can be derived using Lindeberg’s central limit theorem. Hence our case can be tackled if we could establish a Gaussian approximation of the quadratic form (4.2) of general distributions. In this subsection, we will prove a Gaussian approximation result for the quadratic form Z∗ EZ, ∗ where Z := ǫ√Yn ∈ Rbc and E is a bounded positive semi-definite matrix. Denote p = bc and zi = (zi1 , · · · , zip )∗ , where (4.3)
i s zis = xi−¯s−1 ǫi αs′ ( ), s¯ = ⌊ ⌋, s′ = s − s¯c, i ≥ b + 1. n c
As a consequence, we have n 1 X zi . Z := (Z1 , · · · , Zp ) = √ n i=b+1
Pn
Denote U = √1n i=b+1 ui , where {ui }ni=b+1 are centered Gaussian random vectors independent of {zi }ni=b+1 and preserve their covariance structure. Our task is to control the following Kolmogorov distance (4.4)
ρ := sup |P (Rz ≤ x) − P (Ru ≤ x)| , x∈R
with the definitions Rz = Z∗ EZ, Ru = U∗ EU.
20
X. DING AND Z. ZHOU
We have the following result on the high dimensional Gaussian approximation. Recall we define p = bc and denote ξc := sup |αi (t)|. i,t
It is notable that ξc can be well-controlled for the commonly used basis functions. For instance, for the trigonometric polynomials and the weighted Chebyshev polynomials of the first kind, ξc = O(1); and for orthogonal √ wavelet, ξc = O( c). The following theorem establishes the Gaussian approximation for high dimensional quadratic forms under physical dependence. Theorem 4.1. Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, for some constant C > 0, we have ρ ≤ Cl(n), where l(n) is defined as q
l(n) = ψ −1/2 +ξc pψ q+1 M + pψ
ξ 1/2 c
5/6
Mx
M2 + ξc Mx−1 ψ 2 p4 + √ ψ 3 p6 n √ r p M log + γ, + Mx3 γ
q(−τ +1) q+1
where Mx , ψ, M → ∞ and γ → 0 when n → ∞. 4.3. Asymptotic normality of test statistics. With the above preparation, we now derive the distributions for the test statistics T1∗ and T2∗ defined in Section 4.1. First of all, under H10 , we have (4.5)
nT1∗
=
b X c X j=1 k=1
ǫ∗ Y Y ∗ǫ a ˆ2jk = βˆ∗ βˆ = √ Σ−2 √ + oP (1), n n
where we recall (3.5). We can analyze T2∗ in the same way using nT2∗ =
b c X X
ˆ ∗ (Aβ) ˆ a ˆ2jk = (Aβ)
j=k0 +1 k=1
ǫ∗ Y Y ∗ǫ = √ Σ−1 A∗ AΣ−1 √ + oP (1), n n
where A ∈ Rbc×bc is a block diagonal matrix with the non-zero block being the lower (b − k0 )c × (b − k0 )c major part.
PRECISION MATRIX ESTIMATION
21
√1 Y ∗ ǫ ∈ Rp is a block vector with size c, where the j-th entry n P i-block is √1n nk=b+1 xk−i ǫk αj ( nk ). We can therefore rewrite it as
Note that of the
n 1 X 1 i √ Y ∗ǫ = √ hi ⊗ b( ), n n n i=b+1
where hi = xi ǫi . For i > b, zi can be regarded as a locally stationary time series, i.e. hi = U( ni , Fi ). Denote the long-run covariance matrix of {hi } as ¯ ∆(t) =
∞ X
j=−∞
Cov U(t, Fj ), U(t, F0 ) ,
and we further define (4.6)
∆=
Z
0
1
¯ ∆(t) ⊗ (b(t)b∗ (t))dt.
For k ∈ N, denote 1/k 1/k fk = Tr[(∆1/2 Σ−2 ∆1/2 )k ] , gk = Tr[(∆1/2 Σ−1 A∗ AΣ−1 ∆1/2 )k ] .
The limiting distributions of T1∗ and T2∗ are summarized in the following theorem. Theorem 4.2. 0, we have
Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, when ln →
(1). Under H10 , we have
nT1∗ − f1 ⇒ N (0, 2). f2
Furthermore, there exist some positive constants ci , Ci , i = 1, 2, such that f1 f2 c1 ≤ ≤ C1 , c2 ≤ √ ≤ C2 . bc bc (2). Under H20 , we have
nT2∗ − g1 ⇒ N (0, 2). g2
Furthermore, there exist some positive constants wi , Wi , i = 1, 2, such that g1 g2 ≤ W2 . w1 ≤ ≤ W 1 , w2 ≤ p (b − k0 )c (b − k0 )c
22
X. DING AND Z. ZHOU
Finally, we discuss the local power of our tests. We will only focus on the white noise test and similar discussion can be applied to the bandedness test. Consider the alternative hypothesis R1 2 P n ∞ j=1 0 γ (t, j)dt √ → ∞. Ha : bc The following proposition states that under Ha , the power of our test will asymptotically be 1. Proposition 4.3. Under Assumption 2.2, 2.3, 3.2, 3.4 and 3.5, when the alternative hypothesis Ha holds true, for any given significant level α, we have ∗ nT1 − f1 √ P ≥ 2Z1−α → 1, n → ∞, f2 where Z1−α is the (1 − α)% quantile of the standard normal distribution.
that the Proposition √ white noise test has asymptotic power 1 R 1 states P 4.3 2 (t, j)dt ≫ bc/n. In an interesting special case when whenever ∞ γ j=1 √ 0 R1 2 bc/(nk), i = 1, 2, · · · , k, T1∗ achieves asymptotic power 1. 0 γ (t, ji )dt ≫ Note that if k here is large, then we conclude that alternatives consists of many very small deviations from the null can be picked up by the L2 test T1∗ . On the contrary, maximum deviation or L∞ norm based tests will not be sensitive to such alternatives. √ Remark 4.4. We need the assumption bc > C log n, C > 0 is some constant to hold true. It will be satisfied automatically under the polynomial decay assumption (2.3) of the physical dependence measure as b = O(n2/τ ). However, in the exponential decay case (2.11), we need c > C log n to make Proposition 4.3 valid. 4.4. Practical implementation. It can been seen from Theorem 4.2 that the key to implement the tests is to estimate the covariance matrix of a high dimensional vector. A disadvantage of using (4.5) is that the basis functions are mixed with the time series. In the present subsection, we provide a practical implementation by representing nT1∗ and nT2∗ into different forms in order to separate the data and the basis functions. We focus our discussion on nT1∗ . For i > b, j ≤ b, denote the vector Bj ( ni ) ∈ Rbc with b-blocks, where the j-th block is the basis b( ni ) and zeros otherwise. Therefore, for all j ≤ b, b
b. We will put its proof in Appendix A.
24
X. DING AND Z. ZHOU
Lemma 4.5.
Under Assumption 2.2 and 2.3, we have k sup Λ( , 0) − Λkk = O(n−1+4/τ ). n k>b
Next we consider the upper-off-diagonal blocks. For any b < k ≤ n − b + 1, we find that for j > b + k, for some constant C > 0, we have (4.9)
||Λkj || ≤ C(j − b)−τ +1 ,
where we use a similar discussion to Lemma 2.4 and Gershgorin’s circle theorem. As a consequence, we only need to estimate the blocks Λkj for k < j ≤ k + b. Similar to Lemma 4.5, we have k Λ( , j) − Λkj = O(n−1+4/τ ). n
Hence, we need to estimate Λ(t, j), 0 ≤ j ≤ b using the kernel estimators. For a smooth symmetric density function Kh defined on R supported on [−1, 1], where h ≡ hn is the bandwidth such that h → 0, nh → ∞. We write n−j X k/n − t 1 ˆ j) = hk h∗k+j , 0 ≤ j ≤ b. K Λ(t, nh h k=b+1
ˆ L as the estimator by setting its blocks Finally we define Σ (4.10)
ˆ L )kj = Λ( ˆ k + b , j), ˆ L )kk = Λ( ˆ b + k , 0), (Σ (Σ n n
and zeros otherwise, where k = 1, 2, · · · , n − b, k < j ≤ k + b. We can prove that our estimators are consistent under mild assumptions. Theorem 4.6. Under Assumption 2.2 and 2.3, let h → 0 and nh → ∞, for j = 0, 1, 2, · · · , b, we have ˆ j) = OP b √1 + h2 . (4.11) sup Λ(t, j) − Λ(t, t nh As a consequence, we have 1 2 2 ˆ √ b = O Σ − Σ (4.12) . + h L P L nh
PRECISION MATRIX ESTIMATION
25
In practice, the true ǫi is unknown and we have to use ǫˆi defined in (3.10). We then define n−j 1 X k/n − t ˆ ˆ∗ ˜ Λ(t, j) = hk hk+j , 0 ≤ j ≤ b. K nh h k=b+1
˜ L . The analog ˆk := xk ǫˆk . Similarly, we can define the estimation Σ where h of Theorem 4.6 is the following result.
Theorem 4.7. Under the assumptions of Theorem 4.6 and Assumption 3.2, 3.4 and 3.5, we have ˜ j) = OP b √1 + h2 + θn , sup Λ(t, j) − Λ(t, t nh where θn is defined as ! r r b log n θn = ζc + n−dα1 . nh n As a consequence, we have ˜ L = OP b2 √1 + h2 + θn . ΣL − Σ nh
By Theorem 4.2, 4.6 and 4.7, we now propose the following practical procedure to test H10 (the implementation for H20 is similar):
1. For j = 1, 2, · · · , b, i = b + 1, · · · , n, estimate Σ−1 using n(Y ∗ Y )−1 and calculate Qij by the definitions. ˆk }n 2. Estimate ΣL using (4.10) from the samples {h k=b+1 . 3. Generate B (say 2000) i.i.d copies of Gaussian random vectors zi , i = 1, 2, · · · , B. Here zi ∼ N (0, I). For each k = 1, 2, · · · , B, calculate the following Riemann summation Tk1 =
n b 1 X X ˆ ˆ L zk ). (ΣL zk )∗ Qij (Σ n2 j=1 i=b+1
1 T(1)
1 ≤ T(2) Reject H10
1 4. Let ≤ · · · ≤ T(B) be the order statistics of Tk1 , k = 1 , where ⌊x⌋ stands at the level α if T1∗∗ > T(⌊B(1−α)⌋) 1, 2, · · · , B. ∗ for the largest integer smaller or equal to x. Let B = max{k : Tk1 ≤ T1∗∗ }, the p-value can be denoted as 1 − B ∗ /B.
26
X. DING AND Z. ZHOU
5. Simulation Studies. In this section, we design Monte Carlo experiments to study the finite sample accuracy and sensitivity of our estimation and testing procedure. 5.1. Choices of tuning parameters. In this subsection, we briefly discuss the practical choices of the key parameters, i.e. the lag b of the autoregression in Cholesky decomposition, the number of basis functions in sieve estimation, choice of k0 in the bandedness test and the bandwith selection in the nonparametric estimation of covariance matrix. Similar to the discussion in Section 4.1, by Proposition 2.5, Lemma 2.6 and Theorem 3.6, for any given sufficiently large b0 ≡ b0 (n), the following statistic should be small enough Tb =
b0 X
φˆ2j (t)dt, b < b1 < b0 .
k=b1
By Theorem 4.2, Tb is normally distributed. Hence, we can follow the procedure described in the end of Section 4.4. For each fixed b1 < b0 , we can formulate the null hypothesis as Hb0 : b1 > b. Given the level α, denote b∗ = max{b1 < b0 : Hb0 is rejected}. b1
Then we can choose b = b∗ . Note that b∗ + 1 is the first off diagonal where all its entries are effectively zeros in terms of statistical significance. The number of basis functions can be chosen using model selection methods for nonparametric sieve estimation. However, due to non-stationarity, the classic Akaike information criterion (AIC) may fail under heteroskedasticity. In the present paper, we use the cross-validation method described in [21, Section 8] where the cross-validation criterion is defined as n
CV(c) =
ǫˆ2ic 1X , n (1 − υic )2 i=2
where {ˆ ǫic } are the estimation residuals using sieve method with order of c and υic is the leverage defined as υic = yi∗ (Y ∗ Y )yi ,
PRECISION MATRIX ESTIMATION
27
where we recall (3.4). Hence, we can choose cˆ = argmin CV(c), 1≤c≤c0
where c0 is a pre-chosen large value. Finally the bandwidth can be chosen using the standard leave-one-out cross-validation criterion for nonparametric estimation. Denote Z n 1 X 2 ˜ ˜ ˆ ˜ Λ(t, j) ◦ Λ(t, j)dt − J(h) := sup Λ−k (tk , j) , n j 0 k=b+1
where ti = ni , ◦ is the Hadamard (entrywise) product for matrices and ˆkh ˆ ∗ . Therefore, the selected ˜ −k is the estimation excluding the sample h Λ k+j bandwidth is ˆ = argmin Jˆ(h). h h
5.2. Accuracy of precision matrix estimation. In this subsection, we show by simulations the finite sample performance of our estimation. We investigate the following four non-stationary processes: • Non-stationary MA(1) process xi = 0.6 cos(
2πi )ǫi−1 + ǫi , n
where ǫi are i.i.d N (0, 1) random variables. • Non-stationary MA(2) process xi = 0.6 cos(
2πi 2πi )ǫi−1 + 0.3 sin( )ǫi−2 + ǫi . n n
• Time-varying AR(1) process xi = 0.6 cos(
2πi )xi−1 + ǫi . n
• Time-varying AR(2) process xi = 0.6 cos(
2iπ 2πi )xi−1 + 0.3 sin( )xi−2 + ǫi . n n
It is easy to compute the true precision matrices of the above models. In the following simulations, we report the average estimation errors in terms of operator norm and their standard deviations based on 1000 repetitions.
28
X. DING AND Z. ZHOU
We use the methods from Section 5.1 to choose the parameters and the Epanechnikov kernel [37, Section 4.2] for the nonparametric estimation. We compare the results for three different types of sieves, the Fourier basis functions (i.e. trigonometric polynomials), the Legendre polynomials and Daubechies orthogonal wavelet basis functions of order 16 [11]. We observe from Table 1 that our estimators for the precision matrices are reasonably accurate. Due to the consistency of our estimators, they are more accurate when n becomes larger. Furthermore, as we can see from the estimation of MA(1) and MA(2) processes, our estimators can still be quite accurate even when the underlying precision matrices are not sparse. n=200
n=500
n=800
MA(1)
Fourier Basis Polynomial Basis Wavelet Basis
1.17 (0.18) 1.48 (0.12) 1.5 (0.21)
1.09 (0.18) 1.46 (0.19) 1.31 (0.21)
0.96 (0.14) 1.37 (0.21) 1.1 (0.23)
MA(2)
Fourier Basis Polynomial Basis Wavelet Basis
1.35 (0.12) 1.47 (0.13) 1.55 (0.21)
1.28 (0.16) 1.43 (0.14) 1.37 (0.19)
1.18 (0.18) 1.32 (0.13) 1.19 (0.22)
AR(1)
Fourier Basis Polynomial Basis Wavelet Basis
0.53 (0.18) 0.61 (0.13) 0.68 (0.21)
0.46 (0.17) 0.56 (0.12) 0.62 (0.23)
0.4 (0.18) 0.54 (0.14) 0.57 (0.24)
AR(2)
Fourier Basis Polynomial Basis Wavelet Basis
0.78 (0.21) 0.82 (0.15) 0.9 (0.22)
0.71 (0.24) 0.76 (0.11) 0.83 (0.24)
0.64 (0.24) 0.75 (0.1) 0.78 (0.24)
Table 1 Operator norm error for estimation of precision matrices. The standard deviations are recorded in the bracket. We use the trigonometric polynomials for Fourier basis, the Legendre polynomials for Polynomial basis and Daubechies wavelet of order 16 for Wavelet basis.
5.3. Accuracy and power of tests. In this subsection, we design simulations to study the finite sample performance for the white noise and bandedness tests of precision matrices using the procedure described in the end of Section 4.4. At the nominal levels 0.01, 0.05 and 0.1, the simulated Type I error rates are listed below for the null hypothesis of H10 and H20 based on 1000 simulations, where for H20 we use the time varying AR(2) model (i.e. k0 = 2). From Table 2 and 3, we see that the performance of our proposed tests are reasonably accurate for all the above basis functions. Next we consider the statistical power of our tests under some given alternatives. For the test of white noise, we choose the four examples considered
29
PRECISION MATRIX ESTIMATION n=200
n=500
n=800
α = 0.01
Fourier Basis Polynomial Basis Wavelet Basis
0.008 0.009 0.008
0.01 0.0098 0.008
0.009 0.011 0.01
α = 0.05
Fourier Basis Polynomial Basis Wavelet Basis
0.057 0.059 0.053
0.046 0.048 0.048
0.045 0.052 0.047
α = 0.1
Fourier Basis Polynomial Basis Wavelet Basis
0.11 0.087 0.091
0.097 0.093 0.087
0.1 0.12 0.088
Table 2 Simulated type I error rates under H10 . n=200
n=500
n=800
α = 0.01
Fourier Basis Polynomial Basis Wavelet Basis
0.009 0.008 0.011
0.013 0.011 0.014
0.009 0.009 0.008
α = 0.05
Fourier Basis Polynomial Basis Wavelet Basis
0.052 0.051 0.052
0.05 0.048 0.05
0.049 0.052 0.05
α = 0.1
Fourier Basis Polynomial Basis Wavelet Basis
0.096 0.089 0.091
0.097 0.098 0.101
0.11 0.092 0.095
Table 3 Simulated type I error rates under H20 for k0 = 2.
in Section 5.2 as our alternatives. For the testing of bandedness of the precision matrices, for the null hypothesis, we choose k0 = 2 and consider the following two types of alternatives • Time-varying AR(3) process xi = 0.6 cos(
2iπ 2iπ 2πi )xi−1 + 0.3 sin( )xi−2 + δ sin( )xi−3 + ǫi , n n n
where ǫi are i.i.d standard normal random variables and δ ∈ (0, 0.3). • Non-stationary MA(3) process xi = 0.6 cos(
2πi 2iπ i )ǫi−1 + 0.3 sin( )ǫi−2 + ǫi−3 + ǫi . n n n
In all of our simulations, we choose the Daubechies wavelet basis functions of order 16 as our sieve basis functions and the Epanechnikov kernel for
30
X. DING AND Z. ZHOU
0.85 0.80
AR(1) AR(2) MA(1) MA(2)
0.70
0.75
Statistica Power
0.90
0.95
1.00
the nonparametric estimation (4.10). For the choices of the parameters, we follow the discussion of Section 5.1. Figure 1 and 2 show that our testing procedures are quite robust and have strong statistical power for both tests.
50
100
150
200
250
300
Sample size
0.85 0.80
AR(3) MA(3)
0.70
0.75
Statistica Power
0.90
0.95
1.00
Fig 1. Statistical power of White noise testing under nominal level 0.05.
50
100
150
200
250
300
Sample size
Fig 2. Statistical power of bandedness testing under nominal level 0.05. For the AR(3) process we choose δ = 0.2.
Finally, we simulate the statistical power for various choices of δ in the
31
PRECISION MATRIX ESTIMATION
AR(3) process for the sample size n = 200, 300 respectively in Figure 3, we find that our method is quite robust. 1
0.9
0.8
Statistical power
0.7
0.6
0.5
n=200 0.4
n=300 0.3
0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Values of thresholding
Fig 3. Statistical power of bandedness testing under nominal level 0.05 for different choices of δ.
6. Proofs. In this section, we prove the main theorems of this paper. Proof of Theorem 3.6. We will follow the proof strategy P of [9, Lemma 2.3]. The key difference is that our projection matrix S = n1 ni=b+1 yi yi∗ will converge to some deterministic matrix other than identity. We therefore need to provide an analog of [9, Lemma 2.2], where they derive the convergence rate for β-mixing processes and use Berbee’s lemma. Here, in our paper, we will use the trick of m-dependent sequence to prove our results. The proof contains three main steps: (i). Find the convergent limit Σ for S; (ii). Find the optimal rate for the norm of Σ − S; (iii). Follow the proof of [9, Lemma 2.3] to conclude our proof. We start with the first step. For any 1 ≤ i ≤ bc, denote i i′ := i − ci′′ , i′′ := ⌊ ⌋ + 1. c With the above definitions, for 1 ≤ i, j ≤ bc, we can write Sij as Sij =
n k k 1 X xk−i′′ xk−j ′′ αi′ ( )αj ′ ( ). n n n k=b+1
32
X. DING AND Z. ZHOU
For 1 ≤ k1 , k2 ≤ b, denote Snk1 ,k2 =
n X
i=b+1
(xi−k1 xi−k2 − E(xi−k1 xi−k2 )) ,
and Uik1 ,k2 = xi−k1 xi−k2 − E(xi−k1 xi−k2 ), it is easy to check that under the assumption of (2.3), δu (j, q) ≤ Cj −τ by Cauchy-Schwarz inequality. As a consequence, by [44, Lemma 6], we have (6.1)
∞ X k1 ,k2 2 u (ζj+n − ζju )2 , Sn ≤ Cq q
where ζju =
Pj
k=0 δ
u (k, q).
j=−n
Therefore, we have k1 ,k2 Sn = O(n1/2 ). q
Under the assumption (2.4), as |k1 − k2 | ≤ b (for instance, we assume k1 ≤ k2 ), we then have k2 − k1 i − k 1 E(xi−k xi−k ) − γ( , k2 − k1 ) ≤ , 1 2 n n where we use (2.5) and Jensen’s inequality. This implies that
n n 1 X 1 X i − k1 , k2 − k1 ) + O (k2 − k1 )n−2 . E(xi−k1 xi−k2 ) = γ( n n n i=b+1
i=b+1
Under Assumption 2.3, using [32, Theorem 1.1], we have Z 1 n 1 X γ(t, k2 − k1 )dt + O((k2 − k1 )n−2 ). E(xi−k1 xi−k2 ) = n 0 i=b+1
Therefore, combine with (6.1), for some α3 ∈ (0, 1), with 1 − O(n2α3 −1 ) probability, we have Z 1 n 1 X γ(t, k2 − k1 )dt + O(n−α3 ). xi−k1 xi−k2 = n 0 i=b+1
Similarly, we can show that for 1 ≤ c1 , c2 ≤ c, (6.2)
Z 1 n−1 i i 1 X γ˜ (t, k2 − k1 , c1 , c2 ) + O(n−α3 ), xi−k1 xi−k2 αc1 ( )αc2 ( ) = n n n 0 i=b+1
PRECISION MATRIX ESTIMATION
33
where γ˜(t, k2 − k1 , c1 , c2 ) is defined as γ˜ (
i − k1 i − k1 i − k1 i − k1 , k2 − k1 , c1 , c2 ) = γ( , k2 − k1 )αc1 ( )αc2 ( ). n n n n
Denote the matrix Σb (t) ∈ Rb×b with the entries Σbkl (t) = γ(t, |k − l|), 1 ≤ k, l ≤ b and further denote Z 1 Σb (t) ⊗ (b(t)b∗ (t)) dt. (6.3) Σ= 0
Using Gershgorin’s circle theorem, with 1 − O(n4/τ +2α1 +2α3 −1 ) probability, we have λmax (S − Σ) ≤ Cn2/τ +α1 −α3 .
Next, we will derive the optimal rate using the concentration inequality for random matrices and the trick of m-dependent sequence. For the purpose of completeness, we list the Bernstein’s inequality for random matrices, its proof can be found in [34, Theorem 6.1.1]. Lemma 6.1. Let Yi , i = 1, 2, · · · , n be a sequence of centered independent random matrices with dimensions d1 , d2 . Assume that for each i, we have maxi ||Yi || ≤ Rn and define ) n ( n X X ∗ ∗ 2 EYi Yi , EYi Yi , σn = max i=1
i=1
where the norm stands for the largest singular value. The for all t ≥ 0, we have n −t2 /2 X Yi ≥ t ≤ (d1 + d2 ) exp 2 P . σn + Rn t/3 i=1
In order to deal with the issue of independence, we use the approximation of m-dependent sequence. For convenience, for xi , i = b+1, · · · , n, we denote by i xi = G( , Fi ) ∈ Rb . n We then denote the m-approximation sequence by (6.4)
xM i = E(xi |ηi−M , · · · , ηi ), i = b + 1, · · · , n,
where xi , xj are independent when |i − j| > M. Under Assumption 2.2, by the discussion of [7, Remark 2.3], we conclude that for any i, (log b)q/2 M −qτ . (6.5) P sup |xij − xM | ≥ t ≤C ij tq j
34
X. DING AND Z. ZHOU
i Recall that yi = xi ⊗ b( ni ), we now denote yiM = xM i ⊗ b( n ), by choosing
t = M −τ +2 , we conclude that with 1 −
n(log b)q/2 M 2q
probability, we have
n n 1 X 1 X M M ∗ ∗ yi yi − yi (yi ) ≤ Cζc M −τ /2+1 . n n i=b+1
i=b+1
By Assumption 3.5 and the fact τ > 10, we can choose M such that ζc M −τ /2+1 = O(n−1 ). Therefore, it suffices to control the m-dependent approximation. Observe that for some constant C > 0 i 1 i 1 ∗ ∗ M ∗ M Rn = sup (yi ) yi − Σ = sup (xi xi ) b ( )b( ) − Σ n i n i n n 2 ζ ≤C c, n where we use Assumption 2.2 and [44, Lemma 6]. Define k0 = ⌊ n−b M ⌋ and the index set sequences by ( {b + i + kM : k = 0, 1, · · · , k0 }, if b + i + k0 M ≤ n, i = 1, 2, · · · , M. Ii = {b + i + kM : k = 0, 1, · · · , k0 − 1}, otherwise By triangle inequality and for some constant C > 0, we have n 1 X 1 X M ∗ M M M ∗ yk (yk ) − Σ ≥ t/M yi (yi ) − Σ ≥ t ≤ CM sup P P n n i i=b+1
≤ CM bc exp
k∈Ii −t2 /(2M 2 )
2 + R t/3M σM M
,
where we apply Lemma 6.1. To conclude our proof, we need to choose M properly. By definition, it is elementary to see that 2 σM ≤C
ζc2 M . n n 2
2 = O( ζc n−2/3 ). Hence, by choosing Now we choose M = O(n1/3 ), then σM n √ t = O( √ζcn log n), we conclude that
(6.6)
ζc p log n). S − Σ = OP ( √ n
Once we have (6.6), we can almost take the varbartim of the proof of of [9, Lemma 2.3] to finish proving our theorem. Denote φˇj (t) = φj (t) − φˆj (t).
PRECISION MATRIX ESTIMATION
35
For t, t∗ ∈ [0, 1], by mean value theorem and Assumption 3.4, for some constant C, we have −1 Y ∗ ǫ |φˇj (t) − φˇj (t∗ )| = (Bj (t) − Bj (t∗ ))∗ Y ∗ Y /n n Y ∗ǫ ≤ Cnω1 cω2 |t − t∗ | , n
where Bj (t) is defined in (4.7). For some constant M > 0, denote the event Y ∗ǫ Bn as the event such that C n ≤ M , where it can be easily checked that P(Bnc ) = o(1). By Assumption 2.2 and 3.2, [44, Lemma 6], (6.6) and a similar discussion to [9, equations (42) and (43)], we conclude that on Bn , for some constant C1 > 0, there exists some positives η1 , η2 such that r ∗ log n ω1 ω2 ∗ Y ǫ Cn c |t − t | , ≤ C1 ζ c n n
whenever |t − t∗ | ≤ η1 n−η2 . Denote Sn be the smallest subset of [0, 1] such that for each t ∈ [0, 1], there exists a tn ∈ Sn with |tn − t| ≤ η1 n−η2 . For any t ∈ [0, 1], let tn (t) denote as the distance of tn ∈ Sn to t. Then using a similar discussion to equations (44)-(47) of [9], we conclude that n o p p ˇ P sup |φj (t)| ≥ 4Cζc (log n)/n ≤ P max |φˇj (tn )| ≥ 2Cζc (log n)/n ∩Bn +o(1). tn ∈Sn
t
The rest of the work leaves to control the above probability using (6.6), Assumption 2.2 and [44, Lemma 6]. We first observe that o n p P max |φˇj (tn )| ≥ 2Cζc (log n)/n ∩ Bn tn ∈Sn p ≤ P max B∗j (tn )(S −1 − Σ−1 )Y ∗ ǫ/n ≥ Cζc (log n)/n (6.7) tn ∈Sn p + P max |Bj (tn )Σ−1 Y ∗ ǫ/n| ≥ Cζc (log n)/n . (6.8) tn ∈Sn
∗ p By (6.6), (6.7) can be controlled easily using the fact that Yn ǫ = OP ( bc/n). To control (6.8), we adopt the truncation from [9]. Denote An as the event on which ||S − Σ|| ≤ 12 and (6.6) implies that P(Acn ) = o(1). Denote {Mn : n ≥ 1} be an increasing sequence diverging to +∞ and define ǫ1,i,n := ǫi 1(|ǫi | ≤ Mn ) − E[ǫi 1(|ǫi | ≤ Mn )|Fi−1 ], i ǫ2,i,n = ǫi − ǫ1,i,n , gi,n (tn ) = Bj (tn )∗ Σ−1 Bj ( )1(An ). n
36
X. DING AND Z. ZHOU
As a consequence, we have p P max |Bj (tn )Σ−1 Y ∗ ǫ/n| ≥ Cζc (log n)/n tn ∈Sn
≤ (#Sn ) max P tn ∈Sn
(6.9)
n o C p n 1 X gi,n ǫ1,i,n > ζc (log n)/n ∩ An n 2 i=b+1
n C p 1 X gi,n ǫ2,i,n ≥ ζc (log n)/n + o(1). + P max tn ∈Sn n 2
i=b+1
By the discussion of equations (65b) and (70) of [9], we can choose Mn = p O ζc−1 n/(log n) , then the above bound can be controlled by o(1). We can conclude our proof using Assumption 3.5. Proof of Theorem 3.10. For each fixed i ≤ b, denote (6.10)
di = (d11 , · · · , d1c , d21 , · · · , d2c , · · · , di−1,1 , · · · , di−1,c ) ∈ R(i−1)c .
Hence, the OLS estimator of di can be written as (6.11)
ˆ i = (Y ∗ Yi )−1 Y ∗ xi , i ≤ b, d i i
where Yi∗ is a (i−1)c×(n+1−i) rectangular matrix whose columns are yik := (xk−1 , · · · , xk−i+1 )∗ ⊗ b( nk ) ∈ R(i−1)c , k = i, · · · , n, with xi = (xi , · · · , xn ). Therefore, similar to (3.6), we have i i ˆ i,j )∗ b( i ), fji ( ) − fˆji ( ) = (di,j − d n n n ˆ i,j . The rest of the where di,j ∈ Rc is the j-th block of di , similarly for d proof relies on the following equation and Assumption 3.2 ˆ i = (Y ∗ Yi )−1 Y ∗ ǫi , di − d i i where ǫi = (ǫi , · · · , ǫn ). For the rest of the proof, we can almost take the verbatim as that of Theorem 3.6.
Proof of Theorem 3.12. Recall (3.14), the only difference from that of Theorem 3.6 is that W is a deterministic matrix. We ignore the further detail here.
PRECISION MATRIX ESTIMATION
37
Proof of Theorem 3.13. The proof is similar to that of Theorem 3.6, except that we need to analyze the residual (3.16). However, it is easy to see that rˆki is a locally stationary time series with polynomial decay physical dependence measure. Hence, we can almost take the verbatim except for some constants. Proof of Theorem 3.14. Using the fact that any two compatible matrices A, B, AB and BA have the same non-zero eigenvalues, for some constant C > 0, we have ˆ Ω − Ω ≤ C||E||,
where E has the following form of decomposition
E = E1 + E2 + E3 h i h ∗ i ˆ ˆ ˆ ˆ −1 − Φ−1 Φ∗ + Φ Φ ˆ −1 − Φ−1 D ˜ −D ˜ + D ˜ Φ ˜ = D h ∗ i ˆ ˆ −1 − Φ−1 D ˆ −1 − Φ−1 Φ∗ . ˜ Φ + Φ Φ
ˆ ˆ ∗ (Φ ˆ −1 )∗ Φ∗ , we therefore have ||E2 || ≤ 2||B||. ˜ −1 )∗ (Φ− Φ) Denote B := D(Φ ˆ we first observe that RΦ = 0, i ≤ j. Then We further denote RΦ := Φ − Φ, Lemma 2.6, Theorem 3.6 and 3.10, for i ≤ b or j ≤ b ≤ i, by Lemma 2.5, p (RΦ )ij = OP (ζc (log n)/n + n−dα1 ). And for i > b, j > b, |(RΦ )ij | ≤ j −τ . This implies that ˆ ˆ ∗ = OP n4/τ ζ 2 n−1 log n + n−2dα1 + τ4 , − Φ) λmax (Φ − Φ)(Φ c
where we use the Gershgorin’s circle theorem. As a consequence, by submultiplicaticity, for some constant C > 0, we have that r log n ||E2 || = OP ζc + n−dα1 . n Similarly, we can show that
4 ||E3 || = OP n4/τ ζc2 n−1 log n + n−2dα1 + τ
By (3.12), Theorem 3.12 and 3.13 and Assumption 3.5, we conclude that ! r log n 4/τ −dα1 +4/τ ||E1 || = OP ζc n . +n n Hence, we have finished our proof.
38
X. DING AND Z. ZHOU
Proof of Theorem 4.1. By [41, Lemma 7.2], we find that sup P(x ≤ Ru ≤ x + ψ −1 ) = O(ψ −1/2 ).
(6.12)
x∈R
Denote g0 (u) = (1 − min(1, max(u, 0))4 )4 , it is easy to check that (see the proof of [41, Proposition 2.1 and Theorem 2.2]) 1(y ≤ x) ≤ gψ,x (y) ≤ 1(y ≤ x + ψ −1 ),
(6.13) (6.14)
′ ′′ ′′′ sup |gψ,x (y)| ≤ g∗ ψ, sup |gψ,x (y)| ≤ g∗ ψ 2 , sup |gψ,x (y)| ≤ g∗ ψ 3 , y,x
y,x
y,x
where gψ,x (y) := g0 (ψ(y − x)) and g∗ = max |g0′ (y)| + |g0′′ (y)| + |g0′′′ (y)| < ∞. y
By (6.12), (6.13) and a similar discussion to equations (7.5) and (7.6) of [41], ρ can be well controlled if we let ψ → ∞ and bound (6.15)
sup |Egψ,x (Ru ) − Egψ,x (Rz )| . x
The rest of the proof leaves to control (6.15). The proof relies on two main steps: (i). an m-dependent sequence approximation for the locally stationary time series; (ii). a leave-one-block out argument to control the bounded mdependent time series. We start with step (i) and control the error between the m-dependent sequence approximation and the original time series. Recall (6.4), we denote by
(6.16)
M M zM i = (zi1 , · · · , zip ) = E[zi |ηi−M , · · · , ηi ], i M = xM i ǫi ⊗ b( ), n
M be an m-dependent approximation for zi , where ǫM i are defined using xi . Similarly, we can define RzM by replacing Z with ZM . Therefore, by (6.14) and the definition of gψ,x , there exists some constant C > 0, for some small ∆M > 0, we have E[gψ,x (Rz ) − gψ,x (RzM )] ≤ Cpψ∆M + CE[1 − IM ], (6.17)
PRECISION MATRIX ESTIMATION
39
where IM := 1{max1≤j≤p |Zj − ZM j | ≤ ∆M }. Here we use [44, Lemma 6], mean value theorem and Cauchy-Schwartz inequality. We use the following lemma to control the right-hand side of (6.17) by suitably choosing ∆M . z (s, q) Recall (4.3), we denote the physical dependence measure of zkl as δkl and ∞ X z θo,l,q . θs,l,q := sup δkl (s, q), Θs,l,q = k
o=s
By Assumption 2.2 and Lemma 2.7, the above physical dependence measures satisfy sup Θs,l,q < ξc ,
(6.18)
1≤l≤p
∞ X
sup sθs,l,3 < ξc .
s=1 1≤l≤p
Armed with the above preparation, we now control the right-hand side of (6.17) using the following lemma. We put its proof into Appendix A. Lemma 6.2. (6.19)
Under Assumption 2.2, for some constant C1 > 0, we have
p q X 1 q+1 q+1 . pψ∆M + E[1 − IM ] ≤ C1 pψ ΘqM,k,q k=1
By choosing a sufficiently large M , the right-hand side of (6.19) will be of order o(1). Next we will use the leave-one-block out argument to show that the difference between two m-dependent sequences can be well controlled. Its proof relies on Stein’s method. We now introduce the dependency graph strictly following [43, Section 2.1]. For the sequence of p-dimensional random vectors {zk }ni=b+1 , we call it dependency graph Gn = (Vn , En ), where Vn = {b + 1, · · · , n} is a set of vertices and En is the corresponding set of undirected edges. For any two disjoint subsets of vertices S, T ⊂ Vn , if there is no edge from any vertex in S to any vertex in T, the collections of the corresponding zk will be independent. We further denote Dmax,n as the maximum degree of Gn and Dn = 1 + Dmax,n . Next we provide a rough bound for Rz , Ru in terms of the the maximum degree of Gn . Denote (6.20)
F (x) = gψ,x ◦ f, where f (x) = x∗ Ex, x ∈ Rp .
˜kl = (zkl ∧ Mx ) ∨ (−Mx ) − We further define the bounded random variables z ˜ kl = (uij ∧ My )∨ (−My )− E[(uij ∧ My )∨ (−My )] E[(zkl ∧ Mx )∨ (−Mx )] and u for some Mx , My > 0. For some small ∆ > 0, denote (6.21)
˜ j | < ∆, | max |Uj − U ˜ j | < ∆}, I := I∆ = 1{ max |Zj − Z 1≤j≤p
1≤j≤p
40
X. DING AND Z. ZHOU
˜j, U ˜ j are defined using ˜zkl , u ˜ kl . We next denote Nk = {l : {k, l ∈ where Z ˜k = {k} ∪ Nk . Let φ(Mx ) be a constant depending on the En }} and N threshold parameter Mx such that n 1 X X (Ezkα zsβ − E˜ zkα ˜zsβ ) ≤ φ(Mx ). max 1≤α,β≤p n ˜ k=b+1 s∈N k
Analogous quantity φ(My ) can be defined for {ui }. Set φ(Mx , My ) = φ(Mx )+ φ(My ) and define ¯ max |zsl |k )1/k , mu,k = (E ¯ max |usl |k )1/k , mz,k = (E 1≤l≤p
1≤l≤p
¯ sl |k )1/k , m ¯ sl |k )1/k , m ¯ z,k = ( max E|z ¯ u,k = ( max E|u 1≤l≤p
1≤l≤p
Pn
s=b+1 Ezsl . n
¯ sl ) = The following lemma provides a rough bound where E(z for (6.15) in terms of ∆, which can be improved later for the m-dependent sequence. Its proof can be found in Appendix A. Lemma 6.3. For any ∆ > 0 defined in (6.21), denote Mxy = max{Mx , My }, for some constant C > 0, we have sup E[gψ,x (Ru ) − gψ,x (Rz )] x
CD 3 ≤ Cpψ∆ + CE(1 − I) + Cφ(Mx , My )ψ 2 p4 + √ n (m3x,3 + m3y,3 )ψ 3 p6 . n
For the m-dependent sequence, we can easily obtain the following result, whose proof will be put into Appendix A. Corollary 6.4. Assuming that {zi } and {ui } are m-dependent sequences, for some constant C > 0, we have sup E[gψ,x (Ru ) − gψ,x (Rz )] x
≤ Cpψ∆ + CE(1 − I) + Cφ(Mx , My )ψ 2 p4 + C
(2M + 1)2 3 √ (m ¯ x,3 + m ¯ 3y,3 )ψ 3 p6 . n
As we can see from the above corollary, we need to control the first two items. Denote ϕ(Mx ) be the smallest finite constant which satisfies that uniformly for i and j, ˇij )2 ≤ M ϕ2 (Mx ), E(Aij − Aˇij )2 ≤ N ϕ2 (Mx ), E(Bij − B
PRECISION MATRIX ESTIMATION
41
ˇij are the truncated blocked summations of Aij , Bij and defined where Aˇij , B as iN +(i−1)M X ˇ Aij = (zlj ∧ Mx ) ∨ (−Mx ), l=iN +(i−1)M −N +1 i(N +M )
ˇij = B
X
(zlj ∧ Mx ) ∨ (−Mx ).
l=i(N +M )−M +1
Similarly, we can define ϕ(My ) for the Gaussian sequence {ui }. Set ϕ(Mx , My ) = ϕ(Mx ) ∨ ϕ(My ). Furthermore, we let ux (γ) and uy (γ) be the smallest quantities such that P max max |zij | ≤ ux (γ) ≥ 1 − γ, b+1≤i≤n 1≤j≤p
P
max
max |uij | ≤ uy (γ) ≥ 1 − γ.
b+1≤i≤n 1≤j≤p
Then we have the following control on the first two items of Corollary 6.4, whose proof can be found in Appendix A. Its analog is [43, Proposition 4.1]. Lemma 6.5. Assuming that Mx > ux (γ) and My > uy (γ) for some constant C > 0, γ ∈ (0, 1), we have p pψ∆ + E[1 − I] ≤ C pψϕ(Mx , My ) log(p/γ) + γ .
In a final step, we will control ρ defined in (4.4) by quantifying the parameters in the above bounds. Recall (6.16), by (6.19), Corollary 6.4 and Lemma 6.5, for some constant C > 0, we have |Egψ,x (Rz ) − Egψ,x (Ru )| p 1 q X (2M + 1)2 3 q+1 q+1 q √ ΘM,k,q + Cφ(Mx , My )ψ 2 p4 + C (m ¯ x,3 + m ¯ 3y,3 )ψ 3 p6 ≤ C pψ n k=1 p + C pψϕ(Mx , My ) log(p/γ) + γ .
By (6.18) and the proof of [43, Theorem 2.1], we conclude that √ ϕ(Mx ) = C(ξc1/2 /Mx5/6 + N /Mx3 ), ϕ(My ) = Cξc /My2 , φ(Mx , My ) = Cξc (1/Mx + 1/My ).
42
X. DING AND Z. ZHOU
By Assumption 3.5, we should choose that Mx , My ≫ ξc . Furthermore, following the arguments of the proofs of [43, Theorem 2.1] and the orthogonality of the basis functions, we find that m ¯ 3x,3 + m ¯ 3y,3 < ∞. Therefore, we can also ignore all the constants. By (6.18) and Assumption 2.2, we have that p X k=1
ΘqM,k,q
1 q+1
1
q
≤ Cξc p q+1 (M −τ +1 ) q+1 .
We impose the condition that Mx = My and N = M . This concludes our proof. Proof of Theorem 4.2. We first observe that ∆ is the limiting covari∗ ance of Y√nǫ . Denote the eigenvalues of ∆1/2 Σ−2 ∆1/2 as d1 ≥ d2 ≥ · · · ≥ dp ≥ 0. Using Theorem 4.1 with E = Σ−2 and Lindeberg’s central limit theorem, we conclude that P nT1∗ − pi=1 di qP ⇒ N (0, 2), p 2 d i=1 i provided that the following equation holds true (6.22)
d qP 1
p 2 i=1 di
→ 0.
Therefore, the rest of the proof is devoted to analyzing the spectrum of ∆1/2 Σ−2 ∆1/2 . Using a similar discussion to the proof of Theorem 3.6 (i.e. equations (6.2) and (6.3)), we can show that λmin (∆) = O(1) and conclude that λmax (∆) = O(1) using Lemma 2.7 and Gershgorin’s circle theorem. Therefore, we conclude that dp = O(1) and d1 = O(1) using the fact that λmin (A)λmin (B) ≤ λmin (AB) ≤ λmax (AB) ≤ λmax (A)λmax (B) for any positive definite matrices A, B. Hence, (6.22) holds true immediately. Similarly, denote the eigenvalues of ∆1/2 Σ−1 A∗ AΣ−1 ∆1/2 as ρ1 ≥ ρ2 ≥ ρ(b−k0 )c . Similarly, it can be shown that ρk = O(1), k = 1, 2, · · · , (b − k0 )c. It is notable that in this case we use Theorem 4.1 by setting E = Σ−1 A∗ AΣ−1 .
43
PRECISION MATRIX ESTIMATION
Proof of Proposition 4.3. By the smoothness of γ(t, j) and Lemma 2.4, we find that Ha is equivalent to R1 P n bj=1 0 γ 2 (t, j)dt ′ √ Ha : → ∞, bc Using (2.7), Assumption 2.2 and Lemma 2.5 and 2.6, we have R1 R1 P P n bj=1 0 γ 2 (t, j)dt n bj=1 0 φ2j (t)dt √ √ →∞⇔ → ∞. bc bc As a consequence, we will consider the following alternative R1 P n bj=1 0 φ2j (t)dt ∗ √ → ∞. Ha : bc We first observe that nT1∗
=n
b Z X j=1
+ 2n
0
b Z 2 X ˆ φj (t) − φj (t) dt + n
1
b Z 1 X j=1
0
j=1
1 0
φ2j (t)dt
φˆj (t) − φj (t) φj (t)dt.
By Theorem 4.2, we have that nT1∗ − f1 − n
Pb
R1
j=1 0
f2
φ2j (t)dt
2n −
R1 j=1 0
Pb
φˆj (t) − φj (t) φj (t)dt f2
⇒ N (0, 2).
One one hand, we have that b Z X j=1
0
1
√bc in probability. φˆj (t) − φj (t) φj (t)dt = o n
This is because by Lemma 3.3, we can write b Z X j=1
0
1
c b X X (ˆ ajk − ajk )ajk = β ∗ (βˆ − β) + O(c−d ). φˆj (t) − φj (t) φj (t)dt = j=1 k=1
Recall (3.5), from the proof of Theorem 3.6 (for instance see (6.9)), we find that r log n ∗ ˆ . β (β − β) = OP ||β|| n
44
X. DING AND Z. ZHOU
Hence, under Ha , we have b Z X j=1
0
1
p (bc)1/4 log n . φˆj (t) − φj (t) φj (t)dt = OP n
On the other hand, as Ha is equivalent to H∗a , we can conclude our proof using Assumption 3.5. Proof of Theorem 4.6. We only discuss the case when j = 0, the ˆ i , 0) − Λ(ti , 0), where other cases can be proved similarly. Denote Ri := Λ(t i+b ti = n . We will focus on the case when i = 1 and analyze the first entry of (Ri )11 . The other cases can be analyzed similarly. By definition, we have that n k/n − t1 1 X (xk−1 ǫk )2 − E(U21 (t1 , F0 )), K (R1 )11 = nh h k=b+1
where U1 is the first entry of U(t1 , F0 ). On one hand, using [44, Lemma 6], we have that n 1 X i k/n − t1 h 2 2 (xk−1 ǫk ) − E(xk−1 ǫk ) ≤ C(nh)−1/2 , K nh h k=b+1
where we recall that K is supported on [−1, 1]. On the other hand, using the stochastic Lipschitz continuity and the elementary property of kernel estimation, we conclude that n 1 X k/n − t 1 K E(xk−1 ǫk )2 − E(U21 (t1 , F0 )) ≤ C (nh)−1/2 + h2 + n−1 . nh h k=b+1
This implies that
1 ||(R1 )11 || ≤ C √ + h2 + n−1 . nh
The other entries can be discussed in the same way. As a consequence, using Gershgorin’s circle theorem, we conclude that 1 ||R1 || ≤ Cb √ + h2 + n−1 . nh
Hence, we can conclude the proof of (4.11). For the proof of (4.12), we will use a corollary of Gershgorin’s circle theorem for block matrices [33, Section 1.13], where we list here for the reader’s reference.
PRECISION MATRIX ESTIMATION
45
Lemma 6.6. For an b(n−b)×b(n−b) block matrix A with each diagonal block Aii , i = 1, 2, · · · , n−b to be symmetric, denote Gi as the region contains the eigenvalue of A of the i-th block, we then have that b n−b [ X R λk (Aii ), Gi = σ(Aii ) ∪ ||Aij || , i = 1, 2, · · · , n − b, k=1
j=1,j6=i
where R(·, ·) denotes the disk
R(c, r) = {λ : |λ − c| ≤ r}. Therefore, our proof follows from (4.11) and Lemma 6.6. Proof of Theorem 4.7. Following the proof of Theorem 4.6, we consider the case when i = 1. Notice that ! n h i X k/n − t 1 1 2 ˜ 1 , j) − Λ(t ˆ 1 , j) xk−1 ǫk ǫk − ǫˆk . K Λ(t =O nh h 11 k=b+1
By Lemma 2.6 and Theorem 3.6, we conclude that ! r r b log n ˜ ˆ 1 , j) = OP ζc . + n−dα1 Λ(t1 , j) − Λ(t nh n 11
This finishes our proof.
APPENDIX A: ADDITIONAL PROOFS This appendix is devoted to providing the additional technical proofs of the lemmas and theorems of this paper. Proof of Lemma 2.4. Denote Pj (·) = E(·|Fj )−E(·|Fj−1 ), we can write (A.1)
xi =
i X
k=−∞
Pk (xi ).
Denote ti = ni , with the convenience of (A.1), we have (A.2)
|γ(ti , j)| ≤
∞ X
k=−∞
δ(i − k, 2)δ(i + j − k, 2),
where we use [38, Theorem 1] (also see the proof of [46, Proposition 4]). Therefore, there exists a constant C > 0, such that |γ(ti , j)| ≤ Cj −τ by the assumption of (2.3). This concludes our proof.
46
X. DING AND Z. ZHOU
Proof of Proposition 2.5. We start with the proof of (2.8). For i > b2 , denote the (i − 1) × (i − 1) symmetric banded matrix ΓTi by ( (Γi )kl , |k − l| ≤ b2 ; (ΓTi )kl = 0, otherwise. is the covariance matrix of xii−1 . Using Assumption 2.2 where Γi = Ω−1 i and a simple extension of [3, Proposition 5.1.1], we find that λmin (Γi ) > C, for some constant C > 0. Therefore, by Weyl’s inequality and Lemma 2.4, 4 we have λmin (ΓTi ) ≥ C − n−4+ τ . A direct application of Cauchy-Schwarz inequality yields that Ωi γi − ΩTi γi ≤ n−4+5/τ . (A.3)
When n is large enough, ΓTi is a b2 -banded positive definite bounded matrix, then by [12, Proposition 2.2], we conclude that for some κ ∈ (0, 1) and some constant C > 0, we have T (Ωi )kl ≤ Cκ2|k−l|/b2 . (A.4)
Therefore, by (A.3), (A.4) and Lemma 2.4, we conclude our proof when i ≥ b2 . Similarly we can prove the case when b < i ≤ b2 . The second part is due to the Yule-Walker’s equation and the following lemma from numerical analysis. Lemma A.1. Let Ax = w, x, w ∈ Rn be the original linear system. And for the perturbed system, (A + ∆A) x ˜ = w + ∆w, we have the following control ||˜ x − x|| ||∆w|| ≤ ||A||||A−1 || . ||x|| ||w||
Proof of Lemma 2.6. For any fixed i and j = 1, 2, · · · , b, we have (A.5)
i ˜ b γ˜b , ˜ b − Γb Ω φ˜ij − φj ( ) = e∗j Ωbi γib − γ˜ib + e∗j Ωbi Γ i i i i n
47
PRECISION MATRIX ESTIMATION
˜ b are the covariance matrices of the corresponding vectors. For where Γbi , Γ i the first part of the right-hand side of (A.5), using the Cauchy-Schwarz inequality, we have 2 ∗ b b b ej Ωi (γi − γ˜i ) ≤ λmax ((Ωbi )∗ Ωbi )||γib − γ˜ib ||22 .
On one hand, by Lemma 2.4, it is easy to check that λmax ((Ωbi )∗ Ωbi ) = 1/λmin ((Γbi ) ∗ Γbi ) > κ, where κ > 0 is some constant; on the other hand, similar to (A.2), we have ||γib − γ˜ib ||22 =
b X k=1
(γi (k) − γ˜i (k))2 ≤ n−2+3/τ ,
where we also use the assumption of (2.4). For the second item, a similar discussion yields that, for some constant C > 0, 2 ∗ ∗ −1 ˜ b b b b b b ˜ −1 ˜ ˜ ≤ Cn−2+4/τ . Γ − Γ Γ − Γ ≤ Cλ γ ˜ Γ Γ − Γ e Γ j i max i i i i i i i i
where we use Gershgorin’s circle theorem. Hence, the proof follows from Lemma 2.5. Proof of Lemma 2.7. Using (1.2), we have sup σi2 = λmax (ΦΓΦ∗ ) ≤ λmax (ΦΦ∗ )λmax (Γ) < ∞, i
where we use Lemma 2.4 and 2.5, and Gershgorin’s circle theorem. For the control of physical dependence measure, j i X i i−k i−k ǫ G( δ (j, q) = sup , Fi ) − G( , Fi,j ) − , Fi−k ) − G( , Fi−k,j−k )) φik (G( n n n n i k=1
≤ j −τ + Cj −τ +
q
j−1 X k=1
(k(j − k))−τ ≤ C 2 j −τ ,
where in the last inequality we use the fact j−1 X k=1
−τ
(k(j − k))
∼2
This concludes our proof.
Z
j/2 1
−τ
(x(j − x))
−τ
dx ≤ 2(j/2)
Z
j/2 1
x−τ dx.
48
X. DING AND Z. ZHOU
˜ b to stand for the covariance maProof of Lemma 3.1. Recall we use Γ i b b ˜ ˜ i . By definition, we have Γi φ (t) = γ ˜ib . Using Cramer’s rule, we trix of x have ˜ b )j det(Γ i , j = 1, 2, · · · , b, φj (t) = ˜b det Γ i
˜ b )j is the matrix formed by replacing the j-th column of Γ ˜ b by the where (Γ i i b b ˜ is non-singular, it suffices to show that det(Γ ˜ b )j , det Γ ˜b ˜i . As Γ column vector γ i i i are C d functions of t on [0, 1]. We employ the definition of determinant, where ! b Y X b b ˜ i )k,σ(k) , ˜i = (Γ sign(σ) det Γ k=1
σ∈Sb
˜ i )j . Due to Lemma 2.5, where Sb is the permutation group. Similarly for (Γ ˜ b is the non-zero item the possible maximal item in the expansion of det Γ i Qb b ˜ ˜ b )j /∆, det Γ ˜ b /∆ ∆ = k=1 (Γi )k,k . Therefore, it suffices to show that det(Γ i i are in C d . Now we reorder the items to write ˜ bi = det Γ
b! X
ωj γj ,
j=1
Q ˜ i and wj contains the items sign(σ) and b−1 (Γ ˜b where γj is some entry of Γ k=1 i )k,σ(k) . Pb! It is easy to check that j=1 |ωj | < ∞, we can therefore conclude our proof using Lemma 2.4. Proof of Lemma 3.8. For the proof that f is smooth, it is similar to that of Lemma 3.1, we omit the detail here. And for the proof of (3.9), we have i ˜ ii γ ˜ ii − Γii Ω ˜ii . φ˜ij − fj ( ) = e∗j Ωii γii − γ˜ii + e∗j (Γii )−1 Γ n Then the rest of the proof is similar to that of Lemma 2.6, where we use Lemma 3.3. Proof of Lemma 3.11. First of all, for some constant C > 0, we have that 1 2 b 2 σi − (σi ) ≤ Cn−2+ τ ,
where we use Assumption 2.2 and Lemma 2.5. Similarly, by (2.4), we can show that b 2 i (σi ) − g( ) ≤ Cn−1+4/τ . n
g(t) ∈ C d ([0, 1]) is due to Assumption 2.3 and the fact that φij is absolutely summable.
PRECISION MATRIX ESTIMATION
49
Proof of Lemma 4.5. Similar to the proof of Lemma 2.6, we have k Λ( ) − Λkk ≤ n−1+2/τ . n ∞
Hence, the proof follows from Gershgorin’s circle theorem.
Proof of Lemma 6.2. By [24, Lemma A.1] (see the discussion below equation (14) of [43]), we find that q 2/q E|Zk − ZM ≤ CΘ2M,k,q , k | where C is some positive constant. As a consequence, we have E[1 − IM ] ≤ (A.6)
≤
p X j=1
P(|Zj −
C ∆qM
p X
ZM j |
p 1 X q ≥ ∆M ) ≤ q E|Zj − ZM j | ∆M j=1
ΘqM,j,q .
j=1
Optimizing the bound with respect to ∆M , we finish our proof. Proof of Lemma A.2. By a direct computation, we have (A.7)
∂j f (z) = 2
p X
3 2 f (z) = 0. f (z) = 2, ∂jkl eij zi , ∂jk
i=1
Using (6.14) and chain rule, there exists some constant C > 0, we have 2 2 |∂jk F (z)| ≤ C(ψ 2 |∂j f ∂k f | + ψ|∂jk f |), 3 2 2 2 |∂jkl F (z)| ≤ C[ψ 3 |∂j f ∂k f ∂l f | + ψ 2 (|∂jk f ∂l f | + |∂jl f ∂k f | + |∂kl f ∂j f |)].
We then conclude our proof using (A.7).
Proof of Lemma 6.3. Similar to (6.17), for some constant C > 0, we have ˜ z )] ≤ ψp∆ + CE(1 − I), sup E[gψ,x (Rz ) − gψ,x (R x h i ˜ z ) − gψ,x (R ˜ u ) . Denote Hence, it suffices to control supx E gψ,x (R Ψ(t) = EF (Z(t)), Z(t) =
n X
k=b+1
Zi (t),
50
X. DING AND Z. ZHOU
where F is defined in (6.20) and √ √ t˜ zi + 1 − t˜ ui √ Zi (t) = . n Similar to the discussion of equations (25) and (26) in [43], we have i Z z u ˜ ˜ E gψ,x (R ) − gψ,x (R ) = h
1
0
where Z˙ ij (t) =
p Z n 1 X X 1 Ψ (t)dt = E[∂j F (Z(t))Z˙ ij (t)]dt 2 0 i=b+1 j=1
1 = (I1 + I2 + I3 ), 2
˜ ij zij −(1−t)−1/2 u t−1/2 ˜ √ n
I1 =
p Z n X X
i=b+1 j=1
I2 =
p Z n X X
i=b+1 k,j=1 0
I3 =
n X
′
Z p X
i=b+1 k,l,j=1 0
1Z 1 0
1
and Ik , k = 1, 2, 3 are defined as
1
E[∂j F (Z (i) (t))Z˙ ij (t)]dt,
0
(i) E[∂k ∂j F (Z (i) (t))Z˙ ij (t)Vk (t)]dt,
(i) (i) (1−τ )E[∂l ∂k ∂j F (Z (i) (t)+τ V (i) (t))Z˙ ij (t)Vk (t)Vl (t)]dtdτ,
P (i) (i) (t). We first use the where V (i) (t) = ˜i Zj (t), Z (t) = Z(t) − V j∈N following lemma to control the derivatives of F, which are one of the key differences between the max norm in [43] and L2 norm in the present paper. We will prove it later. Lemma A.2. For z = (z1 , · · · , zp ) ∈ Rp and some constant C > 0, we have ! p p X X 2 ∂ F (z) ≤ C ψ 2 eks zs + ψ , ejs zs jk s=1
s=1
# p p p p X X X X 3 ∂ F (z) ≤ C ψ 3 ek zs , els zs + ψ 2 eks zs ejs zs jkl "
s=1
s=1
s=1
s=1
where {ekl } are the entries of the matrix E.
Next we will follow the strategy of the proofs of [43, Proposition 2.1] to control Ik , k = 1, 2, 3. Using the fact that Z (i) (t) and Z˙ ij (t) are independent
51
PRECISION MATRIX ESTIMATION
and E(Z˙ ij (t)) = 0, we conclude that I1 = 0. For the control of I2 , define the expanded neighborhood around Ni by Ni := {j : {j, k} ∈ En for some k ∈ Ni }, P and Z (i) (t) = Z(t) − l∈Ni ∪N˜i Zl (t) = Z (i) (t) − V (i) (t), where V (i) (t) = P ˜ ˜i }. Using the decomposition /N ˜i Zl (t) with Ni /Ni := {k ∈ Ni : k ∈ l∈Ni /N in [43] (see the discussion below equation (26) of [43]), we can write I2 = I21 + I22 , where I21 , I22 are defined as I21 =
p Z n X X
1
i=b+1 k,j=1 0
I22 =
n X
Z p X
i=b+1 k,j,l=1 0
1Z 1 0
(i) E[∂k ∂j F (Z (i) (t))]E[Z˙ ij (t)Vk (t)]dt,
(i)
(i)
E[∂k ∂j ∂l F (Z (i) (t)+τ V (i) (t))Z˙ ij (t)Vk (t)Vl (t)]dtdτ.
We start with the control of I21 . Using Lemma A.2, [44, Lemma 6] and Assumption 3.2, we conclude that sup E ∂k ∂j F (Z (i) (t)) ≤ Cψ 2 p2 . t
By the equation (28) of [43], we have max
1≤j,k≤p
n X ˙ (i) EZij (t)Vk (t) ≤ φ(Mx , My ).
i=b+1
As a consequence, we can control I21 using |I21 | ≤
p Z n X X
i=b+1 k,j=1 0
1
(i) 2 E|∂jk F (Z (i) (t))| E[Z˙ ij (t)Vk (t)] dt
≤ Cφ(Mx , My )ψ 2 p4 .
For the control of I22 , by Lemma A.2, for some constant C > 0, we have that sup E ∂k ∂j ∂l F (Z (i) (t) + τ V (i) (t)) ≤ Cψ 3 p3 . t
52
X. DING AND Z. ZHOU
Next, we can bound Z 1 n X (i) (i) |Z˙ ij (t)Vk (t)Vl (t)|dt E max 1≤k,j,l≤p
0
≤
i=b+1
1
Z
w(t) E max
1≤j≤p
0
n X
|Z˙ ij (t)/w(t)| E max 3
i=b+1
1≤k≤p
n X
i=b+1
(i) |Vk (t)|3 E max 1≤l≤p
n X
i=b+1
(i) |Vl (t)|3
!1/3
dt,
D3 ≤ C √ n (m3x,3 + m3y,3 ), n √ √ where w(t) := 1/( t ∧ 1 − t) and we use the following bounds (see the equations below (28) of [43]) (A.8)
E max
1≤j≤p
(A.9)
n 3 X C ˙ Zij (t)/w(t) ≤ √ (m3x,3 + m3y,3 ), n
i=b+1
n X (i) 3 CDn3 3 Vk (t) ≤ √ (mx,3 + m3y,3 ), n
E max
1≤k≤p
i=b+1
n X (i) 3 CDn6 3 Vl (t) ≤ √ (mx,3 + m3y,3 ). 1≤l≤p n
(A.10)
E max
i=1
For the control of I3 , using a similar argument (see the discussion below equation (28) in [43]), we find that |I3 | ≤ Cψ 3 p6 I31 , where I31 satisfies that |I31 | ≤
Z
1
w(t) E max
1≤j≤p
0
n X
i=b+1
|Z˙ ij (t)/w(t)| E max 3
1≤k≤p
n X
i=b+1
(i) |Vk (t)|3 E max 1≤l≤p
n X
i=b+1
(i) |Vl (t)|3
!1/3
dt.
Combining with (A.8) and (A.9), we conclude our proof. Proof of Corollary 6.4. We first notice that Dn = 2M + 1, |N˜i | ≤ ˜i | ≤ 4M + 1. Define Υi = {j : {j, k} ∈ En for some k ∈ 2M + 1 and |Ni ∪ N ˜i | ≤ 6M + 1. Following the arguments of the Ni }, then we have |Υi ∪ Ni ∪ N proof of Lemma 6.3, it can be shown that (see the equations above (29) of [43]), I3 can be bounded by the following quantity 3 6
Cnψ p ×
Z
0
1
¯ max w(t) E
1≤j≤p
n X i=1
3¯
|Z˙ ij (t)/w(t)| E max
1≤k≤p
n X i=1
(i) ¯ max |Vk (t)|3 E 1≤l≤p
n X i=1
(i) |Vl (t)|3
!1/3
dt.
PRECISION MATRIX ESTIMATION
53
Using a similar discussion to (A.8) and (A.9), we conclude that |I3 | ≤ Cψ 3 p6
(2M + 1)2 3 √ (m ¯ x,3 + m ¯ 3y,3 ). n
Similarly, we can bound I22 by slightly modifying (A.10). We omit further detail and refer to the proof of [43, Proposition 2.1]. This concludes our proof. Proof of Lemma 6.5. First of all, by Assumption 3.2, using a similar discussion to (6.3), we find that there exist constants 0 < c1 < c2 such that c1 < min1≤j≤p σj,j ≤ max1≤j≤p σj,j < c2 uniformly, where σj,k = Cov(Zj , Zk ). Under the assumptions that Mx > ux (γ), My > uy (γ), we can directly use the bounds from the proof of [43, Corollary 2.1 and Proposition 4.1], where we have that with 1 − γ probability, p ˜ j | ≤ Cϕ(Mx ) 8 log(p/γ). sup |Zj − Z j
As a consequence, we can choose p ∆ = Cϕ(Mx , My ) log(p/γ).
This finishes our proof. References.
[1] A. Belloni, V. Chernozhukov, D. Chetverikov, and K. Kato. Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Econometrics, 186:345–366, 2015. [2] P. Bickel and E. Levina. Regularized estimation of large covariance matrices. Ann. Stat., 36:199–227, 2008. [3] P. Brockwell and R. Davis. Time series: Theory and Methods. Springer-Verlag, 1987. [4] J. Bun, J.-P. Bouchaud, and M. Potters. Cleaning large correlation matrices: Tools from random matrix theory. Physics Reports, 666:1–109, 2017. [5] T. Cai, Z. Ren, and H. Zhou. Optimal rates of convergence for estimating toeplitz covariance matrices. Probab. Theory Relat. Fields, 156:101–143, 2013. [6] T. Cai, Z. Ren, and H. Zhou. Estimating structured high-dimensional covariance and precision matrices. Electron. J. Statist., 10:1–59, 2016. [7] T. Cai and H. Zhou. Optimal rates of convergence for sparse covariance matrix estimation. Ann. Stat., 40:2389–2420, 2012. [8] X. Chen. Large Sample Sieve Estimation of Semi-nonparametric Models. Chapter 76 in Handbook of Econometrics, Vol. 6B, James J. Heckman and Edward E. Leamer, 2007. [9] X. Chen and T. Christensen. Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. Journal of Econometrics, 188:447–465, 2015.
54
X. DING AND Z. ZHOU
[10] X. Chen, M. Xu, and W. Wu. Covariance and precision matrix estimation for highdimensional time series. Ann. Stat., 41:2994–3021, 2013. [11] I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, 1992. [12] S. Demko, W. Moss, and P. Smith. Decay rates for inverses of band matrices. Math. Comput., 43:491–499, 1984. [13] X. Ding. Asymptotics of empirical eigen-structure for high dimensional sample covariance matrices of general form. arXiv: 1708.06296, 2017. [14] X. Ding and F.Yang. A necessary and sufficient condition for edge universality at the largest singular values of covariance matrices. Ann. Appl. Probab.(to appear), 2017. [15] X. Ding, W. Kong, and G. Valiant. Norm consistent oracle estimator for general covariance matrices. In preparation, 2017. [16] L. Erd˝ os, H.-T. Yau, and J. Yin. Rigidity of eigenvalues of generalized Wigner matrices. Advances in Mathematics, 229:1435 – 1515, 2012. [17] J. Fan, Y. Fan, and J. Lv. High dimensional covariance matrix estimation using approximate factor models. Ann. Statist., 39:3320–3356, 2011. [18] J. Fan, Y. Liao, and M. Mincheva. Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc.(B), 75:603–680, 2013. [19] J. Fan, Y. Liao, and W. Wang. An overview of the estimation of large covariance and precision matrices. Econom. J., 19:1–32, 2016. [20] D. Hamilton. Time series analysis (Vol. 2). Princeton university press, 1994. [21] B. Hansen. Nonparametric Sieve Regression: Least Squares, Averaging Least Squares, and Cross-Validation. Chapter 8 in The Oxford Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics, The Oxford University Press, 2014. [22] A. Knowles and J. Yin. Anisotropic local laws for random matrices. Prob. Theor. Rel. Fields, pages 1–96, 2016. [23] C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Stat., 37:4254–4278, 2009. [24] W. Liu and Z. Lin. Strong approximation for a class of stationary processes. Stoch. Proc. Appl., 119:249–280, 2009. [25] P. McCullagh and J. Nelder. Generalized Linear Models. Chapman and Hall, 2nd edition, 1989. [26] O.Ledoit and M. Wolf. Nonlinear shrinkage estimation of large-dimensional covariance matrices. Ann. Statist., 40:1024–1060, 2012. [27] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17:1617–1642, 2007. [28] M. Pourahmadi. Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika, 3:677–690, 1999. [29] R. Ran and T. Huang. An inversion algorithm for a banded matrix. Computers and Mathematics with Applications, 58:1699–1710, 2009. [30] A. R¨ ollin. Stein’s method in high dimensions with applications. Ann. Inst. H. Poincar`e Probab. Statist., 49:529–549, 2011. [31] C. Stone. Optimal global rates of convergence for nonparametric regression. Ann. Statist., 10:1040–1053, 1982. [32] H. Tasaki. Convergence rates of approximate sums of Riemann integrals. J. Approx. Theory, 161:477–490, 2009. [33] C. Tretter. Spectral Theory of Block Operator Matrices and Applications. Imperial College Press, 2008. [34] J. Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning, Now Publishers Inc, 2015.
PRECISION MATRIX ESTIMATION
55
[35] T.Tao and V. Vu. Random matrices: universality of local eigenvalue statistics. Acta Math., 206:127 – 204, 2011. [36] W. Wang and J. Fan. Asymptotics of empirical eigen-structure for high dimensional spiked covariance. Ann. Stat., 45:1342–1374, 2017. [37] L. Wasserman. All of Nonparametric Statistics. Springer texts in Statistics, 2006. [38] W. Wu. Nonlinear system theory: Another look at dependence. Proc Natl Acad Sci U S A., 40:14150–14151, 2005. [39] W. Wu and M. Pourahmadi. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika, 90:831–844, 2003. [40] H. Xiao and W. Wu. Covariance matrix estimation for stationary time series. Ann. Statist., 40:466–493, 2012. [41] M. Xu, D. Zhang, and W. Wu. L2 asymptotics for high-dimensional data. arXiv: 1405.7244, 2015. [42] C. Zhang and T. Zhang. Optimal rates of convergence for sparse covariance matrix estimation. Stat. Sci., 27:576–593, 2012. [43] X. Zhang and G.Cheng. Guassian approximation for high dimensional vector under physical dependence. Bernoulli (to appear), 2017. [44] Z. Zhou. Heteroscedasticity and autocorrelation robust structural change detection. J. Am. Stat. Assoc., 108:726–740, 2013. [45] Z. Zhou. Inference for non-stationary time series auto regression. Journal of Time Series Analysis, 34:508–516, 2013. [46] Z. Zhou. Inference of weighted V-statistics for nonstationary time series and its applications. Ann. Stat., 1:87–114, 2014. [47] Z. Zhou and W. Wu. Local linear quantile estimation for non-stationary time series. Ann. Stat., 37:2696–2729, 2009. [48] Z. Zhou and W. Wu. Simultaneous inference of linear models with time varying coefficents. J.R. Statist. Soc. B, 72:513–531, 2010. Department of Statistics University of Toronto 100 St. George St. Toronto, Ontario M5S 3G3 Canada E-mail:
[email protected]
Department of Statistics University of Toronto 100 St. George St. Toronto, Ontario M5S 3G3 Canada E-mail:
[email protected]