Estimation of covariance functions by a fully data-driven model ...

1 downloads 0 Views 780KB Size Report
Dec 6, 2013 - Abstract In this paper, we propose a data-driven model selection approach for the nonparametric estimation of covariance functions under very ...
Stat Methods Appl (2014) 23:149–174 DOI 10.1007/s10260-013-0250-7

Estimation of covariance functions by a fully data-driven model selection procedure and its application to Kriging spatial interpolation of real rainfall data Rolando Biscay Lirio · Dunia Giniebra Camejo · Jean-Michel Loubes · Lilian Muñiz Alvarez Received: 7 January 2013 / Accepted: 26 October 2013 / Published online: 6 December 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract In this paper, we propose a data-driven model selection approach for the nonparametric estimation of covariance functions under very general moments assumptions on the stochastic process. Observing i.i.d replications of the process at fixed observation points, we select the best estimator among a set of candidates using a penalized least squares estimation procedure with a fully data-driven penalty function, extending the work in Bigot et al. (Electron J Stat 4:822–855, 2010). We then provide a practical application of this estimate for a Kriging interpolation procedure to forecast rainfall data. Keywords

Model selection · Covariance estimation · Kriging method

Mathematics Subject Classification (2000)

62G05 · 62G20

R. Biscay Lirio Facultad de Ingeniería, CIMFAV, Universidad de Valparaíso, Valparaiso, Chile e-mail: [email protected] D. G. Camejo Instituto de Cibernética, Matemática y Física, Havana, Cuba e-mail: [email protected] J. -M. Loubes (B) Institut de Mathématiques de Toulouse, Université Toulouse 3, Toulouse, France e-mail: [email protected] L. Muñiz Alvarez Facultad de Matemática y Computación, Universidad de La Habana, Havana, Cuba e-mail: [email protected]

123

150

R. Biscay Lirio et al.

1 Introduction Covariance estimation is a fundamental problem in inference for stochastic processes with many applications, ranging from geology, meteorology, financial time series, epidemiology, etc. We refer for instance to Cressie (1993), Ripley (2004) or Stein (1999) for general references. Among all these applications, forecasting a spatial process has received a particular attention in the statistical literature. For example, assume that we are given observations X (t1 ), . . . , X (tn ) of a realvalued stochastic process {X (t) : t ∈ T }, where T ∈ Rd with d ∈ N, which models the rain level. We wish to predict the process over the whole area T . The sample points t1 , . . . , tn are usually points at which the data are available, for instance a network of rain gauges corresponding to locations where observation are available.   X (t0 ), ν(t0 ) , where Linear inference for one position t0 ∈ T is given by the pair  the predicted value  X (t0 ) of X (t0 ) is defined as a weighted linear combination of X (t0 ). the available observations and  ν(t0 ) is the prediction variability associated to  The linear method that provides the optimal predictor  X (t0 ) is called Kriging, defined in Krige (1951). The predicted value  X (t0 ) is defined by  X (t0 ) =

n 

ωi X (ti ),

i=1

with ω1 , . . . , ωn ∈ R and its variance is given by n  n      ωi ω j Cov X (ti ), X (t j ) . V  X (t0 ) = i=1 j=1

The computation of the weights ωi and the prediction variance  ν(t0 ) depend on the covariance function, which is unknown in practice and has to be estimated by an   X (t0 ) is estimator which is also a covariance function to ensure that the variance V  non-negative. For this, we will build an estimator of the covariance function of a stochastic process which undergoes the non-negative definiteness property. Throughout the paper, we consider a stochastic process X (t) with values in R, indexed by t ∈ T , a subset of Rd , with d ∈ N, and assume that X has finite moments up to order 4. Therefore, its covariance function is finite, i.e |σ (s, t)| = |Cov(X (s), X (t))| < +∞ for all t, s ∈ T , and, for sake of simplicity, we suppose that E(X (t)) = 0 for all t ∈ T . The observations are X i (t j ) for i = 1, . . . , N and j = 1, . . . n, where the observation points t1 , . . . , tn ∈ T are fixed, and X 1 , . . . , X N are independent copies of the process X . Let xi = (X i (t1 ), . . . , X i (tn )) be the vector of observations at the points t1 , . . . , tn with i ∈ {1, . . . , N }. The matrix  denotes the covariance matrix of the stochastic process X at the observation points, that is,  = E(xi xi ) = (σ (t j , tk ))1≤ j≤n,1≤k≤n . Several estimation methods have been developed to obtain non-negative definite estimators of the covariance function. This is the case of parametric methods (see Cressie 1993), which have the known drawback that the true covariance function may fail to belong to the chosen parametric family. Nonparametric approaches provide

123

Estimation of covariance functions

151

more flexibility when constructing the estimator of the covariance function (see Guillot et al. 2000; Hall et al. 1994; Matsuo et al. 2011; Sampson and Guttorp 1992; Shapiro and Botha 1991, etc. for a review). In Bigot et al. (2010), a model selection procedure is provided for the estimation of covariance functions, under very general moments assumptions on the process, without additional conditions such as stationarity or Gaussianity. More specifically, the authors proved an oracle inequality for a penalized least squares covariance estimator. Their result relies on the use of a penalty function depending on the unknown quantity  = V(vec(x1 x1 )), which reflects the correlation structure of the data. However,  of  and they show its good behaviour in they proposed a consistent estimator  some simulated examples. In Biscay et al. (2012), the same problem of covariance estimation via model selection is studied, and an oracle inequality is obtained replacing  proposed in Bigot et al. (2010), in the penalty function the matrix  by the estimator  so that their method is completely derived from the data. In this paper, we extend previous results in the sense that we build a fully datadriven model selection procedure for the estimation of covariance functions using in V of , with V ⊂ R N such that dim (V) ≤  penalty term a more general  estimator  the N  2 . The class of estimators V V that can be constructed with the previous properties includes the estimator proposed in Bigot et al. (2010) and in Biscay et al. (2012), which is defined using a particular unidimensional subspace of R N . Furthermore, we obtain an oracle inequality for the covariance estimator proposed, which warrants that the V best estimator is selected among a collection of candidates using any estimator  of the class, giving more flexibility to our method. We also show the behaviour of our estimator in some simulated examples, and we compare its performance using different subspaces V. Finally, we use the covariance estimator proposed in this paper in an application to Kriging spatial interpolation of real rainfall data registered in the Cuban province of Pinar del Río. The paper is organized as follows. In Sect. 2 we describe the statistical framework of the method, and we define our estimator by model selection. The main theoretical results are presented in Sects. 3 and 4, in which the oracle inequalities are given using a nonasymptotic point of view. Numerical experiments are given in Sect. 5 and the application to spatial interpolation is described in Sect. 6. The most technical proofs are given in Sect. 7, while some of the inequalities used in the proofs are recalled in the “Appendix”. 2 Model selection approach for covariance estimation Our work is a modification of the model selection approach proposed in Bigot et al. (2010, 2011). Hence we provide a brief description of their estimation procedure. To estimate the covariance function σ , an approximation to the process X is considered:  X (t) =

 λ∈m

aλ gλ (t),

123

152

R. Biscay Lirio et al.

where {gλ }λ is a collection of possibly independent functions gλ : T → R, m is a subset of indices of size |m| ∈ N, and {aλ }λ are suitable random coefficients. The covariance function ρ of  X can be regarded as an approximation of σ and it can be written as ρ (s, t) = G (s) G (t), where, after reindexing the functions   if necessary, G (t) = (gλ (t) , λ ∈ m) ∈ R|m| and  ∈ R|m|×|m| is given by  = E aλ aμ with (λ, μ) ∈ m ×m. Therefore, the estimator  σ of σ is taken in the class of functions of the form G (s) G (t), where  ∈ R|m|×|m| is some symmetric matrix. The empirical contrast function used in Bigot et al. (2010) is given by: L N () =

N 1  xi xi − GG 2F , N

(1)

i=1

  where G is the n × |m| matrix with entries g j λ = gλ t j , j = 1, . . . , n, λ ∈ m,   and A2F = Tr AA denotes the Frobenious norm of the matrix A. Note that L N is exactly the sum of the squares of the residuals corresponding to the matrix linear regression model xi xi = GG + Ui ,

i = 1, . . . , N ,

(2)

with i.i.d. matrix errors Ui , such that E(Ui ) = 0. This remark provides a natural framework to study the covariance estimation problem using a matrix regression model. The least squares estimator of the covariance matrix  is computed minimizing = L N () over the space of symmetric matrices S|m| ⊂ R|m|×|m| , and it is defined by    GG = arg min∈S|m| L N (). This optimization problem has the explicit solution     N  = S, where S = 1 i=1  = G G − G SG G G − , which leads to  xi x  N

i

is the sample covariance matrix of the data x1 , . . . , x N , and  = G(G G)− G . In  is a non-negative definite matrix, thus   also Bigot et al. (2010) is shown that  has this property. Hence, the resulting estimator of the covariance function, given by  (t), is a non-negative definite function.  σ (s, t) = G (s) G The role of G and therefore the choice of the subset of indices m is crucial since it determines the behavior of the estimator. Therefore, the problem of interest lies in selecting the best design matrix G = Gm among a collection of candi dates Gm ∈ Rn×|m| , m ∈ M , where M denotes a generic countable set given by m be the least squares M = {m : m is a set of indices of cardinality |m|}. Let  covariance estimators corresponding to the matrix Gm , with m ∈ M. The aim is to select the best of these estimators in the sense of the minimal quadratic risk, which is given by: m 2 =  − m m 2F + Tr ((m ⊗ m ) ) , E  −  F N   −     where m = Gm Gm Gm Gm , with m ∈ M, and  = V vec x1 x1 , where for any matrix A, vec (A) denotes the vector obtained by stacking the columns of A on top of one another. We recall that for any vector z = (Z i )1≤i≤n 2 , the covariance

123

Estimation of covariance functions

153

   matrix is denoted by V (z) = Cov Z i , Z j 1≤i, j≤n 2 , and the symbol ⊗ denotes the Kronecker product of two matrices (see Seber 2008 and references therein). The methodology proposed in Bigot et al. (2010) for the selection of the best m estimator consists in taking the penalized covariance estimator as   = m  S m , where 

N 1    2 (3) m  = arg min xi xi −  m + pen (m) , F N m∈M i=1

and the penalty function pen : M → R is given by pen (m) = (1 + θ ) Tr((m N⊗m )) , with θ > 0. The optimality of this model selection procedure is proved in Bigot et al. 2 m (2010) via an oracle inequality, which states that the quadratic risk E  −   F is 2 m , except for bounded by the quadratic risk of the oracle, given by inf E  −  F  1 m∈M a constant factor and an additive term of order O N . 2 = Tr((m ⊗m )) , then Let Dm = Tr(m ⊗ m ) ≥ 1 for all m ∈ M and let δm Dm pen(m) = (1 + θ )

2D δm m . N

(4)

2 , which is unknown in practice. The penalty function (4) depends on the quantity δm  Indeed, it relies on  = V(vec(x1 x1 )), which reflects the correlation structure of the data. To overcome this issue in Bigot et al. (2010),  is replaced in the penalty term by the consistent estimator: N  = 1  (yi − vec(S)) (yi − vec(S)) , N

(5)

i=1

where yi = vec(xi xi ) ∈ Rn . The good practical behaviour of the covariance esti is showed in Bigot et al. (2010) in some simulated examples, mator obtained using  while the theoretical behaviour of this plug-in method has been investigated in Biscay et al. (2012). In this paper we extend previous results to a wide class of estimators of . The estimation procedure is described below. 2 j j j Let the matrix Y = (y1 , y2 , . . . , y N ) ∈ Rn ×N , and let y j = (y1 , y2 , . . . , y N ) ∈ of the jth row of Y. Let V N be a linear subspace of R N such that R N be the transpose N dim(V N ) ≤ 2 . We propose the following estimator for : 2

  N  1 i i  V = (yi − Y V N )(yi − Y V N ) ,  N N − Tr( V N )

(6)

i=1

where V N ∈ R N ×N is the orthogonal projection matrix of R N onto the space V N , iV N ∈ R N is the ith column of V N , and Tr( V N ) = dim(V N ). Replacing  by V in the penalty function (4) allows us to obtain an estimator of the covariance  N

123

154

R. Biscay Lirio et al.

V and defined as  V =  m matrix  denoted by  V N = m V N S m V N , where N N

m V N = arg min m∈M

 N 1   2 m  F +  xi xi −  pen V N (m) , N i=1

 pen V N (m) = (1 + θ )

2  δm, V N Dm

N

(7)

and 2  δm, VN =

V ) Tr((m ⊗ m ) N N = Tr Dm (N − Tr( V N ))Dm   N 1  i i  × (m ⊗ m ) (yi − Y V N )(yi − Y V N ) . N

(8)

i=1

The corresponding estimator of the covariance function σ is then given by the nonnegative definite function  σm  σV N (s, t) =  V N (s, t) = Gm V N (s)  m V N Gm V N (t) .

(9)

N Note that vec(S) = N1 i=1 yi = N1 Y1N i , where 1N i is the ith column of the N ×N , which has the number 1 in each position. Using this notation, we matrix 1N ∈ R can rewrite the estimator (5) as:   N   1 1 = 1 yi − Y1N i yi − Y1N i .  N N N i=1

The matrix V N = N1 1N ∈ R N ×N is symmetric and idempotent, therefore it is N N the orthogonal projection  1  matrix of R onto a subspace V N ⊂ R of dimension  studied in Bigot et al. (2010) dim (V N ) = Tr N 1N = 1. Hence, the estimator   and Biscay et al. (2012) can be derived from V N in (6) considering N 1−1 ≈ N1 and a   subspace V N ⊂ R N such that dim (V N ) = 1 ≤ N2 . The subspace V N in our definiV is quite arbitrary and is only restricted to have a dimension less than the tion of  N half of the dimension of R N . The advantage of taking V N as some “large” space (with dimension bigger than 1) is that it can lead to a better estimation of  and therefore of V is studied in Sect. 5 in some simulated σ . The influence of the use of the estimator  N examples, where different subspaces V N are considered, including the particular case of dim (V N ) = 1. 3 Oracle inequality for the fully data-driven covariance estimator In this section we prove the optimality of the procedure explained above via an oracle N inequality, which warrants  N  that a suitable model is selected for any subspace V N ⊂ R such that dim(V N ) ≤ 2 .

123

Estimation of covariance functions

155

Theorem 1 Let q ∈ p(0, 1] be a given constant such that there exists p > 2 (1 + 2q) satisfying E x1 x1 F < ∞. Then, for a given subspace V N ⊂ R N such that   dim(V N ) ≤ N2 , the following inequality holds: 

V 2q E  −  N F

1 q

 2D  δm m 2 ≤ C(θ, q) inf  − m m  F + N m∈M  2 p δsup 2 +d N (1 N ⊗ vec(), V ) + , N 

where 



( p ) = C ( p, q, θ ) q

 p Ex1 x1  F



p − p −( −1−q) δm D m 2



m∈M

2 + 2 F δsup

 ,

and d N2 (1 N ⊗ vec(), V ) = 1 N ⊗ vec() − ( V N ⊗ In 2 )(1 N ⊗ vec())2N , 1 N ∈ R N is the vector (1, . . . , 1) , In 2 is the identity matrix of size n 2 × n 2 , V is 2 2 a subspace of R N n such that the orthogonal projection matrix of R N n onto V is

2 2 given by V N ⊗ In 2 , δsup = max δm : m ∈ M , and C(θ, q) and C ( p, q, θ ) are constants depending only on the quantities inside the brackets. In particular, for q = 1 we have V 2 ≤ C (θ ) E  −  N F



  m 2 inf E  −  F

m∈M

+d N2 (1 N

⊗ vec(), V ) +

p N

 2 δsup

.

For the proof of this result, we first restate this theorem in a vectorized form, which turns to be an extension of Theorem 6.1 of Baraud (2000) to the multivariate case. 4 Oracle inequality for multidimensional regression model In this section we follow the ideas in Bigot et al. (2010) for the vectorization of the matrix linear regression model (2). First, we recall that for any symmetric matrix A = (ai j )1≤i≤n,1≤ j≤n , vech(A) is defined as the column vector of size n(n + 1)/2 × 1 obtained by vectorizing only the lower triangular part of A. There exists a unique linear transformation which transforms the half-vectorization of a matrix to its vectorization and vice-versa called, respectively, the duplication matrix and the elimination matrix. For any n ∈ N, the n 2 ×n (n + 1) /2 duplication matrix is denoted by Dn . According to this definitions, the matrix linear regression model (2) can be rewritten in the following vectorized form: yi = Aβ + ui , i = 1, . . . , N ,

(10)

123

156

R. Biscay Lirio et al.

where yi = vec(xi xi ) ∈ Rn as before, A = (G ⊗ G)Dm ∈ Rn 2

2 × |m|(|m|+1) 2

|m|2 × |m|(|m|+1) 2

, where

|m|(|m|+1) 2

Dm ∈ R is the duplication matrix for β, where β = vech() ∈ R , 2 n and ui = vec(Ui ) ∈ R . According to this notations, model (10) is a particular case of the more general multidimensional regression model yi = f i + εi , i = 1, . . . , N ,

(11)

where yi ∈ Rk , with i = 1, . . . , N , are observed random vectors of dimension k ≥ 1 [k = n 2 in (10)], f i ∈ Rk are nonrandom vectors and ε1 , . . . , ε N are i.i.d. random vectors in Rk with E(ε1 ) = 0 and V(ε1 ) =  ∈ Rk×k . For simplicity, we identify k  Nk the function g : X N )) ∈ R , and we denote N→ R with vectors (g(x1 ), . . . ,Ng(x ai bi the inner product of R k , associated to the norm  ·  N , by a, b N = N1 i=1 where a = ((a1 ) , . . . , (a N ) ) and b = ((b1 ) , . . . , (b N ) ) with ai , bi ∈ Rk for all i = 1, . . . , N . Given N , k ∈ N, let (Lm )m∈M be a finite family of linear subspaces of R N k . For fm be the least squares estimator of each m ∈ M assume Lm has dimension Dm . Let  f = vec f 1 , . . . , f N ∈ R N k based on the data y = vec (y1 , . . . , y N ) ∈ R N k under the model Lm , i.e:  fm = arg min y − v2N = Pm y, v∈Lm

where Pm ∈ R N k×N k is the orthogonal projection matrix of R N k onto Lm . Write 2 δm =

Tr(Pm (I N ⊗ )) 2 2 with Dm = Tr(Pm ) and δsup = max{δm : m ∈ M}. Dm

The matrix  is unknown in practice, therefore, we can proceed as in Sect. 2 to V of  of the form: construct an estimator  N N V =  N N − Tr( V N )



 N 1  i i  (yi − Y V N )(yi − Y V N ) , N i=1

2 V N in the definition of δm where Y = (y1 , y2 , . . . , y N ) ∈ Rk×N . Replacing  by  we obtain    N  N 1 2  δm, (yi −Y iV N )(yi −Y iV N ) . V N = (N −Tr( ))Tr(P ) Tr Pm I N ⊗ N m VN i=1

(12) For θ > 0 we define the penalized estimator  f = fm V N = Pm V N y of f by taking m V N = arg min{y −  fm 2N +  pen V N (m)}, m∈M

123

Estimation of covariance functions

157

where the fully data-driven penalty function  pen V N is given by  pen V N (m) = (1 +  δ2

D

m m,V 2 θ ) NN , with  δm, V N as in (12). The following result provides an oracle inequality for the estimator  f:

p

Theorem 2 Let q ∈ (0, 1] be a given constant such that E[ε1 l2 ] < ∞ for some p > 2(1 + 2q). Then, for some constant C that only depends on θ and q, the following inequality holds:  1   q 2q E f −  f N ≤ C(θ, q)



   2 Dm 2 δsup 2 2 inf d N (f, Lm )+ δ +d N (f, V N )+ p , N m N m∈M

where d N2 (f, Lm ) = f − Pm f2N , d N2 (f, V N ) = f − ( V N ⊗ Ik )f2N , V N is a subspace of R N k such that the orthogonal projection matrix of R N k onto V N is given by V N ⊗ Ik , and 



p

q



= C ( p, q, θ )

 p Eε1 l2



p − p −( −1−q) δm D m 2

m∈M



f2 + 2N δsup

 ,



with the constant C depending only on p, q and θ . This theorem is equivalent to Theorem 1 using the vectorized version of the model (11) and turns to be a k-variate extension of results in the Theorem 6.1 of Baraud (2000) (which are covered when k = 1). 5 Numerical experiments In this section we illustrate the behaviour of the fully data-driven covariance estimator by model selection proposed in this paper in some simulated examples. We study its performance when computing the criterion using the estimated penalty (7) defined in V , using two different subspaces Sect. 2. We compare the behavior of our estimator  N V N , one such that dim(V N ) > 1, and another with dim(V N ) = 1. The programs for our simulations were implemented using MATLAB and the code is available on request. We will consider i.i.d copies X 1 , . . . , X N of a Gaussian processes X on T = [0, 1] with values in R, observed at fixed equi-spaced points t1 , . . . , tn in [0, 1] for a fixed n generated according to ∗

X (t j ) =

m  λ=1

aλ gλ∗ (t j ),

j = 1, . . . , n,

where m ∗ denotes the true model dimension, (gλ∗ )λ=1,...,m ∗ are orthonormal functions on [0, 1], and the coefficients a1 , . . . , am ∗ are independent and identically distributed

123

158

R. Biscay Lirio et al.

Gaussian variables with zero mean. Note that E(X (t j )) = 0 for all j = 1, . . . , n and that the covariance function of the process X at the points t1 , . . . , tn is given by ∗

σ (t j , tk ) = cov(X (t j ), X (tk )) =

m  λ=1

V(aλ )gλ∗ (t j )gλ∗ (tk ) ∀1 ≤ j, k ≤ n.

The covariance estimation by model selection is computed as follows. Let (gλ )λ∈N be an orthonormal basis on [0, 1], which may differ from the original basis functions (gλ∗ )λ=1,...,m ∗ . For a given M > 0, candidate models are chosen among the collection M = {{1, . . . , m} : m = 1, . . . , M}. To each set indexed by m we associate the matrix (model) Gm ∈ Rn×m with entries (gλ (t j ))1≤ j≤n,1≤λ≤m , which corresponds to a number m of basis functions g1 , . . . , gm used in the expansion to approximate the process X . We aim at choosing a good model among the family of models {Gm : m ∈ M} in the sense of achieving the minimum of the quadratic risk 2 m 2F =  − m m 2F + δm Dm , R(m) = E −  N

m is the least squares covariance estimator corresponding to the matrix Gm . where  The ideal model m 0 is the minimizer of the risk function m → R(m). Note that for all m = 1, . . . , M, the empirical contrast function (1) satisfies: L N (m) =

N 1  m 2F = L N (m) + C, xi xi −  N i=1

where m 2F = S − m S m 2F L N (m) = S −  and C is a constant that does not depend on m. Then, the penalized criterion minimized in (3) to compute the optimal model can be written as L N (m)+ pen(m) = PC(m)+C, where PC(m) = L N (m) + pen(m) = S − m S m 2F + (1 + θ )

2D δm m . N

Thus, L N (m) can be regarded as the empirical contrast function and PC(m) as the penalized criterion. These functions are used for visual presentations of the results. For each model m = 1, . . . , M, we evaluate the penalized criterion PC(m) with θ = 1 and expect that the minimum of PC(m) is attained at a value close to m 0 . We 2 2 study the influence of using  δm, V N defined in (8) rather than δm on the model selection procedure. First, we compute an approximation of the risk R given by V (m) =  − m m 2F + R N

123

2  δm, V N Dm

N

,

Estimation of covariance functions

159

and an estimator of the penalized criterion PC:  PC V N (m) = S − m S m 2F + (1 + θ )

2  δm, V N Dm

N

.

We denote by m V N the point at which the estimated penalized criterion  PC V N achieves its minimum value, i.e, the model selected by minimizing  PC V N . In the following example we plot the empirical contrast function L N , the risk V , the penalized criterion PC and the function R, the approximate risk function R N estimated penalized criterion estimate  PC V N . We also show the true covariance function σ (s, t) for s, t ∈ [0, 1] and the penalized covariance estimator based on  PC V N ,  G  t) = G as in (9). which is given by  σm (s, (s) (t) V N m V N m V N m V N ∗ be the Fourier basis functions Example: Let g1∗ , . . . , gm ∗

⎧ 1 √ ⎪ ⎪ ⎨ √n 2 √1n cos(2π λ2 t) gλ∗ (t) = ⎪ ⎪ ⎩ √ √1 2 n sin(2π λ −1 2 t)

if λ = 1 if

λ 2

if

λ −1 2

∈Z

(13)

∈ Z∗

We simulate a sample of size N = 50 with n = m ∗ = 35, M = 34 and V(aλ ) = 1 for all λ = 1, . . . , m ∗ . Consider the models obtained by choosing m Fourier basis functions. In this setting, the minimum of the quadratic risk R is attained at m 0 = N2 −1 which for N = 50 gives m 0 = 24. We will show the behaviour of our selection 2 δm, procedure using two different subspaces V N of R N in the computation of  V N defined in (8). More specifically, we will consider the following two cases: V defined in (6) is used for the computation of  • Case 1: The estimator  δ2 N m,V N

taking a subspace V N1 of dimension 1. 2 V defined in (6) is used for the computation of  δm, • Case 2: The estimator  N VN taking a subspace V N2 of dimension 20. Note that both subspaces V N1 and V N2 satisfy the assumption of having dimension   less than or equal to N2 = 25, which is required for the fulfillment of Theorem 1. Figure 1a–d present the results obtained for a simulated sample in Case 1. Figure  1 reproduces the shape of the risk 1a shows that the approximate risk function R VN 2 2 δ 1 into the risk does not have a too function R, suggesting that replacing δm by  m,V N

drastic effect. Figure 1b shows that, in spite of the small sample size N = 50 a quite nice approximation to the true covariance function σ is achieved by its penalized covariance m PC V 1 . Figure 1c shows, as expected, that the empirical conestimate  V 1 based on  N

N

trast function L N is strictly decreasing over the whole range of possible models, hence its minimization would lead to choose the largest model M = 31. On the contrary, both minimization of the penalized criterion PC and its estimate  PC V 1 lead to select N a good model, m  = 24 and m V 1 = 24 respectively, as shown in Fig. 1d. This also N

123

160

R. Biscay Lirio et al.

 1 . b Covariance Fig. 1 Results corresponding to Case 1. a Risk function R and approximated risk R V function σ and estimated covariance function  σm 

1 VN

criterion PC and its estimate  PC V 1 N

N

. c Empirical contrast function L N . d Penalized

2 2 by  demonstrates that replacing δm δm, into the penalized criterion does not notably V1 N

deteriorate the performance of the model selection procedure in this example. We can conclude that the selected model m V 1 corresponds to the ideal model m 0 in this N simulation, and the risk R evaluated at m V 1 is much less than the one of the largest N model M = 34. Figure 2a–d show the results for Case 2. Note that the estimated penalized criterion  PC V 2 leads to the good selection of model m V 2 = 24. Figures 2a, b show that the N N  σm function R 2 approximates well to R, and the covariance function   2 gives a good VN

VN

2 2 can be replaced in practice by  δm, estimation of the true covariance σ . Hence, δm V2

N

in the model selection procedure. It is clear that the selected models m V i , i = 1, 2, depend on the observed N sample through the penalized criterion estimate  PC V i . To illustrate such a variN ability, in this example we compute the percentage Pim 0 ±3 of models chosen in the range [m 0 − 3; m 0 + 3] and the percentage Pim 0 ±4 of chosen models in the

123

Estimation of covariance functions

161

 2 . b Covariance Fig. 2 Results corresponding to Case 2. a Risk function R and approximated risk R V function σ and estimated covariance function  σm 

2 VN

criterion PC and its estimate  PC V 2 N Table 1 Variability of the procedure

N

. c Empirical contrast function L N . d Penalized

P m 0 ±3

P m 0 ±4

R

Case 1

65

78

23,4281

Case 2

71

80

23,1517

range [m 0 − 4; m 0 + 4] over 100 simulated samples. We also compute the average of the values of the approximated risk at the selected model, given by R i = 100  1 m l i ), where m l i denotes the selected model corresponding to simul=1 RV N ( 100 VN

VN

lation l of case i, with l = 1, . . . , 100 and i = 1, 2. It can be observed in Table 1 that the percentage of suitable selection is higher and the averaged risk is slower in Case 2, which correspond to the use of a subspace V N such that dim (V N ) = 20. According to the simulation results showed in this section, we can conclude that the V in the penalty function of our method is appropriate, since it use of the estimator  N selects the model close to the oracle, both when the dimension of V N is small (equal to 1) or large (equal to 20). Note that, as expected, the selection procedure performs slightly better in the case of dim (V N ) = 20.

123

162

R. Biscay Lirio et al.

6 Application to Kriging spatial interpolation of real data The estimator that we propose in this paper is a real covariance function, i.e, a nonnegative definite function, therefore it can be plugged into other procedures which require working with a covariance function. In this section, we consider a Kriging spatial interpolation of recent rainfall data registered in the Cuban province of Pinar del Río and plug previous model selection type covariance estimator into the Kriging procedure. Kriging is very commonly used in practice for the estimation of the value of a stochastic process {X (t) : t ∈ T } at an unobserved location t0 ∈ T from observations of the process X (t1 ), . . . , X (tn ) at known locations t1 , . . . , tn of T ⊂ Rd , with d ∈ N. X (t0 ) is unbiased It provides an optimal linear estimator  X (t0 ) of X , in the sense that  and of minimum prediction variance. The Kriging estimator  X (t0 ) depends on the process spatial dependence, which it can be quantified by the expectation function μ (t) = E (X (t)) and the covariance function σ (s, t) = Cov (X (s), X (t)), with s, t ∈ T , of the stochastic process X . The predicted value  X (t0 ) is defined by:  X (t0 ) =

n 

ωi (t0 )X (ti ),

i=1

where the weights ω1 , . . . , ωn ∈ R are computed by the minimization of the prediction error variance, given by: n  n n       V  X (t0 )− X (t0 ) = ωi (t0 )ω j (t0 )σ ti , t j +V (X (t0 ))−2 ωi (t0 )σ (ti , t0 ), i=1 j=1

i=1

subject to the unbiasedness condition n    ωi (t0 )μ (ti ) − μ (t0 ) = 0. E  X (t0 ) − X (t0 ) = i=1

Depending on the properties of the stochastic process X different types of Kriging apply. We will use an implementation of Simple Kriging, since it assumes a known constant trend μ (t) = 0 for all t ∈ T , which is also an assumption in our covariance estimation procedure. Our data consist of N = 12 rainfall values registered every month at n = 7 different meteorological stations of the  Cuban province of Pinar del Río during the year 2011. More specifically, X i t j , s j is the cumulative rainfall value of month i registered in the meteorological station located at the fixed location (t j , s j ) ∈ T ⊂ R2 , with i = 1, . . . , 12 and j = 1, . . . , 7. We have X i (S j ) = X i (t j , s j ) for i = 1, . . . , 12 and j = 1, . . . , 7. The mean is assumed to be zero: so we transform the data in order to undergo the constraint that

123

Estimation of covariance functions

163

Fig. 3 Predicted cumulative rainfall values using Simple Kriging. a Location of meteorological stations. b Predicted annual rainfall average (2011). c Predicted cumulative rainfall in January. d Predicted cumulative rainfall in June 12 7 1  X i (S j ) = 0. Nn i=1 j=1

12 7 So we replace X i (S) by X i (S) − N1n i=1 j=1 X i (S j ). Then, we apply the model selection procedure for the estimation of the covariance function σ of the underlying stochastic process. Next we use the resulting estimator  σm V N , with dim (V N ) = 6, in the Kriging interpolation procedure described above to obtain the predicted cumulative rainfall values in all the region. This application shows the large flexibility of our method compared to the ones in Bigot et al. (2010) and Biscay et al. (2012), in the sense that, 12 in order to obtain the covariance estimator  σm V N , we can use any subspace V N ⊂ R such that dim (V N ) ≤ 6. Figure 3a shows the location of the meteorological stations in the studied region. Figure 3b shows the predicted annual average of the cummulative rainfall values in all the province. It reflects the usual behavior of rainfall in this region, since the higher values of cumulative rainfall occur in the mountain range called Sierra del Rosario, which is located north of the province. Figure 3c, d show the predicted cumulative rainfall values in January (winter) and June (summer) respectively. Note that the cumulative rainfall values are higher in June, which is consistent with the fact that, in the summer, daytime heating implies large values of precipitation. Note also

123

164

R. Biscay Lirio et al.

that since we consider cumulative rainfall, the assumption of independence between the replications of the process makes some sense, even if this assumption is difficult to be tested in practice. In conclusion, the results obtained here seem to be a good description of the real meteorological phenomena which is studied, respecting the geological conditions of these area. 7 Proofs of the main results Throughout this section C, C , C , . . . denote constants that may vary from line to line. The notation C(.) specifies the dependency of C on some quantities. 7.1 Proof of Theorem 2 The proof of Theorem 2 follows the guidelines of the proof of Theorem 6.1 of Baraud (2000). We start with the following claims: 2 2 2 Claim 3 The quantity  δm, V N defined in (12) is such that E[δm,V N ] ≤ δm + N 2 2 Tr(Pm ) d N (f, V N ).

Claim 4 Let 0 < δ < β = ( 2p − 1) ∧ 4p .

1 2,

2 2 then P[ δm, V N ≤ (1 − 2δ)δm ] ≤ C( p, δ)

p

E[ε1 l ] 2 , Nβ

where

As in the proof of Corollary 4.2 of Bigot et al. (2010) (see Theorem 6 in the “Appendix”), we have that for all η > 0 and any sequence of positive numbers L m , if 2 the penalty function pen : M → R+ is chosen to satisfy pen(m) ≥ (1+η+L m ) DNm δm for all m ∈ M, then for each x > 0 and p > 2, 



2 P H(f) ≥ 1 + η



x 2 δ N m

 p

≤ C( p, η)Eεi l2

 1 Tr(Pm ) ∨ 1 p , (14) p δ 2 m∈M m (L m Tr(Pm ) + x)

  

 where H (f) = f −  f2N − 2 + η4 inf m∈M d N2 (f, Lm ) + pen(m) . Let us first assume that our claims are true. Given θ > 0 one can find two positive numbers δ = δ(θ ) ≤ 21 and η = η(θ ) such that (1 + θ )(1 − 2δ) ≥ (1 + 2η). For such a δ , let  Nm be defined as: 2 2 δm,  Nm = { V N ≥ (1 − 2δ)δm }. 2q pen(m) ≥ We start by bounding E[f −  f N 1 Nm ]. On  Nm we know that  m) 2 (1 + 2η) Tr(P δ for all m ∈ M. Then, if we consider η = L m , we can m N . Hence, for some m ∈ M that minimizes M∗ = use the result (14) on  Nm 0  δ2 , and for p > 2(1 + q) we have that ∀x > 0 inf m∈M f − Pm f2 + Tr(Pm )

and p > 2:

123

N

N

m,V N

Estimation of covariance functions

165

2 δsup 1  q p, E[Hm 0 (f)1 Nm ] q ≤ N

6 of the “Appendix” and Hm 0 (f) = where p is defined in the Theorem    4 2 ∗  f − f N − κ(θ )M , with κ(θ ) = 2 + η (1 + θ ).  m) 2 Denoting m = inf m∈M f − Pm f2N + Tr(P N δm . If q ≤ 1 by a convexity argument and Jensen’s inequality we deduce that: 

E[f −  f N 1 Nm ] 2q

1 q

  q ! q1 2 δsup Tr(Pm 0 ) 2  δm 0 ,V N ≤ C (θ, q) E f − Pm 0 f2N + + p N N   ! 2 δsup Tr(Pm ) 2 ≤ C (θ, q) f − Pm f2N + ] + p E[ δm, . V N N N Using Claim 3 we obtain:  1 q 2q E[f −  f N 1 Nm ] 

≤ C (θ, q) f

− Pm f2N

2 δsup Tr(Pm ) 2 δm + 2d N2 (f, V N ) + p + N N

 .

 1   q 2q . By Pythagoras Theorem and the definition Now we bound E f −  f N 1cN m 2 2 2   f = Pm  y we have that f − f ≤ f + ε . Then: N

N

N

    2q 2q 2q E f −  f N 1cN ≤ f N P[cNm ] + E ε N 1cN . m

Let h =

p 2q

m

> 1, by Holder’s inequality we have that:     1 2q p  1 E ε N 1cN ≤ E ε N h P1− h [ Nmc ] m

   p p Since E ε N ≤ E εi l2 for p ≥ 2, we use Claim 4 to obtain 

E f

2q − f N 1cN m



 ≤

C( p, δ(θ )) Nβ

 (1− 2q )   2q   p p 2q p f N + E εi l2 .

123

166

R. Biscay Lirio et al.

Assume that β(1 −

2q p )

≥ q, then

1     1  2q q   ( p, q, θ ) C q p 2q 2q p E f −  f N 1cN ≤ f N + E εi l2 m N ⎛  2 ⎞   p

2 ⎜ 2 E εi l2 δsup ⎜ f N + ≤ C ( p, q, θ ) 2 2 N ⎝ δsup δsup

p



 Therefore it remains to check if β 1 − p 2

2q p



⎟ ⎟. ⎠

≥ q:

p 4

1. When 2 ≤ p ≤ 4 then − 1 ≤ and β = 2p − 1: since by the hypotheses of Theorem 2. q ≤ 1 and p satisfies p ≥ 2(1 + 2q), we can deduceβ ≥ 2q and  then p  2q 2q 1 1 − p ≥ 2 , therefore we have that β 1 − p = 2 − 1 1 − 2q p ≥ q. 2. When p ≥ 4 then 2p −1 ≥

p 4

p of Theorem 4 : since by the hypotheses   2q p 6q, which imply that β(1− p ) = 4 1 − 2q p ≥ q.

therefore β =

2. p satisfies p ≥ 2(1+2q) ≥

Hence, by the Hermite–Hadamard’s inequality (see Proposition 7 in the “Appendix”), it holds that:      1  1    1  q q q 2q 2q 2q ( q1 −1)    E f − f N E f − f N 1 Nm ≤2 + E f − f N 1cN m ⎡   ⎢ Tr(Pm ) 2 2 ⎢ δm + 2d N2 (f, V N ) ≤ C(θ, q) ⎣ inf f − Pm f N + N m∈M ⎛ +



2 ⎜ 1 ⎢  − p −( p −1−q) δsup ⎜ E[εi  p ] q ⎢C (θ, p, q) δm D m 2 l2 ⎣ N ⎝ m∈M



+ C (θ, p, q)

p E[εi l2 ]

2−1

2 δsup

p

q



⎞⎤

2 ⎟⎥ ⎥ ⎥ + C (θ, p, q) f N ⎟⎥ . ⎦ δ 2 ⎠⎦ sup

2−1 p q p E[εi l ]

 Tr(Pm (I N ⊗)) Tr(Pm )

()Tr(Pm ) 2 Since δsup = ≤ λmaxTr(P = 1 then m) 1   p − p −( 2 −1−q) q , then f ( p, q) m∈M δm Dm

 1   q 2q E f −  f N  

2

2 δsup

  2 δ ) Tr(P sup m 2 + 2d N2 (f, V N ) + δm p , ≤ C inf f − Pm f2N + N N m∈M

123



Estimation of covariance functions

where

167



( p ) = C (θ, p, q) q

p E[εi l2 ]



p −p δm Tr(Pm )−( 2 −1−q)

m∈M

f2 + 2N δsup

 .

7.2 Proof of Claim 3 Let V N be the orthogonal projection matrix of R N onto V N ⊂ R N . Let F = (f1 , . . . , f N ) and E = (ε1 , . . . , ε N ) be matrices such that F, E ∈ Rk×N , we have the following decomposition:     N  N 1 2  δm, (fi − F iV N )(fi − F iV N ) V N = (N −Tr( ))Tr(P ) Tr Pm I N ⊗ N m VN i=1





+Tr Pm

N 1  (εi − E iV N )(εi − E iV N ) IN ⊗ N



i=1





+ Tr Pm

N 1  (fi − F iV N )(εi − E iV N ) IN ⊗ N

 .

i=1

Taking the expectation on both sides of the previous equality, we have that:

2 E[ δm, VN ] =

    N N 1  i i  Tr Pm I N ⊗ (fi −F V N )(fi −F V N ) (N −Tr( V N ))Tr(Pm ) N i=1





+Tr Pm



N  1   IN ⊗ E (εi − E iV N )(εi − E iV N ) N

.

(15)

i=1

Note that the first term of the right side of (15) can be bounded by:    N N 1  i i  Tr Pm I N ⊗ (fi − F V N )(fi − F V N ) (N − Tr( V N ))Tr(Pm ) N i=1



N ≤2 Tr Tr(Pm ) =2



N 1  (fi − F iV N )(fi − F iV N ) N



i=1

N N f − vec(F V N )2N = 2 d N (f, V N ), Tr(Pm ) Tr(Pm )

123

168

R. Biscay Lirio et al.

where V N is a subspace of R N k , with dim(V N ) ≤ [ N2 ]k. Hence:    N N 1  Tr Pm I N ⊗ (fi − F iV N )(fi − F iV N ) (N − Tr( V N ))Tr(Pm ) N i=1

≤2

N d N (f, V N ). Tr(Pm )

(16)

Now we bound the second termof the right side of (15). First, we compute the com N E (εi − E iV N )(εi − E iV N ) . Note that: ponent h j of matrix N1 i=1 

 N    1   i i  E (εi −E V N )(εi − E V N ) = E < ε h − V N ε h ; ε j − V N ε j > N , N i=1

hj

where h, j = 1, . . . , k and ε h = (ε1h , . . . , ε hN ) ∈ R N . Using that ε h − V N ε h and V N ε j are orthogonals, we obtain that:   E < ε h − V N ε h ; ε j − V N ε j > N  N  N  N  j  1 1   h j = E εi εi − E εi it εth N N i=1

t=1

i=1

N    1  j j = E ε1h ε1 − E ε1h ε1 ii N i=1

    Tr( V N ) j E ε1h ε1 , = 1− N  therefore

1 N

  i )(ε − E i ) E (ε − E i i i=1 VN VN

N

hj

 = 1−

Tr( V N ) N

   j E ε1h ε1 .

Finally:    N  N 1 1   i i  Tr Pm I N ⊗ E (εi − E V N )(εi − E V N ) (N − Tr( V N )) Tr(Pm ) N i=1

1 = Tr(Pm )

  Tr Pm I N ⊗

   Tr( V N ) N 2 1−  = δm . (N − Tr( V N )) N

(17)

N 2 2 2 Hence, using (16) and (17) we obtain that E[ δm, V N ] ≤ 2 Tr(Pm ) d N (f, V N ) + δm . This concludes the proof of Claim 3.

123

Estimation of covariance functions

169

7.3 Proof of Claim 4 Let δy = 2  δm, VN =

   N i i  Tr Pm I N ⊗ N1 i=1 (yi −Y V )(yi −Y V ) N

2  δm, VN =

2 , then we can write  δm, V N as Therefore N

Tr(Pm )y−( V N ⊗Ik )y2N N 2 N −Tr( V N ) δy y − ( V N ⊗ Ik )y N .

 N δy f − ( V N ⊗ Ik )f2N + ε − ( V N ⊗ Ik )ε2N N − Tr( V N )  +2 f − ( V N ⊗ Ik )f; ε . (18)

2 For a N ∈ V⊥ N such that a N = 1 define

μN =

f−( V N ⊗Ik )f f−( V N ⊗Ik )f2N

if

f − ( V N ⊗ Ik )f2N = 0

aN

if

f − ( V N ⊗ Ik )f2N = 0.

Since 2| f − ( V N ⊗ Ik )f; ε N | = 2f − ( V N ⊗ Ik )f N | μ N ; ε N | ≤ f − ( V N ⊗ Ik )f2N + μ N ; ε2N ,

(19)

then from (18) and (19) we have that   N δy ε − ( V N ⊗ Ik )ε2N − μ N ; ε2N N − Tr( V N )    N δy ε2N − ( V N ⊗ Ik )ε2N + μ N ; ε2N = N − Tr( V N )   N  δy ε2N − ( V = ⊗ Ik )ε2N , N N − Tr( V N )

2  δm, VN ≥

 where V ⊗ Ik is the orthogonal projector of R N k onto the space V N ⊕ Rμ N . Hence, N we have that: 2 2 P[ δm, V N ≤ (1 − 2δ)δm ]     N 2 2  δy ε2N − ( V ≤ (1 − 2δ)δ ⊗ I )ε ≤P k m N N N − Tr( V N )    Tr( V N ) N Tr() N Tr() −δ 1− ≤ P δy ε2N ≤ Tr(Pm ) Tr(Pm ) N    Tr( V N ) Tr( V N )Tr() N Tr() 2  +δ 1− +P δy ( V N ⊗ Ik )ε N ≥ Tr(Pm ) Tr(Pm ) N

= P1 + P2 .

123

170

R. Biscay Lirio et al.

Now we bound P1 : . . N  . 2 Tr() . 2 Tr()  Tr( ) N N . . V N 1− εi l22 − P1 ≤ P . . .≥δ . Tr(Pm )δy . Tr(Pm )δy N i=1

Taking into account that

δy ≤

λmax (Pm )N Tr



1 N

N

i=1 (yi

− Y iV N )(yi − Y iV N )

Tr(Pm )y − ( V N

⊗ Ik )y2N

 ≤

N , (20) Tr(Pm )

and that 

Tr( V N ) 1− N

 ≥

1 , 2

(21)

then, by Markov’s inequality, . . N  . . N . . 2 P1 ≤ P . εi l2 − N Tr(). ≥ δTr() . . 2 i=1 ⎡. .p⎤ N  . . 2 p p p . . εi l22 − Tr() . ⎦ . ≤ C( p)δ − 2 N − 2 Tr()− 2 E ⎣. . . i=1

When p ≥ 4, using Rosenthal’s inequality (see Theorem 8 in the “Appendix”), we obtain: ⎡. .p⎤ N  . . 2 p . . p εi l22 − Tr() . ⎦ ≤ C ( p)N 4 E[ε1 l2 ]. E ⎣. . . i=1

On the other hand, when 2 ≤ p ≤ 4 using Theorem 9 in the “Appendix” we obtain: ⎡. .p⎤  . N  N . . 2 . p !  . . . .2 p εi l22 − Tr() . ⎦ ≤ 8 E . εi l22 − Tr() . ≤ C ( p)N E[ε1 l2 ]. E ⎣. . . i=1

i=1

Finally, p



P1 ≤ C ( p)δ where β = min

123

p

p 4, 2

− 2p

 −1 .

Tr()

− 2p

E[ε1 l2 ] Nβ

p

≤ C( p)δ

− 2p

E[ε1 l2 ] Nβ

,

(22)

Estimation of covariance functions

171

Now we bound P2 . Taking into account (20) and ( 21) we have that:    Tr( V N ⊗ Ik ) Tr( V N ⊗ Ik ) Tr()  1− . P2 ≤ P ε ( V ⊗ I )ε ≥ Tr()+δ N k k N k k Nk By Proposition 4.3 in Bigot et al. (2010) (see Theorem 5 in the “Appendix”), for 2 =    A = V ⊗ Ik , ρ( V ⊗ Ik ) = 1 and  δm N N

Tr((  V N ⊗Ik )(I N ⊗)) Tr(  V N ⊗Ik )

, we have that for all

x > 0:  / 2 2       P ε  ( V ⊗ I )ε ≥ δ Tr( ⊗ I ) + 2 δ k k VN m m Tr( V N ⊗ Ik )ρ( V N ⊗ Ik )x N   p   Eε1 l2 Tr V ⊗ Ik N 2   p .  +2x δm ρ( V ⊗ Ik ) ≤ C ( p) N p   δm ρ V x2 ⊗ I k N Note that / 2    Tr(( V ⊗ I )(I ⊗ )) + 2 δ k N m Tr( V N ⊗ Ik )x N     δ 2 2 2    Tr(( V N ⊗ Ik )(I N ⊗ )) + 2 + δ x +2x δm ≤ 1 + 2 δ m  ⊗ Ik = ( V N ⊗ Ik ) + Q, where Q is the orthogonal projector of R N k onto and V N Rμ N , therefore Tr(Q) = 1. Using that δ < 1 and taking   Tr( V N ⊗ Ik ) Tr( V N ⊗ Ik ) Tr() Tr() + δ N k 1− k k Nk   ! δ Tr((( V N ⊗ Ik ) + Q)(I N ⊗ )) − 1+ 2 δ[Tr( V N ⊗ Ik ) + 1] × 2(δ + 1)Tr((( V N ⊗ Ik ) + Q)(I N ⊗ )) 

x=



δ2 N k , 18

we obtain that:   p p  Eε1 l2 Tr V ⊗ Ik Eε1 l2 (kTr( V N ) + 1) N P2 ≤ .    p ≤ C ( p)   p  p p Nk 2 2 2 δ p Nk 2    δm ρ V δ ⊗ I k m N 18 18 Since 2  δm =

Tr((( V N ⊗ Ik ) + Q)(I N ⊗ )) Tr() ≥ Tr(( V N ⊗ Ik ) + Q) k + Tr( 1V

≥ N

)

Tr() , k+1

123

172

R. Biscay Lirio et al.

then we have that p

Eε1 l (kTr( V N ) + 1) p p P2 ≤ C ( p)  2  p ≤ C ( p)δ − p Eε1 l2 N 1− 2 p   Tr() 2 p N k 2 δ 18 k+1

p

≤ C( p)δ − p

Eε1 l2 Nβ

.

(23)

Inequalities (22) and (23) allow us to conclude that Claim 4 holds. Acknowledgments

The authors would like to thank the referees for their valuable comments.

Appendix In this section we recall some of the inequalities used in the proof of our results. The next theorem is the Proposition 4.3 given in Bigot et al. (2010), which is a k-variate extension of the Corollary 5.1 in Baraud (2000) (which is recovered for the particular case k = 1). Theorem 5 Given N , k ∈ N , let  A ∈ R N k×N k \{0} be a non-negative definite and symmetric matrix, and let ε1 , . . . , ε N be i.i.d random vectors in Rk , with E(ε1 ) = 0 √ Tr ( A(I N ⊗))  2  . and V(ε1 ) = . Denote ε = (ε1 , . . . , ε N ) , ζ (ε) = ε Aε, and δ∗ = Tr ( A) p Then, for all p ≥ 2, such that Eε1 l2 < ∞ it holds that for all x > 0,   p   /     Eε1 l Tr  A     2 P ζ (ε) ≥ δ∗2 Tr  A + 2δ∗2 Tr  A ρ  A x + δ∗2 ρ  A x ≤ C( p) p ,   p δ∗ ρ  A x2

  where ρ  A is the spectral norm of  A. The following result is the Corollary 4.2 that appears in Bigot et al. (2010), which constitutes also a natural extension of Corollary 3.1 in Baraud (2000), providing a similar bound as in Gendre (2008). Theorem 6 Let q > 0 be given such that there exists p > 2(1 + q) satisfying p Eεi l2 < ∞. Then, for some constants K (θ ) > 1 we have that  1 2q q ≤2 Ef −  f N





1 q −1 +





δ 2 Dm f − Pm f2N + m N m∈M

K (θ ) inf

 +

 p 2 δsup , N

where  q p

123

=

p C( p, q, θ )Eεi l2



m∈M

p − p −( −1−q ) δm D m 2

 .

Estimation of covariance functions

173

Proposition 7 (Hermite Hadamard’s Inequality) For all convex functions f : [a, b] → R is known that:  f

a+b 2



1 ≤ b−a

0b f (x)d x ≤

f (a) + f (b) . 2

a

Now we recall two moment inequalities for sum of independent centered random variables, which are repeatedly used throughout this paper. Theorem 8 (Rosenthal’s Inequality) Let U1 , U2 , . . . Un be independent centered random variables with values in R. Then for any p ≥ 2 we have: ⎛ .   n p⎞ . n n 2 . . p   . . Ui . ≤ C( p) ⎝ E[|Ui | p ] + E[Ui2 ] ⎠ . E . . . i=1

i=1

i=1

For the proof of this inequality, we refer to Petrov (1995). The next result explores the case where p ∈ [1, 2]. To our knowledge the result is due to Bahr and Esseen (1965). Theorem 9 Let U1 , U2 , . . . , Un be independent centered random variables with values R. For any p with p ∈ [1, 2] it holds that: .  . n n . . p  . . E . Ui . ≤ 8 E[|Ui | p ]. . . i=1

i=1

References Baraud Y (2000) Model selection for regression on a fixed design. Probab Theory Relat Fields 117(4):467– 493 Bigot J, Biscay R, Loubes J-M, Muñiz-Alvarez L (2010) Nonparametric estimation of covariance functions by model selection. Electron J Stat 4:822–855 Bigot J, Biscay R, Loubes J-M, Alvarez LM (2011) Group lasso estimation of high-dimensional covariance matrices. J Mach Learn Res 12:3187–3225 Biscay R, Lescornel H, Loubes J-M (2012) Adaptive covariance estimation with model selection. Math Methods Stat 21:283–297 Cressie NAC (1993) Statistics for spatial data. Wiley Series in probability and mathematical statistics: applied probability and statistics. Wiley, New York. Revised reprint of the 1991 edition, A WileyInterscience Publication Gendre X (2008) Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression. Electron J Stat 2:1345–1372 Guillot G, Senoussi R, Monestiez P (2000) A positive definite estimator of the non stationary covariance of random fields. In: Monestiez P, Allard D, Froidevaux R (eds) GeoENV2000. Third European conference on geostatistics for environmental applications. Kluwer, Dordrecht Hall P, Fisher N, Hoffmann B (1994) On the nonparametric estimation of covariance functions. Ann Statist 22(4):2115–2134 Krige DG (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. J Chem Metall Min Soc S Afr 52(6):119–139

123

174

R. Biscay Lirio et al.

Matsuo T, Nychka D, Paul D (2011) Nonstationary covariance modeling for incomplete data: Monte Carlo EM approach. Comput Stat Data Anal 55:2059–2073 Ripley BD (2004) Spatial statistics. Wiley, Hoboken, NJ Sampson PD, Guttorp P (1992) Nonparametric representation of nonstationary spatial covariance structure. J Am Stat Assoc 87:108–119 Seber GAF (2008) A matrix handbook for statisticians. Wiley Series in probability and statistics. WileyInterscience (Wiley), Hoboken, NJ Shapiro A, Botha JD (1991) Variogram fitting with a general class of conditionally nonnegative definite functions. Comput Stat Data Anal 11:87–96 Stein ML (1999) Interpolation of spatial data. Some theory for kringing. Springer series in statistics, vol xvii, p 247. Springer, New York, NY Petrov VV (1995) Limit theorems of probability theory. Sequences of independent random variables. Oxford studies in probability 4. Oxford Science Publications. The Clarendon Press, Oxford University Press, New York von Bahr B, Esseen CG (1965) Inequalities for the r th absolute moment of a sum of random variables 1 ≤ r ≤ 2. Ann Math Stat 36:299–303

123

Suggest Documents