Statistical inferences for functional data - CiteSeerX

10 downloads 103 Views 438KB Size Report
That is, statistical inference such as estimation and hypothesis testing about ... known about this substitution effect on functional data analysis. In this paper this ...
The Annals of Statistics 2007, Vol. 35, No. 3, 1052–1079 DOI: 10.1214/009053606000001505 c Institute of Mathematical Statistics, 2007

arXiv:0708.2207v1 [math.ST] 16 Aug 2007

STATISTICAL INFERENCES FOR FUNCTIONAL DATA By Jin-Ting Zhang1 and Jianwei Chen National University of Singapore and University of Rochester With modern technology development, functional data are being observed frequently in many scientific fields. A popular method for analyzing such functional data is “smoothing first, then estimation.” That is, statistical inference such as estimation and hypothesis testing about functional data is conducted based on the substitution of the underlying individual functions by their reconstructions obtained by one smoothing technique or another. However, little is known about this substitution effect on functional data analysis. In this paper this problem is investigated when the local polynomial kernel (LPK) smoothing technique is used for individual function reconstructions. We find that under some mild conditions, the substitution effect can be ignored asymptotically. Based on this, we construct LPK reconstruction-based estimators for the mean, covariance and noise variance functions of a functional data set and derive their asymptotics. We also propose a GCV rule for selecting good bandwidths for the LPK reconstructions. When the mean function also depends on some time-independent covariates, we consider a functional linear model where the mean function is linearly related to the covariates but the covariate effects are functions of time. The LPK reconstruction-based estimators for the covariate effects and the covariance function are also constructed and their asymptotics are derived. Moreover, we propose a L2 -norm-based global test statistic for a general hypothesis testing problem about the covariate effects and derive its asymptotic random expression. The effect of the bandwidths selected by the proposed GCV rule on the accuracy of the LPK reconstructions and the mean function estimator is investigated via a simulation study. The proposed methodologies are illustrated via an application to a real functional data set collected in climatology. Received April 2004; revised May 2006. Supported by the National University of Singapore Academic Research Grant R-155000-038-112. AMS 2000 subject classifications. Primary 62G07; secondary 62G10, 62J12. Key words and phrases. Asymptotic Gaussian process, asymptotic normal distribution, functional data, hypothesis test, local polynomial smoothing, nonparametric estimation, reconstructed individual functions, root-n consistent. 1

This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2007, Vol. 35, No. 3, 1052–1079. This reprint differs from the original in pagination and typographic detail. 1

2

J.-T. ZHANG AND J. CHEN

1. Introduction. Functional data consist of functions which are often smooth but usually corrupted with noise. With modern technology development, such functional data are being observed frequently in many scientific fields; see Besse and Ramsay [3], Ramsay [20] and Ramsay and Dalzell [21], among others, for good examples and analyses. Comprehensive surveys about functional data analysis (FDA) can be found in [23, 24]. Mathematically, the above-mentioned functional data may be modeled as independent realizations of an underlying stochastic process, (1.1)

yi (t) = η(t) + vi (t) + εi (t),

i = 1, 2, . . . , n,

where η(t) models the population mean function of the stochastic process, vi (t) is the ith individual variation (subject-effect) from η(t), εi (t) is the ith measurement error process and yi (t) is the ith response process. Without loss of generality, throughout this paper we assume the stochastic process has finite support, that is, t ∈ T = [a, b], −∞ < a < b < ∞. Moreover, we assume vi (t) and εi (t) are independent, and are independent copies of v(t) ∼ SP(0, γ) and ε(t) ∼ SP(0, γε ), γε (s, t) = σ 2 (t)1{s=t} , respectively, where and throughout SP(η, γ) denotes a stochastic process with mean function η(t) and covariance function γ(s, t). It follows that the underlying individual functions (trajectories) fi (t) = E{yi (t)|vi (t)} = η(t) + vi (t) are i.i.d. copies of the underlying stochastic process, f (t) = η(t) + v(t) ∼ SP(η, γ). In practice, functional data are observed discretely. Let tij , j = 1, 2, . . . , ni , be the design time points of the ith subject. Then by (1.1) and letting yij = yi (tij ) and εij = εi (tij ), we have (1.2)

yij = η(tij ) + vi (tij ) + εij ,

j = 1, 2, . . . , ni ; i = 1, 2, . . . , n.

In many practical situations, the above discrete functional data (1.2) have to be first registered before any statistical inference can be conducted. Methods for curve registration can be found in Kneip and Gasser [19], Kneip and Engel [18], Silverman [26], Ramsay and Silverman ([23], Chapter 5), Ramsay and Li [22] and Ramsay and Silverman ([24], Chapter 7), among others. In this paper, for convenience, we assume that the functional data (1.2) do not need registration or have been registered. Estimation of the population characteristics η(t), γ(s, t) and σ 2 (t) of the model (1.1) has been the focus of FDA in the literature. Most of the existing approaches involve one smoothing method or another. For example, Besse and Ramsay [3], Ramsay [20] and Ramsay and Dalzell [21] made use of reproducing kernel Hilbert space decomposition; Rice and Silverman [25] and Brumback and Rice [4] employed smoothing splines; Besse, Cardot and Ferraty [2] used B-splines; Hart and Wehrly [16] employed kernel smoothing; and Kneip [17] studied a principal components-based approach. Development of significance tests about η(t) and other population characteristics of

STATISTICAL INFERENCES IN FDA

3

the model (1.1) is more important and challenging. Faraway [13] discussed the difficulties in extending some multivariate hypothesis testing procedures to FDA. Ramsay and Silverman [23] suggested a pointwise t-test or F -test but they did not discuss global tests. For curve data from stationary Gaussian processes, Fan and Lin [11] developed an adaptive Neyman test. In this paper we adopt the method of “smoothing first, then estimation” for functional data. That is, we construct the estimators for η(t), γ(s, t) and σ 2 (t) using the reconstructed individual functions fˆi (t), i = 1, 2, . . . , n, obtained using one smoothing method or another; in particular, in this paper we use the local polynomial kernel (LPK) smoothing technique as described in [10], among others. The idea of “smoothing first, then estimation” itself is hardly new since it has been used in the literature; see [23, 24] and the references therein. What is new here is that we investigate the effect of the substitution of the underlying individual functions fi (t), i = 1, 2, . . . , n, by their LPK reconstructions in FDA. We show that, under some mild conditions, the effect of such a substitution is asymptotically ignorable in FDA. Based on this, we derive the asymptotics of the estimators ηˆ(t), γˆ (s, t) and σ ˆ 2 (t). In particular, under some mild conditions, we show that: (1) ηˆ(t) and √ γˆ(s, t) are n-consistent and asymptotically Gaussian; (2) the asymptotic efficiency of ηˆ(t) will not be affected by better choice of the bandwidth than the bandwidth selected by a GCV rule; and (3) the convergence rate of σ ˆ 2 (t) is affected by the convergence rate of the LPK reconstructions. More details about these results are given in Section 2. In the model (1.1) the only covariate for the mean function η(t) is time. In many applications η(t) may also depend on some time-independent covariates and can be written as η(t; x) = xT β(t), where the covariate vector x = [x1 , . . . , xq ]T and the unknown but smooth coefficient function vector β(t) = [β1 (t), . . . , βq (t)]T . A replacement of η(t) by η(t; xi ) = xTi β(t) in (1.1) leads to the so-called functional linear model (1.3)

yi (t) = xTi β(t) + vi (t) + εi (t),

i = 1, 2, . . . , n,

where yi (t), vi (t) and εi (t) are the same as those defined in (1.1). The ignorability of the substitution effect is also applied to the LPK reconstructions fˆi (t) of the individual functions fi (t) = xTi β(t) + vi (t) of the above model. ˆ Based on this, we construct the estimators β(t) and γˆ(s, t) and investigate √ ˆ their asymptotics; in particular, we show that β(t) is n-consistent and asymptotically Gaussian. Moreover, we propose a global L2 -norm-based test statistic Tn to test a general hypothesis testing problem about the covariate effects β(t); its asymptotic random expression is derived. More details about these results are given in Section 3. The rest of the paper is organized as follows. In Section 4 we present a simulation study which aims to investigate the effect of the bandwidth choice

4

J.-T. ZHANG AND J. CHEN

on the accuracy of the LPK reconstructions fˆi (t) and the mean function estimator ηˆ(t). In Section 5 we illustrate the proposed methodologies by applying them to a real functional data set collected in climatology. Finally, in Section 6 technical proofs of some asymptotic results are outlined. 2. Basic methodologies. 2.1. LPK reconstruction of individual functions. First of all, we describe how to reconstruct the individual functions fi (t), i = 1, 2, . . . , n, using the LPK smoothing technique based on the standard nonparametric regression model (2.1)

yij = fi (tij ) + εij ,

j = 1, 2, . . . , ni ; i = 1, 2, . . . , n.

For any fixed time point t, assume fi (t) has a (p + 1)th continuous derivative in a neighborhood of t for some positive integer p. Then by Taylor’s expansion, fi (tij ) can be locally approximated by a p-order polynomial, that is, (1)

(p)

fi (tij ) ≈ fi (t) + (tij − t)fi (t) + · · · + (tij − t)p fi (t)/p! = zTij αi , (r)

in the neighborhood of t, where αi = [αi0 , αi1 , . . . , αip ]T with αir = fi (t)/r!, and zij = [1, tij − t, . . . , (tij − t)p ]T . Then the p-order LPK reconstructions of ˆ i , where and throughout er,s denotes fi (t) are defined as fˆi (t) = α ˆ i0 = eT1,p+1 α the s-dimensional unit vector whose rth component is 1 and others are 0, ˆ i are the minimizers of the weighted least squares criterion and α

(2.2)

ni n X X

[yij − zTij αi ]2 Kh (tij − t)

i=1 j=1

=

n X i=1

(yi − Zi αi )T Kih (yi − Zi αi ),

where yi = [yi1 , . . . , yini ]T , Zi = [zi1 , . . . , zini ]T and Kih = diag(Kh (ti1 −t), . . . , Kh (tini − t)), with Kh (·) = K(·/h)/h, obtained by rescaling a kernel function K(·) (often a symmetric p.d.f.) with bandwidth h > 0 that controls the size of the associated neighborhood. Minimizing (2.2) with respect to αi , i = 1, 2, . . . , n, is equivalent to minimizing the ith term in the summation on the right-hand side of (2.2) with respect to αi for each i = 1, 2, . . . , n. It follows that for i = 1, 2, . . . , n, (2.3)

fˆi (t) = eT1,p+1 (ZTi Kih Zi )−1 ZTi Kih yi =

ni X

j=1

Khni (tij − t)yij ,

STATISTICAL INFERENCES IN FDA

5

where K ni (t) are known as the empirical equivalent kernels for the p-order LPK; see Fan and Gijbels [10]. In (2.2) different bandwidths may be used for different individual functions. However, the individual functions in a functional data set are i.i.d. realizations of a stochastic process, and hence, often admit similar smoothness properties and sometimes similar shapes [17, 18]; it is then reasonable to treat them in the same way, for example, using a common bandwidth for all of them. The advantages in using a common bandwidth at least include the following: (a) reduce the computational effort for bandwidth selection; and (b) simplify the asymptotic results of the estimators. For convenience, we define the following widely-used functionals of a kernel K: Br (K) = (2.4)

V (K) = K (1) (t) =

Z

Z

Z

K(t)tr dt, K(t)2 dt, K(s)K(s + t) ds.

For estimating a function instead of derivatives, Fan and Gijbels [10] pointed out that even orders are not appealing. Therefore, throughout this paper, we assume p is an odd integer; moreover, we denote γk,l (s, t) as the (k, l)-times k+l

, and denote D as the partial derivative of γ(s, t), that is, γk,l (s, t) = ∂ ∂ k sγ(s,t) ∂lt set of all the design time points tij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , n. In addition, we denote OUP (1) [resp. oUP (1)] as “bounded (resp. tends to 0) in probability uniformly for any t within the interior of T and all i = 1, 2, . . . , n.” Finally, the following regular conditions are imposed. Condition A. 1. The design time points tij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , n, are i.i.d. with p.d.f. π(·) which has the bounded support T = [a, b]. For any given t within the interior of T , π ′ (t) exists and is continuous over T . 2. Let s and t be any two interior time points of T . The individual functions fi (t), i = 1, 2, . . . , n, and their mean function η(t) have up to (p + 1)-times continuous derivatives. Their covariance function γ(s, t) has up to (p + 1)times continuous derivatives for both s and t. The variance function of the measurement errors, σ 2 (t), is continuous at t. 3. The kernel K is a bounded symmetrical p.d.f. with bounded support [−1, 1]. 4. There are two positive constants C and δ such that ni ≥ Cnδ , for all i = 1, 2, . . . , n. As n → ∞, we have h → 0 and nδ h → ∞.

6

J.-T. ZHANG AND J. CHEN

Remark 1. For some practical functional data sets, Condition A4 may be too restrictive. For example, a functional data set with a few individual functions having ni < Cnδ does not satisfy Condition A4. However, such a functional data set can often be slightly modified to satisfy Condition A4. A simple way of doing so is to drop those individual functions having ni < Cnδ so that the remaining individual functions form a new functional data set which satisfies Condition A4. This procedure will not result in less efficient estimators when n ˜ /n → 0 and will not affect the consistency of the estimators when (n − n ˜ ) → ∞, where n ˜ is the number of dropped individual functions, which may be bounded or tend to ∞ as n → ∞. Using Lemma 3 in Section 6, it is easy to show the following. Theorem 1. Assume Condition A is satisfied. Then the average conditional MSE (mean squared errors) of the p-order LPK reconstructions fˆi (t), i = 1, 2, . . . , n, is (

E n

(2.5)

−1

n X i=1

=



[fˆi (t) − fi (t)]2 |D

)

2 (K ∗ )[(η (p+1) (t))2 + γ Bp+1 p+1,p+1 (t, t)] 2(p+1) h (p + 1)!2

V (K ∗ )σ 2 (t) + (mh) ˜ −1 [1 + oP (1)], π(t) 

where K ∗ P is the equivalent kernel of the p-order LPK ( [10], page 64), and −1 m ˜ = (n−1 ni=1 n−1 i ) .

Remark 2. On the left-hand side of (2.5) the notation E{·|D} denotes the conditional expectation when all the design time points tij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , n, are given. Nevertheless, the leading term on the right-hand side of (2.5) is independent of D and hence, the left-hand side is nearly unconditional. For technical convenience and following the literature tradition (e.g., [10]), we keep using the “conditional expectation” notation E{·|D} here and throughout. This remark applies to all other statistical operations conditional to D given in this paper. Theorem 1 indicates that the optimal bandwidth of the p-order LPK reconstructions fˆi (t) is h = OP (m ˜ −1/(2p+3) ) = OP (n−δ/(2p+3) ). Using Lemma 3 again, we can show the following. Theorem 2. Assume Condition A is satisfied. Then for the p-order LPK reconstructions fˆi (t) using the bandwidth h = O(n−δ/(2p+3) ), we have (2.6) fˆi (t) = fi (t) + n−(p+1)δ/(2p+3) OUP (1), i = 1, 2, . . . , n.

7

STATISTICAL INFERENCES IN FDA

Theorem 2 implies that, under the given conditions, the LPK reconstructions fˆi (t) are asymptotically uniformly little different from the underlying individual functions fi (t). We expect that this is true not only for LPK but also for any other linear smoothers, for example, smoothing splines [14, 27], regression splines or orthogonal series [7], among others. 2.2. Estimation of the mean and covariance functions. It is then natural to estimate the mean function η(t) and the covariance function γ(s, t) by the sample mean and sample covariance functions of the p-order LPK reconstructions fˆi (t), ηˆ(t) = n−1

n X

fˆi (t),

i=1

(2.7)

γˆ (s, t) = (n − 1)−1

n X i=1

{fˆi (s) − ηˆ(s)}{fˆi (t) − ηˆ(t)}.

The asymptotic conditional bias, covariance and variance for ηˆ(t) are given below. Theorem 3. Assume Condition A is satisfied. Then as n → ∞, the asymptotic conditional bias, covariance and variance of ηˆ(t) are Bias{ˆ η (t)|D} =

Bp+1 (K ∗ )η (p+1) (t) p+1 h [1 + oP (1)], (p + 1)!

Cov{ˆ η (s), ηˆ(t)|D} 

= γ(s, t)/n +

K ∗(1) [(s − t)/h]σ 2 (s) (nmh) ˜ −1 π(t)

Bp+1 (K ∗ )[γp+1,0 (s, t) + γ0,p+1 (s, t)] −1 p+1 n h [1 + oP (1)], + (p + 1)! 

Var{ˆ η (t)|D} 

= γ(t, t)/n + + Remark 3.

V (K ∗ )σ 2 (t) (nmh) ˜ −1 π(t)

Bp+1 (K ∗ )[γp+1,0 (t, t) + γ0,p+1 (t, t)] −1 p+1 n h [1 + oP (1)]. (p + 1)! 

Under Condition A and by Theorem 3, we have

(2.8) MSE{ˆ η (t)|D} = γ(t, t)/n + OUP {h2(p+1) + (nmh) ˜ −1 + n−1 hp+1 }.

8

J.-T. ZHANG AND J. CHEN

We then always have MSE{ˆ η (t)|D} = γ(t, t)/n + oUP (1/n), provided that (2.9)

mh ˜ → ∞,

nh2(p+1) → 0. ∗

Remark 4. Condition (2.9) is satisfied by any bandwidth h = O(n−δ ) with 1/[2(p + 1)] < δ∗ < δ. In particular, it is satisfied by the optimal bandwidth, h = O(n−δ/(2p+3) ), for the p-order LPK reconstructions fˆi (t) when δ > 1 + 1/[2(p + 1)]. In this case, the p-order LPK reconstruction optimal √ bandwidth is sufficiently small to guarantee the n-consistency of ηˆ(t). Condition (2.9) is also satisfied by the optimal bandwidth, h = O(n−(1+δ)/(2p+3) ) (when 1 + 1/[2(p + 1)] < δ < 1 + 1/(p + 1)) or h = O(n−δ/(p+2) ) [when δ > 1 + 1/(p + 1)], for ηˆ(t). It follows that, in both cases, the optimal bandwidth √ admits the same asymptotic efficiency for estimating η(t) because MSE{ nˆ η (t)|D} → γ(t, t) as n → ∞. By pretending all the underlying individual functions fi (t) were observed, the “ideal” estimators of η(t) and γ(s, t) are η˜(t) = n−1 (2.10)

n X

fi (t),

i=1

γ˜ (s, t) = (n − 1)−1

n X i=1

{fi (s) − η˜(s)}{fi (t) − η˜(t)}.

Theorem 4. Assume Condition A is satisfied, and the bandwidth h = O(n−δ/(2p+3) ) is used for the p-order LPK reconstructions fˆi (t). Then as n → ∞, we have (2.11)

ηˆ(t) = η˜(t) + n−(p+1)δ/(2p+3) OUP (1),

γˆ (s, t) = γ˜ (s, t) + n−(p+1)δ/(2p+3) OUP (1).

In addition, assume δ > 1 + 1/[2(p + 1)]. Then as n → ∞, we have √ n{ˆ η (t) − η(t)} ∼ AGP(0, γ), (2.12) √ n{ˆ γ (s, t) − γ(s, t)} ∼ AGP(0, γ ∗ ),

where AGP(η, γ) denotes an asymptotic Gaussian process with mean function η(t) and covariance function γ(s, t), and (2.13) γ ∗ {(s1 , t1 ), (s2 , t2 )} = E{v1 (s1 )v1 (t1 )v1 (s2 )v1 (t2 )} − γ(s1 , t1 )γ(s2 , t2 ),

with v1 (t) denoting the subject effect of the first individual function f1 (t) as defined in (1.1). When the subject effect process v(t) is Gaussian, γ ∗ {(s1 , t1 ), (s2 , t2 )} = γ(s1 , t2 )γ(s2 , t1 ) + γ(s1 , s2 )γ(t1 , t2 ).

STATISTICAL INFERENCES IN FDA

9

Theorem 4 indicates that, under some mild conditions, the proposed estimators (2.7) are asymptotically identical to the “ideal” estimators (2.10). The required key condition is δ > 1 + 1/[2(p + 1)]. It follows that, to make the measurement errors ignorable via LPK smoothing, we need the number of measurements, ni , for all the subjects (or a large number of subjects; see Remark 1 for discussion) to tend to infinity slightly faster than the number of subjects, n. 2.3. Estimation of the noise variance function. The noise variance function σ 2 (t) measures the variation of the measurement errors εij of the model (1.2). Following Hall and Marron [15] and Fan and Yao [12], we can construct a p˜-order LPK estimator of σ 2 (t) based on the p-order LPK residuals εˆij = yij − fˆi (tij ), although our setting is more complicated. As expected, the resulting p˜-order LPK estimator of σ 2 (t) will be consistent, but its convergence rate will be affected by that of the p-order LPK reconstructions fˆi (t), i = 1, 2, . . . , n. As an illustration, let us consider the simplest LPK estimator, that is, the kernel estimator for σ 2 (t) based on εˆij , (2.14)

2

σ ˆ (t) =

P ni ε2ij i=1 j=1 Hb (tij − t)ˆ Pn Pni , i=1 j=1 Hb (tij − t)

Pn

where Hb (·) = H(·/b)/b with the kernel function H and the bandwidth b. Pretending εˆij ≡ εij , by standard kernel estimation theory (Wand and Jones [28] and Fan and Gijbels [10], among others), the optimal bandwidth P for σ ˆ 2 (t) is b = OP (N −1/5 ), where N = ni=1 ni denotes the total number of measurements for all the subjects, and the associated convergence rate of σ ˆ 2 (t) is OP (N −2/5 ). However, for the current setup, this convergence rate will be affected by the convergence rate of the p-order LPK reconstructions fˆi (t), i = 1, 2, . . . , n, since under Condition A and by Theorem 2, we actually only have εˆij = εij + n−(p+1)δ/(2p+3) OUP (1). For convenience, let ν1 (t) = E[ε2i (t)] = σ 2 (t) and ν2 (t) = Var[ε2i (t)]. Theorem 5. Assume Condition A is satisfied and the p-order LPK reconstructions fˆi (t) use a bandwidth h = O(n−δ/(2p+3) ). In addition, assume ν1′ (t) and ν2 (t) exist and are continuous at t ∈ T , and the kernel estimator σ ˆ 2 (t) uses a bandwidth b = O(N −1/5 ). Then we have (2.15)

σ ˆ 2 (t) = σ 2 (t) + OUP (n−2(1+δ)/5 + n−(p+1)δ/(2p+3) ).

By the above theorem, it is seen that when δ < 2(2p + 3)/(p − 1), the second order term dominates the first order term; and in particular, when p = 1, we have σ ˆ 2 (t) = σ 2 (t) + OUP (n−2δ/5 ). In this case the optimal convergence

10

J.-T. ZHANG AND J. CHEN

rate of σ ˆ 2 (t) is not attainable. It is attainable only when δ > 2(2p+3)/(p−1), so that the first order term in (2.15) dominates the second order term. This is the case only when p ≥ 3. When p = 3, δ > 9 is required; and when p = 2k + 1 → ∞, δ > 4 is required. Therefore, it is usually difficult to make the convergence rate of σ ˆ 2 (t) unaffected by the convergence rate of the porder LPK reconstructions fˆi (t). 2.4. Bandwidth selection. Theorem 1 suggests that we can choose a good bandwidth for the p-order LPK reconstructions fˆi (t), i = 1, 2, . . . , n, using the generalized cross-validation (GCV) score GCV(h) = n−1

(2.16)

n X

GCVi (h),

i=1

where GCVi (h) is the GCV score of the ith p-order LPK reconstruction fˆi (t). Let Ai be the smoother matrix of the ith subject constructed using (2.3). ˆ i = Ai yi and GCVi (h) = yiT (Ini − Ai )T (Ini − Ai )yi /[1 − Then we have y ˆ i = [ˆ tr(Ai )/ni ]2 , where yi = [yi1 , . . . , yini ]T , y yi1 , . . . , yˆini ]T and tr(S) denotes the trace of the matrix S. In practice, the optimal bandwidth h∗ can be obtained via minimizing GCV(h) over a number of bandwidth candidates of interest. Theoretically, it is expected that h∗ = OP (n−δ/(2p+3) ). Remark 4 states that, under the required conditions, the optimal bandwidth for ηˆ(t) and the optimal bandwidth for the p-order LPK reconstructions fˆi (t) admit the same asymptotic efficiency for estimating η(t). Therefore, it is generally sufficient to use h∗ for estimating η(t) although, for finite samples, better bandwidth choices for ηˆ(t) are possible. 3. Functional linear models. Notice that Theorem 2 is also applied to the p-order LPK reconstructions fˆi (t) of the underlying individual functions fi (t) = xTi β(t) + vi (t), i = 1, 2, . . . , n, of the functional linear model (1.3). This property can be used to do inference about the model (1.3). In this section we focus on the estimation and significance tests of the coefficient function vector (covariate effects) β(t) of the model. 3.1. Coefficient function estimation. Let ˆf (t) = [fˆ1 (t), . . . , fˆn (t)]T and X = [x1 , . . . , xn ]T . Throughout this paper we assume X has full rank. Then the least-squares estimator of β(t) is (3.1)

ˆ = β(t)

(

n X i=1

xi xTi

)−1

n X

xi fˆi (t) = (XT X)−1 XT ˆf(t),

i=1

R T 2 ˆ which minimizes Q(β) = n i=1 [fi (t) − xi β(t)] dt. It follows that the ˆ and their subject-effects vi (t) can be estimated by vˆi (t) = fˆi (t) − xTi β(t) −1 Pn

STATISTICAL INFERENCES IN FDA

11

covariance function γ(s, t) can be estimated by γˆ (s, t) = (n − q)−1 (3.2)

n X

vˆi (s)ˆ vi (t)

i=1

ˆ (s)T v ˆ (t), = (n − q)−1 v

ˆ (t) = [ˆ where v v1 (t), vˆ2 (t), . . . , vˆn (t)]T = ˆf (t)−X(XT X)−1 XT ˆf (t) = (In −P)ˆf (t) and P = X(XT X)−1 XT is a projection matrix with PT = P, P2 = P and tr(P) = q. Pretending fi (t), i = 1, 2, . . . , n, is known, the “ideal” estimators of β(t) and γ(s, t) are ˜ = (XT X)−1 XT f (t), β(t) (3.3) ˜ (t), ˜ (s)T v γ˜ (s, t) = (n − q)−1 v

˜ (t) = (In − P)f (t). It is easy to show where f (t) = [f1 (t), . . . , fn (t)]T and v ˜ = β(t) and E˜ that Eβ(t) γ (s, t) = γ(s, t). For further investigation, we impose the following conditions. Condition B.

1. The covariate vectors xi , i = 1, 2, . . . , n, are i.i.d. with finite and invertible second moment Ex1 xT1 = Ω; moreover, they are uniformly bounded in probability; that is, xi = OUP (1). 2. The subject-effects vi (t) are uniformly bounded in probability; that is, vi (t) = OUP (1). Theorem 6. Assume Conditions A and B are satisfied, and the p-order LPK reconstructions fˆi (t) use a bandwidth h = O(n−δ/(2p+3) ). Then as n → ∞, we have ˆ = β(t) ˜ + n−(p+1)δ/(2p+3) OUP (1), β(t) (3.4) γˆ (s, t) = γ˜ (s, t) + n−(p+1)δ/(2p+3) OUP (1). In addition, assume δ > 1 + 1/[2(p + 1)]. Then as n → ∞, we have √ ˆ − β(t)} ∼ AGP(0, γβ ), (3.5) n{β(t) where γβ (s, t) = γ(s, t)Ω−1 .

Theorem 6 implies that, under the given conditions, the proposed estimaˆ tors β(t) and γˆ(s, t) are asymptotically identical to the “ideal” estimators ˜ β(t) and γ˜ (s, t), respectively. Therefore, in FDA it seems reasonable to directly assume the underlying individual functions are “observed” as is done in [23, 24]. The asymptotic result stated in (3.5) is a foundation for significance tests of the covariate effects.

12

J.-T. ZHANG AND J. CHEN

3.2. Significance tests of the covariate effects. Consider the general hypothesis testing problem (3.6)

H0 : Cβ(t) = c(t),

vs. H1 : Cβ(t) 6= c(t),

where t ∈ T = [a, b], C is a given k × q full rank matrix, and c(t) = [c1 (t), . . . , ck (t)]T is a given vector of functions. In order to check the significance of the rth covariate effect, one takes C = eTr,q = [0, . . . , 0, 1, 0, . . . , 0] and c(t) = 0; in order to check if the first two coefficient functions are the same, that is, β1 (t) = β2 (t), one takes C = (e1,q − e2,q )T = [1, −1, 0, . . . , 0] and c(t) = 0. ˆ It is natural to estimate Cβ(t) by Cβ(t). By Theorem 6, we have √ ˆ − c(t)] ∼ AGP(η , γ ), (3.7) n[Cβ(t) c c √ where η c (t) = n[Cβ(t) − c(t)] and γ c (s, t) = γ(s, t)CΩ−1 CT . Let (3.8)

ˆ − c(t)] w(t) = {C(XT X)−1 CT }−1/2 [Cβ(t) = [w1 (t), . . . , wk (t)]T .

Since XT X/n → Ω as n → ∞, using (3.7), we can show that w(t) ∼ AGP(η w , γ w ), where √ η w (t) = n(CΩ−1 CT )−1/2 [Cβ(t) − c(t)] = [ηw1 (t), . . . , ηwk (t)]T ,

(3.9)

γ w (s, t) = γ(s, t)Ik , where Ik denotes the identity matrix of size k. It follows that the components w1 (t), . . . , wk (t) are independent asymptotic Gaussian processes with mean functions ηw1 (t), . . . , ηwk (t), respectively, and a common covariance function γ(s, t). That is, (3.10)

wl (t) ∼ AGP(ηwl , γ),

l = 1, 2, . . . , k.

Based on these results and with C and c(t) properly specified, pointwise t and F -tests for the coefficient functions β1 (t), . . . , βq (t) can easily be conducted ([23], Chapter 9). We here propose the following global test statistic for the general hypothesis testing problem (3.6): (3.11)

Tn =

Z

b

a

kw(t)k2 dt =

k Z X

b

l=1 a

wl2 (t) dt,

where k · k denotes the usual L2 -norm. Let T˜n be the associated “ideal” ˆ global test statistic, obtained by replacing β(t) by the “ideal” estimator ˜ β(t) as defined in (3.3).

STATISTICAL INFERENCES IN FDA

13

To derive the asymptotic random expression of Tn , we assume that γ(s, t) R has finite trace, that is, tr(γ) = ab γ(t, t) dt < ∞. Let λ1 , λ2 , . . . be the eigenvalues, in decreasing order, and φ1 (t), φ2 (t), . . . be the associated orthonormal eigenfunctions of γ(s, t). Let m denote the number of positive eigenvalues. When all the eigenvalues are positive, we let m = ∞. R RThen λr > 0 for r ≤ m and λr = 0 for all r > m. Since tr(γ) < ∞ implies ab ab γ 2 (s, t) ds dt < ∞ by the Cauchy–Schwarz inequality, the covariance function γ(s, t) has the singular value decomposition ([27], page 3) (3.12)

γ(s, t) =

m X

λr φr (s)φr (t),

r=1

s, t ∈ T = [a, b].

Theorem 7. Assume the conditions of Theorem 6 are satisfied. Then as n → ∞, we have Tn = T˜n + n1/2−(p+1)δ/(2p+3) OP (1).

(3.13)

In addition, assume δ > 1 + 1/[2(p + 1)] and γ(s, t) has finite trace so that it has the singular value decomposition (3.12). Then as n → ∞, we have d

(3.14)

Tn =

m X

λr Ar + oP (1),

r=1

Ar ∼ χ2k (u2r ),

d

where X = Y means the random variables X and Y have the same distribution, χ2k denotes a χ2 -distribution with k degrees of freedom and the noncentral parameters (3.15)

Z

u2r = λ−1 r

b

a

2

η w (t)φr (t) dt

.

Under H0 , η w (t) ≡ 0 so that all the u2r are 0.

Theorem 7 suggests that the distribution of Tn is asymptotically the same as that of a χ2 -type mixture. There are three possible methods that can be used to approximate the null distribution of Tn : χ2 -approximation, simulation and bootstrapping. In the first two methods, we approximate the null Pˆ ˆ distribution of Tn by that of the χ2 -type mixture S = m λ A , where r r r=1 2 ˆ Ar ∼ χk , λr are the eigenvalues of γˆ(s, t) and m ˆ is some well-chosen inteˆ r , r = 1, 2, . . . , m, ger such that the eigenvalues λ ˆ explain a sufficiently large P ˆ r and λ ˆr , r = m portion of the total variation tr(ˆ γ) = ∞ λ ˆ + 1, m ˆ + 2, . . . , r=1 are essentially 0. Besse [1] proposed a simple method for selecting such an m. ˆ A simple and natural choice of m ˆ is the number of positive eigenvalues of γˆ (s, t). We found that the second method worked well in our simulation study and in the real data application presented in the next two sections.

14

J.-T. ZHANG AND J. CHEN

In the χ2 -approximation method, the distribution of S is approximated by that of a random variable R = αχ2d + β via matching the first three cumulants of R and S to determine the unknown parameters α, d and β [5, 29]. In the simulation method, the sampling distribution of S is computed based on a sample of S obtained via repeatedly generating (A1 , A2 , . . . , Am ˆ ). The bootstrap method is slightly more complicated. In the bootstrap method, we generate a sample of subject effects vi∗ (t), i = 1, 2, . . . , n, from the estimated subject effects vˆi,1 (t), i = 1, 2, . . . , n, under H1 and then construct a ˆ 0 (t) is the ˆ 0 (t) + v ∗ (t), i = 1, 2, . . . , n, where β bootstrap sample, fi∗ (t) = xTi β i ∗ ˆ (t) = c(t). Let β ˆ (t) be the bootestimator of β(t) under H0 so that Cβ 0 strap estimator of β(t) based on the above bootstrap sample. We then use it to compute Tn∗ =

Z

b

a

kw∗ (t)k2 dt =

k Z X

b

l=1 a

wl∗2 (t) dt,

ˆ ˆ ∗ (t) in the definition where w∗ (t) can be obtained by replacing β(t) with β (3.8) of w(t). The bootstrap null distribution of Tn is obtained by the sampling distribution of Tn∗ via B replications of the above bootstrap process for some large B, for example, B = 10,000. 4. A simulation study. In this section we aim to investigate the effect of the bandwidth selected by the GCV rule (2.16) on the average MSE (2.5) of the p-order LPK reconstructions fˆi (t), i = 1, 2, . . . , n, and the MSE of the mean function estimator ηˆ(t) via a simulation study. We generated simulation samples from the model yi (t) = η(t) + vi (t) + εi (t), η(t) = a0 + a1 φ1 (t) + a2 φ2 (t), vi (t) = bi0 + bi1 ψ1 (t) + bi2 ψ2 (t), bi = [bi0 , bi1 , bi2 ]T ∼ N [0, diag(σ02 , σ12 , σ22 )],

εi (t) ∼ N [0, σε2 (1 + t)],

i = 1, 2, . . . , n,

where n is the number of subjects and bi and εi (t) are independent. The scheduled design time points are tj = j/(m + 1), j = 1, 2, . . . , m. To obtain an unbalanced design which is more realistic, we randomly removed some responses on a subject at a rate rmiss so that on average there are about m(1− rmiss ) measurements on a subject, and nm(1− rmiss ) measurements in a whole simulated sample. For simplicity, in this simulation the parameters we actually used are [a0 , a1 , a2 ] = [1.2, 2.3, 4.2], [σ02 , σ12 , σ22 , σε2 ] = [1, 2, 3, 0.1], φ1 (t) = ψ1 (t) = cos(2πt), φ2 (t) = ψ2 (t) = sin(2πt), rmiss = 10%, m = 40 and n = 20, 30 and 40.

STATISTICAL INFERENCES IN FDA

15

For a simulated sample, the p-order LPK reconstructions fˆi (t) were obtained using a local linear (i.e., p = 1) smoother [8, 9] with the well-known Gaussian kernel. We considered five bandwidth choices, 0.5h∗ , 0.8h∗ , h∗ , 1.25h∗ and 2h∗ , where h∗ is the bandwidth selected by the GCV rule (2.16). For a simulated sample, the average MSE for fˆi (t) and the MSE for the mean P funcn −1 tion estimator ηˆ(t) were computed respectively as MSEf = (nM ) i=1 P PM ˆ M 2 , where τ , . . . , 2 and MSE = M −1 {ˆ η (τ )−η(τ )} { f (τ )−f (τ )} j j 1 i j η j=1 j=1 i j τM are M time points equally-spaced in [0, 1], for some large M , for example, M = 400. Figure 1 presents the simulation results. The boxplots were based on 200 simulated samples. From left to right, panels are respectively for GCV, MSEf and MSEη ; from top to bottom, panels are respectively for n = 20, 30 and 40. In each of the panels, the first five boxplots are associated with the five bandwidth choices: 0.5h∗ , 0.8h∗ , h∗ , 1.25h∗ and 2h∗ , respectively; the sixth boxplot in each of the MSEη panels is associated with the “ideal” estimator η˜(t); see (2.10) for its definition. From Figure 1, we may conclude that (a) overall, the GCV rule (2.16) performed well in the sense of choosing proper bandwidths to minimize the average MSE (2.5); (b) bandwidths smaller than h∗ help reduce the MSEη but do not by much, while bandwidths larger than h∗ do enlarge MSEη substantially; and (c) the MSEη based on ηˆ(t) and those based on the “ideal” estimator η˜(t) are nearly the same unless the bandwidths are substantially larger than h∗ . 5. Application to the Canadian temperature data. The Canadian temperature data (Canadian Climate Program [6]) were downloaded from ftp://ego.psych.mcgill.ca/pub/ramsay/FDAfuns/Matlab/ at the book website of Ramsay and Silverman [23, 24]. The data are the daily temperature records of 35 Canadian weather stations over a year (365 days), among which 15 are in Eastern, another 15 in Western and the remaining five in Northern Canada. This is a typical functional data set with the number of measurements per subject (ni = 365) being much larger than the number of subjects (n = 35). We shall use this functional data set only to illustrate the methodologies developed in this paper. For a more formal analysis, this functional data set should be first registered using either a parametric curve registration method proposed by Silverman [26] or a more flexible nonparametric curve registration method developed by Ramsay and Li [22]. Our methodologies can then be applied similarly to the resulting registered functional data set. Figure 2 presents the individual curve reconstructions of the Canadian temperature data. These reconstructions were obtained by applying the local linear (p = 1) kernel fit [8, 9] with the well-known Gaussian kernel to the

16

J.-T. ZHANG AND J. CHEN

Fig. 1. Simulation results. From left to right, panels are, respectively, for GCV, MSEf and MSEη ; from top to bottom, panels are, respectively, for n = 20, 30 and 40. In each of the panels, the first five boxplots are associated with the five bandwidth choices 0.5h∗ , 0.8h∗ , h∗ , 1.25h∗ and 2h∗ , where h∗ is the GCV bandwidth; the sixth boxplot in a MSEη panel is associated with the “ideal ” estimator η˜(t).

individual temperature records of each of the 35 weather stations, but with a common bandwidth h∗ = 2.79, selected by the GCV rule (2.16). It can be seen that the Eastern weather station temperature curves (solid) mix up with the Western weather station temperature curves (dot-dashed), but most of the Eastern and Western weather station temperature curves stay higher than the Northern weather station temperature curves (dashed). This is reasonable since the Eastern and Western weather stations are located at about the same latitudes, while the Northern weather stations are located at higher latitudes.

STATISTICAL INFERENCES IN FDA

17

Fig. 2. Local linear (p = 1) individual curve reconstructions of the Canadian temperature data with the bandwidth h∗ = 2.79, selected by GCV. Eastern weather stations: solid curves; Western weather stations: dot-dashed curves; and Northern weather stations: dashed curves.

We then modeled the Canadian temperature data set by the functional linear model (1.3) with the covariates  [1, 0, 0]T ,    T

[0, 1, 0] , xi = T  [0,   0, 1] ,

if weather station i is located in Eastern Canada, if weather station i is located in Western Canada, if weather station i is located in Northern Canada, i = 1, 2, . . . , 35,

and the coefficient function vector β(t) = [β1 (t), β2 (t), β3 (t)]T , where β1 (t), β2 (t) and β3 (t) are the covariate effect (mean temperature) functions of the Eastern, Western and Northern weather stations, respectively. Figure 3 superimposes the estimated mean temperature functions of the Eastern, Western and Northern weather stations, together with their 95% standard deviation bands. Based on the 95% standard deviation bands, some informal conclusions can be made. First of all, over the whole year

18

J.-T. ZHANG AND J. CHEN

Fig. 3. Estimated mean temperature functions of the Eastern, Western and Northern weather stations with 95% standard deviation bands (Eastern weather stations, solid; Western weather stations, dot-dashed; and Northern weather stations, dashed).

([a, b] = [1, 365]), the differences between the mean temperature functions of the Eastern and the Western weather stations are much less significant than the differences between the mean temperature functions of the Eastern and the Northern weather stations, or between the Western and the Northern weather stations. This is because the 95% standard deviation band of the Eastern weather station mean temperature function covers (before Day 151) or stays close (after Day 151) to the mean temperature function of the Western weather stations; however, the 95% standard deviation bands of the Eastern and Western weather station mean temperature functions are far away from the mean temperature function of the Northern weather stations. Second, the significances of the differences between the mean temperature functions of the Eastern and the Western weather stations for different seasons are different. During the Spring (usually defined as the months of March, April and May or [a, b] = [60, 151]), the mean temperature functions are nearly the same, but this is not the case during the Summer (June, July

19

STATISTICAL INFERENCES IN FDA Table 1 Significance test results for the differences of the mean temperature functions of the Eastern and Western weather stations based on 10,000 replications

Tn

χ2 -approximation

P-values Simulation

Bootstrapping

h /2 h∗ 2h∗

59954 58248 56868

0.179 0.185 0.189

0.179 0.181 0.185

0.166 0.180 0.184

[60, 151] (Spring)

h∗ /2 h∗ 2h∗

945 656 378

0.842 0.940 1.000

0.836 0.874 0.923

0.834 0.877 0.922

[152, 243] (Summer)

h∗ /2 h∗ 2h∗

6625 6432 6322

0.078 0.082 0.085

0.075 0.084 0.086

0.068 0.083 0.075

[244, 334] (Autumn)

h∗ /2 h∗ 2h∗

28748 28303 27526

0.011 0.012 0.014

0.011 0.013 0.015

0.009 0.008 0.010

[a, b]

h

[1, 365] (Whole year)



and August or [a, b] = [152, 243]) or during the Autumn (September, October and November or [a, b] = [244, 334]). These conclusions can be made more clear via the hypothesis testing problem (3.6) with t ∈ T = [a, b] using the global testing statistic Tn (3.11) and with a, b, c and C properly specified. For example, to test if the mean temperature functions of the Eastern and Western weather stations during the Spring are the same, we take a = 60, b = 151, c = 0 and C = [1, −1, 0]; and to test if the mean temperature functions of the Eastern, Western and Northern weather stations during the Autumn are the same, we take a = 244, b = 334, c = [0, 0]T and C=



1, 0,



0, −1 . 1, −1

We first tested the differences of the mean temperature functions of the Eastern and Western Canadian weather stations for the whole year, and during the Spring, Summer and Autumn. Table 1 shows the significance test results, where the simulation and bootstrap P-values were computed based on 10,000 replications. For each choice of the seasonal period [a, b], we used three different bandwidth choices, h∗ /2, h∗ and 2h∗ , where h∗ = 2.79 was selected by the GCV rule (2.16). For each bandwidth choice, the associated test statistics Tn were computed using (3.11). For each Tn , we computed its P-value using the χ2 -approximation, simulation and bootstrap methods

20

J.-T. ZHANG AND J. CHEN

Fig. 4. Null p.d.f. approximations (χ2 -approximation, solid; simulation, dashed; bootstrap, dotted) of the global test statistic Tn (3.11) when h∗ = 2.79. (a) [a, b] = [1, 365]; (b) [a, b] = [60, 151]; (c) [a, b] = [152, 243]; and (d) [a, b] = [244, 334].

which were described briefly in Section 3.2. Figure 4 displays the null probability density function (p.d.f.) approximations obtained using the three methods. It seems that all three approximations perform reasonably well except at the left boundary where the χ2 -approximations seem problematic. Nevertheless, from the table, we can see that the significance test results are not strongly affected by the bandwidths used; moreover, we can see that the differences between the mean temperature functions of the Eastern and Western weather stations over the whole year (P-value ≥ 0.166) are larger than their differences during the Spring (P-value ≥ 0.834), but much smaller than their differences during the Summer (P-value < 0.068) or during the Autumn (P-value < 0.015). These results are consistent with those observed from Figure 3. Following the same procedure, we also tested the following null hypotheses: the mean temperature functions are the same between (1) the Eastern

STATISTICAL INFERENCES IN FDA

21

and Northern; (2) the Western and Northern; and (3) the Eastern, Western and Northern weather stations for the following periods: (1) the whole year; (2) the Spring; (3) the Summer; and (4) the Autumn. As expected, we rejected all these null hypotheses with P-value 0. These results are also consistent with those observed from Figure 3. 6. Technical proofs. In this section we outline the technical proofs of some of the asymptotic results. Before we proceed, we list the following useful lemmas. Proof of the first lemma can be found in [10], page 64. Notice that, under Condition A4, “n → ∞” implies that “ni → ∞.” Lemma 1.

Assume Condition A is satisfied. Then as n → ∞, we have Khni (tij − t) =

1 K ∗ (tij − t)[1 + oP (1)], ni π(t) h

where K ∗ (·) is the LPK equivalent kernel ( [10], page 64). Lemma 2.

We always have ni X

Khni (tij

j=1

r

− t)(tij − t) =



1, 0,

when r = 0, otherwise.

Assume Condition A is satisfied. Then as n → ∞, we have ni X

j=1

Khni (tij − t)(tij − t)p+1 = ni X

{Khni (tij − t)}2 =

j=1 ni X

j=1

Khni (tij − s)Khni (tij − t) =

Bp+1 (K ∗ )hp+1 [1 + oP (1)], π(t)

V (K ∗ ) (ni h)−1 [1 + oP (1)], π(t) K ∗(1) ((s − t)/h) (ni h)−1 [1 + oP (1)], π(t)

where Br (·) and V (·) are defined in (2.4). Let ri (t) = fˆi (t) − fi (t), i = 1, 2, . . . , n, where fˆi (t) are thePp-order LPK reconstructions of fi (t) given in Section 2.1. Let r¯(t) = n−1 ni=1 ri (t) and P n f¯(t) = n−1 i=1 fi (t). Using Lemmas 1 and 2, we can prove the following useful lemma. Lemma 3.

Assume Condition A is satisfied. Then as n → ∞, we have

E{ri (t)|D} =

Bp+1 (K ∗ )η (p+1) (t) p+1 h [1 + oP (1)], (p + 1)!

22

J.-T. ZHANG AND J. CHEN

Cov{ri (s), ri (t)|D} =



K ∗(1) ((s − t)/h) (ni h)−1 π(t)

+ Cov{ri (s), fi (t)|D} =

2 (K ∗ )γ Bp+1 p+1,p+1 (s, t) 2(p+1) h [1 + oP (1)], (p + 1)!2



Bp+1 (K ∗ )γp+1,0 (s, t) p+1 h [1 + oP (1)]. (p + 1)!

Proof. By (2.3) and Lemma 1, we have ri (t) =

ni X

Khni (tij

j=1

− t)εij +

ni X

j=1

Khni (tij − t){fi (tij ) − fi (t)}.

It follows that E(ri (t)|D) =

ni X

j=1

Khni (tij − t){η(tij ) − η(t)}.

Applying Taylor’s expansion and Lemmas 1 and 2, we have

(6.1)

(p+1 X

ni X

)

(tij − t)l η (l) (t) E(ri (t)|D) = Khni (tij − t) + o[(tij − t)p+1 ] l! j=1 l=1 =

Bp+1 (K ∗ )η (p+1) (t) p+1 h [1 + oP (1)]. (p + 1)!

Similarly, by the independence of fi (t) and εi (t), we have Cov(ri (s), ri (t)|D) =

ni X

j=1

+

Khni (tij − s)Khni (tij − t)σ 2 (tij )

ni X ni X

j=1 l=1

(6.2)

Khni (tij − s)Khni (til − t) × {γ(tij , til ) − γ(tij , t) − γ(s, til ) + γ(s, t)}

=



K ∗(1) [(s − t)/h]σ 2 (s) (ni h)−1 π(t)

2 (K ∗ )γ Bp+1 p+1,p+1 (s, t) 2(p+1) + h [1 + oP (1)]. (p + 1)!2



In particular, letting s = t, we obtain Var(ri (t)|D) =



V (K ∗ )σ 2 (t) (ni h)−1 π(t)

23

STATISTICAL INFERENCES IN FDA

(6.3)

2 (K ∗ )γ Bp+1 p+1,p+1 (t, t) 2(p+1) h [1 + oP (1)], + (p + 1)!2



as desired. Lemma 3 is proved.  Direct application of Lemma 3 leads to the following. Lemma 4.

Assume Condition A is satisfied. Then as n → ∞, we have

E(¯ r(t)|D) =

Bp+1 (K ∗ )η (p+1) (t) p+1 h [1 + oP (1)], (p + 1)!

Cov(¯ r (s), r¯(t)|D) = n

−1



K ∗(1) ((s − t)/h) (mh) ˜ −1 π(t)

2 (K ∗ )γ Bp+1 p+1,p+1 (s, t) 2(p+1) + h [1 + oP (1)], (p + 1)!2



Bp+1 (K ∗ )γp+1,0 (s, t) −1 p+1 n h [1 + oP (1)], Cov(¯ r (s), f¯(t)|D) = (p + 1)! where m ˜ = (n−1

−1 −1 i=1 ni ) ,

Pn

Proof of Theorem 1. have E{[fˆi (t) − fi (t)]2 |D} (6.4)

as defined in Theorem 2. For each i = 1, 2, . . . , n, by (6.1) and (6.3), we

= E{ri2 (t)|D} = {E(ri (t)|D)}2 + Var(ri (t)|D) =



2 (K ∗ )[(η (p+1) (t))2 + γ Bp+1 p+1,p+1 (t, t)] 2(p+1) h 2 (p + 1)!

+

V (K ∗ )σ 2 (t) (ni h)−1 [1 + oP (1)]. π(t) 

Theorem 1, that is, the expression (2.5), then follows directly.  Proof of Theorem 2. Under Condition A, the coefficients of h2(p+1) and (ni h)−1 in the expression (6.4) are uniformly bounded over the finite interval T = [a, b]. Moreover, since ni ≥ Cnδ and h = O(n−δ/(2p+3) ), we have O(h2(p+1) ) = O((ni h)−1 ) = O(n−2(p+1)δ/(2p+3) ) = n−2(p+1)δ/(2p+3) O(1). Thus, E{ri2 (t)|D} = OUP [h2(p+1) +(ni h)−1 ] = n−2(p+1)δ/(2p+3) OUP (1). Therefore, fˆi (t) = fi (t) + n−(p+1)δ/(2p+3) OUP (1). Theorem 2 is then proved.  P Proof of Theorem 3. First of all, notice that ηˆ(t) = n−1 ni=1 fˆi (t) = f¯(t) + r¯(t). It follows that Bias(ˆ η (t)|D) = E(¯ r (t)|D), Cov(ˆ η (s), ηˆ(t)|D) =

24

J.-T. ZHANG AND J. CHEN

Cov(f¯(s), f¯(t)) + Cov(f¯(s), r¯(t)) + Cov(¯ r(s), f¯(t)) + Cov(¯ r (s), r¯(t)). The results of Theorem 3 follow directly from Lemma 4.  Proof of Theorem 4. Since ηˆ(t) = f¯(t) + r¯(t) = η˜(t) + r¯(t), in order to r 2 (t)|D} = show the first expression in (2.11), it is sufficient to prove that E{¯ −2(p+1)δ/(2p+3) 2 n OUP (1). This result follows directly from E{¯ r (t)|D} = {E(¯ r (t)| 2 D)} + Var(¯ r (t)|D) and Lemma 4. To show the second expression in (2.11), notice that the covariance estimator γˆ (s, t) can be expressed as γˆ (s, t) =

n 1X {fi (s) − f¯(s)}{fi (t) − f¯(t)} n i=1

+ + +

n 1X {fi (s) − f¯(s)}{ri (t) − r¯(t)} n i=1 n 1X {ri (s) − r¯(s)}{fi (t) − f¯(t)} n i=1 n 1X {ri (s) − r¯(s)}{ri (t) − r¯(t)} n i=1

≡ γ˜ (s, t) + I1 + I2 + I3 , where ri (t) = fˆi (t) − fi (t), i = 1, 2, . . . , n, are independent and asymptotically have the same variance. By the law of large numbers and by Lemma 3, we have (

I1 = E n−1

n X i=1

)

E[(fi (s) − f¯(s))(ri (t) − r¯(t))|D)] OP (1)

= E{Cov(f1 (s), r1 (t)|D)}OP (1) = n−(p+1)δ/(2p+3) OUP (1). Similarly, we can show that I2 = n−(p+1)δ/(2p+3) OUP (1)

and

I3 = n−2(p+1)δ/(2p+3) OUP (1).

The second expression in (2.11) then follows. When δ > 1 + 1/[2(p + 1)], we have n1/2 {˜ η (t) − ηˆ(t)} = oUP (1),

n1/2 {˜ γ (s, t) − γˆ (s, t)} = oUP (1).

By the definition of η˜(t) and γ˜ (s, t), we have η˜(t) = η(t) + v¯(t),

γ˜ (s, t) = n−1

n X i=1

vi (s)vi (t) − v¯(s)¯ v (t).

25

STATISTICAL INFERENCES IN FDA

By the law of large numbers and the central limit theorem, it is easy to show that n1/2 {˜ η (t) − η(t)} ∼ AGP(0, γ),

n1/2 {˜ γ (s, t) − γ(s, t)} ∼ AGP(0, γ ∗ ),

where γ ∗ {(s1 , t1 ), (s2 , t2 )} = Cov{v1 (s1 )v1 (t1 ), v1 (s2 )v1 (t2 )} = E{v1 (s1 )v1 (t1 )v1 (s2 )v1 (t2 )} − γ(s1 , t1 )γ(s2 , t2 ). In particular, when v(t) is a Gaussian process, we have E{v1 (s1 )v1 (t1 )v1 (s2 )v1 (t2 )} = γ(s1 , t1 )γ(s2 , t2 ) + γ(s1 , t2 )γ(s2 , t1 ) + γ(s1 , s2 )γ(t1 , t2 ). Thus, γ ∗ {(s1 , t1 ), (s2 , t2 )} = γ(s1 , t2 )γ(s2 , t1 ) + γ(s1 , s2 )γ(t1 , t2 ). The proof of Theorem 4 is finished.  Proof of Theorem 5. Under Condition A and by Theorem 2, we have fˆi (tij ) = fi (tij ) + n−(p+1)δ/(2p+3) OUP (1). It follows that εˆ2ij = {yij − fˆi (tij )}2 = {εij + n−(p+1)δ/(2p+3) OUP (1))}2 = ε2ij + 2n−(p+1)δ/(2p+3) εij OUP (1) + n−2(p+1)δ/(2p+3) OUP (1). ˆ 2 (t) = I1 + I2 + I3 , Plugging this into (2.14) with b = O(N −1/5 ), we have σ where under the given conditions and by standard kernel estimation theory, I1 =

P ni 2 i=1 j=1 Hb (tij − t)εij Pn Pni i=1 j=1 Hb (tij − t)

Pn

I2 = 2

Pn

i=1

= σ 2 (t) + N −2/5 OUP (1),

Pni

−(p+1)δ/(2p+3) ε O ij UP (1) j=1 Hb (tij − t)n Pn Pni i=1 j=1 Hb (tij − t)

= n−(p+1)δ/(2p+3) OUP (1), I3 =

Pn

i=1

P ni

−2(p+1)δ/(2p+3) O UP (1) j=1 Hb (tij − t)n Pn Pni i=1 j=1 Hb (tij − t)

= n−2(p+1)δ/(2p+3) OUP (1).

Under Condition A4, ni ≥ Cnδ . This implies that N = ni=1 ni > Cn1+δ . Thus, N −2/5 = O(n−2(1+δ)/5 ). It follows that σ ˆ 2 (t) = σ 2 (t)+OUP (n−2(1+δ)/5 + n−(p+1)δ/(2p+3) ), as desired. The proof of the theorem is completed.  P

Proof of Theorem 6. Under the conditions of Theorem 2, we have |ri (t)| = |fˆi (t) − fi (t)| ≤ n−(p+1)δ/(2p+3) C for some C > 0 for all i and t. Let

26

J.-T. ZHANG AND J. CHEN

∆(t) = [∆1 (t), . . . , ∆q (t)]T = (XT X)−1 XT (ˆf (t)−f (t)). Then for r = 1, 2, . . . , q, we have !−1 n n X −1 X T −1 T |∆r (t)| = n er,q n xj xj xi ri (t) i=1

j=1

!−1 n n X X xi |ri (t)| xj xTj ≤ n−1 eTr,q n−1 i=1

j=1

≤ Cn−(p+1)δ/(2p+3) E|eTr,q Ω−1 x1 |[1 + op (1)].

It follows that ∆(t) = n−(p+1)δ/(2p+3) OUP (1). The first expression in (3.4) ˆ − β(t) ˜ = ∆(t). follows directly from the fact β(t) To show the second expression in (3.4), notice that vˆi (t) = v˜i (t) + ri (t) + ˆ − β(t)] ˜ = v˜i (t) + n−(p+1)δ/(2p+3) OUP (1) because under the given conxTi [β(t) ˆ ditions, we have xi = OUP (1), ri (t) = n−(p+1)δ/(2p+3) OUP (1), and β(t) − −(p+1)δ/(2p+3) ˜ β(t) = n OUP (1). Further, by Condition B, we have vi (t) = OUP (1), therefore, vˆi (s)ˆ vi (t) = v˜i (s)˜ vi (t) + n−(p+1)δ/(2p+3) OUP (1). The second expression in (3.4) follows immediately. When δ > 1 + 1/[2(p + 1)], we have (p + 1)δ/(2p + 3) > 1/2. Therefore, √ ˆ ˜ n[β(t) − β(t)] = n1/2−(p+1)δ/(2p+3) OUP (1) = oUP (1). Moreover, it is easy to show that √ ˜ − β(t)] ∼ AGP(0, γβ ), (6.5) n[β(t) where γβ (s, t) = γ(s, t)Ω−1 . The result in (3.5) follows immediately. The proof of the theorem is completed.  ˆ Proof of Theorem 7. Recall that w(t) = [C(XT X)−1 CT ]−1/2 [Cβ(t)− ˆ ˜ ˜ similarly by replacing β(t) with β(t). c(t)], as defined in (3.8). Define w(t) R Rb b 2 dt. ˜ Then by (3.11), we have Tn = a kw(t)k2 dt and similarly, T˜n = a kw(t)k T −1 T −1/2 ˆ − β(t)]. ˜ ˜ Let ∆(t) = w(t) − w(t) = [C(X X) C ] C[β(t) Then under the given conditions and by Theorem 6, we can show that ∆(t) = 1/2−(p+1)δ/(2p+3) O ˜ n1/2−(p+1)δ/(2p+3) ×OUP (1). It follows that w(t) = w(t)+n UP (1) R Rb b 2 1/2−(p+1)δ/(2p+3) T ˜ ˜ ˜ Op (1), ∆(t) dt+ a k∆(t)k dt = Tn +n and, hence, Tn = Tn +2 a w(t) as desired. When δ > 1 + 1/[2(p + 1)], we have Tn = T˜n + oP (1) as n → ∞. Thus, to d P show (3.14), it is sufficient to show T˜n = m r=1 λr Ar + oP (1). Using (6.5) in ˜ the proof of Theorem 6 above, it is easy to show that w(t) ∼ AGP(η w , γ w ), √ −1 T −1/2 where η w (t) = n(CΩ C ) [Cβ(t) − c(t)] and γ w (s, t) = γ(s, t)Ik , as ˜ are independent defined in (3.9). It follows that the k components of w(t) of each other, and the lth component w ˜l (t) ∼ AGP(ηwl , γ), where ηwl (t) is

STATISTICAL INFERENCES IN FDA

27

the lth component of η w (t) as defined in (3.9). Since γ(s, t) has the singular P ˜l (t) = m value decomposition (3.12), we have w r=1 ξlr φr (t), where ξlr =

(6.6)

with µlr =

Rb a

b

Z

w ˜l (t)φr (t) dt ∼ AN(µlr , λr ),

a

ηwl (t)φr (t) dt. It follows that T˜n =

=

Z

b

a

2 ˜ kw(t)k dt =

k X m X

l=1 r=1

2 ξlr =

k Z X

b

l=1 a

m X k X

w ˜l2 (t) dt

2 ξlr

r=1 l=1

because the eigenfunctions φr (t) are orthonormal over T = [a, b] and the 2 . By (6.6), we have summation is exchangeable due to the nonnegativity of ξlr

R 2 d 2 2 2 −1 Pk µ2 = λ−1 k b η (t)× r l=1 ξlr = λr Ar , where Ar ∼ χk (ur ) with ur = λr l=1 lr a w d P λ A + o (1), as deφr (t) dtk2 , as given in (3.15). It follows that T˜n = m P r=1 r r

Pk

sired. The proof of the theorem is completed. 

Acknowledgments. The authors thank the Editor, the Associate Editor and two reviewers for their helpful comments and invaluable suggestions that helped improve the paper substantially. REFERENCES [1] Besse, P. (1992). PCA stability and choice of dimensionality. Statist. Probab. Lett. 13 405–410. MR1175167 [2] Besse, P., Cardot, H. and Ferraty, F. (1997). Simultaneous nonparametric regressions of unbalanced longitudinal data. Comput. Statist. Data Anal. 24 255– 270. MR1455696 [3] Besse, P. and Ramsay, J. O. (1986). Principal components analysis of sampled functions. Psychometrika 51 285–311. MR0848110 [4] Brumback, B. and Rice, J. A. (1998). Smoothing spline models for the analysis of nested and crossed samples of curves (with discussion). J. Amer. Statist. Assoc. 93 961–994. MR1649194 [5] Buckley, M. J. and Eagleson, G. K. (1988). An approximation to the distribution of quadratic forms in normal random variables. Austral. J. Statist. 30A 150–159. [6] Canadian Climate Program (1982). Canadian Climate Normals 1981–1980. Environment Canada, Ottawa. [7] Eubank, R. L. (1999). Nonparametric Regression and Spline Smoothing, 2nd ed. Dekker, New York. MR1680784 [8] Fan, J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc. 87 998–1004. MR1209561 [9] Fan, J. (1993). Local linear regression smoothers and their minimax efficiencies. Ann. Statist. 21 196–216. MR1212173

28

J.-T. ZHANG AND J. CHEN

[10] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. MR1383587 [11] Fan, J. and Lin, S.-K. (1998). Test of significance when data are curves. J. Amer. Statist. Assoc. 93 1007–1021. MR1649196 [12] Fan, J. and Yao, Q. (1998). Efficient estimation of conditional variance functions in stochastic regression. Biometrika 85 645–660. MR1665822 [13] Faraway, J. J. (1997). Regression analysis for a functional response. Technometrics 39 254–261. MR1462586 [14] Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models. A Roughness Penalty Approach. Chapman and Hall, London. MR1270012 [15] Hall, P. and Marron, J. S. (1990). On variance estimation in nonparametric regression. Biometrika 77 415–419. MR1064818 [16] Hart, J. D. and Wehrly, T. E. (1986). Kernel regression estimation using repeated measurements data. J. Amer. Statist. Assoc. 81 1080–1088. MR0867635 [17] Kneip, A. (1994). Nonparametric estimation of common regressors for similar curve data. Ann. Statist. 22 1386–1427. MR1311981 [18] Kneip, A. and Engel, J. (1995). Model estimation in nonlinear regression under shape invariance. Ann. Statist. 23 551–570. MR1332581 [19] Kneip, A. and Gasser, T. (1992). Statistical tools to analyze data representing a sample of curves. Ann. Statist. 20 1266–1305. MR1186250 [20] Ramsay, J. O. (1995). Some tools for the multivariate analysis of functional data. In Recent Advances in Descriptive Multivariate Analysis (W. Krzanowski, ed.) 269–282. Oxford Univ. Press, New York. MR1380322 [21] Ramsay, J. O. and Dalzell, C. J. (1991). Some tools for functional data analysis (with discussion). J. Roy. Statist. Soc. Ser. B 53 539–572. MR1125714 [22] Ramsay, J. O. and Li, X. (1998). Curve registration. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 351–363. MR1616045 [23] Ramsay, J. O. and Silverman, B. W. (1997). Functional Data Analysis. Springer, New York. [24] Ramsay, J. O. and Silverman, B. W. (2002). Applied Functional Data Analysis. Methods and Case Studies. Springer, New York. MR1910407 [25] Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. J. Roy. Statist. Soc. Ser. B 53 233–243. MR1094283 [26] Silverman, B. W. (1995). Incorporating parametric effects into principal components analysis. J. Roy. Statist. Soc. Ser. B 57 673–689. MR1354074 [27] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. MR1045442 [28] Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London. MR1319818 [29] Zhang, J.-T. (2005). Approximate and asymptotic distributions of chi-squared-type mixtures with applications. J. Amer. Statist. Assoc. 100 273–285. MR2156837 Department of Statistics and Applied Probability National University of Singapore Lower Kent Ridge Road 3 Science Drive 2, Singapore 119260 Singapore E-mail: [email protected]

Department of Mathematics and Statistics San Diego State University 5500 Campanile Drive San Diego, CA 92182 USA E-mail: [email protected]

Suggest Documents