Local M-Estimation of Regression Function for Censored Time Series Data Zongwu Cai
Department of Mathematics University of North Carolina Charlotte, NC 28223, USA E-MAIL:
[email protected]
Elias Ould-Sa d
L.M.P.A. J. Liouville Universite du Littoral C^ote d0 Opale BP 699, 62228 Calais, France E-MAIL:
[email protected] June 29, 1999
Abstract
In this article, we investigate a robust version of local linear regression smoothers for stationary and censored stochastic processes by using M-type local polynomial techniques and transformations. Under some regularity conditions, we establish the weak and strong consistency as well as the asymptotic normality of proposed estimators. We propose an easily implemented bandwidth selection criterion based on the modi ed multifold cross-validation, and a consistent estimator of the asymptotic covariance is also proposed.
Keywords: -mixing; asymptotic properties; bandwidth selection; censored data; local smoothers;
robustness; prediction; time series analysis; variable bandwidth.
AMS 1991 subject classi cation: Primary 62G07; secondary 60G10, 62G35, 62F35, 62E20.
This work was supported, in part, by funds provided by the University
of North Carolina at Charlotte.
1 Introduction In various statistical problems, regression techniques are commonly used for modeling the relationship between response and covariates for both censored and non-censored cases and for both independent and time series data. The main purpose of this article is to study an easily implemented and robusti ed local smoothing method for exploring the association between response and covariates when the observed values of response variable are censored and are not possibly independent (see Section 3 for the kind of dependence stipulated). Although we only focus on the univariate case, we would like to mention that the basic ideas of our methodology hold for the case of multivariate situations. One of key applications of regression estimation, this time involving dependent data, is construction of prediction intervals for the next value in stationary time series fY1; : : :; Yn g. If the time series is Markovian, then we may solve easily the prediction problem by estimating the expectation of Yn+1 , conditional on Yn = x. In the non-censored case, there is a vast literature on modeling the relationship between two variables. There are many linear smoothers proposed to estimate nonparametric regression functions: kernel, spline, local polynomial, and orthogonal series methods. For the available methods and results on both theory and applications, see the books by Eubank (1988), Muller (1988), Hardle (1990), Wahba (1990), Hastie and Tibshirani (1990), Green and Silverman (1994), Wand and Jones (1995), Simono (1996), Fan and Gijbels (1996), among others. Among the aforementioned linear smoothers, local polynomial method has become popular in recent years due to its attractive mathematical eciency, bias reduction and adaptation of edge eects (see, for example, the book by Fan and Gijbels (1996) for the detailed methods and results). Recently, Fan and Jiang (1999) studied a robust version of local linear regression smoothers augmented with variable bandwidth and they showed that the local M-regression inherits many nice statistical properties from the local leastsquares regression although the nonparametric M-type estimators of regression function have been investigated by many authors. See, for example, Cleveland (1979), Cox (1983), Tsybakov (1986), Hardle and Tsybakov (1988), Cunningham, Eubank and Hsing (1991), Fan, Hu and Truong (1994), Welsh (1994), among others. Locally weighted scatter plot smoothing (LOWESS), pioneerly introduced by Cleveland (1979), robusti es the locally weighted least squares method, and it is further developed by Tsybakov (1986), Fan, Hu and Truong (1994) and Welsh (1994) by employing local polynomial techniques, Hardle and Tsybakov (1988) by using kernel method, and Cox (1983) and Cunningham, Eubank and Hsing (1991) by an application of smoothing spline approaches. For the censored data, there are also many methods available to build a relationship between response and covariates. Among the methods, we particularly mention the transformation procedures. The basic ideas of the transformation methodology are to transfer the observed data in an appropriate way, say, unbiased, and then to apply some regression techniques such as linear smoothers to analyze the transferred data. Buckley and James (1979) rst introduced an appealing idea for the censored regression problem when regression function is linear; Koul, Susarla and Van Ryzin (1981) considered a transformation depending only on censoring distribution; and Zheng (1987) proposed a class of transformations of this type. Recently, Fan and Gijbels (1994) studied systematically a more general class of transformations and proposed the data-driven methods to transfer the censored data. 1
The literature mentioned above focuses on the iid case. In the time series context, the nonparametric estimates of regression function have been investigated by many authors in the case where the observations exhibit some kind of dependence such as Markovian chains, ergodic processes, mixing sequences, long-range memory processes, association random variables, and so on. Here, we only mention a few references. See, for example, Robinson (1983) for employing kernel methods and Masry and Fan (1997) for using local polynomial approaches to estimate regression curves. However, a drawback of those kernel or polynomial methods is lack of robustness. To attenuate this diculty, the M-type regression approaches have been used to achieve the desirable robustness properties. Robinson (1984) investigated the M-estimator with kernel weights to stationary time series and established the central limit theorem under -mixing conditions; Collomb and Hardle (1986) obtained the uniform convergence with rates and some other asymptotic results of the family of kernel M-estimators for -mixing when the data belong to a xed compact set and the derivative function of loss function is bounded; Boente and Fraiman (1989, 1990) derived the asymptotic normality of kernel M-estimator for some mixing processes; a class of tests useful for testing the goodness-of- t of a robust regression function when the errors are long-range memory was studied by Koul and Stute (1998). Recently, Lab and Ould-Sad (1999) considered the strong uniform consistency of kernel M-estimator for autoregression functions under stationary ergodic time series assumptions. In an unpublished technical report, Ould-Sad (1997) discussed the strong consistency of the kernel type regression estimator for censored -mixing processes. To our knowledge, there has not been any attempt to use local M-type regression techniques to tackle the problems for censored time series data even not for fully observed data. The aim of this paper is to adapt local polynomial techniques and M-type methods to estimate regression curves with variable bandwidth for censored time series data, which are assumed to be -mixing. More precisely, in the rst step, we use transformation techniques to transfer the observed data in an appropriate way, say unbiased, to account for the censoring. At the second stage, we apply robusti ed local polynomial procedures to estimate the target functions by using the transferred data. The article is organized as follows. In Section 2, we discuss the data-driven transformation methods on the censored data, we also present the local M-type linear regression smoothers with variable bandwidth, and nally, an easily implemented modi ed multifold cross-validation criterion is proposed for selecting the optimal bandwidth. Section 3 is devoted to the presentation of the asymptotic properties of the proposed estimators, including the pointwise weak and strong consistency as well as the asymptotic normality. Associated with inferences on regression curves are the standard errors of the estimated curves. Consistent estimates are derived. Also, the assumptions needed for the proofs of our results are collected together in this section for an easy reference. Finally, Section 4 concludes the technical proofs of our results.
2 Local M-Estimation We consider the regression function m(x) = IE(Yj j Xj = x), where Yj is the survival time, Xj is the associated covariate, and the function m() is unknown regression curve. In many applications, the survival times of studied subjects are not always fully observed, but subject to right-censoring, 2
due to the termination of the study or early withdrawal from the study. Consider the bivariate data f(Xj ; Yj )gnj=1 which form a stationary sequence. Under the independent censoring model in which we have the iid censoring times C1; : : :; Cn that are independent of the survival times given the covariates, one only observes the censored data Zj = min(Cj ; Yj ) and j = 1I(Yj Cj ) as well as the associated covariate Xj . The observations are f(Xi ; Zj ; j )gnj=1 which is a random sample from the population (X; Z; ). Purely for notational simplicity, we assume throughout this paper that the random variables Y and C are nonnegative and continuous. The covariate X is assumed to remain constant over time. Let F ( j x) and G ( j x) be the conditional survival functions of Y and C given X = x, and let fX () denote the density of X .
2.1 Transformation of the Data
Denote by 1 (; ) and 2 (; ) the transformations on the uncensored and censored observations. Consider the following transformation X; Y ) if uncensored (2.1) Y = 1((X; C ) if censored = 1 (X; Z ) + (1 ? ) 2(X; Z ): 2 Then, the data point (X; Z; ) becomes the transferred data point (X; Y ). This transformation is ideal since the transformation functions 1 (; ) and 2 (; ) are unknown. Since the estimation of regression function m() is based on the transferred data f(Xj Yj )g, the basic requirement is that IE(Y j X = x) = m(x), which implies that we estimate the right target function. This requirement is equivalent to
Z 1 0
1(x; y) G(y j x) ?
Zy 0
2(x; c) d G (c j x) ? y d F(y j x) = 0
(2.2)
for all x. An intuitive and appealing transformation is given by Buckley and James (1979)
1 (x; y) = y
and
2 (x; y) = IE(Y j Y > y; X = x):
(2.3)
This means that the uncensored observations are kept unchanged and the censored observations are restored in a best way. It is clear that the Buckley and James transformation depends on the unknown regression function. To overcome this shortcoming, Fan and Gijbels (1994) proposed a data-driven method. The basic idea of the data-based approach is as follows. Suppose that the data point (Xi; Zi ; i ) is censored. In a neighborhood of Xi , the censored observation Zi is replaced by the weighted average of all uncensored data points larger than Zi in that neighborhood. Formally, let k be an integer determining the neighborhood of Xi, and denote by W () a nonnegative weight function. We assume that all observations f(Xj ; Zj ; j )g are ordered according to the Xj 's. De ne
Xi?Xj P Z W j j j Z >Z XXi+ki ??XXji?k = Ybi = P j i W :
(
j :Zj >Zi j
) 2
Xi+k ?Xi?k )=2
(2.4)
(
with the usual convention of truncating the neighborhood when the indices i + k or i ? k are out of the index range. The above transformation, as pointed out by Fan and Gijbels (1994), is a nonparametric estimate of the conditional expectation in (2.3). Uncensored data points remain unchanged 3
and the transferred data are f(Xj ; Ybj )g. Since the Buckley-James transformation de ned in (2.3) is the best restoration, the estimated transformation (2.4) is favorable and appealing although there are some other methods available. In practice, we would recommend using transformation (2.4) together with the optimal smoothing parameter k selected by, say, cross-validation criterion. See Fan and Gijbels (1994) for details.
2.2 Local M-Estimator
With the transferred data f(Xj ; Yj)gnj=1 , we apply the robust version of local linear regression approach with variable bandwidth to estimate regression m(x) and its derivative m0(x) at the point x0. Namely, nd 1 and 2 to minimize n X
j =1
Yj ? 1 ? 2
Xj ? x 0
hn
!
Xj ; b(Xj ) K hx0=b?(X n j)
(2.5)
where K () is a kernel function, hn is a sequence of positive numbers tending to zero, () is a given nonnegative loss function, and b() is a positive function re ecting the variable amount of smoothing at each data point. The quantity hn =b(Xj ) is called variable bandwidth in literature. Some studies on the variable bandwidth can be found in Muller and Stadtmuller (1987), Hall and Marron (1988), Hall, Hu and Marron (1995), Fan and Gijbels (1996), Fan and Jiang (1999), among others. The main advantage of using variable bandwidth is to cope well with spatially inhomogeneous curves, heteroscedastic errors and highly nonuniform designs. Let b1 and b2 be the minimizer of (2.5). Then b1 and b2 satisfy the following equation n X j =1
Yj ? b1 ? b2
Xj ? x hn
0
! x 1 0 ? Xj b(Xj ) K h =b(X ) Xj ?x0 = 0; n j hn
(2.6)
where () is the derivative of (). The local M-estimators of m(x0) and m0(x0) are de ned as b1 b 0(x0). and b2 =hn . We denote them by mb (x0) and m Note that the solution to the nonlinear system (2.6) involves intensive iterations so that the computation burden is a big issue. To save the computing cost, Fan and Jiang (1999) proposed the one-step estimation approach, which was also discussed in Cai, Fan and Li (1999). The basic idea is to nd rst a good initial estimator and then to use the Newton-Raphson formula to iterate one step. Fan and Jiang (1999) showed that the one-step estimator has the same asymptotic performance as the fully iterative one in this aspect, provided that the initial estimator is good enough. We refer to the paper by Fan and Jiang (1999) for a discussion on theory, and the paper by Cai, Fan and Li (1999) for the detailed description of computational algorithm.
2.3 Bandwidth Choice Particularly in the time series case, deriving asymptotically optimal bandwidths is a tedious matter. Here, we propose an easily implemented and quick method to select bandwidth hn , which can be regarded as a modi ed multifold cross-validation criterion to account for the structure of time 4
series data. Let m and Q be two given positive integers and n > mQ. The basic idea is rst to use Q sub-series of lengths n ? qm (q = 1; ; ; Q) to estimate the regression function and then to compute the one-step forecasting errors of the next section of the time series of length m based on the estimated model. More precisely, we choose hn minimizing AMS(hn ) =
Q n?qm X Xm n +
q=1 j =n?qm+1
o
Yj ? mb (Xj ) ; 2
(2.7)
?qm with bandwidth equal h ( n )1=5. Note where mb () is computed from the sample f(Xj ; Yj)gnj =1 n n?qm that for dierent sample sizes, we re-scale bandwidth according to its optimal rate, i.e., hn / n?1=5. In practice, we would recommend using m = [0:1n] and Q = 4. We take m = [0:1n] rather than m = 1 simply because of computation expediency. The optimal choice of function b() is given by
"
2 00 bopt(x) = fX (x) fm (2x(x)g) "
" (x) 2
#=
1 5
where with " = Y ? m(X ), " (x) = IE
0(") j X = x ;
and
" 2 (x) = IE
;
h
(2.8)
2
i
(" ) j X = x :
For the detailed derivation of (2.8), see Fan and Jiang (1999). Note that bopt(x) involves unknown functions. In practice, we may use its consistent estimator. From (2.8), it is easy to see that the more design points or the larger curvature near x0 , the larger should b(x0) be, or the smaller should hn =b(x0) be. Intuitively, a smaller bandwidth should be used at the dense design regions or high curvature regions, whereas at low dense areas or low curvature regions, a larger bandwidth might be needed.
3 Sampling Properties Before we state our main asymptotic results, we rst introduce the mixing coecient. Let Fab be the -algebra generated by f(Xj ; Yj )gbj=a . Denote by
(n) = sup0 jIP(AB) ? IP(A) IP(B)j; A2F?1 B2Fn1
which is called the strong mixing coecient of the stationary processes f(Xt; Yt )g1 t=?1 . If (n) ! 0 as n ! 1, the processes are called strongly mixing. Among various mixing conditions used in literature, -mixing is reasonably weak, and is known to be ful lled for many stochastic processes including many time series models. Gorodetskii (1977) and Withers (1981) derived the conditions under which a linear process is -mixing. In fact, under very mild assumptions linear autoregressive and more generally bilinear time series models are strongly mixing with mixing coecients decaying exponentially. Auestad and Tjstheim (1990) provided illuminating discussions on the role of -mixing (including geometric ergodicity) for model identi cation in nonlinear time series analysis. Chen and Tsay (1993) showed that the functional 5
autoregressive process is geometrically ergodic under certain conditions. Furthermore, Masry and Tjstheim (1995, 1997) demonstrated that under some mild conditions, both ARCH processes and nonlinear additive autoregressive models with exogenous variables, which are particularly popular in nance and econometrics, are stationary and -mixing.
3.1 Assumptions We now list the regularity conditions needed in the proofs of the following theorems although some of them are not the weakest possible. We assume throughout the paper that f(Xj ; Yj )g is a stationary -mixing process with mixing coecient (n).
(A1) The kernel function K () has a compact support, say [?1; 1]. (A2) b minx b(x) > 0 and b() is continuous at the point x . (A3) The regression function m() has a continuous second derivative at the point x . (A4) The bandwidths fhn g satisfy hn > 0 and n hn ! 1. (A5) IE[ (") j X = x] = 0. (A6) fX () is continuous at the point x and fX (x ) > 0. (A7) The function () is continuous and i 0(). Furthermore, it is assumed that h has a derivative 0
0
0
0
functions " (x), " 2 (x), and IE f 0(" )g2 X = x are positive and continuous at the point 2+ x0, and there exists a constant > 0 such that for l = 0 and 1, IE (l)(") X = x is bounded in a neighborhood of x0 .
(A8) The function 0() satis es that "
#
IE sup j 0(" + z ) ? 0(" )j j X = x = o(1) jzj
"
and
#
(" + z ) ? (") ? 0(" ) z j j X = x = o( );
IE sup j jzj
as ! 0 uniformly in x in a neighborhood of x0 .
(A9) The mixing coecients f(k)g satisfy Pk ka [(k)] = where is given in Condition (7). (A10) For l = 0 and 1 and all j 2, IE
n
o
l (" )
()
1
2
+
n
< 1 for some a > =(2 + ),
o X = u; X = v M 0), sn = (n hn= log n)1=2 and ? d (n) = O n for some d > 0. Then Condition (A9) is satis ed for d > 2( +1)= and Condition (A13) is satis ed if d > (1 + 0 )=(1 ? 0). Hence both conditions are satis ed if
(n) = O
n?d
;
1 + 0 2(1 + ) d > max 1 ? 0 ; :
Note that this is a trade o between the moment order + 2 of (") and the decay rate of mixing coecient; the larger the order , the weaker is the decay rate of (n).
Remark 3. The conditions imposed here for () are mild and satis ed for many applications. In particular, they are ful lled for Huber's () function. Here, it should be addressed that we do not need the monotonicity and boundedness of () and the convexity of loss function (), which is required by Fan, Hu and Truong (1994). Also, we discard the symmetric assumption on the conditional distribution of " given X , which is needed in Hardle and Tsybakov (1988).
7
3.2 Asymptotic Properties For simplicity, we introduce some notation. De ne
Z
l = ul K (u) du
B = ? =b(x ) ?=b=b((xx )) ;
Set
0
1
1
1
0
2
? =b(x ) : B = ? =b (x ) =b (x )
and
0
2
Z
l = ul K 2(u) du:
and
0
0
2
1
1
0
2
2
0
0
b b THEOREM 1. Under Conditions (A1)-(A10), there exists solution = b to equation (2.6) 1
2
such that
IP b ? 0 ?! 0;
m(x ) where = ; = h m0 (x ) . In other words, n ; 10
0
0
0
20
IP IP b 0(x0) ? m0(x0) ?! mb (x0) ? m(x0) ?! 0 and hn m 0:
THEOREM 2. Under the assumptions in Theorem 1 and Conditions (A11) and (A12), if there exists some 0 < < 1 such that 0 > 2 and n1? hn =(mn log n) ! 1, where 0 and mn are given in Conditions (A12) and (A11), respectively, we have
a:s: a:s: 0 and h m 0 0 mb (x0 ) ? m(x0) ?! n b (x0) ? m (x0 ) ?! 0:
THEOREM 3. Under the assumptions (A1)-(A10) and (A13)-(A15), we have "
#
2 ? 2 00(x0) h m ID 1 3 n 2 b n hn ? 0 ? 2 ( ? 2 ) b2(x ) ( ? ) b(x ) + op(h2n ) ?! N (0; (x0)); 1 2 0 3 0 0 2 0 1
p
where
(x ) = 0
" 2 (x0) b(x0) 2 " (x0 ) fX (x0)
" 2 (x0) b(x0) 2 " (x0) fX (x0 )
B? B B? = 1
1
2
1
1
1
(0 2 ? 21 )2
b
11
b21
b12 b22
with b11 = 22 0 ? 2 1 2 1 + 21 2 , b12 = b21 = 1 2 0 ? 0 2 1 + 0 1 2 ? 21 1 b(x0), and b22 = 21 0 ? 2 0 1 1 + 20 2 b2(x0). In particular, when K () is a density function with mean zero, then
#
"
2 00 ID N 0; 2(x0 ) ; n hn mb (x0) ? m(x0) ? hn 2mb2((xx0 )) 2 + op (h2n) ?! 0
p where
2(x0 ) =
" 2 (x0 ) b(x0) 0 : 2 " (x0 ) fX (x0)
8
3.3 Standard Errors Standard errors are very useful for assessing sampling variability. It is frequently used in constructing pointwise con dence intervals. Note that the local M-estimator (2.5) is really a weighted M-type likelihood function of a corresponding parametric counterpart. Therefore, the covariate matrix can be estimated from the conventional technique. To this end, de ne
Tbnl =
!
n 1 X x0 ? Xj ; 0 Y?m 0 l b b ( x ) ? m ( x ) ( X ? x ) b ( X ) ( X ? x 0 0 j 0 j j 0) K j n hn j=1 hn=b(Xj )
and b ln;2 = p 1
n hn
n X j =1
Xj : Yj ? mb (x0) ? mb 0 (x0) (Xj ? x0) b(Xj ) (Xj ? x0 )l K hx0=b?(X n j)
Then, the consistent estimator of (x0) is given by
b n b (x ) = Tb T=h n n 0
0
1
!
! Tbn1=hn ?1 b 0n;2 2 Tbn0 Tbn1 =hn Tbn2=h2n b 1n;2 =hn
Tbn1 =hn ?1 b (x ) ; 0 ij 22 Tbn2 =h2n
where A 2 denotes A AT for any vector or matrix A. Clearly, the consistent estimator of 2 (x0) is b 2(x0 ) = b (x0)11.
4 Proofs In this section, we will give the proofs of weak consistency (Theorem 1) and strong convergence (Theorem 2) as well as asymptotic normality (Theorem 3). To this end, the following lemmas are needed for the technical convenience.
Lemma 1. Suppose that f(Xj ; Yj )g is a stationary -mixing sequence and fCj g is an iid sequence
and it is independent of f(Xj ; Yj )g. Then, f(Xj ; Zj ; j )g is also a stationary -mixing process with the same mixing coecients, where Zj = min(Cj ; Yj ) and j = 1I(Yj Cj ). Furthermore, so is f(Xj ; Yj )g, where Yj = j 1 (Xj ; Zj ) + (1 ? j ) 2(Xj ; Zj ).
Proof. See Lemma 1 in Cai (1998). Lemma 2. Under Conditions (A1)-(A10), let
n X 0(" ) b(Xj ) (Xj ? x0 )l K Tnl = j j =1
Then, and Furthermore,
!
x0 ? Xj : hn =b(Xj )
IE(Tnl ) = (?1)l l " (x0 ) n hln+1 fX (x0) b?l(x0)[1 + o(1)];
Var(Tnl ) = O n hn2l+1 :
Tnl = (?1)l l
l+1 ?l " (x0) n hn fX (x0) b (x0)[1 + op(1)]:
9
(4.1) (4.2)
Proof. By stationarity, conditioning on X , and the continuity of b(), fX () and
x ? X l (X ) b(X ) (X ? x ) K
" (), we have
0 IE(Tnl ) = n IE " 0 hn =b(X ) = (?1)l l " (x0 ) n hln+1 fX (x0 ) b?l(x0)[1 + o(1)]:
(4.3)
Therefore, (4.1) holds true. To prove (4.2), let j = 0("j ) b(Xj ) (Xj ? x0)l K hxn0=b?(XXjj ) . Then, fj g is stationary and -mixing by Lemma 1, and stationarity gives Var(Tnl ) = n IE(12) + 2
n X j =2
Similar to (4.3),
(n ? j + 1)Cov(1; j ):
(4.4)
(4.5) IE(12 ) = O hn2l+1 : To obtain an upper bound for the second term on the right-hand side of (4.4), we split it into two terms as follows n X
j =2
jCov( ; j )j = 1
dn X
j =2
jCov( ; j )j + 1
n X
j =dn +1
jCov( ; j )j Jn; + Jn; ; 1
1
2
(4.6)
where dn is a sequence of positive integers such that dn hn ! 0 as n ! 1. By conditioning on X1 and Xj , Condition (A10), and the continuity of b(), one has
jIE( j )j C IE
"
1
which implies that
b(X1) b(Xj ) jX1 ? x0 jl jXj ? x0 jl K
!# x ?X x ? X j l hn =b(X ) K hn=b(Xj ) = O hn ; 0
1
0
2 +2
1
Jn;1 = o n hn2l+1 ;
(4.7) since dn hn ! 0. For Jn;2 , by using the Davydov's inequality (see Hall and Heyde 1980, Corollary A.2), we have jCov(1; j )j C [(j ? 1)] =(2+ ) IEj1j2+ 2=(2+ ) : (4.8) By conditioning on X and using Condition (A7), one has
)l+1 ; IEj1 j2+ C IE b2+ (X1) jX1 ? x0 j(2+ )lK 2+ x0 ? X1 = O h(2+ n hn=b(X1)
(4.9)
which, in conjunction with (4.8) and (4.6), implies that
Jn;2 C hn2l+2=(2+ )
X
[(k)] =(2+ )
kdn X a ? a 2l+2=(2+ ) k [(k)] =(2+ ) C dn hn kdn 2l+1 = o n hn
10
(4.10)
(2+ ) by choosing dn such that dan h = = O(1), so that the requirement dn hn ! 0 is satis ed. A n combination of (4.4), (4.5), (4.6), (4.7) and (4.10) completes the proof of (4.2).
Before stating the following lemma, we de ne R(Xj ) = m(Xj ) ? m(x0) ? m0 (x0)(Xj ? x0), n X 0(") R(Xj ) b(Xj ) (Xj ? x0)l K ln;1 = j j =1
and
n X ln;2 = j =1
("j ) b(Xj ) (Xj ? x0 )lK
!
x0 ? X j ; hn =b(Xj )
!
x0 ? X j : hn=b(Xj )
Lemma 3. Under Conditions (A1)-(A10), we have
ln;1 = 12 (?1)l l+2 " (x0 ) n hln+3 m00(x0) fX (x0 ) b?l?2(x0)[1 + op (1)];
IE(ln;2) = 0, and
Var(ln;2) = 2l " 2 (x0) n hn2l+1 fX (x0) b1?2l(x0 )[1 + o(1)]:
Proof. The proof is similar to that of Lemma 2, omitted. Let
n X `n = j =1
Then,
(Yj ? m(x0 ) ? m0(x0 )(Xj ? x0)) b(Xj ) (Xj ? x0)l K
!
x0 ? Xj : hn =b(Xj )
ln = ln;1 + ln;2 + ln;3;
where ln;3 =
n h X j =1
i
(4.11)
!
Xj : ("j + R(Xj )) ? ("j ) ? 0("j ) R(Xj ) b(Xj ) (Xj ? x0 )l K hx0=b?(X n j)
Next we will show that ln;3 is negligible compared to ln;1 . By Condition (A3) and Taylor's expansion, for jXj ? x0 j < hn =b, 1 sup jm00(x)j max jX ? x j2 C h2 : (4.12) max j R ( X ) j j 0 j n 1j n 1j n 2 x2x0 hn =b By stationarity, conditioning on X , Condition (A8), and (4.12), we have
IEjln;3 j o n h2n IE b(X ) jX ? x0 jl K x0 ? X = o n hln+3 ; hn =b(X )
which leads to
ln;3 = op n hln+3 : 11
(4.13)
By Lemmas 2 and 3, we have ln = Op
n hln+3 +
n hnl
2 +1
= 1 2
:
(4.14)
In order to establish the strong consistency, we need the Bernstein type inequality for -mixing processes, due to Carbon (1988), which is stated here as the following lemma without proof.
Lemma 4. Assume that fUj g is an -mixing process satisfying IE(Uj ) = 0 for all j , and for any n 1,
max jUj j dn ; 1j n
and
Then, for any " > 0 and n 3,
8 n 9 2 < X = IP : Uj ; 2 exp 4? " + 4 e j
2
=1
max IE Uj2 Dn : 1j n
8 9 kn < = p X n :Dn + 8 dn (j ); + 2 e kn (kn) n j 2
2
kn =3
3 n5
=1
for any integer kn and constant such that kn n=2 and 0 < < (4 e kn dn )?1 .
Proof of Theorem 1. Let = . Then the problem is equivalent to showing that there 1
exists a solution b to minimize
2
Xj ? x0 n X Xj 1 b(Xj ) K hx0=b?(X Yj ? 1 ? 2 h `n ( ) = n h n j =1 n n j)
!
(4.15)
such that b ! 0 in probability. Denote by S the sphere centered at 0 with radius . We will show that for any suciently small , the probability inf `n ( ) `n ( 0)
(4.16)
2S
tends to one as n ! 1. Hence `n ( ) has a local minimum in the interior of S . Since at a local minimum (4.15) must be satis ed, it follows that for any > 0, with probability tending to one b 0(x0) be the closest roots to as n ! 1, (4.15) has a minimization b within S . Let mb (x0 ) and m m(x0) and m0 (x0), respectively. Then
n
o
?
b 0(x0) ? m(x0) 2 2 ! 1: IP (mb (x0 ) ? m(x0))2 + h2n m This in turn implies that
IP IP b 0(x0) ? m0(x0) ?! mb (x0 ) ? m(x0) ?! 0 and hn m 0:
Now we establish (4.16). By Taylor's expansion,
`n( ) = `n( 0) + `0n( 0)T ( ? 0 ) + 21 ( ? 0)T `00n( ) ( ? 0 ); 12
(4.17)
where lies between 0 and . Clearly,
`0n ( 0) = ? n 1h
n X n j =1
("j + R(Xj )) b(Xj )
An application of (4.14) yields The Hessian matrix of `n ( ) is given by
! 1 x ? X 1 j n : (4.18) = ? n h =h Xj ?x0 K h =b ( X ) n n j n n hn 0
0
1
`0n( 0) = op(1):
(4.19)
! 1
x ? X j K h =b(X ) ; Xj ?x0 n j hn 1
n X 0(" + ) b(Xj ) `00n ( ) = n 1h j j n j =1
2
0
where j = R(Xj ) + ( 1;0 ? 1 ) + ( 2;0 ? 2 ) Xj ?x0 . By (4.9), `00n ( ) becomes 1 Tn
0
Tn =hn
hn
1
`00n( ) = n h T 1=h T 2=h2 n n n n n ! 1 2 n h i X x 1 0 ? Xj 0 0 ("j + j ) ? ("j ) b(Xj ) Xj ?x0 K h =b(X ) : (4.20) + nh n j =1 n j hn It follows from (4.12) that for jXj ? x0 j < hn =b and 2 S , max jj max jR(Xj )j + (1 + 1=b) ! 0 j n j 1j n
(4.21)
1
as ! 0. By Condition (A8), (4.21), and using the same arguments employed in the proof of (4.13), the second term on the right-hand side of (4.20) becomes
!
2 n h i x0 ? Xj = o (1); 1 X 0(" + ) ? 0(" ) b(Xj ) X 1?x K p j 0 j j j n hn j=1 hn=b(Xj ) hn
which, in conjunction with (4.20) and Lemma 2, implies that
`00n ( ) =
" (x0) fX (x0 ) B1 + op (1):
Let 1 be the smallest eigenvalue of the positive de nite matrix B1 . Then,
(
nlim !1 IP inf 2S
( ? 0)T `00n ( ) ( ? 0 ) > 0:5 1 fX (x0 ) " (x0) 2
)
= 1:
It follows from the above equation, (4.17) and (4.19) that (4.16) is proved which completes the proof of theorem 1.
Proof of Theorem 2. To establish the strong consistency, by using the same arguments as those used in the proof of Theorem 1, it suces to show that
`0n ( 0 ) a:s: = o(1);
and 13
`00n( ) a:s: = C
(4.22)
as n ! 1 for some positive C . We derive only the rst assertion in (4.22), noting that a proof of the second assertion is similar. In view of (4.11), we rst show that
l ln;2 a:s: = o n h1+ n :
(4.23)
To this end, we de ne,
! X ? x l n o x ? X j j j ("j )j n b(Xj ) K h =b(X ) ; n hn hn n j ! X ? x l n o x ? X 1 j j l Vj; = n h ("j )1I j ("j )j > n b(Xj ) K h =b(X ) : h 1
Vj;l 1 =
and
0
0
0
2
Clearly,
("j )1I
n
0
n
n
j
n n 1 l = X l + X V l Al + A l : V j;2 n;1 n;2 j;1 n;2 n h1+ n l j =1
Then, we need to show that Aln;1 a:s: = o(1) and Aln;2 by the boundedness of b(), we have
j =1 a:s: =
o(1) for l = 0 and 1. When jXj ?x0 j < hn=b,
o o n Aln;2 6= 0 [nj=1 ("j ) > n :
n
Condition (A12), stationarity and Markov's inequality yield
n n o o X IP ("j ) > n = O n ? 0 :
n
IP Aln;2 6= 0
1
j =1
n o P = o(1). It is easy to see Hence, n1 IP Aln;2 6= 0 < 1, since 0 > 2, which implies that Aln;2 a:s: that fVj;l 1g is stationary and -mixing, l Vj; ? IE Vj;l C n? h?n = dn; 1
1
1
1
IE Vj;l 1 C2 n?2 h?n 1 = Dn : 2
and
1
By applying the Bernstein's inequality in Lemma 4 to fVj;l 1g, we have, for any " > 0,
" # ? hn l o n l C n C IP An; ? IE An; > " C exp ? " ? n m mn n 93 2 8 mn " C exp ?C m n = o(1) and subsequently, by for some > 1, whichgives that Aln; ? IE Aln; a:s: holds true (4.23) a:s: a:s: l l l l the fact that IE An; ! 0. Similarly, it can be showed that n; = o n hn and n; = 1
1
7
1
1
9
8
1
1
1
14
1+
3
l by using Condition (A12) and the fact that max1j n jR(Xj )j = O(h2 ). This implies o n h1+ n n 1+l a:s: a:s: l 0 that n = o n hn . Hence, `n ( 0 ) = o(1) by (4.18). This completes the proof of the theorem
2.
Proof of Theorem 3. Recall that b is the local M-estimator of . Let bj = m(Xj ) ? mb (x ) ? 0
0
mb 0 (x0) (Xj ? x0 ) = R(Xj ) ? [mb (x0 ) ? m(x0)] ? [mb 0 (x0) ? m0(x0 )] (Xj ? x0 ). Then, Yj ? mb (x0 ) ? mb 0 (x0) (Xj ? x0 ) = "j + bj , and `0n( b ) = ? n 1h n
n X
("j + bj ) b(Xj )
j =1
! 1 x ? X j Xj ?x0 K hn =b(Xj ) = 0: hn 0
A simple algebra leads to
T n
0
Tn1=hn
where
Tn1 =hn h b ? i ? 0n;1 = 0n;2 + 0n;4 ; 0 Tn2=h2n 1n;1 =hn 1n;2 =hn 1n;4 =hn
n h X ln;4 = j =1
("j + bj ) ? ("j ) ? 0("j ) bj
i
b(Xj ) (Xj ? x0)l K
(4.24)
!
x0 ? Xj : hn =b(Xj )
By Lemma 2, the term on the left-hand side of (4.24) becomes
"
#
2 h n 00 ? 2 b n hn " (x0) fX (x0 ) B1 ? 0 ? 2 m (x0) b (x0) ? =b2(x ) + op(h2n) ; 3 0
(4.25)
which implies that the term on the left-hand side of (4.24) is of order
h i Op(n hn ) h2n + b ? 0 :
(4.26)
b (x0) and mb 00(x0), for jXj ? x0j hn=b, By the consistency of m max jb j 1max jR(Xj )j + jmb (x0) ? m(x0)j + hn jmb 0(x0) ? m0(x0)j j n j j n h i b (x0) ? m(x0)) + hn (m0(x0) ? m0(x0)) = op(1): = Op h2n + (m
1
By Condition (A8) and the same arguments as those used in the proof of (4.13),
h n; b ? )i ; = o ( n h ) h + ( p n n =hn 0
1
n;4
2
4
0
which converges to zero in probability faster than the term on the left-hand side of (4.24) by (4.26). Therefore, by (4.25),
p
"
#
2 ? 2 00(x0 ) h m 1 3 n 2 b n hn ? 0 ? 2 ( ? 2 ) b2(x ) ( ? ) b(x ) 1 2 0 3 0 0 2 0 1 n; + op(1): = " (x ) fX (x ) n hn n; =hn
B? 1
0
1
0
p1
15
0
1
2
2
p
n;2
2
n;2 n
Hence, to establish the asymptotic normality of n hn b ? 0 , it suces to show that for any real numbers a1 and a2 such that a = (a1 ; a2 )T 6= 0, the asymptotic distribution of I = p 1 (a 0 + a 1 =h ) n
n hn
1
is normal. To this end, let
j = p1
hn
("j ) [a1 + a2 (Xj ? x0 )=hn ] b(Xj ) K
!
x0 ? Xj hn =b(Xj ) :
Then, fj g is stationary and -mixing by Lemma 1, IE(j ) = 0, In = p1n Var(In ) = IE(12) + 2
n X j =2
Pn , and j j =1
1 ? j ?n 1 Cov(1; j ):
By using the same arguments as those employed in the proof of Lemmas 2 and 3, we obtain IE(12) ! "2 (x0) b(x0) fX (x0) aT B2 a 2 (x0); and
n X j =2
Therefore,
jCov( ; j )j = o(1): 1
Var(In ) ! 2 (x0):
(4.27)
To establish the asymptotic normality of In , we employ the Doob's small-block and large-block technique. Namely, partition f1; : : :; ng into 2 qn + 1 subsets with large-block of size r = rn and small-block of size s = sn . Set n q = qn = r + s : (4.28) n
De ne the random variables, for 0 j q ? 1,
j = Then,
j (r+X s)+r?1 i=j (r+s)
i ;
j =
j
Xr
s
( +1)( + )
i=j (r+s)+r
n
i;
and
q =
nX ?1 i=q(r+s)
i :
9
8
?1 qX ?1 = 0. (4.29) implies that Qn;2 and Qn;3 are asymptotically negligible in probability; (4.30) shows that the summands j in Qn;1 are asymptotically independent; and (4.31) and (4.32) are the standard Lindeberg-Feller conditions for asymptotic normality of Qn;1 for the independent setup. Let us rst establish (4.29). To this eect, we choose the large-block size. Condition (A13) implies that there is a sequence of positive constants n ! 1 such that
p
n sn = o
n hn ;
n(n=hn)1=2 (sn ) ! 0:
and
(4.33)
De ne the large-block size rn by rn = b(n hn )1=2= n c and the small-block size sn . Then, it can easily be shown from (4.33) that, as n ! 1,
sn =rn ! 0; Observe that
rn (n hn )?1=2 ! 0;
rn=n ! 0; qX ?1
X
(n=rn ) (sn ) ! 0:
and
(4.34)
Cov(i ; j ) F1 + F2 :
(4.35)
0 sn 1 X F = qn Var( ) = qn Var @ j A = qn sn [ (x ) + o(1)]:
(4.36)
IE [Qn;2 ]2 =
j =0
Var(j ) + 2
It follows from stationarity and (4.27) that 1
i i, we therefore have
jF j 2 2
X
sn sn X X
i