Two-stage local M-estimation of additive models | SpringerLink

3 downloads 13 Views 465KB Size Report
(2008) 51: 1315. doi:10.1007/s11425-007-0173-6 ... A two-stage local M-estimation procedure is proposed for estimating the additive components and their ...
Science in China Series A: Mathematics Jul., 2008, Vol. 51, No. 7, 1315–1338 www.scichina.com math.scichina.com www.springerlink.com

Two-stage local M-estimation of additive models JIANG JianCheng1,2 & LI JianTao1† 1

School of Mathematical Sciences, Peking University, Beijing 100871, China

2

Department of Mathematics and Statistics, University of North Carolina at Charlotte, NC 28223, USA (email: [email protected], [email protected])

Abstract This paper studies local M-estimation of the nonparametric components of additive models. A two-stage local M-estimation procedure is proposed for estimating the additive components and their derivatives. Under very mild conditions, the proposed estimators of each additive component and its derivative are jointly asymptotically normal and share the same asymptotic distributions as they would be if the other components were known. The established asymptotic results also hold for two particular local M-estimations: the local least squares and least absolute deviation estimations. However, for general two-stage local M-estimation with continuous and nonlinear ψ-functions, its implementation is time-consuming. To reduce the computational burden, one-step approximations to the two-stage local M-estimators are developed. The one-step estimators are shown to achieve the same efficiency as the fully iterative two-stage local M-estimators, which makes the two-stage local M-estimation more feasible in practice. The proposed estimators inherit the advantages and at the same time overcome the disadvantages of the local least-squares based smoothers. In addition, the practical implementation of the proposed estimation is considered in details. Simulations demonstrate the merits of the two-stage local M-estimation, and a real example illustrates the performance of the methodology. Keywords:

local M-estimation, one-step approximation, orthogonal series estimator, two-stage

MSC(2000): 62G35, 62G05, 62G08

1

Introduction

Consider the following additive model Y = α + m1 (X 1 ) + · · · + md (X d ) + ε,

(1)

where X j (j = 1, . . . , d) is the j-th component of random vector X ∈ Rd for a given d, α is an unknown constant and mj (j = 1, . . . , d) are unknown functions to be estimated. To ensure the identifiability of mj (X j ), we impose the constraints E[mj (X j )] = 0 for all j. The additive model, which was suggested by Friedman and Stuetzle[1] , has been widely used in multivariate nonparametric modeling. Since all of the unknown functions are one-dimensional, the difficulty that is associated with the so-called “curse of dimensionality” is substantially reduced. For details, see [2–4]. Given an iid random sample {(xi , yi )}ni=1 of (X, Y ), it is important to efficiently estimate the additive components mj ’s. There are many works for estimating functions mj ’s in literature. Most of the works used the backfitting, marginal integration, or spline methods for fitting the model. For example, Received December 25, 2006; accepted August 20, 2007 DOI: 10.1007/s11425-007-0173-6 † Corresponding author This work was supported by the National Natural Science Foundation of China (Grant No. 10471006)

JIANG JianCheng & LI JianTao

1316

the backfitting estimators were investigated in [5–9]; the marginal integration estimators were considered in [10–14]; the spline estimators were studied in [15, 16]. However, the estimation methods above generally resulted in complicated asymptotic biases for the estimators, which makes bandwidth selection difficult. Recently, Horowitz and Mammen[17] developed a two-stage method for estimating the additive components of a generalized additive model. They showed an oracle phenomenon that each of the additive components can be estimated as well as if the other components were known. Therefore, the estimation approach seems impressive and makes common bandwidth selection methods useful for fitting the additive model. The above-mentioned methods are based on the least squares technique and hence lack robustness. One method for achieving robustness is the approach in Linton[18] using partial Lq -norm medians. This method has difficulties in choosing an appropriate q to achieve high efficiency. Another method for achieving robustness is to use the local M-estimation based on the backfitting, marginal integration, or spline methods, but the resulting estimators will have complicated asymptotic biases and lack the aforementioned oracle phenomenon. This motivates us to propose a two-stage local M-estimation method using the idea of Horowitz and Mammen[17] , since the resulting estimators have simple asymptotic biases and share the oracle phenomenon which facilitates the choice of bandwidth. Our estimation method includes the least squares (LS) and least absolute deviation (LAD) estimation as special examples. However, implementation on general two-stage local M-estimation with continuous and nonlinear ψ-functions is time-consuming. To reduce the computational burden on implementation, onestep approximations[19, 20] to the two-stage local M-estimators are developed. The one-step estimators are shown to achieve the same efficiency as the fully iterative twostage local M-estimators. The two-stage local M-estimator is asymptotically normally distributed with an optimal convergence rate of n−2/5 , which is true regardless of the finite dimension of X, that is, asymptotically there is no curse of dimensionality. It is shown that each of the additive components can be estimated as well as in one-dimensional nonparametric regression (see [19, 20]). Since the proposed two-stage local M-estimator admits no closed form and the number of parameters in the first-stage estimation diverges to infinity, the determined efforts have been devoted to deriving their asymptotic properties. In implementation of the proposed estimation method, one needs to choose the unknown parameters, so we present in details a rule of selecting these parameters. This paper is organized as follows. In Section 2, we introduce the two-stage local M-estimation procedure. Some notations and conditions needed for the establishment of the asymptotic results are listed in Section 3. Section 4 presents the asymptotic results. The implementation of the proposed estimation method is placed in Section 5. In Section 6, Monte Carlo simulations are carried out. In Section 7, a real data example is given to illustrate the utility of the proposed methodology. The proofs of the main results are given in Section 8. 2

Two-stage local M-estimation

The two-stage local M-estimation consists of the following orthogonal series based M-estimation and local M-estimation.

Two-stage local M-estimation of additive models

1317

2.1 Orthogonal series based M-estimation As in Horowitz and Mammen[17] , we assume that the support of X is X = [−1, 1]d, and 1 normalize mj ’s so that −1 mj (v) dv = 0. Let {pk (·), k = 1, 2, . . .} be a standard orthogonal 1 basis for smooth functions on [−1, 1] satisfying that −1 pk (x)dx = 0 and ⎧  1 ⎨ 1 if k = j; pk (x)pj (x) dx = ⎩ 0 −1 otherwise. Let Pκ (x) = [1, p1 (x1 ), . . . , pκ (x1 ), p1 (x2 ), . . . , pκ (x2 ), . . . , p1 (xd ), . . . , pκ (xd )]T and θ κ = (θ0 , θ11 , . . . , θ1κ , . . . , θd1 , . . . , θdκ )T . Then for any θ κ ∈ Rdκ+1 , PT κ (x)θ κ is a series approximation to α + m(x). Our orthogonal ˆ series estimator of α + m(x) based on M-estimation is defined as α ˜ + m(x) ˜ = PT κ (x)θ nκ , where ˆ nκ = arg min θ

θκ ∈Θκ

n 

ρ{yi − PT κ (xi )θ κ }

(2)

i=1

with ρ(·) being a pre-deterministic outlier-resistant function and Θκ being a compact parameter set. The orthogonal series estimator of mj (xj ) is the inner product of [p1 (xj ), . . . , pκ (xj )] with ˆ nκ . It will be employed as initial estimators for regression the corresponding components of θ components in the second-stage estimation introduced below. The orthogonal series estimators are used to ensure that the biases of the first-stage estimators converge to zero rapidly. 2.2 Local M-estimation and its one-step approximation ˜ 1 = (1, X 1 − x1 )T , x ˜ i = (Xi2 , . . . , Xid )T and a = (a0 , a1 )T . Using the kernel weighted Let X i i local linear M-estimation approach, we find ˆ (x1 ) = arg min a a

n 

˜1 + m ρ{yi − [˜ α + aT X ˜ −1 (˜ xi )]}Khn (Xi1 − x1 ), i

(3)

i=1

˜ −1 (˜ xi ) = where Khn (·) = K(·/hn ) for a kernel function K and a bandwidth hn , and m  k 1 1 1  1 T ˆ (x ) is a local M-type estimator of a(x ) = (m1 (x ), m1 (x )) . Equiva˜ k (Xi ). Then a k=1 m ˆ (x1 ) satisfies the following local M-estimation lently, when ρ is differentiable with derivative ψ, a equations: Ψnk (a(x1 )) ≡

n 

˜1 + m ψ{yi − [˜ α + aT X ˜ −1 (˜ xi )]}Khn (Xi1 − x1 )(Xi1 − x1 )k = 0, i

(4)

i=1

for k = 0, 1. When ρ(x) = x2 or |x|, one can easily calculate the two-stage estimates. For other nonlinear ρ(·) with continuous derivative ψ(·), the above equations can be solved by Newton iterations, but it will involve intensive computation at each point x1 . To reduce the computational burden, we here employ the one-step local M-estimation method developed in Fan and Jiang[19] . Specifically, given an initial value 0 a(x1 ) = (m ˇ 1 (x1 ), m ˇ 1 (x1 ))T , we propose to estimate a(x1 ) by 1 1 −1 1 (5) 1 a(x ) = 0 a(x ) − Wn Ψn (0 a(x )),

JIANG JianCheng & LI JianTao

1318

nk 1 where Ψn = (Ψn0 , Ψn1 )T and Wn = (wkm ) is a 2×2 Hessian matrix with wkm = ∂Ψ ∂am |a=0 a(x ) , for k, m = 0, 1. We label the estimator 1 a(x1 ) as the “one-step local M-estimator” of a(x1 ). For estimation of the other components mj ’s, the same procedure can be applied. In the following, we will demonstrate that the asymptotic properties of 1 a(x1 ) are the same as the fully iterative two-stage local M-estimators.

3

Notations and conditions

To facilitate the exposition of the paper, we introduce some notations. Let H = diag(1, hn ), Ki = K((Xi1 − x1 )/hn ), ˇ i = (1, (Xi1 − x1 )/hn )T , X m(x) = m1 (x1 ) + · · · + md (xd ), n  ρ{yi − PT Snκ (θ) = n−1 κ (xi )θ}, i=1

Sκ (θ) = E[Snk (θ)], θκ0 = arg minSκ (θ), θ ∈Θκ bκ0 (x) = α + m(x) − PT κ (x)θ κ0 , ¯ T (x)θ κ0 , ¯bκ0 (x) = α + m−1 (˜ x) − P κ ¯ κ = (θ0 , 0, . . . , 0, θ21 , . . . , θ2κ , . . . , θd1 , . . . , θdκ )T , θ ¯ k (x) = [1, 0, . . . , 0, p1 (x2 ), . . . , pκ (x2 ), . . . , p1 (xd ), . . . , pκ (xd )]T , P R(Xi1 ) = m1 (Xi1 ) − m1 (x1 ) − m1 (x1 )(Xi1 − x1 ). 1

For any matrix A, define A as the Frobenius norm of A, that is, A = [T r(AT A)] 2 ; for 1 any vector ξ, let ξ = (ξ T ξ) 2 . Throughout this paper, for any vector b, we denote bbT by b⊗2 . The following conditions are needed to derive the asymptotic properties of the proposed estimators. (A1) Assume that {(xi , yi )ni=1 } is an iid sample from the distribution of (X, Y ). (A2) The probability density function fX (·) of X is continuous on X and fX (·) > 0 at the point x where regression function is estimated. (A3) (i) The outlier-resistant function ρ(·) has a derivative function ψ(·) almost everywhere, and ψ  (·) exists almost everywhere. Assume that E[ψ(ε)|X = x] = 0 for all x ∈ X . (ii) Assume that



E sup |ψ  (ε + δ) − ψ  (ε)| X = x = o(1), |δ| cλ for all κ. Assume that λmax (Qκ0 ) is bounded for all κ. (A8) All component functions mj (j = 1, . . . , d) are twice continuously differentiable on [−1,1]. The conditions above are very mild and fulfilled in many applications. Conditions (A3) and (A8) define the smoothness on outlier-resistant function and components of regression function, respectively. In particular, Huber’s ψk (x) function[21] and the ψ-functions used in local LS and LAD estimations satisfy the conditions. Condition (A4) bounds the magnitudes of the basis functions and ensures that the errors in the series estimators of mj converge to zero sufficiently fast as κ → ∞. This helps the second-stage estimator avoid the curse of dimensionality. These conditions are satisfied by splines and the Fourier basis. To simplify the proofs of the theoretical results, K(·) is assumed to have a compact support. It can be relaxed to allow kernels with noncompact support if we put restrictions on the tail of K(·). Condition (A6) states the rates at which κ → ∞ and the bandwidth converges to 0 as n → ∞. It requires the first-stage estimator to be undersmoothed. Undersmoothing is needed to insure the sufficiently rapid convergence of the bias of the orthogonal series estimator. We will show that the asymptotic normality of the two-stage local M-estimator does not depend on the choice of κ if (A6) is satisfied. Optimizing the choice of κ would require a rather complicated higher-order theory and is beyond the scope of this paper. Condition (A7) ensures the existence and nonsingularity of the asymptotic covariance matrix of the first-stage estimator. 4

Asymptotic properties

In this section, we establish the asymptotic normality of the two-stage local M-estimators. Put e1 = (1, 0)T and c1 = (s2 , s3 )T . Let S = (si+j−2 ) and V = (vi+j−2 ) (for i, j = 1, 2) be 2 × 2 matrices. Theorem 1 (Consistency).

Under Conditions (A1)–(A8), there exists a sequence of solutions

JIANG JianCheng & LI JianTao

1320 P

to (4), denoted by ˆ a(x1 ), such that H[ˆ a(x1 ) − a(x1 )] −→ 0 as n → ∞. Theorem 2 (Fully iterative local M-estimator). then for the estimator ˆ a(x1 ) in Theorem 1,

Assume that Conditions (A1)–(A8) hold,

L nhn {H[ˆ a(x1 ) − a(x1 )] − β(x1 )} −→ N (0, Σ(x1 )), where β(x1 ) = 12 h2n m1 (x1 )S−1 c1 (1 + op (1)) and Σ(x1 ) =

Γ2 (x1 ) −1 1 VS−1 . fX 1 (x1 ) Γ21 (x1 ) S

Corollary 1 (LAD estimator). If ρ(·) = | · | and ψ(z) = sgn(z), then the two-stage local Mestimation reduces to the case using the LAD principle. Let f (·|x1 ) be the conditional density function of ε given X 1 = x1 . Assume that f (·|x1 ) is continuous at zero and f (0|x1 ) > 0. Then Γ1 (x1 ) = 2f (0|x1 ), Γ2 (x1 ) = 1, and hence

L

nhn {H[ˆ a(x1 ) − a(x1 )] − β(x1 )} −→ N (0, ΣLAD (x1 )),

where ΣLAD (x1 ) = S−1 VS−1 /[4fX 1 (x1 )f 2 (0|x1 )]. Theorem 3. (One-step local M-estimator). Assume that the initial value 0 a(x1 ) satisfies 1 H[0 a(x1 )−a(x1 )] = Op (h2n + √nh ) and E{|ψ(ε+z)−ψ(ε)−ψ  (ε)z||X 1 = x1 } = o(z), as z → 0, n 1 uniformly in x , then under Conditions (A1)–(A8), L nhn {H[1 a(x1 ) − a(x1 )] − β(x1 )} −→ N (0, Σ(x1 )), where β(x1 ) and Σ(x1 ) are the same as those in Theorem 2. Remark 1. The two-stage local M-estimators of m2 (x2 ), . . . , md (xd ) can be obtained similarly. The structures of asymptotic biases and covariances of the proposed estimators are the same as those in the univariate nonparametric regression (see [19, 20]). This shows that the two-stage local M-estimators have an oracle property in the sense that though the other components were unknown, the estimators perform well as if they were known. The result holds regardless of the finite dimension of X, so asymptotically there is no curse of dimensionality. Remark 2. According to Theorem 3, the asymptotic mean square error (AMSE) of the one-step local M-estimator is AMSE(x1 ) =

Γ2 (x1 ) 1 4 2 1 T −1 2 1 hn m1 (x )[e1 S c1 ] + · 2 1 eT S−1 VS−1 e1 . 1 4 nhn fX 1 (x ) Γ1 (x ) 1

Minimizing AMSE(x1 ) with respect to hn , we obtain an optimal bandwidth for estimating m1 (x1 ) in the second-stage estimation:

hopt = 5

Γ2 (x1 )(s22 v0 + s21 v2 − 2s1 s2 v1 ) fX 1 (x1 )[Γ1 (x1 )m1 (x1 )(s22 − s1 s3 )]2

1/5

n−1/5 .

(6)

Implementation of the proposed estimation

The proposed two-stage local M-estimation involves the outlier-resistant function ψ and the smoothing parameter κn in the first stage. In the following, we discuss the selection of these function and parameter.

Two-stage local M-estimation of additive models

1321

5.1 Choice of ψ(·) By Theorem 3, the asymptotic bias of the one-step local M-estimator does not depend on ψ(·). Therefore, we here choose a function ψ from the Huber ψc -function family C = {ψc : ψc (x) = max{−c, min(c, x), c > 0}} to minimize the asymptotic variance parameter: σ 2 (ψ, x1 , F ) = Γ2 (x1 )/Γ21 (x1 ) = E(ψ 2 (ε)|X 1 = x1 )/[E(ψ  (ε)|X 1 = x1 )]2 , that is to find ψ ∗ , such that σ 2 (ψ ∗ , x1 , F ) = inf ψ∈C σ 2 (ψ, x1 , F ), where F is the distribution of ε. Theoretically one should choose c to minimize σ 2 (ψc , x1 , F ), but σ 2 (ψc , x1 , F ) includes the unknown error distribution. However, σ 2 (ψc , x1 , F )/fX 1 (x1 ) can be consistently estimated by n 2 εj )K((Xj1 − x1 )/hn ) n−1 h−1 n j=1 ψc (ˆ 2 1 σ ˆ (ψc , x , F ) = −1 −1 n , (7)  ε )K((X 1 − x1 )/h )]2 [n hn j n j j=1 ψc (ˆ ˆ − m(x ˆ j ), and α ˆ + m(x ˆ j ) is any consistent estimator of α + m(xj ), such where εˆj = Yj − α as the estimator in Horowitz and Mammen[17] . Therefore, one viable choice for c is to find cˆ minimizing σ ˆ 2 (ψc , x1 , F ). Theoretic justification of the choice can be done in a way similar to Jiang and Mack[20] . 5.2 Choice of κn To automatically obtain an appropriate κn , we adopt the following Schwartz-type information criterion[22−24] :   n ˆ nκ ] + 2(log n)κn . ρ[yi − PT (x ) θ (8) QBIC(κn ) ≡ n log i κ i=1

Let κ ˆn be the minimizer of QBIC. For our simulation studies in the next section, we use this method to obtain corresponding κn ’s. In order to reduce the asymptotic bias of the first-stage estimator, we choose κn = κ ˆ n + 1 in all simulations, which corresponds to undersmoothing the first-stage estimator. 6

Simulation results

In this section, we compare the finite sample behavior of the two-stage local M-estimators with some previous estimators and an oracle estimator. Three error distributions are considered, which include the normal, mixed normal and t(3) distributions. Since B-spline[25] method is efficient in digital computation and functional approximation, we here use the B-spline basis in the first-stage estimation. The kernel used for all the estimators is the biweight kernel: K(x) = (15/16)(1 − x2 )2 I(|x|  1), where I(·) is the indicator function. All the simulations are carried out with d = 2 and all the explanatory variates are restricted to be in the interval [−1, 1]. The local LS estimators (ρ(x) = x2 ) are chosen to be the initial values for one-step and fully iterative estimators. The following five estimation methods are to be compared: (a) The one-step local M-estimator in (5) with the local LS estimator as an initial value. (b) Using the one-step local M-estimator as an initial value, update the estimator in (5). We call the resulting estimator a two-step local M-estimator. Intuitively, the two-step local M-estimator further robustifies the one-step local M-estimator.

JIANG JianCheng & LI JianTao

1322

(c) The fully iterative two-stage local M-estimator in (4). (d) Oracle estimator: given the true m2 , we estimate m1 by the fully iterative two-stage local M-estimation. The oracle estimator cannot be used in application but provide a benchmark for comparison. (e) The two-stage estimator proposed in Horowitz and Mammen[17] (HM-estimator). For each of the five methods above, namely one-step, two-step, fully iterative, oracle and HM-estimator, we assess their performance via the mean absolute deviation error (MADE): MADE(m ˘ j ) = s−1

s 

|m ˘ j (Xij ) − mj (Xij )|,

j = 1, . . . , d,

i=1

where {Xij }si=1 are grid points and m ˘ j is one of the above five estimators. The bandwidths for the first four estimation methods are hopt in (6). As for the bandwidth used for HM-estimator, its optimal bandwidth in Section 4 of Horowitz and Mammen[17] is used. Example 1.

We simulate 500 samples of size n=200, 400, 800 from the model: Y = m1 (X 1 ) + m2 (X 2 ) + ε,

(9)

where m1 (X 1 ) = 2 sin(π(1 + X 1 )), m2 (X 2 ) = X 2 , and ε ∼ N (0, 0.42 ). The components of X are independent of ε. X 1 = Z1 + Z2 and X 2 = Z1 − βρ Z2 , where Z1 and Z2 are uniformly distributed on interval [−1, 1]. βρ is selected such that the correlative coefficient of X 1 and X 2 is ρ. Simulation results are summarized in Table 1. Only the simulation results for estimators of m1 are presented. Results of m2 are omitted here for saving space. Table 1

MADEs of one-step, two-step, fully iterative, oracle estimator and HM-estimator

Dependence index

Sample size 200

0.0815

0.0817

0.0818

0.0810

0.0812

Independent

400

0.0612

0.0614

0.0615

0.0606

0.0609

ρ = 0.1

ρ = 0.6

One-step

Two-step

Fully iterative

Oracle

HM

800

0.0407

0.0409

0.0409

0.0403

0.0405

200

0.1033

0.1035

0.1037

0.1027

0.1031

400

0.0811

0.0812

0.0813

0.0804

0.0808

800

0.0756

0.0756

0.0756

0.0749

0.0753

200

0.1112

0.1114

0.1115

0.1105

0.1109

400

0.0931

0.0933

0.0933

0.0927

0.0929

800

0.0768

0.0769

0.0770

0.0765

0.0768

Clearly, the one-step and two-step estimators are nearly as efficient as the fully iterative estimator. This is consistent with Theorem 3. The simulation results also show that the onestep and two-step estimators have nearly the same performance as the HM-estimator in terms of MADE. This demonstrates that our estimators are nearly efficient as the HM-estimator. From the data presented in Table 1, we also infer that the newly proposed local M-estimators have oracle properties. Example 2.

We simulate 500 samples of size n=200, 400 from the model: Y = 2 sin(π(1 + X 1 )) + X 2 + τ ε,

(10)

Two-stage local M-estimation of additive models Table 2 Dependence index

MADEs of one-step, two-step, fully iterative, oracle estimator and HM-estimator τ 0.1

Independent

0.3

0.5

0.1 ρ = 0.1

0.3

0.5

0.1 ρ = 0.6

1323

0.3

0.5

Sample size

One-step

Two-step

Fully iterative

Oracle

HM

200

0.0793

0.0784

0.0783

0.0781

0.0876

400

0.0683

0.0667

0.0664

0.0659

0.0757

200

0.0925

0.0912

0.0910

0.0906

0.1048

400

0.0762

0.0744

0.0742

0.0737

0.0851

200

0.1164

0.1135

0.1127

0.1119

0.1304

400

0.0934

0.0906

0.0901

0.0897

0.1112

200

0.0872

0.0856

0.0854

0.0846

0.0921

400

0.0694

0.0682

0.0679

0.0673

0.0756

200

0.1059

0.1040

0.1036

0.1030

0.1125

400

0.0855

0.0838

0.0834

0.0829

0.0942

200

0.1291

0.1270

0.1266

0.1254

0.1401

400

0.1056

0.1034

0.1026

0.1014

0.1190

200

0.0966

0.0950

0.0947

0.0939

0.1024

400

0.0840

0.0829

0.0825

0.0821

0.0899

200

0.1115

0.1089

0.1084

0.1079

0.1207

400

0.0899

0.0879

0.0874

0.0868

0.0996

200

0.1186

0.1160

0.1155

0.1149

0.1297

400

0.0959

0.0934

0.0929

0.0924

0.1098

Note. The error distribution is t(3) Table 3 Dependence index

MADEs of one-step, two-step, fully iterative, oracle estimator and HM-estimator τ 0.1

Independent

0.3

0.5

0.1 ρ = 0.1

0.3

0.5

0.1 ρ = 0.6

0.3

0.5

Sample size

One-step

Two-step

Fully iterative

Oracle

HM

200

0.0854

0.0843

0.0839

0.0834

0.0915

400

0.0779

0.0772

0.0768

0.0763

0.0841

200

0.0981

0.0962

0.0959

0.0955

0.1089

400

0.0866

0.0850

0.0846

0.0843

0.0970

200

0.1090

0.1068

0.1059

0.1050

0.1197

400

0.0989

0.0962

0.0957

0.0948

0.1112

200

0.0957

0.0951

0.0947

0.0945

0.1024

400

0.0778

0.0767

0.0765

0.0760

0.0856

200

0.1012

0.0999

0.0997

0.0993

0.1093

400

0.0792

0.0778

0.0775

0.0768

0.0889

200

0.1133

0.1113

0.1109

0.1001

0.1230

400

0.0896

0.0882

0.0878

0.0877

0.1006

200

0.0996

0.0985

0.0983

0.0980

0.1068

400

0.0846

0.0837

0.0834

0.0830

0.0921

200

0.1055

0.1042

0.1039

0.1035

0.1136

400

0.0859

0.0848

0.0844

0.0839

0.0949

200

0.1158

0.1139

0.1135

0.1130

0.1260

400

0.0881

0.0863

0.0860

0.0854

0.0995

Note. ε ∼ 0.95N (0, 1) + 0.05N (0, 9)

JIANG JianCheng & LI JianTao

1324

where ε ∼ t(3) or ε ∼ 0.05N (0, 9) + 0.95N (0, 1). The components of X are independent of ε. X 1 and X 2 are generated depicted as that in Example 1. τ is a parameter controlling the signal-noise ratio of the model. The simulation results are summarized in Tables 2 and 3. Only the results for estimators of m1 are presented. The corresponding results for m2 are omitted here. Tables 2 and 3 show that there are big differences in MADEs between HM-estimator and the newly proposed estimators: (i) when the random errors are from t(3) distribution, the MADEs of two-step estimator are much smaller than those of HM-estimator; (ii) when the random errors are from the mixed normal distribution, the estimate differences between these two estimators are still obvious. From the aforementioned simulation experiments, we infer that HM-estimator is not robust, that is, it is very sensitive to outliers. We also have conducted the simulations with other levels of contaminated errors. The conclusions are basically the same as above. From Tables 2 and 3, we observe that the MADEs of the oracle estimator and the local M-estimators are very close, so the local M-estimators have oracle properties in these setting. Tables 2 and 3 also show that the performance of one-step estimator is a little worse than the two-step and fully iterative estimators. Many simulation experiments show that, when the data is contaminated with heavy-tailed errors, the one-step local M-estimator is not good enough, but the two-step local M-estimator performs very well. From the above discussions, the local M-estimators are superior to HM-estimator when the errors are from heavy-tailed or contaminated distributions. 7

Real data example

We now apply the proposed method to the Boston housing data. The full dataset comprises the median value of homes in 506 census in 1970 and 13 accompanying sociodemographic and related variables. It was previously studied in many literatures, such as [5, 26–29]. Of the 13 variables, the dependent variable and the four covariates of interest are MV: median value of owner-occupied homes (in $1,000); RM: average number of rooms in owner unit; TAX: full property tax rate ($/$10,000); PTRATIO: pupil/teacher ratio by town school district; LSTAT: proportion of population that is of “lower status”(%). The last four covariates were chosen by [5, 28] to investigate the factors that affect the median value of owner-occupied homes. Now we analyse the dataset via the following four-dimensional additive model: E[MV|X 1 , X 2 , X 3 , X 4 ] = α + m1 (X 1 ) + m2 (X 2 ) + m3 (X 3 ) + m4 (X 4 ),

(11)

where X 1 =RM, X 2 = log(TAX), X 3 =PTRATIO and X 4 = log(LSTAT). Since the dataset contains 6 outliers[30] , the previous works mostly analyse it with outliers removed. Two estimators, the one-step local M-estimator and the LS estimator based on backfitting algorithm and local linear smoother, are used to fit the model (11). We now compare the difference between the one-step local M-estimator using the full data and the LS estimator using the outliers-dropout data. Figure 1 presents the four estimated additive components.

Two-stage local M-estimation of additive models

Figure 1

1325

Estimated additive components for the Boston housing data. (a) RM; (b) log(TAX); (c) PTRATIO;

(d) log(LSTAT). The solid lines represent the one-step lcoal M-estimators based on the full data; the dashed lines display the LS estimators based on the outliers-dropout data.

From the shapes of the fitted curves, we can obtain the following conclusions. There are strong linear relationships between the dependent variable and log(TAX), PTRATIO, log(LSTAT) respectively, whereas the quadratic relationship between the dependent variable and RM is evident. It is consistent with the testing conclusion of Fan and Jiang[29] . Figure 1 shows that the one-step local M-estimator is robust against outliers, since the two estimated curves of each component are very close. 8

Proofs of the main results

To facilitate the proofs of the main results, we firstly introduce some lemmas. Lemma 1. If A and B are nonnegative matrices, then (a) λmin (A)Tr(B)  Tr(AB)  λmax (A)Tr(B); (b) λmin (A)λmax (B)  λmax (AB)  λmax (A)λmax (B). Proof.

Part (a) is the Lemma 6.5 of [31]. Part (b) is a basic inequality.

⊗2 ⊗2 2 −1 −1 2 −1 Lemma 2. Let Ωκ1 = Q−1 κ E{ψ (ε)Pk (x)}Qκ and Ωκ2 = Qκ E{ψε (x)Pk (x)}Qκ . If Conditions (A3) and (A7) hold, then (a) there is a positive constant C0 > 0, such that λmin (Qκ ) > C0 and λmin (Q∗κ ) > C0 . In addition, λmax (Qκ ) is bounded for all κ; ⊗2 2 (b) the largest eigenvalues of E{ψ 2 (ε)P⊗2 k (x)}, E{ψε (x)Pk (x)}, Ωκ1 and Ωκ2 are bounded for all κ.

Proof. (a) By Conditions (A7)(i) and (A3)(iii), we have CV3 Qκ0  Qκ  CV2 Qκ0 . Then λmax (Qκ )  CV2 λmax (Qκ0 ). By Condition (A7)(ii), λmax (Qκ ) is bounded for all κ. Similarly, we get λmin (Qκ )  CV3 λmin (Qκ0 ) > CV3 cλ and λmin (Q∗κ )  CV4 λmin (Qκ0 ) > CV4 cλ for all κ. This completes the proof of part (a). ⊗2 (b) By Condition (A3)(iii), E{ψ 2 (ε)P⊗2 k (x)} < CV1 E{Pk (x)} = CV1 Qκ0 . Then by Condition (A7)(ii), the largest eigenvalues of E{ψ 2 (ε)P⊗2 k (x)} is bounded for all κ. Similarly,

JIANG JianCheng & LI JianTao

1326

⊗2 2 2 E{ψε2 (x)P⊗2 k (x)} < CV2 Qκ0 . Hence, λmax (E{ψε (x)Pk (x)}) is bounded for all κ. Similar ⊗2 2 arguments yield that E{ψ 2 (ε)Pk (x)} > CV0 Qκ0 and E{ψε2 (x)P⊗2 k (x)} > CV3 Qκ0 . Therefore, Ωκ1 and Ωκ2 are positive. Since CV3 cλ < λmin (Qk )  λmax (Qk ) < C (a constant), there exists −1 −1 T an orthogonal matrix U such that Q−1 κ = U diag(λ1 , . . . , λd(κ) )U, where λ1 , . . . , λd(κ) are the ∗ eigenvalues of Qκ . For any unit vector u, let u = Uu. Then applying part (b) of Lemma 1, we obtain that −1 ⊗2 −1 −1 2 T ∗ λmax (Ωκ1 ) = max |u∗T diag(λ−1 1 , . . . , λd(κ) )UE{ψ (ε)Pk (x)}U diag(λ1 , . . . , λd(κ) )u | ∗ u =1

−1 ⊗2 −1 −1 2 T = λmax (diag(λ−1 1 , . . . , λd(κ) )UE{ψ (ε)Pk (x)}U diag(λ1 , . . . , λd(κ) ))

 Cλ λmax (E{ψ 2 (ε)P⊗2 k (x)}), −2 −2 −2 where Cλ = max{λ−2 1 , . . . , λd(κ) }  CV3 cλ . Similarly, λmax (Ωκ2 ) is bounded for all κ.

Lemma 3. If Conditions (A1)–(A8) hold, then ˆ κ − Qκ 2 = Op (κ2 /n); (a) Q ˆ −1 2 = Op (κ), Q−1 2 = O(κ); (b) Q κ κ 2 2 ˆ (c) Qκ − Qκ0  = Op (κ ); ˆ κ )2 = Op (κ2 /n). ˆ −1 (Qκ − Q (d) Q κ Proof. (a) Let Pκj (xi ) denote the j-th component of Pκ (xi ). Using an argument similar to that for Lemma 4 in Horowitz and Mammen[17] , we can prove part (a). (b) Only the first conclusion is proved, the same argument can be applied to prove the second. ˆ −1 AT = Λκ , where There exist an orthogonal matrix Aκ and a diagonal matrix Λκ such that Aκ Q κ κ ˆ −1 . Let q1 , . . . , qd(κ) be the Λκ = diag(λ1 (ω), . . . , λd(κ) (ω)), and λi (ω) are the eigenvalues of Q κ ˆ −1 ˆ eigenvectors of Q κ and λκ,max (ω) = max{λ1 (ω), . . . , λd(κ) (ω)}. By Lemma 2, P {λmin (Qκ )  −1 T ˆ C0 /2} → 1, that is, P {λκ,max (ω)  2/C0 } → 1. Therefore, Qκ  = Aκ Λκ Aκ  = Λκ  = Op (κ1/2 ). ˆ κ0   Qκ − Qκ0  + Qκ0 − Q ˆ κ0  ≡ Ln1 + Ln2 . By simple algebra, (c) Note that Qκ − Q d(κ) d(κ) 2 2 Ln1 = k=1 j=1 E [(ψε (x) − 1)Pκk (x)Pκj (x)]. Then by Conditions (A3)(iii) and (A4), we d(κ) d(κ) have L2n1  k=1 j=1 E(ψε (x) − 1)2 E[Pκk (x)Pκj (x)]2 = O(κ2 ). Using similar argument as ˆ κ0 2 = Op (κ2 /n). Therefore by Condition (A6), part (c) that in part (a) yields that Qκ0 − Q holds. (d) By part (a) of Lemmas 1–3, we have ˆ κ )2 = Tr[(Qκ − Q ˆ κ )Q ˆ −2 (Qκ − Q ˆ κ )] ˆ −1 (Qκ − Q Q κ κ ˆ 2 ˆ −2 ˆ 2 ˆ −2 = Tr[Q κ (Qκ − Qκ ) ]  λmax (Qκ )Tr[(Qκ − Qκ ) ] ˆ −1 ) · Qκ − Q ˆ κ 2 = λ2 (Q ˆ κ ) · Op (κ2 /n) = Op (κ2 /n). = λ2max (Q κ min Lemma 4. (a) (b)

Suppose Conditions (A1)–(A8) hold, then

n 1/2 /n1/2 ); Q−1 κ i=1 Pκ (xi )ψ(εi )/n = Op (κ  n 1/2 1/2 κ )). Q−1 κ i=1 ψε (xi )Pκ (xi )bκ0 (xi )/n = Op (1/(n

Proof. (a) Denote by σn the σ-algebra generated by {xi : i = 1, . . . , n}. Let Φ(ε) be the n × 1 vector whose i-th element is ψ(εi ) and Zκ be a d(κ) × n matrix, whose i-th column element is

Two-stage local M-estimation of additive models

1327

Pκ (xi ). Then, 2  n   −1/2  1 T T −1 Q Pκ (xi )ψ(εi )/n  = n2 Φ (ε)Zκ Qκ Zκ Φ(ε)  κ i=1 1 T ˆ −1 Zκ Φ(ε) Φ (ε)ZκT [E{ψε (x)|σn }]−1 Q κ0 n2  −1 −1 1 ˆ }Zκ Φ(ε) + 2 ΦT (ε)ZκT {Q−1 Q κ − E{ψε (x)|σn } κ0 n ≡ An + Bn .

=

Using Condition (A3)(iii) and simple algebra, we obtain that 1 ˆ −1 Zκ Φ(ε)} E{ΦT (ε)ZκT Q κ0 n2 E{ψε (x)|σn } 1 ˆ −1 Zκ Φ(ε)ΦT (ε)]|σn }} E{E{Tr[ZκT Q = 2 κ0 n E{ψε (x)|σn } CV ˆ −1 Zκ )} = CV1 d(κ) = O(κ/n).  2 1 E{Tr(ZκT Q κ0 n CV3 nCV3 Therefore, An = Op (κ/n). In the following, we will prove Bn = op (κ/n). In fact, by the properties of Frobenius norm, 1 ˆ ˆ −1 ΦT (ε)ZκT Q−1 κ {[E{ψε (x)|σn }]Qκ0 − Qκ }Qκ0 Zκ Φ(ε) n2 |E{ψε (x)|σn }| 1 ˆ ˆ −1 ΦT (ε)ZκT Q−1  κ /n · [E{ψε (x)|σn }]Qκ0 − Qκ  · Qκ0 Zκ Φ(ε)/n CV3 = Bn1  · Bn2  · Bn3 /CV3 ,

|Bn | =

ˆ ˆ −1 where Bn1 = ΦT (ε)ZκT Q−1 κ /n, Bn2 = E{ψε (x)|σn }Qκ0 − Qκ and Bn3 = Qκ0 Zκ Φ(ε)/n. By part (b) of Lemma 3 and the center limit theorem, we have   n  −1   −1 1/2 1/2 1/2  Bn1  = Qκ Pκ (xi )ψ(εi )/n   Qκ  · Op (κ /n ) = Op (κ/n ). i=1

ˆ κ |σn ] − E Q ˆ κ , which is a sum of iid random variables with mean zero. Note that Bn2 = E[Q Through the similar argument as that in part (a) of Lemma 3, it is straightforward to verify √ that Bn2  = Op (κ/ n). Using an argument similar to that for Lemma 5 of Horowitz and Mammen[17] , we obtain Bn3  = Op (κ1/2 /n1/2 ). Therefore by Condition (A6), we obtain Bn = op (κ/n) and   n  −1/2   1/2 1/2  Qκ Pκ (xi )ψ(εi )/n   = Op (κ /n ).

(12)

i=1

−1/2

Define ξ = Qκ Zκ Φ(ε)/n. Let λl and ql (l = 1, . . . , d(κ)) be the eigenvalues and eigenvectors d(κ) −1 −1 T of Q−1 = κ , respectively. By the spectral decomposition of Qκ , we have Qκ l=1 λl ql ql . Then by part (a) of Lemma 2 and (12), we have 

d(κ) 2 Q−1 κ Zκ Φ(ε)/n

=

l=1



d(κ)

λl ξ

T

ql qlT ξ



λmax (Q−1 κ )

l=1

T ξ T ql qlT ξ  λmax (Q−1 κ )ξ ξ = Op (κ/n).

JIANG JianCheng & LI JianTao

1328

Hence, part (a) holds. (b) Let  be the n× 1 vector whose i-th component is ψε (xi )bκ0 (xi ). By Conditions (A3)(iii), (A4) and part (a) of Lemma 1, we obtain that 1 ˆ −1 Zκ )λmax (T )} E{Tr(ZκT Q κ0 n2 1 ˆ −1 Zκ )Tr(T )} = 2 E{Tr(ZκT Q κ0 n 

n  1 2 ˆ −1 Zκ Z T ) = 2 E Tr(Q [ψ (x )b (x )] ε i κ0 i κ κ0 n i=1

 n  d(κ) 2 1 CV2 O(κ−4 ) = O(1/(nκ3 )), b2κ0 (xi ) =  2 E Id(κ) CV22 n n i=1

−1

ˆ 2 Zκ /n2 }  E{Q κ0

where Id(κ) is a d(κ) × d(κ) identity matrix. 1 ˆ − 2 Zκ /n. By the spectral decomposition of Q ˆ −1 , Let ϑ = Q κ0 κ0 ˆ −1 )ϑT ϑ = Op (1/(nκ3 )). ˆ −1 Zκ /n2  λmax (Q Q κ0 κ0

(13)

ˆ κ0 )τ . Then by part (a) of Lemma 2, part (c) of Lemma 3 ˆ −1 Zκ /n and ζ = (Qκ − Q Let τ = Q κ0 and (13), we have 2 T −2 ˆ −1 ˆ ˆ ˆ (Q−1 κ − Qκ0 )Qκ0 τ  = Tr[τ (Qκ − Qκ0 )Qκ (Qκ − Qκ0 )τ ] T 2 3  λ2max (Q−1 κ )ζ ζ = Op (κ )Op (1/(nκ )) = Op (1/(nκ)).

Hence part (b) follows. Lemma 5. Under Conditions (A3)–(A8), we have ˆ nκ − θκ0  = Op ( κ/n); (a) θ ˆ nκ − θκ0 = 1 Q−1 n ψ(εi )Pκ (xi ) + 1 Q−1 n ψε (xi )Pκ (xi )bκ0 (xi ) + Rn , (b) θ i=1 i=1 n κ n κ

where Rn  = Op (κ3/2 /n). Proof. (a) Let αn = κ/n. It suffices to show that, for any given ε, there is a large constant c such that for large n, P inf Snκ (θκ0 + αn u)  Snκ (θκ0 )  1 − ε. (14) uc

ˆ nκ of Snk (θ) in the This implies that with probability tending to one there exists a minimizer θ ball {θκ0 + αn u : u  c} such that ˆ nκ − θκ0  = Op (αn ). θ

(15)

In fact, if we let Dn (u) = Snκ (θ κ0 + αn u) − Snκ (θκ0 ), then Dn (u) = Dn1 (u) − Dn2 (u), where n n −1 Dn1 (u) = n−1 i=1 [ρ(εi + bk0 (xi ) − PT k (xi )uαn ) − ρ(εi )], Dn2 (u) = n i=1 [ρ(εi + bk0 (xi )) − ρ(εi )]. Decompose Dn1 (u) and Dn2 (u) as Dn1 (u) = In1 +In2 +In3 and Dn2 (u) = Jn1 +Jn2 +Jn3 n 1 n respectively, where In1 = n1 i=1 ψ(εi )[bk0 (xi ) − PT κ (xi )uαn ], In2 = n i=1 A(xi )[bk0 (xi ) − n n 1 1 T 2 Pκ (xi )uαn ] , In3 = Dn1 (u)−In1 −In2 , Jn1 = n i=1 ψ(εi )bk0 (xi ), Jn2 = n i=1 A(xi )b2k0 (xi )

Two-stage local M-estimation of additive models

1329

and Jn3 = Dn2 (u) − Jn1 − Jn2 . Note that by part (b) of Lemma 2 and direct computations, 2 2 2 E[In1 − Jn1 ] = −E[ψ(ε)PT κ (x)uαn ] = 0 and E[In1 − Jn1 ] = O(αn u /n). Then √ In1 − Jn1 = Op (α2n u/ κ).

(16)

Decompose In2 − Jn2 as 1 1 2 2 A(xi )[PT A(xi )bκ0 (xi )PT κ (xi )u] αn − κ (xi )uαn ≡ Mn21 + Mn22 . n i=1 n i=1 n

In2 − Jn2 =

n

Using Lemma 2, Condition (A7)(i) and the law of the large number, we obtain that Mn21 = α2n uT E[A(x)P⊗2 κ (x)]u(1 + op (1)) = α2n uT Q∗κ u(1 + op (1))  λmin (Q∗κ )α2n u2(1 + op (1)). By Conditions (A3)(iii) and (A4), supx∈X |bκ0 (x)| = O(κ−2 ). Then

  n 1 −2 T O(κ )αn |Pκ (xi )u| = O(κ−2 αn u). E|Mn22 |  E n i=1

(17)

By Condition (A6), Mn22 = op (Mn21 ). Therefore In2 − Jn2  12 λmin (Q∗κ )α2n u2 . Note that 1 |ρ{εi + bκ0 (xi ) − PT κ (xi )uαn } − ρ(εi ) n i=1 n

|In3 | 

T 2 − ψ(εi )[bκ0 (xi ) − PT κ (xi )uαn ] − A(xi )[bκ0 (xi ) − Pκ (xi )uαn ] |.

Then by Conditions (A3)(ii), (A4) and (A7)(ii), we have

E|In3 |  E

1 2 o[bκ0 (xi ) − αn PT κ (xi )u] n i=1 n



= o(α2n )uT Qκ0 u + o(κ−4 )

= o(α2n )u2 λmax (Qκ0 ) + o(κ−4 ).

(18)

Similar arguments yield that E|Jn3 | = o(κ−4 ).

(19)

Therefore by (16)–(19), In1 − Jn1 , Mn22 , In3 and Jn3 are dominated by Mn21 . Noting that Mn21 is positive, if we let u be large enough, it follows that Dn (u) > 0 and hence (14) holds. ˆ nκ − Mi = PT (xi )(θ ˆ nκ − θ κ0 ) − bκ0 (xi ). Then (b) Let Mi = α + m(xi ) and ηi = PT (xi )θ κ

κ

ˆ PT κ (xi )θ nκ = Mi + ηi .

(20)

ˆ nκ }Pκ (xi ) = 0. This combines with ˆ nκ satisfies the equation: n ψ{yi − PT (xi )θ By (2), θ κ i=1 n (20) yields that i=1 ψ(εi − ηi )Pκ (xi ) = 0, which can be rewritten as 1 1 [ψ(εi − ηi ) − ψ(εi ) + ψε (xi )ηi ]Pκ (xi ) + [ψ(εi ) − ψε (xi )ηi ]Pκ (xi ) = 0. n i=1 n i=1 n

n

(21)

JIANG JianCheng & LI JianTao

1330

Let Ψi = ψ(εi − ηi ) − ψ(εi ) + ψε (xi )ηi . It follows from the definition of ηi that 1 1 ˆ ψε (xi )P⊗2 ψε (xi )Pκ (xi )bκ0 (xi ) κ (xi )(θ nκ − θ κ0 ) − n i=1 n i=1 n

n

1 1 ψ(εi )Pκ (xi ) + Ψi Pκ (xi ). n i=1 n i=1 n

=

n

Therefore, n n  1 ˆ −1  ˆ nκ − θ κ0 = 1 Q ˆ −1 θ ψ(ε )P (x ) + ψε (xi )Pκ (xi )bκ0 (xi ) Q i κ i n κ i=1 n κ i=1

+

n 1 ˆ −1  Ψi Pκ (xi ). Qκ n i=1

(22)

n n 1 ˆ −1 −1 Make the following definitions: Ln1 = n1 Q−1 κ i=1 ψ(εi )Pκ (xi ), Ln2 = n (Qκ −Qκ ) i=1 ψ(εi )   n n 1 −1 1 ˆ −1 −1 Pκ (xi ), Ln3 = n Qκ i=1 ψε (xi )Pκ (xi )bκ0 (xi ), Ln4 = n (Qκ −Qκ ) i=1 ψε (xi )Pκ (xi )bκ0 (xi ) 1 ˆ −1 n ˆ and Ln5 = n Qκ i=1 Ψi Pκ (xi ). Then θ nκ − θ κ0 = Ln1 + Ln2 + Ln3 + Ln4 + Ln5 . Applying Lemmas 1–2 and Condition (A4), we obtain that  √ ˆ −1 ˆ −1 ⊗2 ]  λmax (Q ˆ −1 Q P (x ) = Tr[P⊗2 κ (xi )(Qκ ) κ i κ κ )Pκ (xi ) = Op ( κ), uniformly for i = 1, . . . , n. Then Ln5   n−1

n 

n  √ ˆ −1 Pκ (xi ) = Op ( κ)n−1 |Ψi | · Q |Ψi |. κ

i=1

(23)

i=1

n n ˆ n 2 T ⊗2 ˆ Since i=1 ηi2  2 i=1 (θ nκ − θ κ0 ) Pκ (xi )(θ nκ − θ κ0 ) + 2 i=1 bκ0 (xi ), then by Conditions (A4) and (A7)(ii), we obtain that n−1

n 

ˆ nκ − θ κ0 )T Q ˆ nκ − θ κ0 ) + κ−4 } = Op (1){θ ˆ nκ − θ κ0 2 + κ−4 }. ˆ κ0 (θ ηi2  2{(θ

(24)

i=1

For any given η > 0, let Δn = (δ1 , . . . , δn )T , Dη = {Δn : |δi |  η, ∀ i  n}, and Ψ(Δn ) =  n−1 ni=1 |ψ(εi − δi ) − ψ(εi ) + ψε (xi )δi |. Then by Condition (A3)(ii) and taking iterative expectation, there exists a sequence of positive and bounded numbers {aη } such that n n

     

E sup Ψ(Δn )|σn = n−1 E sup |ψ(εi − δi ) − ψ(εi ) + ψε (xi )δi | σn  aη n−1 δi2 . (25) Dη

i=1



i=1

n −1

Note that sup1in |ηi | = op (1), which combines with (24)–(25) leads to n i=1 |Ψi | = ˆ n ) = Op (θ ˆ nκ − θκ0 2 + κ−4 ) with Δ ˆ n = (η1 , . . . , ηn )T . This together with (23) yields that Ψ(Δ √ ˆ 2 −7/2 ) = Op (κ3/2 /n). Ln5  = Op ( κθ nκ − θ κ0  ) + Op (κ

(26)

ˆ −1 (Qκ − Q ˆ κ ) 1 Q−1 n ψ(εi )Pκ (xi )  Using part (d) of Lemma 3, we obtain that Ln2  = Q κ i=1 n κ n 1/2  n1 Q−1 ψ(ε )P (x ) · O (κ/n ). By part (a) of Lemma 4, we have i κ i p κ i=1 Ln2  = Op (κ3/2 /n).

(27)

Two-stage local M-estimation of additive models

1331

Similarly, by part (d) of Lemma 3 and part (b) of Lemma 4, we get Ln4  = Op (κ1/2 /n).

(28)

Therefore by Lemma 4, (22)–(28) and Condition (A6), part (c) holds. Lemma 6. Assume that L(·) and S(·) are bounded and continuous at the point x1 , which is in the interior of support of fX 1 (·). Suppose lim sup|z|→∞ |L(z)z l+2 | < ∞ for some non-negative integer l. Then   1  n  Xj − x1 1 1 1 l l+1 1 1 S(Xj )(Xj − x ) = nhn S(x )fX 1 (x ) L L(z)z l dz(1 + op (1)). h n R j=1 Proof.

By directly calculating the mean and variance, the result holds. √ n ˇ i . Under Conditions (A1)–(A8), Jn / nhn is asymptotLemma 7. Let Jn = i=1 ψ(εi )Ki X ically normal with mean zero and covariance matrix D(x1 ) = VΓ2 (x1 )fX 1 (x1 ). Proof.

This result holds from the same argument as for Lemma 7.3 of Jiang and Mack[20] .

Lemma 8. Under Conditions (A1)–(A8), for any non-negative integer l, we have the following conclusions: n 1 1 (a) i=1 ψε (xi )Ki (Xi1 − x1 )l = nhl+1 n sl Γ1 (x )fX 1 (x )(1 + op (1)); n  1 1 1 (b) i=1 ψε (xi )R(Xi1 )Ki (Xi1 − x1 )l = 12 nhl+3 n sl+2 m1 (x )Γ1 (x )fX 1 (x )(1 + op (1)). Proof. Using the same arguments for Lemma 7.2 of Jiang and Mack[20] , we can obtain the conclusions. ¯ κ(˜ In order to state the following lemma, we first give some definitions. Let ξi= h1n ψε (xi )Ki P xi )  n 1 1 1 1 ¯ ˜ )Pκ (˜ for i = 1, . . . , n, C(x ) = G(x , x x)d˜ x, rn1 = n i=1 (ξi − Eξi ), and rn2 = Eξ1 − C(x ). Lemma 9.

Proof.

Under Conditions (A1)–(A8), we have   κ (a) rn1  = Op and (b) rn2 2 = O(h2n κ). nhn

(a) By the definition of Frobenius norm and Conditions (A3)–(A5), 1 [Eξ1T ξ1 − Eξ1T Eξ1 ] n 1 ¯ x1 )} − 1 E{ψε (x1 )K1 P ¯ T (˜ ¯ T (˜ ¯ x1 )} = E{ψε2 (x1 )K12 P κ x1 )Pκ (˜ κ x1 )}E{ψε (x1 )K1 Pκ (˜ nh2n nh2n = O(κ/(nhn )) + O(κ/n) = O(κ/(nhn )).

Ern1 2 =

Applying Markov’s inequality, one gets the conclusion of part (a). (b) By the definitions of C(x1 ) and ξ1 , Conditions (A3) and (A4), we have

  1 ¯ κ (˜ ¯ ˜ )P rn2 = E ψε (x1 )K1 Pκ (˜ x1 ) − G(x1 , x x)d˜ x hn    ¯ κ (˜ ˜ )K(t) − G(x1 , x ˜ )K(t)]dt P = [G(x1 + hn t, x x)d˜ x    ˜) ∂G(x1 + Δ, x ¯ κ (˜ tK(t)dt P x)d˜ x (Dominated convergence theorem) = hn 1 ∂X  ˜) ¯ ∂G(x1 , x x)d˜ x(1 + o(1)), Pκ (˜ = s1 h n 1 ∂X

JIANG JianCheng & LI JianTao

1332

where Δ is between 0 and hn t. Therefore, we obtain that rn2 2 = O(h2n κ). Lemma 10.

Under Conditions (A1)–(A8), for l = 0, 1, we have



n 1  ψε (xi )Ki [˜ α+m ˜ −1 (˜ xi ) − α − m−1 (˜ xi )](Xi1 − x1 )l = op (hln ). nhn i=1

Proof. Only the case l = 0 is proved, since the similar argument can be applied to the case l = 1. By part (b) of Lemma 5, n 1  √ ψε (xi )Ki [˜ α+m ˜ −1 (˜ xi ) − α − m−1 (˜ xi )] nhn i=1  n n  1 ¯T 1  −1 ψε (xi )Ki P ψ(εj )Pκ (xj ) =√ κ (xi )Qκ n nhn i=1 j=1

+

 n  1 ¯T  T ¯ ¯ ψ (ε )P (x )b (x ) − b (x ) + P (x )R Pκ (xi )Q−1 j κ j κ0 j κ0 i i n κ κ n j=1

≡ Cn1 + Cn2 + Cn3 + Cn4 , where Rn is defined in part (b) of Lemma 5. Using Conditions (A3)(iii), (A4), (A6) and through simple algebra operation, we obtain that n 1  ¯ T (xi )Rn Cn4 = √ ψε (xi )Ki P κ nhn i=1 n 1  ¯T √ |ψε (xi )|Ki · max P κ (xi )Rn  1in nhn i=1 n 1  =√ |ψε (xi )|Ki · Op (κ2 /n) = Op ( nhn κ2 /n) = op (1). nhn i=1

(29)

n 1 −1 ¯T ψε (xi )Ki P κ (xi ). Then Cn1 = α(x )Qκ j=1 Pκ (xj )ψ(εj ). By   h κ T 1 T T T n Lemma 9, it is easy to see that α(x1 ) = n (C (x ) + rn1 + rn2 ), where rn1  = Op ( nhn ) √ T and rn2  = O(hn κ). Then  n  hn T 1 T T (C (x ) + rn1 + rn2 )Q−1 Pκ (xj )ψ(εj ) ≡ Jn1 + Jn2 + Jn3 . (30) Cn1 = κ n j=1 −1/2

Let α(x1 ) = n−3/2 hn

n

i=1

For each x1 ∈ [−1, 1], the components of C(x1 ) include zeros and the Fourier coefficients of ˜ ), which is bounded. Therefore, by Bessel’s inequality, there exists some finite constant G(x1 , x M for all κ, such that C T (x1 )C(x1 )  M. Then applying part (b) of Lemma 2, we obtain from the spectral decomposition theorem that   2 n  T 1 −1 hn    E C (x )Qκ Pκ (xj )ψ(εj )  n j=1 ⊗2 2 −1 1 T 1 1 = hn C T (x1 )Q−1 κ E{ψ (ε)Pk (x)}Qκ C(x ) = hn C (x )Ωκ1 C(x ) = O(hn ).

By Markov’s inequality, Jn1 = op (1). By part (a) of Lemma 4, Lemma 9 and Condition (A6), √ 3/2 it is easy to prove Jn2 = Op (κ/ n) = op (1). Similarly, Jn3 = Op (κhn ) = op (1). Therefore, Cn1 = op (1).

(31)

Two-stage local M-estimation of additive models

1333

Applying part (b) of Lemma 4, Conditions (A3)(iii) and (A4), we have   n n  1 −1   1  T  ¯ ψε (xi )Ki Pκ (xi ) ·  Qκ ψε (xj )Pκ (xj )bκ0 (xj ) |Cn2 |  √  n nhn i=1 j=1 n √ 1  ¯ κ (xi ) · Op (1/ nκ) = Op ( hn ) = op (1). |ψε (xi )|Ki · sup P  √ nhn i=1 1in

(32)

Using Lemma 8, we obtain from Conditions (A3)(iii), (A4) and (A6) that |Cn3 |  √

n 1  |ψε (xi )|Ki · max |¯bκ0 (xi )| = Op ( nhn κ−2 ) = op (1). 1in nhn i=1

(33)

Therefore, the conclusion follows from (29) and (31)–(33). n ˜+m ˜ −1 (˜ xi )]}Ki . Proof of Theorem 1. Put L(a0 , a1 ) = nh1 n i=1 ρ{yi − [a0 + a1 (Xi1 − x1 ) + α Denote a ˜ = a0 , ˜b = hn a1 , zi = yi − α − m−1 (˜ xi ) and Rni = α ˜+m ˜ −1 (˜ xi ) − α − m−1 (˜ xi ) = Xi1 −x1 1 n T ˆ ¯ ˜ ˜ ¯ Pκ (xi )(θ nκ − θκ0 ) − bκ0 (xi ). Then L(˜ a, b) = L(a0 , a1 ) = nhn i=1 ρ{zi − a ˜ − b hn − Rni }Ki . 1  1 T T ˜ ˜ Let (m ˆ 1 (x ), hn m ˆ 1 (x )) be the minimizer of L(˜ a, b). Put r = (˜ a, b) , r0 = (m1 (x1 ), hn m1 (x1 ))T  Tˇ and ri = (r − r0 ) Xi . Let Sδ be the circle centered at r0 with radius δ and Ln (r) = nh1 n ni=1 n X 1 −x1 X 1 −x1 ρ{zi − a ˜ − ˜b i − Rni }Ki , ln (r0 ) = 1 ρ{zi − m1 (x1 ) − hn m (x1 ) i }Ki . We will hn

nhn

i=1

1

hn

show that for any sufficiently small δ, lim P inf Ln (r)  Ln (r0 ) = 1. n→∞

r∈Sδ

(34)

By simple algebra, we have  εi +R(Xi1 )−Rni −ri n 1  Ki ψ(t)dt nhn i=1 εi +R(Xi1 )  εi +R(Xi1 )−Rni −ri n 1  Ki {ψ(εi ) + ψε (xi )(t − εi ) = nhn i=1 εi +R(Xi1 )

Ln (r) − ln (r0 ) =

+ [ψ(t) − ψ(εi ) − ψε (xi )(t − εi )]}dt ≡ Kn1 + Kn2 + Kn3 ,

(35)

√ where r ∈ Sδ and max1in |Rni | = Op (κ/ n). Directly calculating the mean and variance, we obtain that Kn1

 εi +R(Xi1 )−Rni −ri n 1  = Ki ψ(εi )dt nhn i=1 εi +R(Xi1 ) = −(r − r0 )T

n n  1  ˇi − 1 Ki ψ(εi )X Ki ψ(εi )Rni ≡ Kn11 + Kn12 . nhn i=1 nhn i=1

√ Using Lemma 7, we get Kn11 = Op ((nhn )−1/2 )δ. Since max1in |Rni | = Op (κ/ n), |Kn12 |   √ √ n 1 i=1 Ki |ψ(εi )| · Op (κ/ n) = Op (κ/ n). Therefore nhn √ Kn1 = Op ((nhn )−1/2 )δ + Op (κ/ n).

(36)

JIANG JianCheng & LI JianTao

1334

By the mean value theorem for integration, we have Kn3 = −(r − r0 )T −

n 1  ˇi Ki [ψ(εi + ui ) − ψ(εi ) − ψε (xi )ui ]X nhn i=1

n 1  Ki [ψ(εi + ui ) − ψ(εi ) − ψε (xi )ui ]Rni ≡ Kn31 + Kn32 , nhn i=1

where ui lies between R(Xi1 ) and R(Xi1 ) − Rni − ri for i = 1, . . . , n. Using Conditions (A5) and √ (A8), we obtain maxi:|Xi1 −x1 |hn |ui |  2δ + Op (κ/ n). Now we will prove √ (37) Kn3 = op (δ 2 ) + op (κ/ n)δ + op (κ2 /n). For any given η > 0, let Δn = (η1 , . . . , ηn )T , Dη = {Δn : |ηi | < η, ∀ i  n} and (r − r0 )T  ˇ i. [ψ(εi + ηi ) − ψ(εi ) − ψε (xi )ηi ]Ki X nhn i=1 n 1 ˇ i |. By Then supDη |V (Δn )|  nhn i=1 supDη |ψ(εi + ηi ) − ψ(εi ) − ψε (xi )ηi | · Ki · |(r − r0 )T X n 2δ Condition (A3)(ii), we have E[supDη |V (Δn )|]  aη nhn E[ i=1 Ki ]  2δbη , where aη and bη are two sequences of positive numbers, tending to zero as η → 0. Since maxi:|Xi1 −x1 |hn |ui |  2δ + √ ˆ n ) = op (δ 2 ) + op (κ/√n)δ with Δ ˆ n = (u1 , . . . , un )T . Therefore, Op (κ/ n), it follows that V (Δ √ √ Kn31 = op (δ 2 ) + op (κ/ n)δ. Similarly, we obtain Kn32 = op (κ2 /n) + op (κ/ n)δ. Now, (37) is proved. Note that by simple integral operation n

V (Δn ) =

Kn2 =

n 1  ˇ iX ˇ T (r − r0 ) − 2R(X 1 )ri − 2R(X 1 )Rni + 2Rni ri + R2 ] Ki ψε (xi )[(r − r0 )T X i i i ni 2nhn i=1

≡ Mn1 + Mn2 + Mn3 + Mn4 + Mn5 .

The same arguments as above and Lemma 8 yield that 1 (r − r0 )T Γ1 (x1 )fX 1 (x1 )S(r − r0 )(1 + op (1)), 2 n 1  ˇ i = Op (h2 )δ, Mn2 = −(r − r0 )T Ki ψε (xi )R(Xi1 )X n nhn i=1



n



√ 1  Ki ψε (xi )R(Xi1 )Rni

= Op (h2n κ/ n), |Mn3 | =

− nhn i=1



n 



√ T 1 ˇ

|Mn4 | = (r − r0 ) Ki ψε (xi )Rni Xi

= Op (κ/ n)δ, nhn i=1



n

1  2

Ki ψε (xi )Rni = Op (κ2 /n). |Mn5 | =

2nhn i=1 Mn1 =

Therefore Kn2 =

√ 1 (r − r0 )T Γ1 (x1 )fX 1 (x1 )S(r − r0 )(1 + op (1)) + Op (κ/ n)δ + Op (κ2 /n). 2

Put Tni = m1 (Xi1 ) − m1 (x1 ) − m1 (x1 )(Xi1 − x1 ) and Qni = Tni − Rni , then Ln (r0 ) − ln (r0 ) =

n n 1  1  [ρ(εi + Qni ) − ρ(εi )]Ki − [ρ(εi + Tni ) − ρ(εi )]Ki nhn i=1 nhn i=1

≡ Wn1 + Wn2 .

(38)

Two-stage local M-estimation of additive models

1335

Decompose Wn1 as follows: Wn1

n n 1  1  = [ρ(εi + Qni ) − ρ(εi ) − ψ(εi )Qni ]Ki + ψ(εi )Qni Ki ≡ Vn1 + Vn2 . nhn i=1 nhn i=1

√ Note that max1in |Qni | = Op (κ/ n). It is easy to see that Vn2 = op (1). Using Condition (A3)(ii) and the same argument as for (37), we obtain Vn1 = op (1). Therefore, Wn1 = op (1). Similarly, we get Wn2 = op (1). Then Ln (r0 ) − ln (r0 ) = op (1).

(39)

Then by (35)–(39), Ln (r) − Ln (r0 ) = Kn1 + Kn2 + Kn3 − [Ln (r0 ) − ln (r0 )] 1 = (r − r0 )T Γ1 (x1 )fX 1 (x1 )S(r − r0 )(1 + op (1)) + op (δ 2 ) + op (1). 2

(40)

Let λ be the smallest eigenvalue of the positive definite matrix S. Then for any r ∈ Sδ , we have for sufficiently small δ, limn→∞ P {inf r∈Sδ Ln (r) − Ln (r0 ) > 14 λΓ1 (x1 )fX 1 (x1 )δ 2 } = 1, which shows that (34) holds. Hence, Ln (r) has a local minimum in the interior of Sδ . Since at a local minimum, (4) must be satisfied. Therefore lim P {[m ˆ 1 (x1 ) − m1 (x1 )]2 + h2n [m ˆ 1 (x1 ) − m1 (x1 )]2  δ 2 } = 1.

n→∞

This completes the proof of the theorem. xi )] and Let εi = yi − [α + m1 (Xi1 ) + m−1 (˜

Proof of Theorem 2.

ˇT a(x1 ) − a(x1 )] + α − α ˜ + m−1 (˜ xi ) − m ˜ −1 (˜ xi ). ηˆi = R(Xi1 ) − X i H[ˆ α+m ˆ 1 (x1 ) + m ˜ −1 (˜ xi )] − m ˆ 1 (x1 )(Xi1 − x1 ). Thus the local M-estimation Then εi + ηˆi = yi − [˜ n ˇ i = 0, which is equivalent to equations (4) can be rewritten as i=1 ψ(εi + ηˆi )Ki X n 

ˇ i = 0. {ψ(εi ) + ψε (xi )ˆ ηi + [ψ(εi + ηˆi ) − ψ(εi ) − ψε (xi )ˆ ηi ]}Ki X

i=1

We denote the lefthand side of the above equation by In1 + In2 + In3 . Then In2 =

n 

ˇi − ψε (xi )R(Xi1 )Ki X

i=1 n 

+

n 

ˇ iX ˇ T H[ˆ ψε (xi )Ki X a(x1 ) − a(x1 )] i

i=1

ˇ i ≡ Ln1 + Ln2 + Ln3 . ψε (xi )[α − α ˜ + m−1 (˜ xi ) − m ˜ −1 (˜ xi )]Ki X

i=1

By Lemma 8, 1 3  1 nh m (x )Γ1 (x1 )fX 1 (x1 )c1 (1 + op (1)), 2 n 1 = −nhn Γ1 (x1 )fX 1 (x1 )SH[ˆ a(x1 ) − a(x1 )](1 + op (1)).

Ln1 = Ln2

(41)

JIANG JianCheng & LI JianTao

1336

Applying Lemma 10, we get Ln3 = op (Ln1 ). By Theorem 1 and Lemma 5, max

i:|Xi1 −x1 |hn

|ˆ ηi | =

max

i:|Xi1 −x1 |hn

+

|R(Xi1 )| +

max

i:|Xi1 −x1 |hn

max

i:|Xi1 −x1 |hn

|α − α ˜ + m(˜ xi ) − m(˜ ˜ xi )|

|m ˆ 1 (x1 ) − m1 (x1 ) + (m ˆ 1 (x1 ) − m1 (x1 ))(Xi1 − x1 )|

ˆ 1 (x1 ) − m1 (x1 )| + hn |m ˆ 1 (x1 ) − m1 (x1 )|) = Op (h2n + κ/n1/2 + |m = op (1).

(42)

Applying Conditions (A3)(ii), (A6) and the same argument as that in Lemma 7.2 of Jiang and Mack[20] , we obtain that ˆ 1 (x1 ) − m1 (x1 )| + hn |m ˆ 1 (x1 ) − m1 (x1 )|]2−ν In3 = op (nhn )[h2n + κ/n1/2 + |m = op (Ln1 ) + op (Ln2 ), where ν is some sufficiently small positive number. By (41), 1 H[ˆ a(x1 ) − a(x1 )] − h2n m1 (x1 )S−1 c1 (1 + op (1)) = [nhn Γ1 (x1 )fX 1 (x1 )]−1 S−1 Jn (1 + op (1)). 2 Then the result of the theorem follows from Lemma 7 and Slutsky’s theorem. Lemma 11. Under the conditions of Theorem 3, for any random sequence {ηi }ni=1 , if max1in |ηi | = op (1), then for any non-negative integer l, n 1 1 (a) i=1 ψ  (εi + ηi )Ki (Xi1 − x1 )l = nhl+1 n sl Γ1 (x )fX 1 (x )(1 + op (1));   1 1 1 (b) ni=1 ψ  (εi + ηi )R(Xi1 )Ki (Xi1 − x1 )l = 12 nhl+3 n sl+2 m1 (x )Γ1 (x )fX 1 (x )(1 + op (1)). Proof. Under the conditions in Theorem 3, we have E[ψ  (ε)|X 1 = x1 ] = Γ1 (x1 ). Then the results follow from Lemma 7.2 of Jiang and Mack[20] . ˇ T H[0 a(x1 ) − a(x1 )] + α − α Proof of Theorem 3. Let δˆi = R(X 1 ) − X ˜ + m−1 (˜ xi ) − m ˜ −1 (˜ xi ). i

i

Similar to (42) and by the initial condition on 0 a(x1 ), we have max

i:|Xi1 −x1 |hn

|δˆi | = Op (h2n + κ/n1/2 + |m ˇ 1 (x1 ) − m1 (x1 )| + hn |m ˇ 1 (x1 ) − m1 (x1 )|) = op (1).

By Lemma 11, wkm = −

n 

ψ  {yi − [˜ α + a0 + a1 (Xi1 − x1 ) + m ˜ −1 (˜ xi )]}Ki (Xi1 − x1 )k+m |a=0 a(x1 )

i=1

=−

n 

ψ  (εi + δˆi )Ki (Xi1 − x1 )k+m = −nh1+k+m sk+m Γ1 (x1 )fX 1 (x1 )(1 + op (1)), n

i=1

for k, m = 0, 1. Then by the definition of Wn , we have Wn−1 = −[nhn Γ1 (x1 )fX 1 (x1 )HSH]−1 (1+ op (1)). Note that 1

ψn0 (0 a(x )) =

n 

ψ{yi − [˜ α+m ˇ 1 (x1 ) + m ˇ 1 (x1 )(Xi1 − x1 ) + m ˜ −1 (˜ xi )]}Ki

i=1

=

n  i=1

{ψ(εi ) + ψε (xi )δˆi + [ψ(εi + δˆi ) − ψ(εi ) − ψε (xi )δˆi ]}Ki

≡ Bn1 + Bn2 + Bn3 .

Two-stage local M-estimation of additive models

1337

Using the similar argument as that in Theorem 2, we obtain that Bn2 = −nhn Γ1 (x1 )fX 1 (x1 )(s0 , s1 )H[0 a(x1 ) − a(x1 )](1 + op (1)) 1 + nh3n s2 m1 (x1 )Γ1 (x1 )fX 1 (x1 )(1 + op (1)) ≡ Bn21 + Bn22 2 and Bn3 = op (Bn22 ). Then ψn0 (0 a(x1 )) =

n 

1 ψ(εi )Ki + nh3n s2 m1 (x1 )Γ1 (x1 )fX 1 (x1 )(1 + op (1)) 2 i=1

− nhn Γ1 (x1 )fX 1 (x1 )(s0 , s1 )H[0 a(x1 ) − a(x1 )](1 + op (1)). Similarly, ψn1 (0 a(x1 )) =

n 

1 ψ(εi )Ki (Xi1 − x1 ) + nh4n s3 m1 (x1 )Γ1 (x1 )fX 1 (x1 )(1 + op (1)) 2 i=1

− nh2n Γ1 (x1 )fX 1 (x1 )(s1 , s2 )H[0 a(x1 ) − a(x1 )](1 + op (1)). Therefore, HWn−1 Ψn (0 a(x1 )) =

1 3  1 nh m (x )HWn−1 HΓ1 (x1 )fX 1 (x1 )c1 (1 + op (1)) 2 n 1 − nhn HWn−1 HΓ1 (x1 )fX 1 (x1 )SH[0 a(x1 ) − a(x1 )](1 + op (1)) n  ˇ i ≡ Mn1 + Mn2 + Mn3 . ψ(εi )Ki X + HWn−1 H i=1

Simple algebra yields that Mn1 = − 21 h2n m1 (x1 )S−1 c1 (1 + op (1)) and Mn2 = H[0 a(x1 ) − a(x1 )](1 + op (1)). Therefore, nhn [H(1 a(x1 ) − a(x1 ))] = nhn H[0 a(x1 ) − a(x1 )] − nhn HWn−1 Ψn (0 a(x1 )) n  1 ˇi nhn h2n m1 (x1 )S−1 c1 (1 + op (1)) − nhn HWn−1 H ψ(εi )Ki X = 2 i=1 1 = nhn h2n m1 (x1 )S−1 c1 (1 + op (1)) + [Γ1 (x1 )fX 1 (x1 )]−1 S−1 Jn / nhn . 2 Then the theorem follows from Slutsky’s theorem and Lemma 7. References 1 Friedman J H, Stuetzle W. Projection pursuit regression. J Amer Statist Assoc, 76: 817–823 (1981) 2 Stone C J. Additive regression and other nonparametric models. Ann Statist, 13: 689–705 (1985) 3 Stone C J. The dimensionality reduction principle for generalized additive models. Ann Statist, 14: 590–606 (1986) 4 Hastie T, Tibshirani R J. Generalized Additive Models. London: Chapman & Hall, 1990 5 Breiman L, Friedman J H. Estimating optimal transformations for multiple regression and correlation. J Amer Statist Assoc, 80: 580–619 (1985) 6 Buja A, Hastie T, Tibshirani R J. Linear smoothers and additive models. Ann Statist, 17: 453–510 (1989) 7 Opsomer J D, Ruppert D. Fitting a bivariate additive model by local polynomial regression. Ann Statist, 25: 186–211 (1997)

1338

JIANG JianCheng & LI JianTao

8 Mammen E, Linton O, Nielsen J P. The existence and asymptotic properties of backfitting projection algorithm under weak conditions. Ann Statist, 27: 1443–1490 (1999) 9 Opsomer J D. Asymptotic properties of backfitting estimator. J Multivariate Anal, 73: 166–179 (2000) 10 Tjøstheim D, Auestad B H. Nonparametric identification of nonlinear time series: Projections. J Amer Statist Assoc, 89: 1398–1409 (1994) 11 Linton O, Nielsen J P. A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika, 82: 93–100 (1995) 12 Chen R, H¨ ardle W, Linton O, et al. Nonparametric estimation of additive separable regression models. In: H¨ ardle W, Schimek M, eds. Statistical Theory and Computational Aspects of Smoothing. Heidelberg: Physica, 1996, 247–253 13 Linton O, H¨ ardle W. Estimating additive regression models with known link function. Biometrika, 83: 529–540 (1996) 14 Fan J, H¨ ardle W, Mammen E. Direct estimation of low-dimensional components in additive models. Ann Statist, 26: 943–971 (1998) 15 Stone C J. The use of polynomial splines and their tensor products in multivariate function estimation. Ann Statist, 22: 118–184 (1994) 16 Newey W K. Convergence rates and asymptotic normality for series estimators. J Multivariate Anal, 73: 147–168 (1997) 17 Horowitz J L, Mammen E. Nonparametric estimation of an additive model with a link function. Ann Statist, 32: 2412–2443 (2004) 18 Linton O. Estimating additive nonparametric models by partial Lq norm: the curse of fractionality. Econometric Theory, 17: 1037–1050 (2001) 19 Fan J, Jiang J. Variable bandwidth and one-step local M-estimator. Sci China Ser A-Math, 43: 65–81 (2000) 20 Jiang J, Mack Y P. Robust local polynomial regression for dependent data. Statist Sinica, 11: 705–722 (2001) 21 Huber P J. Robust Statistics. New York: Wiley, 1981 22 He X, Shi P. Bivariate tensor-product B-splines in a partly linear model. J Multivariate Anal, 58: 162–181 (1996) 23 Doksum K, Koo J-Y. On spline estimators and prediction intervals in nonparametric regression. Comput Statist Data Anal, 35: 67–82 (2000) 24 Horowitz J L, Lee S. Nonparametric estimation of an additive quantile regression model. J Amer Statist Assoc, 100: 1238–1249 (2005) 25 De Boor C. A Practical Guilde to Splines. New York: Springer-Verlag, 1978 26 Harrison D, Rubinfeld D L. Hedonic housing prices and the demand for clean air. J Econom Manag, 5: 81–102 (1978) 27 Belsley D A, Kuh E, Welsch R E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley, 1980 28 Opsomer J D, Ruppert D. A fully automated bandwidth selection method for fitting additive models. J Amer Statist Assoc, 93: 605–619 (1998) 29 Fan J, Jiang J. Nonparametric inference for additive models. J Amer Statist Assoc, 100: 890–907 (2005) 30 Opsomer J D. Optimal bandwidth selection for fitting an additive model by local polynomial regression. PhD thesis. Ithaca, NY: Cornell University, 1995 31 Zhou S, Shen X, Wolfe D A. Local asymptotics for regression splines and confidence regions. Ann Statist, 26: 1760–1782 (1998)