spline-backfitted kernel smoothing of additive

Econometric Theory, 26, 2010, 29–59. doi:10.1017/S0266466609090604

SPLINE-BACKFITTED KERNEL SMOOTHING OF ADDITIVE COEFFICIENT MODEL RONG LIU

University of Toledo

LIJIAN YANG

Michigan State University

Additive coefficient model (Xue and Yang, 2006a, 2006b) is a flexible regression and autoregression tool that circumvents the “curse of dimensionality.” We propose spline-backfitted kernel (SBK) and spline-backfitted local linear (SBLL) estimators for the component functions in the additive coefficient model that are both (i) computationally expedient so they are usable for analyzing high dimensional data, and (ii) theoretically reliable so inference can be made on the component functions with confidence. In addition, they are (iii) intuitively appealing and easy to use for practitioners. The SBLL procedure is applied to a varying coefficient extension of the Cobb-Douglas model for the U.S. GDP that allows nonneutral effects of the R&D on capital and labor as well as in total factor productivity (TFP).

1. INTRODUCTION Regression analysis has been widely used in econometrics studies, for instance, in the estimation of production/cost function. Typical parametric regression models presume that their regression functions follow a predetermined form with finitely many unknown parameters. Nonparametric models, on the other hand, impose less stringent assumptions on the regression functions, but for their flexibility pay the price of the “curse of dimensionality.”A structured model offers a sensible compromise between parametric simplicity and nonparametric flexibility; see for example Sperlich, Tjøstheim, and Yang (2002) for additive interaction modeling for the production function of Wisconsin farms and Rodr´ıguez-Póo, Sperlich, and Vieu (2003) for a general framework of separable models. Recently, Xue and Yang (2006a, 2006b) have proposed an additive coefficient model that allows a response variable Y to depend linearly on some regressors, with coefficients as smooth additive functions of other predictors, called tuning variables. Specifically, The comments from two anonymous referees and co-editor Oliver Linton have resulted in substantial improvement of the work. This research is part of the first author’s dissertation work under the supervision of the second author, and has been supported in part by NSF awards DMS 0405330 and DMS 0706518. Address correspondence to Rong Liu, Department of Mathematics, University of Toledo, Toledo, OH 43606, USA; e-mail: [email protected]; or to Lijian Yang, Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA; e-mail: [email protected]. c Cambridge University Press, 2010

0266-4666/10 $15.00

29

30

R. LIU AND L. YANG

E (Y |X, T) ≡ m (X, T) ≡

d1

d2

l=1

α=1

∑ m l (X) Tl , ml (X) = m 0l + ∑ m αl (X α ) , 1 ≤ l ≤ d1 , (1)

in which the predictor vector (X, T) consists of the tuning variables X = T T X 1 , . . . , X d2 ∈ R d2 and linear predictors T = T1 , . . . , Td1 ∈ R d1 . The functional coefficient model of Chen and Tsay (1993b) corresponds to the case d2 = 1, the varying coefficient model of Hastie and Tibshirani (1993) corresponds to the case d2 = d1 and for each l = 1, . . . , d1 only one single significant m αl with α = l. Also included as special cases of model (1) are the additive model of Hastie and Tibshirani (1990) and Chen and Tsay (1993a), and the multivariate linear regression model (see Xue and Yang, 2006a, for detailed discussion). Model (1)’s versatility for econometric applications is illustrated by the following example: Consider the forecasting of the U.S. GDP annual growth rate, which is modeled as the total factor productivity (TFP) growth rate plus a linear function of the capital growth rate and the labor growth rate, according to the classic Cobb-Douglas model (Cobb and Douglas, 1928). As pointed out in Li and Racine (2007, p. 302), it is unrealistic to ignore the nonneutral effect of R&D spending on the TFP growth rate and on the complementary slopes of capital and labor growth rates. Thus, a smooth coefficient model should fit the production function better than the parametric Cobb-Douglas model. Indeed, Figure 1 shows that a smooth coefficient model has much smaller rolling forecast errors than the parametric Cobb-Douglas model, based on data from 1959 to 2002. In addition, Figure 2 shows that the TFP growth rate is a function of R&D spending, not a constant. Many methods exist for the estimation of functional/varying coefficient models; see Cai, Fan, and Yao (2000) and Yang, Park, Xue, and Härdle (2006) for kernel type estimators; see Huang, Wu, and Zhou (2002) and Huang and Shen (2004) for spline estimators. These published works have partial success in addressing the inaccuracy of estimating multivariate nonparametric functions,

F IGURE 1. Errors of GDP forecasts: solid line = model (31); dotted line = model (30).

SPLINE-BACKFITTED KERNEL SMOOTHING

31

F IGURE 2. Estimation of TFP growth rate function c1 + m SBLL,41 X t−3 .

commonly known as the “curse of dimensionality.” Typically, optimal convergence rates of the coefficient function estimators are established, locally for kernel estimators or globally for spline estimators. Our view is that a satisfactory procedure for estimating the functions d1 ,d2 d1 {m αl (xα )}l=1,α=1 and constants {m 0l }l=1 in model (1) should meet three broad criteria. Specifically, the procedure should be (i) computationally expedient; (ii) theoretically reliable; and (iii) intuitively appealing. As model (1) is a natural extension of the additive model, we extend the “spline-backfitted kernel smoothing” of Wang and Yang (2007) to the additive coefficient model, combining the best features of both kernel and spline methods. Kernel procedures for the additive model, such as in Yang, Härdle, and Nielsen (1999), Sperlich, Tjøstheim, and Yang (2002), Yang, Sperlich, and Härdle (2003), Rodr´ıguez-Póo, Sperlich, and Vieu (2003), and Hengartner and Sperlich (2005), satisfy criterion (iii) and partly (ii) as they are asymptotically normal at any given point, but do not satisfy (i) since they are extremely computationally intensive when either the dimension is high or sample size is large, as illustrated in the Monte Carlo results of Wang and Yang (2007). Spline approaches of Stone (1985), Huang (1998a, 1998b), and Huang and Yang (2004) to the additive model, on the other hand, do not satisfy criterion (ii), as they lack limiting distribution but are fast to compute, thus satisfying (i). In addition, none of the published works has established “uniform convergence rate,” thus lacking in regard to (ii). The spline-backfitted kernel (SBK) and spline-backfitted local linear (SBLL) estimators we propose are

32

R. LIU AND L. YANG

essentially as fast and accurate as univariate kernel and local linear smoothing, thus completely satisfying all three criteria (i)–(iii). Other alternatives for estimating model (1) that may satisfy criteria (i)–(iii) are possible extensions of the smoothed backfitting of Mammen, Linton, and Nielsen (1999) and Nielsen and Sperlich (2005), and the two-stage estimator of Horowitz and Mammen (2004). It is important to note that although Horowitz and Mammen (2004) used B spline in simulation, their theoretical proof was for what should be called “orthogonal series-backfitted local linear” estimator in our parlance. We extend the oracle smoothing idea of Linton (1997) and Wang and Yang (2007) to model (1). If all the nonparametric functions of the last d2 − 1 d1 ,d2 d1 , and all the constants {m 0l }l=1 were known by variables, {m αl (xα )}l=1,α=2 d1 “oracle”, one could define a new variable Y,1 = ∑l=1 m 1l (X 1 )Tl + σ (X, T)ε = d1 d1 2 m 0l + ∑dα=2 m αl (X α ) Tl and estimate all functions {m 1l (x1 )}l=1 by Y − ∑l=1 linear regression of Y,1 on T1 , . . . , Td1 with kernel weights computed from variable X 1 . These would-be estimators do not suffer from the “curse of dimensionality” and are called “oracle smoothers.” We propose to pre-estimate the functions d1 ,d2 d1 and constants {m 0l }l=1 by linear spline, then use these estimates {m αl (xα )}l=1,α=2 as substitutes to obtain an approximation Yˆ,1 to the variable Y,1 , and construct “oracle” estimators based on Yˆ,1 . As in Wang and Yang (2007), the theoretical contribution of this paper is proving that the error caused by this “cheating” is negligible. Consequently, the SBK/SBLL estimators are uniformly (over the data range) equivalent to univariate kernel/local linear “oracle smoothers,” automatically inheriting all their oracle efficiency properties. Our proof relies on the same principles of “reducing bias by undersmoothing in step one” and “averaging out the variance in step two,” accomplished with the joint asymptotics of kernel and spline functions. Compared to Wang and Yang (2007), a major theoretical complication is the dependence structure of T on X, necessitating Assumption 2 on the second moment matrix Q(x) = E TTT |X = x ; see the detailed discussion on Assumption 2 at the end of Section 2 and the extra step to estimate Q(x) in Section 5. In contrast, for the additive model of Wang and Yang (2007), there is no need for Assumption 2 and of estimating Q(x) ≡ 1. Another innovation √ d1 under in this paper is the n-consistent oracle estimation of constants {m 0l }l=1 d1 ,d2 conditions no more than second-order smoothness of {m αl (xα )}l=1,α=1 . Xue √ d1 and Yang (2006a) have provided n-consistent estimation of constants {m 0l }l=1 only under higher order smoothness assumptions, and Xue and Yang (2006b) √ d1 failed to obtain n-consistency for estimating {m 0l }l=1 . For the additive model of Wang and Yang (2007), there is only one such unknown constant, and it is √ n-consistently estimated by the sample mean Y . Lastly, asymptotic theory for the oracle smoothers is developed in Section 3 separately, whereas Wang and Yang (2007) used existing theory from kernel smoothing literature. The paper is organized as follows. In Section 2 we discuss the assumptions of model (1). In Section 3, we introduce the oracle smoothers and discuss their


33

asymptotic properties. In Section 4, we introduce the SBK and SBLL estimators, their uniform consistency, and asymptotic normal distributions. The ideas behind our proofs of the main theoretical results are given by decomposing the estimator’s “cheating” error into a bias and a variance part. In Section 5, we discuss the implementation of the estimators. In Section 6, we apply the methods to an empirical example. All technical proofs are given in the Appendix. 2. ASSUMPTIONS ON THE MODEL n be a sequence of strictly stationary observations, with idenLet {(Yi , Xi , Ti )}i=1 tical distribution as (Y, X, T) in model (1). Denote the unknown conditional mean and variance functions as m(X, T) = E(Y |X, T), σ 2 (X, T) = var(Y |X, T); then one has

Yi = m (Xi , Ti ) + σ (Xi , Ti ) εi

(2)

n that satisfy E(εi |Xi , Ti ) = 0, E(εi2 |Xi , for some conditional white noises {εi }i=1 Ti ) = 1. The variables (Xi , Ti ) can consist of either exogenous variables or lagged values of Yi . For the additive coefficient model, the regression function m takes the form in (1), and satisfies the identification conditions that

E {m αl (X α )} = 0, 1 ≤ l ≤ d1 , 1 ≤ α ≤ d2 ,

(3)

2 ensuring the unique additive representations of m l (x) = m 0l + ∑dα=1 m αl (xα ). As in most works on nonparametric smoothing, estimation of the functions d1 ,d2 is conducted on compact sets. Without lose of generality, let {m αl (xα )}l=1,α=1 the compact set be χ = [0, 1]d2 . Following Stone (1985, p. 693), the space of α-centered square integrable functions on [0, 1] is Hα0 = g : E {g (X α )} = 0, E g 2 (X α ) < +∞ , 1 ≤ α ≤ d2 .

Next, define the model space M, a collection of functions on χ × R d1 , as M=

g (x, t) =

d1

∑ gl (x) tl ;

l=1

gl (x) = g0l +

d2

∑ gαl (xα ) ; gαl ∈ Hα0

α=1

,

d1 in which {g0l }l=1 are finite constants. The constraints that E{gαl (X α )} = 0, 1 ≤ α ≤ d2 ensure unique additive representation of m l as expressed in (3), but are not necessary for the definition of space M. In what follows, denote by n ϕ(Xi , Ti )/n. We introduce two inner En the empirical expectation, En ϕ = ∑i=1 products on M. For functions g1 , g2 ∈ M, the theoretical and empirical inner products are defined respectively as g1 , g2 = E{g1 (X, T)g2 (X, T)}, g1 , g2 n = En {g1 (X, T)g2 (X, T)}. The corresponding induced norms are g1 22 = Eg12 (X, T),

34

R. LIU AND L. YANG

g1 22,n = En g12 (X, T). The model space M is called theoretically (empirically) identifiable, if for any g ∈ M, g 2 = 0 ( g 2,n = 0) implies that g = 0, a.s. In this paper, for any compact interval [a, b], we denote the space of pth order smooth function as C ( p) [a, b] = g|g ( p) ∈ C [a, b] , and the of

class

g (x) − Lipschitz continuous functions for constant C > 0 as Lip b] , C) = g ([a,

g x ≤ C x − x , ∀x, x ∈ [a, b] . We mean by “∼” both sides having the same order as n → ∞. We denote by Id1 ×d1the d1 ×d1 identity matrix, and 0d1 ×d1 the d1 × d1 zero matrix. For any vector x = x1 , x2 , · · · , xd2 , we denote the supre 1/2 2 xα2 . mum and Euclidean norms as |x| = max1≤α≤d2 |xα | and x = ∑dα=1 We need the following assumptions on the data-generating process. Assumption 1. The tuning variable X = (X 1 , . . . , X d2 ) has a continuous probability density function f (x) that satisfies 0 < c f ≤ minx∈χ f (x) ≤ maxx∈χ f (x) ≤ / χ = [0, 1]d2 . C f < ∞ for some constants c f and C f and f (x) = 0, x ∈ Assumption 2. There exist constants 0 < cQ ≤ CQ < +∞ and 0 < cδ ≤ d1 Cδ < +∞ and some δ > 1/2, such that cQ Id1 ×d1 ≤ Q (x) = {q (x)}l,l =1 = 2+δ T |X = x ≤ Cδ for all x ∈ χ E TT |X = x ≤ CQ Id1 ×d1 and cδ ≤ E Tl Tl

and l,l = 1, . . . , d1 . ∞ Assumption 3. The vector process {ςt }∞ t=−∞ = {(Yt , Xt , Tt )}t=−∞ is strictly stationary and geometrically strongly mixing, that is, its α-mixing coefficient α(k)≤cρ k , for constants c > 0, 0 < ρ < 1, where α(k)=sup A∈σ (ςt ,t≤0),B∈σ (ςt ,t≥k) |P(A)P(B) − P(A ∩ B)|.

Assumption 4. The coefficient components, m αl ∈ C 1 [0, 1], m αl ∈ Lip ([0, 1] , C∞ ) , ∀1 ≤ α ≤ d2 , 1 ≤ l ≤ d1 with m 1l ∈ C 2 [0, 1] , ∀1 ≤ l ≤ d1 . Assumption 5. The conditional variance function σ 2 (x, t) is measurable and n bounded. The errors {εi }i=1 satisfy E (εi |Fi ) = 0, E εi2 |Fi = 1, E |εi |2+η |Fi ≤ Cη for some η ∈ (1/2, 1] and the sequence of σ -fields Fi = σ X j , T j , j ≤ i; ε j , j ≤ i − 1 for i = 1, . . . , n. Assumption 6. The marginal density f 1 (x1 ) of X 1 and the conditional secondmoment matrix function Q1 (x1 ) defined in (4) both have continuous derivatives on [0, 1]. Assumptions 1–5 are common in the literature; see, for instance, Huang and Yang (2004), Huang and Shen (2004), and especially Xue and Yang (2006b). Assumption 6 is needed only for the asymptotic theory of oracle “kernel smoother,” but not for the oracle “local linear smoother.” Assumption 2 implies also that for all xα ∈ [0, 1] , 1 ≤ α ≤ d2 and l,l = 1, . . . , d1 , d1 T cQ Id1 ×d1 ≤ Qα (xα ) = {qα (xα )}l,l =1 = E(TT |X α = x α ) ≤ C Q Id1 ×d1 2+δ |X α = xα ≤ Cδ . cδ ≤ E Tl Tl

(4)


35

Furthermore, Assumptions 2 and 5 imply that for some constant C > 0, max E |Tl |2+η < C max E |Tl Tl |2+δ = C max E |Tl |4+2δ ≤ CCδ < +∞.

1≤l≤d1

1≤l≤d1

1≤l≤d1

(5)

At one referee’s request, we provide here insight into the relationship allowed between the vectors T and X under Assumption 2. It is instructive to first understand what T and X cannot be in the context of identifiability for functions d1 ,d2 {m αl (xα )}l=1,α=1 . Suppose that the vector X is centered so that EX = 0. Then model (1) is unidentifiable when (T1 , T2 ) = (X 1 , X 2 ) since −3X 2 T1 + 3X 1 T2 = 0, E (−3X 2 ) = E (3X 1 ) = 0, and the function m (x, t) in (1) is expressed as d1

∑

m 0l +

l=3

d2

∑

α=1

m αl (xα ) tl + m 01 + m 21 (x2 ) +

+ m 02 + m 12 (x1 ) +

≡

d1

∑

m 0l +

l=3

d2

d2

∑ m α1 (xα )

d2

∑

α=1,α=2

m α1 (xα ) t1

t2

α=2

∑ m αl (xα )

tl

α=1

+ m 01 + m 21 (x2 ) − 3x2 + + m 02 + m 12 (x1 ) + 3x1 +

d2

∑

α=1,α=2 d2

m α1 (xα ) t1

∑ m α1 (xα )

α=2

t2 ,

so one can use m ∗21 (x2 ) = m 21 (x2 ) − 3x2 and m ∗12 (x1 ) = m 12 (x1 ) + 3x1 to replace m 21 (x2 ) and m 12 (x1 ) without changing the data generating process (1). In other words, the functions m 21 (x2 ) and m 12 (x1 ) are unidentifiable. Xue and Yang (2006a, p. 2523) gave a similar counterexample and discussed why an unidentifiable model may perform better for prediction. More generally, it is revealing to note that Assumption 2 not only rules out the above anomaly, but it also does not allow the possibility that there exist two Tl ’s (1 ≤ l ≤ d1 ) almost surely equal to two Borel functions of X. To see this, suppose that (T1 , T2 ) = {ϕ1 (X) , ϕ2 (X)}, a.s., for some Borel functions ϕ1 and ϕ2 . Assumption 2 implies that

2 T1 T1 T2

cQ I2×2 ≤ E X = x ≤ CQ I2×2 , ∀x ∈ χ, T1 T2 T22

leading to cQ I2×2 ≤

ϕ12 (x)

ϕ1 (x) ϕ2 (x)

ϕ1 (x) ϕ2 (x)

ϕ22 (x)

≤ CQ I2×2 , a.s., ∀x ∈ χ,

36

R. LIU AND L. YANG

which cannot be true because, for any x ∈ χ, the 2×2 matrix in the above is singular and thus cannot be ≥ cQ I2×2 . That Assumption 2 guarantees the identifiability of model (1) has been established in Lemma 1 of Xue and Yang (2006b). It is important to observe, however, that Assumption 2 does allow the case of exactly one Tl , 1 ≤ l ≤ d1 almost surely equal to a Borel function of X. 3. THE ORACLE SMOOTHERS We now introduce what is known as the oracle smoother in Wang and Yang (2007) for evaluating the estimators. Denote for any vector as a benchmark x = x1 , x2 , . .. , xd2 the deleted vector x 1 = x2 , . . . , xd2 andfor the random vector Xi = X i1 , X i2 , . . . , X id2 the deleted vector Xi, 1 = X i2 , . . . , X id2 , 2 1 ≤ i ≤ n. For any 1 ≤ l ≤ d1 , write m 1,l (x 1 ) = m 0l + ∑dα=2 m αl (xα ). Denote T the vector of pseudo-responses Y1 = Y1,1 , . . . , Yn,1 in which d1 Yi,1 = Yi − ∑ m 0l + m

1,l

Xi, 1

Til =

l=1

d1

∑ m 1l (X i1 ) Til + σ (Xi , Ti ) εi .

l=1

These would have been the responses had the unknown functions {m 1,l (x 1 )}1≤l≤d1 been given. In that case, one could estimate all the coefficient T by functions in x1 , the vector function m 1,· (x1 ) = m 11 (x1 ) , . . . , m 1d1 (x1 ) solving a kernel weighted least squares problem T m˜ K,1,· (x1 ) = m˜ K,11 (x1 ) , . . . , m˜ K,1d1 (x1 ) = argmin L λ, m λ=(λl )1≤l≤d1

1,· , x 1

,

in which L λ, m

1,· , x 1

=

n

∑

i=1

d1

2

Yi,1 − ∑ λl Til

K h (X i1 − x1 ) .

l=1

Alternatively, one could rewrite the above kernel oracle smoother in matrix form m˜ K,1,· (x1 ) =

T CK W1 C K

−1

T CK W 1 Y1

=

1 T C W1 CK n K

−1

1 T C W1 Y1 , n K

(6)

in which T Ti = Ti1 , · · · , Tid1 , CK = {T1 , . . . , Tn }T , W1 = diag {K h (X 11 − x1 ) , . . . , K h (X n1 − x1 )} , K h (u) = K (u/ h) / h for a kernel function K and bandwidth h that satisfy Assumption 7 below.


37

Assumption 7. The function K is a symmetric probability density function supported on [−1, 1], and K ∈ Lip ([−1, 1] , C K ) for some C K > 0, while the bandwidth h = h 1,n > 0, h ∼ n −1/5 . Likewise, one can define the local linear oracle smoother of m 1,· (x1 ) as −1 1 T 1 T m˜ LL,1,· (x1 ) = Id1 ×d1 , 0d1 ×d1 W1 Y1 , CLL,1 W1 CLL,1 C n n LL,1 in which CLL,1 =

T1 T1 (X 11 − x1 )

,..., ,...,

Tn Tn (X n1 − x1 )

(7)

T .

In this paper, denote μ2 (K ) = u 2 K (u) du, K 22 = K (u)2 du, Q1 (x1 ) as in (4) and define the following bias and variance coefficients: 1 bLL,l,l ,1 (x1 ) = μ2 (K ) m 1l (x1 ) f 1 (x1 ) qll ,1 (x1 ) , 2 1 ∂ bK,l,l ,1 (x1 ) = μ2 (K ) 2m 1l (x1 ) f 1 (x1 ) qll ,1 (x1 ) 2 ∂ x1 +m 1l (x1 ) f 1 (x1 ) qll ,1 (x1 ) ,

1 (x1 ) = K 22 f 1 (x1 ) E TTT σ 2 (X, T) |X 1 = x1 , d vl,l ,1 (x1 ) l,l1 =1 = Q1 (x1 )−1 1 (x1 ) Q1 (x1 )−1 .

(8)

THEOREM 1. Under Assumptions 1–5 and 7, for any x 1 ∈ [h, 1 − h], as n → ∞ the oracle local linear smoother m˜ LL,1,· (x1 ) given in (7) satisfies ⎡ ⎤ d1 d1 √ nh ⎣m˜ LL,1,· (x1 ) − m 1,· (x1 ) − ∑ bLL,l,l ,1 (x1 ) h2⎦ l=1

l =1

d → N 0, vl,l ,1 (x1 ) l,l1 =1 . With Assumption 6 in addition, the oracle kernel smoother m˜ K,1,· (x 1 ) in (6) satisfies ⎡ ⎤ d1 d1 √ nh ⎣m˜ K,1,· (x1 ) − m 1,· (x1 ) − ∑ bK,l,l ,1 (x1 ) h2⎦ l=1

d → N 0, vl,l ,1 (x1 ) l,l1 =1 .

l =1

38

R. LIU AND L. YANG

THEOREM 2. Under Assumptions 1–5 and 7, as n → ∞, the oracle local linear smoother m˜ LL,1,· (x1 ) given in (7) satisfies sup

x1 ∈[h,1−h]

√

m˜ LL,1,· (x1 ) − m 1,· (x1 ) = O p log n/ nh .

With Assumption 6 in addition, the oracle kernel smoother m˜ K,1,· (x 1 ) in (6) satisfies sup

x1 ∈[h,1−h]

√

m˜ K,1,· (x1 ) − m 1,· (x1 ) = O p log n/ nh .

Remark 1. The above theorems hold for m˜ LL,α,· (xα ) and m˜ K,α,· (xα ) similarly constructed as m˜ LL,1,· (x1 )and m˜ K,1,· (x1 ), for any α = 2, . . . , d2 , i.e., −1 1 T 1 T Wα Y α , CLL,α Wα CLL,α C m˜ LL,α,· (xα ) = Id1 ×d1 , 0d1 ×d1 n n LL,α

m˜ K,α,· (xα ) =

1 T C Wα CK n K

−1

1 T C Wα Yα , n K

except that in Assumption 4 one has to replace “m 1l ∈ C 2 [0, 1] , ∀1 ≤ l ≤ d1 ” with “m αl ∈ C 2 [0, 1] , ∀1 ≤ l ≤ d1 ” and in Assumption 6, f 1 (x1 ) and Q 1 (x1 ) have to be replaced with f α (xα ) and Q α (xα ). The proofs of Theorems 1 and 2 can be found in Liu and Yang (2008, Sect. A.4). The same oracle idea applies to the constants as well. Define the would-be T estimators of constants (m 0l )1≤l≤d as the following least squares solution: 1 T m˜ 0 = (m˜ 0l )1