SPLINE SINGLE-INDEX PREDICTION MODEL
arXiv:0704.0302v2 [math.ST] 6 Apr 2007
Li Wang and Lijian Yang University of Georgia and Michigan State University
Abstract: For the past two decades, single-index model, a special case of projection pursuit regression, has proven to be an efficient way of coping with the high dimensional problem in nonparametric regression. In this paper, based on weakly dependent sample, we investigate the single-index prediction (SIP) model which is robust against deviation from the single-index model. The singleindex is identified by the best approximation to the multivariate prediction function of the response variable, regardless of whether the prediction function is a genuine single-index function. A polynomial spline estimator is proposed for the single-index prediction coefficients, and is shown to be root-n consistent and asymptotically normal. An iterative optimization routine is used which is sufficiently fast for the user to analyze large data of high dimension within seconds. Simulation experiments have provided strong evidence that corroborates with the asymptotic theory. Application of the proposed procedure to the rive flow data of Iceland has yielded superior out-of-sample rolling forecasts. Key words and phrases: B-spline, geometric mixing, knots, nonparametric regression, root-n rate, strong consistency.
1. Introduction n Let XTi , Yi i=1 = {Xi,1 , ..., Xi,d , Yi }ni=1 be a length n realization of a (d + 1)-dimensional strictly stationary process following the heteroscedastic model
Yi = m (Xi ) + σ (Xi ) εi , m (Xi ) = E (Yi |Xi ) , (1.1) in which E (εi |Xi ) = 0, E ε2i |Xi = 1, 1 ≤ i ≤ n. The d-variate functions m, σ are the
unknown mean and standard deviation of the response Yi conditional on the predictor vector Xi , often estimated nonparametrically. In what follows, we let XT , Y, ε have the stationary distribution of XTi , Yi , εi . When the dimension of X is high, one unavoidable issue is the “curse of dimensionality”, which refers to the poor convergence rate of nonparametric estimation of general multivariate function. Much effort has been devoted to the circumventing of this difficulty. In the words of Xia, Tong, Li and Zhu (2002), there are essentially two approaches: function approximation and dimension reduction. A favorite function approximation technique is the generalized additive model advocated by Hastie and Tibshirani (1990), Address for correspondence: Lijian Yang, Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA. E-mail:
[email protected]
2
LI WANG AND LIJIAN YANG
see also, for example, Mammen, Linton and Nielsen (1999), Huang and Yang (2004), Xue and Yang (2006 a, b), Wang and Yang (2007). An attractive dimension reduction method is the single-index model, similar to the first step of projection pursuit regression, see Friedman and Stuetzle (1981), Hall (1989), Huber (1985), Chen (1991). The basic appeal of single-index model is its simplicity: the d-variate function m (x) = m (x1 , ..., xd ) is expressed as a univariate P function of xT θ0 = dp=1 xp θ0,p . Over the last two decades, many authors had devised various intelligent estimators of the single-index coefficient vector θ0 = (θ0,1 , ..., θ0,d )T , for instance,
Powell, Stock and Stoker (1989), H¨ardle and Stoker (1989), Ichimura (1993), Klein and Spady (1993), H¨ardle, Hall and Ichimura (1993), Horowitz and H¨ardle (1996), Carroll, Fan, Gijbels and Wand (1997), Xia and Li (1999), Hristache, Juditski and Spokoiny (2001). More recently, Xia, Tong, Li and Zhu (2002) proposed the minimum average variance estimation (MAVE) for several index vectors. All the aforementioned methods assume that the d-variate regression function m (x) is exactly a univariate function of some xT θ0 and obtain a root-n consistent estimator of θ0 . If this model is misspecified (m is not a genuine single-index function), however, a goodness-of-fit test then becomes necessary and the estimation of θ0 must be redefined, see Xia, Li, Tong and Zhang (2004). In this paper, instead of presuming that underlying true function m is a single-index function, we estimate a univariate function g that optimally approximates the multivariate function m in the sense of g (ν) = E m (X)| XT θ0 = ν ,
(1.2)
where the unknown parameter θ0 is called the SIP coefficient, used for simple interpretation once estimated; XT θ0 is the latent SIP variable; and g is a smooth but unknown function used for further data summary, called the link prediction function. Our method therefore is clearly interpretable regardless of the goodness-of-fit of the single-index model, making it much more relevant in applications. We propose estimators of θ0 and g based on weakly dependent sample, which includes many existing nonparametric time series models, that are (i) computationally expedient and (ii) theoretically reliable. Estimation of both θ0 and g has been done via the kernel smoothing techniques in existing literature, while we use polynomial spline smoothing. The greatest advantages of spline smoothing, as pointed out in Huang and Yang (2004), Xue and Yang (2006 b) are its simplicity and fast computation. Our proposed procedure involves two stages: √ ˆ minimizing an empirical version of the mean squared estimation of θ0 by some n-consistent θ, error, R(θ) = E{Y − E( Y | XT θ)}2 ; spline smoothing of Y on XT θˆ to obtain acubic spline estimator gˆ of g. The best single-index approximation to m(x) is then m(x) ˆ = gˆ xT θˆ . √ Under geometrically strong mixing condition, strong consistency and n-rate asymptotic
SINGLE-INDEX PREDICTION MODEL
3
normality of the estimator θˆ of the SIP coefficient θ0 in (1.2) are obtained. Proposition 2.2 is the key in understanding the efficiency of the proposed estimator. It shows that the derivatives of the risk function up to order 2 are uniformly almost surely approximated by their empirical versions. Practical performance of the SIP estimators is examined via Monte Carlo examples. The estimator of the SIP coefficient performs very well for data of both moderate and high dimension d, of sample size n from small to large, see Tables 1 and 2, Figures 1 and 2. By taking advantages of the spline smoothing and the iterative optimization routines, one reduces the computation burden immensely for massive data sets. Table 2 reports the computing time of one simulation example on an ordinary PC, which shows that for massive data sets, the SIP method is much faster than the MAVE method. For instance, the SIP estimation of a 200-dimensional θ0 from a data of size 1000 takes on average mere 2.84 seconds, while the MAVE method needs to spend 2432.56 seconds on average to obtain a comparable estimates. Hence on account of criteria (i) and (ii), our method is indeed appealing. Applying the proposed SIP procedure to the rive flow data of Iceland, we have obtained superior forecasts, based on a 9-dimensional index selected by BIC, see Figure 5. The rest of the paper is organized as follows. Section 2 gives details of the model specification, proposed methods of estimation and main results. Section 3 describes the actual procedure to implement the estimation method. Section 4 reports our findings in an extensive simulation study. The proposed SIP model and the estimation procedure are applied in Section 5 to the rive flow data of Iceland. Most of the technical proofs are contained in the Appendix. 2. The Method and Main Results 2.1. Identifiability and definition of the index coefficient It is obvious that without constraints, the SIP coefficient vector θ0 = (θ0,1 , ..., θ0,d )T is identified only up to a constant factor. Typically, one requires that kθ0 k = 1 which entails
that at least one of the coordinates θ0,1 , ..., θ0,d is nonzero. One could assume without loss of
generality 0, and the candidate o θ0 would then belong to the upper unit hemisphere n that θ0,d >P d d−1 2 S+ = (θ1 , ..., θd ) | p=1 θp = 1, θd > 0 . For a fixed θ = (θ1 , ..., θd )T , denote Xθ = XT θ, Xθ,i = XTi θ, 1 ≤ i ≤ n. Let mθ (Xθ ) = E (Y |Xθ ) = E {m (X) |Xθ } .
(2.1)
Define the risk function of θ as i h R (θ) = E {Y − mθ (Xθ )}2 = E {m (X) − mθ (Xθ )}2 + Eσ 2 (X) ,
(2.2)
4
LI WANG AND LIJIAN YANG
d−1 , i.e. which is uniquely minimized at θ0 ∈ S+
θ0 = arg min R (θ) . d−1 θ∈S+
d−1 is not a compact set, so we introduce a cap shape subset of Remark 2.1. Note that S+ d−1 S+
Scd−1 =
(θ1 , ..., θd ) |
d X p=1
θp2 = 1, θd ≥
p
1 − c2
, c ∈ (0, 1)
Clearly, for an appropriate choice of c, θ0 ∈ Scd−1 , which we assume in the rest of the paper.
d−1 , the risk function R (θ) depends only Denote θ−d = (θ1 , ..., θd−1 )T , since for fixed θ ∈ S+
on the first d − 1 values in θ, so R (θ) is a function of θ−d q R∗ (θ−d ) = R θ1 , θ2 , ..., θd−1 , 1 − kθ−d k22 , with well-defined score and Hessian matrices S ∗ (θ−d ) =
∂2 ∂ R∗ (θ−d ) , H ∗ (θ−d ) = R∗ (θ−d ) . T ∂θ−d ∂θ−d ∂θ−d
(2.3)
Assumption A1: The Hessian matrix H ∗ (θ0,−d ) is positive definite and the risk function R∗ is locally convex at θ0,−d , i.e., for any ε > 0, there exists δ > 0 such that R∗ (θ−d )−R∗ (θ0,−d ) < δ implies kθ−d − θ0,−d k2 < ε.
2.2. Variable transformation Throughout this paper, we denote by Bad = x ∈ Rd |kxk ≤ a the d-dimensional ball with
radius a and center 0 and o n C (k) Bad = m the kth order partial derivatives of m are continuous on Bad
the space of k-th order smooth functions.
Assumption A2: The density function of X, f (x) ∈ C (4) Bad , and there are constants 0 < cf ≤ Cf such that
(
cf /Vold Bad ≤ f (x) ≤ Cf /Vold Bad , x ∈ Bad
f (x) ≡ 0,
x∈ / Bad
.
For a fixed θ, define the transformed variables of the SIP variable Xθ Uθ = Fd (Xθ ) , Uθ,i = Fd (Xθ,i ) , 1 ≤ i ≤ n,
(2.4)
5
SINGLE-INDEX PREDICTION MODEL
in which Fd is the a rescaled centered Beta {(d + 1) /2, (d + 1) /2} cumulative distribution
function, i.e.
Fd (ν) =
Z
ν/a
−1
Γ (d + 1) 2 (d−1)/2 dt, ν ∈ [−a, a] . 2 d 1−t Γ {(d + 1) /2} 2
(2.5)
Remark 2.2. For any fixed θ, the transformed variable Uθ in (2.4) has a quasi-uniform [0, 1] distribution. Let fθ (u) be the probability density function of Uθ , then for any u ∈ [0, 1] n ′ o fθ (u) = Fd (v) fXθ (v) , v = Fd−1 (u) ,
in which fXθ (v) = lim△ν→0 P (ν ≤ Xθ ≤ ν + △ν). Noting that xθ is exactly the projection of
x on θ, let Dν = {x|ν ≤ xθ ≤ ν + △ν} ∩ Bad , then one has
P (ν ≤ Xθ ≤ ν + △ν) = P (X ∈ Dν ) =
Z
f (x) dx. Dν
According to Assumption A2 cf Vold (Dν ) Cf Vold (Dν ) ≤ P (ν ≤ Xθ ≤ ν + △ν) ≤ . d Vold (Ba ) Vold (Bad ) On the other hand Vold (Dν ) = Vold−1 (Jν )△ν + o (△ν) , where Jν = {x|xθ = v} ∩ Bad . Note that the volume of Bad is π d/2 ad /Γ (d/2 + 1) and (d−1)/2 . Γ {(d + 1)/2} , Vold−1 (Jν ) = π (d−1)/2 a2 − ν 2
thus
Vold−1 (Jν ) 1 Γ (d + 1) = √ d Vold (Ba ) a π Γ d+1 2 2d 2
ν 2 (d−1)/2 1− . a
Therefore 0 < cf ≤ fθ (u) ≤ Cf < ∞, for any fixed θ and u ∈ [0, 1].
In terms of the transformed SIP variable Uθ in (2.4), we can rewrite the regression function
mθ in (2.1) for fixed θ γθ (Uθ ) = E {m (X) |Uθ } = E {m (X) |Xθ } = mθ (Xθ ) , then the risk function R (θ) in (2.2) can be expressed as h i R (θ) = E {Y − γθ (Uθ )}2 = E {m (X) − γθ (Uθ )}2 + Eσ 2 (X) . 2.3. Estimation Method
(2.6)
(2.7)
6
LI WANG AND LIJIAN YANG
Estimation of both θ0 and g requires a degree of statistical smoothing, and all estimation here is carried out via cubic spline. In the following, we define the estimator θˆ of θ0 and the estimator gˆ of g. To introduce the space of splines, we pre-select an integer n1/6 ≪ N = Nn ≪ n1/5 (log n)−2/5 ,
see Assumption A6 below. Divide [0, 1] into (N + 1) subintervals Jj = [tj , tj+1 ), j = 0, ..., N −
1, JN = [tN , 1], where T := {tj }N j=1 is a sequence of equally-spaced points, called interior knots, given as
t1−k = ... = t−1 = t0 = 0 < t1 < ... < tN < 1 = tN +1 = ... = tN +k , in which tj = jh, j = 0, 1, ..., N + 1, h = 1/ (N + 1) is the distance between neighboring knots. The j-th B-spline of order k for the knot sequence T denoted by Bj,k is recursively defined by de Boor (2001). Denote by Γ(k−2) = Γ(k−2) [0, 1] the space of all C (k−2) [0, 1] functions that are polynomials of degree k − 1 on each interval. For fixed θ, the cubic spline estimator γˆθ of γθ and the related estimator m ˆ θ of mθ are defined as γˆθ (·) = arg
min
γ(·)∈Γ(2) [0,1]
n X i=1
{Yi − γ (Uθ,i )}2 , m ˆ θ (ν) = γˆθ {Fd (ν)} .
(2.8)
Define the empirical risk function of θ ˆ (θ) = n−1 R
n X i=1
{Yi − γˆθ (Uθ,i )}2 = n−1
n X i=1
{Yi − m ˆ θ (Xθ,i )}2 ,
(2.9)
then the spline estimator of the SIP coefficient θ0 is defined as ˆ (θ) , θˆ = arg min R θ∈Scd−1
ˆ i.e. and the cubic spline estimator of g is m ˆ θ with θ replaced by θ, ) ( n n o2 X {Fd (ν)} . Yi − γ Uθ,i gˆ (ν) = arg min ˆ γ(·)∈Γ(2) [0,1]
(2.10)
i=1
2.4. Asymptotic results Before giving the main theorems, we state some other assumptions. Assumption A3: The regression function m ∈ C (4) Bad for some a > 0. 2 |X = 1 and there exists a positive Assumption A4: The noise εsatisfies E (ε |X ) = 0, E ε constant M such that sup E |ε|3 |X = x < M . The standard deviation function σ (x) is x∈B d
continuous on Bad ,
0 < cσ ≤ inf σ (x) ≤ sup σ (x) ≤ Cσ < ∞. x∈Bad
x∈Bad
7
SINGLE-INDEX PREDICTION MODEL
Assumption A5: There exist positive constants K0 and λ0 such that α (n) ≤ K0 e−λ0 n holds n for all n, with the α-mixing coefficient for Zi = XTi , εi i=1 defined as α (k) =
sup
B∈σ{Zs ,s≤t},C∈σ{Zs ,s≥t+k}
|P (B ∩ C) − P (B) P (C)| , k ≥ 1.
Assumption A6: The number of interior knots N satisfies: n1/6 ≪ N ≪ n1/5 (log n)−2/5 . Remark 2.3. Assumptions A3 and A4 are typical in the nonparametric smoothing literature, see for instance, H¨ardle (1990), Fan and Gijbels (1996), Xia, Tong Li and Zhu (2002). By the result of Pham (1986), a geometrically ergodic time series is a strongly mixing sequence. Therefore, Assumption A5 is suitable for (1.1) as a time series model under aforementioned assumptions. We now state our main results in the next two theorems. Theorem 1. Under Assumptions A1-A6, one has θˆ−d −→ θ0,−d , a.s.. Proof. Denote by (Ω, F, P) the probability space on which all
(2.11)
XTi , Yi
Proposition 2.2, given at the end of this section ˆ∗ ∗ sup√ R (θ−d ) − R (θ−d ) −→ 0, a.s..
∞
i=1
are defined. By
(2.12)
kθ−d k2 ≤ 1−c2
So for any δ > 0 and ω ∈ Ω, there exists an integer n0 (ω), such that when n > n0 (ω), ˆ ∗ (θ0,−d, ω) − R∗ (θ0,−d ) < δ/2. Note that θˆ−d = θˆ−d (ω) is the minimizer of R ˆ ∗ (θ−d , ω), R ˆ ∗ θˆ−d (ω) , ω − R∗ (θ0,−d) < δ/2. Using (2.12), there exists n1 (ω), such that when so R ˆ ∗ θˆ−d (ω) , ω < δ/2. Thus, when n > max (n0 (ω) , n1 (ω)), n > n1 (ω), R∗ θˆ−d (ω) , ω − R ˆ ∗ θˆ−d (ω) , ω − R∗ (θ0,−d ) < δ/2 + δ/2 = δ. R∗ θˆ−d (ω) , ω − R∗ (θ0,−d) < δ/2 + R
According to Assumption A1, R∗ is locally convex at θ0,−d , so for any ε > 0 and any ω, if
ˆ
∗ ∗ ˆ R θ−d (ω) , ω − R (θ0,−d ) < δ, then θ−d (ω) −θ0,−d < ε for n large enough , which implies the strong consistency.
Theorem 2. Under Assumptions A1-A6, one has √ d n θˆ−d −θ0,−d −→ N {0, Σ (θ0 )} ,
where Σ (θ0 ) = {H ∗ (θ0,−d )}−1 Ψ (θ0 ) {H ∗ (θ0,−d )}−1 , H ∗ (θ0,−d ) = {lpq }d−1 p,q=1 and Ψ (θ0 ) = {ψpq }d−1 p,q=1 with
−1 lp,q = −2E [{γ˙ p γ˙ q + γθ0 γ¨p,q } (Uθ0 )] + 2θ0,q θ0,d E [{γ˙ p γ˙ d (Uθ0 ) + γθ0 γ¨p,d} (Uθ0 )] 2 −3 2 +2θ0,d E [(γθ0 γ˙ d ) (Uθ0 )] θ0,d + θ0,p I{p=q} + θ0,p θ0,q I{p6=q} −2 −1 E γ˙ d2 + γθ0 γ¨d,d (Uθ0 ) , E [{γ˙ p γ˙ q + γθ0 γ¨p,q } (Uθ0 )] − 2θ0,p θ0,q θ0,d +2θ0,p θ0,d
8
LI WANG AND LIJIAN YANG
ψpq = 4E
hn o i −1 −1 γ˙ p − θ0,p θ0,d γ˙ d γ˙ q − θ0,q θ0,d γ˙ d (Uθ0 ) {γθ0 (Uθ0 ) − Y }2 ,
in which γ˙ p and γ¨p,q are the values of
∂2 ∂ ∂θp γθ , ∂θp ∂θq γθ
taking at θ = θ0 , for any p, q = 1, 2, ..., d−1
and γθ is given in (2.6). Remark 2.4. Consider the Generalized Linear Model (GLM): Y = g XT θ0 + σ (X) ε, where g is a known link function. Let θ˜ be the nonlinear least squared estimator of θ0 in GLM.
Theorem 2 shows that under the assumptions A1-A6, the asymptotic distribution of the θˆ−d ˜ This implies that our proposed SIP estimator θˆ−d is as efficient as if is the same as that of θ.
the true link function g is known.
The next two propositions play an important role in our proof of the main results. Proposition 2.1 establishes the uniform convergence rate of the derivatives of γˆθ up to order 2 to those of γθ in θ. Proposition 2.2 shows that the derivatives of the risk function up to order 2 are uniformly almost surely approximated by their empirical versions. Proposition 2.1. Under Assumptions A2-A6, with probability 1 o n sup sup |ˆ γθ (u) − γθ (u)| = O (nh)−1/2 log n + h4 ,
(2.13)
θ∈Scd−1 u∈[0,1]
∂ log n 3 sup sup max {ˆ γθ (Uθ,i ) − γθ (Uθ,i )} = O √ +h , nh3 1≤p≤d θ∈Scd−1 1≤i≤n ∂θp ∂2 log n 2 {ˆ γθ (Uθ,i ) − γθ (Uθ,i )} = O √ +h . sup sup max nh5 1≤p,q≤d θ∈Scd−1 1≤i≤n ∂θp ∂θq
(2.14) (2.15)
Proposition 2.2. Under Assumptions A2-A6, one has for k = 0, 1, 2 k n o ∂ ∗ ∗ ˆ sup R (θ−d ) − R (θ−d ) = o(1), a.s.. k √ kθ−d k≤ 1−c2 ∂ θ−d Proofs of Theorem 2, Propositions 2.1 and 2.2 are given in Appendix.
3. Implementation In this section, we will describe the actual procedure to implement the estimation of θ0 and g. We first introduce some new notation. For fixed θ, write the B-spline matrix as N Bθ = {Bj,4 (Uθ,i )}n, i=1,j=−3 and
Pθ = Bθ BTθ Bθ
−1
BTθ
(2)
as the projection matrix onto the cubic spline space Γn,θ . For any p = 1, ..., d, denote ˙ p = ∂ Pθ . ˙ p = ∂ Bθ , P B ∂θp ∂θp
(3.1)
9
SINGLE-INDEX PREDICTION MODEL
as the first order partial derivatives of Bθ and Pθ with respect to θ. ˆ ∗ (θ−d ), i.e. Let Sˆ∗ (θ−d ) be the score vector of R Sˆ∗ (θ−d ) =
∂ ˆ∗ R (θ−d ) . ∂θ−d
(3.2)
The next lemma provides the exact forms of Sˆ∗ (θ−d ). ˆ ∗ (θ−d ) defined in (3.2), one has Lemma 3.1. For the score vector of R od−1 n ˙ dY ˙ p Y − θp θ −1 Y T P , Sˆ∗ (θ−d ) = −n−1 Y T P d
(3.3)
p=1
where for any p = 1, 2, ..., d
˙ p Y = 2Y T (I − Pθ ) B ˙ p BTθ Bθ YT P
−1
BTθ Y,
on, N n ˙ p = {Bj,3 (Uθ,i ) − Bj+1,3 (Uθ,i )} F˙d (Xθ,i ) h−1 Xi,p where B
i=1,j=−3
d Γ (d + 1) F˙d (x) = Fd = dx aΓ {(d + 1) /2}2 2d
x2 1− 2 a
d−1 2
(3.4)
with
I (|x| ≤ a) .
Proof. For any p = 1, 2, ..., d, the derivatives of B-splines in de Boor (2001) implies ˙p B
n, N n, N d ∂ d = = Bj,4 (Uθ,i ) Bj,4 (Uθ,i ) Uθ,i ∂θp du dθp i=1,j=−3 i=1,j=−3 n, N Bj,3 (Uθ,i ) Bj+1,3 (Uθ,i ) ˙ = 3 − Fd (Xθ,i ) Xi,p tj+3 − tj tj+4 − tj+1 i=1,j=−3 on, N n . = {Bj,3 (Uθ,i ) − Bj+1,3 (Uθ,i )} F˙d (Xθ,i ) h−1 Xi,p
i=1,j=−3
Next, note that
˙p = B Since ∂ 0≡ and
∂ ∂θp
n
∂ n T −1 T o + Bθ Bθ Bθ Bθ ∂θp −1 T −1 −1 T ∂ T T ˙p. B Bθ Bθ Bθ Bθ BTθ + Bθ BTθ Bθ Bθ + Bθ ∂θp
˙p = B ˙ p BTθ Bθ P
BTθ Bθ
−1
∂θp
−1
BTθ
BTθ Bθ
o
∂ BTθ Bθ = ∂θp
˙ ˙ Tp Bθ + BT B BTθ Bθ = B θ p , thus
−1
BTθ Bθ
+
−1 BTθ Bθ
∂ BTθ Bθ , ∂θp
−1 T −1 ∂ ˙ p BTθ Bθ −1 . ˙ p Bθ + BTθ B B = − BTθ Bθ BTθ Bθ ∂θp
10
LI WANG AND LIJIAN YANG
Hence ˙ p = (I − Pθ ) B ˙ p BTθ Bθ P Thus, (3.4) follows immediately.
−1
BTθ + Bθ BTθ Bθ
−1
˙ Tp (I − Pθ ) . B
In practice, the estimation is implemented via the following procedure. Step 1. Standardize the predictor vectors {Xi }ni=1 and for each fixed θ ∈ Scd−1 obtain the
CDF transformed variables {Uθ,i }ni=1 of the SIP variable {Xθ,i }ni=1 through formula (2.5), where the radius a is taken to be the 95% percentile of {kXi k}ni=1 .
Step 2. Compute quadratic and cubic B-spline basis at each value Uθ,i , where the number
of interior knots N is i o n h N = min c1 n1/5.5 , c2 ,
(3.5)
ˆ ∗ through the port optimization routine Step 3. Find the estimator θˆ of θ0 by minimizing R with (0, 0, ..., 1)T as the initial value and the empirical score vector Sˆ∗ in (3.3). If d < n, one can take the simple LSE (without the intercept) for data {Yi , Xi }ni=1 with its last coordinate set positive.
Step 4. Obtain the spline estimator gˆ of g by plugging θˆ obtained in Step 3 into (2.10). Remark 3.1. In (3.5), c1 and c2 are positive integers and [ν] denotes the integer part of ν. The choice of the tuning parameter c1 makes little difference for a large sample and according to our asymptotic theory there is no optimal way to set these constants. We recommend using c1 = 1 to save computing for massive data sets. The first term ensures Assumption A6. The addition constrain c2 can be taken from 5 to 10 for smooth monotonic or smooth unimodel regression and c2 > 10 if has many local minima and maxima, which is very unlikely in application. 4. Simulations In this section, we carry out two simulations to illustrate the finite-sample behavior of our SIP estimation method. The number of interior knots N is computed according to (3.5) with c1 = 1, c2 = 5. All of our codes have been written in R. Example 1. Consider the model in Xia, Li, Tong and Zhang (2004) i.i.d
Y = m (X) + σ0 ε, σ0 = 0.3, 0.5, ε ∼ N (0, 1) where X = (X1 , X2 )T ∼N (0, I2 ), truncated by [−2.5, 2.5]2 and o n 1/2 m (x) = x1 + x2 + 4 exp − (x1 + x2 )2 + δ x21 + x22 .
(4.1)
√ If δ =n0, then the underlying true function m is a single-index function, i.e., m (X) = 2XT θ0 + o √ 2 , where θ0T = (1, 1) / 2. While δ 6= 0, then m is not a genuine single-index 4 exp −2 XT θ0
11
SINGLE-INDEX PREDICTION MODEL
function. An impression of the bivariate function m for δ = 0 and δ = 1 can be gained in Figure 1 (a) and (b), respectively. Table 1: Report of Example 1 (Values out/in parentheses: δ = 0/δ = 1) σ0
n
θ0 θ0,1
100 θ0,2 0.3 θ0,1 300 θ0,2 θ0,1 100 θ0,2 0.5 θ0,1 300 θ0,2
BIAS
SD
MSE
5e − 04 (−0.00236) −6e − 04 (0.00174) −0.00124 (−0.00129) −0.00124 (0.00110)
0.00825 (0.02093) 0.00826 (0.02083) 0.00383 (0.01172) 0.00383 (0.01160)
7e − 05 (0.00044) 7e − 05 (0.00043) 2e − 05 (0.00014) 2e − 05 (0.00013)
0.00121 (−0.00137) −0.00147 (0.00062) −0.00204 (−0.00229) 0.00197 (0.00208)
0.01346 (0.02257) 0.01349 (0.02309) 0.00639 (0.01205) 0.00637 (0.01190)
0.00018 (0.00051) 0.00018 (0.00052) 4e − 05 (0.00015) 4e − 05 (0.00014)
Average MSE 7e − 05 (0.00043)
2e − 05 (0.00014)
0.00018 (0.00051)
4e − 05 (0.00015)
For δ = 0, 1, we draw 100 random realizations of each sample size n = 50, 100, 300 respectively. To demonstrate how close our SIP estimator is to the true index parameter θ0 , Table 1 lists the sample mean (MEAN), bias (BIAS), standard deviation (SD), the mean squared error (MSE) of the estimates of θ0 and the average MSE of both directions. From this table, we find that the SIP estimators are very accurate for both cases δ = 0 and δ = 1, which shows that our proposed method is robust against the deviation from single-index model. As we expected, when the sample size increases, the SIP coefficient is more accurately estimated. Moreover, for n = 100, 300, the total average is inversely proportional to n. Example 2. Consider the heteroscedastic regression model (1.1) with n √ o π 5 − exp kXk/ d XT θ0 , σ (X) = σ0 m (X) = sin √ , 4 5 + exp kXk/ d i.i.d
(4.2)
in which Xi = {Xi,1 , ..., Xi,d }T and εi , i = 1, ..., n, are ∼ N (0, 1), σ0 = 0.2. In our simulation, √ the true parameter θ0T = (1, 1, 0, ..., 0, 1)/ 3 for different sample size n and dimension d. The
12
LI WANG AND LIJIAN YANG
superior performance of SIP estimators is borne out in comparison with MAVE of Xia, Tong, Li and Zhu (2002). We also investigate the behavior of SIP estimators in the previously unemployed cases that sample size n is smaller than or equal to d, for instance, n = 100, d = 100, 200 and n = 200, d = 200, 400. The average MSEs of the d dimensions are listed in Table 2, from which we see that the performance of the SIP estimators are quite reasonable and in most of the scenarios n ≤ d, the SIP estimators still work astonishingly well where the MAVEs
become unreliable. For n = 100, d = 10, 50, 100, 200, the estimates of the link prediction
function g from model (4.2) are plotted in Figure 2, which is rather satisfactory even when dimension d exceeds the sample size n. Theorem 1 indicates that θˆ−d is strongly consistent of θ0,−d . To see the convergence, we √ run 100 replications and in each replication, the value of kθˆ − θ0 k/ d is computed. Figure
3 plots the kernel density estimations of the 100 kθˆ − θ0 k in Example 2, in which dimension d = 10, 50, 100, 200. There are four types of line characteristics which correspond to the two
sample sizes, the dotted-dashed line (n = 100), dotted line (n = 200), dashed line (500) and solid line (n = 1000). As sample sizes increasing, the squared errors are becoming closer to 0, with narrower spread out, confirmative to the conclusions of Theorem 1. Lastly, we report the average computing time of Example 2 to generate one sample of size n and perform the SIP or MAVE procedure done on the same ordinary Pentium IV PC in Table 2. From Table 2, one sees that our proposed SIP estimator is much faster than the MAVE. The computing time for MAVE is extremely sensitive to sample size as we expected. For very large d, MAVE becomes unstable to the point of the breaking down in four cases. 5. An application In this section we demonstrate the proposed SIP model through the river flow data of J¨ okuls´a Eystri River of Iceland, from January 1, 1972 to December 31, 1974. There are 1096 observations, see Tong (1990). The response variables are the daily river flow (Yt ), measured in meter cubed per second of J¨ okuls´ a Eystri River. The exogenous variables are temperature (Xt ) in degrees Celsius and daily precipitation (Zt ) in millimeters collected at the meteorological station at Hveravellir. This data set was analyzed earlier through threshold autoregressive (TAR) models by Tong, Thanoon and Gudmundsson (1985), Tong (1990), and nonlinear additive autoregressive (NAARX) models by Chen and Tsay (1993). Figure 4 shows the plots of the three time series, from which some nonlinear and non-stationary features of the river flow series are evident. To make these series stationary, we remove the trend by a simple quadratic spline regression and these trends (dashed lines) are shown in Figure 4. By an abuse of notation, we shall continue to use Xt , Yt , Zt to denote the detrended series.
13
SINGLE-INDEX PREDICTION MODEL
In the analysis, we pre-select all the lagged values in the last 7 days (1 week), i.e., the predictor pool is {Yt−1 , ..., Yt−7 , Xt , Xt−1 , ..., Xt−7 , Zt , Zt−1 , ..., Zt−7 , }. Using BIC similar to
Huang and Yang (2004) for our proposed spline SIP model with 3 interior knots, the following
9 explanatory variables are selected from the above set {Yt−1 , ..., Yt−4 , Xt , Xt−1 , Xt−2 , Zt , Zt−1 }.
Based on this selection, we fit the SIP model again and obtain the estimate of the SIP coefficient θˆ = {−0.877, 0.382, −0.208, 0.125, −0.046, −0.034, 0.004, −0.126, 0.079} T . Figure 5 (a) and (b) display the fitted river flow series and the residuals against time.
Next we examine the forecasting performance of the SIP method. We start with estimating the SIP estimator using only observations of the first two years, then we perform the out-ofsample rolling forecast of the entire third year. The observed values of the exogenous variables are used in the forecast. Figure 5 (c) shows this SIP out-of-sample rolling forecasts. For the purpose of comparison, we also try the MAVE method, in which the same predictor vector is selected by using BIC. The mean squared prediction error is 60.52 for the SIP model, 61.25 for MAVE, 65.62 for NAARX, 66.67 for TAR and 81.99 for the linear regression model, see Chen and Tsay (1993). Among the above five models, the SIP model produces the best forecasts. 6. Conclusion In this paper we propose a robust SIP model for stochastic regression under weak dependence regardless if the underlying function is exactly a single-index function. The proposed spline estimator of the index coefficient possesses not only the usual strong consistency and √ n-rate asymptotically normal distribution, but also is as efficient as if the true link function g is known. By taking advantage of the spline smoothing method and the iterative method, the proposed procedure is much faster than the MAVE method. This procedure is especially powerful for large sample size n and high dimension d and unlike the MAVE method, the performance of the SIP remains satisfying in the case d > n.
Acknowledgment This work is part of the first author’s dissertation under the supervision of the second author, and has been supported in part by NSF award DMS 0405330. Appendix A.1. Preliminaries In this section, we introduce some properties of the B-spline. Lemma A.1. There exist constants c > 0 such that for
PN
j=−k+1 αj,k Bj,k
up to order k = 4
P
1/r PN
4 ch1/r kαk ≤ α B kαkr , 1 ≤ r ≤ ∞
≤ 3r−1 h j,k j,k r k=2 j=−k+1
P r , P
1/r N ch1/r kαk ≤ 4 ≤ (3h) kαk , 0 < r < 1 α B
j,k j,k r k=2 j=−k+1 r r
14
LI WANG AND LIJIAN YANG
where α := (α−1,2 , α0,2 , ..., αN,2 , ..., αN,4 ). In particular, under Assumption A2, for any fixed θ
4 N X X
1/2 1/2 αj,k Bj,k ch kαk2 ≤
≤ Ch kαk2 .
k=2 j=−k+1 2,θ
Proof. It follows from the B-spline property on page 96 of de Boor (2001),
P4
k=2
PN
j=−k+1 Bj,k
≡
3 on [0, 1]. So the right inequality follows immediate for r = ∞. When 1 ≤ r < ∞, we use H¨older’s inequality to find 1−1/r 1/r 4 N 4 N N 4 X X X X X X Bj,k |αj,k |r Bj,k αj,k Bj,k ≤ k=2 j=−k+1 k=2 j=−k+1 k=2 j=−k+1 1/r N 4 X X |αj,k |r Bj,k . = 31−1/r k=2 j=−k+1
R∞
Since all the knots are equally spaced, −∞ Bj,k (u) du ≤ h, the right inequality follows from r Z 1 X N 4 X αj,k Bj,k (u) du ≤ 3r−1 h kαkrr . 0 k=2 j=−k+1 When r < 1, we have
Since
R∞
r 4 N N 4 X X X X r |αj,k |r Bj,k . α B ≤ j,k j,k k=2 j=−k+1 k=2 j=−k+1
r −∞ Bj,k
(u) du ≤ tj+k − tj = kh and
Z
0
r Z ∞ r r α B (u) du ≤ kαk (u) du ≤ 3h kαkrr , Bj,k j,k j,k r −∞ k=2 j=−k+1
4 1 X
N X
the right inequality follows in this case as well. For the left inequalities, we derive from Theorem 5.4.2, DeVore and Lorentz (1993) |αj,k | ≤ C1 h−1/r
Z
tj+1 tj
for any 0 < r ≤ ∞, so |αj,k |r ≤ C1r h−1
Z
tj+1
tj
r N X du α B (u) j,k j,k j=−k+1
r X N du. α B (u) j,k j,k j=−k+1
15
SINGLE-INDEX PREDICTION MODEL
Since each u ∈ [0, 1] appears in at most k intervals (tj, tj+k ), adding up these inequalities, we
obtain that
kαkrr ≤ C1 h−1
4 Z X k=1
tj+k tj
The left inequality follows.
r
r N
N
X X
−1
. α B (u) du ≤ 3Ch α B j,k j,k j,k j,k
j=−k+1
j=−k+1
r
For any functions φ and ϕ, define the empirical inner product and the empirical norm as Z 1 n X φ2 (Uθ,i ) . φ (u) ϕ (u) fθ (u) du, kφk22,n,θ = n−1 hφ, ϕiθ = 0
i=1
In addition, if functions φ, ϕ are L2 [0, 1]-integrable, define the theoretical inner product and its corresponding theoretical L2 norm as Z 1 n X 2 φ (Uθ,i ) ϕ (Uθ,i ) . φ2 (u) fθ (u) du, hφ, ϕin,θ = n−1 kφk2,θ = 0
i=1
Lemma A.2. Under Assumptions A2, A5 and A6, with probability 1,
n o
−1/2 ′ ′ ′ ′ B , B − B , B = O (nN ) log n . sup max j,k j ,k n,θ j,k j ,k θ ′ θ∈Scd−1 k,k =2,3,4 1≤j,j ′ ≤N
Proof. We only prove the case k = k′ = 4, all other cases are similar. Let
ζθ,j,j ′,i = Bj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) − EBj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) , with the second moment 2 2 2 2 , Eζθ,j,j ′ ,i = E Bj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) − EBj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) i h 2 2 (U ) B 2 (U ) ∼ N −1 by Assumption A2. where EBj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) ∼ N −2 , E Bj,4 ′ θ,i θ,i j ,4
−1 . The k-th moment is given by 2 Hence, Eζθ,j,j ′ ,i ∼ N
k k E ζθ,j,j ′,i = E Bj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) − EBj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) n k k o k−1 ′ ′ E Bj,4 (Uθ,i ) Bj ,4 (Uθ,i ) + EBj,4 (Uθ,i ) Bj ,4 (Uθ,i ) , ≤ 2
k k where EBj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) ∼ N −k , E EBj,4 (Uθ,i ) Bj ′ ,4 (Uθ,i ) ∼ N −1 . Thus, there exists k a constant C > 0 such that E ζθ,j,j ′ ,i ≤ C2k−1 k!Eζ 2 ′ . So the Cram´er’s condition is satisfied j,j ,i
with Cram´er’s constant c∗ . By the Bernstein’s inequality (see Bosq (1998), Theorem 1.4, page 31), we have for k = 3 ) ( 6/7 n X n qδn2 −1 , + a2 (k) α ζθ,j,j ′ ,i ≥ δn ≤ a1 exp − P n q+1 25m22 + 5c∗ δn i=1
16
LI WANG AND LIJIAN YANG
where
! δ2 (nN )−1 log2 n log n n , m22 ∼ N −1 , δn = δ √ , a1 = 2 + 2 1 + 2 + 5c∗ δ q 25m nN n 2 ! 6/7
5m3 , m3 = max ζθ,j,j ′ ,i 3 ≤ cN 1/3 . a2 (3) = 11n 1 + 1≤i≤n δn i h n ≥ c0 log n, Observe that 5cδn = o(1) by Assumption A6, then by taking q such that q+1 q ≥ c1 n/ log n for some constants c0 , c1 , one has a1 = O(n/q) = O (log n), a2 (3) = o n2 via Assumption A6 again. Assumption A5 yields that α
n q+1
6/7
≤
6/7 n K0 exp −λ0 ≤ Cn−6λ0 c0 /7 . q+1
Thus, for fixed θ ∈ Scd−1 , when n large enough ( n ) 1 X P ζθ,j,j ′ ,i > δn ≤ c log n exp −c2 δ2 log n + Cn2−6λ0 c0 /7 . n
(A.1)
i=1
We divide each range of θp , p = 1, 2, ..., d − 1, into n6/(d−1) equally spaced intervals with disjoint endpoints −1 = θp,0 < θp,1 < ... < θp,Mn = 1, for p = 1, ..., d − 1. Projecting these
small cylinders onto Scd−1 , the radius of each patchΛr , r = 1, ..., Mn is bounded by cMn−1 . q Denote the projection of the Mn points as θr = θr,−d , 1 − kθr,−d k22 , r = 0, 1, ..., Mn . ζθ,j,j ′,i is bounded by max Employing the discretization method, sup ′ θ∈Scd−1 1≤j,j ≤N
sup
ζθ ,j,j ′,i + max r ′
0≤r≤Mn 1≤j,j ≤N
sup
max ′
sup ζθ,j,j ′,i − ζθr ,j,j ′,i .
0≤r≤Mn 1≤j,j ≤N θ∈Λr
(A.2)
By (A.1) and Assumption A6, there exists large enough value δ > 0 such that ) ( n 1 X ζθr ,j,j ′,i > δn ≤ n−10 , P n i=1
which implies that ( ∞ X P max ′ n=1
) ∞ ∞ n X X −1 X n−3 < ∞. N 2 Mn n−10 ≤ C ζθr ,j,j ′,i ≥ δn ≤ 2 n 1≤j,j ≤N n=1
l=1
n=1
Thus, Borel-Cantelli Lemma entails that n X log n −1 sup max ζθr ,j,j ′,i = O √ , a.s.. n ′ nN 0≤r≤Mn 1≤j,j ≤N l=1
(A.3)
17
SINGLE-INDEX PREDICTION MODEL
Employing Lipschitz continuity of the cubic B-spline, one has with probability 1 n −1 X −1 −6 ′ ,i − ζθ ,j,j ′ ,i = O M sup ζ h . max sup n θ,j,j n r ′ 0≤r≤Mn 1≤j,j ≤N θ∈Λr
(A.4)
i=1
Therefore Assumption A2, (A.2), (A.3) and (A.4) lead to the desired result.
Denote by Γ = Γ(0) ∪ Γ(1) ∪ Γ(2) the space of all linear, quadratic and cubic spline functions
on [0, 1]. We establish the uniform rate at which the empirical inner product approximates the theoretical inner product for all B-splines Bj,k with k = 2, 3, 4. Lemma A.3. Under Assumptions A2, A5 and A6, one has hγ1 , γ2 i − hγ1 , γ2 i n o n,θ θ −1/2 log n , a.s.. An = sup sup = O (nh) kγ1 k2,θ kγ2 k2,θ θ∈Scd−1 γ1 ,γ2 ∈Γ
Proof. Denote without loss of generality, γ1 =
4 N X X
αjk Bj,k , γ2 =
k=2 j=−k+1
4 N X X
βjk Bj,k ,
k=2 j=−k+1
for any two 3 (N + 3)-vectors α = (α−1,2 , α0,2 , ..., αN,2 , ..., αN,4 ) , β = (β−1,2 , β0,2 , ..., βN,2 , ..., βN,4 ) . Then for fixed θ hγ1 , γ2 in,θ = =
N 4 N 4 n X X X X X 1 βj,k Bj,k (Uθ,i ) αj,k Bj,k (Uθ,i ) n i=1
N 4 X X
4 X
N X
k=2 j=−k+1 k ′ =2 j ′ =−k+1
kγ1 k22,θ kγ2 k22,θ
=
k=2 j=−k+1
k=2 j=−k+1
N 4 X X
4 X
αj,k βj ′ ,k′ Bj,k , Bj ′ ,k′ n,θ , N X
k=2 j=−k+1 k ′ =2 j ′ =−k+1
=
N 4 X X
4 X
N X
k=2 j=−k+1 k ′ =2 j ′ =−k+1
D E αj,k αj ′ ,k′ Bj,k , Bj ′ ,k′ , θ
D E βj,k βj ′ ,k′ Bj,k , Bj ′ ,k′ . θ
According to Lemma A.1, one has for any θ ∈ Scd−1 , c1 h kαk22 ≤ kγ1 k22,θ ≤ c2 h kαk22 , c1 h kβk22 ≤ kγ2 k22,θ ≤ c2 h kβk22 , c1 h kαk2 kβk2 ≤ kγ1 k2,θ kγ2 k2,θ ≤ c2 h kαk2 kβk2 .
(A.5)
18
LI WANG AND LIJIAN YANG
Hence An
hγ1 , γ2 i − hγ1 , γ2 i kαk∞ kβk∞ n,θ θ sup = sup ≤ c1 h kαk2 kβk2 kγ1 k2,θ kγ2 k2,θ θ∈Scd−1 γ1 ∈γ,γ2 ∈Γ n 1 X D E D E max Bj,k , Bj ′ ,k′ × sup − Bj,k , Bj ′ ,k′ , ′ n n,θ θ θ∈S d−1 k,k =2,3,4 c
i=1
1≤j,j ′ ≤N
n 1 X D E D E Bj,k , Bj ′ ,k′ − Bj,k , Bj ′ ,k′ max An ≤ c0 h−1 sup , ′ n n,θ θ θ∈S d−1 k,k =2,3,4 c
i=1
1≤j,j ′ ≤N
which, together with Lemma A.2, imply (A.5). A.2. Proof of Proposition 2.1
For any fixed θ, we write the response Y T = (Y1 , ..., Yn ) as the sum of a signal vector γ θ , a parametric noise vector Eθ and a systematic noise vector E, i.e., Y = γθ + Eθ + E, in which the vectors γ Tθ = {γθ (Uθ,1 ) , ..., γθ (Uθ,n )}, ET = {σ (X1 ) ε1 , ..., σ (Xn ) εn } and ETθ =
{m (X1 ) − γθ (Uθ,1 ) , ..., m (Xn ) − γθ (Uθ,n )}.
Remark A.1. If m is a genuine single-index function, then Eθ0 ≡ 0, thus the proposed SIP
model is exactly the single-index model. (2)
Let Γn,
θ
be the cubic spline space spanned by {Bj,4 (Uθ,i )}ni=1 , −3 ≤ j ≤ N for fixed θ. (2)
Projecting Y onto Γn,
θ
yields that
γˆθ = {ˆ γθ (Uθ,1 ) , ..., γˆθ (Uθ,n )}T = ProjΓ(2) γθ + ProjΓ(2) Eθ + ProjΓ(2) E, n,θ
n,θ
n,θ
where γˆθ is given in (2.8). We break the cubic spline estimation error γˆθ (uθ ) − γθ (uθ ) into a
bias term γ˜θ (uθ ) − γθ (uθ ) and two noise terms ε˜θ (uθ ) and εˆθ (uθ )
γˆθ (uθ ) − γθ (uθ ) = {˜ γθ (uθ ) − γθ (uθ )} + ε˜θ (uθ ) + εˆθ (uθ ) ,
(A.6)
where n oN −1 hγθ , Bj,4 in,θ γ˜θ (u) = {Bj,4 (u)}T−3≤j≤N Vn,θ
,
(A.7)
,
(A.8)
.
(A.9)
j=−3
n oN −1 hEθ , Bj,4 in,θ ε˜θ (u) = {Bj,4 (u)}T−3≤j≤N Vn,θ
j=−3
εˆθ (u) =
{Bj,4 (u)}T−3≤j≤N
−1 Vn,θ
n
hE, Bj,4 in,θ
oN
j=−3
SINGLE-INDEX PREDICTION MODEL
19
In the above, we denote by Vn,θ the empirical inner product matrix of the cubic B-spline basis and similarly, the theoretical inner product matrix as Vθ n
N oN
1 ′ ,4 , Bj,4 . (A.10) , V = B Vn,θ = BTθ Bθ = Bj ′ ,4 , Bj,4 n,θ θ j θ j,j ′ =−3 n j,j ′=−3
−1 In Lemma A.5, we provide the uniform upper bound of Vn,θ and Vθ−1 ∞ . Before ∞ that, we first describe a special case of Theorem 13.4.3 in DeVore and Lorentz (1993).
Lemma A.4. If a bi-infinite matrix with bandwidth r has a bounded inverse A−1 on l2 and
κ = κ (A) := kAk2 A−1 2 is the condition number of A, then A−1 ∞ ≤ 2c0 (1 − ν)−1 , with
1/4r 2 −1/4r . κ +1 c0 = ν −2r A−1 2 , ν = κ2 − 1
Lemma A.5. Under Assumptions A2, A5 and A6, there exist constants 0 < cV < CV such that cV N −1 kwk22 ≤ wT Vθ w ≤ CV N −1 kwk22 and cV N −1 kwk22 ≤ wT Vn,θ w ≤ CV N −1 kwk22 , a.s.,
(A.11)
with matrices Vθ and Vn,θ defined in (A.10). In addition, there exists a constant C > 0 such that
−1 sup Vn,θ
θ∈Scd−1
∞
≤ CN, a.s., sup Vθ−1 ∞ ≤ CN.
(A.12)
θ∈Scd−1
Proof. First we compute the lower and upper bounds for the eigenvalues of Vn,θ . Let w be any P T (N + 4)-vector and denote γw (u) = N j=−3 wj Bj,4 (u), then Bθ w = {γw (Uθ,1 ) , ..., γw (Uθ,n )} and the definition of An in (A.5) from Lemma A.3 entails that
kγw k22,θ (1 − An ) ≤ wT Vn,θ w = kγw k22,n,θ ≤ kγw k22,θ (1 + An ) .
(A.13)
Using Theorem 5.4.2 of DeVore and Lorentz (1993) and Assumption A2, one obtains that
2
X N
C C 2 2 T
cf kwk2 ≤ kγw k2,θ = w Vθ w = wj Bj,4 (A.14) ≤ Cf kwk22 ,
N N
j=−3
2,θ
which, together with (A.13), yield
cf CN −1 kwk22 (1 − An ) ≤ wT Vn,θ w ≤ Cf CN −1 kwk22 (1 + An ) .
(A.15)
Now the order of An in (A.5), together with (A.14) and (A.15) implies (A.11), in which cV = cf C, CV = Cf C. Next, denote by λmax (Vn,θ ) and λmin (Vn,θ ) the maximum and minimum eigenvalue of Vn,θ , simple algebra and (A.11) entail that
−1 −1 CV N −1 ≥ kVn,θ k2 = λmax (Vn,θ ) , Vn,θ
= λ−1 min (Vn,θ ) ≤ cV N, a.s., 2
20
LI WANG AND LIJIAN YANG
thus
−1 −1 κ := kVn,θ k2 Vn,θ = λmax (Vn,θ ) λ−1 min (Vn,θ ) ≤ CV cV < ∞, a.s.. 2
Meanwhile, let wj = the (N + 4)-vector with all zeros except the j-th element being 1, j = −3, ..., N . Then clearly
n
wjT Vn,θ wj =
1X 2 Bj,4 (Uθ,i ) = kBj,4 k2n,θ , kwj k2 = 1, −3 ≤ j ≤ N n i=1
and in particular w0T Vn,θ w0 ≤ λmax (Vn,θ ) kw0 k2 = λmax (Vn,θ ) ,
T w−3 Vn,θ w−3 ≥ λmin (Vn,θ ) kw−3 k2 = λmin (Vn,θ ) .
This, together with (A.5) yields that κ=
λmax (Vn,θ ) λ−1 min (Vn,θ )
kB0,4 k2n,θ kB0,4 k2θ 1 − An w0T Vn,θ w0 ≥ ≥ T = , w−3 Vn,θ w−3 kB−3,4 k2n,θ kB−3,4 k2θ 1 + An
which leads to κ ≥ C > 1, a.s. because the definition of B-spline and Assumption A2 ensure
that kB0,4 k2θ ≥ C0 kB−3,4 k2θ for some constant C 0 > 1. Next Lemma A.4 with
applying
1/16 2 −1/16
−1
−1 −8 2 and c0 = ν Vn,θ , one gets Vn,θ ≤ 2ν −8 N (1 − ν)−1 = ν = κ −1 κ +1 ∞
2
CN, a.s.. Hence part one of (A.12) follows. Part two of (A.12) is proved in the same fashion.
In the following, we denote by QT (m) the 4-th order quasi-interpolant of m corresponding
to the knots T , see equation (4.12), page 146 of DeVore and Lorentz (1993). According to Theorem 7.7.4, DeVore and Lorentz (1993), the following lemma holds. Lemma A.6. There exists a constant C > 0, such that for 0 ≤ k ≤ 2 and γ ∈ C (4) [0, 1]
(k)
(γ − QT (γ)) ≤ C γ (4) h4−k , ∞
∞
Lemma A.7. Under Assumptions A2, A3, A5 and A6, there exists an absolute constant C > 0, such that for function γ˜θ (u) in (A.7)
k
d
sup (˜ γθ − γθ )
k du d−1
θ∈Sc
∞
≤ C m(4) h4−k , a.s., 0 ≤ k ≤ 2, ∞
(A.16)
Proof. According to Theorem A.1 of Huang (2003), there exists an absolute constant C > 0, such that γθ − γθ k∞ ≤ C sup sup k˜
θ∈Scd−1
inf kγ − γθ k∞ ≤ C m(4) h4 , a.s., (2)
θ∈Scd−1 γ∈Γ
∞
(A.17)
21
SINGLE-INDEX PREDICTION MODEL
which proves (A.16) for the case k = 0. Applying Lemma A.6, one has for 0 ≤ k ≤ 2
k
d
(4)
(4) 4−k
h ≤ C {Q (γ ) − γ } sup ≤ C sup
m
h4−k ,
γ T θ θ θ
duk ∞ ∞ d−1 d−1 ∞ θ∈S θ∈S c
(A.18)
c
As a consequence of (A.17) and (A.18) for the case k = 0, one has
sup kQT (γθ ) − γ˜θ k∞ ≤ C m(4) h4 , a.s., ∞
θ∈Scd−1
which, according to the differentiation of B-spline given in de Boor (2001), entails that
k
d
(4)
sup {Q (γ ) − γ ˜ } m (A.19) ≤ C
h4−k , a.s., 0 ≤ k ≤ 2. T θ θ
duk ∞ d−1 ∞ θ∈Sc
Combining (A.18) and (A.19) proves (A.16) for k = 1, 2.
Lemma A.8. Under Assumptions A1, A2, A4 and A5, there exists an absolute constant C > 0, such that sup
∂
n
sup {˜ γθ (Uθ,i ) − γθ (Uθ,i )}i=1 ∂θp d−1
1≤p≤d θ∈Sc
∞
∂2 n {˜ γθ (Uθ,i ) − γθ (Uθ,i )}i=1 sup
∂θp ∂θq d−1
sup
1≤p,q≤d θ∈Sc
∞
≤ C m(4) h3 , a.s., ∞
≤ C m(4) h2 , a.s.. ∞
(A.20)
(A.21)
Proof. According to the definition of γ˜θ in (A.7), and the fact that QT (γθ ) is a cubic spline on the knots T {{QT (γθ ) − γ˜θ } (Uθ,i )}ni=1 = Pθ {{QT (γθ ) − γθ } (Uθ,i )}ni=1 , which entails that ∂ ∂ {{QT (γθ ) − γ˜θ } (Uθ,i )}ni=1 = Pθ {{QT (γθ ) − γθ } (Uθ,i )}ni=1 ∂θp ∂θp ˙ p {{QT (γθ ) − γθ } (Uθ,i )}n + Pθ ∂ {{QT (γθ ) − γθ } (Uθ,i )}n . = P i=1 i=1 ∂θp Since n ∂ ∂ ∂ n {{QT (γθ ) − γθ } (Uθ,i )}i=1 = γθ − γθ (Uθ,i ) QT ∂θp ∂θp ∂θp i=1 n d {QT (γθ ) − γθ } (Uθ,i ) Xip , + du i=1 applying (A.19) to the decomposition above produces (A.20). The proof of (A.21) is similar.
22
LI WANG AND LIJIAN YANG
Lemma A.9. Under Assumptions A2, A5 and A6, there exists a constant C > 0 such that
˙ Tp sup n−1 BTθ ∞ ≤ Ch, a.s., sup sup n−1 B (A.22)
≤ C, a.s., ∞
1≤p≤d θ∈Scd−1
θ∈Scd−1
sup kPθ k∞ ≤ C, a.s., sup
1≤p≤d
θ∈Scd−1
˙ sup P p
θ∈Scd−1
≤ Ch−1 , a.s..
∞
(A.23)
Proof. To prove (A.22), observe that for any vector a ∈ Rn , with probability 1 n X
−1 T −1
n B a ≤ kak B (U ) n max ≤ Ch kak∞ , j,4 θ,i θ ∞ ∞ −3≤j≤N i=1
−1 ˙ T a n B
p
∞
n 1 X {(Bj,3 − Bj+1,3 ) (Uθ,i )} F˙d (Xθ,i ) Xi,p ≤ C kak∞ . ≤ kak∞ max −3≤j≤N nh i=1
To prove (A.23), one only needs to use (A.12), (A.22) and (3.1).
Lemma A.10. Under Assumptions A2 and A4-A6, one has with probability 1
T n X
Bθ E log n −1
Bj,4 (Uθ,i ) σ (Xi ) εi = O √ = max n , sup n ∞ −3≤j≤N nN θ∈S d−1
(A.24)
i=1
c
T
B ˙ Tp E
∂ E B
θ
= sup sup sup sup
∂θ n n p 1≤p≤d θ∈Scd−1 1≤p≤d θ∈Scd−1 ∞
=O
∞
log n √ nh
.
(A.25)
Similarly, under Assumptions A2, A4-A6, with probability 1
T n 1 X
Bθ Eθ log n
= sup max √ , sup = O B (U ) {m (X ) − γ (U )} j,4 i θ,i θ θ,i
n −3≤j≤N n nN ∞ θ∈Scd−1 θ∈Scd−1 i=1 (A.26)
T
∂ Bθ Eθ log n =O √ sup sup , a.s.. (A.27)
∂θ n nh p 1≤p≤d θ∈Scd−1 ∞
n Proof. We decompose the noise variable εi into a truncated part and a tail part εi = εD i,1 +
Dn Dn η n εD i,2 + mi , where Dn = n (1/3 < η < 2/5), εi,1 = εi I {|εi | > Dn }, Dn Dn n εD = E [εi I {|εi | ≤ Dn } |Xi ] . i,2 = εi I {|εi | ≤ Dn } − mi , mi
It is straightforward to verify that the mean of the truncated part is uniformly bounded by Dn−2 , so the boundedness of B spline basis and of the function σ 2 entail that n 1 X −2 −2/3 n Bj,4 (Uθ,i ) σ (Xi ) mD . = O D = o n sup n i n θ∈Scd−1 i=1
23
SINGLE-INDEX PREDICTION MODEL
The tail part vanishes almost surely ∞ X
n=1
P {|εn | > Dn } ≤
∞ X
n=1
Dn−3 < ∞.
Borel-Cantelli Lemma implies that n 1 X Dn Bj,4 (Uθ,i ) σ (Xi ) εi,1 = O n−k , for any k > 0. n i=1
For the truncated part, using Bernstein’s inequality and discretization as in Lemma A.2 n √ −1 X n Bj,4 (Uθ,i ) σ (Xi ) εD sup sup n = O log n/ nN , a.s.. i,2 θ∈S d−1 1≤j≤N i=1
c
Therefore (A.24) is established as with probability 1
√ √
1 T −k −2/3
nN nN . sup Bθ E + O log n/ + O n = o n = O log n/
n ∞ θ∈Scd−1
The proofs of (A.25), (A.26) are similar as E {m (Xi ) − γθ (Uθ,i ) |Uθ,i } ≡ 0, but no truncation
is needed for (A.26) as sup max |m (Xi ) − γθ (Uθ,i )| ≤ C < ∞. Meanwhile, to prove (A.27), θ∈Scd−1
1≤i≤n
we note that for any p = 1, ..., d ( n )N X ∂ ∂ BTθ Eθ = [Bj,4 (Uθ,i ) {m (Xi ) − γθ (Uθ,i )}] ∂θp ∂θp i=1
.
j=−3
According to (2.6), one has γθ (Uθ ) ≡ E {m (X) |Uθ }, hence E [Bj,4 (Uθ ) {m (X) − γθ (Uθ )}] ≡ 0, −3 ≤ j ≤ N, θ ∈ Scd−1 . Applying Assumptions A2 and A3, one can differentiate through the expectation, thus ∂ [Bj,4 (Uθ ) {m (X) − γθ (Uθ )}] ≡ 0, 1 ≤ p ≤ d, −3 ≤ j ≤ N, θ ∈ Scd−1 , E ∂θp which allows one to apply the Bernstein’s inequality to obtain that with probability 1
( )N
n n o
1X
∂
= O (nh)−1/2 log n , [B (U ) {m (X ) − γ (U )}] j,4 i θ,i θ θ,i
n
∂θp
i=1
j=−3 ∞
which is (A.27).
Lemma A.11. Under Assumptions A2 and A4-A6, for εˆθ (u) in (A.9), one has n o εθ (u)| = O (nh)−1/2 log n , a.s.. sup sup |ˆ θ∈Scd−1 u∈[0,1]
(A.28)
24
LI WANG AND LIJIAN YANG
Proof. PN
Denote ˆ a ≡ (ˆ a−3 , · · · , a ˆN )T = BTθ Bθ
ˆj Bj,4 (u), j=−3 a
−1
−1 n−1 BT ˆθ (u) = BTθ E = Vn,θ θ E , then ε
so the order of εˆθ (u) is related to that of ˆ a. In fact, by Theorem 5.4.2
in DeVore and Lorentz (1993) sup
sup |ˆ εθ (u)| ≤
θ∈Scd−1 u∈[0,1]
−1 −1 T n Bθ E sup Vn,θ
∞
θ∈Scd−1
ak∞ = sup kˆ
θ∈Scd−1
≤ CN sup n−1 BT θ E ∞ , a.s., θ∈Scd−1
where the last inequality follows from (A.12) of Lemma A.5. Applying (A.24) of Lemma A.10, we have established (A.28). Lemma A.12. Under Assumptions A2 and A4-A6, for ε˜θ (u) in (A.8), one has n o sup sup |˜ εθ (u)| = O (nh)−1/2 log n , a.s..
(A.29)
θ∈Scd−1 u∈[0,1]
The proof is similar to Lemma A.11, thus omitted. The next result evaluates the uniform size of the noise derivatives. Lemma A.13. Under Assumptions A2-A6, one has with probability 1 n o ∂ sup sup max εˆθ (Uθ,i ) = O (nh3 )−1/2 log n , 1≤p≤d θ∈Scd−1 1≤i≤n ∂θp n o ∂ sup sup max ε˜θ (Uθ,i ) = O (nh3 )−1/2 log n , 1≤p≤d θ∈Scd−1 1≤i≤n ∂θp n o ∂2 εˆθ (Uθ,i ) = O (nh5 )−1/2 log n , sup sup max 1≤p,q≤d θ∈Scd−1 1≤i≤n ∂θp ∂θq n o ∂2 sup sup max ε˜θ (Uθ,i ) = O (nh5 )−1/2 log n . 1≤p,q≤d θ∈Scd−1 1≤i≤n ∂θp ∂θq
(A.30) (A.31) (A.32) (A.33)
Proof. Note that n ∂ ˙ Tp (I − Pθ ) E. ˙ p BTθ Bθ −1 BTθ E + Bθ BTθ Bθ −1 B = (I − Pθ ) B εˆθ (Uθ,i ) ∂θp i=1
Applying (A.24) and (A.25) of Lemma A.10, (A.12) of Lemma A.5, (A.22) and (A.23) of Lemma A.9, one derives (A.30). To prove (A.31), note that n ∂ ∂ ˙ p Eθ + Pθ ∂ Eθ = T1 + T2 , = ε˜θ (Uθ,i ) {Pθ Eθ } = P ∂θp ∂θp ∂θp i=1 in which o n ˙ Tp Bθ BTθ Bθ −1 BTθ Eθ ˙ p − Bθ BTθ Bθ −1 B (I − Pθ ) B ) ( −1 ˙ T −1 T T Bp Bθ B Bθ Eθ BTθ Bθ B θ θ ˙ p − Bθ , = (I − Pθ ) B n n n n
T1 =
(A.34)
25
SINGLE-INDEX PREDICTION MODEL
T2 = Bθ
BTθ Bθ n
−1
∂ ∂θp
BTθ Eθ n
.
By (A.24), (A.12), (A.22) and (A.23), one derives sup kT1 k∞ = O n−1/2 N 3/2 log n , a.s.,
(A.35)
θ∈Scd−1
while (A.27) of Lemma A.10, (A.12) of Lemma A.5 sup kT2 k∞ = N × O n−1/2 h−1/2 log n = O n−1/2 h−3/2 log n , a.s..
(A.36)
θ∈Scd−1
Now, putting together (A.34), (A.35) and (A.36), we have established (A.31). The proof for (A.32) and (A.33) are similar. Proof of Proposition 2.1. According to the decomposition (A.6) |ˆ γθ (u) − γθ (u)| = |{˜ γθ (u) − γθ (u)} + ε˜θ (u) + εˆθ (u)| . Then (2.13) follows directly from (A.16) of Lemma A.7, (A.28) of Lemma A.11 and (A.29) of Lemma A.12. Again by definitions (A.8) and (A.9), we write ∂ ∂ ∂ ∂ {(ˆ γθ − γθ ) (Uθ,i )} = (˜ γθ − γθ ) (Uθ,i ) + γ˜θ (Uθ,i ) + εˆθ (Uθ,i ) . ∂θp ∂θp ∂θp ∂θp It is clear from (A.20), (A.30) and (A.31) that with probability 1 ∂ sup sup max (˜ γθ − γθ ) (Uθ,i ) = O h3 , 1≤p≤d θ∈Scd−1 1≤i≤n ∂θp sup
n o ∂ ∂ −1/2 ε˜θ (Uθ,i ) + εˆθ (Uθ,i ) = O nh3 log n . sup max ∂θp ∂θp d−1 1≤i≤n
1≤p≤d θ∈Sc
Putting together all the above yields (2.14). The proof of (2.15) is similar. A.3. Proof of Proposition 2.2 Lemma A.14. Under Assumptions A2-A6, one has ˆ sup R (θ) − R (θ) = o(1), a.s.. θ∈Scd−1
ˆ (θ) in (2.9), one has Proof. For the empirical risk function R ˆ (θ) = n−1 R
n X i=1
{ˆ γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi }2
26
LI WANG AND LIJIAN YANG
= n−1
n X i=1
hence
{ˆ γθ (Uθ,i ) − γθ (Uθ,i ) + γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi }2 ,
ˆ (θ) = n−1 R
n X i=1
+2n−1
n X i=1
+n
−1
n X i=1
2
{ˆ γθ (Uθ,i ) − γθ (Uθ,i )} + n
−1
n X
σ 2 (Xi ) ε2i
i=1
{ˆ γθ (Uθ,i ) − γθ (Uθ,i )} {γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi } 2
{γθ (Uθ,i ) − m (Xi )} + 2n
−1
n X i=1
{γθ (Uθ,i ) − m (Xi )} σ (Xi ) εi ,
where γˆθ (x) is defined in (2.8). Using the expression of R (θ) in (2.7), one has ˆ sup R (θ) − R (θ) ≤ I1 + I2 + I3 + I4 , θ∈Scd−1
with
n X {ˆ γθ (Uθ,i ) − γθ (Uθ,i )}2 , I1 = sup n−1 d−1 θ∈S i=1
c
n −1 X {ˆ γθ (Uθ,i ) − γθ (Uθ,i )} {γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi } , I2 = sup 2n θ∈Scd−1 i=1 n X {γθ (Uθ,i ) − m (Xi )}2 − E {γθ (Uθ ) − m (X)}2 , I3 = sup n−1 d−1 θ∈Sc i=1 ) ( n n 1 X 2 X I4 = sup σ 2 (Xi ) ε2i − Eσ 2 (X) + {γθ (Uθ,i ) − m (Xi )} σ (Xi ) εi . n n d−1 θ∈S i=1
c
i=1
Bernstein inequality and strong law of large number for α mixing sequence imply that I3 + I4 = o(1), a.s..
(A.37)
Now (2.13) of Proposition 2.1 provides that sup
sup |ˆ γθ (u) − γθ (u)| = O n−1/2 h−1/2 log n + h4 , a.s.,
θ∈Scd−1 u∈[0,1]
which entail that I1 = O
n
−1/2 −1/2
h
4 2
+ h
, a.s.,
(A.38)
n o n X |γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi | . I2 ≤ O (nh)−1/2 log n + h4 × sup 2n−1 θ∈Scd−1
Hence
log n
2
i=1
I2 ≤ O n−1/2 h−1/2 log n + h4 , a.s..
The lemma now follows from (A.37), (A.38) and (A.39) and Assumption A6.
(A.39)
SINGLE-INDEX PREDICTION MODEL
Lemma A.15. Under Assumptions A2 - A6, one has n ∂ n o X ˆ (θ) − R (θ) − n−1 sup sup ξθ,i,p = o n−1/2 , a.s., R ∂θp θ∈S d−1 1≤p≤d
27
(A.40)
i=1
c
in which
ξθ,i,p = 2 {γθ (Uθ,i ) − Yi }
∂ ∂ γθ (Uθ,i ) − R (θ) , E (ξθ,i,p) = 0. ∂θp ∂θp
Furthermore for k = 1, 2 k n o ∂ ˆ sup k R (θ) − R (θ) = O n−1/2 h−1/2−k log n + h4−k , a.s.. ∂θ θ∈S d−1
(A.41)
(A.42)
c
Proof. Note that for any p = 1, 2, ..., d
n
X ∂ 1 ∂ ˆ {ˆ γθ (Uθ,i ) − Yi } R (θ) = n−1 γˆθ (Uθ,i ) , 2 ∂θp ∂θp i=1
∂ 1 ∂ R (θ) = E {γθ (Uθ ) − m (X)} γθ (Uθ ) 2 ∂θp ∂θp ∂ = E {γθ (Uθ ) − m (X) − σ (X) ε} γθ (Uθ ) . ∂θp i h Thus E (ξθ,i,p ) = 2E {γθ (Uθ,i ) − Yi } ∂θ∂ p γθ (Uθ,i ) −
∂ ∂θp R (θ)
= 0 and
n o X 1 ∂ nˆ ξθ,i,p + J1,θ,p + J2,θ,p + J3,θ,p , R (θ) − R (θ) = (2n)−1 2 ∂θp
(A.43)
i=1
with J1,θ,p = n
−1
n X i=1
J2,θ,p = n−1
n X i=1
J3,θ,p = n−1
n X i=1
{ˆ γθ (Uθ,i ) − γθ (Uθ,i )}
∂ (ˆ γθ − γθ ) (Uθ,i ) , ∂θp
{γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi } {ˆ γθ (Uθ,i ) − γθ (Uθ,i )}
∂ (ˆ γθ − γθ ) (Uθ,i ) , ∂θp
∂ γθ (Uθ,i ) . ∂θp
Bernstein inequality implies that n X ξθ,i,p = O n−1/2 log n , a.s.. sup sup n−1 θ∈S d−1 1≤p≤d c
i=1
(A.44)
28
LI WANG AND LIJIAN YANG
Meanwhile, applying (2.13) and (2.14) of Proposition 2.1, one obtains that o o n n −1/2 log n + h3 sup sup |J1,θ,p | = O (nh)−1/2 log n + h4 × O nh3 θ∈Scd−1 1≤p≤d
= O n−1 h−2 log2 n + h7 , a.s..
Note that J2,θ,p = n
−1
n X i=1
{γθ (Uθ,i ) − m (Xi ) − σ (Xi ) εi }
−n−1 (E + Eθ )T
(A.45)
∂ (˜ γθ − γθ ) (Uθ,i ) ∂θp
∂ {Pθ (E + Eθ )} . ∂θp
Applying (2.13), one gets T ∂ −1 sup sup J2,θ,p + n (E + Eθ ) {Pθ (E + Eθ )} = O h3 , a.s., ∂θp θ∈S d−1 1≤p≤d c
while (A.24), (A.26) and (A.12) entail that with probability 1 −1 T ∂ {Pθ (E + Eθ )} sup sup n (E + Eθ ) ∂θp θ∈S d−1 1≤p≤d c
thus
n
o n o = O (nN )−1/2 log n × N × N × O (nN )−1/2 log n = O n−1 N log2 n , sup
sup |J2,θ,p | = O h3 + n−1 N log2 n , a.s..
(A.46)
θ∈Scd−1 1≤p≤d
Lastly J3,θ,p − n
−1
n X i=1
∂ γθ (Uθ,i ) = n−1 (E + Eθ )T Bθ (˜ γθ − γθ ) ∂θp
BTθ Bθ n
−1
BTθ ∂ γθ . n ∂θp
By applying (A.24), (A.26), and (A.12), it is clear that with probability 1 T −1 T B B B ∂ T −1 T θ θ θ sup sup n Bθ E+n−1 BT γθ θ Eθ n n ∂θp θ∈Scd−1 1≤p≤d n o n o = O (nN )−1/2 log n × N × O h + (nN )−1/2 log n n o = O n−1 log2 n + (nN )−1/2 log n , while by applying (A.16) of Lemma A.7, one has n X ∂ γθ (Uθ,i ) = O h4 , a.s., (˜ γθ − γθ ) sup sup n−1 ∂θp θ∈S d−1 1≤p≤d c
i=1
SINGLE-INDEX PREDICTION MODEL
29
together, the above entail that sup
n o sup |J3,θ,p | = O h4 + n−1 log2 n + (nN )−1/2 log n , a.s..
(A.47)
θ∈Scd−1 1≤p≤d
Therefore, (A.43), (A.45), (A.46), (A.47) and Assumption A6 lead to (A.40), which, together with (A.44), establish (A.42) for k = 1. ˆ (θ) and R (θ) with respect to θp , θq are Note that the second order derivative of R # " n n X X ∂ ∂ ∂2 −1 γˆθ (Uθ,i ) + γˆθ (Uθ,i ) γˆθ (Uθ,i ) , 2n {ˆ γθ (Uθ,i ) − Yi } ∂θp ∂θq ∂θq ∂θp i=1 i=1 2 ∂ ∂ ∂ 2 E {γθ (Uθ ) − m (X)} γθ (Uθ ) + E γθ (Uθ ) γθ (Uθ ) . ∂θp ∂θq ∂θq ∂θp
The proof of (A.42) for k = 2 follows from (2.13), (2.14) and (2.15).
Proof of Proposition 2.2. The result follows from Lemma A.14, Lemma A.15, equations (A.50) and (A.51). A.4. Proof of the Theorem 2 Let Sˆp∗ (θ−d ) be the p-th element of Sˆ∗ (θ−d ) and for γθ in (2.6), denote n o −1 ηi,p := 2 γ˙ p − θ0,p θ0,d γ˙ d (Uθ0 ,i ) {γθ0 (Uθ0 ,i ) − Yi } ,
where γ˙ p is value of
∂ ∂θp γθ
taking at θ = θ0 , for any p, q = 1, 2, ..., d − 1.
Lemma A.16. Under Assumptions A2-A6, one has n X ˆ∗ ηi,p = o n−1/2 , a.s.. sup Sp (θ0,−d ) − n−1 1≤p≤d−1 i=1
Proof. For any p = 1, ..., d − 1
Sˆp∗ (θ−d ) − Sp∗ (θ−d ) =
∂ ∂ − θp θd−1 ∂θp ∂θd
Therefore, according to (A.40), (A.41) and (A.48) ηi,p = n−1
n X i=1
−1 −1 ξθ0 ,i,p − θ0,p θ0,d n
n X
n
o ˆ (θ) − R (θ) . R
ξθ0 ,i,d , E (ηi,p ) = 0,
i=1
n X ˆ∗ ηi,p = o n−1/2 , a.s.. sup Sp (θ0,−d ) − Sp∗ (θ0,−d ) − n−1 1≤p≤d−1 i=1
Since
S ∗ (θ
−d )
attains its minimum at θ0,−d , for p = 1, ..., d − 1 ∂ −1 ∂ ∗ Sp (θ0,−d ) ≡ − θp θd R (θ) ≡ 0, ∂θp ∂θd θ=θ 0
which yields (A.49).
(A.48)
(A.49)
30
LI WANG AND LIJIAN YANG
Lemma A.17. The (p, q)-th entry of the Hessian matrix H ∗ (θ0,−d ) equals lp,q given in Theorem 2. Proof. It is easy to show that for any p, q = 1, 2, ..., d,
Note that
Thus
∂ ∂ ∂ 2 R (θ) = E {m (X) − γθ (Uθ )} = −2E γθ (Uθ ) γθ (Uθ ) , ∂θp ∂θp ∂θp ∂ ∂ ∂2 ∂2 R (θ) = −2E γθ (Uθ ) γθ (Uθ ) + γθ (Uθ ) γθ (Uθ ) . ∂θp ∂θq ∂θp ∂θq ∂θp ∂θq θp ∂ ∂ ∗ ∂ R (θ) , R (θ−d ) = R (θ) − ∂θp ∂θp θd ∂θd θq ∂ 2 θp ∂ 2 ∂2 ∂2 R∗ (θ−d ) = R (θ) − R (θ) − R (θ) ∂θp ∂θq ∂θp ∂θq θd ∂θp ∂θd θd ∂θd ∂θq 2 θp ∂ ∂ R (θ) + θp θq ∂ q R (θ) . − ∂θq ∂θd θd2 ∂θd ∂θd 1 − kθ−d k22 ∂ ∂ ∂ ∗ −1 γθ (Uθ ) , R (θ−d ) = −2E γθ (Uθ ) γθ (Uθ ) + 2θd θp E γθ (Uθ ) ∂θp ∂θp ∂θd ∂2 ∂ ∂ ∂2 R∗ (θ−d ) = −2E γθ (Uθ ) γθ (Uθ ) + γθ (Uθ ) γθ (Uθ ) ∂θp ∂θq ∂θp ∂θq ∂θp ∂θq ∂ ∂ ∂2 γθ (Uθ ) γθ (Uθ ) γθ (Uθ ) + γθ (Uθ ) ∂θd ∂θp ∂θp ∂θd θp ∂ ∂ E γθ (Uθ ) q γθ (Uθ ) +2 ∂θq ∂θd 1 − kθ−d k22 ∂ ∂ ∂2 −1 +2θp θd E γθ (Uθ ) γθ (Uθ ) + γθ (Uθ ) γθ (Uθ ) ∂θp ∂θq ∂θp ∂θq " # 2 ∂ ∂2 −2 −2θp θq θd E γθ (Uθ ) + γθ (Uθ ) γθ (Uθ ) . ∂θd ∂θd ∂θd +2θq θd−1 E
Therefore we obtained the desired result. Proof of Theorem 2. For any p = 1, 2, ..., d − 1, let fp (t) = Sˆp∗ tθˆ−d + (1 − t) θ0,−d , t ∈ [0, 1],
then
d−1
X ∂ d Sˆp∗ tθˆ−d + (1 − t) θ0,−d θˆq − θ0,q . fp (t) = dt ∂θq q=1
(A.50)
(A.51)
31
SINGLE-INDEX PREDICTION MODEL
Note that Sˆ∗ (θ−d ) attains its minimum at θˆ−d , i.e., Sˆp∗ θˆ−d ≡ 0. Thus, for any p = 1, 2, ..., d−
1, tp ∈ [0, 1], one has
−Sˆp∗ (θ0,−d ) = Sˆp∗ θˆ−d − Sˆp∗ (θ0,−d ) = fp (1) − fp (0) 2 T ∂ ˆ∗ ˆ θˆ−d −θ0,−d , R tp θ−d + (1 − tp ) θ0,−d = ∂θq θp q=1,...,d−1 then −Sˆ∗ (θ0,−d ) =
∂2 ˆ∗ ˆ θˆ−d − θ0,−d . R tp θ−d + (1 − tp ) θ0,−d ∂θq ∂θp p,q=1,...,d−1
Now (2.11) of Theorem 1 and Proposition 2.2 with k = 2 imply that uniformly in p, q = 1, 2, ..., d − 1
where lp,q
∂2 ˆ∗ ˆ R tp θ−d + (1 − tp ) θ0,−d −→ lq,p , a.s., ∂θq ∂θp √ is given in Theorem 2. Noting that n θˆ−d −θ0,−d is represented as −
"
(A.52)
#−1 √ ∗ ∂2 ˆ∗ ˆ nSˆ (θ0,−d ) , R tp θ−d + (1 − tp ) θ0,−d ∂θq ∂θp p,q=1,...,d−1
n od−1 where Sˆ∗ (θ0,−d ) = Sˆp∗ (θ0,−d ) and according to (A.48) and Lemma A.16 p=1
Sˆp∗ (θ0,−d ) = n−1
n X i=1
ηp,i + o n−1/2 , a.s., E (ηp,i ) = 0.
Let Ψ (θ0 ) = (ψpq )d−1 p,q=1 be the covariance matrix of
od−1 √ n ˆ∗ n Sp (θ0,−d ) with ψpq given in p=1
Theorem 2. Cram´er-Wold device and central limit theorem for α mixing sequences entail that √
d
nSˆ∗ (θ0,−d ) −→ N {0, Ψ (θ0 )} .
i−1 h Let Σ (θ0 ) = {H ∗ (θ0,−d )}−1 Ψ (θ0 ) {H ∗ (θ0,−d )}T , with H ∗ (θ0,−d ) being the Hessian matrix √ defined in (2.3). The above limiting distribution of nSˆ∗ (θ0,−d ), (A.52) and Slutsky’s theorem imply that
References
√ d n θˆ−d −θ0,−d −→ N {0, Σ (θ0 )} .
Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes. Springer-Verlag, New York.
32
LI WANG AND LIJIAN YANG
Carroll, R., Fan, J., Gijbles, I. and Wand, M. P. (1997). Generalized partially linear singleindex models. J. Amer. Statist. Assoc. 92 477-489. Chen, H. (1991). Estimation of a projection -persuit type regression model. Ann. Statist. 19 142-157. de Boor, C. (2001). A Practical Guide to Splines. Springer-Verlag, New York. DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation: Polynomials and Splines Approximation. Springer-Verlag, Berlin. Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817-823. H¨ardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. H¨ardle, W. and Hall, P. and Ichimura, H. (1993). Optimal smoothing in single-index models. Ann. Statist. 21 157-178. H¨ardle, W. and Stoker, T. M. (1989). Investigating smooth multiple regression by the method of average derivatives. J. Amer. Statist. Assoc. 84 986-995. Hall, P. (1989). On projection pursuit regression. Ann. Statist. 17 573-588. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall, London. Horowitz, J. L. and H¨ardle, W. (1996). Direct semiparametric estimation of single-index models with discrete covariates. J. Amer. Statist. Assoc. 91 1632-1640. Hristache, M., Juditski, A. and Spokoiny, V. (2001). Direct estimation of the index coefficients in a single-index model. Ann. Statist. 29 595-623. Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. Ann. Statist. 31 1600-1635. Huang, J. and Yang, L. (2004). Identification of nonlinear additive autoregressive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 463-477.
SINGLE-INDEX PREDICTION MODEL
33
Huber, P. J. (1985). Projection pursuit (with discussion). Ann. Statist. 13 435-525. Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models Journal of Econometrics 58 71-120. Klein, R. W. and Spady. R. H. (1993). An efficient semiparametric estimator for binary response models. Econometrica 61 387-421. Mammen, E., Linton, O. and Nielsen, J. (1999). The existence and asymptotic properties of a backfitting projection algorithm under weak conditions. Ann. Statist. 27 1443-1490. Pham, D. T. (1986). The mixing properties of bilinear and generalized random coefficient autoregressive models. Stochastic Anal. Appl. 23 291-300. Powell, J. L., Stock, J. H. and Stoker, T. M. (1989). Semiparametric estimation of index coefficients. Econometrica. 57 1403-1430. Tong, H. (1990) Nonlinear Time Series: A Dynamical System Approach. Oxford, U.K.: Oxford University Press. Tong, H., Thanoon, B. and Gudmundsson, G. (1985) Threshold time series modeling of two icelandic riverflow systems. Time Series Analysis in Water Resources. ed. K. W. Hipel, American Water Research Association. Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive autoregression model. Ann. Statist. Forthcoming. Xia, Y. and Li, W. K. (1999). On single-index coefficient regression models. J. Amer. Statist. Assoc. 94 1275-1285. Xia, Y., Li, W. K., Tong, H. and Zhang, D. (2004). A goodness-of-fit test for single-index models. Statist. Sinica. 14 1-39. Xia, Y., Tong, H., Li, W. K. and Zhu, L. (2002). An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 363-410. Xue, L. and Yang, L. (2006 a). Estimation of semiparametric additive coefficient model. J. Statist. Plann. Inference 136, 2506-2534. Xue, L. and Yang, L. (2006 b). Additive coefficient modeling via polynomial spline. Statistica Sinica 16 1423-1446.
Table 2: Report of Example 2 Average MSE MAVE SIP
Time MAVE SIP
Sample Size n
Dimension d
50
4 10 30 50 100 200
0.00020 0.00031 0.00106 0.00031 0.00681 0.00529
0.00018 0.00043 0.00285 0.00043 0.00620 0.00407
1.91 2.17 2.77 3.29 5.94 27.90
0.19 0.10 0.13 0.10 0.31 0.49
100
4 10 30 50 100 200
0.00008 0.00012 0.00017 0.00032 — —
0.00008 0.00017 0.00058 0.00127 0.00395 0.00324
3.28 3.93 5.41 8.48 — —
0.09 0.13 0.15 0.16 0.44 0.73
200
4 10 30 50 100 200
0.00004 0.00005 0.00006 0.00007 0.00015 —
0.00003 0.00007 0.00017 0.00030 0.00061 0.00197
5.32 7.49 10.08 15.42 40.81 —
0.17 0.24 0.26 0.24 0.54 1.44
500
4 10 30 50 100 200 400
0.00002 0.00002 0.00002 0.00002 0.00003 0.00004 —
0.00001 0.00003 0.00008 0.00010 0.00012 0.00020 0.00054
14.44 24.54 32.51 52.93 143.07 386.80 —
0.76 0.79 0.83 0.89 0.99 1.96 4.98
1000
4 10 30 50 100 200 400
0.00001 0.00001 0.00001 0.00001 0.00001 0.00008 —
0.00001 0.00001 0.00002 0.00003 0.00005 0.00006 0.00010
33.57 62.54 92.41 155.38 275.73 2432.56 —
1.95 3.64 1.95 2.72 1.81 2.84 9.35
5 5
2
0 0 2
1
1 0
−2.5
−5
0
−5
−2.5 −2
−1.5
−1 −1
−0.5
0
0.5
1
−2
−1 −1.5
−1
−2 1.5
2
−0.5
0
0.5
1
−2 1.5
2.5
2.5
(b)
0
−2
1
0
2
3
2
4
4
5
(a)
2
−1
0
1
(c)
2
−1
0
1
2
(d)
Figure 1: Example 1. (a) and (b) Plots of the actual surface m in model (4.1) with respect to δo = 0, 1; (c) and (d) Plots of various univariate functions with respect to δ = 0, 1: n ˆ Yi , 1 ≤ i ≤ 50 (dots); the univariate function g (solid line); the estimated function of XTi θ, g by plugging in the true index coefficient θ0 (dotted line); the estimated function of g by plugging in the estimated index coefficient (dashed line) θˆ = (0.69016, 0.72365)T for δ = 0 and (0.72186, 0.69204)T for δ = 1.
n= 100 , d= 50
−1.0
−1.0
−0.5
−0.5
0.0
0.0
0.5
0.5
1.0
1.0
n= 100 , d= 10
−2
−1
0
1
2
−2
−1
(a)
0
1
2
(b) n= 100 , d= 200
−1.0
−1.0
−0.5
−0.5
0.0
0.0
0.5
0.5
1.0
1.0
n= 100 , d= 100
−2
−1
0
(c)
1
2
−3
−2
−1
0
1
2
3
(d)
Figure 2: Example 2. Plots of the spline estimator of g with the estimated index parameter θˆ (dotted curve), cubic spline estimator of g with the true index parameter θ0 (dashed curves), the true function m (x) in (4.2) (solid curve), and the data scatter plots (dots).
Density Estimation, d=50 4
4
Density Estimation, d=10
3
n=100 n=200 n=500 n=1000
0
1
2
Density
2 0
1
Density
3
n=100 n=200 n=500 n=1000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.00
0.05
0.10
0.15
(a)
0.25
0.30
0.35
(b) Density Estimation, d=200 4
4
Density Estimation, d=100
1
2
3
n=100 n=200 n=500 n=1000
0
1
2
Density
3
n=100 n=200 n=500 n=1000
0
Density
0.20
0.00
0.05
0.10
0.15
0.20
(c)
0.25
0.30
0.35
0.00
0.05
0.10
0.15
0.20
0.25
(d)
√ Figure 3: Example 2. Kernel density estimators of the 100 kθˆ − θ0 k/ d.
0.30
0.35
80 100 120 140 20
40
60
flow
0
200
400
600
800
1000
800
1000
800
1000
days
−20
−10
temp
0
10
(a)
0
200
400
600 days
40 0
20
rain
60
80
(b)
0
200
400
600 days
(c) Figure 4: Time plots of the daily J¨ okuls´a Eystri River data (a) river flow Yt (solid line) with its trend (dashed line) (b) temperature Xt (solid line) with its trend (dashed line) (c) precipitation Zt (solid line) with its trend (dashed line).
150 100 0
50
flow
+ + + + + + + + + + + + + ++ + + + + + + + + + +++ + + + + + + + + + + ++ +++ + + + ++ + + + ++ + + + + ++ + + + ++ + ++ + + + + +++ ++ + +++ + + + + + ++ + + + + ++ + + + + + + + ++ + + + + ++ + + + + + +++ ++ + ++ + ++ ++ + + + + + + + + ++ + + + + +++ + + + ++ ++ + + ++ + + ++++ + ++ + + + ++ + + ++ + + + ++ + ++++ ++++ + ++ ++ + + + + + + + + + + + + + + ++ + + ++ ++ ++ + ++ + + + + + + + + ++ + + ++ +++ ++ + + + +++ + + + + + + + + + + + + + + + + + + + + + ++ + ++++ ++ + +++++ ++ + ++ + + + + + + ++ + + + ++ + ++ ++ + + + ++ ++ + + ++ + ++ + + + + + + + + + + ++ + + ++ ++ + + + + + +++ + + + + + + + + + ++ + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ +++ + + + + + + ++ + ++++ + ++ + + + + + + + + + + + ++ ++ + ++ + + + + + + + ++ + + + + ++ + + ++ + + + + + ++ + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
0
200
400
600
800
1000
800
1000
days
20 0 −40
−20
residual
40
60
(a)
0
200
400
600 days
150
(b)
100
++
+
++ + + +
+ + + + + + + + +++ + + + + + + + + + + + + + ++ + + + + ++ + + + + + + ++ + + ++ + + + ++ + + + + + + + + ++ + + + + + ++ + + ++ + + + + + ++ + + + + ++ + ++ ++ + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ + + ++ + + + + + + + ++ + + + ++ + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
0
50
flow
+ +
+ + + +
800
900
1000
1100
days
(c) Figure 5: (a) The scatter plot of the river flow (“+”) and the fitted plot of the river flow (line) and (b) Residuals of the fitted SIP model (c) Out-of-sample rolling forecasts (line) of the river flow for the entire third year (“+”) based on the first two years’ river flow.