conditional predictive regions for stochastic processes - CiteSeerX

1 downloads 0 Views 938KB Size Report
or multi-modal, the two proposed predictors outperform the predictive interval constructed ... implications of the above features in interval/region prediction.
CONDITIONAL PREDICTIVE REGIONS FOR STOCHASTIC PROCESSES Qiwei Yao Institute of Mathematics and Statistics University of Kent at Canterbury Canterbury, Kent CT2 7NF, UK

Abstract Studies on pointwise prediction of nonlinear time series have revealed some distinguishing features of nonlinear prediction. Being signi cantly di erent from linear prediction, the prediction accuracy for nonlinear time series depends on the current position in the state space. Furthermore, small perturbation in current position can lead to considerable errors in nonlinear prediction (see Yao and Tong 1994a.) The above features can also be observed in state-dependent interval/region prediction. In this paper, we consider two types of region predictors | the maximum conditional density region and the shortest conditional modal interval, both of them depending on the current position in the state space. When the underlying conditional distribution is skewed or multi-modal, the two proposed predictors outperform the predictive interval constructed from the conditional percentiles. Based on the data f( t t); 1   g, which are the observations from a strictly stationary process, the conditional distribution function of t given t is estimated using the kernel regression procedure. Further, we estimate the two predictive regions for t from t ( ) based on the estimated conditional distribution functions. The sensitivity measure de ned in the form of the Kullback-Leibler information has been used to monitor the errors in the coverage probabilities of any predictive regions as well as the changes in the width of the conditional modal intervals due to small perturbation in t . Both simulated and real data sets are used as illustrations. X ;Y

t

n

Y

X

Y

X

t > n

X

KEY WORDS: Kernel regression estimation; Maximum conditional density region (MCDR); Nonlinear time series; Sensitivity measure; Shortest conditional modal interval (SCMI).

1 Introduction In any serious attempt to forecast, a point prediction is only a beginning. A predictive interval or, more generally, a predictive region is much more informative. Further, the complete picture will be fully appreciated only if the predictive distribution is available. This paper is devoted to study predictive regions as well as predictive distributions under a general set-up which includes nonlinear time series models as special cases. Recent studies on pointwise prediction of nonlinear time series have revealed three distinguishing features of nonlinear prediction: (i) the dependence of the prediction accuracy on the current position in the state space; (ii) the sensitivity of the predictor to small perturbation in the current state; and (iii) the non-monotonicity of the prediction accuracy in multi-step prediction (see Yao and Tong 1994a, and the references therein). It is also our interest in this paper to discuss the implications of the above features in interval/region prediction. In the context of linear time series models with normally distributed errors, the predictive distributions are normal. Therefore, the predictive intervals are easily obtained using mean plus and minus a multiple of the standard deviation. The width of such an interval, even for multistep ahead prediction, is constant over the whole state space unless the model has heteroscedastic noise. Predictive intervals constructed in this way have also been used in some special nonlinear models (e.g. threshold autoregressive models, see Tong and Moeanaddin 1988, Davies, Pemberton and Petruccelli 1988). However, the above method is no longer pertinent when the predictive distribution is not normal, which, unfortunately, is the case for most nonlinear time series models (cf. Chan and Tong 1994). Yao and Tong (1994b) proposed to construct predictive intervals using conditional percentiles for nonlinear time series. The width of the conditional percentile interval does vary with respect to the current position in the state space, which re ects the variation of the accuracy in prediction. However, interval predictors so constructed are not always appropriate when the predictive distributions are asymmetric or multi-modal. Asymmetric distributions have been widely used in modelling economic activities (cf. Brannas and Gooijer 1992 and the references therein). Further, skewed predictive distributions may occur in multi-step ahead prediction even though the errors in the models have symmetric distributions. Multi-modal phenomena often indicate model uncertainty. The uncertainty may be caused by some factors beyond the variables speci ed in the prediction. (See x2.1 below for some examples.) In order to cope with the possible skewness and multi-modality of the underlying predictive 1

distribution, we propose two alternatives | the shortest conditional modal interval (SCMI) and the maximum conditional density region (MCDR). When the predictive distribution is unimodal and symmetric, both SCMI and MCDR reduce to the predictive interval constructed from conditional percentiles. Hyndman (1995) proposed an estimator of MCDR for parametric nonlinear models based on a simulation method. His approach also assumes that the form of the distribution of the noise in the model is known. Fan, Yao and Tong (1993) presented a nonparametric estimator for predictive density functions. The key idea there is to treat the estimation of conditional density as a regression problem. In the same vein, we propose a kernel estimator for conditional distribution functions using the Nadaraya-Watson kernel regression estimation or the locally linear regression technique (cf. Fan 1992). Further, the estimators for both SCMI and MCDR are constructed directly from the estimated predictive distribution. To monitor the e ect on predictive regions due to small perturbation in the current position, the sensitive measure de ned in the form of the Kullback-Leibler information is applied (see Yao and Tong 1994b, and Fan, Yao and Tong 1993). A quantitative description is obtained on the changes in the coverage probabilities of predictive regions due to a small shift in current position. Some indication is also given for the changes in the width of the SCMI. The paper is organized as follows. x2 gives a description of the SCMI and the MCDR. The advantages of using the two proposed methods are demonstrated through two simple time series models. The results on sensitivity of the predictive regions are also presented in x2. The kernel regression estimators of predictive distribution functions are derived in x3. The asymptotic normality of the estimators is stated. Further, the estimation of the SCMI and MCDR is discussed. In x4, the methodology is illustrated with two simulated models and a set of rainfall- ow data from a catchment in Wales.

2 Predictive Regions Suppose that X is an observable d-dimensional random vector, and Y is a random variable which is unobservable. It is of interest to predict Y from X . The (theoretical) least square point predict is EfY jX g. Since Y is a random variable, a predictive interval/region is much more relevant. To

2

this end, for a given 2 [0; 1], we search for a region  (X )  R for which

P f Y 2 (x) j X = x g = ; x 2 Rd :

(2:1)

We call a predictive region with coverage probability . When is a connected set, we also call it a predictive interval. A natural predictive interval for Y is [0:5? =2(X ); 0:5+ =2(X )];

(2:2)

where  (x) is the conditional percentile of Y given X = x, i.e. P fY   (x)jX = xg = (cf. Yao and Tong 1994b, 1995a). In what follows, we always use g (:jx) to denote the conditional density function of Y given X = x.

2.1 SCMI and MCDR We propose two alternative predictors which may outperform the conditional percentile interval (2.2) when the underlying predictive distribution is asymmetric or multi-modal.

Shortest conditional modal interval (SCMI). For any given 2 [0; 1] and x 2 Rd, de ne Z y+b ) ( g(ujx)du  ; jyj < 1: (2:3) b (x; y) = min b > 0 y?b Let

b (x) = jymin b (x; y); j 1, is not necessarily symmetric. For example, an estimated conditional density function of Xt+3 given Xt = 8 is displayed in Fig.1(a), which is obviously skewed to the right. The curve was estimated using the 1000 independent samples, each of them generated by iterating equation (2.6) three times with the starting point at 8. The kernel density estimator was used with Gaussian kernel and bandwidth 0.6164. Based on the estimated density function, 4

Table 1. Three predictive regions with coverage probability : conditional percentile intervals (CPI), SCMI and MCDR. The percentage decreases in width relative to CPIs are recorded in parentheses. Predictive regions for Xt+3 from Predictive regions for Yt+1 from Type Xt at Xt = 8 for model (2:6) Yt at Yt = 0 for model (2:7) CPI 0.95 SCMI MCDR CPI 0.70 SCMI MCDR CPI 0.50 SCMI MCDR

[5.11, 14.96] [6.50, 15.30] (10.57%) (Same as SCMI) [8.66, 13.95] [9.64, 14.15] (8.72%) (Same as SCMI) [9.83, 12.95] [10.69, 13.56] (8.01%) (Same as SCMI)

[1.23, 8.30] [1.50, 8.42] (2.12%) [1.15, 4.84][[7.54, 8.46] (34.79%) [2.26, 8.04] [2.80, 8.29] (5.02%) [2.18, 3.82][[7.66, 8.34] (59.86%) [2.71, 7.89] [1.80, 4.20] (53.67%) [2.60, 3.40][[7.71, 8.29] (73.36%)

predictive regions for Xt+3 from Xt, with three di erent values of the coverage probability, are speci ed in Table 1. Obviously, the accuracy of prediction has been considerably improved by using SCMIs in the sense that the decrease in width of predictive intervals is signi cant. Note that the SCMI and the MCDR coincide in this case since the conditional density function is unimodal. (Fig.1 is about here.)

Now we consider model

  Yt = 3 cos Y10t?1 + Xt + 0:8X1 + 1 t ; t

(2:7)

where ft g is a sequence of independent standard normal random variables, fXtg is a sequence of independent and identically distributed random variables and P (X1 = 0) = 0:65 = 1 ? P (X1 = 5). We assume that the two sequences are independent of each other. For the sake of illustration, we assume that the `exogenous' variable Xt is unobservable. We predict Yt+1 from Yt only. Thus the (theoretical) least square conditional point predictor is 3 cos(0:1Yt) + 1:75, which is obviously not satisfactory. It is easy to see that the conditional distribution of Yt+1 given Yt is a mixture of two normal distributions. For example, given Yt = 0, the conditional density function of Yt+1 5

is 0:65'(y ? 3) + 1:75'(5y ? 40) (see Fig.1(b)), where '(:) denotes the standard normal density function. The distribution has two modes. Although the mode is equal to 8, the `local mode' 3 is more important in the sense that it attracts more probability mass. From Table 1 we can see that the percentile interval (2.2) fails to do a good job simply because the predictive intervals are too wide. The improvement by using SCMIs is not substantial unless the coverage probability is small enough such that the probability mass within one mode exceeds . The MCDR is much shorter in length (i.e. Lebesgue measure), and therefore o ers a much more accurate prediction. Note that all the MCDRs consist of two disconnected intervals centred at 3 and 8 respectively. The fact clearly reveals the uncertainty in Yt+1 caused by the `hidden' variable Xt+1. The coverage probabilities of the two intervals centred at 3 and 8 are 0.61 and 0.34, 0.38 and 0.32, and 0.20 and 0.30 in order when the global coverage probability is 0.95, 0.70 and 0.50 respectively. As shown in the above example, the multi-modal predictive distribution indicates uncertainty in the predicted variable Y . The uncertainty may be due to the fact that not all the relevant factors are encompassed in X , which can be caused by practical or sometimes technical reasons.

2.2 Sensitivity to small errors in X In practice, observations will almost certainly be subject to measurement or rounding errors. Studies in pointwise prediction show that a small error in the observation of the current position can lead to considerable divergence in prediction, especially when the multi-step ahead prediction is concerned (see Yao and Tong 1994a, and Tong 1995). To monitor such divergence in interval prediction, we adopt the sensitivity measure de ned in the form of the Kullback-Leibler information (Yao and Tong 1994b). We assume that g (y jx) is smooth almost surely in both x and y , and partial di erentiation with respect to x and integration with respect to y of the function g (y jx) are interchangeable where required. We also assume that integrations in (2.8) and (2.9) below exist and are nite. Suppose we observe X = x which has a small shift from the true but unobservable value X = x + . The divergence of two conditional distributions of Y given X = x and X = x +  respectively is  g(yjx + )  Z K (x; ) = fg(yjx + ) ? g(yjx)g log g(yjx) dy =   I (x) + o(jjjj2); (2:8) where Z I (x) = g_ (yjx)g_  (yjx)=g(yjx)dy (2:9) 6

is the Fisher's information matrix if we treat x as a parameter vector. In the above expression, g_ (yjx) denotes the partial derivative of g with respect to x. Expression (2.8) indicates that the more information Y contains about X at X = x, the more sensitively the distribution g (:jx) depends on x. For a predictive region (:) de ned as in (2.1), we will consider two kinds of errors due to the disturbance in the observed value of X , namely the error in coverage probability and the change in Lebesgue measure of . The latter is more dicult, the general answer is still outstanding.

Theorem 1. (i) Assume that (2.1) holds for (:). Then for any x;  2 Rd and (x) is a bounded set,

jP fY 2 (x)jX = x + g ? j  12 f I (x)g 21 + O(jjjj2):

(ii) For x;  2 Rd , y 2 R and b (x; y ) de ned as in (2.3), 1  1 2 f I (x) g 2 jb (x + ; y) ? b (x; y)j  g(y + b (x; y)jx) + g(y ? b (x; y)jx) + O(jjjj2):

In the above expressions, I (x) is de ned as in (2.9). The above theorem indicates that the sensitivity of the predictive regions can be monitored by the sensitive measure I (:) of (2.9). The estimation of I (:) using the locally polynomial regression was discussed in Fan, Yao and Tong (1993). Theorem 1(i) was presented in a slightly di erent form in Yao and Tong (1995b). Theorem 1(ii) gives a upper bound for the change in width of the -modal interval (see Remark 1(v)), which could be taken as an indirect indication of the change in width of the SCMI (cf. (2.3) and (2.4)).

Proof of Theorem 1. It follows from (2.1) that Z P fY 2 (x)jX = x + g = g(yjx + )dy

(x) Z Z = fg(yjx) +  g_ (yjx)gdy + O(jjjj2) = +

(x)

(x)

  g_ (yjx)dy + O(jjjj2):

By Cauchy-Schwarz inequality, it holds that

Z (Z Z f g_ (yjx)g2 ) 21  (x)  g_ (yjx)dy  (x) g(yjx)dy g(yjx) dy  12 f I (x)g 21 : 7

(2.10)

The conclusion (i) follows immediately. To prove (ii), it is easy to see from (2.3) that

=

Z b (x+;y)

?b (x+;y) Z b (x+;y)

g(y + ujx + )du

fg(y + ujx) +  g_ (y + ujx)gdu + O(jjjj2) ?b (x+;y) Z b (x;y) Z b (x+;y)   g_ (y + ujx)du + O(jjjj2): g(y + ujx)du + = ?b (x;y) ?b (x+;y)

=

(2.11)

Without loss of the generality, we assume that b (x + ; y ) > b (x; y ). Then,

Z b (x+;y)

g(y + ujx)du = +

Z b (x+;y)

g(y + ujx)du +

Z ?b (x;y)

g(y + ujx)du

?b (x+;y) b (x;y) ?b (x+;y) = + jb (x + ; y ) ? b (x; y )jfg (y + b (x; y )jx) + g (y ? b (x; y )jx)g + O(jj jj2):

Substituting the above into (2.11), we have that

R b (x;y)  ?b (x;y)  g_ (y + ujx)du jb (x + ; y) ? b (x; y)j = g(y + b (x; y)jx) + g(y ? b (x; y)jx) + O(jjjj2):

The conclusion (ii) follows from (2.10) now immediately. The proof is completed.

3 Estimation We assume that f(Yt ; Xt)g is a strictly stationary process, and its marginal distribution is the same as the distribution of (Y; X ) described in x2. In the case that Xt = (Yt?1 ; : : :; Yt?d ) , we virtually deal with time series fYt g and predict Yt from its lagged variables. Of interest is the estimation of the SCMI and the MCDR for Y based on the observations (Y1 ; X1);    ; (Yn ; Xn). For any xed x 2 Rd , if we had a set of observations from the conditional distribution of Y given X = x, we could construct the predictive regions directly from the sample (cf. Lientz 1970, Robertson and Cryer 1974). However, such samples seldom exist in reality. Note that both SCMI and MCDR are de ned in terms of the conditional density function g (y jx) (see x2.1). In principle, we can construct the estimators for them based on the estimators of g (y jx). Fan, Yao and Tong (1993) developed an estimator of the conditional density function based on the locally polynomial regression method. However, recent studies in estimation of level sets indicate that the estimators based on the distribution functions are more direct and also have some interesting asymptotic properties (see, e.g. Einmahl and Mason 1992, Polonik 1995a,b). Let F (:jx) be the predictive distribution function for Y given X = x, namely F (y jx) = R y g(ujx)du. In the sequel, we rst estimate conditional distribution function F (yjx) in x3.1. ?1 8

The estimators for both SCMI and MCDR will be constructed from the estimators of conditional distribution functions in x3.2.

3.1 Estimation of predictive distribution functions Estimating the conditional distribution can be regarded as a nonparametric regression problem. To make this connection, note that

E fIfY yg jX = xg = F (yjx); where IA denotes the indicator function of set A. The left hand side of the above expression can be regarded as the regression function of the data Z  IfY yg on X . This idea was used by Fan, Yao and Tong (1993) in estimation of conditional density functions. Recent nonparametric regression theory (see Fan 1992, and Ruppert and Wand 1994) suggests that we may use a locally polynomial tting to estimate F (y jx). Since our primary interest is in the estimation of function F (yjx) itself, we use locally linear regression (cf. Fan et al. 1993). For y 2 R xed, let Zt = IfYt yg for t = 1; : : :; n. By Taylor's expansion about x 2 Rd , we have that

E fZ jX = x0 g = F (yjx0)  F (yjx) + (x0 ? x) F_ (yjx); where F_ (y jx) denotes the partial derivative of F (y jx) with respect to x. Considerations of this nature suggest the following least squares problem: let ^0 and ^1 minimize n X t=1

fZt ? 0 ? 1 (Xt ? x)g2 K

 Xt ? x  h

;

(3:1)

where K (:) is a nonnegative function on Rd , which serves as a kernel function, and h > 0 is the bandwidth, controlling the size of the local neighborhood. Clearly ^0 and ^1 estimate F (y jx) and F_ (yjx) respectively, namely F^ (yjx) = ^0; F^_ (yjx) = ^1 : Let ^n = ( ^0; ^1 ) . The least squares theory gives

^ = (X WX)?1X WZ; where Z = (Z1; : : :; Zn ) , W = diag(K ( X1h?x ); : : :; K ( Xnh?x )), and X is an n  (d + 1) matrix with (1; (Xi ? x) ) as the i-th row. More speci cally, 1 F^(yjx) = nh

  Wn;1 Xt h? x ; x Zt ; t=1

n X

F^_ (yjx) = nh1 2 9

  Wn;2 Xt h? x ; x Zt ; t=1

n X

(3:2)

where

Wn;1 (t; x) = e1 Sn?1 (x)(1; t ) K (t); Wn;2 (t; x) = (e2 ; : : :; ed+1) Sn?1(x)(1; t ) K (t):

In the above expressions, ei is the unit vector with the i-th element 1, Sn (x) is a (d + 1)  (d + 1) matrix: 0 1

Sn (x) = B @

  P and s11 = (nh)?1 nt=1 K Xth?x , S21 = nh1 2

  (Xt ? x)K Xt h? x ; t=1

n X

s11 S21 C A; S21 S22

S22 = nh1 3

  (Xt ? x)(Xt ? x) K Xt h? x : t=1

n X

Alternatively, we consider a locally constant tting, i.e. we estimate F (y jx) by ^0 which minimizes (3.1) with the constraint 1  0. This leads to the Nadaraya-Watson kernel estimator

 Xt ? x  X n X ? x n X ^ = K t : F (yjx) = IfYt yg K h

t=1

t=1

h

(3:3)

To state the asymptotic normality of the above estimators, we need following assumptions.

(A1) Let p(:) denote the marginal density function of X , and (x) = R y2g(ujx)du. For given x 2 Rd , there exists a neighbourhood of x on which both p and have continuous partial derivatives, and the conditional distribution function F (y jx) has continuous third partial derivatives with respect to x.

(A2) The joint density of the distinct elements of (X1; Xk) (k > 0) is bounded by a constant independent of k.

(A3) The strictly stationary process f(Xi; Yi); i  1g satis es the condition that j  supf

sup

i1 U 2F1i ;V 2Fi1+j

jCorr(U; V )jg ! 0 as j ! 1;

(3:4)

where Fij is the  - eld generated by f(Xk ; Yk ) : k = i; : : :; j g(j  i). Further, we assume P  0, and jK (x) ? K (y)j  cjjx ? yjj for any x; y 2 Rd .

0 d

0

10

(A5) The bandwidth h ! 0, nh2+d ! 1, and (log n)=(nhd) ! 0 as n ! 1. The condition (3.4) entails that the process is -mixing. In fact, an autoregressive process satisfying some mild conditions is -mixing (cf. Section 4.4 of Gy}or et al 1989).

Theorem 2. Suppose that conditions (A2) | (A5) hold, and x 2 fujp(u) > 0g for which condition (A1) holds. Then for any y 2 R and  2(x; y )  F (y jx)f1 ? F (y jx)g 6= 0, the statements (i) and (ii) below hold. (i) For the locally linear regression estimator de ned in (3.2),

Z 2 p d ^ d N (0; p?1(x) 2(x; y ) K 2 (u)du): nh fF (yjx) ? F (yjx) ? h2 02trfF (yjx)gg ?!

(ii) For the Nadaraya-Watson estimator de ned in (3.3),  F_ (y jx) !) p d(^  tr f F ( y j x ) g f p _ ( x ) g + nh F (yjx) ? F (yjx) ? h202 2 p(x) d N (0; p?1(x) 2(x; y ) ?!

Z

K 2 (u)du):

In the above expressions, F (y jx) denotes the Hessian matrix of F (y jx) with respect to x, p(:) is the density function of X , and 02 is given as in (A4).

Remark 2. Theorem 2 indicates that the Nadaraya-Watson estimator has one more term in the bias than the locally linear regression estimator. In fact, the locally linear regression method has some other appealing features over the Nadaraya-Watson estimator (cf Fan 1992, Fan and et al.1993). However, in practice the F^ (y jx) given in (3.2) might not be in the interval [0,1]. But, the Nadaraya-Watson estimator (3.3) is always between 0 and 1, and is monotonically increase as y increases. When d = 1, Theorem 2(i) is a corollary of Theorem 1 of Yao and Tong (1995a). The more direct proof can be given along the lines of the proof in Appendix of Fan, Yao and Tong (1993). The proof for multivariate x involves more technical details but without fundamentally new ideas. Therefore, we omit the proof here. The estimator F^_ (y jx) given as in (3.2) is also asymptotically normal, which is not stated explicitly since it is beyond the scope this paper. We are not going to discuss in any detail how to choose the bandwidth h, although h plays a very important role in the trade-o between reducing bias and variance (see Theorem 2 above). 11

In practice, an automatically selected bandwidth is often a useful starting point. A data-driven bandwidth derived by the cross-validation approach is given as follows

Z

^h = argmin X fIfYt 0 n

t=1

(3:5)

where F^?t is an estimator of F given as in (3.2) or (3.3) from the sample f(Xi; Yi )j1  i  n; i 6= tg, and w(:)  0 is a weight function.

3.2 Estimators of predictive regions It is easy to see from (2.3) and (2.4) that for any 2 [0; 1] and x 2 Rd , the SCMI for Y given X = x can also be de ned as [a; b] = argminf Lebf[c; d]g j F (djx) ? F (cjx)  g;

(3:6)

where Leb(C ) denotes the Lebesgue measure of the set C . Therefore, a natural estimator for the SCMI is [^a; ^b] = argminf Lebf[c; d]g j F^ (djx) ? F^ (cjx)  g; (3:7) where F^ (:j:) is de ned as in (3.2) or (3.3). The estimation for the MCDR is not as straightforward as the above. Let us assume that the conditional density function g (:jx) has k modes (k  1). Then the MCDR for Y given X = x can be equivalently de ned as

(

! [ai ; bi] = argmin Leb [ci; di] c1 < d1  c2 < d2  : : :  ck < dk ; i=1 i=1 ) k X fF (dijx) ? F (cijx)g  ; and [k

[k

i=1

(3.8)

(see (2.5)). Consequently, an estimator for the MCDR can be de ned as

[k i=1

! c1 < d1  c2 < d2  : : :  ck < dk ; i=1 ) k X ^ ^ fF (dijx) ? F (cijx)g  : and (

[k [ci; di] [^ai ; ^bi] = argmin Leb i=1

(3.9)

Remark 3. The MCDR may consist of less than k disconnected intervals even though the density function g (:jx) has k modes. In fact, the de nition given in (3.8) is always valid as long as the number of modes of the density function g (:jx) is not greater than k. However, due to sampling uctuation the estimator (3.9) is quite likely to comprise k disconnected intervals even 12

if the MCDR (3.8) does not. When x is a scalar, practically we might be able to identify the disconnection due to sampling uctuation by looking at various plots (cf. Example 2 in x4). When d > 1, the task is more dicult. To obtain a primary indication, we can estimate g(:jx) using the method of Fan, Yao and Tong (1993). On the other hand, empirically we may start with a value of k large enough such that the number of the modes is not expected to be greater than k. If some of the intervals in (3.9) are very close to each other or have very small coverage probability, we can reduce the value of k accordingly.

4 Examples We illustrate the methods via the following two simulated models and one real data set. For the sake of simplicity, we always use the Nadaraya-Watson estimator (3.3) with the Gaussian kernel. The bandwidths are chosen based on the values determined by the cross validation method (3.5) and w(:) is chosen as the indicator function of the 95% inner samples of fYtg. We specify a coverage probability = 0:9 for all the predictive regions in the following examples.

Example 1. We begin with simple quadratic model (2.6). A sample of 1050 is generated from the model. Only the rst 1000 points are used in the estimation. We consider two cases: Yt = Xt+m for m = 2; 3. We use h = 0:24 for m = 2, and h = 0:26 for m = 3. (The bandwidths given by (3.5) are 0.172 for m = 2, and 1.894 for m = 3.) The estimated conditional distribution functions of Yt = Xt+m given Xt for m = 2; 3 are displayed in Fig.2. Based on these estimated conditional distribution functions, the estimated conditional percentile intervals and the estimated SCMIs (3.7) are plotted in Fig.3 together with 50 post-samples. For the two-step ahead prediction, the two predictive intervals are almost the same since the conditional density functions seem to be symmetric over all the state space (cf. Fig. 1(b) of Fan, Yao and Tong 1993). However, around Xt = 8, the SCMIs for Xt+3 are obviously shorter than the conditional percentile intervals, due to the fact that the conditional distributions are asymmetric (cf. Fig.1(a) in x2, and also Fig.1(c) of Fan, Yao and Tong 1993). Fig.3(c) shows the width of the SCMIs in Fig.3(a) and Fig.3(b). When the Xt is near 5.6 and 10.5, the predictive interval for Xt+3 is shorter than that for Xt+2 , which indicates the non-monotonicity of the prediction accuracy in multi-step ahead prediction. (Figures 2, and 3 are about here)

13

Example 2. We consider a quadratic AR(2) model Xt+1 = 6:8 ? 0:18Xt2 + 0:27Xt?1 + 0:3t+1 ;

(4:1)

where ft g is the same as in model (2.6). For the sake of the illustration, we use only Xt to predict Yt  Xt+m for m = 1 and 2. A sample of 1100 is generated from the above model. We use the rst 1000 points for estimation. The scatter plots of Xt+m against Xt for m = 1; 2 from those 1000 points are presented in Fig.4, which shows that the conditional distributions of Xt+m (m = 1 and 2) given Xt have two modes for some values of Xt. The estimated conditional distribution functions are displayed in Fig.5. We use h = 0:21 for m = 1, and h = 0:2 for m = 2. (The cross validation method (3.5) speci es bandwidth 0.203 for m = 1, and bandwidth 0.167 for m = 2.) Given Xt = 0, the conditional distribution function of Xt+1 remains a constant when y is between 5 and 7.5 (see Fig.5(a)). This is due to the fact that there is no probability mass in this area in the corresponding conditional distribution (see Fig.4(a)). Similar features can be observed from Fig.5(b). Fig.6 displays the conditional percentile intervals and the SCMIs. The two types of intervals are almost identical, and both of them have an abrupt change in width when x increases from -2 to -1. Clearly, the predictive intervals after the change are very wide. This leaves room for improvement by using the MCDR. Estimators (3.9) with k = 2 for m = 1; 2 are plotted in Fig.7. Due to sampling uctuation, the estimators always consist of two disconnected intervals over the complete state space. The coverage probabilities of the two intervals are plotted in Fig.8. From Fig.7(a) and Fig.8(a), we note that when Xt 62 [?1:2; 6:2], predictive regions for Xt+1 are highly erratic, and the two coverage probabilities are either also highly erratic or very close to 0 and 0.9 respectively. Therefore, it seems plausible to use a predictive region with two disconnected intervals for Xt+1 only when Xt is between -1.2 and 6.2. Similarly, from Fig.7(b) and Fig.8(b), two disconnected intervals should be recommented as a predictive region for Xt+2 only when Xt is between -1.2 and 4 (also see Fig.4(b)). In summary, to predict Xt+1 we use the MCDR when Xt is between -1.2 and 6.2; to predict Xt+2 we use the MCDR when Xt is between -1.2 and 4. Otherwise, we always use the SCMIs. The combined predictive regions, together with 100 post-samples are displayed in Fig.9. (Fig.4 { Fig.9 are about here.)

Example 3. Figures 10(a) and (b) are plots of 401 hourly rainfall and river ow data from a catchment in Wales. Previous analysis of this data set can be found in Young (1993, 1995) 14

Table 2. Three estimated predictive regions (with coverage probability 0.9) for the hourly river ow: conditional percentile intervals (CPI), SCMI, and (3.9) with k = 2. Yt | ow (litres/sec) at the t-th hour; Xt | rainfall (mm) within the t-th hour. Yt is predicted from the values of Yt?1 and Xt?2 . Yt (Yt?1 ; Xt?2) CPI SCMI (3.9) Coverage Probabilities 28.5 (29.1, 0.4) [14.5, 72.6] [14.5, 72.6] [7.6, 34.6][[58.3, 72.6] (0.81, 0.09) 30.7 (28.5, 0.0) [3.8, 45.4] [3.7, 41.7] [3.7, 39.3][[53.0, 53.10] (0.89, 0.01) 28.0 (30.7, 0.6) [11.2, 91.7] [10.5, 78.6] [10.5, 34.6][[65.3, 91.7] (0.76, 0.14) 27.5 (28.0, 0.2) [4.1, 51.8] [4.0, 51.1] [4.0, 34.6][[45.9, 53.1] (0.87, 0.03) 25.4 (27.5, 0.0) [3.8, 45.4] [3.7, 41.7] [3.7, 39.3][[53.0, 53.1] (0.90, 0.00) 26.9 (25.4, 0.0) [3.8, 41.7] [3.7, 39.3] [3.7, 26.9][[29.0, 40.3] (0.82, 0.08) 25.4 (26.9, 0.0) [3.8, 42.3] [3.7, 41.1] [3.7, 34.6][[38.6, 42.3] (0.88, 0.02) and references therein. Various statistical modelling procedures, such as the CV optimal subset regression (Yao and Tong 1994c) and the MARS (Friedman 1991), suggest that the major e ect of the rainfall conditions on the river ow behaviour is of a two-hour delay in time. On the other hand, the ow data themselves are strongly auto-correlated. Considerations in these two respects suggest that we predict the ow at the t-th hour Yt from the observed values of Yt?1 and the rainfall within the (t ? 2)-th hour Xt?2. We use the rst 394 data for estimation (n = 392), and the last 7 points for checking the reliability of the prediction. The data of (Yt?1 ; Xt?2) are standardized beforehand. The cross-validation method (3.5) gives the value 0.107 for the bandwidth. We use h = 0:13 in the estimation. The three predictive regions for the 7 post-samples are given in Table 2. All the predictive regions coverage the corresponding true values. It is easy to see that when Xt?2 is greater than 0, the predictive intervals and region for Yt accommodate larger values, which agrees with the fact that the rainfall will increase the river ow (cf. Fig.10). The improvement by using the SCMI instead of the conditional percentile interval can be seen in six out of the seven cases. In those six cases, the relative decrease in width of the SCMI is between 1.46% to 16.03%. The estimates of (3.9) with k = 2 always consist of two disconnected intervals. However except for the case when (Yt?1 ; Xt?2) = (30:7; 0:6), the small values (< 0:1) of the coverage probability for the second interval seem to suggest that for all the other cases we may use the SCMI as the predictive region. Furthermore, the most coverage probability of the SCMI is from the subinterval 15

which intersects with the rst interval of the estimate of (3.9). When (Yt?1 ; Xt?2) = (30:7; 0:6), the second interval is [65.3, 91.7] with the coverage probability 0.14. This indicates the possibility of quick river ow caused by the rainfall at the level 0.6mm (cf. Figures 10 (a) and (b)). (Fig.10 is about here.)

Acknowledgements The author is indebted to Dr Wolfgang Polonik for the very stimulating discussion and information on estimation of level sets, which led to the estimators in x3.2. Professor Peter C. Young has kindly made available the rainfall- ow data analysed in x4.

References Brannas, K. and Gooijer, J.G. (1992). Modelling business cycle data using autoregressiveasymmetric moving average models. ASA Proceedings of Business and Economic Statistics Section, 331- 336. Chan, K.S. and Tong, H. (1994). A note on noisy chaos. J. R. Statis. Soc. B, 56, 301-311. Davies, N., Pemberton, J. and Petruccelli, J.D. (1988). An automatic procedure for identi cation, estimation and forecasting univariate self exciting threshold autoregressive models. The Statistician, 37, 119-204. Einmahl, J.H.J. and Mason, D.M. (1992). Generalized quantile processes. Ann. Statist. 20, 1062-1078. Fan, J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc., 87, 9981004. Fan, J., Gasser, T., Gijbels, I., Brockmann, M., and Engel, J. (1993). Local polynomial tting: a standard for nonparametric regression. Inst. Statist. Mimeo Series # 2302 Fan, J, Yao, Q. and Tong, H. (1993). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. (Submitted.) Friedman, J.H. (1991) Multivariate adaptive regression splines. Ann. Statist. 19, 1-50. 16

Gy}or , L., Hardle, W., Sarda, P., and Vieu, P. (1989). Non-parametric Curve Estimation from Time Series. Springer-Verlag, Berlin. Hyndman, R.J. (1995). Forecast regions for non-linear and non-normal time series models. Int. J. Forecasting, (in the press). Lientz, B.P. (1970). Results on nonparametric modal intervals. SIAM J. Appl. Math. 19, 356-366. Polonik, W. (1995a). Measuring mass concentrations and estimating density contour cluster { an excess mass approach. Ann. Statist., (in the press). Polonik, W. (1995b). Minimum volume sets and generalized quantile processes. (Submitted.) Robertson, T. and Cryer, J.D. (1978). An iterative procedure for estimating the mode. J. Amer. Statist. Assoc., 69, 1012-1016. Ruppert, D. and Wand, M.P. (1994). Multivariate locally weighted least squares regression. Ann. Statist., 22, 1346 { 1370. Tong, H. (1990). Non-linear Time Series. Oxford University Press, Oxford. Tong H. (1995). A personal overview of nonlinear time series analysis from a chaos perspective (with discussion). Scand. J. Statist. 22, (in the press). Tong, H. and Moeanaddin, R. (1988). On multi-step non-linear least squares prediction. The Statistician, 37, 101-110. Yao, Q. and Tong, H. (1994a). Quantifying the inference of initial values on nonlinear prediction. J. Roy. Statist. Soc. B, 56, 701-725. Yao, Q. and Tong, H. (1994b). On prediction and chaos in stochastic systems. Phil. Trans. Roy. Soc. Land. A, 357-369. Yao, Q. and Tong, H. (1994c). On subset selection in non-parametric stochastic regression. Statistica Sinica, 4, 51-70. Yao, Q. and Tong, H. (1995a). Asymmetric least squares regression estimation: a nonparametric approach. J. Nonparametric Statist., (in the press). 17

Yao, Q. and Tong, H. (1995b). On initial-condition sensitivity and prediction in nonlinear stochastic systems. Bull. Int. Statist. Inst., (in the press). Young, P.C. (1993). Time variable and state dependent modelling of nonstationary and nonlinear time series. Developments in Time Series Analysis, ed. T.Subba Rao. Chapman and Hall, London, 374-413. Young, P.C. (1995). Data-based mechanistic modelling and the rainfall- ow nonlinearity. Environmetrics, (in the press).

18

Figure Captions Fig.1 (a) The estimated conditional density function of Xt+3 given Xt = 8 for model (2.6). (b) The conditional density function of Yt+1 given Yt = 0 for model (2.7).

Fig.2 The estimated conditional distribution functions of Yt  Xt+m given Xt for model (2.6). (a) m=2; (b) m=3.

Fig.3 (a), (b) The estimated SCMIs and conditional percentile intervals for Xt+m given Xt for model (2.6), together with 50 post-samples. The coverage probability is 0.9. Solid curve | SCMI; dashed curve | conditional percentile interval; diamond | post-sample. (a) m=2; (b) m=3.

Fig.3 (c) Solid curve | the width of the SCMIs in Fig.3(a); dashed curve | the width of the SCMIs in Fig.3(b).

Fig.4 The scatter plots of Xt+m against Xt for model (4.1). (a) m = 1; (b) m = 2. Fig.5 The estimated conditional distribution functions of Yt  Xt+m given Xt for model (4.1). (a) m = 1; (b) m = 2.

Fig.6 The estimated SCMIs and conditional percentile intervals for Xt+m given Xt for model (4.1). The coverage probability is 0.9. (a) m = 1; (b) m = 2.

Fig.7 The estimator (3.9) with k = 2 for Xt+m given Xt for model (4.1). The coverage probability is 0.9. The two disconnected intervals are bounded respectively with solid lines and dashed lines. (a) m = 1; (b) m = 2.

Fig.8 The coverage probabilities of the two intervals in Fig.7. Solid curve | the coverage probability of the interval with solid boundarys; dashed curve | the coverage probability of the interval with dashed boundarys. (a) m = 1; (b) m = 2.

Fig.9 The recommended predictive regions for Xt+m given Xt for model (4.1), together with 100 post-samples. Solid curve | boundary of MCDR; dashed curve | boundary of SCMI; diamond | post-sample.

Fig.10 Hourly rainfall and river ow data from a catchment in Wales. (a) ow (litres/sec); (b) rainfall (mm). 19

Figure 1(a) 0.2

Figure 1(b) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.15

20

0.1 0.05 0 0

2

4

6

8 10 12 14 16

0

2

4

6

8

10

21

Figure 2(a)

22

Figure 2(b)

Figure 3(a)

23

16 14 12 10 8 6 4 2

Figure 3(b) 16 14 12 10 8 6 4 2

0

2

4

6

8 10 12 14 16

0

2

4

6

8 10 12 14 16

Figure 3(c)

24

9 8 7 6 5 4 3 2 1 0

2

4

6

8 10 12 14 16

Figure 4(a)

25

10 8 6 4 2 0 -2 -4 -6 -8 -10 -10 -8 -6 -4 -2 0 2 4 6 8 10

Figure 4(b) 10 8 6 4 2 0 -2 -4 -6 -8 -10 -10 -8 -6 -4 -2 0 2 4 6 8 10

26

Figure 5(a)

27

Figure 5(b)

Figure 6(a)

28

10 8 6 4 2 0 -2 -4 -6

Figure 6(b) 10 8 6 4 2 0 -2 -4 -6 -8

-8 -6 -4 -2 0 2 4 6 8

-8 -6 -4 -2 0 2 4 6 8

Figure 7(a)

29

10 8 6 4 2 0 -2 -4 -6 -8

Figure 7(b) 10 8 6 4 2 0 -2 -4 -6 -8

-8 -6 -4 -2 0 2 4 6 8

-8 -6 -4 -2 0 2 4 6 8

Figure 8(a)

Figure 8(b)

30

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 -8 -6 -4 -2 0 2 4 6 8

-8 -6 -4 -2 0 2 4 6 8

Figure 9(a) 10

5

31

0

-5

-10 -10

-5

0

5

10

Figure 9(b) 10

5

32

0

-5

-10 -10

-5

0

5

10

Figure 10(a)

200

150

33 100

50

0 0

50

100

150

200

250

300

350

400

Figure 10(b) 6

5

4

34

3

2

1

0 0

50

100

150

200

250

300

350

400