Motivated by interval/region prediction in nonlinear time series, we propose a ... the predictive intervals are easily obtained using mean plus and minus a ...
Conditional Minimum Volume Predictive Regions For Stochastic Processes Qiwei Yao Institute of Mathematics and Statistics University of Kent at Canterbury Canterbury, Kent CT2 7NF, UK
Wolfgang Polonik Institut fur Angewandte Mathematik Universitat Heidelberg Im Neuenheimer Feld 294 69120 Heidelberg, Germany
Abstract Motivated by interval/region prediction in nonlinear time series, we propose a minimum volume predictor (MV-predictor) for a strictly stationary process. The MV-predictor varies with respect to the current position in the state space and has the minimum Lebesgue measure among all regions with the nominal coverage probability. We have established consistency, convergence rates, and asymptotic normality for both coverage probability and Lebesgue measure of the estimated MV-predictor under the assumption that the observations are taken from a strong mixing process. Applications with both real and simulated data sets illustrate the proposed methods.
Keywords: Conditional distribution, level set, minimum volume predictor, Nadaraya-Watson estimator, nonlinear time series, predictor, strong mixing.
Supported partially by the EPSRC Grants and the EU HCM Program. The authors thank Professor Peter C.
Young for making available the rainfall- ow data analysed in x4.
1
1 Introduction In any serious attempt to forecast, a point prediction is only a beginning. A predictive interval or, more generally, a predictive region is much more informative. In the context of linear time series models with normally distributed errors, the predictive distributions are normal. Therefore, the predictive intervals are easily obtained using mean plus and minus a multiple of the standard deviation. The width of such an interval, even for multi-step ahead prediction, is constant over the whole state space unless the model has conditional heteroscedastic noise. Predictive intervals constructed in this way have also been used in some special nonlinear models (e.g. threshold autoregressive models, see Tong and Moeanaddin 1988, Davies, Pemberton and Petruccelli 1988). However, the above method is no longer pertinent when the predictive distribution is not normal, which, unfortunately, is the case for most nonlinear time series models (cf. Chan and Tong 1994). Recent studies on pointwise prediction of nonlinear time series have revealed that the prediction accuracy does depend on the current position in the state space (Yao and Tong 1994a, and the references therein). Yao and Tong (1995a) proposed to construct predictive intervals using conditional quantiles (percentiles) for nonlinear time series. However, interval predictors so constructed are not always appropriate when the predictive distributions are asymmetric or multi-modal. Asymmetric distributions have been widely used in modelling economic activities (cf. Brannas and De Gooijer 1992, and the references therein). Further, skewed predictive distributions may occur in multi-step ahead prediction even though the errors in the models have symmetric distributions. Multi-modal phenomena often indicate model uncertainty. The uncertainty may be caused by factors beyond the variables speci ed in the prediction. (See x2 below for some examples.) In order to cope with the possible skewness and multi-modality of the underlying predictive distribution, we propose to use a (conditional) minimum volume set, which we call the (conditional) minimum volume predictor (MV-predictor), among all the candidate regions in a given class (e.g. intervals). The MV-predictor depends on the current position in the state space. It consists of a region where the predictive distribution has highest mass concentration, in the sense that it has minimal Lebesgue measure among all the sets in a given class with the nominal coverage probability. Especially, the MV-predictor of all the predictive intervals is the one with the shortest length. In fact, the MV-predictor has the aforementioned properties among all the sets with the nominal coverage probability, provided the model implied implicitely by the given class is correct. (See the discussion below De nition 2.2 in x2.1.) For a symmetric and unimodal predictive distribution, the MV-predictor reduces to the quantile interval. The minimum volume approach has a long history in the statistical literature, and a well-known example in its early time is the so-called shorth (cf. Andrews et al. 1972). Additional literature on minimum volume sets can be found in Polonik (1997). A closely related concept excess mass was introduced indepedently by Hartigan (1987), and Muller and Sawitzki (1987,1991). See also 1
Nolan (1991), and Polonik (1992, 1995). Hyndman (1995) seemed to be the rst paper to use the minimum volume approach (under a dierent name) for time series prediction. Under the assumption that the predictive distribution is of a known parametric form, Hyndman (1995) estimated the MV-predictor based on a simulation method. However until now there has been a lack of appreciation of the general methodology and its properties, although empirical development of using it for time series prediction can be found in Yao and Tong (1995b), Hyndman (1996) and De Gooijer and Gannoun (1997). In this paper, we establish various asymptotic properties of the predictive regions relying on the asymptotic results on conditional empirical processes reported in Polonik and Yao (1997). Under a general setting which admits time series modelling as a special case, we construct an estimator for the MV-predictor directly based on a Nadaraya-Watson estimator of predictive distribution. Under the assumption that the observed data are from a strictly stationary and strong mixing process, we have established consistency for the estimated MV-predictor using an L1 -distance, and the asymptotic normality for its coverage probability and its Lebesgue measure. We have also derived an explicit rate at which the estimated MV-predictor converges. Comparing with the results on the global (i.e. unconditional) minimum volume sets reported in Polonik (1997), the convergence rates for conditional MV-predictors are typically slower, although in similar form (see e.g. Remark 3.3 in x3 below). This re ects the distinction between local and global ttings. We also propose a bootstrap scheme to choose the state-dependent bandwidths for the purpose of estimating MV-predictors. We use the Nadaraya-Watson estimator for the predictive distribution simply to keep the theory as simple as possible. It is conceivable that similar results hold if the predictive distribution is estimated by local linear regression (Fan 1992, Yu and Jones 1997) or adjusted Nadaraya-Watson method (Hall, Wol and Yao 1997). The paper is organized as follows. x2 introduces the minimum volume predictor and its estimator. The advantages of using the proposed method are demonstrated through two simple time series models. We present all the theoretical results in x3. x4 reports the simulation results with a nonlinear AR(2) model. The application to a rainfall- ow data from a catchment in Wales is also reported. The Appendix contains the key technical proofs.
2 Methodology 2.1 Minimum volume predictive region Suppose that X is an observable d-dimensional random vector, and Y is a d0-dimensional random vector which is unobservable. It is of interest to predict Y from X . In time series context, d0 = 1 and X typically denotes a vector of lagged values of Y . A predictive region (jX ) Rd0 for Y 2
from X with a nominal coverage probability 2 [0; 1] should satisfy the condition that
P f Y 2 (jx) j X = x g ; x 2 Rd :
(2.1)
Often we tend to choose (jx) to be a connected set (e.g. an interval in the case that d0 = 1). However, the consideration in accuracy of prediction leads us to search for a set which has the minimum volume (i.e. Lebesgue measure) among all the sets ful lling condition (2.1). The resulting set is not necessarily connected. In what follows, we always use F (jx) denote the conditional distribution of Y given X = x, which is also called predictive distribution. We use the same F to denote the conditional distribution function in the sense that F (y jx) = F ((?1; y ]jx). We use g (jx) to denote its density function if it exists.
De nition 2.1 Let C denote the class of measurable subsets of Rd0 : Then MC(jx) 2 argminfLeb(C ) : F (C jx) g; 2 [0; 1] C 2C
(2.2)
is called a (conditional) minimum volume predictor (MV-predictor) in C at level , where Leb(C ) denotes the Lebesgue measure of C . We denote its volume as
C (jx) = Leb(MC(jx))
(2.3)
De nition 2.2 For any probability density function on Rd0 , the level set at level > 0 is de ned as ? () = fy : (y ) g. For completeness we de ne ? (0) to be the support of . For 2 [0; 1] de ne that 2 MC() if and only if there exists a unique 0 for which ? () 2 argminfLeb(C ) : (C ) = g; C 2C
where denote the probability measure corresponding to .
Under the assumption that f 2 MC(), there exists an MV-predictor which is a level set. In this case, we can regard the model implied implicitly by C to be correct. (Also see Remark 2.3 (b).) For a more thorough discussion about the modelling aspect, we refer to Polonik (1995, 1997). To highlight the impact of using the minimum volume approach to prediction, we discuss some interesting features of the MV-predictor in the case d0 = 1 in the rest of x2.1. Let Ik denote the class of sets consisting of unions of at most k (one-dimensional) intervals. We write MIk (jx) = Mk (jx), Ik (jx) = k (jx) and MIk () = Mk (): Some remarks are now in order.
Remark 2.3 (a) Suppose g(jx) to be symmetric and unimodal in the sense that there exists a point m such that g (jx) is strictly increasing to the left and strictly decreasing to the right of m . Then Mk (jx), for all k 1, reduces to the quantile interval I (jx) = [q(0:5 ? =2jx); q(0:5 + =2jx)]; (2.4) 0
0
3
where q (jx) is the conditional quantile of Y given X = x, i.e. F (q (jx)jx) = (b) The MV-predictor is not unique. However, it is unique up to Leb-nullsets if the density function g (jx) exists and g (jx) 2 Mk (). In this case, an MV-preditor can be expressed as a level set ?g () fy : g (y jx) g which is in fact the optimal MV-predictor in the sense that it minimizes the Lebesgue measure among all predictive regions. Hence, ideally we should choose k equal to the number of modes of the conditional density g(jx). On the other hand, we always tend to choose a small k in practice. In fact, M1 (jx) gives the shortest predictive interval. (c) Similar to the unconditional i.i.d. case (cf. Einmahl and Mason 1992, and Polonik 1997), C (jx) can be regarded as a generalized (conditional) quantile. To this end, we replace the class C in (2.2) by C1 = f(?1; y ]; y 2 Rg and the Lebesgue measure by the function de ned as ((?1; y ]) = y . Then the `MV-predictor' resulted is (?1; q (jx)] with the `volume' ((?1; q(jx)]) = q(jx), where q(jx) is the conditional quantile de ned in (a) above. To highlight the advantages of MV-predictors, we provide the numerical illustration with two simple time series models. We start with a simple quadratic model
Yt = 0:23Yt?1(16 ? Yt?1 ) + 0:4t;
(2.5)
where ft g is a sequence of independent random variables each with the standard normal distribution truncated in the interval [?12; 12]. The conditional distribution of Yt given Xt = Yt?m is symmetric for m = 1, but not necessarily so for m > 1. For example, the conditional density function at Xt = 8 with m = 3 is depicted in Figure 1(a), which is obviously skewed to the left. Based on this density function, predictive regions with three dierent coverage probabilities are speci ed in Table 1. Obviously, the accuracy of prediction has been considerably improved by using MV-predictors. Note that Mk (jx) =M1 (jx) for all k 2 since the conditional density function is unimodal. Now we consider model Y t ? 1 (2.6) Yt = 3 cos 10 + Zt + 0:8Z1 + 1 t; t
where ft g and fZt g are two independent i.i.d. sequences with 1 N (0; 1) and P (Z1 = 0) = 0:65 = 1 ? P (Z1 = 5). For the sake of illustration, we assume that the `exogenous' variable Zt is unobservable. We predict Yt from Xt Yt?1 only. Thus the (theoretical) least square conditional point predictor is 3 cos(0:1Yt?1) + 1:75, which is obviously not satisfactory. It is easy to see that the conditional distribution of Yt given Yt?1 is a mixture of two normal distributions. Figure 1(b) depicts the conditional density function at Yt?1 = 0. We can see that the percentile interval (2.4) fails to do a reasonable job simply because the predictive intervals are too wide. The improvement by using M1 (jx) is not substantial unless the coverage probability is small enough such that the probability mass within one mode exceeds . The M2 (jx) is much shorter in length (i.e. Lebesgue measure), and therefore oers a much more accurate prediction. All three M2 (jx)0s consist of two disconnected intervals, which clearly reveals the uncertainty in Yt caused by the 4
(a) Three-step ahead predictive density
(b) Predictive density of Y(t) given Y(t-1)=0
0.6 0.15
0.4
0.10
0.05
0.2
0.0
0.0 5
10
15
0
2
4
6
8
10
Figure 1: (a) The conditional density function of Yt given Yt?3 = 8 for model (2.5). (b) The conditional density function of Yt given Yt?1 = 0 for model (2.6). `hidden' variable Zt . The coverage probabilities of the two intervals centered at 3 and 8 are 0.61 and 0.34, 0.38 and 0.32, and 0.20 and 0.30 in order when the global coverage probability is 0.95, 0.70 and 0.50 respectively. Table 1. Three predictive regions with coverage probability : the quantile intervals I (jx), and the MV-predictors M1(jx) and M2 (jx). The percentage decreases in volume (i.e. Lebesgue measure) relative to I (jx) are recorded in parentheses. regions for Yt from Predictive regions for Yt from Predictor YPredictive Yt?1 at Yt?1 = 0 for model (2:6) t?3 at Yt?3 = 8 for model (2:5) I (jx) [5.11, 14.96] [1.23, 8.30] 0.95 M1 (jx) [6.50, 15.30] (10.57%) [1.50, 8.42] (2.12%) M2 (jx) [1.15, 4.84][[7.54, 8.46] (34.79%) I (jx) [8.66, 13.95] [2.26, 8.04] 0.70 M1 (jx) [9.64, 14.15] (8.72%) [2.80, 8.29] (5.02%) M2 (jx) [2.18, 3.82][[7.66, 8.34] (59.86%) I (jx) [9.83, 12.95] [2.71, 7.89] 0.50 M1 (jx) [10.69, 13.56] (8.01%) [1.80, 4.20] (53.67%) M2 (jx) [2.60, 3.40][[7.71, 8.29] (73.36%)
2.2 Estimation We assume that f(Xt; Yt )g is a strictly stationary process, and it has the same marginal distribution as (X; Y ) 2 Rd Rd0 : Of interest is to estimate MC (jx) based on observations f(Xt; Yt); 1 t ng. 5
Note that for any given measurable set C Rd0 , E fI(Y 2C ) jX = xg = F (C jx). This regression relationship suggests the following Nadaraya-Watson estimator for the conditional distribution F (jx): , n n X X X ? x Xt ? x ; t Fbn(C jx) = IfYt 2Cg K K h h t=1
t=1
K () 0 is a kernel function on Rd , and h > 0 is a bandwidth. Fbn(jx) is called the empirical
conditional distribution. An estimator for the MV-predictor c MC(jx) is obtained from (2.2) with F (jx) replaced by Fbn (jx), i.e. c MC (jx) 2 argminfLeb(C ) : Fbn (C jx) g: (2.7) C 2C
We denote its volume and real (unknown) coverage probability as
bC (jx) = Leb(c MC(jx)); ~C = F fc MC (jx)jxg:
(2.8)
Now we explicate the reasons for using the minimum volume approach rather than the level set approach to construct predictive regions. First, it is easy to obtain a predictive region which consists of not more than k intervals based on the mimumum volume approach with k 1 prescribed. This cannot always be achieved using the level set approach. In fact, the volumes of the MV-predictors with dierent k0s also provide additional information on the shape of the predictive distribution (see Figure 6). Further work to investigate the modality of the distribution using excess mass will be reported elsewhere. Secondly, it will be crucial for using the level set approach to estimate the predictive density function for which the smoothing in both x- and y-direction seems inevitable (cf. Fan, Yao and Tong 1996).
2.3 Bootstrap bandwidth selector Like all other kernel smoothers, the quality of our estimator depends crucially on the choice of the bandwidth h. However, the conventional data-driven bandwidth selectors such as cross-validation do not appear to have obvious analogues in the context of estimating MV-predictors. Deriving asymptotically optimal bandwidths is a tidious matter. Using plug-in methods requires explicit estimation of complex functions. Such complexity is arguably not justi ed, not least because the conditional measure F (C jx) is often approximately monotone in x (e.g. x behaves like a location parameter) and so has only limited opportunity for complex behaviour. Instead, we suggest a bootstrap scheme to select the bandwidth. To simplify the presentation, we outline the scheme for the case that d0 = 1 only. We t a parametric model
Yi = G(Xi; q) + i ; where G(x; q ) denotes a polynomial function of x and q is a set of indices indicating the terms included in G, and i is standard normal. We use the AIC to determine q . The coecients of 6
G and are estimated from the data. We form a parametric estimator M k (jx) based on the
above model. By Monte Carlo simulation from the model, we compute a bootstrap version of fY1; : : :; Yng from (2.3) based on given observations fX1; : : :; Xng, and hence a bootstrap version c Mk (jx) of c Mk (jx) with f(Xi ; Yi)g replaced by f(Xi; Yi )g. De ne k (jx)gjfXi; Yi g]; D(h) = E [Lebfc Mk (jx) 4 M
where A 4 B denotes the symmetric dierence of sets A and B . Choose h = ^h to minimise D(h). The similar idea has been used in bandwidth selection for estimating conditional distribution function by Hall, Wol and Yao (1997).
3 Theoretical properties In this section, we always assume that x 2 Rd is given and f (x) > 0, where f () denotes the marginal density of X , and c denotes some generic constant, which may be dierent at dierent places. Furthermore, all stochastic quantities are assumed to be measurable, and we write dF (jx)(A; B) = F (ABjx): We introduce some regularity conditions rst.
(A1) The kernel density function K is bounded and symmetric, and limu!1 jjujjdK (u) = 0. (A2) f 2 C ;d(b), where C ;d (b) denotes the class of bounded real-valued functions with bounded 2
2
second order partial derivatives.
(A3) F (jx) has a Lebesgue-density g(jx) 2 C ;d0 (b): Moreover, for each C 2 C the function F (C j) 2 C ;d (b) such that supC2C @xi@@xj F (C jx) < 1 8 1 i; j d: (A4) jj R vvT K (v) dvjj < 1. (A5) The process f(Xt; Yt)g is strong mixing, i.e. 2
2
(j ) =
sup
0 A2F?1 ;B 2Ft1+j
jP (AB) ? P (A)P (B)j ! 0; as j ! 1;
(3.1)
where Fst denotes the -algebra generated by f(Xi; Yi ); s i tg. Further we assume that (k) b k for some b > 0 and 0 < < 1.
(A6) The joint density function of distinct elements of (Xs; Xt; Xq; Xr) exists and is bounded from the above by a constant independent of (s; t; q; r) for all s t < q r. Now we introduce the notion of metric entropy with bracketing which provides a measure of richness or complexity of a class of sets C. This notion is closely related to covering numbers. We adopt L1-type covering numbers using the bracketing idea. The bracketing reduces to inclusion 7
when it is applied to classes of sets rather than classes of functions. For each > 0, the covering number is de ned as
NI (; C; F (jx)) = inf fn 1 : 9 C1 ; : : :; Cn 2 C such that 8 C 2 C 9 1 i; j n with Ci C Cj and F (Cj n Ci jx) < g:
(3.2)
The quantity log NI (; C; F (jx)) is called the metric entropy with inclusion of C with respect to F (jx). A pair of sets Ci ; Cj is called a bracket for C . Estimates for such covering numbers are known for many classes. (See, e.g. Dudley 1974, 1984.) We will often assume below that either log NI (; C; F (jx)) or NI (; C; F (jx)) behave like a powers of ?1 : We say that condition (R ) holds if log NI (; C; F (jx)) < H (); for all > 0, (R ) where ( ?r if = 0; (3.3) H () = log(?A ) A if > 0; for some constants A; r > 0. In fact (R0) holds for intervals, rectangles, balls, ellipsoids, and for classes which are constructed from those by performing set operations union, intersection and complement nitely many times. Especially, the classes Ik used above ful ll (R0). The classes of convex sets in Rd (d 2) ful ll condition (R ) with = (d ? 1)=2: Other classes of sets satisfying (R ) with > 0 can be found in Dudley (1974, 1984).
Theorem 3.1 (Uniform consistency)
Let 0 < a < b < 1, and the following two conditions hold. (i) C (jx) is continuous in 2 [a; b], (ii) supC 2C jFbn (C jx) ? F (C jx)j ! 0 in probability (or almost surely) as n ! 1. If in addition MC(jx) is uniquely (up to Leb-nullsets) de ned by (2.2), then
sup dF (jx)(c MC(jx); MC(jx)) ! 0 in probability (or almost surely).
2[a;b]
The above theorem is formulated in general terms. The key assumption is condition (ii). To verify this condition in the univariate case with C = Ik , we can directly use results on consistency of usual conditional empirical processes (see for instance Bosq 1996). A more general multivariate result is obtained by Polonik and Yao (1997). All the results given below can be formulated in a similar way, relying on certain properties of the set-indexed empirical conditional distribution (see also Polonik 1997 for the unconditional i.i.d. case). However, the real diculty is to validate those properties under appropriate mixing conditions. Therefore, we present other results more explicitly. 8
The following theorem concerns the rates of convergence of dF (jx)(c MC(jx); MC(jx)). This is the only place where we assume that the model implied by the class C is correct, i.e. g (jx) 2 MC(). Under this condition, dF (jx)(c MC(jx); MC(jx)) = dF (jx)(c MC(jx); ?g(jx)()):
Theorem 3.2 (Rates of convergence)
Let conditions (A1) { (A6), and (R ) hold. Assume that for some 2 (0; 1), g (jx) 2 MC() with corresponding level > 0: Suppose further that ?g(jx)() is Lipschitz continuous in at = with respect to dF (jx): Suppose that
jFbn(c MC(jx)jx) ? F (MC (jx)jx)j = oP (nhd )? = : Then for any > 0 and we have that as n ! 1,
1 1 h = c max n? d+(3+ ) ; n? d+2(3 +1)
8
0 we obtain that even jFbn (c MC(jx)jx) ? j = OP (nhd )?1 : 1 +4
Theorem 3.4 (Asymptotic normality of coverage probabilities)
Let conditions (A1) { (A6) hold, and (R ) hold with < 1=3. Let h = c n? d (log log n)?1 . Suppose that C (jx) is continuous at , that F (MC (jx)jx) = ; and that (3:4) holds. Then for ~C de ned in (2.8), we have that as n ! 1, Z p d d nh (~C ? ) ?! N (0; (1 ? ) K =f (x)): 2
9
1 +4
Remark 3.5 (a) The only unknown quantity in the asymptotic variance is the marginal density f (x), which can be estimated consistently. This is in marked contrast with the asymptotic variance of Fb (y jx). (See, for example, Hall, Wol and Yao 1997.) Therefore, the con dence level for the
coverage probability can be easily constructed based on the above theorem. (b) The continuity assumption for C (jx) is satis ed for all 2 (0; 1) if g (jx) is continuous and has no at parts, i.e. Lebfy : g (y jx) = g = 0 8 > 0. Theorem 3.6 (Asymptotic normality of volumes) Suppose that the function C (jx) de ned in (2.3) is dierentiable and let 0C (jx) denote its Lipschitz continuous derivative. If 0C (jx) > 0, then under the assumptions of Theorem 3.4, as
n!1
Z p d f (x) d N (0; (1 ? ) K ): nh 0 (jx) bC (jx) ? C (jx) ?! 2
C
4 Numerical properties To appreciate the nite sample proporties of the estimated MV-predictors, we illustrate the methods via one nonlinear AR(2) model and a set of the rainfall and river ow data from a catchment in Wales. We always use the standard Gaussian kernel in calculation. The bandwidths are selected by the bootstrap scheme stated in x2.3. We always set the coverage probability = 0:9. We only consider here the estimation of MV-predictors. Examples for estimating predictive distributions can be found in Hall, Wol and Yao (1997).
10
Y(t-1)
5
0
-5
. .
. .. .. . . . ... .............. .............................................................................. . .. . . .. ........................ .................... . .. ... .. . . . . .............. . . . . . . . ........ . .... ........... . ......................... . . . .. . ... .. ... . .. . . . . . . . . . . . . . . ......... ... .. .................... ... . .............. . . . ....... . . .. ....... . . . .. . . ... .............. . ................ ....... . ... .... . ..... ... ............ .. ..... ............ ... ..... ...................... ...... .......... . . ................ .......... ...................... ............. ............. ............... . .. .. -5
0
5
10
Y(t-2)
Figure 2: The scatter plot of Yt?1 against Yt?2 from a sample of size 1000 generated from model (4.1). The postions marked with `' are (from left to right) (-4.5, 5.7), (1.5, 7.8), (3.9, 2.4) and (8.1, -4.7).
Example 1. We frist consider the following simulated model Yt = 6:8 ? 0:17Yt? + 0:26Yt? + 0:3t; 2
1
10
2
(4.1)
where ft g is a sequence of independent random variables each with the standard normal distribution truncated in the interval [?12; 12]. We conduct the simulation in two stages to estimate the MV-predictors for Yt given (i) Xt (Yt?1 ; Yt?2) and (ii) Xt Yt?1 respectively. (a) Symmetric differences between the estimated and true MV-predictors 1.2 0.8 0.4 0.0 x=(-4.5, 5.7) n=1000
x=(-4.5, 5.7) n=500
x=(1.5, 7.8) n=1000
x=(1.5, 7.8) n=500
x=(3.9, 2.4) n=1000
x=(3.9, 2.4) n=500
x=(8.1, -4.7) n=1000
x=(8.1, -4.7) n=500
(b) Bandwidths selected by the bootstrap scheme 0.4 0.3 0.2 0.1 x=(-4.5, 5.7) n=1000
x=(-4.5, 5.7) n=500
x=(1.5, 7.8) n=1000
x=(1.5, 7.8) n=500
x=(3.9, 2.4) n=1000
x=(3.9, 2.4) n=500
x=(8.1, -4.7) n=1000
x=(8.1, -4.7) n=500
Figure 3: Simulation results for model (4.1). (a) The boxplots of the ratio of c (jx)M (jx)g to LebfM (jx)g. (b) The boxplots of the bandwidths selected by the LebfM 1 1 1 bootstrap scheme stated in x2.2. (i) For four xed values of Xt = (Yt?1 ; Yt?2 ), we repeat simulation 100 times with sample size n = 1000 and n = 500 respectively. Figure 2 is the scatter plot of a sample of size n = 1000. The four positions marked with `' are the values of Xt = x at which the MV-predictors M1 (jx) for c (jx)M (jx)g=LebfM (jx)g which Yt are estimated. Figure 3(a) is the boxplots of LebfM 1 1 1 decreases in general when the sample size increases from 500 to 1000. The boxplots of the bandwidths selected by the bootstrap scheme are displayed in Figure 3(b). With the given sample sizes, the AIC always identi es the correct model from the candidature polynomial model of c (jx)g ? LebfM (jx)g] and order 3. Figures 4 and 5 present the histograms of n h^ (x)[LebfM 1 1 c ^ ^ n h(x)g[F fM1(jx)jxg? ], where h(x) denotes the bandwidth selected by the bootstrap scheme. With given sample size, we could not observe the predominance of the asymptotic normalities stated in Theorems 3.4 and 3.6. However all the histograms are centered around 0, and overall the distributions with n = 1000 tend to be more normal-like than those with n = 500. (ii) For a sample of size 1000, we estimate both predictors M1 (jx) and M2 (jx) for Yt given its rst lagged value Yt?1 = x only. We let x range over 90% inner samples. We use a post-sample of 1 2
1 2
11
(a) x=(-4.5, 5.7), n=1000 0.3 0.2
(b) x=(1.5, 7.8), n=1000
(c) x=(3.9, 2.4), n=1000
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
(d) x=(8.1, -4.7), n=1000 0.4 0.3 0.2
0.1 0.0
0.0 -2
0
2
0.0
4
-2
0.4
0
2
4
6
0.30
0.0 -4
(f) x=(1.5, 7.8), n=500
(e) x=(-4.5, 5.7), n=500
0.1
-2
0
2
4
-2
(g) x=(3.9, 2.4), n=500
0.2
0
1
2
3
(h) x=(8.1, -4.7), n=500
0.20
0.4
0.20
0.15
0.3
0.15
0.10
0.2
0.05
0.1
0.25
0.3
-1
0.10
0.1
0.05
0.0
0.0 -2
-1
0
1
2
3
0.0
4
-2
0
2
4
6
0.0 -4
-2
0
2
4
-1
0
1
2
3
c (jx)g ? Figure 4: Simulation results for model (4.1). The histograms of n h^ (x)[LebfM 1 LebfM1 (jx)g]. 1 2
(a) x=(-4.5, 5.7), n=1000 1.0 0.8
(b) x=(1.5, 7.8), n=1000
(c) x=(3.9, 2.4), n=1000 0.6
1.4
0.8
0.5
1.2
0.6
0.4
0.6
1.0 0.8
0.3
0.4
0.4
0.6
0.2
0.2
0.2
0.1
0.0
0.0
0.0
-1.0
0.0
(d) x=(8.1, -4.7), n=1000
0.5
-1.5
-0.5
0.5
0.2 0.0 -3
(f) x=(1.5, 7.8), n=500
(e) x=(-4.5, 5.7), n=500
0.4
-2
-1
0
1
-1.5
0.8
1.0
0.5
1.2
0.6
0.8
0.4
1.0
0.6
0.3
0.8
0.4
0.2
0.2
0.1
0.0
0.0
0.2 0.0 -2.5
-1.5
-0.5
0.5
-2.0
-1.0
0.0
0.5
0.0
0.5
(h) x=(8.1, -4.7), n=500
(g) x=(3.9, 2.4), n=500
0.4
-0.5
0.6 0.4 0.2 0.0 -3
-2
-1
0
1
-1.0
-0.5
0.0
0.5
c (jx)jxg ? ]. Figure 5: Simulation results for model (4.1). The histograms of n h^ (x)[F fM 1 1 2
12
(a) MV-predictive interval
••• • ••• ••
Y(t)
5
•• • •
••• 0
•• • •• • •• • • • •• • • • •• • • • • •• • ••• •• • • • •• • • •••• •• ••• •
•
-5
5
Y(t)
• •••• • ••
• ••••• •
(b) MV-predictor with two intervals
• ••• ••••• • ••
0
-5 •
-5
0 Y(t-1)
5
-5
(c) Coverage probabilities
5
(d) Recommended predictor • •••• • ••
• ••••• •
0.8 ••• • ••• ••
5
Y(t)
0.6 Probability
0 Y(t-1)
0.4
••• 0
•
•• • •
•• • •• • •• • • • •• • •• • • •• • • ••• ••• •• • • • •• • • •••• •• ••• •
0.2 -5
• ••• •••• •• •• •
0.0 -4
-2
0
2
4
6
8
-5
Y(t-1)
0 Y(t-1)
5
Figure 6: Simulation results for model (4.1). (a) The estimated M1 (jx) for Yt given Yt?1 = x, together with 100 post-samples. (b) The estimated M2 (jx) for Yt given Yt?1 = x. The two disconnected intervals are bounded respectively with solid lines and dashed lines. (c) The coverage probabilities of the two intervals in (b). Solid curve { the coverage probability for the interval with solid boundarys; dashed curve { the coverage probability for the interval with dashed boundarys. (d) The recommended MV-predictor for Yt given Yt?1 = x, together with 100 post-samples.
13
size 100 to check the perfermance of the predictors. For estimating bandwidths, the parametric model selected by the AIC is
Yt = 8:088 ? 0:316Yt?1 ? 0:179Yt2?1 + 0:003Yt3?1 + 0:825t: Figure 6(a) displays the estimated M1 (jx) together with the 100 post-points. Within the range c (jx) contains about 90% of the post-sample. of values of x on which estimation is conducted, M 1 c Note that M1(jx) has an abrubt change in the width around x = 1:5. In fact the predictor M1(jx) is not satisfactory for x between 1.5 and about 6 because the intervals are void in the c (jx) is plotted in Figure 6(b). centre for those x-values (also see Figure 2). The estimator M 2 Due to sampling uctuation, the estimator always consists of two disconnected intervals over the whole sample space. The coverage probabilities of the two intervals are plotted in Figure 6(c). c (jx) are From Figures 6(b) and 6(c), we note that when x 62 [1:5; 6:3], the two intervals of M 2 almost connected, and the two corresponding coverage probabilities are either erratic or very c (jx) for x 2 [1:5; 6:3] and close to 0 and 0.9 respectively. Therefore, it seems plausible to use M 2 c (jx) for x 62 [1:5; 6:3]. The combined MV-predictor is depicted in Figure 6(d) together with M 1 c (jx) (Figure the post-sample. The combined predictor covers the post-sample as good as the M 1 6(b)) although its Lebesgue measure has be reduced signi cantly for x 2 [1:5; 6:3].
Example 2. Figure 7(a) and (b) are plots of 401 hourly rainfall and river ow data from a
catchment in Wales. Various statistical modelling procedures, such as the CV optimal subset regression (Yao and Tong 1994b) and the MARS (Friedman 1991), suggest that the major eect of rainfall on the river ow is of a two-hour delay in time. Figures 7(d) { (f) indicate that the point-cloud in the scatter plot of ow against rainfall with time lag 2 is slightly thinner than those with time lag 0 and 1. On the other hand, the ow data themselves are strongly auto-correlated (Figure 7(c)). Considerations in these two respects suggest that we predict the ow at the t-th hour Yt from its lagged value of Yt?1 and the rainfall within the (t ? 2)-th hour Xt?2 . We estimate three types of predictors using the data with sample size n = 392, which was resulted by leaving out the 373-th, the 375-th and the last ve ow data (therefore also their corresponding lagged values and the rainfall data) in order to check the reliability of the prediction. We standardise the observations of regressors before the tting. We adopt the bootstrap scheme to select the bandwidth. The parametric model determined by the AIC is
Yt = ?1:509 + 1:191Yt?1 + 0:924Xt?2 + 0:102Yt?1Xt?2 ? 0:004Yt2?1 + 7:902t; where t is standard normal. Table 2 reports the estimated predictors for the 7 data points which are not used in estimation. All the quantile intervals cover the corresponding true values. For the MV-predictor M1 (jx), 6 out of 7 intervals contain the true value. The only exception occurs when there is higt brust of river ow at the value 86.9. It is easy to see from Figure 6 that data 14
Flow (litres/sec)
(a) Hourly river flow data 200 150 100 50 0 0
50
100
150
200
250
300
350
400
300
350
400
Time (hours)
Rainfall (mm)
(b) Hourly rainfall data 4 2 0 0
50
100
150
200
250
Time (hours)
(c) Flow against flow at lag 1 •
•
0
•• • ••
•
•
•
• • • •• • • • •••••••••• •• • ••••••••••• • • •• •••••••••••••• •••••••• • ••••••••••••• • ••• •••••••••••• 0
50
•
•
100
150
Flow at t-th hour
50 0
100 50
200
• • •• •••• •• ••• •
• • • •• •••
• • • • ••• •• •• • •• • ••• •• •• • • ••
0
• •
• • •
• • •
1
•
• • •
• •
•
••
2
3
4
Rainfall in (t-2)-th hour
(e) Rainfall against flow at lag 1
(f) Rainfall against flow at lag 0
•
• • • • • • • •• • • •• • •• • • • •••• ••• • •• • ••• •• •• • •• • • • • 0
•
150
Flow at (t-1)-th time
• 100
•
0
200 150
Flow at t-th hour
50
•
•
150
••
200
•• •
••
•• • 1
•
•
• • 2
•
•
•
• ••
•
150 100 50 0
3
4
• • • ••• •• ••• •• ••• • 0
Rainfall in (t-1)-th hour
•
•
200
•
• • •• ••
• Flow at t-th hour
Flow at t-th hour
200
100
(d) Rainfall against flow at lag 2
•• • • ••
• • • • • • • • • • • •• • •• • • • •• •• • • • • • • • •• •• •• • • 1
•
•
•
•
• • ••
2
3
4
Rainfall in t-th hour
Figure 7: Hourly rainfall and river ow data from a catchment in Wales. (a) ow (litres/sec); (b) rainfall (mm); (c) scatter plot of ow data; (d) { (f) scatter plots of rainfall against ow at time lags 2, 1 and 0 respectively. 15
Table 2. The estimators for the quantile interval I (jx) and the MV-predictors Mi (jx) (i = 1; 2) with coverage probability = 0:9 for the river ow at the t-th hour Yt >from its lagged value Yt?1 and the rainfall in the (t-2)-th hour Xt?2 .
Yt
47.9 86.9 28.0 27.5 25.4 26.9 25.4
(Yt?1 ; Xt?2) (29.1, 1.8) (92.4, 2.6) (30.7, 0.6) (28.0, 0.2) (27.5, 0.0) (25.4, 0.0) (26.9, 0.0)
Ib(jx)
[3.8, 71.8] [3.8, 102.9] [5.6, 39.3] [4.7, 35.8] [4.4, 34.6] [7.7, 33.5] [4.7, 34.1]
c (jx) M 1
[3.3, 51.1] [3.3, 76.2] [3.8, 33.5] [3.8, 32.4] [3.8, 32.4] [9.3, 33.5] [3.6, 31.7]
c (jx) M Coverage probabilities 2 [3.6, 42.3][[65.4, 67.5] (0.89, 0.01) [3.3, 53.1][[62.6, 78.5] (0.83, 0.07) [4.3, 5.6][[6.6, 34.6] (0.03, 0.87) [4.1, 26.9][[30.7, 34.6] (0.82, 0.08) [3.6, 24.7][[30.7, 34.6] (0.02, 0.88) [9.7, 26.9][[29.1, 34.1] (0.81, 0.09) [3.3, 25.9][[30.7, 34.1] (0.83, 0.07)
are sparse at this level and upwards. Due to the quick river ow caused by rainfall, we expect the predictive distributions are skewed to the right. Therefore, the improvement in prediction should be observed by using the MV-predictor M1 (jx) instead of the quantile interval I (jx). c (jx), the relative decrease In fact even if we discard the case where the true Yt lies outside of M 1 c b in length of M1 (jx) with respect to quantile predictor I (jx) is between 4.4% and 27.9% for c (jx) could be regarded as a compressed shift of Ib(jx) to its the 6 other cases. Actually M 1 left in 6 out of 7 cases. For the application with data sets like this, it is pertinent to use the state-dependent bandwidths. For example for estimating M1 (jx), our bootstrap scheme selected quite large bandwidths 1.57 and 2.99 for the rst two cases respectively in Table 2, in response to the sparseness of data in the area with positive rainfall and quick river ow. The selected bandwidths for the last ve cases are rather stable and are between 0.33 and 0.43. There seems little evidence suggesting the multi-modality, for the estimated M2 (jx)0s always have one interval with very tiny coverage probability. For this example, we recommend to use the MV-predictive interval M1(jx). In the above application, we include a single rainfall point Xt?2 in the model for the sake of simplicity. A more pertinent approach should take into account the moisture condition of soil which depends on prior rainfall. For more detailed discussion in this aspect, we refer to Young (1993) and Young and Bevan (1994).
5 Appendix: Proofs The proofs rely on some asymptotic results for a conditional empirical process from Polonik and Yao (1997), which we listed in the Addendum below for easy reference. The basic idea is similar to the proofs of Polonik (1997) where a global (unconditional) empirical process with i.i.d. observations were concerned. The proof for Theorem 3.1 is omitted since it is the least technically involved. 16
Proof of Theorem 3.2: In the following basic inequality (5.1) follows along the same lines
as (7.24) of Polonik (1997), given in the proof of Theorem 3.1. (We do not present the proof here.) In (5.1) we heavily use the fact that F (?g(jx) ()MC()jx) = 0 which follows >from the assumption that g (jx) 2 MC(). For each > 0 and M supy g (y jx) we have d (c M (jx); ? ( )) D() + R () + R () + o ( p 1 ); (5.1) F (jx)
g(jx)
C
n
n
1
P
2
nhd
where D() = F (fy : jg (y jx) ? j > gjx) R1n() = M (Fbn ? F )(?g(jx) ()jx) + 0 (1jx) (bC (jx) ? C (jx)) k
MC ()jx) ? (Fbn ? F )(?g(jx)()jx) : R2n() = M (Fbn ? F )(c Note that in the present situation we have 0k (1jx) = : Below we use the Bahadur-Kiefer approximation of Theorem 6.2 to control R1n (). In fact, it follows that R1n() and R2n () are of
the same asymptotic stochastic order. We now present the proof for = 0: The other cases can be proven similarly, as will be indicated at the end of the proof. First note that
D() = O() p R1n() = R2n() = OP 1 nhd :
(5.2) (5.3)
(5.2) immediately follows from the assumptions. As for (5.3) we use Theorem 6.1 with 2 = 1: By choosing = (nhd )? to balance the rates of the deterministic term D() and sum of the stochastic terms R1n () + R2n () we get from (5.1) a rst stochastic rate 1 4
dF (jx)(c MC (jx); ?g(jx)()) = OP (nhd )? : 1 4
Note that here at this rst step we could choose an even larger bandwidth, namely the optimal bandwidth h = n? d . Then, however, we would have to stop here, and this would nally give us a slower rate of convergence. Instead, by using the faster bandwidth as given in the Theorem, this rst rate can be used to obtain a faster rate of convergence (and then go on interating). By choosing 2 = (nhd )? we obtain from Theorem 6.1 and Theorem 6.2, respectively, that R () = R () = O 1 (nhd )?( + ) (log n) : (5.4) 1 +4
1 4
n
1
n
2
P
1 2
1 8
1 2
As above, by balancing the deterministic and the stochastic term in (5.1) we get the second rate
dF (jx)(c MC (jx); ?g(jx)()) = OP nhd )?(
1 1 + 16 ) 4
(log n) : 1 4
Iterating this argument M times we obtain the rate (nhd )?S (M )(log n)S (M ?1) where S (M ) = PM 1 j j =1 ( 4 ) : We can do this iteration arbitrarily, but ntely, often. Since S (M ) ! 1=3 as M ! 1 17
the assertion follows. Note that in order to assure that the assumption of Theorem 6.1 on h and 2 are satis ed in each iteration step we need for each > 0 that nhd+4 (nhd )? + : This leads to the condition h = c n? d . As for > 0; the same proof in principle works. However, we have to take into account that we can only proceed the above iteration steps as long as the resulting rate is not faster than ? nhd , which comes from the de nition of in Theorem 6.1. Due to this reason, we can do this iteration arbitrarily often only for < 1=5: For 1=5 iteration does not help, so that the arbitrary > 0 in the exponent does not appear. 1 3
1 +3
3 1 2(3 +1)
q:e:d:
Proof of Theorem 3.4: By assumption we have F (MC(jx)jx) = = Fn (MC(jx)jx) + oP b c
it follows that
p1
nhd
p p d nh (~C ? ) = nhd (F (c MC(jx)jx) ? Fbn (c MC(jx)jx)) + oP (1) p d = nh (F (MC (jx)jx) ? Fbn (MC(jx)jx)) + oP (1):
The last equality follows from Theorem 6.1, because Theorem 3.1 gives consistency of c MC (jx), p d c this is dF (jx)(MC (jx); MC(jx)) = oP (1): Finally, asymptotic normality of nh (F (MC (jx)jx)? Fbn (MC(jx)jx)) with given variance under the present conditions can be found in Polonik and Yao (1997).
q:e:d:
Theorem 3.6 immediately follows from Theorem 6.2 together with the asymptotic normality of the conditional empirical process (cf. Polonik and Yao 1997).
6 Addendum: For ease of reference we present two theorems about the asymptotic behaviour of a conditional empirical process which can be found in Polonik and Yao (1997). Let 8 q > > if = 0; 2 log 1 < ! 2
? ?
(6.1) ( ; n) = > ? 2 ; nhd if > 0: > : max 2
1
3 1 2(3 +1)
2
Theorem 6.1 Suppose that (A2) { (A6) hold. For each > 0, let C C be a class of measurable sets with supC 2C F (C jx) 1, and suppose that C ful lls (R ) with some 0. 2
2
18
Further we assume that hd ! 0 and nhd ! 1 as n ! 1 such that
2
nhd+4 (2; n) : nhd 2 log
(6.2)
1 2
For = 0 we assume in addition that (log n) ! 1 as n ! 1: Then there exists a constant M > 0 such that 8 > 0; 9 02 > 0 such that for all 2 02 and for large enough n 6
!
P sup jn(C jx)j M ( ; n) : 2
C 2C
Theorem 6.2 (Generalized Bahadur-Kiefer approximation) Suppose that (A2) { (A6) hold. Assume that C (jx) is dierentiable with Lipschitz-continuous derivative 0k (jx). Let C be such that (R ) holds. Let further 2 (0; 1) be xed and suppose that MC(jx) is unique up to Leb-nullsets, that F (MC( jx)jx) = for all in a neighborhood of , and that 0C(jx) > 0. If for h and satisfying the conditions of Theorem 6.1 we have for n ! 1 that dF jx (c MC(jx); MC(jx)) = OP ( ); then for n ! 1 p d b nh j(Fn ? F )(?g jx ()jx) + 0 (1jx) (bC(jx) ? C(jx))j = OP ( ; n) : C 2
(
(
2
)
2
)
References Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rodgers, W.H., and Tukey, J.W. (1972). Robust estimation of location: survey and advances. Princeton Univ. Press, Princeton, N.J. Bosq, D. (1996). Nonparametric statistics for stochastic processes. Lecture Notes in Statistics No. 110, Springer, New York. Brannas, K. and De Gooijer, J.G. (1992). Modelling business cycle data using autoregressiveasymmetric moving average models. ASA Proceedings of Business and Economic Statistics Section, 331- 336. Chan, K.S. and Tong, H. (1994). A note on noisy chaos. J. R. Statist. Soc. B, 56, 301-311. Davies, N., Pemberton, J. and Petruccelli, J.D. (1988). An automatic procedure for identi cation, estimation and forecasting univariate self exciting threshold autoregressive models. The Statistician, 37, 119-204. De Gooijer, J.G. and Gannoun, A. (1997). Nonparametric conditional predictive regions for stochastic processes. (A preprint.) 19
Dudley, R.M. (1974). Metric entropy of classes of sets with dierentiable boundaries. J. Approx. Theorie 10 227-236. d'Et e de Probabilites de Saint Flour Dudley, R.M. (1984). A course in empirical processes. Ecole XII-1982, Lecture Notes in Math. 1097 1-142, Springer, New York. Einmahl, J.H.J. and Mason, D.M. (1992). Generalized quantile processes. Ann. Statist. 20, 1062-1078. Fan, J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc., 87, 9981004. Fan, J, Yao, Q. and Tong, H. (1996). Estimation of conditional densities and sensitivity measures in nonlinear dynamical systems. Biometrika, 83, 189-206. Friedman, J.H. (1991). Multivariate adaptive regression splines. Ann. Statist. 19, 1-142. Hall, P., Wol, R.C.L. and Yao, Q. (1997). Methods for estimating a conditional distribution function. (Submitted.) Hartigan, J.A. (1987). Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82 267-270 Hyndman, R.J. (1995). Forecast regions for non-linear and non-normal time series models. Int. J. Forecasting, 14, 431-441. Hyndman, R.J. (1996). Computing and graphing highest density regions. Ameri. Statist. 50, 361-365. Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann. Stat. 18, 191-219 Muller, D.W. and Sawitzki, G. (1987). Using excess mass estimates to investigate the modality of a distribution. Preprint No. 398, SFB 123, Universitat Heidelberg Muller, D.W. and Sawitzki, G. (1987). Excess mass estimates and tests of multimodality. J. Amer. Statist. Assoc. 86 738-746 Polonik, W. (1995). Measuring mass concentration and estimating density contour clusters - an excess mass approach. Ann. Statist. 23 855-881 Polonik, W. (1997). Minimum volume sets and generalized quantile processes. Stoch. Processes and Appl. 69 1-24 Polonik, W. and Yao, Q. (1997). Set indexed conditional empirical and quantile processes. Manuscript in preparation. Tong, H. and Moeanaddin, R. (1988). On multi-step non-linear least squares prediction. The Statistician, 37, 101-110. Yao, Q. and Tong, H. (1994a). Quantifying the inference of initial values on nonlinear prediction. J. Roy. Statist. Soc. B, 56, 701-725. 20
Yao, Q. and Tong, H. (1994b). On subset selection in non-parametric stochastic regression. Statistica Sinica, 4, 51-70. Yao, Q. and Tong, H. (1995a). Asymmetric least squares regression estimation: a nonparametric approach. J. Nonparametric Statist., 6, 273-292. Yao, Q. and Tong, H. (1995b). On initial-condition sensitivity and prediction in nonlinear stochastic systems. Bull. Int. Statist. Inst., IP 10.3, 395-412. Yu and Jones (1997). Local linear quantile regression. J. Amer. Statist. Assoc. (To appear.) Young, P.C. (1993). Time variable and state dependent modelling of nonstationary and nonlinear time series. Developments in Time Series Analysis, ed. T.Subba Rao. Chapman and Hall, London, 374-413. Young, P.C., and Beven, K.J. (1994). Data-based mechanistic modelling and the rainfall- ow nonlinearity. Environmetrics, 5, 335-363.
21