Combining Forecasting Procedures: Some Theoretical Results Yuhong Yang 312 Snedecor Hall Department of Statistics and Statistical Laboratory Iowa State University Ames, IA 50011-1210 Email:
[email protected] March, 2000 Abstract
We study some methods of combining procedures for forecasting a continuous random variable. Statistical risk bounds under the square error loss are obtained under mild distributional assumptions on the future given the current outside information and the past observations. The risk bounds show that the combined forecast automatically achieves the best performance among the candidate procedures up to a constant factor and an additive penalty term. In term of the rate of convergence, the combined forecast performs as well as if one knew which candidate forecasting procedure is the best in advance. Empirical studies suggest combining procedures can sometimes improve forecasting accuracy compared to the original procedures. Risk bounds are derived to theoretically quantify the potential gain and price for linearly combining forecasts for improvement. The result supports the empirical nding that it is not automatically a good idea to combine forecasts. A blind combining can degrade performance dramatically due to the undesirable large variability in estimating the best combining weights. An automated combining method is shown in theory to achieve a balance between the potential gain and the complexity penalty (the price for combining); to take advantage (if any) of sparse combining; and to maintain the best performance (in rate) among the candidate forecasting procedures if linear or sparse combining does not help.
Keywords: Combining forecasts, combining for adaptation, combining for improvement.
1 Introduction Forecasting or predication is of interest in many scienti c elds. Depending on the nature of the process of constructing forecasts, there are model-based approaches and more subjective opinion-based methods. The former includes empirical-based modelings utilizing statistical tools and theory-based methods, which depend mainly on a theory on the subject matter. Opinion-based forecasting requires extensive experience in the subject eld and experts' predictions are usually based on a dicult-to-quantify mixture of many dierent sources of information (qualitative and/or quantitative) and their subjective intuition or believes. Enormous research in various elds have strongly suggested that when forecasts are combined (sometimes even in simple fashions), the performance of prediction can be enhanced. A survey of research results on combining forecasts is in Clemen (1989). While the idea of combining models/wisdom is always attractive, the challenge, of course, is how to do it right. Although many approaches have been proposed for combining forecasts with a lot of 1
empirical evidence of success and some theoretical justi cations, there is still a lack of general theoretical understanding. A diculty is that the forecasts to be combined can be completely dierent in their derivations and assumptions, and therefore the commonly used methods for comparing statistical models such as model selection criteria (e.g., AIC (Akaike (1973), BIC (Schwartz (1978), or cross-validation (Stone (1974)) or the traditional hypothesis testing techniques are not generally applicable. As pointed out by e.g., Armstrong (1989) and others, the research in the area of combining forecasts so far provides few guidelines for applications. In this paper, we propose some methods for combining point forecasts of a continuous variable and address several issues on combining forecasts. We focus on theoretical developments.
1.1 Setup of the problem Suppose that we are interested in forecasting or predicting a real-valued continuous random quantity Y: Let Y1 ; Y2 ; ... be the true values at time 1, 2, ... respectively. At each time i 1; (possibly) related qualitative and/or quantitative information, denoted jointly by Xi (which thus could be multidimensional) is observed prior to the occurrence of Yi : We need to predict Yi based on Xi and the earlier ?1: After Yi is revealed, a possible discrepancy (lose) occurs between the forecast data Z i?1 = (Xl ; Yl )il=1 and the true value Yi : The goal of forecasting is to make the discrepancy as small as possible. Assume that for i 1; the conditional distribution of Yi given Xi = xi and the previous observations Z i?1 = z i?1 has (conditional) mean mi and variance vi. That is, Yi = mi + ei ; where ei is the random error representing the conditional uncertainty in Y at time i. The conditional mean and variance mi and vi depend on xi and z i?1 in general. Let ybi be a predicted value of Yi : Then Ei (Yi ? ybi )2 = Ei (mi + ei ? ybi )2 = (mi ? ybi )2 + vi2 ; where Ei denotes the conditional expectation given Z i?1 = z i?1 and Xi = xi : The decomposition says that given z i?1 and xi ; the expected total square error of ybi has two components: the squared dierence (bias) between ybi and the unknown conditional mean mi ; and the conditional uncertainty vi : The latter part is not in ones control and therefore is always present regardless of which method is used for prediction. The best one can hope for is that ybi equals (or very close to) mi : Since vi is the same for all forecasts, it will not be included in our measure of performance. Accordingly, we consider the lose function L(Yi ; ybi) = (mi ? ybi )2 : Let be a forecasting procedure yielding forecasts yb1 ; yb2 ; ... at times 1, 2 and so on. We consider the average cumulative risk in prediction up to the present time n, denoted by R(; n), as the performance measure of the forecasting method , i.e., n X R(; n) = n1 E (mi ? ybi )2 ; i=1
where the expectation is taken with respect to the randomness of Z n?1 and Xn : 2
Now suppose we have a collection of forecasting procedures, denoted by = f1; 2 ; :::g. The number of forecasting procedures in can be large or even countably in nite, which, from a theoretical point of view, is appropriate for study to capture the complexity when the number of candidate forecasting procedures is large relative to n. For each forecasting procedure j 2 ; at each time i 1; the procedure gives a predicted value or forecast ybj;i based on z i?1 and xi: The outside information x may be completely, partially, or not at all available to a procedure. Our interest in this paper is on combining the candidate procedures. Some basic questions are:
What are advantages and disadvantages of combining forecasts? What price do we have to pay when combining forecasts?
How many forecasts should be combined?
1.2 Some views on combining forecasts Despite empirical successes of combining forecasts, the topic started somewhat controversial (see, e.g., discussions of Newbold and Granger (1974)), especially in the statistical community. One reason is the lack of trust on the rather ad hoc nature of the majority proposed methods from a statistical point of view. Another more philosophical objection views combining forecasts fundamentally awed as a technique. The argument is that one should combine all the information that are used by the individual forecasters, and with the pooled information, one can construct a single best model, based on which the optimal forecast can be found. We regard the rst reason above valid. Useful methods have been proposed to pursue the \optimal" combining based on variability assessment of the assumed unbiased forecasts (e.g., Bates and Granger (1969)). However, there are rare results on performance of the combining methods in terms of statistical risks. Such results would be very helpful for comparing dierent methodologies. Intuitively, at least based on the familiar danger of over tting in related contexts (e.g., regression), there should be a price (in terms of performance, in addition to the computing cost) for e.g., linearly combining a number of forecasts, because the \optimal" weights must be estimated and therefore can have a large variability when combining a number of forecasts. Simulation and case studies have indeed shown complicated combining methods often lead to unstable weights and the combined forecast performs even worse than the best individual forecast (see, e.g., Figlewski and Urich (1983), Kang (1986), Clemen and Winkler (1986) and others). A related question is how many forecasts should be combined? Should one just go ahead and combine all the forecasts available? Understanding of these questions will provide helpful guidelines for combining procedures in practice. Regarding the objection of combining forecasts based on the philosophy that combining information and then creating a comprehensive model is preferable, we strongly disagree. As pointed out by many researchers who favor combining forecasts, the combining information approach may not be feasible if one does not have access to the information used by the individual forecasters. Furthermore, even if 3
one can in principle obtain the information, practical consideration (e.g., cost, time constraint, and lack of expertise in processing some of the information) may well prohibit such an action. While these are valid defenses on combining forecasts, we would like to argue that even if one ignores these diculties associated with combining information for building a super model, combining forecasts is still a legitimate and constructive approach to combine information (but without an attempt to build a super model). Combining information to build a super model, in our view, in light of nonparametric statistical research, has a very limited potential of successful applications. Let us consider a relatively simple regression case to illustrate the point. Assume that the variable being forecast, Yn; is a function of the information set Xn = (Xn1 ; Xn2;:::) plus some white noise, i.e., Yn = f(Xn )+"n ; where the errors are independent. The individual forecasters have access to some or all of the components of Xn . Of course, f is unknown to any forecaster. Unless one has a very strong belief (hopefully a correct one) of a parametric form of f; the estimation of f is a nonparametric problem. A super model of in nite dimension is not very helpful. Besides fully nonparametric methods, such as kernel and nearest neighbor regression, an automated parametric approximation approach has gained more and more attention in recent years. The idea is that while honestly admitting that the target f can not be described by a model with a nite number of parameters, one hopes that it can be approximated reasonably well by a sequence of parametric models. A key here is to allow the dimensions of the approximating models to increase appropriately (based on data) so that the approximation error (bias) of the model and the estimation error (variance of the estimator of f) are well balanced to achieve the optimal performance. The fundamental dierence between this approach and the traditional parametric approach is that the latter performs estimation and prediction based on the selected model, taking it as the true model. It now becomes commonly acknowledged that such parametric approaches usually lead to under-estimation of uncertainty in the estimation/prediction procedure (see, e.g., Chat eld (1995)). For nonparametric estimation of f via approximating models, various results have shown that with appropriate model selection criteria, the resulting estimator or predictor converges in optimal fashions without knowledge of which approximating model is the best at the given sample size (for some recent results, see, e.g., Barron and Cover (1991), Yang and Barron (1998), Barron, Birge, and Massart (1999), Yang (1999a), Lugosi and Nobel (1999)). When f is not nite-dimensional, the best model balances bias and variability in general depends on the sample size and its dimension increases as more data arrive. From this point of view, identifying the right model for forecasting is problematic. One will never recover the true model, but the prediction accuracy will be better and better. Even if one is given the right model with a large number of unknown parameters, it is still better-o to use a much simpler model trading the bias for much less variability. From above, model selection is a valuable technique to use the combined information. It can be regarded as a special case of combining forecasts: the individual forecasts are based on dierent models and weights on the candidate forecasts concentrate on only one individual. Note that the objective here is to select the best model for the purpose of estimation/prediction. This goal is dierent from the focus 4
of combining forecasts in the literature, where the forecasts are combined to improve the performance of even the best individual forecast. It is anticipated that dierent information and aspects in the forecasting procedures, when put together suitably, can yield a better prediction procedure. Some of the research on combining forecasts seems to suggest that if the combined procedure can not do better than the best individual candidate, it is not worthwhile for considering. While we agree that one should try to get the best out of the individual forecasts, we view the problem of combining forecasts for achieving the best individual performance (which will be called combining for adaptation) worth studying and it is even more fundamental than combining for improvement in some sense. The problem of selecting the best model is far from trivial, so attaining the performance of the best forecast (which is unknown, of course) is a serious research question. In addition, if some forecasts are not modelbased, usual model selection criteria would not apply. Combining for improvement is closely related to the problem of combining to achieve the best performance, because one can consider the set of all the allowed combined forecasts (e.g., though linear combination) as the new set of candidate procedures. The techniques and results used for combining for adaptation then will be useful as well for combining for improvement. Model selection often produces unstable estimators in the sense that slight change in the data causes the selection of a dierent model, which usually makes the estimator or forecast based on the selected model have unnecessarily large variance. While strategies have been suggested to stabilize such estimators or forecasts such as by bagging (Breiman (1996b)), there are evidences suggesting that appropriate weighting of the models can perform better, see, e.g., Buckland, et al (1995) and Yang (1999bc). Thus combining procedures with suitable weights can be a better way than model selection in terms of estimation or prediction accuracy. Under the Bayesian framework, promising results have been obtained for Bayesian model averaging (BMA), see the recent review article by Hoeting, et al (1999) for references. For the reasons above, we will begin with combining for adaptation. The forecasts to be combined are not restricted to be model-based. Results will also be given on combining for improvement via linear combinations. In recent years, a number of interesting results have been obtained on combining experts or forecasts in the machine learning literature (see, e.g., Vovk (1990), Littlestone and Warmuth (1994), Cesa-Bianchi et al (1997)). The emphasize is that no probabilistic assumptions at all is required on the generation of the data and the performance bound on the combined procedure is required to hold for all possible realizations of the data. The results typically state that the cumulative loss of the combined procedure is always guaranteed to attain the best individual procedure within a small penalty term. While being amazingly general, for the methods to be successful in practice, one needs to have a good collection of original forecasting procedures. However, to come up with a good set of procedures, probabilistic and statistical assumptions are most likely (if not always) necessary. Statistical risks then naturally tie together the rst stage of constructing individual forecasts and the second stage of combining them. Note that performance bounds valid for all realizations of data usually imply some probabilistic bounds. 5
However, such bounds need not to be optimal and may not be sucient for statistical characterizations as will be discussed later in Section 2.
1.3 Potential applications 1.3.1 Single series forecasting
For this case, one is interested in prediction of a variable based on its past values. Here no outside information is used to help prediction. Let Y1 , Y2; ... be a univariate time series. Dierent statistical models/methods including Box-Jenkins methods using ARIMA models (Box and Jenkins (e.g., 1976), exponential smoothing (e.g., Holt (1957) and Winters (1960)), as well as models based on e.g., econometric theories have been proposed and widely used in practice. The methods we study can be used to combine these procedures. Note also that forecasting procedures based on ARIMA models with dierent orders can be combined, instead of selecting one using a formal criterion and/or graphical inspections. As mentioned earlier, model selection often produces unstable estimators/forecasts, and a suitable model averaging (combining) can do a better job in terms of prediction accuracy. Many authors on combining forecasts have commented that one might expect a greater improvement when the procedures are more distinct in nature (see, e.g., Newbold and Granger (1974)). The above discussion suggests that it may also be worthwhile to combine procedures of similar nature to avoid a large variability due to selection.
1.3.2 Forecasting based on regression Here suppose the outside information Xi = (Xi1 ; :::; Xid) consists of d explanatory variables and takes values in D Rd : The relationship between Y and X is modeled as Yi = f(Xi ) + "i ; where f is the unknown regression function and "i 's are the random errors. There are several scenarios that one can consider according to dierent assumptions of the regression function (parametric or nonparametric), the nature of the explanatory variables ( xed design or random design), and assumptions on the errors (independence, short- or long-range dependence, non-stationarity). Many research results have been obtained for dierent combinations of the above assumptions. In principle, one can use the methods we study to combine forecasts based on various such regression procedures.
2 Proposed methods for combining forecasts for adaptation 2.1 Combining when the conditional variances are known
We st consider the case when the conditional distribution of Yi given Z i?1 = z i?1 and Xi = xi is Gaussian for all i 1 with the conditional variances vi 's known. This of course, is not realistic in most 6
applications, but it provides a relatively simple case to begin with. To combine the forecasting procedures = f1 ; 2; :::g at each time n, we look at their past performances and assign weights accordingly as follows. Let = fj : j 1g be the prior weights on the procedures, where j 's are positive numbers summing up to 1. Let Wj;1 = j and for n 2; let
P (Yi?byj;i) ?1 j exp ? 21 in=1 vi !: Wj;n = ? P P Yi?by 0 2
?1 j 0 exp ? 12 in=1
j ;i vi
2
P
Note that 1 j =1 Wj;n = 1 for n 1 and Wj;n depends only on the past forecasts and the corresponding actual realizations of Y: Then the combined forecasting procedure is a convex combination of the original procedures: 1 X ybn = Wj;nybj;n j =1
for n 1: Related algorithms focusing on worst-case performance have been studied in machine learning, see, e.g., Vovk (1990), Littlestone and Warmuth (1994), Cesa-Bianchi et al (1997), Haussler, Kivinen and Warmuth (1998). Closely related ideas on Bayesian updating, universal coding, and prequential statistical analysis had been considered earlier and since in information theory and statistics (see, e.g., Cover (1965), Dawid (1984), Foster (1991), and recent review articles on universal prediction (Merhav and Feder (1998) and minimum description length principle in estimation and prediction (Barron, Rissanen, and Yu (1998)). Results in machine learning literature focus on bounding the cumulative loss Pn (Y ? yb )2 (or under other loss functions). Results are obtained usually under the strong condition i=1 i i that the observations Yi are uniformly bounded. Methods and statistical risk bounds for combining estimation procedures in the context of density estimation, regression and conditional probability estimation are in Yang (1999bce, 2000ab) and Catoni (1999). Note that in the determination of Wj;n 's, the previous prediction errors are weighted according to the actual conditional variances vi : When there is a larger uncertainty in Yi given z i?1 and xi; one is most likely to have a larger predication error. The division of vi is in accordance of one's willingness to tolerate a bigger error for such a case. Notice also that the weight assignment does not depend on the outside information xn directly, though the forecasts ybj;n's and vn 's will generally, as expected, depend on xn (partially or fully). Note also that (Y ?by )2 Wj;n?1 exp ? n?12vn?j;n1 ?1 !: Wj;n = P 0 W 0 exp ? ?Yn?1 ?byj0 ;n?1 2 j j ;n?1 2vn?1 Thus after each additional observation, the weights on the candidate forecasts are updated. We call the algorithm Aggregated Forecast Through Exponential Re-weighting (AFTER). 7
We need an assumption controlling the magnitudes of the conditional means and the forecasts relative to the conditional variance vn : Condition 0: There exists a constant > 0 such that for all i 1;
p p jmi j pvi ; sup jybj;ij pvi : j 1
Theorem 1: Assume that Condition 0 is satis ed. Then the mean average cumulative square risk
(relative to the conditional variance) of the combined procedure satis es
!
!
n n (m ? yb )2 )2 X 1X (m ? y b 2 log(1= ) 1 i j i j;i i : 2( + 1) jinf +n E 1 n i=1 E vi n v i i=1
Remarks:
1. Bounds of the type
n n 1X )2 C logM + inf 1 X(Yi ? ybj;i )2 (Y ? y b i i 1j M n i=1 n i=1 n
for combining M procedures based on related methods are given in e.g., Haussler, Kivinen and Warmuth (1998). But the result is obtained under the very strong assumption that Yi 's are uniformly bounded. Such a requirement is usually not appropriate to model uncertainty in Y: 2. The type of upper bounds
!
n n log(M) + 1 X 1X )2 C inf 2 (Y ? y b i i 1j M n i=1 n n i=1 (Yi ? ybj;i) with a constant C > 1; when taking expectation, is much weaker than the bound given in the theorem. From the result, up to a constant factor and an additive penalty log(1=j )=n; the combined procedure achieves the performance of the optimal forecasting procedure among the candidates. In particular, if there are M forecasting procedures then with uniform weight j = 1=M; we have
!
!
n n (m ? yb )2 )2 X 1X (m ? y b 1 2 logM i i j;i i E : 2( + 1) j M n i=1 n i=1 E vi n + 1inf vi In Theorem 1, the risks of the forecasts are measured relative to the variability of Y: Under some conditions, it implies good performance of the combined forecast under the ordinary square loss. Condition 1: The conditional variances vn's are uniformly upper bounded away from 1 and lower bounded away from zero, i.e., there exist positive constants A1 and A2 such that A1 vn A2 for all n 1: Condition 2: The conditional mean mn's are uniformly bounded between ?A3 and A3 for some constant A3 > 0: The forecasts ybj;n are also bounded in the range for all j and n: Corollary 1: Assume that Conditions 1 and 2 are satis ed. Then the mean average cumulative square risk of the combined forecast satis es
!
n n 1X 2 C inf log(1=j ) + 1 X E (m ? yb )2 ; E (m ? y b ) i i j;i i j 1 n i=1 n n i=1
8
where C is a constant depending only on A1 ; A2 and A3 : Remark: One does not need to know the constants A1; A2 or A3 to use the AFTER algorithm to combine forecasts.
2.2 Combining when variances are estimated by individual procedures For Theorem 1, the conditional variances vi are assumed to be known. This is quite restrictive and is perhaps applicable in practice only when one has a reliable assessment of the variability in Y (which is possible under stationarity conditions). Similar results can be obtained when the variances are estimated by the individual forecasters. Assume that for each forecasting procedure j ; prior to forecasting Yn ; an estimate of vn ; say, vbj;n is obtained based on z n?1 and xn: If the observations are stationary, various variance estimation methods have been proposed for dierent scenarios. Note that the procedures do not have to use dierent variance estimators and if some procedures do not provide variance estimates, we can borrow from others. Condition 3: Assume the variance estimators bvj;n are not too far away from the true value: there exists a constant > 1 such that 1 bvj;n vn for all j 1 and n 1: We modify the AFTER algorithm using the variance estimates. Let Wj;1 = j and for n 2; let 2 j exp ? 1 Pn?1 (Yi ?byj;i) ?1bv1=2 2 i=1 bvj;i ni=1 j;i Wj;n = ?Yi?by 0 2 ! : P P0 j 0 n ? 1 j ;i 1 j 1 n?1bv1=2 exp ? 2 i=1 bv 0 i=1 j 0 ;i
j ;i
Then combine the forecasts according to the weights ybn =
1 X j =1
Wj;nybj;n :
Theorem 2: Assume that Conditions 0 and 3 are satis ed. Then the mean average cumulative
square risk of the combined procedure satis es ! n n (m ? yb )2 C X n (bv ? v )2 ! )2 X 1X (m ? y b 2 log(1= ) i j;i i j i ; E j;i v2 i (3+1) jinf +n E + n 1 n i=1 E vi n v i i i=1 i=1 where C is a constant depending only on : Note that risks of the variance estimators also appear in the above performance bound, and thus for this combining method, variance estimation is also important. However, we do not need good variance estimation for all the procedures, as long as there is an accurate one for a good procedure j : The weighting method and the risk bound suggest that procedures with bad forecasts and/or variance estimation receive rather small weights and accordingly do not aect the performance of good procedures. In order to have good estimators of vi 's, assumptions on the relationship between observations are necessary. For example, one may consider stationary errors with independent, short or long-range dependence. Performance of variance estimators can be theoretically justi ed under reasonable conditions. 9
2.3 Combining under non-Gaussian errors We now relax the normality requirement on the conditional distributions of Yn; n 1. Let h(t); t 2 R be a xed probability density function with mean zero and variance 1. We assume that Yi has the conditional density h( y?smi i )=si given Z i?1 = z i?1 and Xi = xi; where si = pvi is the conditional standard deviation of Yi : We assume h is known. Non-Gaussian choices of h allow dierent degree of heavy-tail in the conditional distribution of Y: Condition 4: The density h satis es that for each pair 0 < s0 < 1 and T > 0; there exists a constant B = Bs0 ;T (depending on s0 and T) such that Z ? dy B (1 ? s)2 + t2 h(y) log 1 h(y) y ? t s h( s )
for all s0 s s?0 1 and ?T t T: R Note that h(y) log 1 hh((yy?) t ) dy above is the Kullback-Leibler divergence between density h and a s s re-scaled and re-located density 1s h( y?s t ). Under mild conditions, this divergence should behave locally like a squared distance. The condition is satis ed by Gaussian, double-exponential, t and many other distributions. p We assume variance estimators are available for the forecasting procedures. Let sbj;i = bvj;i for j 1 and i 1: The earlier AFTER algorithm is now modi ed to compute the weight based on the density function h. Let Wj;1 = j and for n 2; let
?1 h Yi?byj;i =sbj;i j in=1 bsj;iYi?by 0 Wj;n = P n?1 j ;i =sbj 0 ;i j 0 1 j 0 i=1 h bs 0 j ;i
and combine the forecasts accordingly: ybn =
1 X j =1
Wj;nybj;n :
Theorem 3: Assume that Conditions 0, 3 and 4 are satis ed. Then the mean average cumulative
square risk of the combined procedure satis es
! n n (m ? yb )2 C 0 X n 2 !! )2 X 1X (m ? y b 2 log(1= ) C (v ? v b ) i j i j;i i j;i i E E (3+1) jinf + n + n ; 1 n i=1 E vi n vi vi2 i=1 i=1
where C and C0 are constants depending only on and B ?1=2 ; 1=2 :
2.4 Combining based on distribution estimates Suppose at each time i; a forecaster j comes up with an estimate of the conditional distribution, say pbj;i of Yi given Z i?1 = z i?1 and Xi = xi: It re ects the forecaster's assessment of the uncertainty in Yi prior to its occurrence. Various methods have been suggested to combine probability distribution estimates (see, e.g., Genest and Zidek (1984) for a review on this topic). Here our focus is on combining the distribution estimates to form a composite forecast of Y: 10
Let Wj;1 = j and for n 2; let n?1pbj;i (Yi ) Wj;n = P j i=1 n?1 j 0 1 j 0 i=1 pbj 0 ;i (Yi )
and combine the forecasts accordingly: ybn =
1 X j =1
Wj;nybj;n :
Assume that at time i; the true conditional distribution of Yi has density hi ( yi ?simi ); where hi is a probability density function with mean 0 and variance 1. We assume that at least one forecaster correctly speci es the right forms of the conditional density hi for all i 1 and provides estimates of the conditional means and variances. Let J denote the collection of all such forecasting procedures. Note that the set J needs not to be known to combine the procedures. Condition 5: Assume that Condition 4 is uniformly satis ed for all hi; i 1 in the sense that the constants Bs0 ;T for hi in the condition do not depend on i: Theorem 4: Assume that Conditions 0, 3 and 5 are satis ed. Then the mean average cumulative square risk of the combined procedure satis es
! n n (m ? yb )2 C 0 X n 2 !! )2 X 1X 2 log(1= ) C (m ? y b (v ? v b ) j i i j;i i j;i i E E (3+1) jinf + n + n ; 2J n i=1 E vi n vi bvj;i i=1 i=1
where C and C0 are constants depending on and B ?1=2 ; 1=2 : Remark: For an asymptotic bound, it suces to assume that there is at least one procedure (unknown) that eventually correctly speci es the shape of the conditional distribution of Y: From the result, as long as that there is at least one forecasting procedure that picks up the right forms of the error distributions with good estimates of the conditional means and standard deviations, the combined procedure based on the AFTER algorithm should behave well. Thus dierent distribution shapes and assumptions on the dependence relationship between the errors "i as well as on the relationship between Y and the outside information X can be tried out to increase the chance of capturing the true uncertainty in Y: The theorem shows that if one such choice approximates the reality well, the combined forecast will automatically perform well. We omit the proof of the result, which is similar to the proof of Theorem 3.
2.5 Combining forecasts without variance estimation The methods in the previous subsections require knowledge of the variances or appropriate estimates for determining weights for the candidate forecasts. Sometimes, e.g., when the data are nonstationary, it may be dicult to estimate the conditional variance of Yn reliably. We present a method here that does not require variance estimation. A key idea is borrowed from Catoni (1999). Condition 6: Assume that all the forecasts are uniformly bounded between ?A and A for a positive constant A > 0: 11
Let be a xed nonnegative convex function with (0) = 0: Here is not necessarily symmetric about 0. When (Yi ? ybi ) is used as a loss function, asymmetry of is of interest to re ect the reality that over and under forecasting might have very dierent consequences (cf. Newbold and Granger (1974)). Since is convex, for any a0; there exists a line y = a0 (a ? a0 ) + (a0 ) (called a supporting line) such that is above the line. We further assume that is quadratically o the line and does not change too dramatically in some sense. Condition 7: There exist constant c > 0 and c > 0 and > 0 such that for each a0 2 R and a supporting line y = a0 (a ? a0) + (a0 ) at any a0; we have (a) ? (a0 (a ? a0 ) + (a0 )) c(a ? a0 )2 ; and
0 max j (a)j c (1 + T) ; + ?T aT
for each T > 0: Condition 7 pretty much restricts to include asymmetric function
0 max j (a)j c (1 + T) ? ?T aT
to be essentially quadratic. The slight generality is used so as
c a2 when a > 0 (a) = 1 2
c2a when a 0 for some positive constants c1 6= c2 : Such a loss (Yi ? ybi ) penalizes prediction error dierently when under or over predict Y: We also need a condition on the conditional distributions of Yn 's. We do not require knowledge of its shape as long as certain moment generating functions are uniformly upper bounded. Condition 8: There exist a constant t0 > 0 and continuous functions 0 < M1 (t); M2(t) < 1 on (?t0 ; t0) such that for all n 1 and ?t0 t t0; EnjYnj2 exp(tjYnj ) M1 (t); En exp(tjYnj ) M2 (t): Let Wj;1 = j and for i 2;
P
?1 (Y ? yb ) j exp ? in=1 Pn?1 i j;i ; Wj;i = P Yi ? ybj 0 ;i j 0 1 j 0 exp ? i=1
where is a positive constant to be chosen later. Then the combined forecasting procedure is de ned as 1 X ybn = Wj;nybj;n j =1
Theorem 5: Assume that Conditions 6-8 are satis ed. Then when is chosen small enough, say 0 < 0 for some constant 0 depending on c; c; A; M1 and M2 ; the mean average cumulative risk
(under the loss ) of the combined procedure satis es 12
!
n n log(1=j ) + 1 X 1X E (Y ? y b ) inf i i j n i=1 n n i=1 E (Yi ? ybj;i ) : In particular, when is the square loss, we have
!
n n X log(1= ) 1 1X j 2 2 j n i=1 E (mi ? ybi ) inf n + n i=1 E (mi ? ybj;i ) : Note that unlike the previous results, no variance estimation is required here nor the form of the conditional distribution of Y is speci ed. In addition, there is no loss of a constant factor besides the additive penalty term, but under the strong moment generating function conditions. A diculty with this method is the choice of ; which can have a dramatic eect on the weights for moderate n: In general, the quantities c; c; A; M1 and M2 are unknown and therefore it is hard to determine 0 . If one is willing to assume a particular form of distribution (e.g., normal), the quantities can be appropriately bounded and hence determine 0 (cf. Catoni (1999) for the regression case with independent errors). In contrast, the earlier algorithms do not require such a tuning parameter. Some simulation results in the context of nonparametric regression with independent errors in Yang (1999bc) suggest that when the true error distribution is double-exponential but weight calculation are based on normal error, estimation accuracy is signi cantly decreased. On the other hand, when both types of error distributions are considered, the combined procedure behaves as if it knew which error distribution is the correct one. Thus we tend to recommend the use of earlier methods when one has reasonably strong con dence in that the true error distributions are captured well by some members in the set of the chosen error distributions.
3 Combining forecasts for a better performance The results given so far deal with combining forecasts for adaptation, i.e., the purpose of combining the individual forecasts is to achieve the best overall performance over time among all the competing forecasts. In applications, the more aggressive goal of combining procedures for improving the individual forecasts is certainly more appealing. Empirical evidences have demonstrated that when dierent forecasts are combined, a better performance is sometimes obtained. Weak forecasts still can have good components or partial information, which, when used appropriately, can help to improve a better forecasting procedure. See Clemen, Murphy and Winkler (1995) for a discussion of the dierence between the two directions on combining forecasts. There are many ways (uncountable) that one can combine forecasts. It seems clear that one can not expect any combining method to achieve the best performance among all possible combined forecasts without very restrictive probabilistic assumptions on the data. Accordingly, one may restrict attention to a xed class of combining methods. For example, one could consider the class of all convexly or linearly combined forecasts, and demand a combined forecast to performs nearly as well as the best convexly or linearly combined forecast. Worst-case quadratic loss bounds for linearly combining a nite collection 13
of forecasting procedures have been obtained in machine learning, see, e.g., Cesa-Bianchi, Long, and Warmuth (1996). Theoretically speaking, the problem of combining procedures among a given class is closely related to the problem of combining forecasts for adaptation, because one can treat the class of the all allowed combined forecasts as the original set of forecasts and then apply the methods given in Sections 2 to combine them. To that end, one may need to \discretize" the class (e.g., discretize the linear coecients) when there are uncountably many ways to combine within this class, since the given algorithms deal only with a countable collection of forecasting procedures. Related theoretical results on linearly combining nonparametric regression procedures with independent errors are in Juditsky and Nemirovski (1996) and Yang (1999d). Some practical combining methods for regression have been proposed by Wolpert (1992), Breiman (1996a) and LeBlanc and Tibshirani (1996). In this section, we consider linear combinations of forecasts with a certain constraint. The treatment is mainly theoretical since the combining methods we study are hard to be implemented in applications. Our purpose is to address several important issues on combining forecasts for a better performance. The theoretical understandings can be very helpful when judging practically feasible combining methods. Let = f1 ; 2 ; :::g be the set of original forecasting procedures. One hopes that a certain linear combination of them will signi cantly outperform any of the individual procedure. There are two related issues that we want to address:
stability versus pursuing further improvement; sparsity in combining. In addition to higher computational cost, there must also be a price in terms of statistical performance for searching for the best linearly combined procedure. The price will be quanti ed in a minimax framework. As intuition suggests, the more procedures are involved in combining, the bigger price one needs to pay. The rst issue above then amounts to the comparison of the potential gain and the price one pays when combining more forecasting procedures. If the potential gain is larger than the price, then one should combine more procedures (ignoring the computational cost). Otherwise, one can end up with dramatically degrading the performance. Of course, one does not know the trade-o between the potential gain and the price. Thus it would be nice to have a combined procedure which automatically behaves the right way: be both conservative and aggressive whichever is better. Sparseness in combining is also an important issue. If many coecients in a linear combination of the procedures are rather small, keeping the important ones only can lead to a much better performance due to the much reduced variability of the resulting forecast. Since one does not know which coecients are important in advance, some sort of search over sparse combinations is needed. Empirical evidences suggest that simple combining methods sometimes outperform more complicated versions. A possible explanation, as proposed by several researchers (e.g., Figlewski and Urich (1983), Kang (1986), Clemen and Winkler (1986)), is that more complicated combining methods have a much larger variability. Our theoretical results tend to support this view. 14
For simplicity in illustration, we will assume normality on the conditional distributions of the future response given the past data and the present outside information, and also assume that the variances vi are known and bounded above and below by some positive constants as in Corollary 1. However, similar results can be obtained with little diculty for the other cases considered in Section 2 with weaker assumptions.
3.1 What is the price for combining for improvement? Let = f1; :::; M g be a nite collection of forecasting procedures with j producing forecast ybj;i at time i: Consider linear combination of the procedures: ybi =
M X j =1
j ybj;i ; i 1;
P
where = (1 ; :::; M ) satis es the constraint M j =1 jj j 1. Let denote the corresponding forecasting procedure and let = M denote the set of all such : Let be the minimizer of n ? 1X 2 E m ? y b i i n i=1
over 2 and let R(M; n; ) denote the minimized value. It captures the best performance among all the linearly combined procedure subject to the constraint. Targeting at (which is unknown), we combine the procedures using the AFTER algorithm with restricted in a nite set D : An appropriate choice of D leads to the following result. Let M denote the combined forecasting P procedure. For a forecasting procedure with forecast ybi for i 1; let R(; n) = n1 ni=1 E (mi ? ybi )2 denote the mean average cumulative square risk. Theorem 6: Assume that Conditions for Corollary 1 are satis ed. Then one can construct a combined procedure M such that R( M ; n) C
(
n=M ) when M < pn R(M; n; ) + M log(1+ n M R(M; n; ) + plog when M pn ; n log n
where C is a constant depending on A1 , A2, and A3 : In particular, if M = Mn C0n for some > 0 and C0 > 0, then
8 < R(M; n; ) + nlog?n when 0 < 1=2 M 1=2 R( ; n) C : R(M; n; ) + logn n when 1=2 < 1; 0
1
(1)
where the constant C 0 depends on A1 , A2 , A3 ; and C0: The result characterizes the performance of the combined procedure in terms of the best linear combination (subject to the constraint) and a penalty. For nonparametric estimation or prediction, exact risks are usually too dicult to obtain without restrictive conditions, and accordingly convergence rates are used to compare procedures. The result implies that when combining M = C0n procedures, the combined forecast has mean average cumulative risk converging at the rate of the best combination 15
plus a penalty term or price, which is at order ( logn=n)1=2 when 1=2 and is of order n?(1? ) log n when 0 < 1=2: De ne ( M log(1+n=M ) p 1 Mp< n n n (M) = plog M M n: n log n
Note that n (M) is the penalty term associated with the combined procedure for not knowing which linear coecients work the best. The term n (M) increases as M increases and thus can be viewed as a complexity penalty of combining M procedures (relative to the sample size). When n is small, the price for combining a large number of procedures is very high, but when n gets large, more forecasts can be combined at a relatively low price. One might wonder if the price we give above for combining forecasts is really necessary or rather due to sub-optimal method of combining. In fact, the upper bound given above is almost optimal in the sense that it can not be improved much in general as shown in the following result. For simplicity, assume that the conditional variance of Yi given z i?1 and xi is 1 for i 1: Theorem 7: Consider Mn = bn cfor some > 0. There exist Mn procedures Mn = fj ; 1 j Mng satisfying Condition 2 such that for any combined procedure based on Mn ; one can nd a case such that ( n?(1? ) when 0 1=2 R(; n) ? R(M; n; Mn ) C log n 1=2 when 1=2 < < 1; n where the constant C does not depend on n: Thus no combining method can achieve the smallest risk over all the (constrained) linear combinations uniformly within an order smaller than the ones given above. Note that the lower rate matches the upper rate in Theorem 6 when > 1=2 and the upper and lower rates dier only in logarithmic factors when 0 1=2: Theorems 6 and 7 together with Corollary 1 provide some understanding of the question whether one should combine for adaptation or combine for improvement. By Corollary 1, and taking the uniform prior weight on the Mn procedures, combining for adaptation yields risk of order inf R( ; n) + lognMn : 1j Mn j As long as Mn does not increase exponentially fast, logMn =n is of order log n=n; which is usually negligible for nonparametric estimation or prediction. Then combining for adaptation achieves the best performance (rate) among the Mn forecasting procedures. In terms of rate of convergence of the mean average cumulative risk, linearly combining n procedures for better performance is guaranteed to do better than combining for adaptation only when both R(Mn ; n; Mn ) and the penalty term (log n=n)1=2 for > 1=2 and n?(1? ) for 0 1=2 converges at a faster rate than inf 1j Mn R(j ; n)+ lognMn : Otherwise, the linearly combined procedure can do worse or even much worse than combining for adaptation. Thus it is not automatically advantageous to combine forecasting procedures for improvement. 16
Theorems 6 and 7 can also help us to understand how many procedures should be combined. Note that inf 1j Mn R(j ; n) decreases as more forecasting procedures are included in ; but on the other hand the penalty n(Mn ) increases. Thus one need a good trade-o between the potential gain and the complexity penalty. In the next subsection, a method will be given to automatically achieve the best trade-o.
3.2 Combining with multiple discretization accuracies In the construction of the combined forecast in the previous subsection, a discretization is used, which is chosen according to n (among others). This is undesirable because it requires knowledge of how many times one will forecast in the future. Theoretically speaking, the problem can be xed by considering dierent discretization accuracies at the same time. In addition, we will seek automatic balance between the potential gain and complexity penalty. Let = f1 ; 2 ; :::g be a list of forecasting procedures. Let M = f1 ; 2 ; :::; M g: Fix M for a moment. For each k 1; consider a best -net Nk in M with 2k points. From Shutt (1984, Theorem 1), the entropy number under the l1M distance is upper bounded by c(M; k); where 81 if 1 k log2 M < (M; k) = : log2 (1+k M=k) if log2 M k M 2?2k=M M ?1 if k > M and c does not depend on M or k: We combine the 2k corresponding forecasting procedures (linear combinations of the original M procedures with in Nk ) with uniform weight. Denote the procedure by (M; k): By Corollary 1, R( (M; k) ; n) C nk + R(M; n; M ) + 2(M; k) ; where C is a constant depending on A1; A2 ; and A3 : For convenience, we will use the same symbol C to denote constants that may depend on these constants. De ne log by log x = log(x+1)+2 log log(x+1): Now we combine the procedures f (M; k) : M 1; k 1g for adaptation with prior weight ce? log M ?log k ; where the constant c is chosen to normalize the weights to add up to 1. Let denote this forecasting procedure. It follows again from Corollary 1, k log M log k 2 R( ; n) C M;k inf R(M; n; M ) + n + (M; k) + n + n : Compared to the other terms, lognM + logn k are asymptotically negligible. Thus the nal combined procedure behaves automatically as well as if it knew the optimal discretization accuracy that balances k 2 n and (M; k) and knew also the optimal number of procedures Mn that balances R(M; n; M ) and ? inf k nk + 2(M; k) . In particular, for each n and M; by taking k as in the proof of Theorem 6, we have R( ; n) C inf (R(M; n; M ) + n (M)) : M
(2)
Note that the upper bound agrees in order with that in Theorem 6, but the combining procedure does not need to know n in advance. Up to a constant factor, the combined forecast automatically achieves the best trade-o between risk reduction and complexity penalty for combining. 17
3.3 Sparsely combining In some cases, when combining a lot of procedures, the best linear combination may concentrate on only a few procedures. For such a case, combining only the important procedures leads to a much better performance because of the signi cantly reduced variability in the nal combined procedure. Usually one does not know which of the original procedures are important individually, and even if one does, some of the individually important procedures may not be needed in the best linear combination. Without such knowledge, we consider dierent sparse combinations of the candidate forecasts. For each integer M > 1; 1 L < M; and a subset S of f1; 2; :::;M g of size L; let (S) be the linearly combined (for improvement) procedure based on fj : j 2 S g using multiple discretization accuracies as in the previous subsection. Then let M;L be the combined (for adaptation) procedure based on all M (M ) be the such (S) with uniform weight 1=(M L ) (there are (L ) many such procedures). Then let combined (for adaptation) procedure based on M;1; :::; M;M ?1 using the uniform weight 1=(M ? 1): Let F denote the combined (for adaptation) procedure based on (M ) ; M 2 with weight c0 log M, where P c0 e? log M = 1. Let (S) denote the collection of procedures the constant c0 is chosen such that 1 M =2 fj : j 2 S g: Based on Corollary 1, we have R(F ; n) M ) log( inf R (M; n; (S)) + n (L) + nL C Minf2 1Linf (3) M ?1 jS j=L;S f1;2;:::;M g L log M : inf R (M; n; (S)) + (L) + C Minf2 1Linf n M ?1 jS j=L;S f1;2;:::;M g n Compared with upper bound (2), we see the potential advantage of sparse combining. Let Mn achieve the best trade-o between R (M; n; M ) and n(M): Suppose there exists (S) f1; 2; :::; Mng with L = j (S) j much smaller than Mn and R (Mn ; n; (S)) is only slightly larger than R (Mn ; n; M ) : Then n(L) + L lognMn is much smaller than n(Mn ) as long as Mn does not grow exponentially fast. For such a case, sparse combining avoids unnecessarily large variability in combining all the rst Mn procedures. Sparse approximation/estimation has been eectively considered for wavelet estimation (see, e.g., Donoho and Johnstone (1994, 1998), Johnstone (1999)) and also applied on general model selection (see, e.g., Barron (1994), Yang and Barron (1998, 1999)), Barron, Birge, and Massart (1999), Yang (1999a)).
3.4 Adaptively combining forecasts The results in the previous subsections provide helpful understanding on the dierence between combining for adaptation and combining for improvement; on potential gain and price for combining forecasts; and on advantage of sparse combining. In practice, however, one does not know which scenario the current situation is in and therefore can not gauge beforehand in choosing the optimal combining strategy. If one just bluntly combine a large number of procedures, one might end up with a much worse performance compared to combining for adaptation. In this subsection, we give an adaptive combining method 18
to overcome these diculties, at least in theory. Here we show when combining the procedures properly, one can have the potential of obtaining a large gain in forecasting accuracy yet without losing much when there happens to be no advantage considering sophisticated linear combinations for improvement. Consider three dierent approaches to combine procedures in = f1 ; 2; :::g. The rst approach is to combine the procedures for adaptation. Here one intends to capture the best performance (in terms of rate of convergence) among the candidate procedures. Let A denote this combined procedure based on using the AFTER Algorithm as given in the Section 2. When is not a nite collection, one can not use uniform weight. A choice of prior weight j is ce? log j ; where the constant c is chosen to normalize the weights to add up to 1. Based on Corollary 1, we have log(j + 1) + R( ; n) =: CR1 (n; ): (4) R(A ; n) C inf j j n If one procedure, say j behaves the best asymptotically, then the penalty is of order 1=n: If the best procedure changes according to n; then inf j (log(j + 1) =n + R(n; j )) is a trade-o between complexity and estimation accuracy. The second approach targets at the best performance among all the linear combinations of the original procedures up to dierent orders. This is the combining procedure described in Section 3.2. Let B denote this procedure. Then R(B ; n) C inf (R(M; n; M ) + n(M)) =: CR2 (n; ): M The third approach is to combine the procedures through various sparse combinations as described in Section 3.3. Let C denote this procedure. Then L logM R(C ; n) C Minf2 1Linf inf R (M; n; (S)) + n (L) + n =: CR3 (n; ): M ?1 jS j=L;S f1;2;:::;M g Now we combine these three procedures A ; B ; and C with e.g., equal weight 1=3: And let F denote the nal combined procedure. Note that it is still a linear combination of the original procedures. We have the following conclusion. Proposition 1:
Assume Conditions 1 and 2 are satis ed. Then the combined procedure satis es
R (F ; n) C min(R1 (n; ); R2 (n; ); R3 (n; )) ; where C is a constant depending only on A1 ; A2 and A3 .
Thus in terms of rate of convergence, rst of all, the combined forecasting procedure converges as fast as any original procedure. Secondly, when linear combinations of the rst Mn procedures (for some Mn > 1) can improve estimation accuracy dramatically, one pays the price at most of order n(Mn ) for the better performance. When certain linear combinations of a small number of procedures perform well, the combined procedure can also take advantage of that. In summary, the combined procedure is automatically both aggressive pursuing further accuracy improvement and conservative avoiding paying a price higher than the potential gain. 19
4 Concluding remarks Combining forecasts is a useful technique to share strengths of dierent forecasting procedures. There are two closely related directions on combining forecasts: combining for adaptation and combining for improvement. An algorithm AFTER is proposed to combine forecasts for adaptation. The method assigns weights for the candidate procedures according to their performances so far and the computations rely on specifying the forms (e.g., Gaussian or double-exponential) of the conditional distributions of Y given the past data and the current outside information. One may use a xed speci cation for combining or use speci cations from the individual forecasters. When the common speci cation for the rst case or the individual speci cations corresponding to the best forecasts capture the true uncertainty in Y well, the AFTER algorithm is shown to have a prediction risk automatically close to the best procedure. No assumption on the relationship between dierent forecasts is needed for the result. In applications, various model assumptions on the relationship between Y and the outside information X and on dependence structure among the errors (noise) can be considered for constructing individual predictions. The resulting forecasts as well as additional ones based on subjective assessments can be combined. The algorithm AFTER is easy to implement in practice. Combining without specifying or estimating the condition distributions of Y is also studied under assumptions on certain moment generating functions. An alternative to combining procedures is selecting the best one. Model selection criteria can be used for that regard when combining model-based forecasts. However, model selection is often unstable, which causes a rather large variability in prediction. Combining with appropriate weighting as by AFTER gives more stable predictions and accordingly yield a better performance. In the other direction, we studied linearly combining (subject to a constraint) forecasts for improving the individual procedures. Since the \optimal" weights are estimated, one must pay a price in the sense that the risk of the combined forecast is larger in general than that of the best linear combination. The potential gain and the price in terms of prediction accuracy is theoretically quanti ed. The result supports the empirical nding that combining does not necessarily lead to performance improvement and a blind combining of a numer of forecasts can have a much worse performance compared to the best individual procedure. In addition, we showed advantage of sparse combining. A multi-purpose combining method is shown to be able to automatically balance between the potential gain and the complexity penalty; it takes advantage of sparse combining when only a few procedures have large weights for the best linear combination; and it always achieves (in rate) the best individual performance. Our results on linearly combining forecasts for improvement are mainly theoretical. Computationally feasible methods need to be developed with similar properties. In addition to linear combining, nonlinear methods may also be proven to be helpful to share strengths of dierent procedures. Another interesting and challenging research direction is combining prediction intervals. See Taylor and Bunn (1999) for some simulation results. Further theoretical research on these topics, in our opinion, will be fruitful to 20
guide the practice of combining forecasts in reality.
5 Proofs of the results Proof of Theorem 1: Let
f n = ni=1 p 1 exp ? 2v1 (yi ? mi )2 2vi i n (y ? m )2 ! X 1 1 i i ; exp ? 2 = v i (2)n=2 ni=1 vi1=2 i=1
and qn
1 X
=
j =1
j ni=1 p 1 2v
1 2 exp ? 2v (yi ? ybj;i ) i i
!
1 n (y ? yb )2 X X 1 1 i j;i j exp ? 2 : = 1 = 2 n= 2 n v i (2) i=1 vi j =1 i=1 Consider log(f n =qn) : By monotonicity of the log function, for each xed j 1; we have
0 1 Pn (yi?mi) 1 B exp ? 2 i=1 vi CCA log(f n =qn) log B @ P exp ? 1 n (yi ?byj ;i ) 2
2
j
2 i=1 vi n 2 2 X = log(1=j ) + 12 (yi ? ybj ;i) v? (yi ? mi ) : i i=1
On the other hand, observe
(5)
P1 p j exp ? 1 (y ? yb )2 ? 1 (y ? yb )2 1 X j j =1 42 v1 v2 2v1 1 j;1 2v2 2 j;2 p2v exp ? 2v1 (y1 ? ybj;1)2 qn = P 1 p j exp ? 1 (y ? yb )2 1 1 j =1 1 j; 1 j =1 2v1 2v1 P1 pj exp ? Pn 1 (y ? yb )2 n 2vi i j;i : P1j =1 i=1j 2vi Pni=1 ? 1 1 j =1 n?1 p2vi exp ? i=1 2vi (yi ? ybj;i )2
i=1
Let pi = p21vi exp ? (yi ?2vmi i ) Wj;i;
2
Together with (5), we have
P W exp ? (yi?byj;i)2 : It follows by de nition of and gi = p21vi 1 j =1 j;i 2vi p n X n n log(f =q ) = log gi : i i=1
n n 2 X X 2 log pgi log(1=j ) + 12 (yi ? ybj ;i) v? (yi ? mi ) : i i i=1
i=1
(6)
Taking expectation on both sides of (6) under the conditional distribution of Yn given Z n?1 = z n?1 and xn; we have p nX ?1 p i log g + En log gn (7) i n i=1 nX ?1 2 2 2 2 log(1=j ) + 21 (yi ? ybj ;i) v? (yi ? mi ) + En (yn ? ybj ;n )2v? (yn ? mn ) : i n i=1 21
Since yn = mn + en ; where En en = 0; we have (yn ? ybj ;n)2 ? (yn ? mn )2 = (mn ? ybj ;n)2 ? 2en (ybj ;n ? mn ) : and It follows Observe
En[en(ybj ;n ? mn )] = (ybj ;n ? mn ) Enen = 0:
(Y ? yb )2 ? (Y ? m )2 (m ? yb )2 n n En n j ;n = n j ;n : pn Z
2vn
2vn
Z En log g = pn log pgn dyn (ppn ? pgn)2 dyn; n n where the inequality is the familiar relationship between the Kullback-Leibler divergence and the squared P W yb : The above Hellinger distance. Note that the means of pn and gn are mn and ybn = 1 j =1 j;n j;n Hellinger distance is related to the performance of the combined forecast: 12 0 2 Z Z 1 X @mn ? Wj;nybj;nA = yn pn dyn ? yn gn dyn j =1 Z 2 =
yn (pn ? gn ) dyn
2 p p p p (jyn jj pn + gn j) j pn ? gn jdyn Z p p Z p p 2 2 yn ( pn + gn) dyn ( pn ? gn)2 dyn; Z
where the last step follows from Cauchy-Schwartz inequality. Note that
Z
yn2 (ppn + pgn )2 dyn
Z
2 yn2 (pn + gn) dyn
Z
= 2(m2n + vn + yn2 gndyn ) 2 ); 2(m2n + 2vn + sup ybj;n j 1
where the last step follows from that gn is a convex combination of densities with the same variance vn and bounded means. Together with (7), we have nX ?1 p 2 log gi + ? 2 (mn ? ybn ) 2 2 mn + 2vn + supj 1 ybj;n i i=1 nX ?1 2 2 2 log(1=j ) + 12 (yi ? ybj ;i) v? (yi ? mi ) + (mn ?2vybj ;n) i n i=1 Thus ?1 (y ? yb )2 ? (y ? m )2 ! nX ?1 p 1 nX i j ;i i i i log g ? 2 E v i i i=1 i=1 ! 2 2 ) (m ? y b n n log(1=j ) ? E 2 ?m2 + 2v + sup yb2 + E (mn ?2vybj ;n) : n n n j 1 j;n 22
Handle log pgnn??11 ? (yn?1 ?byj ;n?1v)n??1(yn?1 ?mn?1 ) the same way as for i = n; and similarly for other 1 i < n ? 1; we have 2
2
!
n (m ? yb )2 2 X i j ;i E ? 2 (mi ? ybi ) : E ) + log(1= j 2 2v + 2v + sup y b 2 m i i j 1 j;i i i=1 i=1 Since the above holds for every j ; together with Condition 0, it follows n X
!
!
n 2 i ? ybi ) inf log(1= ) + X E (mi ? ybj;i )2 E (m : j j 1 4( + 1)vi 2vi i=1 i=1 The conclusion of Theorem 1 then follows. This completes the proof of Theorem 1. Proof of Theorem 2: We only need to modify the proof of Theorem 1. De ne f n as before but modify qn to be n X
1 X
j ni=1 p 1 exp ? 2bv1 (yi ? ybj;i )2 2vbj;i j;i j =1 1 n (y ? yb )2 ! X X 1 1 i j;i j n=2 n 1=2 exp ? 2 = : v b j;i j =1 (2) i=1 vbj;i i=1
qn =
As before, for each xed j 1; we have
0 1 Pn (yi?mi ) 1 ? 1 = 2 n exp ? 2 i=1 vi B i=1 (2vi) P CCA log(f n =qn) log B @ n j i=1 (2vbj ;i )?1=2 exp ? 21 ni=1 (yi ?bvjbyj;i ;i) n vb X ;i )2 (yi ? mi )2 (y ? y b 1 i j j ;i ? v log v + vb : = log(1=j ) + 2 i j ;i i i=1 2
2
Taking expectation conditioned on z i?1 and xi; we have bvj;i (yi ? ybj;i)2 (yi ? mi)2 (mi ? ybj;i)2 vi Ei log v + bv = ? v + bv ? 1 ? log bvvi : v b i j ;i i j ;i j ;i j ;i It follows n (m ? yb )2 X v v 1 i j ;i i i n n + bv ? 1 ? log bv : E log(f =q ) log(1=j ) + 2 E vbj ;i j ;i j ;i i=1 Similarly as before,
log(f n =qn) =
where gi is modi ed to be gi = Combining (8) and (9), we have n X i=1
E log
p i gi
(8)
(9)
n X log pgi ; i i=1
2 Wj;i p 1 exp ? (yi 2?bv ybj;i ) : 2^vj;i j;i j =1
1 X
n 2 X log(1=j ) + 12 E (mi bv? ybj ;i ) + bv vi ? 1 ? log bvvi : j ;i j ;i j ;i i=1
23
(10)
Note as before
p Z i
Z p p 2 p i Ei log g = pi log g dyi ( pi ? gi ) dyi ; i i Z
and
2; yi2 gi dyi sup vbj;i + sup ybj;i j 1
pi
j 1
(mi ? ybi )2 Ei log g ? 2 2 : 2 mi + vi + supj 1 bvj;i + supj 1 ybj;i i Together with (10), we have
!
n (m ? yb )2 )2 X 1 (m ? y b v v i i j ;i i i i E ? 2 E 2 log(1=j )+ 2 bvj;i + bvj;i ? 1 ? log bvj;i : 2 mi + vi + supj 1 bvj;i + supj 1 ybj;i i=1 i=1 It is straightforward to verify that if 1= x ; x?1?log x c (x?1)2 for some constant c depending only on : Together with the fact that the above inequality holds for every j ; under Condition 3, it follows ! (bvj;i ? vi)2 ! n (m ? yb )2 n )2 X X (m ? y b i j;i i i E 2(3 + 1)v jinf log(1=j ) + E ; + C E 1 2vi 2vi2 i i=1 i=1
n X
where C > 0 is a constant depending only on : The conclusion then follows. This completes the proof of Theorem 2. Proof of Theorem 3: Rede ne y ? m 1 i i ; n n h f = i=1 s s i i and qn 1 X qn = j ni=1 sb1 h yi bs? ybj;i : j =1
j;i
j;i
The proof then follows similarly as the proof of Theorem 2. Proof of Theorem 5: De ne h(x) = exp(? (x)) (it is not necessarily a probability density function). De ne 1 X qn = j ni=1 h (yi ? ybj;i ) : For each xed j 1; we have
j =1
n X n log(1=q ) log(1=j ) + i=1
As before,
(yi ? ybj ;i ) :
01 1 P1 X h (y ? yb ) h (y ? yb ) qn = @ j h (y1 ? ybj;1 )A j =1 Pj 1 1 h (yj;1 ? yb 2 ) j;2 j;1 j =1 j 1 j =1 P1 n h (y ? yb ) i j;i : P1j =1 jni=1 ? 1 h (y i ? ybj;i ) j =1 j i=1 24
(11)
01 1 n X X log (1=qn) = ? log @ Wj;ih (yi ? ybj;i )A : (12) i=1 j =1 P ? We now bound log 1 W h (y ? y b ) = log E J expf? (yi ? ybJ;i )g ; where E J denotes expecj;i i j;i j =1 It follows
tation with respect to J under the probability mass function f(j) = P(J = j) = Wj;i (for a xed i). By Lemma 10.1 of Catoni (1999), under Condition 8, we have
?
log E J exp (? (yi ? ybJ;i )) ?E J (yi ? ybJ;i ) + !! 2 E J ? (y ? yb ) ? E J (y ? yb )2 exp c2 jy j + 1 + sup jyb j : (13) i J;i i J;i i j;i 2 j 1 We need to handle the second term on the right hand side above. Denote it by I: Note that by the second part of Condition 7,
?
E J (yi ? ybJ;i ) ? E J (yi ? ybJ;i ) 2 ? ? E J (yi ? ybJ;i ) ? yi ? E J ybJ;i 2
c2
jyi j + 1 + sup jybj;ij
2
j 1
? E J ybJ;i ? E J ybJ;i 2
2 ! ? 2 E J ybJ;i ? E J ybJ;i 2 : jyij + 1 + sup jybj;ij j 1 Under the rst part of Condition 7, let a0 = yi ? ybi = yi ? E J ybJ;i ; we have c2 22 ?1
(14)
(yi ? ybj;i) ? (a0 (yi ? ybj;i ? (yi ? ybi )) + (yi ? ybi )) cjyi ? ybj;i ? yi + ybi j2 = c (ybj;i ? ybi )2 : (15) Taking expectation under E J ; we have
?
E J (yi ? ybJ;i ) ? a0 E J (ybi ? ybJ;i ) + (yi ? ybi ) cE J (ybJ;i ? ybi )2 : Observing that E J (ybi ? ybJ;i ) = 0 by de nition of ybi ; it follows E J (yi ? ybJ;i ) ? (yi ? ybi ) cE J (ybJ;i ? ybi )2 :
(16)
It follows that the second term I in inequality (13) is bounded by E J (ybJ;i ? ybi )2 multiplied by
! !! 2 c2 22 ?1 1 + sup jyb j 2 + jy j2 exp c2 jy j + 1 + sup jyb j i j;i i j;i 2 j 1 j 1 2c2 22 ?2 (A + 1)2 + jyi j2 exp c2 jyij + (A + 1) :
Under Condition 8, taking conditional expectation given z i?1 and xi; we have
? Ei (I) E J ybJ;i ? E J ybJ;i 2 2 c2 22 ?2 (A + 1)2 e(A+1) M2 (c2 ) + e(A+1) M1 (c2 ) :
Take suciently small, say 0 < 0 so that 2 c222 ?2 (A + 1)2 e(A+1) M2 (c2 ) + e(A+1) M1 (c2 ) c=2: Together with (16), Ei (I) c=2E J (ybJ;i ? ybi )2 =2EifE J (Yi ? ybJ;i ) ? (Yi ? ybi )g: 25
It follows that
Ei log E J exp (? (Yi ? ybJ;i )) ?Ei (Yi ? ybi ) + Ei (Yi ? ybi ) ? E J (Yi ? ybi ) + =2Ei E J (Yi ? ybi ) ? (Yi ? ybi ) ?Ei (Yi ? ybi ) ; where the last step follows from E J (Yi ? ybi ) (Yi ? ybi ) by Jensen's inequality based on convexity of : Together with (12) and (11), we have
?E
n X i=1
(Yi ? ybi )
?E log(1=qn) n X ? log(1=j ) ? E (Yi ? ybj ;i) : i=1
Since j is arbitrary, we have
!
n X
n X log(1= ) j E (Yi ? ybi ) jinf 1 + i=1 E (Yi ? ybj;i ) : i=1 This completes the proof of Theorem 5. Proof of Theorem 6: We consider rst the case when M < pn: Recall M = f = (1 ; :::; M ) : PM j j 1g: Let N be an -net in under the lM distance, i.e., for each 2 ; there exists M M 1 j =1 j 0 0 M qPM 0 2 N such that k ? k1 = j =1 jj ? j j : An -net N in M yields a suitable net in the set F = f : 2 M g of the linear combinations of the original forecasts. Let yb1; :::; byM denote the original forecasts by the M procedures at time i: Let F be the set of the linear combinations of the forecasts P yb with 2 ; there exists 0 2 N such yb1; :::; byM with coecients in N : Then for any yb = M M j =1 j j that M M X X (17) j ? j0 ybj j A3k ? 0 kM jyb ? j0 ybj j = j 1 A3 ; j =1
j =1
where A3 is an upper bound on the magnitude of the original forecasts. Now we combine all the procedures in F using the AFTER algorithm given in Section 2 for Theorem 1 with uniform weight 1=jNj. Let ybi ; i 1 denote the combined forecasts at time i. By Corollary 1, we have n n 1X )2 C log(jN j) + C inf 1 X E ?mi ? yb 2 ; E (m ? y b i i i 2N n i=1 n i=1 n
where C depends only on A1 , A2 , and A3 : Since F is an (A3 )-net for any Pnin F? ; by triangle inequality, ? P 2 2 n 2 2 set of mi 's; we have inf 2N E i=1 mi ? ybi 2 inf 2 E i=1 mi ? ybi + 2nA3 : It follows that n n 1X )2 C log(jN j) + 2C inf 1 X E ?mi ? yb 2 + 2A2C2: (18) E (m ? y b i 3 i i 2 n n n M
i=1
i=1
To get the best upper bound (in order), we need to minimize log(jN j) + 2A23 2n when discretizing M : We apply a result on entropy number, i.e., the worst case approximation error with the best net of size of 26
2k points. Let k denote the entropy number of M under the l1M distance. From Shutt (1984, Theorem 1), when k M; k c2?k=M for some constant c independent of k and M: Take + 2 log2) k = M (log(n=M) 2 log2 (note that k M). (Strictly speaking, we need to round up to make k an integer.) Then + 2 log2) + (A3 c)2 M c0 M log(1 + n=M); log(jN j) + 2A23 2n M (log(n=M) 2 log2 2
where c0 depends only on A3 and c: The upper bound in Theorem 6 for M < pn then follows. Now consider the other case: M pn: The argument above leads to a sub-optimal rate M log(1 + n=M) for M n. For this case, due to the l1 constraint, the number of large coecients is small relative p to M when M n: Working with only the large coecients can result in optimal rate of convergence. PM Note that for kkM 1 1; j j =1 j ybj j A3 : Then by a sampling argument (see e.g., Lemma 1 in Barron (1993)), for each m; there exist a subset I f1; :::; M g of size m and I0 = (i0 ; i 2 I) such P yb ? P 0 yb j A =pm: Taking m = pn= logn; we have j PM yb ? P 0 yb j that j M 3 i2I i i i=1 i i j 2I j j j =1 j j P 1 = 4 m A3 (log n=n) : Consider an -net in BI = fI : j 2I jj j 1g under the l1 distance. Again by )+2 log 2) ; the best -net has approximation accuracy Shutt (1984, Theorem 1), taking k = m (log(n=m 2log 2 p P c=2 m =n: Then as in (17), we know that there exists I00 in this -net such that j i2I i0 ybi ? P 00 yb j A c=2pm =n: Thus for each 2 F; there exist I f1; :::; M g of size m and 00 such 3 I i2I i i that M 1=4 X X 00 c00 (logn)1=4 ; A3 c j j ybj ? j ybj j A3 (logn) + n1=4 n1=4 2n1=4 (log n)1=4 j =1 j 2I where c00 depends only on A3 and c: Now for each xed subset I f1; :::; M g of size m ; discretize the linear coecients as described earlier. Then (with uniform weight) combine the corresponding linear combinations of the forecasting ? procedures in . Then combine these (combined) procedures over all possible choices of I (there are M m M many such I altogether) with uniform weight. Let denote this nal procedure and let I = fi ; i 2 I g. Applying Corollary 1 twice, we have that R( M ; n)
n ? 2 + (n logn)1=2 + m log(n=m) + log(Mm ) 1X E m ? y b C 2inf i i M n i=1 n n n ! ! n ? X C 0 2infM n1 E mi ? ybi 2 + plogM ; n log n i=1
!
where the constants C and C 0 depend on A1, A2, and A3: This completes the proof of Theorem 6. Proof of Theorem 7: Let Xi = (Xi1 ; :::; Xi;M ); i 1 be i.i.d., where Xi1 ; :::; Xi;M are independent and uniformly distributed on [0,1]. Assume Yi = f(Xi ) + "i ; where the errors "i are independent and normally distributed with mean zero and variance 1. Let '1 (x); :::; 'M (x) be uniformly bounded orthonormal basis functions (e.g., trigonometric basis). Take i ; i 1 to be the procedure that forecasts Yi by 'i (Xi ): For each M = n ; consider the class of regression functions F = ff (x1 ; x2 ; :::; xM) = 27
P ?
1 '1 (x1 ) + ::: + M 'M (xM ) :k kM1 1g: It is clear that R(Mn ; n; Mn ) = E ni=1 mi ? ybi 2 = 0 for f 2 F: For this case, due to the independence of the errors, prediction and regression are essentially identical. The conclusion of Theorem 7 then follows from Theorem 2 of Yang (1999d). This completes the proof of Theorem 7.
6 Acknowledgments This research was supported by US National Security Agency Grant MDA9049910060.
References [1] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. In Proc. 2nd Int. Symp. Info. Theory , pp. 267-281, eds. B.N. Petrov and F. Csaki, Akademia Kiado, Budapest. [2] Armstrong, J.S. (1989) Combining forecasts: The end of the beginning or the beginning of the end? Intl. J. Forecast., 5, 585-588. [3] Barron, A.R. (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39, 930-945. [4] Barron, A.R. (1994) Approximation and estimation bounds for arti cial neural networks. Machine Learning, 14, 115-133. [5] Barron, A.R. and Cover, T.M. (1991) Minimum complexity density estimation. IEEE, Trans. on Information Theory , 37, 1034-1054. [6] Barron, A.R., Birge, L. and Massart, P. (1999) Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113, 301-413. [7] Barron, A.R., Rissanen, J., and Yu, B. (1998) The minimum description length principle in coding and modeling. IEEE Trans. on Information Theory, 44, 2743-2760. [8] Bates, J.M., and Granger, C.W.J. (1969) The combination of forecasts. Operational Research Quarterly, 20, 451-468. [9] Box, G.E.P. and Jenkins, G.M. (1976) Time Series Analysis: Forecasting and Control, 2nd ed. San Francisco: Holden-Day. [10] Breiman, L. (1996a) Stacked regressions. Machine Learning, 24, 49-64. [11] Breiman, L. (1996b) Bagging predictors. Machine Learning, 24, 123-140. [12] Buckland, S.T., Burnham, K.P., and Augustin, N.H. (1995) Model selection: An integral part of inference. Biometrics, 53, 603-618. [13] Catoni, O. (1999) \Universal" aggregation rules with exact bias bounds. Preprint. [14] Cesa-Bianchi, N., Freund, Y., Haussler, D.P., Schapire, R., and Warmuth, M.K. (1997) How to use expert advise? Journal of the ACM, 44, 427-485. [15] Chat eld, C. (1995) Model uncertainty, data mining, and statistical inference (with Discussion). J. Roy. Statist. Soc., Ser. A, 158, 419-466. [16] Clemen, R.T. (1989) Combining forecasts: a review and annotated bibliography. Intl. J. Forecast., 5, 559583. [17] Clemen, R.T., Murphy, A.H., and Winkler, R.L. (1995) Screening probability forecasts: contrasts between choosing and combining. Intl. J. Forecast., 11, 133-145.
28
[18] Clemen, R.T. and Winkler, R.L. (1986) Combining economic forecasts. Journal of Business and Economic Statistics, 4, 39-46. [19] Cover, T.M. (1965) Behavior of sequential predictors of binary sequences. In Transactions of the Fourth Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes, 263-271. Publishing House of the Czechoslovak Academy of Sciences. [20] Dawid, A.P. (1984) Present position and potential developments: Some personal views. Statistical theory { The prequential approach (with Discussion). J. Roy. Statist. Soc., Ser. A, 147, 278-292. [21] Donoho, D.L. and Johnstone, I.M. (1994) Ideal denoising in an orthonormal basis chosen from a library of bases. C. R. Acad. Sci. Paris, 319, 1317-1322. [22] Donoho, D.L. and Johnstone, I.M. (1998) Minimax estimation via wavelet shrinkage. Ann. Statistics., 26, 879-921. [23] Figlewski, S. and Urich, T. (1983) Optimal aggregation of money supply forecasts: accuracy, pro tability and market eciency. Journal of Finance, 28, 695-710. [24] Foster, D.P. (1991) Prediction in the worst case. Ann. Statistics., 19, 1084-1090. [25] Genest, C. and Zidek, J.V. (1986) Combining probability distributions: a critique and an annotated bibliography. Statistical Science, 1, 114-148. [26] Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999) Bayesian model averaging: a tutorial (with Discussion). Statistical Science, 14. [27] Holt, C.C. (1957) Forecasting seasonal and trends by exponentially weighted moving averages. Carnegie Institute of Technology. [28] Johnstone, I. (1999) Function Estimation in Gaussian Noise: Sequence Models. Book Manuscript. [29] Juditsky, A. and Nemirovski, A. (1996) Functional aggregation for nonparametric estimation. Publication Interne, IRISA, N. 993. [30] Kang, H. (1986) Unstable weights in the combination of forecasts. Management of Science, 32, 683-695. [31] LeBlanc, M. and Tibshirani, R (1996) Combining estimates in regression and classi cation. J. Amer. Statist. Asso., 91, 1641-1650. [32] Littlestone, N. and Warmuth, M.K. (1994) The weighted majority algorithm. Information and Computation 108, 212-261. [33] Merhav, N. and Feder, M. (1998) Universal prediction. IEEE Trans. on Information Theory, 44 2124-2147. [34] Newbold P. and Granger, C.W.J. (1974) Experience with forecasting univariate times series and the combination of forecasts. J. Roy. Statist. Soc., Ser. A, 137, 131-165 (with discussion). [35] Schutt, C. (1984) Entropy numbers of diagonal operators between symmetric Banach spaces. J. Approx. Theory, 40, 121-128. [36] Schwartz, G. (1978) Estimating the dimension of a model. Ann. Statistics, 6, 461-464. [37] Stone, M. (1974) Cross-validatory Choice and Assessment of Statistical Predictions (with Discussion). J. Roy. Statist. Soc., Ser. B, 36, 111-147. [38] Taylor, J.W. and Bunn, D.W. (1999) Investigating improvements in the accuracy of prediction intervals for combinations of forecasts: a simulation study. Intl. J. Forecast., 15, 325-339. [39] Vovk, V.G. (1990) Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory, 372-383. [40] Winters, P.R. (1960) Forecasting sales by exponentially weighted moving averages. Man. Sci., 6, 324-342.
29
[41] Wolpert, D. (1992) Stacked generalization. Neural Networks, 5, 241-259. [42] Yang, Y. (1999a) Model selection for nonparametric regression. Statistica Sinica, 9, 475-499. [43] Yang, Y. (1999b) Regression with multiple candidate models: selecting or mixing? Technical Report No. 8, Department of Statistics, Iowa State University. [44] Yang, Y. (1999c) Adaptive regression by mixing. Technical Report No. 12, Department of Statistics, Iowa State University. [45] Yang, Y. (1999d) Aggregating Regression Procedures for a Better Performance. Technical Report No. 17, Department of Statistics, Iowa State University. [46] Yang, Y. (1999e) Adaptive Estimation in Pattern Recognition by Combining Dierent Procedures. Accepted by Statistica Sinica. [47] Yang, Y. (2000a) Mixing strategies for density estimation. To appear in the Ann. Statististics. [48] Yang, Y. (2000b) Combining Dierent Procedures for Adaptive Regression. Journal of Multivariate Analysis, 74, 135-161. [49] Yang, Y. and Barron, A.R. (1998) An asymptotic property of model selection criteria. IEEE Trans. on Information Theory, 44, 95-116. [50] Yang, Y. and Barron, A.R. (1999) Information-theoretic determination of minimax rates of convergence. To appear in the Ann. Statististics.
30