Appl. Statist. (2015) 64, Part 1, pp. 75–92
Combining the Bayesian processor of output with Bayesian model averaging for reliable ensemble forecasting R. Marty, Universite´ Laval, Quebec, ´ Canada
V. Fortin, Environnement Canada, Dorval, Canada
H. Kuswanto, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
A.-C. Favre Laboratoire d’Etude des Transferts en Hydrologie et Environnement, UMR 5564, Grenoble, France
and E. Parent AgroParis-Tech–Institut National de la Recherche Agronomique, Paris, France [Received June 2012. Final revision December 2013] Summary. Weather predictions are uncertain by nature. This uncertainty is dynamically assessed by a finite set of trajectories, called ensemble members. Unfortunately, ensemble prediction systems underestimate the uncertainty and thus are unreliable. Statistical approaches are proposed to post-process ensemble forecasts, including Bayesian model averaging and the Bayesian processor of output. We develop a methodology, called the Bayesian processor of ensemble members, from a hierarchical model and combining the two aforementioned frameworks to calibrate ensemble forecasts. The Bayesian processor of ensemble members is compared with Bayesian model averaging and the Bayesian processor of output by calibrating surface temperature forecasting over eight stations in the province of Quebec (Canada). Results show that ensemble forecast skill is improved by the method developed. Keywords: Bayesian model averaging; Bayesian processor of output; Ensemble post-processing; Ensemble prediction system; Hierarchical Bayesian model; Predictive distribution
1.
Introduction
Uncertainty in weather forecasts is mainly a consequence of imperfect knowledge of initial conditions of the current atmosphere and of the deficiencies of (operational) numerical weather prediction (NWP) models. An ensemble prediction system (EPS) is an NWP system which Address for correspondence: R. Marty, D´epartement de G´enie Civil et de G´enie des Eaux, Universit´e Laval, 1065 Avenue de la M´edecine, Qu´ebec, G1V 0A6, Canada. E-mail:
[email protected] © 2014 Royal Statistical Society
0035–9254/15/64075
76
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
aims to represent dynamically the uncertainty in its state variables by an empirical distribution provided by a finite set of trajectories, called ensemble members. Thus ensemble meteorological forecasts provide a way of communicating the information on forecast uncertainty to end users. However, users tend to assume that the ensemble is reliable, i.e. that it represents the uncertainty accurately (Johnson and Bowler, 2009). Unfortunately, current EPSs typically underestimate the uncertainty, partly because not all sources of uncertainty are being taken into account, and partly because there is a discrepancy between the temporal and spatial scales at which the variables are predicted and the scales at which they are needed for specific applications. Ensemble forecasts are generally obtained by perturbing the initial conditions, boundary conditions and parameters of a single NWP model (Houtekamer et al., 1996; Molteni et al., 1996; Toth and Kalnay, 1997; Buizza et al., 2007), but this approach fails generally to take into account model uncertainty (Leutbecher and Palmer, 2008). Multimodel EPS, such as the ‘North American ensemble forecasting system’ (Candille, 2009), is useful in that respect, but this introduces an additional difficulty, as the ensemble members are then typically not exchangeable, nor even equiprobable. Statistical post-processing of ensemble forecasts is generally necessary to obtain a reliable assessment of the forecast uncertainty. Several approaches have been proposed recently to build reliable probabilistic forecasts from ensemble forecasts, including Bayesian model averaging (BMA) (Raftery et al., 2005; Wilson et al., 2007a), the Bayesian processor of ensemble (Krzysztofowicz, 2008), kernel dressing methods (Roulston and Smith, 2003;Wang and Bishop, 2005; Fortin et al., 2006) and other ensemble model output statistics (Gneiting et al., 2005) sometimes based on reforecasts (Hagedorn et al., 2008). With BMA, the predictive distribution of the variable of interest is represented by a finite mixture of probability density functions. A major difficulty with the BMA methodology, in its most general form, is the fact that one needs to estimate a large number of parameters and weights. Usually, one makes additional assumptions to reduce the number of free parameters, such as exchangeability of ensemble members (Fraley et al., 2010). In fact, in many implementations of BMA, a single kernel distribution centred on unbiased ensemble members is used. BMA is thus closely related to kernel smoothing methods. This is in particular so for the ensemble BMA framework (see the ensembleBMA R package (R Development Core Team, 2006; Fraley et al., 2011)). Di Narzo and Cocchi (2010) proposed a different hierarchical Bayesian model based on the best member approach in which a latent variable represents the best member and ¨ the parameters’ uncertainty is taken into account within a Bayesian framework. Moller et al. (2012) developed a multivariate post-processor designed to obtain a joint predictive distribution of five weather variables by adding Gaussian copulas in BMA to offer information on spatial dependences between weather quantities. The idea of the Bayesian processor of ensemble is to build a likelihood function from the joint record of ensemble forecasts and observations, and to combine it with a prior distribution to obtain the posterior distribution of the predictand. Thus information from asymmetric samples (long climatic sample and shorter joint sample) is optimally fused. Nevertheless this theoretically correct Bayesian approach suffers from the curse of dimensionality, as it is not easy to identify the likelihood given the available data. A necessary simplification consists of summarizing the ensemble forecasts by a statistic of lower dimensionality, which means that some of the information that is provided by the shape of the empirical distribution of the ensemble members is lost. A particular case which can occur in practice is that the only information that one can extract from the ensemble is the ensemble mean, especially because it can take more than 10 years of past forecasts to identify correctly the relationship between ensemble spread and the skill of the ensemble mean (Whitaker and Loughe, 1998). In that case, the ensemble
Reliable Ensemble Forecasting
77
forecast collapses into a point forecast, and all information about the spread of the ensemble is lost. The simpler framework of the Bayesian processor of output (BPO) (Krzysztofowicz, 2004) which is applicable for post-processing deterministic forecasting systems, can be used in this case. In this paper, we propose a methodology based on the BPO and BMA frameworks. This is done by framing the problem as a downscaling issue, and solving it by using a hierarchical model: instead of viewing the EPS as unreliable for the purpose of forecasting the predictand, we assume instead that it is reliable for an unobserved (latent) variable, and that there is a statistical relationship between this latent variable and the predictand. This makes sense for NWP models running at low resolution, which cannot resolve the spatial scale at which observations are taken but try to represent through parameterizations the average effect at the grid scale of the physical processes occurring at finer scales. The ensemble members are then considered to be exchangeable with this latent variable and can serve to build a non-parametric predictive distribution for the latent variable. Conditionally on the latent variable, the predictive distribution of the quantity of interest is then obtained by using the BPO framework. The method developed is compared with ensemble BMA and the BPO by calibrating surface temperature forecasting over eight stations in the province of Quebec (Canada) during the summer of 2008. The paper is organized as follows. In Section 2, we detail the methodology, by first defining the notation and stating the underlying assumptions, and then presenting how to combine BMA and the BPO in a new framework to calibrate ensemble forecasts. In Section 3, we present forecasting and the observation data set, and verification scores that are used to assess the skill and reliability of raw and calibrated ensemble forecasts. In Section 4, we discuss the results obtained in the summer of 2008 and compare the skill of these three Bayesian approaches. The main conclusions are summarized in Section 5, in which we also identify issues requiring further investigation. 2. The Bayesian processor of ensemble members method for post-processing ensemble forecasts 2.1. Notation and assumptions Let yt be a scalar quantity of interest for which a forecast is required. Let t represent the time at which the forecast is valid. We are interested in obtaining a forecast of yt issued at time t − h, where h is the lead time of the forecast. Let also Θt−h represent the initial conditions of the atmosphere at the time that the forecast is issued. yt can be seen as a realization from a stochastic process Gy , conditional on Θt−h : yt |Θt−h ∼ Gy : NWP models are designed to extrapolate in time the initial conditions and to summarize the information that is required to forecast quantities of interest through a finite set of state variables. Let ξt = T.Θt−h / be the subset of state variables which is relevant for forecasting yt , in the sense that it is a sufficient statistic for yt : p.yt |Θt−h / = p.yt |ξt /: Forecasting errors make it impossible in practice to obtain ξt ; hence ξt can be considered as a latent variable. The objective of an EPS is to represent the uncertainty in its state variables. .h/ .h/ Consider an EPS which produces a finite set of S forecasts Xt = {Xt,s , s = 1, : : : , S} for ξt , and thus is informative for the scalar quantity yt . Ensemble forecasts Xt can be seen as realizations from a different stochastic process GX conditional on Θt−h :
78
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent .h/
Xt |Θt−h ∼ GX : In the remainder of the paper, h is omitted when it is easily comprehensible from the context. Note that we do not require Xt,s or ξt to refer to exactly the same physical quantity as yt . For example Xt,s could refer to the grid box average of air temperature at a height corresponding to 850 hPa of pressure whereas yt could refer to screen level temperature at a specific location. In most cases, there indeed is a scale discrepancy between what the NWP model can resolve and what the observation actually represents. We shall assume that the joint distribution p.ξt , Xt,1 , Xt,2 , : : : , Xt,s / is exchangeable, which implies some faith in the NWP system. Ensemble members are also assumed to be exchangeable with each other. Essentially, the numerical model is assumed to be well suited for predicting ξt with differences between ensemble members which adequately reflect the uncertainty in ξt . 2.2. The Bayesian model averaging component If we could observe ξt , then predicting yt would only require building a probability distribution for yt conditional on ξt = T.Θt−h /. However, we observe only Xt and hence we instead need to obtain p.yt |Xt /. From the law of total probability p.yt |Xt / = p.yt |T = ξt , Xt / p.ξt |Xt / dξt : Since ξt is a sufficient statistic for Θt−h and Xt is a stochastic function of Θt−h , p.yt |T = ξt , Xt / = p.yt |T = ξt /: Now, since ξt is exchangeable with any ensemble member, then p.ξt |Xt / is equivalent to the predictive distribution of an additional ensemble member Xt,s+1 given Xt . As a consequence of this exchangeable behaviour, a non-parametric estimator Fˆ for the distribution of ξt |Xt can be obtained by using a Dirichlet process mixture model (Blackwell and MacQueen, 1973; Fortin et al., 1997): Fˆ .ξt |Xt / = αF0 + .1 − α/FS .ξt |Xt /, where F0 is a prior estimation of the distribution F , 1 H.ξt − Xt,s / S is the empirical distribution of the sample Xt (H being the Heaviside function) and α is the weight that is given to the prior estimation. If the weight α that is given to the prior distribution is considered as negligible, then p.ξt |Xt / can be approximated by the empirical distribution of Xt FS .ξt |Xt / =
p.ξt |Xt / ≈ fS .ξt |Xt /, where fS .ξt |Xt / =
1 δ.ξt − Xt,s /, S s
with δ being the Dirac function. It follows that 1 p.yt |Xt / ≈ p.yt |T = ξt / δ.Xt,s − ξt / dξt : S s
Reliable Ensemble Forecasting
79
Consequently, 1 p.yt |T = Xt,s /: .1/ S s Note that this expression of the predictive distribution is the same as in ensemble BMA. Regardless of the notation that is used within these equations and considering a single group of exchangeable members, equation (1) is identical to equation (5) in Fraley et al. (2010). Sampling from the predictive distribution p.yt |Xt / involves a two-step process: p.yt |Xt / ≈
.j/
(a) sampling ξt from the empirical distribution FS .ξt |Xt /; then .j/ (b) sampling from p.yt |T = ξt /. 2.3. The Bayesian processor of output component 2.3.1. Exchangeability Even if exchangeability is assumed for the joint probability distribution of the ensemble members Xt and ξt , it is not staightforward to recover the probability distribution of yt given ξt , p.yt |T = ξt /, from the relationship between yt and Xt . Such a problem can be cast into the general framework of a regression with error in the predictors, ensemble members Xt being related to ξt through a joint distribution p.Xt,1 , Xt,2 , : : : , Xt,s , ξt /. The problem would become significantly simpler if it consisted of obtaining the probability distribution p.ξt |yt / of ξt given yt from the relationship between yt and Xt , as it would then amount to the simpler problem of regression with noise on the measurement of the predictand. Through Bayes’s rule p.yt |T = ξt / ∝ p.ξt |yt /p.yt /, where p.yt / is the marginal distribution of yt . This is analogous to the BPO method for the case where ξt is observed. Krzysztofowicz (2004) referred to p.yt / as the prior distribution of the predictand, even if yt is not a parameter of the model but an observable. Note that Bayesian approaches applied on observables are sometimes used in meteorology and climate sciences, as in forecast calibration or assimilation (Coelho et al., 2004; Stephenson et al., 2005). p.ξt |yt / then corresponds to the likelihood function, and p.yt |T = ξt / to the posterior distribution of the predictand. We shall use the same terminology throughout the paper. 2.3.2. Linear model Assuming for example a linear model between yt and ξt , considering past forecasts over a training period of length N, ξt = β0 + β1 yt + "t
2 "t ∼ N .0, σ"|y /,
.2/
where we state that "t are independent, the relationship between yt and Xt,s would be similar, but with more uncertainty: Xt,s = β0 + β1 yt + "t + ηt,s :
.3/
Various methods are possible for estimating the parameters on the basis of past forecasts, including the unconstrained least square method or constrained least square method where the slope parameter β1 is set to 1, as suggested by Hamill (2007) and Wilson et al. (2007b). However, a Bayesian specification is also possible and relatively straightforward. The estimation of the parameters of the linear model is described in Section 4.1.2. The main issue comes from the fact that the departures ηt,s between ξt and each ensemble member Xt,s are not stochastically independent, leading to difficulties in obtaining the likelihood
80
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
function of the regression parameters given the observations. A simple solution consists in working with the ensemble mean X¯ t = .1=S/Σ Xt,s for the purpose of estimating the regression parameters X¯ t = β0 + β1 yt + ηt
2 ηt ∼ N .0, σX|y ¯ /:
.4/
This is possible without unduly affecting parameter estimation, because the ensemble mean is the dependent variable, not the explanatory variable (as with the BMA method). It should also be realized that assuming normality of the residuals in expression (4), although convenient, puts constraints on the joint distribution of the ensemble mean X¯ t and the latent variable ξt . 2 of the latent variable ξ conditionally on obserThe estimation of the residual variance σ"|y t 2 of the vation yt is not straightforward, even knowing an estimate of the residual variance σX|y ¯ ensemble mean given an observation. From the assumption of conditional independence of the ensemble member and yt , given ξt , it follows that there is less information in X¯ t than in ξt 2 2 and hence that σX|y ¯ can provide an upper bound for σ"|y . To avoid recourse to a parametric exchangeable model and to obtain a final estimate which is consistent with this interval but also 2 between 0 and maximizes the skill of the predictive distribution, we suggest using a value for σ"|y 2 which minimizes an objective function: here the mean continuous ranked probability score σX|y ¯ (CRPS) (Matheson and Winkler, 1976) of the probabilistic forecast, as advocated by Gneiting et al. (2005). This score is described in detail in Section 3.2. Computing the CRPS requires sampling from the predictive distribution a large number of times if the distribution of the predictive is not given in closed form to perform integration. An exploratory sensitivity analysis suggests that R = 100 samples per ensemble member is sufficient to obtain robust results. 2.3.3. Prior probability distribution for the predictand Once the probability model p.ξt |yt / has been established from a regression analysis of all pairs .yt , Xt /, it can be combined with a prior distribution p.yt / to obtain the posterior p.yt |ξt /. The prior could be defined by climatological records, as it is standard with the BPO method, with a particular attention on seasonal effects, to fit a normal probability distribution N .yt ; μ, σ 2 /. Other sources of information could be incorporated in the prior distribution. For example, if the errors of two or more forecasting systems are independent, their forecasts could be assimilated sequentially, the prediction obtained by using one system becoming the prior to be used for assimilating the information that is provided by the next system. This technique could for example be used to assimilate both Canadian and American EPSs which are part of the North American ensemble forecasting system. Even forecasts from the same EPS but for a longer lead time could be used as a prior, if the difference in lead time is sufficient for the forecast errors to be independent. For example, the probabilistic prediction that is obtained by using a forecast with a lead time of 192 h could be used a week later as the prior for obtaining a probabilistic forecast with a lead time of 24 h. Another possibility would be to define a non-informative prior with a zero mean and a large variance. 2.3.4. Training sample We now discuss the choice of the sample from which the parameters of the regression should be estimated. It is very likely that the sample is affected by seasonality effects. Seasonality and potential non-stationarity lead to non-constant regression parameters over time, i.e. the forecasts in winter may be more informative than in summer and consequently the stochastic dependence in winter would be stronger than in the summer season (Krzysztofowicz and Evans, 2008). Instead of performing deseasonalization of the time series to remove any seasonality effect
Reliable Ensemble Forecasting
81
or non-stationarity, we choose to re-estimate the parameters of the model at each time step by using a moving window of N past forecasts. The size N of the window should be sufficiently large to obtain stable estimates for the parameters, but sufficiently small to avoid non-stationarities in parameter values caused by seasonality effects. 2.3.5. Posterior probability density function for the predictand The posterior distribution is obtained by simply combining the prior with the likelihood function p.yt |ξt / ∝ p.yt /p.ξt |yt /, where p.yt / = N .yt ; μ, σ 2 /, 2 p.ξt |yt / = N .ξt ; β0 + β1 yt , σ"|y /:
Since the prior p.yt / and the likelihood function p.ξt |yt / are conjugate, it is straightforward to perform Bayesian updating analytically, 2 /, p.yt |ξt / = N .yt ; μy|ξ , σy|ξ
where the parameters μy|ξ and
2 σy|ξ
.5/
are defined as follows:
μy|ξ =
2 μ σ 2 β12 .ξt − β0 /=β1 + σ"|y 2 σ 2 β12 + σ"|y
.6/
and 2 σy|ξ =
2 σ 2 σ"|y 2 σ 2 β12 + σ"|y
:
.7/
2.4. Combining the Bayesian model averaging and Bayesian processor of output components Combining equations (1) with (5) we obtain as the final probability model for yt given the ensemble forecast Xt 1 2 p.yt |Xt / ≈ N .yt ; μy|ξ=Xt, s , σy|ξ /, .8/ S 2 have been defined in equations (6) and (7) respectively. It is interesting where μy|ξ=Xt, s and σy|ξ that the expectation of yt given Xt can be written as a weighted average of the bias-corrected ensemble mean and of the prior specification for the mean. The method proposed is built by combining the strengths of two components, i.e. BMA and the BPO. Henceforth we denote the method as the Bayesian processor of ensemble members (BPEM). The description of the BPEM highlights that this method differs from ensemble BMA by introducing a prior distribution. The BPO calibration is performed on the ensemble mean through the combination of a regression-based bias correction and a climatological prior. The BPEM is more sophisticated in that the calibration is applied to each ensemble member. These methods were considered as ‘empirical Bayesian approaches’ by Di Narzo and Cocchi (2010), who proposed a method combining the dressing kernel approach with a Bayesian methodology. The calibration is based on the introduction of a latent variable defining the (unknown) best member and is implemented with the sampling of the whole parameter space and the evaluation of the likelihood on a training period set equal to 30 days. Our approach differs from that of
82
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
Di Narzo and Cocchi (2010) by the introduction of a prior on the predictand, the expression of the linear model and the estimation of its parameters. 3.
Data and forecast assessment
3.1. Observational data and ensemble forecasts The temperature forecast data set that is used in this study was provided by the Canadian Meteorological Centre. It contains 1 year (ranging from April 2008 to April 2009) of observed temperature from 322 meteorological stations over Canada and corresponding ensemble forecasts. Ensemble forecasts are first downscaled to station site by using a bilinear interpolation and then rounded to the neareast degree to save space in the database. This database is extracted from that used in Wilson et al. (2007a). Each ensemble forecast comprises 20 perturbated members produced by the ‘Global ensemble prediction system’ (GEPS) (Houtekamer et al., 2009; Charron et al., 2010) (see also ‘General notices’ from the Canadian Meteorological Centre (2013)), meaning that the control member is previously removed from the original ensemble to respect the assumption on exchangeability of ensemble members. As a case-study, the method proposed is used to calibrate ensemble forecasts only in the summer season (from July to August 2008). 6-hourly ensemble forecasts are issued at 00 and 12 h Universal Time Coordinated (UTC) with lead times from 6 up to 384 h. Only the 00 UTC run and lead times multiples of 24 h have been kept in this paper. The observed data set that was used to estimate the climatological prior is taken from the Canadian daily climate data and information archive: www.climat.meteo.gc.ca. We consider in this paper the temperature forecasts from eight stations across Quebec province. Table 1 sums up the general and geographical information of these meteorological stations. 3.2. Assessment of forecast’s skill and reliability The objective function that is used to assess the performance of the calibrated ensemble forecasts is the CRPS (Matheson and Winkler, 1976). The CRPS compares the cumulative distribution functions of the forecast with the observation. The CRPS can be written in a general form as ∞ CRPS.F , y/ = {F.u/ − H.u − y/}2 du, .9/ −∞
where F is the predictive cumulative distribution function, y refers to the observation and Table 1.
Meteorological stations and optimal training length
Station
Longitude
Latitude
Baie-Comeau Airport Chibougamau-Chapais Airport Gasp´e Airport Montreal International Airport Quebec City International Airport Sherbrooke Airport Sept-ˆıles Airport Val d’Or Airport
−68.12◦ −74.32◦ −64.29◦ −73.45◦ −71.23◦ −71.41◦ −66.16◦ −77.47◦
49.08◦ 49.46◦ 48.47◦ 45.28◦ 46.48◦ 45.26◦ 50.13◦ 48.03◦
E E E E E E E E
N N N N N N N N
Elevation (m above sea level)
BPEM–isc
BPO–isc
ensBMA
22 387 34 36 74 241 55 337
20 15 15 15 15 30 50 30
25 15 25 20 25 25 25 15
25 15 25 25 45 10 30 30
Reliable Ensemble Forecasting
83
H.u − y/ is the Heaviside function equal to 0 when u < y and to 1 otherwise. The CRPS is a negatively oriented score; hence a smaller value is better. The properties of the CRPS have been well documented in Candille and Talagrand (2005), Gneiting and Raftery (2007) and Gneiting et al. (2007). It has become a popular and attractive assessment method because it can address calibration as well as sharpness (Gneiting et al., 2005). It is difficult to assess why a meteorological forecasting system performs well or not for a single event, and hence the CRPS of an ensemble forecast provided at a single time point is generally not of interest. What matters more is the average performance of a forecasting system over a time period. For this reason, we rely on the average value CRPS of the CRPS over a training period to identify model parameters, and over a verification period to assess model skill. Users might prefer a reliable system over a sharper forecast which is not reliable, in particular when using the ensemble spread as an input in a risk management process. Many verification tools have been developed for assessing the reliability of probabilistic forecasts (Jolliffe and Stephenson, 2003). These include the rank histogram (Talagrand et al., 1999) and reliability diagram (Wilks, 1995). In this paper, reliability is assessed through the decomposition of CRPS into reliability and potential components proposed by Hersbach (2000): CRPS = Rel + Pot,
.10/
where s 2 g¯s o¯s − , S s Pot = g¯s o¯s .1 − o¯s /:
Rel =
s
Here ensemble members are ordered and subscript s refers to the bin defined by two consecutive members [Xt,s , Xt,s+1 ]. g¯s represents the mean width of bin s and o¯s the average frequency that the observation is lower than .Xt,s + Xt,s+1 /=2. Note that s=S means that cumulative distribution function F is considered as a piecewise cumulative distribution function. The reliability component Rel measures the ability of the EPS to be statistically consistent by comparing the verifying frequency o¯s with the expected probability s=S, the square differences being weighted by the average width of the bin s. The potential component Pot is the CRPS that is obtained by a perfectly reliable EPS and depends mostly on data and on the EPS itself. Fig. 1 depicts the skill and reliability of the raw GEPS and climatology-based temperature forecasts for several lead time projections (24 h, 48 h, 72 h, 96 h, 192 h, 288 h and 384 h) and average over the eight stations in Quebec province (see Table 1). Here climatology is defined by a moving window of ±5 days around valid dates over the previous 30 years (1978–2007). Skill is decreasing with increasing lead times. Beyond lead time 192 h, climatology-based forecasts are more skilful than the GEPS ones. Note that these results hide the particular behaviour of Quebec City International Airport station in which GEPSs are less skilful than climatology for all lead times. The decomposition in reliability and potential CRPS shows that raw ensemble forecasts are not reliable with a value Rel from 0.6 to 1.4 ◦ C. Results of the reliability assessment, which are shown in Fig. 1, corroborate some conclusions in the literature about underdispersion of EPSs and indicate that obviously ensemble forecasts are less reliable than forecasts based on climatology. Reliability improves with lead time because of a corresponding increase in ensemble spread. It is worth noting that climatology is not perfectly reliable because climatology is based
84
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent 3 2 1.5
CRPS | Rel | Pot
1
0.5
0.2
0.1
0.05
0
100
200
300
400
forecast lead time [h] Fig. 1. CRPS and its decomposition in reliability (Rel) and potential CRPS (Pot) of raw ensemble forecasts (GEPS) and forecasts based on climatology (Clim), averaged over eight stations in Quebec province and on the summer season (July and August 2008) (a base 10 logarithmic scale is set for the vertical axis): —}—, GEPS (CRPS); –}–, Clim (CRPS); —4—, GEPS (Rel); –4–, Clim (Rel); —r—, GEPS (Pot); –r–, Clim (pot)
on 30 years and the verification is computed with only 2 months. These results emphasize that raw ensemble forecasts should be calibrated in order to be reliable and skilful. 4. Assessing ensemble post-processing methods for Quebec surface temperature 4.1. Specification of prior and linear model parameters 4.1.1. Specification of the prior parameters μ and σ 2 To assess the effect of prior definition, two configurations are considered in this study to specify the prior distribution p.yt /. (a) isc: the informative seasonal climatology is specified by a moving window of ±5 days around a valid date over the previous 30 years, to avoid any seasonal effect. The specification is thus performed by using a sample of size 30 × 11 = 330. As there is no seasonality effect or trend, the climatological series is thus assumed to be stationary (not shown). The prior distribution is the normal distribution defined by the empirical mean and standard deviation of the climatological sample. Note that the standard deviation is often close to 3 ◦ C. (b) nic: the non-informative climatology corresponds to all observations in the archive, regardless of the season. This non-informative prior distribution is a normal distribution with mean 5.6 ◦ C and standard deviation 12.2 ◦ C. A comparison of the variance of these prior distributions shows the effect of the selection of observations within the season through a moving window. A preliminary analysis showed that it is preferable to use seasonal informative climatology isc than non-informative climatology nic as prior. Forecasts calibrated by a Bayesian post-processor using a non-informative prior are less skilful than with an informative prior. In particular the skill of raw ensemble forecasts is higher
Reliable Ensemble Forecasting
85
than that of calibrated forecasts with BPO–nic for all lead times and with BPEM–nic for longerrange forecasts. In addition, the analysis of average CRPS values shows that the BPO is more sensitive than the BPEM to the prior specification. Calibrating on ensemble members instead of the ensemble mean or specifying the linear model parameters within a Bayesian framework limits the effect of the prior. Thus, in the remainder of the paper, only the informative prior is considered in the BPEM and BPO frameworks to simplify the description and the discussion of the results. 4.1.2. Bayesian specification of the linear model An exploratory analysis performed on the data set that was presented in Section 3.1 revealed that a Bayesian estimation of the linear model actually performed better than an ordinary least squares approach, especially for long lead times. Thus here the estimation is undertaken in a Bayesian framework. Informative conjugate priors are used to introduce a priori information (Gelman et al., 2004; Marin and Robert, 2007). The specification of the linear model is detailed in Appendix A. A look at the posterior expectations of the regression parameters (17) and of the residual variance (18) shows that, if the observed data do not provide enough information, then the Bayesian point estimators are close to prior hyperparameters. Conversely, the parameters that are determined by least squares are retained if data are genuinely informative. This feature is a positive aspect of the Bayesian procedure, as it provides an element of robustness for dealing with both short training periods and forecasting systems having low forecast skill. As often with Bayesian analysis, building the prior distribution by choosing values for the hyperparameters is a challenge. This is very much specific to the practical problem at hand. An advantage of the data set that is used in this paper is that forecasts of the same quantity are available for different lead times h (from 24 h to 384 h). It is thus possible to obtain prior information on the skill of the model for a given lead time by observing its skill for other lead times as a filtering effect. As the skill of the forecasts is expected to decrease with lead time, hyperparameters for a specific lead time (h; 24 h, 48 h, 72 h, 96 h, 192 h, 288 h and 384 h) are determined by considering the closest, and previous if available, lead time (h ; 48, 24 h, 48 h, 72 h, 96 h, 192 h and 288 h respectively). Ideally, the same hour of the day should be used as numerical models of the atmosphere typically have a diurnal cycle in their bias. For each lead 2 ) are time h and each training length N, parameters of the linear regression (β0 , β1 and σX|y ¯ estimated by least squares over the whole data set, leading to an estimate of hyperparameters 2 (see equations (13) and (12)). The by fitting prior distributions of parameters β0 , β1 and σX|y ¯ precision matrix H0 is estimated by the inverse of the covariance matrix of β0 and β1 which are previously normalized by σX|y ¯ . Hyperparameters γ 0 are subsequently set by the following considerations: for the data set that was presented earlier in this paper, the numerical model forecasts directly the observed quantity (screen level temperature) and aims to be unbiased. Hence, it is reasonable for the prior distribution to reflect a priori trust in the numerical model by setting γ 0 to .0 1/T . 4.2. Identification of the optimum training length The Bayesian approaches designed to calibrate ensemble forecasts require a training period on which the linear model parameters are estimated. This step is performed for each station independently of each other. Then the verification is carried out for all sites by determining the average CRPS over the verification period (the summer of 2008) and its decomposition into reliability and potential components. The optimization of training length is performed by minimizing the average CRPS over the application period (the summer of 2008) searching between 10 days and 50 days training period.
86
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
Regardless of the Bayesian calibration method and the prior definition, this optimal training length is common for all lead times. The skill of calibrated forecast depends on the sample size, but there are clear advantages in using shorter training lengths. Indeed, seasonality in the predictand and in the skill of the forecast limits the applicability of methods that are based on longer training lengths. Furthermore, frequent changes to operational forecasting systems make it difficult to use forecasts from past years to calibrate statistical models. In addition, the Canadian Meteorological Centre plans to produce, once a week, a reforecast of the previous 15 years. Being able to work with a short training length would allow the method to take advantage of such a database easily, thus reducing even more the negative effects of seasonality in the predictand and in the skill of the forecast, as calibration could be performed by using observations and forecasts that are valid for the same week of the year. On the basis of these considerations, the optimal training length is selected to be as short as possible while keeping the average CRPS low. Table 1 details the optimal training length for each Bayesian framework and for each meteorological station. Training lengths are generally shorter than 25 days, excluding the BPEM and BPO using the non-informative climatology nic.
4.3. Verification of ensemble forecast calibrated by Bayesian processor of ensemble members, ensemble Bayesian model averaging and Bayesian processor of output Verification of calibrated ensemble forecasts is performed through the average CRPS and its reliability component. The calibration of ensemble forecasts with the BPEM is given in the detailed supporting information in the on-line version of the paper. The BPEM is compared with the ensemble BMA by assuming exchangeability of ensemble members (Fraley et al., 2010) and the BPO applied to the ensemble mean. The latter approaches are currently the most widely used to calibrate the ensemble forecasts. Fig. 2 depicts the average CRPS values obtained during the summer season (July and August 2008). The CRPS values of informative climatologies clim–isc are independent of lead time. Bayesian calibrations of ensemble forecasts reduce the CRPS and so improve forecasts’ skill. For lead times higher than 192 h, the skill of the calibrated forecasts are close to that of seasonalclimatology-based forecasts (1.5 ◦ C). Calibration methods present a similar CRPS for longer lead times (superior to 192 h). This behaviour is largely explained by the parameters of the linear model. For larger lead times the relationship between predictions and observations is less robust: the slope parameter of linear models tends to be 0. Therefore, the predictives obtained by BPEM–isc and BPO–isc approximate informative climatology, whereas the ensemble BMA predictive becomes a normal distribution with mean equal to the average of observations over the training period, and standard deviation of the same order of magnitude as the seasonal climatological standard deviation. The CRPS values of both BPEM–isc and BPO–isc converge (by design) to that of climatology. Overall the skill’s improvement by BPEM–isc is slightly better than with the other Bayesian approaches. Results on reliability are given in Fig. 2(b). As expected, Bayesian calibrations improve also reliability, with a gain of about 0.5 ◦ C. The verification results show a slight advantage for the new methodology BPEM when using seasonal climatology as prior information. BPEM–isc presents the lowest values of average CRPS and so the best skill within all Bayesian frameworks evaluated in this paper for shorter lead times. Ensemble forecasts calibrated by the BPEM show a reliability component of the same order of magnitude as that of forecasts calibrated by the BPO or ensemble BMA, regardless of the effect of assessing reliability over a short period (2 months). The proposed post-processing method BPEM can generate a more skilful and reliable forecast which satisfies the definition of a good probabilistic forecast (Gneiting, 2008). The approach developed performs well and
Reliable Ensemble Forecasting
87
3
2
CRPS
1.5
1
0.5 24
96
192 (a)
288
384
24
96
192 (b)
288
384
3 2 1
Reliability
0.5
0.1 0.05
0.01
Fig. 2. (a) CRPS and (b) its reliability component of the GEPS, climatology-based (CLIM) and calibrated ensemble forecasts (with the BPEM, ensemble BMA and BPO) for the summer season (July and August 2008) and considering the eight stations in Quebec province (a base 10 logarithmic scale is set for the vertical axis): , GEPS; , BPEM–isc; ; BPO–isc; , ensBMA–isc; ; CLIM–isc
constitutes a significant improvement of the BMA and BPO methods by their combination. The comparison with ensemble BMA indicates that incorporating the climatological information via the prior distribution in the BPEM introduces a significant benefit in generating more skilful probabilistic forecasts. In addition, these results indicate that it is thus preferable to consider a weighted sum of normal distributions fitted on unbiased members rather than a normal distribution fitted on unbiased ensemble mean. In other words, the statistical calibration applied on each ensemble member by the BPEM provides more skilful forecasts, compared with the
88
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
BPO based on the ensemble mean. Beyond 192 h lead time calibrated forecasts are as reliable and accurate as forecasts based on seasonal climatology. 5.
Discussion and conclusion
This study presents a new methodology, called the BPEM, to calibrate ensemble forecasts. This method successfully generates reliable forecasts and outperforms slightly both ensemble BMA and BPO approaches as well as climatological forecasts, when using informative climatology as prior information. The comparison with ensemble BMA and the BPO reveals that the introduction of the seasonal climatological information in the predictive distribution and the application of calibration on each ensemble member through the latent variable explain mainly this good behaviour. Parameters are determined by a Bayesian specification of the linear regression and the minimization of the CRPS on a training period. The choice of a short optimal training length is based on practical considerations, which was confirmed by an analysis of the CRPS as a function of training sample size. Results obtained for eight stations in Quebec province are shown in this paper. Ensemble forecasts’ skill is slightly improved by the BPEM. The analysis of the reliability component highlights that Bayesian approaches enhance ensemble reliability significantly, with CRPS reliability of the same order of magnitude as seasonal-climatology-based predictions. This calibration approach is in the process of being evaluated by Environment Canada using a more recent version of the Canadian GEPS. Further experiments are planned to adapt this methodology to other variables, including precipitation and streamflow, to confirm its advantage over ensemble BMA and BPO. Ensemble BMA can in fact be seen as a particular case of the BPEM corresponding to a flat prior, and hence it is expected that similar or better results can indeed generally be obtained with the BPEM. The BPO method applied to the ensemble mean could, however, have an edge over the BPEM method, especially if the ensemble size is small, owing to the non-parametric approximation of the predictive distribution. Besides, a more parsimonious model using only the ensemble mean as a predictor (or some other summary statistic) could lead to more robust results. In such a case, it would be logical to use the BPO method instead of the BPEM. The latter could become more relevant if improvements in the underlying EPS make it possible to extract information from the individual ensemble members besides the ensemble mean. It might even be possible to use a comparison between the BPEM and BPO as a metric for evaluating possible improvements to the EPS, such as increasing the ensemble size, investing in a reforecasting experiment, increasing the horizontal resolution or improving the representation of the observational and model uncertainty. Improvements to the BPEM method are certainly possible, however. Ideas for future work are described in the following subsections. 5.1. Spread–skill relationship and assumptions about the distribution of residuals ηt in the regression between forecasts and observation The forecast–observation database that was used in this paper is considered too small to identify correctly the link between forecast spread and skill of ensemble mean. Reforecasts data sets can extend the forecast–observation database. Consequently, if a strong spread–skill relationship is observed, one could instead assume a weighted linear regression in which a weight equal to the inverse of ensemble variance can be set for each ensemble forecast. 5.2. Using reforecasts as training data set Reforecasts data sets (for example see Hamill et al. (2006) and Hagedorn et al. (2008)) can
Reliable Ensemble Forecasting
89
be used in addition to, or instead of, recent ensemble forecasts in the linear model linking ensemble forecasts and observations. Such a data set should compensate seasonality issues and model changes in existing operational forecast data sets. Indeed, results obtained with the BPEM method for June 2008 (which are not shown) suggest that in transitional periods bias between observations and the ensemble mean is less systematic, leading to some calibrated forecasts with poor skill. The main challenge with reforecasts data sets is that they are produced with a frozen model version and data assimilation system, which can have significantly less horizontal resolution and skill than operational systems. A compromise consists in performing the reforecast experiment along with the forecast, for the same Julian date or week of the year, but using fewer members and for a smaller number of past years than a typical reforecast experiment. It would be interesting to compare both reforecasting techniques, but also to assess how BPEM forecast skill improves as a function of the size of the reforecast data set, in terms of both number of years and number of members. 5.3. Multivariate forecasting Traditionally, calibration of weather forecasts has focused on forecasting a single weather element for a single lead time. For many practical applications, multivariate forecasting is required, and for multiple lead times, and correlations existing between weather elements and lead times must be preserved. 5.4. Estimation of the variance of the latent variable ξ The estimation of the variance of the latent variable is based on minimizing the CRPS with a grid search procedure. When the standard error of the regression (3) is sufficiently large, then the grid interval becomes wide. Consequently, the optimum grid search is computationally intensive and hence it could be an inefficient procedure. A numerical optimization procedure such as Newton– Raphson iteration could be considered in these cases. Other skill measures than CRPS could also be used in the estimation process. A complete hierarchical Bayesian implementation, in which all parameters are estimated jointly, and leading to a predictive distribution in which the parameter uncertainty is fully accounted for, is also desirable. Acknowledgements Financial support for the undertaking of this work has been provided by the MPRIME project on hydrological ensemble forecasting for water resources management (ex ‘Mathematics of information technology and complex systems’) led by Franc¸ois Anctil (Chaire de recherche Institut Hydro-Qu´ebec en Environnement, D´eveloppement et Soci´et´e en pr´evisions et actions hydrologiques, Universit´e Laval) and supported by Hydro-Qu´ebec and Manitoba Hydro. St´ephane Beauregard (Meteorological Service of Canada) is thanked for providing observation and ensemble predictions data sets. The authors thank the two reviewers, the Associate Editor and the Joint Editor for their valuable comments and suggestions that improved significantly the final form of the paper. R. Marty was a post-doctoral researcher of the chair ‘Chair de recherche EDS en pr´evisions et actions hydrologiques’. Appendix A: Bayesian linear regression The linear model (4) can be expressed in the matrix form x =Yγ +η
90
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
with
⎛ X¯ ⎞
⎛1
1
⎜ : ⎟ x = ⎝ :: ⎠,
⎜: Y = ⎝ ::
y1 ⎞ :: ⎟ : ⎠,
γ=
X¯ N 1 yN The parameters γ are classically estimated by least squares, leading to
β0 : β1
.11/
γˆ = .Y T Y/−1 Y T x: The Bayesian specification supplements the linear model with additional information in the form of a prior probability distribution. Here we choose to use informative conjugate priors. The conditional likelihood can be written as −2 2 2 −N=2 T exp{− 21 σX|y l.x|Y , γ, σX|y ¯ / ¯ / = .2πσX|y ¯ .x − Y γ/ .x − Y γ/}
where N is the number of pairs .yt , X¯ t / used for model calibration, which is also hereafter referred to 2 as the training length, and where σX|y ¯ is the residuals variance of the linear model, i.e. the variance of ¯ the ensemble mean X conditional on an observation yt (for known parameter values). Considering the 2 2 likelihood shape, the marginal prior of σX|y ¯ is a scaled inverse χ -distribution and the conditional prior of 2 γ given σX|y ¯ is Gaussian. The conjugate priors can thus be obtained within this family as 2 2 2 2 p.σX|y ¯ |Y / = Inv-χ .σX|y ¯ ; v0 , s0 /,
.12/
−1 2 2 p.γ|Y , σX|y ¯ / = N .γ; γ 0 , σX|y ¯ H0 /
.13/
where γ 0 is the prior specification of the parameters of the linear model, H0 the precision matrix, v0 the number of χ2 degrees of freedom and s02 its scaling parameter. Posterior probability density functions can be expressed as
1 −1 −1 2 2 2 2 2 T T −1 p.σX|y σX|y [.N − 2/s + v0 s0 + .γˆ − γ 0 / {.Y Y/ + H0 } .γˆ − γ 0 /] ¯ |Y , x/ = Inv-χ ¯ ; N + v0 , N + v0 (14) 2 T −1 T 2 T −1 ˆ + H0 γ 0 /, σX|y p.γ|Y , x, σX|y ¯ / = N {γ; .Y Y + H0 / .Y Y γ ¯ .Y Y + H0 / }
.15/
where s denotes the residual variance considering least squares estimators and can be computed as 2
ˆ T .x − Y γ/: ˆ .N − 2/s2 = .x − Y γ/ The marginal posterior distribution on γ is determined by integrating equation (15) over equation (14) 2 on σX|y ¯ : ∞ 2 2 p.γ|Y , x, σX|y p.γ|Y , x/ = ¯ ¯ / p.σX|y ¯ |Y , x/ dσX|y 0
leading to a bivariate Student distribution T2 p.γ|Y , x/ = T2 .γ; νT , μT , ΣT / = with
Γ{.νT + 2/=2} 1 .γ − μT /T ΣT −1 .γ − μT / −.νT +2/=2 1 + Γ.νT =2/ νT |ΣT |1=2 .νT π/.2=2/ .16/ νT = N + v0 , μT = .Y T Y + H0 /−1 .Y T Y γˆ + H0 γ 0 /,
.N − 2/s2 + v0 s02 + .γˆ − γ 0 /T {.Y T Y/−1 + H0 −1 }−1 .γˆ − γ 0 / T .Y Y + H0 /−1 : N + v0 Finally Bayesian point estimators (β˜0 , β˜1 and σ˜ 2X|y ¯ ) are obtained by taking the expectation of the marginal posterior distributions ΣT =
γ˜ = . β˜0 β˜1 /T = E.γ|Y , x/ = .Y T Y + H0 /−1 .Y T Y γˆ + H0 γ 0 /,
.17/
Reliable Ensemble Forecasting
91
2 σ˜ 2X|y ¯ = E.σX|y ¯ |Y , x/
=
.N − 2/s2 + v0 s02 1 −1 + [.γˆ − γ 0 /T {.Y T Y/ + H0 −1 }−1 .γˆ − γ 0 /] N − 2 + v0 N − 2 + v0
.18/
where γ 0 , H0 , v0 and s02 are hyperparameters defining the prior distributions and where s2 denotes the residual variance considering least squares estimators and can be computed as ˆ T .x − Y γ/: ˆ .N − 2/s2 = .x − Y γ/ Bayesian point estimators are a compromise between a priori estimates and least squares estimates. The posterior expectation of the regression parameters γ˜ is a weighted average of the prior estimate γ 0 and the least square estimate γ. ˆ The posterior expectation of the residual variance is also a weighted average of the prior estimate s02 , and the least square estimate s2 , plus a term reflecting prior–data conflict. These expressions show that H0 and v0 correspond to the weight that is given to the prior information.
References Blackwell, D. and MacQueen, J. B. (1973) Ferguson distributions via Polya urn schemes. Ann. Statist., 1, 353–355. Buizza, R., Bidlot, J. R., Wedi, N., Fuentes, M., Hamrud, M., Holt, G. and Vitart, F. (2007) The new ECMWF VAREPS (Variable Resolution Ensemble Prediction System). Q. J. R. Meteorol. Soc., 133, 681–695. Canadian Meteorological Centre (2013) GENOTS and system changes. Canadian Meteorological Centre, Dorval. (Available from collaboration.cmc.ec.gc.ca/cmc/cmoi/product guide/table of contents e.html.) Candille, G. (2009) The multiensemble approach: the NAEFS example. Mnthly Weath. Rev., 137, 1655–1665. Candille, G. and Talagrand, O. (2005) Evaluation of probabilistic prediction systems for a scalar variable. Q. J. R. Meteorol. Soc., 131, 2131–2150. Charron, M., Pellerin, G., Spacek, L., Houtekamer, P. L., Gagnon, N., Mitchell, H. L. and Michelin, L. (2010) Toward random sampling of model error in the Canadian Ensemble Prediction System. Mnthly Weath. Rev., 138, 1877–1901. Coelho, C. A. S., Pezzulli, S., Balmaseda, M., Doblas-Reyes, F. J. and Stephenson, D. B. (2004) Forecast calibration and combination: a simple bayesian approach for ENSO. J. Clim., 17, 1504–1516. Di Narzo, A. F. and Cocchi, D. (2010) A Bayesian hierarchical approach to ensemble weather forecasting. Appl. Statist., 59, 405–422. Fortin, V., Bernier, J. and Bob´ee, B. (1997) Simulation, Bayes, and bootstrap in statistical hydrology. Wat. Resour. Res., 33, 439–448. Fortin, V., Favre, A. C. and Sa¨ıd, M. (2006) Probabilistic forecasting from ensemble prediction systems: improving upon the best-member method by using a different weight and dressing kernel for each member. Q. J. R. Meteorol. Soc., 132, 1349–1369. Fraley, C., Raftery, A. E. and Gneiting, T. (2010) Calibrating multimodel forecast ensembles with exchangeable and missing members using Bayesian Model Averaging. Mnthly Weath. Rev., 138, 190–202. Fraley, C., Raftery, A. E., Gneiting, T., Sloughter, M. and Berrocal, V. (2011) Probabilistic weather forecasting in r. R J., 3, 55–63. Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2004) Bayesian Data Analysis, 2nd edn. Boca Raton: Chapman and Hall–CRC. Gneiting, T. (2008) Probabilistic forecasting. J. R. Statist. Soc. A, 171, 319–321. Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007) Probabilistic forecasts, calibration and sharpness. J. R. Statist. Soc. B, 69, 243–268. Gneiting, T. and Raftery, A. E. (2007) Strictly proper scoring rules, prediction, and estimation. J. Am. Statist. Ass., 102, 359–378. Gneiting, T., Raftery, A. E., Westveld, A. H. and Goldman, T. (2005) Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mnthly Weath. Rev., 133, 1098–1118. Hagedorn, R., Hamill, T. M. and Whitaker, J. S. (2008) Probabilistic forecast calibration using ECMWF and GFS ensemble reforecasts, Part I: Two-meter temperatures. Mnthly Weath. Rev., 136, 2608–2619. Hamill, T. M. (2007) Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian Model Averaging—Comment. Mnthly Weath. Rev., 135, 4226–4230. Hamill, T. M., Whitaker, J. S. and Mullen, S. L. (2006) Reforecasts—an important dataset for improving weather predictions. Bull. Am. Meteorol. Soc., 87, 33–46. Hersbach, H. (2000) Decomposition of the Continuous Ranked Probability Score for ensemble prediction systems. Weath. Forecast., 15, 559–570. Houtekamer, P. L., Lefaivre, L., Derome, J., Ritchie, H. and Mitchell, H. L. (1996) A system simulation approach to ensemble prediction. Mnthly Weath. Rev., 124, 1225–1242.
92
R. Marty, V. Fortin, H. Kuswanto, A.-C. Favre and E. Parent
Houtekamer, P. L., Mitchell, H. L. and Deng, X. X. (2009) Model error representation in an operational ensemble Kalman filter. Mnthly Weath. Rev., 137, 2126–2143. Johnson, C. and Bowler, N. (2009) On the reliability and calibration of ensemble forecasts. Mnthly Weath. Rev., 137, 1717–1720. Jolliffe, I. and Stephenson, D. (2003) Forecast Verification: a Practitioner’s Guide in Atmospheric Science. Chichester: Wiley. Krzysztofowicz, R. (2004) Bayesian Processor of Output: a new technique for probabilistic weather forecasting. In Proc. 17th Conf. Probability and Statistics in the Atmospheric Sciences, vol. 4.2. Seattle: American Meteorological Society. Krzysztofowicz, R. (2008) Bayesian Processor of Ensemble: concept and development. In Proc. 19th Conf. Probability and Statistics in the Atmospheric Sciences, vol. 4.5. Seattle: American Meteorological Society. Krzysztofowicz, R. and Evans, W. B. (2008) Probabilistic forecasts from the National Digital Forecast Database. Weath. Forecast., 23, 270–289. Leutbecher, M. and Palmer, T. N. (2008) Ensemble forecasting. J. Computnl Phys., 227, 3515–3539. Marin, J.-M. and Robert, C.-P. (2007) Bayesian Core: a Practical Approach to Computational Bayesian Statistics. New York: Springer. Matheson, J. E. and Winkler, R. L. (1976) Scoring rules for continuous probability distributions. Mangmnt Sci., 22, 1087–1096. ¨ Moller, A., Lenkoski, A. and Thorarinsdottir, T. L. (2012) Multivariate probabilistic forecasting using ensemble Bayesian model averaging and copulas. Q. J. R. Meteorol. Soc., 139, 982–991. Molteni, F., Buizza, R., Palmer, T. N. and Petroliagis, T. (1996) The ECMWF Ensemble Prediction System: methodology and validation. Q. J. R. Meteorol. Soc., 122, 73–119. Raftery, A. E., Gneiting, T., Balabdaoui, F. and Polakowski, M. (2005) Using Bayesian Model Averaging to calibrate forecast ensembles. Mnthly Weath. Rev., 133, 1155–1174. R Development Core Team (2006) R: a Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. Roulston, M. S. and Smith, L. A. (2003) Combining dynamical and statistical ensembles. Tellus A, 55, 16–30. Stephenson, D. B., Coelho, C. A. S., Doblas-Reyes, F. J. and Balmaseda, M. (2005) Forecast assimilation: a unified framework for the combination of multi-model weather and climate predictions. Tellus A, 57, 253–264. Talagrand, O., Voutard, R. and Strauss, B. (1999) Evaluation of probabilistic prediction system. In Proc. European Centre for Medium-Range Weather Forecasts Wrkshp Predictibility, vol. 4.2, pp. 1–25. Reading: European Centre for Medium-Range Weather Forecasts. Toth, Z. and Kalnay, E. (1997) Ensemble forecasting at NCEP and the breeding method. Mnthly Weath. Rev., 125, 3297–3319. Wang, X. G. and Bishop, C. H. (2005) Improvement of ensemble reliability with a new dressing kernel. Q. J. R. Meteorol. Soc., 131, 965–986. Whitaker, J. S. and Loughe, A. F. (1998) The relationship between ensemble spread and ensemble mean skill. Mnthly Weath. Rev., 126, 3292–3302. Wilks, D. S. (1995) Statistical Methods in the Atmospheric Science: an Introduction. San Diego: Academic Press. Wilson, L. J., Beauregard, S., Raftery, A. E. and Verret, R. (2007a) Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian Model Averaging. Mnthly Weath. Rev., 135, 1364–1385. Wilson, L. J., Beauregard, S., Raftery, A. E. and Verret, R. (2007b) Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian Model Averaging—Reply. Mnthly Weath. Rev., 135, 4231–4236.
Supporting information Additional ‘supporting information’ may be found in the on-line version of this article: ‘Combining the Bayesian processor of output with Bayesian model averaging for reliable ensemble forecasting—How to calibrate ensemble forecast with BPEM’.