JULY 2005
1825
ARRIBAS ET AL.
Test of a Poor Man’s Ensemble Prediction System for Short-Range Probability Forecasting A. ARRIBAS, K. B. ROBERTSON,
AND
K. R. MYLNE
Met Office, Exeter, United Kingdom (Manuscript received 19 March 2004, in final form 29 September 2004) ABSTRACT Current operational ensemble prediction systems (EPSs) are designed specifically for medium-range forecasting, but there is also considerable interest in predictability in the short range, particularly for potential severe-weather developments. A possible option is to use a poor man’s ensemble prediction system (PEPS) comprising output from different numerical weather prediction (NWP) centers. By making use of a range of different models and independent analyses, a PEPS provides essentially a random sampling of both the initial condition and model evolution errors. In this paper the authors investigate the ability of a PEPS using up to 14 models from nine operational NWP centers. The ensemble forecasts are verified for a 101-day period and five variables: mean sea level pressure, 500-hPa geopotential height, temperature at 850 hPa, 2-m temperature, and 10-m wind speed. Results are compared with the operational ECMWF EPS, using the ECMWF analysis as the verifying “truth.” It is shown that, despite its smaller size, PEPS is an efficient way of producing ensemble forecasts and can provide competitive performance in the short range. The best relative performance is found to come from hybrid configurations combining output from a small subset of the ECMWF EPS with other different NWP models.
1. Introduction A poor man’s ensemble prediction system (PEPS) is a set of different numerical weather prediction (NWP) forecasts produced by different operational centers using their own models, schemes, and analyses. Its name derives from the fact that all forecasts are already produced by different sources, and only need pulling together before the possibility of using them as an ensemble. It is, therefore, essentially free. Current operational global ensemble prediction systems (EPSs), such as those run at the European Centre for MediumRange Weather Forecasts (ECMWF; Molteni et al. 1996), the National Centers for Environmental Prediction (NCEP; Toth and Kalnay 1993), or the Canadian Meteorological Centre (CMC; Houtekamer et al. 1996), are designed for medium-range weather forecasting. In this paper we investigate whether the PEPS may provide a viable alternative system more suited to short-range use.
Corresponding author address: Alberto Arribas, Met Office, FitzRoy Road, Exeter EX1 3PB, United Kingdom. E-mail:
[email protected]
© 2005 American Meteorological Society
MWR2911
Many forecasting centers now have access to medium-range ensemble products from one of the major centers above, but few have access to an EPS more appropriate for short-range prediction (less than about 72 h). For example, the ECMWF EPS is extensively used by 25 national meteorological services across Europe, but experience has shown that the ECMWF EPS generally has too little spread early in the forecast period. This affects its performance at the short range, which is illustrated by the common observation of forecasters in the Met Office that ECMWF’s ensemble members follow the solution of the high-resolution deterministic model (or of the EPS control) too closely in the short range (Young and Carroll 2002). In other words, the ECMWF EPS does not spread sufficiently to incorporate the full range of uncertainty. They observe subjectively that model forecasts from other NWP centers are synoptically more different from the ECMWF deterministic model than are the EPS members. Legg and Mylne (2004) also found that using the ECMWF EPS resulted in much better forecast probabilities of severe weather 4 days ahead than at 2 days. Early evidence that a PEPS approach may give good results for the short range was provided by Kalnay and Ham (1989), Atger (1999), and Ziehmann (2000). In
1826
MONTHLY WEATHER REVIEW
the latter study, a small ensemble consisting of four different models was tested and shown to outperform the ECMWF EPS when allowance was made for the much smaller ensemble size, and it was thus suggested that the PEPS methodology was a particularly efficient way of providing ensemble forecasts. More recently this approach has been supported by Ebert (2001). A slightly different approach has been followed by Krishnamurti et al. (1999) who took data from a PEPS-like multimodel ensemble and applied a multilinear regression to obtain the optimal deterministic forecast at every grid point. Multimodel ensemble prediction is now becoming the standard approach in longer-range prediction as, for example, in the European seasonal forecasting project Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER; see Web site at http://www. ecmwf.int/research/demeter). Also, research in shortrange ensemble prediction (Hou et al. 2001; Wandishin et al. 2001) has consistently shown benefits from a multimodel approach. To provide probabilistic forecasts, an ensemble attempts to sample the probability density function (pdf) of forecast states, and for these forecasts to be representative, this sampling is required to have a covariance similar to the forecast error covariance, including the short-range forecast “errors of the day.” Therefore, it must attempt to sample all the dominant sources of uncertainty for the forecast range of interest. For the medium range the uncertainty is dominated by error growth due to baroclinic instability on relatively large, synoptic length scales. The singular vector (SV) perturbations used by the ECMWF EPS (Buizza and Palmer 1995) are designed to maximize the linear growth of synoptic-scale baroclinic structures over the early part of the forecast (currently the first 48 h). This approach is well suited to, and successful for, medium-range prediction, but the sampling of the pdf within this optimization period is clearly not random. For short-range prediction, while baroclinic instability is important, other sources of error occurring on shorter length scales are also very important. Coutinho et al. (2004) show that increasing the horizontal resolution of the SV calculation from the current operational resolution of T42 to T63 moves the peak of the initial-time total energy spectrum to larger wavenumbers as well as better resolving the entire spectrum, indicating that the T42 calculation is insufficient to capture the important shortwave processes. Also, singular vectors are calculated using a linearized model with simplified physics, which imposes a further limit on the possible solutions. In particular, the current operational EPS SVs neglect moist processes in the extratropics, which are likely to
VOLUME 133
be important for short-range instabilities (Coutinho et al. 2004). By contrast, the PEPS approach to sampling initial condition errors is essentially a random sampling at the analysis time, resulting from the different analyses of the NWP centers, and including variations on all scales resolved by the data assimilation systems and the full physics of the models. To quantify the full range of forecast uncertainties, an EPS needs also to take account of errors due to the model imperfections as well as initial condition errors. ECMWF introduced their “stochastic physics” scheme in October 1998 to account for model error by multiplying the total parameterized tendencies by a random number (Buizza et al. 1999, 2000). The approach of the CMC EPS and the NCEP Short-Range Ensemble Forecasting system (Tracton and Du 1998) is more complex and, in some ways, similar to the PEPS’ philosophy. Both systems include different physical parameterizations and different models to represent the model uncertainties. In the case of ECMWF’s ensemble, although the “stochastic physics” has been demonstrated to have a positive impact, particularly for precipitation fields, the overall effect on ensemble performance in the extratropics is small. It has been also shown (Mylne et al. 2001) that the addition of a second model and analysis to the ECMWF EPS further improved the overall medium-range ensemble performance. There is then a number of reasons why the PEPS approach, by including a large number of different models and independent analyses, may provide improved ensemble performance in the short range compared to the ECMWF EPS. On this basis, the Met Office started a project in 2000 to test the skill of a PEPS for short-range probability forecasting, up to 120 h ahead, using a larger ensemble than in previous experiments. Initial results showed that using an ensemble size of typically eight members, with low-resolution data (on a 5° latitude ⫻ 0.5° longitude grid) for mean sea level pressure (PMSL) and 500-hPa geopotential height only, the PEPS outperformed the ECMWF’s EPS according to Brier skill score measures. Continuing this work, a more detailed study was initiated for which NWP centers around the world were invited to contribute output from their models in a common format. Model fields were collected at the higher grid resolution of 1.25°, and fields included surface parameters (mean sea level pressure, 2-m temperature, and 10-m wind speed) that are of more relevance to short-range forecasting. It is the results of these more detailed experiments that are reported in this paper. The next section presents the data and methodology used. Section 3 describes the analysis and verification strategy for most variables. Section 4 investigates the
JULY 2005
1827
ARRIBAS ET AL. TABLE 1. NMS data available 1 Sep 2002–10 Dec 2002. Forecast (abbreviation)
Data times
Met Office 432 ⫻ 325 Unified Model (UM) (MetO) ECMWF T511 (ECMWF) ECMWF EPS T255 control forecast (EPS-CF) ECMWF EPS T255 perturbed forecast member nn (EPS-nn) Met Office 288 ⫻ 217 UM, Met Office analysis (MetO-LR-UK) Met Office 288 ⫻ 217 UM ECMWF analysis (MetO-LR-EC) Bureau of Meteorology (BOM) Deutscher Wetterdienst (DWD) Météo-France (M-F) Canadian Meteorological Centre global (CMC) Canadian Meteorological Centre ensemble control (CMC-CF) Japan Meteorological Agency (JMA) Korean Meteorological Administration (KMA) National Centers for Environmental Prediction Medium-Range Forecast (MRF) and aviation runs (NCEP) National Centers for Environmental Prediction ensemble control (NCEP-CF) Russian Hydrometeorological Center* (RHMC)
0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200 0000 1200
UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC UTC
Forecast times as supplied T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) Not available T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 132 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 96 (12 hourly) T ⫹ 12 to T ⫹ 72 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 24–120 (24 hourly) Not available T ⫹ 12 to T ⫹ 84 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 84 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 120 (12 hourly) Not available T ⫹ 12 to T ⫹ 120 (12 hourly) T ⫹ 12 to T ⫹ 84 (12 hourly)
* Unfortunately, because of problems with the data format and time constraints we were not able to use this model.
dependence of the results on the configuration type and number of members. In section 5 we analyze the relative economic value of PEPS and its suitability to detect extreme events. Discussion and conclusions are presented in section 6.
2. Forecast data and methodology The poor man’s configurations investigated in this paper have been based on data collected from a variety of different national meteorological services (NMSs) around the globe. In total 10 NMSs were contacted with the result that data from 14 forecast models were provided in a common grid format. Only 9 of the 14 forecast models used were completely independent, the rest being modified versions of the originals. All models and data are shown in Table 1, which also defines abbreviations referenced in the text. Fields were retrieved via FTP in most cases in as near–real time as was possible. A total of five field types were assessed, including some surface parameters
that are of greatest relevance when verifying shortrange forecasts: PMSL, 500-hPa geopotential height (H500), 850-hPa temperature (T850), 10-m wind speed (WS10), and 2-m temperature (TMP2). Forecasts were retrieved from T ⫹ 12 to T ⫹ 120 at 1.25° ⫻ 1.25° horizontal resolution and were verified over four regions: Northern Hemisphere (NHEM; 0°– 70°N, 0°–360°E), Southern Hemisphere (SHEM; 0°– 70°S, 0°–360°E), North America (NAMR; 30°–70°N, 130°–60°W), and Europe (EURO; 35°–60°N, 10°W– 15°E). When fields were not at the required horizontal resolution they were bilinearly interpolated to it. Also, when any field was missing a 12-h-older field was used. The analysis was carried over a 101-day period, between 1 September and 10 December 2002. Originally, three configurations were analyzed for all variables, regions, and thresholds: (i) P9: A nine-member configuration including all independent models (MetO, M-F, NCEP, CMC, DWD, ECMWF, BOM, JMA, KMA).
1828
MONTHLY WEATHER REVIEW
(ii) P6: A six-member configuration. These six models were chosen because of they are the easiest and fastest ones to retrieve (MetO, M-F, NCEP, CMC, DWD, ECMWF). (iii) P6 ⫹ 6: A 12-member “hybrid” configuration with 6 members coming from different models (MetO, M-F, NCEP, DWD, ECMWF, and EPS-CF) and 6 members coming from the ECMWF’s EPS (the first three pairs of members). This configuration allows us to assess the benefit of incorporating a contribution from SV perturbations in the ensemble. Although we recognize that it would have been preferable to compare ensembles with the same number of members, practical issues (mainly limited computing resources) made this impossible. However, in section 4 we present results from some additional configurations designed specifically with the aim of studying the dependence of the results on the number of ensemble members and type of configuration. The main focus of this study is to compare the skill of the PEPS with the ECMWF EPS in the short-range period (24–72 h). Following this purpose, verification was completed using the ECMWF analysis as the verifying “truth,” and therefore, any advantage was given to the ECMWF EPS when comparing both systems. Although the use of observations to verify and compare both systems might have been better (avoiding potential problems such as extrapolation beneath the model topography at certain locations), it would have been subject to different problems. In particular, each of the models would have suffered differently from representativity errors in the interpolation to site observations. While our interest was in ensemble performance in the short range, the PEPS does use global models with relatively low resolution, and it is therefore appropriate to use analyses (giving any advantage to the “rival” system, the ECMWF EPS) but focusing primarily on nearsurface weather parameters, such as TMP2 and WS10. The ECMWF EPS was chosen because it is the operational EPS currently used in 25 NMSs across Europe. Also, although not specifically designed for short range, it is arguably the best performing global ensemble (Buizza et al. 2005). The current operational ECMWF EPS is a 51member ensemble with a resolution of T255L40 (Buizza et al. 2003). The control forecast is started from the unperturbed analysis (interpolated from the T511L60 analysis), and the 50 additional forecasts are started from perturbed analysis fields generated by adding to/subtracting from the unperturbed analysis a combination of the dynamically fastest growing pertur-
VOLUME 133
bations calculated using SV (Molteni et al. 1996) and computed at T42L40. To remove each model’s biases (including the ECMWF’s EPS model), forecasts were expressed as anomalies with respect to the model’s mean over the previous 60 days. By doing this, we can be sure that all benefits coming from PEPS are due to the ensemble system itself and not to model error cancellation. Using only 60 days would also allow us, in case this system became operational, to adapt quite quickly to the frequent changes to which any NWP model is subject. To compare the EPS and various PEPS configurations a number of different measures are used. These include Brier skill score and its decomposition into reliability and resolution components (always taking the ECMWF’s EPS as reference forecast), reliability and sharpness diagrams, relative operating characteristics, relative economic value, and rank histograms. Nothing will be said about individual models’ performances— the purpose of the work is to test the PEPS as an ensemble, not as a model intercomparison. The experiments discussed in this paper give no time advantage to PEPS, although for operational purposes it may be of interest to note that results could be made available to users several hours earlier than those from the current operational implementation of the EPS. This will be commented on section 3. Events to be forecast were defined in terms of selected weather parameters being smaller or greater than certain relative thresholds. Instead of absolute thresholds (e.g., PMSL ⬍ 980 hPa) as commonly used in previous experiments (Atger 1999; Ziehmann 2000), we chose to compare anomalies referenced to climatological standard deviations based on the 15-yr ECMWF Reanalysis (ERA-15; see Web site at http://www. ecmwf.int/research/era/ERA-15/). The reason behind this is that absolute thresholds can be rare in some areas of the world but common in other areas. The climatological standard deviations have been calculated from ERA-15 on a monthly (linearly interpolated to each day) and a gridpoint-by-gridpoint basis. All weather parameters were examined for nine different threshold values (from –4 to ⫹4 standard deviations), except 10-m wind speed, which was only examined for positive thresholds (0 to ⫹4 standard deviations).
3. Analysis of results Analysis of results from the PEPS experiments involves processing very large quantities of data, and it has not been possible to analyze more than 101 days. Verification results are presented here for autumn 2002
JULY 2005
ARRIBAS ET AL.
1829
FIG. 1. P6 ⫹ 6 (solid line), P9 (dash–dot line), and P6 (dash line) mean (of all variables considered) Brier skill score values at T ⫹ 24 relative to the ECMWF’s EPS for all considered thresholds (⫺4 to ⫹4) in the following regions: (a) NHEM, (b) SHEM, (c) EURO, and (d) NAMR.
(from 1 September 2002 to 10 December 2002 inclusive), five variables (PMSL, H500, T850, WS10, and TMP2), and four different regions (NHEM, SHEM, NAMR, EURO). Following convention (e.g., Wilks 1995), comparison of the skill of different forecasts is normally done in terms of a skill score: S⫽
Xr ⫺ Xf , Xr ⫺ Xp
共1兲
where Xf is the score of X for the forecast, Xr for the reference forecast, and Xp for a perfect deterministic forecast. A skill score has a maximum value of 1 (or 100%) for a perfect forecast (Xf ⫽ Xp) and a value of zero for performance equal to that of the reference (Xf ⫽ Xr); S has no lower limit, with negative values representing poorer skill than the reference. The most commonly used verification diagnostic for probabilistic forecasts is the Brier score (BS), originally introduced by Brier (1950) and described in its modified standard form by Wilks (1995) as BS ⫽
1 n
n
兺 共f ⫺ o 兲 . 2
i
i⫽1
i
共2兲
The BS is essentially the mean square error for probability forecasts of an event, where fi and oi are forecast and observed probabilities, respectively; oi takes values of 1 when the event occurs and 0 when it does not. Brier skill scores (BSSs) are calculated by using BS as X in Eq. (1). Ensemble forecast probabilities were calculated by simple democratic counting, the probability being equal to the proportion of forecasts in the ensemble predicting the event, and probabilistic skill scores were calculated for forecasts of each of the five weather parameters being below (or above) the nine thresholds defined with respect to the climatological standard deviation (from ⫺4 to ⫹4). In all cases the ECMWF’s EPS (EPS hereafter) has been taken as the reference forecasting system, and verification was performed using the ECMWF analysis as the verifying “truth.” Figures 1–3 show the BSS of three PEPS configurations (P6, P9, and P6 ⫹ 6) relative to the EPS in all regions and thresholds considered at T ⫹ 24, T ⫹ 48, and T ⫹ 72, respectively. Positive/negative values indicate that PEPS is performing better/worse than the EPS. The values shown are the average of the indi-
1830
MONTHLY WEATHER REVIEW
VOLUME 133
FIG. 2. Same as Fig. 1, but at T ⫹ 48.
vidual BSS for each of the five variables analyzed (PMSL, H500, T850, WS10, TMP2). As information, individual BSS values for each variable in P6 and P6 ⫹ 6 at T ⫹ 24, T ⫹ 48, and T ⫹ 72 are shown in Fig. 4. In Figs. 1–3 it can be seen that, in general terms, P6 ⫹ 6 (solid line) is the most skillful of the three configurations, with the best results found in NHEM and EURO regions. This hybrid configuration outperforms the EPS at most verification times and thresholds (including the most extreme ones) during the first 72 h. Despite their different sizes and model combinations, the two pure PEPS configurations (P9, dash–dot line, and P6, dash line) are performing similarly, with slightly positive BSS at T ⫹ 48 and T ⫹ 72 for all regions except SHEM, and a more mixed signal at T ⫹ 24 (positive BSS for EURO and slightly negative BSS for the rest of regions). It is also interesting to note that, especially at T ⫹ 24, P6 usually performs marginally better than P9. Some of these observed features (the P6 and P9 better performance at T ⫹ 48 and T ⫹ 72 than at T ⫹ 24, and the slightly better performance of P6 than P9) may be partially explained by the fact that we are verifying against ECMWF’s analysis. By using this analysis we are penalizing those configurations where the contribution from the ECMWF is smaller, especially at the shortest verification time, T ⫹ 24. Thus, P9, where only
1 member in 9 is coming from the ECMWF, is disadvantaged compared to P6 (1 in 6) and P6 ⫹ 6 (8 in 12). It is also plausible that, as indicated by Ebert (2001), there could be an optimal number of models and the addition of less skilful ones could degrade the performance of the system. A related result is the P6 and P9 smaller BSS in the SHEM region, which suggests that most of the models are performing better relative to the EPS in the NHEM than in the SHEM. This is probably due to the excellent capability of the ECMWF to assimilate satellite data, which gives its model, and the EPS, a significant advantage over many other models in the large regions of the Southern Hemisphere that are poorly covered by conventional nonsatellite observations. The configuration P6 ⫹ 6, which has the largest contribution from ECMWF, exhibits much smaller change in skill between the two hemispheres than P6 and P9. The better P6 ⫹ 6 performance in the first 48 h, although partially explained by the use of ECMWF’s analysis and its larger size, could also imply that the inclusion of some SV perturbations in the ensemble could be beneficial. A further analysis of the dependence of the results on the configuration type and number of members is presented in section 4. In any case, in spite of the advantage given to the EPS by using its
JULY 2005
1831
ARRIBAS ET AL.
FIG. 3. Same as Fig. 1, but at T ⫹ 72.
analysis as the verifying truth, all PEPSs are shown to be highly competitive, even considering their much smaller number of members (6, 9, or 12) than the full EPS (51). The fact that the nonhybrid PEPS (P6 and P9) could be issued earlier (as much as 12 h) than EPS or hybrid PEPS would improve further their relative performance as was seen in preliminary experiments. In these experiments (not shown) PEPS configurations were given a 12-h advantage in recognition of this earlier availability, and much larger positive BSS (up to 0.3) were recorded. However, this advantage was not given in the current work as it was impossible to separate the effect of the shorter lead time from the performance of the PEPS method. Therefore, results shown here may be taken as a minimum bound of what could be offered by an operative PEPS relative to the EPS. Let us now consider the different components of the BSS. Based on the Brier score decomposition in three terms proposed by Murphy (1973) (reliability: ability of the system to forecast accurate probabilities; resolution: ability of the forecast to separate the different categories, whatever the forecast probabilities; and uncertainty: variance of the observation, independent of the forecast system), Atger (1999) decomposed the BSS into two positively oriented terms indicating the skill due to the reliability and to the resolution:
BSS ⫽
冉
冊 冉
冊
relref ⫺ rel res ⫺ resref ⫹ . BSref BSref
共3兲
Given that the reliability of probability forecasts can generally be corrected by calibration (although normally at the cost of reduced resolution) it has been normally considered more important for operational purposes to improve resolution (Atger 1999). Figure 5 shows the reliability and resolution components of the BSS in the NHEM region (plots for other regions are similar and therefore not shown) at T ⫹ 24, T ⫹ 48, and T ⫹ 72. As for BSS, the values plotted are the average of the reliability and resolution components of BSS for the five variables considered. For the most extreme thresholds (⫾3, ⫾4) and for all configurations, the contribution to the positive BSS comes from the reliability component. It is only in the central thresholds (between ⫺2 and ⫹2) that there is also often a positive contribution from the resolution component. The positive contribution coming from the resolution component is always larger for the hybrid configuration P6 ⫹ 6 than for P6 or P9. Another important aspect of an ensemble system is the number of outliers (i.e., the number of occasions when the observation lies outside the entire spread of the ensemble). This is an important characteristic, es-
1832
MONTHLY WEATHER REVIEW
VOLUME 133
FIG. 4. BSS values for each individual variable: PMSL (solid line), TMP2 (dot line), T850 (dash line), H500 (dash–dot line), and WS10 (dash–dot–dot–dot–dash line) for configurations (left) P6 and (right) P6 ⫹ 6 at (top) T ⫹ 24, (center) T ⫹ 48, (bottom) and T ⫹ 72. The horizontal axis shows the threshold considered (⫺4 to ⫹4), and the vertical axis shows the BSS values.
pecially when interested in the probability forecasting of extreme events, since an underspreading system that had a larger than expected number of outliers would have a smaller chance to capture different solutions. A useful way of analyzing the number of outliers is using rank histograms (Hamill and Colucci 1997). Ideally, a rank histogram should have all the bins defined by the ranked ensemble members equally populated, indicating that the ensemble is reliably representing the spread of uncertainty. Figure 6a shows rank histograms for PMSL in the NHEM region for T ⫹ 24, T ⫹ 48, T ⫹ 72, and T ⫹ 120 lead times. To compare ensembles of dif-
ferent sizes on one graph, the bin populations have been normalized relative to their ideal populations and the “histograms” plotted as lines. Results for other variables and regions are similar and are therefore not shown. It can be observed that the outlier bins for the EPS (Fig. 6a, dotted line), especially the upper one, are more severely overpopulated relative to the ideal 1.0 than for the PEPS configurations. This is true at all lead times. Configurations P6 and P9 perform similarly (dash and dash–dot lines overlap) with a tendency toward overpopulation of the middle ranks, especially at the shorter lead times, which indicates overspreading.
JULY 2005
ARRIBAS ET AL.
1833
FIG. 5. P6 ⫹ 6 (solid line), P9 (dash–dot line), and P6 (dash line) mean value (averaged over all five variables considered) (left) reliability and (right) resolution components of the BSS for all considered thresholds (⫺4 to ⫹4) in the NHEM region. Values are shown at (top) T ⫹ 24, (center) T ⫹ 48, and (bottom) T ⫹ 72.
For both configurations, P6 and P9, the highest rank is close to the ideal population. The hybrid configuration’s rank histogram (P6 ⫹ 6, solid line) is the closest one to the ideal and lies somewhere in the middle, between the EPS and the pure PEPS. Overall, the normalized rank histograms of all PEPS configurations are closer to ideal than the EPS at all lead times. Another way of looking at the number of outliers of each system is Fig. 6b, where the actual number of outliers is shown relative to its expected number (the op-
timal value would be 1). Here we can see how all PEPS configurations are performing similarly (lines overlap at all lead times), and for all of them the relative number of outliers is smaller than, but close to, 1. As mentioned for Fig. 6a, this is a symptom that all PEPS configurations are slightly overspreading. On the other hand, the number of outliers for the EPS is much larger, between 5 (at T ⫹ 24) and 3 (at T ⫹ 72) times larger than its expected number, corroborating that, as seen in the rank histograms, the EPS is underspreading.
1834
MONTHLY WEATHER REVIEW
VOLUME 133
TABLE 2. H500 correlation values of rms spread and rms error for the ECMWF’s EPS (EPS) and PEPS configurations P6 ⫹ 6 and P9 from T ⫹ 24 to T ⫹ 96. Please note that the data shown in this table have not been bias corrected (see text).
FIG. 6. (a) Normalized rank histograms for PMSL in the Northern Hemisphere at T ⫹ 24, T ⫹ 48, T ⫹ 72, and T ⫹ 120 for the ECMWF’s EPS (dot line), P6 ⫹ 6 (solid line), P9 (chain dot line), and PEPS_6 (dash line). (b) Relative number of outliers (absolute number divided by the expected number) of P6 ⫹ 6 (dash line with solid square), P6 (gray dash line with circles), P9 (dot line with crosses), and EPS (solid line with diamonds) at all lead times.
In fact, and despite the different ensemble sizes, even the actual number of outliers (not shown) of P6 ⫹ 6 and P9 is smaller than that of the EPS at T ⫹ 24. Underdispersion in ensembles is hypothesized to be a major cause for the low degree of correlation between the ensemble spread and accuracy of the mean forecast (Hamill and Colucci 1997; Stensrud et al. 1999). In Table 2 we show the correlation values between H500 rms spread and rms error (both with respect to the ensemble mean) for the EPS and the two PEPS con-
Spread–error (H500)
T ⫹ 24
T ⫹ 48
T ⫹ 72
T ⫹ 96
EPS P6 ⫹ 6 P9
0.27 0.42 0.71
0.57 0.58 0.62
0.59 0.57 0.53
0.60 0.55 0.55
figurations P6 ⫹ 6 and P9. [Please note that the data shown on Table 2 have not been bias corrected. These are the only data shown in this paper that have not been bias corrected, the reason being that the method used to remove the bias—calculation of anomalies with respect to the previous 60-day model mean—would have produced rms spread and rms error values of a different magnitude to those commonly shown in the literature (Buizza et al. 1999; Hou et al. 2001), making it difficult to establish comparisons with previous studies.] It can be seen that, at T ⫹ 24, P9 has the best spread–error relationship (0.71) and the EPS the worst one (0.28). This is a clear effect of the way SVs are designed to maximize growth on the first 48 h. Later on, from T ⫹ 48 to T ⫹ 96, correlation values are more similar for all configurations. The hybrid configuration (P6 ⫹ 6) shows the benefits of combining both systems, with better spread–skill relationship than the EPS at T ⫹ 24 and similar values from T ⫹ 48 to T ⫹ 96. The larger spread of the PEPS configurations in the short range allows them to better forecast the forecast skill of the ensemble mean. Also, capturing a more different range of solutions at this time may result in a better chance of incorporating extreme-weather developments.
4. Dependence on configuration type and number of members It has been shown in the previous section that PEPS configurations perform comparably, or even outperform, the EPS in the short range. In general, the hybrid approach (P6 ⫹ 6) is offering better results than the pure PEPS configurations (P6 and P9), but it is not clear if this is due to the combination of SVs with a multimodel, multianalysis system or to its larger number of members (12 compared to 9 and 6). For a fairer comparison, some new configurations are analyzed for all regions and thresholds but, because of the large amount of data involved, only for WS10. This variable was chosen as being typical of all the variables previously analyzed (see Fig. 4) and also an interesting sur-
JULY 2005
ARRIBAS ET AL.
FIG. 7. The 12-h evolution of mean BSS values (averaged between 0 and ⫹4 thresholds) relative to the EPS for the 10-m wind speed and four different configurations: P6 ⫹ 6 (dash–dot line), P14 (short dash line), E6 ⫹ 6 (long dash line), and E15 (solid line).
face parameter for extreme-weather and economic impacts. The new configurations analyzed are (i) P14: A 14-member ensemble. Essentially the P9 plus the NCEP-CF, CMC-CF, EPS-CF (control runs of the NCEP, CMC, and ECMWF ensembles, respectively), and the two reduced-resolution versions of the MetO model using initial conditions (ICs) provided by the Met Office and ECMWF. (ii) E15: A 15-member subset of the ECMWF’s EPS, including the control member and the first seven pairs of perturbed members. (iii) E6 ⫹ 6: A 12-member configuration using the same IC as the P6 ⫹ 6 but only the ECMWF model. The models MetO, M-F, NCEP, and DWD in P6 ⫹ 6 have been substituted by runs of the ensemble model (T255) produced at ECMWF using the analyses from these centers. Figure 7 shows the mean BSS values of WS10 (averaged between 0 and ⫹4 thresholds) for these three configurations together with P6 ⫹ 6 in the NHEM region between 12- and 84-h lead time. We can now, therefore, compare the relative performance of four systems with a similar number of members, ranging between 12 and 15. The best performance from T ⫹ 12 to T ⫹ 96 corresponds to the P6 ⫹ 6, which consistently beats the EPS during the analyzed period. This hybrid configuration is also beating the pure PEPS configuration P14 (despite the larger number of members of P14) at all lead times. As discussed in section 3, using the ECMWF analysis puts P14 at a disadvantage, but P6 ⫹ 6 is still perform-
1835
ing better at days 3 and 4, when the impact of the verifying analysis should be much smaller than earlier on in the period. Except in the first 24 h, when the poor performance of P14 could be also partially explained by the use of the ECMWF as the verifying analysis, P14 outperforms the EPS from T ⫹ 36 to T ⫹ 84. Verification with an independent dataset (observations or independent analysis) would be needed to quantify what part of the benefit seen in P6 ⫹ 6 when compared to P14 is coming from a better sampling of the initial uncertainty and which part is artificially produced by verifying against ECMWF’s analysis, but it seems plausible that the combination of SV and a multimodel, multianalysis system is having a positive effect. This leads to the question of whether the benefits of P6 ⫹ 6 could then be obtained by using a multianalysis ensemble, with ICs coming from different NWP centers, but only a single forecast model. The comparison between P6 ⫹ 6 (multimodel, multianalysis) and E6 ⫹ 6 (single model, multianalysis) systems, both using the same set of ICs, is a way of addressing it. Although in the first 48 h both systems are beating the EPS, it is clear that the multimodel system (P6 ⫹ 6) outperforms the single-model system (E6 ⫹ 6) consistently through the forecast period. This suggests that the inclusion of different ICs alone is not enough to explain the better BSS of P6 ⫹ 6 in the earlier period and that model error is playing an important role even at the short range. Thus, similar to what was observed by Mylne et al. (2001), it does appear that the addition of independent ICs to the EPS’s ICs (and, in the case of P6 ⫹ 6, also the use of different models to better represent model uncertainty) increases the spread of the system in the first 48 h, the growing period of the SV. This, in turn, reduces the number of outliers and increases the spread– error correlation, as seen in section 3, for P6 ⫹ 6. Not surprisingly, the 15-member EPS subset (E15) performs worse than the whole EPS at all verification times. However, it is interesting to note that if we use it as a way to compare similar sizes of PEPS and EPS systems, the PEPS approach is clearly outperforming the EPS approach. Thus, P6 ⫹ 6 and P14 (except at T ⫹ 12) BSS values are always better than the corresponding E15. This supports the results previously found by Ziehmann (2000) and Hou et al. (2001), suggesting that the PEPS methodology is a particularly efficient way of providing ensemble forecasts.
5. Extreme events and the economic value of the system One of the main reasons to develop a short-range probability forecasting system is to attempt to capture
1836
MONTHLY WEATHER REVIEW
VOLUME 133
FIG. 8. ROC, relative economic value, reliability diagram, and sharpness diagram for the WS10 variable and the full EPS (solid line), P6 ⫹ 6 (dot line), and P14 (dash line) configurations. Data correspond to (a) WS10 ⬎ 3 in the Northern Hemisphere region and T ⫹ 24. (b) Same as (a), but for T ⫹ 48.
extreme weather events, those severely affecting people’s lives and wealth. For this reason, in this section we focus on surface variables and events more than three standard deviations from climatology in the first 48 h of the forecast period. The probabilistic measures of skill used in this section are relative operating characteristic (ROC) curve, reliability and sharpness diagrams, and the relative economic value. The ROC curve (Stanski et al. 1989) measures the skill of a forecast in terms of a hit rate (HR)
and a false alarm rate (FAR), both classified according to the observations (ECMWF’s analysis in our case). Another way of assessing the value of a forecast for decision making is the relative economic value (Richardson 2000). This is based on a simple decisionanalytic model in which a decision maker has just two alternatives depending exclusively on a given weather event occurring or not, and converts the ROC information into potential financial savings for the user of the forecast for a range of cost–loss ratios. A reliability
JULY 2005
ARRIBAS ET AL.
1837
FIG. 8. (Continued)
diagram (Wilks 1995) is a plot of the observed relative frequency against the forecast probability. The sharpness diagram characterizes the relative frequencies of use of the forecast probabilities. Figure 8 shows the ROC curves, the reliability and sharpness diagrams, and the relative economic value for the event WS10 ⬎ ⫹3, at (a) T ⫹ 24 and (b) T ⫹ 48 in the NHEM region. Three systems are shown, the EPS, P6 ⫹ 6, and P14. To compare all systems the probabilistic information has been condensed in probability bins of width 10%. The ROC curves are similar for all three systems (the corresponding lines overlap at most probabilities), and all are bowed close to the ideal top-left corner indicating good discriminatory skill. The area under the ROC
curve is marginally bigger for the EPS (0.979 compared to 0.974 and 0.968 for P6 ⫹ 6 and P14, respectively, at T ⫹ 24. Values at T ⫹ 48 are 0.970, EPS; 0.960, P6 ⫹ 6; and 0.951, P14), mainly due to its better HR/FAR values at the lowest probability threshold (10%). Similarly, for very small cost–loss ratios (smaller than the climatological frequency of the event, 0.04) the EPS is slightly better in terms of relative value. This is to be expected since the larger number of members in the EPS allows it to better represent small probabilities (if the full range of probabilities resolved by the 51member EPS had been analyzed, allowing probabilities down to ⬃2% rather than the 10% limit imposed, we would expect to see some further potential benefit of the EPS because of its large size). For cost–loss ratios
1838
MONTHLY WEATHER REVIEW
higher than 0.04, the P6 ⫹ 6 is a marginally better decision-making system at lead times T ⫹ 24 and T ⫹ 48. P14, although worse than P6 ⫹ 6, also performs quite well for cost–loss ratios above 0.04. In terms of reliability, all three systems are slightly underestimating the frequency of the event with respect to the observed values, but P6 ⫹ 6 and P14 show a better behavior than the EPS, especially for probabilities smaller than 50%. The sharpness diagrams are similar (at T ⫹ 24 and T ⫹ 48), indicating a good resolution of all systems for this event (note that the peaks in the sharpness diagrams for P6 ⫹ 6 and P14 are simply due to the fact that some 10% probability bins are populated by more than one ensemble probability). The similarity of the sharpness diagrams and the ROC and economic value scores for the PEPS configurations and the EPS indicate that, despite what was seen in section 3—that the positive BSS at extreme events came from the reliability component—PEPS configurations have a comparable discriminatory ability to the EPS, and therefore are a suitable tool for decision making. More analyses were completed for other extreme events (equal or larger than three standard deviations) and surface variables (TMP2), showing a similar behavior to the case discussed above (WS10 ⬎ 3); therefore, they are not shown. Although the intention of this study was to test the performance of a multimodel system over a long period of time and not just over a sample of case studies, it is also worth briefly mentioning a few extreme events that occurred during our data-collection period. These include the rain episodes that led to the central European floods in August 2002, the southern France flood in September 2002, and two intense windstorms over Europe in October 2002. With the exception of the intense windstorm that occurred on 27 October 2002, all these extreme developments were well captured by the ECMWF EPS and by the experimental PEPS configurations (although generally the PEPS showed a weaker signal than the EPS). The 27 October storm was poorly forecast by the EPS, which missed any indication of strong winds at some of the most severely affected areas (southern United Kingdom, northern France) at T ⫹ 24. The EPS probabilities were better forecast at T ⫹ 48 but again missed large areas that were severely affected. On the contrary, the PEPS configurations P6 ⫹ 6 and P14 gave a much better indication of risk (10%– 20%) at both lead times. Preliminary results from a posteriori investigation of this case at the ECMWF indicated that the poor forecast may have been caused by use of a long time step in the EPS model (F. Lalaurette 2004, personal communication). Although this particu-
VOLUME 133
lar problem can be corrected, this case illustrates the potential vulnerability of a single model system and shows that, at low cost, a multimodel system can provide useful information not contained in a single-model EPS, which could lead to a better risk assessment of extreme events.
6. Discussion and conclusions In this paper we have tested and analyzed different configurations of a poor man’s ensemble with the aim of creating a useful probability forecasting system for the short range. The results from PEPS have been mainly assessed against the ECMWF’s EPS, and in general, PEPS has proved to be a good system, with results highly competitive with the EPS. The efficiency of PEPS is remarkable: despite being a much smaller and cheaper system, some configurations are producing comparable or better results to those of the full EPS. The best performance was obtained by combining a PEPS with a small subset of ECMWF’s EPS members, in what we have called a hybrid ensemble. Brier skill score results show a better performance of the hybrid configurations for most variables, regions, and thresholds. There is, therefore, a consistent benefit over the full EPS. The hybrid approach represents a promising option, including a representation of the model error through a multimodel system and a good set of initial conditions through the combination of random sampling of analysis uncertainty (multianalysis) and the fastest growing modes of the analysis (singular vectors). The model error representation has been found to be particularly important at all time scales. Thus, the comparison between a multianalysis, single-model configuration (E6 ⫹ 6) and a multianalysis, multimodel one (P6 ⫹ 6), both using the same set of initial conditions, has shown that the multimodel is outperforming the single-model system at all lead times. The analysis of the contribution from the reliability and resolution components to the BSS results shows a marked and different pattern for normal events (those comprised between ⫾2) and extreme ones (those larger than ⫾2). In general, in the case of normal events, there is a positive contribution from both components for most variables and regions. For the most extreme ones, BSS increases come only from the reliability component. Nevertheless, when extreme events (⫾3) are analyzed for surface variables, PEPS resolution is found competitive with that of the ECMWF’s EPS. These extreme events are also studied in terms of ROC and relative economic value, confirming that the PEPS configurations offer a useful decision-making tool, even for the lowest thresholds.
JULY 2005
ARRIBAS ET AL.
Another basic aspect of a system like this, especially when interested in the predictability of severe-weather developments in the short range, is the number of outliers. Our results show that PEPS configurations have a smaller relative number of outliers (and, at T ⫹ 24, also in absolute terms) at the short range than the full EPS, suggesting that it may have a greater chance of capturing extreme-weather events. Also, the correlation between rms spread and rms error is higher for the PEPS systems than for the EPS at T ⫹ 24 (and similar from T ⫹ 48 to T ⫹ 96), suggesting a better capacity of PEPS to forecast the forecast skill early in the short range. A well-designed PEPS is, therefore, able to provide useful information not already contained in single-model ensemble prediction systems, helping to produce a better assessment of weather-related risks. Finally, various centers around the world are currently developing systems specifically designed for short-range ensemble prediction, and for all these a poor man ensemble system offers a cheap alternative and benchmark. Acknowledgments. We thank all the following NWP centers for their considerable efforts in providing the data used in this study in the formats requested and in a timely fashion over many months: Bureau of Meteorology (Australia), Canadian Meteorological Centre, Deutscher Wetterdienst, ECMWF, Japan Meteorological Agency, Korean Meteorological Administration, Météo-France, National Centers for Environmental Prediction, and the Russian Hydrometeorological Center. Without their contributions this project would have been impossible. REFERENCES Atger, F., 1999: The skill of ensemble prediction systems. Mon. Wea. Rev., 127, 1941–1953. Brier, G. W., 1950: Verification of forecast expressed in terms of probability. Mon. Wea. Rev., 78, 1–3. Buizza, R., and T. N. Palmer, 1995: The singular vector structure of the atmospheric general circulation. J. Atmos. Sci., 52, 1434–1456. ——, M. Miller, and T. N. Palmer, 1999: Stochastic representation of model uncertainties in the ECMWF EPS. Quart. J. Roy. Meteor. Soc., 125, 2887–2908. ——, J. Barkmeijer, T. N. Palmer, and D. Richardson, 2000: Current status and future developments of the ECMWF Ensemble Prediction System. Meteor. Appl., 7, 163–175. ——, D. S. Richardson, and T. N. Palmer, 2003: Benefits of increased resolution in the ECMWF ensemble system and comparison with poor-man’s ensembles. Quart. J. Roy. Meteor. Soc., 129, 1269–1288. ——, P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu, 2005: A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Wea. Rev., 133, 1076– 1097.
1839
Coutinho, M. M., B. J. Hoskins, and R. Buizza, 2004: The influence of physical processes on extratropical singular vectors. J. Atmos. Sci., 61, 195–209. Ebert, E. E., 2001: Ability of a poor man’s ensemble to predict the probability and distribution of precipitation. Mon. Wea. Rev., 129, 2461–2479. Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta–RSM short-range ensemble forecasts. Mon. Wea. Rev., 125, 1312– 1327. Hou, D., E. Kalnay, and K. K. Droegemeier, 2001: Objective verification of the SAMEX’98 ensemble forecasts. Mon. Wea. Rev., 129, 73–91. Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev., 124, 1225–1242. Kalnay, E., and M. Ham, 1989: Forecasting forecast skill in the Southern Hemisphere. Extended Abstracts, Third Int. Conf. on Southern Hemisphere Meteorology and Oceanography, Buenos Aires, Argentina, Amer. Meteor. Soc., 24–27. Krishnamurti, T. N., C. M. Kishtawal, T. LaRow, D. Bachiochi, Z. Zhang, E. Williford, S. Gadgil, and S. Surendran, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble. Science, 285, 1548–1550. Legg, T. P., and K. R. Mylne, 2004: Early warnings of severe weather from ensemble forecast information. Wea. Forecasting, 19, 891–906. Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122, 73–119. Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600. Mylne, K. R., R. E. Evans, and R. T. Clark, 2002: Multi-model multi-analysis ensembles in quasi-operational medium-range forecasting. Quart. J. Roy. Meteor. Soc., 128, 361–384. Richardson, D., 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System. Quart. J. Roy. Meteor. Soc., 126, 649–667. Stanski, H. R., L. J. Wilson, and W. R. Burrows, 1989: Survey of common verification methods in meteorology. WMO WWW Tech. Rep. 8, WMO Tech. Doc. 358, 114 pp. Stensrud, D. J., H. E. Brooks, J. Du, M. S. Tracton, and E. Rogers, 1999: Using ensembles for short-range forecasting. Mon. Wea. Rev., 127, 433–446. Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at the NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330. Tracton, M. S., and J. Du, 1998: Short-range ensemble forecasting (SREF) at NCEP/EMC. Proc. Sixth Workshop on Meteorological Operational Systems, Reading, United Kingdom, ECMWF. Wandishin, M. S., S. L. Mullen, D. J. Stendsrud, and H. E. Brooks, 2001: Evaluation of a short-range multimodel ensemble system. Mon. Wea. Rev., 129, 729–747. Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences—An Introduction. International Geophysics Series, Vol. 59, Academic Press, 467 pp. Young, M. V., and E. B. Carroll, 2002: The use of medium-range ensembles at the Met Office. II: Applications for mediumrange forecasting. Meteor. Appl., 9, 273–288. Ziehmann, C., 2000: Comparison of a single-model EPS with a multi-model ensemble consisting of a few operational models. Tellus, 52A, 280–299.