Assessing the Usefulness of Probabilistic Forecasts

0 downloads 0 Views 1MB Size Report
ditions to help end users in their risk assessment and decision-making processes. ... In this paper two new scores, the full-pdf-reliability Rpdf and information quantity ... 1993; Palmer et al. ...... Stephenson, Eds., John Wiley and Sons, 137–163.
1492

MONTHLY WEATHER REVIEW

VOLUME 136

Assessing the Usefulness of Probabilistic Forecasts STEPHEN CUSACK

AND

ALBERTO ARRIBAS

Met Office Hadley Centre, Exeter, United Kingdom (Manuscript received 30 January 2007, in final form 8 August 2007) ABSTRACT The errors in both the initialization and simulated evolution of weather and climate models create significant uncertainties in forecasts at lead times beyond a few days. Modern prediction systems sample the sources of these uncertainties to produce a probability distribution function of future meteorological conditions to help end users in their risk assessment and decision-making processes. The performance of prediction systems is assessed using data from a set of historical forecasts and the corresponding observations. There are many aspects to the correspondence between forecasts and observations, and various summary scores have been created to measure the different features of forecast quality. The main concern for end users is the usefulness of forecasts. There are two independent and sufficient aspects for the assessment of the usefulness of forecasts to end users: 1) the statistical consistency of forecast statements with observations and 2) the extra information contained in the forecast relative to the situation in which such predictions are unavailable. In this paper two new scores, the full-pdf-reliability Rpdf and information quantity IQ, are proposed to measure these two independent aspects of usefulness. In contrast to all existing summary scores, both Rpdf and IQ depend upon all moments of the forecast pdf. When taken together, the values of Rpdf and IQ offer a general measure of the usefulness of ensemble predictions.

1. Introduction Uncertainties in forecasts arise from errors in both the initialization and subsequent modeled evolution and are amplified by the chaotic nonlinear dynamics of the climate system (e.g., Lorenz 1963, 1993). These sources of uncertainty are sampled by current ensemble prediction systems to produce a probability distribution function (pdf) of a predictand (e.g., Toth and Kalnay 1993; Palmer et al. 1993). Therefore, the forecast is probabilistic in nature, and a full assessment of its quality requires analysis of the whole forecast pdf. The assessment of the performance of prediction systems is used to inform both end users and system developers. Information on forecast quality can influence the decisions made by end users and provide direction to the system’s research and development efforts. In practice, short- and medium-range probabilistic forecasts are assessed using real-time forecasts over a certain period of time (typically a few months). However, in the case of long-range forecasts (e.g., seasonal) the

Corresponding author address: Stephen Cusack, Met Office Hadley Centre, FitzRoy Road, Exeter EX1 3PB, United Kingdom. E-mail: [email protected] DOI: 10.1175/2007MWR2160.1

assessment is completed by running a large number of historical reforecasts, usually referred to as hindcasts (e.g., Graham et al. 2005). An assessment of the usefulness of probabilistic forecasts to end users consists of two independent aspects: 1) the statistical consistency of the issued forecast statements with observations [as defined in Toth et al. (2003)] and 2) the extra information contained in the issued predictions relative to a reference situation representing the end user’s knowledge in the absence of a forecast. There are other aspects of forecast quality (e.g., Murphy and Winkler 1987; Murphy et al. 1989; Murphy 1993) that are more appropriate to utilize in system development work rather than measuring the usefulness of probabilistic forecasts to end users. The assessment of the statistical consistency of predictions requires the comparison of forecast information with corresponding observations and the assignment of appropriate penalties for differences. If the forecast information is probabilistic in nature then it is desirable for the measure to reflect the full characteristics of the pdf. The appropriate penalty to be applied to differences between forecasts and observations is specific to the end user’s exposure to weather and climate. The wide variety of applications of forecast in-

APRIL 2008

1493

CUSACK AND ARRIBAS

formation requires a large range of appropriate penalty functions. Although it is desirable to tailor the assessment of the usefulness of forecasts to suit each end user’s exposure to weather and climate, general measures of usefulness could offer a summary of the qualities of a forecast system to all end users. A simple functional form to penalize the distance between forecasts and observations provides a general measure of statistical consistency to all end users. The end user’s knowledge in the absence of a forecast is required to define the extra information contained in the issued predictions. The end user is commonly assumed to have knowledge of the observed climatological mean values (hereinafter often referred to as “climatology”) in the absence of a forecast. The comparison of issued predictions with the observed climatology produces one suitable measure of the extra information contained in forecasts. In accordance with the assessment of statistical consistency, the estimates of the extra information quantity should be based upon the full characteristics of the forecast pdfs. The possibility that summary performance measures in common use fail to capture the essential features of forecasts has been discussed by Murphy (1991). In section 2, we analyze the ability of current assessment methods to measure the usefulness of forecast pdfs, and problems are identified in all measures. In particular, existing measures fail to measure properly the qualities of the whole forecast pdf. Section 3 includes a description of the two new scores Rpdf and IQ and a discussion of the uncertainties in their estimates. In section 4, a practical example is used to highlight the impact of sampling errors and a comparison of the values of these two new scores with a commonly used probabilistic score. Section 5 contains a summary and some discussion of future work.

2. Assessing probabilistic forecasts The ability of current assessment methods to measure the two aspects of the usefulness of forecast pdfs is now discussed. The forecasts are assumed to concern continuous variables, and the scores analyzed below are those used commonly to assess probabilistic forecasts. Measures of the quality of deterministic forecasts, such as mean square error, mean absolute error, anomaly correlation, linear error in probability space (Ward and Folland 1991), and conditional quantile diagrams (Murphy 1991) are inappropriate for probabilistic forecasts and therefore are not considered.

a. The Brier score Wilks (2006) suggests that the most common scalar measure of probability forecasts of dichotomous events

is the Brier score (BS; Brier 1950), which measures the difference in probability space between forecasts and observations for a specific event occurring: n

BS ⫽



1 共 y ⫺ ok兲2, n k⫽1 k

共1兲

where yk is the forecast probability of the kth event, ok is the corresponding observed probability, and there are n events in total. The first point of note is the failure of the BS to assess the whole distribution of possible events. This lack of generality is considered a shortcoming if a measure of forecast quality of relevance to all end users is sought. The BS cannot distinguish between the two independent aspects of forecast utility because it is single valued. However, the BS can be decomposed into the sum of three separate terms (Murphy 1973): BS ⫽

1 n

I



Ni 共 yi ⫺ oi兲2 ⫺

i⫽1

⫹ o共1 ⫺ o 兲,

1 n

I

兺 N 共o ⫺ o兲

2

i

i

i⫽1

共2兲

where n is the total number of forecast–observation pairs, I is the total number of discrete values that a forecast probability can assume, yi is the relative frequency of a forecast in the ith category, oi is the average observed probability conditional on yi occurring, and o is the mean climatological observed value over the whole sample. The three terms on the right-hand side of Eq. (2) are labeled the reliability, resolution, and uncertainty, respectively. The decomposition is based upon the grouping together of forecasts with similar values of the probability of an event occurring and gathering the corresponding observations in each of these forecast groups. However, forecast pdfs with similar values of the probability of an event occurring can have significantly different distribution details. This situation is depicted in Fig. 1 for two distinct ensemble forecast pdfs of a variable X. One pdf represents a forecast of extreme values of X, although the sign of the variable is not known, whereas the other pdf represents a forecast of near-climatological conditions for X. If the components of the BS were computed for the event in which the anomaly is less than zero, for the forecasting system that produces the two pdfs shown in Fig. 1, then these two forecasts would be grouped together into the same ith category for the purpose of conditioning the observations. Therefore, the estimates of the reliability and resolution components of the BS for this event are based upon treating these two forecasts as similar. Because these scores are based upon conditioning based on one attribute of pdfs, namely, the probability of an

1494

MONTHLY WEATHER REVIEW

VOLUME 136

rather than the forecasts issued to end users. For these reasons, the multicategory BS and RPS scores and their respective decompositions into reliability, resolution, and uncertainty components provide inappropriate measures of the general usefulness of forecast pdfs.

c. The relative operating characteristic

FIG. 1. Two forecast pdfs are shown for the standardized anomaly of a variable X. The pdfs are based upon idealized output from an ensemble forecasting system: the dashed curve represents a situation in which the forecasting system produces a bimodal distribution in which extreme values of X are likely, although the sign is not known, and the dotted curve represents a situation in which average conditions are forecast to be much more likely.

event occurring, they are inappropriate if one wishes to assess the full characteristics of the forecast pdfs. The formulation of the components of the BS contains another feature that is inappropriate for measures of forecast usefulness. The resolution component of the BS cannot be considered as a measure of the extra information in a forecast. The resolution component in Eq. (2) is based upon the difference between conditioned observations and climatology whereas a true measure of the added information of predictions should be based upon the forecast statements rather than on conditioned observations. Therefore, the resolution term of the BS cannot measure the extra information contained in forecasts.

b. Multicategory BS and ranked probability score Brier (1950) described a score based on multiple categories of events that is less widely used than the BS. The ranked probability score (RPS) is a measure of the quality of forecasts of a multicategory predictand and can be expressed as the linear combination of individual BS values (e.g., Toth et al. 2003). Both the multicategory BS and the RPS are dependent upon a distribution of possible events and therefore overcome the first of the undesirable features of the BS noted in section 2a. However, both measures do not address the other problems of the BS described above, concerning the use of a conditioning variable that identifies forecasts as similar when they may have very dissimilar pdfs, and a resolution component that is representative of the extra information in conditioned observations

The relative operating characteristic (ROC) area and its associated graphical form known as the ROC curve are suggested by WMO (2002) as a means of assessing aspects of probabilistic forecasts of binary events. These measures of the quality of forecasting dichotomous events are constructed by grouping together forecasts with the same probability for the binary event in question. Toth et al. (2003) show that ROC scores are in most cases sensitive to the resolution of the forecasting system and are very similar to the resolution component of the BS. This leads to a sensitivity to one but not both aspects of the usefulness of forecasts. The comments on the BS and its resolution component are applicable to ROC-based measures. The situation depicted in Fig. 1, in which the algorithm for sorting forecasts is not based upon the full characteristics of pdfs, is equally relevant to ROC-based measures, and the problem regarding resolution being based upon conditioned observations rather than the issued forecasts is also applicable. Therefore, ROC scores are an inappropriate measure of the usefulness of probabilistic forecasts.

d. The rank histogram The rank histogram is a graphical measure of the correspondence between probabilistic forecasts and observations. It was developed contemporaneously by Anderson (1996), Hamill and Colucci (1997), Harrison et al. (1995), and Talagrand et al. (1997). In brief, the ensemble members are ranked by value, and the rank of the observed value in this sorted distribution of ensemble members is found. This is repeated for all events, and then a histogram of the rank of the observation is finally compiled. Hamill (2001) provides a fuller description of this graphical measure together with an examination of its performance. The interpretation of the rank histogram can provide insight into the statistical consistency between forecast pdfs and corresponding observations but can offer no details on the other aspect of forecast usefulness, concerning the extra information in a forecast. Furthermore, the information concerning statistical consistency provided by the rank histogram is limited. The histogram contains information from all pairs of forecasts

APRIL 2008

CUSACK AND ARRIBAS

and observations, and forecast error compensation can pass undetected by the rank histogram. For example, if the forecast has a warm bias when cold anomalies are observed, and vice versa, the rank histogram will be flat. The aggregation of all of the different forecast events into one histogram allows this cancellation of errors to occur.

e. Potential predictability The potential predictability (PP) of ensemble forecasts summarizes a property similar to the extra information contained in forecasts and does not analyze their statistical consistency with observations at all. The PP has been assessed using analysis-of-variance methods (e.g., Rowell et al. 1995) by comparing the variance of the forecast and climatological distributions. This definition of PP is based on just one characteristic of the forecast pdf and so cannot be considered as a general measure of the extra information contained in a forecast. Kleeman (2002) presents an illustration of the weakness of PP if it is defined as a function of variance alone using a forecast pdf with a different first moment from the corresponding climatological pdf and suggests a suitable modification to PP to capture differences in both the first and second moments. The Kleeman (2002) measure fails to capture all characteristics of forecast and climatological pdfs and is therefore unsuitable as a general measure of the extra information of forecasts.

3. General measures of the usefulness of probabilistic forecasts The assessment of the usefulness of forecasts issued to end users requires examination of two independent aspects: 1) the differences between forecasts and observations and 2) the extra information contained in the forecast relative to the end user’s knowledge in the absence of a forecast. In addition, the basic forecast consists of probabilistic information, and a full assessment requires that the entire forecast pdf be considered. The components of the assessment consist of a forecast pdf, a climatological pdf to represent the situation in which the user has no forecast, and the construction of an observed pdf to allow the full characteristics of the forecast pdf to be assessed. Furthermore, a penalty function must be defined to reward or penalize differences between distributions as appropriate. These components of a general assessment of the usefulness of forecasts are now discussed. This is followed by the description of measures summarizing the general usefulness of forecasts.

1495

a. The components of an assessment of usefulness The forecast pdf should represent the basic product of the ensemble prediction system after postprocessing has been applied [e.g., calibration, or more sophisticated techniques such as those described in Hamill and Whitaker (2006)]. Such a postprocessed forecast is the basis of products delivered to end users, and it is appropriate for this to form the basis of the assessment. The climatological pdf is intended to represent the end users’ knowledge in the absence of information from a forecasting system and can be formed from the collection of observed values of the predictand over a suitable climatological period. The choice of climatological period is made more difficult by the nonstationary nature of recent climate statistics, and care must be taken to choose a relevant time period. There are alternatives to this choice of information in the absence of a forecast, such as the persistence of prior observed values: hereinafter we choose the climatological pdf as the end user’s knowledge in the absence of a forecast without loss of generality. Observed pdfs are formed by the creation of homogeneous subsets of forecasts. The corresponding verifying observations can be gathered for each subset to form a pdf corresponding to each homogeneous sample of forecasts. There are practical limitations, and related consequences, in the formation of such observed pdfs, and these issues are discussed at the end of this section and explored further in section 4. A penalty function is required to quantify differences between pdfs. In the ideal case, such a penalty function should reflect the exposure of the end user to the weather and climate. However, the great diversity of end-user applications of forecasts has a corresponding wide variety of suitable penalty functions. No unique penalty function exists that accurately captures the exposure to weather and climate conditions for all end users. Therefore a penalty function with a necessarily compromised ability to measure the specific usefulness of forecasts to each end user must be employed. However, the desired behavior of the penalty function regarding resistance to outliers, robustness to the form of pdf, and sensitivity to distance should not be compromised. The differences between pdfs could be assessed using information theory (e.g., Cover and Thomas 1991). For example, Kleeman (2002) proposes the use of relative entropy to assess the differences between the forecast and climatological pdfs. The definition of information used in the field of information theory is derived within the context of the encoding of data, for subsequent transmission. The information (or uncertainty) is mea-

1496

MONTHLY WEATHER REVIEW

sured by the use of a logarithm function with an appropriate base in the penalty function. Roulston and Smith (2002) argue that relative entropy is not a true measure of distance between two distributions because it satisfies neither symmetry nor the triangle inequality (Cover and Thomas 1991). Furthermore, basing a penalty on a logarithm function is appropriate for the encoding/transmission of information, though its relevance as a general measure of end users’ exposure to weather and climate is unsupported and a simpler formulation could be used. Note that hereinafter a general definition of information based on the knowledge provided by forecasts is intended, rather than the numerical definition employed in information theory concerned with data encoding for subsequent transmission. Anderson and Stern (1996) examine the potential predictive utility of forecasts using a significance test based on Kuiper’s statistic (Press et al. 1986). This statistic represents the differences between two cumulative distribution functions (cdfs; e.g., forecast and climatology) and is preferred to the more common Kolmogorov–Smirnov (K–S) test (Knuth 1981) because of its uniform sensitivity to the entire distribution of values. However, Kuiper’s statistic can fail to reflect differences between multimodal distributions (Press et al. 1986), and it is conceivable that the modal behavior of the climate system could produce such problematic distributions. The K–S statistic and variants such as Kuiper’s statistic are not used as the basis for measuring differences between distributions because they are not robust to the shape of the distribution. As mentioned earlier, the penalty function cannot capture accurately the many varied penalties incurred by end users. Therefore, an alternative criterion must be used to judge the most suitable formulation. Resistance to outliers is a desirable attribute of a general summary score, and a linear penalty function is more resistant than the quadratic form. For this reason, the simplest means of specifying differences between distributions is chosen, based on the absolute value of the linear difference between pdfs.

VOLUME 136

(1973) proposed a technical definition of reliability as the conditional bias of forecasts, therefore restricting it to examination of the first moment alone of pdfs. The new measure is designed to examine the full characteristics of the pdfs and is labeled the full-pdf-reliability to distinguish it from the measure of conditional bias. If there are M homogeneous subsamples, and the predictand is discretized into one of N bins, then the forecasts can be converted to M pdfs Fi,m, where i ⫽ 1, . . . , N and m ⫽ 1, . . . , M. A similar conversion can be done for the observations to yield Oi,m. The most suitable values of M and N and the creation of homogeneous subsamples are discussed in section 4. The estimate Xm of the difference between pdfs for subsample m takes the form N

Xm ⫽

兺 |F

i,m

⫺ Oi,m | ⌬i

i⫽1

,

N

兺⌬

共3a兲

i

i⫽1

where ⌬i is the width of the bin. The average difference X over all subsamples is given by M

兺X

m

X⫽

m⫽1

M

.

共3b兲

A positively oriented full-pdf-reliability score Rpdf that measures the similarity between the forecast and observed distributions averaged over all homogeneous subsamples, with 0 indicating complete unreliability and 1 indicating perfect reliability of the full pdf, is formed from X as follows: Rpdf ⫽ 1 ⫺ 共XⲐ2兲,

共3c兲

where X is halved to account for the fact that all Xm and, hence, X lie between the values of 0 and 2. The maximum value of Xm ⫽ 2 occurs when all nonzero values of one pdf occur at zero values for the other pdf, and vice versa.

b. The statistical consistency of forecasts We propose that the measurement of the statistical consistency of forecasts with observations proceeds by sorting forecasts into homogeneous subsamples and gathering the corresponding observations to form observed pdfs in each subsample. The degree of similarity between the forecast and observed pdfs in each subsample constitutes the statistical consistency of the forecasts. This procedure can be viewed as an assessment of the reliability of predictions; however, Murphy

c. The extra information in a forecast The extra information in a forecast is appropriately defined as differences in the information between the issued forecasts and that available in the absence of a forecast. This is parameterized as the differences between the homogeneous subsampled forecasts Fi,m and the observed climatological pdf Ci, where subscripts i and m are defined above. The estimate Ym of the difference between pdfs for subsample m takes the form

APRIL 2008

CUSACK AND ARRIBAS

1497

N

Ym ⫽

兺 |F

i,m

⫺ Ci | ⌬i

i⫽1

,

N

兺⌬

共4a兲

i

i⫽1

where ⌬i is the width of the bin. The average difference Y over all subsamples is given by M

兺Y

m

Y⫽

m⫽1

M

.

共4b兲

A positively oriented score that measures the extra information in the forecasts relative to the climatological observed distribution [the information quantity of a system (IQ)] averaged over all homogeneous subsamples, with a value of 0 indicating no new information and a value of 1 indicating the maximum possible amount of new information, is as follows: IQ ⫽ YⲐ2,

共4c兲

where Y is halved to account for the fact that all Ym and, hence, Y lie between the values of 0 and 2.

d. Uncertainties in the estimates of Rpdf and IQ The pdfs in Eqs. (3) and (4) are based upon finite data amounts in practice. Figure 2 illustrates the variable effects that finite data samples can have upon the estimates of forecast pdfs. This will lead to uncertainties in the values of the scores. The effects of these uncertainties on the scores have both a random and a systematic component, and these two components are now discussed. The random component of the uncertainty in the scores is due to sampling errors of the assumed true underlying distribution. If the true underlying distribution is constructed from the data values in a subsample then the sampling error can be estimated using the bootstrap resampling procedure (e.g., Efron 1982). In brief, the true underlying pdf is resampled to create an artificial set of data values from which a new estimate of the pdf is obtained. If this procedure is applied to the pdfs in Eqs. (3) and (4), then resampled values of the scores can be obtained, and repeated iteration of this procedure produces a sampling distribution of both Rpdf and IQ statistics representing the uncertainty in these scores. The systematic component of the uncertainty in the scores is due to the errors in estimating the true underlying distribution. In the above, it was assumed that the data values in a subsample formed the true underlying distribution; however, a finite amount of data values invalidates this assumption. This error in the sampling of the true distribution is equivalent to noise in the

FIG. 2. A pdf generated by sampling a Gaussian random number generator 500 times and displayed using two different discretizations of the pdf. The solid line uses N ⫽ 12 bins, whereas the dashed line uses N ⫽ 120 bins.

values of Fi,m, Oi,m, and Ci in Eqs. (3) and (4). Noise in these quantities leads to increased values of Xm and Ym in Eqs. (3a) and (4a), respectively, and these effects lead to systematically lowered values of Rpdf and systematically raised values of IQ. A proposed method of estimating the systematic uncertainties is based on the assumption that the original data are drawn from an underlying distribution of values from a theoretical distribution, such as the normal distribution for temperatures, the gamma distribution for representing precipitation, and the Weibull distribution for representing wind speeds. The original data values are used to provide the best estimates of the parameters of the chosen underlying distribution, and the resulting pdfs can be sampled a very large number of times to obtain new estimates of the quantities in Eqs. (3) and (4) to yield the Rpdf and IQ scores. The difference in these estimates of Rpdf and IQ from those obtained using original data values to estimate the true pdf is the systematic error. If the systematic error is accounted for by basing estimates of Rpdf and IQ on best-fitted theoretical distributions then the method of estimating the random error in these scores has to be adjusted from that described above. If there are P data values in the original subsample, then the underlying theoretical distribution can be sampled P times to obtain a new estimate of the parameters of the theoretical distribution, and this new estimate can be sampled many times to obtain resampled values of Rpdf and IQ scores. This bootstrap resampling procedure can be repeated many times to obtain an estimate of the random error in these scores due to the sampling error of the underlying distribution.

1498

MONTHLY WEATHER REVIEW

4. Applying the scores in practice The estimation of the values of Rpdf and IQ requires the specification of M and N and the choice of a suitable sorting algorithm for creating homogeneous subsets of forecasts. Discussions of these issues are given in this section, together with suggestions for appropriate options. Techniques to reduce the uncertainty in the estimates of Rpdf and IQ are described, and a practical example of these new scores is given in section 4d. Results in this section are derived from a large dataset of forecasts described in Hamill et al. (2006, hereinafter HW06). In brief, this large dataset consists of a 15-member ensemble of 0000 UTC forecasts started every day from 1979 to 2007 (and is currently being run at near–real time), with forecasts integrated to 15-days lead time. The operational version of the National Centers for Environmental Prediction Global Forecasting System in 1998 was used, with T62 horizontal resolution and 28 vertical levels. The start dates used in the following analysis consist of one per week in a 16-week window centered on mid-July of each year. The assessment is based on forecasts of average air temperature at 2 m in the second week of the forecast (for a grid box in northwest Europe in section 4a and the whole global domain in section 4d), using normalized forecast and observed data for 1980–2004.

a. Sorting the forecasts into homogeneous subsets The dataset of historical forecasts is sorted into M subsamples, each of which contains forecasts as similar to one another as possible. The sorting is most simply based upon the median value of forecast pdfs. If this produces subsamples with significant variations in the interquartile range then one may consider doing a twostage sorting. More complex sorting procedures such as clustering techniques (e.g., Philipp et al. 2007) could be employed, and a measure of the degree of inhomogeneity of the forecast pdfs in a subsample could be devised and compared with similar measures from other sorting algorithms (for similar values of M ) to determine the benefit from using complex clustering techniques. The results from clustering techniques are obtained at very significant computational cost because each grid box in a model’s domain must be separately analyzed. Simpler techniques may prove to be more practical and sufficient for the purpose.

b. The values of M and N End users can place constraints on the values of M and N that are specific to their applications. However, both Rpdf and IQ are intended as general measures of

VOLUME 136

usefulness, and it would be appropriate to use criteria for specifying M and N that are general rather than customer specific. Figure 3 presents results from an analysis of the HW06 dataset. Figure 3a shows the dependence of Rpdf on the values of M and N, and Fig. 3b is the corresponding plot for the IQ score. It is apparent that both scores have a significant dependence on the values of M and N. Therefore the intercomparison of scores between different systems must be made using similar values of M and N to allow meaningful conclusions to be drawn about the comparative usefulness of different systems. Figure 4 shows typical forecast and observed pdfs when the dataset is sorted into four subsets based on the median value of the forecast pdf, using N ⫽ 16. It can be seen that the shape of all pdfs is similar to the shape of the normal distribution. Therefore, the systematic error that is due to the insufficient number of data values sampling the true pdf can be addressed by employing a normal distribution with parameters fitted from the data values in each subsample, as described in section 3d. The values for the Rpdf and IQ scores are shown in Figs. 3c and 3d, respectively. The results suggest that a large part of the dependence of the scores on the values of M and N can be explained by the sampling error in the estimation of the true underlying distribution by the original dataset. In some cases the uncertainty in the nature of the theoretical distribution may produce errors of a larger magnitude than the reduction in the errors noted above. Section 4c contains a discussion of alternative options to reduce the sampling error of the underlying pdf, and this can be used either to produce better estimates of the parameters of the theoretical distribution or simply to produce a better approximation of the true distribution in the absence of any suitable theoretical distribution. The values plotted in Figs. 3c and 3d have associated random errors that are due to the uncertainty in the values of the theoretical distribution’s parameters and that can be estimated using a bootstrap method discussed in section 3d. The estimates of the summary scores using bestfitting theoretical distributions have a residual dependence on the number of homogeneous subsamples (Figs. 3c and 3d). More subsamples tend to produce sharper forecast pdfs in each subsample. Sharper forecasts will tend to produce an increase in the IQ score and reductions in Rpdf as is observed in Figs. 3c and 3d. The intercomparison of the scores from different systems will be affected by choices made in the estimation of Rpdf and IQ. It is recommended 1) that a theoretical distribution with best-fitting parameters from the dataset is used to minimize the error in estimating the true underlying distribution whenever possible and 2)

APRIL 2008

CUSACK AND ARRIBAS

1499

FIG. 3. (a) Contour plot of Rpdf for various values of N and M. The value of N governs the resolution of predictand values, and M represents the number of subsets of homogeneous forecasts. (b) As in (a), but for the IQ score. (c) As in (a), but with Rpdf estimated using a best-fitting normal distribution. (d) As in (b), but with IQ estimated using a best-fitting normal distribution.

that the uncertainty in the scores is estimated using a bootstrap resampling of the estimates of the parameters of the theoretical distribution. If these choices are made then the Rpdf and IQ scores have a residual dependence on the choice of M that should be accounted for in any intercomparison. The use of these guidelines leads to a more accurate assessment of the usefulness of probabilistic forecasts.

c. Reducing uncertainties by spatial and temporal aggregation of data The magnitude of both the systematic and random uncertainties in the values of the summary scores is

dependent upon the number of start dates Ns that contain independent information in the assessment dataset. The constraints on the value of Ns are discussed in this section, and some strategies to reduce uncertainties are proposed. In general, a forecasting system’s quality will vary according to the processes being simulated, and it is desirable for forecast assessments to distinguish between periods for which different processes dominate (e.g., Shongwe et al. 2007; Rodwell and Folland 2002; Lenderink et al. 2007). Therefore, it is inappropriate to group together all forecasts throughout the annual cycle. Furthermore, the forecast quality can be ex-

1500

MONTHLY WEATHER REVIEW

FIG. 4. The pdfs of both the forecast and observed data sorted into four subsets based on the value of the median of the forecast ensemble members for each start date.

pected to depend upon the quality of the initialization; hence the magnitude of Ns is restricted by the nonstationary nature of the observing system. The computational requirements of a probabilistic forecasting system can also restrict the value of Ns, because operational systems are often designed to utilize a significant fraction of this expensive resource. Using seasonal forecasting as an example, the Met Office and European Centre for Medium-Range Weather Forecasts operational postprocessing of seasonal predictions gathers forecasts based on the start month, and the effective number of samples Ns,e used for calibration and postprocessing is equal to Ns /12. HW06 combine daily forecasts within a 91-day window centered on the date of interest in their assessment of daily precipitation forecasts; therefore, Ns,e is approximately Ns /4. The results shown in Figs. 3 and 4 are from a 16-week window centered in mid-July; therefore, Ns,e is approximately Ns /3. The optimal temporal aggrega-

VOLUME 136

tion period may depend upon location; for instance, in tropical regions it could be satisfactory to aggregate six months of start dates centered on the specific start date of interest, whereas in the extratropics the aggregation window may be only 3 or 4 months wide. Finding the optimal temporal aggregation of forecasts will produce smaller sampling errors, thereby reducing uncertainty in results. The spatial aggregation of data provides a means of reducing high-frequency spatial noise. The spectrum of the signal can have much less power at short wavelengths than the sampling error; therefore, the spatial aggregation at small scales can produce a smaller sampling error with negligible reduction of the signal. For example, longer-range forecasting on the time scale of months is more concerned with the lowest zonal wavenumbers where anomalies have most variance. The current longitudinal resolution of long-range prediction systems is about 1° or 2°; therefore, there is scope for spatial aggregation to reduce sampling errors. Figures 5a and 5b are analogous to Figs. 3a and 3b, except the eight adjacent neighbors’ data have been combined with the selected grid box. The variability of the scores as a function of M and N is significantly reduced by the spatial aggregation of data. This is due to smaller sampling errors in the estimate of the pdf in a subsample. Care should be taken with the temporal and spatial aggregation of values. The use of local anomalies, relative to an appropriate local climatology, ensures no degradation in the representation of the first moment of pdfs. The use of normalized anomalies ensures both the first and second moments are represented properly, and this method was applied to produce the results presented in Figs. 3, 4, and 5. HW06 applied rank ordering of values at different locations before applying the spatial and temporal aggregation of forecast information to produce results that are superior to both unadjusted and local anomaly forecasts. Transforming all values into an order by rank corrects for the differences in all moments of the pdfs that are being aggregated and is recommended.

d. Practical example The Rpdf and IQ scores were computed for the entire global domain for the weekly-averaged near-surface air temperature at one-week lead time over the boreal summer period (June–August) from 1979 to 2004. The values of M and N were chosen as a compromise between the end users’ need for the most precise forecast possible (requiring large values of M and N ) and the desire to reduce the error in the estimates of the scores (achieved by the use of smaller values of M and N ).

APRIL 2008

CUSACK AND ARRIBAS

1501

FIG. 5. As in Figs. 3a,b, but using results based on data from the same grid box aggregated with its eight adjacent neighbors.

This compromise is made within the context of generally low levels of sharpness in forecasts of this predictand. After taking all of these factors into account, the values of M ⫽ 4 and N ⫽ 32 were chosen to provide estimates of the scores for forecasts. The forecast data were used to generate best-fitting parameters for a normal distribution, and a similar procedure was done for observations, to reduce the effects of sampling errors. The results are shown in Fig. 6, and for comparison the values from the RPS, which is a common measure of probabilistic forecasts, are included. The RPS is a negatively oriented score; therefore, lower values indicate closer agreement between forecasts and observations. Its estimate is based upon a discretization of the predictand values into 32 intervals, the same as was used for the Rpdf and IQ scores. The results in Fig. 6 indicate that Rpdf is very high globally, with larger values occurring in the extratropics. On the contrary, higher values of the IQ score occur in the tropics and especially over the oceans. The higher IQ scores in the tropics are due to the more persistent nature of weather on the weekly time scale relative to the extratropics, and especially over the tropical oceans, which act to redden the spectra of atmospheric variability. Small perturbations to the initial state grow more slowly than in the midlatitudes. However, the smaller values of Rpdf in the tropics and especially over the oceans are notable: the tropical forecast pdfs may contain more information but this information is slightly less reliable. The minima in the RPS score tend to occur in regions where there is a maximum in either of the Rpdf or IQ scores, with some exceptions such as off the northwestern coast of Austra-

lia. An example of the shortcomings of the RPS in assessing probabilistic forecasts can be seen in the eastern tropical Pacific Ocean. This region has a global minimum value of RPS, indicating the best quality of forecast pdfs. However, the Rpdf values indicate that the forecast pdf is less reliable here than in many other regions, and this result needs to be accounted for in probabilistic forecast statements. In this example the RPS is in general very sensitive to the two aspects that compose the usefulness of a forecasting system, although it is unable to distinguish between them. Besides, for reasons discussed in section 2, the RPS is not sensitive to the full characteristics of probabilistic forecasts. Therefore, the Rpdf and IQ scores provide a more robust estimate of the usefulness of forecasts.

5. Summary and future work This paper presented two new scores to assess the usefulness of probabilistic forecasts that, unlike standard summary scores, depend explicitly on the full probability distribution function. There are two aspects that are sufficient to define the usefulness of probabilistic predictions to end users: 1) the statistical consistency between the forecast and the observations and 2) the extra information contained in the forecast relative to the end user’s knowledge in the absence of a forecast. These two distinct aspects are evaluated by the proposed new scores: the full-pdfreliability Rpdf measures statistical consistency and the information quantity IQ measures the extra information in a forecast. The basis of the new scores consists of penalizing the

1502

MONTHLY WEATHER REVIEW

FIG. 6. Global maps of (a) Rpdf, (b) IQ, and (c) RPS for the weekly mean near-surface air temperature forecasts at one-week lead time for boreal summer cases from 1979 to 2004.

VOLUME 136

APRIL 2008

1503

CUSACK AND ARRIBAS

forecast for differences from the ideal behavior for the aspect of performance being considered. The proposed scores are designed to be appropriate measures of the general utility of probabilistic forecasts, and a simple linear functional form has been chosen to penalize differences between pdfs. As with all assessment measures, sampling errors due to the use of finite data amounts need consideration when interpreting the estimated values of the scores. The assessment of the full characteristics of forecast pdfs rather than single moments is equivalent to evaluating forecasts in greater detail than with previous scores, and as a consequence the sampling errors would be expected to be larger. The choices of the parameters M and N affect the sampling errors by altering the amount of data used to estimate pdfs. The use of appropriately shaped theoretical distributions greatly reduces the sampling error in the estimates of the new scores. It is recommended that the sharpness of forecasts be examined to find the smallest values of M and N. For instance, long-range predictions are typically issued in the form of tercile or quintile probabilities; therefore, forecasts need only be classified into three or five subsamples. Furthermore, the aggregation of forecasts over larger spatial regions and longer time periods increases the effective number of samples and thus reduces the sampling errors. Methods to further reduce uncertainties in Rpdf and IQ will be sought in the future. As illustrated by the assessment of a large dataset of medium-range forecasts, the Rpdf and IQ scores can provide robust estimates of the usefulness of a forecasting system. Crucially, the Rpdf and IQ scores differentiate between the two independent aspects of forecast usefulness, allowing the identification of areas where the system is adding additional information to our base knowledge (e.g., climatology). This is an important property of these scores, especially for time scales such as seasonal prediction, for which the current operational systems have only a modest level of skill—which varies depending on the month, lead time, and region— beyond that of using climatology. The application of these scores to the hindcast datasets generated by the major centers that produce seasonal forecasts will be completed in future work. Acknowledgments. This work was funded by the U.K. Government Meteorological Research Programme. The authors thank Jim Hansen for many constructive comments on the contents of earlier drafts. The authors also thank the anonymous reviewers for their comments, many of which led to improvements to the final manuscript.

REFERENCES Anderson, J. L., 1996: A method for producing and evaluating probabilistic forecasts from ensemble model integrations. J. Climate, 9, 1518–1530. ——, and W. F. Stern, 1996: Evaluating the potential predictive utility of ensemble forecasts. J. Climate, 9, 260–269. Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3. Cover, T. M., and J. A. Thomas, 1991: Elements of Information Theory. John Wiley and Sons, 542 pp. Efron, B., 1982: The Jackknife, the Bootstrap, and Other Resampling Plans. Society for Industrial and Applied Mathematics, 92 pp. Graham, R. J., M. Gordon, P. J. McLean, S. Ineson, M. R. Huddleston, M. K. Davey, A. Brookshaw, and R. T. H. Barnes, 2005: A performance comparison of coupled and uncoupled versions of the Met Office seasonal prediction general circulation model. Tellus, 57A, 320–339. Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550–560. ——, and S. J. Colucci, 1997: Verification of Eta–RSM shortrange ensemble forecasts. Mon. Wea. Rev., 125, 1312–1327. ——, and J. S. Whitaker, 2006: Probabilistic quantitative precipitation forecasts based on reforecast analogs: Theory and application. Mon. Wea. Rev., 134, 3209–3229. ——, ——, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 33–46. Harrison, M. S. J., D. S. Richardson, K. Robertson, and A. Woodcock, 1995: Medium-range ensembles using both the ECMWF T63 and unified models—An initial report. UKMO Tech. Rep. 153, 25 pp. Kleeman, R., 2002: Measuring dynamical prediction utility using relative entropy. J. Atmos. Sci., 59, 2057–2072. Knuth, D. E., 1981: Seminumerical Algorithms. Vol. 2, The Art of Computer Programming, Addison-Wesley, 688 pp. Lenderink, G., A. P. van Ulden, B. van den Hurk, and E. van Meijgaard, 2007: Summertime inter-annual temperature variability in an ensemble of regional model simulations: Analysis of the surface energy budget. Climatic Change, 81, 233– 247. Lorenz, E. N., 1963: Deterministic nonperiodic flow. J. Atmos. Sci., 20, 130–141. ——, 1993: The Essence of Chaos. University of Washington Press, 227 pp. Murphy, A. H., 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600. ——, 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119, 1590–1601. ——, 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281– 293. ——, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338. ——, B. G. Brown, and Y.-S. Chen, 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4, 485–501. Palmer, T. N., F. Molteni, R. Mureau, R. Buizza, P. Chapelet, and J. Tribbia, 1993: Ensemble prediction. Proc. Seminar on Validation of Models over Europe, Vol. 1, Reading, United Kingdom, European Centre for Medium-Range Weather Forecasts, 21–66. Philipp, A., P. M. Della-Marta, J. Jacobeit, D. R. Fereday, P. D.

1504

MONTHLY WEATHER REVIEW

Jones, A. Moberg, and H. Wanner, 2007: Long-term variability of daily North Atlantic–European pressure patterns since 1850 classified by simulated annealing clustering. J. Climate, 20, 4065–4095. Press, W. H., B. P. Flannery, S. A. Teulosky, and W. T. Vetterling, 1986: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 818 pp. Rodwell, M. J., and C. K. Folland, 2002: Atlantic air–sea interaction and seasonal predictability. Quart. J. Roy. Meteor. Soc., 128, 1413–1443. Roulston, M. S., and L. A. Smith, 2002: Evaluating probabilistic forecasts using information theory. Mon. Wea. Rev., 130, 1653–1660. Rowell, D. P., C. K. Folland, K. Maskell, and M. N. Ward, 1995: Variability of summer rainfall over tropical North Africa (1906–92): Observations and modelling. Quart. J. Roy. Meteor. Soc., 121, 669–704. Shongwe, M. E., C. A. T. Ferro, C. A. S. Coelho, and G. J. van Oldenburgh, 2007: Predictability of cold spring seasons in Europe. Mon. Wea. Rev., 135, 4185–4201.

VOLUME 136

Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25. Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330. ——, O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, 137–163. Ward, M. N., and C. K. Folland, 1991: Prediction of seasonal rainfall in the north Nordeste of Brazil using eigenvectors of sea-surface temperature. Int. J. Climatol., 11, 711–743. Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 627 pp. WMO, 2002: Standardised verification system (SVS) for longrange forecasts (LRF): New attachment II-9 to the manual on the GDPS (WMO 485). Vol. 1, WMO, Geneva, Switzerland, 21 pp.

Suggest Documents