1 MARCH 2016
BAKER AND TAYLOR
1773
A Framework for Evaluating Climate Model Performance Metrics NOEL C. BAKER AND PATRICK C. TAYLOR NASA Langley Research Center, Hampton, Virginia (Manuscript received 6 February 2015, in final form 21 December 2015) ABSTRACT Given the large amount of climate model output generated from the series of simulations from phase 5 of the Coupled Model Intercomparison Project (CMIP5), a standard set of performance metrics would facilitate model intercomparison and tracking performance improvements. However, no framework exists for the evaluation of performance metrics. The proposed framework systematically integrates observations into metric assessment to quantitatively evaluate metrics. An optimal metric is defined in this framework as one that measures a behavior that is strongly linked to model quality in representing mean-state present-day climate. The goal of the framework is to objectively and quantitatively evaluate the ability of a performance metric to represent overall model quality. The framework is demonstrated, and the design principles are discussed using a novel set of performance metrics, which assess the simulation of top-of-atmosphere (TOA) and surface radiative flux variance and probability distributions within 34 CMIP5 models against Clouds and the Earth’s Radiant Energy System (CERES) observations and GISS Surface Temperature Analysis (GISTEMP). Of the 44 tested metrics, the optimal metrics are found to be those that evaluate global-mean TOA radiation flux variance.
1. Introduction Efforts to standardize climate model experiments and collect simulation data—such as the Coupled Model Intercomparison Project (CMIP)—provide the means to intercompare and evaluate climate models. The evaluation of models using observations is a critical component of model assessment. Performance metrics—or the systematic determination of model biases—succinctly quantify aspects of climate model behavior. With the many global climate models that participate in CMIP, it is necessary to compare individual model performance and to track changes due to development, allowing the quantitative assessment of progress between generations of models and the evaluation of individual parameterization changes to each model (Reichler and Kim 2008). A wide variety of performance metrics have been developed (e.g., Willmott 1982; Taylor 2001; Giorgi and Mearns 2002; Qu and Hall 2006; Pincus et al. 2008; Williams and Webb 2009; Christensen et al. 2010). Metrics typically quantify model performance in
Corresponding author address: Noel C. Baker, NASA Langley Research Center, 21 Langley Blvd., Mail Stop 420, Hampton, VA 23681-2199. E-mail:
[email protected] DOI: 10.1175/JCLI-D-15-0114.1 Ó 2016 American Meteorological Society
reproducing a single climate characteristic or process. Several assessments, however, take a broader approach and determine performance using a combined set of diagnostics creating a single performance index. Reichler and Kim (2008) used the squared differences between mean-state modeled and observed climate quantities, normalized by observed variance. Gleckler et al. (2008) created a set of performance metrics using the root-meansquared differences between simulated and observed climate. Waugh and Eyring (2008) focused on stratospheric processes in chemistry climate models to assign model performance grades. While these studies signify advances in model performance evaluation, Knutti (2010) argues that there is a strong need for a wider variety of performance metrics. A further need for an optimal suite of metrics to be developed from the broad suite has also been expressed (Gleckler et al. 2008; Knutti et al. 2010). The WGNE/WGCM Climate Model Metrics Panel [jointly established by the Working Group on Numerical Experimentation (WGNE) and the Working Group on Coupled Modelling (WGCM)] is a recent effort in this direction. Coordinating with the activities of the World Climate Research Programme (WCRP), the primary panel objective is to identify a standard set of performance metrics to benchmark climate models. The panel has taken the first clear steps toward compiling a unified
1774
JOURNAL OF CLIMATE
suite of performance metrics. Despite the large number of individual efforts to propose and demonstrate performance metrics, no framework exists to facilitate the selection of an optimal suite of metrics through a quantitative metric assessment and intercomparison. A unique framework addressing this critical gap is described and demonstrated. The general design principles behind the framework are described in section 4. The framework is then demonstrated using a set of previously unexplored performance metrics that evaluate the simulation of top-of-atmosphere (TOA) and surface radiative flux variance and probability distributions within 34 CMIP5 models against Clouds and the Earth’s Radiant Energy System (CERES) observations and GISS Surface Temperature Analysis (GISTEMP). The model output and data used are described in section 2. The performance metric calculations are presented in section 3. A summary and conclusion are provided in section 5.
2. Data and models This study uses the newest ensemble of CMIP5 climate models, limited to 34 models for which the necessary variables and data records are available (WCRP 2014). These models and their respective modeling centers are listed in Table 1. For each model, simulation data from two experiments were obtained: ‘‘preindustrial control,’’ which uses nonevolving preindustrial conditions with prescribed atmospheric gas concentrations and unperturbed land use, and ‘‘historical,’’ which imposes changing conditions of atmospheric gas concentrations and land use consistent with observations (see Taylor et al. 2012). Monthly mean model output is processed in its native resolution then interpolated to a 18 horizontal resolution grid. The quantities used and their associated abbreviations are listed in Table 2. Monthly mean observations are taken from several sources. Radiation fluxes are from NASA’s CERES Energy Balanced and Filled TOA (EBAF-TOA; Loeb et al. 2009) and EBAF-Surface Ed2.8 (Kato et al. 2013) radiation flux products. All currently available CERES data (March 2000–October 2014) is used. While a longer record is desirable, the dataset provides information suitable for climate model evaluation (e.g., Hartmann and Ceppi 2014). CERES stability is verified to be better than 0.2 W m22 decade21—suitable for analysis of interannual variability (Loeb et al. 2012). Doelling et al. (2013) show that monthly mean RMS errors in radiative fluxes are ,1%. Observed surface temperature anomalies are obtained from NASA’s GISTEMP (Hansen et al. 2010) gridded land–ocean temperature index maps expressed as monthly
VOLUME 29
mean deviations from the corresponding 1951–80 means (28 horizontal resolution). The data record range used is the same as for radiation fluxes. Improved measurements in more recent decades have reduced the 95% confidence range uncertainty to about 0.058C (Hansen et al. 2010).
3. Performance metrics The climate model performance metrics used in this study focus on global-mean variance, probability distributions, and relationships between variables fundamental to the earth’s energy budget, because a trustworthy model is expected to reliably reproduce characteristics of the energy budget (e.g., Hartmann and Ceppi 2014). Using Earth’s energy budget interannual variability process relationships as performance metrics is justified by fundamental links with the climate state and climate sensitivity (e.g., Hansen et al. 1984; Wetherald and Manabe 1988; Taylor et al. 2011a,b; Lambert and Taylor 2014). These metrics have been selected since the statistical evaluation of radiation flux variance and probability distribution quantities have not been previously investigated as model performance metrics. Variance within the climate system is generated by physical processes and therefore is considered an integrated measure of climate processes. A new set of performance metrics is also introduced that characterize regression relationships between TOA flux variables and surface temperature. These regression relationships represent interannual-time-scale radiative feedbacks similar to Dessler (2013) and Armour et al. (2013). Descriptions of the derived quantities used to construct each metric are listed in Table 2. The performance of each CMIP5 model is assessed against observed conditions using performance metrics. Five metric classes are introduced in this study to produce 44 different performance metrics. The first metric class statistically compares the similarity of a modeled and observed time series; variance is tested using the two-sample F test for equal variances (denoted as the variance test), and distribution similarity is tested using the two-sample Kolmogorov–Smirnov test (K–S test). Weights for this performance metric class are calculated using the following steps. 1) A global-mean monthly detrended, deseasonalized anomaly time series is computed for each model and observational record, using the last 100 yr of the preindustrial control experiment for each model and the most recent 14 yr of CERES and GISTEMP records.
1 MARCH 2016
1775
BAKER AND TAYLOR TABLE 1. CMIP5 models analyzed in this study (http://cmip-pcmdi.llnl.gov/cmip5/). Institution and country
Model
Commonwealth Scientific and Industrial Research Organisation (CSIRO) and Bureau of Meteorology, Australia Beijing Climate Center, China Meteorological Administration Canadian Centre for Climate Modelling and Analysis U.S. National Center for Atmospheric Research U.S. National Center for Atmospheric Research Community Earth System Model contributors
France Centre National de Recherches Météorologiques/Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique Commonwealth Scientific and Industrial Research Organisation in collaboration with the Queensland Climate Change Centre of Excellence, Australia LASG, Institute of Atmospheric Physics, Chinese Academy of Sciences and Center for Earth System Sciences (CESS), Tsinghua University U.S. NOAA/Geophysical Fluid Dynamics Laboratory
U.S. NASA Goddard Institute for Space Studies
Met Office Hadley Centre, United Kingdom Institute of Numerical Mathematics, Russia L’Institut Pierre-Simon Laplace, France
Atmosphere and Ocean Research Institute (The University of Tokyo), National Institute for Environmental Studies, and Japan Agency for Marine-Earth Science and Technology Japan Agency for Marine-Earth Science and Technology, Atmosphere and Ocean Research Institute (The University of Tokyo), and National Institute for Environmental Studies Max Planck Institute for Meteorology, Germany
Norwegian Climate Centre
2) The observed time series is evaluated against the first 14-yr section of the modeled time series using the chosen statistical test. The test evaluates the statistical similarity (on a 95% confidence level) between the modeled and the observed time series: a test pass P occurs when the model and observations are statistically similar, and a test failure F occurs when they are statistically different. 3) A test is repeated using sequential, 14-yr sections of the model-simulated time series using a one-month moving window. The motivation for this approach is that the 100-yr modeled time series is nonstationary and heteroscedastic. Selecting time sections of the same length as the observations allows the variances to be compared over the length of the series with multiple samples.
ACCESS1.0 ACCESS1.3 BCC_CSM1.1 BCC_CSM1.1(m) CanESM2 CCSM4 CESM1 with biogeochemistry [CESM1(BGC)] CESM1(CAM5) CESM1(FASTCHEM) CESM1(WACCM) CNRM-CM5 CSIRO Mk3.6.0 FGOALS-g2 GFDL CM3 GFDL-ESM2G GFDL-ESM2M GISS-E2-H GISS-E2-H-CC GISS-E2-R GISS-E2-R-CC HadGEM2-CC HadGEM2-ES INM-CM4.0 IPSL-CM5A-LR IPSL-CM5A-MR IPSL-CM5B-LR MIROC4h MIROC5 MIROC-ESM MPI-ESM-LR MPI-ESM-MR MPI-ESM-P NorESM1-M NorESM1-ME
4) The final metric value wi is calculated as the ratio of time sections that pass the statistical test, resulting in a metric value between 0 and 1: P . (1) P1F The second metric class compares the variance of the magnitude of changes at each time step, called ‘‘local variance,’’ between each model and observations. This concept is based on a method used for statistical analysis in behavioral psychology (Madison 2001). It is defined as the measure of variance (var) of the first derivative of the time series of simulated variable x: dx . (2) y local 5 var dt wi 5
1776
JOURNAL OF CLIMATE
VOLUME 29
TABLE 2. List of climatological quantities, derived metrics, and associated abbreviations used in this study. Climate quantities
Abbreviation
Surface temperature Top-of-atmosphere outgoing longwave radiation flux, all-sky conditions Top-of-atmosphere outgoing longwave radiation flux, clear-sky conditions Top-of-atmosphere outgoing longwave radiation flux, cloudy-sky conditions (taken as the difference between all sky and clear sky) Top-of-atmosphere reflected shortwave radiation flux, all-sky conditions Top-of-atmosphere reflected shortwave radiation flux, clear-sky conditions Top-of-atmosphere reflected shortwave radiation flux, cloudy-sky conditions (taken as the difference between all sky and clear sky) Surface downwelling longwave radiation flux, all-sky conditions Surface downwelling longwave radiation flux, clear-sky conditions Surface downwelling shortwave radiation flux, all-sky conditions Surface downwelling shortwave radiation flux, clear-sky conditions Surface upwelling longwave radiation, all-sky conditions Surface upwelling shortwave radiation, all-sky conditions Surface upwelling shortwave radiation, clear-sky conditions Derived metrics Ratio of OLR to Ts, calculated at each monthly time step in which Ts anomaly is $ 0.18C or #20.18C Ratio of OLR (clear-sky) to Ts, calculated at each monthly time step in which Ts anomaly is $0.18C or #20.18C Ratio of OLR (cloudy-sky) to Ts, calculated at each monthly time step in which Ts anomaly is $0.18C or #20.18C Least squares linear regression fit of OLR time series to Ts time series Least squares linear regression fit of OLR (cloudy-sky) time series to Ts time series Ratio of SW to Ts, calculated at each monthly time step in which Ts anomaly is $ 0.18C or #20.18C Ratio of SW (clear-sky) to Ts, calculated at each monthly time step in which Ts anomaly is $0.18C or #20.18C Ratio of SW (cloudy-sky) to Ts, calculated at each monthly time step in which Ts anomaly is $0.18C or #20.18C Least squares linear regression fit of SW time series to Ts time series Least squares linear regression fit of SW (cloudy-sky) time series to Ts time series
A time series of derivative values is generated using the forward difference method. This concept is similar to the first-lag autocorrelation of a time series, reflecting the 1-month lag responses to changes. Then the final metric values are computed following the procedure for the first metric class using the variance test. The third metric class quantifies overlap between modeled and observed cumulative probability distributions. The earth mover’s distance (EMD) is traditionally used as an empirical method for color-based image retrieval (Rubner et al. 2000) and has been used in atmospheric science to isolate the difference between cloud type characteristics by Xu et al. (2005). It is defined mathematically as the measure of distance between two cumulative probability distributions: N
EMD 5
å jcdf(xi ) 2 cdf(oi )j,
i51
(3)
Ts OLR OLR clear-sky OLR cloudy-sky SW SW clear-sky SW cloudy-sky
OLR/Ts OLR(clear-sky)/Ts OLR(cloudy-sky)/Ts OLR Ts regression OLR(cloudy-sky) Ts regression OLR/Ts OLR(clear-sky)/Ts OLR(cloudy-sky)/Ts SW Ts regression SW(cloudy-sky) Ts regression
where cdf(oi ) and cdf(xi ) represent the global-mean cumulative density functions of the observed and simulated anomaly time series, respectively. A larger EMD value denotes a model with a probability distribution that is less similar to observations. As a result, a model that performs worse has a larger EMD value. A normalization procedure is used to rank the models and create metric values between 0 and 1 in which the metric value wi for each model i is calculated by subtracting the lowest EMD value from all models, dividing by the highest EMD value, then subtracting from 1: wi 5 1 2
EMDi 2 min(EMDi ) . max(EMDi )
(4)
The fourth performance metric class represents a subset of metrics that compare modeled and observed relationships between global-mean TOA radiation changes and changes in global-mean surface temperature. For the first type of metric, a time series is generated that computes the ratio of monthly TOA radiation
1 MARCH 2016
1777
BAKER AND TAYLOR
anomalies to monthly surface temperature anomalies. Final metric values are computed using the procedure for the first metric class using either a variance test or K–S test. The second type of metric computes a least squares linear regression relationship between a TOA radiation anomaly time series and a surface temperature time series. The final metric value is computed following the procedure for the first metric class using a two-sample t test for equal means without assuming equal variances (the means test). Details for these metric calculations are found under the derived metrics section of Table 2. The fifth performance metric class uses spatial correlation to compare modeled and observed TOA radiation flux variance. A gridded map of variance is computed for each TOA radiation quantity using the last 14 yr of the preindustrial control experiment for each model and the most recent 14 yr of the CERES observation record, interpolated to the size of CERES horizontal resolution. The squared correlation coefficient R2 is computed between the modeled and observed variance spatial maps, and the final metric value for each model wi is equivalent to R2, in which higher values denote higher spatial correlation and thus better model performance.
4. Evaluation framework The proposed framework provides a means to evaluate climate model performance metrics. Metrics are evaluated by their ability to improve the ensemble-mean representation of present-day mean-state climate. The framework methodology consists of three steps: 1) a climate model assessment using performance metrics, 2) creating metric-weighted ensemble-mean simulations of present-day mean-state climate quantities, and 3) performance metric evaluation using an aggregated error index.
a. Assessing model skill with performance metrics The first step tests climate models against observed conditions using performance metrics. In the present study, the preindustrial control experiment simulations are used to assess model skill since this experiment represents unforced model control runs, allowing for the evaluation of model process performance in the absence of an external forcing. In this study, models are compared to observations using the performance metric statistical tests described in section 3.
b. Metric-weighted ensembles The second step of the methodology uses knowledge of model skill to construct metric-weighted ensemblemean simulations. In the traditional approach for
constructing an ensemble mean of N models, the equal-weight ensemble mean hSy i of a climatological variable mean-state y i simulated by model i is defined as N
hSy i 5
å Wi yi ,
(5)
i51
where Wi 5 1/N is the weight of each model. Conversely, in an unequal-weight ensemble averaging approach, the model weight Wi can differ for each model i based on a performance metric evaluation such that N
hSy,m i 5
å Wi,m yi ,
(6)
i51
where hSy,m i is the ensemble mean weighted by performance metric m, and Wi,m is the weight of each model. The weights are determined using performance metrics that assess the quality of models. In the present study, models weights Wi,m are equivalent to the performance metric values wi calculated using Eqs. (1)–(4). Metricweighted ensemble means of present-day climate are constructed using the CMIP5 historical simulation. Better-performing models are weighted more strongly, and a smaller (or 0) weight is given to models that poorly reproduce the performance metric, resulting in these models contributing less—or not at all—to the unequally weighted ensemble mean. The weight computation methodology is unspecified in the general framework, as computing weights will depend upon the nature of the individual metric, but a wide variety of metrics can be incorporated into the framework using a normalization technique such as the one employed in Eq. (4).
c. Evaluation of performance metrics The evaluation of a performance metric is determined by whether the metric can be successfully used to constrain the model ensemble by reducing ensemble-mean error in mean-state present-day climate simulations. We focus on the mean-state climate since it is a fundamental aspect of climate for which we have a high-quality observational record to evaluate model ensemble performance. The ability to accurately simulate climate is quantified using a method adapted from a model performance index I2 used by Reichler and Kim (2008). Here, we use a modified I2 error index to measure combined errors of a model ensemble against observed climatological radiation budget quantities listed in Table 2. The normalized error is calculated as the difference between ensemble-mean and observed climate:
1778
JOURNAL OF CLIMATE
e2y,m 5
å [An (hSy,m,n i 2 oy,n )2 /s2y,n ],
(7)
n
where hSy,m,n i is the simulated ensemble-mean climatology weighted by metric m for climate variable y per grid point n, oy,n is the corresponding observed climatology, and s2y,n is the interannual variance from observations. The normalized errors are globally averaged and area weighted by An. The average errors are then scaled relative to the normalized error of the equally weighted ensemble mean, e2y,eq : Iy2,m 5
e2y,m , e2y,eq
(8)
and averaged over the 12 climate variables y listed in Table 2 (excluding derived cloudy-sky quantities) to produce the final index: Im2 5 Iy2,m y .
(9)
The I2 error index is chosen to evaluate performance metrics for several reasons. First, I2 allows a straightforward comparison of a variety of metrics. The index also accounts for the aggregated errors of many variables incorporating multiple measures of performance. Lower I2 values are favorable, as they indicate less error; thus, metric-weighted ensemble means that produce I2 values lower than the equally weighted ensemble are considered improvements.
d. Framework design principles Several key considerations and design principles are adopted to define this framework. The first design principle is that two different simulation experiments should be used for metric evaluation. Since experimental conditions—such as radiative forcing, atmospheric gas concentrations, and sea ice coverage—can vary between experiments (Taylor et al. 2012), model performance may depend upon the specific experiment because of correlations between the performance metric variable, the surface temperature, and the radiative forcing. For these reasons, the framework encourages using simulations from one experiment to assess model performance and using a different experiment to evaluate the metric. The second design principle is that observations should be integrated into metric evaluation. The integration of observations into this framework is critical as a means to evaluate model representation of the natural system. This design principle dictates that metric evaluation be performed against present-day climate in order to incorporate available observations that encompass a wide range of variables. The assumption
VOLUME 29
is then also made that a metric is considered effective if a better present-day simulated climate is produced.
5. Results a. Model assessment using performance metrics The 34 climate models listed in Table 1 are tested using a set of 44 radiation performance metrics characterizing variance, probability distributions, and regression relationships. Figure 1 shows the results of the model performance assessment. Performance metric numbers are listed to the left of the figure (for readability, the numbers correspond to the metrics named in the legend of Fig. 2), and the 34 tested CMIP5 models are listed across the bottom. The diagram is colored based on model performance; better-scoring models are those with high metric values, indicated by darkershaded squares, and poor performance is indicated by pink or white shades. In general, models reproduced the observed energy budget quantities and associated processes reasonably well. The mean metric score for each model is shown as the bottom horizontal row of Fig. 1. The 27 models that score a 0.5 or higher mean metric value are colored in blue or darker shades, while the 7 models that score lower than 0.5 mean are pink colored. The bestperforming models across all tested metrics are the NCAR CESM models, the CSIRO ACCESS models, and the NorESM models. The model with the lowest metric score (MIROC-ESM) consistently failed to reproduce radiation processes, scoring a 0 for 11 of the tested metrics. Figure 1 shows that model behavior is consistent for certain tested metrics. For example, models generally perform worse for metrics derived using clear-sky radiation fluxes, and they are better able to capture observed cloud radiative effects in the cloudy-sky and all-sky metrics. All of the models perform poorly in the TOA radiation and surface temperature regression tests. Model performance is also relatively consistent across modeling centers; this result is expected, since these models often share large amounts of code, parameterizations, and physical approximations. For example, the four NASA GISS models tend to behave similarly for many tested metrics. Interestingly, on average, models tend to score worse for metrics related to the variance of radiation quantities. The models are generally better able to recreate observed probability distributions. This may be because of the relatively short observation length, since the modeled variance of radiation and surface temperature is nonstationary and heteroscedastic over the duration
1 MARCH 2016
BAKER AND TAYLOR
1779
FIG. 1. Results of tested model performance metrics (for better readability, the numbers listed on the left correspond to the legend in Fig. 2) for CMIP5 models listed across bottom. Metric values have been calculated as described in section 3. Model performance is indicated by the color scale on the right: darker shades denote higher metric values and better performance.
of the time series and unable to reproduce the observed variance.
b. Evaluation of metric-weighted ensembles The proposed framework is used to evaluate and compare the 44 performance metrics, listed in Fig. 2. Each colored marker corresponds to a CMIP5 model ensemble mean that has been weighted by using the values of the indicated metric as model weights (as described in section 4b). For each metric-weighted ensemble, an I2 error index value has been computed, then normalized by the equal-weighted ensemble mean, indicated by a black circle where I2 is equal to 1. The I2 error index measures aggregated model ensemble errors computed against observed mean-state TOA and
surface radiation budget quantities listed in Table 2. Markers on the farthest left of the figure indicate metricweighted ensembles which exhibit less error than the equal-weighted ensemble mean; these metrics are considered to be optimal metrics. An optimal performance metric produces a metricweighted ensemble with a low I2 error index; this indicates that climate models that perform better for this metric also simulate a present-day TOA and surface energy budget in better agreement with observations. The five best performance metrics are derived from TOA OLR and shortwave (SW) radiation fluxes. These metrics are OLR clear-sky local variance, OLR cloudysky variance, SW cloudy-sky variance, OLR clear-sky variance, and SW all-sky local variance. In general,
1780
JOURNAL OF CLIMATE
VOLUME 29
FIG. 2. The I2 error index values for tested metrics. Colored markers indicate the I2 value for unequally weighted ensemble means corresponding to the numbered metrics in the legend. Each I2 value is the average of aggregated errors for 12 radiation budget quantities listed in Table 2. The black marker (45) indicates the I2 value of the equally weighted ensemble mean. Smaller I2 values indicate less error relative to observations.
the optimal performance metrics are related to globalmean TOA radiation flux variance. This result indicates that the variance in global-mean TOA radiation is highly relevant for assessing model quality in reproducing mean-state radiation budget quantities. Metric 11 (OLR clear-sky local variance) received the lowest I2 index, indicating that the ensemble weighted with this metric scores better than other tested metrics and the equally weighted ensemble mean. The results of the I2 test indicate that this metric is highly relevant to a model’s ability to accurately represent mean-state radiation budget quantities. The farthest left column in Fig. 1 shows the metric mean value across all models; metric 11 mean is marked by a light pink color, which represents a low metric score. Since clear-sky OLR flux variations are dominated by atmospheric temperature and humidity variations (e.g., Loeb et al. 2012), and the local variance represents a variance in the magnitude of monthly changes, this result indicates that CMIP5 models generally fail to reproduce these monthly atmospheric temperature and humidity variations. To demonstrate the effectiveness of the framework in identifying optimal performance metrics, the top five metrics are used to constrain the ensemble mean of
two important radiation budget quantities: TOA all-sky OLR and reflected SW radiation fluxes. The globalmean value of these two quantities, shown in Fig. 3, is calculated for CERES observations, each tested CMIP5 model, the equally weighted model ensemble mean, and the unequally weighted ensemble means using the top five performance metrics. The equally weighted ensemble mean exhibits nearly a 3 W m22 bias in SW flux and 2 W m22 bias in OLR flux. However, all five metricweighted means show improvement over the equally weighted mean with values closer to the CERES observations. When used as ensemble weights, the three top metrics improve the bias in modeled global-mean TOA OLR and SW radiation fluxes by approximately 1 W m22 each. This result demonstrates that optimal metrics determined through the proposed framework are well-suited to evaluate the quality of CMIP5 models in order to improve twentieth-century ensemble-mean simulations of relevant climate quantities. While most OLR variance metrics score better than the equally weighted ensemble on the I2 scale, metric 7 (OLR cloudy-sky local variance) scored poorly in the I2 index test. Global-mean TOA OLR cloud variations are dominated by ENSO events in the tropics (Loeb et al. 2012), since the local variance represents the
1 MARCH 2016
1781
BAKER AND TAYLOR
FIG. 3. Global-mean TOA OLR and SW radiation fluxes calculated using monthly mean CMIP5 climate model simulations using the last 20 yr of the historical experiment. CERES observations are shown as a yellow star. The equally weighted model ensemble mean and the unequally weighted ensemble means using the top five performance metrics are also shown.
magnitude of the monthly changes, the poor score of this metric could indicate a failure of models to capture the monthly variations associated with these events. Other metrics that scored I2 index values near to or higher than the equally weighted mean are not considered optimal metrics in this framework as they are not strongly linked to the quality of mean-state simulations of the tested radiation budget quantities. It is possible that the physical processes tested in these metrics may not be relevant to assessing overall model quality. However, since some of these metrics are physically similar to the optimal metrics, it is also possible that some metric computation methods do not adequately assess model quality. Of the five metric classes discussed in section 3, the first metric class produced a majority of the optimal metrics. This result shows that the method used to compute a model performance metric is important for 1) determining how models will perform and 2) how well a metric assesses overall model quality.
6. Summary In summary, a unique, systematic framework that integrates observations to quantitatively evaluate performance metrics is described and demonstrated using a set of previously unexplored performance metrics that evaluate the simulation of TOA radiative flux variance
and probability distributions and regression relationships with surface temperature changes. The framework is developed using two main design principles: 1) Two different simulations—one to create metric weights and another to apply weights and create weighted ensembles— should be used to provide a simulation-independent metric evaluation. 2) Observations should be integrated into the performance metric evaluation. This study presents an application of the framework by focusing on five metric classes related to global-mean radiation processes. However, the framework applies to a broad array of performance metrics, making it possible to directly compare many different kinds of metrics. There is a community need to provide an optimal set of climate model performance metrics. This process should be quantitative and integrate observations. The framework presented here represents a potential way forward. The framework tests whether metrics can be used to constrain ensemble-averaged mean-state quantities. This step provides a link between the metrics, which are related to climate processes, and the tested mean-state simulations. This concept is demonstrated in Fig. 3, successfully improving the ensemble-mean representation of TOA radiation flux quantities. A future application of the framework will be to use the performance metrics to assess the relationship between a performance metric and ensemble-mean future projections. While the current framework is only tested on twentieth-century simulations, it could be extended to constrain climate trends in response to changing climate conditions. Acknowledgments. This study was funded through the NASA Postdoctoral Program with the support of Oak Ridge Associated Universities and NASA Langley Research Center. Observational data products are publicly available online and were obtained from the following websites of the CERES products (http://ceres. larc.nasa.gov/) and GISTEMP temperature datasets (http://data.giss.nasa.gov/gistemp/). The authors appreciate the helpful comments received from Anthony Broccoli and an anonymous reviewer. REFERENCES Armour, K. C., C. M. Bitz, and G. H. Roe, 2013: Time-varying climate sensitivity from regional feedbacks. J. Climate, 26, 4518–4534, doi:10.1175/JCLI-D-12-00544.1. Christensen, J. H., E. Kjellström, F. Giorgi, G. Lenderink, and M. Rummukainen, 2010: Weight assignment in regional climate models. Climate Res., 44, 179–194, doi:10.3354/cr00916. Dessler, A. E., 2013: Observations of climate feedbacks over 2000–10 and comparisons to climate models. J. Climate, 26, 333–342, doi:10.1175/JCLI-D-11-00640.1.
1782
JOURNAL OF CLIMATE
Doelling, D. R., and Coauthors, 2013: Geostationary enhanced temporal interpolation for CERES flux products. J. Atmos. Oceanic Technol., 30, 1072–1090, doi:10.1175/JTECH-D-12-00136.1. Giorgi, F., and L. O. Mearns, 2002: Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the ‘‘reliability ensemble averaging’’ (REA) method. J. Climate, 15, 1141–1158, doi:10.1175/1520-0442(2002)015,1141:COAURA.2.0.CO;2. Gleckler, P. J., K. E. Taylor, and C. Doutriaux, 2008: Performance metrics for climate models. J. Geophys. Res., 113, D06104, doi:10.1029/2007JD008972. Hansen, J., A. Lacis, D. Rind, G. Russel, P. Stone, I. Fung, and R. Ruedy, 1984: Climate sensitivity: Analysis of feedback mechanisms. Climate Processes and Climate Sensitivity, Geophys. Monogr., Vol. 5, Amer. Geophys. Union, 130–163. ——, R. Ruedy, M. Sato, and K. Lo, 2010: Global surface temperature change. Rev. Geophys., 48, RG4004, doi:10.1029/2010RG000345. Hartmann, D. L., and P. Ceppi, 2014: Trends in the CERES dataset, 2000–13: The effects of sea ice and jet shifts and comparison to climate models. J. Climate, 27, 2444–2456, doi:10.1175/ JCLI-D-13-00411.1. Kato, S., N. G. Loeb, F. G. Rose, D. R. Doelling, D. A. Rutan, T. E. Caldwell, L. Yu, and R. A. Weller, 2013: Surface irradiances consistent with CERES-derived top-of-atmosphere shortwave and longwave irradiances. J. Climate, 26, 2719–2740, doi:10.1175/JCLI-D-12-00436.1. Knutti, R., 2010: The end of model democracy? Climatic Change, 102, 395–404, doi:10.1007/s10584-010-9800-2. ——, R. Furrer, C. Tebaldi, J. Cermak, and G. A. Meehl, 2010: Challenges in combining projections from multiple climate models. J. Climate, 23, 2739–2758, doi:10.1175/2009JCLI3361.1. Lambert, F. H., and P. C. Taylor, 2014: Regional variation of the tropical water vapor and lapse rate feedbacks. Geophys. Res. Lett., 41, 7634–7641, doi:10.1002/2014GL061987. Loeb, N. G., B. A. Wielicki, D. R. Doelling, G. L. Smith, D. F. Keyes, S. Kato, N. Manalo-Smith, and T. Wong, 2009: Toward optimal closure of the earth’s top-of-atmosphere radiation budget. J. Climate, 22, 748–766, doi:10.1175/2008JCLI2637.1. ——, S. Kato, W. Su, T. Wong, F. G. Rose, D. R. Doelling, J. R. Norris, and X. Huang, 2012: Advances in understanding topof-atmosphere radiation variability from satellite observations. Surv. Geophys., 33, 359–385, doi:10.1007/s10712-012-9175-1. Madison, G., 2001: Variability in isochronous tapping: Higher order dependencies as a function of intertap interval. J. Exp. Psychol., 27, 411–422.
VOLUME 29
Pincus, R., C. P. Batstone, R. J. P. Hofmann, K. E. Taylor, and P. J. Glecker, 2008: Evaluating the present-day simulation of clouds, precipitation, and radiation in climate models. J. Geophys. Res., 113, D14209, doi:10.1029/2007JD009334. Qu, X., and A. Hall, 2006: Assessing snow albedo feedback in simulated climate change. J. Climate, 19, 2617–2630, doi:10.1175/JCLI3750.1. Reichler, T., and J. Kim, 2008: How well do coupled models simulate today’s climate? Bull. Amer. Meteor. Soc., 89, 303–311, doi:10.1175/BAMS-89-3-303. Rubner, Y., C. Tomasi, and L. J. Guibas, 2000: The Earth Mover’s Distance as a metric for image retrieval. Int. J. Comput. Vision, 40, 99–121, doi:10.1023/A:1026543900054. Taylor, K. E., 2001: Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res., 106, 7183–7192, doi:10.1029/2000JD900719. ——, R. J. Stouffer, and G. A. Meehl, 2012: An overview of CMIP5 and the experiment design. Bull. Amer. Meteor. Soc., 93, 485– 498, doi:10.1175/BAMS-D-11-00094.1. Taylor, P. C., R. G. Ellingson, and M. Cai, 2011a: Geographical distribution of climate feedbacks in the NCAR CCSM3.0. J. Climate, 24, 2737–2753, doi:10.1175/2010JCLI3788.1. ——, ——, and ——, 2011b: Seasonal contributions to climate feedbacks in the NCAR CCSM3.0. J. Climate, 24, 3433–3444, doi:10.1175/2011JCLI3862.1. Waugh, D. W., and V. Eyring, 2008: Quantitative performance metrics for stratospheric-resolving chemistry–climate models. Atmos. Chem. Phys., 8, 5699–5713, doi:10.5194/acp-8-5699-2008. WCRP, 2014: CMIP5—Coupled Model Intercomparison Project. Accessed December 2014. [Available online at http://cmippcmdi.llnl.gov/cmip5/.] Wetherald, R. T., and S. Manabe, 1988: Cloud feedback processes in a general circulation model. J. Atmos. Sci., 45, 1397–1415, doi:10.1175/1520-0469(1988)045,1397:CFPIAG.2.0.CO;2. Williams, K. D., and M. J. Webb, 2009: A quantitative performance assessment of cloud regimes in climate models. Climate Dyn., 33, 141–157, doi:10.1007/s00382-008-0443-1. Willmott, C. J., 1982: Some comments on the evaluation of model performance. Bull. Amer. Meteor. Soc., 63, 1309–1313, doi:10.1175/1520-0477(1982)063,1309:SCOTEO.2.0.CO;2. Xu, K. M., T. Wong, B. A. Wielicki, L. Parker, and Z. A. Eitzen, 2005: Statistical analyses of satellite cloud object data from CERES. Part I: Methodology and preliminary results of the 1998 El Niño/2000 La Niña. J. Climate, 18, 2497–2514, doi:10.1175/JCLI3418.1.