Considerations in the Selection of Global Climate ... - AMS Journals

15 MARCH 2011

OVERLAND ET AL.

1583

Considerations in the Selection of Global Climate Models for Regional Climate Projections: The Arctic as a Case Study* JAMES E. OVERLAND NOAA/Pacific Marine Environmental Laboratory, Seattle, Washington

MUYIN WANG AND NICHOLAS A. BOND Joint Institute for the Study of the Atmosphere and Ocean, University of Washington, Seattle, Washington

JOHN E. WALSH University of Alaska Fairbanks, Fairbanks, Alaska

VLADIMIR M. KATTSOV Voeikov Main Geophysical Observatory, St. Petersburg, Russia

WILLIAM L. CHAPMAN University of Illinois at Urbana–Champaign, Urbana, Illinois (Manuscript received 2 October 2009, in final form 10 November 2010) ABSTRACT Climate projections at regional scales are in increased demand from management agencies and other stakeholders. While global atmosphere–ocean climate models provide credible quantitative estimates of future climate at continental scales and above, individual model performance varies for different regions, variables, and evaluation metrics—a less than satisfying situation. Using the high-latitude Northern Hemisphere as a focus, the authors assess strategies for providing regional projections based on global climate models. Starting with a set of model results obtained from an ‘‘ensemble of opportunity,’’ the core of this procedure is to retain a subset of models through comparisons of model simulations with observations at both continental and regional scales. The exercise is more one of model culling than model selection. The continental-scale evaluation is a check on the large-scale climate physics of the models, and the regional-scale evaluation emphasizes variables of ecological or societal relevance. An additional consideration is given to the comprehensiveness of processes included in the models. In many but not all applications, different results are obtained from a reduced set of models compared to relying on the simple mean of all available models. For example, in the Arctic the top-performing models tend to be more sensitive to greenhouse forcing than the poorer-performing models. Because of the mostly unexplained inconsistencies in model performance under different selection criteria, simple and transparent evaluation methods are favored. The use of a single model is not recommended. For some applications, no model may be able to provide a suitable regional projection. The use of model evaluation strategies, as opposed to relying on simple averages of ensembles of opportunity, should be part of future synthesis activities such as the upcoming Fifth Assessment Report of the Intergovernmental Panel on Climate Change.

1. Introduction * National Oceanic and Atmospheric Administration Contribution Number 1760 and Pacific Marine Environmental Laboratory Contribution Number 3458.

Corresponding author address: J. E. Overland, NOAA/PMEL, 7600 Sand Point Way NE, Seattle, WA 91885. E-mail: [email protected] DOI: 10.1175/2010JCLI3462.1 Ó 2011 American Meteorological Society

Comprehensive atmosphere–ocean general circulation models (AOGCMs) comprise the major objective tool that scientists use to account for the complex interaction of processes that determine future climate change. To facilitate this process, the Intergovernmental Panel on Climate Change (IPCC) used the simulations

1584

JOURNAL OF CLIMATE

VOLUME 24

FIG. 1. Annual mean surface air temperature (SAT) and sea level pressure (SLP) simulated by 16 CMIP3 models compared to ERA-40 for the period 1980–99, integrated over 608–908N. Individual model’s spatial mean rms error (RMSE) is normalized by the RMSE for the 16-member ensemble mean. Models are ranked with the SAT RMSE increase.

from about two dozen AOGCMs developed by 17 international modeling centers to form the basis for the results in their Fourth Assessment Report (AR4) (Solomon et al. 2007). Regional projections from these models are also being used by management agencies to assess and plan for future ecological and societal impacts. These AOGCM results are archived as part of the Coupled Model Intercomparison Project phase 3 (CMIP3) at the Program for Climate Model Diagnosis and Intercomparison (PCMDI). The IPCC’s AR4 emphasizes that current generation AOGCMs provide credible quantitative estimates of future climate change at continental scales and above (Solomon et al. 2007). While it is clear that the CMIP3 models are better than the earlier models used for the Third Assessment Report (Randall et al. 2007; Reichler and Kim 2008), a general question remains: how reliable are these model projections at regional scales and what is the limit of their utility? Nevertheless, the climate community is making use of the AR4 AOGCM simulations on regional scales. PCMDI shows over 1100 projects and over 500 publications using CMIP3, most based on regional projections. While there is emerging evidence of the utility of some regional climate model projections, users should strive to evaluate uncertainties in the multiple-model simulations. We present a set of considerations to guide regional applications of AOGCMs. In anticipation of applications for the IPCC Fifth Assessment Report (AR5), we recommend accounting for the strengths and weaknesses in individual models, rather than relying primarily on averages over all models from ‘‘ensembles of opportunity’’ (Tebaldi and Knutti 2007).

Selection considerations arise even for simple applications. An example is provided by annual climate statistics for the region north of 608N (Fig. 1). Rankings of models based on their normalized root-meansquare differences with the 40-yr European Centre for Medium-Range Weather Forecasts Re-Analysis (ERA40) depend on the variables being considered, such as surface air temperature (SAT) and sea level pressure (SLP). Consider the two leftmost models: the first model shows the best results for SAT and the worst results for SLP, whereas the second model has relatively low errors for both variables. If one is only interested in SAT, is it better to use the ‘‘best’’ model? Or is overall model performance a more important evaluation factor? Value judgments about different selection criteria quickly arise. This paper is organized as follows. We first summarize the sources of uncertainty in AOGCM projections and continue with a discussion of selection criteria. Examples from Arctic and subarctic regions are provided through a meta-analysis of several CMIP3 studies and from new results. Our theme is that model performance varies among different variables and regions of the Arctic, pointing to a need for regional specificity in model assessment.

2. Climate projection uncertainties Table 1 itemizes the names, contributing country, resolution, and number of available runs for selected variables for the twentieth-century (20c3m) model simulations carried out for CMIP3. The table shows that the number of available ensemble members can be different

15 MARCH 2011

1585

OVERLAND ET AL.

TABLE 1. List of coupled atmosphere–ocean general circulation models and available number of ensemble runs archived for selected variables (SLP, SST, and sea ice concentration) in 20c3m simulations. Note that the number after letter ‘‘L’’ indicates the number of vertical levels in the model. If no archive is available for selected variable, then no number is given in the column. Institutions/models, excluding those named in section 4, are (1) Beijing Climate Center, (2) Bjerknes Centre for Climate Research, (7) Commonwealth Scientific and Industrial Research Organisation Mark version 3.0, (10) Flexible Global Ocean–Atmosphere–Land System Model gridpoint version 1.0 (Institute of Atmospheric Physics), (13) Goddard Institute for Space Studies Atmosphere–Ocean Model, (14) GISS model version EH, (15) GISS model version E-R, (16) Istituto Nazionale di Geofisica e Vulcanologia, SINTEX-G (17) Institute of Numerical Mathematics Coupled Model, version 3.0, (22) Meteorological Research Institute, and (23) Parallel Climate Model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

IPCC I.D.

Country

Atmosphere resolution

Ocean resolution

SLP

SST

Sea ice

BCC BCCR version 2.0 (BCM2.0) CCSM3 CGCM3.1 (T47) CGCM3.1 (T63) CNRM-CM3 CSIRO Mk3.0 CSIRO Mk3.5 ECHAM5–MPI-OM FGOALS-g1.0 [(IAP)] GFDL-CM2.0 GFDL-CM2.1 GISS-AOM GISS-EH GISS-ER INGV-SGX INM-CM3.0 IPSL CM4 MIROC3.2[high resolution (hires)] MIROC3.2(medres) ECHO-G (MRI)-CGCM2.3.2 PCM UKMO-HadCM3 UKMO-HadGem1 Total Runs

China Norway United States Canada Canada France Australia Australia Germany China United States United States United States United States United States Italy Russia France Japan Japan Germany/Korea Japan United States United Kingdom United Kingdom

2.88 3 2.88 3 L31 1.48 3 1.48 3 L26 3.758 3 3.78 3 L31 2.88 3 2.88 3 L31 2.88 3 2.88 3 L45 1.8758 3 1.8658 3 L18 1.8758 3 1.8658 3 L18 1.8758 3 1.8658 3 L31 2.88 3 2.88 3 L26 2.58 3 2.08 3 L24 2.58 3 2.08 3 L24 48 3 38 3 L20 58 3 48 3 L20 58 3 48 3 L13 1.1258 3 1.128 3 L19 58 3 48 3 L21 3.758 3 2.58 3 L19 1.1258 3 1.128 L56 2.88 3 2.88 3 L20 3.758 3 3.78 3 L19 2.88 3 2.88 3 L30 2.88 3 2.88 L18 3.758 3 2.58 3 L15 1.8758 3 1.258 3 L38

(0.58–1.58) 3 1.58 L35 (0.38–1.08) 3 1.08 3 L40 1.98 3 1.98 3 L29 1.48 3 0.98 3 L29 28 3 (0.58–28) 3 L31 1.8758 3 0.9258 3 L31 1.8758 3 0.9258 3 L31 1.58 3 1.58 3 L40 18 3 18 3 L30 18 3 18 3 L50 18 3 18 3 L50 1.48 3 1.48 3 L43 28 3 28 *cos(lat) 3 L16 58 3 48 3 L33 18 3 18 3 L33 28 3 2.58 3 L33 28 3 18 3 L31 0.288 3 0.1888 3 L47 (0.58–1.48) 3 1.48 L44 (0.58–2.88) 3 2.88 L20 (0.58–2. 58) 3 28 L23 (0.5–0.78) 3 0.78 L32 1.258 3 1.258 L20 (0.338–1.08) 3 1.08 L40

1 8 5 1 1 3 1 4 3 3 3 2 5 9 1 1 1 1 3 5 5 4 2 2 70

1 2 5 1 1 3 1 3 3 3 5 2 5 9 1 1 1 1 3 3 5 3 2 2 57

1 7 5 1 1 3 3 3 3 3 5 2

even for the same model but from different components of the coupled models. There are several reasons why CMIP3 AOGCMs can provide reliable projections. Models are built on wellknown dynamical and physical principles, and many large-scale aspects of present-day climate are simulated quite well by these models (Randall et al. 2007; Knutti 2008). Further, some biases in simulated climate by different models can be unsystematic (Raïsa¨nen 2007; Jun et al. 2008). While it may be supposed that there would be considerable convergence between different models in their simulation results for the twentieth century and projections for the twenty-first century, given that they are trying to simulate responses to similar forcing,1 there are in fact considerable differences in the models’ ability to hindcast regional climate variability based on location,

1

In fact, twentieth-century forcings are not identical for different CMIP3 models, for example, some of them include volcanic aerosol and solar variability while others do not.

9 1 1 1 1 3 3 5 2 2 2 62

variable of interest, and metrics—for example, means, variance, trends, etc.—with no convergence toward a single subset of preferred models. Thus, the question of model reliability has no simple quantitative answer; there is no one best model (Gleckler et al. 2008; Reifen and Toumi 2009; Raïsa¨nen et al. 2009). There are three main sources of uncertainty in the use of AOGCMs for climate projections: large natural variations (both forced and unforced), the range in emissions scenarios, and across-model differences (Hawkins and Sutton 2009). First, it is known that, if climate models are run several times with slightly different initial conditions, the trajectory of day-to-day and year-toyear evolution will have different timing of events, even though the underlying statistical–spectral character of the model climate tends to be similar for each run. This variability is a feature of the real climate system, and consumers of climate projections must recognize its importance. Natural variability is a source of ambiguity in the comparison of models with each other and with observational data. This uncertainty can affect decadal

1586

JOURNAL OF CLIMATE

or even longer means, so it is highly relevant to the use of model-derived climate projections. To reduce this uncertainty, one may average the projections over decades or, preferably, form ensemble averages from a set of at least several runs of the same model. One may also be interested in the possible changes in extreme states; however, it is important to recognize the limitations of predictability in the timing of events and the loss of information about variability when multiple simulations are averaged. It is unfortunate that for some models only one model run exists in the archive. A strong recommendation is for climate centers to provide multiple ensemble model runs in future model archives. A second source of uncertainty arises from the range in plausible emissions scenarios. Emissions scenarios have been developed based on assumptions for future development of humankind (Nakicenovic et al. 2000); they are converted into greenhouse gases and aerosol concentrations, which are then used to drive the AOGCMs in the form of external forcing specified in the CMIP3 models and summarized in the IPCC AR4. Because of the residence time of carbon in the atmosphere and the thermal inertia of the climate system, climate projections are often relatively insensitive to the precise details of which future emissions scenarios are used over the next few decades, as the impacts of the scenarios are rather similar before midtwenty-first century (Hawkins and Sutton 2009). For the second-half of the twenty-first century, however, and especially by 2100, the choice of the emission scenario becomes a major source of uncertainty of climate projections and dominates over natural variability—at least for temperature and model-to-model differences (Solomon et al. 2007). If 2030–50 is a time scale of interest, we will often use the A1B scenario, or A1B and A2 together, to increase the number of potential ensemble members as their CO2 trajectories are similar before 2050. However, some model differences in spatial distributions for precipitation do emerge in the Arctic even before the mid-twenty-first century as a result of different emissions scenarios. The third source is termed across-model uncertainty, which can be separated into parameterization uncertainty and structural uncertainty (Knutti 2008). Subgrid-scale processes must be parameterized, requiring specification of functional relationships and tuning of attendant coefficients. Different numerical approximations of the model equations, spatial resolution, and other model development factors introduce structural uncertainty between different models. Further, there are deficiencies common to all models, such as lack of dynamic vegetation and ice sheet processes. Model uncertainty can be addressed in part through consideration of multimodel ensemble means. For example,

VOLUME 24

Fig. 2 shows the dependence of the rms error (integrated over 608–908N and the seasonal cycle) for Arctic SAT, SLP, and precipitation plotted against the number of highest-ranking models included in a multimodel ensemble composite. There is a minimum error when the composite contains five to seven models. Apparently single models are subject to model errors that are reduced by the multimodel ensemble means. As the number of models in the average exceeds five to seven, the inclusion of additional, worse-performing models compared to observations degrades the multimodel ensemble composite. This finding is robust across variables and larger domains within the extratropical Northern Hemisphere (Fig. 2d). Reifen and Toumi (2009) also advocate multimodel ensembles, noting in a regional case for Siberia that selected ensembles of 2 to 5 models slightly outperform a 17-model ensemble on average. In their examples, which focus on the simulation of the twentieth-century temperature evolution, they do not see the major increases in error with additional models as in our Fig. 2. It is worthwhile noting that for hindcast comparisons the ‘‘truth’’ is known and the best models can be selected unambiguously. From a Bayesian perspective, while a poorly performing model in hindcast mode is less likely to be relatively skillful in forecast mode, the opposite result is possible. For this reason, the optimal ensemble size for climate change projections could be larger than that indicated by Fig. 2.

3. Considerations in model selection/culling There is no universal methodology in the use of multiple AOGCMs for climate projections (Gleckler et al. 2008; Raïsa¨nen et al. 2009). If the climate impact problem is defined by needing a projection of a specific variable in a specific location, it is unclear whether it is preferable to base model selection decisions on the accuracy of the models for this local variable in hindcast simulations or with respect to overall measures of performance over a range of variables and larger regions. For example, successful projections of advectively driven changes require that models capture changes over larger areas. Moreover, the coarse resolution of most current climate models certainly dictates caution in their application on smaller scales in heterogeneous regions such as along coastlines or rugged orography. Our experience and that of other groups conducting model evaluations indicates the importance of multiple, complementary approaches. There are three general concepts pertaining to multiplemodel evaluation. The first consideration is the skill of the models’ twentieth-century hindcast simulations relative to

15 MARCH 2011

OVERLAND ET AL.

1587

FIG. 2. Dependence of the rms error (model minus NCAR–NCEP reanalysis values integrated over 608–908N and the seasonal cycle) of SAT, arctic precipitation, and SLP on the number of highest-ranking models included in a composite. The lower-right panel is temperature for 208–908N.

various datasets or fields. While past skill would seem to be a necessary condition for using a model, this provides no guarantee of accurate model projections under new external forcing or climate states. For example, realistic sensitivity to solar forcing may produce a well-simulated seasonal cycle, but greenhouse-driven changes involve a different type of forcing. The external forcing by increasing greenhouse gases relates to the longwave radiative fluxes, which also involve water vapor and cloud feedbacks that are fast enough to affect the seasonal cycle (although in the case of the seasonal cycle, the temperature sensitivity to local radiative forcing is strongly moderated by horizontal heat transport and the heat capacity of the system). The second consideration is the quality of the underlying physics. Does the model include reasonable formulations of the physics of interest? For example, sea ice is represented in different models at considerably different levels of sophistication. The number of layers in the ice and the presence or absence of ice transport and deformation vary among models. Similarly, if projections of permafrost are to be obtained directly from an AOGCM, the model should include processes affecting

soil temperature and water in a manner that allows for credible simulations of temporal evolution (Walsh 2008). The third consideration is consistency: do different models produce similar simulations, given the range of natural variability? Invoking a model evaluation procedure involves making a set of choices about the models, validation datasets, variables, regions, evaluation metrics, and weighting approaches and, thus, inevitably entails subjectivity. However, using means across all members of an ensemble of opportunity is also a subjective choice. Metrics can include comparisons to mean climate, climate variability as represented by the annual cycle or interannual variability, or trends due to externally forced change. Validation of models often includes just comparisons to the mean, but this has drawbacks. First, the quality of projections should relate also to potential changes or sensitivities in the force balance or heat budget in the system, in addition to mean states. Second, it is unclear to what degree models are tuned to represent the mean state. If, however, model projections depend on the state of the system at the end of the twentieth century, such as sea ice, then having the

1588

JOURNAL OF CLIMATE

correct mean conditions is critical (Wang and Overland 2009). Several authors recommended using the seasonal cycle for comparison, which appears as a good choice since the range of radiative forcing and other conditions over the year is comparable or greater than the magnitude of potential future climate shifts (Hall and Qu 2006; Knutti et al. 2008; Walsh et al. 2008). Interannual variability (variance) is another important consideration. For example, Stoner et al. (2009) document the models’ ability to capture the time scales and spatial patterns of the major modes of observed natural variability [Arctic Oscillation, Pacific decadal oscillation, and El Ninõ– Southern Oscillation]. The correlation between model and observation time series is a poor choice since the observations represent but a single realization of a system with considerable natural variability. Model simulations are not designed to replicate the timing of fluctuations in the observed climate (Overland and Wang 2007; Kattsov et al. 2007a); comparisons between models and observations should utilize simple statistics. Trends would be another seemingly good choice, as we are interested in projecting future trends. Over time scales of 20–50 yr, however, internal variability is large and obscures the underlying trend owing to external causes in the observation set, especially in the case of regional trends. Applying complex or aggregate metrics (i.e., using multiple criteria) to model selection, while intellectually appealing, also has limitations. If one includes more and more variables or subregions as independent constraints, more and more models will exhibit deficiencies that make them candidates for exclusion; our experience is that a practical limit to model selection is rapidly reached for CMIP3 archived models. Likewise, if one develops a numerical ‘‘model climate performance index’’ by combining performance rankings over many variables, it can produce inconsistencies, as suggested by Fig. 1. User priorities dictate the use of a particular variable in model evaluation, but deficiencies in fields of other variables may point to model shortcomings that adversely affect a model’s projections of change. Our evaluation is based on the models in the CMIP3 archive. Since the CMIP3 ensemble of models is an ‘‘ensemble of opportunity’’ (i.e., a collection of model output based on availability), an important consideration is the representativeness of the CMIP3 models in terms of both their correspondence to reality and their spread about the ensemble mean. Annan and Hargreaves (2010) have recently assessed the CMIP3 models on the basis of a paradigm that reality (truth) is drawn from the same distribution as the ensemble members—the ‘‘statistically indistinguishable paradigm.’’ In that study, the CMIP3 ensemble was found to provide a good sample under this

VOLUME 24

paradigm, although the model spread was slightly too broad and the ensemble shows some biases. The same authors also noted that ‘‘the quality of probabilistic predictions in terms of both reliability and resolution may be improved by some non-uniform weighting’’ (Annan and Hargreaves 2010, p. 5). In the present paper, our framework of selection (inclusion or exclusion) is one form of a nonuniform weighting, that is, binary weights of 1 and 0. We suggest transparent choices regarding selection methods. In fact, one should view the process as reducing the impact of models with large hindcast error, that is, culling, rather than as selecting best models, while retaining several models as a measure of model uncertainty. Credibility is gained when the model selection procedure is simple and thoroughly documented. We can recommend the following steps: 1) In many applications, it is advisable to eliminate the models that seriously fail to meet one or more observational constraints on continental scales, based on comparison with actual or synthetic data (e.g., reanalysis products) for climatically relevant variables such as sea level pressure, sea surface temperature, seasonal sea ice extents, or upper-air variables. Do the selected variables have reasonable mean, variance, and seasonal cycle close to the observations at continental scale or above? It bears noting that even reproducing mean quantities correct is nontrivial as the CMIP3 models are initialized in the nineteenth century. More importantly, however, model errors are often larger than the climate changes observed since the nineteenth century, pointing to the fundamental fact that climate must be simulated from first principles. Although tuning of a model is to some extent possible, a model cannot be explicitly forced to agree with observations. 2) After eliminating poorly performing models based on continental-scale climate processes, the remaining subset of models is evaluated for individual variables and regions of interest. These are user selected for the problem at hand, such as a variable with an ecosystem or societal impact. Our analyses of AOGCM hindcast simulations show that models can perform differently based on region and for different variables within a region, often without obvious reasons (Overland and Wang 2007; Walsh et al. 2008). Nevertheless, it is plausible that the uncertainties of future climate projections would be reduced among models with better regional hindcast simulations. 3) Model uncertainty is a sampling problem, so a sample size of at least several models is desirable. Model means selected by multivariable metrics can outperform

15 MARCH 2011

OVERLAND ET AL.

1589

FIG. 3. Summary of model evaluations over the Northern Hemisphere and Arctic for selected variables. Numbers in columns 3–10 indicate the order of ranking for each model of selected variables for given region from published papers. Blue in column 6 indicates where a model performs better than the median with respect to the reference data, and red indicates worse performance. Pass or Fail indicates models passing or not passing the culling criterion/criteria. When two numbers are shown in the same column (7–10), they are corresponding to the model’s rank based on the original output (first number before ‘‘/’’) or the bias removed (second number after ‘‘/’’).

any individual model (Walsh et al. 2008). Natural variability is a complicating factor in using models when only one ensemble member is available.

4. Meta-analysis We illustrate a meta-analysis approach by synthesizing the results from several earlier papers (Fig. 3) with the mid-to-high-latitude Northern Hemisphere as a focus. Walsh et al. (2008) analyzed air temperature and sea level pressure (SLP) north of 608N using the observational constraint of the rms error and variance over the seasonal cycle. Similarly, Gleckler et al. (2008) compared models to mean values of SLP and 850-hPa temperatures and provided a model climate performance index. Wang et al. (2007) used the criteria of interannual variance of winter Arctic land surface air temperatures. Summer Arctic sea ice was investigated by Wang and Overland (2009), who required that the models not only be able to simulate the mean minimum sea ice extent but also magnitude of seasonality (March minus September extent). Overland and Wang (2007) evaluated the Pacific

decadal oscillation, a measure of variance for North Pacific sea surface temperature and an index of North Pacific climate. All of these evaluations involved assessing how well the model hindcasts for the twentieth century matched observations. We include two additional studies: Reichler and Kim (2008) who included criteria for the whole planet and Wu et al. (2008) who use a criterion of model sensitivity. By spring 2009, 25 CMIP3 models were available via PCMDI archive (available online at http://www-pcmdi. llnl.gov) (Table 1). Figure 3 summarizes the performance of these models based on variable and analysis criteria from the above papers. The meta-analysis nominally ranks the models based on their performance on selected variables (SAT, sea ice extent, and SLP) over the Arctic and NH when compared with observations. Of the 25 models in Table 1, certain models consistently perform better than the others across the criteria. These are in alphabetic order: the Community Climate System Model, version 3 (CCSM3); Centre National de Recherches Me´teórologiques (CNRM); ECHAM and the global Hamburg Ocean Primitive Equation (ECHO-G)–Meteorological Institute of the University of Bonn, ECHO-G Model

1590

JOURNAL OF CLIMATE

VOLUME 24

(MIUBECHOG) and Met Office (UKMO) Hadley Centre Global Environmental Model version 1 (HadGEM1; shaded in gold in Fig. 3), which pass both Arctic SAT and sea ice selection criteria (Wang et al. 2007; Wang and Overland 2009) and rank relatively high by other studies (columns 3 and 4 in Fig. 3). A second, purple group of the models ranked high generally, but fail at least one of the sea ice extent or Arctic SAT tests: Canadian Centre for Climate Modelling and Analysis (CCCma) Coupled General Circulation Model, version 3.1 [CGCM3.1(T47)]; CGCM3.1 (T63), ECHAM5–Max Planck Institute Ocean Model (MPI-OM); Geophysical Fluid Dynamics Laboratory Climate Model version 2.0 (GFDL CM2.0), GFDL-CM2.1; Model for Interdisciplinary Research on Climate 3.2 medium-resolution version [MIROC3.2 (medres)]; and UKMO third climate configuration of the Met Office Unified Model (HadCM3). A third group includes one model [L’Institut Pierre-Simon Laplace Coupled Model, version 4 (IPSL-CM4)], which performs well in the sea ice simulation but ranked generally low by other measures. Although there is some consistency among the different methods of ranking the models, there are also caveats in the process. For example, the National Center for Atmospheric Research (NCAR) CCSM3 model performs relatively well for temperatures and sea ice simulations, but is biased for SLP in the Northern Hemisphere; the model ranking advances to fourth from its original 15th place (last) when the SLP bias is removed (column 10 of Fig. 3). The results of this meta-analysis are consistent with a culling methodology. Note, however, that we do not consider the meta-analysis in Fig. 3 as an endorsement of any particular model. The theme of the following section 5 is to provide examples that compare model selection results with allmodel means obtained from the CMIP3.

5. Examples of regional applications a. Examples from Arctic-wide comparisons The magnitude of projected late twenty-first-century (2070–90) Arctic-wide (608–908N) mean SAT, SLP, and precipitation vary with model rank (Fig. 4), which in this case is an ‘‘integrated rank’’ in which air temperature, precipitation, and sea level pressure are given equal weight over the domain poleward of 608N. There is a tendency for the higher-ranking models [based on minimum rms errors over a seasonal cycle relative to the NCAR–National Centers for Environmental Prediction (NCEP) reanalysis] to project larger twenty-first-century changes. Perhaps models that best capture the seasonal cycle of radiative forcing are more sensitive to external (greenhouse) forcing.

FIG. 4. Projected late twenty-first-century (2070–90) Arctic-wide (608–908N) mean (a) SAT, (b) precipitation, and (c) SLP vary with model rank. After Walsh et al. (2008).

15 MARCH 2011

1591

OVERLAND ET AL. TABLE 2. List of selected models for different regions, SIE: sea ice extent. Region

Variable Better performing model

Alaska

Barents Sea

Chukchi Sea

East Bering Sea

East Bering Sea

East Bering Sea

SAT

SIE

SIE

SIE

SST

Spring winds

CCSM3, CNRM-CM3, ECHO-G, IPSL, MIROC(medres), UKMO-Hadgem1

CCMS3, CNRM-CM3, ECHO-G, MIROC(medres)

CGCM3.1-T63, MIROC(medres), MRI-CGCM2.3.2, UKMO-HadCM3, UKMO-HadGEM1

GFDL-CM2.1, MIROC(hires), MIROC(medres), MRI-CGCM2.3.2, UKMO-HadCM3

CNRM-CM3, CCSM3 ECHAM5/MPI, GFDL-CM2.1 MIROC(medres), UKMO-HadCM3

This result is a particularly important argument for considering model culling, as the individual members of the ‘‘ensemble of opportunity’’ of all available models do not necessarily represent an unbiased sample of future projections. The selection process can reduce the spread in the future projections. Taking September sea ice extent over the Northern Hemisphere as an example, the averaged standard deviation among all model runs is 2.7 3 106 km2 when averaged over the twenty-first century. This standard deviation is reduced to 1.3 3 106 km2 when projections are made by the six best-performing models identified by Wang and Overland (2009), a reduction of ;50%.

b. Examples from regional results for Alaska Walsh et al. (2008) found five top-performing models for Alaska (Table 2), all of which have metarankings in the first and second group of Fig. 3. It should be noted that the ranking on a regional level depends on whether the annual mean biases are removed before the rms error calculations. The ECHAM5–MPI model ranks well below the median (13th out of the 15 models) in its simulation of Alaskan temperature, primarily because the model is too cold throughout the year over Alaska (Fig. 5). On the other hand, the amplitude of its seasonal cycle is very close to the observed, so excluding the annual mean bias from consideration improves the model’s rank to first of the same set of 15 models. While it seems sensible to remove a model’s bias when using its projections, it also is plausible that the bias is inversely related to a model’s reliability in making these projections, and hence bias should be considered in model evaluation. This example illustrates that a climate model that simulates several variables well at the large scale can be a relatively poor performer on a regional scale. With regard to the reasons for the different levels of skill over Alaska as well as Greenland and larger domains, no systematic relationship to model resolution emerged. The models with the smallest rms errors, ECHAM5–MPI and GFDL-CM2.1, have resolutions that are neither the highest nor lowest of the 15 models.

Other candidates for explanations of the differences in model performance include the cloud formulations and radiative properties, the planetary boundary layer parameterization, sea ice, and the land surface schemes of the various models (e.g., Sorteberg et al. 2007; Kattsov et al. 2007b; Walsh 2008). Biases in the large-scale atmospheric circulation, perhaps driven by processes outside the Arctic, are also candidates to explain the across-model differences in temperature, precipitation, and sea level pressure.

c. Examples from regional sea ice simulations for the Barents and Chukchi Seas The Chukchi Sea northwest of Alaska and the Barents Sea north of Norway are seasonal sea ice zones, where the sea ice advances in winter/spring and retreats in summer, leaving regions of open water. We apply the proposed two-step strategy that first evaluates models’ large-scale climate performance and then selects models suitable for regional projections. We start with the six selected models identified by Wang and Overland (2009) for their Arctic-wide performance and then evaluate their regional behavior with regard to the mean and seasonal cycle of sea ice extent. For the Barents Sea (65–828N, 15– 608E) we found that all models but one overestimate the sea ice extents compared to the observed climatology (thick red line in Fig. 6 left). This is an exception to the statement that biases in simulated climate by different models tend to be unsystematic. An explanation for this result is that expansion of sea ice in the Barents Sea is generally countered by ocean advection of heat into the region from the north Atlantic. An ongoing assessment indicates that present generation models in CMIP3 are often unsatisfactory in their simulations of the regional ocean currents. Because only one model (CCSM3) shows agreement with observations, we have less confidence in making projections of future sea ice conditions for the Barents Sea. The overspecification of sea ice at the end of the twentieth century in the Barents Sea had unfortunate implications for the AR4. Too much sea ice in these models meant that the simulated winter temperatures

1592

JOURNAL OF CLIMATE

VOLUME 24

FIG. 5. Maps of composite and 14 individual model biases of temperature for January (1958–2000). Simulations are from a subset of the CMIP3 twentieth-century simulations After Walsh et al. (2008).

15 MARCH 2011

OVERLAND ET AL.

1593

FIG. 6. Seasonal cycle of sea ice extent averaged over 1980–99 for Barents Sea and Chukchi Sea. Units are 106 km2. The thick red line is observations based on Hadley center sea ice analysis, and thick blue lines indicate the 620% bracket relative to the observed value. The thin colored lines show the sea ice extent seasonality from various CMIP3 models. Note that most models show too much sea ice coverage over the Barents Sea.

for the region during 1980–1999 were also too low (Fig. 5). When temperature change was calculated for 2090–2099 relative to 1980–99 and averaged over all models, the changes were some of the largest on the planet, greater than 58C (IPCC Figures SPM.6 and 11.5). By way of comparison, CCSM3 projects a warming in annual mean SAT of ;5.08C in the Barents Sea over the course of the twenty-first century (2080–99 minus 1980–99) under A1B emissions scenario, while the CMIP3 ensemble mean projection, excluding CCSM3, is a warming of 5.58C. The reason for large temperature changes is not that this region would be excessively warm in the decade of 2090, but that the reference temperatures were too cold in the late twentieth century because of excessive sea ice cover in the Barents Sea. On a global mean basis, the annual mean temperature change for the same periods (i.e., 2080–99 minus 1980– 99) projected by the CCSM3 model is an increase of

2.68C, whereas the rest of the models give a mean increase of 2.48C. For the Chukchi Sea, we found that all the six models identified by Wang and Overland (2009) pass our regional selection criteria (Table 2) (Fig. 6 right). This may be attributable to the lesser importance of oceanic heat advection to the heat budget of the Chukchi Sea relative to the Barents Sea.

d. Counter examples from the Bering Sea As in the example for the Chukchi Sea, evaluation of sea ice in the eastern Bering Sea (548–668N, 1758–1558W) starts with the six models identified by Wang and Overland (2009). We then further require that these models be able to simulate the spring (April and May) sea ice extent over the eastern Bering Sea with less than 20% error of the observed value. The latter criterion was imposed because of increased need for robust information on the

1594

JOURNAL OF CLIMATE

timing and distribution of sea ice since at this time of year it provides an essential habitat for marine mammals such as ringed seals and walrus (e.g., Moore and Huntington 2008). The process indicates that four models are appropriate for sea ice over the eastern Bering Sea (Table 2, column 5). In contrast, for the western Bering Sea we found that only one model (CCSM3) passed the regional selection criteria and none passed the regional test for the Sea of Okhotsk. Sea ice is evidently too extensive east of the Kamchatka Peninsula in most IPCC models. While these results from the Chukchi and east Bering Seas are encouraging, we still lack a general explanation of why some models perform well in one region but not in other proximate regions. Such results should be accepted: some regional/variable combinations are not suitable for projections using this set of models. Applying the two-step approach for southeastern Bering Sea SST and winds, we require that the models first must represent North Pacific SST variability via the Pacific decadal oscillation with sufficient fidelity in terms of spatial pattern and low-frequency energy content, as determined in Overland and Wang (2007). Twelve models remain from the use of this large-scale physical constraint. Presumably, this selection is achieved by the models that properly characterize the multiyear variability associated with air–sea interactions involving the basin-scale forcing, that is, the Aleutian low. Regional evaluation for the southeastern Bering Sea has considered SST during the summer (June–September) and winds during the spring (April–June) using a method of weighted means (Hollowed et al. 2009). With regard to the SST, we have evaluated how the models’ hindcast simulations of the mean, interannual variance, and trend for the last half of the twentieth century compare with observations. The five leading models are listed in Table 2. Figure 7 compares the SST projections for the all-CMIP3-models mean (yellow), the mean of the culled set of 12 models that pass the PDO test (blue) and the five-model (unweighted) regional subset mean (green). The pink line represents observations, and the gray lines are individual ensemble members from the set of 12 culled models. Unlike sea ice, future temperature in this example is a variable for which all-model means and the reduced set of model means have similar trajectories. Of particular interest in Fig. 7 is the range of individual ensemble members in any given projection year. The year-to-year range of natural variability will still dominate the climate well into the twenty-first century. With regard to the springtime winds, we investigate whether weighting model results in forming ensembles is necessarily justified relative to simple all-model means. Weighting has appeal from a Bayesian perspective in that it is reasonable to presume models with demonstrably

VOLUME 24

greater skill in their hindcast simulations are more reliable for future projections. In addition, there is precedent for using weighted ensembles in seasonal prediction (Krishnamurti et al. 2006) and for longer-term climate prediction (e.g., Giorgi and Mearns 2002). On the other hand, the amount of information available for validation of model hindcasts is limited and there is subjectivity in the criteria by which model hindcasts are gauged. We show an example of a comparison between a simple and weighted ensemble for the onshore wind component in the southeast Bering Sea, where weights were assigned to the 12 models that passed the PDO test, based on the normalized errors of means and interannual variability since the observed past trend is negligible. Leading models for winds are different than those selected for SST (Table 2). We show a comparison between a simple and weighted ensemble of projections of the 12 selected models, illustrated by percentage histograms of the anomalies of the modeled springtime averaged (April–June) onshore wind component for 104 samples (21 ensemble members and yearly samples from the 5-yr period 2043–47, Fig. 8). The weighted ensemble has a mean and median that is shifted slightly to more negative (i.e., more offshore) values relative to the simple, unweighted ensemble. Perhaps more importantly, the tails of the distribution are less pronounced in the weighted versus unweighted ensemble. Individual models yielding strongly negative and positive wind anomaly projections were among the group that received lower weights based on twentieth-century hindcasts. While the two ensembles are similar, the reductions in extreme projections suggest a reason to consider the use of weighting. We can conclude from this example and other examples in section 5 that the merits and impacts of model culling and model weighting are application dependent.

6. Conclusions This paper discusses approaches for using the output from current and future generation climate models for regional applications. In general, the quality of individual model simulations depends on the variable, region, and evaluation metric. The reasons for these inconsistencies are rarely clear, which bears on the credibility of the models’ results on regional scales. Nevertheless, there is certainly interest in regional projections, as gauged by the hundreds of studies indexed by the PCMDI. Further, the top-performing models for the Arctic in many of the cases that we have investigated tend to be more sensitive to greenhouse forcing than the poorer-performing models, suggesting that all-model means do not provide unbiased projections.

15 MARCH 2011

OVERLAND ET AL.

1595

FIG. 7. Projected SST anomalies for the eastern Bering Sea during June from ensemble members of 12 CMIP3 models (gray lines). The colored lines show ensemble means based on different numbers of model projections using on the culling strategy. The magenta line indicates the observed value based on Hadley SST2 analysis.

As illustrated by the examples in the preceding section, the strategy for dealing with climate models’ uncertainties can be keyed to several considerations. First, the available models of opportunity can be evaluated based on the ability of their twentieth-century hindcasts to reproduce large-scale climate variability. This variability can include the annual cycle of certain variables

that are responsive to radiative forcing; it can also include leading spatial and/or temporal modes of variability. Such an evaluation serves as a climatic basis for culling some models from further consideration. Second, one can consider twentieth-century hindcasts of problem-relevant variables in the region of interest for further selection of models or assignments of weights to

FIG. 8. Comparison of histograms between simple and weighted ensembles for anomalies of onshore wind component projections in the southeast Bering Sea using 21 ensemble members and samples from five individual years (2043–47). Anomalies of wind are binned into categories of width 1 m s21, centered on integer values. There is a shift in mean and a reduction in extremes when the influence of models with poor comparisons to observed winds are given reduced weight.

1596

JOURNAL OF CLIMATE

variables or models. While it may be tempting to determine a single best model and use its projections, this practice has serious risks. All climate models are subject to model uncertainty, and prior success may be fortuitous (Reifen and Toumi 2009). Moreover, the spread in the projections from different models provides a measure of one major source of uncertainty of projections. Finally, consideration can be given to the sophistication of each model vis-a`-vis the parameters of interest. For example, the models with more elaborate schemes for handling sea ice tended to compare better with observations and thus have increased credibility for Arctic applications. The Northern Hemisphere high-latitude examples presented here support a metric-based selection of a subset of models; it is counterintuitive to include the simulations from models known or strongly suspected to have low skill for historical climate variability or particular regional variables. Studies that include a subset of models should openly acknowledge that the selection process is guided by a range of choices. Given that there is presently, and for the foreseeable future, no convergence on a best subset of models for all applications or a universal procedure for model selection, the needs and priorities of a particular application will inevitably influence at least some steps of the model-selection procedure. The meta-analysis in Fig. 3 is for illustrative purposes and should not be used as a model endorsement, as the use of a particular model should be application dependent. Even within the reduced set of selected models, as supported by our proposed procedure, the quest for reasons to explain the full range of model results to date has no apparent quantitative solution. Nevertheless, it is apparent that the age of model democracy is over, as there is presently a widespread effort to move beyond arithmetic averaging over all members of an ensemble of opportunity. Given this situation, the use of selection metrics and aggregation methods should remain simple and transparent, with careful documentation of the model selection procedure. Acknowledgments. We acknowledge the modeling groups, the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and the WCRP’s Working Group on Coupled Modelling (WGCM) for their roles in making available the WCRP CMIP3 multimodel dataset. Support of this dataset is provided by the Office of Science, U.S. Department of Energy. We appreciate the organizational support to host workshops to discuss the use of CMIP3 models from the Arctic Council AMAP Climate Group and the underway SWIPA Snow, Water, Ice and Permafrost in the Arctic Review, Ecosystem Studies of Sub-Arctic Seas (ESSAS), and

VOLUME 24

Working Group 20 of PICES. JEO, MW, and NAB appreciate the support of the NOAA Arctic Project, the NOAA FATE Project, and the AYK Project. This publication is partially funded by the Joint Institute for the Study of the Atmosphere and Ocean (JISAO) under NOAA Cooperative Agreement NA17RJ1232. JEW was supported by NOAA Grant NA10OAR4310055 and NSF Grant ARC-0652838, while WLC was supported by NSF Grant OPP-0520112. VMK was supported by NSF Grant OPP-0652838 and RFBR Grant 08-05-00569.

REFERENCES Annan, J. D., and J. C. Hargreaves, 2010: Reliability of the CMIP3 ensemble. Geophys. Res. Lett., 37, L02703, doi:10.1029/ 2009GL041994. Giorgi, F., and L. O. Mearns, 2002: Calculation of average, uncertainty range, and reliability of regional climate changes from AOGCM simulations via the ‘‘Reliability Ensemble Averaging’’ (REA) method. J. Climate, 15, 1141–1158. Gleckler, P. J., K. E. Taylor, and C. Doutriaux, 2008: Performance metrics for global climate models. J. Geophys. Res., 113, D06104, doi:10.1029/2007JD008972. Hall, A., and X. Qu, 2006: Using a current seasonal cycle to constrain snow albedo feedback in future climate change. Geophys. Res. Lett., 33, L03502, doi:10.1029/2005GL025127. Hawkins, E., and R. Sutton, 2009: The potential to narrow uncertainty in regional climate predictions. Bull. Amer. Meteor. Soc., 90, 1095–1107. Hollowed, A. B., N. A. Bond, T. K. Wilderbuer, W. T. Stockhausen, Z. T. A’mar, R. J. Beamish, J. E. Overland, and M. J. Schirripa, 2009: A framework for modelling fish and shellfish responses to future climate change. J. Mar. Sci., 66, 1584–1594. Jun, M. Y., R. Knutti, and D. W. Nychka, 2008: Local eigenvalue analysis of CMIP3 climate model errors. Tellus, 60A, 992– 1000. Kattsov, V. M., G. A. Alekseev, T. V. Pavlova, P. V. Sporyshev, R. V. Bekryaev, and V. A. Govorkova, 2007a: Modeling the evolution of the World Ocean ice cover in the 20th and 21st centuries. Izv. Russ. Acad. Sci., 43, 165–181. ——, J. E. Walsh, W. L. Chapman, V. A. Govorkova, T. V. Pavlova, and X. Zhang, 2007b: Simulation and projection of Arctic freshwater budget components by the IPCC AR4 Global Climate Models. J. Hydrometeor., 8, 571–589. Knutti, R., 2008: Should we believe model predictions of future climate change? Philos. Trans. Roy. Soc., A366, 4647– 4664. ——, and Coauthors, 2008: A review of uncertainties in global temperature projections over the twenty-first century. J. Climate, 21, 2651–2663. Krishnamurti, T. N., A. Chakraborty, R. Krishnamurti, W. K. Dewar, and C. A. Clayson, 2006: Seasonal prediction of sea surface temperature anomalies using a suite of 13 coupled atmosphere– ocean models. J. Climate, 19, 6069–6088. Moore, S. E., and H. P. Huntington, 2008: Arctic marine mammals and climate change: impacts and resilience. Ecol. Appl., 18 (Suppl.), S157–S165. Nakicenovic, and Coauthors, 2000: IPCC Special Report on Emissions Scenarios: A Special Report of Working Group III of the IPCC. Cambridge University Press, 599 pp.

15 MARCH 2011

OVERLAND ET AL.

Overland, J. E., and M. Wang, 2007: Future climate of the North Pacific Ocean. Eos, Trans. Amer. Geophys. Union, 88, 182. Raïsa¨nen, J., 2007: How reliable are climate models? Tellus, 59A, 2–29. ——, L. Ruokolainen, and J. Ylhaïsi, 2009: Weighting of model results for improving best estimate of climate change. Climate Dyn., 35, 407–422. Randall, D. A., and Coauthors, 2007: Climate models and their evaluation. Climate Change 2007: The Physical Science Basis, S. Solomon et al., Eds., Cambridge University Press, 589– 662. Reichler, T., and J. Kim, 2008: How well do coupled models simulate today’s climate? Bull. Amer. Meteor. Soc., 89, 303–311. Reifen, C., and R. Toumi, 2009: Climate projections: Past performance no guarantee of future skill? Geophys. Res. Lett., 36, L13704, doi:10.1029/2009GL038082. Solomon, S., D. Qin, M. Manning, M. Marquis, K. Averyt, M. M. B. Tignor, H. L. Miller Jr., and Z. Chen, Eds., 2007: Climate Change 2007: The Physical Science Basis. Cambridge University Press, 996 pp. Sorteberg, A., V. Kattsov, J. E. Walsh, and T. Pavlova, 2007: The Arctic surface energy budget as simulated with the IPCC AR4 AOGCMs. Climate Dyn., 29, 131–156.

1597

Stoner, A. M., K. Hayhoe, and D. J. Wuebbles, 2009: Assessing general circulation model simulations of atmospheric teleconnection patterns. J. Climate, 22, 4348–4372. Tebaldi, C., and R. Knutti, 2007: The use of the multi-model ensemble in probabilistic climate projections. Philos. Trans. Roy. Soc., A365, 2053–2075. Walsh, J. E., 2008: Simulations of present Arctic climate and future regional projections. Proc. Ninth Int. Conf. on Permafrost, Fairbanks, AK, U.S. Permafrost Association, 1911–1916. ——, W. L. Chapman, V. Romanovsky, J. H. Christensen, and M. Stendel, 2008: Global climate model performance over Alaska and Greenland. J. Climate, 21, 6156–6174. Wang, M., and J. E. Overland, 2009: A sea ice free summer Arctic within 30 years? Geophys. Res. Lett., 36, L07502, doi:10.1029/ 2009GL037820. ——, and ——, V. M. Kattsov, J. E. Walsh, X. Zhang, and T. Pavlova, 2007: Intrinsic versus forced variations in coupled climate model simulations over the Arctic during the twentieth century. J. Climate, 20, 1093–1107. Wu, Q., D. J. Karoly, and G. R. North, 2008: Role of water vapor feedback on the amplitude of seasonal cycle in the global mean surface air temperature. Geophys. Res. Lett., 35, L08711, doi:10.1029/2008GL033454.