18
WEATHER AND FORECASTING
VOLUME 24
Evaluation of Short-Range Quantitative Precipitation Forecasts from a Time-Lagged Multimodel Ensemble HUILING YUAN Cooperative Institute for Research in Environmental Sciences, University of Colorado, and NOAA/Earth System Research Laboratory/ Global Systems Division, Boulder, Colorado
CHUNGU LU NOAA/Earth System Research Laboratory/Global Systems Division, Boulder, and Cooperative Institute for Research in the Atmosphere, Colorado State University, Fort Collins, Colorado
JOHN A. MCGINLEY, PAUL J. SCHULTZ, BRIAN D. JAMISON, AND LINDA WHARTON NOAA/Earth System Research Laboratory/Global Systems Division, Boulder, Colorado
CHRISTOPHER J. ANDERSON NOAA/Earth System Research Laboratory/Global Systems Division, Boulder, and Cooperative Institute for Research in the Atmosphere, Colorado State University, Fort Collins, Colorado (Manuscript received 19 June 2007, in final form 1 August 2008) ABSTRACT Short-range quantitative precipitation forecasts (QPFs) and probabilistic QPFs (PQPFs) are investigated for a time-lagged multimodel ensemble forecast system. One of the advantages of such an ensemble forecast system is its low-cost generation of ensemble members. In conjunction with a frequently cycling data assimilation system using a diabatic initialization [such as the Local Analysis and Prediction System (LAPS)], the time-lagged multimodel ensemble system offers a particularly appealing approach for QPF and PQPF applications. Using the NCEP stage IV precipitation analyses for verification, 6-h QPFs and PQPFs from this system are assessed during the period of March–May 2005 over the west-central United States. The ensemble system was initialized by hourly LAPS runs at a horizontal resolution of 12 km using two mesoscale models, including the fifth-generation Pennsylvania State University–National Center for Atmospheric Research Mesoscale Model (MM5) and the Weather Research and Forecast (WRF) model with the Advanced Research WRF (ARW) dynamic core. The 6-h PQPFs from this system provide better performance than the NCEP operational North American Mesoscale (NAM) deterministic runs at 12-km resolution, even though individual members of the MM5 or WRF models perform comparatively worse than the NAM forecasts at higher thresholds and longer lead times. Recalibration was conducted to reduce the intensity errors in timelagged members. In spite of large biases and spatial displacement errors in the MM5 and WRF forecasts, statistical verification of QPFs and PQPFs shows more skill at longer lead times by adding more members from earlier initialized forecast cycles. Combing the two models only reduced the forecast biases. The results suggest that further studies on time-lagged multimodel ensembles for operational forecasts are needed.
1. Introduction Short-range quantitative precipitation forecasts (QPFs) are critical in providing flash flood warnings, information for transportation management (including aviation
Corresponding author address: Huiling Yuan, NOAA/ESRL, R/ GSD7, 325 Broadway, Boulder, CO 80305-3328. E-mail:
[email protected] DOI: 10.1175/2008WAF2007053.1
and highway decision making), and fire weather forecasting. Flash flood warnings require 30-min precipitation input, while river flood prediction requires 6-h QPFs. Due to a lack of good precipitation estimates at high temporal frequencies (such as hourly and less than hourly), this paper focuses on discussing 6-h QPFs and probabilistic QPFs (PQPFs). One of the issues related to short-range QPFs from regional, fine-resolution atmospheric models initialized
FEBRUARY 2009
YUAN ET AL.
19
FIG. 1. Flowchart of the NOAA LAPS diabatic initialization scheme. The scheme can initialize any forecast model that uses explicit microphysics with a realistic diabatic state and initial clouds.
by simply interpolating from larger-scale models is the ‘‘spinup’’ time period generally necessary for these models to fully develop cloud systems and precipitation. This spinup problem results from the absence of clouds and saturated updrafts in the initialization, which means latent heating rates are initialized to zero everywhere. We refer to this type of initialization as a ‘‘cold start,’’ which can lead to spinup times of up to 6 h into the model forecast. Lu et al. (2007) showed that model spinup imposed a major forecast error on short-range numerical weather predictions (NWP). To overcome the initial spinup problem in an NWP model, the National Oceanic and Atmospheric Administration/Earth System Research Laboratory/Global Systems Division [NOAA/ESRL/GSD, formerly the NOAA Forecast Systems Laboratory (FSL)] developed the Local Analysis and Prediction System (LAPS, Albers 1995; Jian et al. 2003; Jian and McGinley 2005), a ‘‘hot start’’ diabatic initialization scheme. As shown in Fig. 1, the LAPS scheme ingests real-time data from radar, satellite, aircraft, surface observations, and background information from a regional-scale atmospheric model. Estimates of cloud hydrometeors are diagnosed for each grid column based on a one-dimensional cloud model, ensuring thermodynamic consistency with the state variables. This provides a realistic model initialization state. With this diabatic initialization, microphysical processes and grid-resolving precipitation are activated in the first model time step, so that precipitation is present from the beginning. Jian and McGinley (2005)
demonstrated that much improved short-range QPFs can be obtained using this diabatic initialization scheme in a study of several tropical cyclones. In recent years, with the rapid advancement of computer technology, ensemble forecasting has been widely used to estimate uncertainties in weather forecasting and provide operational weather forecasts (Lewis 2005 and references within). Brooks et al. (1995) examined the feasibility of short-range ensemble forecast applications in NWP. Du et al. (1997) studied the impact of initial condition uncertainty on short-range precipitation forecasts. Their results demonstrated that cool season QPFs and PQPFs from ensemble forecasts could achieve greater forecast skill than using a deterministic forecast with double the horizontal resolution. Previous studies investigated Monte Carlo methods to generate initial conditions in ensemble forecasts (such as Leith 1974; Mullen and Baumhefner 1989; Stensrud et al. 2000). The NOAA/National Centers for Environmental Prediction (NCEP) developed breeding perturbed initializations for both global ensemble forecasts (Toth and Kalnay 1993, 1997) and short-range ensemble forecasting (SREF, Du et al. 2006) systems. The European Centre for Medium-Range Weather Forecasts (ECMWF) operates short-range to early middle-range ensemble forecasts using singular vector-perturbed initializations (e.g., Palmer et al. 1993; Molteni et al. 1996; Buizza et al. 1999; Hersbach et al. 2000). In addition to short-range ensembles using perturbed initial conditions, there are multiple physics ensembles (e.g., Stensrud et al.
20
WEATHER AND FORECASTING
2000; Bright and Mullen 2002), multimodel ensembles (e.g., Krishnamurti et al. 1999), and the combinations of multiple physics and multimodel ensembles (e.g., Alhamed et al. 2002; Du et al. 2006). Early studies showed the usefulness of the time-lagged ensemble forecast system (e.g., Hoffman and Kalnay 1983; Dalcher et al. 1988). A time-lagged ensemble forecast system has been applied to construct highresolution ensembles, such as the shifted initialization technique (Walser et al. 2004). Lu et al. (2007) developed a time-lagged ensemble forecast system for the application of short-range NWP, including verifying state variables of geopotential heights, temperature, relative humidity, and winds. They argued that because the time-lagged ensembles were composed of a set of deterministic forecasts initialized sequentially from different analysis times, the ensemble system captured the flow-evolving error growth in the model initial conditions. By optimally weighting these ensemble members using a multilinear regression algorithm, they showed that the short-range NWP errors could be reduced significantly. However, the model configuration in Lu et al. (2007) was based on rapid updating, which posed the problem that each forecast in the time series was not sufficiently independent of the previous run. Other studies on the NCEP SREF experiments indicate that these ensembles, which were built using only initial perturbations, generally have insufficient dispersion (Hamill and Colucci 1998; Stensrud et al. 1999); that is, the variability between ensemble members is too small. This may be especially true for short-range QPFs, because forecast precipitation strongly depends on parameterized model physics. Alhamed et al. (2002) showed that model diversity in an ensemble system could generate forecasts with enhanced ensemble spread. Stensrud et al. (2000) discussed the significance of both variations in model physics as well as in initial conditions in ensemble forecasting. Motivated by the methodology used in Lu et al. (2007), this study evaluates a time-lagged ensemble system that attempts to exploit ensemble diversity obtained by multiple initialization datasets. Multiple models are also used to add diversity arising from various model dynamic cores and model physics, and to increase ensemble spread. The ensemble system is initialized by LAPS for each forecast cycle. The introduction of new observations at each analysis time helps to reduce temporal error correlations. Short-range QPFs and PQPFs from this ensemble system are investigated for the period of March–May 2005 over the west-central United States. To reduce the intensity errors due to longer lead times, recalibration (section 4b) was applied to the time-lagged members using the most recent forecast.
VOLUME 24
This paper is organized as follows. Section 2 introduces the time-lagged multimodel ensemble system. Section 3 describes the model configurations, the experimental design, the verification data, and a reference operational forecast dataset. The assessment of 6-h QPFs and PQPFs during 3 months is presented in sections 4 and 5, respectively. A summary and conclusions are provided in section 6.
2. Time-lagged multimodel ensemble system a. Time-lagged multimodel ensembles The concept underlying a time-lagged ensemble forecast system is to utilize a frequently updating data assimilation–forecast system to generate an ensemble of forecasts (Fig. 2; Hoffman and Kalnay 1983). Each cycle generates a new forecast and no additional cost is incurred for the generation of ensemble members; thus, a set of time-lagged forecasts provides an economical way to enlarge the ensemble. In general, the older the forecast, the larger the error of 6-h QPFs contributed by that member. Therefore, including forecasts many hours old could negatively impact the verification of ensemble forecasts. On the other hand, ensemble forecasting needs a set of diversified model solutions (a large ensemble spread) to broaden the statistical sample domain and more skillfully capture the true solution. The number of ensemble members depends on the forecast projection time and forecast initialization interval. The computational demand of running many models concurrently limits the number of ensemble members for longer forecast lead times. However, the time-lagged ensemble can be enhanced by including additional models. Such a mixture gives rise to the concept of a time-lagged multimodel ensemble system, which not only increases the ensemble size, but also enhances the diversity of the ensemble members. Figure 2 shows the time-lagged multimodel ensemble system with a combination of two mesoscale models: the fifth-generation Pennsylvania State University–National Center for Atmospheric Research (NCAR) Mesoscale Model (MM5) and the Weather Research and Forecast (WRF) model with the Advanced Research WRF (ARW) dynamic core. In practice, more models could be used for constructing this ensemble system. As indicated in Fig. 2, the two models (white and gray horizontal bars) are initialized every dt hours at H, H 1 dt, H 1 2dt, . . . , and so on. At any arbitrary forecast time (t0, the vertical line, within the maximum model forecast projection 2 the length of the bars), there are N forecasts (e.g., QPFs) generated by these two models; that
FEBRUARY 2009
YUAN ET AL.
21
FIG. 2. Schematic plot for the time-lagged multimodel ensemble forecast system. At each initial time, each pair of ensemble runs uses an updated LAPS diabatic initialization.
is, N is the number of ensemble members. One can determine the value of N by counting how many times the model bars intersect with a particular time line (t0) or using the following formula: N 5 M 3 T 5 M 3 INT
! maximum forecast projection lead time 3 11 , model initial interval ðdtÞ
by equally weighting all ensemble members, which represents a probability of precipitation at a grid point over a given precipitation threshold, that is, the fraction of ensemble members exceeding the given threshold [P 5 (No. of members . threshold)/(total number of ensemble members)]. Obviously, the observed probability equals either 0 or 1, for the case of occurrence or nonoccurrence over the precipitation threshold.
(1)
3. Model configurations and data where M is the number of models, T is the number of time-lagged members for an individual model, and INT is a function for truncating a real number to an integer. The number (T) of time-lagged members can be determined by the maximum forecast projection time, forecast lead (projection) time, and model initialization interval. In this study, for 6-h precipitation accumulations, the maximum forecast projection is 18 h using two models with a 1-h initialization interval, and the lead time can be from 6 to 18 h. Therefore, a total of 26 timelagged multimodel ensemble members are available at a 6-h lead time, and the ensemble size decreases to 2 at an 18-h lead time.
b. Ensemble-mean QPF and PQPF For each grid point, the ensemble-mean QPF is the average value calculated by equally weighting all timelagged multimodel ensemble members [e.g., Eq. (4.1) in Lu et al. (2007)]. Similarly, the PQPF can be computed
a. Model configurations The MM5 and WRF models used in this study were implemented during the 2004/05 winter to support highway snow removal operations via the Maintenance Decision Support System (MDSS; Mahoney et al. 2005), a development and demonstration project sponsored by the Federal Highways Administration. The two models were nonhydrostatic and used different dynamic cores, such as different terrain-following coordinate systems and advection schemes. The Noah land surface model (Ek et al. 2003) was similar in both models, but microphysical schemes were different. The WRF model used the NCAR WRF Single-Moment (WSM) five-class microphysical scheme, while the MM5 model used the Schultz (1995) scheme. Convective cumulus schemes were not used in either of the models with horizontal grid spacing of 12 km. Both models were initialized every hour by the LAPS diabatic initialization and run
22
WEATHER AND FORECASTING
over a west-central United States domain (Fig. 3) with 128 3 128 horizontal grid points and 31 vertical levels. The lateral boundary conditions were provided by the NCEP North American Mesoscale (NAM, the former Eta Model; information online at http://www.meted. ucar.edu/nwp/pcu2/NAMMay2005.htm) model at 40-km resolution [gridded binary (GRIB) 212 grids]. The WRF (MM5) model was run to 18 (24) h with hourly model output. The ensemble output data were analyzed from March to May 2005 over the west-central United States. This configuration was intended to optimize 0–6-h precipitation forecasts in the Denver, Colorado, metropolitan area and the stretch of Interstate 70 that extends approximately 100 km due west of Denver into the mountainous terrain (Mahoney et al. 2005). In this area, a substantial fraction of the total observed precipitation was caused by orographic forcing. Furthermore, the heaviest precipitation, well away from the Denver area, was associated with strong convection, which was usually not accurately treated by models using 12-km grid spacing without a convective parameterization. For these reasons, the model configuration may not have been ideal for QPFs computed over the entire domain, as done herein. Weisman et al. (1997) examined the performance of models with and without convective parameterizations at various grid spacing; results suggest that the use of a convective parameterization would be appropriate on a 12-km grid for situations with deep convection present, such as that used in this study. For the purposes of optimizing the performance on precipitation forecasts of situations with deep convection and for the lead times analyzed, the modeling domain should be large enough to easily contain an entire large synoptic-scale cyclonic disturbance, and either the grid increment to be decreased to about 4 km, or a convective parameterization should be used. The NAM model runs used here for comparison, which are also computed on ;12 km grid spacing, are configured better for this purpose; this is reflected in the relatively good skill scores compared to the ensemble system (to be shown). Still, the salient points of this paper remain: the use of time-lagged model runs improves the ensemble system forecast performance, as does the use of a variety of models.
b. Verification data Precipitation verification data are from the NCEP stage IV precipitation analyses (information online at http://www.emc.ncep.noaa.gov/mmb/ylin/pcpanl/). Since hourly precipitation estimates from the NCEP stage II and IV analyses have quality control (QC) problems, this paper only verifies 6-h precipitation compared to the NCEP stage IV 6-h dataset, which is composed from
VOLUME 24
radar and rain gauge observations with a QC procedure. The 4-km stage IV 6-h precipitation data are aggregated to the 12-km WRF and MM5 model grids four times per day (0000, 0600, 1200, and 1800 UTC) during the study period. Uncertainties in the observation–verification data greatly affect verification scores in precipitation forecasts (e.g., Yuan et al. 2005), especially over mountainous terrains. This paper uses the stage IV 6-h precipitation analyses as ‘‘truth’’ to verify QPFs and PQPFs. The resampling technique (Hamill 1999) is adopted to estimate the uncertainty of the verification scores. By randomly selecting the statistics 10 000 times from all 6-h verification periods (with replacement) and computing a new score, the 90% confidence bounds are obtained by the upper and lower bounds (95% and 5%) of the 10 000 scores.
c. Operational high-resolution NAM forecasts The NCEP operational 12-km NAM model is used as a reference for assessing time-lagged multimodel ensembles during the same study period. The 12-km NAM data on GRIB 218 grids over the continental United States are interpolated to the WRF and MM5 12-km model grids using an inverse distance-weighting scheme. The boundary conditions of the time-lagged multimodel ensemble system come from the 40-km NAM forecasts, which share the same model structures and physical schemes as the 12-km operational NAM runs. However, the model physics in the 12-km WRF and MM5 models were different from the 12-km NAM. The purpose of this evaluation is not to compare individual model performance, but to examine how the time-lagged multimodel ensemble system can improve precipitation forecasts.
4. Six-hourly QPFs during 3 months a. The 3-month mean of 6-h QPFs Figure 3 shows the 3-month mean of 6-h precipitation accumulations, including a total of 348 archived 6-h validation times from 1800 UTC 1 March 2005 to 0000 UTC 1 June 2005. The observed precipitation (Fig. 3a) concentrates over the relatively high terrain. The observed heavy precipitation centers [.0.75 mm (6 h)21, ;90 mm month21] are mainly located in Nebraska and the border areas of Utah, Wyoming, and Idaho. The 6-h QPFs (Figs. 3b–l) at a 6-h lead time are shown for the NAM, MM5, WRF, and multimodel ensemble forecasts. Compared to the stage IV analyses (Fig. 3a), the 0–6-h NAM forecasts at 12-km resolution (Fig. 3b) underestimate heavy precipitation centers and show extended
FEBRUARY 2009
YUAN ET AL.
FIG. 3. The 3-month mean of 6-h precipitation from March to May 2005 for (a) the stage IV analyses and the forecasts at a 6-h lead time from the (b) NAM, (c) MM5 with 19 members, and (d)–(f) MM5, (g)–(i) WRF, and (j)–(l) multimodel ensemble with the 1, 6, and 13 most recent members in each model. Spatial correlation coefficients (cor) and RSMEs (rmse) between the 3-month mean of (a) the stage IV analyses and (b)–(l) the model forecasts are shown in the titles.
23
24
WEATHER AND FORECASTING
precipitation areas, which perhaps is caused by the Betts–Miller–Janjic (BMJ; Janjic 1994) convective scheme. Over- and underestimations of low and high precipitation amounts, respectively, often can be seen in the operational NAM forecasts (see the scores and biases online: http://www.emc.ncep.noaa.gov/mmb/ylin/pcpverif/ scores/). Previous studies indicated that the operational Eta Model (former version of the NAM) using the BMJ convective scheme tends to produce smooth precipitation fields (e.g., Gallus 1999) and much smoother results compared to the Eta and WRF models using the Kain– Fritsch (Kain and Fritsch 1993) convective scheme (Baldwin and Wandishin 2002). Baldwin and Wandishin (2002) also suggested that different smoothing of precipitation forecasts result from different convective schemes and model parameterizations in horizontal diffusion and vertical turbulent mixing. In this study, both the WRF and MM5 models were run by deactivating the convective scheme and contained other differences from the NAM model. Not surprisingly, the 0–6-h NAM QPFs (Fig. 3b) are much smoother than the MM5 (Fig. 3d) and WRF (Fig. 3g) forecasts. The 0–6-h MM5 forecasts (Fig. 3c) demonstrate dry biases in both precipitation coverage and extreme centers, while the WRF forecasts (Fig. 3g) show wet biases, especially in the Salt Lake area of Utah. Spatial correlation coefficients and root-mean-square errors (RMSEs, shown in the titles in Figs. 3b–l) between the 3-month mean fields of the stage IV analyses and corresponding model forecasts show that the NAM forecasts possess smaller forecast errors and higher spatial correlations than the MM5 and WRF forecasts, which corresponds to the smoother NAM field that performs better under quadratic metrics such as the RMSE and spatial correlation coefficients. Three ensemble groups are constructed using the time-lagged members from the WRF, MM5, and multimodel ensemble (combining both the MM5 and WRF forecasts at the same lead time or with the corresponding members; e.g., Fig. 3j uses the members from Figs. 3d and 3g) forecasts, respectively. The 3-month mean of 6-h ensemble-mean QPFs (Figs. 3e and 3h) using the six most recently initialized time-lagged members (i.e., no forecasts initialized earlier than 6 h) in each model also exhibits wet biases and spatial displacement errors. The WRF forecasts (Figs. 3g, 3h, and 3i) have higher spatial correlation coefficients with slightly increased RMSEs compared to the MM5 forecasts (Figs. 3d, 3e, and 3f). The multimodel ensemble mean (Figs. 3j, 3k, and 3l) shows a similar precipitation distribution with slightly decreased RMSEs and spatial correlation coefficients compared to the WRF model. In general, spatial correlation coefficients change slightly
VOLUME 24
and the RMSE increases with increasing ensemble size [such as the multimodel forecasts (Figs. 3j, 3k, and 3f) and the MM5 forecasts (Figs. 3d, 3e, 3f, and 3c)].
b. Frequency biases, recalibration, and bias-adjusted threat scores Frequency bias (Jollife and Stephenson 2003) and bias-adjusted threat score (TSA; Baldwin and Kain 2006) are used to analyze QPFs. The TSA has less sensitivity to forecast bias than the equitable threat score (ETS; Hamill 1999; Baldwin and Kain 2006). The TSA at a given threshold is defined as TSA 5
ða 1 cÞ1/B c1/B ða 1 cÞ1/B 1 c1/B
,
(2)
where a is the number of hit events (forecasted the events that were observed), b is the number of false alarm events, and c is the missed events; in addition, B [B 5 ða 1 bÞ/ða 1 cÞ] is the frequency bias and has the best value of 1, and indicates an underestimation (overestimation) with the value less (greater) than 1. The TSA ranges from 0 to 1 with the best value of 1. Resampling (Hamill 1999) is used to provide the upper and lower bounds (95% and 5%) of a skill score, by randomly selecting the statistics 10 000 times from 348 six-hour verification periods (with replacement). At the 0.25 mm (6 h)21 threshold (Fig. 4a), the WRF and multimodel (combining the corresponding WRF and MM5) forecasts show small biases, while the MM5 deterministic forecasts exhibit dry biases. For higher thresholds, frequency biases increase with the lead time for the WRF, MM5, and multimodel forecasts, and the rate of increase slows after the 12-h lead time. At short lead times [e.g., 6–11-h lead times at the 2.5 mm (6 h)21 threshold], the confidence bounds of the multimodel ensemble forecasts do not overlap with the scores of individual models, indicating the differences are significant at the 95% level or higher. The multimodel ensemble forecasts demonstrate advantages for lower thresholds and shorter lead times over the individual models considering the confidence bounds. At lower thresholds [0.25 and 2.5 mm (6 h)21], the 12-km NAM forecasts with the 6-, 12-, and 18-h lead times exhibit higher wet biases than other models, while at higher thresholds [5 and 10 mm (6 h)21], the NAM forecasts show better bias scores at longer lead times than other models. With increasing ensemble size (from the most recently initialized members to the longest time-lagged members), frequency biases (Fig. 5) of 6-h ensemblemean QPFs at a 6-h lead time tend to increase when using a few time-lagged members but soon saturate and decrease. There is a tendency to gain wet biases when
FEBRUARY 2009
YUAN ET AL.
25
FIG. 5. Same as in Fig. 4 but for frequency biases at a 6-h lead time as a function of the number of time-lagged members (from the most recent to longest time lags). FIG. 4. Frequency biases of 6-h QPFs during the 3 months from March to May 2005 as a function of forecast lead time for thresholds at (a) 0.25, (b) 2.5, (c) 5.0, and (d) 10 mm (6 h)21 from the WRF (boldface solid line), MM5 (dashed line), and multimodel ensemble (solid line with open circles) forecasts, and the NAM runs at 6-, 12-, and 18-h lead times (asterisks). Error bars represent 90% confidence bounds for the multimodel ensemble.
more time-lagged members are used. Frequency biases of the multimodel ensemble mean are between the bias values of the WRF and MM5 ensembles, and show better results than the individual model ensembles when using only a few members. The multimodel ensemble mean also shows better frequency biases than the 12-km NAM forecasts. The forecast biases tend to increase with the forecast lead time (Figs. 3 and 4). The displacement errors and small sample size for heavy amounts pose a great challenge to removing the biases during the study period. Instead, recalibration (Casati et al. 2004) was conducted to reduce the intensity biases in the time-lagged mem-
bers by using the most recent forecast. This procedure ranks all precipitation amounts in the field and matches the cumulative frequency distribution (CFD) of each time-lagged member to the CFD of the most recent member at each verification time for the MM5 and WRF forecasts, respectively. By doing so, the spatial structure of each member is retained and the intensity errors due to the forecast lead time are reduced. After recalibration, the frequency biases for each model at different lead times are similar to that at the 6-h lead time, and the frequency biases for the multimodel ensemble mean are around 1.0 (not shown). Figure 6 shows an example of 6-h precipitation fields before and after recalibration. Compared to the stage IV analysis (Fig. 6a), both the MM5 and WRF forecasts at longer lead times (Figs. 6c, 6d, 6h, and 6i) show larger wet biases than the forecasts at the 6-h lead time (Figs. 6b and 6g). After recalibration the biases are much reduced with smaller RMSEs, while the main spatial
26
WEATHER AND FORECASTING
VOLUME 24
FIG. 6. The 6-h precipitation accumulation ending at 1800 UTC 14 Mar 2005 for (a) the stage IV analysis; the MM5 forecasts at (b) 6-, (c) 12-, and (d) 18-h lead times; and the recalibrated MM5 forecasts at (e) 12- and (f) 18-h lead times; and for the WRF forecasts at (g) 6-, (h) 12-, (i) 18-h lead times, and the recalibrated WRF forecasts at (j) 12- and (k) 18-h lead times. Spatial correlation coefficients (cor) and RMSEs (rmse) between (a) the stage IV analyses and (b)–(k) the forecasts are shown in the titles.
FEBRUARY 2009
27
YUAN ET AL.
FIG. 7. The distribution of the matched 6-h precipitation totals from the MM5 and WRF forecasts before and after recalibration for the case shown in Fig. 6.
feature is retained with slightly varied spatial correlation coefficients (Figs. 6e, 6f, 6j, and 6k). The matched forecasts (Fig. 7) using the CFD indicate that recalibration generates drier forecasts than the raw forecasts for this case (the curves are below the diagonal), especially for the MM5 forecasts. The verification measures hereafter (Figs. 8–15) are based on the recalibrated data. The TSA of the QPFs (Fig. 8) decreases and the uncertainty bounds of the TSA increase with increasing lead time and threshold. The TSA of the NAM forecasts surpasses other model forecasts at longer lead times of 6 and 12 h for lower thresholds. The multimodel ensemble forecasts show a slightly higher TSA than the MM5 or WRF forecasts, but with very small differences from the forecasts of individual models in terms of the 90% confidence bounds. With increasing ensemble size, the TSA of the ensemble mean at a 6-h lead time slightly varies (Fig. 9) at lower thresholds and slightly increases at higher thresholds. The multimodel ensemble is not superior to the best individual model ensemble considering the 90% confidence bounds. Since the TSA is less sensitive to the biases and the TSA changes slightly with increasing lead time and ensemble size, the forecasts from longer time-lagged members can be used to compose ensembles and increase the ensemble size without significantly degrading the forecast skill.
c. Rank histograms Rank histograms (RHs; e.g., Talagrand et al. 1997; Hamill and Colucci 1998; Hamill 2001) are generated by counting the rank of the observation data as compared to available ensemble members and showing the frequency of each rank for all grid pixels. Therefore, the number of possible ranks is the number of ensemble
FIG. 8. Same as in Fig. 4 but for the TSAs.
members plus 1. Figure 10 shows the RHs of ensemble forecasts at a 6-h lead time for several ensemble configurations. A uniform rank (horizontal line in Fig. 8, equal to the value of 1 divided by the number of the total ranks) is expected if the observations rank equally among the ensemble members. Generally, Fig. 10 shows a U-shaped RH for the WRF ensemble, a reverse L-shaped RH for the MM5 ensemble, and a skewed U-shaped RH for the multimodel ensemble. Using the two most recent ensemble members (Figs. 10a1, 10b1, and 10c1), the frequencies are close to a uniform rank, with somewhat higher frequencies at the end ranks and slightly lower values at the middle-range ranks. By including longer time-lagged members, the skewness to the highest rank increases in each ensemble group (Figs. 10a1–a3, 10b1–b3, and 10c1–c3). This mainly results from increased biases with the lead time (Fig. 4) and is partially due to recalibration. For example, the 6-h
28
WEATHER AND FORECASTING
VOLUME 24
RH. A skewed RH indicates forecast biases. Wet biases in ensemble forecasts can lead to an L-shaped RH, with more events situated at the low-end ranks, while dry biases often cause observations to rank highest with a reversed L shape. Increasing ensemble size does not improve the RH, indicating that the biases (Figs. 4 and 5) associated with nonuniform ranks are not reduced. A mixture of the two models shows the RH having a skewed U-shape with more dry biases. This may suggest that just two models are not enough to increase the variability. Alhamed et al. (2002) also showed that the forecasts from a single model tend to cluster together. In addition, hourly initialization in the ensemble system might not have enough observation data, which may have led to less diversified forecasts in each cycle. Previous studies (e.g., Mittermaier 2007; Yuan et al. 2008) showed that using the time-lagged ensemble with 6-h lagged cycles could benefit the 6-h QPFs and PQPFs. Further investigations into designing the time-lagged multimodel ensemble system are needed, such as combinations of more models and different initialization intervals to increase the ensemble spread.
5. Six-hourly PQPFs during 3 months This section discusses forecast quality and verification scores of 6-h PQPFs at a 6-h lead time during the 3 months, as compared to the Sstage IV analyses.
a. Attributes diagrams FIG. 9. Same as in Fig. 5 but for the TSAs.
MM5 forecasts show dry biases (Fig. 4) and the recalibrated MM5 forecasts at longer lead times tend to have dry biases as well (e.g., Fig. 6). When using the 12 most recent members in the MM5 ensemble (Fig. 10c3), the frequency of the highest rank almost doubles that of the uniform rank. By using the maximum ensemble size, the RH shape for each ensemble group (not shown) is similar to that using the 12 most recent members. Comparing the ensemble groups at the same ensemble size, the multimodel ensemble mixes the effects of the RH from the WRF and MM5 ensembles and the middle ranks have not been improved. Hamill (2001) suggested a cautious approach to interpreting different RHs, the shape of which can be affected by several factors. Underdispersive ensembles, which lack variability among ensemble members (i.e., smaller ensemble spread), can lead to a U-shaped RH. Also, uncertainties in the observation data and mixed dry–wet biases in the ensembles can cause a U-shaped
Attributes diagrams (ADs; Wilks 2006; Jollife and Stephenson 2003 and references within) illustrate a forecast property of consistency between observed frequencies and forecast probabilities. On an AD (Fig. 11), the observed occurrence frequencies of the forecasts in each forecast probability category are plotted as the reliability curve. The closer the curve to the diagonal (perfect line), the better the reliability is. The flatter the curve on an AD, the less resolution it has. Sample climatological frequency (i.e., the occurrence frequency for the total sample at the selected threshold) is represented by the horizontal solid line (no resolution) and the vertical dashed line. A reliability curve falls in an area between this vertical dashed line and a 22.58 dashed line (halfway between the perfect line and the noresolution line) indicates skillful forecasts compared to the sample climatology. Inset histograms show the percentage of the total sample (;5.7 millions) for each forecast probability category. The squared distance between the reliability curve and the perfect line weighted by the percentage of the total sample quantifies the reliability of the ensemble forecasts, and the squared distance between the curve and the horizontal
FEBRUARY 2009
YUAN ET AL.
29
FIG. 10. Rank histograms of 6-h QPFs at a 6-h lead time during the 3 months from March to May 2005 for the (left) multimodel, (middle) WRF, and (right) MM5 ensemble forecasts with the 2, 6, and 12 most recent initialized members. The abscissa indicates the rank of the observation among all ensemble members. The ordinate indicates the frequency of the total sample for each rank. The horizontal line denotes a uniform rank.
sample climatological frequency weighted by the percentage of the total samples quantifies the resolution. A perfect ensemble has a reliability of 0 and a resolution of 1. As long as the resolution is greater than the reliability, the forecast is skillful. Figure 11 shows the AD of 6-h PQPFs at a 6-h lead time by varying ensemble size, compared to the stage IV analyses. The sample climatological frequency (i.e., the stage IV precipitation analyses during the study period) greatly decreases from low to high thresholds, dropping from 12.22% to 0.58%. At the 0.25 mm (6 h)21 threshold, reliability curves suggest overconfident forecasts (Wilks 2006, p. 288) with poor resolution (flatter slope). For high-probability categories, the curves below the 1:1 diagonal line indicate wet biases. Similarly, the AD above the 2.5 mm (6 h)21 threshold show an overconfidence and strong wet biases for high-probability categories. At 5- and 10-mm thresholds, reliability
curves of the MM5 and WRF ensembles using the six most recent members mostly align with the no-skill line, indicating that the PQPFs at higher thresholds are generally unskillful compared to the sample climatology. With increasing ensemble size, the reliability curves are rotated toward the perfect reliability line with more improvements at higher thresholds. The improvement in the reliability curves with more ensemble members is consistent with the results in Richardson (2001). Comparing the three ensemble groups at the same ensemble size (e.g., 12 members in Fig. 11), the multimodel ensemble only shows a better reliability at lower thresholds [0.25 and 2.5 mm (6 h)21]. Inset histograms show that the percentage of nonzero forecast probabilities in the middle-range categories increase with ensemble size, while the percentage of extreme categories (0% and 100%) decrease and thus the sharpness is reduced.
30
WEATHER AND FORECASTING
VOLUME 24
FIG. 11. Attributes diagrams of 6-h PQPFs at a 6-h lead time during the three months for four thresholds: (a) 0.25, (b) 2.5, (c) 5, and (d) 10 mm (6 h)21. Reliability curves are for different ensembles using the most recent members. The 1:1 diagonal line indicates the perfect forecast. The sloped and vertical dashed lines denote no-skill lines. The horizontal line with digits above is the sample climatological frequency. Inset histograms indicate the percentage of the total sample (;5.7 million) at nonzero forecast probability categories for using the 12 (white bar) and 24 (gray bar) most recent multimodel members, and the two digits below the histogram indicate the percentage for the 0% probability category.
In general, increasing ensemble size does not significantly improve the reliability or reduce the conditional forecast biases in PQPFs. Dry biases and more severe wet biases in the AD are also related to a U-shaped RH (Fig. 10), whereas the biases and RH shape are not improved due to increased ensemble size. The model forecasts contain large biases in QPFs (Figs. 3–5) and
PQPFs (Fig. 11) compared to the stage IV analyses. The forecasts may be improved through postprocessing based on long-term forecasts.
b. Brier skill scores The Brier skill score (BSS; e.g., Wilks 2006; Jollife and Stephenson 2003 and references within) is based
FEBRUARY 2009
YUAN ET AL.
FIG. 12. BSSs of 6-h PQPFs vs the 6-h NAM runs at a 6-h lead time during the 3 months for four thresholds: (a) 0.25, (b) 2.5, (c) 5, and (d) 10 mm (6 h)21 as a function of the number of time-lagged members (from the most recent to longest time lags) from the WRF (boldface solid line), MM5 (dashed line), and multimodel ensemble (solid line with open circles) forecasts. Error bars represent 90% confidence bounds for the multimodel ensemble.
31
32
WEATHER AND FORECASTING
VOLUME 24
FIG. 13. BSSs of 6-h PQPFs at a 6-h lead time during the 3 months vs the 6-h NAM runs as a function of forecast lead time for using the (a) 10 most recent and (b) maximum multimodel members. The labels show four thresholds.
on the Brier score (BS), which measures the accuracy of n probability assessments. The BS [BS 5 ð1/nÞ åj 5 1 ðpj oj Þ2 ] computes the average squared deviation between forecast probabilities (pj) and observational outcomes (oj). For a dichotomous forecast (e.g., the stage IV precipitation over a threshold in this study), the observational outcome equals 1 if the event occurs and 0 if the event does not occur. The BS values range from 0 to 1, with lower scores indicating better forecasts. The BSS can be defined as BSS 5 1
BSresolution BSreliability BS 5 , BSref BSuncertainty
(3)
where BS and BSref stand for the BS from an assessed ensemble system and a reference forecast system, respectively. The resolution term (BSresolution) and reliability term (BSreliability) can be computed from the reliability curves in an attribute diagram, while the uncertainty term (BSuncertainty) solely depends on the sample climatology. The BSS is a positively oriented score, with a perfect value of 1 and positive (zero or negative) values for skillful (unskillful) forecasts compared to the reference system. Figure 12 shows the BSS referenced to the 6-h operational NAM forecasts at 12-km resolution. The posi-
tive BSS values indicate that the PQPFs are better than the NAM forecasts. Similar to the TSA (Fig. 9), the BSS decreases and the uncertainties of BSS increase with the threshold. Using one member, the 6-h WRF [at the 2.5, 5, and 10 mm (6 h)21 thresholds] and the MM5 [10 mm (6 h)21] deterministic runs are worse than the 6-h NAM forecasts. The WRF ensemble is worse than the MM5 ensemble using a few members but becomes closer with increasing ensemble size. This possibly results from the larger biases in the WRF forecasts (Fig. 5) at the selected thresholds. Adding members leads to a smoother field in the WRF forecasts (Fig. 3) and smaller differences of skill from the MM5 forecasts (Fig. 12). Compared to the best individual model ensemble at the same ensemble size, the BSS of the multimodel ensemble is only improved at the lowest threshold and by using a few members in terms of the 90% confidence bounds. By increasing the number of time-lagged ensemble members, the BSS increases for each ensemble group, which is consistent with the previous study that elucidates the relationship between increased ensemble size and improved BSS (Richardson 2001). The rapid growth of the BSS is shown as the ensemble goes from 1 to 6 of the most recent members, while the skill slowly increases or slightly varies by using seven or more of the most recent members. This implies that additional members are
FEBRUARY 2009
YUAN ET AL.
33
FIG. 14. The ROC curves of 6-h PQPFs at a 6-h lead time during the 3 months for four thresholds: (a) 0.25, (b) 2.5, (c) 5, and (d) 10 mm (6 h)21. Inset text shows the area under the corresponding ROC curve using the trapezoidal method by connecting the points of the ROC curve.
deficient. This is also supported by the findings that the skill improvement with increasing ensemble size is more significant for smaller ensemble size and the benefit could be reduced by adding more members as the forecast probability density function (PDF) becomes better defined (Richardson 2001). Similar to the reliability curves (Fig. 11), larger improvements can be seen for higher thresholds. Increasing the ensemble size does not improve the forecast biases (Figs. 3 and 5) nor the TSA (Fig. 9) in the ensemble-mean QPFs, while the 6-h PQPFs show better skill (Fig. 12) with increased ensemble size. This indicates that the ensemble with more members could generate more realistic probabilities.
By selecting the 10 most recent members at each forecast lead time, the BSS is computed for the multimodel ensemble (Fig. 13a) referenced to the 6-h NAM deterministic runs. At lower thresholds, the BSS values drop at 6–11-h lead times and vary little afterward. The BSS slightly varies at the 10 mm (6 h)21 threshold. The PQPFs remain skillful for all thresholds up to a 14-h lead time (the longest lead time with a limitation of 10 members) compared to the 6-h NAM runs. Both decreasing ensemble size (Richardson 2001) and increasing forecast lead time can cause the BSS to decrease. By using maximum members for each lead time (Fig. 13b), the ensemble size and BSS gradually decrease with
34
WEATHER AND FORECASTING
VOLUME 24
FIG. 15. Same as in Fig. 12 but for the areas under the fitted ROC curve by transforming the ROC curve into a normal deviate space.
increasing lead time, but this time-lagged multimodel ensemble system shows skillful forecasts for all thresholds up to an 18-h (maximum) lead time compared to the 6-h NAM runs. For the lead times of 14–18 h, the BSS
greatly decreases due to small ensemble size and increased forecast errors. In summary, inclusion of time-lagged members can improve the forecast skill. And using the time-lagged
FEBRUARY 2009
YUAN ET AL.
multimodel members can extend the skillful forecasts to a longer forecast lead time.
c. Area under the relative operating characteristic curve The relative operating characteristic curve (ROC; Mason 1982) is plotted by the hit rates [a/ða 1 cÞ] in the ordinate and the false alarm rates [b/ðb 1 dÞ, where a, b, and c are as in Eq. (2) in section 4b, and d is the number of correct forecasts of nonevents] in the abscissa at each probability category. The area under the ROC curve indicates the discriminating ability of the events at the selected threshold, which has the perfect value of 1.0 and an unskillful value of 0.5. Assuming the constant variance across subsets of the data, a value of ;0.75 and higher for the ROC area is an indicator of a good forecast. Composite hit and false alarm rates of 6-h PQPFs during the study period before recalibration (not shown) are almost the same as those after recalibration. This indicates that recalibration only reduces forecast biases while not improving the discriminating ability, since the ROC curve is insensitive to the forecast biases (Harvey et al. 1992; Mason and Graham 2002; Jollife and Stephenson 2003). Figure 14 shows the ROC curves for the three ensemble groups at different ensemble sizes. Inset ROC areas were calculated based on the trapezoidal method by connecting the points in the ROC curve, which may underestimate the ROC area since the points tend to populate the lower left corner where both the hit and false alarm rates equal 0. With increasing ensemble size, the ROC curves move closer to the upper-left corner (where the false alarm rate equals 0 and the hit rate equals 1) for each ensemble group. The ROC area (Fig. 14) increases with increasing ensemble size and decreases with the threshold. Surprisingly, by adding time-lagged members, the ROC curves almost lie along a single ROC curve. This indicates that additional members only cover more of the hit and false alarm rate space (especially rapid growth in up to the first six members; not shown) and do not improve the fundamental discrimination of the ensemble. The ROC curves of the MM5 ensemble are always beneath the curves of the WRF ensemble, showing inferior forecasts. Considering the same ensemble size (e.g., 12 members in Fig. 14), the ROC curves of the multimodel ensemble are inferior to ones of the WRF ensemble at all thresholds due to the poorer MM5 ensemble. Unlike the BSS, the ROC area is less sensitive to the biases. Therefore, the multimodel ensemble does not improve the discrimination. Because of the limited ensemble size, the trapezoidal method (Fig. 14) tends to underestimate the ROC area. To overcome this problem, the ROC curves were fitted in
35
a normal deviate space (Wilson 2000) and the ROC areas were recomputed (Fig. 15). First, the hit and false alarm rates were transformed into a normal deviate space and the points were fitted along the resulting straight line, and then the fitted curves were transformed back to the ROC space. For two and more members, the values of the fitted ROC area (Fig. 15) are high, around 0.9, and slightly vary for all thresholds. The uncertainties of the ROC area increase with the threshold and ensemble size. The MM5 ensemble has lower fitted ROC areas than the WRF ensemble, especially at higher thresholds, and shows a downward trend with increasing ensemble size. At higher thresholds, the fitted ROC areas with a few members are slightly better than those with more members and the multimodel ensemble shows the marginal benefit. In general, the discriminating ability is not improved with increasing ensemble size. Adding time-lagged members in the WRF ensemble is beneficial in that the PDF is better defined without harming the discriminating ability. Adding time-lagged members to the MM5 ensemble slightly decreases the resolution, while improves the BSS with reduced forecast biases (Fig. 12). Good ROC area values also indicate that a bias correction procedure may help increase the forecast skill by removing systematic errors (Yuan et al. 2007) with long-term historical forecast data.
6. Summary and conclusions This study demonstrates the construction of a timelagged multimodel ensemble forecast system. This system provides an economical way of generating a set of ensemble members. A frequently cycling data assimilation system with a diabatic initialization (such as the NOAA LAPS) is particularly appropriate for generating these ensembles with unique initial perturbations, which capture the time-evolving uncertainty in initial conditions. The further use of different models not only helps to increase the ensemble size, but also captures the uncertainty of different model physical parameterizations, thus enhancing the ensemble diversity. Six-hourly QPFs and PQPFs at 12-km resolution were investigated for the time-lagged multimodel ensemble system. The 3-month mean of QPFs showed heavy precipitation centers as in the NCEP stage IV precipitation analyses but with forecast biases in the system. The MM5 and WRF models suffer from problems associated with not having a convective parameterization scheme. The explicit precipitation process led to the overprediction of precipitation at the 12-km grid scale. Because frequency biases increased with the forecast lead time (Fig. 4), recalibration was implemented to help reduce the intensity errors while still retaining the
36
WEATHER AND FORECASTING
spatial pattern in the time-lagged members. The TSA (Fig. 8), which is insensitive to forecast biases, gradually decreased for the WRF and MM5 deterministic QPFs and the multimodel ensemble mean with the forecast lead time. When adding more time-lagged members, the TSAs (Fig. 9) of the ensemble-mean QPFs from the three ensemble groups varied slightly, while the shape of the rank histograms (Fig. 10) was not improved with increased ensemble size. A skewed U-shaped rank histogram might be caused by mixing underestimated and overestimated forecasts or by underdispersion of the ensemble. As expected in Richardson (2001), the improvements in the 6-h PQPFs (the attributes diagrams and the BSS) with increasing ensemble size were more obvious when adding a few members, and diminished when the PDF was better defined by adding more members. Dry and wet biases were also shown in the attributes diagrams (Fig. 11). When increasing the ensemble size, reliability curves were slightly improved and observed event frequencies were slightly increased for middle-range nonzero probability categories. Although the deterministic runs had lower TSAs at longer lead times than the 12-km operational NAM forecasts (Fig. 8), inclusion of forecasts from earlier initialization cycles improved the BSSs (Fig. 12) in the WRF and MM5 ensembles. Compared to the best individual model ensemble, the multimodel ensemble only improved the BSS at the lowest threshold with a few members. The forecast lead times of skillful PQPFs was extended (Fig. 13) for the time-lagged multimodel ensemble, compared to the 6-h NAM deterministic forecasts. With increasing timelagged members, more hits and false alarm rates were covered but the ROC curves (Fig. 14) clustered together in the lower-left corner, and the area under the fitted ROC curve generally was not improved (Fig. 15). Overall, by combining only two models, the multimodel ensemble did not improve the resolution due to the poorer MM5 ensemble. Collectively, these results encourage further investigations into ensembles constructed using more models with a variety of configurations. Also, ensemble members with hourly initialization might not produce large enough ensemble spreads in this study. Longer initialization intervals (such as 6 h) will need to be examined for 6-h QPFs and PQPFs. Additionally, future studies should further examine the selection of the physical parameterization schemes, particularly the option of using a convective scheme in this application, and postprocessing procedures for removing forecast biases (e.g., Yuan et al. 2007) based on long-term historical forecasts. Also, uncertainties in the observation–verification data should be considered in future verification studies. While the accuracy of
VOLUME 24
an ensemble forecast system depends on the model physics and numerical schemes, the postprocessing also plays an important role in improving the QPF skill. Several methods of QPF and PQPF postprocessing in ensemble systems have been developed, such as a regression method used to correct systematic errors in ‘‘superensemble’’ forecasts (Krishnamurti et al. 1999) and analog forecasting of QPFs using 25-yr ensemble reforecasts (Hamill et al. 2006). Regarding increased errors for longer forecast lead times, weighting different time-lagged members should also be tested. Research on optimal combinations of different ensemble members for improving PQPF and QPF is under way. Acknowledgments. This research was performed while H. Yuan held a National Research Council Research Associateship Award at NOAA/ESRL. We thank Drs. Alexander MacDonald and Steven Koch for discussions on the subject, Steve Albers for conducting the NOAA internal review, and Ann Reiser for providing the technical edit. This research was funded by the NOAA Office of Oceanic and Atmospheric Research. The views expressed are those of the authors and do not necessarily represent the official policy or position of NOAA. We also thank the three anonymous reviewers for their valuable suggestions.
REFERENCES Albers, S. C., 1995: The LAPS wind analysis. Wea. Forecasting, 10, 342–352. Alhamed, A., S. Lakshmivarahan, and D. J. Stensrud, 2002: Cluster analysis of multimodel ensemble data from SAMEX. Mon. Wea. Rev., 130, 226–256. Baldwin, M. E., and M. S. Wandishin, 2002: Determining the resolved spatial scales of Eta model precipitation forecasts. Preprints, 15th Conf. on Numerical Weather Prediction, San Antonio, TX, Amer. Meteor. Soc., 3.2. [Available online at http://ams.confex.com/ams/pdfpapers/47735.pdf.] ——, and J. S. Kain, 2006: Sensitivity of several performance measures to displacement error, bias, and event frequency. Wea. Forecasting, 21, 636–648. Bright, D., and S. L. Mullen, 2002: Short-range ensemble forecasts of precipitation during the southwest monsoon. Wea. Forecasting, 17, 1080–1100. Brooks, H. E., M. S. Tracton, D. J. Stensrud, G. DiMego, and Z. Toth, 1995: Short-range ensemble forecasting: Report from a workshop, 25–27 July 1994. Bull. Amer. Meteor. Soc., 76, 1617–1624. Buizza, R., A. Hollingsworth, F. Lalaurette, and A. Ghelli, 1999: Probabilistic predictions of precipitation using the ECMWF Ensemble Prediction System. Wea. Forecasting, 14, 168–189. Casati, B., G. Ross, and D. B. Stepehenson, 2004: A new intensityscale approach for the verification of spatial precipitation forecasts. Meteor. Appl., 11, 141–154.
FEBRUARY 2009
YUAN ET AL.
Dalcher, A., E. Kalnay, and R. N. Hoffman, 1988: Medium range lagged average forecasts. Mon. Wea. Rev., 116, 402–416. Du, J., S. L. Mullen, and F. Sanders, 1997: Short-range ensemble forecasting of quantitative precipitation. Mon. Wea. Rev., 125, 2427–2459. ——, J. McQueen, G. DiMego, Z. Toth, D. Jovic, B. Zhou, and H. Chuang, 2006: New dimension of NCEP Short-Range Ensemble Forecasting (SREF) system: Inclusion of WRF members. Preprints, WMO Expert Team Meeting on Ensemble Prediction System, Exeter, United Kingdom, World Meteorological Organization. [Available online at http:// wwwt.emc.ncep.noaa.gov/mmb/SREF/WMO06_full.pdf.] Ek, M. B., K. E. Mitchell, Y. Lin, P. Grunmann, E. Rogers, G. Gayno, and V. Koren, 2003: Implementation of the upgraded Noah land-surface model in the NCEP operational mesoscale Eta model. J. Geophys. Res., 108, 8851, doi:10.1029/ 2002JD003296. Gallus, W. A., 1999: Eta simulations of three extreme precipitation events: Sensitivity to resolution and convective parameterization. Wea. Forecasting, 14, 405–426. Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167. ——, 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550–560. ——, and S. J. Colucci, 1998: Evaluation of Eta–RSM ensemble probabilistic precipitation forecasts. Mon. Wea. Rev., 126, 711–724. ——, J. S. Whitaker, and S. L. Mullen, 2006: Reforecasts: An important dataset for improving weather predictions. Bull. Amer. Meteor. Soc., 87, 33–46. Harvey, L. O., K. R. Hammond, C. M. Lusk, and E. F. Mross, 1992: The application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863–883. Hersbach, H., R. Mureau, J. D. Opsteegh, and J. Barkmeijer, 2000: A short-range to early-medium-range ensemble prediction system for the European area. Mon. Wea. Rev., 128, 3501– 3519. Hoffman, R. N., and E. Kalnay, 1983: Lagged average forecasting, an alternative to Monte Carlo forecasting. Tellus, 35A, 100– 118. Janjic, Z., 1994: The step-mountain Eta coordinate model: Further developments of the convection closure schemes. Mon. Wea. Rev., 122, 927–945. Jian, G.-J., and J. A. McGinley, 2005: Evaluation of a short-range forecast system on quantitative precipitation forecasts associated with tropical cyclones of 2003 near Taiwan. J. Meteor. Soc. Japan, 83, 657–681. ——, S.-L. Shieh, and J. A. McGinley, 2003: Precipitation simulation associated with Typhoon Sinlaku (2002) in Taiwan area using the LAPS diabatic initialization for MM5. Terr., Atmos., Oceanic Sci., 14, 261–288. Jollife, I. T., and D. B. Stephenson, 2003: Forecast Verification: A Practitioner’s Guide in Atmospheric Science. John Wiley and Sons, 240 pp. Kain, J. S., and J. M. Fritsch, 1993: Convective parameterization for mesoscale models: The Kain–Fritsch scheme. The Representation of Cumulus Convection in Numerical Models, Meteor. Monogr., No. 46, Amer. Meteor. Soc., 165–170. Krishnamurti, T. N., and Coauthors, 1999: Improved weather and seasonal climate forecasts from multimodel superensemble. Science, 285, 1548–1550. Leith, C. E., 1974: Theoretical skill of Monte Carlo forecasts. Mon. Wea. Rev., 102, 409–418.
37
Lewis, J., 2005: Roots of ensemble forecasting. Mon. Wea. Rev., 133, 1865–1885. Lu, C., H. Yuan, B. Schwartz, and S. Benjamin, 2007: Short-range forecast using time-lagged ensembles. Wea. Forecasting, 22, 580–595. Mahoney, W. P., III, and Coauthors, 2005: The Federal Highway Administration’s Maintenance Decision Support System Project: Summary results and recommendations. Transportation Research Record 1911, TRB, National Research Council, Washington, DC, 133–142. Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291–303. Mason, S. J., and N. E. Graham, 2002: Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Quart. J. Roy. Meteor. Soc., 128, 2145–2166. Mittermaier, M. P., 2007: Improving short-range high-resolution model precipitation forecast skill using time-lagged ensembles. Quart. J. Roy. Meteor. Soc., 133, 1487–1500. Molteni, F., R. Buizza, T. N. Palmer, and T. Petroliagis, 1996: The ECMWF Ensemble Prediction System: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122, 73–119. Mullen, S. L., and D. P. Baumhefner, 1989: The impact of initial condition uncertainty on numerical simulations of large-scale explosive cyclogenesis. Mon. Wea. Rev., 117, 2800–2821. Palmer, T. N., F. Molteni, R. Mureau, R. Buizza, P. Chapelet, and J. Tribbia, 1993: Ensemble prediction. Proc. ECMWF Seminar on Validation of Models over Europe, Vol. 1, Reading, United Kingdom, ECMWF, 21–66. Richardson, D. S., 2001: Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Quart. J. Roy. Meteor. Soc., 127, 2473–2489. Schultz, P., 1995: An explicit cloud physics parameterization for operational numerical weather prediction. Mon. Wea. Rev., 123, 3331–3343. Stensrud, D. J., H. E. Brooks, J. Du, M. S. Tracton, and E. Rogers, 1999: Using ensembles for short-range forecasting. Mon. Wea. Rev., 127, 433–446. ——, J. W. Bao, and T. T. Warner, 2000: Using initial condition and model physics perturbations in short-range ensemble simulations of mesoscale convective systems. Mon. Wea. Rev., 128, 2077–2107. Talagrand, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction systems. Proc. ECMWF Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25. Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330. ——, and ——, 1997: Ensemble forecasting at NCEP and the breeding method. Mon. Wea. Rev., 125, 3297–3319. Walser, A., D. Lu¨thi, and C. Scha¨r, 2004: Predictability of precipitation in a cloud-resolving model. Mon. Wea. Rev., 132, 560–577. Weisman, M. L., W. C. Skamarock, and J. B. Klemp, 1997: The resolution dependence of explicitly modeled convective systems. Mon. Wea. Rev., 125, 527–548. Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. International Geophysical Series, Vol. 91, Academic Press, 627 pp.
38
WEATHER AND FORECASTING
Wilson, L. J., 2000: Comments on ‘‘Probabilistic predictions of precipitation using the ECMWF ensemble prediction system.’’ Wea. Forecasting, 15, 361–364. Yuan, H., S. L. Mullen, X. Gao, S. Sorooshian, J. Du, and H. H. Juang, 2005: Verification of probabilistic quantitative precipitation forecasts over the southwest United States during winter 2002/2003 by the RSM ensemble system. Mon. Wea. Rev., 133, 279–294.
VOLUME 24
——, X. Gao, S. L. Mullen, S. Sorooshian, J. Du, and H. H. Juang, 2007: Calibration of probabilistic quantitative precipitation forecasts with an artificial neural network. Wea. Forecasting, 22, 1287–1303. ——, J. A. McGinley, P. J. Schultz, C. J. Anderson, and C. Lu, 2008: Short-range precipitation forecasts from time-lagged multimodel ensembles during the HMT-West-2006 campaign. J. Hydrometeor., 9, 477–491.