AMERICAN METEOROLOGICAL SOCIETY - AMS Journals

2 downloads 47763 Views 3MB Size Report
European Centre for Medium-Range Weather Forecasts, Reading, UK. 24 .... NWP models might be needed to gain skill on the best performing single model.
AMERICAN METEOROLOGICAL SOCIETY Journal of Hydrometeorology

EARLY ONLINE RELEASE This is a preliminary PDF of the author-produced manuscript that has been peer-reviewed and accepted for publication. Since it is being posted so soon after acceptance, it has not yet been copyedited, formatted, or processed by AMS Publications. This preliminary version of the manuscript may be downloaded, distributed, and cited, but please be aware that there will be visual differences and possibly some content differences between this version and the final published version. The DOI for this manuscript is doi: 10.1175/JHM-D-15-0130.1 The final published version of this manuscript will replace the preliminary version at the above DOI once it is available. If you would like to cite this EOR in a separate work, please use the following full citation: Zsoter, E., F. Pappenberger, P. Smith, R. Emerton, E. Dutra, F. Wetterhall, D. Richardson, K. Bogner, and G. Balsamo, 2016: Building a multi-model flood prediction system with the TIGGE archive. J. Hydrometeor. doi:10.1175/JHM-D15-0130.1, in press. © 2016 American Meteorological Society

Manuscript (non-LaTeX)

1

Click here to download Manuscript (non-LaTeX) Zsoter_JoHM.docx

Building a multi-model flood prediction system with the TIGGE archive

2 3

Ervin Zsoter, European Centre for Medium-Range Weather Forecasts, Reading, UK

4

Current address: European Centre for Medium-Range Weather Forecasts, Shinfield Park,

5

Reading, RG2 9AX, UK

6 7

Florian Pappenberger, European Centre for Medium-Range Weather Forecasts, Reading, UK

8

and College of Hydrology and Water Resources, Hohai University, Nanjing, China

9 10

Paul Smith, European Centre for Medium-Range Weather Forecasts, Reading, UK and

11

Lancaster Environment Centre, Lancaster University, UK

12 13

Rebecca Elizabeth Emerton, Department of Geography and Environmental Science,

14

University of Reading, Reading, UK and European Centre for Medium-Range Weather

15

Forecasts, Reading, UK

16 17

Emanuel Dutra, European Centre for Medium-Range Weather Forecasts, Reading, UK

18 19

Fredrik Wetterhall, European Centre for Medium-Range Weather Forecasts, Reading, UK

20 21

David Richardson, European Centre for Medium-Range Weather Forecasts, Reading, UK

22 23

Konrad Bogner, Swiss Federal Research Institute WSL, Birmensdorf, Switzerland and

24

European Centre for Medium-Range Weather Forecasts, Reading, UK

25

1

26

Gianpaolo Balsamo, European Centre for Medium-Range Weather Forecasts, Reading, UK

27 28

Corresponding author: E. Zsoter, European Centre for Medium-Range Weather Forecasts,

29

Shinfield Park, Reading, RG2 9AX, UK ([email protected])

2

30

Abstract

31 32

In the last decade operational probabilistic ensemble flood forecasts have become common in

33

supporting decision-making processes leading to risk reduction. Ensemble forecasts can

34

assess uncertainty, but they are limited to the uncertainty in a specific modelling system.

35

Many of the current operational flood prediction systems use a multi-model approach to

36

better represent the uncertainty arising from insufficient model structure.

37 38

This study presents a multi-model approach to building a global flood prediction system

39

using multiple atmospheric reanalysis datasets for river initial conditions and multiple

40

TIGGE ensemble forcing inputs, to the ECMWF land-surface model. A sensitivity study is

41

carried out to clarify the effect of using archive ensemble meteorological predictions and

42

uncoupled land-surface models. The probabilistic discharge forecasts derived from the

43

different atmospheric models are compared with those from the multi-model combination.

44

The potential for further improving forecast skill by bias correction and Bayesian Model

45

Averaging is examined.

46 47

The results show that the impact of the different TIGGE input variables in the

48

HTESSEL/CaMa-Flood system setup is rather limited other than for precipitation. This

49

provides a sufficient basis for evaluation of the multi-model discharge predictions. We also

50

highlight that the three applied reanalysis data sets have different error characteristics

51

allowing large potential gains with a multi-model combination. It is shown that large

52

improvements to the forecast performance for all models can be achieved through appropriate

53

statistical post-processing (bias and spread correction). A simple multi-model combination

3

54

generally improves the forecasts, while a more advanced combination using Bayesian Model

55

Averaging provides further benefits.

4

56

1

Introduction

57

Operational probabilistic ensemble flood forecasts have become more common in the last

58

decade (Cloke and Pappenberger 2009; Demargne et al. 2014; Olsson and Lindström 2008).

59

Ensemble forecasts are a good way of assessing forecast uncertainty, but they are limited to

60

the uncertainty captured by a specific modelling system. A multi-model approach can address

61

this shortcoming and provide a more complete representation of the uncertainty in the model

62

structure, also potentially reducing the errors (Krishnamurti et al. 1999).

63 64

“Multi-model” can refer to systems using multiple meteorological models, hydrological

65

models or both (Velázquez et al. 2011). According to Emerton et al. (2016), amongst the many

66

regional-scale operational hydrological ensemble prediction systems across the globe, at

67

present there are six large (continental and global) scale models, four running at continental

68

scale over Europe, Australia and the USA, and two systems are available globally. The U.S.

69

Hydrologic Ensemble Forecast Service (HEFS) run by the National Weather Service (NWS)

70

(Demargne et al. 2014), and the Global Flood Forecasting Information System (GLOFFIS), a

71

recent development at Deltares in The Netherlands are examples of systems using different

72

hydrological models as well as multiple meteorological inputs. The European Flood Awareness

73

System (EFAS) developed by the Joint Research Centre (JRC) of the European Commission

74

(EC) and ECMWF operates using a single hydrological model with multi-model

75

meteorological input (Thielen et al. 2009). Finally, the European HYdrological Predictions for

76

the Environment (E-HYPE) Water in Europe Today (WET) model of the Swedish

77

Meteorological and Hydrological Institute (SMHI) (Donnelly et al. 2015), the Australian Flood

78

Forecasting and Warning Service and the Global Flood Awareness System (GloFAS, Alfieri

79

et al. 2013) running in collaboration between ECMWF and JRC, all use one main hydrological

80

model and one meteorological model input. 5

81 82

While the multi-model approach has traditionally involved the use of multiple forcing inputs

83

and hydrological models to generate discharge forecasts it also allows for consideration of

84

multiple initial conditions. In keeping with GloFAS, this paper uses atmospheric reanalysis

85

data to generate the initial conditions of the land-surface components of the forecasting system,

86

therefore a multi-model approach based on three reanalysis datasets is trialled.

87 88

The THORPEX (The Observing system Research and Predictability Experiment) Interactive

89

Grand Global Ensemble (TIGGE; Bougeault et al. 2010) archive is an invaluable source of

90

multi-model meteorological forcing data. The archive has attracted attention among

91

hydrological forecasters and is already being extensively used in hydrological applications.

92

The first published example of a hydro-meteorological forecasting application was by

93

Pappenberger et al. (2008). In that paper, the forecasts of nine TIGGE centres were used

94

within the setting of EFAS for a case study of a flood event in Romania in October 2007, and

95

showed that the lead time of flood warnings could be improved by up to 4 days through the

96

use of multiple forecasting models rather than a single model. This study and other

97

subsequent studies using TIGGE multi-model data (e.g. He et al. 2009; He et al. 2010; Bao

98

and Zhao 2012) have indicated that combining different models not only increases the skill,

99

but also the lead time at which warnings could be issued. He et al. (2009) highlighted this and

100

further showed that individual systems of the multi-model forecast have systematic errors in

101

time and space which would require temporal and spatial post-processing. Such post-

102

processing should carefully maintain spatial, temporal and inter-variable correlations;

103

otherwise they lead to deteriorating hydrological forecast skill.

104

6

105

The scientific literature contains numerous studies on methods that can lead to significant

106

gain in forecast skill by combining and post-processing different forecast systems. Statistical

107

ensemble post-processing techniques target the generation of sharp and reliable probabilistic

108

forecasts from ensemble outputs. Hagedorn et al. (2012) showed, based on TIGGE

109

ensembles, that by considering an equal weight multi-model approach, a selection of best

110

NWP models might be needed to gain skill on the best performing single model. In addition

111

to this, the calibration of the best single model using a reforecast dataset can lead to

112

comparable or even superior quality to the multi-model prediction. Gneiting and Katzfuss

113

(2014) focus on various methodologies that require weighting of the different contributing

114

forecasts to optimize model error corrections. They recommend the application of well-

115

established techniques in the operational environment such as the nonhomogeneous

116

regression or Bayesian model averaging (BMA). The BMA method generates calibrated and

117

sharp probability density functions (PDFs) from ensemble forecasts (Raftery et al. 2005),

118

where the predictive PDF is a weighted average of the PDFs centred on the bias-corrected

119

forecasts. The weights reflect the relative skill of the individual members over a training

120

period. The BMA has been widely used and proved to be beneficial in hydrological ensemble

121

systems (e.g. Ajami et al. 2007; Cane et al. 2013; Dong et al. 2013; Liang et al. 2013; Todini

122

2008; Vrugt and Robinson 2007).

123 124

Previous studies have used hydrological models, rather than land surface models, to analyse

125

the benefits of multi-model forecasting, and have focussed on individual catchments. The

126

potential of multi-model forecasts at the regional or continental scale shown in previous

127

studies, provides the motivation for building a global multi-model hydro-meteorological

128

forecasting system.

129

7

130

In this study we present our experiences in building a multi-model hydro-meteorological

131

forecasting system. Global ensemble discharge forecasts with a 10-day horizon are generated

132

using the ECMWF land-surface model and a river-routing model. The multi-model approach

133

arises from the use of meteorological forecasts from four models in the TIGGE archive and

134

the derivation of river initial conditions using three global reanalysis datasets. The main focus

135

of our study is the quality of the discharge forecasts derived from the TIGGE data. We

136

analyse the HTESSEL/CaMA-Flood set-up and the scope for error reduction by applying the

137

multi-model approach and different post-processing methods on the forecast data. Three sets

138

of experiments are undertaken to test (i) the sensitivity of the forecasting system to the input

139

variables, (ii) the potential improvements in forecasting historical discharge that can be

140

achieved by a combination of different reanalysis data sets and (iii) the use of bias correction

141

and model combination to improve the predictive distribution of the forecasts.

142 143

In Section 2 the datasets, models and methodology used throughout the paper are described.

144

Section 3 summarises the discharge experiments we produced and analysed. In Section 4, we

145

provide the results, while Section 5 gives conclusions to the paper.

146

2

147

2.1 HTESSEL land-surface model

148

The hydrological component of this study was the Hydrology-Tiled ECMWF Scheme for

149

Surface Exchange over Land (HTESSEL; Balsamo et al. 2009; Balsamo et al. 2011) land-

150

surface model. The HTESSEL scheme follows a mosaic (or tiling) approach where the grid

151

boxes are divided into patches (or tiles), with up to six fractions over land (bare ground, low

152

and high vegetation, intercepted water, shaded and exposed snow) and two extra tiles over

153

water (open and frozen water) exchanging energy and water with the atmosphere. The model

System Description and Datasets

8

154

is part of the integrated forecast system (IFS) at ECMWF and used in coupled atmosphere-

155

surface mode on time ranges from medium-range to seasonal forecasts. In addition, the model

156

provides a research test bed for applications where the land-surface model can run in a stand-

157

alone mode. In this so-called “offline” version the model is forced with near-surface

158

meteorological input (temperature, specific humidity, wind speed and surface pressure),

159

radiative fluxes (downward solar and thermal radiation) and water fluxes (liquid and solid

160

precipitation). This offline methodology has been explored in various research applications

161

where HTESSEL or other models were applied (e.g. Agustí-Panareda et al. 2010; Dutra et al.

162

2011; Haddeland et al. 2011).

163

2.2 CaMa-Flood river-routing

164

The Catchment-based Macro-scale Floodplain model (CaMa-Flood; Yamazaki et al. 2011)

165

was used to integrate HTESSEL runoff over the river network into discharge. CaMa‐Flood is

166

a distributed global river-routing model that routes runoff to oceans or inland seas using a

167

river network map. A major advantage of CaMa-Flood is the explicit representation of water

168

level and flooded area in addition to river discharge. The relationship between water storage

169

(the only prognostic variable), water level, and flooded area is determined on the basis of the

170

subgrid‐scale topographic parameters based on a 1km digital elevation model.

171

2.3 TIGGE ensemble forecasts

172

The atmospheric forcing for the forecast experiments is taken from the TIGGE archive where

173

all variables are available on a standard 6-hour forecast frequency. The ensemble systems of

174

ECMWF, UKMO (UK Met Office), NCEP (National Centers for Environmental Prediction)

175

and CMA (China Meteorological Administration) provide, in the TIGGE archive,

176

meteorological forcing fields from the 00 UTC runs with 6-hour frequency starting from

177

2006-2008 depending on the model. All four models were only available with the complete

9

178

forcing variable set from August 2008. ECMWF was available with 50 ensemble members on

179

32 km horizontal resolution (~50 km before January 2010) up to 15 days ahead, UKMO with

180

23 members on ~60 km (~90 km before March 2010) also up to 15 days, NCEP with 20

181

members on ~110 km up to 16 days ahead, and finally CMA with 14 members on ~60 km

182

resolution up to 10 days ahead. In testing the sensitivity of the experimental set-up to

183

meteorological forcing (see Section 4.1) the ECMWF control forecasts were used, extracted

184

directly from ECMWF's Meteorological Archival and Retrieval System (MARS), where the

185

meteorological variables are available without the TIGGE restrictions. These have the same

186

resolution as the 50 ensemble members but start from the unperturbed analysis.

187

2.4 Reanalysis data

188

The discharge modelling experiments require reanalysis data, which are used to provide the

189

climate and the initial conditions needed for the HTESSEL land surface model runs, and to

190

produce the river initial conditions required in the CaMa-Flood routing part of the TIGGE

191

forecast experiments.

192 193

In this study we have used three different reanalysis datasets; two produced by ECMWF, the

194

ERA Interim reanalysis (hereafter ERAI) and the ERA-Interim/Land version with GPCP v2.2

195

precipitation (Huffman et al. 2009) correction (Balsamo et al., 2015; hereafter ERAI-Land),

196

and a third, the MERRA-Land upgrade of the Modern-Era Retrospective Analysis for

197

Research and Applications (MERRA) dataset, produced by NOAA. The combination of these

198

three sources was a proof of concept to potential added value of the multi-initial conditions.

199 200

ERAI is ECMWF‘s global atmospheric reanalysis from 1979 to present produced with an

201

older (2006) version of the ECMWF IFS on a T255 spectral resolution (Dee et al., 2011).

202

ERAI-Land is a version of ERAI at the same 80 km spatial resolution with improvements for 10

203

land-surface. It was produced in offline mode with a 2014 version of the HTESSEL land-

204

surface model using atmospheric forcing from ERA-Interim, with precipitation adjustments

205

based on GPCP v2.2, where the ERA-Interim 3-hourly precipitation is rescaled to match the

206

monthly accumulated precipitation provided by the GPCP v2.2 product (for more details

207

please consult Balsamo et al. 2010).

208 209

The MERRA-Land dataset is similar to ERAI-Land in that it is a land-only version of the

210

MERRA land model component, produced also in offline mode, using improved precipitation

211

forcing and an improved version of the catchment land surface model (Reichle et al. 2011).

212

2.5 Discharge Data

213

In this study a subset of the observations available in GloFAS were used, mainly originating

214

from the Global Runoff Data Centre (GRDC) archive. The GRDC is the digital world-wide

215

depository of discharge data and associated metadata. It is an international archive of data

216

starting in 1812, and fosters multinational and global long-term hydrological studies.

217 218

For the discharge modelling a dataset of 1121 stations with upstream areas over 10,000 km2

219

was available until the end of 2013. GRDC has a gradually decreasing number of stations

220

with data in the archive limiting their use for more recent years. For the forecast discharge we

221

limited our analyses to the period of August 2008 to May 2010. This period provided the

222

optimal compromise in increasing the sample size between the length of the period and the

223

number of stations with good data coverage. For the reanalysis discharge experiments and

224

also for generating the observed discharge climate, stations with a minimum of 15-years of

225

available observations were used in the 30-year period from 1981 to 2010. For the forecast

226

experiments, stations with at least 80% of the observations available were used in the 22-

227

month period from August 2008 to May 2010. Figure 1 shows the observation availability in 11

228

the reanalysis and TIGGE forecast experiments. It highlights that for the reanalysis the

229

coverage is better globally with about 850 stations while the forecast experiments has around

230

550 stations with large missing areas, mainly in Africa and Asia.

231

2.6 Forecasting system setup

232

To produce runoff from the TIGGE atmospheric ensemble variables (see Section 2.3)

233

HTESSEL experiments were run with 6-hourly forcing frequency and hourly model time

234

step. For the instantaneous variables (such as 2m temperature) linear interpolation was used

235

to move from the 6-hour to hourly time step used in the CaMa-Flood simulations. For

236

accumulated variables (such as precipitation) a disaggregation algorithm which conserves the

237

6-hourly totals was used. The disaggregation algorithm divides into hourly values based on a

238

linear combination of the current and adjacent 6-hourly totals with weights derived from the

239

time differences.

240 241

The climate and the initial conditions needed for the HTESSEL land-surface model runs to

242

produce runoff were taken from ERAI-Land, the same initial conditions for all models and

243

ensemble members without perturbations. The other two reanalysis data sets could also be

244

used to initialise HTESSEL, but the variability on the resulting TIGGE runoff (and thus on

245

the TIGGE discharge) would be very small compared with the impact of the TIGGE

246

atmospheric forcing (especially precipitation, see also Section 4.1) and the impact of the

247

TIGGE forecast routing initialisation (see Section 3.2 for further details).

248 249

The HTESSEL resolution was set to T255 spectral resolution (~80 km). This was the

250

horizontal resolution used in ERAI, and an adequate compromise between the highest

251

(ECMWF mainly ~50 km) and lowest (NCEP with ~110 km) forcing model resolution which

12

252

also allowed fast enough computations. The TIGGE forcing fields were transformed to T255

253

using bilinear interpolation.

254 255

The TIGGE archive includes variables at the surface and several pressure levels. However,

256

variables are not available on model levels, and as such, temperature, wind and humidity at

257

the surface (i.e. 2m for temperature and humidity, and 10m for wind) were used in HTESSEL

258

rather than on the preferred lowest model level (LML).

259 260

Similarly, TIGGE contains several radiation variables, but not the downward radiations

261

required by HTESSEL. In order to run HTESSEL without major technical modifications, we

262

had to use a radiation replacement for all TIGGE models and ensemble members. We used

263

ERAI-Land for this purpose, as it does not favour any of the TIGGE models used in this

264

study. This way, for one daily run the same single radiation forecast was used for all

265

ensemble members and all models. These 10-day radiation forecasts were built from 12-hour

266

ERAI-Land short range predictions. To reduce the possible spin-up effects in the first hours

267

of the ERAI-Land forecasts, the 06-18h radiation fluxes were combined (as 12-hour sections)

268

from subsequent 00:00 and 12:00 UTC runs, following the approach described in Balsamo et

269

al. (2015). The sensitivity to the HTESSEL input variables will be discussed in Section 4.1.

270 271

In this study we were able to process four models out of the 10 global models archived in

272

TIGGE: ECMWF, UKMO, NCEP and CMA. The other six models do not archive one or

273

more of the forcing variables, in addition to the downward radiation, required for this study.

274

The runoff produced by HTESSEL for the TIGGE ensembles was routed over the river

275

network by CaMa-Flood. These relatively short experiments for the TIGGE forecasts

13

276

required initial river conditions. These were provided by three CaMa-Flood runs for the

277

1980-2010 period with ERAI, ERAI-Land and MERRA-Land runoff input.

278 279

The discharge forecasts were produced by CaMa-Flood out to 10 days (T+240h), the longest

280

forecast horizon common to all models. No perturbations were applied on the river initial

281

conditions for the ensemble members. The forecasts were extracted from the CaMa-Flood 15

282

arc min (~25 km) model grid for every 24 hours similarly to the 24-hour reporting frequency

283

of the discharge observations.

284

3

285

The main focus of the experiments was on the quality of the discharge forecasts derived from

286

the TIGGE data. Three sets of experiments were performed to test the HTESSEL/CaMA-

287

Flood set-up and the scope for error reduction by applying the multi-model approach and

288

different post-processing methods:

289

Experiments



290 291

Discharge sensitivity to meteorological forcing - The first experiment (Section 4.1) tests the sensitivity of the forecasting system to the input variables.



Reanalysis impact on discharge - The second experiment (Section 4.2) evaluates the

292

potential improvements on the historical discharge that can be achieved by a

293

combination of different reanalysis datasets.

294



Improving the forecast distribution - In the third experiment (Section 4.3) the use

295

of bias correction and model combination to improve the predictive distribution of the

296

forecast is considered.

14

297

3.1 Discharge sensitivity to meteorological forcing

298

In Section 2.6 a number of compromises in the coupling of HTESSEL and forecasts from the

299

TIGGE archive were introduced. Sensitivity experiments were conducted to study the impact

300

of these. Table 1 provides a short description of the experiments.

301 302

The baseline for the comparisons is the discharge forecasts generated by HTESSEL and

303

CaMa-Flood driven by ECMWF control forecasts. These forecasts were produced weekly (at

304

00 UTC) throughout 2008-2012 to cover several seasons (~260 forecast runs in total). In the

305

baseline setup, the LML meteorological output for temperature, wind and humidity was used

306

to drive HTESSEL.

307 308

The first sensitivity test (Surf vs. LML) was to replace these LML values with the surface

309

values (as 2m temperature and humidity, and 10m wind), from the same model run. This

310

mirrors the change needed to make use of the TIGGE archive. Due to limitations in the

311

TIGGE archive, the ERAI-Land radiation was used for all forecasts. Substitution of the

312

ECMWF control radiation in the HTESSEL input by ERAI-Land is the second sensitivity test

313

(Rad). Further to this, substitution of the wind (Wind), temperature, humidity and surface

314

pressure together (THP), and finally precipitation (Prec) from ERAI-Land in place of the

315

ECMWF ensemble control run values was also evaluated. Temperature and humidity were

316

analysed together due to the sensitive nature of the balance between these two variables.

317

Although these changes were not applied on the TIGGE data, they give a more complete

318

picture on sensitivity to the forcing variables. This puts into context the discharge errors that

319

we indirectly introduced through the TIGGE-HTESSEL setup changes.

320

15

321

The impact on the errors was compared by evaluating the ratio of the magnitude (absolute

322

value) of the discharges to the baseline experiments discharge value. These changes in

323

relative discharge were computed for each station as the average of the relative changes over

324

all runs (in the 2008-2012 period with weekly runs), and also as a global average of all

325

available stations.

326

3.2 Reanalysis impact on discharge

327

For the forecast CaMa-Flood routing the river initial conditions are provided by reanalysis

328

based simulations (see Section 2.4). They do not make use of observed river flow and

329

therefore are an estimate of the observed values. The quality of the forecast discharge is

330

expected to be strongly dependent on the skill of this reanalysis-derived historical discharge.

331

This is highlighted in Figure 2 where ERAI-Land, ERAI and MERRA-Land are compared for

332

a station in the US for a 4-year period.

333 334

Each of these reanalyses provides different error characteristics which can potentially be

335

harnessed by using a multi-model approach. For this station ERAI has a tendency to produce

336

occasional high peaks, while MERRA-Land has a strong negative bias. Although Figure 2 is

337

only a single example it highlights the large variability between these reanalysis datasets and

338

therefore a potentially severe underestimation of the uncertainties in the subsequent forecast

339

experiments by using only a single initialisation data set.

340 341

The impact of the multi-model approach was analysed by experiments with the historical

342

discharges derived from ERAI, ERAI-Land and MERRA-Land inputs. Three sets of CaMa-

343

Flood routing runs were performed, for each of the four TIGGE models for the whole 22-

344

month period in 2008-2010, each initialised from one of the three reanalysis-derived

16

345

historical river conditions. The performance of the historical discharge was evaluated

346

independently of the TIGGE forecasts on the period of 1981-2010.

347

3.3 Improving the forecast distribution

348

In the third group of experiments a number of post-processing techniques were applied at

349

each site with the aim of improving the forecast distribution for the observed data. Here we

350

outline the techniques with reference to a single site and forecast origin 𝑡. The forecast values

351

available are denoted 𝑓𝑚,𝑗,𝑡,𝑖 where 𝑚 indexes over the forecast products (ECMWF, UKMO,

352

NCEP, CMA), 𝑗 over the 𝑁𝑚 ensemble members in forecast product 𝑚 and 𝑖 = 1, … ,10 the

353

available lead times.

354

3.3.1 Bias correction

355

As a first step we analysed the biases of the data. As described in Section 3.2 the historical

356

river initial conditions have potentially large errors. In addition, the variability of the

357

discharge in a 10-day forecast horizon is generally much smaller than derived from reanalysis

358

over a long period. Therefore any timing or magnitude error in the historical discharge

359

provided initial conditions means the forecast errors can be very large and will change only

360

slightly, in relative terms, throughout the 10-day forecast period.

361 362

As bias was expected to be a very important aspect of the errors, three methods of computing

363

the bias correction 𝑒𝑚,𝑗,𝑡,𝑖 to add to the forecast 𝑓𝑚,𝑗,𝑡,𝑖 were proposed. The first of these is to

364

apply no correction (or uncorrected); that is 𝑒𝑚,𝑗,𝑡,𝑖 = 0 in all cases. The second method,

365

referred to as 30-day correction, removes the mean bias of the 30-day period preceding the

366

actual forecast run for each forecast product at each specified forecast range. The mean bias

367

is computed as an average error of the ensemble mean over a 30-day period. In this case

17

368

given a series of 30 dates 𝑡 = 1, … 𝑁30 and observed discharge data 𝑦𝑡 the bias corrections are

369

given by

370

𝑁𝑚 30 ∑𝑁 𝑡=1 ∑𝑗=1(𝑦𝑡 − 𝑓𝑚,𝑗,𝑡−𝑖,𝑖 )

𝑁30 𝑁𝑚

371

The third correction method, referred to as initial-time correction, focused specifically on

372

the historical discharge based initial condition errors. The error at initialisation of the routing

373

(𝑓𝑚,𝑗,𝑡,0), i.e. the error of the historical discharge, was used as a correction for all forecast

374

ranges. This initial time correction gives 𝑒𝑚,𝑗,𝑡,𝑖 = 𝑦𝑡 − 𝑓𝑚,𝑗,𝑡,0

375 376

This method therefore uses a specific error correction for each individual forecast run from

377

day1 to day10. Due to the common initialisation the initial time correction was the same for

378

all four TIGGE models, for all three historical discharge experiments respectively.

379 380

3.3.2 Multi-model combination

381

To investigate the potential further benefits of combining different forecast products two

382

model combination strategies were trialled. The naïve combination strategy (also referred to

383

as multi-model combination or MM) was based on utilising a grand ensemble with each

384

member having equal weight. In this combination the larger ensembles (the largest being

385

ECMWF with 50 members) get larger weights. In direct analogy to the case of a single

386

forecast product the cumulative forecast distribution is expressed in terms of the indicator

387

function 𝛿(𝑧) which takes the value 1 if the statement 𝑧 is true and zero otherwise as:

388 389

𝑃𝑟(𝑌𝑡+𝑖 < 𝑦) =

𝑚 ∑𝑚 ∑𝑁 𝑗=1 𝛿(𝑓𝑚,𝑗,𝑡,𝑖 + 𝑒𝑚,𝑗,𝑡,𝑖 < 𝑦)

∑𝑚 𝑁𝑚

Here 𝑒𝑚,𝑗,𝑡,𝑖 indicates one of the three bias corrections we introduced in the previous section.

390 18

391

In the second combination strategy BMA was used to explore further the effects of weighted

392

combination and a temporally localised bias correction. Since discharge is always positive the

393

variables were transformed so that their distributions marginalised over time are standard

394

Gaussian. This is achieved using the Normal Quantile transform (Krzysztofowicz, 1997) with

395

the upper and lower tails handled as in Coccia and Todini (2011). The transformed values of

396

the bias corrected forecasts and observations are denoted 𝑓̃𝑚,𝑗,𝑡,𝑖 and 𝑦̃𝑡 respectively.

397 398

This study follows the BMA approach proposed by Fraley et al. (2010) for systems with

399

exchangeable members with the weight (𝑤𝑚,𝑡,𝑖 ), linear bias correction (with parameters 𝑎𝑚,𝑡,𝑖

400

2 and 𝑏𝑚,𝑡,𝑖 ) and nugget variance (𝜎𝑚,𝑡,𝑖 ) being identical for each ensemble member within a

401

given forecast product. The resulting cumulative forecast distribution; in the transformed

402

space; is then a weighted combination of standard Gaussian cumulative distributions,

403

specifically: 𝑁𝑚

404

𝑃𝑟(𝑌̃𝑡+𝑖 < 𝑦̃) = ∑ 𝑤𝑚,𝑡,𝑖 ∑ 𝛷 ( 𝑚

𝑗=1

𝑦̃ − 𝑎𝑚,𝑡,𝑖 − 𝑏𝑚,𝑡,𝑖 𝑓̃𝑚,𝑗,𝑡,𝑖 ) 𝜎𝑚,𝑡,𝑖

405

As indicated by the origin and lead time subscripts the BMA parameters were estimated for

406

each forecast origin and lead time. Estimation bias proceeds by first fitting the linear

407

correction using least-squares before estimating the weight and variance terms using

408

maximum likelihood (Raftery et al. 2005). A moving window of 30 days of data, before the

409

initialisation of the forecasts similarly to the 30-day correction, were utilised for the

410

estimation to mimic operational practice.

411 412

As the initial conditions were expected to play an important role a further forecast was

413

introduced in the context of the BMA analysis. The deterministic persistence forecast is,

19

414

throughout the 10-day forecast range, the most recent observation available at time of issue.

415

That is 𝑓𝑚,1,𝑡,𝑖 = 𝑦𝑡

416 417

This persistence forecast was also used as a simple reference to compare our forecasts

418

against.

419 420

To aid comparison with the naïve combination strategy a similar sized ensemble of forecasts

421

was generated from the BMA combination by applying ensemble copula coupling (Schefzik

422

et al., 2013) to a sample generated by taking equally spaced quantiles from the forecast

423

distribution and reversing the transformation.

424

3.3.3 Verification statistics

425

The forecast distributions were evaluated using the continuous ranked probability score

426

(CRPS, Candille and Talagrand 2005). The CRPS evaluates the global skill of the ensemble

427

prediction systems by measuring a distance between the predicted and the observed

428

cumulative density functions of scalar variables. For a set of dates 𝑡 = 1, … 𝑁𝑡 with

429

observations and forecasts issued with the same lead time the CRPS can be defined as: 𝑁𝑡

430

∞ 2 1 𝐶𝑅𝑃𝑆 = ∑ ∫ (𝑃𝑟(𝑌̃𝑡+𝑖 < 𝑦) − 𝛿(𝑦 ≥ 𝑦𝑡 )) 𝑑𝑦 𝑁𝑡 −∞ 𝑡=1

431 432

The CRPS has a perfect score of 0 and has the advantage of transforming into the mean

433

absolute error for deterministic forecasts and thus providing a simple way of comparing

434

different types of systems. In this study the method of Hersbach (2000) for computing the

435

CRPS from samples was used. The global CRPS scores reported for each lead time were

436

produced by pooling the samples from all the stations before computing the scores.

437 20

438

As the CRPS has the unit of the physical quantity (e.g. for discharge m3/s), comparing scores

439

can be problematic and is only meaningful if two homogeneous sample based scores are

440

compared. For example, different geographical areas or different seasons cannot really be

441

compared. In this study we ensured that for any comparison of forecast models and post-

442

processed versions the samples were homogeneous. We considered the same days in the

443

verification period at each station specifically, and also the same stations in the global

444

analysis, producing equal sample sizes across all compared products.

445 446

To help compare results across different stations and areas we used the CRPS-based skill

447

score (CRPSS) with the reference system of the observed discharge climate in our

448

verification. We produced the daily observed climate for the 30-year period of 1981-2010 and

449

pooled observations from a 31-day window centred over each day. Observed climate was

450

produced for stations with at least 10-year data available in total (310 values) for all days of

451

the year.

452 453

The historical discharge experiments were compared using the mean absolute error (MAE)

454

based skill score with the observed daily discharge climate as reference,

455 456 457

𝑀𝐴𝐸𝑆𝑆 = 1 −

𝑀𝐴𝐸 𝑀𝐴𝐸𝑜𝑏𝑠𝑐𝑙𝑖𝑚

and the sample Pearson correlation coefficient, 𝐶𝑂𝑅𝑅 =

𝑡 ∑𝑁 ̅) 𝑡=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦 𝑡 𝑡 √∑𝑁 (𝑥 )2 √∑𝑁 (𝑦 ̅)2 𝑡=1 𝑖 − 𝑥̅ 𝑡=1 𝑖 − 𝑦

458

The MAE reflects the ability of the systems to match the actual observed discharge, while the

459

correlation highlights the quality of match between the temporal behaviour of the historical

460

forecast time series and the observation time series.

21

461

4

Results

462

First we present the findings of the sensitivity experiments carried out, using the ECMWF

463

ensemble control forecast, on the impact of the HTESSEL coupling with the TIGGE

464

meteorological input. Then we compare the quality of the historical discharge produced from

465

ERAI, ERAI-Land and MERRA-Land reanalysis datasets and the impact of their

466

combination. Finally from the large number of forecast products described in Sections 3.2

467

and 3.3 we present results that aid interpretation of the discharge forecast skills and errors

468

with focus on the potential multi-model improvements:

469



The four uncorrected TIGGE forecasts with ERAI-Land initialisation.

470



The MM combination of the four uncorrected models with the ERAI, ERAI-Land and

471

MERRA-Land initialisations, and the grand-ensemble of these three MM

472

combinations (called GMM hereafter).

473



The GMM combinations of the 30-day corrected, the initial-time corrected, the

474

combined initial time and 30-day corrected MM forecasts (first initial-time correct the

475

forecasts then apply the 30-day correction on these).

476



And finally the GMM of the BMA combined MM forecasts (from all three

477

initialisations) with the uncorrected models, the initial time corrected models and also

478

the uncorrected models extended by the persistence as a separate single value model.

479

4.1 Discharge sensitivity to meteorological forcing

480

The impact of replacing HTESSEL forcing variables other than precipitation (combination of

481

Rad, Wind and THP tests) with ERAI-Land (Fig. 3) is rather small (~3% by T+240, brown

482

curve in Figure 3). The least influential is the Wind (red curve), while the biggest

483

contribution comes from the THP (green curve). When all ensemble forcing is replaced

484

including precipitation the impact jumps to ~15% by T+240h showing that large majority of

485

the change in the discharge comes from differences in precipitation (not shown). 22

486 487

The analysis of different areas and periods (see Table 2) highlights that larger impacts are

488

seen for the winter period where the contribution of precipitation decreases and the

489

contribution of the other forcing variables, both individually and combined, increases by

490

~two to fivefold (this is particularly noticeable for THP). This is most likely a consequence of

491

the snow-related processes, with snowmelt being dependent on temperature, radiation and

492

also wind in the cold seasons. This also implies that the results are dependent on seasonality;

493

a result which was also found by Liu et al. (2013), who looked at the skill of post-processed

494

precipitation forecasts using TIGGE data for the Huai river basin in China. In this study, due

495

to the relatively short period we were able to use in the forecasts verification, scores were

496

only computed for the whole verification period and no seasonal differences were analysed.

497 498

Regarding the change from LML to surface forcing for temperature (2m), wind (10m) and

499

humidity (2m), the potential impact can be substantial as shown by an example for the 1

500

January 2012 in Fig. 4. In such cold winter conditions large erroneous surface runoff values

501

could appear in some parts of Russia when switching to surface forcing in HTESSEL. The

502

representation of dew deposition is a general feature of HTESSEL which can be amplified in

503

stand-alone mode. When coupled to the atmosphere, the deposition is limited in time, as it

504

leads to a decrease of atmospheric humidity. However, in stand-alone mode, since the

505

atmospheric conditions are prescribed, large deposition rates can be generated when the

506

atmospheric forcing is not in balance (e.g. after model grid interpolation, or changing from

507

LML to surface forcing).

508 509

This demonstrates that with a land-surface model such as HTESSEL, particular care needs to

510

be taken in design of the experiments when model imbalances are expected. The use of

23

511

surface data was an acceptable compromise as the sensitivity experiments highlighted only a

512

small impact caused by the switch from LML to surface forcing (dashed black line in Fig. 3),

513

and similarly by the impact of the Rad test, confirming that the necessary changes in the

514

TIGGE land-surface model setup did not have a major impact on the TIGGE discharge.

515

4.2 Reanalysis impact on discharge

516

The quality of the historical river flow that provides initial conditions for the CaMa-Flood

517

TIGGE routing is expected to have a significant impact on the forecast skill. We analyse the

518

discharge performance which is highlighted in Figure 5. This shows the mean absolute error

519

skill score (MAESS) and the sample Pearson correlation (CORR) for the ERAI, ERAI-Land

520

and MERRA-Land simulated historical discharge from 1981 to 2010, and for their equal

521

weight multi-model average (MMA). The results are provided as continental and also as

522

global averages of the available stations for Europe (~150 stations), North-America (~350

523

stations), South-America (~150 stations), Africa (~80 stations), Asia (mainly Russia, 60

524

stations) and Australia & Indonesia (~50 stations), making ~840 stations globally.

525 526

The general quality of these global simulations is quite low. The MAESS averages over the

527

available stations (see Figure 1) are below 0 for all continents, i.e. large scale average

528

performance is worse than the daily observed climatology. The models are closest to the

529

observed climate performance over Europe and Australia & Indonesia. The correlation

530

between the simulated and observed time series shows a slightly more mixed picture, at least

531

in some cases; especially Europe and Australia & Indonesia, the model is better than the

532

observed climate. It is interesting to note that although the observed climate produces a high

533

forecast time series correlation, in Asia the reanalysis discharge scores very low for all three

534

sources. This could be related to the problematic handling of the snow in that area.

535 24

536

Figure 2 shows an example where MERRA-Land displayed a very strong negative bias. This

537

example highlights the large variability amongst these data sources and is not an indication of

538

the overall quality. Although MERRA-Land shows generally negative bias (not shown), the

539

overall quality of the three re-analysis driven historical discharge data sets is rather

540

comparable. The highest skill and correlation is generally shown by ERAI-Land for most of

541

the regions with the exception of Africa and Australia, where MERRA-Land is superior.

542

ERAI, as the oldest dataset, appears to be the least skilful. Reichle et al. (2011) have found

543

the same relationship between MERRA-Land and ERAI using 18 catchments in the US.

544

Although they computed correlation between seasonal anomaly time series (rather than the

545

actual time series evaluated here), they could show that runoff estimates had higher

546

correlation of the anomaly time series in MERRA-Land than in ERAI.

547 548

The multi-model average of the three simulations is clearly superior in the global and also in

549

the continental averages, with very few exceptions that have marginally lower MMA scores

550

compared with the best individual reanalysis. The MMA is able to improve on the best of the

551

three individual data sets at about half of the stations globally, both in the MAESS and

552

CORR scores. Figure 6 shows the improvements in correlation. The points where the

553

combination of the three reanalyses helps to improve on the best model cluster mainly over

554

Europe, Amazonia and the east of US. On the other hand the northern hemispheric winter

555

areas seem to show mainly deterioration. This again is most likely related to the difficulty in

556

the snow related processes, which can hinder the success of the combination if, for example,

557

one model is significantly worse with larger biases than the other two. Further analysis could

558

help identify these more detailed error characteristics providing a basis for further potential

559

improvements.

25

560

4.3 Improving the forecast distribution

561

Figure 7 displays example hydrographs of some analysed forecast products for a single

562

forecast run to provide a practical impression of our experiments. The forecasts from 18 April

563

2009 are plotted for the GRDC station of Lobith in the Netherlands. The thin solid coloured

564

lines are the four TIGGE models (ECMWF, UKMO, NCEP and CMA) plotted together

565

(MM) with ERAI-Land (red), ERAI (green) and MERRA-Land (blue) initialisations. They

566

start from very different levels which are quite far from the observation (thick black line), but

567

then seem to converge to roughly the same range in this example. The ensemble mean of the

568

initial error corrected MM-s (from the three initialisations with dashed lines), which by

569

definition start from the observed discharge at T+0h, then follow faithfully the pattern of the

570

mean of the respective MM-s. The 30-day corrected forecasts (dash-dotted lines) follow a

571

pattern relative to the MM ensemble means set by the last 30 days performance. The

572

combination of the two bias correction methods (dotted lines) blend the characteristics of the

573

two; all three versions start from the observation (as first the initial error is removed), and

574

then follow the pattern set by the past 30 day performance of this initial time corrected

575

forecast. Finally, the BMA transformed (uncorrected) MM-s (thin grey lines) happen to be

576

closest to the observations in this example, showing a rather uniform spread throughout the

577

processed T+24h to T+240h range.

578 579

The quality of the TIGGE discharge forecasts based on the verified August 2008 to May

580

2010 period is strongly dependent on the historical discharge that is used to initialise them.

581

Figure 5 highlighted that the daily observed discharge climate is a better predictor than any of

582

the three historical reanalysis-driven discharges (MAESS < 0). It is therefore not surprising

583

that the uncorrected TIGGE forecasts show similarly low relative skill based on the CRPS

584

(Figure 8). Figure 8 also shows the performance of the four models (dashed grey lines). As in

26

585

this study we concentrate on the added value of the multi-model combination, we do not

586

distinguish between the four raw models. The scores change very little over the 10-day

587

forecast period, showing a marginal increase in CRPSS as lead time increases. This is

588

indicative of the incorrect initialisation, with the forecast outputs becoming less dependent on

589

initialisation further into the medium-range, and slowly converging towards climatology.

590 591

The first stage of the multi-model combination is the red line in Figure 8, the combination of

592

the four uncorrected model with the same ERAI-Land initialisation. On the basis of this

593

verification period and global station list the simple equal-weight combination of the

594

ensembles does not really seem to be able to improve on the best model. However, we have

595

to acknowledge that the performance in general is very low.

596 597

The other area where we expect improvements through the multi-model approach is the

598

initialisation. Figure 8 highlights a significant improvement when using three historical

599

discharge initialisations instead of only one. The quality of the ERAI-Land (red), ERAI

600

(green) and MERRA-Land (blue) initialised forecasts (showed here only the multi-model

601

combination versions) are comparable, with the ERAI-Land slightly ahead which is in

602

agreement with the results of the direct historical discharge comparisons presented in Section

603

4.2. However, the grand combination of the three is able to improve significantly (orange

604

line) on all of them. The improvement is much larger at shorter lead times as the TIGGE

605

meteorological inputs provide lower spread, and therefore the spread introduced by the

606

different initialisations is able to have a bigger impact.

607 608

The quality of the discharge forecasts could be improved noticeably by introducing different

609

initial conditions. However, the CRPSS is still significantly below zero, pointing to the need

27

610

for post-processing. In this study we have experimented with a few methods which were

611

proven to be beneficial.

612 613

The 30-day correction removed the mean bias of the most recent 30 runs from the forecasts.

614

Figure 8 shows the grand combination of the 30-day bias corrected multi-models (with all

615

three initialisations) which brings the CRPSS to almost 0 throughout the 10 day forecast

616

range (dashed burgundy line in Figure 8). This confirms that the forecasts are severely biased.

617

In addition, the shape of the curve remains fairly horizontal suggesting this correction is not

618

making best use of the temporal patterns in the bias.

619 620

Further significant improvements in CRPSS are gained at shorter forecast ranges by using the

621

initial time correction (dashed purple line with markers in Figure 8), which does make use of

622

temporal patterns in the bias. The shape of this error curve shows a typical pattern with the

623

CRPSS decreasing with forecast range, reflecting the decreasing impact of the initial time

624

correction and increased uncertainty in the forecast. The impact of the initial time errors

625

gradually decreases until finally it disappears by around day 5-6 when the 30-day correction

626

becomes superior.

627 628

The combination of the two methods, by applying the 30-day bias correction to forecasts

629

already adjusted by the initial time correction, blends the advantages of both corrections. The

630

CRPSS is further improved mainly in the middle of the 10-day forecast period with

631

disappearing gain by T+240h (solid burgundy line with markers in Figure 8).

632 633

The fact that the performance of the 30-day correction is worse in the short range than the

634

initial time correction highlights that the impact of the errors at initial time has a structural

28

635

component that cannot be explained by the temporally averaged bias. Similarly the initial

636

time correction cannot account exclusively for the large biases in the forecasts as its impact

637

trails off relatively quickly.

638 639

The persistence forecast shows a distinct advantage over these post-processed forecasts

640

(dashed light blue line with circles in Figure 8). There is positive skill up to T+144h and the

641

advantage of the persisted observation as a forecast diminishes so that by T+240h its skill is

642

similar to that of the combined corrected forecasts. This further highlights that the utilising of

643

the discharge observations in the forecast production promises to provide a really significant

644

improvement.

645 646

It is suggested that the structure of the initial errors has two main components: (i) biases in

647

the reanalysis initialisations due to biases in the forcing (e.g. precipitation) and in the

648

simulations (e.g. evapotranspiration) and (ii) biases introduced by timing errors in the routing

649

model due, in part, to the lack of optimized model parameters. A further evaluation of the

650

weight of each of these error sources is beyond the scope of this study.

651 652

The final of our trialled post-processing methods is the BMA. In Figure 8, similarly to the

653

other post-processed products, only the grand combination is displayed of the three BMA

654

transformed MM-s with the different initialisations. The BMA of the uncorrected forecasts

655

was able to increase further the CRPSS markedly across all forecast ranges except T+24h

656

(black line without markers). The results for T+24h suggest that at this lead time the perfect

657

initial error correction from T+0h still holds superior.

658

29

659

The other two BMA versions, one with the uncorrected forecasts extended by the persistence

660

as predictor (black line with circles), and one with the initial time corrected forecasts (not

661

shown), both provide further skill improvements. The one with the persistence performs

662

overall better, especially in the first few days. The BMA incorporating the persistence

663

forecast remains skilful up to T+168h, the longest lead time of any of the forecast methods

664

tested. At longer lead times (days 8-10) the BMA of the uncorrected model forecast appears

665

to provide the highest skill of all the post-processed products. This is evidence that the

666

training of the BMA is not optimal. In part this is due to the estimation methodology used.

667

More significantly experiments (not reported) show that the optimal training window for the

668

BMA varies across sites, showing different picture for the BMA with or without persistence,

669

and also delivering potentially higher global average skill using a longer window.

670 671

Although Figure 8 shows only the impact of the four post-processing methods on the grand

672

combination of the MM forecasts, the individual MM-s with the three initialisations show the

673

same behaviour. The GMM-s always outperform the three MM-s for all the post-processing

674

products, e.g. for the most skilful method, the BMA, the grand combination extends the

675

positive skill by ~1 day (from around 5 days to 6 days, not shown).

676 677

The distribution of the skill increments over all stations provided by different combination

678

and post-processing products is summarised in Figure 9 at T+24h (Figure 9a) and T+240

679

(Figure 9b). The reference skill is the average CRPSS of the four TIGGE models with the

680

ERA-Land initialisation (these values are represented by the dashed grey lines in Figure 8).

681

Figure 9 highlights the structure of the improvements in different ranges of the CRPSS for

682

the different methods over all verified stations in the August 2008 to May 2010 period. The

683

picture is characteristically different at different lead times as suggested by the T+24h and

30

684

T+240h plots. At short range the improvements of the different products scale nicely into

685

separate bands. The relatively simple MM combination of the 4 models with ERAI-Land (red

686

circles) does not improve on the forecast; the increments are small and with mixed sign. The

687

GMM combination of the three uncorrected MM-s (green triangles) shows a marked

688

improvement, and the 30-day correction version (orange triangles) improves further while the

689

initial time correction products (cyan squares and purple stars) show the largest improvement

690

over most of the stations. At this short T+24h range the BMA (blue stars) of the uncorrected

691

forecasts is slightly behind which is a general feature across the displayed -5 to 1 CRPSS

692

range.

693 694

In contrast to the short range, T+240h provides a significantly different picture. The relatively

695

clear ranking of the products is gone by this lead time. The MM and GMM combinations are

696

able to improve slightly for most of the stations, but at this range the contribution seems to be

697

generally always positive. The post-processing methods at this medium range, however,

698

deteriorate the forecasts sometimes, especially in the -1 to 0.5 range (the 30-day correction

699

seems to behave noticeably better in this respect). The general improvements are clear though

700

for most of the stations, and also the overall ranking of the methods seen in Figure 8 is

701

reflected, although much less clearly than at T+24h, with the BMA topping the list at

702

T+240h.

703 704

Finally Figure 10 presents the discharge performance we could achieve in this study for all

705

the stations that could be processed in the August 2008 to May 2010 period at T+240h. It

706

displays the CRPSS of the best overall product, the GMM with the BMA of the uncorrected

707

forecasts (combination of the three BMA transformed MM-s with the three initialisations

708

without initial time or 30-day bias correction). The variability of the scores is very large

31

709

geographically, but there are emerging patterns. Higher performance is observed in the

710

Amazon, and central and western parts of the USA, while lower CRPSSs are seen over the

711

Rocky Mountains in North America, and in northerly points in Europe and Russia.

712

Unfortunately the geographical coverage of the stations is not good enough to draw more

713

detailed conclusions.

714

5

715

This study has shown aspects of building a global multi-model hydro-meteorological

716

forecasting system using the TIGGE archive, and analysed the impact of the post-processing

717

required to run a multi-model system on the forecasts.

Conclusions

718 719

The atmospheric input was taken from four operational global meteorological ensemble

720

systems, using data available from TIGGE. The hydrological component of this study was the

721

HTESSEL land-surface model while the CaMa‐Flood global river routing model was used to

722

integrate runoff over the river network. Observations from the GRDC discharge archive were

723

used for evaluation and post-processing.

724 725

We have shown that the TIGGE archive is a valuable resource for river discharge forecasting,

726

and three main objectives were successfully addressed: (i) the sensitivity of the forecasting

727

system to the meteorological input variables, (ii) the potential improvements to the historical

728

discharge dataset (which provides initial river conditions to the forecast routing) and (iii)

729

improving the predictive distribution of the forecasts. The main outcomes can be grouped as

730

follows:

32

731

(i) The impact of replacing or altering the input meteorological variables to fit the system

732

requirements is small and allows the use of variables from the TIGGE archive for this

733

hydrological study.

734

(ii) The multi-model average historical discharge dataset provides a very valuable source of

735

uncertainty and a general gain in skill.

736

(iii) Significant improvements in the forecast distribution can be produced through the use of

737

initial time and 30-day bias corrections on the TIGGE model discharge, or on the

738

combination of the forecast models; however, the combination of techniques used has a big

739

impact on the improvement observed, with the best BMA products providing positive skill up

740

to 6 days.

741 742

The quality of the raw TIGGE-based discharge forecasts has been shown to be low, mainly

743

determined by the limited performance of the reanalysis-driven historical river conditions

744

analysed in Section 4.2. The lower skill is in agreement with results found in other studies.

745

For example Alfieri et al. (2013) showed that in the context of GloFAS, the Lisflood

746

hydrological model (Van Der Knijff et al. 2010), forced by ERAI-Land runoff, shows

747

variable performance based on the 1990-2010 historical period. From the analysed 620 global

748

observing stations the Pearson’s correlation coefficient reaches as low as -0.2, and only 71%

749

of them provide correlation values above 0.5. Donelly et al. (2015) highlighted similar

750

behaviour with the E-HYPE system based on 181 river gauges in Europe for 1981-2000. The

751

correlation component of the Kling-Gupta efficiency started around 0, and geographical

752

distribution of values in Europe was very similar to our result (not shown). The lowest

753

correlation was found mainly in Spain and in Scandinavia, with a comparable average value

754

to our European mean of 0.6-0.7 (see our Figure 5).

755

33

756

The combination and post-processing methods we applied to the discharge forecasts provided

757

significant improvement of the skill. Although the simple multi-model combinations and the

758

30-day bias correction (removing the mean error of the most recent 30 days) both provide

759

significant improvements, they are not capable of achieving positive global skill (i.e.

760

outperform the daily observed discharge climate). The initial-time correction, by adjusting to

761

the observations at initial time and applying this error correction into the forecast, is able to

762

provide skill in the short range (only up to 2-3 days), especially when combined with the 30-

763

day correction. However, the impact quickly wears off and for longer lead times (up to about

764

6 days) only the BMA post-processing method is able to provide positive average global skill

765

(closely followed by the persistence).

766 767

Although other studies could show significant improvement by using multiple meteorological

768

inputs (e.g. Pappenberger et al. 2008), in this study the impact of combining different TIGGE

769

ensemble models is rather small. This is most likely a consequence of the overwhelming

770

influence of the historical river conditions on the river initialisation. The grand combinations,

771

when we combine the forecasts produced with different reanalysis-driven historical river

772

conditions, however, always outperform the individual MM-s (single initialisation) for all the

773

post-processing products. They provide a noticeable overall skill improvement, which in our

774

study translated into an extension of the lead time, when the CRPSS drops below 0, by about

775

one day as a global average for the most skilful BMA forecasts.

776 777

In the future we plan to extend this study to address other aspects of building a skilful multi-

778

model hydro-meteorological system. The following areas are considered:

779

(i) Include other data sets that provide global coverage of runoff data on high enough

780

horizontal resolution, such as the Japanese 55-year Reanalysis (JRA-55, Kobayashi et al.

34

781

2015) or the NCEP Climate Forecast System Reanalysis (CFSR, Saha et al. 2010) to provide

782

further improvements in the initial river condition estimates.

783

(ii) Introduce the multi-hydrology aspect by adding an additional land-surface model such as

784

JULES (the Joint UK Land Environment Simulator, Best et al. 2011).

785

(iii) The presented scores in this study are relatively low even with the post-processing

786

methods applied. In order to achieve significantly higher overall scores, the information on

787

the discharge observations should be utilised in the modelling.

788

(iv) Similarly the discharge quality could be significantly improved by better calibration of

789

many of the watersheds in the CaMa-Flood routing.

790

(v) Alternatively the application of different river routing schemes such as Lisflood, which is

791

currently used in the GloFAS, would also provide potential increase in the skill through the

792

multi-model use.

793

(vi) Further analysis of the errors, and the trialling of other post-processing methods could

794

also lead to potential improvements. In particular better allowance should be made for

795

temporal correlation in the forecast errors. The use of the Extreme Forecast Index (Zsoter,

796

2006) as a tool to compare the forecasts to the model climate could potentially bring added

797

skill into the flood predictions.

798

6

799

We are thankful to the European Commission for funding this study through the GEOWOW

800

project in the 7th Framework Programme for Research and Technological Development

801

(FP7/2007-2013) under Grant Agreement no. 282915. We are also grateful to The Global

802

Runoff Data Centre, 56068 Koblenz, Germany for providing the discharge observation

803

dataset for our discharge forecast analysis.

Acknowledgements

35

804

7

References

805

Agustí-Panareda, A., G. Balsamo, and A. Beljaars, 2010: Impact of improved soil moisture

806

on the ECMWF precipitation forecast in West Africa, Geophys. Res. Lett., 37, L20808,

807

doi:10.1029/2010GL044748.

808 809

Ajami, N. K., Q. Duan, and S. Sorooshian, 2007: An integrated hydrologic Bayesian

810

multimodel combination framework: Confronting input, parameter, and model structural

811

uncertainty in hydrologic prediction, Water Resour. Res., 43, W01403,

812

doi:10.1029/2005WR004745.

813 814

Alfieri, L., P. Burek, E. Dutra, B. Krzeminski, D. Muraro, J. Thielen, and F. Pappenberger,

815

2013: GloFAS - global ensemble streamflow forecasting and flood early warning, Hydrology

816

and Earth System Sciences, 17, 1161–1175, doi:10.5194/hess-17-1161-2013.

817 818

Balsamo, G., A. Beljaars, K. Scipal, P. Viterbo, B. van den Hurk, M. Hirschi, and A. K.

819

Betts, 2009: A Revised Hydrology for the ECMWF Model: Verification from Field Site to

820

Terrestrial Water Storage and Impact in the Integrated Forecast System, J. Hydrometeor., 10,

821

623–643. doi: http://dx.doi.org/10.1175/2008JHM1068.1.

822 823

Balsamo, G., S. Boussetta, P. Lopez, and L. Ferranti, 2010: Evaluation of ERA-Interim and

824

ERA-Interim-GPCP-rescaled precipitation over the U.S.A., Era report Series 01/2010, 10 pp,

825

European Centre for Medium Range Weather Forecasts, Reading, UK.

826 827

Balsamo, G., F. Pappenberger, E. Dutra, P. Viterbo, and B. van den Hurk, 2011: A revised

828

land hydrology in the ECMWF model: a step towards daily water flux prediction in a fully36

829

closed water cycle, Hydrol. Proc., 25, 1046–1054. doi: 10.1002/hyp.7808.

830 831

Balsamo, G., and Coauthors, 2015: ERA-Interim/Land: a global land surface reanalysis data

832

set, Hydrol. Earth Syst. Sci., 19, 389–407, doi:10.5194/hess-19-389-2015.

833 834

Bao, H. H., and O. Zhao, 2012: Development and application of an atmospheric-hydrologic-

835

hydraulic flood forecasting model driven by TIGGE ensemble forecasts, Acta Meteorologica

836

Sinica, 26, 93–102.

837 838

Best, M. J., and Coauthors, 2011: The Joint UK Land Environment Simulator (JULES),

839

model description – Part 1: Energy and water fluxes, Geosci. Model Dev., 4, 677–699,

840

doi:10.5194/gmd-4-677-2011.

841 842

Bougeault, P., and Coauthors, 2010: The THORPEX Interactive Grand Global Ensemble,

843

Bull. Amer. Meteor. Soc., 91, 1059–1072.

844 845

Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a

846

scalar variable, Q.J.R. Meteorol. Soc., 131: 2131–2150. doi:10.1256/qj.04.71.

847 848

Cane, D., S. Ghigo, D. Rabuffetti, and M. Milelli, 2013: The THORPEX Interactive Grand

849

Global Ensemble. The case study of may 2008 flood in western Piemonte, Italy, Nat. Hazards

850

Earth Syst. Sci., 13, 211–220.

851 852

Cloke, H. L., and F. Pappenberger, 2009: Ensemble flood forecasting: A review, J.

853

Hydrology, 375, 613–626, doi:10.1016/j.jhydrol.2009.06.005.

37

854 855

Coccia, G. and Todini, E., 2011: Recent developments in predictive uncertainty assessment

856

based on the model conditional processor approach, Hydrol. Earth Syst. Sci., 15, 3253–3274,

857

doi:10.5194/hess-15-3253-2011.

858 859

Dee, D. P., and Coauthors, 2011: The ERA-Interim reanalysis: configuration and

860

performance of the data assimilation system, Q. J. R. Meteorol. Soc., 137, 553–597.

861

doi:10.1002/qj.828.

862 863

Demargne, J., Wu, L., Regonda, S. K., Brown, J. D., Lee, H., He, M., Seo, D.-J., Hartman,

864

R., Herr, H. D., Fresch, M., et al. (2014). The science of NOAA's Operational Hydrologic

865

Ensemble Forecast Service. Bull. Amer. Meteor. Soc., 95, 79–98,doi:10.1175/BAMS-D-12-

866

00081.1.

867 868

Donnelly, C., J. C. M. Andersson, B . Arheimer, 2015: Using flow signatures and

869

catchment similarities to evaluate the E-HYPE multi-basin model across Europe,

870

Hydrological Sciences Journal, Doi:10.1080/02626667.2015.1027710

871 872

Dong, L., L. Xiong, and K.-x. Yu, 2013: Uncertainty Analysis of Multiple Hydrologic

873

Models Using the Bayesian Model Averaging Method, J Appl. Math., 2013, 11 pages,

874

doi:10.1155/2013/346045.

875 876

Dutra, E., C. Schär, P. Viterbo, and P. M. A. Miranda, 2011: Land-atmosphere coupling

877

associated with snow cover, Geophys. Res. Lett., 38, L15707, doi:10.1029/2011GL048435.

878

38

879

Emerton, R. E., E. M. Stephens, F. Pappenberger, T. C. Pagano, A. H. Weerts, A. W. Wood,

880

P. Salamon, J. D. Brown, N. Hjerdt, C. Donnelly, C. A. Baugh and H. L. Cloke, 2016:

881

Continental and Global Scale Flood Forecasting Systems, WIREs Water,

882

doi:10.1002/wat2.1137.

883 884

Fraley, C., A. E. Raftery, and T. Gneiting, 2010: Calibrating Multimodel Forecast Ensembles

885

with Exchangeable and Missing Members Using Bayesian Model Averaging, Mon. Wea.

886

Rev., 138, 190–202. doi:http://dx.doi.org/10.1175/2009MWR3046.1.

887 888

Gneiting, T., and M. Katzfuss, 2014: Probabilistic Forecasting, Annual Review of Statistics

889

and Its Application, 1, 125–151.

890 891

Haddeland, I., and Coauthors, 2011: Multimodel Estimate of the Global Terrestrial Water

892

Balance: Setup and First Results, J. Hydrometeor, 12, 869–884.

893

doi:http://dx.doi.org/10.1175/2011JHM1324.1.

894 895

Hagedorn, R., R. Buizza, T. M. Hamill, M. Leutbecher, and T. N. Palmer, 2012: Comparing

896

TIGGE multimodel forecasts with reforecast-calibrated ECMWF ensemble forecasts, Q. J. R.

897

Meteorol. Soc., 138, 1814–1827. doi:10.1002/qj.1895.

898 899

He, Y., F. Wetterhall, H. L. Cloke, F. Pappenberger, M. Wilson, J. Freer, and G. McGregor,

900

2009: Tracking the uncertainty in flood alerts driven by grand ensemble weather predictions,

901

Meteor. Appl., 16, 91–101. doi:10.1002/met.132.

902

39

903

He, Y., F. Wetterhall, H. Bao, H. L. Cloke, Z. Li, F. Pappenberger, Y. Hu, D. Manful, and Y.

904

Huang, 2010: Ensemble forecasting using TIGGE for the July-September 2008 floods in the

905

Upper Huai catchment: a case study, Atm. Sci. Lett., 11, 132–138. doi:10.1002/asl.270.

906 907

Hersbach, H., 2000: Decomposition of the Continuous Ranked Probability Score for

908

Ensemble Prediction Systems, Wea. Forecasting, 15, 559–570.

909

doi:http://dx.doi.org/10.1175/1520-0434(2000)0152.0.CO;2.

910 911

Huffman, G. J., R. F. Adler, D. T. Bolvin, G. Gu, 2009: Improving the Global Precipitation

912

Record: GPCP Version 2.1., Geophys. Res. Lett., 36, doi:10.1029/2009GL040000.

913 914

Kobayashi, S., and Coauthors, 2015: The JRA-55 Reanalysis: General Specifications and

915

Basic Characteristics, Journal of the Meteorological Society of Japan, 93, 5–48,

916

doi:10.2151/jmsj.2015-001.

917 918

Krishnamurti, T. N., C. M. Kishtawal, T. E. LaRow, D. R. Bachiochi, Z. Zhang, C. E.

919

Williford, S. Gadgil, S. Surendran, 1999: Improved weather and seasonal climate forecasts

920

from multimodel superensemble, Science, 285, 1548–1550.

921 922

Krzysztofowicz, R., 1997: Transformation and normalization of variates with specified

923

distributions, J. Hydrol., 197, 286–292.

924 925

Liang, Z., D. Wang, Y. Guo, Y. Zhang, and R. Dai, 2013: Application of Bayesian Model

926

Averaging Approach to Multimodel Ensemble Hydrologic Forecasting, J. Hydrol. Eng., 18,

927

1426–1436.

40

928 929

Liu, Y., Q. Duan, L. Zhao, A. Ye, Y. Tao, C. Miao, X. Mu, and J. C. Schaake, 2013:

930

Evaluating the predictive skill of post-processed NCEP GFS ensemble precipitation forecasts

931

in China's Huai river basin, Hydrological Processes, 27, 57-74, DOI: 10.1002/hyp.9496.

932 933

Olsson, J., and G. Lindström, 2008: Evaluation and calibration of operational hydrological

934

ensemble forecasts in Sweden, J Hydrol., 350, 14–24, doi:10.1016/j.jhydrol.2007.11.010.

935 936

Pappenberger, F., J. Bartholmes, J. Thielen, H. L. Cloke, R. Buizza, and A. de Roo, 2008:

937

New dimensions in early flood warning across the globe using grand-ensemble weather

938

predictions, Geophys. Res. Lett., 35, L10404, doi:10.1029/2008GL033837.

939 940

Raftery, A. E., T. Gneiting, F. Balabdaoui, and M. Polakowski, 2005: Using Bayesian Model

941

Averaging to Calibrate Forecast Ensembles, Mon. Wea. Rev., 133, 1155–1174.

942

doi:http://dx.doi.org/10.1175/MWR2906.1.

943 944

Reichle, R. H., R. D. Koster, G. J. M. De Lannoy, B. A. Forman, Q. Liu, S. P. P. Mahanama,

945

and A. Toure, 2011: Assessment and enhancement of MERRA land surface hydrology

946

estimates, Journal of Climate, 24, 6322–6338, doi:10.1175/JCLI-D-10-05033.1.

947 948

Schefzik, R., L. T. Thorarinsdottir, and T. Gneiting, 2013: Uncertainty Quantification in

949

Complex Simulation Models Using Ensemble Copula Coupling, Statistical Science, 28, 616–

950

640, doi:10.1214/13-STS443.

951

41

952

Saha, S., and Coauthors, 2010: The NCEP Climate Forecast System Reanalysis. Bulletin of

953

the American Meteorological Society, 91, 1015–1057, doi:10.1175/2010BAMS3001.1.

954 955

Thielen, J., J. Bartholmes, M. H. Ramos, A . de Roo, 2009: The European Flood Alert

956

System - Part 1: Concept and development, Hydrology and Earth System Sciences, 13,

957

125–140.

958 959

Todini, E., 2008: A model conditional processor to assess predictive uncertainty in flood

960

forecasting, Intern. J River Basin Manag., 6, 123–137. doi:10.1080/15715124.2008.9635342.

961 962

J. M. Van Der Knijff, J. M., J. Younis, and A. P. J. De Roo, 2010: LISFLOOD: a GIS‐based

963

distributed model for river basin scale water balance and flood simulation, International

964

Journal of Geographical Information Science, 24:2, 189–212,

965

doi:10.1080/13658810802549154.

966 967

Velázquez, J. A., F. Anctil, M. H. Ramos, and C. Perrin, 2011: Can a multi-model approach

968

improve hydrological ensemble forecasting? A study on 29 French catchments using 16

969

hydrological model structures, Adv. Geosci., 29, 33–42, doi:10.5194/adgeo-29-33-2011.

970 971

Vrugt, J. A., and B. A. Robinson, 2007: Treatment of uncertainty using ensemble methods:

972

Comparison of sequential data assimilation and Bayesian model averaging, Water Resour.

973

Res., 43, W01411, doi:10.1029/2005WR004838.

974

42

975

Yamazaki, D., S. Kanae, H. Kim, and T. Oki, 2011: A physically based description of

976

floodplain inundation dynamics in a global river routing model, Water Resour. Res., 47,

977

W04501, doi:10.1029/2010WR009726.

978 979

Zsoter, E., 2006: Recent developments in extreme weather forecasting, ECMWF Newsletter,

980

pp. 8–17, Reading, United Kingdom.

43

981

Figure 1 Location of discharge observing stations that could be processed in the discharge

982

experiments. The blue points are used in both the reanalysis (at least 15 years of data

983

available in 1981-2010) and in the TIGGE forecast experiment (at least about 80% of days

984

available in August 2008 to May 2010) evaluation. The yellow points provide enough

985

observation only for the reanalysis while the red points have enough data available only for

986

the TIGGE forecasts.

987 988

Figure 2 Example of discharge produced by ERAI-Land (red), ERAI (green) and MERRA-

989

Land (blue) forcing and the corresponding observations (black) for a GRDC stations on the

990

Rainy River at Manitou Rapids in the US.

991 992

Figure 3 Impact of different forcing configurations in HTESSEL on the discharge outputs as

993

a relative change compared to baseline (see Section 3.1). The black dashed line displays the

994

impact of changing the lowest model level (LML) to surface forcing (2m for temperature and

995

humidity, 10m for wind). The coloured lines highlight the impact of replacing different

996

ensemble control forcing variables, either individually or in combination, with ERAI-Land

997

data.

998 999

Figure 4 Surface runoff output of HTESSEL for the period 1-10 January 2012 (240-hour

1000

accumulation) from two ensemble control experiments, one with using (a) surface forcing

1001

and (b) where possible the lowest model level forcing. In the surface forcing experiment very

1002

large erroneous surface runoff values appear in very cold winter conditions.

1003 1004

Figure 5 Historical discharge forecast performance for ERAI-Land, ERAI, MERRA-Land

1005

and their equal-weight multi-model average (MMA). Mean Absolute Error Skill Score

44

1006

(MAESS) and sample Pearson correlation (CORR) are provided for each continent (North-

1007

America, South-America, Europe, Africa, Asia and Australia & Indonesia). The reference

1008

forecast system in the skill score is the observed discharge climate as daily prediction. The

1009

correlation values are also provided for the observed climate. The scores are continental and

1010

global averages of the individual scores of the available stations (for station reference see

1011

Figure 1).

1012 1013

Figure 6 Relative improvements in the sample Pearson correlation (CORR) by equal-weight

1014

average of ERAI-Land, ERAI, MERRA-Land discharges. Values show the change in

1015

correlation compared with the best of ERAI-Land, ERAI and MERRA-Land. Positive values

1016

show improvement while negative change means lower skill in the average than in the best

1017

of the three historical discharges.

45

1018 1019

Figure 7 Example of different discharge forecast products for the GRDC station of Lobith on

1020

the Rhine river in the Netherlands. All forecast are from the 00 UTC run at 18 April 2009 up

1021

to T+240h. The following products are plotted: multi-model combinations of four TIGGE

1022

models (ECMWF, UKMO, NCEP and CMA) with ERAI-Land (solid red lines), ERAI (solid

1023

green lines) and MERRA-Land (solid blue lines) initialisations; 30-day corrected (dash-

1024

dotted lines), initial time corrected (dashed lines) and 30-day and initial time corrected

1025

(dotted lines) versions of the three multi-model combinations, each with all three

1026

initialisations (with the respective colours); and finally the Bayesian Model Averaging

1027

versions of the three multi-model combinations (all with grey lines, only from T+24h). The

1028

verifying observations are displayed by the black line.

1029 1030

Figure 8 Discharge forecast performance for forecast ranges of T+0h to T+240h for the

1031

period of August 2008 to May 2010 as global averages of CRPSS (computed at each stations

1032

over the whole period) with the following forecast products: Four TIGGE models (ECMWF,

1033

UKMO, NCEP and CMA) with ERAI-Land initialisation (grey lines). Multi-model

1034

combination of these four models with ERAI-Land (red line), ERAI (green line) and

1035

MERRA-Land (blue line) initialisation. A grand combination of these three multi-models

1036

(orange line). Grand combinations for six post-processed products: the multi-model of the

1037

30-day correction (dashed burgundy line), the initial-error correction (dashed purple line

1038

with markers), the 30-day and initial-error correction combination (solid burgundy line with

1039

markers), two Bayesian model averaging (BMA) versions of the multi-model, one with the

1040

uncorrected forecasts (black line without markers), and one with the uncorrected forecasts

1041

extended by the persistence as predictor (black line with circles), and finally the persistence

1042

forecast (dashed light blue line with circles). The CRPSS is positively oriented and has a

46

1043

perfect value of 1. The 0 value line represents the quality of the reference system, the daily

1044

observed discharge climate.

1045 1046

Figure 9 Distribution of the skill increments over all stations provided by six combination

1047

and post-processing products for two time ranges, T+24h (a) and T+240h (b). The x axis

1048

shows the reference skill, the average CRPSS of the four TIGGE models with the ERA-Land

1049

initialisation, while the y axis displays the CRPSS of the post-processed forecasts at the

1050

stations. The six products are the MM combination of the 4 models with ERAI-Land (red

1051

circles), the GMM combination of the three uncorrected MM-s (green triangles), then the

1052

GMM combination of four post-processed products: the 30-day corrected MM-s (orange

1053

triangles), the initial time corrected MM-s (cyan squares), the combined 30-day and initial

1054

time corrected MM-s (purple stars) and finally the BMA transformed MM-s of the

1055

uncorrected forecasts (blue stars), where all the MM-s are the three MM with the different

1056

initialisations. The diagonal line represents no skill improvement, above this line the six

1057

products are better, while below it they are worse than the reference. The CRPSS values are

1058

computed based on the period of August 2008 to May 2010. Some of the stations which have

1059

reference CRPSS below -5 are not plotted.

1060 1061

Figure 10 Global CRPSS distribution of the highest quality post-processed product at

1062

T+240h, the grand multi-model combination of the BMA transformed uncorrected forecasts,

1063

based on the August 2008 to May 2010 period. The CRPSS is positively oriented and has a

1064

perfect value of 1. The 0 value represents the quality of the reference system, the daily

1065

observed discharge climate.

1066

47

1067

Table 1 Summary description of the sensitivity experiments with the ECMWF ensemble

1068

control forecasts. The baseline is the reference run with the lowest model level (LML) for

1069

wind, temperature and humidity forcing. The other experiments are with different changes

1070

for the forcing variables. First the LML is changed to surface (Surf), then different variables

1071

of the ensemble control and their combinations are substituted by ERAI-Land data. Default

1072

font means ensemble control forcing input (EC) while the bold italicised font denotes

1073

substituted ERAI-Land input. Sensitivity experiments

Forcing variable setup Rad

THP

Wind

Prec

Baseline

EC-Surf

EC-LML

EC-LML

EC-Surf

Surf vs. LML

EC-Surf

EC-Surf

EC-Surf

EC-Surf

Radiation

ERAI-Surf

EC-LML

EC-LML

EC-Surf

Wind

EC-Surf

EC-LML

ERAI-LML

EC-Surf

THP

EC-Surf

ERAI-LML

EC-LML

EC-Surf

Rad/THP

ERAI-Surf

ERAI-LML

EC-LML

EC-Surf

Rad/THP

ERAI-Surf

ERAI-LML

ERAI-LML

EC-Surf

Rad/THP/Wind/Prec ERAI-Surf

ERAI-LML

ERAI-LML

ERAI-Surf

1074

48

1075

Table 2 Detailed evaluation of the discharge sensitivity experiments at T+240h range for

1076

different areas and periods. Relative discharge differences are shown after replacing

1077

ensemble control forcing variables, either individually or in combination, by ERAI-Land,

1078

and also the lowest model level (LML) with surface forcing (2m for temperature and

1079

humidity, 10m for wind). The whole globe, Northern Extratropics (defined here as 35-70N)

1080

and Tropics (30S-30N), as well as the specific seasons are displayed. Average difference

Rad+THP+ Rad+THP+ Rad

THP

Wind

(%)

Surf vs. LML Wind

Wind+Prec

Global

1.0

2.4

0.6

2.9

15.6

0.7

N.Extratropics

1.0

3.1

0.6

3.5

12.8

0.8

N.Extratropics JJA

0.5

0.6

0.2

0.9

13.0

0.3

N.Extratropics DJF

1.1

2.9

0.7

3.6

9.5

0.8

Tropics

1.0

1.1

0.5

1.8

17.6

0.5

Tropics JJA

0.9

0.9

0.4

1.5

15.0

0.5

Tropics DJF

1.2

1.3

0.6

2.1

18.7

0.6

1081

49

1082

1083 1084

Figure 1 Location of discharge observing stations that could be processed in the discharge

1085

experiments. The blue points are used in both the reanalysis (at least 15 years of data

1086

available in 1981-2010) and in the TIGGE forecast experiment (at least about 80% of days

1087

available in August 2008 to May 2010) evaluation. The yellow points provide enough

1088

observation only for the reanalysis while the red points have enough data available only for

1089

the TIGGE forecasts.

1090

50

1091 1092

Figure 2 Example of discharge produced by ERAI-Land (red), ERAI (green) and MERRA-

1093

Land (blue) forcing and the corresponding observations (black) for a GRDC stations on the

1094

Rainy River at Manitou Rapids in the US.

1095

51

1096 1097

Figure 3 Impact of different forcing configurations in HTESSEL on the discharge outputs as

1098

a relative change compared to baseline (see Section 3.1). The black dashed line displays the

1099

impact of changing the lowest model level (LML) to surface forcing (2m for temperature and

1100

humidity, 10m for wind). The coloured lines highlight the impact of replacing different

1101

ensemble control forcing variables, either individually or in combination, with ERAI-Land

1102

data.

1103

52

1104 1105

Figure 4 Surface runoff output of HTESSEL for the period 1-10 January 2012 (240-hour

1106

accumulation) from two ensemble control experiments, one with using (a) surface forcing

1107

and (b) where possible the lowest model level forcing. In the surface forcing experiment very

1108

large erroneous surface runoff values appear in very cold winter conditions.

1109

53

1110 1111

Figure 5 Historical discharge forecast performance for ERAI-Land, ERAI, MERRA-Land

1112

and their equal-weight multi-model average (MMA). Mean Absolute Error Skill Score

1113

(MAESS) and sample Pearson correlation (CORR) are provided for each continent (North-

1114

America, South-America, Europe, Africa, Asia and Australia & Indonesia). The reference

1115

forecast system in the skill score is the observed discharge climate as daily prediction. The

1116

correlation values are also provided for the observed climate. The scores are continental and

1117

global averages of the individual scores of the available stations (for station reference see

1118

Figure 1).

1119

54

1120 1121

Figure 6 Relative improvements in the sample Pearson correlation (CORR) by equal-weight

1122

average of ERAI-Land, ERAI, MERRA-Land discharges. Values show the change in

1123

correlation compared with the best of ERAI-Land, ERAI and MERRA-Land. Positive values

1124

show improvement while negative change means lower skill in the average than in the best

1125

of the three historical discharges.

1126

55

1127 1128

Figure 7 Example of different discharge forecast products for the GRDC station of Lobith on

1129

the Rhine river in the Netherlands. All forecast are from the 00 UTC run at 18 April 2009 up

1130

to T+240h. The following products are plotted: multi-model combinations of four TIGGE

1131

models (ECMWF, UKMO, NCEP and CMA) with ERAI-Land (solid red lines), ERAI (solid

1132

green lines) and MERRA-Land (solid blue lines) initialisations; 30-day corrected (dash-

1133

dotted lines), initial time corrected (dashed lines) and 30-day and initial time corrected

1134

(dotted lines) versions of the three multi-model combinations, each with all three

1135

initialisations (with the respective colours); and finally the Bayesian Model Averaging

1136

versions of the three multi-model combinations (all with grey lines, only from T+24h). The

1137

verifying observations are displayed by the black line.

1138

56

1139 1140

Figure 8 Discharge forecast performance for forecast ranges of T+0h to T+240h for the

1141

period of August 2008 to May 2010 as global averages of CRPSS (computed at each stations

1142

over the whole period) with the following forecast products: Four TIGGE models (ECMWF,

1143

UKMO, NCEP and CMA) with ERAI-Land initialisation (grey lines). Multi-model

1144

combination of these four models with ERAI-Land (red line), ERAI (green line) and

1145

MERRA-Land (blue line) initialisation. A grand combination of these three multi-models

1146

(orange line). Grand combinations for six post-processed products: the multi-model of the

1147

30-day correction (dashed burgundy line), the initial-error correction (dashed purple line

1148

with markers), the 30-day and initial-error correction combination (solid burgundy line with

1149

markers), two Bayesian model averaging (BMA) versions of the multi-model, one with the

1150

uncorrected forecasts (black line without markers), and one with the uncorrected forecasts

1151

extended by the persistence as predictor (black line with circles), and finally the persistence

1152

forecast (dashed light blue line with circles). The CRPSS is positively oriented and has a

57

1153

perfect value of 1. The 0 value line represents the quality of the reference system, the daily

1154

observed discharge climate.

1155

58

1156 1157

Figure 9 Distribution of the skill increments over all stations provided by six combination

1158

and post-processing products for two time ranges, T+24h (a) and T+240h (b). The x axis

1159

shows the reference skill, the average CRPSS of the four TIGGE models with the ERA-Land

1160

initialisation, while the y axis displays the CRPSS of the post-processed forecasts at the

1161

stations. The six products are the MM combination of the 4 models with ERAI-Land (red

1162

circles), the GMM combination of the three uncorrected MM-s (green triangles), then the

1163

GMM combination of four post-processed products: the 30-day corrected MM-s (orange

1164

triangles), the initial time corrected MM-s (cyan squares), the combined 30-day and initial

1165

time corrected MM-s (purple stars) and finally the BMA transformed MM-s of the

1166

uncorrected forecasts (blue stars), where all the MM-s are the three MM with the different

1167

initialisations. The diagonal line represents no skill improvement, above this line the six

1168

products are better, while below it they are worse than the reference. The CRPSS values are

1169

computed based on the period of August 2008 to May 2010. Some of the stations which have

1170

reference CRPSS below -5 are not plotted.

1171

59

1172 1173

Figure 10 Global CRPSS distribution of the highest quality post-processed product at

1174

T+240h, the grand multi-model combination of the BMA transformed uncorrected forecasts,

1175

based on the August 2008 to May 2010 period. The CRPSS is positively oriented and has a

1176

perfect value of 1. The 0 value represents the quality of the reference system, the daily

1177

observed discharge climate.

60