An efficient rounding-off rule estimator: Application to daily rainfall time

0 downloads 0 Views 1MB Size Report
Finally, the RRE is applied on 340 daily rainfall time series collected by the rain ... records are instead present in most time series collected by nonrecording ...... [45] To understand better the meaning of probabilities in equation (A3), let us look ...
Click Here

WATER RESOURCES RESEARCH, VOL. 43, W12405, doi:10.1029/2006WR005409, 2007

for

Full Article

An efficient rounding-off rule estimator: Application to daily rainfall time series Roberto Deidda1 Received 7 August 2006; revised 22 February 2007; accepted 7 June 2007; published 15 December 2007.

[1] An overview of problems and errors arising when fitting parametric distributions and

applying goodness of fit tests on samples containing roughly rounded off measurements is first illustrated. The paper then presents the rounding-off rule estimator (RRE), an original method that allows the estimation of the percentages of rainfall measurements that have been rounded at some potential resolutions. The efficiency of the RRE is evaluated using a wide set of samples drawn by different distributions and rounded off according to different rounding rules. Finally, the RRE is applied on 340 daily rainfall time series collected by the rain gauge network of the Sardinian Hydrological Survey (Italy). In most stations, results revealed the presence of significant percentages of roughly rounded-off measurements, even at 1 and 5 mm resolutions, rather than at the standard 0.1 or 0.2 mm discretization. The application of the proposed RRE may give important support to perform quality data analyses, to assess and discriminate methods to fit parametric distributions on rounded-off samples, and to detect if and how the precision of recorded measurements might have changed in long time series used for climatic change studies. Citation: Deidda, R. (2007), An efficient rounding-off rule estimator: Application to daily rainfall time series, Water Resour. Res., 43, W12405, doi:10.1029/2006WR005409.

1. Introduction [2] The analysis of long time series of observations often requires taking into account the different sampling rules throughout the considered period. This is a well known problem when dealing with paleoclimatic time series where the older data are usually estimated by indirect measurement of other variables or proxies, while the more recent data are recorded by dedicated instruments. In these cases, series can be easily split into records of old measures where data may assume discrete and/or very approximate values (which are often censored and sampled at irregular time intervals) and records of more recent measures collected at regular sampling times with appropriate resolutions to correctly describe the process of interest. To take into account these and other kinds of inhomogeneities in the sampling of data, methods have been developed and widely applied [e.g., see Martins and Stedinger, 2001; Brunetti et al., 2006, and references therein]. [3] When dealing with time series recorded in the last century, such as daily rainfall collected by rain gauges, one expects to analyze records with an appropriate resolution. For instance, the standard resolution of daily rainfall measurements taken by the Hydrological Surveys in Italy, and also in many other countries, should be 0.1 or 0.2 mm/d. Unfortunately, as it will be shown in section 4 of this paper, a systematic analysis conducted on 340 time series collected between 1922 and 1980 by the Hydrological Survey of the 1 Dipartimento di Ingegneria del Territorio, Universita` di Cagliari, Cagliari, Italy.

Copyright 2007 by the American Geophysical Union. 0043-1397/07/2006WR005409$09.00

Sardinia Region (Italy) revealed a different situation. Only 33 series (i.e., less than 10% of the whole data set) were correctly recorded at the standard discretization of 0.1 or 0.2 mm/d, while in most stations a lot of values were rounded off at larger resolutions, often at 1 or 5 mm/d and sometimes even at 10 mm/d. In 160 time series the percentage of records rounded off at 5 mm resolution ranges between 10% to 40% of the whole record. [4] The reason of the presence of such roughly roundedoff values should be searched in the way in which daily rainfall depths were measured in the analyzed period. Till the 80s, in Italy, as in many other countries, rain gauge networks were set up mainly with two types of instruments: a lot of nonrecording standard rain gages (that require daily manual measurement by an entrusted person) and very few recording gages (mainly tipping buckets that trace rainfall signals on weekly strip charts). Only after the 80s, rain gauge networks were progressively updated with the installation of automatically recording rain gauges that store rainfall information directly in memory chips and are often able to transmit, in real time, the rainfall evolution to a central office. [5] Recording tipping bucket rain gauges have been adopted since the beginning of the 1900s by the Italian and many other Hydrological Surveys. These devices trace vertical tips, each one corresponding to 0.2 mm rainfall depth, in rotating strip charts. Daily rainfall depths are then estimated on these paper charts by specialized staff, thus time series derived from these kind of stations are usually correctly discretized at 0.2 mm resolution. [ 6 ] Anomalous percentages of roughly rounded-off records are instead present in most time series collected by nonrecording standard rain gauges. These devices

W12405

1 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

consist of a cylindrical collector above a funnel leading to a receiver. Italian and many other European standard gauges have a 0.1 m2 orifice [Tonini, 1959], thus 1 L of stored water corresponds to 10 mm of rainfall depth. Daily measurements are then obtained by pouring the catch from the receiver into measuring glasses. Sardinian Hydrological Survey adopted three sizes of glasses: 1 L, 1/2 L, and 1 dL capacity, corresponding to 10 mm, 5 mm, and 1 mm rainfall depth respectively. A truncated measure of daily rainfall depth (in millimeters) can be obtained by counting how many times the 1-L glass can be fully filled, and then how many times the water remaining in the last incompletely filled glass allows fully filling the 1/2-L and the 1-dL glasses. Finally, inserting a calibrated measuring stick in the last incompletely filled 1-dL glass allows the estimation of how many tenths of millimeter should be added to the previous measure. Thus the standard discretization of these kinds of data should be 0.1 mm. [7] These very simple, but delicate, measurements were not carried out by Hydrological Survey staff, but by people working or living in places close to the rain gauge location. They were entrusted with the daily measuring operations described above, and had to annotate the number of fully filled glasses of each type, as well as the stick measure, on registers that today remain the only information we have about rainfall occurred in the past. Annotations were then converted by Hydrological Survey staff into daily rainfall depths that were published in yearly reports (Annali Idrologici). [8] We suppose that the anomalous concentrations of roughly rounded-off values may be attributed to the scarce devotion of these entrusted people to the delicate measuring operations required to estimate daily rainfall depths. Rounded-off values at 5 mm resolution may be due to the use of the first two glasses only. Instead, anomalous concentration of ties at multiples of 1 mm means that all the three glasses were used, but the final estimation by the stick was not carried out. [9] Standard discretization of 0.1 or 0.2 mm resolution, usually prescribed for rainfall measurements, should be fine enough to allow parametric distributions to be accurately fitted without the need of applying estimation methods for discrete data (such as, e.g., the binned maximum likelihood). Nevertheless, even the standard discretization may be a source of errors for some kind of statistical analysis, such as the moment scaling function estimation [Harris et al., 1997]. [10] However, when significant percentages of records are rounded off at resolutions larger than the standard discretization, as detected in many of the time series analyzed here, even errors in fitting parametric distributions and in applying goodness of fit tests become particularly relevant and cannot be disregarded, as described in section 2 of this paper. The knowledge of the rounding-off rules, i.e., the percentage of records rounded off at different resolutions, is fundamental to overcome some of these problems, such as the derivation of the proper percentage points for statistical tests aimed at evaluating the goodness of fitted distributions [Deidda and Puliga, 2006]. Nevertheless, the empirical determination of the rounding-off rules may be a difficult task when dealing with databases containing many stations, and, moreover, it may be affected by subjective

W12405

sensibility. Thus, in this paper, we develop an objective statistically based method to estimate the rounding-off rules in any time series. The derivation of this rounding-off rule estimator (RRE) is presented in Appendix A, while the performances are evaluated and discussed in section 3. section 4 is devoted to show the results of a systematic application of the proposed estimator on 340 time series of daily rainfall depths in our database. In section 5, conclusions are drawn and potential applications of the RRE are briefly discussed.

2. Undesirable Effects Arising With Rounded-off Records [11] The aim of this section is to highlight and comment on some undesirable effects arising in fitting parametric distributions and applying goodness of fit tests on samples containing roughly rounded-off records. We refer to the real cases of two daily rainfall time series: the first one (recorded by station 007) is a badly discretized time series, since a high percentage of data were rounded off at 1 mm and 5 mm resolution, while the second one (recorded by station 235) is a quasi-perfect discrete time series with most values rounded at 0.1 or 0.2 mm. In the following, we will name perfect discrete time series only those series containing all records rounded at 0.1 mm. In detail, the application of the rounding-off rule estimator, derived in Appendix A, revealed that the 44-year-long time series recorded by the bad station 007 contains 37% of values rounded off at 0.1 mm resolution, 24% rounded at 1 mm, and 39% of values rounded at 5 mm. The same analysis performed on the quasi-perfect time series (50 years long), retrieved by station 235, determined that 85% of data were rounded at 0.1 or 0.2 mm, while the remaining 15% were rounded at 1 mm. 2.1. Fitting Parametric Distributions [12] Time series of daily rainfall data usually contain zero and non zero values. Thus several modeling approaches first separate rainy from not rainy data, and then separately deal with the distribution of rainfall occurrences (i.e., the succession of wet and dry periods) and the distribution of rainfall values in rainy days. Often, all strictly positive rainfall data are then used to fit a distribution of rainy values. If this is the case, the following general equation F(x) can be assumed as cumulative distribution function (CDF) of rainy and not rainy values x on each time series: F ð xÞ ¼ ð1  z 0 Þ þ z 0 F0 ð xÞ x  0

ð1Þ

where z 0 = Pr{X > 0} represents the probability of occurrence of rainy days, while F0(x) = Pr{X  xjX > 0} is the CDF of only rainy values. [13] Nevertheless, it would be advisable to carefully analyze the smallest record values before fitting F0(x) using all strictly positive rainfall data. Indeed, distribution of very small values may be not clearly definite: small values may be due to dew processes rather than being true rainfall, there may be effects of subjective rounding and errors, and, whatever the cause may be, there are empirical evidences that small values often depart from the distribution of higher rainfall values. The generalized Pareto distribution [Pickands, 1975] explicitly requires to put a threshold on

2 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 1. Simple moments and L moments of excess y over thresholds u: y = x  u, where x > u are left-censored daily rainfall records. Mean m, coefficient of variation CV, and L moment coefficient of variation L-CV of y are plotted versus u for (left) station 007 and (right) station 235. data and to infer parameter values only on records exceeding the threshold. Anyhow, it would be a good rule to apply the same approach whatever distribution is candidate to describe daily rainfall, and to choose a threshold value that allows the inferred distribution to be accepted, applying for instance a goodness of fit test. [14] Adopting this last approach, once the threshold u has been selected, we can fit a CDF Fu(y) on the (strictly positive) exceedances y = x  u, where x is the original sample of daily rainfall. The fitted distribution Fu(y) allows then to compute quantiles of daily rainfall (in the range x > u) as x = u + F1 u (P), where P = Pr{Y  yjY > 0} = Pr{X  xjX > u}, and F1 u is the inverse function of Fu. [15] Let us consider the case that an estimation method based on the simple moments (SM) or on the probability weighted moments (PWM) is applied to infer parameter values of any distribution Fu(y) chosen as candidate for modeling the excesses y. If the chosen distribution may correctly describe only data exceeding a nonzero threshold u, moments or L moments have to be computed only on the (strictly positive) exceedances y = x  u. As an example, for the two time series selected above, Figure 1 shows the mean, the coefficient of variation CV and the L moment coefficient of variation L-CV of the exceedances y computed for left-censoring thresholds u ranging from 0 to 10 mm. For the badly discretized time series (left plots in Figure 1) we can observe how the mean of excesses y decreases when u ranges from 0 to 5 mm; for u = 5 mm there is a jump where the mean assumes again an high value; than the mean decreases again for u ranging from 5 to 10 mm. Although not shown, the mean decreases with the same behavior in each following 5-mm interval, while other jumps are localized at u = 10 mm, 15 mm, 20 mm, etc. Similar comments hold for CV and L-CV statistics computed on y for different left-censoring thresholds u, but the statistics are now increasing in each 5-mm interval of thresholds u. The repetition of the described patterns are an effect of the large percentage of values rounded at 5 mm resolution. Indeed, if

we look at moments and L moments computed on the quasiperfect discrete time series (right plots in Figure 1) we can observe a regular behavior for thresholds u larger than about 2 3 mm: mean increases linearly with u, while CV and L-CV become nearly constant, and jumps on multiples of 5 mm are not present any more. Looking more in detail, also in this case we can observe small jumps repeated for u multiples of 1 mm: they are due to the presence of about 15% of values rounded off at 1 mm resolution. Anyhow, the spread remains very limited since most of data are anyway correctly discretized. Although not shown in Figure 1, similar behaviors can be observed computing higher moments and L moments that are needed to fit distributions with more than two parameters. [16] Using the sample moments or L moments shown in Figure 1 to fit any one or two parameters distribution leads to a consequent spread on the parameter estimates. As an example, for the two considered time series, we show in Figure 2 the estimates of the parameters of the following generalized Pareto distribution (GPD): Fu ð x; au ; x Þ ¼ Prf X  xjX > ug 8   > x  u 1=x > > xu > > : 1  exp  au

x 6¼ 0 ð2Þ x ¼ 0:

where x is the shape parameter, au the scale parameter, while u is the threshold value. [17] Although u is often referred to as ‘‘position’’ or ‘‘location’’ parameter, it cannot be considered as a true distribution parameter. Indeed, it is used to left-censor sample x before fitting equation (2), and thus it should be fixed a priori. This is also the reason why the maximum likelihood (ML) estimation method does not allow to estimate u (indeed we cannot maximize a likelihood function for a sample where some values may be excluded by the left-censoring threshold u itself), and why the SM

3 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 2. Parameters x and a0 of the generalized Pareto distribution F0(x) in equation (2): SM, PWM, and LM estimates obtained using different thresholds u are compared. Shown are data from (left) station 007 and (right) station 235.

and PWM methods require only the first two moments or L moments, computed on the excesses y, to estimate x and au. [18] In order to estimate the shape x and the scale au parameters of the generalized Pareto distribution, we applied the SM and PWM methods on the basis of the expression reported by Hosking and Wallis [1987] and Stedinger et al. [1993]. The ML estimates of the same parameters were obtained by the optimized univariate approach proposed by Grimshaw [1993]. Note that the sign adopted for the shape parameter in the above referred works is opposite with respect to equation (2). [19] The x parameter controls the tail behavior of the distribution. For x = 0 the distribution has the ordinary exponential form. For x > 0 the distribution has a long right tail, thus it is often referred to as ‘‘heavy tailed distribution’’: in this case, simple moments of order greater than or equal to 1/x are degenerate, thus for x 1/2 mean and variance do not degenerate and allow fitting equation (2) with the SM method. For x < 0 the distribution is short tailed with an upper bound value (u  au/x). [20] The generalized Pareto distribution has a very important property: if a sample can be reasonably considered drawn by a GPD Fu0(x) with threshold u0 and parameters x and au0, then the excesses above any other threshold u > u0 should also follow the GPD in equation (2) with the same shape parameter x and a scale parameter au given by the following equation [Coles, 2001, p. 83]: au ¼ au0 þ x ðu  u0 Þ

ð3Þ

[21] Thus, once GPD parameters x and au of Fu(x) in equation (2) are estimated on the excesses above any threshold u > u0, equation (3) allows also to reparameterize a generalized Pareto distribution F0(x) that will be perfectly overlapping to the fitted Fu(x) for any x > u. Such a distribution F0(x) can be described by equation (2) with a threshold u = 0, the same x parameter estimated for Fu(x),

and a shape parameter a0 obtained for u0 = 0 by equation (3), which can be rewritten as a0 ¼ au  xu

ð4Þ

[22] The reparameterized distribution F0(x) is able to describe all rainy values since it is defined for x > 0, although there may be departures from very small values x 2 (0, u0). Moreover, inserting F0(x) in equation (1) allows modeling in a simple way the whole rainfall process, including the rainy and not rainy occurrences. Finally, we highlight that, in virtue of equation (4), both the x and a0 parameters should be constant if they are estimated for distributions Fu(x) with any threshold u > u0. [23] Figure 2 shows the estimates of x and a0 obtained by the simple moments (SM), the probability weighted moments (PWM) and the maximum likelihood (ML) methods for thresholds u ranging form 0 to 10 mm. Left plots of Figure 2 refer to the badly discretized time series and clearly show how the spread of moments and L moments observed in the left plots of Figure 1 reflect also on parameter estimates obtained by the SM and PWM methods. Moreover, it is apparent that also the ML estimates, shown in the same plots, display a similar spread, revealing that also the ML estimation method is sensitive to the presence of ties. The spread of the x estimates for station 007 ranges from 0.3 to 0.3, with some differences among the considered fitting methods: it is clearly unlikely that by varying the left-censoring threshold u the behavior of the fitted distribution may oscillate from a bounded type (x < 0) to an heavy tailed one (x > 0). As a matter of fact (as shown in Figure 8 discussed in section 4), the distribution of data belonging to station 007 is very close to an exponential one (x  0). The oscillation of the distribution shape, when leftcensoring data with different threshold values, is clearly an artificial effect that is driven by the large percentage of values rounded off at 5 mm resolution.

4 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 3. Points (drawn using the Weibull plotting position) displaying the empirical distribution functions of daily rainfall depth collected by (left) station 007 and (right) station 235. Lines represent fitted GPD Fu(x) in equation (2) for different thresholds u (values from 2.5 to 12.5 mm with increment of 0.1 mm are used). Right vertical axes report some return periods T that were associated with exceedance probability P = 1  1/(365.25 T).

[24] Let us now look at the right plots of Figure 2, which refer to the quasi-perfect discrete time series. We can observe that both the x and a0 estimates become nearly constant, as expected by theoretical arguments, for any threshold larger then u0  2 3 mm. As already noticed in Figure 1, very small jumps are still present every 1 mm. Nevertheless, the spread of estimates remains very limited because of the small percentage of values rounded at 1 mm as well as to the small rounding resolution itself. [25] The size of errors induced by fitting parametric distributions to roughly rounded-off measurements is illustrated in Figure 3. Empirical cumulative distribution functions of daily rainfall depths recorded by station 007 and 235 are compared with the generalized Pareto distributions fitted to left-censored samples with thresholds u in the range 2.5 12.5 mm (the ML estimates displayed in Figure 2 are used). For the badly discretized time series (left plot in Figure 3), a wide spread of fitted distributions is apparent. Moreover, depending on the applied threshold, the 5-year return period quantile provided by the fitted distribution may be larger than the largest observed value in 44 years of observations! Conversely, distributions fitted on the quasiperfect discrete time series (right plot in Figure 3) result close each other. 2.2. Goodness of Fit Tests [26] Other undesirable effects arise when computing goodness of fit statistics on rounded-off records. In fact, the presence of ties may change the distribution of test statistics, thus using the percentage points derived by asymptotic distributions for continuous samples (usually available in tables of books and scientific papers) may lead to misinterpret the goodness of fit test results [Deidda and Puliga, 2006]. To clarify better this aspect, we focus on the W2 Crame´r –von Mises and A2 Anderson-Darling statistics. Both are empirical distribution function (EDF) statistics since they measure the discrepancy between the empirical distribution function and the tested cumulative distribution

function G(x), whose parameters may be known or unknown [Stephens, 1986]. For a sample of size n, W2 and A2 can be defined as W2 ¼

A2 ¼ n 

 n  X 1 2i  1 2 Gðxi Þ  þ 12n i¼1 2n

n X 2i  1 ½logðGðxi ÞÞ þ logð1  Gðxnþ1i ÞÞ n i¼1

ð5Þ

ð6Þ

where xi are sample values arranged in increasing order, while G is the cumulative distribution function that will be tested for fitting. [27] W2 and A2 statistics are often employed in goodness of fit tests: first, equations (5) and (6) are evaluated using sample data and the fitted distribution G, then the results are compared with percentage points at a given confidence level. Examples of applications of these tests to evaluate the goodness of fit of various distributions to hydrologic data can be found in work by Ahmad et al. [1988], Claps and Laio [2003], Choulakian and Stephens [2001], Dupuis [1999], and Laio [2004]. It is worthwhile to remark that the distributions of EDF statistics, and thus also percentage points to be used in statistical tests, depend on the fitting distribution G itself (e.g., distributions of W 2 and A2 for the generalized extreme value and the GPD families are different), on the shape of G, on the parameters of G to be estimated, on the parameter estimation method, and on the sample size. [28] Stephens [1986] provided percentage points of W 2 and A2 in case parameters of some widely used distributions are estimated by the ML method, but the generalized Pareto distribution was not considered. Choulakian and Stephens [2001] studied asymptotic distributions of W 2 and A2 statistics in case equations (5) and (6) are evaluated on

5 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 4. CDFs of test statistics (left) W2 and (right) A2 determined by Monte Carlo generations of 10,000 GPD random samples with parameters x = 0.02 and a = 8.7 mm. Solid line (case A) refers to continuous samples; dashed line (case B) refers to perfect discrete samples, rounded off at resolution 0.1 mm; and dotted line (case C) is traced for samples rounded off with the same rounding rule estimated on station 007, i.e., 37% at resolution 0.1 mm, 24% at resolution 1 mm, and 39% at resolution 5 mm.

samples drawn by a generalized Pareto distribution, and only one or both the scale and shape parameters are estimated by the ML method. They highlighted how EDF statistics for GPD are only affected by the shape parameter (being invariant under scale changes) and provided tables of percentage points of W2 and A2 at several significance levels for different x shape parameters. [29] In addition, we stress that percentage points of W 2 and A2 provided in tables and expressions within some of the above cited works were derived for the case of distributions G fitted on continuous samples. In case we are dealing with rounded measures, percentage points provided for continuous samples cannot be applied anymore, since the presence of ties changes the distribution of the test statistics. The Monte Carlo approach represents in these cases the only way to correctly take into account samples with rounded records, once the rounding rule of the sample itself is known. Moreover, the Monte Carlo approach takes easily into account also the family and the shape of the fitting distribution, the estimation method, the parameters to be estimated, and the sample size. [30] As an example, in Figure 4 the CDFs of test statistics W 2 and A2 determined by Monte Carlo generations of 10,000 GPD random samples are compared for the following cases: case A, continuous samples (only computer roundoff discretization); case B, perfect discrete samples rounded off at the standard 0.1 mm resolution; case C, samples rounded off at 0.1, 1 and 5 mm accordingly to the same rounding-off rule estimated for the badly discretized time series (station 007), as reported at the beginning of the section. All samples have size 3000 and were drawn by a GPD F0(x) with parameters x = 0.02 and a0 = 8.7 mm, that allows a good fit on station 007 data to be obtained (see Figure 8). Both x and a0 parameters were first estimated by the ML method on each generated (continuous for case A or rounded for cases B and C) sample, then each fitted

distribution was used together with the same synthetic (and eventually rounded) sample in equations (5) and (6). The 10,000 values of W2 and A2 obtained in such a way are plotted in Figure 4. [31] It is apparent from Figure 4 that, even with a perfect discrete sample, percentage points derived for continuous samples need to be redetermined. The differences become even more significant for samples with values rounded at larger resolutions, as it is the case of many time series in the analyzed database. A way to determine the correct percentage points is illustrated in the same Figure 4. Indeed, once the confidence level has been chosen, quantiles or percentage points can be obtained by (interpolation of) the empirical distribution of the test statistics obtained by Monte Carlo generations. Values obtained by Figure 4 for the 90% confidence level (p value = 0.1) are reported in Table 1 and compared with percentage points provided by Table 1. Dependence of A2 and W2 Percentage Points at 90% Confidence Level on the Rounding-off Rule of the Samplea

Asymptotic results (x = 0) Asymptotic results (x = 0.1) MC: case A continuous samples MC: case B perfect discrete samples MC: case C mixed rounded samples

W2

A2

0.124 0.116 0.121 0.153 5.411

0.796 0.766 0.789 1.403 31.215

a Asymptotic results are derived from Choulakian and Stephens [2001, Table 2] for GPD with shape parameters x = 0 and x = 0.1 and refer to the case that both the unknown x and a0 parameters are estimated by the ML method. Monte Carlo (MC) results were obtained by applying the ML method on 10,000 GPD random samples of size 3000 generated with parameters x = 0.02 and a = 8.7 mm. MC results refer to the following cases: case A, continuous samples; case B, perfect discrete samples (all records rounded at 0.1 mm); and case C, mixed rounded samples with the same rounding-off rule estimated on station 007. Percentage points for MC experiments are obtained by quantiles of CDFs plotted in Figure 4.

6 of 15

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

W12405

Table 2. Summary of the Rounding-off Rules P(D) and RRE Performances in All the Groups of Tests Considereda Maximum Bias

Maximum RMSE

Group

D

0.1 m

0.2 m

0.5 m

1.0 m

5.0 m

N = 700

N = 3000

N = 700

N = 3000

A B C D E F

P(D) P(D) P(D) P(D) P(D) P(D)

0.4 0.1 0.2 1 0.5 0.5

0.1 0.2 0.2 0

0.1 0.3 0.2 0 0.2 0.3

0.1 0.3 0.2 0

0.3 0.1 0.2 0 0.3

0.018 0.028 0.023 0.022 0.004 0.023

0.018 0.027 0.022 0.012 0.004 0.023

0.042 0.046 0.038 0.038 0.022 0.045

0.023 0.032 0.027 0.018 0.011 0.030

0.2

a

In the last four columns, the maximum RMSE and absolute bias among all Djs and all considered couples of (a, x) for parent distributions are reported.

Choulakian and Stephens [2001, Table 2]. Table 1 shows that percentage points for continuous samples (case A) are very close to the linear interpolations of values provided by Choulakian and Stephens [2001] for x = 0 and x = 0.1; percentage points for perfect discrete samples (case B) become about 1.5 2 times the values for continuous case; and for samples rounded as station 007 records, percentage points become about 50 times greater than those valid for continuous samples. This example clearly shows how applying goodness of fit tests without knowing how the analyzed sample has been rounded off would certainly bring wrong conclusions. Indeed, only once the rounding-off rules are estimated, the proper percentage points for application of statistical tests can be derived. 2.3. Final Remarks [32] In this section we have highlighted, with the aid of some examples, how the presence of rounded-off records may lead to undesirable and sometimes unexpected problems when fitting parametric distributions and/or when applying goodness of fit tests. Although we considered only the generalized Pareto distributions, we stress that similar problems arise also for any other continuous parametric distribution, with one, two or more parameters. The following sections develop a methodology to estimate the rounding-off rules from sample data. The knowledge of the percentages of rounded-off values is necessarily the first step to assess the best approach to estimate parameters of parametric distributions, and to determine the percentage points for goodness of fit test correctly accounting for the rounding of the sample.

3. Description and Performances of the Rounding-off Rule Estimator [33] The aim of this section is to briefly describe the rounding-off rule estimator (RRE), whose derivation is provided in Appendix A together with some details for numerical implementation, and to evaluate the performances of the estimator on synthetic samples that may be considered representative of daily rainfall time series. [34] The RRE provides the estimation of the percentages of measurements rounded off at different resolutions within a given sample. It requires only a preliminary exploration of the data set in order to detect the r potential rounding-off resolutions D = {D j: j = 1,  ,r}. This aim can be easily pursued by visual inspection of the empirical cumulative distribution functions and frequency histograms, looking for anomalous recurrent ties. For instance, an exploration in the

data set analyzed in this paper revealed that recorded values may have been rounded off at the following resolutions: D ¼ ½0:1; 0:2; 0:5; 1; 5mm

ð7Þ

[35] Then the RRE application requires to solve the set of linear algebraic equations (equation (A1)), given in Appendix A, where the matrix A of known coefficients can be univocally determined once the vector D is defined. The solution of this set of equations is finally used in equation (A7), provided in Appendix A as well, to estimate ^ = Dj) that a value has been rounded off the probability P(D at each resolution D j, for j = 1,  ,r. [36] Before applying the rounding-off rule estimator on observed data (as exemplified in section 4, where we discuss the results obtained by a systematic application of the RRE on a wide database of daily rainfall time series), it is essential to evaluate errors and performances of the estimator on synthetic series where we know a priori the rounding-off rule. Thus, in subsection 3.1, we present a preliminary analysis of our data set aimed at identifying reliable probability distributions of daily rainfall time series, while in Subsection 3.2 we discuss the setting of test cases to evaluate the RRE performances, that are then summarized in subsection 3.3. 3.1. Reliable Probability Distributions for RRE Performances Evaluation [37] We assume that rainy values can be considered drawn by a generalized Pareto distribution F0(x), described by equation (2) with threshold u = 0. We assume also that, once F0(x) has been fitted and the probability of rainy days z 0 has been estimated, equation (1) is able to describe the whole distribution of rainy and not rainy daily values. We stress that although equation (2) can often satisfactorily fit daily rainfall data only adopting a non zero threshold u (generally a few millimeters may be enough), once au has been estimated for any given threshold u, equation (4) allows to reparameterize equation (2) in order to obtain F0(x). In such a way, equation (1) not only gives a very simple representation of rainy and not rainy values, but also perfectly overlaps the GPD fitted on data exceeding a reliable threshold u. Certainly, there may be a little departure between equation (1) and the empirical distribution of observed data in the range (0, u), but this does not affect the finding of this section. [38] There are many reasons to assume rainfall depths distributed as a GPD. Indeed, several studies have proved that GPD is able to correctly fit rainfall depths at daily or

7 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

Figure 5. Parameters x and a0 of the generalized Pareto distribution F0(x) in equation (2), with u = 0, estimated from 200 daily rainfall time series collected by the rain gauge network of the Sardinian Hydrological Survey. The observation period is longer than 40 years for all considered series. smaller scales [see, e.g., Cameron et al., 2000; Coles et al., 2003; De Michele and Salvadori, 2005; Fitzgerald, 1989; Madsen et al., 2002; Salvadori and De Michele, 2001; Van Montfort and Witter, 1986]. Recently, Deidda and Puliga [2006] also gave evidence, using L moment ratio diagram [Hosking, 1990], that GPD is the best candidate to be the parent distribution of daily rainfall time series analyzed here. Moreover, besides the above reasons, the GPD family is particularly suitable for the scope of this section. Indeed F0(x) provided by equation (2) can easily represent data distributions with different shapes and different scales, just changing the x and a0 parameters, allowing the evaluation of the performances of the estimator in very different conditions. [39] In order to reduce the estimation variance, parameters x and a0 were estimated for the 200 longest time series of daily rainfall, each more than 40 years long. Moreover, to deal with the spread of estimates (x, a0) in stations with rounded data (as shown in Figure 2), the median values of the estimates for thresholds u ranging from 2.5 mm to 12.5 mm were adopted. Figure 5 shows the couples of parameters (x, a0) determined in such a way for the 200 time series. For most stations, the generalized Pareto distributions F0(x), drawn adopting these couples of parameters, appeared, at a visual inspection, very close to the empirical distribution functions. The aspects related to parameter estimation on samples containing rounded-off values would certainly deserve to be better deepened (e.g., some of the adopted estimates would benefit from further refinements), but this matter goes beyond the scope of this paper. Here we use estimates (x, a0) presented in Figure 5 only with the aim to evaluate the performances of the RRE for all the possible distributions of daily rainfall time series in our database. [40] As far as the last free parameter is concerned, i.e., z 0 in equation (1), we found that in most stations the proba-

W12405

bility of rainy days is slightly larger than 20%, with a small spread around this value. 3.2. Test Cases for RRE Performances Evaluation [41] On the basis of results presented in subsection 3.1, we evaluated the performances of the RRE on samples drawn by generalized Pareto distributions F0(x) with shape parameter x in the range from 0.15 to 0.35, and the scale parameter a0 ranging from 5 to 15 mm. Moreover, sample size N = 700 and N = 3000 were taken into account, since these lengths are representative of daily rainfall time series about 10 years and 40 years long with a probability of rain z 0  20%. [42] All test were performed in the following way. For each couple (x, a0) in the ranges given above, we generated S = 10,000 GPD random samples of size 700 and 3000, and then we rounded-off data according to the rounding-off rule P(D) chosen for the test. Finally we applied the RRE on each rounded sample and evaluated the bias and the root mean square error (RMSE) of the estimated probability of each rounding resolution Dj: S 

 1 X  ^ s Dj  P Dj ^ Dj ¼ P Bias P S s¼1 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u S X 

 u  ^ Dj ¼ t1 ^ s Dj  P Dj 2 RMSE P P S s¼1

ð8Þ

^ s(Dj) is the estimate, on the sth synthetic series, of where P the probability that records have been rounded off at resolution Dj, while P(Dj) is the probability really used to round off Monte Carlo samples at resolution Dj. [43] Performances of the RRE were evaluated for different vectors of rounding-off resolutions D and associated probabilities P(D), that are reported in Table 2, where labels A, B, C, D, E and F identify each couple D, P(D) used as rounding rule. In such a way we analyzed the performances of the RRE for different rounding-off rules and for all possible distributions of rainfall data. Examples of the RRE performances evaluated by equation (8) on each rounding-off resolution Dj are presented in Tables 3 and 4. [44] In detail, for group A, B, C and D the RRE was tested for the five-element vector of potential rounding

Table 3. Evaluation of Performances of the Rounding-off Rule Estimator on S = 10,000 Samples Drawn by an Exponential Distribution (x = 0) With Scale Parameter a0 = 10 mm and Then Rounded off According to the Rounding-off Rule of Group A D P(D)

0.1 mm

0.2 mm

0.5 mm

1 mm

Rounding-off Rule Adopted for Group of Tests A 0.4 0.1 0.1 0.1

5 mm 0.3

^ Mean[P(D)] ^ Bias[P(D)] ^ RMSE[P(D)]

RRE Performances for Sample Size N = 3000 0.402 0.101 0.091 0.105 0.002 0.001 0.009 0.005 0.019 0.016 0.015 0.014

0.301 0.001 0.011

^ Mean[P(D)] ^ Bias[P(D)] ^ RMSE[P(D)]

RRE Performances for Sample Size N = 700 0.402 0.100 0.092 0.105 0.002 0.000 0.008 0.005 0.039 0.034 0.027 0.027

0.301 0.001 0.022

8 of 15

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Table 4. Same as Table 3 but Here Samples Are Rounded off According to the Rounding-off Rule of Group Da D P(D)

0.1 mm

0.2 mm

0.5 mm

1 mm

Rounding-off Rule Adopted for Group of Tests D 1 0 0 0

5 mm 0

^ Mean[P(D)] ^ Bias[P(D)] ^ RMSE[P(D)]

RRE Performances for Sample Size N = 3000 0.980 0.009 0.002 0.007 0.020 0.009 0.002 0.007 0.024 0.016 0.005 0.010

0.001 0.001 0.002

^ Mean[P(D)] ^ Bias[P(D)] ^ RMSE[P(D)]

RRE Performances for Sample Size N = 700 0.960 0.019 0.007 0.011 0.040 0.019 0.007 0.011 0.050 0.034 0.016 0.017

0.002 0.002 0.004

off at the smallest resolution D1 = 0.1 mm (perfect discrete sample). [47] Groups of tests E and F are instead aimed at showing how the performances of RRE increase when the number of potential resolutions decreases. For group E the vector D = [0.1, 1, 5] mm has been used, while D = [0.1, 0.2, 0.5] mm was used for group F. The probability of rounding at smallest, medium and largest resolution is the same for both group of tests: P(D) = [50%, 20%, 30%]. [48] Matrices A for test E and F become, respectively, 2

Perfect discrete samples: All records are rounded off at resolution 0.1 mm.

2

1

6 6 1 6 6 2 6 6 6 1 A¼6 6 5 6 6 1 6 6 10 6 4 1

1 1

1 5 1 5 1 50 25

1

1 1

3

7 7 1 17 7 7 7 7 1 1 17 7 7 7 1 1 17 7 2 7 5 1 1 1 10 5 1 2

ð9Þ

[45] To understand better the meaning of probabilities in equation (A3), let us look at the values of the above matrix A. The first row contains the probabilities a1,js that the values rounded off at resolutions Djs provided in equation (7) might be multiples of 0.1 mm (D1). Thus the row contains only ones, since whatever discretization Dj is used, the rounded value is certainly a multiple of 0.1 mm. The second row contains probabilities a2,js that the values rounded off at resolutions Djs might be multiples of 0.2 mm (D2). Thus a2,1 = 1/2 means that we expect that only one half of values rounded off at resolution 0.1 mm will be multiples of 0.2 mm; obviously a2,2 = 1; while measurements rounded off with resolution 0.5 mm (D3) will be multiples of 0.2 mm only when they take the values 1 mm, 2 mm, 3 mm, etc., thus a2,3 = 1/2. Conversely, a3,2 is equal to 1/5 since we expect that only one fifth of measurements rounded off with resolution 0.2 mm are multiples of 0.5 mm (this happens again when rounded measurements take the values 1 mm, 2 mm, 3 mm, etc.). Similar considerations hold also for the other coefficients. [46] Groups of tests A, B, C and D differ only for the vector of probabilities P(D) used to round data at resolutions D provided in equation (7), as shown in Table 2. In group A, most data are rounded off at the smallest (D1 = 0.1 mm) and largest resolutions (D5 = 5 mm). In group B, most data are rounded off at the inner and close resolutions D3 = 0.5 mm and D4 = 1 mm. In group C, data are rounded off with equiprobable rule. In group D, all data are rounded

1

1

6 6 1 6 AE ¼ 6 10 1 6 4 1 1 50 5

a

resolutions defined in equation (7). Thus, according to equation (A3), the matrix of probabilities aij becomes

W12405

3

2

7 7 17 7; 7 5 1

6 61 6 AF ¼ 6 2 1 6 41 1 5 5

1

1 1

1

3

7 17 7 27 7 5 1

ð10Þ

[49] Tests of group D deserve a special comment since they were aimed at evaluating the size of false rounding detections that may be induced by equation (A7) in cases RRE is applied to estimate probabilities of rounding at resolutions D, but sample values are rounded off only at a subset of resolutions. We chose the most critical case in which we are dealing with perfect discrete samples where all values are rounded at resolution 0.1 mm, but the RRE is applied with the vector of resolutions D provided in equation (7). An example of test results is presented in Table 4 for samples drawn by an exponential distribution (x = 0) with scale parameter a0 = 10 mm. If we look at results for Dj 6¼ 0.1 mm, it is apparent from Table 4 that RRE may provide probabilities different from zero for resolutions not used to round off data. Nevertheless, the errors on probabilities of rounding at resolutions Dj 6¼ 0.1 mm, erroneously detected by the RRE, are still small: absolute values of bias and RMSE remain lower than 1% and 2% respectively for sample size N = 3000, corresponding to about 40-year-long series, and increase only to values lower than 2% and 4% respectively for shorter series (N = 700, corresponding to about 10-year-long series). We highlight also that owing to equation (A7), the size of the underestimation of the probability of rounding at resolution 0.1 mm is equal to the sum of probabilities estimated for rounding resolutions Dj 6¼ 0.1 mm. Since the aim of this group of tests was to evaluate the size of false detection, the maximum absolute bias and the maximum RMSE were determined only among Dj 6¼ 0.1 mm. 3.3. Discussion of RRE Performances [50] For the sake of synthesis in presenting results of the wide amount of cases considered in the tests, Table 2 reports, for each group of test, only the maximum absolute value of the bias and the maximum RMSE among all the considered Dj and all the distributions F0(x) drawn with the different couples of considered parameters (x, a0). [51] In Table 2 we can observe that the overall performances of RRE are generally very good, although there is a slight worsening when data are rounded at very close resolutions, as in groups B and F. However, we want to stress here that our major interest is mainly addressed to detect rounding rules when data have been rounded at far and significantly different resolutions. Indeed, in these cases the rounded values exert the larger statistical impact when

9 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 6. Results of the RRE on 340 daily rainfall time series that are more than 10 years long. For each series, rounding probabilities were first determined for resolutions D = [0.1, 0.2, 0.5, 1, 5] mm and then ^ l) > 10% resulted in order to exclude false detections. were recomputed only for those Dls for which P(D (a) Histogram classifying stations on the probability that measurements have been rounded at the smallest resolution D = 0.1 mm. (b) Histogram based on the probability that measurements have been rounded at D = 0.1 mm or D = 0.2. (c) Histogram based on the probability that measurements have been rounded at resolution D = 0.5 mm or smaller. (d) Histogram based on the probability that measurements have been rounded at resolution D = 1 mm or smaller: Only 180 time series count more than 90% of data rounded off at resolutions smaller than or equal to 1 mm; consequently, 160 series have more than 10% of records rounded off at larger resolutions (see Figure 7 for details). fitting parametric distributions, or when applying tests based on goodness of fit statistics, as shown in section 2. An example can better clarify this point. First, let us consider a time series where all data were rounded at close resolutions, for instance half and half at 0.1 mm and 0.2 mm. In this case, there is certainly a worsening of the performance (although bias and RMSE do not exceed few percentage points), nevertheless, we have less interest in discriminating the percentage of data rounded at the two possible resolutions: for instance, distributions of test statistics, as those shown in Figure 4, do not significantly change if RRE provides percentages of rounding at resolutions 0.1 mm and 0.2 mm slightly different from 50% and 50% (note that RRE errors are anyhow lower than few percentage points). Conversely, if we look at a sample with values rounded at very far resolutions, for instance 0.1 mm

and 5 mm, there is a greater interest in discriminating the percentage of roundings at the two resolutions, and in this case the RRE becomes even more efficient (as proved by groups of tests A and E). [52] In conclusion, a general good efficiency of the RRE has been verified in all the considered groups of tests: results can be summarized as follows. For the five-resolution vector D, the maximum RMSE is about 2 – 3% for 40-year-long time series (N = 3000) and only increases to about 4% for 10-year-long time series (N = 700), while the maximum absolute value of the bias is around 2%. Considering the most interesting case (group E) for the threeresolution vectors D, the maximum RMSE resulted about 1% for 40-year-long time series (N = 3000) and about 2% for 10-year-long time series (N = 700), while the maximum

10 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

Figure 7. Results of the RRE on 340 daily rainfall time series with more than 10 years of data. Histogram classifies stations on the probability that values have been rounded at resolution D = 5 mm: Fourteen series have more than 30% of measurements rounded off at 5 mm; in 44 series, data rounded at 5 mm are between 20% and 30% of the whole sample; and in 102 stations, the percentage of large roundings is between 10% and 20%. The first bin is empty because of recomputing with a new D to avoid false detections. absolute value of the bias is less than 0.5% irrespective of the sample size.

4. RRE Application on Daily Rainfall Data [53] The rounding-off rule estimator, described in Appendix A and tested in section 3, proved to be very efficient even with small samples. Thus we applied it on the 340 daily rainfall time series with more than 10 years of records, which were collected by the Hydrological Survey of the Sardinia Region (Italy) from 1922 to 1980. Among these 340 series, 200 ones count more than 40-year records and were already used to produce Figure 5. [54] On each time series, the application has been performed in two steps. At the first step the RRE has been applied to estimate the probabilities of rounding at resolutions D = [0.1, 0.2, 0.5, 1, 5] mm. At a second step the RRE has been applied again only for those resolutions Dls for ^ l) > 10%. On the basis of the results which resulted P(D obtained for group of tests D, discussed in section 3, we are quite confident to exclude, in such a way, the chance to estimate non zero probabilities for unused rounding-off resolutions. Moreover, we obtain a secondary advantage since reducing the size of vector D increases the efficiency of the RRE. [55] Results of the RRE obtained for the 340 time series are presented in Figures 6 and 7 in the form of histograms. The histogram in Figure 6a classifies stations according to ^ = 0.1 mm) that measurements the estimated probability P(D have been rounded at the smallest resolution D = 0.1 mm. We can observe that in most stations the percentage of data rounded at 0.1 mm resolution does not exceed the 50% of ^ the whole time series (P(D = 0.1 mm) < 0.5 for 313

W12405

stations). The histogram shown in Figure 6b is based on ^ the estimated probability P(D  0.2 mm) that measurements have been rounded at D = 0.1 mm or D = 0.2 mm. It is apparent that only 33 stations, thus only about 10% of the 340 analyzed ones, have more than 90% of records correctly discretized. Certainly, this is an unexpected result, since we would expect all stations to be rounded off at the standard discretizations of 0.1 or 0.2 mm. The situation does not improve much even if we can accept as good also records rounded off at resolution D = 0.5 mm. As we can observe in Figure 6c, only 46 time series contain more than 90% of records rounded off at resolutions 0.1 mm, 0.2 mm or 0.5 mm. Finally, if we further relax the requirements for our time series and also accept records rounded off at 1 mm resolution, we find only 180 time series with more than 90% of records rounded at resolution D = 1 mm or smaller ones. The remaining 160 series have more than 10% of measurements rounded off at larger resolution, D = 5 mm. [56] Figure 7 shows in detail the estimated percentages of records rounded off at resolution D = 5 mm in these 160 time series: 14 series contain more than 30% of measurements rounded off at 5 mm, in 44 series the number of records rounded at 5 mm is between 20% and 30% of the whole sample, while in 102 stations the percentage results between 10% and 20%. The first bin is empty since at the second step of application of the RRE, a subset of D was selected for each station excluding resolutions with estimated probability lower than 10%. [57] Figures 8 and 9 display (on the left plots) the empirical CDFs of two time series containing a high percentage of records rounded off at 5 mm resolution: distribution of records in the first time series (station 007, the same bad station analyzed in section 2) is very close to an exponential one, while measurements in the second time series (station 356) follow a heavy tailed distribution. Figures 8 and 9 were produced with a triple purpose. The first is to emphasize how the generalized Pareto distribution can easily adapt to very different distributions of daily rainfall records, even if containing rounded-off values. The second purpose is to highlight how some properties of the GPD, such as the invariance, under left-censoring threshold changes, of parameters x and a0 provided by equation (4), can be used to limit the uncertainties in parameter estimation due to spread discussed in section 2: fitting lines in Figures 8 and 9 were drawn using equations (1) and (2) with u = 0, where parameters of F0(x) were assumed as the median values of x and a0 estimated for thresholds u in the range from 2.5 mm to 12.5 mm. Note that parameters x and a0 used to draw lines in Figure 8 are the median values of estimates displayed in Figure 2 (left) and already used, before reparameterization (4), to draw CDFs Fu(x) for different thresholds u in Figure 3 (left). The third purpose is to show (on the right plots) how the CDFs of synthetic samples, drawn by the fitted GPD and then rounded off with the same rounding rule estimated by the RRE on the observed time series, look like those of observed time series.

5. Conclusions [58] The presence of roughly rounded-off measurements produces undesirable effects and errors when fitting parametric distributions and when applying goodness of fit tests.

11 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 8. GPD mixed with zeros according to equation (1) with parameters x = 0.02, a0 = 8.66 mm, and z 0 = 21% (solid line). (left) Empirical CDFs of daily rainfall data collected by station 007 (circles). (right) Empirical CDFs of a synthetic sample drawn by a GPD with the parameters given above and rounded off with rule D = [0.1, 1, 5], P(D) = [37%, 24%, 39%] (circles). (top) Plots in semi-log scale to highlight the tail behavior. (bottom) Zoom of smaller values. Assuming the generalized Pareto distribution (GPD) as parent distribution of daily rainfall records, we have shown how roughly rounded-off values in the analyzed samples may cause wide spreads of parameters estimates when some common and widely adopted estimation methods are applied (the simple moments, the probability weighted moments and the maximum likelihood estimators were considered here). Although we only showed results for the GPD, we highlight that similar problems arise even if any other continuous parametric distribution is candidate to model left-censored samples. Indeed, the spread of parameters estimates is driven by the spread of the sample statistics computed on the excesses above left-censoring thresholds. Moreover, the presence of rounded-off values may also change the distribution of goodness of fit statistics, as shown for the Crame´r – von Mises and the AndersonDarling ones. Thus the right determination of percentage points for a correct application of statistical tests requires

the knowledge of the rounding-off rule of the analyzed sample. [59] The paper presents an original objective statistically based method that allows the estimation of the percentages of measurements that were rounded off at some predetermined resolutions. The efficiency of the rounding-off rule estimator (RRE) has been proved to be high on different distributions (that can reliably describe the variety of daily rainfall distributions in the analyzed data set) and different rounding-off rules. A wide set of tests showed that for 40-year-long time series the maximum RMSE of probabilities estimated on 5 potential rounding-off resolutions is about 2 – 3%, and that it becomes lower than 1% when considering 3 potential rounding-off resolutions. The efficiency remains good also for shorter time series. Indeed, considering 10-year-long time series the maximum RMSE increase only to 4% and 2% for probabilities estimated respectively on 5 and 3 potential rounding-off resolutions.

12 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

Figure 9. GPD mixed with zeros according to equation (1) with parameters x = 0.28, a0 = 6.80 mm, and z 0 = 23% (solid line). (left) Empirical CDFs of daily rainfall data collected by station 356 (circles). (right) Empirical CDFs of a synthetic sample drawn by a GPD with the parameters given above and rounded off with rule D = [0.1, 0.5, 1, 5], P(D) = [31%, 11%, 31%, 27%] (circles). (top) Plots in semilog scale to highlight the tail behavior. (bottom) Zoom of smaller values. [60] The RRE has been systematically applied to estimate the rounding-off rules on 340 time series of daily rainfall records collected by the rain gauge network of the Hydrological Survey of the Sardinian Region (Italy). Results revealed that most series contain significant percentages of values rounded off at very large resolutions, even 1 mm and 5 mm, rather than being discretized at the standard 0.1 or 0.2 mm resolution. The reasons for the presence of anomalous tie concentrations at these resolutions were discussed in the Introduction of the paper. We believe that similar problems might also have happened in measuring rainfall depths in other rain gage networks, and thus the proposed rounding-off rule estimator represents a very useful method to evaluate the goodness of records also in other data sets. [61] In order to choose reliable estimates of GPD parameters, despite the large spreads due to the presence of rounded values, we adopted the median values of parame-

ters estimated for different thresholds. In such a way, acceptable fit in most time series was obtained, as in the examples shown in Figures 8 and 9. Nevertheless, there were a few time series where, even adopting the median values among multiple thresholds estimates, the fitted GPDs did not perfectly describe the empirical distributions. Thus the assessment of the best approach for parameter estimation on roughly discretized samples certainly deserves to be deepened in forthcoming research activities, and the rounding-off rule estimator may give an important support for these tasks. [62] Finally, over the last years there has been an increasing interest in assessing possible climate change on the basis not only of meteorological scenarios, but also by the analysis of long rainfall time series (e.g., some analyses on Italian daily rainfall time series has been performed by Brunetti et al. [2001a, 2001b, 2006] and Cislaghi et al. [2005]). The good performances demonstrated by the

13 of 15

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

W12405

rounding-off rule estimator, even in the case of short time series (10 years long), make it an important analysis method to support these kind of studies. In drawing conclusions in this delicate subject, as well as in performing quality control and homogenizing procedures, the chance that the precision of measurements and the rounding-off rules have changed during the analyzed periods is certainly an important aspect to evaluate. Indeed, as we have shown, the rounding process may affect the inference of distributions and may also alter the percentage of recorded rainy days.

Appendix A: Rounding-off Rule Estimator (RRE) [63] Let D = {Dl: l = 1,  ,r} be the vector containing the r potential rounding-off resolutions of the data. Estimates of the rounding-off rules can be obtained by solving the following set of linear algebraic equations: An¼m

ðA1Þ

where A = {ai,j: i, j = 1,  ,r} is a r-by-r matrix of known coefficients defined in equation (A3), m = {mi: i = 1,  ,r} is a r-by-1 vector of right-hand side known values provided by equation (A6), while n = {nj: j = 1,  ,r} is the r-by-1 vector of unknowns. [64] Each unknown nj represents the number of measurements rounded off at resolution Dj. Thus the probability that a value has been rounded off at resolution Dj can be estimated as 



^ D ¼ Dj ¼ Pr P

nj

k¼1

nk

j ¼ 1;    ; r

ðA2Þ

P ^ = Dj) = 1 which obviously assures that property rj¼1 P(D holds. [65] Each value mi represents the number of measurements that are multiples of Di. The number mi includes certainly all the ni measurements rounded off at Di resolution, but can also include some other measurements rounded off at different resolutions Djs. Thus we introduce the following equation to estimate the probability ai,j that a measure rounded off at resolution Dj might also be a multiple of Di: 1 ai;j ¼

Di ; Dj = Dj

i; j ¼ 1;    ; r

ðA3Þ

where [Di, Dj] is the least common multiple of Di and Dj. If Di or Dj is not an integer, it may be convenient to multiply both of them by a common factor before evaluating equation (A3), since numerical algorithms usually require arguments of [] to be integers. Examples of matrices A and some comments on the meaning of coefficients provided by equation (A3) are given in section 3. [66] If the unknowns njs were known, the expected number mi of measurements that are multiples of Di would be estimated by summing the products of the number nj of values rounded off at resolution Dj by the probability ai,j: ^i ¼ m

r X j¼1

ai;j nj

i ¼ 1;    ; r

ðA4Þ

W12405

[67] Nevertheless, our problem is dual. Indeed, we can determine exactly vector m on any given sample, while the njs are unknowns. Thus equation (A4) takes the form of the set of equation (A1) which provides, together with equation (A2), the rounding-off rule estimator. [68] Let x = {xk: xk > Dmax/2 and k = 1,  ,N} be the sample where we have an interest in determining the unknown rounding rule P(D) = {P(D = Dl): l = 1,  ,r}. The need of the constraint xk > Dmax/2, where Dmax = max{Dl: l = 1,  ,r}, will be discussed later. For each Di we can define a working vector y(i) = {y(i) k = ROUND(xk/ Di)*Di: k = 1,  ,N}, where ROUND() is the nearest integer function. The number mi of multiples of Di can thus be evaluated as n o ðiÞ mi ¼ # x k : x k ¼ y k

i ¼ 1;    ; r

ðA5Þ

where #{} returns the number of elements within the set, thus equation (A5) provides the number of elements in the x vector that are equal to the corresponding rounded values in the working vector y(i). To avoid mistakes due to floatingpoint representation of numbers in numerical calculation, equation (A5) should be substituted by n   o ð iÞ mi ¼ # xk : ABS xk  yk < 

i ¼ 1;    ; r

ðA6Þ

where ABS() is a function that returns the absolute value, while e is a small quantity (larger than the floating-point roundoff errors and smaller than the data resolution Dmin = min {Dl: l = 1,  ,r}). [69] In rainfall time series there are usually some data equal to zeros, other ones that are strictly positive. Nevertheless, some zero values may come from the rounding of small rainfall depths. If equation (A5) or (A6) were evaluated for all positive values of our series, the number of these zeros-nonzeros could not be taken into account, leading to an underestimation of rounding probability at larger resolutions, and specially of P(Dmax). Thus the constraint xk > Dmax/2 has been introduced to avoid this cause of bias (that was observed in numerical experiments performed without the constraint). [70] A final remark regards the case that we are looking for possible roundings D = {Dl: l = 1,  ,r}, but data were rounded off using only a subset of resolutions. Let Df be a resolution within the set of potential resolutions D that was not used to round any value in the analyzed sample. Solving the set of equation (A1) we would expect the corresponding unknown nf to be zero. Nevertheless, because of estimation variance, estimates nf will be distributed around zero and can take positive and negative values. We thus suggest the substitution of equation (A2) with the following ones that force unreliable negative values to be zero:    max nj ; 0 ^ P D ¼ Dj ¼ Pr k¼1 maxfnk ; 0g

j ¼ 1;    ; r

ðA7Þ

P ^ = Dj) = 1 holds. which still assure that rj¼1 P(D [71] While unreliable negative values are simply corrected by the above equations, positive nf values can lead to false rounding detection P(D = Df) > 0: the size of these

14 of 15

W12405

DEIDDA: AN EFFICIENT ROUNDING-OFF RULE ESTIMATOR

errors is evaluated in section 3 (tests of group D). We just remark here that for resolutions Djs actually used to round data in the analyzed sample, equation (A7) provides the same results as equations (A2), thus we strongly suggest the implementation of the RRE using the new equation (A7).

References Ahmad, M. I., C. D. Sinclair, and B. D. Spurr (1988), Assessment of flood frequency models using empirical distribution function statistics, Water Resour. Res., 24(8), 1323 – 1328. Brunetti, M., M. Colacino, M. Maugeri, and T. Nanni (2001a), Trends in daily intensity of precipitation in Italy from 1951 to 1996, Int. J. Climatol., 21, 299 – 316. Brunetti, M., M. Maugeri, and T. Nanni (2001b), Changes in total precipitation, rainy days and extreme events in northeastern Italy, Int. J. Climatol., 21, 861 – 871. Brunetti, M., M. Maugeri, F. Monti, and T. Nanni (2006), Temperature and precipitation variability in Italy in the last two centuries from homogenised instrumental time series, Int. J. Climatol., 26, 345 – 381. Cameron, D., K. Beven, and J. Tawn (2000), An evaluation of three stochastic rainfall models, J. Hydrol., 228, 130 – 149. Choulakian, V., and M. A. Stephens (2001), Goodness-of-fit tests for the generalized Pareto distribution, Technometrics, 43, 478 – 484. Cislaghi, M., C. De Michele, A. Ghezzi, and R. Rosso (2005), Statistical assessment of trends and oscillations in rainfall dynamics: Analysis of long daily Italian series, Atmos. Res., 77, 188 – 202. Claps, P., and F. Laio (2003), Can continuous streamflow data support flood frequency analysis? An alternative to the partial duration series approach, Water Resour. Res., 39(8), 1216, doi:10.1029/2002WR001868. Coles, S. (2001), An Introduction to Statistical Modeling of Extreme Values, Springer, London. Coles, S., L. R. Pericchi, and S. Sisson (2003), A fully probabilistic approach to extreme rainfall modeling, J. Hydrol., 273, 35 – 50. Deidda, R., and M. Puliga (2006), Sensitivity of goodness of fit statistics to rainfall data rounding off, Phys. Chem. Earth, 31(18), 1240 – 1251, doi:10.1016/j.pce.2006.04.041. De Michele, C., and G. Salvadori (2005), Some hydrological applications of small sample estimators of generalized Pareto and extreme value distributions, J. Hydrol., 301, 37 – 53. Dupuis, D. J. (1999), Exceedances over high thresholds: A guide to threshold selection, Extremes, 1(3), 251 – 261.

W12405

Fitzgerald, D. L. (1989), Single station and regional analysis of daily rainfall extremes, Stochastic Hydrol. Hydraul., 3, 281 – 292. Grimshaw, S. D. (1993), Computing maximum likelihood estimates for the generalized Pareto distribution, Technometrics, 35, 185 – 191. Harris, D., A. Seed, M. Menabde, and G. Austin (1997), Factors affecting multiscaling analysis of rainfall time series, Nonlinear Processes Geophys., 4, 137 – 155. Hosking, J. R. M. (1990), L-moments: Analysis and estimation of distributions using linear combinations of order statistics, J. R. Stat. Soc., Ser. B, 52(1), 105 – 124. Hosking, J. R. M., and J. R. Wallis (1987), Parameter and quantile estimation for the generalized Pareto distribution, Technometrics, 29, 339 – 349. Laio, F. (2004), Cramer – von Mises and Anderson-Darling goodness of fit tests for extreme value distributions with unknown parameters, Water Resour. Res., 40, W09308, doi:10.1029/2004WR003204. Madsen, H., P. S. Mikkelsen, D. Rosbjerg, and P. Harremoe¨s (2002), Regional estimation of rainfall intensity-duration-frequency curves using generalized least squares regression of partial duration series statistics, Water Resour. Res., 38(11), 1239, doi:10.1029/2001WR001125. Martins, E. S., and J. R. Stedinger (2001), Historical information in a generalized maximum likelihood framework with partial duration and annual maximum series, Water Resour. Res., 37(10), 2559 – 2568. Pickands, J. (1975), Statistical inference using extreme order statistics, Ann. Stat., 3, 119 – 131. Salvadori, G., and C. De Michele (2001), From generalized Pareto to extreme values law: Scaling properties and derived features, J. Geophys. Res., 106(D20), 24,063 – 24,070. Stedinger, J. R., R. M. Vogel, and E. Foufoula-Georgiou (1993), Frequency analysis of extreme events, in Handbook of Hydrology, edited by D. R. Maidment, chap. 18, pp. 1 – 66, McGraw-Hill, New York. Stephens, M. A. (1986), Tests based on EDF statistics, in Goodness-of-Fit Techniques, edited by R. B. D’Agostino and M. A. Stephens, pp. 97 – 193, Marcel Dekker, New York. Tonini, D. (1959), Elementi di Idrografia ed Idrologia, vol. 1, Libr. Univ., Venice, Italy. Van Montfort, M. A. J., and J. V. Witter (1986), The generalized Pareto distribution applied to rainfall depths, Hydrol. Sci. J., 31(2), 151 – 162.

 

R. Deidda, Dipartimento di Ingegneria del Territorio, Universita` di Cagliari, Piazza d’Armi, I-09123 Cagliari, Italy. ([email protected])

15 of 15

Suggest Documents