A methodology for treating missing data applied to ... - Springer Link

3 downloads 588 Views 1MB Size Report
Dec 19, 2008 - how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data,.
Environ Monit Assess (2010) 160:1–22 DOI 10.1007/s10661-008-0653-3

A methodology for treating missing data applied to daily rainfall data in the Candelaro River Basin (Italy) Rossella Lo Presti · Emanuele Barca · Giuseppe Passarella

Received: 19 February 2008 / Accepted: 5 November 2008 / Published online: 19 December 2008 © Springer Science + Business Media B.V. 2008

Abstract Environmental time series are often affected by the “presence” of missing data, but when dealing statistically with data, the need to fill in the gaps estimating the missing values must be considered. At present, a large number of statistical techniques are available to achieve this objective; they range from very simple methods, such as using the sample mean, to very sophisticated ones, such as multiple imputation. A brand new methodology for missing data estimation is proposed, which tries to merge the obvious advantages of the simplest techniques (e.g. their vocation to be easily implemented) with the strength of the newest techniques. The proposed method consists in the application of two consecutive stages: once it has been ascertained that a specific monitoring station is affected by missing data, the “most similar” monitoring stations are identified among neighbouring stations on the basis of a suitable similarity coefficient; in the second stage, a regressive method is applied in order to estimate the missing data. In this paper, four different regressive methods are applied and compared, in order to

R. Lo Presti · E. Barca · G. Passarella (B) Water Research Institute of the National Research Council, Department of Bari, Viale F. De Blasio, 5 70123 Bari, Italy e-mail: [email protected]

determine which is the most reliable for filling in the gaps, using rainfall data series measured in the Candelaro River Basin located in South Italy. Keywords Time series · Missing data · Parametrical regression · Uncertainty

Introduction The “presence” of gaps in environmental data time series represents a very common, but extremely critical problem, since it can produce biased results and, in the worst case, it can even prevent important analyses of the considered variables from being carried out. Missing data plagues almost all surveys, and quite a number of experiments, no matter how carefully an investigator tries to control all the possible sources of data noise in a survey, or how well an experiment is designed. The problem is how to deal with missing data once it has been deemed impossible to recover the actual missing values. Apart from the amount of missing data, another issue which plays an important role in the choice of any recovery approach is the evaluation of “missingness” mechanisms. The term missing completely at random (MCAR) refers to data where the missingness mechanism does not depend on the variable under investigation or any

2

Environ Monit Assess (2010) 160:1–22

other variable, which is observed in the dataset. This very stringent assumption is required to allow the application of a case deletion procedure, but it must be underlined that missing data is very rarely MCAR in practical applications (Rubin 1976). The term missing at random (MAR) means data is missing, but conditioned by some other variable observed in the data set, i.e. other than the variable under investigation (Schafer 1997). Finally, not missing at random (NCAR) occurs when the missingness mechanism depends on the actual value of the missing data. This is the most difficult condition to model. Traditional approaches include case deletion and mean imputation. Nevertheless, in the last decade, interest has arisen in the estimation of missing data via regression (single imputation). Single imputation consists in group means (respectively, medians or modes), regression imputation, stochastic regression imputation and a variety of other methods. More recently, multiple imputation has become available, which returns M complete datasheets by imputing M times, and is considered always better than case deletion or single methods (Scheffer 2002). In this paper, a new, relatively quick and automatic methodology for treating missing data is proposed, which is also able to provide the uncertainty related to the estimated data. Four different regressive methods of imputation are compared, given the missingness mechanisms and the amount of missing data. The methodology, which is based on the MAR assumption, was applied to the daily rainfall time series data registered in the Candelaro River Basin (Apulia, South Italy) from January 1970 to December 2001.

Methods In the study case, the proposed methodology was applied to space–time continuous data, i.e. to time series referred to different monitoring stations belonging to the same network. Data were arranged in a matrix having the following structure: – –

Each row was uniquely associated to the date of the observation; Each column was uniquely associated to a monitoring site.

The proposed methodology can be considered as a broadening and development of the classic univariate regressive methods suggested when the MAR hypothesis is supposed (Tables 1 and 2). The case study was set up to check the three hypotheses generating missing data and to validate the MAR assumption. It should be borne in mind that the proposed methodology, although developed for cases where the MAR hypothesis has been verified, can be also applied in the MCAR case, which is less restrictive than MAR. The first stage consists in finding the “gap” in the data matrix. Then, for the given data matrix structure, each gap is uniquely associated to a date and a monitoring site. In the second stage, a subset of gauging stations, “similar” to that affected by missing values (target station) with respect to the time behaviour of the considered variable, is identified by means of the non-parametric Spearman coefficient (Conover 1971). Once the similarity matrix has been obtained, those stations, whose Spearman coefficient is greater than a given threshold of acceptance, defined in the literature (Hubbard 1994)

Table 1 Formal definition of the data loss mechanisms Hypothesis

Definition

Suggested methods

MCAR MAR

p(M|Y, ϕ) = p(M|ϕ) for all Y, ϕ p(M|Y, ϕ) = p(M|Yobs , ϕ) for all Ymiss , ϕ p(M|Y, ϕ) = p(M|Ymiss , ϕ)

Listwise and pairwise Regression and multiple imputation Mixed models

NMAR

p probability function, M missingness, ϕ probability function parameter set, Y matrix of all the potentially observed values = Yobs + Ymiss , Yobs really observed data value matrix, Ymiss missing data value matrix

Type of method

Data deletion

Data deletion

Global imputation based only on the variable containing the missing data

Global stochastic imputation based only on the variable containing the lost data

Global imputation based on all the variables present in the data matrix

Method

Listwise

Pairwise

Imputation of the sample mean (or any other central tendency index)

Imputation of a random value

Least squared parametric regression

It consists in applying a probabilistic model to the complete cases, considering the variable containing the lost data as dependent variable

It consists in filling in the gaps using a random value obtained perturbing the sample mean adding a stochastic component

It consists in filling in the gaps using the sample mean or another central tendency index

It consists in coupling all the variables present in the data matrix, and then deleting the coupled data containing a missing datum

It consists in deleting all the data matrix rows presenting lost data

Description

Table 2 Main techniques presently in use for missing data management Pros

Simple application both in univariate and multivariate case

Simple application

Simple application. The input value is always the same for all the gaps

Very simple application. The data loss is less than the listwise method

Very simple application

Cons The main statistics are not biased only if the MCAR condition is verified. It causes a further loss of data Statistics are unbiased only if MCAR condition is verified. The correlation matrix may be asymmetrical and not positive definite. No information about the uncertainty associated to the restored data is provided It causes an underestimation of the sample variance. There is no information about the uncertainty associated to the imputed data It causes an overestimation of the sample variance. There is no information about the uncertainty associated to the imputed data The model parameters are unbiased only if the hypothesis of Gaussianity, homoscedasticity and incorrelation of residuals are verified. It needs to know the relationship between variables, e.g. linear, exponential,etc.

Bibliography

Any book or paper treating missing data (e.g. Allison 2001)

Any book or paper treating missing data (e.g. Allison 2001)

Any book or paper treating missing data (e.g. Allison 2001)

Any book or paper treating missing data (e.g. Allison 2001)

Any book or paper treating missing data (e.g. Allison 2001)

Environ Monit Assess (2010) 160:1–22 3

Type of method

Local imputation

Global imputation

Global imputation

Local imputation

Hot deck

Expectation maximisation (EM)

Multiple imputation (MI)

Recursive partitioning

Pros

Cons

Knowledge about the functional The implementation of relationship between dependent such methods may be and independent variables is difficult because they are negligible; there is no generally not present in the need to verify the main statistical software hypothesis of standard regression It consists in filling in There is no need for a strong The criteria for defining in the gaps using the data hypothesis to apply the method similarity among matrix rows “similar” to that cases is often subjective containing the missing data Supposing known the The method provides an The method is iterative probability distribution unbiased estimation of in nature so the convergence of the variable containing the missing values might be time-demanding. It the missing data, the needs the knowledge about the method iteratively estimates probability distribution the model parameters of the data The missing data are The method provides a The algorithm is rather filled in by a set of measure of the uncertainty sophisticated and it is not values. They are thereafter associated to the estimation. implemented in the main analysed with standard It is a robust method statistical software. It statistical methods might be time-demanding The data matrix is The method is not time The methods work if rearranged in a lexicographic consuming, even if dealing the missing data patterns order, after which an with a large database are monotonic. It iterative tree algorithm requires that some variables in is applied the dataset are complete

Description

Non-parametric Global imputation It consists in applying regression based on all the a probabilistic model variables present to the complete cases, in the data matrix considering the variable containing the lost data as dependent variable

Method

Table 2 (Continued)

Conversano (2003)

Rubin (1987, 1988)

Little and Rubin (1987); Schafer (1997)

Sande (1983)

Theil (1950); Conover (1971)

Bibliography

4 Environ Monit Assess (2010) 160:1–22

Environ Monit Assess (2010) 160:1–22

for different variables, are considered “similar” to the target station. Afterwards, if all the “similar stations” record the same value during the given day, it is assigned straightforwardly to the missing datum (stage 3a), otherwise, even if only one of the similar stations registered a different value from the others, the missing value is estimated by means of a univariate regressive technique (stage 3b) (Neter et al. 1996). In this study, two parametric and two non-parametric techniques were applied and compared. Specifically, the following four regressive techniques were applied: (a) simple substitution, (b) classical least squares univariate parametric regression, (c) ranked regression (Conover 1971) and (d) the Theil (1950) method. The nonparametric methods (c and d) offer the advantage that they do not depend on the relationship (linear, exponential, etc.) between the variables (generally unknown) and are intrinsically distribution-free. They, therefore, produce unbiased results, even when the distribution of the dependent variable is extremely asymmetric (such as the daily rainfall in the study case). Consequently, we should expect these two non-parametrical techniques to perform better than the other two. In the following sections, the details of the nonparametric techniques are presented. Given that this procedure is always applied to a univariate field, it can be repeated independently as many times as there are similar sites. This has the advantage of filling the time series gaps with a distribution of values (multiple imputation), instead of a single value whose mean and variance represent the most probable value and the related estimation uncertainty, respectively.

Missingness mechanisms In order to manage matrices characterised by incomplete data series properly, the first requirement is to identify the mechanism responsible for data value losses. When describing the phenomenon of missing data in probabilistic terms, three different types of processes can be considered (Little and Rubin 1987). The simplest case, although the least frequent, is when data loss occurs

5

absolutely at random. This condition, named with the acronym MCAR, occurs when the probability that the value of a given variable is missing is constant for all the cases and independent from the value of other variables. On the contrary, when the probability that a value is missing depends on the value of other variables, and only on them, the condition is called MAR. The MCAR condition assumes that the distribution of the observed data is equal to that of the complete dataset, and consequently, that the loss of data does not change the description of the phenomenon. On the other hand, the MAR condition establishes that the two distributions can differ, but that the distribution of the complete dataset can be predicted starting from the observed values (Schafer 1997). Finally, the third and last case arises when the loss of data does not occur randomly at all (NMAR); in this case, the probability that a value is missing depends on the missing value itself; this can happen, for example, when the values exceed the maximum measurable threshold of the gauging instrument. Table 1 provides a formal description of every mechanism of missing value realisation and the appropriate techniques for dealing with each case. Once the data loss mechanism has been established, the next step is to choose a suitable technique useful for managing the missing data. In general, two possibilities exist: (1) to exclude those cases (matrix rows) containing even a single missing datum from the data matrix (missing data elimination techniques) or (2) to fill in the matrix with estimated data values (missing data input techniques). Table 2 shows the main techniques presently used with related pros and cons. It is important to underline that no best estimation technique in the absolute sense can exist. The effectiveness of each technique depends on a number of factors, among which are the percentage of missing data in the series, the data loss mechanism (MCAR, MAR or NMAR) and the intrinsic characteristics of the considered variable. In general, unlike NMAR, MCAR and MAR can be negligible when the percentage of missing data is not too great (Little and Rubin 1987). Regression and multiple imputation are both valid provided the missingness mechanism is not NMAR.

6

Environ Monit Assess (2010) 160:1–22

Table 3 Methods suggested taking into account the missing data rate MD rate

Suggested methods to deal with missing data

< 5% 5–10%

Any method works well Using a constant value, e.g. sample mean, might be good only if the correlation among the variables is negligible Single imputation methods, e.g. regression, work well Multiple imputation methods are the best choice Single imputation methods might be biased Multiple imputation strongly recommended Multiple imputation methods are almost always reliable

10–15% > 15%

Nonetheless, the computation time, the user’s statistical knowledge and the power of the software available for computation should also be taken into account in the choice of the method to be used. For instance, concerning the completeness of the data series, Johnson (2003) reports that replacing missing data with the total data average produces good results only if there is scarce correlation between the variables. When the amount of missing data exceeds 5–10% of the total measured data, Johnson suggests, rather, the use of simple input methods, such as regression, or, better still, of multiple input methods (Table 3). Currently, the most used method is the classical least squares parametric regression, since it combines speed and reliability sufficiently well. Nevertheless, it may produce biased estimations when conditions such as normality, variance homogeneity, uncorrelated residuals, normally distributed with null expected value and absence of uncertainty associated to the independent variable(s) X, are not respected (Neter et al. 1996). Gauging station similarity In order to quantify the similarities among the pluviometric stations, the non-parametric Spearman correlation coefficient, ρ, calculated with following formula, was used: N  

ρ=

i=1

Rxi −

N+1 2



 × R yi −

N ( N 2 −1) 12

N+1 2

where Rxi is the rank of the variable x, R yi is the rank of variable y, N is the time series length and the index i refers to the i-th observation (Conover 1971). Since ranking the data releases the variables from the distribution type and their relationship (linear, exponential, etc.), using this coefficient, rather than the most common parametric counterpart (Pearson coefficient), allows us to obtain unbiased results even in presence of asymmetric data distributions. The Spearman coefficient was calculated for every pair of stations obtaining a symmetrical square matrix (non-parametric correlation matrix) of M2 elements where M is the number of stations. In order to define the most similar stations, called here “twin stations”, a threshold value of the similarity coefficient needs to be chosen. Hubbard (1994) demonstrated that an average distance of 10 km between rainfall stations is sufficient to explain most of the variation between locations. Consequently, the similarity coefficient corresponding to stations located within a range of 10 km was chosen as threshold. Regressive methods A large number of regressive methods could be used as the engine of the proposed methodology for estimating missing values. In particular, the following methods were applied in this study and a comprehensive evaluation of the most reliable for rainfall data was carried out: – – – –

Simple substitution; Parametric regression (Neter et al. 1996); Ranked regression (Conover 1971); Theil (1950) method.

Given that the first two methods are wellknown and easily available in literature (Neter et al. 1996), only the last two methods are described in detail in the following sections. Ranked regression

 (1)

Let us consider two time series X and Y, representing the considered variable values (e.g. daily

Environ Monit Assess (2010) 160:1–22

7

rainfall) recorded at two different sites, Y (being the time series containing the gap to be filled in) and X (the corresponding time series in one of the similar sites). At first, X and Y values must be ordered and ranked (equal rank position is assigned to repeated rainfall values); afterwards, the linear regressions equation is assessed by means of the least squares method. The rank, Rˆ (y0 ), of the generic missing value, y0 , of Y is then assessed through of the following equation: Rˆ (y0 ) = b × R (x0 ) + a

(2)

where b is the regression coefficient, a is the intercept value and R(x0 ) is the rank of the measured value x0 . Finally, the estimated rank value Rˆ (y0 ) is back-transformed into estimation yˆ 0 applying one of the following rules: (a) If Rˆ (y0 ) = R (yi ) then yˆ 0 = yi   (b) If R (yi ) ≤ Rˆ (y0 ) ≤ R y j  Rˆ (y0 ) − R (yi )  then yˆ 0 = yi +   yj − yi R y j − R (yi ) (c) If Rˆ (y0 ) > Max (R (yi )) then Rˆ (y0 ) = Max (R (yi )) (d) If Rˆ (y0 ) < Min (R (yi )) then Rˆ (y0 ) = Min (R (yi ))

(3)

where yi and y j are observed values and yi < y j. The main benefit of this methodology is that it allows the sample frequency distribution to be neglected, even if, in case of Gaussianity, it is less powerful than the parametric counterpart; in fact, since rank is an ordinal value, it even produces a loss of information with respect to the original data scale (Conover 1971). However, when the samples are drawn from populations which are not normally distributed, non-parametric methods become not only more reliable, but even more powerful than parametric ones (Glantz 1988).

vations (xi ,yi ) and (x j,y j), the angular coefficient bij can be evaluated by means of: bij =

y j − yi . x j − xi

(4)

For the inference of the regressive model, bij = b ∗ is assumed, bij being the angular coefficient of the regression line and b ∗ the median of the angular coefficients of the straight lines connecting each pair. Since for N observations N(N− 1)/2 values of bij need to be calculated, the larger the time series, the longer the computational procedure becomes. Theil (1950) proposed a shortened method, which requires fewer computations, producing even less reliable results than the original method. In the Theil method, measured data must be ordered according to increasing values of X and split in two halves (if the number of observations is odd, one observation, generally the median value of X, is discarded); then N/2 differences are calculated for both the variables, X and Y, according to the followings relationships: 

xi = x(i+N/2) − xi . yi = y(i+N/2) − yi



(5)

Finally, the angular coefficient of the regression line, b ∗ , and the intercept, a∗ , are evaluated by means of:    b ∗ = median ( yi ) median (xi ), (6) a∗ = median (yi ) − b ∗ median (xi ) .

(7)

This way, a regression line is obtained passing through the crossing point of the medians (instead of the means), which is considered the nonparametric centre of the cloud of points (Theil 1950).

Case study The Theil method Study area Considering again X and Y as defined in the previous section, let xi be the i-th observation of the variable X and yi the correspondent value of the dependent variable Y. For each pair of obser-

The Candelaro river catchment (Fig. 1a) is located in the northern part of the Apulian region in a geo-morphological environment typical of the

8

Environ Monit Assess (2010) 160:1–22

a

Tavoliere of Apulia (the largest alluvial plain in southern Italy and the second largest in Italy). It extends over more than 2,330 km2 . The entire basin is enclosed between the Apennine Chain to the west and the Mesozoic carbonate platform (Apulian foreland) that crops out of the Gargano Promontory to the East. Morphologically, the basin is characterised by a mean elevation of 300 m, a maximum of 1,150 m and a minimum of 0 masl. Three different morphological zones can be distinguished (Fig. 1b): –

b

– –

Zone 1, corresponding to the eastern border of the Apennine Chain (Subappennino Dauno); Zone 2, the most extended, corresponding to the Tavoliere plain; Zone 3, corresponding to the southern slope of the Gargano Promontory.

Rainfall regimes

c

Fig. 1 a Study area. b The Candelaro River Basin: morphological zones. c The Candelaro River Basin: meteorological homogeneous areas

On the base of the rainfall values, it is possible to define two homogenous areas (A and B) in the basin (Fig. 1c) basically influenced by soil elevation. The homogenous climatic area B corresponds to the morphological zones 1 and 3, whilst the homogenous area A coincides with the Tavoliere plain (morphological zone 2). In general, rainfall over the study area is characterised by a twofold behaviour depending on the season. During the winter, the basin is mainly affected by frontal cyclonic precipitations due to the collision of cold air masses coming from the Balkans and the relatively warm masses coming from Africa. Locally, rainfall events are partially moderated by the Gargano Mountains, which act as a protective barrier producing a consistent decrease in the winter rainfall rate to an average seasonal value lower than 400 mm. During the summer, rainfall behaviour changes to convective and orographic, generally characterised by violent, short-lasting and spatially distributed downpours; nevertheless, the average seasonal rainfall values rarely overcome 100 mm. Concerning the rainfall regimes of the study area, it is usually assumed as Mediterranean,

Environ Monit Assess (2010) 160:1–22

9

Fig. 2 The Candelaro River Basin: rainfall regime

which means hot, dry summers and cool, wet winters. Locally, the pluviometric regime is a typical Adriatic regime (Fig. 2), which is characterised by a main maximum during the late fall and a secondary smaller peak during the early spring; the minimum rainfall rate is usually located during the summer. Analysing data from each time series of the monitoring network, we can confirm the expected rainfall regime (Barca et al. 2006). Data In this study, the proposed methodology was applied to daily rainfall time series originating from 16 stations irregularly positioned within the Candelaro River Basin (Fig. 1a). The time series range from January 1, 1970 to December 31, 2001, in 14 stations, whilst they start from January 1, 1976 in the remaining two stations (Istituto Centrale di Statistica 1983). The elevation of each station ranges from 4 masl (Manfredonia station) to 910 masl (Orto di Zolfo station).

Apart from this monitoring network, data are also available in ten stations located outside the river basin, although in the neighbourhood. The time series for these ten stations also start from January 1976. The gauging stations all belong to the meteorological monitoring network of the Hydrographic Services of Land Protection Department of the Apulia Region. Tipping bucket rain gauges are used for measuring rainfall rates and daily total amounts are conventionally recorded each day at 9:00. None of the considered gauging stations is equipped with systems able to minimise the wind effect on the measures (Sevruk 1986). The average distance between the monitoring stations is around 8.5 km with a standard deviation of 4.5 km. The methodology was applied to the Alberona rainfall gauging station, which was affected by an intermediate amount of missing data (3.47%) and is surrounded by four twin stations (Biccari, Volturino, Tertiveri and Orto di Zolfo) in a neighbourhood of 10 km (Fig. 1a).

10

Exploratory data analysis Data distribution Analysing daily rainfall time series presents an intrinsic difficulty since precipitation is characterised by intermittence. Consequently, we need to consider both a discontinuous and continuous component to describe the phenomenon. The first component records the occurrence of the rainfall event, whilst the second reports the measured rainfall amount (Dunn 2003). Table 4 shows the main statistics related to the annual amount of precipitation, site by site, calculated exclusively for rainy days (HR > 1 mm) and the relative frequencies of the rainfall events. In general, it can be observed that both rainfall amount and events increase with altitude (orographic effect) even though some exceptions exist (Fig. 3). A more detailed analysis shows that gauging stations fall into two groups, each of them evidently characterised by different ground elevations. The first group includes all the gauging stations located between 0.0 and 400.0 masl. The median of the daily rainfall rates of this group is about 4 mm with the exception of three “anomalous” sites (Lesina, Sannicandro and Cagnano-Varano) whose median stands around 5 mm. The second group, which includes all those stations located above 400.0 masl, is mostly characterised by a median daily rainfall rate of 5 mm. Even in this case, a couple of stations make an exception to the rule; they are Monte Sant’Angelo and Volturino, whose median is more similar to that of the lower stations. A possible explanation of such anomalies could be the different geographic exposure of the three stations of the first group. In fact, these stations located on the Gargano promontory are the only three north-facing ones. Concerning the second group, the anomaly of the Monte Sant’Angelo station could be explained by its proximity to the sea, differently from the other stations of the group, which are all located inland. Finally, the low median value of Volturino is probably due to the amount of missing data affecting this station.

Environ Monit Assess (2010) 160:1–22

Furthermore, in agreement with Chandler and Wheater (1998), the more elevated stations show a larger spread of data in comparison to those of the first group. The first simple data analyses (see above) revealed that daily precipitation values vary greatly over time. Indeed, a very high variability was found even comparing the monthly rainfall values year after year, at the same station. For example, Fig. 4 clearly shows the difference between July and November rainfall values at the gauging station of Foggia_OSS, the time series of this station being the most complete. In particular, the roughly 160-mm value recorded in the extremely wet November of 1985 is noteworthy, compared to the roughly 1-mm value registered during the relatively dry November of 1973. Concerning the theoretical rainfall probability distribution, the high value of the skewness coefficient and the median position in the box and whiskers representation (Fig. 3) unequivocally show that the distribution is strongly asymmetrical in all the stations (many dry days and few rainy days). A preliminary idea of the theoretical distribution model can, therefore, be drawn computing the Pearson coefficient k according to the following relationship (Yevjevich 1972): k=

Cs2 (E + 6)2   4 2E − 3Cs2 4E − 3Cs2 + 12 

(8)

where Cs is the coefficient of skewness and E is the kurtosis calculated as the deviation of the sample value from the theoretical Gaussian distribution. This analysis shows that a Pearson distribution of the III type (commonly known as a gamma distribution) can be associated to all the stations (Yevjevich 1972). Testing the missingness mechanisms Missing data distribution within the time series (NMAR hypothesis rejection) As stated above, the ultimate objective of this paper was the application of a regressive method

Environ Monit Assess (2010) 160:1–22

11

Table 4 Percent frequencies of rainy days for each monitoring station and descriptive statistics about the amount of precipitation Station name

Elevation (masl)

Rainy days (%)

Average (mm)

Median (mm)

Standard deviation (mm)

Maximum value (mm)

Skewness

Kurtosis

Manfredonia Lesina Fonterosa Orsara Foggia_IAC Foggia_OSS S.severo Cagnano Torremaggiore Ortodizolfo Sannicandro PietramEAAP Lucera Castelluccio Tertiveri

2 5 25 55 74 74 87 150 169 176 224 225 251 284 352

15.73 18.93 16.46 16.56 16.47 17.69 18.30 21.22 18.19 18.96 21.29 19.14 19.40 18.92 20.35

6.4 8.2 6.7 6.7 6.6 6.9 7.1 9.4 7.3 7.2 9.5 7.2 7.1 7.4 7.2

3.8 5.0 4.0 4.2 3.8 4.0 4.3 5.6 4.4 4.0 5.4 4.4 4.0 4.4 4.4

7.28 9.61 7.67 7.42 7.56 8.17 8.10 10.64 8.11 8.49 12.19 8.19 8.16 8.61 7.94

60.2 121.6 76.6 67.6 65.8 88.8 119.8 101.8 68.6 92.8 132.0 106.8 83.6 93.0 83.4

2.81 3.26 2.97 2.92 2.94 3.42 3.80 2.72 2.67 3.21 3.64 3.30 2.90 3.33 2.74

10.13 18.89 13.13 12.12 11.80 18.03 28.53 11.41 10.01 16.15 20.10 19.38 12.45 17.66 11.50

Troia Biccari Pietram S.giovanni S.marco Bovino Monte s.angelo Alberona Savignano Volturino Pidocchiara Faeto Ortanova

439 449 456 557 560 646 650 700 718 735 843 905 910

20.04 22.91 22.94 23.12 24.28 23.63 23.60 24.15 22.85 20.30 20.96 24.86 25.10

7.7 8.6 9.2 8.6 9.3 8.7 8.9 8.6 7.7 7.4 8.4 8.1 9.1

4.4 5.4 5.2 5.2 5.6 5.2 5.4 5.2 5.0 4.4 4.4 5.0 5.2

8.83 9.67 10.88 9.74 11.76 9.91 10.09 9.79 7.95 8.43 11.20 8.99 11.24

103.4 92.6 126.8 104.8 259.2 89.6 101.6 122.4 75.0 77.6 123.4 120.0 151.8

3.05 2.70 2.80 3.12 5.81 2.80 3.08 3.19 2.66 2.66 3.72 3.30 3.65

15.22 10.75 12.00 15.31 82.93 11.44 14.42 17.51 11.72 9.61 20.51 20.73 21.78

to fill in the gaps within the rainfall time series; this can be achieved only if the mechanism producing the data loss is not NMAR (Table 1). To exclude the NMAR hypothesis, the latter was tested on the basis of empirical knowledge regarding rainfall phenomenon in Apulia. It was assumed that loss of data was due to gauge failures caused by particularly intense rainy events. If this hypothesis is correct, and admitting that the gauge immediately starts working again correctly soon after any intense rainfall event, then high rainfall rates should be lost more frequently than low values. This would mean that the probability that a certain rainfall value is missing depends on the data value itself, which is exactly what the NMAR mechanism is. In practice, if the NMAR

hypothesis is true, both the following statements should be verifiable: 1. a positive correlation between the amount of missing data and the elevation of any station should exist, given that rainfall tends, on average, to increase with altitude; 2. the amount of missing data should be affected by evident seasonal behaviour, given that autumn–winter semesters are generally more rainy than spring–summer. The amount of missing data was calculated as a percent ratio between the number of missing daily data and the overall number of observation days, excluding those years when the gauge stopped working; the highest values were found for the

12

Environ Monit Assess (2010) 160:1–22

Fig. 3 Box plots of daily rainfall rate distributions per monitoring station (the monitoring stations are in order of increasing height above sea level)

Fig. 4 Total monthly rainfall rate variability in July and November at Foggia_OSS monitoring station

Environ Monit Assess (2010) 160:1–22

13

Fig. 5 Mean monthly rate of missing data in the stations of Alberona and Foggia_OSS

stations of Volturino and Troia (8% and 5%, respectively) (Table 5). The correlation between the amount of missing data and the elevation resulted very low (R = 0.26), apparently suggesting that the NMAR hypothesis should be rejected. In fact, this result could also be explained by the complexity of the relationship between rainfall rates and topography; different authors, in fact, demonstrate that, when determining the rainfall rates at a given site, other topographical variables such as terrain slope, exposure and distance from the sea also play an important role (Prudhomme and Reed 1998; Johansson and Chen 2003). Nevertheless, if the second statement was also verified, this would result in a rejection of the NMAR hypothesis, and therefore, it could be stated with sufficient confidence that this is not the mechanism producing data loss in the study case. The monthly percentages of missing rainfall data were then explored in order to identify a seasonal behaviour pattern. However, an almost uniform distribution of these percentages during

the year resulted. In fact, Fig. 5, which shows the mean monthly rate of missing data over the whole

Table 5 Percent rate of missing data related to the 18 stations of the monitoring network located in the study area (Candelaro) Station name

Rate (%)

Alberona Biccari Faeto Foggia_IAC Foggia_OSS Lucera Manfredonia Pidocchiara Orto di Zolfo Pietramontecorvino Pietramontecorvino EAAP San Marco in Lamis San Severo San Giovanni Rotondo Tertiveri Torremaggiore Troia Volturino

3.47 1.38 2.93 3.74 0.14 0.65 4.40 2.09 3.34 0.47 0.68 1.64 3.41 1.44 1.27 2.51 4.79 7.94

14

Environ Monit Assess (2010) 160:1–22

observation period, seems to highlight a slight difference between wet and dry months. However, no relationship was found between the amount of missing data and the measurement period, as demonstrated below, when those values were investigated more rigorously and in detail. The standardised entropy (H) of the local probability distribution 9 was used to study the monthly distribution of missing data: K    ln p (k) × p (k)

H=−

k=1

(9)

ln K

where p(k) is the percentage frequency of missing data at the k-th month. According to Shannon (1948), standardisation of the measurement of local entropy can be obtained by dividing H by its upper boundary (ln K), which is the entropy associated to the uniform distribution, whose probability is p(k) = K−1 , corresponding to the maximum entropy. In practice, using the standardised entropy as a uniformity index of the missing data distribution during the average year allows us to determine objectively whether the distribution is affected by a seasonal behaviour pattern. In fact, once the standardised entropy has been computed as a function of the monthly frequencies of missing data per station, the closer this parameter is to 1, the more uniform the missing data distribution during the months is. Table 6 reports the standardised entropy values for the considered monitoring station (Alberona) and for all the “similar Table 6 Percentages of missing data per month, over the whole observation period (32 years), at the Alberona station and at the “similar stations” of Biccari, Orto di Zolfo and Tertiveri and standard entropy related to each monthly distribution of missing data

stations” (Biccari, Orto di Zolfo, Tertiveri and Volturino) The entropy values reported in Table 6 are evidently close to 1, allowing us to reject the hypothesis that a relationship exists between the amount of missing data and the season, and consequently, to reject the NMAR mechanism as being responsible for the presence of the missing data. Excluding any dependence of missing rainfall data on rainfall amount, the MAR and MCAR mechanisms, remain to be tested. Variables influencing rainfall measures (MCAR hypothesis rejection) Gauging measures are always affected by a number of errors, which lead, as a rule, to an underestimation of the actual rainfall rate. Scientific literature (Drécourt and Madsen 2002; Vejen et al. 1998; Rubel and Hantel 1999) reports three main sources of error: 1. wind-induced losses, which usually produce the largest error; 2. wetting of the walls and evaporation from the tipping bucket; 3. instrumental accuracy and precision; Furthermore, several secondary sources of error are also reported in the literature, the most frequent of which are splash in, splash out, windshield, exposure and temperature. Figure 6 summarises, graphically, the basic components of the systematic error of precipitation measurement (Goodison et al. 1998).

Month

Alberona

Biccari

Orto di Zolfo

Tertiveri

Volturino

January February March April May June July August September October November December Standard entropy

8.6 10.5 8.3 5.1 6.3 4.6 4.3 6.3 4.9 12.3 15.0 13.7 0.96

16.7 13.0 12.3 3.7 1.2 5.6 1.2 0.0 13.0 16.0 9.3 8.0 0.88

30.7 19.4 16.6 0.0 1.8 0.0 2.0 3.6 0.0 2.6 12.0 11.3 0.74

11.5 7.8 8.2 5.9 6.5 6.8 6.1 6.7 9.4 10.6 10.4 10.2 0.99

10.8 8.1 8.7 9.9 7.5 7.4 7.1 7.7 8.5 7.9 8.2 8.2 0.99

Environ Monit Assess (2010) 160:1–22

15

Fig. 6 Basic components of systematic error of precipitation measurement (Goodison et al. 1998)

The first source of error comes from the use of elevated can-type gauges that are subject to systematic errors mostly due to wind field deformation above the precipitation gauge orifice (Sevruk 1986; Sevruk and Nespor 1998). An elevated precipitation gauge systematically deforms the wind field and forces the wind speed to increase over the gauge orifice (blocking effect). Due to this adverse wind action, some of the lighter precipitation particles are borne away before reaching the gauge and are lost for measurement purposes. The wind-induced loss depends on wind speed, the weight of precipitation particles (i.e. the intensity of precipitation) and the gauge construction parameters. The use of a windshield and a

low installation height of the gauge can help to reduce the wind speed at the level of the gauge orifice. Shielded gauges reduce wind-induced loss up to 50% and even 70%, respectively, for snow and mixed precipitation. Summing up, the windinduced error is small for large intensities, small installation heights and gauges equipped with windshield. On average, when using unshielded gauges, wind speeds greater than 4 ms−1 produce underestimations in the range of 2–10% with respect to the true rainfall value and up to even 60% when measuring snowy precipitations. The second main source of error, i.e. wetting (2–10%) and evaporation (0–4%) losses, depends on the age, material and dimension of the tipping

16

bucket. The shape of the gauge orifice rim can have a certain effect too. Evaporation loss depends also on the gauge construction parameters. Finally, the third source of error obviously depends on the characteristics of the measuring instrument and it is generally modelled as white noise. An important consideration can be made at this point with regard to the missingness assumptions. Concerning the choice between the MAR and MCAR hypotheses, having once accepted the influence of other variables on rainfall measurement, it can be reasonably affirmed that these variable have some influence also on instrumental failures causing data losses. Accepting this last statement is equivalent to rejecting the MCAR mechanism as the one responsible for missing data. To this end, Rubin (1976) and Scheffer (2002) point out that “missing data is very rarely MCAR”. Station similarity As stated above, a crucial step in the proposed methodology lies in the selection of those stations most similar (twin stations) to the gauging station affected by missing data (target station). In order to find twin stations, the Spearman coefficients (similarity index) were ranked, by coupling the target station time series with those of each of the nearby stations. In determining the similarity index values, the whole time series were considered including the “dry” days, characterised by no rainfall. This choice was motivated by the following considerations: 1. missing data can also include no rainy days, since we hypothesise that they are missed at random; 2. using only rainy days could bias the similarity index values since it would not allow us to consider those events where a zero value was registered in the target station, whilst a rainfall event occurred in a neighbouring station, and vice versa. Table 7 reports all the Spearman coefficients ρ related to each station, ordered in descending mode, whilst Fig. 7 shows the spatial distribution

Environ Monit Assess (2010) 160:1–22 Table 7 Similarity index values related to the Alberona station Gauging station

Spearman coefficient (similarity index)

Biccari Volturino Tertiveri Ortodizolfo Pietram Bovino Orsara Troia Faeto PietramEAAP Savignano Lucera Castelluccio Foggia_OSS Foggia_IAC S.marco Torremaggiore S.severo S.giovanni Monte s.angelo Ortanova Sannicandro Lesina Manfredonia Cagnano Fonterosa

0.82218 0.81161 0.80540 0.79118 0.77727 0.76328 0.76223 0.74859 0.74850 0.73434 0.72235 0.71503 0.70699 0.67946 0.65715 0.64457 0.64363 0.63771 0.63406 0.62916 0.60042 0.59760 0.59647 0.59206 0.58813 0.56790

of the similar stations referred to Alberona in the study area. In agreement with the criterion explained in Section Gauging station similarity, a value was found of ρ larger than 0.75 between stations located within a range of 10 km; this value was then chosen as the threshold value for characterising the twin sites. Finally, as Table 7 and Fig. 7 clearly show, the twin stations of Alberona are those of Biccari, Volturino, Tertiveri and Orto di Zolfo. These four stations were used to estimate daily missing values in Alberona. Biased rainfall values and parametric regression At this point, it is opportune to explain that, even though the proposed methodology does not exclude a priori any specific kind of regression model, its reliability can be improved, to some

Environ Monit Assess (2010) 160:1–22 Fig. 7 Spatial distribution of stations similar to Alberona in the study area

Fig. 8 Mean monthly behaviour of the index of similarity of the four most similar stations to Alberona

17

18

Environ Monit Assess (2010) 160:1–22

extent, by the choice of regressive model, taking into account the statistical data distribution. Concerning the specific case study, the authors had previously noticed that rainfall measurements are often influenced by other variables (ref. Section Variables influencing rainfall measures (MCAR hypothesis rejection)). This influence, even though systematic, is not constant for each rainfall event but depends strongly upon the instantaneous intensity of the noising variables. Moreover, the global effect of the three main sources of error, reported in the cited section, cannot be considered as a white noise because it evidently, always, produces underestimated rainfall measures. These two issues lead to the rejection of the homoscedasticity hypothesis, at the basis of the parametric regression (Neter et al. 1996). In brief, the assumption of homoscedasticity means that the variance of the regression error is assumed to be constant in the population, condi-

Table 8 Main statistics of absolute error distribution

Statistics Biccari

Median Mode IR Min Max Skewness Kurtosis Volturino Median Mode IR Min Max Skewness Kurtosis Tertiveri Median Mode IR Min Max Skewness Kurtosis Orto di Zolfo Median Mode IR Min Max Skewness Kurtosis

tional on the explanatory variables. The assumption of homoscedasticity fails when the variance changes in different segments of the population (Wooldridge 2006). This does not affect the computation of point estimates, but, if the error variance is not constant, then parametric regression is no longer best linear unbiased estimator. Summarising, in Section Data distribution, it was rigorously proved that daily rainfall distribution law is not Gaussian but gamma; in Section Variables influencing rainfall measures (MCAR hypothesis rejection), it was shown that rainfall measurements, used to estimate missing data, are not error-free; finally, in the current section, the homoscedasticity hypothesis was rejected. Consequently, it can be concluded that non-parametric regressive models, being essentially distribution-free, are more suitable than parametric ones. Furthermore, this conclusion is supported by the cited scientific literature.

Simple substitution Parametric regression Conover Theil 0.6 0.6 4.0 −54.4 58.6 −0.08 12.85 1.0 1.0 3.8 −41.4 56.8 0.70 14.11 1.5 1.2 4.2 −46.4 74.8 1.28 15.27 −0.6 −1.0 4.2 −74.8 48.6 −1.66 18.25

−0.9 −0.9 4.1 −42.4 61.4 1.40 12.79 −1.0 −1.5 4.0 58.6 −35.5 1.52 13.93 −1.0 −1.8 4.3 −44.9 72.4 1.66 15.21 −1.2 −3.1 4.4 −45.3 60.4 1.37 12.80

0.4 −0.2 5.0 −12.0 110.4 3.84 26.61 0.4 −0.4 5.2 −11.0 68.8 3.08 15.29 0.4 −0.6 5.6 −11.9 110.4 3.73 25.52 0.4 −1.8 5.4 −10.6 111.6 3.81 26.23

0.0 0.0 4.0 −43.5 61.6 1.18 12.68 −0.3 −0.5 3.9 57.4 −38.9 1.02 14.14 −0.1 −0.4 4.2 −48.0 73.2 1.28 15.27 −0.1 −1.3 4.0 −66.8 48.6 −1.01 16.63

Environ Monit Assess (2010) 160:1–22

Results Figure 8 shows the monthly behaviour of the Spearman coefficient, assumed as an index of similarity, between Alberona and the twin stations around it. The similarity is greater during the winter than in the summer, which is characterised by localised, intense and short rainfall phenomena. New, complete time series for Alberona were generated applying the four methods described above and using, as known variable, the time series of each of the twin stations of Alberona (i.e. Biccari, Orto di Zolfo, Tertiveri and Volturino). The reliability of the time series generated in Alberona was evaluated by computing the “absolute error” that is the difference between the measured and the generated daily values. Table 8 reports some statistics of the absolute error distributions crossing similar stations and methods. The median and the inter-quartile range (IR) were chosen as main descriptive parameters instead of mean and standard deviation. The reason for this choice lies in the evident departure from Gaussianity of all the error distributions, as the skewness coefficient shows; in such cases, the median value, which is not influenced by extreme values, is preferable. Considering these two statistics, a preliminary idea of the differences between the four methods applied can be gained. In general, the closer the median is located to zero, the more the distribution of the absolute errors looks like a white noise. Furthermore, the shorter the IR, the more the absolute error values are concentrated around zero. Consequently, looking at the statistics of the four distributions, per similar station, the one characterised by both its median closest to zero and the shortest IR should be considered as the most reliable. Given the median values, the non-parametric methods (ranked regression and Theil) seem to perform better than the other two. The mode value, which indicates the most frequent value, also seems to confirm that the non-parametric methods tend to provide better results. Nevertheless, the high IR value, related to ranked regression (Conover method), strongly reduces the suitability of such a method. On the contrary, the Theil method is confirmed as being very reliable, as it is characterised by median values close to

19

zero and low IR values. Considering jointly the median and the IR values, the simple substitution also seems to be rather reliable, at least when stations with a strong similarity are used for computations. Finally, as expected, the parametric regression method again results as being the least suitable method among the four tested. It is noteworthy that the Theil method appears also to be more robust in comparison with the other methods, since the basic statistics used for evaluating the methods’ reliability seem to remain almost invariant even for less similar stations.

Table 9 Percent of errors within different ranges around zero in each of the four “most similar” stations to Alberona (in italics, the highest percent values) Range

Methods Substitution Regression Ranked Theil regression

Biccari ±0.25 ±0.50 ±0.75 ±1.00 ±1.25 ±1.50 ±2.00 Volturino ±0.25 ±0.50 ±0.75 ±1.00 ±1.25 ±1.50 ±2.00 Tertiveri ±0.25 ±0.50 ±0.75 ±1.00 ±1.25 ±1.50 ±2.00 Orto di Zolfo ±0.25 ±0.50 ±0.75 ±1.00 ±1.25 ±1.50 ±2.00

8 13 19 27 36 41 50

5 11 17 24 31 37 47

9 16 22 29 36 40 48

10 19 25 31 36 42 51

9 14 20 28 38 44 52

5 10 15 20 27 34 47

10 17 23 30 37 41 48

8 17 25 32 38 43 52

6 10 14 20 31 36 45

5 10 15 20 26 31 45

8 15 21 27 35 39 47

10 16 23 31 38 42 50

8 13 18 27 35 40 49

5 9 13 19 25 30 41

10 16 21 28 35 39 50

6 13 21 28 36 42 50

20

Environ Monit Assess (2010) 160:1–22

Fig. 9 a Box plots of the absolute errors in the estimated values for Alberona using the Biccari station recorded data. b Box plots of the absolute errors in the estimated values for Alberona using the Orto di Zolfo station recorded data. c Box plots of the absolute errors in the estimated

values for Alberona using the Tertiveri station recorded data. d Box plots of the absolute errors in the estimated values for Alberona using the Volturino station recorded data

A more accurate evaluation of the error dispersion was carried out considering the percent of error values clustered in seven given intervals of increasing size around zero (Table 9); in fact, zero represents the most preferable error value (white

noise error distribution) since it would mean that the estimated values are identical to the actual ones. Table 9 outlines how the Theil method is almost always able to cluster a larger amount of estimated errors around the most desirable value

Environ Monit Assess (2010) 160:1–22

of zero. Values reported in Table 9 still show that even simple substitution performs satisfactorily when the similarity coefficient is relatively high. Finally, the box diagrams in Fig. 9 summarise graphically all the considerations discussed above. These diagrams show, clearly enough, that the ranked regression method, even when characterised by an almost null median, has a larger IR value and is evidently non-symmetric in shape, confirming that it could not be the most reliable method to apply. On the contrary, the plots confirm that the non-parametric Theil technique provides the most accurate estimations of missing daily rainfall data.

Conclusion In conclusion, a methodology for filling in gaps in time series was proposed and applied, in this paper, to rainfall time series. Given a gauging station affected by missing data (target station), the methodology consists in two steps. During the first step, the randomness of the missing data is checked and a classification of “similarity” between the target station and the other gauging stations over the study area is carried out. Similarity is defined as the Spearman coefficient between the daily rainfall time series of the target station and each of the other stations of the study area. The second step represents the estimation phase proper. Four different estimation methods were tested with the aim of assessing the most reliable for the studied variable. Obviously, these methods, even though all univariate, require a different level of user expertise; from this point of view, the simplest method is indeed the “simple substitution”, whilst the most difficult is the Theil method. In the middle, the parametric and the ranked regressions were also applied. The results show that, even if the Theil method is the most reliable, simple substitution may be acceptable, particularly when the similarity value tends to be significantly high. Both parametric and ranked regressions resulted barely reliable and, in particular, the latter was the worst.

21

Further developments To improve the estimation of missing data starting from the proposed methodology, various paths can be followed that are currently under study. One development consists in extending the techniques, already used in the univariate area, to the bivariate and multivariate area. Unfortunately, at present, the literature on multivariate non-parametric regression and related software is rather scarce. Consequently, substantial theoretical and practical efforts are needed to follow this path. Nevertheless, using univariate methods is not a point of weakness of the proposed methodology; this approach follows the modern paradigm of missing data assessment, as originally reported by Rubin (1987, 1988) in the “multiple imputation” framework. It consists in using not only the “most similar” station for making estimations, but a set of “similar stations” so that, instead of a unique estimation value, a sort of estimated range can be determined. This would allow the introduction of some kind of uncertainty range related to the most probable value. Further investigations in this direction are in progress. Another option consists in grouping the months of the time series on the basis of the most representative statistics of the considered variable (e.g. considering rainfall rates, three classes could be defined as dry, moderately rainy and rainy months on the basis of the total monthly amount) and achieving the estimation using homogeneous data subsets of the corresponding month. Achieving this could certainly improve the reliability of the whole methodology, by diminishing the variability of the starting dataset. Acknowledgements The authors wish to acknowledge the courtesy of Dr. Giuseppe Tedeschi and Dr. Giuseppe Amoruso of the Hydrographic Regional Office (HRO) in providing data used throughout the paper.

References Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage.

22 Barca, E., Passarella, G., Lo Presti, R., Masciale, R., & Vurro, M. (2006). HarmoniRiB river basin data documentation: Chapter 7—Candelaro River Basin. Bari, Italy: Water Research Institute of the National Research Council. Retrieved from http://www. harmonirib.com . Chandler, R. E., & Wheater, H. S. (1998). Climate change detection using generalized linear models for rainfall— a case study from the West of Ireland. I. preliminary analysis and modelling of rainfall occurrence. Research Report No. 194, Department of Statistical Science, University College London. Conover, W. J. (1971). Practical nonparametric statistics (2nd ed.). New York: Wiley. Conversano, C. (2003). Incremental Algorithms for missing data imputation based on recursive partitioning. In Proceedings of the 35th symposium on the interface. Salt Lake City, Utah, 12–15 March 2003. Drécourt, J. P., & Madsen, H. (2002). Uncertainty estimation in groundwater modelling using Kalman filtering. In K. Kovar & Z. Hrkal (Eds.), Proceedings of the 4th international conference on calibration and reliability in groundwater modelling, ModelCARE 2002 (Vol. 46(2/3), pp. 306–309). Acta Universitatis Carolinae– Geologica 2002, Prague. Dunn, P. K. (2003). Precipitation occurrence and amount can be modelled simultaneously. Faculty of Sciences, USQ, Working Paper Series SC-MC-0305. Glantz, S. (1988). Primer in biostatistics. Milan, Italy: McGraw-Hill. Goodison, B. E., Louie, P. Y. T., & Yang, D. (1998). WMO solid precipitation measurement intercomparison— final report. Instruments and Observing Methods Report No. 67, WMO/TD-No. 872. Hubbard, K. G. (1994). Spatial variability of daily weather variables in the high plains of the USA. Agricultural and Forest Meteorology, 68, 29–41. Istituto Centrale di Statistica (1983). In ISTAT (Ed.), Annuario di statistiche meteorologiche 1981 (Vol. XXI), Rome. Johansson, B., & Chen, D. (2003). The influence of wind and topography on precipitation distribution in Sweden: Statistical analysis and modelling. International Journal of Climatology, 23, 1523–1535. Johnson, M. L. (2003). Lose something? Ways to find your missing data. Houston Center for Quality of Care and Utilization Studies Professional Development Series 17-09-2003. Little, J. R. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.

Environ Monit Assess (2010) 160:1–22 Neter, J., Kutner, M. H., & Nachtsheim, C. J. (1996). Applied linear statistical models. Chicago, IL: Irwin. Prudhomme, C., & Reed, D. W. (1998). Relationships between extreme daily precipitation and topography in a mountainous region: A case study in Scotland. International Journal of Climatology, 18, 1439–1453. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. Rubin, D. B. (1987). Multiple imputation for nonresponce in surveys. New York: Wiley. Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the survey research methods section of the American statistical association (pp. 79–84). American Statistical Association. Rubel, F., & Hantel, M. (1999). Correction of daily gauge measurements in the Baltic sea drainage basin. Nordic Hydrology, 30, 191–208. Sande, I. G. (1983). Hot-deck imputation procedures. In W. G. Madow & I. Olkin (Eds.), Proceedings of symposium: Incomplete data in sample surveys (Vol. 3). New York: Academic Press. Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall. Scheffer, J. (2002). Dealing with missing data. Research Letters in the Information and Mathematical Sciences, 3, 153–160. Retrieved from http://www.massey.ac.nz/ ∼wwiims/research/letters/. Sevruk, B. (1986). Correction of precipitation measurements: Summary report. In B. Sevruk (Ed.), Correction of precipitation measurements (Vol. 23, pp. 13–23). Zurich: Zuricher Geographische Schriften. Sevruk, B., & Nespor, V. (1998). Empirical and theoretical assessment of the wind induced error of rain measurement. Water Science and Technology, 37(11), 171– 178. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Technical Journal, 27, 379–423, 623– 656. Theil, H. (1950). A rank-invariant method of linear and polynomial regression analysis. Indicationes Mathematicae, 12, 85–91. Vejen, F., Allerup, P., & Madsen, H. (1998). Korrection for fejlkilder af daglige nedbørmålinger i Danmark. Technical Report 98-9, Danish Meteorological Institute. In Danish. Wooldridge, J. (2006). Introductory econometrics: A modern approach (3rd ed.). Cincinnati, OH: SouthWestern College. Yevjevich, V. (1972). Probability and statistics in hydrology. Fort Collins, CO: Water Resources Publications.

Suggest Documents