Clustering and preferential sampling, two distinct issues in Geostatistics

2 downloads 0 Views 343KB Size Report
Abstract: This work aims to overview two issues of sampling design, clustering and preferability, within the scope of geostatistics, area concerned with fitting.
Clustering and preferential sampling, two distinct issues in Geostatistics Raquel Menezes1 1

Centre of Mathematics, University of Minho, Campus de Azur´em, 4800-058 Guimar˜ aes, Portugal

Abstract: This work aims to overview two issues of sampling design, clustering and preferability, within the scope of geostatistics, area concerned with fitting spatially continuous models to spatially discrete data (Chil´es and Delfiner, 1999). Firstly, data-locations are typically assumed as being uniformly representative of the observation region. Menezes et al (2008) shows that the impact of the clustered data on the variogram estimation is not negligible and proposes a kernel estimator for this type of data. Secondly, preferential sampling arises when the process that determines the data-locations and the process being modelled are stochastically dependent. Diggle et al (2008) proposes a model-based approach to deal with this problem. Here, we intend to emphasize the difference between these two potential problems of sampling design. Keywords: geostatistics; clustered design; variogram; preferential sampling; modelbased inference.

1

Introduction

Geostatistics is that part of spatial statistics which is concerned with data obtained by spatially discrete sampling of some spatially continuous process S(x) with x ∈ IR2 . So, suppose that the sample data are of the form (xi , yi ) : i = 1, ..., n, where x1 , ..., xn are data-locations within an observation region D ⊂ IR2 and y1 , ..., yn are measurements associated with these locations. The xi represents the sampling design and yi is a realization of Yi = Y (xi ), where Y (x) with x ∈ D is the measurement process. The unobserved signal process S(x) with x ∈ D is the one aiming to be modelled and usually regarded as our goal of prediction. Assuming an additive measurement error, then a simple model for the data takes the form Yi = µ + S(xi ) + Zi : i = 1 . . . n

(1)

where the Zi are mutually independent, zero-mean random variables. If one considers E[S(x)] = 0 for all x, then in (1) E[Yi ] = µ for all i. The model (1) can be easily extended to the regression setting, if one makes E[Yi ] = µi = d0i β, with di a vector of explanatory variables associated with Yi .

2

Clustering and preferential sampling

Bearing in mind that the sampling design for locations xi might be deterministic (example of a regular grid) or stochastic, all analyses are carried out conditionally on xi , which are considered into the spatial dependency structure. Even when the locations are given by a stochastic point process, the associated distribution becomes normally irrelevant to the geostatistical model. Suppose that, as a convenient shorthand notation, we use [.] to mean “the distribution of”, and write S = {S(x) : x ∈ IR2 } and Y = (Y1 , . . . , Yn ). Then, for model (1), one has [S, Y ] = [S][Y |S] = [S][Y1 |S(x1 )] . . . [Yn |S(xn )], where Yi |S(xi ) may be considered univariate Gaussian distributions with means S(xi ) and common variance τ 2 . The rest of this draft is structured as follows. In section 2, we introduce the two sampling issues: clustering and preferential sampling. Both are typically associated to the aggregation of sample locations, but only the latter considers the stochastic dependence between the locations and the data processes. Section 3 overviews the solution proposed in Menezes et al (2008) to the clustering issue. Section 4 overviews the solution proposed in Diggle et al (2008) to the preferability issue, which may occur, for example, in environmental monitoring networks if there is a natural inclination to place more monitors in areas classified as high risk for pollution. Some final remarks are included in Section 5.

2

Clustering versus preferential sampling

In geostatistics, it is commonly assumed that the sampling locations xi are uniformly spread over the observation region. However, the strategy for data collection may originate unequal sample density, leading to the presence of clustered sampling. Observations that are clustered in space are in most cases driven by external factors, i.e. not related to the underlying process S(x), such as the existence of specific geographic or demographic spots. Moreover, clusters may be needed to better characterize short-range variability, which requires a denser sampling, but sometimes too costly to cover the whole study region. On the other hand, geostatistical methods rely on the fundamental assumption that the sampling locations xi have been chosen independently of the values of the spatial variable. Consequently, the sampling design are assumed to be stochastically independent of the process S(x) and hence of Y . However, the possible adoption of denser sampling in areas deemed critical (for example, maximum values are expected to occur) may not permit this stochastic independence. Furthermore, if we denote X = (x1 , . . . , xn ), the previous situation occurs when X and Y are the joint outcome of a marked point process in which there is dependence between the points, X,

0.0

0.2

0.4

0.6

0.8

1.0

1.0

3

0.0

0.2

0.4

0.6

0.8

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

R.Menezes

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 1. Sample locations and underlying signal process S(x). From the most left-hand panel to the most right-hand, one has the completely random sample, the preferential sample and the clustered sample, respectively.

and the marks Y . So, as described in Diggle et al (2008), preferential sampling refers to any situation in which [X, S, Y ] 6= [X][S, Y ]. Note that preferential sampling typically leads to the presence of clustered data but not the other way around. Figure 1 compares three distinct sampling designs. The left-hand panel illustrates the case of the completely random sampling design, where the sample locations xi are an independent random sample from the uniform distribution on D. The centre-panel exemplifies the preferential sample and the right-hand panel the clustered sample. In each case, the grey-scale image represents the realisation of S(x) used to generate the associated measurement data. We wish to highlight that both centre and right-panels show clustered data but only the preferential design concentrates the datalocations according to the underlying process S(x).

3

The clustered sampling problem

In this context, the behaviour of widely used geostatistical tools such as the variogram may significantly decay, as shown in Kovitz and Christakos (2004) and Menezes et al (2008). The latter authors propose a variogram estimator robust to clusters, by taking into account an inverse weight to a given neighbourhood density as a compensation for the unpopulated areas. For the isotropic case, their proposal can be represented by Pn Pn 2 i=1 j=1 wij (u)[Y (xi ) − Y (xj )] Pn Pn γ b(u) = , u ≥ 0, 2 i=1 j=1 wij (u)   P u−kxi −xj k where wij (u) = √ 1 ×K and ni = k I{kxi −xk k≤δ} , being h ni ×nj

K(.) and I{.} the kernel and the indicator functions, respectively.

4

Clustering and preferential sampling

Note that ni represents the number of points that fall within the circle of radius δ and center xi . The idea is to adjust the contribution from any sample point xi according to its number of neighbours ni ; as the variogram estimator sums contributions from pairs of data points (xi , xj ), the product ni nj represents the joint contribution of the pair; the square root is adopted as a scale corrector. The resulting variogram downweights squared increments of Y (.) based on the geometric mean of the local densities of observations around each of the observation locations making up the increment. This estimator requires the selection of two user-adjustable values: the kernel bandwidth h and the neighbourhood radius δ. The first is treated via the MSE, i.e. the mean square error. The latter results from the analysis of the density estimation of the distance between points derived on the observation region. In Menezes et al (2008), this variogram robust to clusters is proved to enjoy good properties, such as asymptotic unbiasedness and consistency.

4

The preferential sampling problem

We now present some evidence of why the preferability issue should not be treated as a simple clustering issue. This evidence is based on two simulation experiments presented in Diggle et al (2008). Both experiments involved the simulation of data under random, preferential and clustered sampling for comparison purposes. 4.1

Impact of the preferability issue

The main goal of the first experiment was to analyse the effect of the sampling design on the variogram estimation. Figure 2 shows simulation-based estimates of the point-wise bias and standard deviation of the classical variogram (method of moments), derived from 500 independent simulations of each of our three sampling designs. With respect to bias, the results under both uniform and clustered non-preferential sampling designs are consistent with the unbiasedness of the variogram estimator. In contrast, under preferential sampling the results show severe bias. With respect to efficiency, the right-hand panel of Figure 2 illustrates that clustered sampling designs, whether preferential or not, are also less efficient than the random sampling. The second simulation experiment was aimed at extending previous analysis to the widely used ordinary kriging predictor. For each of the 500 replicate simulations, the bias and the mean square prediction error of the chosen predictor, when applied to one fixed location, were derived. Table 1 summarizes the simulation results. The bias is large and positive under preferential sampling. This prediction bias is a direct consequence of

5

2.0

1.2

R.Menezes

stddev

0.6

1.0

1.5

random preferential clustered

0.0

0.0

0.2

0.5

0.4

bias

0.8

1.0

random preferential clustered

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

distance

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

distance

FIGURE 2. Bias and standard deviation of the sample variogram under random, preferential and clustered sampling.

the bias in the estimation of the model parameters, which in turn arises because the preferential sampling model, adopted in these simulation studies, leads to the over-sampling of locations corresponding to high values of the underlying process S. The two non-preferential sampling designs both lead to approximately unbiased prediction, as expected by theory. The substantially larger mean square error for clustered sampling by comparison with completely random sampling reflects the inefficiency of the latter, as already illustrated in the context of variogram estimation. TABLE 1. Impact of sampling design on the bias and mean square error of the ordinary kriging. Each entry in the table is a 95% coverage interval calculated empirically from 500 independent simulations.

bias mean square error

4.2

Random (-0.081, 0.059) (0.268, 0.354)

Sampling design Preferential (1.290, 1.578) (2.967, 3.729)

Clustered (-0.082, 0.186) (0.948, 1.300)

Monte Carlo maximum likelihood estimator

Diggle et al (2008) proposes a model-based approach for preferential sampling. First, bear in mind that the joint distribution for S, X and Y can be factorized as [S, X, Y ] = [S][X|S][Y |X, S], where [X|S] is modeled as an inhomogeneous Poisson process with λ(x) = exp{α + βS(x)} (α and β

6

Clustering and preferential sampling

are constants). Then, the likelihood function for the data can be expressed as L(θ) = [X, Y ] = ES [[X|S][Y |X, S]], (2) where θ identifies the unknown model parameters. Note that the conditional distribution [X|S] strictly requires the realisation of S to be available at all locations x ∈ D. In practice, the spatially continuous realisation of S can be approximated by the set of values of S on a fine lattice to cover D, and the exact locations X can be replaced by their closest lattice points. S can be then divided into S = {S0 , S1 }, where S0 denotes the values of S at each of n data-locations xi ∈ X , and S1 denotes the values of S at the remaining lattice-points. Consequently, with some algebraic manipulation, thei exact h |S0 ] [S likelihood in (2) can be re-written as L(θ) = ES|Y [X|S] [Y [S0 |Y ] 0 ] , and a Monte Carlo approximation is  m  X [Y |S0j ] [S0j ] , (3) LM C (θ) = m−1 [X|Sj ] [S0j |Y ] j=1 where the Sj are simulations of S conditional on Y . This requires efficient algorithms for simulating stationary Gaussian process on a fine grid, as those described in Rue and Held (2005) .

5

Final remarks

We conclude that geostatistical analyses which ignores the existence of clustered sample data or preferential sample data can be misleading. Each of these sampling design issues should be treated accordingly. As remarked, one possible approach to the analysis of preferentially sampled geostatistical data is to adopt a parametric model and fit this to the data using likelihood-based methods of inference. Alternatively, one can use explanatory variables to eliminate, or at least reduce, the adverse effects of preferential sampling. References Diggle, P.J., Menezes, R., and Su, T.L. (2008). Geostatistical Inference Under Preferential Sampling. Under submission. Kovitz, J. and Christakos, G. (2004). Spatial statistics of clustered data. Stochastic Environmental Research, 18, 147-166. Menezes, R., Garcia-Soid´an, P., and Febrero-Bande, M. (2008). A kernel variogram estimator for clustered data. Scandinavian Journal of Statistics, 35(1), 18-37. Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications. London: Chapman and Hall.