Australian & New Zealand Journal of Statistics Aust. N. Z. J. Stat. 58(1), 2016, 47–69
doi: 10.1111/anzs.12140
ON THE CORRELATION STRUCTURE OF GAUSSIAN COPULA MODELS FOR GEOSTATISTICAL COUNT DATA
Zifei Han1 and Victor De Oliveira1,*
The University of Texas at San Antonio Summary We describe a class of random field models for geostatistical count data based on Gaussian copulas. Unlike hierarchical Poisson models often used to describe this type of data, Gaussian copula models allow a more direct modelling of the marginal distributions and association structure of the count data. We study in detail the correlation structure of these random fields when the family of marginal distributions is either negative binomial or zero-inflated Poisson; these represent two types of overdispersion often encountered in geostatistical count data. We also contrast the correlation structure of one of these Gaussian copula models with that of a hierarchical Poisson model having the same family of marginal distributions, and show that the former is more flexible than the latter in terms of range of feasible correlation, sensitivity to the mean function and modelling of isotropy. An exploratory analysis of a dataset of Japanese beetle larvae counts illustrate some of the findings. All of these investigations show that Gaussian copula models are useful alternatives to hierarchical Poisson models, specially for geostatistical count data that display substantial correlation and small overdispersion. Key words: Fr´echet–Hoeffding upper bound; Gaussian random field; isotropy; negative binomial; Poisson-Gamma model; zero-inflated Poisson.
1. Introduction Geostatistical count data are often collected in many earth and social sciences such as ecology, demography and geology, but the number and variety of models to describe such count data is less rich than what is available in the context of geostatistical continuous data. On the one hand there are classes of models based only on second-order specifications, in line with classical geostatistical models, such as binomial and Poisson kriging (McNeill 1991; Monestiez et al. 2006; De Oliveira 2014). On the other hand there are classes of models based on specifications of the family of finite-dimensional distributions that use Gaussian random fields as building blocks. The prime example of the latter is the Poisson-lognormal model proposed by Diggle, Tawn & Moyeed (1998), a hierarchical model that can be viewed as a generalized linear mixed model. A class of hierarchical Poisson models that include the Poisson-lognormal model as a particular case was proposed by De Oliveira (2013). The latter work studied the second-order properties of this type of model, and found that these models are unable to represent certain types of spatial count data, particularly data sets that consist mostly of small counts or that display substantial correlation and small overdispersion. *Author to whom correspondence should be addressed. 1 Department of Management Science and Statistics, The University of Texas at San Antonio, San Antonio, TX 78249, USA e-mail:
[email protected] Acknowledgements. We warmly thank two anonymous referees for helpful comments and suggestions that lead to an improved article. This work was partially supported by the U.S. National Science Foundation Grant DMS–1208896. © 2016 Australian Statistical Publishing Association Inc. Published by John Wiley & Sons Australia, Ltd.
48
CORRELATION OF GAUSSIAN COPULA MODELS
A different type of model that also uses Gaussian random fields as building blocks seeks to construct random fields with a pre-specified family of marginal distributions using Gaussian copulas (Sklar 1959; Nelsen 2006). Such models have a long history in the literature. They have been proposed in the mechanical engineering literature under the name of ‘translation’ models (Grigoriu 2007), and in the operations research literature under the name of ‘NORTA’ models (for NORmal To Anything; Cario & Nelson 1997). In the statistical literature Madsen (2009), Kazianka & Pilz (2010) and Kazianka (2013) proposed the use of Gaussian copulas to model geostatistical count data, and investigated methods for performing parameter estimation and prediction. One of the main sources of appeal of this type of model is the separate specification of the marginal distributions and the association structure of the random field, which suggests that this type of model may be free from (some of) the aforementioned limitations of hierarchical Poisson models. Nevertheless, a detailed study of the correlation structure of Gaussian copula models seems to be lacking in the literature, so we undertake such study in this work for the case of some families of discrete marginal distributions. In this work we carry out a study of the properties of correlation functions of Gaussian copula spatial models with discrete count marginal distributions. More specifically, we consider in detail the cases when the family of marginal distributions is either negative binomial or zero-inflated Poisson, as these families have in the past been shown useful in describing geostatistical count data. We view such models as transformed Gaussian random fields (De Oliveira 2003), which is convenient for the purpose of studying their correlation functions. Although in general there is no closed-form expression for the correlation functions of such random fields, they can be accurately approximated using a convenient series representation. By a mix of analytical and numerical exploration, we show that the second-order structure of Gaussian copula spatial models is quite flexible, their flexibility resembling in a sense that of the second-order structure of Gaussian random fields. To illustrate this point, we provide a side-by-side comparison between the correlation functions of a Gaussian copula spatial model and that of a hierarchical Poisson spatial model, where both have the same family of negative binomial marginal distributions. It is shown that the former model is more flexible than the latter in terms of the range of feasible correlation, sensitivity to the mean function and modelling of isotropy. An exploratory analysis of Japanese beetle larvae counts illustrates some of the shortcomings of hierarchical Poisson spatial models in describing small count datasets that display substantial correlation. We close with a discussion of the main findings and possible topics for further research. 2. Gaussian copula spatial models We describe here a random field model for geostatistical data based on copulas (Nelsen 2006). The model was proposed for the analysis of geostatistical count data in papers by Madsen (2009), Kazianka & Pilz (2010) and Kazianka (2013), wherein the focus was mainly on inference. We describe a slightly different formulation of the model, explicitly as a transformed Gaussian random field (De Oliveira 2003), which is convenient for the purpose of studying the second-order properties on the field. Let {Y (s) : s ∈ D}, D ⊂ R2 , be a random field taking values in N0 = {0, 1, 2,…}, and let F = {Fs (·; ψ) : s ∈ D} be a family of count cdfs with support contained in N0 and corresponding pmfs fs (·; ψ), where ψ is a vector of marginal parameters. The proposed model for the random field Y (·) is © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA
Y (s) = Fs−1 (Z(s)); ψ ,
49
s ∈ D,
(1)
where {Z(s) : s ∈ D} is a latent Gaussian random field with mean 0, variance 1 and correlation function Kϑ (s, u), (whence Kϑ (s, s) = 1 for all s), (·) is the cdf of the standard normal distribution, and Fs−1 (·; ψ) is the quantile function of Fs (·; ψ) defined as Fs−1 u; ψ = inf{x ∈ R : Fs x; ψ u}, u ∈ (0, 1). By construction the marginal distribution of Y (s) is Fs (·; ψ). We assume the family F to be parameterized in terms of E(Y (s)) and 2 > 0, where the former may include auxiliary information and the latter controls some type of departure from Poisson distributions with the same mean (usually overdispersion). Specifically, we assume that E(Y (s)) = t(s)exp β f(s) = (s) s ∈ D, where β = (1 ,…, p ) are regression parameters, f(s) = (f1 (s),…, fp (s)) are known locationdependent covariates, with f1 (s) ≡ 1, and t(s) > 0 are known ‘sampling efforts’. We assume that var(Y (s)) < ∞, which guarantees that cov(Y (s), Y (u)) exists for all s, u ∈ D. Consequently we have ψ = (β , 2 ) . As for the correlation function of the latent Gaussian random field, we assume that Kϑ (s, u) = (1 − 2 )K¯ (s, u) + 2 1{s = u},
(2)
where K¯ θ (s, u) 0 is a correlation function in R2 that is continuous everywhere, 1{A} is the indicator function of A, and ϑ = (θ , 2 ) is a vector of correlation parameters, with 2 ∈ [0, 1]. Such expressions include both correlation functions that are continuous everywhere and those that are discontinuous along the ‘diagonal’ s = u, a property that as we will show later is inherited by the correlation function of Y (·). 2.1. Marginal distributions The family F of marginal distributions used to construct model (1) can be of any type, continuous discrete or mixed, but in this work we focus on the case where this is a family of count distributions. More specifically, we consider in detail the cases when F is either a family of negative binomial distributions or of zero-inflated Poisson distributions, since these families have in the past been shown to be useful in describing geostatistical count data. Negative binomial distributions are commonly used to model count data that display overdispersion by having tails that are heavier than those of Poisson distributions. We use a parameterization expressed in terms of the mean and a ‘size’ parameter (what Cameron & Trivedi (2013) call the NB2 distribution), wherein the pmf of Y (s) is given by fsNB (y; ψ) =
2 y 2 (y + 2 ) 2 1 − , (2 )y! 2 + (s) 2 + (s)
y = 0, 1, 2,…
With this parameterization the mean and variance of Y (s) are (s) E(Y (s)) = (s) and var(Y (s)) = (s) 1 + 2 , © 2016 Australian Statistical Publishing Association Inc.
(3)
(4)
50
CORRELATION OF GAUSSIAN COPULA MODELS
from which it follows that the distribution is overdispersed (relative to the Poisson distribution with the same mean). The amount of (relative) overdispersion is defined as var(Y (s)) − E(Y (s)) (s) = 2 , E(Y (s))
(5)
so 2 controls overdispersion among distributions with mean (s); we call this the Gaussian copula negative binomial (GC-NB) model. Zero-inflated Poisson distributions are commonly used to model count data with an ‘excess’ of zeros in relation to what Poisson (and negative binomial) distributions predict. This distribution is a mixture of a Poisson distribution and an atom at zero (Cameron & Trivedi 2013), with pmf given by + (1 − ) exp(−(s)) if y = 0 y Pr Y (s) = y = (1 − ) exp(−(s))((s)) if y = 1, 2,… y! where (s) is the mean of the Poisson distribution that is being mixed and is the ‘constructed probability of 0 count’. This probability could also be made dependent on covariates (e.g. Lambert 1992; Agarwal, Gelfand & Citron-Pousty 2002), but we assume it to be constant. We again use a parameterization in terms of the mean (s) = (1 − )(s) and the overdispersion parameter 2 = 1= − 1, in which case the pmf of Y (s) is given by 2 1 + exp −(1 + 12 )(s) if y = 0 2 +1 2 +1 y ZIP fs (y; ) = 1 1+ 12 ((s))y 2 exp − 1+ 2 (s) if y = 1, 2,… 2 +1 y! With this parameterization the mean and variance of Y (s) are also given by (4); we call this the Gaussian copula zero-inflated Poisson (GC-ZIP) model. The aforementioned families of distributions are used to model different types of count data as they represent different forms of overdispersion (or different deviations from the Poisson distribution). The negative binomial distribution has a heavier tail than the Poisson distribution, while the zero-inflated Poisson distribution has an ‘excess’ of zeros when compared to what Poisson (and negative binomial) distributions predict. Both families include distributions that approach the Poisson distribution with mean (s) as 2 → ∞. Figure 1 displays several pmfs from these families with different mean and overdispersion parameters. 2.2. Covariance function For most families F of marginal distributions the covariance function of the random field in (1) is not available in closed-form, except for a few cases (e.g. for uniform and lognormal distributions), and this is especially true for discrete distributions. We use here a representation of the covariance function of Y (·) that is particularly convenient for numerical computation. Lemma 2.1. Let Y (·) be the random field defined in (1) obtained from the family F and the correlation function Kϑ (s, u). Then cov(Y (s), Y (u)) =
∞
ak (Fs )ak (Fu )
k=1 © 2016 Australian Statistical Publishing Association Inc.
{Kϑ (s, u)}k , k!
s, u ∈ D,
(6)
3
4
5
6
7
8
0.20 0
1
2
3
4
5
6
7
8
9
10 11 12
6
8
10
12
14
16
18
20
0
2
4
value
6
8
10
12
14
16
18
probability
0.15 0.10 0.00
4
2
3
4
5
6
7
8
9
10
0.20
0.4 0.2
ZIP
0.1
probability
0.3
NB
0.0 2
1
value
NB ZIP
0
0
value
0.0 0.1 0.2 0.3 0.4 0.5 0.6
probability
probability
0.1
9 10 11 12 13 14
value
NB
0.15
2
ZIP
0.10
1
0.05
0
NB ZIP
0.00
0.3
NB ZIP
0.2
probability
ZIP
51
0.05
0.4 NB
0.0
probability
0.0 0.1 0.2 0.3 0.4 0.5 0.6
ZIFEI HAN AND VICTOR DE OLIVEIRA
0
2
4
6
value
8
10
12
14
16
value
Figure 1. Several negative binomial and zero-inflated Poisson pmfs with parameters (s) = 3 and 6 (top and bottom rows), and 2 = 0.8, 2 and 5 (left, middle and right columns).
with ak (Fs ) =
∞
−∞
Fs−1 ((t); ψ)Hk (t) (t)dt,
k = 1, 2,…
(7)
where (t) is the pdf of the standard normal distribution and Hk (t) is the (probabilists’) Hermite polynomial of degree k (H1 (t) = t, H2 (t) = t 2 − 1, …). Additionally, for any fixed s, u ∈ D for which Kϑ (s, u) ∈ [0, 1) the series (6) is absolutely convergent. Proof. See De Oliveira (2013). From this representation an approximation for cov(Y (s), Y (u)) can be obtained by truncating the infinite series (6) and approximating the coefficients (7) by numerical integration. This strategy was used by De Oliveira (2013) to approximate covariance functions of copula models with continuous marginals, and we use it here to do the same for models with discrete marginals. The computation of ak (Fs ) requires evaluation of the quantile functions. For negative binomial marginals we use the function qnbinom in R, while for zero-inflated Poisson marginals we use the function qzipois in the R package VGAM. We carry out the approximation of ak (Fs ) by Gaussian quadrature using the function gauss.quad.prob from the R package statmod, which results in an error that is negligible compared to the truncation error, once we select a large enough truncation value. To guide the selection of the truncation value that guarantees that the approximation error does not surpass a given error tolerance, we use the following lemma: Lemma 2.2. Consider the random field Y (·) defined in (1) obtained from the family F, with mean and variance functions given in (4), and correlation function Kϑ (s, u). If e(M ) denotes the error in approximating corr(Y (s), Y (u)) that results from truncating the infinite series (6) at M , then |e(M )|
21 (s)2 (u)2 (Kϑ (s, u))M +1 1+ 1 + . 2 2 (s) + 1 − Kϑ (s, u) (u) +
© 2016 Australian Statistical Publishing Association Inc.
52
CORRELATION OF GAUSSIAN COPULA MODELS
Proof. See the Appendix. It follows that for a given desired error tolerance, > 0 say, a truncation value in (6) that guarantees the error in approximating corr(Y (s), Y (u)) is not larger than is
⎢ − 21 ⎥ ⎢ ⎥ (s)2 (u)2 ⎢ log (1 − Kϑ (s, u)) 1 + (s)+ ⎥ 1 + 2 (u)+2 ⎢ ⎥ ⎢ ⎥, M = ⎣ ⎦ log(Kϑ (s, u)) where · is the floor function. It can be seen from the proof of Lemma 2.2 that this truncation value is quite conservative, especially when Kϑ (s, u) is close to one. Note that the above upper bound for the error depends on 2 , (s), (u) and Kϑ (s, u), but not on the particular family F, as long as it satisfies (4), and that this upper bound for the error is an increasing function of each of these individual quantities (when the others are kept constant). 3. Hierarchical Poisson spatial models In this section we briefly describe the class of hierarchical Poisson models proposed in De Oliveira (2013), which includes as a special case the Poisson-lognormal model proposed by Diggle et al. (1998). In what follows, the quantities β, f(s), t(s) and (s) have the same definitions and interpretations as for the Gaussian copula spatial models. Let {(s) : s ∈ D} be a positive random field describing the spatial variation of a quantity of interest over the domain D, whose values are not observable, and let G = {Gs (·; ψ) : s ∈ D} be its family of continuous marginal cdfs, assumed to satisfy E{(s)} = exp(β f(s)). A random field model for Y (·) is defined hierarchically as follows: 1. For any set of distinct locations {s1 ,…, sn } ⊂ D, the counts Y (s1 ),…, Y (sn ) are conditionally independent given = ((s1 ),…, (sn )), and d
Y (si )| = Y (si )|(si ) ∼ Pois(t(si )(si )),
i = 1,…, n,
where Pois(a) indicates the Poisson distribution with mean a > 0. 2. It is assumed that (s) = Gs−1 ((Z(s)); ψ)), where Z(·) is a Gaussian random field with mean 0, variance 1 and correlation function K¯ θ (s, u) 0 that is continuous everywhere. The second-order properties of this class of random fields were studied in De Oliveira (2013). Here we will explore properties of the correlation function of the Poisson-Gamma model 2 (PG2), a member of the foregoing class of models that was studied in detail by De Oliveira (2013), which results when Gs (·; ψ) is the cdf of the gamma distribution with shape parameter equal to 2 and scale parameter equal to exp(β f(s))=2 . The PG2 and GC-NB models are similar in some respects, but different in others. On the one hand, the family of marginal distributions is the same for both models, explicitly the negative binomial distributions specified in (3), so for the PG2 model E(Y (s)) and var(Y (s)) are the same as those for the GC-NB model. In particular, 2 controls overdispersion in both models. Additionally, from results in De Oliveira (2013, propositions 2.1 & 4.1) it follows that © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA
53
− 21 2 2 corr(Y (s), Y (u)) = corr((s), (u)) · 1+ 1+ (s) (u)
− 21 ∞ k 2 ¯ 2 2 2 ˜ (K θ (s, u)) = ak (G) · 1+ 1+ k! (s) (u) k=1
(8)
˜ are given by (7), with G˜ being the cdf of the gamma distribution with shape where the ak (G) parameter equal to 2 and scale parameter equal to 1=2 . Consequently we can approximate corr(Y (s), Y (u)) to any desired accuracy by the same combination of truncation and numerical integration described in Section 2.2. On the other hand, it will be seen in the next section that the correlation functions of the GC-NB and PG2 models display different behaviours in several respects. 4. Properties of correlation functions In this section we explore properties of the correlation functions of the Gaussian copula models GC-NB and GC-ZIP described in Section 2, and contrast properties of the former model with those of the correlation function of the hierarchical Poisson model PG2. The exploration and comparison will be in terms of the range of feasible correlation, sensitivity to the mean function, and modelling of isotropy of the correlation functions. In what follows we parameterize overdispersion in terms of 2 = 1=2 to match the notation in De Oliveira (2013) and ease comparisons. 4.1. Range of feasible correlation For geostatistical models it is desirable that for s = u, corr(Y (s), Y (u)) may take any value in the interval [0, 1) that is compatible with the covariation apparent in the data, and feasible within the limit imposed by the Fr´echet–Hoeffding upper bound (values in [−1, 0) are rarely of practical interest). A general expression for this upper bound appears in Whitt (1976) and De Oliveira (2013, equation 23), and an alternative expression for the case of bivariate distributions with discrete marginals and support contained in N0 was given in Nelsen (1987). When (4) holds the latter is (x,y)∈S (1 − Fu (y; ψ)) + (x,y)∈T (1 − Fs (x; ψ)) − (s)(u) , 1 (s)(1 + 2 (s))(u)(1 + 2 (u)) 2 where S = ((x, y) ∈ N20 : Fs (x; ψ) Fu (y; ψ)),
T = Sc.
This expression for the bound was also used in Madsen (2009). Let E{Y (s)} = 10 be fixed. Figure 2 displays the Fr´echet–Hoeffding upper bound as a function of E{Y (u)} for the family of bivariate distributions having marginals (3) with 2 = 0.2; the upper bound for 2 = 4 is very similar, except for E{Y (u)} close to zero. This shows that unless E{Y (u)} is close to zero, the range of feasible correlation in this family of distributions is almost the entire interval [0, 1). In contrast the range variation of correlation variation under the PG2 model can be severely limited for some negative binomial marginals (De Oliveira 2013). Figure 2 also displays the maximum possible correlation under the PG2 © 2016 Australian Statistical Publishing Association Inc.
0.4
0.6
0.8
1.0
CORRELATION OF GAUSSIAN COPULA MODELS
0.2
σ2 = 0.2 σ2 = 4 F−H bound
0.0
maximum cor (Y(s), Y(u)) under PG2 model
54
0
5
10
15
20
25
30
E (Y(u))
Figure 2. Plots of correlation upper bounds as a function of E{Y (u)} when E(Y (s)) = 10 for the family of bivariate distributions with negative binomial marginals (3); Fr´echet–Hoeffding upper bound (solid line), PG2 models with 2 = 0.2 (broken line) and 2 = 4 (dotted line).
model (obtained by setting K¯ θ (s, u) = 1 in (8)) for 2 = 0.2 (broken line) and 2 = 4 (dotted line). This shows that when overdispersion is small, the maximum possible correlation under the PG2 model is substantially smaller than what is feasible for models with negative binomial marginals. On the other hand, Gaussian copula models always attain the Fr´echet–Hoeffding upper bound when sup{Kϑ (s, u) : s = u} = 1 (Grigoriu 2007). So when 2 = 0 the GC-NB and GC-ZIP models have the entire range of correlation variation that is feasible for their respective family of marginal distributions. When 2 > 0 the range of correlations is a subset of [0, 1). From (6) it follows that, for s = u, the maximum possible correlation under the GC-NB model is ∞
lim corr(Y (s), Y (u)) =
u→s
k=1
ak2 (Fs )(1 − 2 )k =k!
(s)(1 + 2 (s))
.
(9)
Figure 3 displays the maximum possible correlation as a function of E(Y (u)) (E(Y (s)) = 10) under the GC-NB model for two values of 2 and two values of 2 . Unlike the PG2 model, under the GC-NB model this maximum displays a relatively mild dependence on the marginals. Consequently the PG2 model cannot represent count data that display both strong correlation and small overdispersion, while the GC-NB model can. From (6) it also follows that for the GC-NB model (and the GC-ZIP model) the magnitude of the discontinuity of correlation function at s = u is ∞
1−
k=1
ak2 (Fs )(1 − 2 )k =k!
. (s)(1 + 2 (s)) 2 The above expression is zero when 2 = 0, since ∞ k=1 ak (Fs )=k! = var(Y (s)), while it is 2 positive when > 0; see also De Oliveira (2013, proposition 2.2) and Figure 4. Hence, the GC-NB model includes both correlation functions that are continuous and discontinuous along © 2016 Australian Statistical Publishing Association Inc.
0.2
0.4
0.6
0.8
1.0
55
σ2 = 0.2, τ2 = 0.1 σ2 = 4.0, τ2 = 0.1
σ2 = 0.2, τ2 = 0.5 σ2 = 4.0, τ2 = 0.5
0.0
maximum cor (Y(s), Y(u)) in GC−NB model
ZIFEI HAN AND VICTOR DE OLIVEIRA
0
5
10
15
20
25
30
E (Y(u))
Figure 3. Plots of the maximum possible correlation under the GC-NB model as a function of E{Y (u)} when E{Y (s)} = 10, for two values of 2 and two values of 2 .
the ‘diagonal’ s = u. On the other hand, it was shown in De Oliveira (2013, proposition 2.2) that the correlation function of the PG2 model is always discontinuous along the ‘diagonal’ s = u, and the magnitude of the discontinuity at s is {1 + 2 (s)}−1 ; see also Figure 6. Thus the PG2 model imposes a very specific link between its ‘nugget’ and the mean function whereas as will be shown in the next section, for the GC-NB model the size of the nugget is not much affected by the mean. 4.2. Sensitivity to the mean function From (6) and (8) it is apparent that the correlation functions of the GC-NB, GC-ZIP and PG2 models depend in principle on several features of the underlying random fields. Since most of the exploration and comparison that follow are by necessity carried out numerically, we consider in this and the next sections a small set of values for these features that have the same (or very similar) interpretations across all models. For D = [0, 1] × [0, 1], the random field features to be varied are: • The mean function (s). This is allowed to take the values 3, 10 or 30. These cases include small, intermediate and large counts. • The parameter 2 . This is allowed to take the values 0.2 (small overdispersion) or 4 (large overdispersion). • The continuous correlation component K¯ θ (d ) in (2). This is taken to be either
d d d2 d exp − or 1+ + exp − , 0.3 0.2 0.12 0.2 where d = ||s − u|| is Euclidean distance. Both are isotropic correlation functions from the Mat´ern(1 , 2 ) family. The first has range parameter 1 = 0.3 and smoothness parameter 2 = 0.5 (so Z(·) is mean-square continuous but not mean square differentiable), while the second has range parameter 1 = 0.2 and smoothness parameter 2 = 2.5 (so Z(·) is twice mean-square differentiable). The different range parameters were chosen to produce comparable correlation decays. © 2016 Australian Statistical Publishing Association Inc.
1.0
CORRELATION OF GAUSSIAN COPULA MODELS 1.0
56
0.6
0.8
E (Y(s)) 3 10 30
0.0
0.2
0.4
cor (Y(s), Y(u))
0.6 0.4 0.0
0.2
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.2
0.4
0.6
0.8
1.0
0.0
1.2
0.2
0.4
1.0
1.2
1.0 0.6
0.8
E (Y(s)) 3 10 30
0.2 0.0 0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.0
||s−u||
1.0
||s−u||
0.6 0.4
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
0.6
0.8
E (Y(s)) 3 10 30
0.2
cor (Y(s), Y(u))
0.8
0.4
cor (Y(s), Y(u))
0.6 0.4 0.0
0.2
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u|| 1.0
1.0
||s−u||
0.6 0.4
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
0.6
0.8
E (())Ys 3 10 30
0.2
cor (Y(s), Y(u))
0.6
||s−u||
1.0
||s−u||
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u||
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u||
Figure 4. Correlation functions for the GC-NB models with different means. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0 (rows 1 and 2) and 2 = 0.25 (rows 3 and 4); 2 = 0.2 (rows 1 and 3) and 2 = 4 (rows 2 and 4). © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA
57
• The nugget effect 2 in (2). This is either 0 or 0.25 for the GC-NB and GC-ZIP models; it is always 0 for the PG2 model. Figure 4 displays the correlation functions of GC-NB models with different mean functions. The left column is for Mat´ern(0.3, 0.5) models, while the right column is for Mat´ern(0.2, 2.5) models; 2 = 0 in rows 1 and 2, and 2 = 0.25 in rows 3 and 4; 2 = 0.2 in rows 1 and 3, and 2 = 4 in rows 2 and 4. This figure shows that for all of the considered model features the correlation function of the GC-NB model displays very little sensitivity to the mean function, thus resembling the situation in Gaussian random fields where the mean and covariance functions are separately specified. Note also that GC-NB models include as possibilities the presence or absence of a ‘nugget effect’, depending on whether 2 is positive or zero, and in the former case the size of the nugget does not depend heavily on the mean function. Figure 5 displays the correlation functions of GC-ZIP models with different mean functions, with the same layout as in Figure 4. The correlation function of GC-ZIP models also displays little sensitivity to the mean function, although this sensitivity is slightly stronger than for GC-NB models. These models also include the possible presence or absence of a ‘nugget effect’. Figure 6 displays the correlation functions of PG2 models with different mean functions. The left column is for Mat´ern(0.3, 0.5) models, while the right column is for Mat´ern(0.2, 2.5) models; 2 = 0.2 in the first row and 2 = 4 in the second row. For this model the correlation function is very sensitive to the mean function when overdispersion is small, but not when overdispersion is large. This behaviour renders the PG2 model less flexible than the GC-NB model. 4.3. Modelling isotropy In this section we investigate whether assuming K¯ θ (s, u) to be isotropic implies that corr(Y (s), Y (u)) is also isotropic or close to being so. This would be useful for model construction and assessment, since isotropy is often a default assumption. When (s) is constant, it follows from (6) and (8) that the correlation functions of Y (·) under the GC-NB, GC-ZIP and PG2 models are all isotropic when K¯ θ (s, u) is isotropic. The situation is much less clear when (s) is not constant, in which case the investigation needs to be carried out numerically. We consider the same set of random field features described in Section 4.2, and look at this issue separately for random fields in R1 and R2 . 4.3.1. Random fields in the line In the line, corr(Y (s), Y (u)) isotropic means that it is stationary (homogeneous). Suppose that (s) = exp(1 + s), with s ∈ D = [0, 1], and K¯ θ (s, u) is one of the correlation functions listed in Section 4.2. We investigate the aforementioned issue by computing corr(Y (s), Y (u)) on a regular grid of 21 × 21 (= 441) points in [0, 1] × [0, 1], and noting that corr(Y (s), Y (u)) = h(s − u) for some (one-to-one) function h(·) if and only if the level curves of this correlation function are lines with slope equal to 1 (i.e., lines parallel to u − s = 0). For the GC-NB model, Figures 7 and 8 display heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for, respectively, 2 = 0 and 2 = 0.25. In both figures the left column is for Mat´ern(0.3, 0.5) models, while the right column is for Mat´ern(0.2, 2.5) models; 2 = 0.2 in the top row and 2 = 4 in the bottom row. These figures show that for all of the considered © 2016 Australian Statistical Publishing Association Inc.
CORRELATION OF GAUSSIAN COPULA MODELS 1.0
1.0
58
0.6
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
cor (Y(s), Y(u))
0.6 0.4 0.2
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.2
0.4
0.6
0.8
1.0
0.0
1.2
0.2
0.4
1.0
1.2
0.6 0.0
0.2
0.4
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u|| 1.0
1.0
||s−u||
0.6
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
cor (Y(s), Y(u))
0.4
0.6
0.8
E (Y(s)) 3 10 30
0.2
cor (Y(s), Y(u))
0.8
1.0
1.0 0.6 0.4 0.0
0.2
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.0
0.2
0.4
||s−u||
0.6
0.8
1.0
1.2
1.0
1.0
||s−u||
0.6 0.4
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.2
0.4
0.6
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
cor (Y(s), Y(u))
0.6
||s−u||
||s−u||
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u||
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u||
Figure 5. Correlation functions for the GC-ZIP models with different means. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0 (rows 1 and 2) and 2 = 0.25 (rows 3 and 4); 2 = 0.2 (rows 1 and 3) and 2 = 4 (rows 2 and 4). © 2016 Australian Statistical Publishing Association Inc.
0.6 0.4
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
0.4
0.6
0.8
E (Y(s)) 3 10 30
0.2
cor (Y(s), Y(u))
59
1.0
1.0
ZIFEI HAN AND VICTOR DE OLIVEIRA
0.0
0.2
0.4
0.6
0.8
1.0
0.0
1.2
0.2
0.4
0.8
1.0
1.2
1.0
1.0
0.6 0.4
cor (Y(s), Y(u))
0.8
E (Y(s)) 3 10 30
0.2
0.4
0.6
0.8
E (Y(s)) 3 10 30
0.0
0.0
0.2
cor (Y(s), Y(u))
0.6
||s−u||
||s−u||
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u||
0.0
0.2
0.4
0.6
0.8
1.0
1.2
||s−u||
Figure 6. Correlation functions for the PG2 models with different means. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0.2 (top row) and 2 = 4 (bottom row).
model features the level curves are (roughly) lines parallel to u − s = 0, which implies that corr(Y (s), Y (u)) is approximately isotropic. For the GC-ZIP model, Figures 9 and 10 display heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for, respectively, 2 = 0 and 2 = 0.25, with the same layout as in Figures 7 and 8. Again, for all of the considered model features the level curves are (roughly) lines parallel to u − s = 0. Although a slight deviation is apparent around the main diagonal in the lower right panel of Figure 10, inspection of the numerical values show that the deviation from isotropy is quite minor, so it is also the case that for this model corr(Y (s), Y (u)) is approximately isotropic. For the PG2 model, Figure 11 displays heatmaps of corr(Y (s), Y (u)) as a function of (s, u), with the same layout as in Figure 7. This shows that the level curves are not close to being lines parallel to u − s = 0 when the overdispersion is small, so in this case corr(Y (s), Y (u)) is not close to being isotropic. This behaviour is mainly due to the strong sensitivity of corr(Y (s), Y (u)) to (s) and (u), as illustrated in Figure 6 when the mean function is constant. However when the overdispersion is large, the deviation from isotropy is mild. 4.3.2. Random fields in the plane Suppose now that (s) = exp(0.5 + 0.5x + y), with s = (x, y) ∈ D = [0, 1] × [0, 1], and K¯ θ (s, u) is again one of the correlation functions listed in Section 4.2. Unlike the R1 setting, © 2016 Australian Statistical Publishing Association Inc.
60
CORRELATION OF GAUSSIAN COPULA MODELS 1.0
1.0
0.8
0.8
0.6
0.6
1.0
u
u 0.4
0.4
0.2
0.2
0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6 0.0
0.2
0.4
s
0.6
0.8
1.0
s
1.0
1.0
0.8
0.8
0.6
0.4
0.2
0.6
u
u 0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
s
0.0
0.2
0.4
0.6
0.8
1.0
s
Figure 7. Heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for GC-NB models with 2 = 0. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 is 0.2 (top row) and 2 = 4 (bottom row).
it is now infeasible to visually assess the behaviour of corr(Y (s), Y (u)) as a function of (s, u) ∈ [0, 1]2 × [0, 1]2 . Hence, we select three distinct distances, 0.05, 0.2 and 0.6, and for each distance compute corr(Y (s), Y (u)) for six pairs of locations within D separated by these distances and having different orientations; Figure 12 displays for each distance the six pairs of locations. Table 1 reports corr(Y (s), Y (u)) for the GC-NB models corresponding to all distances and orientations. In all tables the error in approximating corr(Y (s), Y (u)) is less than = 10−3 . It shows that for each combination of distance and set of model features, corr(Y (s), Y (u)) is close to being constant for all pairs of locations separated by the same distance. Table 2 reports corr(Y (s), Y (u)) for the GC-ZIP models corresponding to all distances and orientations, where the same behaviour is observed. These results suggest that corr(Y (s), Y (u)) is close to being isotropic when K¯ θ (s, u) is isotropic. Table 3 reports corr(Y (s), Y (u)) for the PG2 models corresponding to all distances and orientations. Unlike the situation in respect of Gaussian copula models, for PG2 models corr(Y (s), Y (u)) appears to vary substantially for pairs of locations separated by the same distance. For instance, the range of variation of corr(Y (s), Y (u)) is 0.23 when d = 0.05, 2 = 0.2, and K¯ θ (s, u) is Mat´ern(0.3, 0.5). Consequently corr(Y (s), Y (u)) appears not to inherit isotropy from K¯ θ (s, u). © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA 1.0
1.0
0.8
0.8
0.6
61
0.6
u
0.7
u 0.4
0.4
0.2
0.2
0.6
0.5 0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
s
0.6
0.8
1.0
0.4
s
1.0
1.0
0.8
0.8
0.3
0.2
0.6
0.6
0.1
u
u 0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.0
0.2
0.4
0.6
0.8
0.0
1.0
s
0.2
0.4
0.6
0.8
1.0
s
Figure 8. Heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for GC-NB models with 2 = 0.25. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0.2 (top row) and 2 = 4 (bottom row).
In summary, the foregoing explorations for random fields in R1 and R2 strongly suggest that for Gaussian copula models corr(Y (s), Y (u)) is approximately isotropic when K¯ θ (s, u) is isotropic, while this is not the case for the PG2 model. Remark. Let X , Y be random variables with joint cdf H , having marginals F, G and copula C, so H (x, y) = C(F(x), G(y)). It is often stated in the literature that dependence between X and Y does not depend on their marginals, but only on the copula. When F and G are continuous, this is indeed the case for many measures of dependence, such as Kendall’s tau and Spearman’s rho, but not for Pearson’s correlation coefficient (irrespective of the marginal type). The examples in the previous sections show that for Gaussian copula models with negative binomial and zero-inflated Poisson marginals, Pearson’s correlation displays some dependence on the marginals, albeit a mild one. 5. Example In this section we perform an exploratory data analysis of a geostatistical count dataset to illustrate some of the capabilities and limitations of the models considered here. Figure 13 displays counts, within quadrats which are one square foot in size, of Japanese beetle larvae, collected on an 18 × 8 regular grid in a field planted with maize, where the distance © 2016 Australian Statistical Publishing Association Inc.
62
CORRELATION OF GAUSSIAN COPULA MODELS 1.0
1.0
0.8
0.8
0.6
0.6
1.0
u
u 0.4
0.4
0.2
0.2
0.0
0.8
0.6
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
s
0.6
0.8
1.0
s
1.0
1.0
0.8
0.8
0.4
0.2
0.6
0.6
u
u 0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
s
0.0
0.2
0.4
0.6
0.8
1.0
s
Figure 9. Heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for GC-ZIP models with 2 = 0. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0.2 (top row) and 2 = 4 (bottom row).
between two nearest quadrat centers is 4 feet. The counts range from 0 to 8. This dataset is available in the R package SMPracticals. There are no covariates in this dataset, and an exploratory analysis (not shown) suggests no apparent spatial trend, so we assume that (s) and var(Y (s)) are constant, whence we estimate these quantities by the sample mean and variance, ˆ = 2.5 and var{Y ˆ (s)} = 3.08. Figure 14 (left) displays the empirical semivariogram of the Japanese beetle larvae counts calculated under the assumption of isotropy, from which we get a preliminary estimate = 3.10. From the relation connecting of the semivariogram for large distances, (∞) ˆ = sill the semivariogram and correlation functions in weakly stationary processes, we get for × (1 − (4)), the smallest distance between observations (d = 4) that (4) ˆ = sill ˆ so (4) ˆ = 1 − 1.97=3.10 ≈ 0.36. This is an estimate of the maximum correlation among pairs of counts, so a sensible model for this dataset should be capable of matching it. Consider the PG2 model. From (5) we have ˆ2 = 2.52 =(3.08 − 2.5) ≈ 10.8, and then the maximum correlation under the PG2 model is obtained by setting K¯ θ (s, u) = 1 and −1 substituting ˆ2 and ˆ for 2 and in (8), resulting in ˆmax ≈ 0.19. This PG2 = (1 + 10.8=2.5) falls far short of 0.36, whence the PG2 model seems inadequate for describing this dataset. Consider now the GC-NB model. From (9) we note that to get an estimate of the maximum correlation under the GC-NB model, in addition to estimates of the marginal parameters, we also need an estimate of 2 , the magnitude of the discontinuity of the correlation function of the (unobserved) latent Gaussian random field Z(·). To get a preliminary © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA 1.0
1.0
0.8
0.8
0.6
63
0.7
0.6
u
u 0.4
0.4
0.2
0.2
0.6
0.5 0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
s
0.6
0.8
1.0
0.4
s
1.0
1.0
0.8
0.8
0.6
0.6
0.3
0.2
u
0.1
u 0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
s
0.2
0.4
0.6
0.8
1.0
s
Figure 10. Heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for GC-ZIP models with 2 = 0.25. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0.2 (top row) and 2 = 4 (bottom row).
estimate, note from (1) that for any s ∈ D and k ∈ N0 Z(s)|Y (s) = k ∼ N (0, 1) truncated to −1 (Fs (k − 1)), −1 (Fs (k)) , so a set of ‘pseudo residuals’ can be defined as
(−1 (Fˆ si (k − 1))) − (−1 (Fˆ si (k))) ˆ , Zˆ i = E(Z(s i )|Y (si ) = k) = Fˆ si (k) − Fˆ si (k − 1) where (·) is the pdf of the standard normal distribution, and Fˆ si (·) is obtained by replacing and 2 (= 1=2 ) with the preliminary estimates previously obtained. Figure 14 (right) displays the empirical semivariogram of the pseudo residuals Zˆ 1 ,…, Zˆ n . Since an estimate of 2 is highly dependent on the assumed correlation function K¯ θ (d ) in (2), we fit the empirical semivariogram of the residuals to two models, the exponential and squared exponential correlation functions, representing opposite extremes in terms of ‘smoothness’. The resulting estimates of the size of the discontinuity at the origin are ˆE (0+ ) = 0 and ˆSE (0+ ) = 0.43; see Figure 14 (right). Again using the relation connecting the semivariogram and the correlation × ˆ2 , so preliminary estimates for 2 under the two assumed functions, we have (0 ˆ + ) = sill 2 correlation functions are ˆE = 0 and ˆ2SE = 0.43 = 0.44. Finally, from (9), estimates of the 0.97 maximum correlation under the GC-NB model under these two correlation models are © 2016 Australian Statistical Publishing Association Inc.
64
CORRELATION OF GAUSSIAN COPULA MODELS 1.0
1.0
0.8
0.8
0.6
0.6
1.0
u
u 0.4
0.4
0.2
0.2
0.0
0.8
0.6
0.0 0.0
0.2
0.4
0.6
0.8
0.0
1.0
0.2
0.4
0.6
0.8
1.0
s
s 1.0
1.0
0.4 0.8
0.8
0.2
0.6
0.6
u
u 0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
s
0.6
0.8
1.0
s
0.6
1.0 0.8
0.8 0.6
3 4
3 5
4 1
6
0.2
0.4
6
0.2
1
2
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
3
5 6
0.0
0.0 0.0
3
1 1
0.0
1
5
0.4
0.4
5 6
0.2
5
0.2
5
2 4
4
1
6
0.4
6
2
2
0.6
3 4 4
1.0
3 2 2
0.8
1.0
Figure 11. Heatmaps of corr(Y (s), Y (u)) as a function of (s, u) for PG2 models. The correlation function K¯ θ (s, u) is Mat´ern(0.3, 0.5) (left column) and Mat´ern(0.2, 2.5) (right column); 2 = 0.2 (top row) and 2 = 4 (bottom row).
0.0
0.2
0.4
0.6
0.8
1.0
Figure 12. Pairs of locations with different orientations separated by a distance 0.05 (left), 0.2 (center) and 0.6 (right).
ˆmax ˆmax GCNB,E = 1 and GCNB,SE = 0.54. The smallest of these is well above the model-free estimate 0.36, so the GC-NB model seems to provide a quite plausible description of this dataset. 6. Discussion In this paper we have explored properties of correlation functions of Gaussian copula spatial models when the marginal distributions are discrete with support contained in N0 . Specifically, we have focused on the case in which the marginal distributions are either negative binomial or zero-inflated Poisson. Although there is no closed-form expression for © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA
65
Table 1 Corr(Y (s), Y (u)) for the GC-NB model corresponding to three distances d , six pairs of locations (s, u) separated by distance d , and different model features. d
s
u
Mat´ern(0.3, 0.5) 2 = 0
Mat´ern(0.2, 2.5) 2 = 0.25
2 = 0
2 = 0.25
2 = 0.2 2 = 4 2 = 0.2 2 = 4 2 = 0.2 2 = 4 2 = 0.2 2 = 4 0.05 (0.58, 0.23) (0.71, 0.82) (0.96, 0.98) (0.99, 0.86) (0.39, 0.32) (0.14, 0.52) 0.2 (0.71, 0.25) (0.09, 0.96) (0.57, 0.76) (0.04, 0.66) (0.89, 0.57) (0.37, 0.36) 0.6 (0.07, 0.91) (0.82, 0.94) (0.76, 0.75) (0.84, 0.92) (0.27, 0.77) (0.04, 0.02)
(0.55, 0.27) (0.67, 0.84) (0.99, 0.94) (0.98, 0.82) (0.42, 0.27) (0.10, 0.49) (0.56, 0.37) (0.29, 0.98) (0.71, 0.62) (0.19, 0.52) (0.72, 0.46) (0.20, 0.25) (0.11, 0.31) (0.77, 0.35) (0.53, 0.20) (0.24, 0.93) (0.25, 0.17) (0.53, 0.38)
0.822 0.834 0.836 0.835 0.822 0.823 0.490 0.496 0.495 0.490 0.494 0.486 0.126 0.128 0.127 0.128 0.125 0.123
0.775 0.777 0.777 0.776 0.775 0.775 0.386 0.387 0.387 0.386 0.387 0.385 0.079 0.080 0.080 0.080 0.079 0.079
0.608 0.618 0.621 0.620 0.607 0.609 0.364 0.369 0.369 0.365 0.368 0.361 0.094 0.096 0.095 0.096 0.094 0.092
0.513 0.515 0.516 0.516 0.513 0.514 0.267 0.268 0.268 0.267 0.268 0.266 0.058 0.058 0.058 0.057 0.058 0.057
0.973 0.982 0.984 0.984 0.973 0.974 0.837 0.844 0.844 0.837 0.842 0.832 0.329 0.333 0.330 0.334 0.328 0.323
0.983 0.984 0.984 0.984 0.983 0.982 0.792 0.793 0.793 0.792 0.793 0.791 0.236 0.237 0.236 0.237 0.236 0.235
0.716 0.727 0.730 0.729 0.715 0.717 0.619 0.626 0.626 0.620 0.624 0.615 0.245 0.248 0.246 0.250 0.244 0.241
0.640 0.642 0.642 0.642 0.640 0.640 0.524 0.525 0.525 0.524 0.525 0.523 0.167 0.168 0.167 0.168 0.167 0.166
the correlation functions of the random fields that arise, these can be accurately approximated using a convenient representation. By a mix of analytical and numerical exploration, it was shown that the second-order structure of these Gaussian copula spatial models is quite flexible, this flexibility resembling, in a sense, that of the second-order structure of Gaussian random fields. In contrast, the second-order structure of hierarchical Poisson spatial models is not as flexible, as was shown in De Oliveira (2013). This was illustrated by a side-byside comparison between the correlation functions of a Gaussian copula spatial model and a hierarchical Poisson spatial model. The comparisons were expressed in terms of range of feasible correlation, sensitivity to the mean function and modelling of isotropy. The main findings about the correlation functions of Gaussian copula models are summarized as follows: • They can be either continuous everywhere or discontinuous along the ‘diagonal’, depending on whether the correlation function of the latent Gaussian random field is continuous everywhere or not. • When the correlation function of the latent Gaussian random field is continuous, they can take any value in the interval [0, 1) that is feasible within the limit imposed by the Fr´echet–Hoeffding upper bound. • They display little sensitivity to the specification of their mean functions, so the correlation and mean functions are approximately (functionally) independent of each other. The situation again resembles that which pertains for in Gaussian random fields. • They can be made (approximately) isotropic by choosing the correlation function of the latent Gaussian random field to be isotropic, even when their mean functions are not constant. © 2016 Australian Statistical Publishing Association Inc.
66
CORRELATION OF GAUSSIAN COPULA MODELS
Table 2 Corr(Y (s), Y (u)) for the GC-ZIP model corresponding to three distances d , six pairs of locations (s, u) separated by distance d , and different model features. d
s
u
Mat´ern(0.3, 0.5) 2 = 0
Mat´ern(0.2, 2.5) 2 = 0.25
2 = 0
2 = 0.25
2 = 0.2 2 = 4 2 = 0.2 2 = 4 2 = 0.2 2 = 4 2 = 0.2 2 = 4 0.05 (0.58, 0.23) (0.71, 0.82) (0.96, 0.98) (0.99, 0.86) (0.39, 0.32) (0.14, 0.52) 0.2 (0.71, 0.25) (0.09, 0.96) (0.57, 0.76) (0.04, 0.66) (0.89, 0.57) (0.37, 0.36) 0.6 (0.07, 0.91) (0.82, 0.94) (0.76, 0.75) (0.84, 0.92) (0.27, 0.77) (0.04, 0.02)
(0.55, 0.27) (0.67, 0.84) (0.99, 0.94) (0.98, 0.82) (0.42, 0.27) (0.10, 0.49) (0.56, 0.37) (0.29, 0.98) (0.71, 0.62) (0.19, 0.52) (0.72, 0.46) (0.20, 0.25) (0.11, 0.31) (0.77, 0.35) (0.53, 0.20) (0.24, 0.93) (0.25, 0.17) (0.53, 0.38)
0.821 0.822 0.816 0.819 0.821 0.822 0.491 0.493 0.493 0.492 0.493 0.488 0.128 0.128 0.128 0.129 0.127 0.125
0.686 0.709 0.688 0.701 0.699 0.703 0.350 0.345 0.345 0.350 0.347 0.353 0.077 0.076 0.077 0.076 0.077 0.077
0.609 0.610 0.604 0.607 0.608 0.610 0.367 0.369 0.369 0.367 0.369 0.364 0.096 0.096 0.096 0.097 0.095 0.094
0.463 0.450 0.445 0.447 0.463 0.462 0.247 0.243 0.244 0.247 0.245 0.248 0.057 0.056 0.057 0.056 0.057 0.057
0.973 0.979 0.978 0.979 0.973 0.973 0.835 0.835 0.836 0.835 0.835 0.831 0.330 0.331 0.330 0.332 0.329 0.325
0.954 0.943 0.939 0.941 0.954 0.953 0.715 0.702 0.703 0.714 0.706 0.720 0.219 0.217 0.218 0.216 0.219 0.221
0.716 0.716 0.710 0.713 0.715 0.716 0.619 0.620 0.620 0.620 0.620 0.616 0.247 0.248 0.247 0.249 0.246 0.243
Table 3 Corr(Y (s), Y (u)) for the PG2 model corresponding to three distances d , six pairs of locations (s, u) separated by distance d , and different model features. d
s
u
Mat´ern(0.3, 0.5)
Mat´ern(0.2, 2.5)
= 0.2
=4
2 = 0.2
2 = 4
0.338 0.465 0.234 0.287 0.334 0.427 0.192 0.246 0.239 0.194 0.227 0.170 0.050 0.062 0.054 0.068 0.048 0.041
0.718 0.711 0.739 0.733 0.727 0.713 0.359 0.369 0.368 0.360 0.366 0.354 0.074 0.076 0.075 0.077 0.074 0.072
0.355 0.511 0.577 0.551 0.348 0.363 0.326 0.417 0.406 0.330 0.385 0.289 0.130 0.162 0.141 0.176 0.126 0.106
0.903 0.886 0.933 0.931 0.929 0.942 0.734 0.754 0.752 0.736 0.748 0.723 0.220 0.226 0.222 0.228 0.219 0.214
2
0.05
0.2
0.6
(0.58, 0.23) (0.71, 0.82) (0.96, 0.98) (0.99, 0.86) (0.39, 0.32) (0.14, 0.52) (0.71, 0.25) (0.09, 0.96) (0.57, 0.76) (0.04, 0.66) (0.89, 0.57) (0.37, 0.36) (0.07, 0.91) (0.82, 0.94) (0.76, 0.75) (0.84, 0.92) (0.27, 0.77) (0.04, 0.02)
(0.55, 0.27) (0.67, 0.84) (0.99, 0.94) (0.98, 0.82) (0.42, 0.27) (0.10, 0.49) (0.56, 0.37) (0.29, 0.98) (0.71, 0.62) (0.19, 0.52) (0.72, 0.46) (0.20, 0.25) (0.11, 0.31) (0.77, 0.35) (0.53, 0.20) (0.24, 0.93) (0.25, 0.17) (0.53, 0.38)
© 2016 Australian Statistical Publishing Association Inc.
2
0.575 0.559 0.553 0.555 0.576 0.574 0.470 0.461 0.462 0.469 0.464 0.473 0.158 0.156 0.157 0.156 0.158 0.158
30
ZIFEI HAN AND VICTOR DE OLIVEIRA 7
7
2
2
8
6
4
4
4
2
2
0
3
7
1
2
6
4
1
0
0
1
4
3
2
3
5
3
3
1
1
0
2
3
1
2
1
0
1
3
3
1
1
2
5
1
1
1
2
3
0
0
1
1
4
1
1
3
2
3
1
1
1
0
2
1
0
2
3
1
1
1
3
1
2
1
3
4
3
1
0
2
3
2
2
3
1
2
1
2
4
0
0
4
2
4
4
3
4
1
3
4
4
1
1
3
2
3
6
5
1
2
3
1
3
4
5
3
4
5
0
2
3
1
5
3
5
3
2
3
2
1
2
2
3
5
1
5
15
3 1
10
6 4
0
y (ft)
6
5
20
25
6
67
0
10
20
30
40
50
60
70
x (ft)
2.0 1.5 1.0
Exponential Squared Exponential
0.6
semivariogram
2.5
0.2
semivariogram
3.0
0.4
3.5
0.8
1.0
Figure 13. Counts of Japanese beetle larvae within square foot quadrats.
0.0
0.5 0.0 0
10
20
30
40
0
distance (ft)
10
20
30
40
distance (ft)
Figure 14. Left: Empirical semivariogram of the Japanese beetle larvae counts; Right: Empirical semivariogram of the pseudo residuals Zˆ i under the GC-NB model, and two fitted semivariograms.
Based on the aforementioned findings and the exploratory data analysis in Section 5 we conjecture that the Gaussian copula spatial models studied in this work may be more capable than hierarchical Poisson models of describing geostatistical data that consist mostly of small counts or that display strong correlation and small overdispersion. Although the explorations and findings in this work focused on Gaussian copula spatial models having discrete marginal distributions (specifically negative binomial and zero-inflated Poisson distributions), we conjecture that similar properties should also hold for other families of marginal distributions, whether discrete, continuous or mixed. Further investigation of this and related matters will be undertaken elsewhere. 7. Appendix A Proof of Lemma 2.2 The error in approximating cov(Y (s), Y (u)) that results from truncating (6) at M is M (Kϑ (s, u))k e(M ˜ ) = cov(Y (s), Y (u)) − ak (Fs )ak (Fu ) k! k=1 =
∞
ak (Fs )ak (Fu )
k=M +1 © 2016 Australian Statistical Publishing Association Inc.
(Kϑ (s, u))k . k!
68
CORRELATION OF GAUSSIAN COPULA MODELS
From the Cauchy-Schwarz inequality it follows that for any s ∈ D and k ∈ N0 1 |ak (Fs )| = |E(Fs−1 ((Z); ψ)Hk (Z))| E(Y 2 (s))E(Hk2 (Z)) 2 , where Z ∼ N (0, 1). Since E(Hk2 (Z)) = k!, we have ∞
|e(M ˜ )|
|ak (Fs )ak (Fu )|
k=M +1 ∞
(Kϑ (s, u))k k!
1 1 (Kϑ (s, u))k E(Y 2 (s))k! 2 E(Y 2 (u))k! 2 k! k=M +1
1 (Kϑ (s, u))M +1 . = E(Y 2 (s))E(Y 2 (u)) 2 1 − Kϑ (s, u) Using this and (4), we have |e(M ˜ )| 21 (u) (s)(u) 1 + (s) 1 + 2 2 21 (u) 1 + (u) + (Kϑ (s, u))M +1 (s)(u) 1 + (s) + (s) 2 2 21 (u) (s)(u) 1 + (s) 1 + (1 − Kϑ (s, u)) 2 2
|e(M )| =
21
(u)2 (s)2 (Kϑ (s, u))M +1 1+ = 1+ . 2 2 (s) + (u) + 1 − Kϑ (s, u)
References Agarwal, D.K., Gelfand, A.E. & Citron-Pousty, S. (2002). Zero-inflated models with application to spatial count data. Environ. Ecol. Stat., 9, 341–355. Cameron, A.C. & Trivedi, P.K. (2013). Regression Analysis of Count Data, 2nd edn., Cambridge University Press, New York. Cario, M.C. & Nelson, B.L. (1997). Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical report, Department of Industrial Engineering and Management Sciences, Northwestern University. De Oliveira, V. (2003). A note on the correlation structure of transformed Gaussian random fields. Aust. N. Z. J. Stat. 45, 353–366. De Oliveira, V. (2013). Hierarchical Poisson models for spatial count data. J. Multivariate Anal. 122, 393–408. De Oliveira, V. (2014). Poisson kriging: a closer investigation. Spat. Stat. 7, 1–20. Diggle, P.J., Tawn, J.A. & Moyeed, R.A. (1998). Model-based geostatistics (with discussion). J. R. Stat. Soc. Ser. C. Appl. Stat. 47, 299–326. Grigoriu, M. (2007). Multivariate distributions with specified marginals: applications to wind engineering. J. Eng. Mech. 133, 174–184. Kazianka, H. (2013). Approximate copula-based estimation and prediction of discrete spatial data. Stoch. Env. Res. Risk. A. 27, 2015–2026. Kazianka, H. & Pilz, J. (2010). Copula-based geostatistical modeling of continuous and discrete data including covariates. Stoch. Env. Res. Risk. A. 24, 661–673. Lambert, D. (1992). Zero-Inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14. © 2016 Australian Statistical Publishing Association Inc.
ZIFEI HAN AND VICTOR DE OLIVEIRA
69
Madsen, L. (2009). Maximum likelihood estimation of regression parameters with spatially dependent discrete data. J. Agric. Biol. Environ. Stat. 14, 375–391. McNeill, L. (1991). Interpolation and smoothing of binomial data for the southern african bird, atlas project. South African Statist. J. 25, 129–136. Monestiez, P., Dubroca, L., Bonnin, E., Durbec, J.-P. & Guinet, C. (2006). Geostatistical modelling of spatial distribution of balaenoptera physalus in the northwestern Mediterranean sea from sparse count data and heterogeneous. Ecological Modelling, 193, 615–628. Nelsen, R.B. (1987). Discrete bivariate distributions with given marginals and correlation. Comm. Statist. Simulation Comput., 16, 199–208. Nelsen, R.B. (2006). An Introduction to Copulas, 2nd edn., Springer, New York. Sklar, A. (1959). Fonctions de R´epartition a` n Dimensions et Leurs Marges. Publications de L’Institut de Statistique de L’Universit´e de Paris 8, 229–231. Whitt, W. (1976). Bivariate distributions with given marginals. Ann. Statist., 4, 1280–1289.
© 2016 Australian Statistical Publishing Association Inc.