Stochastic Environmental Research and Risk Assessment 16 (2002) 425–448 Ó Springer-Verlag 2002 DOI 10.1007/s00477-002-0114-4
Spatial prediction of categorical variables: the Bayesian maximum entropy approach P. Bogaert
425 Abstract. Being a non-linear method based on a rigorous formalism and an efficient processing of various information sources, the Bayesian maximum entropy (BME) approach has proven to be a very powerful method in the context of continuous spatial random fields, providing much more satisfactory estimates than those obtained from traditional linear geostatistics (i.e., the various kriging techniques). This paper aims at presenting an extension of the BME formalism in the context of categorical spatial random fields. In the first part of the paper, the indicator kriging and cokriging methods are briefly presented and discussed. A special emphasis is put on their inherent limitations, both from the theoretical and practical point of view. The second part aims at presenting the theoretical developments of the BME approach for the case of categorical variables. The three-stage procedure is explained and the formulations for obtaining prior joint distributions and computing posterior conditional distributions are given for various typical cases. The last part of the paper consists in a simulation study for assessing the performance of BME over the traditional indicator (co)kriging techniques. The results of these simulations highlight the theoretical limitations of the indicator approach (negative probability estimates, probability distributions that do not sum up to one, etc.) as well as the much better performance of the BME approach. Estimates are very close to the theoretical conditional probabilities, that can be computed according to the stated simulation hypotheses. Keywords: Geostatistics, Indicator kriging, BME, Categorical variable
1 Introduction A new and powerful epistemic approach of random field estimation (mapping) accross space and time which combined Bayesian conditionalization of physical knowledge with stochastic information (entropy) maximization was proposed by Christakos (1990, 1991, 2000). This approach became widely known under the name of Bayesian maximum entropy (BME) and has been used with great success in a variety of scientific and engineering applications (see, e.g., Christakos and Li, 1998; Bogaert et al., 1999; D’Or et al., 2001; Christakos et al., 2002). In this work we propose an extension of the BME formalism in the case of discrete-valued or categorical data, both for ordinal or nominal variables, which is a very powerful P. Bogaert UCL/AGRO/MILA/ENGE Place Croix du Sud, 2 Box 16, 1348 Louvain-la-Neuve, Belgium e-mail:
[email protected]
426
method for spatial prediction and mapping of categorical variables. In the environmental context, the efficient prediction of categorical distributions is of paramount importance for mapping purposes, e.g., land use suitability (Bierkens and Burrough, 1993), land cover (de Bruin, 2000) or soil mapping (Heuvelink and Webster, 2001), that are typically characterized by their discrete nature and the necessity of integrating various data sources. Currently, some of the most widely used reference methods rely on the geostatistical indicator formalism (Journel, 1983). This formalism offers the advantage of being simple to understand and easy to implement, though it suffers from many limitations, both theoretical and methodological. Indicator kriging techniques are based on a least-squares optimality criterion and remain linear estimators, even if the coding of the data themselves (indicator coding) is made in a non-linear way. Also, there are major concerns about the consistency of the models that are used, as well as about the consistency of the data coding. In order to circumvent these problems, the BME approach is used, which is completely non-linear and does not rely on the least squares optimality criterion. Before moving on to the implementation of the BME method, it is useful to define first the general framework within which the method is intended to be used, as well as to review the indicator formalism of geostatistics.
2 The general framework This section is devoted to a brief presentation of the general context of our work as well as to a quick overview of the main tools that are frequently used for characterizing the spatial dependence of categorical variables, namely the indicator (cross-)covariance functions and (cross-) variograms. 2.1 Categorical random field Consider a discrete or categorical random variable C, that can be nominal or ordinal, having XC fci ; i ¼ 1; :::; nc g as its finite set of possible outcomes. We will be interested in the case where the ci ’s are forming a complete system of events, i.e., ci 6¼ ; 8 i; ci \ cj ¼ ; 8 i 6¼ j, so that P½[i ci ¼ Ri P½ci ¼ 1. Consider a spatial domain D and an arbitrary location vector xa such that FC fCðxa Þ; xa 2 D Rd g defines a discrete-valued random field with a continuous domain over space. Let paðia Þ P½Cðxa Þ ¼ cia
ð1Þ
be the univariate probability that the field is taking the value cia at location xa (ia ¼ 1; . . . ; nc ). Similarly, let
pa;bðia ;ib Þ P½Cðxa Þ ¼ cia \ Cðxb Þ ¼ cib
ð2Þ
be the bivariate probability that the field is taking the values cia and cib at locations xa and xb , respectively (ia ; ib ¼ 1; . . . ; nc ). We will assume that by letting hab ¼ xa xb and hvd ¼ xv xd , we have pa;bði;jÞ ¼ pv;dði;jÞ if hab ¼ hvd ; 8 i; j ¼ 1; . . . ; nc , i.e., bivariate probabilities are invariant under translation (an even stronger hypothesis would be pa;bði;jÞ ¼ pv;dði;jÞ 8 jjhab jj ¼ jjhvd jj, i.e. bivariate invariance under translation and rotation). It is clear that assuming
this automatically implies that pa;aði;jÞ ¼ pv;vði;jÞ when hab ¼ hvd ¼ 0, yielding paðiÞ ¼ pvðiÞ 8 xa ; xv , i.e. invariance under translation of the univariate probability distribution.
2.2 Indicator covariance functions and variograms Covariance functions and variograms are widely used for characterizing the spatial dependence of second-order stationary continuous-valued random fields. They can be extended in the case of discrete-valued random field by using the Kronecker delta operator di ðxÞ ¼
1 if CðxÞ ¼ ci ; 0 if CðxÞ 6¼ ci
ð3Þ
so that computing Cdia ;dib ðxa ; xb Þ ¼ Cov½dia ðxa Þ; dib ðxb Þ under the hypothesis of bivariate invariance under translation yields the indicator (cross-)covariance function
Cdia ;di ðhab Þ ¼ E½dia ðxa Þdib ðxb Þ E½dia ðxa ÞE½dib ðxb Þ b
¼ pa;bðia ;ib Þ pia pib ;
ð4Þ
sometimes referred to as the class indicator (cross-)covariance function. From (4),
Cdia ;di ð0Þ ¼ b
pia pib pia ð1 pia Þ
if ia 6¼ ib ; if ia ¼ ib
ð5Þ
as pa;aðia ;ib Þ ¼ 0 if ia 6¼ ib and pa;aðia ;ia Þ ¼ paðia Þ ¼ pia from the invariance under translation hypothesis. Considering that dia ðxa Þ and dib ðxb Þ can be assumed as independent when khab k ! 1 allows us to write pa;bðia ;ib Þ ¼ pia pib , and so
limkhab k!1 Cdia ;di ðhab Þ ¼ 0 8 ia ; ib ; b
ð6Þ
that we will simply denote as Cdia ;dib ð1Þ ¼ 0. A closely related function is the indicator (cross-)variogram,
cdi
a ;dib
ðhab Þ ¼ 12Cov½dia ðxa Þ dia ðxb Þ; dib ðxa Þ dib ðxb Þ ¼ dðia ¼ib Þ pia pa;bðia ;ib Þ ;
ð7Þ
where
cdi
a ;dib
ð0Þ ¼ 0 8 ia ; ib
ð8Þ
and
cdi
a ;dib
ð1Þ ¼
pia pib pia ð1 pia Þ
if ia 6¼ ib if ia ¼ ib
ð9Þ
427
using the same hypotheses as above. At this point, one should remark from (4) and (7) that
Var½di ¼ Cdi ;di ð0Þ ¼ cdi ;di ð1Þ ¼ pi ð1 pi Þ E½di ¼ pi
ð10Þ
so that Var½di ¼ E½di E2 ½di . It is clear from (10) that knowing E½di automatically implies the knowledge of Var½di . Clearly, the variograms 428
cdi ;di ðhab Þ ¼ 12Var½di ðxa Þ di ðxb Þ ¼ pi pa;bði;iÞ
ð11Þ
depend on E½di ¼ pi . As a second remark, among other conditions, it is obvious from (10) that cdi ;di ðOÞ0:25 8 i and the Var½di ’s must be such that the corresponding pi ’s sum to one, two basic requirements which are found to be frequently violated by the categorical indicator variogram models that are found in the literature (see, e.g., Castrignano et al., 2000; De Cesare and Posa, 1995; Oberthur et al., 1999). As a final remark, it is also clear that the set of indicator (cross-)covariance functions and (cross-)variograms do not contain any additional information compared to the more natural and much simpler set of probability functions
pia ;ib ðhab Þ ¼ pa;bðia ;ib Þ ;
ð12Þ
which are directly linked to transition probability functions in a continuous Markov chain context. Equation (12) is also intimately related to transition rates between categories and mean length of these categories (see, e.g., Weissmann et al., 1999). A detailed discussion of the advantages of (12) over (4) and (7) is given in Carle and Fogg (1996). Of course, this also implies that the various indicator (cross-)variograms (covariance functions) are intimately linked together, as they must define a valid set of univariate and bivariate probabilities. These remarks will be useful when examining the modeling of (4) and (7) and their use in a prediction algorithm like indicator (co-)kriging.
3 The indicator approach In a spatial prediction context, we are traditionally interested in predicting the probabilities of observing each category at an unsampled location x0 , based on a set of information at locations xa (a ¼ 1; . . . ; n) where the categories have been observed. This problem is often solved using various forms of indicator (co)kriging algorithms, that are obtained as a straightforward modification of the regular kriging algorithm for continuous-valued random fields. These algorithms are generally used in a prediction context, but they are also at the root of simulation methods (see, e.g., De Cesare and Posa, 1995; Gomez-Hernandez et al., 1990; Ying, 2000). We will focus first on the case of indicator kriging (Journel, 1983) and we will emphasize the limitations of the method in the context considered in this work. We will also show the limitations of the indicator cokriging algorithm, which is the straightforward multivariate extension of indicator kriging. For a comprehensive discussion about the limitations of indicator (co)kriging used in the continuous-valued random fields context, see, e.g., Cressie (1993) and Olea (1999).
3.1 Indicator kriging What is sought are the pi0 ðx0 jfia gÞ P^½Cðx0 Þ ¼ ci0 j fCðxa Þ ¼ cia ; a ¼ 1; . . . ; ng, i0 ¼ 1; . . . ; nc , i.e., the conditional probability estimates for each category at an unsampled location x0 taking into account the information provided at surrounding locations xa , a ¼ 1; . . . ; n. Define the indicator coding 1 if Cðxa Þ ¼ ci0 : ð13Þ di0 ðxa Þ ¼ 0 if Cðxa Þ 6¼ ci0 429
The indicator kriging predictor is then written as
pi0 ðx0 jfia gÞ ¼
n X
ki0 ;a di0 ðxa Þ ;
ð14Þ
a¼1
where the weights ki0 ;a are obtained by solving a linear set of equations that involve Cdi0 ;di0 ðhÞ or cdi0 ;di0 ðhÞ, subject to unbiasedness and optimality constraints. Classically, unbiasedness is ensured by imposing that Ra ki0 ;a ¼ 1, so that E½pi0 ðx0 jfia gÞ ¼ pi0 . The consequence of this constraint in the linear kriging equations system is that the ki0 ;a ’s are invariant under a multiplication of the covariance function or variogram by a positive constant, i.e., using cdi0 ;di0 ðhÞ or kcdi0 ;di0 ðhÞ with k > 0 does not affect these weights and yields the same predicted probability pi0 ðx0 jfia gÞ. This could be viewed as a paradox of the method, as cdi0 ;di0 ðhÞ should depend on pi0 as seen from (11). The use of the unbiasedness constraint is often justified by the fact that it would allow one to consider the indicator mean E½di ðxÞ ¼ pi as constant but unknown (Goovaerts, 1997), by analogy to what is done for continuous random fields. However, for categorical data, this is inconsistent as the knowledge of the variogram cdi0 ;di0 ðhÞ implies the knowledge of pi0 . A more consistent approach would be to systematically use the simple indicator kriging predictor
pi0 ðx0 jfia gÞ ¼
n X
ki0 ;a ðdi0 ðxa Þ pi0 Þþpi0 :
ð15Þ
a¼1
Remark that, being nonconvex operators, the predictors (14) and (15) cannot guarantee that pi0 ðx0 jfia gÞP0, so invalid negative probability estimates can easily be obtained. As the probability estimates are obtained separately for each value of the index i0 , there is also no guarantee that Ri0 pi0 ðx0 jfia gÞ ¼ 1, as required. These problems will be illustrated later with some examples.
3.2 Indicator cokriging Indicator cokriging was proposed as a remediation for the fact that joint information about the various categories is neglected in indicator kriging (see, e.g., Lajaunie, 1990). This joint information is brought by the cross-variograms or cross-covariance functions. These must be modeled as a whole for their use in the linear cokriging equations, as it is required that Var½ pi0 ðx0 jfia gÞ 0. Traditionally, this is ensured by defining a valid linear model of coregionalization (LMC, Journel and Huijbregts, 1978). For example, in terms of covariance functions, this model must be positive definite and is written as
Rh ¼
X
m Rm 0 C ðhÞ ;
ð16Þ
m
with
6 6 Cd1 d1 ðhÞ 6 6 .. Rh ¼ 4 . ðsymÞ 430
.. .
7 Cd1 dnc ðhÞ 7 7 7 .. 5 ; . Cdnc dnc ðhÞ
ð17Þ
where the C m ðhÞ’s are permissible models of covariance functions with Cm ð0Þ ¼ 1 8 m, and where the Rm 0 ’s are nc nc positive definite matrices. Classically, these matrices are obtained through a least squares-based algorithm conditionally to the choice of the C m ðhÞ’s which is made by the user (Goulard and Voltz, 1992). Though the use of (16) has proven to be useful for multivariate continuous-valued random field, it can also be proven (see Appendix) that for our discrete-valued random field, the only possible valid expression for this LMC is
Rh ¼ R0
X
km Cm ðhÞ ;
ð18Þ
m
with
X
km ¼ 1
ð19Þ
m
and where, from (5), R0 is given by
6 6 Cd1 d1 ð0Þ 6 6 .. .. R0 ¼ 4 . . ðsymÞ
7 Cd1 dnc ð0Þ 7 7 7 .. 5 ¼ diagðpÞ pp0 . Cdnc dnc ð0Þ
ð20Þ
with p0 ¼ ðp1 ; . . . ; pnc Þ. Any other form for the LMC is non permissible, in the sense that it leads to an inconsistent model for bivariate probabilities. Note also that (18) and (20) define a positive semi definite model as rankðR0 Þ ¼ nc 1, which is the consequence of the constraint Ri pi ¼ 1. This also implies that the indicator cokriging system cannot be solved as is, as the system of equations built from (18) is rank deficient. Equation (18) is sometimes referred to as the intrinsic LMC and also induces a property called self-krigeability, which states that any cokriging system based on this set of cross-covariance functions will reduce to a kriging system as the various categories are known at the same set of locations, so reducing down to zero the usefulness of the method compared to indicator kriging. Of course, still being a linear non-convex estimator, indicator cokriging suffers from the same general limitations as indicator kriging. Indicator kriging and cokriging are commonly (and probably abusively) referred to as nonlinear methods, though the only nonlinear part is the binary coding of the data, the predictor still remaining a linear combination of these binary values. Clearly, in general pi0 ðx0 jfia gÞ does not correspond to the conditional probability but is only its least squares estimate, as it is well-known that this conditional probability is a highly nonlinear function of the data. In the
author’s opinion, according to the numerous limitations of the indicator (co)kriging algorithm, it would be more efficient to seek directly for this conditional probability, that can be obtained provided that the joint probability distribution P½\j Cðxj Þ ¼ cij , j ¼ 0; . . . ; n, ij ¼ 1; . . . ; nc can be estimated. This is precisely what is proposed by the BME approach.
4 The Bayesian maximum entropy approach This section is devoted to the theoretical developments that underlies the BME approach. The general process can be viewed as a three-stages procedure (Christakos, 2000) that can be briefly summarized as follows; (i) at the first stage (the prior stage), one aims at finding a joint probability distribution that has a maximum entropy and respects some general constraints (e.g., a set of probabilities that are supposed to be known and are imposed a priori); (ii) at the second stage (the meta-prior stage), the specific information about the data set under study are collected and translated into useable mathematical relations; (iii) the final stage (integration stage) is the computation of the posterior conditional distribution with respect to the maximum entropy distribution obtained at the prior stage and the information collected at the meta-prior stage. The first part of this section will focus on the maximum entropy solution for the joint probability distribution. The second part will consider the integration of various kind of specific information and the corresponding formulations for the posterior conditional distributions. 4.1 Estimation of the joint probability distribution The first step in the BME approach is to seek for the joint probability distribution p0;...;nði0 ;...;in Þ ¼ P½\j Cðxj Þ ¼ cij , j ¼ 0; . . . ; n, ij ¼ 1; . . . ; nc that is the maximum entropy distribution subject to some constraints. Indeed, in practice one has only a partial general knowledge represented by a set KG that specify some of the properties of the random field. For example, KG fpa;bðia ;ib Þ ; a; b ¼ 0; . . . ; n; ia ; ib ¼ 1; . . . ; nc g if the set of (cross-)covariance functions (4), (cross-)variograms (7) or probability functions (12) are known. If one think of p0;...;nði0 ;...;in Þ as a nc nc (n times) hypersquare probability table having nnc cells, pa;bðia ;ib Þ are then the margins or order 2, so that the set of bivariate probabilities pa;bðia ;ib Þ ’s are forming n2 square nc nc probability table. We will consider that these probability tables are the constraints in our maximum entropy estimation problem. Denote p0;...;nði0 ;...;in Þ as an estimate of p0;...;nði0 ;...;in Þ . Its entropy is given by the relation X H0;...;n ¼ p0;... ;nði0 ;...;in Þ ln p0;...;nði0 ;...;in Þ ; ð21Þ i0 ;...;in
where Ri0 ;...;in denotes summation over all the possible values for all the indexes. If we assume that we know the pa;bðia ;ib Þ ’s, we want to respect the conditions
pa;bðia ;ib Þ ¼ pa;bðia ;ib Þ where
8 a 6¼ b; 8 ia ; ib ;
ð22Þ
431
pa;bðia ;ib Þ ¼
X
ð23Þ
p0;...;nði0 ;...;in Þ :
fij ;j6¼a;bg
Provided that the pa;bðia ;ib Þ ’s are a consistent set of probabilities, Eq. (22) is sufP ficient for ensuring that paðia Þ ¼ paðia Þ as well as that i0 ;...;in p0;...;nði0 ;...;in Þ ¼ 1. Using the Lagrangian formalism, maximizing (21) under the constraints (22) is equivalent to maximize the objective function
L0;...;n ¼ H0;...;n þ
XX
labðia ib Þ pa;bðia ib Þ pa;bðia ib Þ :
ð24Þ
a;b ia ;ib a6¼b
432
Setting the partial derivatives equal to zero with respect to p0;...;nði0 ;...;in Þ and labðia ib Þ yields
oL0;...;n op0;...;nði0 ;...;in Þ
¼ ln p0;...;nði0 ;...;in Þ 1 þ
X
labðia ib Þ ¼ 0
a;b a6¼b
oL0;...;n ¼ pa;bðia ib Þ pa;bðia ib Þ ¼ 0; olabðia ib Þ
;
ð25Þ
a 6¼ b
which must be solved with respect to the coefficients labðia ib Þ . This nonlinear system of equations (25) can be solved using classical numerical methods. However, one can remark that using an arbitrary reparametrization like
labðia ib Þ ¼ gabðia ib Þ þ
gaðia Þ gþ1 þ n 1 nðn 1Þ
ð26Þ
leads to
ln p0;...;nði0 ;...;in Þ ¼ g þ
X a
gaðia Þ þ
XX a
gabðia ib Þ ;
ð27Þ
b b6¼a
such that (27) corresponds to the definition of a non-saturated log-linear model involving only main effects gaðia Þ and interaction effects of first-order gabðia ib Þ . The equivalence between maximum entropy probability distribution functions that satisfies marginal constraints and non-saturated log-linear models is well known (Good, 1963). A consequence of this is that estimating p0;...;nði0 ;...;in Þ from the BME formalism under marginal constraints is equivalent to fitting a non-saturated loglinear model (Bishop et al., 1975), and simple and classical algorithms (e.g., an iterative scaling procedure as proposed by Deming and Stephan, 1940) can be used for this goal. The previous developments focused on the estimation of the joint distribution that respects a set of bivariate probabilities, but there would be no mathematical difficulty in deriving the same relation for incorporating higher order probabilities (e.g., trivariate probabilities pa;b;vðia ;ib ;iv Þ ) using the same formalism, as long as the imposed constraints are consistent with each other so that there exists a unique maximum entropy solution that satisfies all these constraints. This highlights the great generality of the approach.
4.2 Estimation of the conditional probabilities In our spatial prediction context, what is sought is a conditional probability distribution at the unsampled location x0 , i.e., p0ði0 ÞjKS P½Cðx0 Þ ¼ ci0 jKS ¼
P½ðCðx0 Þ ¼ ci0 Þ \ KS ; P½KS
8 i0 ¼ 1; . . . ; nc ; ð28Þ
with P½KS ¼ Ri0 P½ðCðx0 Þ ¼ ci0 Þ \ KS , where KS refers to a set of specific knowledge about the random field. We will only examine here the cases for which this set of knowledge is such that P½ðCðx0 Þ ¼ ci0 Þ \ KS can be computed univoquely from the joint probability distribution p0;:::;nði0 ;:::;in Þ or its estimate p0;:::;nði0 ;:::;in Þ obtained from the previous BME approach. In this last case, the theoretical probabilities p0;:::;nði0 ;:::;in Þ are replaced by their estimates p0;:::;nði0 ;:::;in Þ in order to obtain an estimate p0ði0 ÞjKS ¼ p^0ði0 ÞjKS . There are numerous kind of specific knowledge that can be incorporated using (28). Among others, a classical example for KS would be
KS
n \
Cðxj Þ ¼ cij ;
ð29Þ
j¼1
(the category is known at each location x1 ; . . . ; xn ), so that
P^½ðCðx0 Þ ¼ ci0 Þ \ KS ¼ p0;...;nði0 ;...;in Þ :
ð30Þ
A more elaborate example would be
KS
n \
Cðxj Þ 2 Ej ;
ð31Þ
j¼1
where Ej XC 8 j (the subset of possible categories is known at each location x1 ; . . . ; xn ), leading to
P^½ðCðx0 Þ ¼ ci0 Þ \ KS ¼
X i1 2E1 ;...;in 2En
ð32Þ
p0;...;nði0 ;...;in Þ ;
In some instances, there could be a specific probability information that is made available, so that
n h\n i o ; i ; KS ¼ PS Cðx Þ ¼ c ¼ 1; . . . ; n j ij j c j¼1
ð33Þ
(the joint probability of occurance for the categories is known at each location x1 ; . . . ; xn ), where PS ½\j Cðxj Þ ¼ cij refers to specific probabilities, obtained independently from the a priori probabilities p1;...;nði1 ;...;in Þ , leading to
P^½ðCðx0 Þ ¼ ci0 Þ \ KS ¼
X i1 ;...;in
" p0;...;nði0 ;...;in Þ PS
n \ j¼1
# Cðxj Þ ¼ cij
;
ð34Þ
433
This last relation should not be mistaken with the traditional Bayesian conditioning obtained from the total probability theorem. For example, using (29), one can define the incompatible events
KS;m
n \
ðCðxj Þ ¼ cij Þ
ð35Þ
j¼1
where each KS;m corresponds to one of the possible combination of values for the indexes i1 ; . . . ; in such that P½[m KS;m ¼ Rm P½KS;m ¼ 1. This would lead to 434
ES ½ p0ði0 ÞjKS;m ¼
X
p0ði0 ÞjKS;m PS ½KS;m
ð36Þ
m
where the expectation ES ½: is computed with respect to the probability distribution PS ½: for the set of events KS;m , and where each p0ði0 ÞjKS;m is obtained from (28) and (30). All the previous relations focused on information provided about the Cðxj Þ’s where j ¼ 1; . . . ; n. There is however no problem in including any additional information that refers to Cðx0 Þ itself, e.g., Cðx0 Þ 2 E0 XC (at the prediction location x0 , the subset of possible categories is known). Note also that there would be no difficulty in obtaining any multivariate conditional distribution
p0;...;mði0 ;...;im ÞjKS ¼
hT i ðCðx Þ ¼ c Þ \ K P^ m j i S j j¼0 P^½KS
;
8 i0 ; . . . ; im 2 XC
ð37Þ
using exactly the same formalism as above. Obviously, various combinations of events can be considered simultaneously using a combination of the previous relations, leading to more complex relations. Note also that, as all these conditional probabilities are computed from a valid joint probability distribution, they automatically lead to valid conditional distributions. These remarks emphasize the considerable generality and the power of the BME formalism. Various kind of knowledge are easily incorporated and lead to nonlinear conditional probability estimates. The superiority of the BME approach over the indicator (co)kriging approach is emphasized hereafter using two simulated examples.
5 Comparing BME and indicator (co)kriging: two simulated examples For our first example, we will assume that there are 100 locations that have been randomly sampled over a square of unit size. The prediction is conducted over a 40 by 40 grid covering the square. Continuous values zðxÞ are jointly simulated at the 1600 prediction nodes and at the 100 sampling locations using a sequential simulation method and an exponential semivariogram model cZ ðhÞ with a range equal to 0.5 and a sill equal to 1. The values are thus zero-mean unit variance multivariate Gaussian distributed. These simulated values are then replaced by the interval to which they belong. The limits ai ’s of these intervals are taken as the 0, 0.2, 0.5, 0.8 and 1 quantiles of the zero-mean unit variance Gaussian distribution (Fig. 1), so that the complete system of events ci Z 2 ½ai ; aiþ1 ½, i ¼ 1; . . . ; 4 can be defined. These intervals of values can be viewed as an ordered categorical variable. As the simulated values are multivariate Gaussian distributed, the knowledge of the
435
Fig. 1. Zero-mean unit variance Gaussian probability distribution function and indicator coding of the data. Categories 1 to 4 are defined as the intervals bounded by the 0, 0.2, 0.5, 0.8 and 1 quantiles of the distribution, respectively
mean and covariance function is a necessary and sufficient information for characterizing any joint probability distribution function. Conditional probabilities can then easily be computed from it. The continuous conditional distribution f ðz0 jKS Þ is given by
R ai1 þ1 f ðz0 jKS Þ ¼
ai
1 R ai1 þ1 ai 1
R ai
...
R ai
n þ1
ain
n þ1
ain
f ðz0;...; zn Þdzn dz1 ;
ð38Þ
f ðz1;...; zn Þdzn dz1
where f ðz0 ; . . . ; zn Þ is the joint multivariate Gaussian probability distribution function for locations x0 ; . . . ; xn , and
KS
n \
Cðxj Þ ¼ cij
j¼1
n \
Zðxj Þ 2 ½aij ; aij þ1 ½ :
ð39Þ
j¼1
From (38), the conditional probabilities p0ði0 ÞjKS , i0 ¼ 1; . . . ; 4 are obtained using the relation ai þ1 # Z0 \ n P Cðx0 Þ ¼ ci0 Cðxj Þ ¼ cij ¼ f ðz0 jKS Þdz0 : j¼1
"
p0ði0 ÞjKS
ð40Þ
ai
0
Using (38) and (40), the theoretical conditional probability of belonging to each category can be easily obtained, as the multivariate continuous Gaussian distribution f ðz0 ; . . . ; zn Þ that was used to generate the data set is known. Clearly, the prediction of the probability of belonging to each one of the classes does depend on the ordering of these classes, as the computation is conducted from the continuous distribution f ðz0 ; . . . ; zn Þ using (39).
436
The results provided by BME, indicator kriging (IK) and indicator cokriging (ICK) can thus be compared to the reference values provided by (40), as the best possible estimates of the conditional probabilities than one can expect from the available information are the theoretical conditional probabilities provided by (40). It is worth emphasizing here that the aim of the study is to compare the performance of the various methods in predicting correct conditional probabilities; the study is by no way intended to show the performance of the conditional probability for predicting the true category over the grid. This performance itself is highly dependent on several factors like the number of available data, the number of classes that are considered, the intensity of the correlation over space, etc. For comparison purposes, the theoretical indicator variogram and crosscovariance function models (Fig. 2) and the theoretical probability and
Fig. 2. Theoretical indicator covariance functions built from the categories defined in Fig. 1 using an exponential covariance function with sill and range equal to 1 and 0.5, respectively. Direct indicator covariances are represented on diagonal, and cross-indicator covariance are represented off-diagonal
cross-probability functions (Fig. 3) will be used for all methods. These models are easily obtained by integration over bivariate Gaussian distributions that are derived from the joint distribution f ðz0 ; . . . ; zn Þ. Note that, as seen from Fig. 3, these functions exhibit complex shapes, that are unlikely to be captured by a very simple model like the LMC. Using these theoretical functions in the estimation process allows us to obtain a fair and objective comparison of the performances for the different approaches without taking into account complex inference problems and methodological considerations. In some sense, the results of the estimations will represent the best results that one could expect from these methods. For all methods, a same neighborhood size consisting in the five closest sampled locations has been considered for each estimation location. As a first result, due to the inherent limitations of the linear estimates IK and ICK, the basic requirements for obtaining valid distributions are not met. Using
Fig. 3. Theoretical probability functions built from the categories defined in Fig. 1 using an exponential covariance function with sill and range equal to 1 and 0.5, respectively. Direct probabilities are represented on diagonal, and cross-probabilities are represented offdiagonal
437
438
IK, all the estimated conditional probabilities that are obtained are non negative (however there is no guarantee that this will be true in general, as kriging is a non-convex estimator), but they are far from summing to one as required (Fig. 4a). This is the natural consequence of the fact that the probabilities of belonging to each category have been obtained separately from each others. On the other side, the opposite situation occurs when using ICK. The jointly estimated probabilities are now summing to one, but out of the 6400 conditional probability estimates, 19% of them are negative, with a mean negative probability equal to 0:025 and a minimum negative probability equal to 0:12 (Fig. 4b). Goovaerts (1994) claims that the order of magnitude of these deviations from admissible values is usually small, but this simple example shows that they can be quite high too and may lead to considerable difficulties. The recommended obvious ad hoc solution consists in resetting non admissible probability values to the closest bound 0 or 1, followed by a standardization so that they will sum up to one, but the potential consequences of this trick are never assessed. Another trick consists in computing all but one conditional probability estimates p0ði0 ÞjKS , i0 6¼ k so that p0ðkÞjKS is obtained as one minus the sum of these estimates (see, e.g., Mao and Journel, 1999); it is however clear that the results are thus not invariant through the choice of the category which is left apart during the estimation. All these serious drawbacks for IK and ICK are overcome naturally by BME, as the various inequality and equality constraints are automatically incorporated into the estimation procedure. As a second result, a comparison of the conditional probability estimates obtained using BME shows that they are in very good agreement with the true conditional probabilities (Fig. 5). This is even more obvious if ones plot the difference between the predicted and the true conditional probabilities for each method (Fig. 6). In spite of the fact that BME does not explicitly use the information that the categories are strictly ordered, there is little information that has been lost. The situation is much less favorable for IK and ICK, for which conditional
Fig. 4. Non validity of the conditional probability estimates obtained from IK and ICK. Part (a) shows the sum of the probabilities to belong to one of the four categories at the estimation locations (the correct value should be 1). Part (b) shows the probability estimates (all categories mixed) obtained from ICK (correct values should belong to the [0, 1] interval)
439
Fig. 5a–d. Histograms and scatterplots for the conditional probability estimates (all categories mixed). Parts a, b, c and d refer to the true values, BME estimates, IK estimates and ICK estimates, respectively
probabilities are quite different from the true ones, with plenty of values equal to 0 or 1. Note also that there is little improvement when using ICK instead of IK; see Goovaerts (1994) for a discussion about this aspect. It also means that the extra information carried by the indicator cross-covariance function between categories has been poorly used by ICK. Based on the estimated conditional distributions at the 1600 estimation locations, one can draw a map of the most probable category at each location (Fig. 7). A comparison of the most probable category at each grid point with the simulated categorical map shows that using the theoretical conditional probability distributions, 52.9% of the grid points are correctly classified. These values are equal to 52.9%, 49.4% and 48.9% for BME, IK and ICK, respectively. The choice of the most probable category as the estimate is of course arbitrary to some extent; a more appropriate choice could have been made if needed using a more complex loss function.
440
Fig. 6a–f. Categories of maximum conditional probability at the estimation locations. Part a shows the locations of the 100 samples that have been used for the estimation at the nodes of a 40 by 40 square grid. Part b shows the simulated categories at the node of the grid. Parts c, d, e and f refer to the true maximum conditional probability categories, the IK estimates, the BME estimates and the ICK estimates, respectively. Categories are coded on a grayscale from black (category 1) to white (category 4)
441
Fig. 7. Histograms of the differences between estimated and theoretical probabilities of belonging to a given category for BME, IK and ICK, along with the estimated standard deviation of the values
A close examination of these maps confirms that there is a very good agreement between BME and the true conditional distributions as measured by the high frequency (95%) of identical maximum probability categories. For IK and ICK, this frequency is equal to 79% and 78%, respectively. Moreover, the spatial variations of the map are quite different. IK and ICK maps tend to show patchy areas where the expected progressive transitions from a category to an adjacent category is weekly apparent (it is worth noting again the very marginal differences
442
between the IK and ICK maps). The BME map is much more satisfactory: there is a clear progressive transition between categories, that accounts correctly for the fact that these categories are strictly ordered. In our second example, we will focus on the case where there is an uncertainty about the exact category at the sampled locations. We will assume that there are 200 locations that have been randomly sampled over the same square as above. The prediction is conducted again over the 40 by 40 grid covering the square. Continuous values are jointly simulated as previously using the same semivariogram model. At the sampled locations, these simulated values are then replaced by intervals having as bounds the 0, 0.5, and 1 quantiles of the zero-mean unit variance Gaussian distribution, so that it corresponds to a grouping of the categories used in the first example. There is thus an uncertainty about the exact category at the sampled locations, as we now have
KS
n \
Cðxj Þ 2 Ej ;
ð41Þ
j¼1
where the Ej ’s are one of the two incompatible events ðCðxj Þ ¼ c1 Þ [ ðCðxj Þ ¼ c2 Þ and ðCðxj Þ ¼ c3 Þ [ ðCðxj Þ ¼ c4 Þ. The computation of the true conditional distributions and probabilities is done using relations similar to (38) and (40). For BME, there is no difficulty in processing such a composite information using (37). For IK, the situation is ambiguous. Indeed, IK requires a probability coding of the data for each one of the categories, but the information is only provided here for a grouping of these categories. Two possibilities will be considered. The first one (IK1) will convert the information by assigning identical probabilities for both possible categories:
P½Cðxj Þ ¼ c1 Þ [ ðCðxj Þ ¼ c2 ¼ 1 !
P½Cðxj Þ ¼ c1 ¼ P½Cðxj Þ ¼ c2 ¼ 0:5 ; P½Cðxj Þ ¼ c3 ¼ P½Cðxj Þ ¼ c4 ¼ 0 ð42Þ
such that Rj P½Cðxj Þ ¼ cij ¼ 1, whereas the second option (IK2) is to consider the conversion
8 < P½Cðxj Þ ¼ c1 ¼ p1 =ðp1 þ p2 Þ P½Cðxj Þ ¼ c2 ¼ p2 =ðp1 þ p2 Þ ; P½Cðxj Þ ¼ c1 Þ [ ðCðxj Þ ¼ c2 ¼ 1 ! : P½Cðxj Þ ¼ c3 ¼ P½Cðxj Þ ¼ c4 ¼ 0 ð43Þ in an attempt to account for the a priori pi ’s probabilities. Similar conversions are of course considered for the event ðCðxj Þ ¼ c3 Þ [ ðCðxj Þ ¼ c4 Þ. In both cases, instead of coding the data as the binary probability values 0 or 1, these probabilities are now allowed to take values in the [0, 1] interval. This has sometimes be referred to as a soft indicator coding (Goovaerts, 1997; Journel, 1986). Note however that neither (42) nor (43) are equivalent to the specified information; considering the right-hand side of (42) or (43) is much more informative (and thus more conditioning) than just considering the left-hand side. Moreover, the choice between the first or the second option seems to be quite arbitrary to a large extent. Note also that choosing IK1 or IK2 does not affect the weights in (14), so
the conditional probability estimates for a category using (42) or (43) are merely the same up to a multiplicative constant. A comparison between the BME conditional probability estimates and the true conditional probabilities shows again that they are in very good agreement (Fig. 8). However, for IK1 and IK2, the conditional probability estimates have little to do with the true ones. As explained above, remark also that the conditional probability estimates for IK1 and IK2 are linearly related, and using (43) instead (42) does not improve much the quality for the estimates. A single look at the maps of the most probable category (Fig. 9) shows that IK1 and IK2 are doing a quite poor job. The frequency of correctly identified maximum probability categories (compared to the true maximum conditional probabilities) is equal to 66% and 39% for IK1 and IK2, respectively, whereas it still remains very high (98%) for BME. As for the first example, the IK1 map does not restitute correctly the progressive transitions from one category to an adjacent
Fig. 8. Histograms and scatterplots for the conditional probability estimates (all categories mixed). Parts (a), (b), (c) and (d) refer to the true values, BME estimates, IK1 estimates and IK2 estimates, respectively
443
444
Fig. 9a–f. Categories of maximum conditional probability at the estimation locations. Part a shows the locations of the 200 samples that have been used for the estimation at the nodes of a 40 by 40 square grid. Part b shows the simulated categories at the node of the grid. Parts c, d, e and f refer to the true maximum conditional probability categories, the IK1 estimates, the BME estimates and the IK2 estimates, respectively. Categories are coded on a grayscale from black (category 1) to white (category 4)
one. On the other side, using (43) has a dramatic effect and yields a useless IK2 map, where only categories 2 and 3 are represented. This is the direct consequence of imposing arbitrarily higher a priori probability values for these two classes. This second example emphasizes even more clearly the strong limitations of indicator (soft) kriging. The same general remarks apply for the results obtained with indicator cokriging (not shown here).
6 Conclusions As seen from the theory and the previous examples, BME appears to be much more satisfactory than IK or ICK with respect to many points: It yields conditional probability distribution that are automatically valid (no negative probability estimates, probability estimates that sum to one). These simple conditions cannot be enforced using IK and ICK; BME does not require the use of indicator cross-covariance function or variograms, as it directly uses probability functions. Moreover, it provides real nonlinear estimates, to the opposite of IK and ICK estimates that are still linear combination of values (even if these values are obtained from a nonlinear transformation of the data); Various and possibly complex sources of information that are provided about the categories are naturally incorporates without making additional hypotheses or transformation, as required by IK or ICK; The method can be easily generalized, e.g., to obtain multivariate conditional probabilities, to account for multi-point probabilities (e.g., trivariate probabilities) without any theoretical difficulty. Although the methodology that has been presented here focused, for the sake of brievety, on a single spatial categorical variable, it can be generalized for dealing with space/time data, with several categorical variables at the same time, or even for combining both continuous and categorical variables. The only drawback of the BME approach seems to be in the computing requirements. Indeed, the maximum entropy procedure requires the reconstruction of a hypersquare probability table having nnc cells, and this operation has to be repeated for each different configuration of the sampled/estimation locations. With respect to this, at least, the simpler IK and ICK algorithms are less demanding. This higher computing requirement is the price to pay for getting sound and powerful estimates. However, note that in general the few data that are the closest to the estimation location are the more informative, so there is little point in considering high values for n. The tests that have been conducted for our examples have shown that n ¼ 5 was reasonable for providing stable estimates, with a computing requirement of less than a second per grid point for a medium-size computer. For regularly located sampled/estimation locations (e.g., gridded data), this limitation is overcome as the same probability table can be used everywhere. As a conclusion, BME can be considered as a serious challenger for processing categorical variables in a spatial estimation context.
Appendix Consider the symmetric nc nc matrix of bivariate probabilities for arbitrary locations xa ; xb :
445
Pab
6 6 pa;bð1;1Þ 6 6 ¼ 4 ... ðsymÞ
.. .
7 pa;bð1;nc Þ 7 7 7 .. 5 : . pa;bðnc ;nc Þ
ðA1Þ
From the total probability theorem, it is clear that we must have
446
8P 8 0 < Pj pa;bði;jÞ ¼ pi 8 i < 1 Pab ¼ p0 pa;bði;jÞ ¼ pj 8 j , Pab 1 ¼ p : Pi : 0 p ¼ 1 1 Pab 1 ¼ 1 i;j a;bði;jÞ
;
ðA2Þ
where p0 ¼ ðp1 ; . . . ; pnc Þ. It is sufficient to verify that 10 Pab ¼ p0 , as this also verify Pab 1 ¼ p and 10 Pab 1 ¼ 1 from the symmetry property for Chab and the fact that 10 p ¼ 1. From (4) and (17), it is clear that Rhab ¼ Pab pp0 , so that
p0 ¼ 10 Pab ¼ 10 ðRhab þ pp0 Þ ;
ðA3Þ
and thus 10 Rhab ¼ 00 . Using (16), it comes that
10 Rhab ¼
X
m 0 10 Rm 0 C ðhab Þ ¼ 0
8 hab :
ðA4Þ
m 0 As Cm ðhab Þ > 0 for some hab , the only general solution is 10 Rm 0 ¼ 0 8m (for the sake of notational simplicity, we will drop the ab index in the subsequent notations). Remark that the conditions
10 R0 ¼ 0 10 Rm 0 ¼ 0 8m
ðA5Þ
are equivalent to saying that any arbitrary column (line) of the symmetric matrices R0 and Rm 0 can thus be expressed as one minus the sum of the other columns (lines). That is, choosing arbitrarily the jth column R0ð jÞ of R0 , we have
R0ð jÞ ¼
X
R0ðiÞ ;
8j :
ðA6Þ
i6¼j
Similarly, one can also write for any of the Rm 0 ’s that
Rm 0ð jÞ ¼
X
Rm 0ðiÞ ;
8 j; m :
ðA7Þ
i6¼j
Whatever the choice for the Rm 0 ’s, it is always possible to write
Dm;j Rm 0ð jÞ ¼ R0ð jÞ ;
8 j; m
ðA8Þ
where the Dm;j ’s are appropriate diagonal matrices. Using (A8) for R0ðiÞ in (A6) and using (A7) for Rm 0ðjÞ in (A8) yield the relations
P R0ð jÞ ¼ i6¼j Dm;i Rm 0ðiÞ P R0ð jÞ ¼ Dm;j i6¼j Rm 0ðiÞ
8 j; m 8 j; m
ðA9Þ
and shows that (A9) holds true 8 j in general when Dm; j ¼ Dm , so that R0ð jÞ ¼ Dm Rm 0ð jÞ . Using (A5), we obtain
10 R0ð jÞ ¼ 10 Dm Rm 0ð jÞ ¼ 0;
8 j; m :
ðA10Þ
The only possible choice for Dm such that (A10) holds true in general is that km 10 Dm ¼ 10 , so that Rm 0 ¼ km R0 , and so (16) becomes
Rh ¼ R0
X
k C m m
m
ðhÞ :
Using (A11) for h ¼ 0, it comes also that R0 ¼ R0
ðA11Þ P
m
km and so
P
m
km ¼ 1.
References Bierkens MFP, Burrough PA (1993) The indicator approach to categorical soil data. II. Application to mapping and land use suitability analysis. J. Soil Science 44: 369–381 Bishop Y, Fienberg S, Holland P (1975) Discrete Multivariate Analysis. Theory and Practice. MIT Press, Cambridge, 557 pp Bogaert P, Serre M, Christakos G (1999) Bayesian maximum entropy method using transformations. Proc. Annual Conf. Int. Assoc. Math. Geol. 1: 57–62 Carle SF, Fogg GE (1996) Transition probability-based indicator geostatistics. Mathematical Geology 28: 453–476 Castrignano A, Goovaerts P, Lulli L, Bragato G (2000) A geostatistical approach to estimate probability of occurence of Tuber melanosporum in relation to some soil properties. Geoderma 98: 95–113 Christakos G (1990) A Bayesian/maximum entropy view to the spatial estimation problem. Math. Geol. 22: 763–776 Christakos G, Li X (1998) Bayesian maximum entropy analysis and mapping: a farewell to kriging estimators? Math. Geol. 30: 435–462 Christakos G (1991) Some applications of the Bayesian maximum entropy concept in geostatistics. Fundamental Theories of Physics, Kluwer Academic Publishers, Boston, pp. 215–229 Christakos G (2000) Modern Spatiotemporal Geostastistics. Oxford University Press, New York, 312pp Christakos G, Bogaert P, Serre M (2002) Temporal Geographical Information Systems: A Bayesian Maximum Entropy Primer for Natural and Epidemiological Sciences. SpringerVerlag, New York, 250 pp Cressie, N (1993) Statistics for Spatial Data. Wiley, New York, 900 pp de Bruin S (2000) Predicting the areal extent of land-cover types using classified imagery and Geostatistics. Remote Sensing of Environment 74: 387–396 De Cesare L, Posa D (1995) A simulation technique of a non-Gaussian spatial process. Comput. Stat. Data Anal. 20: 543–555 Deming WE, Stephan FF (1940) On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Statist. 11: 427–444 D’Or D, Bogaert P, Christakos G (2001) Application of the BME approach to soil texture mapping. Stoc. Environ. Res. Risk Assess. 15: 87–100 Gomez-Hernandez JJ, Srivastava RM (1990) ISIS3D: an ansi-C three-dimensional multiple indicator conditional simulation program. Comp. Geosci. 16: 395–440 Good IJ (1963) Maximum entropy for hypotheses formulation especially for mutidimensional contingency tables. Ann. Math. Statist. 34: 911–934 Goovaerts P (1994) Comparative performance of indicator algorithms for modelling conditional probability distribution functions. Math. Geol. 26: 389–411
447
448
Goovaerts P (1997) Geostatistics for Natural Resources Evaluation. Oxford University Press, New York, 496 pp Goulard M, Voltz M (1992) Linear coregionalization model: tools for estimation and choice of cross-variogram matrix. Math. Geol. 24: 269–286 Heuvelink GBM, Webster R (2001) Modelling soil variation: past, present and future. Geoderma 100: 269–301 Journel AG, Huijbregts CJ (1978) Mining Geostatistics. Academic Press, New York, 600 pp Journel AG (1983) Non-parametric estimation of spatial distributions. Math. Geol. 15: 445– 468 Journel AG (1986) Constrained interpolation and qualitative information – the soft kriging approach. Math. Geol. 18: 269–286 Lajaunie C (1990) Comparing some approximate methods for building local confidence intervals for predicting regionalized variables. Math. Geol. 22: 123–144 Mao S, Journel AG (1999) Conditional 3D simulation of lithofacies with 2D seismic data. Comp. Geosci. 25: 845–862 Oberthur T, Goovaerts P, Dobermann A (1999) Mapping soil texture classes using field texturing, particle size distribution and local knowledge by both conventional and geostatistical methods. European J. Soil Sci. 50: 457–479 Olea RA (1999) Geostatistics for Engineers and Earth Scientists. Kluwer Academics, Dordrecht, 303 pp Weissmann GS, Carle SF, Fogg GE (1999) Three dimensional hydrofacies modeling based on soil surveys and transition probability geostatistics. Water Resour. Res. 35: 1761–1770 Ying Z (2000) IKSIM: a fast algorithm for indicator kriging and simulation in the presence of inequality constraints, hard and soft data. Comp. Geosci. 26: 493–507