Published online May 16, 2007
Eungyu Park* Dep. of Geology Kyungpook National Univ. Korea
Amro M. M. Elfeki Dep. of Hydrology and Water Resour. Management Faculty of Meteorology, Environ. and Arid Land Agric. King Abdulaziz Univ. Kingdom of Saudi Arabia (currently on leave from Irrigation and Hydraulics Dep. Faculty of Engineering Mansoura Univ., Egypt)
Yungoo Song Dep. of Earth System Sciences Yonsei Univ. Korea
SOIL PHYSICS
Generalized Coupled Markov Chain Model for Characterizing Categorical Variables in Soil Mapping We developed a general formulation of the Markovian transition probability model and the corresponding computational algorithm for characterizing heterogeneity in soil types. The generalized model is based on the previously developed coupled Markov chain (CMC) model in which spatial conditioning is done using transition probabilities that incorporate field observations. The generalized coupled Markov chain (GCMC) model is more flexible with respect to conditioning than the previous CMC model because there are no restrictions on the input data format, and a random sequence calculation algorithm is used. The GCMC model was compared with the sequential indicator simulation (SIS), and the results were quantitatively analyzed. When adequate soil sampling data are available, the GCMC model predicts the spatial distribution of soil types as well as or better than the SIS model. The GCMC model has the advantage of simple input variables (because preprocessing is not required) and faster computation time (by about 60%). The models were also tested with sparse data sets, and the GCMC model predicted the presence of soil types better than the SIS model, based on a metric derived from ensemble probabilities. Further studies are in progress to expand applications of the model to stationary and nonstationary soil type distributions, improve algorithm efficiency, address underestimation caused by undersampled lithology, and extend the model to three dimensions. Abbreviations: CMC, coupled Markov chain; EPD, ensemble probability distribution; GCMC, generalized coupled Markov chain; NN, nearest neighborhood; SIS, sequential indicator simulation; TPM, transition probability matrix.
Kangjoo Kim School of Civil and Environ. Engineering Kunsan National Univ. Korea
S
urface and subsurface heterogeneities are complex mixtures of discrete structures that may be categorized by soil types or geology and characterized by more or less discontinuous boundaries and random features. They are common in nature but difficult for scientists and engineers to predict accurately, especially with limited physical resources and technologies. In practical applications requiring knowledge of surface and subsurface properties, building a conceptual model that characterizes spatial heterogeneities and their distributions is an important first step, and the model should make use of quantitative and qualitative field information as much as possible. Uncertainties in spatial predictions, however, give rise to considerable uncertainty in applications such as risk assessment, cost estimation, and decision analysis (Li et al., Soil Sci. Soc. Am. J. 71:909–917 doi:10.2136/sssaj2005.0386 Received 29 Nov. 2005. *Corresponding author (
[email protected]). © Soil Science Society of America 677 S. Segoe Rd. Madison WI 53711 USA All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.
SSSAJ: Volume 71: Number 3 • May–June 2007
2003; Zhang and Li 2005). Deterministic approaches consider only the most plausible cases and do not incorporate uncertainty in spatial predictions. To overcome this, geostatistical stochastic simulations have been successfully applied (Isaaks and Srivastava, 1989; Deutsch and Journel, 1992). Although the concept of predicting statistical variability of observed field properties is simple, developing models capable of mimicking complex spatial variability has proven difficult. In conventional geostatistics, sequential Gaussian simulation (SGS) and SIS are used mainly as tools for stochastic simulation, with both models requiring random interrogation of unknown nodes by kriging (Deutsch and Journel, 1992). The SGS method assumes intrinsic stationarity and a Gaussian distribution of the random variables of interest. The method also assumes that means of the random variables exist, and spatial relationships between variables can be described by a semivariogram model. These assumptions are often unrealistic and limit the model, as does the fact that the model cannot be used with mutually exclusive categorical variables such as soil type or geology. Currently, a few algorithms are available for simulating categorical variables (Goovaerts, 1997). The prevalent SIS method does not assume a particular distribution of random variables. Indicator-based simulation methods, such as SIS, are well suited for many types of geological or pedological characterization because indicator kriging works well for nonparametric categorical variables such as geology and soil type (Gomez-Hernandez and 909
Srivastava, 1990; Journel, 1983; Carle and Fogg, 1996). They have been widely used in many fields such as soil mapping, stratigraphy, hydrogeology, and sedimentology (e.g., Bierkens and Burrough, 1993a, 1993b; Weerts and Bierkens, 1993; Bierkens and Weerts, 1994; Carle et al., 1998; Ritzi, 2000). A few studies have reported, however, that SIS has difficulties in accurately reproducing spatial structures for systems comprised of multiple categorical variables (Chilés and Delfiner, 1999; Seifert and Jensen, 1999). Both methods exclude many realistic soil and geology problems because they assume directional symmetry (by use of an isotropic semivariogram or autocovariance model) and cannot handle the asymmetry often encountered in practice (Carle and Fogg, 1996, 1997; Carle et al., 1998; Weissmann et al., 1999). Conventional stochastic simulations cannot readily incorporate geologic interpretations or handle asymmetric juxtaposition tendencies (Carle and Fogg 1996). In recent years, transition-probability-based conditional indicator models have been proposed to address these problems. Carle and Fogg (1996, 1997) first introduced transition-probability-based indicator geostatistics in which the variograms conventionally used by geostatisticians are replaced by Markovian transition probabilities. The latter are more convenient for developing and interpreting cross-correlations of multiple lithologies. Elfeki and Dekking (2001) independently developed a two-dimensional CMC model that explicitly uses a conditional simulation algorithm with a simplified transition probability equation using Markovian properties. Parks et al. (2000) used Markov chains with a simulated annealing technique to capture geological realism in stochastic simulations. Wu et al. (2004) developed a Markov chain Monte Carlo methodology and successfully reproduced the texture and structure of binary thin-section images. Li et al. (2004) proposed a triplex Markov chain model that improved the prediction accuracy by introducing two horizontal chains, with one chain running from left to right and the other running from right to left, relaxing what they call “an artifact” of the previous CMC model; however, they lost the essence of asymmetry in the Markov transition probabilities by doing so. Elfeki and Dekking (2005) have shown the effect of asymmetry and the directional dependency of Markov chains on predictions of spatial distributions. Park et al. (2005) extended the two-dimensional CMC model of Elfeki and Dekking (2001) to three dimensions. The CMC method for computing probability (Elfeki and Dekking, 2001) is rather explicit compared with conventional kriging-based geostatistics because it directly uses the CMC probability equation by finding a set of conditioning information around a given cell, and it does not require matrix inversions. The model could potentially be integrated with soft information, such as geologic inferences and geophysical data, because of its explicit scheme. The CMC model has been applied to real outcrop and borehole data (Elfeki and Dekking, 2001, 2005; Elfeki, 2006). Implementation of the CMC model is also convenient because it does not require post-processing (e.g., parametric fitting of a semivariogram model or cumbersome indicator cokriging techniques),
Fig. 1. Schematic drawing for one-dimensional coupled Markov Chain calculation (Sp, state of cell M; Sj, state of cell I; and Sq, state of cell N). 910
and computational efficiency can be maximized by searching minimal directional conditioning information. A limitation of previous CMC methods is the rigidity of this algorithm, however, a characteristic inherent in the CMC transition probability equations (Elfeki and Dekking, 2001). In the previous two-dimensional CMC model, calculations must be performed following a row-by-row, line-by-line sequential path, and the probability equation requires that at least one piece of conditioning information for each direction (x and y) be associated with the cell adjacent to that for which the probability is being calculated (Elfeki and Dekking, 2001). Therefore, unidirectional information transfer may predominate, and this can give rise to “artificial” lithology parcel inclination for certain transition probabilities, sampling intervals, and borehole spacings (Elfeki and Dekking, 2005). To avoid these problems, a more generalized transition probability equation and its computational algorithm is needed. Another problem that has persisted since the development of the CMC model is underestimation for undersampled or sparsely located indicators (Li et al., 2004; Li and Zhang, 2005). This underestimation is present in all CMC-based models and has not been resolved completely (Li and Zhang, 2005). The current study is confined to cases where all indicators are properly sampled in both number and interval. In this study, we first derived a general form of the Markovian transition probability equation with no restrictions on the separation of conditioning information. Then a corresponding computational algorithm was developed that efficiently assesses the conditional probability based on the equation at a given location using neighboring data. This GCMC model was tested on a hypothetical soil map reconstruction using sparse data extracted from the soil map. The results were qualitatively and quantitatively compared with results from the SIS model. The CMC model cannot be used for comparison because it is unable to handle sparse data. To compare the models, we used self- and cross-transition probabilities from the simulated map, as well as model predictability as measured by ensemble probability.
MATERIALS AND METHODS Development of Generalized Two-Dimensional Coupled Markov Chain Transition Probability Equations Previous Two-Dimensional Coupled Markov Chain Theory The CMC model in two-dimensional space has been described by Elfeki and Dekking (2001). In the Markovian framework, the conditional distribution of any future state is independent on the past history if the present state is given. That is, if the discrete stochastic process {Zn , n = 0, 1, 2, …} is a sequence of random variables taking values in the state space {S1, S2, …, Sn} (Fig. 1), then the sequence is a Markovian process if (Ross 2000)
(
)
Pr Z i = S k | Z i −1 = S l , Z i − 2 = S n , Z i −3 = S r ,..., Z 0 = S p =
(
)
[1]
Pr Z i = S k | Z i −1 = S l = p lk where plk is the transition probability from a state (Sl) to another state (Sk). In one-dimensional problems, a Markov chain is described by a single transition probability matrix (TPM). For details of onedimensional Markov chains applied in geology, see Krumbein (1967). Transition probabilities can be estimated from the relative frequen-
SSSAJ: Volume 71: Number 3 • May–June 2007
cies of transitions from a certain state to other states. These transition probabilities can be arranged into an n × n matrix: ⎡ p11 ⎢ . ⎢ P=⎢ . ⎢ ⎢ . ⎣⎢ pn1
. . .
. . plk
. . .
. .
. .
. .
p1n ⎤ . ⎥⎥ . ⎥ ⎥ . ⎥ pnn ⎦⎥
[2]
where n is the number of states in the system. In the matrix P, all elements must be non-negative (plk ≥ 0), and the sum of the elements in each row must be 1. The transition probabilities considered in Eq. [2] are called one-step transitions, meaning that the transition from one state to another occurs in one step. For transitions that occur in N steps, the one-step transition probability matrix is multiplied by itself N times to get the N-step transition probability matrix. Imposing two mild constraints on the transition matrix (aperiodicity and irreducibility), successive multiplications lead to identical rows (w1, w2, …, wn). Marginal probabilities wk (k = 1, 2, …, n) are then given by
[3]
lim plkN = wk
N →∞
where wk is no longer dependent on the initial state Sl. The aforementioned is a conventional method to estimate transition probabilities from finite continuous data. A few methodologies to estimate continuous multistep transition probabilities from sparse field observations are also currently available, including one based on the continuous-lag Markov chain model (Carle et al., 1998), and another on estimation from the experimental “transiogram” (Li and Zhang, 2005; Li, 2006). The conditional probability equation that is governed by the immediate past and the future state was derived by Elfeki and Dekking (2001):
(
)
Pr Z i = S j Z i −1 = Sl , Z N = S q =
p Njq−i plj
[4]
plqN −i +1
where ZN is a random variable of the future, and Sq is the future state. This can be extended to two-dimensional space by coupling two state spaces, {S1, S2, …, Sn} × {S1, S2, …, Sn}, with the assumption that a transition to a given location in two-dimensional space from two independent chains must be to the same state from both chains (Fig. 2). A simple modification to Eq. [4] yields the two-dimensional CMC formulation:
(
Pr Zi , j = S k Zi −1, j = S p ,Z N , j = S q ,Zi , j −1 = S f ,Zi , P = S g
=C
)
[5]
p pk pkqN −i p fk pkgP − j N −i +1 P − j +1 p pq p fg
Generalized Coupled Markov Chain Probability Equation In this study, we compose the directional transition probabilities from the interpolated training map using the nearest neighborhood (NN) method. Every unsurveyed cell is evaluated based on the nearest hard information, with the nearest neighboring hard information found by a circular search algorithm, and the lithology is assigned to an uninformed cell. The evaluated cell, then, becomes another hard information to be used by the following evaluations. To build the interpolated training map, we use sparse data as the sole input and impose the assumption that each observation is equally honored in the initial guess. This is one way to honor sparse data without bias. In the case of expected directional anisotropy with lower sampling frequency, however, this method may fail to capture the spatial characteristics of indicator spatial structures, as a variogram system will do. One of the biggest advantages of the Markov chain model is that conditioning using the Markovian property improves calculation efficiency by finding the minimum amount of conditioning information. Consider a onedimensional series of events that are Markovian (Fig. 1). The probability of cell i being in state Sj, given that the previous cell M is in state Sp and cell N is in state Sq, can be expressed mathematically as
(
)
Pr Z i = S j Z M = S p , Z N = S q =
)
Pr Z i = S j Z M = S p , Z N = S q = N −i +1 P − j +1 p pq p fg n
∑p l
pl
[6]
plqN −i p fl plgP − j
Pr (Z i = S j , Z M = S p , Z N = S q ) Pr (Z M = S p , Z N = Sq )
[7]
The joint distribution probability can be expressed as a combination of shorter forms of the joint distribution probability and conditional probability by applying the Markovian property to get
(
where C is
C=
when discretization of the model domain in the x direction (Δx) or y direction (Δy) is too small, or the conditioning information is sparse. Elfeki and Dekking (2005) have investigated the problem and concluded that the choice of these intervals is crucial to obtaining plausible results, leaving the problem still unresolved. Another issue is that the algorithm works row by row, and conditioning information must be input sequentially (data on a line across a domain must be fully informed). This may be satisfactory for simulating geological cross-sections, where the conditioning information is acquired from vertical boreholes (Elfeki and Dekking, 2001, 2005). Liu and Journel (2004) have improved sequential simulations by using a structured path guided by information content rather than a random path based on a priori knowledge. In more general two-dimensional cases, however, where prior information is lacking, acquisition of information in the direction of lines is impractical, and random distribution of the surveyed information is commonly encountered. In this situation, the random sequence calculation is more suitable (Isaaks, 1990).
(
) (
Pr Z N = S q Z i = S j Pr Z i = S j Z M = S p
(
Pr Z N = S q Z M = S p
)
) [8]
Therefore, the generalized one-dimensional Markov chain transition conditional probability equation will be
assuming a forced transition (Elfeki and Dekking, 2001).
Limitations of the Previous Coupled Markov Chain Probability Equation A major limitation of the previous coupled Markov chain is that past conditioning information must be adjacent to the calculation location (Elfeki and Dekking, 2001). This restriction is caused by the derived probability equation and forces line-by-line sequential calculation. It also induces an inclination of the indicator parcel at an angle of tan−1(Δy/Δx) SSSAJ: Volume 71: Number 3 • May–June 2007
Fig. 2. Schematic drawing for two-dimensional coupled Markov Chain calculation (Sp, state of cell M; Sj, state of cell i; Sq, state of cell N; Sf, state of cell O; and Sg, state of cell P). 911
(
Pr Zi = S j ZM = S p ,Z N = S q ,ZO = S f ,Z P = S g
(
)
−
Pr Z i = S j Z M = S p , Z N = S q =
p
M −i + N −i pj jq + N −M pq
p
[9]
p
−
Here we assume spatial asymmetry of the spatial indicator structures, and the negative and positive superscripts stand for the multistep transition probability in the negative or positive direction calculated from the TPMs for the negative or positive direction, respectively. In Eq. [9], when cell N is far from cell i, the terms +ppq(N−M) and +pjq(N−i) will cancel each other because they will almost equal the stationary probability wq. When we get closer to cell N, however, its state will start to play a role, and the simulation results will be influenced by the state at that cell. Physically, it is obvious that when the calculation location is far from the conditioning data, the conditioning data has little influence, while the probability calculation is strongly dependent on the conditioning data when its location is very close. As a special case, if M = i − 1 and Sp = Sl, the equation above becomes similar to the previous one-dimensional Markov chain formula (Eq. [4]):
(
)
Pr Z i = S j Z M = S p , Z N = S q =
−
p pj + p Njq−i +
N −i +1 p pq
[10]
except that in Eq. [4], only one transition probability matrix is used for the whole calculation instead of two asymmetric transition probability matrices. As in the previous two-dimensional CMC model, we couple two independent one-dimensional chains to get a two-dimensional generalized CMC by excluding impractical transitions (Elfeki and Dekking, 2001). The new two-dimenisonal formula conditioned by the surrounding four known points (Fig. 2) is given by
=
n
∑ k
δ M −i +
p pjw −
δ M −i +
p pkw
δ N −i −
p jqe
δ O −i +
p fjn
δ N −i −
pkqe
δ O −i +
p fkn
δ P −i
p jgs
[11]
δ P −i
pkgs
where δξ is the Dirac delta function, ⎧⎪0 φξ > λ ξmax δξ = ⎨ ξ ⎪⎩1 φξ ≤ λ max
[12]
with φξ defined as the distance to the ξ-directional conditioning information and λ ξmax defined as the maximum search range in the ξ direction.
Improved Calculation Algorithm The previous CMC model is limited in its application because of the restriction in the probability equation, as mentioned above. The twodimensional GCMC model does not have such a restriction and can evaluate the probability equation in a random sequence using randomly located conditioning information. Therefore, the GCMC probability equation is more widely applicable because any calculation algorithm can be used, such as line-by-line sequential or random path, and the two-dimensional GCMC model can accept any distribution of the conditioning information. To estimate the directional TPMs from randomly scattered hard data, the GCMC model must interpolate unsampled locations as a training image. In this study, an interpolated map using the NN method, which is described above, was used as a training image by assuming that all acquired data are equally important and that the spatial structures represented by the data are isotropic. Two sets of randomly scattered data generated using different data acquisition frequencies of 600 and 120 samplings across 120 000 m2 were used (Fig. 3b and 3c). In the simulation, the calculation sequence is decided by a random generator, which is different from the line-by-line sequential path of the previous CMC model. At a given calculation location, four cells with conditioning information are sought in the north (N), east (E), west (W), and south (S) directions (Fig. 4). The dimensions of the calculation cell are determined from the minimum correlation scale, and the maximum correlation scale is used to specify the maximum search range, λmax, and the maximum search swath, ωmax. To maximize the utility of sparsely located conditional information, the search algorithm uses an angle tolerance, δ, which may be selected empirically, and conditional information in straight as well as oblique directions is sought (Fig. 4). This searching algorithm
Fig. 3. Randomly chosen sampling location maps of each frequency: (a) original soil map; (b) sparse data set with 600 randomly selected data points; (c) sparse data set with 120 randomly selected data points. Each symbol represents a different soil type. 912
)
Fig. 4. Schematic drawing of calculation using generalized two-dimensional coupled Markov Chain probability equation (N, E, W, S: cells with conditioning information in the north, east, west, and south, respectively; λmax, maximum search range; ωmax, maximum search swath; δ, angle tolerance). SSSAJ: Volume 71: Number 3 • May–June 2007
is similar to that used for semivariogram estimation in GSLIB (Deutsch and Journel, 1992).
Prediction Algorithm The algorithm for stochastic prediction using the twodimensional GCMC model is as follows: Step 1: The simulated domain is discretized considering the sampling interval and minimum correlation lengths of the indicators. Step 2: Acquired field data is stored in a corresponding cell of the discretized domain. Step 3: If the data format is sequential, the transition probabilities of each indicator are directly calculated following Elfeki and Dekking (2001); if the data format is sparse, an interpolated map using the NN method is generated first, and then four different transition probability matrices (positive x, negative x, positive y, and negative y directions) are calculated from the guessed map. Step 4: The starting cell is chosen by a random generator. Step 5: To generate a conditional probability distribution, a search algorithm seeks out a cell with conditioning information in each direction (positive x, negative x, positive y, and negative y), if it exists, and the probability distribution is drawn using Eq. [11] and [12]. Step 6: A random number is drawn, and the soil type is determined from the probability distribution.
Fig. 5. (a and b) Single realizations and (c and d) ensemble maps computed from 50 realizations using 600 data points: (a) a realization from the generalized coupled Markov chain (GCMC) model; (b) a realization from the sequential indicator simulation (SIS) model; (c) an ensemble soil map from the GCMC model; (d) an ensemble soil map from the SIS model.
Step 7: The next cell to visit and calculate is decided by the random generator. Steps 6 and 7 are repeated until all cells have been visited.
RESULTS AND DISCUSSION To test the developed two-dimensional GCMC model, a synthesized soil map, which was modified from a real soil map of the Velddriel area, in the Rhine–Meuse delta, the Netherlands (Weerts, 1996), by merging physically similar soil categories (Fig. 3a), was used. The mapped area comprises three soil types (Type 1, 34%; Type 2, 30%; Type 3, 36%). Sparse data sets were prepared by selecting samples at random locations in the original map at two different sampling frequencies. The simulation domain had dimensions of 300 m (x direction) by 400 m (y direction) discretized into 120 000 cells with cell size 1 m (x direction) by 1 m (y direction). Figure 3b shows sampling locations and soil types for 600 points (Type 1, 36%; Type 2, 31%; Type 3, 33%) and Fig. 3c shows those for 120 points (Type 1, 39%; Type 2, 27%; Type 3, 34%).
Using the sampling points in each map (Fig. 3b and 3c) as input data, a total of 50 realizations were generated by the developed code. Computation time for the 50 GCMC realizations was about 65 s using a computer system with 3.8 GHz dual CPU and 4 Gb memory. Figure 5a is an example of a single realization computed by the code using the 600-point data set. It is not possible to compare the GCMC model directly with the previous CMC model using the given data sets because input data for the former are spatially nonsequential and cannot be used with the CMC model. Comparing the models using sequential input data is not meaningful, even if it is possible, because the GCMC model is intrinsically identical to the CMC model for that case. Indeed, the GCMC model is the general form of the CMC because it can handle both sequential and nonsequential sparse data. We therefore compared our results with the GSLIB SIS model, which uses simple indicator kriging as an estimator (Deutsch and Journel, 1992). Table 1 provides a summary of the comparison between the GCMC and SIS models. Computation time for the 50 SIS realizations, not including preprocessing (i.e., semivariogram estimation), was 116 s using the
Table 1. Summary of comparison between sequential indicator simulation (SIS) and generalized coupled Markov chain (GCMC) models. Parameter Description of spatial structure Asymmetry
SIS Variogram or autocovariance Cannot handle asymmetry (variogram or autocovariance are symmetric) γ(s) = γ(−s) cov(s) = cov(−s)
GCMC Transition probability Can handle asymmetry pij(s) ≠ pij(−s)
Procedure and matrix formulation
Implicit: A covariance matrix is generated and solves a kriging system for each unknown cell
Explicit: There is no matrix formulation. Each grid cell is calculated by sampling from the conditional distribution given in Eq. [11].
Relative simulation time
1.78
1
Preprocessing
Models the experimental covariance or variogram
Uses the nearest-neighbor method to produce a training image (guessed map) to estimate the transition probabilities
SSSAJ: Volume 71: Number 3 • May–June 2007
913
Table 2. Transition probability matrix calculated from the original soil map with 1000 data points, and over- or underpredictions of the generalized coupled Markov chain (GCMC) single realization and sequential indicator simulation (SIS) single realization, where negative numbers are underpredictions and positive numbers are overpredictions of the relationships between indicators. Data in italics are self-transitions. Soil type 1 2 3 1 2 3 1 2 3
Soil type 2 3 Original 0.0043 0.0060 0.9896 0.0012 0.0161 0.9827 0.0073 0.0099 0.9827 GCMC over- and underpredictions –0.0014 0.0077 –0.0063 0.0003 −0.0007 0.0004 0.0050 0.0032 −0.0082 SIS over- and underpredictions 0.0039 0.0085 –0.0124 0.0049 0.0115 −0.0164 0.0079 0.0170 −0.0249 1
same computer system. This is about 1.8 times longer than the GCMC model took for the example case. Figure 5b is an example of a single realization from the SIS model, using the same data as that used for the GCMC model simulation. The results look similar except that the SIS map has higher entropy at the soil-type boundaries. Comparing realizations from the SIS and GCMC models, it is obvious that the GCMC is more conservative and, therefore, shows clearer boundaries between soil types. This distribution is also seen in the TPMs. Table 2 shows the TPM in the positive x direction acquired from the original soil map, and the component-wise residuals represent the over- or underpredictions of the realizations from each model calculated by subtracting the original TPM from the simulated TPM. The tabulated residuals of the GCMC model are generally less than the residuals of the SIS realization. The larger residuals obtained with the SIS model may be attributed to the high entropy of soil-type boundaries shown in Fig. of ensemble probability distributions of indicator soil types 5b. The residuals are generally larger in the self-transition Fig. 6.1 Comparison through 3 computed from 50 generalized coupled Markov chain (GCMC; components (diagonals) than in the cross-transition comfirst column) and sequential indicator simulation (SIS; second column) realizations using 600 data points. ponents (off-diagonal) for both the GCMC and SIS models. The underpredictions are obvious in the self-transition components for both the GCMC and SIS models. out of nine transition relationships of the GCMC ensemble map are From the 50 single realizations of each model, ensemble soil closer to that of original soil map, and four transition relationships maps were generated (Fig. 5c and 5d) by assigning the most freof the SIS ensemble map are closer to the original relationship. This quently appearing soil type in each cell. In Table 3, the same metric is suggests that even if the GCMC model seems superior to the SIS model in predicting transition relationships between soil types, their used to compare the two different models. The results show that five abilities to predict transition relationships are almost equal for single Table 3. Over- or underpredictions of generalized coupled Markov realizations, if the quantity of input data is sufficient. chain (GCMC) ensemble realizations with 1000 data points, To assess each model’s ability to predict the appearance of soil and sequential indicator simulation (SIS) ensemble realizations. Negative numbers are underpredictions and positive types and to assess robustness, ensemble probability distributions numbers are overpredictions of the relationship between (EPD) of each soil type for each scheme were computed from the indicators. Data in italics are self-transitions. single realizations of each model (Fig. 6). For the EPD calculation, Soil type the number of times each soil type appears in each computational Soil type 1 2 3 cell is divided by the total number of realizations (i.e., 50). In Fig. GCMC 6, the SIS results have a more dispersive margin than the GCMC 1 −0.0027 0.0019 0.0008 2 −0.0007 −0.0063 0.0071 results, indicating that the SIS model is less robust. Blurring of the 3 0.0017 −0.0007 −0.0010 margin in the SIS realizations is most severe for Soil Type 3, which SIS has the most scattered distribution. Robustness of the stochastic 1 −0.0009 0.0038 −0.0029 2 0.0006 −0.0023 0.0017 predictions may or may not be advantageous in making predic3
914
0.0027
0.0027
−0.0054
SSSAJ: Volume 71: Number 3 • May–June 2007
Table 4. Comparison of generalized coupled Markov chain (GCMC) and sequential indicator simulation (SIS) predictability based on Eq. [13]. The ensemble probability was calculated from 50 realizations using 600 data points. Model predictability
Soil type
GCMC
SIS
———— % ———— 91.2 84.9 88.5 80.9 74.3 70.4
1 2 3
tions and depends on field conditions and simulation objectives. For a quantitative measure, we define ny
nx
∑∑ δ qm =
i
m ij
p ijm
j
nx
ny
i
j
∑∑ δ
m = 1, 2, 3
[13]
m ij
where qm is the model predictability for soil type m for a given set of input data, δijm is a delta function with a value of 1 if the soil type on the original map is m for Fig. 7. (a and b) Single realizations using 120 data points and (c and d) ensemble maps computed from 50 realizations using 120 data points. a given location (i,j) and with a value of 0 elsewhere, and pijm is the ensemble probability of soil type m for a given location (i,j). In both the GCMC and SIS schemes, the type, we conclude that the GCMC model predicts the distribution calculated model predictability (Table 4) is highest for Soil Type of soil types better than the SIS model. 1 and lowest for Soil Type 3, which may be attributed to the nonstationarity in directional anisotropy of Soil Type 3. The CONCLUSIONS GCMC model predictabilities are 5 to 10% higher than those of We presented a generalization of the transition probability the SIS model (Table 4), which means that the GCMC model equation for the Markovian two-dimensional transition probabilmore confidently predicts the appearance of soil types. ity model, as well as the development of a computational algoTo further test the GCMC model, a sparser sampling data set rithm based on that equation. The generalized transition probwith a frequency of 120 locations across an area of 120 000 m2 was ability equation is not limited by input data format or calculation used. Figures 7a and 7b show single realizations from the GCMC sequences. Hence, artificial parcel inclinations that occur in the and SIS models, respectively. When compared with Fig. 5a and 5b, CMC model do not occur in the new model. The underestimaboth single realizations poorly mimic the original soil map (Fig. 3a). tion of low-frequency indicators was not addressed in this study. This is because the sampling frequency is reduced by about oneBecause the previous CMC model cannot handle sparse eighth. Figures 7c and 7d are ensemble maps from 50 realizations data, the new model was instead compared with the SIS model of the GCMC and SIS models, respectively. Qualitatively, the disusing a hypothetical soil map and sampling data. The GCMC tributions of each soil type predicted by both the GCMC and SIS model simulation time was only 65 s, compared with 116 s models resemble the original soil map. The intra- and interrelationfor the SIS model. Simulation results using one synthetic data ships of soil types were quantitatively measured using the residuTable 5. Over- or underpredictions of generalized coupled Markov als matrix of transition probabilities (Table 5). All abstract residual chain (GCMC) ensemble realizations with 120 data points, and sequential indicator simulation (SIS) ensemble realizacomponents of the residual transition probabilities calculated from tions. Negative numbers are underpredictions and positive the GCMC ensemble map are slightly less than those from the SIS numbers are overpredictions of the relationships between ensemble map except the Soil Types 1 to 2 cross-transition comindicators. Data in italics are self-transitions. ponent. This result may indicate that the GCMC model performs Soil type Soil slightly better than, or at least equal to, the SIS model depending type 1 2 3 on the sampling frequency used. In Fig. 8, the EPDs of each soil GCMC type from the GCMC and SIS models are presented. The ensemble 1 −0.0026 0.0049 −0.0023 probabilities from both models are much more diffused than those 2 −0.0009 −0.0071 0.0080 in Fig. 6 because the sampling frequency was lower in this case. Also 3 0.0054 −0.0019 −0.0035 as seen in Fig. 6, the ensemble probabilities of the SIS model are SIS more diffused than those of the GCMC model. Using the derived 1 0.0000 0.0103 −0.0103 EPD, model predictability was calculated using Eq. [13] (Table 6). 2 0.0010 0.0109 −0.0120 Comparing the GCMC and SIS model predictabilities for each soil 3
SSSAJ: Volume 71: Number 3 • May–June 2007
0.0072
0.0089
−0.0161
915
Further development of the GCMC model is required to resolve problems associated with data nonstationarity, such as underestimation of undersampled or sparsely distributed information, and to extend the model to three-dimensional space. These issues will be addressed in a subsequently. ACKNOWLEDGMENTS This research was financially supported by the Core Environmental Technology Development Project for Next Generation (Eco-technopia 21) of the Korean Inst. of Environmental Science and Technology (Kiest) under Contract 04003-0008 and Korea Science and Engineering Foundation (R01-2006-000-11050-0). REFERENCES Bierkens, M.F.P., and P.A. Burrough. 1993a. The indicator approach to categorical soil data: I. Theory. J. Soil Sci. 44:361–368. Bierkens, M.F.P., and P.A. Burrough. 1993b. The indicator approach to categorical soil data: II. Application to mapping and land use suitability analysis. J. Soil Sci. 44:369–381. Bierkens, M.F.P., and H.J.T. Weerts. 1994. Application of the indicator simulation to modeling the lithological properties of a complex confining layer. Geoderma 62:265–284. Carle, S.F., and G.E. Fogg. 1996. Transition probability-based indicator geostatistics. Math. Geol. 28:453–476. Carle, S.F., and G.E. Fogg. 1997. Modeling spatial variability with one- and multidimensional continuous Markov chains. Math. Geol. 29:891–917. Carle, S.F., E.M. LaBolle, G.S. Weissmann, D. VanBrocklin, and G.E. Fogg. 1998. Conditional simulation of hydrofacies architecture: A transition probability/Markov approach. p. 147– 170. In G.S. Fraser and J.M. Davis (ed.) Hydrogeologic models of sedimentary aquifers. Concepts in Hydrogeol. and Environ. Geol. no. 1. Soc. Sediment. Geol., Tulsa, OK. Chilés, J.P., and P. Delfiner. 1999. Geostatistics: Modeling spatial uncertainty. John Wiley & Sons, New York. Deutsch, C.V., and A.G. Journel. 1992. GSLIB: Geostatistical Software Library and user’s guide. Oxford Univ. Press, New York. Elfeki, A.M.M. 2006. Reducing concentration uncertainty using Fig. 8. Comparison of ensemble probability distributions of indicator soil types the coupled Markov chain approach. J. Hydrol. 317:1–6. 1 through 3 computed from 50 generalized coupled Markov chain (GCMC; Elfeki, A.M.M., and M. Dekking. 2001. A Markov chain model for first column) and sequential indicator simulation (SIS; second column) realsubsurface characterization: Theory and applications. Math. Geol. izations using 120 data points. 33:569–589. Elfeki, A.M.M., and M. Dekking. 2005. Modeling subsurface set found the GCMC model performance to be equal to or heterogeneity by coupled Markov chains: Directional dependency, Walther’s slightly better than the SIS model and that, using the metrics law and entropy. Geotech. Geol. Eng. 23:721–756. derived here as well as EPD maps, the spatial distribution of Gomez-Hernandez, J.J., and R.M. Srivastava. 1990. ISIM3D: An ANSI-C soil type was predicted with greater accuracy. This suggests the three-dimensional multiple indicator conditional simulation program. Comput. Geosci. 16:395–440. superiority of the GCMC method relative to SIS, but further Goovaerts, P. 1997. Geostatistics for natural resources evaluation. Oxford Univ. research is warranted into which model is advantageous over Press, New York. another for a given data set, for GCMC may not be uncondiIsaaks, E. 1990. The application of Monte Carlo methods to the analysis of tionally superior to relevant models. spatially correlated data. Ph.D. diss. (Diss. Abstr. 9115791). Stanford Univ., Stanford, CA. Isaaks, E., and R. Srivastava. 1989. An introduction to applied geostatistics. Oxford Univ. Press, New York. Table 6. Comparison of generalized coupled Markov chain Journel, A.G. 1983. Non-parametric estimation of spatial distributions. Math. (GCMC) and sequential indicator simulation (SIS) predictGeol. 5:445–468. ability based on Eq. [13]. The ensemble probability is calcuKrumbein, W.C. 1967. FORTRAN IV computer program for Markov chain lated from 50 realizations using 120 data points. experiments in geology. Comput. Contrib. 13. Kansas Geol. Surv., Lawrence. Model predictability Li, S.-G., D. McLaughlin, and H.-S. Liao. 2003. A computationally practical Soil type GCMC SIS method for stochastic groundwater modeling. Adv. Water Resour. 26:1137–1148. ———— % ———— Li, W. 2006. Transiogram: A spatial relationship measure for categorical data. 1 79.2 66.9 Int. J. Geogr. Inf. Sci. 20:693–699. 2 79.9 59.7 Li, W., and C. Zhang. 2005. Application of transiograms to Markov chain 3 58.2 57.7 simulation and spatial uncertainty assessment of land-cover classes.
916
SSSAJ: Volume 71: Number 3 • May–June 2007
GISci. Remote Sens. 42:297–319. Li, W., C. Zhang, J.E. Burt, A.-X. Zhu, and J. Feyen. 2004. Two-dimensional Markov chain simulation of soil type spatial distribution. Soil Sci. Soc. Am. J. 68:1479–1490. Liu, Y., and A. Journel. 2004. Improving sequential simulation with structured path guided by information content. Math. Geol. 36:945–964. Park, E., A.M.M. Elfeki, and M. Dekking. 2005. Characterization of subsurface heterogeneity: Integration of soft and hard information using multidimensional coupled Markov chain approach. p. 193–202. In C.F. Tsang and J.A. Apps (ed.) Underground injection science and technology. Dev. Water Sci. 52. Elsevier, Amsterdam. Parks, K.P., L.R. Bentley, and A.S. Crowe. 2000. Capturing geological realism in stochastic simulations of rock systems with Markov statistics and simulated annealing. J. Sediment. Res. 70:803–813. Ritzi, R.W. 2000. Behavior of indicator semivariograms and transition probabilities in relation to variance in lengths of hydrofacies. Water Resour. Res. 36:3375–3381. Ross, S. 2000. Introduction to probability models. 7th ed. Academic Press, San Diego.
SSSAJ: Volume 71: Number 3 • May–June 2007
Seifert, D., and J.L. Jensen. 1999. Using sequential indicator simulation as a tool in reservoir description: Issues and uncertainties. Math. Geol. 31:527–550. Weerts, H.J.T. 1996. Complex confining layers: Architecture and hydraulic properties of Holocene and late Weichselian deposits in the fluvial Rhine–Meuse delta, the Netherlands. Ph.D. diss. Utrecht Univ., Utrecht, the Netherlands. Weerts, H.J.T., and M.F.P. Bierkens. 1993. Geostatistical analysis of overbank deposits of anastomosing and meandering fluvial systems; Rhine–Meuse delta, the Netherlands. Sediment. Geol. 85:221–232. Weissmann, G.S., S.F. Carle, and G.E. Fogg. 1999. Three-dimensional hydrofacies modeling based on soil surveys and transition probability geostatistics. Water Resour. Res. 35:1761–1770. Wu, K., N. Nunan, J.W. Crawford, I.M. Young, and K. Ritz. 2004. An efficient Markov chain model for the simulation of heterogeneous soil structure. Soil Sci. Soc. Am. J. 68:346–351. Zhang, C., and W. Li. 2005. Markov chain modeling of multinomial landcover classes. GISci. Remote Sens. 42:1–18.
917