transition rate matrix Rf â RKÃK (the model coefficients) depend on the ... In order to obtain an estimate of the transition rate matrix along the direction f, the ...
C ONTRIBUTED R ESEARCH A RTICLES
16
spMC: Modelling Spatial Random Fields with Continuous Lag Markov Chains by Luca Sartore Abstract Currently, a part of the R statistical software is developed in order to deal with spatial models. More specifically, some available packages allow the user to analyse categorical spatial random patterns. However, only the spMC package considers a viewpoint based on transition probabilities between locations. Through the use of this package it is possible to analyse the spatial variability of data, make inference, predict and simulate the categorical classes in unobserved sites. An example is presented by analysing the well-known Swiss Jura data set.
Introduction Originally, the spMC package (Sartore, 2013) was developed with the purpose of analysing categorical data observed in 3-D locations. It deals with stochastic models based on Markov chains, which may be used for the analysis of spatial random patterns in continuous multidimensional spaces (Carle and Fogg, 1997). Its results are easily interpretable and it is a good alternative to the T-PROGS software developed by Carle (1999), which is oriented towards modelling groundwater systems. The models considered in the spMC package are used to analyse any categorical random variable Z (s) at the d-dimensional position s ∈ Rd which satisfies the Markov property. Other R packages are also helpful for analysing categorical spatial data. For example, the gstat package (Pebesma, 2004) allows for analyses using traditional methods such as the parameter estimation of spatial models based on variograms and kriging techniques for predictions. All these methods and their variants are also available in other packages, e.g. geoRglm (Christensen and Ribeiro Jr, 2002) and RandomFields (Schlather, 2013). When Z (s) is assumed to be linked to a continuous hidden random process, these packages are useful for studying the covariance structure of the data. The spMC package extends the functionality of the T-PROGS software to R users. New useful functions are included for faster modelling of transition probability matrices, and efficient algorithms are implemented for improving predictions and simulations of categorical random fields. The main tasks and their functions are clearly summarised in Table 1. Three different fitting methods were implemented in the package. The first is based on the estimates of the main features that characterise the process, the second focuses on the minimisation of the discrepancies between the empirical and theoretical transition probabilities, and the third follows the maximum entropy approach. Once the model parameters are properly estimated, transition probabilities are calculated through the matrixvalued exponential function (see Higham, 2008, Algorithm 10.20 in Chapter 10). These transition probabilities are then combined to predict the category in an unsampled position. Three algorithms are used to simulate spatial random fields; those based on the kriging techniques (Carle and Fogg, 1996), those using fixed and random path methods (Li, 2007a; Li and Zhang, 2007), or those using multinomial categorical simulation proposed by Allard et al. (2011). In order to reduce computation time through OpenMP API (version 3.0; OpenMP Architecture Review Board, 2008), the setCores() function allows the user to change the number of CPU cores, so that one can mix shared memory parallel techniques with those based on the Message Passing Interface (The MPI Forum, 1993) as described in Smith (2000). Here, it will be shown how to perform a geostatistical analysis of the Jura data set (Goovaerts, 1997) using the spMC package (version 0.3.1). The data set consists of 359 sampled spatial coordinates and their respective observed realisations of two categorical variables (related to the rock-type and the land use) and some continuous variables (corresponding to the topsoil content).
Brief overview of the models The spMC package deals with both one-dimensional and multidimensional continuous lag models. If Z (sl ) denotes a categorical random variable in a location sl , for any l = 1, . . . , n, its outcome conventionally takes values in the set of mutually exclusive states {z1 , . . . , zK }, where K represents the total number of observable categories. A continuous lag Markov chain model organises the conditional probabilities tij (sl − sk ) = Pr( Z (sl ) = z j | Z (sk ) = zi ), for any i, j = 1, . . . , K, in a K × K transition probability matrix. Generally speaking, such a model is a transition probability matrix-valued function depending on one-dimensional or multidimensional
The R Journal Vol. 5/2, December
ISSN 2073-4859
C ONTRIBUTED R ESEARCH A RTICLES
Tasks and functions
17
Techniques implemented in the spMC package
Estimations of one-dimensional continuous lag models transiogram Empirical transition probabilities estimation tpfit One-dimensional model parameters estimation tpfit_ils Iterated least squares method for one-dimensional model parameters estimation tpfit_me Maximum entropy method for one-dimensional model parameters estimation tpfit_ml Mean length method for one-dimensional model parameters estimation Estimations of multidimensional continuous lag models pemt Pseudo-empirical multidimensional transiograms estimation multi_tpfit Multidimensional model parameters estimation multi_tpfit_ils Iterated least squares method for multidimensional model parameters estimation multi_tpfit_me Maximum entropy method for multidimensional model parameters estimation multi_tpfit_ml Mean length method for multidimensional model parameters estimation Categorical spatial random field simulation and prediction sim Random field simulation sim_ck Conditional simulation based on indicator cokriging sim_ik Conditional simulation based on indicator kriging sim_mcs Multinomial categorical simulation sim_path Conditional simulation based on path algorithms Graphical tools plot.transiogram mixplot contour.pemt image.pemt image.multi_tpfit
Plot one-dimensional transiograms Plot of multiple one-dimensional transiograms Display contours with pseudo-empirical multidimensional transiograms Images with pseudo-empirical multidimensional transiograms Images with multidimensional transiograms Table 1: Most important user functions in the spMC package.
lags, i.e. T(hφ ) : Rd → [0, 1]K ×K , wherein hφ denotes a d-dimensional continuous lag along the direction φ ∈ Rd . Such a lag corresponds to the difference between the location coordinates and is proportional to the direction φ. The exponential form, T(hφ ) = exp khφ kRφ , (1) ∞
=
k hφ ku u Rφ , u! u =0
∑
is usually adopted to model the observed variability and local anisotropy. The components of the transition rate matrix Rφ ∈ RK ×K (the model coefficients) depend on the direction φ and they must satisfy the following properties (Norris, 1998, Section 2.1): • rii ≤ 0, for any i = 1, . . . , K. • rij ≥ 0, if i 6= j. • The row sums satisfy K
∑ rij = 0.
j =1
• The column sums satisfy K
∑ pi rij = 0,
i =1
where pi is the proportion of the i-th category.
The R Journal Vol. 5/2, December
ISSN 2073-4859
C ONTRIBUTED R ESEARCH A RTICLES
18
The components of R−φ may be computed through the relation pj
rij, −φ =
pi
r ji, φ
∀i, j = 1, . . . , K,
where −φ denotes the opposite direction.
Transition rate matrix estimation In order to obtain an estimate of the transition rate matrix along the direction φ, the package provides two solutions, i.e. by following the one-dimensional approach or the multidimensional. The latter estimates the matrix Rφ by the ellipsoidal interpolation of d matrices, which are computed along the axial directions through one-dimensional procedures. The one-dimensional techniques related to the tpfit_ml() and tpfit_me() functions are based ∗ . The iterated least on mean lengths Li, φ and transition frequencies of embedded occurrences f kj, φ squares method is implemented through the tpfit_ils() function. The first two functions estimate the stratum mean lengths for each category through the mlen() function. The mean lengths are computed either with the average of the observed stratum lengths or their expectation based on the maximum likelihood estimate by assuming that the observed lengths are independent realisations of a log-normal random variable. In order to verify the distributional assumption on the lengths, the function getlen() estimates stratum lengths of embedded Markov chains along a chosen direction, while other functions such as boxplot.lengths(), density.lengths(), hist.lengths() are used for graphical diagnostics. The tpfit_ml() function computes the transition frequencies of embedded occurrences as an average through the function embed_MC(). The maximum entropy method, adopted by the tpfit_me() function, calculates the transition frequencies of embedded occurrences through the iterative proportion fitting (Goodman, 1968). The algorithm may be summarised as follows: 1. Initialise f i, φ with pi /Li, φ . ∗ 2. Compute f ij, φ = f i, φ f j, φ ∀i, j = 1, . . . , K.
3. Compute f i, φ =
∗ pi ∑kK=1 ∑Kj6=k f kj, φ ∗ Li, φ ∑Kj6=i f ij, φ
.
4. Repeat the second and the third step until convergence. Both tpfit_ml() and tpfit_me() functions estimate the autotransition rates as rii = −1/Li, φ , ∗ /L while the rates for any i 6= j are calculated as rij, φ = f ij, i, φ . φ The tpfit_ils() function estimates the transition rate matrix by minimising the sum of the squared discrepancies between the empirical probabilities given by the transiogram() function and theoretical probabilities given by the model. The bound-constrained Lagrangian method (Conn et al., 1991) is performed in order to have a proper transition rate matrix, which satisfies the transition rate properties. The multidimensional approach is computationally efficient. In fact a generic entry of the matrix Rφ is calculated by the ellipsoidal interpolation as v u d 2 u hv, φ t |rij, φ | = ∑ r , khφ k ij, ev v =1
(2)
where hv, φ is the v-th component of the vector hφ , ev represents the standard basis vector, and the rate rij, ev is replaced by rij, −ev for components hv, φ < 0. In this way, it is only necessary to have in memory the estimates for the main directions. The multi_tpfit_ml(), multi_tpfit_me() and the multi_tpfit_ils() functions automatically perform the estimation of d transition rate matrices along the axial directions with respect to the chosen one-dimensional method.
The R Journal Vol. 5/2, December
ISSN 2073-4859
C ONTRIBUTED R ESEARCH A RTICLES
19
Prediction and simulation based on transition probabilities Several methods were developed to predict or simulate the category in an unobserved location s0 given the knowledge of the sample positions s1 , . . . , sn . The conditional probability, ! n \ Pr Z (s0 ) = zi Z (sl ) = z (sl ) , l =1
is used to predict or simulate the category in s0 , where zi represents the i-th category and z(sl ) is the observed category in the l-th sample location. Usually such a probability is approximated through • Kriging- and Cokriging-based methods, implemented in the sim_ik() and sim_ck() functions. • Fixed or random path algorithms, available in sim_path(). • Multinomial categorical simulation procedure, sim_mcs() function. The approximation proposed by Carle and Fogg (1996) is implemented in the functions sim_ik() and sim_ck(). Both of them use some variant of the following ! n \ n K Pr Z (s0 ) = z j Z (sl ) = z(sl ) ≈ ∑ ∑ wij, l cil , l =1 i =1 l =1
where
( cil =
1 0
if z(sl ) = zi , otherwise,
and the weights wij, l are calculated by solving the following system of linear equations: T(s1 − s1 ) .. . T(s1 − sn )
where
··· .. . ···
T(sn − s1 ) W1 T(s0 − s1 ) . .. .. .. = , . . T(sn − sn ) Wn T(s0 − sn )
w11, l Wl = ... wK1, l
··· .. . ···
w1K, l .. . . wKK, l
This approximation does not satisfy the probability axioms, because such probabilities might lie outside the interval [0, 1] and it is not ensured that they sum up to one. To solve the former problem truncation is considered, but the usual normalisation is not adopted to solve the latter; in fact, after the truncation, these probabilities might also sum up to zero instead of one. The implemented stabilisation algorithm translates the probabilities with respect to the minimum computed for that point. Then, the probabilities are normalised as usual. To improve the computational efficiency of the algorithm, the m-nearest neighbours are considered in the system of equations instead of all sample points; in so doing, a decrease in computing time is noted and the allocated memory is drastically reduced to a feasible quantity. For the approximation adopted in the sim_path() function, conditional independence is assumed in order to approximate the conditional probability as in the analysis of a Pickard random field (Pickard, 1980). This method, as described in Li (2007b), considers m known neighbours in the axial directions, so that the probability is computed as ! ! n m \ \ Pr Z (s0 ) = zi Z (sl ) = z(sl ) ≈ Pr Z (s0 ) = zi Z (sl ) = z k l ∝ l =1
l =1
m
∝ tk1 i (s0 − s1 ) ∏ tikl (s0 − sl ). l =2
The method proposed by Allard et al. (2011) is implemented in the sim_mcs() function. It was introduced to improve the computational efficiency of the Bayesian maximum entropy approach
The R Journal Vol. 5/2, December
ISSN 2073-4859
C ONTRIBUTED R ESEARCH A RTICLES
20
proposed by Bogaert (2002). Here, the approximation of the conditional probability is n
Pr
! n \ Z (s0 ) = zi Z (sl ) = z (sl ) ≈ l =1
pi ∏ tikl (s0 − sl ) K
l =1 n
.
∑ pi ∏ tik (s0 − sl ) l
i =1
l =1
Also in this case, the user can choose to apply this approximation by considering all data or only the m-nearest neighbours, with the same advantages described above. Once the conditional probabilities are computed, the prediction is given by the highest probable category, while the simulation is given by randomly selecting one category according to the computed probabilities. After the first simulation is drawn, the sim_ik() and sim_ck() functions execute an optimisation phase in order to avoid “artifact discontinuities”, wherein the simulated patterns do not collimate with the conditioning data (Carle, 1997). The user can then choose to perform simulated annealing or apply a genetic algorithm in order to reduce the quantity K
K
∑∑
rij, SIM − rij, MOD
i =1 j =1
2
K
+ ∑ pi, SIM − pi, MOD
2
,
i =1
where rij, SIM and pi, SIM are coefficients estimated from the pattern to optimise, while rij, MOD and pi, MOD are those used to generate the initial simulation. Other comparison methods are also available through the argument optype.
An example with the Jura data set The data set consists of spatial coordinates and the observed values of both categorical and continuous random variables collected at 359 locations in the Jura region in Switzerland. In particular, we will deal with the rock-type categorical variable of the geological map created by Goovaerts (1997, see Figure 4), which consists of 5957 sites. The aim of these analyses is related to the parameters estimation of the model in (1), and its interpretation through graphical methods. These analyses are useful to check the model assumptions and to ensure the accuracy of the predictions. First, the spMC package and the Jura data set in the gstat package are loaded as follows:
library(spMC) data(jura, package = "gstat") If the package is compiled with the OpenMP API, the number of CPU cores to use can be set by
setCores(4) otherwise a message will be displayed and only one core can be internally used by the spMC package. In order to study the spatial variability of the data and interpret the transitions from a geometrical viewpoint, the empirical transition probabilities along the main axes are calculated. These probabilities point out the persistence of a category according to the lag between two points. They also provide juxtapositional and asymmetrical features of the process, which are not detected by adopting indicator cross-variograms (Carle and Fogg, 1996). Therefore, all couples of points along axial directions are chosen such that their lag-length is less than three. After, we calculate the empirical transition probabilities for twenty points within the maximum distance. This can be conducted with the execution of the following code:
data