Computational Statistics and Data Analysis 53 (2009) 2781–2785
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Editorial
Spatial statistics: Methods, models & computation James LeSage a , Sudipto Banerjee b , Manfred M. Fischer c , Peter Congdon d,e,∗ a
Department of Finance & Economics, Texas State University, United States
b
School of Public Health, University of Minnesota, United States
c
Institute for Economic Geography and GIScience, Vienna University of Economics and Business Administration, Austria
d
Centre for Statistics, Queen Mary, University of London, United Kingdom
e
Department of Geography, Queen Mary, University of London, United Kingdom
article
info
Article history: Available online 14 November 2008
This is the first Special Issue of Computational Statistics & Data Analysis devoted to the topic of spatial statistics. There have been substantial advances in analysis of spatial data in the last decade or two, and many of these are linked to use of computationally intensive techniques for modeling and statistical inference, making a special issue of CSDA devoted to spatial statistics appropriate. Spatial applications in econometrics, environmental science and epidemiology have also benefited from new modeling perspectives, whether using classical or Bayesian approaches to draw inferences. Related advances have occurred in spatio-temporal statistics and statistical image analysis. A common theme in the use of spatial statistical methods is the need to relax the simplifying assumption of independence among observations, which is seldom realistic when dealing with sample data collected with reference to points or regions located in space. We replace the independence assumption with others that provide a framework for addressing dependence relations between observations. The reader will see a number of alternative strategies taken to accomplish this, based on varying assumptions and application frameworks. The role of computationally intensive techniques should become clear when modeling dependent data relationships, since this situation leads to a potential for each observation to depend on all others. Measuring spatial dependence and clustering The first group of papers have a common theme in seeking to model or measure spatial dependence without introducing a predictor–response model. The papers by Ceyhan, by Wu and Li, by Assunçâo and Correia and by Hossain and Lawson address issues in multivariate spatial and spatio-temporal point pattern analysis, while the papers by Cucala and by Zhang and Lin both consider spatial scan statistics. Finally Bivand et al. consider the distribution of spatial correlation measures. In his paper, Ceyhan (2009) considers the problem of multivariate interaction between two or more classes (or species) and resulting spatial segregation or association patterns that can be studied using a nearest neighbor contingency table (NNCT). New versions of overall and cell-specific tests based on NNCTs are introduced and compared with Dixon’s overall and cell-specific tests and other spatial clustering methods. Overall tests are used to detect any deviation from the null case, while the cell-specific tests are post hoc pairwise spatial interaction tests that are applied when the overall test yields a significant result. The distributional properties of these tests are analyzed and their finite sample performance assessed
∗ Corresponding editor at: Department of Geography, Queen Mary, University of London, United Kingdom. Tel.: +44 207 882 7760; fax: +44 208 981 6276. E-mail addresses:
[email protected] (J. LeSage),
[email protected] (S. Banerjee),
[email protected] (M.M. Fischer),
[email protected] (P. Congdon). 0167-9473/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2008.11.001
2782
Editorial / Computational Statistics and Data Analysis 53 (2009) 2781–2785
by means of an extensive Monte Carlo simulation study. Furthermore, it is shown that the new NNCT tests have better performance in terms of type I error and power. The methods are applied on two real life data sets. Wu and Li (2009) argue that conventional summary statistics for multivariate spatial point pattern data can only detect the dependence between two types of points, and cannot be used to detect the dependence among three types of points. Their paper proposes new summary statistics for detecting the presence of the kth type of point and its influence on the relationship between the ith and the jth types of points, regardless of the direction and presence or otherwise of correlation between the ith and jth point types. Such statistics can also be used to infer information about the type of correlation and the range of interaction in multivariate point patterns. Particular attention is paid to edge effects, with a simulation and applied example used to illustrate the proposed methodologies. Assuncao and Correa (2009) propose a space-time surveillance system for the quick detection of space-time emerging clusters in a set of point process events. The system takes into account purely spatial and purely temporal heterogeneity, and uses the Shiryayev–Roberts control chart method based on martingales and tuning parameters rather than type I probability errors. The performance of the system is evaluated by means of a Monte Carlo study and illustrated in real world contexts. Hossain and Lawson (2009) review a range of exact and approximate methods for estimating spatial point process models that are commonly used in spatial epidemiology applications. Approximate methods are based on discretization of the study window, while exact methods include a marked point process model, i.e. the conditional logistic model. They find that the discretized methods perform relatively well in explaining spatial separation, while in cases involving spatial heterogeneity the Poisson model (or the log-Gaussian Cox process model) and a discretized window produce much better estimates. Cucala (2009) tackles the issue of cluster identification in spatial point processes, proposing a dimensionality reduction that transforms spatially located points to ordered events with the spacing between events represented as areas. The transformation draws on earlier work by Dematteï et al. (2007) on spatial cluster detection for event data. Using the null distribution of area spacings, cluster detection in the two-dimensional space can be tackled using one-dimensional temporal cluster detection methods such as scan statistics that rely on likelihood ratio tests. The argument is that arbitrarily shaped spatial clusters can be detected using the method without resorting to the conventional family of elliptic windows with predetermined shape, angle and center. Empirical tests suggest that the method has power against arbitrarily shaped cluster alternatives. This would be important in public health contexts where the goal is to identify candidate clusters of locations containing high disease rates that should be further explored. A second work on spatial scan statistics is by Zhang and Lin (2009). They point to different philosophies regarding the use of clustering methods for Poisson event variables of interest. Economists, ecologists and regional scientists tend toward the use of model covariates to first filter variables of interest, with subsequent application of clustering methods to the model residuals. There is typically only a single (dependent) variable of interest that arises from some underlying theory. Geographers and epidemiologists prefer direct application of clustering methods to all variables of interest to identify important spatial clusters which are then the subject of the modeling exercise. Here there are frequently a number of covariates of interest rather than a single dependent variable. The resolution proposed by Zhang and Lin is to fit the covariates and cluster components of the model at the same time, introducing a competition between these to explain variation in the variable of interest. While this is likely to displease both camps noted above, the authors make a case for some advantages to their approach. For example, random effects or spatially structured variants of these can be easily incorporated in the model to accommodate overdispersion. Advocates of a modeling approach that proceeds sequentially rather than the simultaneous method proposed here might object that spatially structured effects should not need to compete with spatial cluster detection. The work by Bivand et al. (2009) examines problems that may arise from using a normal approximation to the exact distribution for both global and local Moran’s I-statistic, frequently used for empirically testing the strength of spatial dependence. In the case of local or sub-sample-based Moran I-statistics, the small (sub-)sample size is likely to create problems for the normal approximation. They argue that there is no need for the normal approximation since the numerically demanding exact distribution can be evaluated for reasonably large problems using some results that relate the local Moran statistic through Imhof’s formula and today’s computational software and hardware. An implementation of their method in the R open source computing environment is discussed. Their software implementation demonstrates how power analysis of statistical tests related to alternative spatial patterns can be simplified in practice. This is important since the distribution of the test statistic under the alternative hypothesis of spatial dependence varies across alternative spatial processes. Multivariate spatial regression A second broad grouping of papers is linked by a concern with regression in spatial and spatio-temporal contexts. The contribution by Finley et al. (2009) tackles situations where the number of sample observations is too large to allow use of hierarchical spatial random effects models with Markov chain Monte Carlo methods. Collection of large data sets in many scientific fields as well as the advent of space-time data sets that contain multiple observations for each location over time has exacerbated the number of situations where this problem arises. A bias reducing modification to a previously proposed dimensionality reduction scheme that relies on a predictive process derived from ideas related to kriging is set forth here. The authors describe an algorithm for approximately optimal spatial design of knot locations used in the predictive process, which essentially eliminates the need for concern about the impact of knot design on modeling outcomes. Illustrations of multivariate spatial regressions that use the modified predictive process are provided using both simulated and real data
Editorial / Computational Statistics and Data Analysis 53 (2009) 2781–2785
2783
sets. The latter represent forest inventory data from the USDA Forest Service Bartlett Experimental Forest coupled with Landsat imagery and other variables in an application that predicts forest biomass by tree species. The work by Bolin et al. (2009) examines spatio-temporal changes in land surface useful for monitoring trends in vegetation cover. The Bayesian hierarchical model allows for spatially varying coefficients, and the authors argue for the use of an EM algorithm as a computationally efficient approach to estimation. Their prior on regression coefficients assumes dependence between underlying observation fields and restricts the time evolution of the pixel fields to lie in the space spanned by the regression functions. This induces a spatial prior on the regression coefficients which seems appropriate to modeling land surface changes. A novel aspect of the method is the way in which the authors exploit the cyclic property of the trace operator in conjunction with the Cholesky factorization to deal with typically difficult variance–covariance matrix calculations. They use the African Sahel, a semi-arid region, to illustrate their method. The paper by Baltagi et al. (2009) shifts attention to a panel data regression model with heteroscedasticity in the individual effects and spatial correlation in the error structure. They show that ignoring spatial dependence when testing for heteroscedasticity and, vice versa, ignoring heteroscedasticity when testing for spatial correlation produces misleading results. The authors derive joint and conditional LM tests for heteroscedasticity and spatial correlation, and study their performance using Monte Carlo experiments. Haining et al. (2009) use existing methodology for discrete valued spatial data and a crime data set to compare competing small-area analyses, in the presence of overdispersion and spatial autocorrelation. The empirical comparison of several smoothing methods offers an indication of what might work successfully on other data sets. The papers by Ugarte et al. (2009), MacNab and Lin (2009) and Lee and Durbán (2009) develop different methods for empirical Bayes (EB) analysis of generalized linear mixed models for disease counts (Breslow and Clayton, 1993). Ugarte et al. (2009) consider how to reconcile EB methods to smooth incidence maps with detection of true high risk areas. They propose second-order correct estimators for the MSE of the log-relative risk predictor to build confidence intervals of relative risks and thereby detect areas with elevated risks. The performance of this EB procedure is investigated in a simulation study using the geographical structure of the Scottish lip cancer data. Fully Bayesian credible intervals and decision rules based on the posterior distribution of the relative risks are also investigated. These are found to be more powerful in detecting elevated risk than EB confidence intervals. However, they conclude that performance of different decision rules may depend on the spatial configuration and the data, and it is difficult to define a global criterion that can be routinely applied in every applied setting. MacNab and Lin (2009) present an empirical Bayes PQL (EBPQL) procedure for approximating posterior point and interval predictions of GLMM random effects, and assess these using prediction uncertainty attributions with respect to the random effects, the fixed effects, and the prior parameters. They present a Monte Carlo assessment of EBPQL point and interval estimates for the random effects, fixed effects, and prior parameters in univariate and bivariate (shared component) disease mapping models. They conclude that uncertainty about the random effects prior parameters may have only minor influence on EBPQL random effect prediction and inference in situations involving Bayesian ecological modeling. Specifically, they assess cases involving samples of moderate Poisson observations and attribute less than 5% of the prediction error to uncertainty about the prior parameters. A corresponding finding is that a gain of 1% or less arose in the coverage rates when prior uncertainty was accounted for in the EBPQL estimates of the random effect prediction errors. Spatial random effects often play the role of smoothers and can be modeled as smoothing splines. Lee and Durbán (2009) develop spatial models employing penalized splines (P-splines) in addition to individual random effects for the analysis of spatial count data. This yields a unified framework for estimation using generalized linear mixed models. The authors propose modeling spatial variation as a two-dimensional P-spline at the centroids of regions. Additional random effects are introduced to account for individual variation among regions. This approach helps separate the large-scale geographical trend from the local spatial correlation. The methodology proposed is applied to the analysis of lip cancer incidence rates in Scotland. Spatial computational methods Likelihood-based methods for modeling multivariate Gaussian spatial data have desirable statistical characteristics, but the practicability of these methods for treating massive georeferenced data sets is often questioned. Smirnov and Anselin (2009) present an O(N ) parallel method of computing the log-determinant that arises in the log-likelihood function for models involving spatial interaction on a lattice. The major apparatus that they employ is the well known method of eigenvalue moments. In particular, for reasonable parameters of the model under study one is able to utilize a Taylor series expansion of the logarithm function. The paper proposes that for the range of the free model parameters, one can safely ignore higher order eigenvalue moments, essentially truncating the Taylor series. Bayesian hierarchical models As spatially referenced data sets become more prevalent in scientific applications, the desire for full inference and accurate assessment of uncertainty has become increasingly important. Bayesian hierarchical models may be proposed for capturing different sources of uncertainty accurately, yet present several computational challenges. The special issue has a number of articles that illustrate computationally feasible hierarchical models for analyzing challenging data sets.
2784
Editorial / Computational Statistics and Data Analysis 53 (2009) 2781–2785
Hierarchical spatial models are being increasingly employed to explore and understand relationships between environmental pollutants and different health hazards. The article by Choi et al. (2009) develops a statistical hierarchical Bayesian framework for studying the spatio-temporal associations between daily mortality and exposure to daily fine particulate matter, PM2.5, while accounting for different sources of uncertainty. The hierarchical setup enables them to combine information from monitoring data and an air quality model (CMAQ) at different spatial and temporal scales. A spatio-temporal generalized Poisson regression model enables them to examine the spatial temporal relationships between health end-points and exposures to PM2.5 while accounting for epidemiological confounders. The methods are illustrated with a data set for North Carolina during the year 2001. In another interesting health-related application of hierarchical spatial modeling, Hatfield et al. (2009) explore the mechanisms relating exposure to ultraviolet (UV) radiation and elevated risk of skin cancer. They develop hierarchical spatial logistic models based on a sample cohort of x-ray technologists to better estimate UV exposure and explore the temporal pattern of UV exposure and its gradient. An especially attractive feature here is that the models can be estimated using standard software for hierarchical models such as WinBUGS. Hierarchical modeling approaches, while attractive, can often be computationally expensive. In many settings, investigators wish to understand what gains, if any, accrue from these models. Kang et al. (2009) offer a very educational investigation into hierarchical and non-hierarchical regression models for spatially contiguous small areas with independent and spatially dependent error terms. Spatial dependence is introduced according to which areas are neighbors, but care is taken to account for extra components of variability due to measurement error, which a careful statistical analysis should filter out. The authors develop a computationally feasible methodology to estimate the measurement-error variance, to look for spatial outliers, and to determine which models are appropriate for accurate prediction. A small-area data set of doctors’ prescription amounts per consultation is fitted to the different kinds of models and used to illustrate their spatial methodology. The three papers by White and Ghosh (2009), by Congdon (2009), and by Wall and Liu (2009) consider special applications and extensions of the popular conditionally autoregressive (CAR) model for lattice data, implemented using fully Bayesian methods. White and Ghosh note that applications of the CAR model usually assume neighborhoods formed deterministically using area inter-distances or boundaries. They propose instead that selection of the neighborhood depends on unknown parameter(s), leading to a Stochastic Neighborhood CAR (SNCAR) model. The resulting model shows flexibility in accurately estimating covariance structures for data generated from a variety of spatial covariance models. From an extensive simulation study they conclude that the SNCAR model provides a better fit than the CAR or exponential models when the models are misspecified as exponential and CAR, respectively. A real data application involves radioactive contamination of the soil in Switzerland after the Chernobyl accident The papers by Congdon (2009) and by Wall and Liu (2009) both consider multivariate reduction of spatial data. Congdon (2009) proposes a factor analytic model for the impact of spatially defined latent social constructs on area health outcomes, involving two sub-models. The first is a measurement model using socioeconomic variables (e.g. from population censuses) as indicators of latent social constructs. The other sub-model considers variation in spatial health outcomes in terms of both latent social constructs and residual common factors, where the latter have only the health variation component as their measurement model. The two sets of latent variables (social indicator based and residual) can be mutually correlated and latent scores can be correlated over areas. However, the extent of spatial dependence in the scores for any particular latent variable is determined by the data, so a prior assumption of exclusively spatial dependence is avoided. A case study considers the impact of two latent social constructs (denoted as social deprivation and social fragmentation) on four types of psychiatric hospitalization in the local authorities of London, England. Wall and Liu (2009) mention that the conventional latent class analysis (LCA) model for multivariate binary data assumes that within each cluster the responses are independent, whereas this may not be appropriate for multiple indicators collected repeatedly over time, or over different spatial locations. Liu and Wall introduce a hierarchical spatial LCA model in which the first part defines the relationship between the observed indicators and the latent classes, and the second part of the model is a multinomial probit involving random variables (a multivariate CAR spatial process) underlying the latent classes. Identifiably restrictions on the covariance matrix of the spatial process are considered. The motivating data are indicators at 97 locations of whether the soil levels of eight different heavy metals exceed a legal threshold. The primary goal is to find out which heavy metals share a pollution source and which of the locations are more polluted than others, and a three-class SLCA model is used to that end. Spatial allocation and decision models The central idea in Bonneu and Thomas-Agnan (2009) is to analyze optimal spatial location of new fire stations in their application in a truly stochastic way by merging the optimization problem with a scheme for sampling from possible spatial demand patterns. This leads to a spatial distribution of optimal locations associated with alternative realized demand patterns that arise from the sampling process. The distribution of optimal locations is more informative about the stochastic nature of the problem at hand and of course amenable to statistical analysis. They also argue their approach has some advantages in terms of dimensionality reduction, allowing the optimization problem to be tackled in the smaller dimension spatial model parameter space rather than the larger dimension sample space.
Editorial / Computational Statistics and Data Analysis 53 (2009) 2781–2785
2785
In many environmental and ecological application contexts, the data available are a sample of some regionalized variables that may be modeled as random fields with spatial dependence. When the sampling scheme is very irregular, a direct application of supervised classification algorithms leads to biased discriminant rules. The paper by Bel et al. (2009) suggests modifications of the classical CART (Classification And Regression Trees) algorithm when the sampling pattern is very irregular, in particular in the presence of clusters. Multiple examples not only indicate the types of potential application, but also build support for the proposed methods. There has been recent interest in capturing spatial variation while analyzing large data sets from microarray experiments. Here hierarchical models may prove to be computationally infeasible. The article by Wu et al. (2009) considers an interesting application that uses spatial information in the analysis of protein microarray data. They formulate a classical hypothesis testing framework for identifying regions of differential expression on the protein microarray while accounting for spatial information for the two-dimensional antibody microarray. Borrowing strength from the spatial neighbors enables detection of the differentially expressed regions with significantly higher power than would have been possible without harnessing the spatial information. References Assuncao, R., Correa, T., 2009. Surveillance to detect emerging space-time clusters. Computational Statistics and Data Analysis 53 (8), 2817–2830. Baltagi, B., Song, S., Kwon, J., 2009. Testing for heteroskedasticity and spatial correlation in a random effects panel data model. Computational Statistics and Data Analysis 53 (8), 2897–2922. Bel, L., Allard, D., Laurent, J.M., Cheddadi, R., Bar-Hen, A., 2009. CART algorithm for spatial data: Application to environmental and ecological data. Computational Statistics and Data Analysis 53 (8), 3082–3093. Bivand, R., Müller, W., Reder, M., 2009. Power calculations for global and local Moran’s image. Computational Statistics and Data Analysis 53 (8), 2859–2872. Bolin, D., Lindström, J., Eklundh, L., Lindgren, F., 2009. Fast estimation of spatially dependent temporal vegetation trends using Gaussian Markov random fields. Computational Statistics and Data Analysis 53 (8), 2885–2896. Bonneu, F., Thomas-Agnan, C., 2009. Spatial point process models for location–allocation problems. Computational Statistics and Data Analysis 53 (8), 3070–3081. Breslow, N., Clayton, D., 1993. Approximate inference in generalized linear mixed models. Journal American Statistical Association 88, 9–25. Ceyhan, E., 2009. Overall and pairwise segregation tests based on nearest neighbor contingency tables. Computational Statistics and Data Analysis 53 (8), 2786–2808. Choi, J., Fuentes, M., Reich, B., 2009. Spatial–temporal association between fine particulate matter and daily mortality. Computational Statistics and Data Analysis 53 (8), 2989–3000. Congdon, P., 2009. Modelling the impact of socioeconomic structure on spatial health outcomes. Computational Statistics and Data Analysis 53 (8), 3047–3056. Cucala, L., 2009. A flexible spatial scan test for case event data. Computational Statistics and Data Analysis 53 (8), 2843–2850. Dematteï, C., Molinari, N., Daurès, J.P., 2007. Arbitrarily shaped multiple spatial cluster detection for case event data. Computational Statistics and Data Analysis 51, 3931–3945. Finley, A., Sang, H., Banerjee, S., Gelfand, A., 2009. Improving the performance of predictive process modeling for large datasets. Computational Statistics and Data Analysis 53 (8), 2873–2884. Haining, R., Law, J., Griffith, D., 2009. Modelling small area counts in the presence of overdispersion and spatial autocorrelation. Computational Statistics and Data Analysis 53 (8), 2923–2937. Hatfield, L., Hoffbeck, R., Alexander, B., Carlin, B., 2009. Spatiotemporal and spatial threshold models for relating UV exposures and skin cancer in the central United States. Computational Statistics and Data Analysis 53 (8), 3001–3015. Hossain, M., Lawson, A., 2009. Approximate methods in Bayesian point process spatial models. Computational Statistics and Data Analysis 53 (8), 2831–2842. Kang, E., Liu, D., Cressie, N., 2009. Statistical analysis of small-area data based on independence, spatial, non-hierarchical, and hierarchical models. Computational Statistics and Data Analysis 53 (8), 3016–3032. Lee, D.-J., Durbán, M., 2009. Smooth-CAR mixed models for spatial count data. Computational Statistics and Data Analysis 53 (8), 2968–2979. MacNab, Y., Lin, Y., 2009. On empirical Bayes penalized quasi-likelihood inference in GLMMs and in Bayesian disease mapping and ecological modeling. Computational Statistics and Data Analysis 53 (8), 2950–2967. Smirnov, O., Anselin, L., 2009. An O(N) parallel method of computing the log-Jacobian of the variable transformation for models with spatial interaction on a lattice. Computational Statistics and Data Analysis 53 (8), 2980–2988. Ugarte, M., Goicoa, T., Militino, A., 2009. Empirical Bayes and Fully Bayes procedures to detect high-risk areas in disease mapping. Computational Statistics and Data Analysis 53 (8), 2938–2949. Wall, M., Liu, X., 2009. Spatial latent class analysis model for spatially distributed multivariate binary data. Computational Statistics and Data Analysis 53 (8), 3057–3069. White, G., Ghosh, S., 2009. A stochastic neighborhood conditional autoregressive model for spatial data. Computational Statistics and Data Analysis 53 (8), 3033–3046. Wu, L.-C., Li, H.-Q., 2009. Summary statistics for measuring the relationship among three types of points in multivariate point patterns. Computational Statistics and Data Analysis 53 (8), 2809–2816. Wu, J., Patwa, T., Lubman, D., Ghosh, D., 2009. Identification of differentially expressed spatial clusters using humoral response microarray data. Computational Statistics and Data Analysis 53 (8), 3094–3102. Zhang, T., Lin, G., 2009. Spatial scan statistics in loglinear models. Computational Statistics and Data Analysis 53 (8), 2851–2858.