int. j. geographical information science, 2000, vol. 14, no. 6, 543± 566
Research Article Integrating GIS components and spatial statistical analysis in DBMSs ZHIQIANG ZHANG and DANIEL A. GRIFFITH Department of Geography, Syracuse University, Syracuse, NY 13244, USA e-mail: Gri
[email protected];
[email protected] (Received 15 September 1998; accepted 23 November 1999) Abstract. The integration of GIS and spatial statistical analysis can occur in two diŒerent ways: embedding spatial statistical analysis into a GIS environment, or embedding selected GIS functions into a spatial statistical analysis environment. Attention has been given almost exclusively to the former in the past. This paper argues that the latter deserves more attention, especially with the rapid development of embeddable GIS component technology. However, because current statistical software packages have not yet provided the environment into which GIS components can be embedded, we explore the possibility of integrating GIS components and spatial statistical analysis in a current proprietary Database Management System (DBMS: Microsoft Access), which can be easily extended to incorporate statistical analysis capabilities. A conceptual framework for this integration is outlined, an interactive spatial statistical analysis module is presented, and an example is used to illustrate the advantage of this module.
1.
Introduction Much progress has been made in the last decade in improving the spatial analytical capabilities of Geographical Information Systems (GIS). Advanced spatial analytical routines, such as principal components analysis (PCA), point pattern analysis and kriging, have been incorporated into major commercial GIS such as Arc/Info and IDRISI. Despite this progress, much remains to be done in order to turn a GIS into a spatial analytical platform. The criticisms about the de ciencies of GISs in spatial analytical capabilities have been widely seen in the literature (Goodchild 1987, Haining and Wise 1991, Fotheringham and Rogerson 1993, Anselin et al. 1993). More recently, Levine (1996) argues that neither the existing GIS programs nor statistical packages provide quantitative measures of spatial relationships such as dispersion, concentration, or spatial autocorrelation. Cressie et al. (1996) con rm that only recently GISs have begun to incorporate spatial statistical procedures into their information-processing subsystems. UCGIS (1998) identi es spatial analysis in a GIS environment as one of its top research priorities, further justifying the importance of integrating GIS with spatial statistics. DiŒerent researchers have put forward very diŒerent opinions regarding what kinds of spatial analytical functionalities should be linked to GIS. For instance, in Goodchild’s view, an ideal GIS should include at least six classes of spatial analysis (Goodchild International Journal of Geographical Information Science ISSN 1365-8816 print/ISSN 1362-3087 online © 2000 Taylor & Francis Ltd http://www.tandf.co.uk/journals
544
Z. Zhang and D. A. GriYth
1987). Openshaw (1991) recommends eight spatial analysis techniques that might be regarded as ‘GIS appropriate generic’. Haining (1994) identi es three fundamental types of spatial operations for a spatial data analysis module in GISs. Bailey (1994) lists more than ten methods of potentially useful spatial analysis techniques. The strategies for linking a GIS with statistical analysis are also very diŒerent in practice. Following Goodchild (1991), four strategies can be identi ed: (a) standalone spatial analysis software, (b) loose coupling of existing GIS software with statistical software, (c) close coupling of GIS software with statistical software, and (d) full integration of statistical spatial analysis in a GIS. Developing stand-alone packages, though a sensible choice in some situations, is generally not considered a good strategy because it seems a waste not to exploit either the powerful spatial data handling, displaying, querying and analysis capabilities of current GIS software, or the sophisticated statistical data analysis routines available in statistical software. Loose coupling approaches leave GIS and spatial analysis as two separate entities with an ability to transfer les in a common format between them. This is probably the oldest method that has been used in practice. It is feasible but very ine cient. Anselin and Getis (1993) even regard this as a non-solution, since it ignores the distinctive characteristics of a spatial database for use in spatial data analysis. Anticipating that GIS developers will choose not to embed spatial statistical functions in their data structure and software design architecture in the near future, researchers have been pursuing the strategies of close coupling and integration to develop their own customized spatial statistical modules. In our opinion, the integration approach should include two totally diŒerent but equally valid solutions: (a) embedding spatial analysis functions into a GIS, or (b) embedding selected GIS functions into a spatial analytical software package. Attention in the literature has focused almost exclusively on the former so far. The reasons for this are numerous. First, GISs have a unique and complicated data structure, which is very di cult to replicate in other software packages. Second, most spatial analysis routines require the input of locational information, which is readily available in a GIS. Third, many spatial analytical functions are developed by researchers in urban and regional sciences, geosciences, geography, or environmental science; they usually have a good knowledge of GIS, and prefer to work within a GIS framework. The purpose of this paper is to explore the possibilities and advantages of the second option, an integration strategy that has been largely ignored in the past, namely embedding a limited set of GIS components into a software system that can be easily extended to incorporate spatial data analysis functions. A module is designed and a practical implementation using Microsoft Access, a proprietary Database Management System (DBMS), is presented. Microsoft Access has the following combination of advantages that are superior to the other currently available platforms (such as the major statistical packages and DBMSs) that can be used to implement the preceding idea: (1) it provides seamless support for standard COM objects and ActiveX controls, allowing current commercial GIS components and other software components to be quickly assembled; (2) it has powerful database management and analysis capabilities, which is a crucial part of spatial data analysis; (3) it uses Visual Basic as its programming language, making it really easy to execute spatial database operations and develop spatial analytical routines, and to integrate with legacy computer code written in other languages such as C/C 1 1 and Fortran; and, (4) it is one of the most popular software packages for handling data, making it a good starting point for disseminating spatial statistical techniques. In the process of implementing the spatial statistical module in Access, we
Improving spatial analytical capabilities of GIS
545
contribute our opinions to the debate about which spatial analysis functions should be coupled with a GIS by identifying a group of spatial statistical functions that could serve well as the bridge between GISs and classical statistical analysis systems (Gri th 1993a), especially spatial autocorrelation testing and spatial autoregression. 2.
GISs and spatial statistical analysis The argument that a GIS is de cient in spatial analysis capabilities is itself de cient, if the concept of ‘spatial analysis’ is de ned to include any analytical operations that could be performed on spatial data. Such valuable techniques as surface partitioning, map overlay, buŒering, network analysis, and location-allocation analysis have been part of a GIS almost since its advent, and we cannot expect a GIS to incorporate all valuable spatial analysis techniques. We should be very cautious, however, in arguing that GISs are de cient in spatial statistical analysis capabilities, too, if the de nition of ‘spatial statistical analysis’ covers all the statistical procedures and functions suitable for analysing spatial data. In fact, since the late 1980s, the statistical aspect of spatial analysis has received increasing attention from the GIS community. The release of IDRISI for the rst time brought statistical functionalities into a raster-based GIS, a situation in which statistical functions are relatively easy to implement because of the rather simple topological structure of raster-data. The statistical analysis module of IDRISI (Eastman 1992) oŒers not only classic regression analysis and some multivariate statistical methods, such as principal components analysis (PCA), but also the following specialized statistical routines for the spatial description of image or raster-based data: (1) trend surface analysis, which calculates the relationship between a single image and the Cartesian coordinate axes of the image itself; (2) point pattern analysis, a technique for determining the character of a point set’s geographical arrangement; and, (3) a simple spatial autocorrelation test, which calculates the rst lag autocorrelation coe cient of an image using Moran’s ‘I’ statistic. The Unix ARC/INFO (ESRI 1991) package, a primarily vector-based GIS, also began to provide kriging capabilities in its Grid module, although its model parameter estimation still is in need of improvement; the data generated with a kriging function can be graphically portrayed using a semivariogram plot. Moreover, linkages between GIS and standard statistical software have been developed. For example, Arc/Info and ArcView have been linked with S 1 , and MapInfo has been linked with SPSS. However, when employing Cressie’s taxonomy of statistics for spatial data (Cressie 1991, pp. 8–9), it is apparent that a group of very important spatial statistical methods is still missing in current GIS software: those dealing with areally aggregated data. Cressie divides spatial data into four general categories: (a) geostatistical data, whose most important characteristic is continuous variation over space; (b) point patterns, i.e. the locations of events; (c) lattice data (either regular or irregular), representing observations with a discretized distribution over space; and, (d) objects, or 3- and 4-dimensional physical entities. Geostatistics usually refers to statistical methods for analyzing spatial variability of geostatistical data at both the large scale (spatial trend ) and the small scale (spatial correlation). As we noted above, two of the most important techniques in geostatistics, trend surface analysis and kriging (predicting unobservation values using observed values), have been incorporated in some current GIS packages. Point pattern analysis, whose major aim is to answer the question of whether the locational pattern is exhibiting random, clustering, or uniform spacing, also is available in some GIS packages like IDRISI. However,
546
Z. Zhang and D. A. GriYth
spatial statistical methods for analyzing lattice or tessellation data (observations associated with regular or irregular area units; see Gri th and Layne, 1999), are still absent in almost all the current GIS software packages, with the exception of IDRISI’s simple spatial autocorrelation test for pixel data. It is this de ciency that presumably will hamper the diŒusion of GIS technology in the future, because: (1) point pattern analysis does not require much in terms of the functionality of a GIS, once the coordinates of the locations have been determined; and, (2) geostatistics are better suited for data with continuous distributions over space, which is typically not satis ed by spatial samples in the social sciences (Anselin 1992b). Most of urban and regional analysis (social, economic, or environmental ) is conducted with discretized data aggregated for diŒerent geographical areas or zones, such as census tracts, counties, council districts or states. The focus of this paper is on identifying and implementing a set of statistical methods suitable for handling these data. Hence the ‘spatial statistics’ referred to in the following section of this paper has a narrower de nition than the one that has been used by some researchers to include all the statistical methods for analyzing spatial data. It is closer to the ‘spatial statistics’ referred to by Gri th (1987) and the ‘spatial econometrics’ by Anselin (1988). The core of spatial statistics is the explicit recognition of the inescapable property of almost all spatial data (Anselin 1990, Goodchild et al. 1992)— spatial dependency or spatial autocorrelation (CliŒand Ord 1973; Goodchild 1986; Gri th 1987; Odland 1988) among georeferenced data values. The existence of spatial dependence violates the basic assumption of independence of most classical statistical analyses, and causes a redundancy in information for nearby locations. The standard errors of regression coe cients, as well as the standard errors of the correlation coe cients computed for pairs of geographically distributed variables, tend to be incorrect if they are calculated with classical statistical formulae. The corrupted inferential basis may lead us to draw seriously awed conclusions. Classic statistical methods, when applied to georeferenced data, fail to capture the spatial dependency of the data, raising questions of estimator su ciency, biasedness, e ciency, and consistency. Therefore, the major concerns of spatial statistics are how to measure spatial autocorrelation and how to rewrite the classic general linear model so that spatial autocorrelation can be taken into account. The attractiveness of spatial statistics is that rather than throwing away all of the classic statistical techniques, it modi es them so that they can be used properly for spatial data analysis. It thus extends the power of statistical analysis into the arena of spatial data. If the 1970s and early 1980s are the ‘classical spatial analysis’ period for a GIS, and the late 1980s and early 1990s are the period when the GIS community has taken action to incorporate ‘geostatistical analysis’, then the late 1990s and rst decade of the new millenium should be the time when both the GIS developer and user communities adopt more comprehensive ‘spatial statistical analysis’. 3. Integrating GIS components and spatial statistical analysis in DBMSs 3.1. ActiveX/OCX and GIS components As mentioned earlier, embedding GIS functions into a spatial analysis package, though theoretically a valid strategy for integrating GIS and spatial statistical analysis capabilities, has received insu cient attention in the literature. The biggest obstacle is the technical di culty of replicating GIS functionalities in platforms other than GISs, due to the complexity of GIS data structures. However, doing so is becoming very desirable and showing considerable promise with the recent technological
Improving spatial analytical capabilities of GIS
547
breakthroughs in object-oriented programming and component-based rapid application development (RAD). The cutting edge of these breakthroughs is the ActiveX technology (Microsoft 1997a), formerly known as OLE (Object Linking and Embedding). OLE technology was originally proposed by Microsoft to provide a cut-and-paste capability for Windows. It subsequently evolved into OLE2, for more general communication between Windows applications, permitting the embedding of one document type (for example, an Excel spreadsheet) inside another (for example, a Microsoft Word document ). During this evolution a new object communication and encapsulation model was born— COM, which stands for Component Object Model (Microsoft 1997b). One of the major ideas of COM is to promote the development of reusable software components —OLE Controls (or OCXs, after their three-letter le extension). OLE controls became ActiveX Controls when Microsoft made a strategic change to encompass the Internet. ActiveX Controls developed in one speci c platform or language (such as Visual Basic, C 1 1 , and Delphi), can be reused in another platform or language. This exibility boosts the e ciency of software and application development since programmers do not have to repeatedly develop everything from scratch. A complicated software application can be built simply by assembling software components, such as a command button, a list box, or a check box. The assembling work can be done easily using drag-and-drop in a visual development environment. ActiveX technology has profoundly impacted the entire software development industry. Software development companies are rushing to oŒer ActiveX control versions of their agship software. Interoperability with other software becomes an increasingly important criterion for assessing a software product. Such jobs as scanning and editing images, connecting and viewing tabular data, and playing video and audio, are not as formidable as they used to be for application developers. The GIS industry is not the exception in this trend of change. After the release of the rst embeddable GIS component— SylvanMAPS/OCX (Sylvan Ascent 1995) in 1995—all major GIS software companies released ActiveX/OCXs of their agship GIS software. MapObjects (ESRI 1996), MapX (MapInfo 1996), GeoView (Blue Marble Geographics 1996), and GeoPoint (Visual Components 1996) are some of those on the market today. The potential impact of embeddable GIS technology on the future of spatial data analysis cannot be overlooked. For a long time educators and researchers have worried about the inappropriate use of GISs by people lacking a basic knowledge of the special characteristics (for example, map projections and spatial autocorrelation) of spatial data. Now with the ActiveX technology, GISs can be dragged and dropped by users into their applications within a matter of seconds. The danger of the misuse of GISs by those without a basic appreciation of spatial data analysis will be more serious than ever before. To make matters worse, those responsible for the education of spatial analysts (especially geographers and urban and regional scientists) seem to be falling behind in the trend of embeddable GIS technology (Luo 1997). Therefore, there is a signi cant need today to raise the pro le of the special characteristics of spatial data, emphasize the importance of spatial data analysis, and explore ways in which the ideas of spatial statistical analysis can be implemented and grasped by users. 3.2. L inking GIS and spatial statistical analyses outside a GIS environment Since spatial statistical analysis is closely related to both GIS and classical statistical analysis, it can be implemented either in a GIS environment or in a statistical analysis environment (e.g. the S 1 SpatialStats module).
548
Z. Zhang and D. A. GriYth
The former of these two strategies has two major advantages. First, the spatial con guration information, which is the essential input for spatial statistical analysis, can be easily generated in a GIS environment. For example, it is well known that a spatial connectivity matrix can be constructed by accessing the topological structure of ARC/INFO (Kehris 1991, Can 1996). Second, a GIS provides sophisticated tools for visualizing the results of a spatial statistical analysis. Mapping residuals of a spatial autoregressive modelling exercise, for instance, helps us identify spatial outliers. Because of these advantages, GISs have been the dominant platform into which spatial statistical techniques are integrated. Ding and Fotheringham (1992) explore the possibilities of developing a statistical analysis module (‘SAM’) that runs within the environment of ARC/INFO. SAM comprises C programs for calculating the measures of spatial autocorrelation. These C programs can be called with AML (ARC/INFO macro language) commands. The best example of linking spatial statistical analysis with ARC/INFO is SAGE (spatial analysis in a GIS environment), a software package for the interactive analysis of area-based data in a client-server architecture, developed by Haining et al. (1996) at the University of She eld. SAGE not only supports both classical and spatial statistics, but also has the linked window capability, which means that cases highlighted in one window are also highlighted in others. The major drawback of ARC/INFO as an environment for spatial statistical analysis is that ARC/INFO does not support statistical graphics, which is a crucial element of statistical analysis. Because of this weakness, recently there has been increasing interest in exploring the possibilities of linking spatial statistical analysis with ArcView GIS (ESRI 1994), which not only contains the essence of ARC/INFO, but also provides a standard Windows working environment and a chart module for generating statistical graphics. Majure et al. (1996) built the link between ArcView and Xgobi—a dynamic graphics program that allows multivariate data to be explored. Anselin and Bao (1996) explore the linkage between ArcView and SpaceStat (Anselin 1992a), a spatial statistical analysis package developed using Gauss. Zhang and Gri th (1997) show that a user-friendly spatial statistical analysis module can be directly developed in ArcView using Avenue. Mathsoft (1998) has developed S 1 for ArcView GIS. However, there is a drawback associated with this strategy of developing spatial statistical analysis capabilities within a GIS environment: all statistical analysis functions have to be developed from scratch (for a full integration strategy), and only a very limited number of statistical graphics are supported. This is the major reason why we consider the second strategy—developing spatial statistical functions in a statistical analysis environment. In contrast to the rst strategy, this strategy does not need to develop sophisticated GIS functions from scratch since embeddable GIS components already are widely available. Spatial con guration information can be extracted from embedded GIS components, and classical statistical functions provided by current commercial statistical software packages such as Minitab and SAS can be modi ed to take spatial autocorrelation into account (Gri th 1988, 1993b). Unfortunately, all the major commercial statistical software packages do not provide an open development environment into which GIS components and other ActiveX controls can be embedded, though SAS Inc. has developed some GIS functions inside the SAS package from the ground up (PROC GIS). For this reason, we choose to explore the second strategy within a DBMS, since the strong data handling and query capabilities of a DBMS make it more amenable to the developing of statistical analysis functions. With the aid of statistical analysis plug-ins, a DBMS
Improving spatial analytical capabilities of GIS
549
can become a very decent statistical analysis environment. For example, SPSS has developed a statistical analysis extension for dbase, an early database management system. Total Access Statistics (FMS 1995), a statistical analysis add-in for Microsoft Access, also enhances the statistical analysis capabilities of Microsoft Access. More importantly, many recent DBMSs, such as Visual Foxpro and Microsoft Access, provide visual application development environments into which GIS components and other ActiveX controls can be embedded simply by drag-and-drop. In this paper, we illustrate how to integrate GIS components and spatial statistical analysis in Microsoft Access. Such an integration allows users to perform exploratory spatial data analysis within a DBMS and without the need to worry about whether or not GISs or statistical software packages are available, or knowledge of using them has been acquired. Considering the popularity of Microsoft Access, this integration will more than likely provide a channel for spatial statistical analysis to be more widely adopted by users from various disciplines. 4. Implementing interactive spatial statistical analysis in Access 4.1. Module design Figure 1 is a simpli ed representation of the interaction between a GIS and a spatial statistical analysis. Spatial data are handled in a GIS. Two diŒerent categories of information contained in a GIS data structure can be extracted: locational information (such as areal unit/con guration connectivity, perimeter, area, inter-centroid distance and topology) and attribute information (attribute values of spatial entities). Locational information is used to create spatial arrangement les, which are crucial
Figure 1. Interactions between GIS and spatial statistical analysis.
550
Z. Zhang and D. A. GriYth
to spatial statistical analysis, such as a neighbors le (Zhang and Gri th 1997), a binary connectivity matrix (matrix C), or some weighted version of matrix C (matrix W). Classical statistical methods, such as ordinary least squares (OLS) regression and multivariate statistical methods, can be used to do the preliminary analysis of attribute information, or generate new attribute values that deserve scrutiny (such as the residual from OLS regression). Before we rush to any inference based on the results of a classical statistical analysis, however, it is necessary to examine the spatial dependence latent in the attribute values. If spatial dependence is found to be signi cant, a classical statistical technique (for instance, an OLS regression procedure) needs to be replaced with its spatial statistical counterpart (for instance, a spatial autoregressive procedure). Output from the spatial statistical analysis then can be moved back to a GIS for further analysis. For example, mapping the local spatial statistics values or residuals from a spatial autoregression is quite revealing about the spatial patterns of data. 4.2. Software components and the application development of a spatial statistical analysis module in Access. This section outlines how to use software components to rapidly develop the aforementioned spatial statistical analysis module. In a traditional application development process, computer code is compiled and linked to form a single monolithic binary le. Once the compiler generates the application, the application does not change until the next version is compiled and shipped. Any change to the code will lead to the recompilation of the entire application. The component-based solution adopted in this article is very diŒerent. It breaks the monolithic application into separate pieces, or components, by the use of COM. Each component is like a miniapplication, packaged as a binary bundle of code that is compiled, linked, and ready to use. Figure 2 shows the components and necessary environment for developing a spatial statistical module in Microsoft Access. GIS components, classical statistical components, and spatial statistical components, combined with other accessory components (such as input boxes, drop-down list boxes, command buttons, and other standard ActiveX controls for building Microsoft windows applications), provide the basic building blocks for this spatial statistical analysis module. The GIS component we use is ESRI’s MapObjects, one of the most popular embeddable GIS components. MapObjects supports the industry-standard shape le format, Spatial Database Engine (SDE) layers, and a wide variety of raster image formats such as bmp and tif. Spatial statistical components are built on the data handling capabilities of Access, the standard OLS multiple regression procedures contained in the classic statistical components, and the basic spatial analytical capabilities provided by the MapObjects. DiŒerent ActiveX controls can be assembled in an Access ‘form’—a visual environment for application development—and major functions or procedures can be grouped in an Access ‘module’. Of note is that diŒerent components are glued together using Visual Basic, the programming language of Microsoft Access and the container for software components as well. Among all of the computer languages, Visual Basic provides the easiest means for accessing and manipulating COM-based software components. For example, to zoom into a map object named ‘Census’ when a user selects a rectangle area of this map using a mouse, we only need to put two lines of code inside the procedure that responds to the MouseDown Event:
Improving spatial analytical capabilities of GIS
551
Figure 2. Spatial statistical analysis module in Microsoft Access.
Private Sub Census-MouseDown(Button As Integer, Shift As Integer, X As L ong, Y As L ong) Set Census.Extent 5 Census.T rackRectangle Census.refresh End Sub The way Visual Basic calls the statistical software components is also very straightforward. Suppose we have packaged the OLS regression, the most fundamental statistical routine, inside a COM object named ‘Statistics.Regression’; then the following lines of code will execute an OLS regression of a dependent variable (Y) on independent variables (X ): Dim Obj as Object, Result as Object Set Obj 5 CreateObject(‘Statistics.Regression’) Set Result 5 Obj.OL S(Y ,X) Of note is that implementation details are hidden inside the COM object; the client simply creates an instance of this object and calls one of the methods the object exposes through its interface. In the future, even if this component is upgraded to incorporate more functionalities, or a new algorithm is adopted to implement the OLS regression, the client still can use the above three lines of code to perform the OLS regression. That is, no code change is needed on the client side as long as
552
Z. Zhang and D. A. GriYth
the OLS method is still exposed by the COM interface. This enables our module to evolve gracefully over time as new components replace older components. In the following sections, we discuss in more detail calculating spatial autocorrelation measures and spatial autoregressive modelling, two core elements of our spatial statistical components. Focuses are on justifying what spatial statistical functions should be included, and the algorithms for implementing them. This article will not discuss the technical details of how to program software components. Those who are interested in this can consult Rogerson (1997) and Grimes et al. (1998). 4.3. Spatial autocorrelation testing Spatial dependence among observations of a georeferenced attribute can be measured at two diŒerent scales: global (the autocorrelation in data values in the many locations constituting a map), and local (the association between a single location and its neighbors). The Moran Coe cient (MC, see equation A-1 in the Appendix) and the Geary Ratio (GR) are two well-known global indicators of spatial autocorrelation. They are inversely related, with one index expressible in terms of the other. MC 5 Õ 1/(n Õ 1 ) (n is the number of the total area units) or GR 5 1 indicates a random map pattern; MC > Õ 1/(n Õ 1) or 0 < GR < 1 indicates that similar values tend to cluster on a map (positive spatial autocorrelation); and, MC < Õ 1/(n Õ 1) or GR > 1 indicates that dissimilar values tend to cluster on a map (negative spatial autocorrelation). The MC is more popular and easier to understand since its expected value converges on zero when n is large, and a positive value is associated with positive spatial autocorrelation, while a negative value is associated with negative spatial autocorrelation. The MC is also more powerful statistically, and closely parallels the Pearson product moment correlation coe cient. Hence, in this article we implement MC in the spatial statistical module and extensively use it for testing spatial autocorrelation in our case study. Computing MC becomes more and more numerically intensive as n gets larger, and calculating the variance of MC is even more complex. Fortunately, there is a very good asymptotic approximation for the standard error of MC (Gri th and Amrhein 1991): sà # mc
S N 2
n
n
(1) c ij = = i 1j 1 Equation (1) provides a good approximation for all but very small values of n. We use it here to test for spatial autocorrelation in the regression residuals in our module. Local indicators are useful for examining the partial map patterns that may exist in diŒerent local map areas, especially when a given GIS dataset is large. Gi and Gi* statistics (Getis and Ord 1992) are two local statistics rst used in practice. Anselin (1995) further summarizes the local statistics and de nes a class of Local Indicators of Spatial Association (LISA), including Local Gamma, Local Moran and Local Geary. Local statistics, when combined with statistical graphics and the mapping ability of a GIS, are excellent tools for exploratory spatial data analysis (ESDA). (To some extent local statistics are con rmatory in nature, since they require an inference theory before they can be interpreted. However, in this article we mainly use them for exploratory purposes —especially for identifying the extreme values in a data set—and the Z scores for local statistics are used mainly as yardsticks to quantify deviations, rather than to draw an inference.) A most recent discussion of this marriage can be seen in the papers by Unwin and Unwin (1998), Brundson
Improving spatial analytical capabilities of GIS
553
(1998), and Wilhelm and Steck (1998), all of which were published in a special issue of T he Statistician. After Tukey (1977), many exploratory data analysis (EDA) tools have been used to perform preliminary analyses of aspatial data. Statistical graphics, such as boxplots, scatterplots and stem-and-leaf plots, have proven to be very eŒective in identifying patterns, trends, or outliers and anomalies that are hidden in a data set. Including a map as a view for spatial data is an important extension to the traditional aspatial EDA toolbox. A simple map can be both an appropriate addition to an otherwise aspatial display, and an e cient means of portraying spatial patterns and locations of interest. Figure 3 shows an interactive ESDA scheme using linked windows. There are at least three diŒerent modes of visualizing spatial data: a table (showing the values of diŒerent variables), a statistical chart (depicting statistical characteristics), and a map (showing the location and spatial relationships between diŒerent spatial entities). Each of these graphics can be displayed in a separate window. A fully interactive ESDA should allow the linkages between all these windows. First, linkages can be built within the same mode of windows. For example, a statistical graphic can be linked to another statistical graphic. Brushing linked scatterplots (Becker and Cleveland 1988) has been shown to be a very good strategy for visually examining relationships among diŒerent variables. Second, linkages also can be built between diŒerent modes of windows. For example, linking a map window with a scatterplot window on the screen extends a scatterplot brushing to a geographic brushing (Monmonier 1989). When points in one part of a scatterplot are selected, their geographic locations are highlighted. Conversely, when regions on the map are highlighted, points on the plots corresponding to these regions are identi ed. Stueltzle (1987) was among the rst to point out the importance and eŒectiveness of linked windows. MacDougall (1992) developed a prototype interface called Polygon Explorer, which links a map display, some of the more commonly used statistical graphs (e.g. bar charts, histograms, and a scatterplot), and cluster analysis. SPIDER, developed by Haslett et al. (1990), is one of the rst truly interactive
Figure 3. Exploratory spatial data analysis using multiple linked windows.
Z. Zhang and D. A. GriYth
554
visualization tools for analysing spatially distributed data via linked-windows. It provides not only the linkage between maps and scatterplots, but also the linkage between a map and a semivariogram, a visualization tool for analysing the spatial autocorrelation patterns in geostatistical data. A counterpart to this semivariogram plot in spatial statistics is the Moran scatterplot (Anselin 1993). A Moran scatterplot can be used to evaluate whether or not the spatial data under study conform to some of the assumptions of classical statistical analysis: constant variance (by examining the shape and distribution of points on the scatterplot), and non-zero spatial autocorrelation (note that MC is a standardized regression coe cient associated with the Moran scatterplot constructed with matrix C). It is also a straightforward visualization tool for examining whether or not the underlying relationship is linear. The spatial statistical module in this article illustrates how to use the dynamic linking of a Moran scatterplot window with a map window and an attribute table to help identify spatial outliers and local spatial clustering. 4.4. Spatial autoregressive modelling Three types of spatial autoregressive models are commonly used in spatial statistics and spatial econometrics: the simultaneous autoregressive model (SAR, (A4) in the Appendix), the autoregressive response model (AR, (A5) in the Appendix), and the conditional autoregressive model (CAR). The major diŒerence between these models stems from how their inverse covariance function is speci ed. The commonly used version of the inverse covariance matrix for a CAR model may be written as (I Õ rC )/s2, where C is the binary connectivity matrix and r is the spatial autocorrelation parameter. The commonly used version of the inverse covariance matrix for an SAR model may be written as (I Õ rW)t(I Õ rW)/s2, where W is the stochastic version of matrix C and t denotes the matrix transpose operation. The SAR model is the speci cation preferred by spatial statisticians, whereas the AR model is the speci cation preferred by spatial econometricians. It is appropriate for situations involving second-order dependency (which is captured by WtW). One appealing feature of the SAR model is that it casts spatial dependence so that the value in areal unit i is a weighted average of the values in juxtaposed areal units. The conceptual diŒerence between an SAR model and an AR model is that the former posits spatial autocorrelation as a feature of the error term, while the latter posits spatial autocorrelation as a direct response in the dependent variable Y. Phenomena that appear to conform to the AR model speci cation include housing prices, median household income, population density, and selected other socio-economic attributes of areal units. Therefore, it is chosen to be implemented in our spatial statistical module and is used to model the urban population density distribution in the Washington, DC area. Following Gri th (1988), when assuming that the error term (j) is normally distributed, an appropriate normalizing factor (the Jacobian term) needs to be incorporated into parameter estimation procedures for spatial autoregressive models. For example, the SAR model can be speci ed for estimation purposes as: Y 5 exp(J/2)
rW Y 1 exp(J/2)
(I Õ rW )Xb 1 exp(J/2)
j exp(J/2)
(2)
The Jacobian term ensures that the estimate of r will fall within its feasible parameter space, restricting correlations to the interval [Õ 1, 1 ]. The required normalizing factor may be cast as a function of the eigenvalues of matrix W for an
Improving spatial analytical capabilities of GIS
555
AR or SAR model ((A-6) in the Appendix). This normalizing constant term presents a formidable impediment for parameter estimation when the spatial data set is large. Computing eigenvalues for an n-by-n matrix becomes impossible at some value of n. Even powerful supercomputers cannot supply the computing resources necessary to numerically handle the normalizing constant for massive spatial data sets containing hundreds of thousands, millions, or billions of areal units (Gri th and Sone 1995). To eŒectively estimate spatial autoregressive models in a desktop DBMS like Microsoft Access, simplifying the computation of the normalizing constant is highly desirable. In our module we have utilized two forms of a Jacobian approximation. One approximation ((A-7) in the Appendix) only requires that the minimum and maximum eigenvalues of matrices C and W be known (we refer to this as the ‘approximate’ Jacobian), and the other ((A-9) in the Appendix) can be used when no eigenvalue information is available (we refer to this as the ‘generalized’ Jacobian). These are illustrated in the next section using the empirical modelling of the Washington, DC population density. Because r is in both the numerator and the denominator of equation (2 ), the accompanying estimation problem becomes nonlinear. This not only adds great di culty to the computer implementation of a spatial autoregressive model in Access, but also makes the parameter estimation a numerically intensive procedure. Fortunately, since we know that estimation of equation (2) is a linear regression problem when r is set to a given value, and the relationship between the mean sum of squared errors (MSE) and r given by equation (2) is roughly quadratic in nature, we can use the following algorithm to implement this nonlinear procedure: E
E
E
For a spatial autoregressive model, compute a set of MSEs across a range of r values using the OLS method. For example, if the Moran coe cient implies the existence of positive spatial autocorrelation, r can take on the values of 0.0001, 0.05, 0.1, ..., 0.9999. Otherwise, r can take on the values of Õ 0.9999, Õ 0.95, Õ 0.90, ..., Õ 0.0001, although the actual ranges are determined by the extreme eigenvalues of matrix C or W. Denote the increment of r as D; Choose the r (r ) with the minimum MSE(M ), and also record r , r , i i iÕ 1 i + 1 and M and M ; iÕ 1 i+ 1 Use the following quadratic equation-derived formula to get the nal estimate of rà :
D(M Õ M ) iÕ 1 i+ 1 (3) 2(M Õ 2M 1 M + ) iÕ 1 i i 1 Obviously, the advantage of this algorithm is that it uses only OLS, which is much easier to implement than a genuine nonlinear estimation procedure. The only concern is how ne D should be in order to obtain parameter estimates comparable to those yielded by the nonlinear procedure. The smaller the D, the more OLS procedures that need to be executed. This issue also is addressed in the following case study. rà 5 r 1 i
5.
An example: urban population density modelling for Washington, DC In this section we present an urban population density modelling exercise for Washington, DC, as an example to illustrate the functionalities of the spatial statistical module we developed in Microsoft Access. The analysed population density data set covers 192 census tracts that are located to the east of the Potomac River. In the
Z. Zhang and D. A. GriYth
556
rst step of the spatial statistical analysis, a geographical weights matrix needs to be created. In our module, a spatial neighbours le (table 1) instead of a binary spatial weights matrix, is used to store and display the spatial contiguity of diŒerent areal units. Such a le contains the same spatial con guration information as a binary spatial weights matrix, but is much smaller in size (all of the 0 cells of matrix C are ignored), especially when the size of a data set is very large. 5.1. Spatial autocorrelation testing and exploratory spatial data analysis Once the spatial con guration le is generated, the next step is to measure the nature and degree of spatial autocorrelation latent in the data set. The following power transformation was used to bring the raw population density (popden) data into closer conformity with a normal frequency distribution: Y 5 log (popden 1 d ), where d 5 8540.65. A global test of spatial autocorrelation yields a global Moran coe cient for the power-transformed population density variable of 0.46, which has a z-score value of 10.45, which indicates that we can reject the statistical hypothesis (i.e. no spatial autocorrelation is latent in the data set). The MC value, however, indicates a moderate positive spatial autocorrelation latent in the data. That is, census tracts with similar population density tend to cluster together in the Washington, DC, area. To further explore the map patterns that may exist in diŒerent local areas, and to determine if spatial outliers or local spatial instabilities are present, a local test for spatial autocorrelation was performed. A local spatial statistics table and a Moran scatterplot are two important outputs of the module. To facilitate the dynamic interaction between a map, a table and a Moran scatterplot, a local spatial statistics table needs to satisfy the following requirements: (a) each record of the table corresponds to an areal unit (in this case, a census tract), and (b) at least the following elds are included:
E
E
E
E
E
E
Field I: identi er for each areal unit Field II: attribute value for each areal unit: in this case, the power transformed population density (Y ) Field III: attribute value for the spatial neighbors of each areal unit: in this case, CY, where C is the binary connectivity matrix Field IV: standardized attribute value (standardized Y ) Field V: standardized attribute values for spatial neighbours (standardized CY ) Fields VI and above: local statistics for each areal unit, such as local Moran, G statistics, or local Gamma. Table 1. The structure of a spatial neighbours le. Area unit 21 3 4 5 . ..
Spatial neighbours 3, 4, 5, 7 2, 7, 9, 14, 16 2, 5, 6, 8, 11 2, 4, 7, 8 . ..
Using an ARC/INFO convention identi er ‘1’ is reserved for the universal polygon, which is the polygon outside the map boundary.
Improving spatial analytical capabilities of GIS
557
With the above speci cation, a local statistics table can be linked to a map and its underlying GIS data structure using Field I as a primary key, and a Moran scatterplot can be generated based on either Fields II and III, or Fields IV and V. Figure 4 is a screenshot showing the interaction between a local spatial statistics table (in the lower part) and a map (in the upper part) in the Microsoft Access environment. Of interest here is that a map and its underlying GIS data, which used to be available only in a GIS or cartographic software, have been embedded in a DBMS environment using the component technology. The user can perform basic GIS-type operations on this map in the Microsoft Access environment, such as zooming, identifying, spatial query, statistical mapping, and geocoding, just like in a GIS environment. The dynamic linkage between the local statistics table and the map allows the user to select a set of observations of interest in the table, with the corresponding locations of these observations being simultaneously highlighted on the map. For instance, in gure 4 observations with a local Moran statistic greater than 2 (chosen for exploratory analysis not drawing inferences), which are suspected to represent local spatial clustering of similar population densities, are selected from the local statistics table. The corresponding census tracts are simultaneously highlighted in dark grey on the
Figure 4. Local spatial statistics and maps can be used together to explore the population density distribution pattern in Washington, DC.
558
Z. Zhang and D. A. GriYth
map. At least two spatial clusters with similar values are identi ed: one in the southeast part of the map, which contains two census tracts with relatively large areas, and the other located to the north of the rst one, which contains several census tracts with smaller areas. In gure 5, the map and the Moran scatterplot are combined to perform the spatial data analysis interactively. The upward slope of the Moran scatterplot further suggests the existence of moderate positive spatial autocorrelation. Suspecting that the point located in the lower left-hand corner of the scatterplot may be a spatial outlier, and a spatial clustering of low population densities, a user can click on this point with the mouse and cursor, and its spatial location is simultaneously highlighted on the map. By using the map-identifying tool, we discover that this is the census tract where Capitol Hill is located, with a population density of only 21 people per square mile. This nding is consistent with the fact that the area around Capitol Hill is mainly occupied by public and commercial, rather than residential, buildings. For simple illustration purposes this example only implements one map to one chart
Figure 5. The Moran scatterplot combined with the map to perform an interactive spatial data analysis in Microsoft Access.
Improving spatial analytical capabilities of GIS
559
interactive analysis. The linking between multiple maps/charts can certainly be developed with more programming eŒort. 5.2. Spatial autoregressive modelling The preceding exploratory analysis suggests that a classical urban population density model (Clark 1951) of the form ln (D ) 5 ln (D ) Õ bd 1 e, (4) d 0 where D is the population density at a given distance from the city centre, D is the d 0 density at the city centre, b is the rate of change of density with distance from the city centre, and d is the distance, will probably yield awed results. Since the Washington, DC, population density exhibits moderate positive spatial autocorrelation, a spatial autoregressive (AR) model can be used to replace equation (4): ln(D ) 5 rW ln(D ) 1 Xb 1 e, (5) d d where r is the spatial autocorrelation parameter, W is the row standardized version of the binary connectivity matrix, X is the data matrix and b is the vector of coe cients including the intercept. Tables 2 and 3 tabulate a comparison of these two models. The OLS regression model only accounts for about 15% of the total variation (R2: 15.7%; adjusted R2: 14.8%). Of note is that moderate spatial autocorrelation (MC: 0.426, normalized z-score: 9.675) remains in the OLS regression residuals. The standard errors of MC for both the OLS model residuals and the AR model residuals have been computed using the asymptotic approximation (equation 1). The AR model yields an improved t. The pseudo-R2 of the AR model is 49.8%. More importantly, the MC for the AR model residuals decreases to 0.024 (normalized zscore: 0.662), indicating that most of the spatial autocorrelation has been taken into Table 2. OLS model output. Variable
Parameter
Estimate
Std. Error
b 10.258 0 Õ 0.087 b 1 Õ 1.035 b 2 R2: 15.7%; Residuals: MC 5 0.426, S(MC) 5 0.045, Z(MC) 5 9.675 Intercept Distance from CBD Low density indicator
0.066 0.020 0.234
The indicator variable takes on a value of 1 when a census tract has zero population density, and a value 0 when it does not. Table 3. AR model output. Variable Spatial Lag (W Y ) Intercept Distance from CBD Low density indicator
Parameter
Estimate
r 0.6739 b 10.4615 0 Õ 0.0418 b 1 Õ 0.7350 b 2 Psuedo-R: 49.8%; Residuals: MC 5 0.024, S(MC) 5 0.045, Z(MC) 5 0.662
Asy. Std. Error 0.0872 0.1775 0.0170 0.1880
The results here are generated using SAS spatial statistics programs developed by Gri th (1993). They are compared with those generated by our module.
560
Z. Zhang and D. A. GriYth
account by the AR model. This is an experimental result, since the statistical distribution theory for MC calculated with AR residuals is unknown. Figure 6 is a screenshot for performing spatial autoregressive modeling using our spatial statistical module in Access. The graphical user interface lets a user select the data and variables to be analysed from drop-down Combo boxes and List boxes. Users also can determine what increment (D) they want to use when estimating the spatial autoregressive parameter (r), and which Jacobian speci cation, exact (A-6), approximate (A-7), or generalized (A-9), they are interested in using as the normalizing factor. For example, an increment of 0.05 and the exact Jacobian term are chosen in the example shown in gure 6. The Moran scatterplot for residuals, as displayed in the lower part of gure 6, shows no sign of spatial autocorrelation remaining.
Figure 6. Spatial autoregressive modelling in Microsoft Access.
Improving spatial analytical capabilities of GIS
561
The ease of use provided by our spatial statistical module allows us to fully explore estimation behavior of the autoregressive model speci cation under diŒerent choices of D for incrementing r and either the exact or approximate Jacobian term. Table 4 shows the simulation results when the increment and Jacobian terms take on diŒerent input values. The results suggest several interesting points: (a) When the exact Jacobian is used, the diŒerence of rà produced by our module and that produced by SAS (0.6739, see table 3) appears in the second decimal place when D takes on a value of 0.2. However, as D gets smaller, this diŒerence shrinks: when D is greater than 0.10, the diŒerence is in the second decimal place, and when D takes on a value of 0.05 or smaller, the diŒerence only appears in the fourth decimal place. Moreover, the estimates for b , b and b are almost identical with 0 1 2 those yielded by SAS when D is 0.05 or smaller. These results suggest that the MSEOLS algorithm used in our module for estimating r is a good substitute for the usual nonlinear estimation procedure, and that D should take on a value around 0.15 (only 7 OLS regressions are required ) or smaller, depending on how accurate we want our results to be as well as the available computing power. (b) When the Jacobian approximation is used to replace the exact Jacobian as the normalizing constant in the spatial autoregressive modelling, and D is 0.10 or smaller, the rà and b yielded by our module only diŒer from those yielded by SAS 0 in the third decimal place, and b and b in the fourth decimal place. These results 1 2 not only corroborate the nding from (a), i.e. the MSE-OLS algorithm for estimating r is a good substitute for the usual nonlinear estimation procedure, but also suggest that A-7 may be a robust approximation for the exact Jacobian. (c) If the generalized Jacobian approximation is used to replace the exact Jacobian term, from the second decimal place on, rà , b , and b yielded by our module start 0 2 diŒering from their SAS counterparts. However, rˆ’s diŒerence is only 0.017 (when D is 0.05). Hence the generalized Jacobian can still be utilized for estimating parameters with a reasonable degree of accuracy when the eigenvalues of matrix C or W are not available. Table 4. Spatial autoregressive model parameter estimates under diŒerent selections for increments of r and the Jacobian term. Jacobian Exact
Approximate
Generalized SAS
r Increment
rÃ
b
0.20 0.15 0.10 0.05 0.02 0.01 0.005 0.001 0.10 0.05 0.01 0.005 0.05 0.01 0.001 *
0.6634 0.6707 0.6702 0.6734 0.6737 0.6738 0.6738 0.6738 0.6754 0.6788 0.6793 0.6793 0.6571 0.6581 0.6584 0.6739
10.4528 10.4570 10.4582 10.4605 10.4628 10.4629 10.4626 10.4623 10.4643 10.4645 10.4674 10.4665 10.4470 10.4414 10.4490 10.4615
b
0
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
b
1
0.0426 0.0421 0.0413 0.0419 0.0419 0.0419 0.0419 0.0419 0.0418 0.0416 0.0415 0.0415 0.0430 0.0429 0.0430 0.0419 Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
Õ
2
0.7397 0.7364 0.7366 0.7352 0.7351 0.7350 0.7350 0.7350 0.7343 0.7328 0.7326 0.7326 0.7425 0.7419 0.7419 0.7350
562
Z. Zhang and D. A. GriYth
6.
Discussions Spatial statistics are designed to handle one special property of spatial data–spatial autocorrelation. However, users who want to employ spatial statistical analyses in their work often face a dilemma: current GIS software packages are very weak in classical statistical analysis capabilities, let alone spatial statistical analysis; and, commercial statistical software packages are designed for undertaking classical statistical analysis, rather than spatial statistical analysis. Merging these two separate entities, and modifying classical statistical analysis so that it can take spatial autocorrelation into account is necessary to overcome this di culty. The integration of GIS and spatial statistical analyses can be implemented either by embedding spatial statistical analysis into a GIS environment, or embedding selected GIS functions into a spatial statistical environment. The platforms used by these two approaches are diŒerent, as are the user groups they target. They also have diŒerent advantages and disadvantages. Most of the research attention to date has focused on this former approach. In this paper, we argue that the latter approach deserves more attention, especially given the rapid development of embeddable GIS component technology. One advantage of this approach is especially noteworthy: embedded GIS components can provide the basic GIS-type operations that a spatial statistical analysis module needs, while classical statistical routines and statistical graphics do not need to be developed from scratch. Since current commercial statistical software packages have not yet been able to provide the visual application development environment into which GIS components and other ActiveX controls can be embedded, we choose to integrate GIS components and spatial statistical analysis in a proprietary DBMS, Microsoft Access. The primary reasons for doing so include: Microsoft Access is strong in data handling and query; it has statistical charting capabilities; with a third-party statistical analysis add-in, it can become a good statistical analysis environment; and, most importantly, it provides visual development environments and programming tools for the embedding of GIS components. Considering the strong user base of Microsoft Access, the importance of this integration scheme cannot be overlooked: it can promote the dissemination of spatial statistics. In this paper we present a successfully developed user-friendly and interactive spatial statistical module using this integration scheme. This module provides four of the most frequently performed tasks in spatial statistical analysis: specifying the spatial relationship le, testing for spatial autocorrelation, performing exploratory spatial data analysis, and tting spatial autoregressive models. It enables us to explore empirical spatial patterns of urban population density in Washington, DC. However, currently only spatial modeling for irregular lattice cases, with an AR and SAR model speci cation, have been implemented. How to extend the functionalities of this module to include regular lattice cases (which often involve a massively large sample size), and other spatial statistical functions, constitutes future research. More case studies also are needed to further assess this module. Of note is that this integration scheme has one possible limitation: currently it can work only on Windows platforms (Windows 95, Windows NT, or more advanced Windows operating systems). Finally, our research suggests future work for current commercial statistical software vendors: adopt a more open environment in the near future, especially the one that supports ActiveX technology. References Anselin, L., 1988, Spatial Econometrics: Methods and Models (Dordrecht: Kluwer Academic). Anselin, L., 1990, What is special about spatial data? Alternative perspective on spatial data
Improving spatial analytical capabilities of GIS
563
analysis. In Spatial Statistics: Past, Present, and Future, edited by D. A. Gri th, Monograph #12, Institute of Mathematical Geography (Ann Arbor: Michigan Document Services). Anselin, L., 1992a, SpaceStat T utorial: A Workbook for Using SpaceStat in the Analysis of Spatial Data (Santa Barbara: Department of Geography). Anselin, L., 1992b, Spatial data analysis with GIS: an introduction to application in the social sciences. Technical Report 92-10, UC-Santa Barbara. Anselin, L., 1993, The Moran Scatterplot as a Means to Visualize Instability in Spatial Autocorrelation. Position paper prepared for the Workshop on Exploratory Spatial Data Analysis and GIS, NCGIA, Santa Barbara, CA. Anselin, L., 1995, Local indicators of spatial association—LISA. Geographical Analysis, 27, 93–115. Anselin, L., and Bao, S., 1996, Exploratory Spatial Data Analysis Linking SpaceStat and ArcView. Working Paper. Morgantown, WV: Regional Research Institute. Anselin, L., Dodson, R. F., and Hudak, H., 1993, Linking GIS and spatial data analysis in practice. Geographical Systems, 1, 3–23 Anselin, L., and Getis, A., 1993, Spatial statistical analysis and geographic information systems. In Geographic Information Systems, Spatial Modelling, and Policy Evaluation, edited by M. M. Fischer, and P. Nijkamp (Berlin: Springer-Verlag), pp. 103–114. Bailey, T. C., 1994, A review of statistical spatial analysis in geographical information systems. In Spatial Analysis and GIS, edited by S. Fotheringham and P. Rogerson (London: Taylor and Francis), pp.13–44. Becker, R. A., and Cleveland, W. S., 1988, Brushing scatterplots. In Dynamic Graphics for Statistics, edited by W. S. Cleveland and M. E. McGill (Belmont: Wadsworth, Inc.), pp.201–224. Blue Marble Geographics, 1996, GeoV iew http://www.bluemarblegeo.com/devgv.htm. Brunsdon, C., 1998, Exploratory spatial data analysis and local indicators of spatial association with XLISP-STAT. T he Statistician, 47, 471–84. Can, A., 1996, Weight matrices and spatial autocorrelation statistics using a topological vector data model. International Journal of Geographical Information Systems, 10, 1009–1017. Clark, C., 1951, Urban population densities. Journal of the Royal Statistical Society, Series A, 114, 490–496. Cliff, A. D., and Ord, J. K., 1973, Spatial Autocorrelation. Monographs in Spatial and Environmental Systems Analysis (Methuen, Inc., New York, USA). Cressie, N. A., 1991, Statistics for spatial analysis (New York: John Wiley and Sons). Cressie, N. A., Olsen, A., and Cook, D., 1996, Massive data sets: problems and possibilities, with application to environmental monitoring. In Massive Data Sets: Proceedings of a Workshop, edited by the National Research Council, Committee on Applied and Theoretical Statistics (Washington, DC: National Academy Press), pp. 115–119. Ding, Y., and Fotheringham, A. S., 1992, The integration of spatial analysis and GIS. Computers, Environment and Urban Systems, 16, 3–19. Eastman, J. R., 1992, Idrisi Version 4.0: User’s Guide (Worcester, Massachusetts: Clark University). ESRI (Environmental Systems Research Institute), 1991, ARC/INFO 6.0: User’s Guide (Redlands: ESRI). ESRI (Environmental Systems Research Institute), 1994, Introducing ArcV iew (Redlands: ESRI). ESRI (Environmental Systems Research Institute), 1996, Building Applications with MapObjects (Redlands: ESRI). FMS Inc., 1995, T otal Access Statistics: Statistical Analysis for Microsoft Access ( Vienna: FMS). Fotheringham, A. S., and Rogerson, P., 1993, GIS and spatial analytical problems. International Journal of Geographical Information Systems, 7, 3–19. Getis, A., and Ord, K., 1992, The analysis of spatial association by use of distance statistics. Geographical Analysis, 24, 189–206. Goodchild, M., 1986, Spatial Autocorelation. CATMOG, (Norwich:GeoBooks). Goodchild, M., 1987, A spatial analytical perspective on geographical information systems. International Journal of Geographical Information Systems, 1, 327–334.
564
Z. Zhang and D. A. GriYth
Goodchild, M., 1991, The case for linking GIS and spatial analysis. In GIS and Spatial Data Analysis: Report on the SheYeld Workshop, edited by R. P. Haining, and S. M. Wise (Regional Research Laboratory Initiative) Discussion Paper Number 11, pp. 4–6. Goodchild, M., Haining, R. P., and Wise, S. M., 1992, Integrating GIS and spatial data analysis: problems and possibilities. International Journal of Geographical Information Systems, 6, 407–423. Griffith, D. A., 1987, Spatial Autocorrelation: A Primer. Resource Publications in Geography (Washington: Association of American Geographers). Griffith, D. A., 1988, Estimating spatial autoregressive model parameters with commercial statistical packages. Geographical Analysis, 20, 176–186. Griffith, D. A., 1992, Simplifying the normalizing factor in spatial autoregressionsfor irregular lattices. Papers in Regional Science, 71, 71–86. Griffith, D. A., 1993a, Which spatial statistics techniques should be converted to GIS functions? Geographic Information Systems, Spatial Modelling, and Policy Evaluation, edited by M. M. Fischer and P. Nijkamp (Berlin: Springer-Verlag), pp. 103–114. Griffith, D. A., 1993b, Spatial Regression Analysis on the PC: Spatial Statistics Using SAS (Washington: Association of American Geographers). Griffith, D. A., and Amrhein, C., 1991, Statistical analysis for geographers (Englewood CliŒs: Prentice Hall). Griffith, D. A., and Layne, L., 1999, A Case Book for Spatial Data Analysis: A Compilation of Analyses of DiVerent T hematic Data Sets (Oxford: Oxford University Press), pp. 3–5. Griffith, D. A., and Sone, A., 1995, Trade-oŒs associated with normalizing constant computational simpli cations for estimating spatial statistical models. Journal of Statistical Computing and Simulation, 51, 165–183. Grimes, R., Stockton, A., Reilly, G., and Templeman, J., 1998, Beginning AT L COM programming (Birmingham: Wrox Press). Haining, R. P., 1994, Designing spatial data analysis modules for geographical information systems. In Spatial Analysis and GIS, edited by S. Fotheringham and P. Rogerson (London: Taylor and Francis), pp. 45–64. Haining, R. P., and Wise, S. M., (editors), 1991, GIS and Spatial Data Analysis: Report on the SheYeld Workshop, Regional Research Laboratory Initiative Discussion Paper No. 11. Haining, R. P., Ma, J., and Wise, S. M., 1996, Design of a software system for interactive spatial statistical analysis linked to a GIS. Computational Statistics, 11, 449–466. Haslett, J., Wills, G., and Unwin, A., 1990, SPIDER—an interactive statistical tool for the analysis of spatially distributed data. International Journal of Geographical Information Systems, 4, 285–296. Kehris, E., 1991, Spatial Autocorrelation Statistics in ARC/INFO. Research Report No. 16, North West Regional Research Laboratory, Lancaster University. Levine, N., 1996, Spatial statistics and GIS: software tools to quantify spatial patterns. Journal of the American Planning Association, 62, 45–52. Luo, Y., 1997, Spatial analysis education and GIS componentware technology. In 1997 ESRI User Conference Proceedings (Redlands: ESRI), http://www.esri.com/library/userconf/ proc97. MacDougall, E. B., 1992, Exploratory analysis, dynamic statistical visualization, and geographic information systems. Cartography and Geographic Information Systems, 19, 237–246. Majure, J., Symanzik, J., and Cook, D., 1996, ArcV iew2.1-XGobi L ink Version 1.3, http://www.gis.iastate.edu/XGobi-AV2/XGobi-AV2.html. MapInfo Corporation, 1996, MapInfo MapX. http://www.mapinfo.com/events/mapx/mapx.html. Mathsoft, 1998, S 1 for ArcView GIS. http://www.mathsoft.com/splus/splsprod/arcview.htm. Microsoft Corporation, 1997a, V isual Basic Component T ools Guide: Programming System for W indows (Redmond: Microsoft Press). Microsoft Corporation, 1997b, W hite Papers on COM T echnologies. http://www. microsoft.com/com/wpaper.htm. Monmonier, M., 1989, Geographic brushing: enhancing exploratory analysis of the scatterplot matrix. Geographical Analysis, 21, 81–84. Odland, J., 1988, Spatial Autocorrelation (NewburyPark: SAGE publications).
Improving spatial analytical capabilities of GIS
565
Openshaw, S., 1991, What types of spatial analysis techniques are needed in GIS? In GIS and Spatial Data Analysis: Report on the SheYeld Workshop, edited by R. P. Haining, and S. M. Wise (Regional Research Laboratory Initiative) Discussion Paper No. 11, pp. 6–7. Rogerson, D., 1997, Inside COM (Redmond: Microsoft Press). Stueltzle, W., 1987, Plot Windows. Journal of the American Statistical Association, 82, 466–475. Sylvan Ascent Inc., 1995, SylvanMAPS/OCX (Santa Fe: Sylvan Ascent). Tukey, J. W., 1977, Exploratory Data Analysis (Addison-Wesley Publishing Company). University Consortium for Geographic Information Science (UCGIS), 1998, Research Priorities, Revised W hite Papers (Park City, Utah: UCGIS). Unwin, A., and Unwin, D., 1998, Exploratory spatial data analysis with local statistics. T he Statistician, 47, 415–21. Visual Components, 1996, GeoPoint. http://www.visualcomp.com/products/gp.htm. Wilhelm, A., and Steck, R., 1998, Exploratory spatial data by using interactive graphics and local statistics. T he Statistician, 47, 423–30. Zhang, Z., and Griffith, D. A., 1997, Developing user-friendly spatial statistical analysis modules for GIS: an example using ArcView. Computers, Environment and Urban Systems, 21, 5–29.
Appendix 1.1. Spatial autocorrelation measures
C
D
n n n n n MC 5 n (A-1) cij (xi Õ xÅ )(xj Õ xÅ )/ cij (xi Õ xÅ )2 i= 1 j= 1 i= 1 j= 1 i= 1 where c are the elements of the connectivity matrix C, and c 5 1 if area units i and ij ij j are adjacent; otherwise c 5 0. ij The mathematical expression of the variance of the MC under a normality assumption is: 2S n2 Õ 4n(S 1 S ) 1 3S2 0 0 1 0Õ S2 (n2 Õ 1) 0
s2 5 mc
1 (n Õ 1)2
(A-2)
where: S 5 0
n
n
i= 1 j= 1
c ;S 5 ij 1
n
n
i= 1 j= 1
c
ij
A
n
j= 1
B
c Õ 1 ij
1.2. Spatial autoregressive models Suppose a conventional linear statistical model has the following form: Y 5 Xb 1 e
(A-3)
where Y is an n-by-1 data vector, X is an n-by-( p 1 1) matrix of p predictor variables and an intercept term, b is a ( p 1 1)-by-1 vector of regression coe cients, and e is an n-by-1 random error vector whose elements are not independent. Then the SAR model rewrites equation (A-3) as: Y 5 rW Y 1 (I Õ rW )Xb 1 j,
(A-4)
where j is a spatially ltered error term. An AR model is simply the SAR model without the rW Xb term: Y 5 rW Y 1 Xb 1 j.
(A-5)
Improving spatial analytical capabilities of GIS
566
1.3. T he Jacobian term and its approximation The Jacobian term for spatial autoregressive parameter estimation may be written as follows: n (A-6) J 5 2 ln (1 Õ rl )/n, w i i= 1 where l are the eigenvalues of matrix W. One approximation for the Jacobian term i is: Õ
Õ
J5 Õ a 1 Õ
a 2
GC GC
D D
H H
ln(1 Õ rl ) min 1 1 Õ c ln (1 Õ rl ) 1 min rl min ln (1 Õ rl ) max 1 1 Õ c ln (1 Õ rl ) , 2 max rl max
r Þ 0, and Õ J 5 0, r 5 0
where
a 5 1
n 1 n 0.64617 1 0.88966*l 1 1.35361* i= 1 i * (n Õ 8.32548)0.08541; min n
A
B
a 5 0.65111 1 0.58813*aÃ Õ 0.42683*cÃ Õ 0.10948*| l | Õ 1.42581; 2 1 1 min n 1 n 0.02760 Õ 0.07904* i = 1 i ; c 5 0.9771 1 1 1 Õ 0.99743*eÕ 0.01286n n
c 5 1.01760 1 2
n 1 0.02743 n Õ 0.16523* i = 1 i ; Õ 1 1.01839eÕ 1.00912n n
n 5 Sn c , the row sum of matrix C, n is the number of area units, and l i j= 1 ij min and l are the minimum and maximum eigenvalues of matrix C or W, depending max upon which one is used. Note that l 5 1 for matrix W. max For the irregular lattice case, the Jacobian term also can be quite accurately approximated by the following function when no information is available regarding the eigenvalues of matrices C and W (Gri th 1992): (A-8) J 5 a ln (d ) 1 a ln(d ) Õ a 1n(d 1 r) Õ a 1n(d Õ r). n,1 n,1 n,2 n,2 n,1 n,1 n,2 n,2 According to Gri th’s analysis of classical spatial statistical examples, the magnitudes of the calibrated values for a , d a , d are found to be remarkably n,1 n,1, n,2 n,2 consistent over the diŒerent geographic con gurations and numbers of areal units. Consequently, the following generalized Jacobian approximation may be employed: J 5 0.2 1n(1.75) 1 0.12 1n(1.05) Õ 0.2 1n(1.75 1 r) Õ 0.12 1n(1.05 Õ r)
(A-9)