identifying non-hierarchical spatial clusters

1 downloads 0 Views 60KB Size Report
Jan 2, 2002 - Keywords: Clustering, spatial analysis, spatial patterns, optimization, spatial data ... Thus, in spatially based applications of cluster analysis, ...
IDENTIFYING NON-HIERARCHICAL SPATIAL CLUSTERS Alan T. Murray, Tony H. Grubesic Department of Geography The Ohio State University Columbus, Ohio 43210-1361, USA Email: [email protected] Investigating relationships in information is an important component of statistical and spatial analysis. Clustering techniques have proven to be effective in helping to find patterns or relationships which otherwise may not have been detected. One of the most popular partitioning based clustering approaches is k-means. In this paper we explore the application of the k-means approach to spatial information in order to assess its characteristic features. One aspect of concern is implementation in commercial statistical software packages. In particular, the invariable sub-optimality (or poor solution quality) of identified clusters will be investigated. Another aspect of evaluation is a direct comparison to an alternative, spatially-based clustering approach. Although numerous authors over the past 30 or so years have discussed shortcomings associated with the use of the k-means approach, empirically based work which illustrates biasing effects is lacking. This paper contributes to the further understanding of clustering techniques. Significance: This paper highlights the problematic features of non-hierarchical clustering approaches available in the most widely used statistical packages, particularly when used in the analysis of spatial information. Keywords: Clustering, spatial analysis, spatial patterns, optimization, spatial data mining, location modeling (Received 5 March 2001; Accepted 2 January 2002)

1. INTRODUCTION

Cluster detection continues to be an important exploratory technique in statistical analysis, utilized in most disciplines. The application areas range from relating business attributes in abstract dimensions to investigating cancer cases in a region. While the range of topics and instances will undoubtedly continue to diversify, it is the use of cluster analysis in the spatial setting that is of interest in this paper. Specifically, the existence of patterns across space may suggest various causal relationships in disease outbreak, crime activity, land use, etc. The inherent notion of location based upon the physical distance or travel time separating areas (or observations) being analyzed is particularly important, but this is not the only consideration. The use of cluster analysis in spatial instances also requires measurements of attribute similarity between observations. Thus, in spatially based applications of cluster analysis, observations are related in terms of space and attribute variables. This is somewhat of a contrast to the more traditional statistical applications of cluster analysis which tend to be aspatial, where relationships between observations of interest are defined and differentiated only on the basis of select variables. The most prominent cluster analysis approach has been the k-means type process (or algorithm), which is detailed in Hartigan (1975) and Estivill-Castro and Murray (1998) among others. The basis of this approach is the use of centroids as cluster centers for creating a specified number of groups or partitions. The use of centroids implies that differences between observations are related in terms of a squared deviation measure from the partition centroid to which they belong. Much of current clustering research still builds upon the basic k-means process (e.g. Cuesta-Albertos et al. 1997; Haung 1998; Rollet 1998; Krishnha and Murty 1999). In this paper we are interested in evaluating optimality and appropriateness issues associated with the use of k-means for cluster analysis. Obtaining high quality solutions continues to be a focus of current research as it is an inherently difficult optimization problem to solve optimally. Unfortunately, commercial statistical packages have not appreciated optimization oriented complexities, nor have they implemented capabilities for a user to ensure that high quality solutions are obtained. The second issue of appropriateness is perhaps of even more significance for geographic research. Fundamentally, k-means in spatial applications represents the use of a squared distance measure to relate areas, which makes it difficult to interpret, explain, justify or reason (Watson-Gandy 1972; Murray and Estivill-Castro 1998). Although Kaufman and Rousseeuw (1990) among others note that the distance squared measure (or centroid) for relating observations may inappropriately influence how clusters are structured, there has not been accompanying empirical analysis to illustrate such side effects.

In this paper k-means solution quality as implemented in commercial statistical packages commonly relied upon in scientific research is evaluated. These packages are the dominant tools relied upon by engineers and others. Another aspect of analysis is on the relative performance and clustering results of k-means compared with a spatially-oriented clustering approach. The following sections detail the k-means clustering model and review its inherent features. A spatial clustering model variant is then reviewed. Comparative results are detailed. Finally, discussion and conclusions are provided.

2. K-MEANS CLUSTERING

The use of clustering in statistical analysis has been significant. Fisher (1958) and MacQueen (1967) may be credited with the development of the classic k-means approach (or algorithm) typically relied upon in non-hierarchical clustering analysis. The defining characteristic of this approach is the use of centroids for grouping observations. Reviews of k-means may be found in Belbin (1987), Arabie and Hubert (1996) and Hansen and Jaumard (1997). K-means continues to be the dominant non-hierarchical partitioning technique, particularly in commercial statistical software packages. Further, as noted previously, substantial research continues to be devoted to the use and implementation of k-means based approaches. The following notation will be used to specify this clustering model: i = index of observatio ns (total number = n ); a i = attribute weight of observatio n i;

k = index of clusters (total number = p); d ik = difference measure relating observation i and cluster k ;  1 if observation i is in cluster k =   0 otherwise.

z ik

K-means Minimize

∑∑ a d i

i

2 ik

z ik



(1)

k

Subject to:

∑z

ik

k

z ik

=

= 1

∀i



(2)

(0,1)

∀ i, k



(3)

The objective (1) of the k-means model is to minimize the total weighted squared difference in cluster group membership. This is equivalent to minimizing the within group sum of squares (Kaufman and Rousseeuw 1990). Constraint (2) ensures that each observation is assigned to a cluster group. Constraint (3) imposes integer restrictions on decision variables. It should be noted that throughout this paper the difference measure, d ik , is assumed to be a Euclidean based metric. The non-linear form of objective (1) means that linear programming based approaches are not possible for optimally solving k-means. Given this, heuristic solution techniques continue to be a necessity for solving practical applications. The most successful and widely applied heuristic for k-means is: Alternating Heuristic (i) Generate p clusters. (ii) Identify a center for each cluster. (iii) Assign observations to their closest center. (iv) If cluster groupings have changed, return to (2). Otherwise, local optima reached – heuristic stops. Initial clusters are typically generated randomly in step (i). When using the alternating heuristic to solve k-means, the “center” in step (ii) corresponds to the cluster centroid. An important requirement in the application of this heuristic is that it be repeated multiple times with different initial cluster configurations in step (i). This increases the likelihood that a global optimum, rather than a local, is identified. It is in this respect that commercial implementations of the k-means approach fail to give sufficient attention or consideration.

3. SOLUTION QUALITY

From an optimization perspective, it is particularly problematic that one is not able to control the performance of the kmeans heuristic solution technique as implemented in most, if not all, statistical packages. The reason for this is that solution quality is likely to be extremely poor when only a single re-start (rather than multiple) of the heuristic is utilized. In this section we will investigate the performance of a number of statistical packages using four spatial applications. The first application contains 33 observations representing emergency service call locations in Austin, Texas (Daskin 1982). The second application contains 55 spatial observations associated with relative air travel volumes from the Washington, D.C. region (see Murray 2000). The third application contains 152 observations representing coffee buying centers in Busoga, Uganda (see Murray 2000). These three applications each have varying demand values associated with represented observations ( a i is not equal to 1 as assumed in most statistical packages). The final application contains 114 crime event locations in Akron, Ohio (Edgewood neighborhood) for 1999. In this case the demand values are equal to one ( a i =1). ArcView version 3.2, a commercial geographic information system (GIS), was utilized for data processing, display and management on a Pentium III/600 personal computer. The only statistical package capable of handling observation weights not equal to one in non-hierarchical cluster analysis is SAS, so we will begin with an evaluation of SAS (release 6.12) using the Austin, Washington and Busoga applications. The FASTCLUS procedure provided in SAS employs the alternating heuristic to solve k-means. However, only one initial cluster configuration is used (and allowed) in their implementation. SAS output from the FASTCLUS procedure does not include solution quality information, so the objective (1) must be calculated externally using identified cluster membership. The k-means objective value for associated FASTCLUS solutions was computed using Fortran code. Given the functional evaluation (the k-means objective (1)), we now have one component necessary for solution quality assessment. As discussed previously, k-means cannot be solved optimally. However, the alternating heuristic with a substantial number of re-starts may be used to identify the best known solution. The alternating heuristic for k-means was coded in Fortran, compiled using Compaq Visual Fortran version 6.1, as a dynamic linked library (DLL) and integrated in ArcView using Avenue script. This allows us to compare the quality of k-means clusters generated using SAS (or any other package) to the best known solutions for a particular application. In this research 10,000 randomly generated initial clusterings (re-starts) of the alternating heuristic was used to solve each problem application. Although the solutions may not in fact be globally optimal, they do provide a basis for comparing SAS results. If the solutions from SAS are of poorer quality, then the SAS objective function value will be higher than the best solution found for k-means using 10,000 re-starts of the alternating heuristic. The SAS results may be summarized graphically as a percentage deviation from the best known solution for each problem. This comparison is shown in Figure 1 for the Austin (p=3-6), Washington (p=5-10) and Busoga (p=5-15) applications. Figure 1 illustrates that the k-means clusters identified using SAS deviate substantially from the best known k-means solutions. That is, the SAS solutions are not only sub-optimal, but far from optimal in most cases. The minimum deviation of 1.40% is found for the Austin application when p=3. This is in contrast to the maximum deviation of 936.77% found for the Uganda application when p=14. In fact, all of the Uganda solutions generated using SAS are found to be of extremely poor quality as measured by the k-means objective (1). This is not particularly surprising given that only one start of the heuristic is utilized for the initial seed in FASTCLUS. Nevertheless, this is problematic and does highlight the optimality issues being ignored in the implementation of the k-means optimization approach. Also troubling is the observed trend that as problem size increases, SAS solution quality deteriorates dramatically. SPSS (version 9) and Splus (version 4.5) also provide the capability to do non-hierarchical clustering of information using the k-means approach. As mentioned previously, however, these packages limit the weighting of observations to values of one (i.e. a i =1). The final application of crime incidents in Akron, Ohio, where weights are one for each observation, will be used in the evaluation of SPSS, Splus and SAS. Similar to Figure 1, deviation from the best known solution is summarized in Figure 2 using the Akron application (p=4-20) using the three statistical packages. SAS continues to standout as producing sub-optimal clusters deviating significantly from the best known solutions. In Figure 2 the SAS clusters have an average deviation of 20.62% and a maximum of 42.01% for p=19. The clusters produced by SPSS are somewhat better, with an average deviation of 9.31% and a maximum of 28.23% for p=19. The clusters identified using Splus are consistently closer to the best known solutions (with the exception of p=18). The average deviation of Splus solutions is 2.34% with a maximum of 8.49% for p=11. The findings summarized in Figure 2 suggest that all three

packages have some shortcomings with respect to solution quality, although clusters identified using Splus are consistently better than those generated by SAS and SPSS.

Objective value deviation from best known solution

1000% 900% 800% 700% 600% 500% 400% 300% 200% 100% 0% p=3

p=4

p=5

p=6

p=5

p=6

Austin (33 observations)

p=7

p=8

p=9

p=10

p=5

p=6

p=7

p=8

Washington (55 observations)

p=9

p=10 p=11 p=12 p=13 p=14 p=15

Busoga (152 observations)

Figure 1.SAS FASTCLUS solution quality. Two comments are in order before continuing. First, the findings shown in Figures 1 and 2 represent a lower bound on solution quality as there may in fact be better solutions for the problems solved. Thus, the SAS, Splus and SPSS results may be worse than they appear in Figures 1 and 2. Second, it is possible for a user to run the FASTCLUS procedure in SAS for different initial clusters by modifying the RANDOM parameter. This would be equivalent to a random re-start. However, the alternative clusters generated cannot be assessed by the user because no objective value information is provided. Obviously this does not help one to make an assessment regarding solution quality, so the potential value of the random restart capability is lost.

Objective value deviation from best known solution

45%

40%

SPSS

35%

SPlus

30%

SAS 25%

20%

15%

10%

5%

0% 4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Number of clusters

Figure 2. Analysis of crime in Akron using three common statistical packages.

4. SPATIAL CLUSTERING

The previous discussion has highlighted the optimization oriented deficiencies found in the implementation of k-means in the most widely utilized statistical packages. Even if this issue is ignored, there are structural concerns regarding the use of k-means in the spatial domain. Although such concerns have been discussed in the literature (Watson-Gandy 1972; Kaufman and Rousseeuw 1990; Murray and Estivill-Castro 1998), few studies have focused on partition representation fundamentals in non-hierarchical clustering. Exceptions are the direct and indirect comparisons of the k-means approach to hierarchical techniques (see Klastorin 1985; Belbin 1987; Milligan 1996). Although spatial applications of cluster analysis have been of interest for decades, increased access to spatial information continues to expand the application areas of cluster detection approaches. Further, easy to use geographical information systems (GIS) and their integration in major commercial statistical packages has given rise to large scale instances of spatially based applications of cluster analysis. One example is SAS, which now has an integrated GIS component, SAS/GIS. For modeling tools and techniques such as cluster analysis, it is essential that developed and implemented approaches not be biased on the basis of the data utilized. The within group variance minimization approach of k-means (using partition centroids) seems reasonable for aspatial applications of cluster analysis, but is questionable in the geographic domain. Is there a physical interpretation of a distance squared measure if k-means is utilized? For spatial applications, centroid based cluster analysis has been shown to be problematic, primarily with respect to undue influence of outliers (Watson-Gandy 1972; Murray and Estivill-Castro 1998). This has not, however, altered the use of k-means based approaches in geographic research. Given this, alternatives need to be explored and comparative analysis is necessary. An alternative to the k-means approach is to consider a modified interpretation of the partition center. Specifically, the partition center can be an equilibrium location rather than a centroid. This interpretation emerged from the location modeling work of Cooper (1963). The major difference between the location model of Cooper (1963) and k-means is the impact that the difference measure has on cluster structure. The formulation of this alternative clustering model is: Location Cluster Model (LCM) Minimize

∑∑ a d i

i

ik

z ik



(4)

k

Subject to:

∑z

ik

k

z ik

=

= 1

∀i



(5)

(0,1)

∀i, k



(6)

The objective (4) of the LCM is to minimize the total weighted difference in cluster group membership. Constraint (5) ensures that each observation is assigned to a cluster group. Constraint (6) imposes integer restrictions on decision variables. The distinction between objective (1) in k-means and objective (4) in the LCM is the contribution of the difference measure,

d ik . Specifically, objective (1) utilizes d ik2 , whereas objective (4) includes only d ik .

Solving the LCM is challenging given that it is non-linear. Only small (n ≈ 55) problem instances of the LCM may be solved optimally (Rosing 1992; Murray 1999), so heuristic solution approaches are essential for solving medium to large problem instances of the LCM. The alternating heuristic has proven to be the most effective technique for solving the LCM. Convergence properties and associated discussion may be found in Cooper (1963) and Rosing (1992). In contrast to the kmeans implementation of this solution approach, adapting the alternating heuristic for the LCM requires only that cluster centers have a slightly different interpretation. The inherent structural properties responsible for the grouping of observations reflects the difference between the LCM and k-means. For k-means, the centers correspond to group centroids as defined by objective (1). For the LCM, the centers correspond to equilibrium locations (referred to as Weber points in the location literature). The centroid and equilibrium locations do not represent the same point for a given set of observations, as discussed in Watson-Gandy (1972) and Wesolowsky (1993). The centroid for each cluster group k in kmeans is defined as follows:

~ xk

=

∑a x

i



(7)

∑a y



(8)

i∈Ck

i

∑a

i

i∈Ck

i

~y k

=

i

i∈Ck

∑a

i∈Ck

i

where Ck

= set of observatio ns i in cluster k ;

(x i , y i ) (~xk , ~y k )

= coordinate s of observatio n i; = center coordinates of cluster k.

The equilibrium locations delineating cluster groups in the LCM are defined by differentiating objective (4) with respect to x and y and setting the partial derivatives equal to zero. This then identifies an optimal equilibrium center (~ xk , ~ y k ) for each cluster group k as follows: ~ xk

~y k

=

=

∑ (a x ([~x

i∈C k

i

i

∑ (a y ([~x

i∈Ck

i

i

k

)

− xi ] + [~ y k − yi ] 2

k

2 1/ 2

− xi ] + [~y k − yi ] 2

)

2 1/ 2

)

∑ (a ([~x

k

− xi ]2 + [~ y k − yi ]

∑ (a ([~x

− xi ]2 + [~ yk − yi ]

i∈Ck

)

i∈Ck

i

i

)

2 1/ 2

k

)

2 1/ 2

)

)



(9)



(10)

It is obvious that the equilibrium location (equations 9 and 10) is mathematically more complex than the centroid y k variables as is possible when the centroid is utilized. Thus, (equations 7 and 8). It is not possible to isolate the ~ x k and ~ solving for the group centers in the LCM is not computationally straightforward. Fortunately, it is possible to obtain the optimal solution for the center location using an iterative scheme as discussed in Wesolowsky (1993). Thus, optimal partition centers may be identified for both k-means and the LCM.1 The significance of this discussion is that solving kmeans or the LCM using the alternating heuristic requires a different interpretation and implementation of the cluster center. Comparatively, the LCM has an added computational burden in that one must employ an iterative procedure to identify cluster centers. This is in contrast to the closed form solution possible using the centroid (equations 7 and 8). Given this difference, it becomes clear why there may be an industry preference for the centroid approach (k-means) as there are processing speed gains associated with its use. However, as pointed out in Watson-Gandy (1972) and Estivill-Castro and Murray (1998), the use of the centroid in geographic space is fundamentally flawed. We can now examine the extent to which partitions differ based upon the use of the centroid versus the equilibrium location in cluster analysis.

5. COMPARISONS

The four previously described spatial applications (Austin, Washington, Busoga and Akron) will be used to examine a range of cluster values for comparative analysis. Spatially based problem instances allow us to assess how cluster structure and membership is altered when d ik has a real and physical interpretation. The initial focus of comparison is on functional performance. Specifically, this involves the evaluation of the two clustering models using a common objective function as done in Murray (1999,2000). Cluster model comparisons for the indicated range of partitions, p, are summarized in Figure 3 for the Austin (33 observations), Washington (55 observations), Busoga (152 observations) and Akron (114 observations) applications. Figure 3 shows the resulting functional deviation of the k-means clusters explicitly evaluated using the LCM as a percentage of the best known LCM solution. As an example, the Austin application when p=5 using the best k-means clustering is evaluated as an LCM solution and compared to the best LCM clustering. In this case the clusters identified using k-means were found to be 6.26% higher than the best identified LCM partition. The lowest deviation in Figure 3 is 0.84% (p=5 in the Akron application) and the highest is 15.59% (p=12 also for the Akron application). The average 1

This does not mean that a global optima may necessarily be identified for either model, however.

difference between k-means based partitions compared to the LCM groupings summarized in Figure 3 is 5.48%. The deviations observed in Figure 3 indicate that clusters produced using k-means are not functionally equivalent to LCM clusters. While this is not entirely surprising, it suggests that k-means is unstable with respect to spatial interpretation. The bias inherent in k-means results in clusters that are sub-optimal when evaluated in geographic space using the LCM. This confirms that there are problems with the use of centroids for cluster centers when the difference measure, d ik , has a physical interpretation. 16%

Objective value difference

14%

12%

10%

8%

6%

4%

2%

0% p=3 p=4 p=5 p=6 p=5 p=6 p=7 p=8 p=9 p=10 p=5 p=6 p=7 p=8 p=9 p=10 p=11 p=12 p=13 p=14 p=15 p=4 p=5 p=6 p=7 p=8 p=9 p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=20

Austin

Washington

Busoga

Akron

(33 observations)

(55 observations)

(152 observations)

(114 observations)

Figure 3. Functional comparison of k-means and LCM. A second basis for comparison is computational effort required for solving k-means and LCM using the alternating heuristics. For the Austin applications (33 observations), both approaches required approximately 0.001 seconds per solution. For the Washington applications (55 observations), a high of 0.0019 seconds per solution for k-means was required and at most 0.0023 seconds per solution was necessary for the LCM. The Busoga applications (152 observations) required a high of 0.0073 seconds per solution for k-means and a high of 0.0117 seconds per solution for the LCM. Finally, the Akron applications (114 observations) had a high of 0.0044 seconds per solution for k-means and approximately 0.0075 seconds per solution for the LCM. Based upon these findings, it appears that almost twice the computational effort was needed to solve the LCM, which would be attributed to the need to iterate for cluster centers (optimal equilibrium locations, but not necessarily optimal solutions). A final comparison of k-means and LCM involves the examination of spatial variation. Of course, some degree of spatial variation is expected given the functional differences observed previously. Since these are spatial applications, however, we would like to better understand the extent of these spatial differences. Figures 4 and 5 present cluster solutions for a selected value of p for k-means and the LCM. Figure 4 depicts the p=5 clusters for the Washington application and shows that k-means and the LCM identify very spatially different clusters. Figure 5 shows twelve clusters of crime occurrence in Akron. There is considerable spatial variation in the two depicted cluster groupings in Figure 5. The comparisons presented in Figures 4 and 5 are straightforward and easy to understand, but they obviously do not depict the variation associated with all of the application results. One way to address this is to present figures for each evaluated configuration of clusters (38 would be needed in this case). Alternatively, reporting a summary measure of spatial similarity between two sets of clusters would be quite useful. Murray (2000) developed such a summary measure of cluster similarly. The approach suggested by Murray (2000) is based on finding the maximum amount of similarity that exists between two cluster partitions. Thus, it is possible to compare k-means partitions to LCM partitions in a quantitative manner. Figure 6 summarizes these findings. For each value of p, the cluster similarity is expressed as the percent overlap between the two solutions. Thus, the most similar that k-means and the LCM partitions could be, for a given value of p, is when they completely coincide, or have 100% overlap. The extreme results for each application are p=4 for Austin, p=5 for Washington, p=10 for Busoga, and p=20 for Akron, which overlap 51.52%, 67.27%, 64.47%, and 78.95% respectively. There is little similarity between kmeans and the LCM partitions in these instances.

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

# #

# #

# # #

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

# # #

# #

Observations k-means grouping LCM grouping

#

#

Figure 4. Washington clusters for p=5.

#

#

#

#

# # #

#

# #

#

#

# # #

#

#

#

# #

#

# ##

#

#

# #

#

#

#

# ## # #

#

#

# # #

# #

# #

#

#

# #

#

#

# #

# #

#

#

#

Observations k-means grouping LCM grouping

# #

#

#

# # #

#

#

# #

# #

# #

## # #

# #

# ## #

#

#

# #

# ## #

#

#

#

#

# #

#

#

#

#

Figure 5. Akron crime clusters (p=12).

6. DISCUSSION

The comparisons of the clusters produced by the k-means and the LCM approaches have shown significant differences in terms of functional evaluation as well as spatial structure. It is clear that the unintended influence of outlying observations impacts cluster configuration when the centroid based approach is utilized. This very much supports the assertions made by Watson-Gandy (1972) and Murray and Estivill-Castro (1998), among others, that the use of distance squared measures (or the centroid approach) is problematic. Thus, when the difference measure has a physical interpretation, it is possible to observe the dubious features of the k-means approach, which may be attributed to the use of the centroid for cluster delineation. This finding suggests that even abstract applications of cluster analysis may not be appropriate for k-means based approaches. Rationalizing the use of k-means for cluster analysis on the basis of computational savings, given these findings, would appear to be unjustified. Given that k-means is inappropriate and that solutions obtained from commercial software may be far from optimal, this raises the serious prospect of legal liability issues for analysts and commercial providers. As an example, information provision is now a multi-billion dollar industry and users of this information have expectations about its quality (Epstein et al. 1998). Research associated with GIS use has recognized the obvious legal ramifications associated with using spatial information for education, land use planning, regional policy making, etc. (Epstein 1991). Along with user expectations that data quality conform to high standards that are well documented, the time cannot be far away when modeling tools and techniques are held to equivalent standards. After all, these are the analytical methods which process and evaluate this information in various ways. Certainly the issues of model solution quality are closely aligned with data quality

expectations. Thus, liability for problematic or fundamentally questionable analytical tools and techniques should be given serious consideration.

100%

95%

Percentage overlap

90%

85%

80%

75%

70%

65%

60%

55%

50% p=3 p=4 p=5 p=6 p=5 p=6 p=7 p=8 p=9 p=10 p=5 p=6 p=7 p=8 p=9 p=10 p=11 p=12 p=13 p=14 p=15 p=4 p=5 p=6 p=7 p=8 p=9 p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=20

Austin

Washington

Busoga

Akron

(33 observations)

(55 observations)

(152 observations)

(114 observations)

Figure 6. K-means and LCM cluster overlap.

7. CONCLUSIONS

It is fairly clear that cluster analysis will be increasingly applied to spatial applications given the proliferation of GIS and associated spatial information. Further, many commercial statistical packages readily integrate information managed by GIS software or it can be directly incorporated into GIS analysis functionality. Despite the traditions and past acceptance of particular analytical approaches and techniques, the use of spatial information often dictates that unique considerations be taken into account. Perhaps the clearest example of this is the existence of dependencies in spatial information – spatial autocorrelation. Basically a new field of statistics has emerged to contend with issues associated with geographically based analysis. What has been shown in this paper is that the use of k-means for non-hierarchical cluster analysis applied to spatially referenced information is problematic. This is a significant empirical finding when one considers that most, if not all, commercial statistical software packages utilize the basic k-means type of approach as well as have the ability, either directly or indirectly, to analyze geographic information. We have detailed k-means and the location cluster model (LCM) in this paper. K-means is commonly used for clustering in statistical analysis. The LCM is an alternative cluster model which may be interpreted and justified in the spatial domain. Comparisons between k-means and the LCM using spatial applications illustrated functional and structural differences between the two approaches. These differences do not support the use of k-means over the LCM on the basis that it is more computationally efficient to solve.

REFERENCES

1. 2. 3. 4. 5.

Arabie, P. and Hubert, L. (1996). An overview of combinatorial data analysis. In Clustering and Classification (Ed.: P. Arabie, L. Hubert and G. De Soete), 5-63 (World Scientific: New Jersey). Belbin, L. (1987). The use of non-hierarchical allocation methods for clustering large sets of data. Australian Computer Journal, 19: 32-41. Cooper, L. (1963). Location-allocation problems. Operations Research, 11: 331-343. Cuesta-Albertos, J., Gordaliza, A. and Matran, C. (1997). Trimmed k-means: an attempt to robustify quantizers. Annals of Statistics, 25: 553-576. Daskin, M. (1982). Application of an expected covering model to emergency medical service system design. Decision Sciences, 13: 416-439.

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

17. 18. 19. 20. 21. 22. 23. 24.

Epstein, E. (1991). Legal aspects of GIS. In Geographical Information Systems: Principles and Applications (Ed.: D. Maguire, M. Goodchild and D. Rhind), 489-502 (John Wiley: New York). Epstein, E., Hunter, G., and Agumya, A. (1998). Liability insurance and the use of geographic information. International Journal of Geographical Information Science, 12: 203-214. Estivill-Castro, V. and Murray, A.T. (1998). Mining spatial data via clustering. In Research and Development in Knowledge Discovery and Data Mining (Ed.: X. Wu, R. Kotagiri and K.B. Korb), 110-121 (Springer: New York). Fisher, W. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53: 789798. Hansen, P. and Jaumard, B. (1997). Cluster analysis and mathematical programming. Mathematical Programming, 79: 191-215. Hartigan, J. (1975). Clustering Algorithms (John Wiley: New York). Haung, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2: 283-304. Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis (John Wiley: New York). Klastorin, T. (1985). The median problem for cluster analysis: a comparative test using the mixture model approach. Management Science, 31: 84-95. Krishnha, K. and Murty, M. (1999). Genetic k-means algorithm. IEEE Transactions on Systems, Man and Cybernetics, B 29: 433-439. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probably (Ed.: L. Le Can and J. Neyman), Vol. 1: 281-297 (University of California: Berkeley). Milligan, G. (1996). Clustering validation: results and implications for applied analysis. In Clustering and Classification (Ed.: P. Arabie, L. Hubert and G. De Soete), 341-375 (World Scientific: New Jersey). Murray, A. (1999). Spatial analysis using clustering methods: evaluating the use of central point and median approaches. Journal of Geographical Systems, 1: 367-383. Murray, A. (2000). Spatial characteristics and comparisons of interaction and median clustering models. Geographical Analysis, 32: 1-18. Murray, A. and Estivill-Castro, V. (1998). Cluster discovery techniques for exploratory spatial data analysis. International Journal of Geographical Information Systems, 12: 431-443. Rollet, R., Benie, G., Li, W., Wang, S. and Boucher, J. (1998). Image classification algorithm based on the RBF neural network and k-means. International Journal of Remote Sensing, 19: 3003-3009. Rosing, K. (1992). An optimal method for solving the (generalized) multi-Weber problem. European Journal of Operational Research, 58: 414-426. Watson-Gandy, D. (1972). A note on the centre of gravity in depot location. Management Science, 18: B-478-481. Wesolowsky, G. (1993). The Weber problem: history and perspectives. Location Science, 1: 5-23.

Suggest Documents