Fast Spatial Clustering with Different Metrics and in the ... - CiteSeerX

9 downloads 0 Views 701KB Size Report
boundary-based clustering method overcoming drawbacks of traditional spatial ..... third approach. Approach 3a is ... streets) exist between them. In a situation like Fig. .... the main difference between clustering results in L1 and L2. This North ...
Fast Spatial Clustering with Different Metrics and in the Presence of Obstacles Vladimir Estivill-Castro

Ickjai Lee

Department of CS & SE, The University of Newcastle Callaghan, NSW 2308, Australia

Department of CS & SE, The University of Newcastle Callaghan, NSW 2308, Australia

[email protected]

[email protected]

ABSTRACT In many GIS settings, the Euclidean metric is not applicable as the model for distance between points. Other geometric models are needed in many practical scenarios, for which urban geography is a common example. Recently, Estivill-Castro and Lee [8] proposed an effective and efficient boundary-based clustering method overcoming drawbacks of traditional spatial clustering, but has a geometric focus. By factoring out the topological aspects of the method we obtain a generic boundary-based clustering that robustly generalizes for arbitrary Minkowski distances and is capable of handling obstacles. We illustrate this with the Manhattan distance and the Dominance distance. Experiments demonstrate that our method consistently finds various types of high-quality clusters within subquadratic time.

1.

INTRODUCTION

Automated data-collection technology is continuously producing vast amounts of spatial data. Detecting patterns in such large spatial databases is now core business in data mining and GIS. Spatial clustering is a fundamental pattern detector and spotter. It partitions a set of geo-referenced point data, P = {p1 , p2 , . . . , pn } in some study region S , into smaller homogeneous groups due to spatial proximity based on a metric. These distinct groups of high spatial concentrations are indicative of phenomena with spatial association and their explanation results into useful insights or they are suggestive of hypotheses providing the ground for further exploratory analysis. Spatial clustering approaches [4, 5, 6, 12, 20, 23, 24, 27] studied within the GIS community have different strengths and weaknesses; but, most share several common drawbacks. First, these semi-automatic clustering methods necessitate some prior knowledge from end-users to produce their best groupings [8]. More seriously, spatial clustering methods ignore characteristics of geo-referenced data (for example, spatial dependence and spatial heterogeneity) [2] that make

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2001 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

space special [9]. Finally, spatial clustering methods based on the Euclidean distance assume a geometric model where the amount of separation is based on the existence of a linear path between pairs of points. In many spatial settings, the Euclidean distance is not applicable. Recently, Estivill-Castro and Lee [8] proposed a fast and robust boundary-based spatial clustering criterion. This Short-Long criterion requires Euclidean distance as the underlying geometric modeling technique to derive topological information. In particular, the topological information captured is spatial neighbors (encoded as a proximity graph). This boundary-based clustering overcomes shortcomings of semi-automatic clustering and satisfies special needs of geoinformation. Further, it is able to detect various kinds of high-quality clusters imitating real world situations such as non-convex clusters, clusters of different densities, clusters of various sizes, clusters in the presence of noise, clusters linked by multiple bridges, sparse clusters adjacent to highdensity clusters and closely located high-density clusters. We will derive from this method a topological generalization that is easy to use for exploratory data analysis of alternative geometries; in particular, we extend boundary-based clustering to geometric scenarios beyond the Euclidean distance. For example, in urban geography, the Manhattan distance better approximates real world situations [15]. This paper applies the Short-Long criterion with the Manhattan metric and the Dominance metric for mining patterns in urban geography or situations where paths only run horizontally and vertically (grid-like paths). This extension allows to compare and contrast clustering a set P with different metrics. Our exploratory tool enables users to plug in the metric their desire. The metric is to be used as a parameter of the process. We further extend the Short-Long criterion to the presence of obstacles. We also show that the user can select and explore different criteria by which the geometry of obstacles affects the relation is nearby neighbor. Finally, we also show that the user can articulate several levels of influence in the presence of obstacles and analyze or contrast them at the expense of some computational trade-offs. Section 2 demonstrates how neighboring information (and thus, topological information) varies with geometric information (varies with different metrics). Section 3 highlights the operation of the Short-Long criterion in the topological model. This separation of the clustering on topological information allows the extension beyond the Euclidean metric. In Section 4, we analyze clustering in the presence of obstacles. Section 5 provides experimental results on real data.

Our experiments demonstrate that the Short-Long criterion not only produces high-quality spatial concentrations, but is applicable robustly to different distance metrics. Finally, the last section draws conclusions.

2.

SPATIAL NEIGHBORS

Modeling topological relations like pi is nearby neighbor of pj is central for spatial clusters. We expect that the relation pi is in the same spatial cluster as pj has some correlation to the is nearby neighbor relation. Topological predicates, like declaring two objects as neighbors if they intersect or if they share a common boundary apply directly to area objects but indirectly to point data, since points neither intersect nor share boundaries unless they coincide. Topological information is derived for point data through point-to-area transformations. Thus, two points are regarded as neighbors if their transformed areas share a common boundary. A widely adopted transformation is to assign every location in the study region S to the nearest pi ∈ P based on a certain metric. This transformation divides S into non-overlapping regions except for boundaries. The resulting tessellation is the well-known Voronoi diagram. Thus, two points are neighbors if and only if their corresponding Voronoi regions share a common Voronoi edge. By connecting two neighboring points in the Voronoi diagram, another tessellation: the Delaunay triangulation is obtained. Thus, the dual graph explicitly encodes spatial neighborhood information. For illustration, we use three instances of the Minkowski metric, namely p = 1, p = 2 and p = ∞. Let dLp :

Suggest Documents