Efficient clustering technique for regionalisation of a spatial database

0 downloads 0 Views 1MB Size Report
Keywords: regionalisation; spatial data mining; density-based cluster. Reference ... regionalisation of a spatial database', Int. J. Business Intelligence and Data.
66

Int. J. Business Intelligence and Data Mining, Vol. 3, No. 1, 2008

Efficient clustering technique for regionalisation of a spatial database Lokesh Kumar Sharma*, Simon Scheider and Willy Kloesgen Fraunhofer Institut Intelligente Analyse und Informationssysteme, Schloss Birlinghoven, 53754 Sankt Augustin, Germany E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] *Corresponding author

Om Prakash Vyas Department of Computer Science, Pt. Ravishankar Shukla University, Raipur 492010, India E-mail: [email protected] Abstract: Regionalisation, a prominent problem from social geography, could be solved by a classification algorithm for grouping spatial objects. A typical task is to find spatially compact and dense regions of arbitrary shape with a homogeneous internal distribution of social variables. Grouping a set of homogeneous spatial units to compose a larger region can be useful for sampling procedures as well as many applications, e.g., direct mailing. It would be helpful to have specific purpose regions, depending on the kind of homogeneity one is interested in. In this paper, we propose an algorithm combining the ‘spatial density’ clustering approach and a covariance-based method to inductively find spatially dense and non-spatially homogeneous clusters of arbitrary shape. Keywords: regionalisation; spatial data mining; density-based cluster. Reference to this paper should be made as follows: Sharma, L.K., Scheider, S., Kloesgen, W. and Vyas, O.P. (2008) ‘Efficient clustering technique for regionalisation of a spatial database’, Int. J. Business Intelligence and Data Mining, Vol. 3, No. 1, pp.66–81. Biographical notes: Lokesh Kumar Sharma obtained his Master’s Degree in Computer Applications in 2000. He is working for a Doctoral Degree in Descriptive Modelling and Pattern Discovery in Spatial Data Mining at Fraunhofer Institut Intelligente Analyse und Informationssysteme (IAIS), Sankt Augustin, Germany. His work is supported by a DAAD Scholarship 2006–2008. His research interests are spatial clustering and spatial classification. Simon Scheider received his Diploma in Geography from the University of Trier, Germany. He has been working in the field of Knowledge Discovery in Spatial Databases at Fraunhofer IAIS since 2002. Currently, he is working on his PhD, which is concerned with concept descriptions in geographical databases. His research interests include statistical methods, data models and abstract semantic descriptions for geographical information. Copyright © 2008 Inderscience Enterprises Ltd.

Efficient clustering technique for regionalisation of a spatial database

67

Willi Kloesgen worked as a Senior Scientist and Project Manager at Fraunhofer IAIS. His work focused on the design and implementation of database management, statistical and modelling systems, and their applications in governmental and industrial projects. Besides application issues of KDD, software architectures of systems, types of discovery patterns, and evaluations of interestingness dimensions belong to his primary interests in KDD. Om Prakash Vyas is currently working as Professor and Head in School of Computer Science and Information Technology at Pt. R.S. University Raipur, India. He has done MTech in Computer Science and PhD from Indian Institute of Technology Kharagpur, India, and MSc (nat.) from the Technical University of Kaiserslautern, Germany. He has long academic experience spanning more than two decades. His research interests are in the fields of data mining and communication networks.

1

Introduction

An important application area for spatial clustering algorithms is social and economic geography (Ester et al., 2001). In the scope of this paper, we consider a classical methodical problem of social geography, ‘regionalisation’ (Sedlacek, 1978) of socially homogeneous residential areas for a city, which is a problem of the so-called ‘social area analysis’ (Shevky and Bell, 1955). A famous hypothesis in social science is that social differentiation causes geographical differentiation. Put simply, similar people tend to live together in one place. This is because an important selection criterion for housing seems to be the social similarity of the spatial neighbourhood in visibility range. Thus, a social area could be seen as a geometric region, including a set of points as living places, with each point having a densely inhabited spatial neighbourhood of a similar social structure. Although such a region typically can have numerous shapes (in theory, regions can, e.g., be circular city belts, centripetal sectors, or nuclei, compare Figures 1 and 2), it is implicitly clear that regions should not overlap (planarity) and they should be topologically connected, so geometric holes and islands should not be allowed. Formally, a region is planar if all interior neighbourhoods are exclusively inhabited by points of the same region. A region is topologically connected, if all pairs of points of it are connected by an internal path. The purposes of a regionalisation can be, e.g., spatial aggregation of statistical values or statistical sampling for social surveys. A fully automated procedure solving this task would allow for many special purpose regionalisations, dependent on the kind of homogeneity one is interested in, e.g., special regions for direct mailing. To compute these regions by classical social area analysis, the first step is to classify some spatial ‘a priori’ region, like statistical or postcode areas, according to several fixed social criteria. In the second step, neighbouring regions of the same class are merged. This approach has got a big disadvantage: The a priori regions can be inappropriate, which means their borders must not coincide with the borders of the actual social area.

68 Figure 1

L.K. Sharma et al. Zonal urban structures after Burgess 1925

Source: Snyder (2000) Figure 2

Sector and nucleus urban structures after Hoyt 1939 and Harris and Ullmann 1945

Source: Snyder (2000)

On the one hand, this is causing inhomogeneity, because the actual spatial distribution of social characteristics is not taken into account. Therefore, regions are not sure to be socially homogeneous (this problem is commonly called MAUP in spatial statistics). The merging criterion, class equality, does not solve this problem. In the result, neighbourhoods in different regions can be very similar, whereas neighbourhoods inside of the same region can be very different.

Efficient clustering technique for regionalisation of a spatial database

69

On the other hand, it is not sure that the neighbourhoods of all living places of a region are inhabited and dense, because the spatial point distribution of living places is also not taken into account. In the result, two neighbouring living places inside of a region can be far away from each other. In this paper, we consider the problem to classify a database of two-dimensional address objects into homogeneous density connected subsets called ‘regions’. In the following, with the term ‘spatial’, we refer to the geographical space in this sense. In our opinion, this problem can be tackled by a special clustering method, which takes into account spatial point distributions as well as the distribution of non-spatial characteristics. The precise task of regionalisation is discussed in detail in the following section. As we will show in Section 2, this task cannot fully be solved by existing clustering techniques. In Section 4, we discuss concepts for a density- and homogeneity-based clustering approach. In Section 5, we will introduce the algorithm before we will give some experimental evaluation in Section 6.

2

Related work

Significant contributions have been provided by different researchers on the study of spatial clustering. Partitioning algorithms such as k-mean, k-medoid, EM, CLARANS (Han et al., 2001; Ng and Han, 2002) construct a partition of a given set D of n objects in a d-dimensional space and an input parameter k. Convex clusters are built such that the total deviation of each object from its cluster centre is minimal. Hierarchical algorithms (Ward, 1963; Han et al., 2001) create a hierarchical decomposition of the given data objects using distance-based criterions in a d-dimensional space. These cluster algorithms find spherically shaped and sparse or scattered clusters. All these methods can incorporate geographical space as dimensions, but they do not provide clustering constraints like maximal distance or density-thresholds. Therefore, it is not possible to assure populated spatial neighbourhoods of a certain density. Furthermore, these algorithms do not allow for arbitrary cluster shapes. Ester et al. (1996), Sander et al. (1998), Shoier and Borruso (2004), Ma and Zhang (2004), Gorawski and Malczok (2006) and Brecheisen et al. (2006) have provided significant contributions to density-based clustering techniques. The key idea of density-based clustering is to continue growing a given cluster as long as the density in an n-dimensional neighbourhood is larger than some threshold. The density is calculated by the area of the neighbourhood region and the number of objects in that region (Sander et al., 1998). Most of the density-based clustering techniques focus on spatial data sets, only some papers discuss spatial and non-spatial data. MDBSCAN (Shoier and Borruso, 2004) has a similar structure to DBSCAN (Ester et al., 1996), but introduces a notion of proximity not only for spatial characteristics but also for non-spatial characteristics. MDBSCAN uses two different neighbourhood predicates on an n-dimensional space: one for the two-dimensional geographic subspace, and another for the non-geographic (n − 2)-dimensional subspace. In this way, it can assure dense geographic and non-geographic neighbourhoods. Problems are first that the method produces clusters such that their respective regions may be non-planar and unconnected (allowing for islands and holes). This happens if the spatial neighbourhood of a point p of a cluster contains points of another cluster.

70

L.K. Sharma et al.

These ‘external’ points then belong to the spatial but not to the non-spatial neighbourhood of p. Second, the method actually does not assure homogeneity. This is because homogeneity is only required from each density-reachability neighbourhood. But as homogeneity is a statistical measure, it is not true that the union of two density connected and homogeneous neighbourhoods A and B itself is homogeneous again, so homogeneous (A) ∧ homogeneous (B) ∧ (∀a ∈ A, b ∈ B:densityConnected(a, b)) ⇒ homogeneous(A∪B) cannot be assumed. As a consequence, homogeneously density-connected paths do not assure homogeneity of the resulting cluster.

3

Regionalisation as a spatial clustering problem

Suppose a spatial database D of geo-referenced addresses (point data) is given. Let X = {X1 … Xj … Xm} be a set of variables associated with D, so that each address oi ∈ D has got the m-tuple (x1i, …, xji, …, xmi) of values. Formally, following Section 1, the task of regionalisation requires three characteristics from a clustering in D: Each cluster must be spatially density connected by close and densely inhabited neighbourhoods, allowing for arbitrary shapes. Second, these neighbourhoods must be populated by only one cluster. Furthermore, each cluster must be homogeneous with respect to some essential variables X. It is clear that the first characteristic could be fulfilled by Sanders (1998) ‘density connected set’ implemented as DBSCAN. This method produces exactly one determinate result: it is the partition of D into unclassified noise and the set of density-connected sets CL′, which is maximal, so if C′ is a density-connected set and C′ ⊆ D, then C′ ∈ CL′. But, it is immediately clear that not each density connected set must be homogeneous. It turns out that in order to allow for homogeneous density connected clusters, the maximality condition has to be relaxed, a new planarity and a new homogeneity condition has to be introduced. In the following, we will describe the task to find a regionalisation clustering for given database D. Terms and definitions are based on the ones introduced by Sander et al. (1998), and they are presented in numbered definitions if their meaning deviates from the original ones. In contrast to a density-based clustering, we define a regionalisation clustering to be mutually exclusive and not maximal: Definition 1: Let CL = {C1, …, Ck} be a (not necessarily maximal) mutually exclusive set of non-empty subsets of D, denoting the result of a regionalisation clustering, so that each cluster is defined to be a regionalisation cluster, but not each possible regionalisation cluster is part of CL. We define the noise in D with respect to a given clustering CL as the set of objects in D not belonging to any cluster in CL, noise = D\(C1∪ … ∪ Ck). Let Npred be a reflexive and symmetric binary predicate on D meaning that two points are spatial neighbours. Let Card be a function returning the cardinality of a subset of D, and MinC be a minimum cardinality.

Efficient clustering technique for regionalisation of a spatial database

71

Definition 2: Internally directly density reachable iddr(): An object p is internally directly density reachable from an object q with respect to Npred, MinC, and CL, iddr(p, q), if Npred(p, q) Card({o ∈ D| Npred(o, q)}) > MinC ∀ o ∈ D: Npred(o, q) ⇒ ∃Ci ∈ CL: o ∈ Ci ∧ q ∈ Ci

(neighbourhood condition) (core object condition) (planarity condition)

This binary predicate is not symmetric and means that p is part of an inhabited and dense neighbourhood of q, which entirely belongs to one cluster. Based on this predicate, we define ‘internally density reachable’ idr() and ‘internally density connected’ idc() accordingly. Therefore, the binary predicate ‘internally density connected’ idc() is symmetric. These definitions imply the ones in Sander et al., but they do not follow from them, so idc(p, q) implies dc(p, q), but dc(p, q) does not imply idc(p, q). Furthermore, let H be a homogeneity predicate, meaning that a subset of D is homogeneous with respect to a variable Xj and a minimum homogeneity MinH (see Section 4). Definition 3: A regionalisation cluster Ci in D with respect to a set of variables X = {X1 … Xm} is a non-empty subset of D, satisfying the following formal requirements: •

for all addresses p, q from Ci, p ∈ Ci ∧ q ∈ Ci: p is internally density connected to q (internal density connectivity)



the addresses of Ci are homogeneous with respect to each variable in X, so: ∀Xj ∈ X: H ({o ∈ D| o ∈ Ci}, Xj) (homogeneity).

The first requirement looks pretty similar to ‘connectivity’ in Sander et al. (1998). It means that each address of a cluster must belong to a close and dense neighbourhood, and each address of this neighbourhood must belong to the same cluster. Furthermore, the cluster cannot consist of disconnected parts. But, there is an important difference of internal density connectivity to general density connectivity: A set that is ‘internal density connected’ is not necessarily a ‘density connected set’, because it must not be maximal. In rough equivalence to the ‘planarity’ concept in region geometry, it prevents ‘overlapping’ of clusters, while it allows for clusters to ‘touch’, which is to be density reachable from each other, because not all density reachable pairs of objects must belong to the same cluster. In consequence, a bordering address can be density reachable from another cluster’s bordering address, while it is of course not internally density reachable from that address. At the same time, clusters may not ‘overlap’, because idr() neighbourhoods by definition do not contain any addresses from other clusters. This is necessary because we should not allow for the spatial mixing of two cluster’s addresses. The second requirement assures a certain minimal information loss if all address values of each variable would be substituted by their respective cluster mean. This requirement and the implementation of an appropriate homogeneity measure will be discussed in the next section.

72

4

L.K. Sharma et al.

Concepts for regionalisation clustering

4.1 Homogeneity of a region The purpose of regionalisation is to aggregate variables by parameters, e.g., averages, for each region. We will consider only the continuous case. Then, heterogeneity of a cluster can best be interpreted in terms of the information loss of a continuous distribution substituted by its first parameter. In the well-known hierarchical clustering method by Ward (1963), heterogeneity of a clustering is the sum of information loss occurring if individual values are substituted by the mean of their cluster. The information loss is measured by the Error Sum-of-Squares criterion (ESS), which is equivalent to the sum of intra-cluster variances. As outlined in the last section, multidimensional procedures including space as a dimension, like e.g., Ward’s clustering method, are not appropriate, because they cannot guarantee spatial neighbourhood density. But there are also some homogeneity-related aspects in which our problem differs from standard clustering problems: •

In regionalisation, we implicitly expect a certain acceptable minimum homogeneity of a region, because we would not accept a region to be arbitrarily heterogeneous. In conventional clustering methods (e.g., Ward’s minimum variance method), a minimum homogeneity for each cluster is not assured, because it is not used as a constraint, but only as a clustering criterion.



Furthermore, our goal is not to maximise inter-cluster heterogeneity, so it is acceptable that addresses from different clusters are similar, as long as they are spatially remote. This is because regions are considered to be spatial individuals of some general ‘non-spatial region type’, so that two separate regions can be of the same type.



Because the addresses of a region can be considered as a statistical sample of some ‘region type’, homogeneity should be measured as an estimated parameter of the universe of this type. Therefore, it could also be necessary to assure outlier robustness, because outliers inside of a residential area are frequent, and distribution parameters are very sensitive to them.

To formulate a minimum homogeneity constraint, we have to normalise the intra-cluster variance and fix it to some minimum homogeneity. For this purpose, we should take into account distribution characteristics of a variable globally and locally with respect to a cluster. In the following, we write X for Xj ∈{X1 … Xj … Xm}. We will discuss two slightly different approaches based on normalised variance, called ‘normalised local variance’ and ‘normalised local covariance’, which afterwards will be combined to a general homogeneity measure. Let c = Card(C) be the size of a subset C ⊆ D. Let xc = Σic=1 xi / c be the local arithmetic mean of variable X in C. Let n be the size of database D, n = Card(D). Let x = Σin=1 xi / n be the global arithmetic mean of variable X.

Efficient clustering technique for regionalisation of a spatial database

73

4.1.1 Homogeneity as normalised local variance We define the following measures: Varlocal-local (C , X ) = Σic=1 ( xi − xc ) 2 / c,

which is the intra-cluster variance of variable X in a cluster C. Varglobal ( X ) = Σin=1 ( xi − x ) 2 / n,

which is the global variance of variable X. The normalised variance homogeneity with respect to a cluster C and a variable X is 1 minus the intra-cluster variance divided by the global variance: H NLV (C , X ) = 1 −

Varlocal-local (C , X ) . Varglobal ( X )

This parameter relates the information loss for cluster aggregation with the information loss for global mean aggregation. It is independent of the relative location of the local cluster mean compared with the global mean, which means that ‘average’ clusters whose mean is lying near the global mean are treated in the same way as ‘extraordinary’ ones. This is of course a problem, because sub-samples with a mean lying near the tails of the global distribution should have a larger variance than average ones. This means that we should have to consider an extraordinary cluster to be more homogeneous than an average one with the same variance. Therefore, we will look at a second measure.

4.1.2 Homogeneity as normalised local covariance with respect to the global mean We define a local global covariance as the covariance of all value pairs inside of one cluster C with respect to the global mean: Cov local-global (C , X ) =

∑ ∑ c

c

i =1

i =1

( xi − x )( x j − x ) c2

.

Furthermore, we define a local global variance as the local variance with respect to the global mean: Varlocal-global (C , X ) =



c i =1

( xi − x ) 2 c

.

The normalised local covariance homogeneity with respect to a cluster C and a database D is the local global covariance of cluster C divided by the local global variance of cluster C: H NLC (C , X ) =

Cov local-global (C , X ) Varlocal-global (C , X )

.

It can be easily shown that Varlocal-global is exactly equal to the sum Covlocal-global + Varlocal-local. Therefore, Covlocal-global = Varlocal-global – Varlocal-local. If we substitute this for Covlocal-global, we get a formula, which is much easier to compute and also to understand:

74

L.K. Sharma et al. H NLC (C , X ) = 1 −

Varlocal-local (C , X ) . Varlocal-global (C , X )

To interpret this measure, we have to realise that the local global variance Varlocal-global is the sum of two parts: the local intra-cluster variance Varlocal-local, as well as Covlocal-global, which is the fraction of Varlocal-global influenced by the similarity of absolute address distances to the global mean. The value range of this measure is between zero and one. If all addresses of a cluster have the same (non-zero) distance to the global mean, the intra-cluster variance will be zero and the homogeneity will be one. But Varlocal-global itself increases with the average distance of cluster addresses to the global mean. This means that for a constant homogeneity, the local cluster variance will increase if the cluster mean moves towards the tails of the global distribution (compare Figure 3). In this way, the larger spread of extraordinary clusters can be taken into account for the normalisation. Figure 3

Local cluster internal (red and orange) and global normal distributions (green) of the variable ‘percentage of families’. The local variance as a function of the distance (dotted horizontal lines) of the respective local mean (dotted vertical red and orange lines) from the global mean (green vertical line)

The disadvantage of this measure is that it will become zero if the cluster mean reaches the global mean, so the homogeneity of ‘average’ clusters will be underestimated.

4.1.3 A combined outlier-robust homogeneity measure The advantages of both measures can be combined by applying a dynamic weighted sum. The normalised local covariance measure should dominate our synthesised homogeneity if a cluster is ‘extraordinary’, whereas the normalised local variance measure should dominate, if the cluster is ‘average’. A cluster can be considered extraordinary if its distance from the global mean is large when compared with its intra-cluster variance. This is exactly what is measured by HNLC itself: if the global mean is the same as the local mean, HNLC is 0. HNLC will increase with the distance between local and global

Efficient clustering technique for regionalisation of a spatial database

75

mean, if the local variance remains constant. Because HNLC is defined to be between 0 and 1, HNLC and 1−HNLC can directly be used as weights: H comb (C , X ) = ( H NLC (C , X ))( H NLC (C , X )) + (1 − H NLC (C , X ))( H NLV (C , X ))  Varlocal-local H comb (C , X ) =  1 −  Var local-global 

 Varlocal-local  1 −   Varlocal-global

  Varlocal-local  +    Varlocal-global

  Varlocal-local   1 − Varglobal 

  . 

The proposed measure can be made outlier-robust, if we apply a conventional predicate for outlier detection. In a general definition, an outlier falls more than 1.5 times the Inter-Quantile Range (IQR) above a higher quantile Qu or below a lower quantile Ql of a distribution. Definition 4: Let MinH ∈ (0, …, 1) be a fixed normalised homogeneity minimum, for example 0.7. Let Qu and Ql be an upper and lower quantile of a variable Xj in cluster C, for example Q80% and Q20%. Then we consider the cluster C of D to be homogeneous with respect to Xj, H(C, Xj, MinH), if: •

Hcomb(C', Xj) > MinH,



with C' = {oi ∈ C | Ql – 1.5 × (Qu – Ql) < xji < Qu + 1.5 × (Qu – Ql)},

which means the predicate is true, if the cluster shows a minimal homogeneity for an outlier-free subset of its values.

4.2 Heuristics and quality for a regionalisation-clustering algorithm Similar to DBSCAN, a regionalisation-clustering algorithm should ‘grow’ up each cluster from an (randomly chosen) initial core object by density-connected paths. In addition, the resulting cluster must be homogeneous. Noise can be everything that is not part of any cluster. To formulate an algorithm, one has to decide about initial core objects, the application of the homogeneity predicate and the classification of objects as noise. In Section 2, we defined a regionalisation clustering to be a non-maximal mutually exclusive set of internally density-connected subsets of D being homogeneous with respect to each essential variable Xj ∈ X. This has the obvious consequence that for a given D and a given X, many clusterings are possible. This is because a regionalisation cluster is not uniquely determined by any of its core objects, so there is no equivalent to Lemma 2 in Sander et al. (1998). Lemma 2 assures the deterministic property of the DBSCAN algorithm, because its result is independent of the initial core object for each cluster. This involves a second aspect, which is the quality of a clustering result according to its purpose. We consider a regionalisation to have a good quality, if the Error Sum-of-Squares criterion (ESS) is low, the number of homogeneous clusters for a given database D is low, and if at the same time the percentage of noise is low. In this case, the resulting clustering will allow an efficient spatial aggregation of individual values covering a significant fraction of D.

76

L.K. Sharma et al. Non-determinism, therefore, has got two important consequences:



a regionalisation-clustering algorithm will come to more or less different results, if the initial core object for each cluster is different



with respect to the spatial distribution of values – especially outliers – of variables, the clustering quality, defined by the percentage of noise and the number of clusters, will depend on the initialisation of each cluster.

To increase the clustering quality, the algorithm therefore should follow certain heuristics: •

Avoid and reconsider noise. If the initial core object neighbourhood of a cluster happens to contain (undetected) outliers, or if this core object actually lies at the border of a region, then it is very probable that the cluster will not be realised (compare Figure 4, unfavourable initial core objects). But the same neighbourhood could be part of a homogeneous cluster initialised by another core object. Therefore, inhomogeneous but dense neighbourhoods should not be marked as noise; they should be reconsidered again for other clusters. Furthermore, the overall number of initial core objects should be kept small.



Merge clusters. Furthermore, if the spatial distribution of outliers is irregular, then number and size of clusters depend on the direction of growing. This is because a cluster can stop growing too soon, and therefore the number of clusters increases (compare Figure 4, unfavourable direction of growing). A heuristic solution to this problem is to merge two neighbouring density connected clusters if their union fulfils the homogeneity criterion. This can be done in a second step after the actual clustering has finished.

Figure 4

The spatial distribution of outlier objects influences the clustering result. See text for explanation

Efficient clustering technique for regionalisation of a spatial database

5

77

The regionalisation-clustering algorithm RCSDB

To find a regionalisation cluster C ∈ CL, RCSDB (Figure 5) starts with a randomly chosen object p and retrieves all objects density reachable from p with respect to Npred. We test if the p-neighbourhood fulfils the density condition with respect to MinC (core object condition in Definition 2). If p does not fulfil it, it is assigned to Noise (Definition 1). If it does, it is added to a preliminary cluster PC. Then, we measure the homogeneity H (Definition 4) with respect to MinH for this cluster. If the cluster is not homogeneous, then we release the p-neighbourhood as unclassified. In this way, it can be reconsidered for other clusters (compare heuristic 1). If it is homogeneous, we expand the cluster (compare Figure 6). This procedure is randomly applied to each p ∈ D that has not yet been classified. After the algorithm has finished, all unclassified points become Noise. Figure 5

Main algorithm RCSDB

Figure 6

Expanding a cluster in RCSDB

78

L.K. Sharma et al.

D is a database of geo-referenced addresses (point data) with non-spatial attributes. A function getRandomItem(D) chooses a starting object randomly (see Section 4.2). The purpose of function is DenseObject (PC, MinC) is to assure the density condition with respect to minimum cardinality. We have to consider all objects of a cluster at a time for the homogeneity measure. A function copy() is used to copy a list.

6

Experimental results

The algorithm was tested on a two-dimensional point database of 133,207 buildings for the city of Cologne, Germany. Each building has got several attributes describing statistical properties of its inhabitants, like consumer behaviour, social, structural and demographic characteristics. We pitched on the following social, structural and demographic variables: percentage of families, average age, buying power and building size. But, we focus on results calculated for one variable ‘percentage of families’. We computed three variants of clusterings: •

The RCSDB algorithm was implemented in Java and applied to this data set parameterised by Npred: = 0.008, MinC: = 3, MinH: = 0.7, Qu: = Q80%, Ql: = Q20%.



Additionally, after execution of RCSDB, we applied a merging algorithm. A pair of clusters was merged, if both were density connected neighbours and if their union was homogeneous.



We applied a conventional DBSCAN algorithm to the database.

We see that the new RCSDB algorithm (Table 1) produces more clusters for the same data set than the conventional DBSCAN. But the percentage of noise is comparably low, and the error sum-of-squares, the sum of squared deviations from the respective cluster mean is considerably lower than for the DBSCAN variant. Also, weighted average homogeneity has increased by 25% for the RCSDB variant. Table 1

Regionalisation quality of the three clustering versions for Cologne using one variable ‘percentage of families’ with value range {0, …, 4}

Quality measure Number of clusters Size of noise Error sum-of-squares (ESS)

DBSCAN

RCSDB

RCSDB with merging

72

122

116

124 (133207)

126 (133207)

126 (133207)

131085

128001

128190

Standard deviation

0,985

0,962

0,962

Weighted average homogeneity

0.53

0.77

0.78

If we look at the map in Figure 7, which shows clustering results for DBSCAN, we see very few big yellow clusters covering the whole area of the city centre, as well as all neighbouring residential areas. Points that are noise are shown in light dark grey. In Figure 8, we immediately see the reason for the increase in cluster numbers in the RCSDB variant. The central business district was carved out clearly in the central yellow cluster. Furthermore, different closely connected residential sectors can be distinguished, which are configured concentrically like a belt around the CBD (light green, turquoise,

Efficient clustering technique for regionalisation of a spatial database

79

light blue, dark blue). These clusters are more or less equivalent to the following residential areas in anticlockwise order: ‘Nippes’, ‘Ehrenfeld’, ‘Lindenthal’ and ‘Rhodenkirchen’. Figure 7

Clustering of cologne by DBSCAN

Figure 8

Clustering of Cologne by RCSDB (without merging)

80

L.K. Sharma et al.

As can be seen in Figure 9, the merging could not much improve our result. Important residential regions like Rhodenkirchen and Lindenthal were merged, and also a part of the CBD with Ehrenfeld. Therefore, we consider the unmerged variant to be more plausible than the merged one. This can of course be different if the distribution of outliers is less favourable. Figure 9

7

Clustering of cologne by RCSDB (with merging)

Conclusion and future work

Our experimental result has shown that our proposed algorithm can be used for solving the regionalisation problem. The task is to classify a database of geographical locations into homogeneous, planar and density-connected subsets called ‘regions’. We found that the clustering algorithm must find internal density connected sets (that is density-connected sets, which allow to ‘touch’ other clusters, but do not allow for ‘overlapping’). Furthermore, these sets have to own a certain minimal homogeneity. This can be measured by a normalised variance–covariance-based parameter, which takes into account local and global variances as well as the ‘extravagance’ of a cluster. Furthermore, homogeneity should be outlier-robust. We could produce very high quality and plausible results for the city of Cologne for one social variable ‘percentage of families’. The algorithm of course can be applied to many variables at once. In the future, we will put a stronger focus on the problem of multivariate data. We will also have to consider the issue of time complexity and optimality of the RCSDB algorithm. Furthermore, we will work on heuristic approaches to improve the plausibility and quality of clustering results.

Efficient clustering technique for regionalisation of a spatial database

81

Acknowledgement The authors thank Dr. Angi Voss and Dr. Uli Bartling for their valuable support in many discussions about the regionalisation problem.

References Brecheisen, S., Kriegel, H.P. and Pfeifle, M. (2006) ‘Multi-step density-based clustering’, Knowledge and Information System, Vol. 9, No. 3, pp.284–308. Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996) ‘A density-based algorithm for discovering clusters in large spatial databases with noise’, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD’96), AAAI Press, Portland, pp.291–316. Ester, M., Kriegel, H-P. and Sander, J.(2001) ‘Algorithms and applications for spatial data mining’, Invited chapter for Geographic Data Mining and Knowledge Discovery, Research Monographs in GIS, Taylor and Francis, pp.160–190, ISBN: 9781420073973. Gorawski, M. and Malczok, R. (2006) AEC Algorithm: A Heuristic Approach to Calculation Density-based Clustering Eps Parameter, LNCS 4243 Springer-Verlag, pp.90–99. Han, J., Kamber, M. and Tung, A.K.H. (2001) ‘Spatial clustering methods in data mining: a survey’, Geographic Data Mining and Knowledge Discovery, Taylor and Francis, pp.1–29, ISBN: 9781420073973. Ma, D. and Zhang, A. (2004) ‘An adaptive density-based clustering algorithm for spatial database with noise’, Proc. 4th IEEE Int. Conf. on Knowledge Discovery and Data Mining (ICDB’04), pp.467–470. Ng, R.T. and Han, J. (2002) ‘CLARANS: a method for clustering objects for spatial data mining’, IEEE Trans. Know. Data Eng., Vol. 14, No. 5, pp.1003–1016. Sander, J., Ester, M., Kriegel, H.P. and Xu, X. (1998) ‘Density- based clustering in spatial databases: the algorithm GDBSCAN and its applications’, Journal of Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Vol. 2, pp.169–194. Sedlacek, P. (1978) Regionalisierungsverfahren (Wege der Forschung 195), Wissenschaftliche Buchgesellschaft, Darmstadt. Shevky, E. and Bell, W. (1955) Social Area Analysis. Theory, Illustrative Application and Computational Procedures, Stanford University Press, Stanford. Shoier, G. and Borruso, G. (2004) A Clustering Method for Large Spatial Databases, LNCS 3044, Springer-Verlag, pp.1089–1095. Snyder, D.E. (2000) Urban Geography, http://teacherweb.ftl.pinecrest.edu/snyderd/APHG/ Unit%206/urbannotes.htm Ward, J.H. (1963) ‘Hierarchical grouping to optimize an objective function’, J. Am. Statist. Assoc., Vol. 58, pp.236–244.

Suggest Documents