A Multiple-Resolution Method For Edge-Centric Data ... - CiteSeerX

0 downloads 0 Views 2MB Size Report
c and run on an IBM Thinkpad 600 with a 233mhz Pentiu-. mII processor running RedHat Linux 5.2 with kernel version. 2.0.36-0.7. We show a performance and ...
A Multiple-Resolution Method For Edge-Centric Data Clustering Scott Epter Department of Computer Science Rensselaer Polytechnic Institute [email protected]

Abstract Recent works in spatial data clustering view the input data set in terms of inter-point edge lengths rather than the points themselves. Cluster detection in such a system is a matter of nding connected paths of edges whose weight is no greater than some user input threshold or cuto value. The SMTIN algorithm[9] is one such system that uses Delaunay triangulation to compute the set of nearest neighbor edges quickly and eciently. Experiments demonstrate a substantial performance and accuracy improvement using SMTIN in comparison to other clustering systems. The resolution of the clusters discovered in the SMTIN system is directly related to the choice of a cuto threshold, which makes SMTIN perform poorly for input sets with clusters at multiple resolutions. In this work we introduce an edge-centric clustering method that detects clusters at multiple resolutions. Our algorithm detects di erences in density among groups of points and uses multiple cuto points in order to account for clusters at di erent resolutions. One of the main bene ts of the multi-resolution approach of our system is the ability to accurately cluster points that other systems would consider to be noise. Experiments indicate a substantial improvement in the clustering quality of our system in comparison to SMTIN as well as the removal of the requirement of an input distance-threshold, acheived with comparable theoretical as well as actual runtime performance. We present promising directions for this new algorithm.

Mukkai Krishnamoorthy Department of Computer Science Rensselaer Polytechnic Institute [email protected]

Recently, work has been introduced that views the input set from the perspective of its inter-point edges rather than the points themselves. There are several advantages to such edge-centric clustering methods, including fast processing time and insensitivity to the order of the input. Also, they do not require the number of clusters as input, instead determining this value automatically from the data set itself. Conversely, however, such systems require the input of a distance threshold that acts as a ltering criterion in the acceptance of edges. The SMTIN algorithm is one example of an edge-centric system. It makes use of a Delaunay triangulation to construct a compact edge set, making it an e ective and ecient method for spatial data mining. It makes use of a single, global, user-input distance threshold for ltering. There are two main drawbacks to this approach. First, the user must supply the distance threshold as part of the input data. Ideally, clustering algorithms should by de nition work in an unsupervised manner with little or no need for apriori knowledge in the form of user input. Second, the use of a single threshold distance does not permit the clustering of data sets in which clusters exist at multiple resolutions. Consider the data set shown in Figure 1. The proper grouping of the points in this set is sensitive to the chosen threshold. There exist arguably valid groupings of points in this example in which there is no single distance that partitions all internal and all external edges into crisply distinct sets.

1 Introduction Current clustering algorithms in the context of data mining and knowledge discovery can be loosely classi ed in terms of the data on which they are intended to be used as well as the type of patterns for which they are searching. Several algorithms are geared towards nding spatial clusters, typically in low dimensions. Other algorithms are intended to detect associations and other statistical information, typically in very high dimensions. Several works also exist in the clustering of stored data in the name of enhanced information retrieval.

Figure 1: A data set that has valid clusterings at multiple resolutions In this work we present an edge-centric clustering algorithm that extends SMTIN by clustering at multiple resolutions. We use a vector of threshold values that de nes a set of distance intervals over which connected edges in the triangulation are considered to be at the same resolution. Using concepts from the clusterability detection methods in [3] we derive the threshold vector directly from the data set itself. Our work makes the following contributions:

Automated detection of edge-acceptance thresholds The lter critera are derived automatically from the input data.

Detection at Multiple Resolutions We provide support

for input data that consists of groups of points with di erent intra-group densities. This allows us to detect clusters hierarchically in a single execution of the application. Intra-Cluster Homogeneity Our algorithm strives to generate clusters with uniform density by minimizing the di erence in edge-lengths between nearest neighbors within each detected cluster. The result is an ecient means of detecting even subtle changes in density between regions of points. Information from Sparse Regions Previous approaches to clustering treat the points in sparse regions as 'outliers' and thus aim to remove them. The hierarchical structure of our algorithm allows such points to instead be grouped in a meaningful way vis-a-vis the more dense regions of the data set. The tendency toward intra-cluster density-homogeneity and multiple-resolution support o er promising potential for our algorithm to work in high-dimensions and thus be used for informational clustering. We discuss this point further in Section 6. The rest of this paper is as follows: In Section 2 we discuss related work. In Section 3 we discuss edge-centric clustering, including a description of the SMTIN algorithm and the clusterability methods in [3]. Section 4 discusses and analyzes our multiple resolution algorithm. Section 5 provides experimental veri cation of our work and nally concluding remarks and future directions are presented in Section 6.

2 Related Work Data clustering is a wide-ranging eld encompassing a large body of work in several heterogeneous disciplines. An attempt to provide an exhaustive evaluation of the eld is well beyond the scope of this work. [8] and [10] provide excellent overviews of standard issues in data clustering, as well as classical approaches to handling them. In terms of work related to this paper, current clustering algorithms can be classi ed in terms of the data for which they are suited. In this regard, there are two broad classi cations, namely spatial clustering methods and what we will refer to as informational clustering algorithms. Spatial methods detect clusters generally in low (2 or 3)-dimensional spaces as they would be seen by the human eye. Informational methods group points, often in high-dimensional spaces, in order to determine patterns and associations that are not easily visualized. Examples of spatial-based approaches that perform well in the presence of spherically-shaped clusters are presented in [10], [13] and [17]. Methods geared towards detection of irregularly-shaped clusters are presented in [4], [6] and [16], as well as in [9]. Informational methods that focus speci cally on highdimensional clustering issues are presented in [1] and [7]. The methods in [5] and [7] link clustering to methods for the discovery of association rules. A host of informational work exists in the eld of information retrieval, including the work presented in [2] and [11]. The multiple-resolution algorithm introduced here is a direct extension of the SMTIN algorithm[9] and the clusterability and seed detection methods introduced in [3]. SMTIN

performs a connected-component search over a set of nearest neighbor edges where the weight of each edge is below a user-input threshold. [3] discusses a methodology for determining clustering tendency based on the all-pairs edge lengths of the input set.

3 Background In this section we provide background information regarding our work. Included in our discussion is an introduction to edge-centric clustering and in particular the SMTIN algorithm. We also cover the motivations behind our multiresolution approach. SMTIN uses a Delaunay triangulation of the input point set in order to generate a candidate set of edges. In lowdimensional data sets, a complete set of contiguous, nearestneighbor edges is generated in O(n log n) time. The number of edges generated is O(n)[14], thus making the complexity of further processing scale well with the number of points. The algorithm itself is quite elegant and simple. After removing all edges with weight above a given threshold  , the connected-components of the remaining edges are detected. The point sets induced by these components constitute the output clusters. There are two obvious and complementary aspects of SMTIN that lend themselves to extension, namely detection of  automatically and the ability to adapt  on a local scale dynamically. These two elements form the basis for our new, multiple-resolution algorithm. Using the same triangulation edges, we generate a histogram of edge-lengths. The vector of  values is determined by the makeup of this histogram. Figure 2(a) represents an example data set that cannot be clustered accurately at a single resolution. Figure 2(b) shows the Delaunay triangulation of the data set shown in Figure 2(a). The histogram induced by the edges of the triangulation in Figure 2(b) is shown in Figure 2(c). [3] discusses the use of such histograms for discovering clustering tendency. The intuition of our approach is that valleys in the histogram represent density shifts in the data set. For any particular edge-length interval for which the frequency of occurrence of edges in the interval is low, there is a corresponding density shift in the represented data set. These shifts form the boundaries surrounding clusters that exist at di ering resolutions.

4 New Algorithm Figure 4 lists the steps of our multi-resolution clustering algorithm. A set of edges is returned in Step 1. In Step 2 we generate the edge-length frequency histogram H with a scan of the edge set. By virtue of its length each edge of the triangulation maps to exactly one bucket of H. The bucket indices are scaled logarithmically in order to adjust the interval of each bucket by length. Interesting results might also be obtained using a histogram normalization method such as that described in [12], but space limitations prohibit us from exploring this avenue further. During the scan of Step 2 we also create an adjacency list for each point. The nodes of each adjacency list i are pairs of the form < j; b > where j is an index into the point set and b is the index of the bucket to which edge e maps by virtue of its length  . For each edge e : b there is an adjacency node of the form < j; b > in adjacency list i as well as a node of the form < i; b > in adjacency list j after Step 2. i;j

i;j

i;j

Edge-Length Histogram

Edge-Frequency Histogram for Multi-Resolution Data Set 200 180 160 140 120 Frequency

Frequency

100 80 60 40 20 0 0

(a)

2

4

6

8

10 Edge Length

12

14

16

18

20

Edge Length

(b)

(c)

Figure 2: Multiple-Resolution example. (a) Data set. (b) Triangulation. (c) Edge-frequency histogram. Note that the relationship between edge-lengths and buckets is many-to-one. More than one length may map to a particular bucket. Once H has been created we scan it in order to nd the vector V of valley indices in Step 3. These indices become the resolution boundaries used during the connected-component search in Step 5. Each item in V represents a cuto bucket, de ning a resolution restriction on adjacency. In Step 4 we update the bucket index of each adjacencylist node to correspond to the index of the correct cuto . This assigns to each point the correct resolution at which the point is to be clustered. Since we want each point to be assigned to a cluster based on the lowest edge length (and thus densest) group to which it belongs we maintain for each point the index of the lowest cuto index of any of its edges, denoted low. In Step 5 we perform a DFS on the adjacency lists, with the additional constraint that two points are considered adjacent only if their low values are equivalent. Input: a data set Output: a set of clusters 1: nd edge set E of D by Delaunay triangulation. 2: generate frequency histogram H of edge lengths. 2a: create adjacency lists (one for each point). 3: nd vector V of valleys in H. 4: assign appropriate cuto s to adjacency list nodes. 4a: maintain lowest resolution nodes for each list. 5: nd connected components subject to cuto constraints Figure 3: New Algorithm The nal value of low for each point p re ects the resolution at which p will be clustered. Only points with equal low values will be grouped together. This ensures maximally uniform density among co-clustered points. One important notion that deserves consideration is that points which happen to lie on the border between groups of points at di erent resolutions are assigned to the group with the lower low value (and thus greater density). This situation is illustrated in Figure 4. There are three circled regions of points in the diagram, labelled A, B, and C. The points in question are those in region C, since each of these points lies on the boundary between sets of mutually-clusterable points at di erent resolutions. Suppose the edges among points in group A are assigned threshold index j and the edges between points in group B are assigned threshold index k, with j < k. Group C points will thus have mixed types of edges at the beginning of Step 4. By the de nition of the algorithm, all points in group C will be assigned a low value of j . Since each group A point will have a low value of j and each group B i

i

point will have a low value of k, any edge between a point in group C and a point in group B will be ltered. Similarly, any edge between a point in group C and a point in group A will be accepted. Thus all points in group C will cluster with the points in group A. B

C A

Figure 4: Choosing the resolution boundary One of the most powerful aspects of the algorithm is the fact that it makes clustering decisions at both the local and global levels. Each interval of bucket groups in the threshold vector brings together sets of points that have a high level of occurrence in the set as a whole. Thus each interval in the threshold vector de nes a single clustering resolution, based upon the makeup of the entire input. The restriction of clustering each point at its lowest resolution causes points to be clustered at tight levels of mutual density on a local scale.

4.1 Complexity Analysis

Here we provide an analysis of the theoretical running time of our algorithm. We demonstrate that the computational complexity of the entire algorithm is dominated by the triangulation step, which is O(n log n). This is equivalent to the complexity of the basic SMTIN algorithm[9]. Step 1 is known to be optimally performed in O(n log n) time[14]. Since the number of edges produced is (3n ? 6), steps 2 and 4 each require a single scan of E and are thus each O(n). Let b represent the number of buckets in the frequency histogram (all experiments in this work were done with b = 100). Step 3 is O(b). Step 5 is just a modi ed depth- rst search and is thus O(n). Clearly the running time of the algorithm from input to nal output is dominated by the triangulation step and is thus O(n log n).

5 Experimental Veri cation In this section we present our experimental results. We demonstrate a substantial improvement in the ability to detect spatial clusters with varying density in comparison to SMTIN. This enhanced processing comes at the expense of only a negligible performance penalty when compared to the triangulation step, a necessary component of both algorithms. Results show our algorithm's ability to discern subtle di erences in localized point density with a high-degree of precision.

5.1 Description of Experiments

We compare our results with those obtained using our implementation of the SMTIN algorithm. Since both algorithms rely on Delaunay triangulation our experiments consider the generation of the edge set as a preprocessing step. Our comparisons thus focus on processing a triangulated point set. Results in [9] demonstrate the eciency of SMTIN using an ecient triangulation algorithm[15]. The amount of time required to completely cluster large data sets (100,000 points) is shown to be on the order of 30-40 seconds. The experiments presented in this section are o ered in support of the following points: our algorithm signi cantly improves upon SMTIN's ability to discern variations in localized mutual density among groups of points. It maintains a high degree of processing eciency while removing the need for an input data threshold. The single-run, hierarchical clustering qualities inherent in our algorithm result in the generation of useful information from sparse regions. Previous approaches dismiss the points in such regions as outliers and thus remove them. We report several statistics on each of our clusterings as well as those of SMTIN in Figure 5.

5.1.1 Evaluation Criteria

One of the challenges in data clustering is the determination of the correct resolution or sampling window[8]. The crux of our algorithm is the de nition of what constitutes a valid clustering at a particular resolution. Using an automated means of determining such a resolution allows us to detect clusters in a robust, order-independent manner. The basis of our evaluation metric is the desire to maximize intracluster homogeneity with respect to nearest-neighbor edge-lengths. A high degree of edge-length homogeneity implies uniform density among the points in the cluster. Clusters as we de ne them not only represent endpoints of contiguous edge sets as in [9]. They are further characterized by uniform intra-cluster density. One of the most immediate consequences of this approach is the ability to mine useful information from points that were previously considered noise and thus of no use. The choice of input threshold in SMTIN makes it particularly sensitive to the delineation of noise points. In Figures 5 we provide results that quantify the above qualities within a given clustering in two ways. First, we report the number of points assigned to actual clusters as well as those considered noise. Second we evaluate degree of length-homogeneity among the edges that result in the clusters themselves. In order to do so we determine the set of edges of the Euclidean minimum spanning tree[14] for each cluster. This structure represents the means by which a minimally-weighted, connected component of points is generated. Let MST C denote the Euclidean minimum spanning tree for some cluster C . Let E denote the set of edges of MST C . We de ne the density-spread of C as the di erence between the length of the longest edge in E and that of the shortest edge in E . Intuitively a high density-spread value indicates inter-cluster heterogeneity, an undesirable characteristic. We would thus like density-spread to be low. For each result we report the average density-spread over all clusters for a given clustering. C

C

C

5.2 Results

In this section we present comparisons between our new algorithm and SMTIN. Both algorithms were implemented in c and run on an IBM Thinkpad 600 with a 233mhz PentiumII processor running RedHat Linux 5.2 with kernel version 2.0.36-0.7. We show a performance and quality evaluation for three data sets: A A four-by-four grid of roughly 2100 points with varying local density, (see Figure 6(a)) B Roughly 101; 000 points at two resolutions. A dense rectangular region, surrounded by a contiguous, sparser region. (see Figure 8(a)) and C Data set 3 from [17]. 100; 000 points of somewhat randomly distributed data.(see Figure 9(a)). For each data set we report the following for each algorithm: number of discovered clusters, total points, clustered points, noise points, average density-spread and processing time. Note that since this evaluation is solely a comparison between our algorithm and SMTIN, the reported processing time here merely includes the elements of processing where the two algorithms di er (i.e. selecting the threshold vector in the case of our algorithm and performing the actual cluster detection). [9] reports results over the entire process including triangulation, in which 100; 000 points are completely processed in roughly 30 ? 40 seconds. In each case we report one line for our multiple-resolution algorithm as well as one line each for various interesting input thresholds of SMTIN. For data sets A and B we show representations of the resulting clustering. The grid structure of data set A makes it easy to refer to individual segments of the data set in our description. For data set C we provide a plot that illustrates our grouping of points that would likely be removed as outliers by other approaches.

5.3 Analysis

Figure 6(a) represents data set A and Figure 6(b) illustrates the clusters detected from data set A by our algorithm. For discussion of these Figures we will refer to each represented region by coordinate pair (i; j ) where i represents the row for 1  i  4 and j represents the column for 1  j  4. Regions (1; 1) and (1; 3) are each considered complete, individual clusters, as are regions (2; 2); (2; 3); (4; 2) and (4; 4). There is a thin line of dense points that runs through regions (2; 4); (3; 4) and (3; 3) thus connecting (4; 1) with (3; 2). Each small, dense region in (2; 4) is detected as an individual cluster, demonstrating the ability of the algorithm to discern subtle density shifts. There is a single bridge intentionally present in the input data between (3; 3) and the top row of (4; 3), thus linking that row only to the large cluster. In regions (1; 2) and (4; 3) the vertical separation between points is larger than the horizontal and so each row represents a cluster. There are four clusters in region (4; 1). Figures 7(a-d) show SMTIN clusterings of data set A for various distance thresholds. The four small clusters in (4; 1) are correctly detected in all plots. In plot (a), the low threshold value results in the exclusion of many noise points, yet only the rows of (1; 2) and (4; 3) are detected correctly. The remaining clustered points are all considered to be part of a single cluster. In examples (b-d) only the clusters of (4; 1) are correctly detected. All other points are grouped into one cluster.

5.3.1 Cluster Quality and Running Time Comparison

Figure 5 shows the evaluative characteristics of each of the plots shown above. As would be expected, the densityspread of SMTIN's clusterings increases with the threshold. But a decrease in the threshold results in an increase in the number of points removed as noise, resulting in information loss. While a relative evaluation of the comparison of runtimes between our algorithm and SMTIN shows what appears a signi cant performance penalty, we feel that it is still acceptable for the following reasons: The runtime shown re ects the time to detect the thresholds as well as process with multiple thresholds. This step is not undertaken by SMTIN. More importantly, the overall processing time will still be dominated by the triangulation, as demonstrated in [9]. When run as a complete process from point set to clusters there will be little or no perceptible di erence in runtime between our algorithm and SMTIN. With this in mind, the multi-resolution algorithm will perform in time comparable to if not better than most other algorithms.

5.4 Runtime

Figure 10 shows the runtime of the threshold detection phase and processing phase of our algorithm as a function of point set size. The linear-time complexity of the unique steps of our algorithm is con rmed.

6 Conclusions and Future Work Common clustering-based knowledge discovery systems include spatial miners and information miners. Spatial data mining algorithms very often work in low dimensions and attempt to group points visually. Information miners generally work in high dimensions and attempt to discover numeric or associative patterns in data. Two important and as of yet largely unexplored desiderata of spatial algorithms is the ability to mine clusters at multiple resolutions as well as the ability to do so without requiring subjective input from the user. In this work we have introduced a novel, multi-resolution clustering algorithm that addresses both of these issues eciently and e ectively. Experimental results demonstrate a substantial improvement in the clustering quality when compared to existing algorithms.

Threshold Determination and Processing Time 1.8 "Processing" "Thresholds" 1.6

1.4

1.2

Time (sec.)

Figure 8(a) represents data set B. Data set B consists of a rectangular region of somewhat dense points, surounded on all sides by a sparser set of points. This example clearly illustrates the ability of the multiple-resolution algorithm to de ne cluster borders by changes in density in the set. Figure 8(b) illustratess the correct detection of two clusters from data set B by the multiple-resolution algorithm. Figure 8(c) shows SMTIN's detection for a value of  too small to include the points in the outer ring. Figure 8(d) shows SMTIN's detection for a value of  too large to distinguish between points in the two regions. Figures 9(a) and (b) show data set C and the sparsest cluster detected by our algorithm. The sparseness of these points would normally be detected as noise and the points (or at least a large segment of them) would normally be discarded. Yet the plot indicates that there is useful information that can be discerned from them. The illustrated points represent the sparse borders that surround the moredense regions of the data set.

1

0.8

0.6

0.4

0.2

0 0

20000

40000 60000 Number of Points

80000

Figure 10: Running times for Threshold Detection and Processing

6.1 Promising Directions

Our system uses a novel approach to maximize the mutual similarity of nearest-neighbor edge lengths within each cluster. This quality, coupled with the ability to perform hierarchical clustering at resolutions determined from the data set itself makes the potential for using our algorithm on highdimensional data sets quite promising. Two of the most signi cant challenges in mining high-dimensional data sets should not be problematic for our algorithm. High dimensional data sets are notoriously hard to cluster for they are in general very sparse. However the resolution at which clustering takes place in our system is adapted dynamically not only to the entire data set, but within the context of localized regions as well. In the face of extreme sparsity, the algorithm will still discern between subtle di erences in inter-point density. The second challenge posed by highdimensional data sets is the tendency of computational complexity to explode as dimensionality increases. Once the edge set of nearest neighbors is determined however, our algorithm has linear complexity with respect to the size of the point set, regardless of the number of dimensions. Although Delaunay triangulation is known to be intractable in high dimensions, an ample body of algorithms exist to discover sets of nearest neighbors and thus provide a suitable edge set as input to our system. For eciency purposes it is preferable for the size of the edge set to be linear in the number of points in the data set. In addition to these factors the tendency of our system to maximize intra-cluster homogeneity maps nicely into a high-dimesnsional, information-gathering approach.

References [1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high-dimensional data for data mining applications. In ACM SIGMOD Conference on Management of Data, 1998.

100000

Method Clusters Points Avg. Density-spread Time threshold total clustered noise (secs.) MR 37 2185 2172 13 0 .03 SMTIN 1 21 2185 1665 520 0 0 SMTIN 4 5 2185 2055 130 .53 0 SMTIN 16 5 2185 2176 9 .93 0 SMTIN 25 5 2185 2185 0 1.13 0 (a) Method Clusters Points Avg. Density-spread Time threshold total clustered noise (secs.) MR 2 101007 101007 0 .83 4.1 SMTIN 4 1 101007 51076 49931 0 .9 SMTIN 16 1 101007 101007 0 2 .9 (b) Method Clusters threshold MR 237 SMTIN .16 381 SMTIN .25 527 SMTIN .5 1

Points Avg. Density-spread Time total clustered noise (secs.) 100000 92955 7045 .002 4.2 100000 52236 47764 .0007 .9 100000 76346 23654 .0009 .9 100000 99907 93 .002 .9 (c)

Figure 5: Performance Evaluations. (a) Data set A. (b) Data set B. (c) Data set C. [2] Javed Aslam, Katya Pelekhov, and Dniela Rus. Static and dynamic information organization with star clusters. In Proceedings of the Conference on Information and Knowledge Management, 1998. [3] Scott Epter, Mukkai Krishnamoorthy, and Mohammed Zaki. Clusterability detection and initial seed selection in large datasets. Technical report, Rensselaer Polytechnic Institute, 1999. [4] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Conference on Knowledge Discovery in Databases, 1996. [5] David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Clustering categorical data: an approach based on dynamical systems. In Proceedings of the 24th VLDB Conference, 1998. [6] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: an ecient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, 1998. [7] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Clustering based on association-rule hypergraphs. Technical report, University of Minnesota, 1997. [8] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. [9] In-Soo Kang, Tae wan Kim, and Ki joune Li. A spatial data mining method by delaunay triangulation. In ACM Conference on Geographic Information Systems, 1997. [10] Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley and Sons, 1990.

[11] Kihong Kim and Sang K. Cha. Sibling clustering of tree-based spatial indexes for ecient spatial query processing. In Proceedings of the Conference on Information and Knowledge Management, 1998. [12] Sukhamay Kundu. A solution to histogramequalization and other related problems by shortest path methods. Pattern Recognition, 31, March 1998. [13] Raymond T. Ng and Jiawei Han. Ecient and e ective clustering methods for spatial data mining. In Proceedings of the 20th VLDB Conference, 1994. [14] Franco P. Preparata and Michael Ian Shamos. Computational Geometry: An Introduction. Springer-Verlag, 1985. [15] J. R. Schewchuk. Triangle: Engineering a 2d quality mesh generator and delaunay triangulator. In Proceedings of the First Workshop on Applied Computational Geometry, 1996. [16] Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th VLDB Conference, 1998. [17] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An ecient data clustering method for very large databases. In ACM SIGMOD Conference on Management of Data, 1996.

1

2

3

4

1

2

3

4

1 2 3 4 (a)

(b)

Figure 6: Data Set A: (a)Original data set (b)Multi-resolution clusters

1

2

3

4

1

2

3

4

1 2 3 4 (a)

(b)

1 2 3 4

(c)

(d)

Figure 7: SMTIN clusters of Data Set A. (a) = 1 (b) = 4 (c) = 16 (d) = 25

(a)

(b)

(c)

(d)

Figure 8: Data Set B. (a)Original data set, (b)Multi-resolution clusters (two distinct clusters) (c-d) SMTIN clusters: (c) = 4 (all points in one cluster { outer ring points removed as outliers), (d) = 16 (all points in one cluster)

(a)

(b)

Figure 9: Data Set C: (a)Original data set (b)Lowest-density cluster reported by multi-resolution algorithm

Suggest Documents