Clustering High-Dimensional Data with Low-Order Neighbors Yanchang Zhao and Chengqi Zhang Faculty of Information Technology Univ. of Technology, Sydney, Australia {yczhao, chengqi}@it.uts.edu.au
Abstract Density-based and grid-based clustering are two main clustering approaches. The former is famous for its capability of discovering clusters of various shapes and eliminating noises, while the latter is well known for its high speed. Combination of the two approaches seems to provide better clustering results. To the best of our knowledge, however, all existing algorithms that combine density-based clustering and grid-based clustering take cells as atomic units, in the sense that either all objects in a cell belong to a cluster or no object in the cell belong to any cluster. This requires the cells to be small enough to ensure the fine resolution of results. In high-dimensional spaces, however, the number of cells can be very large when cells are small, which would make the clustering process extremely costly. On the other hand, the number of neighbors of a cell grows exponentially with the dimensionality of datasets, which makes the complexity increase further. In this paper, we present a new approach that takes objects (or points) as the atomic units, so that the restriction of cell size can be relaxed without degrading the resolution of clustering results. In addition, a concept of ith-order neighbors is introduced to avoid considering the exponential number of neighboring cells. By considering only low-order neighbors, our algorithm is very efficient while losing only a little bit of accuracy. Experiments on synthetic and public data show that our algorithm can cluster high-dimensional data effectively and efficiently.
1 Introduction Density-based clustering [2, 3, 5] and grid-based clustering [9, 10] are two well-known clustering approaches. The former is famous for their capabilities of discovering clusters of various shapes, effectively eliminating outliers, while the latter is well known for their high speed. However, both approaches are not scalable to high dimensionality. For density-based ones, the reason is that the index struc-
Yi-Dong Shen Lab of Computer Science, Institute of Software Chinese Academy of Sciences, China
[email protected]
tures, such as R*-tree, are not scalable to high-dimensional spaces. For grid-based approaches, the reason is that both the number of cells and the count of neighboring cells grow exponentially with the dimensionality. Grid-based algorithms take cells as the atomic units which are inseparable, and thus the interval partitioned in each dimension must be small enough to ensure the accuracy of clustering. Consequently, the number of cells will become extremely large with the increase of dimensionality. Meanwhile, the number of neighbors of a cell also grows exponentially with the dimensionality, which makes the complexity increase further. Some researchers try to break the curse of dimensionality by using the optimal grid [6], the adaptive grid [8] or in an apriori-like way [1]. Previously, we developed an algorithm called AGRID (Advanced GRid-based Iso-Density line clustering) [11] that combines density-based and grid-based approaches to cluster large high-dimensional datasets. With the idea of density-based clustering, it employs grid to reduce the complexity of density computing and can discover clusters of arbitrary shapes efficiently. However, in order to reduce the complexity of computing, only (2d+1) out of all 3d neighbors are considered for each cell when computing the densities of objects in it. When the dimensionality is high, the majority of the neighboring cells is ignored and the accuracy becomes very poor. In this paper, we present a new version of AGRID that substantially improves its accuracy of density and clustering. It has two main technical features. The first is that objects (or points), instead of cells, are taken as the atomic units. In this way, it is no longer necessary to set the intervals in every dimension very small, so that the number of cells does not grow dramatically with the dimensionality of datasets. The second feature is the concept of ith-order neighbors with which the neighboring cells are organized into a couple of groups to meet different requirements of accuracy. As a result, we obtain a tradeoff between accuracy and speed. The rest of the paper is organized as follows. In Section 2, we introduce the related work in density-based and
grid-based clustering. Section 3 reviews the basic idea of AGRID. The strategy of ith-order neighbors and our algorithm are described in detail in Section 4. Section 5 shows the experimental results of our algorithm. We conclude this paper in Section 6.
2 Related work The general idea of density-based clustering is to continue growing the given cluster as long as the density (number of objects) in the neighborhood exceeds some threshold. Such a method can be used to filter out noise and discover clusters of arbitrary shapes. Typical density-based methods are DBSCAN [3], OPTICS [2], and DENCLUE [5]. DBSCAN [3] is a density-based algorithm, which states that within each cluster, the density of the points is significantly higher than the density of points outside the cluster. DBSCAN starts with an arbitrary point and retrieves all points with the same density reachable from the point using Eps and MinPts as controlling parameters. If the point is a core point, then the procedure yields a cluster. If the point is on the border, then DBSCAN goes on to the next point in the database. Its time complexity is O(NlogN). The major drawback of DBSCAN is the significant input required from the user. In addition, the algorithm is not designed to handle higher dimensional data. OPTICS [2] builds an augmented ordering of data which is consistent with DBSCAN, but goes a step further: keeping the same two parameters, Eps and MinPts, OPTICS covers a spectrum of all different Eps’≤Eps. The constructed ordering can be used automatically or interactively. OPTICS can be considered as a DBSCAN extension in direction of different local densities, a more mathematically sound approach is to consider a random variable equal to the distance from a point to its nearest neighbor, and to learn its probability distribution. Instead of relying on user-defined parameters, a conjuncture is that each cluster has its own typical distance-to-nearest-neighbor scale. DENCLUE [5] models the overall point density analytically using the sum of the influence functions of the points. Determining the density-attractors causes the clusters to be identified. DENCLUE can handle clusters of arbitrary shape using an equation based on the overall density function. Grid-based algorithms quantize the data space into a finite number of cells that form a grid structure and all of the clustering operations are performed on the grid structure. The main advantage of this approach is its fast processing time. However, it does not work effectively and efficiently in high-dimensional space due to the so-called “curse of dimensionality”. Main grid-based approaches for clustering includes STING [10], WaveCluster [9], OptiGrid [6], CLIQUE [1], MAFIA [8], etc, and they are sometimes
called density-grid based approaches ([4, 7]). STING [10] is a grid-based multi-resolution clustering technique in which the spatial area is divided into rectangular cells and organized into a statistical information cell hierarchy. Thus, the statistical information associated with spatial cells are captured and queries and clustering problems can be answered without recourse to the individual objects. The hierarchical structure of grid cells and the statistical information associated with them make STING very fast. STING assumes that K, the number of cells at bottom layer of hierarchy, is much less than the number of objects, and the overall computational complexity is O(K). WaveCluster [9] proposes to look at the multidimensional data space from a signal processing perspective and the objects are taken as a d-dimensional signal. The high frequency parts of of the signal correspond to the boundaries of clusters, while the low frequency parts with high amplitude correspond to the areas of the data space where data are concentrated. It first partitions the data space into cells and then applies wavelet transform on the quantized feature space and detects the dense regions in the transformed space. With the multi-resolution property of wavelet transform, it can detect the clusters at different scales and levels of details. Its time complexity is O(N d log N ). The basic idea of OptiGrid [6] is to use contracting projections of the data to determine the optimal cutting hyper-planes for partitioning the data. The data space is partitioned with arbitrary (non-equidistant, irregular) grids based on the distribution of data, which avoids the effectiveness problems of the existing grid-based approaches and guarantees that all clusters can be found, while still retaining the efficiency of a gird-based approach. The time complexity of OptiGrid is between O(N d) and O(dN log N ). CLIQUE [1] and MAFIA [8] are two algorithms for discovering clusters in subspaces. CLIQUE discovers clusters in subspaces in a way similar to the Apriori algorithm. It partitions each dimension into intervals and computes the dense units in all dimensions. Then these dense units are combined to generate the dense units in higher dimensions. MAFIA is an efficient algorithm for subspace clustering using a density and gird based approach. It uses adaptive grids to partition the dimension depending on the distribution of data in the dimension. The bins and cells that have low density of data are pruned to reduce the computation. The boundaries of the bins are not rigid and this improves the quality of the clustering result.
3 Basic idea of AGRID In this section, we review the basic idea of AGRID. The following notation is used throughout this paper. N is the number of objects (or points or instances) and d is the dimensionality of dataset. L is the length of an interval and
r is the radius of neighborhood. α is an object or a point, ~ is a point with and Cα is the cell in which α is located. X ~ Y ~ ) is the discoordinates of (x1 , x2 , ..., xd ) and Distp (X, ~ and Y ~ with Lp -metric as the distance meatance between X sure. Ci1 i2 ..id stands for the cell whose identifier is i1 i2 ..id , where ij is the identifier of the interval in which the cell is located in j-th dimension. Denq (α) is the density of α when all ith-order neighbors of α (0 ≤ i ≤ q) are considered in density computing. In our algorithm, each dimension is divided into several intervals and thus the data space is partitioned into many hyper-rectangular cells. The intervals in each dimension are numbered from zero in ascendent order with 0, 1, 2, 3, 4, ..., and the identifier of a cell is composed of the IDs of those intervals which the cell belongs to in every dimension. For example, in a 4-dimensional space, if a cell belongs to intervals 5, 2, 7 and 3 respectively in the first, second, third and fourth dimension, then the identifier of the cell is (5, 2, 7, 3) and we use C5,2,7,3 to stand for the cell. In AGRID algorithm, we use grid to help reduce calculating complexities. Before computing densities of objects, each dimension is divided into several intervals and thus the data space is partitioned into many hyper-rectangular cells. For any object α in a cell, we only compute its distances with those objects in its immediate neighboring cells. Objects that are not in the neighboring cells are far away from α, and do not contribute to the densities of α. Much time can be saved using this method. Definition 1 (Density) The density of point α is the count of points both in the neighborhood of α. Denq (α) = k { β | β ∈ the rp -neighborhood of α } k AGRID defines neighborhood and neighboring cells as follows. Definition 2 (Neighborhood) In a d-dimensional space, the space around point α, in which all points are within r from α, are called r-neighborhood of α. Especially, when Lp metric is used as the measure of distance, the neighborhood is called rp -neighborhood (p ≥ 1). According to the above definition, the r2 -neighborhood of a point is a hyper-sphere with the radius of r, while the r∞ neighborhood is a hyper-cube with the edge of 2r. Definition 3 (Neighboring Cells or Neighbors) Cell Ci1 i2 ..id and Cj1 j2 ..jd are neighbors of each other iff ½ |ip − jp | ≤ 1 : p = l ip = jp : p = 1, 2, . . . , l−1, l+1, . . . , d where l is an integer between 1 and d, and i1 i2 ..id and j1 j2 ..jd are respectively the sequences of interval IDs of cell Ci1 i2 ..id and Cj1 j2 ..jd .
Generally speaking, in d-dimensional spaces, each cell has (2d + 1) neighbors (including itself) according to Def. 3. Neighborhood and neighbors (or neighboring cells) are two different concepts according to Def. 2 and 3. The former is defined for a point and its neighborhood is an area or a space, while the latter is defined for a cell and its neighbors are a number of cells adjacent to it. With the idea of grid and neighbor, only those objects located in neighboring cells are considered when calculating the density of an object. For example, let α be an object located in cell Cα in a d-dimensional space. When computing the density of α, only objects located in those (2d+1) immediate neighboring cells (including Cα itself) are considered (see Figure 1). When dimensionality is high, a lot of cells are ignored and the accuracy becomes quite poor. To improve the accuracy, ith-order neighbors are proposed in our new algorithm.
4 Clustering with ith-order neighbors When computing the densities of points with the help of grid, all 3d neighboring cells of each cell need considering if we would like to get the exact densities. Although thus the accuracy is ensured, the computing complexity is exponential with the dimensionality, which make it inapplicable. In order to improve the efficiency, in our previous algorithm AGRID, only (2d+1) immediate neighboring cells are considered. However, with the increase of dimension, a majority of neighboring cells are ignored, which makes the densities and clustering less accurate. The above 3d and (2d + 1) neighbors are two extreme cases. To improve the accuracy without losing much efficiency, we provide a strategy of ith-order neighbors to order the neighbors of a cell. All the 3d neighboring cells are grouped into (d+1) groups and assigned to a value of order. The lower the order is, the larger the contribution of the cell to the densities is. Therefore, by considering only those low-order neighbors, high accuracy and high efficiency can be achieved. Assume that L is the length of an interval. If the radius of neighborhood is greater than L/2 and we would like to compute the density of object α accurately, then all the 3d cells around Cα need to be considered. Now let’s see what will happen if the radius of neighborhood is less than L/2. Assume L∞ -metric is used as the distance measure, then the neighborhood of α is a hypercube. Figure 1 shows the neighborhood and neighbors in a 2-dimensional space. If object α is located near the up-left corner of Cα , only those cells that are on the up-left side of Cα need to be considered. The number of these cells is 2d instead of 3d . In what follows, we assume that r, the radius of neighborhood, is less than L/2 (The requirement is easy to meet when L∞ metric is used as the distance measure). Now that only
r
L
Figure 1. Neighbors and neighborhood. The black point is α, the gray cell in the center is Cα , and the gray cells around it are its (2d + 1) immediate neighbors defined in AGRID. The area circumscribed by the dotted line is the neighborhood of α.
(2d+1) neighbors are considered in AGRID, a majority part of neighborhood is ignored so that the density computed becomes quite inaccurate, especially with the increase with d. Fortunately, the overlapped spaces are different from neighbor to neighbor, and they are related with the position of neighbors relative to α. Therefore, the neighbors can be classified into a couple of groups according to their relative positions, and the density of α can be approximated by considering only those groups with significant contributions to it. The definition of ith-order neighbors is proposed to help compute densities.
4.1 ith-order neighbors Definition 4 (ith-order neighbors) Let Cα be a cell in a d-dimensional space. A cell which shares a (d − i)dimensional facet with cell Cα is an ith-order neighbors of Cα , where i is an integer between 0 and d. Especially, we set 0th-order neighbors of Cα to be Cα itself. The above definition can be put in a more formal way as the follows. Definition 5 (ith-order neighbors) Assume that C and C 0 are two cells in a d-dimensional space, and [uj , hj ) and [u0j , h0j ) are respectively the interval in which C and C 0 lie in the j-th dimension. C 0 and C are ith-order neighbors of each other iff ½ ∀ t (0 ≤ t ≤ i), ujt = h0jt or hjt = u0jt ∀ t (i + 1 ≤ t ≤ d), ujt = u0jt and hjt = h0jt where 0 ≤ i ≤ d, and j1 , j2 , ..., jd is a permutation of 1, 2, ..., d. With ith-order neighbors, the neighbors of cell Cα are divided into a couple of groups according to their relative positions to it. The lower the order is, the greater is the contribution to the densities of the points in Cα . Since the computing complexity is extremely expensive to consider all 3d
neighboring cells for each cell, we can only take into account only those low-order neighbors to improve the quality of clustering while keeping the efficiency. An example of ith-order neighbors in a 3-dimensional space is shown in Figure 2. The gray cell in Figure 2a is Cα , and the 0th-order neighbor of Cα is itself. The gray cells in Figures 2b-d are 1st-, 2nd- and 3rd-order neighbors of Cα , respectively. However, even in a single group of neighbors, not all cells are of the same significance to the points of Cα . As Figure 1 shows, if the radius of neighborhood is less than L/2, only those neighbors which are located on the same side of Cα as α contribute to the densities of the points in Cα . Therefore, for each point α in cell Cα , the neighbors need to consider are related to the relative position of α in Cα , so in the following another definition of ith-order neighbors is given not for cells, but for points. a dDefinition 6 (ith-order neighbors w.r.t. α) In dimensional space, let α be a point and Cα be the cell in which α is located. Among ith-order neighbors of Cα , those cells located on the same side of Cα as α are called ith-order neighbors of Cα w.r.t. α, or ith-order neighbors of α for short. An example of ith-order neighbors of α in a 3dimensional space are shown in 3. Assume that α is a point in the up-right-back corner of the gray cell (Cα ) in Figure 3a, so Cα is the 0th-order neighbor of α. The gray cells labelled “1”, “2” and “3” in Figures 3b-d are 1st-, 2nd- and 3rd-order neighbors of α, respectively. Since it is difficult to show visually the ith-order neighbors if the dimensionality is more than three, we give in another way an example in 4-dimensional space in the following. For example, let’s assume that point α is located in cell C0,0,0,0 in a 4-dimensional space, then the ith-order neighbors of cell C0,0,0,0 w.r.t. α can be listed in the following. 0th : C0,0,0,0 1st : C1,0,0,0 , C0,1,0,0 , C0,0,1,0 , C0,0,0,1 2nd : C1,1,0,0 , C1,0,1,0 , C1,0,0,1 , C0,1,1,0 , C0,1,0,1 , C0,0,1,1 3rd : C1,1,1,0 , C1,1,0,1 , C1,0,1,1 , C0,1,1,1 4th : C1,1,1,1 According to Definition 6, since an ith-order neighbor of α shares a (d − i)-dimensional facet with Cα , the ID sequences of the ith-order neighbors of Cα have i different IDs from those of Cα , and the difference between each pair of IDs can be either +1 or −1. Because the ith-order neighbors of α lies on the same ¡side ¢ of Cα as α, the number of ith-order neighbors of α is di . Definition 7 (Density) The density of point α is the count of points both in the neighborhood and in those low-order neighbors of α, i.e., in the ith-order neighbors where 0 ≤ i ≤ q.
4.2 Choice of Distance Measure
(a)
(b)
(c)
(d)
Figure 2. ith-order neighbors of Cell Cα .
(a)
(b)
(c)
(d)
Figure 3. ith-order neighbors of Object α. Denq (α) = k { β | β ∈ the rp -neighborhood of α and β ∈ the ith-order neighbors of α, 0 ≤ i ≤ q} k According to the above definition of density, only those neighbors whose orders are no more than q are considered, and other neighboring cells are ignored. Thus, the density can be more and more accurate with the increase of q. When q = d, all 3d neighbors are considered and the density computed is just the exact value of density. Nevertheless, the number of neighbors considered increases dramatically with the increase of q. When the value of q is set to 1, the 0th- and 1st-order neighbors together are just the (2d + 1) neighbors defined in AGRID. With ith-order neighbors, we can see that the lower the order is, the more the contribution to the density is. Moveover, The more neighbors are considered, the loose is the requirement for combination. In fact, if only 0thorder neighbors are considered, two objects are required to be close enough to each other in the d-dimensional space. If both 0th- and 1st-order neighbors are considered, then they are required to be close enough in a (d − 1)-dimensional space. Generally speaking, when all ith-order neighbors considered, then the two objects to be combined are required to be close enough in a (d − i)-dimensional space. Therefore, with the increase of q in our algorithm, the requirement become relaxed and the clustering takes into account the similarity in subspaces. In this way, our algorithm is better than those algorithm which discovers clusters in the whole dimensional space, since the clusters usually exist only in subspaces in a high-dimensional dataset. By tuning on the parameter q, we can obtain different clustering effects. Clearly, both the accuracy and cost will increase as q increases. Therefore, we need a tradeoff between the accuracy and efficiency. The value of q can be chosen according to the requirement of the accuracy and the performance of computers. A large value of q will improve the accuracy of clustering result, but at the cost of time. On the reverse, high speed can be achieved by setting a small value to q, but the accuracy will become poorer accordingly.
In clustering, the Manhattan distance (L1 ) or Euclidean metric (L2 ) is usually used as the distance measure. L∞ metric is chosen as the distance measure in our algorithm because of the following two reasons. One problem for high-dimensional data is that the data becomes very sparse when the dimensionality is high, since the distances between points becomes larger and larger with the increase of the dimensionality. However, with L∞ metric as the distance measure, the distances between points become closer, thus the data becomes more dense, which counteracts in some degree the sparsity of data in highdimensional space. Another reason for choosing L∞ -metric as the distance measure is that it ties in the grid structure. If Lp -metric is used as the distance measure, the distance between the farthest two point in a cell is d1/p L. If d = 36 and p = 2 (Euclidean Distance), the distance will be 6L. If setting r = 6L, as a result, many cells need considering when computing the density of a single point. Nevertheless, if r is set to 6L, then the neighborhood of α becomes very small. Consequently, the densities of points are too small to be useless for clustering. Fortunately, with L∞ -metric as the distance measure, the distance between the farthest two point in a cell is L, which is not affected by the increase with the dimensionality. Thus, it is easy to choose the size of cells and the neighborhood, and can make our algorithm effective. Moreover, when L∞ -metric is used as the distance measure, the neighborhood of an object becomes a hyper-cube with the edge length of 2r (where r is the radius of the neighborhood). In addition, there is an assumption (r ≤ L/2) in Section 4.1. When L2 -metric is used as the distance measure, the distances between points become very large and the requirement can not be met when the dimensionality is high. It is the same case with other Lp -metrics. Fortunately, the above problem disappears when L∞ -metric is used. Therefore, L∞ -metric is used in our algorithm, which is shown in the following: ~ Y ~ ) = max |xi − yi | Dist∞ (X, i=1..d
4.3
(1)
Choice of r and DT
Since there is an assumption in Subsection 4.1 that the radius of neighborhood is less than L/2, r is simply set to a value less than L/2, while L is decided with the method in AGRID [11]. Since a small r can make the densities too low to find any useful clusters, r is set to be between L/4 to L/2 in our algorithm. As to DT , the same formula is used as that in AGRID [11].
100
5 Experimental results
90
Contribution Accuracy
80
70
Percentage (%)
Our experiments were performed on a PC with 256MB RAM and an Intel Pentium III 1GHz CPU. In these experiments, we show the scalability, performance and accuracy of CLONE and the improvement of CLONE over AGRID. In addition the comparison between CLONE and Random Projection [12] shows the effectiveness of our algorithm.
5.2 Improvement The improvement of CLONE over AGRID is shown in Figure 5. The dataset consists of 5,000 records with 10 dimensions and there are two clusters in it.Figure 5a shows the clusters discovered by AGRID and Figure 5b shows the clusters discovered by CLONE with q set to two. The first two clusters in Figure 5a are of similar pattern, but they are separated by AGRID. However, they are combined into one cluter by CLONE with more neighbors considered.
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Order (q)
Figure 4. Contribution and accuracy with ithorder neighbors
5.1 Contribution of ith-order neighbors to density
cluster 2
cluster 1
cluster 3
cluster 1
(a) Clusters by AGRID
cluster 2
(b) Clusters by CLONE
Figure 5. Experimental results of AGRID & CLONE. 600
2000
1800 500 1600
q=4
q=4 1400
400 1200
q=3
Time (s)
Time (s)
Experiments have been conducted with a couple of datasets to test the contribution of ith-order neighbors to the densities. The average result with 15-dimensional datasets of various sizes is shown in Figure 4. The horizontal axis denotes q (the order of neighbors) and the vertical axis represents the percentage of contribution or accuracy. The dotted line shows the contribution of ith-order neighbors to densities, and the solid line shows the accuracy of densities when all low-order neighbors are considered. The contribution of 1th-order neighbors are greater than that of 0thorder neighbors, because 1th-order neighbors are more than 0th-order neighbors. When order is greater than one, the contribution to density becomes less with the increase of order. Those neighbors whose orders are no less than five contribute little to the densities. When q is set to 3, the accuracy of density becomes 96.8%. The above experiment shows that it is reasonable to consider only low-order neighbors in our algorithm. When q is zero, the algorithm is fastest, but the accuracy is very low. With the increase of q, more neighbors are considered and the accuracy goes up dramatically, but the running time becomes longer. When q is larger than three, there is no significant increase in accuracy in this experiment. Generally speaking, q can be set by users according to the performance of their computers and the requirement of accuracy.
60
300
q=2
1000
q=3
800
q=1
200
600
q=0
q=2
400 100
q=1 200
q=0 0 10
20
30
40
50
60
70
80
Size (x1000)
(a)
90
100
0
0
10
20
30
40
50
60
70
80
90
Dimensionality
(b)
Figure 6. Scalability with the size and dimension of datasets
and the sizes of datasets range from 10,000 to 100,000. In Figure 6b, the size is 100,000 and the dimensionalities are from 3 to 90. The values of q are set from 0 to 4. Since the performance is relative to the specific dataset, the values in the figure are the average of several experiments. From the figure, it is clear that the complexity of CLONE is nearly linear both in the size and dimensionality of datasets when q is set to be less than four. In addition, the complexity grows with the increase of q.
5.4 Public datasets 5.3 Scalability The performance of CLONE is given in Figure 6. In Figure 6a, the dimensionalities of the datasets are all 20,
In addition to the experiments with synthetic datasets, experiments have been conducted over the Control Chart time series dataset from UCI KDD Archive
Table 1. CLONE vs Random Projection CLONE Random Projection CE
0.517
0.706
NMI
0.822
0.790
(http://kdd.ics.uci.edu/) and comparison was made with the algorithm of Random Projection [12]. The dataset is of 60 dimensions and 600 records. The clustering given in UCI KDD Archive is used as the standard result and Conditional Entropy (CE) and Normalized Mutual Information (NMI) used in [12] are utilized to measure the quality of clustering. The value of CE is a non-negative real number, while the value of NMI lies between zero and one. The less CE is, the more the tested result approaches the standard result. Contrary to CE, the larger the value of NMI is, the better is the clustering. In all, we would like to minimize CE and maximize NMI. The best results of clustering with our algorithm and Random Projection are given in Table 1. From the table, we can see that the clustering of our algorithm are of lower CE and higher NMI, which shows that our algorithm performs better than Random Projection.
6 Conclusions By combining Density-based and grid-based clustering approaches, our previous algorithm, can discover clusters with various shapes at very high speed. However, it ignores too many neighbors and the accuracy becomes very low in high-dimensional spaces. In our new algorithm, an idea of ith-order neighbors is employed to solve the problem. Without much degradation of performance, the densities computed become more accurate and the result of clustering turns out better than before. By changing the number of ith-order neighbors with different values of q, the accuracy and performance of the algorithm can be adjusted according to the requirement of different users and applications.
Acknowledgment Yi-Dong Shen is supported in part by the National Natural Science Foundation of China.
References [1] R. Agrawal, J. Gehrke, et al: Automatic subspace clustering of high dimensional data for data mining aplications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), pp.94–105, Seattle, WA, June 1998.
[2] M. Ankerst, M. Breunig, et al: OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’99), pp.49–60, Philadelphia, PA, June 1999. [3] M. Ester, H.-P. Kriegel et al: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc.1996 Int. Conf. On Knowledge Discovery and Data Mining (KDD’96), pp.226–231, Portland, Oregon, Aug. 1996. [4] Jiawei Han, Micheline Kamber: Data Mining: Concepts and Techniques. Higher Education Press, Morgan Kaufmann Publishers, 2001. [5] A. Hinneburg and D.A. Keim: An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD’98), pp.58–65, New York, Aug. 1998. [6] A. Hinneburg, and D.A. Keim: Optimal GridClustering: Towards Breading the Curse of Dimensionality in High-Dimensional Clustering. In Proc. 25th VLDB Conf., Edinburgh, Scotland, 1999. [7] Erica Kolatch: Clustering Algorithms for Spatial Databases: A Survey. Dept. of Computer Science, University of Maryland, College Park, 2001. http://citeseer.nj.nec.com/436843.html [8] Nagesh, H., Goil, S., and Choudhary, A.: MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets, Technical Report 9906-010, Northwestern University, June 1999. [9] G. Sheikholeslami, S. Chatterjee, and A. Zhang: WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proc. 1998 Int. Conf. Very Large Data Bases (VLDB’98), pp.428–429, New York, Aug. 1998. [10] W. Wang, J. Yang, and R. Muntz: STING: A statistical information grid approach to spatial data mining. In Proc. 1997 Int. Conf. Very Large Data Bases (VLDB’97), pp.186–195, Athens, Greece, Aug. 1997. [11] Yanchang Zhao and Junde Song: AGRID: An Efficient Algorithm for Clustering Large HighDimensional Datasets. In: Proc. of The 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’03), pp.271–282, Seoul, Korea, April 2003. [12] Xiaoli Zhang Fern, Carla E. Brodley: Random Projection for High Dimensional Data Clustering: A Clustering Ensemble Approach, In Proc. 20th Int. Conf. On Machine Learning (ICML’03), Washington DC, 2003.