Clustering in Very Large Databases Based on Distance and Density

12 downloads 5940 Views 328KB Size Report
analysis, is a huge task that Challenges data mining researChes. Current Clustering ... Keywords data mining, very large database, Clustering. 1 Introduction.
Jan. 2003, Vol.18, No.1, pp.67{76

J. Comput. Sci. & Technol.

Clustering in Very Large Databases Based on Distance and Density

), GONG XueQing() and ZHOU AoYing ( )

QIAN WeiNing (

Department of Computer Science and Engineering, The Laboratory for Intelligent Information Processing Fudan University, Shanghai 200433, P.R. China

E-mail: fwnqian, xqgong, [email protected] Received January 4, 2001; revised October 30, 2002.

Abstract Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g., R -tree); 2) pre-specifying the uncertain parameter k, with which clustering can only be re ned by trial and test many times; 3) lacking high eÆciency in treating arbitrary shape under very large data set environment. In this paper, we rst present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters e ectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of eÆciency and cost, and even gets much more speedup as the data size scales up much larger. Keywords data mining, very large database, clustering 1 Introduction Along with the appearance of more and more applications in spatial database, pattern recognition, economic market and data analysis, nding the knowledge behind data is becoming very important. Data mining is the process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial decisions. Usually the mining object is clear in the view of decision-makers or users, e.g., they know which dimension or combined dimensions are important to them, and they want to understand the hidden and interesting pattern or characteristics in a rapid, easy and eÆcient way. That is to say, they want to grasp valuable knowledge at a low cost. Facing huge amounts of data that may be obtained from satellite images, medical equipment, geographic information systems (GIS), image database exploration[1] etc., current clustering algorithms often consider some single criterion and take the xed strategy alone. Each criterion and strategy correspond to its own

main task. They do not care about the relationship among themselves and something useful beneath the data. Because of such limitations, these methods always have weakness in some aspects unavoidably, shown in DBSCAN, CURE, BIRCH, CLARANS, STING etc. To our knowledge, no ef cient algorithm with di erent clustering strategies is found currently. Moreover, in very large databases, especially in data warehouses, the already existing information, such as index, record cluster, is very useful for data mining processes. But this kind of information is not fully utilized and analyzed. The destination of data clustering methods is to group the data or objects in databases into distinct and meaningful subclasses. With the increase in the size of very large amount of data, scanning the whole database or data warehouse for mining is not a wise approach. Although the memory of computer will increase a lot in the future, the size of very large database is at least one hundred times of memory. Clustering methods, in scanning very large databases, cannot avoid I/O swapping, which

This work is supported by the National Grand Fundamental Research `973' Program of China under Grant No.G1998030414; the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.99038. The rst author is partially supported by Microsoft Research Fellowship.

68

causes huge overhead and reduces the eÆciency in mining greatly. If we capture a data sample precisely according to its distribution and make use of the analysis in index, clustering procedure can be tailored in limited data space and satisfactory eÆciency will be achieved. Another problem troubling data mining researchers is the pre-speci ed k, which is unable to determine before moving forward to the nal goal. We think that following important requirements for clustering algorithms are necessary: to achieve good time eÆciency under very large datasets, to identify arbitrary clusters regardless of their shapes or relative positions, to remove noise or outliers e ectively, to obtain the same clustering results insensitive to di erent orderings of input data and to cluster without any pre-speci ed k. In this paper, a new clustering algorithm is proposed. It runs on a hierarchical framework and takes a hybrid criterion based on both distances between clusters and the density within each cluster. This hybrid method can easily identify arbitrarily shape clusters and can be scaled up to very large databases eÆciently. It also uses the statistics information analyzed from indexes, which helps to pre-process the datasets, recognize noises, and con gure sub-clusters quickly and precisely. In general cases, the pre-process can greatly reduce the data size to handle. The rest of the paper is organized as follows. We rst generalize the related work and their extensions in Section 2. In Section 3, a new clustering algorithm considering both distance and density is presented. Its improvements for scaling up to a very large database and the complexity are also given. Section 4 discusses its enhancement behavior corresponding to di erent data environments. In Section 5, we show the experimental evaluation of the e ectiveness and eÆciency of our algorithm under synthetic data sets and very large data sets. Finally in Section 6, concluding remarks are offered.

2 Related Work Before introducing the hybrid clustering algorithm, we generalize six kinds of main clustering methods from currently clustering algorithms. They are hierarchical methods, partitioning methods, density-based methods, grid-based methods, wavelet methods and categorical methods.

J. Comput. Sci. & Technol., Jan. 2003, Vol.18, No.1 2.1 Hierarchical Methods Hierarchical methods are a sequence of partitions in which each partition is nested into the next partition without the restriction of k. An agglomerative algorithm for hierarchical clustering starts with the separate set of clusters, each data point of which is initialized. Pairs of items or clusters are then successively merged until the distance between clusters satis es the minimum requirements. BIRCH[2] , CURE[3] etc., are hierarchical algorithms. BIRCH rst performs a pre-clustering phase in which dense regions of points are represented by compact summaries, and then a centroidbased hierarchical algorithm is used to cluster the set of summaries, (which is much smaller than the original data set). In BIRCH, the pre-clustering algorithm uses incremental and approximate method to reduce the input size, during which the entire database is scanned, and cluster summaries are stored in memory as CF-tree. For each successive data point, the CF-tree is traversed to nd the closest cluster to it in the tree. If the point is within a threshold distance to the closest cluster, it is absorbed into the tree. Otherwise, it starts its own cluster in the CF-tree. BIRCH is the rst clustering algorithm to reduce outliers. But it is sensitive to the input data order and does not perform well when the original shape is not \spherical". CURE uses multi-representative data points instead of single representative data point in order to control the geometry of arbitrary shape well. It takes a novel shrink technique to reduce the outliers and uses kd-tree to simplify the data structure. But its merge procedure costs much and the eÆciency is low when the data size scales up to a very large database. In the worst case, the time complexity is O(n2  log n) and its partition strategy is not clear.

2.2 Partitioning Methods Partitioning clustering algorithms make e orts to determine k partitions that optimize a certain criterion measurement. Usually they start with an initial partition and then use an iterative criterion to make the objects in a cluster, which are more similar to each other than to the objects in di erent clusters. The famous K -means and K method belong to this approach. There are mainly several popular partitioning clustering algorithms: PAM[4] , CLARA[4] , CLARANS[5] . Ng and Han introduce CLARANS (Clustering Large Applications based on Randomized Search), which is an improved K -means method that introduces clus-

QIAN W N

.: Clustering in Very Large Databases

69

et al

tering techniques into spatial data mining problems and overcomes most of the disadvantages of traditional clustering methods on large data sets. It is experimentally shown that CLARANS outperforms traditional k-means, while it is still slow and cost complexity for passing over the database is prohibitive. Moreover, the cluster quality for very large database cannot be guaranteed.

2.3 Density-Based Methods Jain[6] explores a density approach to identify clusters in k-dimensional point sets. The data set is partitioned into a number of nonoverlapping cells and histograms are constructed. Cells with relatively high frequency counts of points are the potential cluster centers and the boundaries between clusters fall in the valleys of the histogram. This method has the capability of identifying clusters of any shape. However, the space and run-time requirements for storing and searching multidimensional histograms can be optimized, and the performance of such an approach crucially depends on the size of the cells. DBSCAN[7] and OPTICS[8] relies on a density-based notion of clustering. They are designed to discover clusters of arbitrary shapes. The key idea in DBSCAN is that for each point of a cluster, the neighborhood of a given radius has to contain at least minimum number of points, i.e., the density in the neighborhood has to exceed the same threshold. DBSCAN can separate the noise (outliers) and discover clusters of arbitrary shape. It uses R -tree to achieve a better performance. But the average run time complexity of DBSCAN is O(n  log n). When the data size is very large, DBSCAN needs frequent I/O swap to load data into memory and its eÆciency is very low, sometimes it even does not work. R. Agrawal et al., gave CLIQUE[9] algorithm identifying dense clusters in subspaces of maximum dimensionality and generating cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It is easier to calculate connection between cluster regions than in DBSCAN. But it faces the same problem as DBSCAN does. And it also needs an uncertain choice of optimal cut point.

2.4 Grid-Based Methods Some algorithms have been presented recently which materialize the data space into a nite number of cells and then take all operation on the materialized space. The main advantage of these methods is their high processing speed, which is typically

independent of the number of data objects. They depend only on the number of cells in each dimension in the materialized space. Wang Wei et al., gave STING[10] , which divides the spatial data into rectangular cells using a hierarchical structure with statistical information stored together. STING can use the hierarchical representation of grid cells to search for how a new object is assigned to the clusters. However, in hierarchy, the parent cell may not be built up correctly for the reason of statistical numbers. It lowers the quality and accuracy of clusters, despite the high speed of clustering. When the agglomerative procedure in hierarchy moves on, the cells cannot represent the precise information the objects originally have. Few errors after agglomeration upon agglomeration will pass on, accumulate, and nally become big enough, so the new cluster is often produced with great distortion to the original data sets.

2.5 Wavelet Methods Some researchers try to solve the cluster problem in other ways. WaveCluster[1] is a novel clustering approach based on wavelet transforms. It uses multi-resolution property of wavelet transforms and identi es the arbitrary shape. In wavelet transform, convolution with an appropriate kernel function results in a transformed space where the natural clusters in the data become more distinguishable. The clusters can be identi ed through nding the dense regions in the transformed domains. A priori knowledge about the exact number of clusters is not required in WaveCluster. But the transform procedure has high complexity and the signal processing often discards some clustered data points during the transform period or when the amplitude is not wide. Also, the lter coeÆcients are another kind of uncertain parameters to be tested.

2.6 Categorical Methods Categorical-methods-based clustering also focuses on the properties of categories, which appears more important when the web applications become closer to our daily life. D. Gibson et al. proposed a method based on an iterative approach[11] for assigning and propagating weights on the categorical values in a table, which facilitates a type of similarity measure arising from the co-occurrence of values in database. Daniel Boley et al., in their paper[12] , presented a method using association rules and weighted the categorization of web pages, to cluster

70

corresponding to web documents. Their methods are very promising but still not general.

3 Hybrid Clustering Algorithm Based on Distance and Density In this section, we present our hybrid-clustering algorithm whose salient features are: (1) with the hybrid strategies of distance and density, this algorithm can easily recognize arbitrarily-shaped clusters (e.g., dumbbell-shaped, nested-shaped), (2) it is robust to the presence of di erent proportioned noises and outliers, (3) with the statistics information, it takes far less time than traditional hierarchical clustering algorithms while keeping good cluster quality, and (4) the user need not specify k, the nal number of clusters, which should be transparent to users.

3.1 Overview From the related work introduced in Section 2, we conclude that the basic criterion in clustering methods is distance or density. Current clustering algorithms are based on either single one. Each criterion pays much attention to its own way and does not care about the superiority of others. Because of such reasons, these methods always have weaknesses in some aspects unavoidably, shown in DBSCAN, CURE, BIRCH, CLARANS, STING, etc. The algorithm based on density can recognize most arbitrarily shaped clusters, and nish cluster eÆciently only if the size of data set is suitable enough to be loaded into memory. However, density algorithm has a close relationship with the whole data in database, since it needs to sequentially scan database at least once. If the data set scales up to a very large database or data warehouse, the scanning process may occupy much more time. Moreover, in order to store all the data and their density information, the algorithm employs some complex data structures such as R -tree (DBSCAN)[7] . Maintaining these structures and their operations consumes much system overhead and I/O swapping. In addition, density algorithm has diÆculty in identifying some shapes (say, dumb bell) etc. On the other hand, the algorithm based on distance can apply sampling technique to reduce the data size greatly, and can determine nicely the cluster's shape represented by multi-data points. Apparently, this method is able to record all the clustering information in memory with simple data

J. Comput. Sci. & Technol., Jan. 2003, Vol.18, No.1 structures and avoids large I/O costs. Nevertheless, because of its strict dependency on the minimum distance, this kind of clustering method may produce wrong results. For example, two adjacent but distinctively separating clusters may be merged into one; or, a very large cluster may be split into two. In the meantime, it also has the time complexity of O(n2  log n) (CURE)[3] . When the data set scales up to a very large database, especially a data warehouse, the number of initial sub-clusters is still very large and the computational time appears impossible. Besides, users have to give the number of clusters, which should be transparent to them. Our hybrid-clustering algorithm runs on a hierarchical architecture. We use several representative points to gure a cluster like CURE[3] well, which could handle arbitrarily-shaped clusters. But we abandon its shrinking process, since it always leads to incorrect clusters if the data's distributed shape is complex or special, e.g., when the shape of original cluster is hollow. Also, in traditional hierarchical algorithms, initially sub-clusters on the bottom level are eachly original data points. With the help of statistics information achieved from indexes, we construct units instead of original data points as sub-clusters on the bottom level. Each unit contains several data points that must belong to one cluster. So the number of units is far smaller than the number of original data points and computational complexity can be greatly reduced. Our unit method is quite di erent from the grid idea of Wang Wei's STING[10] . We will discuss their distinctions later. The hybrid method uses both density and minimum distance to determine if two clusters should be merged and connected or not. It has the advantages of both density algorithm and distance algorithm. Some detailed comparisons of our clustering method to traditionally hierarchical distance-based algorithm (CURE), densitybased algorithm (DBSCAN) and grid-based algorithm (STING) will be given in Section 4 and Section 5.

3.2 Hybrid Clustering Algorithm The hybrid algorithm needs three parameters: M-DISTANCE, M-DENSITY and M-DIAMETER. M-DIAMETER will be introduced later. De nition 1. M-DISTANCE is the minimum distance between two clusters. De nition 2. M-DENSITY is the minimum value among each density, which is the number of data in a cell belonging to corresponding clusters.

QIAN W N

.: Clustering in Very Large Databases

et al

The main clustering algorithm starts from original sub-clusters (including units or data points). It is detailed below.

1. CLUSTER (M-DISTANCE, M-DENSITY) 2. f 3. sort the sub-clusters in heap; 4. for each sub-cluster i with minimum distance between i and i.closest do 5. f 6. if (distance (i, i.closest) < M-DISTANCE) 7. merge (i, i.closest); 8. else 9. f 10. if (CONNECTIVITY (i, i.closest, M-DENSITY) == TRUE) 11. merge (i, i.closest); 12. else 13. f 14. note i and i.closest cannot connect to each other; 15. g 16. g 17. adjust the heap; 18. g 19.g

We rst sort the sub-clusters according to the distance between each sub-cluster and the closest sub-cluster to it. As the distance among them may be changed after every merging process, some adjustments for the sub-clusters are needed. The algorithm uses the heap structure[3] , which is of high eÆciency in maintaining clustering data. The loop from line 4 to line 18 is the process to merge the sub-clusters. It obtains sub-cluster with minimum distance between itself and the closest sub-cluster to it. If this distance is smaller than M-DISTANCE, then two sub-clusters must belong to one cluster. Therefore, they should be merged. There exists such a situation where some sub-clusters should be merged but the distance between them is bigger than M-DISTANCE, and the ordinary distance-based method cannot solve this problem. In our hybrid algorithm, a simple but e ective density-based way is designed to nd these sub-clusters. We test the connectivity of subclusters whose closest sub-cluster is not very adjacent, e.g., to see whether it can connect to the closest sub-cluster. Two connected sub-clusters must belong to one cluster, so they should be merged. If these two sub-clusters could not connect to each other, in current merging process, they must belong to two clusters unless the distance between them is smaller than M-DISTANCE, as noted in line 14. After the merging operation or the noting operation, some sub-clusters' closest sub-cluster may be

71

changed. The algorithm adjusts the heap according to the new distance between the sub-cluster and the closest one to it. We make use of statistics information to test the connectivity of sub-clusters. First, we give some de nitions below. De nition 3. A cluster's diameter is the maximum distance of two data points in this cluster. De nition 4. M-DIAMETER is the minimum diameter clusters may have. De nition 5. A cell is a small data grid, the length of whose diagonal distance is smaller than minf1=2 M-DISTANCE, M-DIAMETERg. Then some properties about cell can be presented and proved. Theorem 1. There will not exist any cell that contains data points belonging to two di erent clusters, and there will not exist any cell that contains a whole cluster either. Proof. From the de nition of M-DISTANCE in De nition 1 and that of cell in De nition 5, we can conclude Theorem 1. 2 De nition 6. Noises are the data points in the cell whose density is smaller than M-DENSITY. Theorem 2. If cell i belongs to cluster A, cell j is a neighbor cell of i and density (j ) > MDENSITY, then cell j must belong to cluster A too. Proof. Because the diagonal distance is shorter than 1=2M-DISTANCE, the distance between any data points in cell j is smaller than M-DISTANCE. Therefore, these data points must not belong to several di erent clusters. From De nition 6, we know that cell j must not be noises. Therefore, the data points in the cell must belong to one certain cluster b. Because the diagonal distance is shorter than 1=2 M-DISTANCE, the distance between any data points in cell i or j is smaller than M-DISTANCE. Therefore, the data points in cell i or j must not belong to two di erent clusters. Therefore, all the data points in cell i and cell j belong to cluster A. 2 Finally, under the guide of Theorems 1 and 2, the following procedure is to judge the connectivity of two sub-clusters.

1.CONNECTIVITY (cluster i, cluster j , M-DENSITY) 2. f 3. QUEUE q ; 4. for each cell k in cluster i do 5. q.ADD (k); 6. for each neighbor cell l of cells in q AND l do not in q do 7. f 8. if (density (l) > M-DENSITY) 9. if (l.belongto==cluster j )

J. Comput. Sci. & Technol., Jan. 2003, Vol.18, No.1

72 10. 11. 12. 13. 14. 15. 16. 17. 18. g 19. 20. g

return TRUE; else f

q.ADD (l);

merge (cluster i, l.belongto);

for each cell m in l.belongto do g

q.ADD (m);

return FALSE;

Here a queue structure is used to store all the cells each cluster i can reach. The outer loop adds all cells in cluster i to the queue. The inner loop from line 6 to line 18 nds all the cells cluster i can reach. The parameter M-DENSITY is needed, which is the scale to discern whether the data points in a cell are noises. The cell's belongto attribute denotes which sub-cluster the cell belongs to. When the procedure nds a cell belonging to cluster j, it returns TRUE. If nding a cell that cluster i can reach, it adds this cell to the queue. In addition, the procedure merges cluster i and the sub-cluster the cell belongs to, for avoiding repeatedly testing whether these two sub-clusters should be merged in the main clustering procedure. All the cells in the new merged sub-cluster can be reached from cluster i. If two sub-clusters should be merged, such procedure is taken; if not, it continues to nd next pair of sub-clusters satisfying merging conditions. When no one can be merged, the algorithm ends. The algorithm is self-adaptable and needs not require providing the number of the nal clusters beforehand. Instead, users only give the parameters M-DISTANCE, M-DENSITY, and MDIAMETER. In most cases, users are not certain for the cluster numbers before the end of mining process. However, users are sure of the parameters, which are more precise and more meaningful to users than the number of clusters.

3.3 Scale Up to Very Large Databases In the hybrid-clustering algorithm, the initial sub-clusters consist of data points randomly sampled from databases, which represent the whole data. So, the data size in computation is relatively small. However, when the database scales up to a very large database, the sampling volume increases also. Furthermore, the number of original subclusters may be still rather big, though some sample techniques are used. To enhance the hybrid algorithm to very large databases, we construct units

instead of data points as original sub-clusters. To obtain these units, rst, partitioning work is done to make the data points into cells. Some statistics information such as density, mean value, maximum and minimum coordinates are obtained in each dimension of every cell. Then we test if a cell forms a unit. The de nition of unit is as below. De nition 7. A unit is a cell whose data points belong to a certain cluster. According to Theorem 1, we know that the data points in a cell must either partly belong to a cluster, or not belong to any cluster. So cells can be classi ed into three classes: 1) data points in the cell belonging to a certain cluster are units, 2) data points in the cell that does not belong to any cluster are all noise or outliers, 3) some data points in the cell belonging to a certain cluster and locating in the brim of this cluster, the others are noises or outliers. The density of units is bigger than that of other cells. Therefore, we determine if a cell is a unit by density in the hybrid algorithm. If the density of a cell is M (M > 1) times the average density of all the data points, we regard such a cell as a unit. There may be some cells with low density but are in fact parts of nal clusters, in the initial clusters or even in the previous merging period, they may not be included into units. We treat such data points not belonging to any units as a separate subcluster. With the appropriate value M , most units will be identi ed in an eÆcient way. Since the number of noises, outliers and the brim part of clusters is far less than the number of all the data points, this method will greatly reduce the time complexity, as will be shown in the experiments. Our idea has serious di erence with Wang Wei's STING[10] in the cases: (1) hybrid method initially constructs the number of data in one unit according to the statistics information most closely controlling the data grid, while STING speci es con , an estimation of data number in one grid; (2) hybrid method uses units in the initial and low level clustering, since the initial clusters and low level sub-clusters are the main factors in speeding up merge procedure, and also in keeping good cluster quality. STING uses grid in the whole course of hierarchical merge, accumulates the errors on each level and passes to the nal results. Its clustering quality is rather poor. (3) Hybrid method can use units to recognize any arbitrarily shaped clusters without any pre-

QIAN W N

.: Clustering in Very Large Databases

73

et al

conditions. But STING has only e ect under the suÆcient condition it de nes. The procedure of initializing the sub-clusters is as below. 1. INITIALIZE SMALLCLUSTERS (M-DISTANCE, M-DIAMETER, M) 2. f 3. diameter = min(1=2 M-DISTANCE, M-DIAMETER); 4. collect statistics information of cells; 5. for each cell i do 6. f 7. if (DENSITY (i) > MAVERAGE DENSITY) 8. store the cell as a sub-cluster; // this cell is a unit; 9. else 10. store every data points in the cell as a sub-cluster; 11. g 12. g

The procedure rst computes the diameter of the cells, then collects the statistics information from indexes. The most important statistics information is the density of the cells. Moreover, we should compute the maximum and the minimum values in each dimension and obtain data points to which these values correspond. Such information may be used to determine the representative points of the sub-cluster. In the loop from line 5 to line 11, the procedure tests each cell to see whether it should be treated as a unit. Each unit is stored as a single sub-cluster and every data point that cannot be included into units in other cells is stored as separate sub-clusters. We use the data points corresponding to the maximum and minimum values in each dimension of units as the representative points of the sub-clusters, and the single point in other sub-clusters as their own representative points.

3.4 Labeling Data in Database The clusters identi ed in this algorithm are denoted by the representative points. In most cases, user needs to know the detailed information about the clusters and the data points included. Therefore, unlike the label process in CURE[3] , we need not test every data point which cluster it belongs to. Instead, we can only nd the cluster a cell belongs to, then all data points in it belong to such a cluster. This process can also improve the speed of the whole clustering process.

3.5 Time and Space Complexity The time complexity of our clustering algorithm can be O(m2  log m) in upper case. Here, m is the number of sub-clusters at the beginning. In general, when the data distribute in well proportioned dense area, m will be small and the time complexity will tend to O(n), nearly linear to the number of data points in the database. While the data scatters rarely in the whole area, m will not be far less than n and the time complexity will not be far less than O(n2 log n). Because the cell and the data are stored in linear space, the space complexity needed in our method is O(n).

4 Enhancements for Di erent Data Environments There may exist many variant data sets in a very large database, especially in a data warehouse. These data sets may be very large, or there may be noises and outliers that may e ect the result of clustering. There may exist clusters with di erent density or di erent scale. We will show how our algorithm handles these kinds of data sets well in this section.

4.1 Handling Very Large Database Most hierarchical clustering algorithms cannot be directly applied to large data sets due to their quadratic time complexity with respect to the input size. So we employ the technology of random sampling similar to CURE algorithm[3] . We nd that sampled data points can nicely re ect the characteristics of the original data sets. But we also nd that although the sample size is far less than that of original data sets, it is still too large for the hierarchical algorithm to eÆciently handle. In traditional hierarchical algorithms, the subclusters on the bottom level are data points. Although we cannot nd the clusters at beginning, we can easily nd sub-clusters via testing the density in a certain area. So we use units instead of the sample data points as the sub-clusters on the bottom level. Then the time complexity of our algorithm mainly depends on the number of units, which is very small compared to the sample size. Because of the appropriate statistics information, using units instead of data points will not a ect the clustering results. A time comparison of handling large data sets between DBSCAN and CURE is shown in Section 5.

74

4.2 Handling Noises Noises are random and persistent disturbance that obscures or reduces the clarity of clusters. Compared to the outliers, they are well proportioned. In our algorithm, we can easily nd noises and wipe them o . In the INITIALIZE SMALLCLUSTERS procedure, after line 4, we have already acquired the density of each cell. Although the density of noises is well proportioned, noises are very sparse. So we can nd the cells with very low density and eliminate the data points in them precisely. This method can reduce the in uence of the noises both on eÆciency and on time.

4.3 Handling Outliers Unlike the noises, outliers are not well proportioned. They always a ect the clustering results in hierarchical clustering algorithms depending on minimum distance. So in these algorithms, there always need an additional phase to eliminate the outliers. But in our hybrid algorithm, we need not remove the outliers, but identify them. Traditional hierarchical algorithm cannot identify the outliers. It is because that whether two clusters should be merged is only determined by the distance between them. Therefore, the outliers adjacent to a cluster may be merged to the cluster. In our hybrid algorithm, distance between clusters is not the only condition to determine whether two clusters should be merged. It also tests whether two clusters can connect to each other by density. Outliers are data points that are away from the clusters and have smaller scale compared to clusters. So outliers will not be merged to any clusters. When the algorithm nishes, the sub-clusters that have rather small scale are outliers. Our algorithm can determine this scale by the parameter M-DIAMETER, which denotes the minimum diameter a cluster may have.

4.4 Handling Data Sets Containing Clusters with Di erent Density or Di erent Scale There may be clusters with di erent density or di erent scale as shown in Fig.1. Traditional clustering algorithms based on density or minimum distance always have troubles in identifying such clusters. In our algorithm, these clusters can be easily identi ed. Testing distance between sub-clusters can identify the clusters with di erent density.

J. Comput. Sci. & Technol., Jan. 2003, Vol.18, No.1 And clusters with di erent scale can be identi ed through testing if they can connect to each other. Using more representative points can also help identifying the clusters with di erent scale.

Fig.1. Data sets with clusters in di erent density or di erent scale.

4.5 Handling Data Sets Containing Clusters with Arbitrary Shape Most traditional clustering algorithms cannot handle some certain shape clusters. DBSCAN, a density-based algorithm[7] , has troubles in dealing with dumbbell-shape clusters, as shown in Fig.2. In our algorithm, this kind of clusters can be identi ed through using several representative points and testing the distance between sub-clusters. CURE, a minimum-distance-based algorithm[3] , has troubles in dealing with hollow-shape clusters, as shown in Fig.3(a) (the white points are the representative points after shrinking), for the representative points must shrink a factor to means in order to handle outliers. But in the hybrid algorithm, the outliers are identi ed automatically. So the representative points need not shrink and this kind of clusters can easily be identi ed as we can see in Fig.3(b). STING, a statistics-informationbased algorithm[10] , cannot identify the clusters with bevel edge, as shown in Fig.4(a), which is caused by its using cells thoroughly. Unlike the

Fig.2. Data sets with clusters in dumb shape.

Fig.3. Data sets with hollow clusters. (a) CURE. (b) Hybrid algorithm.

QIAN W N

.: Clustering in Very Large Databases

75

et al

Fig.4. Data sets with clusters with bevel edge. (a) Hybrid algorithm. (b) STING.

Fig.5. Data sets with clusters in complex shape.

hybrid algorithm, STING uses cells from the beginning to the end. Although it can nd the clusters, it cannot denote them exactly as shown in Fig.4(b). In hybrid algorithm, the clusters are represented by several data points, which can capture the shape of the clusters exactly as shown in Fig.4(a). From the comparison, it shows that by using non-shrinking multi-representative technique, the algorithm can capture the shape of the clusters precisely. This kind of summarization does not depend on the partitioning of the data space, but the data set itself.

Fig.6. Comparison to CURE.

Therefore, it overcomes the shortcomings of cellbased methods. Furthermore, connectivity test, based on density information, makes it sure that the concise presentation of clusters will not misseparate connected clusters. In Fig.5, the result shows that the hybrid algorithm can recognize complex shape clusters in large data sets.

5 Performance Evaluation The experiments were run under such an environment as: with two Intel Pentium II 350MHz CPUs, Microsoft Windows NT4.0, 512MB Ram, and 9.6GB Disk. We study the performance of our hybrid-clustering algorithm to see its e ectiveness and eÆciency for clustering compared to DBSCAN and CURE. From gure to gure, we can see the result of our hybrid algorithm. In this section, we will show the time comparison to DBSCAN and CURE. Fig.6 illustrates the performance of the hybridclustering algorithm and CURE as the number of sample size is from 1,000 to 6,000. It shows that our algorithm far outperforms CURE while keeping the good clustering quality compared to original data sets. Fig.7 shows, in hybrid-clustering algorithm as the number of representing points increases, the time variance is linear. When the data size scales up to very large, the hybrid-clustering algorithm can successfully handle arbitrarily large number of data points. Fig.8 illustrates the performance of our algorithm and DBSCAN as the number of data size from 30,000 to 1,000,000. It shows that our algorithm far outperforms than DBSCAN. (We did not include the time of DBSCAN loading data and building R -tree.)

Fig.7. Time variance vs. the number of representative points increases.

J. Comput. Sci. & Technol., Jan. 2003, Vol.18, No.1

76

Fig.8. Scale-up experiments.

6 Conclusions In this paper, we present a hybrid-clustering algorithm. This algorithm di ers from other clustering algorithms for it identi es the clusters both by distance between clusters and density within clusters. Using this strategy, the algorithm can easily and eÆciently identify arbitrarily-shaped clusters with good quality, and user need not pre-specify the number of clusters. With the help of statistics information, it greatly reduces the computational cost of the clustering process. Our experimental results demonstrate that hybrid clustering algorithm can outperform other popular clustering methods.

Acknowledgments We thank Dr. Joerg

Sander for providing information and source code of DBSCAN, Dr. Sudipto Guha for providing good suggestion to CURE, and Prof. Jiawei Han for giving valuable critical comments to our algorithm.

References [1] Sheikholeslami G et al. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proc. 24th Int. Conf. Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City, Morgan Kaufmann, 1998, pp.428{438. [2] Zhang T, Ramakrishnan R, Livny M. BIRCH: An eÆcient data clustering method for very large databases. In Proc. 1996 ACM SIGMOD International Conference on Management of Data, Jagadish H V, Mumick I S (eds.), Quebec: ACM Press, 1996, pp.103{114. [3] Guha S et al. CURE: An eÆcient clustering algorithm for large databases. In Proc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.73{84. [4] Kaufman L et al. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990. [5] Ng R T, Han J. EÆcient and e ective clustering methods for spatial data mining. In Proc. the 20th Int. Conf. Very Large Data Bases, (VLDB'94), Bocca J B, Jarke M, Zaniolo C (eds.), Santiago de Chile, Chile: Morgan Kaufmann, 1994, pp.144{155. [6] Jain Anil K. Algorithms for Clustering Data. Prentice Hall, 1988.

[7] Ester M et al. A density-based algorithm for discovering clusters in large spatial databases with noises. In Proc. the 2nd International Conference on Knowledge Dis-

(KDD-96), Simoudis E, Han J, Fayyad U M (eds.), AAAI Press, 1996, pp.226{231. [8] Ankerst M et al. OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACM covery and Data Mining

SIGMOD International Conference on Management of

Delis A, Faloutsos C, Ghandeharizadeh S (eds.), Philadelphia: ACM Press, 1999, pp.49{60. [9] Agrawal R, Gehrke J, Gunopulos D et al. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM SIGMOD Int. Conf. Management of Data, Haas L M, Tiwary A (eds.), Seattle: ACM Press, 1998, pp.94{105. [10] Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In Data,

Proc.

23rd International

Conference on

Very

Large

Jarke M, Carey M J, Dittrich K R, Lochovsky F H, Loucopoulos P, Jeusfeld M A (eds.), Athens, Greece: Morgan Kaufmann, 1997, pp.186{195. [11] Gibson D, Kleinberg J M, Raghavan P. Clustering categorical data: An approach based on dynamical systems. In Proc. 24th International Conference on Very Large Data Bases, Gupta A, Shmueli O, Widom J (eds.), New York City: Morgan Kaufmann, 1998, pp.311{322. [12] Boley D, Gini M, Gross R et al. Partitioning-based clustering for web document categorization. Decision Support System Journal, 1999, 27(3): 329{341. Data Bases,

QIAN WeiNing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include clustering, data mining and Web mining. GONG XueQing is a Ph.D. candidate in Computer Science Department, Fudan University. His major is database and knowledge-base. His research interests include Web data management, data mining and data management over P2P systems.

ZHOU AoYing received his M.S. degree in computer science from Sichuan University in 1988, and his Ph.D. degree in computer software from Fudan University in 1993. He is currently a professor in the Department of Computer Science and Engineering, Fudan University. His main research interests include Web/XML data management, data mining and streaming data analysis, and Peer-to-Peer computing systems and their application.

Suggest Documents