Available online at www.sciencedirect.com Available online at www.sciencedirect.com
Procedia Engineering
ProcediaProcedia Engineering 00 (2011) Engineering 29 000–000 (2012) 4325 – 4329 www.elsevier.com/locate/procedia
2012 International Workshop on Information and Electronics Engineering (IWIEE)
An Improved Semi-Supervised Clustering Algorithm for Multi-Density Datasets with Fewer Constraints Xiaoyun Chena, Sha Liua*, Tao Chena, Zhengquan Zhangb, Hairong Zhangb a
School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000,China b GanSu Computing Center, Lanzhou, 730000, China
Abstract Semi-supervised clustering algorithms aim to significantly improve the clustering results using limited supervision in the form of labelled instances or pairwise constraints. But few of these algorithms are specially and well-designed both for multi-density and complex shape datasets. However, such complex data are usual in the real world. In this paper, an improved semi-supervised clustering algorithm is proposed based on SCMD algorithm. Our new algorithm can deal with the multi-density problems, including not only the inter-cluster density variety but also the intra-cluster density difference; and it can yield superior performance with fewer constraints. We test our new algorithm on several synthetic datasets of varying shapes, sizes, and densities. Experimental results show that our algorithm has manifest superior performance in comparison with SCMD algorithm, even when the constraints are not sufficient.
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Harbin University of Science and Technology Open access under CC BY-NC-ND license. Keywords: Data mining, clustering, semi-supervised, constraints, multi-density ;
1. Introduction Clustering, as one of fundamental subjects of data mining, is attracting more and more attention, and has been extensively studied for decades. Clustering aims to group a large amount of data into several clusters, making that the data are similar in the same cluster and dissimilar in different clusters. Clustering analysis is extensively used for applications ranging for market research, biology, data analysis, image processing and so on. Clustering algorithms can be divided into unsupervised and semi-supervised
* Corresponding author. Tel.: 15002589036 E-mail address:
[email protected].
1877-7058 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. doi:10.1016/j.proeng.2012.01.665
24326
Xiaoyun al. / Procedia Engineering 29 (2012) 4325 – 4329 Xiaoyun ChenChen et al./etProcedia Engineering 00 (2011) 000–000
clustering algorithms according to whether we have priori knowledge or not. In this paper, we present a semi-supervised clustering algorithm focus on multi-density and arbitrary shape datasets. Our algorithm makes full use of pairwise constraints’ guidance to clustering process, and gives a good solution to the issues of density difference within and between clusters. The rest of the paper is organized as follows. Section 2 talks about the related works. We review the SCMD [1] algorithm and point out two of its shortcomings in section 3. In section 4, we detail our improved semi-supervised algorithm to solve the problems of SCMD. Section 5 shows and discusses the empirical results on several datasets. Finally, we present our conclusions and future work in section 6. 2. Related Work DBSCAN [2] algorithm is the most typical density-based clustering algorithm. But it cannot find the natural cluster structure in the datasets with multi-density. Then, several algorithms are proposed to solve the problem such as Chameleon [3], SNN [4], etc. Chameleon uses dynamic modeling to determine the similarity between pairs of clusters. SNN identifies core points using similarity which determined by the number of nearest neighbors shared by two points. Besides, Ref [5] proposes a new two-level SOM-based clustering algorithm using a clustering validity index based on inter-cluster and intra-cluster density. Ref [6] develops a cluster tree to determine cluster structure and understand hidden information present in datasets of nested clusters or clusters of multi-density. Semi-supervised clustering algorithms aim to significantly improve the clustering results using limited supervision. Various Semi-supervised clustering algorithms are proposed in recent years. Wagstaff et al. [7] propose COP-KMeans algorithm which employs supervision in the form of pairwise constraints. Ref [8] explores the use of labelled data to generate initial seed clusters, and during clustering process labelled data provided priori information about the conditional distribution of hidden category labels. SCMD [1] is a semi-supervised clustering algorithm for multi-density and complex shape dataset. C-DBSCAN [9] uses the KD-Tree to deal effectively with subareas that have different densities. 3. Problems of SCMD Yu et al. proposed SCMD algorithm as a semi-supervised clustering algorithm for multi-density. It is composed of three phases. First it calculates the referenced parameter Eps list of the clusters with mustlink constraints. Then, select the representative Eps by using cannot-link constraints from the sorted referenced Eps. Finally, the classical DBSCAN algorithm is used on the datasets. SCMD algorithm can indeed show good performance on some datasets, but it still exist two problems.
Fig. 1. (a) (b) Problem of must-link insufficiency in SCMD; (c) (d) Problem of intra-cluster density non-uniform in SCMD
From SCMD algorithm, we can see that all the Eps used in clustering process are calculated only from the must-link constraints. Therefore, if one cluster has no must-link constraint, this cluster may not appear in the final clustering results. Especially, if the cluster having no must-link constraint is the sparsest one, it must be lost and all points of this cluster are assigned to noise. Fig 1(a) and (b) show the problem of
XiaoyunChen Chenetetal. al./ /Procedia ProcediaEngineering Engineering00 29(2011) (2012)000–000 4325 – 4329 Xiaoyun
3
must-link insufficiency. The solid line represents the must-link constraint and the dashed line represents the cannot-link constraint. In Fig 1(a), we can get three representative Eps from the must-link sets, and then three clusters with different density are generated. But in Fig 1(b), the sparsest cluster has no mustlink constraint, so we can only get two representative Eps for cluster C1 and C2 by using SCMD algorithm. Another problem of SCDM is that it cannot deal with the problem of intra-cluster density non-uniform. It does well on the datasets with inter-cluster density variety, because it can calculate different Eps for different density. Therefore, for a cluster whose density distribution is non-uniform, it will get two or more Eps. Then, this cluster will be divided into several subclusters. For example, in fact there are two clusters in Fig 1(c). But when the smaller Eps is used to expand cluster, only C1 in Fig 1(d) is generated. The rest points are grouped into three clusters by using the bigger Eps. In result, there are four clusters. 4. Proposed Algorithm In this section, we present an improved algorithm to resolve the problems mentioned in last section. 4.1. Adding Eps In last section, we said if one cluster has no must-link constraint, this cluster may not appear in the final clustering results for its Eps has not been calculated. So we should try to add its Eps to reference Eps list. The key is how to find such clusters. As we all know, the constraints are transitive. (A, B) belongs to must-link sets means that A and B are in the same cluster. So does (B, C). Then we can get that A and C are in the same cluster. ۦA, Bۧ belongs to cannot-link sets means that A and B mustn’t in the same cluster. If (A, C) is a must-link constraint, B and C cannot be in the same cluster. According to the constraints transitivity, transitive closures can be obtained from the constraints, and the sets containing only one point P are just the cluster we want to find. Then, we consider the P-Kdistances as the Eps of these clusters respectively, and add them to reference Eps list. Thuswise, these clusters won’t be lost. For example, there are some points with constraints in the left part of Fig 2(a), and the other points of the clusters are omitted. We can see that C4 has no must-link constraint, and d1 is what we want to find. From the constraints, Sets {a1, a2, a3}, {b1, b2}, {b3, b4}, {c1}, {c2, c3} and {d1} are generated according to the constraints transitivity, as the right part of Fig 2 shown. Then we add c1-Kdistance and d1-Kdistance as Eps into referenced Eps list. You may think that c1-Kdistance doesn’t need to be added by reason that there is already an Eps calculated by (c2, c3) for C3. It doesn’t matter. Here, so long as guaranteeing d1-Kdistance as the Eps of C4 being added into referenced Eps list, our aim is achieved. 4.2. Assigning Boundary Clusters Definition 1. (Boundary Cluster) A cluster C is a boundary cluster, if the total number of the points of this cluster is less than k, i.e. |C| < k. Lemma1 Boundary cluster cannot be an isolated cluster and it must lie on the boundary of a certain larger cluster. Proof. A cluster must have one or more core points, because a cluster is expanded from core points according to the rule of directlydensity-reachable. The Eps-neighborhood of the core point must contain at least Minpts (k in this paper) members. If the cluster is an isolated cluster, the core points of it couldn’t satisfy the condition of k neighbors, because the total number of the points of this cluster is less than k. So, there is no core point in the cluster. Counter evidence prove that it isn’t isolated and it lies on the boundary of a certain cluster.
4327
44328
Xiaoyun al. / Procedia Engineering 29 (2012) 4325 – 4329 Xiaoyun ChenChen et al./etProcedia Engineering 00 (2011) 000–000
In this paper, we assign the boundary clusters to the larger clusters which they are close to respectively. The main idea is derived from KNN classification method [7], but there is a bit different. A border cluster is modified by a majority vote of its members’ neighbors, with the cluster being assigned to the class most common amongst its members’ k nearest neighbors. 4.3. The Algorithm Framework The algorithm is composed of five phases: getting referenced Eps with must-link sets, adding referenced Eps, selecting representative Eps by using cannot-link sets, multi-stage DBSCAN clustering by using representative Eps, assigning boundary clusters. The flowchart of proposed algorithm is shown in Fig 2(b).
Fig. 2. (a) Transitive closures force by constraints; (b) The flowchart of proposed algorithm.
5. Experimentation In this section we present the experimental results of our proposed algorithm and compare it with SCMD. We select four datasets that are used in some literatures or synthesized by us as our experimental datasets. They are all multi-density and have noise points. In all of our experimental results, different colors combining different shapes represent different clusters. Black dots represent noises. 1
1
1
1
0.9
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.8 0.7 0.6 0.5
0
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
0.6
0.8
1
0
0.2 0.1 0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
0.6
0.8
1
Fig. 3. (a) SCMD algorithm on Data1; (b) Our proposed algorithm on Data1; (c) SCMD algorithm on Data2; (d) Our proposed algorithm on Data2 1
1
1
1
0.9
0.9
0.9
0.9
0.8
0.8
0.8
0.8
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0.1
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Fig. 4. (a) SCMD algorithm on Data3; (b) Our proposed algorithm on Data3; (c) SCMD algorithm on Data4; (d) Our proposed algorithm on Data4.
XiaoyunChen Chenetetal. al./ /Procedia ProcediaEngineering Engineering00 29(2011) (2012)000–000 4325 – 4329 Xiaoyun
5
There are two datasets in Fig 3, and each dataset is of different densities and shapes of clusters. If the must-link constraint of the sparsest cluster is not given, experimental results show that SCMD algorithm can only find three clusters in Data1 and two clusters in Data2, and the points of sparsest clusters are assigned to noises, just as Fig 3(a) and Fig 3(c) shows. But our algorithm can still find the whole clusters accurately showed in Fig 3(b) and Fig 3(d). There are two datasets in Fig 4. The points of each cluster are of gauss distribution which means the points in the cluster center are denser and the points in the cluster boundary are sparser. Experimental results show that most points of the real clusters can be assigned correctly by SCMD algorithm just as Fig 4(a) and 4(c) shown, but the points in the cluster boundary are grouped into several small clusters because of the bigger Eps. Fig 4(b) and 4(d) show our algorithm has higher performance than SCMD algorithm. 6. Conclusions and Future Work In this paper, we propose an improved semi-supervised clustering algorithm for multi-density datasets with fewer constraints. It can handle the datasets with arbitrary shape and multi-density, including not only the inter-cluster density variety but also the intra-cluster density difference. Experimental results show that our algorithm has manifest superior performance in comparison with SCMD algorithm, even when the must-link constraints are not sufficient. But if there is a cluster having not any constraints, and is the sparsest cluster, how to detect is in the clustering process? This is our research interest in the future. Acknowledgements We gratefully acknowledge supports from 2010 X10 innovation awards project of IBM. References [1] Y. Yu, T. Huang, G. Guo, and K. Li. “Semi-supervised clustering algorithm for multi-density and complex shape dataset”. In Chinese Conference on Pattern Recognition (CCPR '08), pages 1–6, 2008. [2] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. “A density-based algorithm for discovering clusters in large spatial databases with noise”. In Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226–231, 1996. [3] G. Karypis, E.-H. S. Han, and V. Kumar. “Chameleon: A hierarchical clustering algorithm using dynamic modeling”. IEEE Computer, 32:68–75, 1999. [4] L. Ert¨oz, M. Steinbach, and V. Kumar. “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data”. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM '03), pages 47–58, 2003. [5] S. Wu and T. W. Chow. “Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density”. Pattern Recognition, 37(2):175–188, 2004. [6] X. Li, YunmingYe, MarkJunjieLi, and MichaelK.Ng. “On cluster tree for nested and multi-density data clustering”. Pattern Recognition, 43(9):3130–3143, 2010. [7] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. “Constrained k-means clustering with background knowledge”. In Proceedings of 18th International Conference on Machine Learning, pages 577–584, 2001. [8] S. Basu, A. Banerjee, and R. Mooney. “Semi-supervised clustering by seeding”. In Proceedings of 19th International Conference on Machine Learning (ICML-2002), pages 19–26, 2002. [9] C. Ruiz, M. Spiliopoulou, and E. Menasalvas. “Density-based semi-supervised clustering”. Data Min Knowl Disc (2010), 21(3):345–370, 2010.
4329