A New Overlapping Clustering Algorithm Based on

0 downloads 0 Views 192KB Size Report
Suárez, A.P., Pagola, J.E.M.: A Clustering Algorithm Based on Generalized Stars. ... Alonso, A.G., Suárez, A.P., Pagola, J.E.M.: ACONS: A New Algorithm for Clus ...
A New Overlapping Clustering Algorithm Based on Graph Theory Airel P´erez-Su´arez1,2, Jos´e Fco. Mart´ınez-Trinidad1, Jes´ us A. Carrasco-Ochoa1, and Jos´e E. Medina-Pagola2 1

´ Instituto Nacional de Astrof´ısica, Optica y Electr´ onica (INAOE) Luis E. Erro No. 1, Sta. Mar´ıa Tonantzintla, Puebla, CP:72840, Mexico 2 Advanced Technologies Application Center (CENATAV) 7ma A  21406 e/ 214 and 216, Playa, C.P. 12200, La Habana, Cuba {airel,ariel,fmartine}@ccc.inaoep.mx, [email protected]

Abstract. Most of the clustering algorithms reported in the literature build disjoint clusters; however, there are several applications where overlapping clustering is useful and important. Although several overlapping clustering algorithms have been proposed, most of them have a high computational complexity or they have some limitations which reduce their usefulness in real problems. In this paper, we introduce a new overlapping clustering algorithm, which solves the limitations of previous algorithms, while it has an acceptable computational complexity. The experimentation, conducted over several standard collections, demonstrates the good performance of the proposed algorithm. Keywords: Data Mining, Overlapping Clustering, Graph-based Algorithms.

1

Introduction

Clustering is a Data Mining technique that organizes a set of objects into a set of classes called clusters, such that objects belonging to the same cluster are enough similar to infer they are of the same type, and objects belonging to different clusters are enough different to infer they are of different types [1]. Most of the clustering algorithms reported in the literature build disjoint clusters; i.e., they do not allow objects to belong to more than one cluster. Although this approach has been successful for unsupervised learning, there are some applications, like information retrieval [2], social network analysis [16, 19], Bioinformatics [14] and news stream analysis [4], among others, where objects could belong to more than one cluster. For this kind of applications, overlapping clustering is useful and important. Several overlapping clustering algorithms have been proposed in the literature [2–21]; however, most of them have a high computational complexity or they have some limitations which reduce their usefulness in real problems. This is the main reason why we address the problem of overlapping clustering in this work. I. Batyrshin and M. Gonz´ alez Mendoza (Eds.): MICAI 2012, Part I, LNAI 7629, pp. 61–72, 2013. c Springer-Verlag Berlin Heidelberg 2013 

62

A. P´erez-Su´ arez et al.

In this paper, we introduce a new clustering algorithm for building overlapping clusterings. The proposed algorithm, called OCDC (Overlapping Clustering based on Density and Compactness), introduces a new graph-covering strategy and a new filtering strategy, which together allow obtaining a small set of overlapping clusters. Moreover, our algorithm reduces the limitations of previous algorithms and it has an acceptable computational complexity. An experimental evaluation, over several standard collections, showed that OCDC builds more accurate clusters, according to the FBcubed evaluation measure [22], than those built by previous overlapping clustering algorithms. The remainder of this paper is as follows: in Section 2, the related work is analyzed. In Section 3, we introduce the OCDC algorithm. An experimental evaluation, for showing the performance of our proposed algorithm over several data collections, is presented in Section 4. Finally, the conclusions and future work are presented in Section 5.

2

Related Work

In the literature, there have been proposed several clustering algorithms that address the problem of overlapping clustering [2–21]. The DHS algorithm, reported in [23], was not included as related work because it addresses the problem of overlapping hierarchical clustering, which is out of the scope of this paper. Star [2], ISC [4], Estar [5, 6], Gstar [11], ACONS [12], ICSD [17] and SimClus [21] are graph-based clustering algorithms, which build overlapping clusterings; the SimClus algorithm uses the same strategy and criterion for building the clustering as the one used in Estar algorithm, proposed by Gil-Garc´ıa et al. in [6]. The computational complexity of Star and ICSD is O(n2 · log2 n) and O(n2 ), respectively. On the other hand, the computational complexity of Estar, Gstar, and ACONS is O(n3 ). These algorithms have two main limitations. First of all, as we will show in our experiments, they build a large number of clusters. It is known that the correct number of clusters is unknown in real problems. However, when we use a clustering algorithm for discovering hidden relations among objects of a collection, the number of clusters obtained should be small wrt. the number of objects in the collection. Note that, if the number of clusters grows, then analyzing those clusters could be as difficult as analyzing the whole collection. Another limitation, as we will show in our experiments, is that they build clusterings with high overlapping. As it was mentioned in the previous section, overlapping clustering is very useful for several applications; however, when the overlapping is high, it could be difficult to obtain something useful about the structure of the data. Moreover, there are applications like, for instance, document segmentation by topic, where a high overlapping could be a signal of a bad segmentation [24]. Another example can be found in social networks analysis [13, 15].

A New Overlapping Clustering Algorithm Based on Graph Theory

63

Another algorithm able to build overlapping clusters is STC [3]. As it was mentioned in [7], the main limitation of STC is related to its computational complexity, which could get to be exponential. Besides, STC was specifically designed to work with text strings which limits its applicability. Other algorithms addressing the problem of overlapping clustering are SHC [7], LA-IS2 [9], RaRe-IS [10], the algorithm proposed by Zhang et al. in [14], RRW [18], LA-CIS [19] and SSDE-Cluster [20]. The computational complexity of SHC, RaRe-IS, LA-IS2 , LA-CIS and RRW is O(n2 ). On the other hand, the computational complexity of SSDE-Cluster and the algorithm proposed by Zhang et al. is O(n · k · d) and O(n2 · k · h), respectively. From the user point of view, these algorithms have several limitations. First of all, SHC, RaRe-IS, RRW, SSDE-Cluster and the algorithm proposed by Zhang et al. need to tune values of several parameters, whose values depend on the collection to be clustered. In general, users do not have any a priori knowledge about the collection they want to cluster; therefore, to tune up several parameters could be a difficult task. Second, SHC, RaRe-IS and LA-CIS build clusterings with high overlapping. Finally, LA-IS2 and LA-CIS make irrevocable assignment of objects to clusters; this is, once an object has been assigned to a cluster, it cannot be removed and added to another cluster even if doing it would improve the clustering quality. Other overlapping clustering algorithms reported in the literature are the algorithm proposed by Palla et al. [8], CONGA [13], CONGO [15] and H-FOG [16]. The computational complexity of CONGA, CONGO and H-FOG is O(n6 ), O(n4 ) and O(n6 ), respectively. The computational complexity of the algorithm proposed by Palla et al. is exponential. The main limitation of these algorithms is their computational complexity, which makes them inapplicable in real problems involving large collections. Besides, CONGA, CONGO and H-FOG need to know in advance the number of clusters to build; this number is commonly unknown in real problems. As it can be seen from the related work, in the literature there are many algorithms addressing the problem of overlapping clustering. However, as we pointed out in this section, they have some limitations. These limitations are mainly related to: a) the production of a large number of clusters. b) the production of clusterings with high overlapping. c) the necessity of tuning several parameters whose values depend on the collection to be clustered. Besides, there are several algorithms having a high computational complexity; these limitations make these algorithms worthless for many real problems. The algorithm proposed in this work solves the aforementioned limitations, while it has an acceptable computational complexity. Moreover, as we will show in our experiments, our algorithm builds more accurate clusterings than those built by previous algorithms.

64

3

A. P´erez-Su´ arez et al.

Building Overlapping Clusterings

In this section, a new overlapping clustering algorithm is introduced. Our algorithm, called OCDC (Overlapping Clustering based on Density and Compactness), represents a collection of objects as a weighted thresholded similarity graph  β and it builds an overlapping clustering in two phases called initialization and G improvement. In the initialization phase, a set of initial clusters is built through  β . In the improvement phase, the initial clusters are proa vertex covering of G cessed for reducing both the number of clusters and their overlapping. The presentation of the OCDC algorithm is divided into three parts. First of all, in subsection 3.1, we give some basic concepts needed for introducing our algorithm. After, in subsection 3.2, we explain the initialization phase and finally, in subsection 3.3, we explain the improvement phase. 3.1

Basic Concepts

Let O = {o1 , o2 , . . . , on } be a collection of objects, β ∈ [0, 1] a given parameter and S(oi , oj ) a symmetric similarity function. A weighted thresholded similarity graph is an undirected and weighted graph  β = V, E β , S, such that V = O and there is an edge (v, u) ∈ E β iff v = u and G  S(v, u) ≥ β; each edge (v, u) ∈ Eβ is labeled with the value of S(v, u). β , S be a weighted thresholded similarity graph. A weighted β = V, E Let G  β , denoted by G = V  , E  , S, is a substar-shaped sub-graph (ws-graph) in G  β , having a vertex c ∈ V  , such that there is an edge between c and graph of G all the vertices in V  \ {c}. The vertex c is called the center of G and the other vertices of V  are called satellites. All vertices having no adjacent vertices (i.e., isolated vertices) are considered degenerated ws-graphs. β , S be a weighted thresholded similarity graph and  β = V, E Let G   β , such that each G = W = {G1 , G2 , . . . , Gk } be a set of ws-graphs in G  i  β iff it meets that V = k V  . Vi , Ei , S, i = 1 . . . k. The set W is a cover of G i=1 i In addition, we will say that a ws-graph Gi covers a vertex v iff v ∈ Vi . Let Gv = Vv , Ev , S be a non degenerated ws-graph, having v as its center. The intra-cluster similarity of Gv , denoted as Intra sim(Gv ), is the average similarity between all pair of vertices belonging to V  [25], and it is computed as:  z,u∈Vv ,z=u S(z, u)  Intra sim(Gv ) = , where |Vv | = 1 (1) |V  |·(|V  |−1) v

v

2

Computing Intra sim(Gv ) using (1) is O(n2 ). Thus, using this equation for computing the intra-cluster similarity of the ws-graph determined by each vertex  β becomes O(n3 ). Since we want to reduce the computational complexity in G of OCDC, we will approximate the intra-cluster similarity of a non degenerated ws-graph Gv by the average similarity between its center and its satellites, as follows:

A New Overlapping Clustering Algorithm Based on Graph Theory

 Aprox Intra

sim(Gv )

=

u∈Vv ,u=v S(v, u) , where |Vv | − 1

|Vv | = 1

65

(2)

β , the Aprox Intra sim of Equation (2) is O(n); thus, while we are building G  each ws-graph in Gβ can be computed, without extra computational cost. 3.2

Initialization Phase

The main idea of this phase is to generate an initial set of clusters by covering  β using ws-graphs; in this context, each ws-graph constitutes an initial cluster. G As it can be seen from subsection 3.1, a ws-graph is determined by its center; thus, the problem of building a set W = {Gc1 , Gc2 , . . . , Gck } of ws-graphs, such  β , can be transformed into the problem of finding a set that W is a cover of G X = {c1 , c2 , . . . , ck } of vertices, such that ci ∈ X is the center of Gci ∈ W ,  β forms a ws-graph; therefore, all vertices of G β ∀i = 1 . . . k. Each vertex of G should be analyzed for building X. In order to prune the search space and for establishing a criterion for selecting those vertices that should be added to X, we introduce the concepts of density and compactness of a vertex v. β , would An idea for reducing the number of ws-graphs needed for covering G be to iteratively add the vertex having the highest degree to X. Thus, we expect β ; to maximize the number of vertices that are iteratively added to the cover of G β nevertheless, the number of vertices that a vertex v could add to the cover of G depends on the previously selected vertices. Let v1 , v2 , . . . , vk be those vertices which were selected in the first k iterations. Let u be a non previously selected vertex, having the highest degree in the iteration k + 1. If d previously selected  β only vertices (d ≤ k) are adjacent to u, then u would add to the cover of G |u.Adj| − d vertices. However, if there is a vertex z in the same iteration, such that: 1. |z.Adj| < |u.Adj|. 2. u is adjacent to d1 previously selected vertices (d1 ≤ k). 3. |u.Adj| − d < |z.Adj| − d1 . then selecting the vertex z in the iteration k + 1 would be a better choice than selecting u. From the previous paragraph, we can conclude that if v ∈ V is added to X, the number of adjacent vertices of v, having a degree non greater than the degree of v, is a good estimation about how much vertices could include v in the cover  β . However, this number is bounded by the total number of adjacent vertices of G of v; i.e., there could be a bias for vertices having a lot of adjacent vertices. Let v, u be two vertices, such that v and u could add 6 and 4 vertices to the cover β , respectively. Based on the previous paragraph, the best choice is adding of G v to X. But, if we know that |v.Adj| = 10 and |u.Adj| = 5, then u could be β a greater percent a better choice, because it could include in the cover of G of its adjacent vertices than v (4/5 versus 6/10). Motivated by these ideas, we introduce the concept of density of a vertex.

66

A. P´erez-Su´ arez et al.

The density of a vertex v ∈ V , denoted as v.density, is computed as follows: v.density =

v.pre dens |v.Adj|

(3)

where v.pre dens denotes the number of adjacent vertices of v, having a degree non greater than the degree of v. Equation (3) takes values in [0,1]. The higher the value of v.density, the greater the number of adjacent vertices that v could β .  β and therefore, the better v is for covering G include in the cover of G There are some aspects we would like to highlight from the previous definition. First, a high value of v.density also means that the ws-graph determined by v has low overlapping with the previously selected ws-graphs. Second, since this property is based on the degree of vertices, using it as the only clustering criterion could lead to select ws-graphs with a high number of satellites but with low average similarity among them. For solving this issue, we define the concept of compactness of a vertex, which takes into account the Aprox Intra sim of a ws-graph, computed using equation (2). The compactness of a vertex v ∈ V , denoted as v.compactness, is computed as follows: v.pre compt (4) v.compactness = |v.Adj| where v.pre compt denotes the number of vertices u ∈ v.Adj, such that Aprox Intra sim(Gv ) ≥ Aprox Intra sim(Gu ), being Gv and Gu the wsgraphs determined by v and u, respectively. Like (3), equation (4) takes values in [0,1]. The higher the value of v.compactness, the greater the number of adjacent  β and therefore, the better v is for vertices that v could include in the cover of G covering the graph. Finally, both aforementioned criteria are combined through the average between v.density and v.compactness. A high average value corresponds with a high value of v.density and/or v.compactness; thus, the higher this average, β , β . Therefore, in order to build the covering of G the better v is for covering G we must analyze the vertices in descending order, according to the average of v.density and v.compactness. Moreover, we do not have to analyze those vertices having v.density = 0 and v.compactness = 0.  β is comprised of three The strategy proposed for building a covering of G steps. First, all vertices having an average of density and compactness greater than zero are added to a list of candidates L; isolated vertices are included directly in X. After, the list L is sorted in descending order, according to the average of density and compactness of the vertices. Finally, L is iteratively processed and each vertex v ∈ L is added to X if it satisfies at least one of the following conditions: a) v is not covered yet. b) v is already covered but it has at least one adjacent vertex which is not covered yet. This condition avoids the selection of ws-graphs having all their satellites covered by previously selected ws-graphs. After this process, each selected ws-graph constitutes an initial cluster.

A New Overlapping Clustering Algorithm Based on Graph Theory

3.3

67

Improvement Phase

The main idea of this phase is to postprocess the initial clusters in order to reduce the number of clusters and their overlapping. For doing this, the set X is analyzed in order to remove those less useful ws-graphs. In this context, the usefulness of a ws-graph G will be determined based on its number of satellites and the number of satellites it shares with other ws-graphs. The strategy proposed for building X is greedy; thus, it could happen that, β once the set X has been built, some vertices in X could be removed and G would remain as covered. For example, if there is a vertex v ∈ X such that: (i) the ws-graph Gv determined by v shares all their satellites with other selected ws-graphs. (ii) v itself belongs to at least another selected ws-graph. β would remain as covered. Nevertheless, then we can remove v from X and G if there is a vertex u ∈ X such that u meets condition (ii) but it does not meet condition (i), then u cannot be removed from X. Notice that, even though most of the satellites of Gu were covered by other ws-graphs, removing u from X will leave the non-shared satellites of Gu as uncovered. For solving this issue, we will define the usefulness of a ws-graph based on how many satellites it shares. β . Let Let v ∈ X be a vertex determining the ws-graph Gv in the covering of G v.Shared be the set of satellites that Gv shares with other selected ws-graphs and let v.N on shared be the set of satellites belonging only to Gv . In this work, we β iff the following conditions will understand that Gv is not useful for covering G are meet: (1) there is at least another selected ws-graph covering v. (2) |v.Shared| > |v.N on shared|. For removing a non-useful ws-graph, we must add all its non-shared satellites to other initial clusters. Since a non-useful ws-graph Gv meets condition (1), the non-shared satellites of Gv will be added to the ws-graph having the greatest number of satellites, among all ws-graphs covering vertex v; thus, we allow the creation of clusters with many objects. In this context, adding non-shared satellites of Gv to another ws-graph Gu means adding those satellites to a list named u.Linked; thus, the cluster determined by Gu now will include also the vertices in u.Linked. The strategy proposed for removing the non-useful ws-graphs and building the final clustering consists of two steps. First, the vertices in X are marked as not-analyzed and they are sorted in descending order, according to their degree. In the second step, each vertex v ∈ X is visited for removing from v.Adj all the vertices forming non-useful ws-graphs. For this purpose, each vertex u ∈ v.Adj is analyzed as follows: If u belongs to X and it is marked as not-analyzed we check if Gu is a non-useful ws-graph; the vertices belonging to X and marked as analyzed are those which were determined as useful in a previous iteration and therefore, they do not need to be verified again. For checking if Gu is a

68

A. P´erez-Su´ arez et al.

non-useful ws-graph we only need to check if Gu meets the above mentioned condition (2). Notice that, since u belongs to the ws-graph formed by v, Gu already meets condition (1). After, if Gu is non-useful, then u is removed from X and its non-shared satellites are added to v.Linked; otherwise, u is marked as analyzed. Once all the vertices of v.Adj have been analyzed, the set determined by vertex v, together with the vertices in v.Adj and v.Linked, constitutes a cluster in the final clustering. We would like to mention that, the way in which a cluster is built allows us to understand its center v as the prototype of the cluster. Noticed that, each vertex included in this cluster is related with v; that is, it is adjacent to v or it is adjacent to one adjacent vertex of v. The OCDC algorithm has a computational complexity of O(n2 ); the proof was omitted due to space restrictions.

4

Experimental Results

In this section, the results of some experiments that show the performance of OCDC algorithm are presented. In these experiments, we contrasted our results against those obtained by the algorithms Star [2], ISC [4], SHC [7], Estar [5], Gstar [11], ACONS [12] and ICSD [17]. The collections used in the experiments were built from five benchmark text collections, commonly used in document clustering: AFP1 , CISI2 , CACM2 , Reuters-215783 and TDT24 . From these benchmarks, six document collections were built; the characteristics of these collections are shown in Table 1. Table 1. Overview of collections Name #Documents AFP 695 Reu-Te 3587 Reu-Tr 7780 TDT 16 006 cacm 433 cisi 1162

#Terms #Classes Overlapping 11 785 25 1.023 15 113 100 1.295 21 901 115 1.241 68 019 193 1.188 3038 52 1.499 6976 76 2.680

In our experiments, documents were represented using the Vector Space Model. The index terms of the documents represent the lemmas of the words occurring at least once in the whole collection; stop words were removed. The index terms of each document were statistically weighted using term frequency normalized by the maximum term frequency. The cosine measure was used to calculate the similarity between two documents. 1 2 3 4

http://trec.nist.gov ftp://ftp.cs.cornell.edu/pub/smart http://kdd.ics.uci.edu http://www.nist.gov/speech/tests/tdt.html

A New Overlapping Clustering Algorithm Based on Graph Theory

69

There are three types of evaluation measures: external, relative and internal measures [1]. From these three kind of measures, the most widely used are the external measures. Several external evaluation measures have been proposed in the literature, for instance: Purity and Inverse Purity, F1-measure, Jaccard coefficient, Entropy, Class Entropy and V-measure, among others (see [22, 26]). However, none of the external measures reported so far have been developed, at least explicitly, for evaluating overlapping clustering; that is, these measures fail at reflecting the fact that, in a perfect overlapping clustering, objects sharing n classes should share n clusters. In [22], Amigo et al. proposed a new external evaluation measure, called FBcubed, for evaluating overlapping clusterings. This measure meets four constraints which evaluate several desirable characteristics in an overlapping clustering solution. These constraints are intuitive and they express important characteristics that should be evaluated by an external evaluation measure. Moreover, in [22] the authors showed that none of the most used external evaluation measures satisfies all the four constraints. Based on the aforementioned analysis and in order to conduct a fair comparison among the algorithms, we will use the FBcubed measure for evaluating the quality of the clusterings built by each algorithm. A more detailed explanation about the FBcubed measure can be found in [22]. The first experiment was focused on comparing the algorithms, according to the quality of the clusterings they build for each collection; for evaluating a clustering solution, we used the aforementioned FBcubed measure [22]. In Table 2, we show the best performances attained by each algorithm over each collection, according to the FBcubed measure. Table 2. Best performances of each algorithm over each collection. The highest values per collections appear bold-faced. Collection AFP Reu-Te Reu-Tr TDT cacm cisi

Star 0.69 0.45 0.42 0.43 0.31 0.30

ISC Estar Gstar ACONS ICSD 0.20 0.63 0.63 0.62 0.61 0.05 0.39 0.40 0.40 0.39 0.03 0.36 0.36 0.36 0.36 0.06 0.37 0.35 0.34 0.35 0.18 0.32 0.31 0.32 0.32 0.05 0.29 0.29 0.29 0.29

SHC OCDC 0.27 0.77 0.20 0.51 0.19 0.43 0.15 0.48 0.15 0.33 0.21 0.32

As it can be seen from Table 2, OCDC outperforms, according to the FBcubed measure, all the other tested algorithms. The second experiment was focused on comparing the algorithms according to the number of clusters they build for each collection. Table 3 shows the results of this experiment. As it can be noticed from Table 3, OCDC builds, in almost all collections, less clusters than the other algorithms. We would like to highlight that these results of OCDC are obtained without affecting the quality of the clusterings

70

A. P´erez-Su´ arez et al.

Table 3. Number of clusters built by each algorithm for each collection. The smallest values per collection appear bold-faced. Collection AFP Reu-Te Reu-Tr TDT cacm cisi

Star ISC Estar Gstar ACONS ICSD SHC OCDC 123 334 98 90 129 104 85 52 507 1785 600 711 798 621 273 102 471 3936 904 849 857 853 561 166 2019 8250 1854 1653 1663 1657 1203 769 124 228 122 129 129 152 29 102 134 654 195 159 213 209 100 52

(see Table 2); in this way, OCDC builds clusterings that could be easier to analyze than those built by the other algorithms used in the comparison. The third experiment was focused on comparing the algorithms according to the overlapping they produce for each collection; the overlapping of a clustering is computed as the average number of clusters in which an object is included [23]. In Table 4, we show the overlapping of the clusterings built by each algorithm for each collection. Table 4. Overlapping of the clustering built by each algorithm for each collection. The lowest values per collection appear bold-faced. Collection AFP Reu-Te Reu-Tr TDT cacm cisi

Star 1.71 3.41 5.54 4.81 2.31 4.12

ISC Estar Gstar ACONS ICSD 1.65 2.52 2.31 2.48 2.53 1.79 7.40 6.73 6.66 7.64 1.84 12.14 13.08 12.65 13.32 1.88 59.41 69.43 66.38 70.97 1.82 3.46 3.20 3.19 2.72 2.12 7.85 7.49 7.77 7.54

SHC OCDC 2.43 1.18 13.13 1.40 29.33 1.56 80.74 1.50 1.99 1.26 6.98 1.58

From Table 4, it can be seen that OCDC builds clusterings with less overlapping than the clusterings built by the other algorithms. This way, OCDC allows overlapping among the clusters but it controls the overlapping, in order to avoid building clusters with high overlapping, that could be interpreted as the same cluster. This characteristic could be useful for applications where low overlapping is desirable, for instance, document segmentation by topic [24] or web document organization [3, 7].

5

Conclusions

In this paper, we introduced OCDC, a new overlapping clustering algorithm. OCDC introduces a new graph-covering strategy and a new filtering strategy, which together allow obtaining a small set of overlapping clusters with low overlapping. Moreover, OCDC reduces the limitations of previous algorithms and it has an acceptable computational complexity.

A New Overlapping Clustering Algorithm Based on Graph Theory

71

The OCDC algorithm was compared against other overlapping clustering algorithms of the state-of-the-art in terms of: i) clustering quality, ii) number of clusters, and iii) overlapping of the clusters. The experimental evaluation showed that OCDC builds more accurate clusterings, according to the FBcubed measure, than the algorithms used in the comparison. Moreover, OCDC builds clusterings having less clusters and less overlapping than those clusterings built by most of the tested algorithms. From the above commented, we can conclude that OCDC is a better option for overlapping clustering than previously reported algorithms. As future work, we are going to explore the use of the OCDC algorithm in agglomerative hierarchical overlapping clustering. Acknowledgment. This work was supported by the National Council on Science and Technology of Mexico (CONACYT) under Grants 106366 and 106443.

References 1. Pfitzner, D., Leibbrandt, R., Powers, D.: Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems 19(3), 361–394 (2009) 2. Aslam, J., Pelekhov, K., Rus, D.: Static and Dynamic Information Organization with Star Clusters. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 208–217 (1998) 3. Zamir, O., Etziony, O.: Web document clustering: A feasibility demonstration. In: Proceedings of the 21st ACM SIGIR Conference, pp. 46–54 (1998) 4. Pons-Porrata, A., Ruiz-Shulcloper, J., Berlanga-Llavor´ı, R., Santiesteban-Alganza, Y.: Un algoritmo incremental para la obtenci´ on de cubrimientos con datos mezclados. In: Proceedings of CIARP 2002, pp. 405–416 (2002) 5. Gil-Garc´ıa, R.J., Bad´ıa-Contelles, J.M., Pons-Porrata, A.: Extended Star Clustering Algorithm. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 480–487. Springer, Heidelberg (2003) 6. Gil-Garc´ıa, R.J., Bad´ıa-Contelles, J.M., Pons-Porrata, A.: Parallel Algorithm for Extended Star Clustering. In: Sanfeliu, A., Mart´ınez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 402–409. Springer, Heidelberg (2004) 7. Hammouda, K.M., Kamel, M.S.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004) 8. Palla, G., Der´enyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043), 814–824 (2005) 9. Baumes, J., Goldberg, M., Magdon-Ismail, M.: Efficient Identification of Overlapping Communities. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 27–36. Springer, Heidelberg (2005) 10. Baumes, J., Goldberg, M., Krishnamoorty, M., Magdon-Ismail, M., Preston, N.: Finding communities by clustering a graph into overlapping subgraphs. In: Proceedings of IADIS Applied Computing, pp. 97–104 (2005)

72

A. P´erez-Su´ arez et al.

11. Su´ arez, A.P., Pagola, J.E.M.: A Clustering Algorithm Based on Generalized Stars. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 248–262. Springer, Heidelberg (2007) 12. Alonso, A.G., Su´ arez, A.P., Pagola, J.E.M.: ACONS: A New Algorithm for Clustering Documents. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 664–673. Springer, Heidelberg (2007) 13. Gregory, S.: An Algorithm to Find Overlapping Community Structure in Networks. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 91–102. Springer, Heidelberg (2007) 14. Zhang, S., Wang, R.S., Zhang, X.S.: Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A: Statistical Mechanics and its Applications, Vol 374(1), 483–490 (2007) 15. Gregory, S.: A Fast Algorithm to Find Overlapping Communities in Networks. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 408–423. Springer, Heidelberg (2008) 16. Davis, G., Carley, K.: Clearing the FOG: Fuzzy, overlapping groups for social networks. Social Networks 30(3), 201–212 (2008) 17. Su´ arez, A.P., Trinidad, J.F. M., Ochoa, J.A.C., Medina Pagola, J.E.: A New Incremental Algorithm for Overlapped Clustering. In: Bayro-Corrochano, E., Eklundh, J.-O. (eds.) CIARP 2009. LNCS, vol. 5856, pp. 497–504. Springer, Heidelberg (2009) 18. Macropol, K., Can, T., Singh, A.K.: RRW: repeated random walks on genome-scale protein networks forlocal cluster discovery. BMC Bioinformatics 10(283) (2009) 19. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K., Wallace, A.: Finding Overlapping Communities in Social Networks. In: Proceedings of SocialCom 2010, pp. 104–113 (2010) 20. Magdon-Ismail, M., Purnell, J.: SSDE-CLuster: Fast Overlapping Clustering of Networks Using Sampled Spectral Distance Embedding and GMMs. In: Proceedings of SocialCom 2011, pp. 756–759 (2011) 21. Al-Hasan, M., Salem, S., Zaki, M.J.: SimClus: an effective algorithm for clustering with a lower bound on similarity. Knowledge and Information Systems 28(3), 665– 685 (2011) 22. Amig´ o, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 461–486 (2009) 23. Gil-Garc´ıa, R.J., Pons-Porrata, A.: Dynamic hierarchical algorithms for document clustering. Pattern Recognition Letters 31(6), 469–477 (2010) 24. Abella-P´erez, R., Medina-Pagola, J.E.: An Incremental Text Segmentation by Clustering Cohesion. In: Proceedings of HaCDAIS 2010, pp. 65–72 (2010) 25. Jo, T., Lee, M.R.: The Evaluation Measure of Text Clustering for the Variable Number of Clusters. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007, Part II. LNCS, vol. 4492, pp. 871–879. Springer, Heidelberg (2007) 26. Meil˘ a, M.: Comparing Clusterings by the Variation of Information. In: Sch¨ olkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 173– 187. Springer, Heidelberg (2003)