Clustering by vertex density in a graph

0 downloads 0 Views 152KB Size Report
Clustering by vertex density in a graph. Alain Guénoche. IML - CNRS, 163 Av. de Luminy, 13009 Marseille (France) [email protected]. Résumé In this ...
Clustering by vertex density in a graph Alain Gu´enoche IML - CNRS, 163 Av. de Luminy, 13009 Marseille (France) [email protected]

R´ esum´ e In this paper we introduce a new principle for two classical problems in clustering : obtaining a set of partial classes or a partition on a set X of n elements. These structures are built from a distance D and a threshold value σ giving a threshold graph on X with maximum degree δ. The method is based on a density function De : X → R which is computed first from D. Then, the number of classes, the classes and the partitions are established using only this density function and the graph edges, with a computational complexity in O(nδ). Monte Carlo simulations, from random Euclidian distances, validate the method.

Key words: Threshold graph, Density, Classes, Partitions

1 Introduction Given a distance matrix on X, extracting partial classes (in which not all the elements are clustered) or establishing partitions (all the elements are clustered in disjoint classes) is generally made by optimizing a criterion over the set of all partitions with a given number of classes. For most of the criteria, the optimization problem is NP-Hard and some heuristics are applied. Too many authors deserve to be cited to just select a few of them. In this paper, we investigate a different approach. First, we select a threshold σ and, considering only pairs having a distance value lower than or equal to σ, we build the corresponding graph. Thus we are lead to the clustering problem in valued graphs, well known in Combinatorial Data Analysis. The method we propose is based on the evaluation of a density function in each vertex. Then we search for connected parts having a high density value, as it has been proposed in the percolation methods (Tr´emoli`eres 1994). The algorithm differs from similar approaches (Bader and Hogue, 2003, Rougemont and Hingamp, 2003) in many ways : We use the valuation of the edges to measure a density and we perform progressive clustering, adding the elements in three steps :

2

Alain Gu´enoche

– we first realize a kernel which is a connected part of the vertices for which the density is locally maximum and greater than the average ; – then, these classes are extended, adding vertices that are connected to only one kernel ; – finally the unclassified elements are assigned to one of the previous classes. It can be applied to any data in a metric space - quantitative, qualitative or binary, sequences - according to an appropriate distance. We emphasize that the number of classes in not given a priori. It is defined as the number of local maximum of the density function and, with simulations, we prove that the correct number can very often be recovered, when classes do not intersect. Given a distance matrix, D : X × X → R, the first step is to select a threshold σ to build a graph. Its edge set E is the set of pairs having a distance value lower than or equal to σ. Let n = |X|, m = |E| and Γσ = (X, E) the corresponding graph. When there is no ambiguity on the threshold value, it will simply be denoted Γ . For any part Y of X, let Γ (Y ) be the set of vertices adjacent to Y which do not belong to Y . Thus, the neighborhood of x is denoted Γ (x), the degree of a vertex x is Dg(x) = |Γ (x)| and let δ be the maximum degree in Γ . The paper is organized as followed. In section 2 we define three density functions, that will be compared in section 4. In section 3 we detail the algorithm for building partial classes and partitions, giving its complexity. In section 4 we evaluate classes and partitions using Monte Carlo simulations ; several criteria estimate partition quality.

2 Threshold value and Density functions The nested family of threshold graphs is obtained from a distance matrix making σ vary rom 0 to Dmax, the largest distance value. Let σc be the connectivity threshold value, that is the largest edge length of a minimum spanning tree of D. A value larger than or equal to σc , gives a connected graph and a lower value always yields to several connected components. For σ = Dmax the graph is complete. Choosing a threshold value is a delicate problem. It has an influence on the number of classes since they are connected part of X and necessarily included in the connected components of the graph. Hence we replace this problem by choosing a percentage of the number of edges of the complete graph. Indicating α percent will define a threshold distance value σ, and the graph will contain α × n(n−1) edges. 2 For each vertex x, we define a density value denoted De(x) which would be high when the elements of Γ (x) are close to x. We propose three functions : – The first one only depends on the minimum distance value from x. De1 (x) =

σ − miny∈Γ (x) D(x, y) σ

Clustering by vertex density in a graph

3

– The second one is computed from the average length of the edges from x. P 1 σ − Dg(x) y∈Γ (x) D(x, y) De2 (x) = σ – The third one corresponds to the maximum distance value in Γ (x). De3 (x) =

σ − maxy∈Γ (x) D(x, y) σ

To get the threshold graph from D is in O(n2 ). In order to evaluate these functions it is sufficient to test the edges in the neighborhood of x which contains at most δ vertices. So the computation of the density function is thus in O(n2 ).

3 Classes and partitions The dense classes are by definition connected parts in Γ sharing high density values. Our initial idea was to search for a density threshold and to consider the partial subgraph whose vertices have a density greater than this threshold. Classes would have been the connected components. This strategy does not give the expected results. Enumerating all the possible threshold values, we have observed that often none was satisfying. By decreasing the threshold, we often obtain only a single growing class, and many singletons. Since there is no straightforward way to fix a threshold, the local maximum values of the density function are to be considered. 3.1 Classes at three levels We successively compute the kernels of the classes, the extended classes that do not cover X, and the complete classes that make a partition. For these three levels, the classes are nested. Kernels A kernel, denoted K, is a connected part of Γ , obtained by the following algorithm : we first search for the local maximum values of the density function and we consider the partial subgraph of Γ reduced to these vertices. ∀x ∈ K, ∀y ∈ Γ (x) we have De(x) ≥ De(y). The initial kernels are the connected components of this graph. More precisely, if several vertices with maximum value are in the same kernel, they necessary have the same density value ; otherwise the initial kernels are singletons. Then, we assign recursively to each kernel K the vertices (i) having

4

Alain Gu´enoche

a density greater than or equal to the average density value over X and (ii) that are adjacent to only one kernel. By doing so, we avoid any ambiguity in the assignment, reserving the decision when several are possible. The number of kernels is the number of classes and it will remain unchanged in the following. Hence it is not required to indicate this number, as for all the alternative methods optimizing a criterion. We shall see that it performs well, when there is a small number of classes, having at least 30 to 50 elements. Extended classes At the second level, we simply assign elements that are connected to a unique kernel, whatever their density is. If an element not in a kernel is connected to several ones, the decision is again postponed. Complete classes Finally, to get partitions, we assign the remaining elements to one class. For x and any extended class C to which it is connected, we compute the number of edges between x and C, and also its average distance value to C. Finally there are two alternative choices, the majority connected class Cm and the closest one Cd . If they are identical, x is connected to it ; if they m| differ we apply the empiric following rule : if |C |Cd | > 1.5, class Cm is retained, because the number of links to Cm is clearly larger than to Cd ; otherwise Cd is retained. 3.2 Complexity Kernel computation is in O(nδ) to find the local maximum vertices, and in O(m) ≤ O(nδ) to determine the kernel elements. During the extension steps, for any x we count its connections to the kernels, and then to the extended classes. Both are also in O(nδ) that is the complexity of the whole algorithm. With this very low computation time, this method enables us to treat large distance matrices (n À 1000), more efficiently than other optimization procedures on distance matrices which are, in the best case, in O(n2 ). It permits to test several threshold values to find an interval in which the number of classes remains the same. For disjoint classes, the corresponding threshold value always exists and the expected number of classes can easily be recovered. More interesting is the fact that to build the threshold graph, it is not necessary to memorize the whole distance matrix ; it is sufficient to read it row after row and to keep in memory the adjacency lists or an adjacency matrix. This is very important for biological data that are more and more numerous : in one year, starting with sixty entirely sequenced genomes we get more than one hundred, each one possessing several thousands of genes. And DNA chips quantify the expression of several thousands of genes simultaneously.

Clustering by vertex density in a graph

5

4 Experimental validation In order to show that this method allows to recover existing classes, we have tested it, comparing the efficiency of the density functions. First, we have developed a random generator of Euclidian distances in which there is a given number of classes denoted p ; n points are generated in an Euclidian space having dimension m ≥ p ; the coordinates of each point are selected uniformly at random between 0 and 1 except for one coordinate, corresponding to its class number, which is selected in [1,2]. So the ”squares” of any two groups share a common border providing many small inter-class distance values, and the split of the initial partition is much smaller than its diameter. The Euclidian distance is then computed. So, the generator of random distances depends on three parameters : – n : the number of elements, – p : the number of initial classes, – m > p : the dimension of the Euclidian space. The initial partition is denoted P = {C1 , ..Cp }. 4.1 Quality of the classes compared to the initial partition For the three levels, we would like to estimate the quality of the resulting classes, and so the efficiency of the clustering process. Let n0c be the number of classified vertices at each level. They are distributed in p0 classes denoted C10 , ..Cp0 0 realizing a partition P 0 over a subset of X for the kernels and the extended classes. We first aim to map the classes of P 0 onto those of P by evaluating ni,j = T |Ci Cj0 |. We define the corresponding class of Cj0 , denoted Θ(Cj0 ), as the one in P that contains the greatest number of elements of Cj0 . Θ(Cj0 ) = Ck if and only if nk,j ≥ ni,j for all i from 1 to p. In order to measure the accuracy of the classes, we evaluate three criteria. n0 – τc : the percentage of clustered elements in P 0 (τc = nc ). – τe : the percentage of elements in one of the p0 classes which belong to its corresponding class in P . T 0 P 0 Ci | i |Θ(Ci ) τe = n0c – τp : the percentage of pairs in the same class in P 0 which are also joined together in P . The first criterion measures the efficiency of the clustering process at each level ; if very few elements are clustered, the method is inefficient. For the second criterion, we must compute, for each class in P 0 , the distribution of the elements of any initial class to define its corresponding class in P . Thus it can be interpreted as a percentage of ”well classified” elements. The third one estimates the probability for a pair in one class of P 0 to belong to a single

6

Alain Gu´enoche

class in P . Remark : The last two criteria may reach their maximum value (1.0) even when partitions P and P 0 are not identical. When a class in P is subdivided in two parts, they will have the same corresponding class in P ; all their elements will be considered as well classified, and the rate of pairs will also be equal to 1. Consequently, we indicate in Table 3 the percentage of trials for which the correct number of classes has been found. 4.2 Results We first evaluate the number of recovered classes according to the percentage of edges in the threshold graph. We have generated 300 distances on 200 points distributed in 5 classes, and we have computed the number of local maximum, using the De2 density function. In a first series, the dimension of the Euclidian space is equal to the number of classes (m = 5). For the second series, this dimension is doubled (m = 10), the new coordinates being selected uniformly at random in [0,1]. These new variables do not provide any information for partitioning ; they just add noise. We first study the average number of classes corresponding to increasing values of α and the dimension (m) of the Euclidian space. The results in Table 1 prove that the correct number is predictable from 20% to 45%. These computations can be made for a single distance matrix and the number of classes is then the one remaining the same along a large interval for α. Table 2 indicates the percentage of trials giving each computed number of classes. α .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 m = 5 10.4 6.0 5.0 5.0 4.9 4.9 4.9 4.8 4.7 4.2 3.0 1.6 1.1 m = 10 9.8 6.0 5.0 4.7 4.6 4.5 4.2 3.8 3.0 2.1 1.3 1.1 1.0 Table 1. Average number of classes determined with a threshold graph having α.n.(n − 1)/2 edges. Nb. of classes 20% 25% 30% 35% 40% 45%

m=5 3 4 5 6 0.0 .03 .86 .10 0.0 .03 .96 .01 0.0 .06 .94 0.0 0.0 .10 .90 0.0 .01 .13 .86 0.0 .01 .19 .80 0.0

7 .01 0.0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0 0.0 .01

m = 10 2 3 4 5 6 0.0 .01 .14 .70 .14 0.0 .03 .21 .74 .01 0.0 .04 .33 .62 .01 .01 .08 .37 .54 0.0 .03 .15 .41 .41 0.0 .08 .27 .41 .23 0.0

7 .02 0.0 0.0 0.0 0.0 0.0

Table 2 : Distribution of the number of classes according to the percentage of edges. One can see that for 25% of edges, 5 classes has been determined in more than 95% of trials when the dimension is equal to the number of classes

Clustering by vertex density in a graph

7

and practically 75% when it is the double. In the other cases, the number of classes is very close to 5. These performances remain the same when the number of elements increases up to 1000. They are weaken when it goes below 30 elements per class. The results are equivalent if there are only 3 classes in a 3 dimensional space, but the best results (100% of exact prediction) are obtained with 30 to 40 percent of edges. This can be generalized ; the larger the expected number of classes is, the smaller must be α. Now we compare the density functions using the same protocol for distances (n = 200, p = 5, m = 5, 10), keeping 25% of the edges in the threshold graph to evaluate the density.

De1 τc τe τp Kernels .30 .82 .72 Classes .54 .78 .68 Partitions 1.0 .72 .59 % of 5 classes .30

m=5 De2 De3 τc τe τp τc τe τp .36 1.0 .99 .23 .99 .99 .54 .99 .99 .50 .99 .99 1.0 .96 .93 1.0 .93 .88 .94 .67

m = 10 De1 De2 De3 τc τe τp τc τe τp τc τe τp .26 .75 .62 .29 .97 .94 .19 .97 .95 .49 .71 .60 .46 .95 .91 .39 .97 .95 1.0 .60 .45 1.0 .89 .80 1. .83 70 .21 .73 .31

Table 3 - Average results of the quality criteria on the 3 types of classes for the 3 density functions. The superiority of functions De2 is evident. Function De1 is not satisfying, at any level, because it recovers the correct number of classes too rarely. In fact, every pair of mutual nearest neighbors constitutes a kernel. It seems to be better for function De3 but it predicts too many classes and the criteria manage this bias. For function De2 , 1/3 of the elements belong to the kernels and 1/2 are in the extended classes. More than 90% of the joined pairs of elements come from the same initial class. 4.3 Evaluating final partitions In the preceding IFCS conference (Gu´enoche and Garreta, 2002), we have proposed several criteria to appreciate if a partition P fits with a distance D. We recall some of them first which are based on the principle that the large distance values would be between-class, or external links, and the small distance values would be within-class, or internal links. A partition P on X induces a bipartition on the pair set. Let (Le |Li ) be this bipartition, and Ne and Ni be the number of elements in each part. These quantities induce another bipartition of the n(n − 1)/2 pairs in X. After ranking the distance values in decreasing order, we can distinguish the Ne greatest values, on the left side, and the Ni smallest values on the right side. This bipartition is denoted (Gd |Sd ). A perfect partition on X would lead to identical bipartitions on pairs. To compare these two bipartitions, we compute :

8

Alain Gu´enoche

– The rate of agreements on pairs : This rate, denoted τa , is the percentage of pairs belonging to Le and Gd (external links and large distances) or to Li and Sd (internal links and small distances). – The rate of weight : It is computed from the of distances P sums P P in each of the four classes, respectively denoted by (L ), (L ), (Gd ) and e i P (Sd ) : P P (Le ) (Sd ) τw = P ×P . (Gd ) (Li ) These two ratios correspond to the weight of external links divided by the maximum that could be realized with Ne distance values and to the minimal weight of Ni edges divided by the weight of the internal links. Both are lower than or equal to 1 and so is τw . A rate close to 1 means that the between-class links belong to the longest distance set and the intra-class links have been taken from the smallest ones. – The ratio of well designed triples : We only consider triples made of two elements x1 and x2 belonging to the same class in P and a third external element y. We say that such a triple is well designed if and only if D(x1 , x2 ) ≤ inf{D(x1 , y), D(x2 , y)}. Criteria τt is just the percentage of well designed triples. After 200 trials, we have computed the average of these three criteria, first for the initial partitions P (they do not reach the theoretical maximum equal to 1., because there are many small distance values between elements in different squares) and for the final partitions P 0 . The results in Table 4 prove that the latter practically have the same quality compared to the distances. m=5 m = 10 τa τw τt τa τw τt Initial partitions .90 .91 .92 .88 .91 .88 Final partitions .89 .90 .90 .84 .88 .82 Table 4 - Comparison of three fitting criteria between initial partitions and final partitions recovered by the density clustering method. The method has also been tested with other types of distances, as boolean distances (symmetrical difference distance on binary vectors) or graph distance (Sczekanovski-Dice distance on graphs). The existence of classes is always guaranteed by generating procedures that cannot be detailed here. The partition quality remains the same, as τe = .96 and τp = .93 for graphs on 300 vertices distributed in 5 classes.

5 Conclusion The density clustering method has many advantages over classical partitioning ones.

Clustering by vertex density in a graph

9

– It allows both to extract classes that do not cover the complete set of elements and to built a partition. The first problem becomes very important with DNA array data for which not all the genes have to be clustered. – It provides, according to a threshold distance value or a percentage of edges in the threshold graph, the number of classes, and this number is very often the correct one and always very close if classes have a few ten of elements. – It has an average complexity proportional to the number of elements and to the average degree in the graph. This makes it very efficient to treat large problems with distance matrices that cannot be completely memorized, as for complete genomes comparison. Finally it is a one parameter method (the threshold distance value or the percentage of edges in the threshold graph) that can be used for large clustering problems with biological data. Acknowledgements This work is supported by the program inter-EPST ”BioInformatique”.

References G.D. Bader and C.W. Hogue (2003). An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinformatics, 4, 2, 27 p. ´noche, H. Garreta (2002). Representation and evaluation of A. Gue partitions, in Classification, Clustering and Data Analysis, ed. K. Jajuga et al., 131-138, Springer-Verlag. H. Matsuda, T. Ishihara, A. Hashimoto (1999). Classifying molecular sequences using a linkage graph with their pairwise similarities, Theoretical Computer Science, 210, 305-325. J. Rougemont and P. Hingamp (2003). DNA microarray data and contextual analysis of correlation graphs, BMC Bioinformatics, 4 :15, 11p. ´molie `res (1994). Percolation and multimodal data structuring, R. C. Tre in New Approaches in Classification and Data Analysis, ed. E. Diday et al., pp. 263-268, Springer-Verlag.