Communities in Graphs and Hypergraphs - CiteSeerX

4 downloads 0 Views 285KB Size Report
sets and show, based on examples from image segmentation and infor- .... The hypergraph H/U denotes the contraction of H along U, ie. its vertices.
Communities in Graphs and Hypergraphs Michael Brinkmeier, Sven Recknagel, and Jeremias Werner Technische Universit¨ at Ilmenau Institute for Theoretical Computer Science 98684 Ilmenau, Germany

Abstract. In this paper we define a type of cohesive subgroups – called communities – in hypergraphs, based on the edge connectivity of subhypergraphs. We describe a simple algorithm for the construction of these sets and show, based on examples from image segmentation and information retrieval, that these groups may be useful for the analysis and accessibility of large graphs and hypergraphs.

1

Introduction

In many applications one is interested in groups of actors or objects which are more closely related to each other than arbitrary groups. More formally, these so called cohesive subgroups are sets of vertices in a graph or hypergraph, which models the relations between the objects of interest, which are in some sense denser or more tightly connected than arbitrary subsets. A very general approach is that of clustering or partitioning (for an introduction see eg. [Gae04], for hypergraphs see [WF94] ). But in many situations a complete decomposition is not appropriate. Instead some objects might be classified as something like ‘out-liers’ or not as strong connected to specific groups. Problems like these occur in various settings, like electronics (eg. [LS69]), sociometrics (eg. [Sei83]), information retrieval (eg. [FLG00,FLGC02,GWFT04]) and segmentation of images (eg. [SM00]). Every type of cohesive subgroups uses its own, specific definition of cohesiveness. In this paper we want to give examples for cohesive subgroups or communities, as they are called here, based on the edge connectivity of the sub(hyper)graphs. More specific, we say that the community of a given subhypergraph is the maximal subhypergraph of maximum edge connectivity containing it. Our definition is an extension of the type of communities described in [Bri03] to hypergraphs. The constructed groups are k-components in the sense of Matula [Mat72], which were subsequently examined by Botafogo et al. [BS91,Bot93]. Our emphasis lies on the presentation of a new, faster – at least as experiments indicate – algorithm for their construction, as well as the treatment of large networks. In section 2 we introduce hypergraphs and required notions, while section 3 provides the definition of communities and some of their most important properties. The following section describes our main algorithmic tool, Lax-AdjacencyOrderings as introduced in [Bri05,Bri06], which are used subsequently to formulate a rough description of the implemented algorithm, with which the experiments were conducted. Finally, we shortly present some experimental results

regarding image segmentation and the detection of cohesive subgroups in large document sets. On one hand, the experimental data show, that even though the asymptotic worst-case runtime of our algorithm is quite large, the average runtime of our algorithm is much lower, allowing the structuring of a collection of documents with about 1.5 million word-document relations, in about 900 seconds. On the other hand, the experiments indicate, that the resulting groups may be of value in several fields of applications.

2

Definitions

For an arbitrary finite set X let P ∗ (X) be the set of all subsets with at least two elements. An (undirected) hypergraph H = (V, E) consists of a finite, non-empty set V of vertices and a multi-set E of sets in P ∗ (V ) whose elements are called (hyper)edges.1 The size |e| of an edge is the number of vertices it contains. In general n denotes the number of vertices of a hypergraph, m the number of edges P and M the sum of the sizes of all edges, ie. M = e∈E |e|. A hypergraph H is edge weighted if there exists a map w : E → R+ assigning a strictly positive real number to each edge. More general, we assume for an arbitrary subset x ⊂ V that w(x) > 0 if and only if x ∈ E. A weighted hypergraph will be written as a triple H = (V, E; w). An edge e ∈ E is said to touch a subset X ⊂ V of vertices, if X ∩ e 6= ∅. The degree degH (v) of a vertex v ∈ V is the number of edges touching it. The weight w(v) incident to v is the sum of all edges touching v. A cut of a hypergraph H = (V, E; w) is a proper subset C ⊂ V of vertices. The weight w(C) of a cut is the sum of the weights of all edges touching C and its complement C = V \ C. A minimum cut of H = (V, E, w) is a cut of minimum weight. The weight of a minimum cut is called edge connectivity of H and denoted by λH . A cut and its complement can naturally be identified with each other. In contrast to graphs, hypergraphs allow two notions of subgraphs. The first, the notion of partial hypergraph preserves edges, ie. it is a hypergraph G = (U, F ) with U ⊆ V and F ⊆ E, such that e ∈ F implies e ⊆ U . The second, more flexible notion of subhypergraph allows hypergraphs with ‘incomplete’ subedges. A subhypergraph G = (U, F ) of this type satisfies U ⊆ V and there exists an injective map ι : F → E with f ⊆ ι(f ). In other words, every edge of G is a subset of an edge in H, and every edge of H corresponds to at most one subedge in G. If H is weighted, then the weight of the edges in the subhypergraph (partial hypergraph) is the weight of the associated original edge. In fact, every partial hypergraph is a subhypergraph. Nonetheless, we differentiate both types, due to the next definition. Let U ⊂ V be a proper subset of vertices. The subhypergraph H[U ] induced by U is the hypergraph H[U ] = (U, E[U ]; w) with E[U ] := {e[U ] := e ∩ U | e ∈ E and |e ∩ U | ≥ 2} and 1

Observe, that we allow ‘parallel’ hyperedges, ie. that the same vertex set may appear multiple times as a hyperedge.

w(e[U ]) := w(e). The partial hypergraph HhU i induced by U is the hypergraph HhU i = (U, EhU i; w) with EhU i := {e | e ⊆ U }. Let H = (U, F ) and H′ = (U ′ , F ′ ) be two subhypergraphs (partial hypergraphs) of G = (V, E). Then we say H ⊆ H′ if U ⊆ U ′ and there exists an injective map ι : F → F ′ such that e ⊆ ι(e). This relation defines a partial order on the subhypergraphs of G. Let H = (V, E) be an undirected hypergraph and U = {U1 , . . . , Ul } a partition of its vertex set V . For v ∈ V , the class Ui containing v is denoted by [v]. The hypergraph H/U denotes the contraction of H along U, ie. its vertices are the classes of U and its edges are given by E(H/U) = {[e] | e ∈ E} and [e] = {[v] | e ∩ [v] 6= ∅}, ignoring all trivial edges (ie. those containing only vertices in one class). A partition U of H = (V, E; w) is τ -proper for τ > 0, if [u] = [v] implies λH (u, v) ≥ τ . A hypergraph G is a τ -proper contraction of H, if G = H/U and U is τ -proper. It is easy to see, that the relation of being a τ -proper contraction is associative, ie. if G is a τ -proper contraction of H and G ′ one of G, then G ′ is a τ -proper contraction of H, too.

3

Communities

We want to identify cohesive groups in a hypergraph, ie. sets of vertices, which are in some sense stronger connected than arbitrary sets. We use the edge connectivity of a subhypergraph as its measure of cohesiveness. Hence a cohesive subgroup containing a specific subset of vertices or even a subhypergraph G, is a maximal subhypergraph of maximum edge connectivity containing G. More formally we obtain the following definition. Definition 1 (Community). Let H = (V, E; w) be a weighted, undirected hypergraph and G = (U ; F ) a subhypergraph (partial hypergraph) of H. A Community C generated by G is a subhypergraph (partial hypergraph), such that – G ⊆ C, – λC = max {λD | G ⊆ D ⊆ H} – If λD = λC for a subhypergraph (partial hypergraph) then D ⊆ C. The strength strH (G) of a subhypergraph G is the edge connectivity of its community. The family of all communities of H is denoted by Comm(H). The following lemmata are consequences of the fact, that the edge connectivity of the union of two subhypergraphs with intersecting vertex sets is at least as high as the lower edge connectivity of the two parts (cmp. Lemma 7 in the Appendix). Lemma 1. For every subhypergraph (partial hypergraph) G = (U, F ; w) of H = (V, E; w) there exists a uniquely determined community commH (G), which is induced by its vertex set. Lemma 2. Comm(H) is laminar, ie. for every pair of communities C1 , C2 , either C1 ∩ C2 = ∅ or one is contained in the other.

In addition to the global notion of community, we are going to use a parametrized version. For an arbitrary subhypergraph G of H = (V, E; w) and for τ with λH ≤ τ ≤ strH (G), the τ -community commτH (G) of G is the largest community C containing G with λC ≥ τ . In fact, our communities coincide with the k-components as introduced by Matula [Mat72]. He defines a ρ-component of an undirected graph as a maximal subgraph of edge connectivity ρ. In our terminology, the τ -community with λH ≤ τ ≤ strH (G) of a subhypergraph G, is the minimal ρ-component with ρ ≤ τ containing G.

4

Lax Adjacency Orderings

Lax-Adjacency-Orderings are relaxations of Maximum-Adjacency-Orderings as introduced by Stoer and Wagner [SW97]. They were already used in [Bri05] to decrease the time required by the construction of a minimum cut of an undirected graph, and in [Bri06] in the more general setting of symmetric submodular functions and related concepts (including cuts of hypergraphs). Definition 2 (Lax Adjacency Ordering). A Lax-Adjacency-Ordering (LAordering) with threshold τ > 0 of an undirected hypergraph H = (V, E; w) is a total ordering (v1 , . . . , vn ) on V , such that w(Vi , vi ) ≥ min {τ, max {w(Vi , vj ) | j ≥ i}} for i = 2, . . . , n and Vi := {v1 , . . . , vi−1 }. In other words, the i-th vertex in an LA-ordering has maximum adjacency2 to its predecessors, or at least adjacency τ . The following lemma describes their most important property. Lemma 3 (cf. [Bri06]). Let (v1 , . . . , vn ) be an LA-ordering with threshold τ > 0 of H = (V, E; w). Then min {w(Vn , vn ), τ } = min {λ(vn−1 , vn ), τ } , and if w(Vi , vj ) ≥ τ , then λ(vi−1 , vj ) ≥ τ for j ≥ i ≥ 2. The application of a modified Fibonacci-heap, which identifies all keys ≥ τ , allows the following runtime for the construction of an LA-ordering. Lemma 4 ([Bri05,Bri06]). The construction of an LA-ordering with threshold τ > 0 for an integer weighted, undirected hypergraph H = (V, E; w) requires O(M + n min{τ, log n}) time. 2

The Adjacency of vi is the sum of all weights of edges connecting vi to its predecessors.

Due to Lemma 3, every LA-ordering (v1 , . . . , vn ) of H = (V, E; w) with threshold τ > 0 induces a τ -proper contraction, in which each class consists of the maximal high-adjacency sequences, ie. the maximal sequences vi , . . . , vj , such that w(Vk , vk ) ≥ τ for i ≤ k ≤ j. We say, G is an LA-contraction of H with threshold τ , if there exists an LA-ordering with threshold τ of H, such that G is the contraction of H along the induced partition. More general we say G is an iterated LA-contraction with threshold τ of H, if there exists a sequence H = G0 , G1 , . . . , Gl = G of hypergraphs, such that Gi is an LA-contraction of Gi−1 with threshold τ for 1 ≤ i ≤ l. Lemma 5. Let G be a proper τ -contraction of H = (V, E; w). For every pair u, v of vertices with [u] 6= [v] in G, we have λG ([u], [v]) ≥ τ ⇔ λH (u, v) ≥ τ. A direct consequence of the preceding lemma is the following Corollary 1. For every undirected hypergraph H = (V, E; w), we have λH ≥ τ if and only if the trivial hypergraph with exactly one vertex is an iterated LAcontraction of H with threshold τ .

5

The Algorithm

Assume that G = H/U is a τ -proper contraction of H and that there exists a community C ∈ Comm(H) with λC ≥ τ , such that the vertices of C lie in at least two classes of U. For an arbitrary LA-ordering ([v1 ], . . . , [vl ]) of G with threshold τ , let [vj ] and [vi ], j < i be the last two classes containing vertices of C. Then the cut [vi ] separates vertices in C and hence its weight is at least λC ≥ τ . By Lemma 3 and Lemma 5, this implies w([v1 ] ∪ · · · ∪ [vi−1 ], [vi ]) ≥ τ , since C is completely contained in the subhypergraph (partial hypergraph) induced by all vertices in [v1 ] ∪ · · · ∪ [vi ]. Therefore we have the following result. Lemma 6. For τ > 0 let G be a τ -proper contraction of H and ([v1 ], . . . , [vl ]) an LA-ordering with threshold τ of G. If there exists a community C of H with edge connectivity λC ≥ τ , and two vertices u, v ∈ C with [u] 6= [v], then there exists a class [vi ] of G with adjacency ≥ τ . This property of LA-orderings can be used to easily construct all τ -communities for a given τ > 0. Assume that G is an iterated LA-contraction of H with threshold τ and let G ′ be an LA-contraction of G with the same threshold. If the τ -communities of H aren’t contracted into single vertices, we have G ′ 6= G. If G ′ = G, then every τ -community has to be contracted into a single vertex, and we can restrict our search to the subhypergraphs (partial hypergraphs) induced by the non trivial classes of G ′ . If G ′ is trivial, ie. if it consists of exactly one vertex, then by Corollary 1 λC ≥ τ and hence H itself is the only τ -community in H. These observations imply that Algorithm 1 constructs the set of all τ communities of a given hypergraph H = (V, E; w).

Algorithm 1: Construction of all τ -communities. Input: An undirected, weighted hypergraph H = (V, E; w) and a real τ > 0. Output: The set of all τ -communities of H. Data: Two sets S, T of sets of vertices of H S := {V }; while S 6= ∅ do Remove a set U from S; // Extract a candidate set U from S Let G be the subhypergraph (partial hypergraph) induced by U ; repeat // Iteratively contract G along LA-contractions G ′ := G; Let G be an LA-contraction of G ′ with threshold τ ; until G ′ = G; if G ′ is trivial then // τ -community found T := T ∪ {U } else // Add candidates to S for further examination forall non-trivial classes [v] of G ′ do T := T ∪ {[v]}; end return T

Now assume that the set T of τ -communities of H is known. Since Comm(H) is laminar, every ρ-community of H with ρ > τ is a subhypergraph of a τ community. Hence, the set of all ρ-communities can be constructed by constructing all ρ-communities in the (known) τ -communities of H. This observation implies a simple algorithm for the construction of all communities, if the edge weights are integers. We start with τ = λH , which may be obtained via the algorithms described in [KW95,Bri05,Bri06] an determine all τ -communities. Then we increase τ step by step and construct all τ -communities in the communities found in the preceding round. The algorithm terminates, if no community was found in the last round. An alternative is a kind binary search. Observe that the edge connectivity of every community lies between λH and the maximum weight incident to a single vertex. Now assume, that for a given hypergraph H it is known that all communities have edge connectivities in the interval [δ, ∆]. Set τ = ⌈(δ + ∆)/2⌉ and construct all τ -communities in H. Then all communities with edgeconnectivities in [τ, ∆] are subhypergraphs of τ -communities. On the other hand, all communities with edge connectivities in [δ, τ ] still are communities in G, which is obtained from H by contracting each τ -community to a single vertex. Hence, every τ -community of H contains only communities with edge connectivities in [τ, ∆] and G only those with edge connectivity in [δ, τ ]. This leads to Algorithm 2. Following the arguments in [Bri05], one can prove that the construction of an LA-ordering requires O(n log n + M ) time. Since we can only guarantee one contraction per round (except for the last), Algorithm 1 requires at most n rounds, leading to a worst case runtime of O(n2 log n + nM ). Since Algorithm 2 uses Algorithm 1 as a subroutine, the total runtime of it is at most O(W n2 log n+ W nM ), with W the maximal weight incident to a single vertex.

Algorithm 2: Communities – Construction of all Communities Input: An undirected, integer weighted Hypergraph H = (V, E; w) and integers 0 ≤ δ ≤ ∆. Output: The set of all communities in H with edge-connectivity in [δ, ∆]. τ := ⌊ δ+∆ ⌋; 2 Determine the set R of all τ -communities of H using Algorithm 1; if δ = ∆ then return R else G ′ := Contract all τ -communities in G; S ′ := Communities(G ′ , δ, τ ); S := replace every set in S ′ by the set of represented, original vertices; forall U ∈ R do T := Communities(H[U ], τ + 1, ∆); end return R ∪ S ∪ T end

If the edge weights aren’t integers, we can use Algorithm 2 to determine τ communities, for a finite number of discrete values. Afterwards, the gaps between these values can be ‘filled’ using the adaption of the algorithm described in [Bri03] to hypergraphs.

6

Applications

Now we are going to examine the communities obtained in two applications, the segmentation of images and the structuring of sets of documents. We will see, that in both situation the constructed hierarchies indeed have some meaning regarding the applications. 6.1

Image Segmentation

To find segments of an image, we have to construct a graph from representing it. In [SM00] this is done by interpreting each pixel as a vertex and to connect adjacent pixels by weighted edges. The weight of an edge is calculated from the color value of the two pixels. For an arbitrary pixel v let F (v) be a tuple of values assigned to it, eg. the red, green and blue components. Then the weight w(u, v) of an edge (u, v) between adjacent pixels is w(u, v) = e−

kF (u)−F (v))k σ

with a constant σ > 0. Common examples are 1. F (v) = [r, g, b], with r, g and b the red, green and blue component of the color of pixel v, or 2. F (v) = (b, b · s · sin(h), b · s · cos(h)) with h, s and b the hue, saturation and brightness of the color of pixel v.

Fig. 1. The dominating communities of a picture with simple shapes and a ‘photograph’ (arranged horizontally).

Since the weight of each edge is a value between 0 and 1, we easily can change them to integers in an arbitrary range3 . This discretization allows faster algorithms and – as the experiments show – doesn’t seem to affect the quality of the found image segments. The left image in Fig. 1 is a test image containing some simple shapes in different colors. Hence, the borders of the shapes are well-defined. Therefore, the edges between nodes corresponding to adjacent pixels in different shapes have a low weight, whereas pixels in the same shape lead to a maximum weight of 1, because of the same F (v)-values. As a consequence, the algorithm calculates always the same (perfect) segments for arbitrary values of σ. The right image in Fig. 1 the borders are not as clear, because the color transitions are softer. Consequently, the graph structure depends highly on the value of σ. It turns out that higher values for σ result in better segmentations. For lower values the segments are smaller and only few of them can be further segmented. For σ = 0.01 the main segments are shown in Fig. 1. There are a lot more segments (communities) that consist only of a few pixels, or add only a few pixels to their substructure. In fact, only about 5% of the segments are of significant size. But by simple postprocessing, communities with a small number of direct children may be removed from the resulting tree. 6.2

Word Document Relations

Let W be a set of words. We interpret a document (or text) d over W as a subset of W . This definition directly allows the definition of an unweighted, undirected hypergraph W = (W, D) whose vertices are the words and whose edges are the documents. But we are going to use the dual hypergraph. D = (D, EW ) whose vertices are the documents and its edges are the sets ew := {d ∈ D | w ∈ d}, one for each word w ∈ W . 3

For our experiments it was the range between 0 and 10000

Dataset kEk # classes cacmcisi 63191 2 re0 71997 13 re1 82163 25 cranmed 100072 2 tr41 159212 10 classic 160207 4 wap 174230 20 tr31 234121 7 k1a 257783 20 k1b 257783 6 fbis 273732 17 hitech 296678 6 reuters 344325 80 ohscal 387417 10 sports 566055 7 reviews 606857 5 la12 617771 6 tui 869250 63 new3 941715 44 nsf 1566640 168 average

Runtime CS CP C 2.4 1.7 1.68 2.70 1.09 2.2 3.85 2.4735 9.01 7.71 4.46 13.10 2.99 1.48 4.61 4.33 4.50 6.38 14.62 5.42 17.30 4.58 2.47 6.62 19.19 8.81 29.35 19.13 8.81 29.23 21.89 6.66 21.19 15.71 8.95 22.43 5.88 3.96 11.90 20.91 21.37 59.46 31.60 22.18 51.06 13.64 10.59 27.86 20.74 21.58 44.74 61 59 58 36.60 31.14 61.59 117.49 68.07 848.35 -

CS 0.95 0.85 0.8 0.92 0.8 0.74 0.91 0.87 0.89 0.93 0.91 0.9 0.53 0.4 0.68 0.67 0.69 0.64 0.4 0.21 0.73

Purity CP 0.85 0.45 0.23 0.58 0.29 0.72 0.22 0.46 0.22 0.59 0.21 0.27 0.55 0.15 0.51 0.35 0.38 0.64 0.18 0.31 0.41

C 0.86 0.62 0.61 0.9 0.63 0.87 0.63 0.72 0.61 0.89 0.55 0.55 0.67 0.42 0.75 0.64 0.61 0.78 0.41 0.43 0.66

CS 0.83 0.38 0.21 0.68 0.31 0.55 0.24 0.41 0.23 0.57 0.22 0.34 0.36 0.21 0.42 0.42 0.33 0.48 0.08 0.1 0.37

FScore CP C 0.73 0.75 0.38 0.4 0.2 0.28 0.68 0.7 0.29 0.38 0.49 0.53 0.18 0.35 0.42 0.47 0.17 0.34 0.54 0.68 0.19 0.27 0.33 0.35 0.44 0.46 0.19 0.22 0.47 0.5 0.41 0.48 0.35 0.36 0.48 0.51 0.11 0.17 0.13 0.17 0.36 0.42

Table 1. The Runtime, purity and FScore obtained by computing the communities on subhypergraphs CS , communities on partial hypergraphs CP and the processed communities C. (The experiments ran on a 3.2GHz Intel Xeon system with 4GB memory.)

In addition we assign a weight c(w) to each word. We use the common term weighting scheme tf.idfw,d to measure the importance of word w in some document d. Then we set the weight c(w) of word w to P tf.idfw,d c(w) = d∈ew . |ew | Furthermore we remove words occurring in less than 0.5% and more than 10% of the documents. To improve the quality of the resulting structure, we use several pre- and postprocessing steps. The preprocessing first cuts-off vertices of low degree, which afterwards are assigned to the communities of their neighbors. The postprocessing intends to find further refinements of the initially found communities based on partial hypergraphs. Basically, these constructions are of the type: find communities in subhypergraphs obtained from the communities of the original hypergraph, by adding partial edges and/or removing/contracting subcommunities. Details and further explanations about the algorithms and results can be found in [Wer07]. To evaluate the performance of our algorithms we use a collection of about 17 different datasets4 from different sources, composed by Karypis and Zhao [ZK01]. In addition we use the abstracts of the National Science Foundation (nsf ) and the web graph of the Technische Universit¨at Ilmenau (tui). We further use an 4

http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz

enhanced version of the Reuters newswire dataset (reu). The documents of these datasets are assigned to a unique categories. For example the reu dataset contains classes like earn (stock/dividend ticker), acq (company/corporation messages), money-fx (banks and reserves), crude(raw materials), grain(agriculture), interest, sugar and coffee ordered by its size. We defined communities using partial and subhypergraphs. Figure 1 lists the runtime required for the computation of the communities by and subhypergraphs (CS ) partial hypergraphs (CP ) and using pre- and postprocessing as indicated above (C). Furthermore, Fig. 1 contains the Purity [ZK01] and FScore [LA99] achieved by the constructed hierarchies. We assume a reference categorization κ : D → {1 . . . g}, which maps documents to categories. Let Cl a community without its subcommunities, then we define the purity P (Cl ) as the fraction P (Cl ) =

1 (h) · maxh (nl ), nl (h)

where nl is the cardinality of Cl and nl is the number of documents of C in category h. The total weighted purity is then defined by, P (C) =

k X nl l=1

· P (Cl ).

n

The FScore of category h and community Cl is given by (h)

(h)

F (κh , Cl ) =

2·( (

nl nh

(h) nl

nh

)·(

)+(

nl nl

(h) nl

nl

)

.

)

Then, the FScore of the category h is the best scored community, ie. F (κh ) = maxCl ∈C F (κh , Cl ). The overall FScore is the weighted sum of the individual category FScores: g X nh F = · F (κh ). n h=1

Since the purity is intended for flat partitionings and not for hierarchies, its use in this context is problematic. In general, communities closer to the root may be expected to have a smaller purity than those far from the root. Nonetheless, since purity values for partitions are given in the literature, it was used in [Wer07] to allow at least a rough comparison. As figure 1 shows, the quality of the resulting structure heavily depends on which type of subhypergraphs is used. The structure of CS consists of many small communities, whereas the structure of CP contains fewer but larger communities. In fact, we observed that at higher levels, partial hypergraphs seem to be better than subhypergraphs, but on lower levels the partial edges of subhypergraphs become increasingly important. As a consequence, we try to find finer communities by enriching the initial communities on partial hypergraphs. The resulting structure C is a compromise between the size of communities and their accuracy.

corp lt term today

3842

cts net note rev

360

qtl vs cts record div

5700

vs shr loss profit

1044

rev loss note month profit

company market share corp

lt price trade product

462

692

price company trade share market

avg net shr vs shrs

300

company lt acquire stock cash

1030

1100

tonnage export price product market 98

bank market billion rate central 854

agriculture import source shipment

coffee quota bracil ico

55

cocoa inter buffer icco

156

sugar tonnage trader white

Fig. 2. The dominating refined communities of the reuters dataset.

Since the algorithm and datastructures are designed to handle very large graphs, we observe a good scalability depending on the size of the hypergraph. For example, it was possible to compute the structure of the german Wikipedia Encyclopedia hypergraph, which currently contains approximately 500.000 documents and 147.000 words, in less than 40 minutes. Due to the fact that quality metrices can only give a rough impression of the structure, we will examine the hierarchy of the reuters dataset in more detail. Figure 2 is a tree of refined communities C, where each community is represented by the set of its most frequent words. The ties are labeled with the number of documents the subtree contains. Obviously, the hypergraph is splitted into two main communities. The left tree contains stock ticker message including the abbreviations vs, cts, shr most frequently. Nearly, all documents are of class earn. The community is then splitted into three subcommunities, containing different kind of documents, for example about quarterly (qtl ) cashed dividends (div ), or about shares (shrs). The other branch of the hierarchy contains documents of the remaining classes. In the upper part, we find communications about companies and corporations of class acq. The lower part is splitted into communities about reserves and central banks and an intuitive categorization of merchandises intended for export. Such communities are about coffee, grain, cocoa and sugar.

Fig. 3. The complete community tree of the reuters dataset. Each vertex is colored by the color of its category (for a colored version see http://www.tuilmenau.de/fakia/mbrinkme communities.html).

Figure 3 depicts the community tree, in which each document has a color indicating its class. As this image indicates, many communities have high purity (ie. are nearly monochrome). But as a closer examination reveals, they often have a common topic, which is hidden at first view. For example, the siblings of community money-fx mainly consist of documents about banking, but the left one contains articles about the japanese economy.

7

Conclusion

We described a graph theoretical definition of dense or cohesive groups of undirected graphs and hypergraphs, based on edge connectivity. These communities can be constructed by an algorithm with a seemingly high asymptotic runtime. But, as experiments indicate the average runtime of the algorithm is much lower, and allows us to construct the cohesive groups in reasonable time on large graphs. Furthermore, experiments showed, that the resulting structure and refinements of it, provide a good hierarchical segmentation or clustering of several types of graphically represented data, like images and collections of documents.

References [Bot93]

Rodrigo A. Botafogo. Cluster analysis for hypertext systems. In Proc. of the 16-th Annual ACM SIGIR Conference of Res. and Dev. in Info. Retrieval, pages 116–125, 1993. [Bri03] Michael Brinkmeier. Communities in graphs. In Thomas B¨ ohme, Gerhard Heyer, and Herwig Unger, editors, IICS, volume 2877 of Lecture Notes in Computer Science, pages 20–35. Springer, 2003. [Bri05] Michael Brinkmeier. A simple and fast min-cut algorithm. In Maciej Liskiewicz and R¨ udiger Reischuk, editors, FCT, volume 3623 of Lecture Notes in Computer Science, pages 317–328. Springer, 2005. [Bri06] Michael Brinkmeier. Minimizing symmetric set functions faster, 2006. Technical Report, Technical University Ilmenau, Germany, 2006. [BS91] Rodrigo A. Botafogo and Ben Shneiderman. Identifying aggregates in hypertext structures. In UK Conference on Hypertext, pages 63–74, 1991. [FLG00] Gary Flake, Steve Lawrence, and C. Lee Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150–160, Boston, MA, August 20–23 2000. [FLGC02] Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans Coetzee. Selforganization of the web and identification of communities. IEEE Computer, 35(3):66–71, 2002. [Gae04] Marco Gaertler. Clustering. In Ulrik Brandes and Thomas Erlebach, editors, Network Analysis, volume 3418 of Lecture Notes in Computer Science, pages 178–215. Springer, 2004. [GWFT04] R. E. Tarjan Gary William Flake and K. Tsioutsiouliklis. Graph clustering and minimum cut trees. Internet Mathematics, 1(3):355–378, 2004. [KW95] Regina Klimmek and Frank Wagner. A simple hypergraph min cut algorithm, 1995. [LA99] Bjornar Larsen and Chinatsu Aone. Fast and effective text mining using linear-time document clustering. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16–22, New York, NY, USA, 1999. ACM Press. [LS69] Fabrizio Luccio and Mariagiovanna Sami. On the decomposition of networks in minimally interconnected subnetworks. IEEE Transactions on Circuit Theory, 16(2):184–188, 1969. [Mat72] David W. Matula. k-components, clusters, and slicings in graphs. SIAM J. Appl. Math., 22(3):459–480, 1972. [Sei83] Stephen B. Seidman. Internal cohesion of ls sets in graphs. Social Networks, 5:97–107, 1983. [SM00] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [SW97] Mechtild Stoer and Frank Wagner. A simple min-cut algorithm. Journal of the ACM, 44(4):585–591, July 1997. [Wer07] Jeremias Werner. Kohesive gruppen in hypergraphen. Diplomarbeit, Technische Universit¨ at Ilmenau, Germany, 2007. [WF94] Stanley Wassermann and Katherine Faust. Social Networks Analysis: Methods and Applications. Cambridge University Press, 1994. [ZK01] Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis, 2001.

A

Unions of subhypergraphs and the Overlap Lemma

Let G1 and G2 be two subhypergraphs of H = (V, E; w), ie. V (Gi ) ⊆ V for i = 1, 2, and there exist injective maps ιi : E(Gi ) → E such that f ⊆ ιi (f ). The union G = G1 ∪ G2 of these two subhypergraphs is defined by V (G) := V (G1 ) ∪ V (G2 ) and E(G) := {f1 ∪ f2 | fi ∈ E(Gi ) and ι1 (f1 ) = ι2 (f2 )} . Obviously, every Gi is a subhypergraph of the union G, and G is a subhypergraph of H. Lemma 7. Let G1 and G2 be two subhypergraphs of H = (V, E; w), such that V (G1 ) ∩ V (G2 ) 6= ∅. Then λG ≥ min{λG1 , λG2 } where G is the union of G1 and G2 . Proof. Assume that C is a cut of G. Due to the fact, that the vertex sets of G1 and G2 intersect, C cuts at least one of these hypergraphs. Hence we have wGi (C) ≥ λGi for at least one i ∈ {1, 2}. Since every edge cut by C also appears in G, this implies wG (C) ≥ λGi . Consequently, we have λG ≥ min{λG1 , λG2 }.

B

The Proof of Lemma 1

Since G itself is a subhypergraph, a community obviously exists. Now assume, that there exist two communities C1 and C2 of the subhypergraph G. Let C be the union of C1 and C2 . Then, by Lemma 7, we have λC ≥ min{λC1 , λC2 } and hence C ⊆ C1 , C2 , implying C1 = C = C2 . Hence, the community is unique.

C

The Proof of Lemma 2

Assume that two communities C1 = comm(G1 ) and C2 = comm(G2 ) overlap and that λC1 ≤ λC2 . Then, due to the definition of communities, their union C has the same edge connectivity as C1 , implying C ⊆ C1 and hence C = C1 , implying C2 ⊆ C1 .

D

The Proof of Lemma 5

Obviously, every cut of G which separates [u] and [v], induces a cut of H separating u and v. Hence, λH (u, v) ≥ τ implies λG ([u], [v]) ≥ τ . Now assume λG ([u], [v]) ≥ τ , but λH (u, v) < τ . Then there exists a cut C of H separating u and v with w(C) < τ . Since G is a τ -proper contraction of H, the cut C cannot separate two vertices w and w′ with [w] = [w′ ]. Hence every class [w] of G either lies completely in C or in its complement, and therefore C implies a cut of G with the same weight, implying λG ([u], [v]) < τ . But this contradicts our assumption.