Community Detection Algorithms: a comparative ... - University of Oxford

1 downloads 211 Views 340KB Size Report
Oct 21, 2010 - Social networks are a major area of research in the recent years [4]. ... network research is the study o
Community Detection Algorithms: a comparative evaluation on artificial and real-world networks D.Phil student report

Department of Engineering Science Department of Zoology

PSORAKIS IOANNIS

supervisors: Prof Stephen Roberts, Prof Ben Sheldon University of Oxford October 21, 2010

1

Abstract In this report we provide a brief ‘research diary’ of our work on community detection. We present an overview of widely adopted algorthms along with a comparative analysis against real-world and computer-generated networks. Most imporantly, among these methods we present a novel probabilistic community detection algorithm that provides state of the art results with minimal computational overhead.

1

Introduction

The network paradigm provides a formal way of representing data whose associations are of the outmost importance in order to understand the phenomenon under study. Many systems in nature consist of interconnected entities [1][2]; the behaviour of each of those entities at an individual level determines (more or less obviously) the function of the whole system at large scale. While networks have already been studied in a rigorous mathematical framework in previous centuries (from the foundations of Graph Theory in 18th century to the combinatorial analysis problems of the 20th century) [1][2][3], modern advances in data storage and manipulation technologies have allowed a massive resurgence in the study of networks. The plethora of different problems, ranging from social interactions to neuron connectivities lead to an interdisciplinary approach to such problems, using tools from Statistical Mechanics and Computer Science to Behavioural Sciences and Sociology [2]. According to Mark Newman, real world networks can be classified into four categories [1]; (a)social networks that represent the pattern of interactions between individuals (from human acquaintances to instances of animal behavioural traits), (b) information networks that reflect a knowledge dependency model (for example the journal citation network), (c) technological networks that represent the distribution of resources (from the electricity grid to airline routes) and finally (d) biological networks that capture the structure of systems such as the metabolic pathways or protein interactions. Social networks are a major area of research in the recent years [4]. Human social networks are now an important aspect of people’s everyday interactions and the ever-increasing connectivity amongst individuals has lead to a major interest in the form and function of these networks. Additionally, advances in sensor technology have facilitated the collection of zoological field data, where animal interactions are now being observed and evaluated at a larger scale. A major aspect of social network research is the study of community structure i.e the form and function of network parts or modules with ‘hot-spots’ of hightened connectivity. Indeed,

2

a significant research effort is invested on the development of methodologies for community detection that consist of: • evaluating if the given network has a modular, non-random structure. • identifying the different partitions (groups of individuals) that the network consists of. • evaluating the function and roles of these modules. In the present work, our intention is to provide a preliminary analysis of social networks and more specifically present an evaluation of different modern community detection algorithms against real and artificial datasets. Among these models, we also present the performance of a novel probabilistic community detection algorithm that provides extremely competitive performance without suffering some important weaknesses of the standard approaches. This report is organized as follows: initially we provide a short description of Graph Theory notions we will use throughout the report, along with the necessary notation. Then we discuss the notion of community structure and how to quantify it in a mathematical sense. Then we provide a brief overview of the community detection methodologies we have implemented and used for our experiments along with the results for artificial and real world datasets. Based on the outcomes of our experiments, we conclude by discussing ideas and challenges for future work.

2

Theoretical Background

In this section we provide an overview of the network theory notions we utilize for our study along with some critical discussion of the basic approaches to community detection and social network analysis. We start from a purely mathematical level describing main notions and notation and proceed by expanding our view to the interdisciplinary problem of defining assessing the community structure of a network. Main ideas such as the modularity are introduced.

2.1

Graph Theory notions and notation

A graph is the abstract representation of a set V of N entities along with their corresponding M connections D; each individual or node or vertex {ni }N i=1 ∈ V is linked to a subset of V via a collection of M edges {li }M ∈ D. An example i=1 graph is shown in Fig 1. Each graph G = {V, D} can be conveniently represented in the form of an N × N adjacency matrix A where if Aij 6= 0 then i points to j. In the simplest 3

Figure 1: A sample graph with eight vertices and ten edges [1] of cases, we are only interested in just the presence or absence of a connection between two entities therefore we have an unweighted graph with Aij ∈ {0, 1}. We follow the convention that Aii = 0. In the more general case of a weighted network, Aij takes values that reflect the strength with which i is connected to j. If A is symmetric, then we have an undirected graph and vice-versa. For the purposes of the present work, we are not interested in other graph types such as bipartite graphs, hypergraphs, etc. For real-world networks, for each pair of nodes the presence of an edge may represent dependence, similarity, direct reachability, commodity flow, causality or any other concept depending on the problem context and modelling assumptions. On the opposite case, where the presence of an edge between any pair of nodes is a random variable P (Aij = 1) = p, then we have an Erdis Renyi random graph. ER random graphs have been studied in detail in the previous years (see [1] for an overview) and are being widely used in order to evaluate the significance of inferred structures from real-world networks [6][13].

2.2

Network properties

The study of networks typically consists of looking at different scales of the graph in order to make inferences about the form and function of its modules. Thus, we define sets of properties of individual vertices of the graph along with properties that characterize the network as a whole [1]. Local properties [5] are defined by the topology of the network in the local neighborhood of a single node. Some of them are the degree of a vertex-i, that is the number of adjacent vertices, the clustering coefficient that measures the probability that two adjacent vertices of i will also be connected, the node centrality that measures how many shortest paths in the network pass through that individual vertex, etc. Naturally, in real world networks nodes can have other properties, for example in an animal social network a node can have features such as age, gender, 4

species etc but these are not been taken into account at least during the first stages of the analysis. Global properties [5] are defined from large-scale statistical properties of the network under study. One of the most important properties is the degree distribution; that is the probability that a node-i will have a degree ki . It has been widely acknowledged that the degree distribution accounts for many of the networks functions, such as the way it is structured into communities. For the majority of real world networks, the degree distribution follows the power-law [2] [5] i.e the percentage of vertices with degree ki drops exponentially as we increase ki . Other global properties can be the weight distribution for weighted networks, the global clustering coefficient which is the average of node clustering coefficients, average shortest path which is the average geodesic distance between any pair of nodes or the degree corelation which is the probability that nodes with degree k will be connected to nodes with degree k 0 . The latter characterizes networks as assortative (nodes with same degree tend to be connected) or disassortative (inversely). Mesoscopic properties. These characteristics lie on a scale between the global and the local perspective of the network. They describe the community or modular structure of the network and will be discussed in the following section.

2.3

Community structure in networks

Although the idea that a network can be partitioned into groups that have some qualitative characteristic is very intuitive, it has been widely acknowledged that it can be quite difficult to formalize it in a mathematical way [3] [2]. That is because, the idea of community (or module or group) itself is not very well defined in a context-independent way. Nevertheless, according to a common definition in the literature, community is a subset gc ∈ V of nodes that are more densely connected to each other than with the rest of the network. Thus, given a single community gc in the network, and based on [13] that is: 2min 2M mout > > n(n − 1) N (N − 1) n(N − n)

(1)

where min is the number of edges connecting nodes belonging to gc , n is the size (or cardinality) of gc , mout is the number of links connecting a node in gs with another one at the external network. The denominators represent the total number of possible connections between nodes of the same community, the whole network, the nodes outside the community. Equation (1) defines that the proportion of intra-community links is larger than that of the whole network. Additionally, we expect that the fraction of inter-community links to be lower than the other two.

5

Therefore, community structure in the network consists of ‘hot-spots’ of increased connectivity, as seen in Fig. 2.

Figure 2: A sample community structure given a small network [3]. In real-world networks, communities represent modules that have special characteristics or play a distinct role in the network [2] [3]; for example in human social networks communities represent cliques of friends, in citation networks represent groups of similar field of research and in transportation networks hubs of high geographical proximity. At this point it is important to state that given a network and based on its topology and weights we are inferring structural communities, which might not always reflect functional ones [2]. I It is very natural for real world networks to possess a nested or hierarchical community struture [2][6], i.e each individual community can be further partitioned into other subcommunities. Therefore, the whole community structure of the network can be represented in the form of a dendogram, see Fig. 3, where the top node represents the whole graph and each layer below a possible partition. The community dendogram reflects the community organization of a network under different resolution perspectives. On the other hand, in random graphs links tend to form between nodes without any regard of mesoscopic cohesiveness. Based on that difference betweeen real and random networks, Mark Newman and Michelle Girvan proposed in [6] the notion of modularity, which evaluates the quality of community division. Suppose we have a real world network Greal that has some form of community structure; we have C node subsets and that follow (1). Then, we have another network Gnull named null graph, with the same number of nodes, same subsets {gc }C s=1 , each node has the same degree as the corresponding one in Greal , but its edges point to random ones in Gnull without any regard for community membership. In order to define how modular Greal is, we compare the fraction of intra-community links in 6

Figure 3: A sample dendogram representing the different layers of community structure in a network. The root represent the whole graph that breaks down into communities as we go down the tree. The bottom layer represents individual nodes [6]. The horizonal read line represents a given level of community organization. Greal against the expected value of that fraction in the null graph. Thus we define modularity Q as: N N ki kj 1 XX Q= )δ(Ci , Cj ) (Aij − 2M 2M

(2)

i=1 j=1

where M is the total number of edges, N the number of nodes, Aij the corresponding element from the adjacency matrix, ki the degree of vertex-i and δ(Ci , Cj ) is 1 if i and j belong to the same community and 0 otherwise. An alternative way to write modularity [19] is by using the assortative mixing matrix e ∈ Q(k+1) is to avoid getting trapped into local maxima; we start the algorithm with a random initialization of partitions and a low value of T gives a higher freedom to explore possible solutions. As the algorithm converges to an area of ‘good’ solutions, the model becomes very strict and does not allow selections that decrease Q. The solution exploration scheme consists of N 2 local and N global ‘moves’, where N is the total number of nodes. The local moves consist of assigning an individual node to a different partition, while global moves consist of merging or splitting whole partitions. Thus, although the algorithm is very popular for general optimization problems, for the community detection framework it is very computationally demanding [3]. The Potts method [13] is a community detection algorithm inspired by Statistical Mechanics. The model assumes that a network is a system of spins that can have q different states. Thus, each node-i can take have a spin value σi ∈ {1, ..., q} (that basically accounts for its community membership, C = q) and the interaction energy between spins is given by −Jij (where J = A) if the spins are in the same state and zero if they are not. Finding the appropriate partition for the network equals to finding the ground state (minimum) of the Hamiltonian: H = −Jij

X

δσi ,σj + γ

q X ns (ns − 1) s=1

i,j∈N

2

(7)

where ns is the number of spins (nodes) in state (community) s, Jij the interaction strength (given by the adjacency or weight matrix A), γ a positive parameter and the Kronecker δσi ,σj is 1 if i, j have the same spin (belong to the same community) or 0 otherwise. The above equation reflects the two competing forces in our system; the first term of the summand favours a homogeneous distribution of spins (minimum for all i, j in the same community) while the second term favours a uniform distribution of spins across nodes. To find the ground state (minimum) of the above system we employ a Monte Carlo single spin flip heat-bath algorithm

13

along with simulated annealing. The method is fast and provides very competitive results for most real-world and artificial datasets. The computational complexity of a variety of different community detection algorithms is presented in [16].

4

Experiments

Our preliminary study involved a comparative analysis of different community detection algorithms across real world (see Table 1) and computer generated network datasets (see Table 2). From the outcome of this small scale research we expect to derive some general conclusions on the performance of different clustering algorithms using a variety of metrics. Additionally we present the performance of NMF, a novel probabilistic community detection algorithm based on non-negative matrix factorization that produces state-of-the-art results in most problems and gives a promising research direction. Table 1: Real-world networks Name N M weighted? Zachary’s karate [22] 34 77 no Southern women [23] 18 139 yes Jazz musicians [24] 198 2742 no Dolphins [25] 62 159 no Great Tits subset 49 738 yes

4.1

Set-up

As mentioned previously, our experimentation involved a comparative analysis of different methods across real and artificial datasets. For the real datasets we used some very popular networks which are widely adopted as performance benchmarks in the vast majority of the community detection literature. The detailed list is presented on Table 1. For those real-world networks and like any other data clustering problem, we usually do not have an observed solution i.e a collection of communities that represent the real partition of the graph. For that reason, we have also generated a collection of artificial networks with observed community structure, in order to evaluate the performance of our algorithms not only in terms of the modularity but also in terms of how similar the produced groups of our algorithms are to the real ones. 14

For unweighted networks, we adopted the standard procedure presented in [6] and found in the majority of community detection literature; we generate networks with N = 128 nodes, C = 4 communities with n = 32 nodes each where each one has an average or expected degree hki = 16. We control the internal cohesivess of communities by setting the expected intra-community and inter-community degree of each node to hkin i (each pair of nodes of the same community has a connection kin probability pin = 32−1 ) and hkout i (a node from one community has a probability kout of connecting to a node from another community pout = 128−32 ) respectively, under the contraint hki = hkin i + hkout i. For weighted networks, we followed the procedure inspired by [20], where for a network of N nodes and C communities we define a C × C matrix T where the probability that a node from community-i will be connected to another from community-j is given by Tij . Additionaly, for the edge weights we assume that they follow a Poisson distribution with different lamdas Lij provided by a C × C matrix L for each pair of communities-i, j. The reason we use a different distribution for node connections and weights is to capture the fact that a different phenomenon may affect those two. For example: people from two different communities might have a very low probability of being acquainted at all, but if they do, they might have a very large number of interactions. Based on the above, we can generate any weighted network at any given size, expected topological features and connection strengths. Table 2: Computer generated networks, for Newman-Girvan graph we provide the expected values of E,Q Name N M C Qobserved R1 50 274 4 0.519 R2 100 1181 2 0.555 R3 40 226 5 0.62 NG graph 128 1024 4 0.37 to 0.7 All experiments were ran on a modern desktop computer (circa 2010), all methodologies were coded in MATLAB and great care has been taken to follow the implementation-specific guidelines of the authors of each method, if they were available.

4.2

Performance metrics

In order to evaluate the performance of our community detection algorithms against real-world networks, we used the modularity Q as it is the only available metric for 15

unobserved datasets. We expect that an efficient algorithm will achieve competitive results against other methods in the literature. For observed datasets, apart from Q we make use of the information that the community partitions are already known. Thus, we need to measure the similarity o between the observed partitions {g(o) c }C c=1 and the ones produced by our algoC

f . We notice that we might get a different number of communities rithm {g(f ) c }c=1 from the ones observed; some might be split or merged into others. For that reason, we follow an approach from information theory presented in [16] by estimating the normalized mutual information. We define a C×C confusion matrix N where rows correspond to the observed commmunities and columns correspond to the ‘found’ communnities. Each element Nij is the number of nodes in the real community-i that appear in the found community-j. Thus the similarity between the two partiCf o tions {g(o) c }C c=1 and {g(f ) c }c=1 is:

I(g(o) , g(f ) ) = P Co

−2

PCo PCf i=1

i=1 Ni∗ log

Nij N j=1 Nij log Ni∗ PCf N∗j Ni∗ j=1 N∗j log N N +

(8)

where Co and Cf are the number of communities found in the observed and found partition respectively, Ni∗ the sum of elements of the i-th row of N, N∗j the sum of elements of the j-th column of N and N the sum of elements of the confusion matrix N. The quantity I takes values from 0 to 1 and measures the amount of information correctly extracted by the algorithm [16]. Other measures, such as the Rand similarity metric are presented in [17]. Finally, another important performance metric is the statistical significance of Cf , each commuthe extracted community structure. Given a partitioning {g(f ) c }c=1 nity g(f ) c has say n nodes, lin intra-community and lout inter-community links. By following [13], we calculate the expected number of possible equivalent communities E(n, lin , lout ) in a random network of the same size (N nodes, M edges) 2M where the connection probability is P (Aij = 1) = N (N −1) for i 6= j:    n(n−1)  N n(N − n) lin 2 E(n, lin , lout ) = p lin lout n

(9)

×(1 − p)n(n−1)/2−lout plout (1 − p)n(N −n)−lout

(10)

where if E(n, lin , lout ) > 1 then it is likely to find one such a community in a random graph of the same size, marking the border of statistical significance [13]. We conclude this section by raising an important issue: it is accepted in the community detection literature that algorithms that seek to directly optimize modularity perform better than the ones that simply use it as a performance metric. 16

From our point of view, we consider that defining a performance metric and based on that, comparing an algorithm that uses it for evaluation purposes and another that tries to directly optimize it, creates biased results towards the latter.

4.3

Sensitivity analysis

As mentioned in the previous section, our comparative analysis involved the performance evaluation of different community detection algorithms, across a variety of datasets. Although good results at a variety of benchmark datasets provide a good indication for the competitiveness of a method, we need further confirmation that the results are not dataset-specific. Additionally, we are interested in the performance of the algorithm in cases we have missing observations or fuzzy community structure. For that reason, we have designed a variety of tests for weighted and unweighted networks to evaluate their resilience in cases where the community structure of the same network is not apparent; communities are loosely connected or observations are knocked-out. For unweighted networks, we used the Newman-Girvan random graph [6] described in the ‘Set-up’ section. As mentioned previously, we can control intracommunity density by manipulating the values of hkout i and hkin i, which basically are the probabilities that nodes from different and the same community are connected. Therefore, starting with a densely connected graph (hkout i = 1,hkin i = 15) and by increasing hkout i (thus decreasing hkin i) we make the communities more loosely connected. We run our algorithms for each value of hkout i to identify how the performance of each one drops as the network becomes fuzzier. Our sensitivity analysis scheme is shown in Algorithm 1. [ht] Sensitivity analysis for Newman-Girvan random graphs [1] Set hkout i ← 1, thus hkin i = 15 because hki = hkin i + hkout i. hkout i ≤ 8 generate 100 Newman-Girvan random graphs (N = 128, C = 4, n = 32) with the given hkout i, hkin i. run community detection for those graphs using each of the avail(kout ) (kout ) able methods. Get Qmean and Imean . Set hkout i ← hkout i + 1, thus making the network ‘fuzzier’ Plot Qmean , Imean across different values of hkout i For weighted networks generated by the procedure we described at the ‘Set-up’ section we follow a different approach. For simplicity we assume that the weights Aij take integer values and under a social network context, 1 unit of weight reflects one unit of ‘co-existence’ or ‘co-occurence’ of two individuals given a time-frame; for example Aij can be the number of emails exchanged between two people per day, or the number of occurences of two animals in the same location per hour. We assume that each unit of co-occurence is captured by a sensor that can be faulty, i.e it can successfully capture an observation with probability psensor . Therefore, we generate samples of the weights matrix A of the network for different values 17

of psensor and run our community detection methods in order to evaluate their performance. Our methodology is shown in Algorithm 2 and emulates the realworld problem of capturing bird positions at Wytham Woods, Oxford. [ht] Sampling weighted networks using the faulty sensor scheme [1] psensor = {0.9, 0.7, 0.4} Generate 25 weight matrices W where Wij follows a binomial distribution with n = Aij and p = psensor run community detection for those graphs (kout ) (kout ) using each of the available methods. Get Qmean and Imean . Plot Qmean , Imean across different values of psensor

4.4

Results

In this section we present the results of our experiments for each community detection method and dataset. Our focus is to compare the performance of the already existing methods against the novel probabilistic NMF algorithm and evaluate its potential for further research. Table 3: Zachary Karate club results Method Modularity Group size Run-time(sec) EO 0.42±0.01 4±0 4.16±0.10 NMF 0.12±0.06 5±0 0.17±0.01 Spectral 0.40 4 0.36 Hierarchical 0.35 10 0.1 Donetti-Munoz[11] 0.412 5 N/A For methods such as Extremal Optimization and NMF are sensitive to initialization, we performed multiple runs against the same dataset and we monitored the mean and standard deviation of the results. It can be seen from the tables, that for real-world datasets with no observable solution we monitored the value of modularity Q, the group size of the partition along with the run time. Table 4: Southern women club results Method Modularity Group size Run-time(sec) EO 0.26±0.002 2±0 1.84±0.07 NMF 0.24±0.08 2±0 0.06±0.006 Spectral 0.26 2 0.21 Hierarchical 0.26 2 0.01 We can see that NMF produces excellent results in the majority of datasets both in terms of modularity and group similarity (for observable datasets), with minimal 18

computational effort.

Method EO NMF Spectral Hierarchical

Table 5: Jazz musicians results Modularity Group size Run-time(sec) 0.42±0.007 4±0 74.157±1.84 0.42±0.008 7±0 2.09±0.15 0.39 3 1.01 0.34 13 0.56

Table 6: Dophin social network results Method Modularity Group size Run-time(sec) EO 0.51±0.006 4±0 11.24±0.428 NMF 0.466±0.033 7±0 0.334±0.030 Spectral 0.491 5 0.44 Hierarchical 0.455 12 0.13

Table 7: Wytham Woods Great Tit subset results Method Modularity Group size Run-time(sec) EO 0.35±  2±0 6.73±0.18 NMF 0.35±  3±0 0.25±0.036 Spectral 0.35 2 0.3 Hierarchical 0.3 5 0.09 Additionally, for the artificial datasets in Tables 9 and 10, NMF correctly extracts the exact original community structure, with speed that is many levels of magnitude below the second best performing method, Extremal Optimization. For unweighted networks, we performed sensitivity analysis for the NewmanGirvan random graph, using an approach described in the previous section. The results, presented in Fig. 4 and 5 show the performance of each method across different levels of community cohesion. We can see that for increasing inter-community degree hkout i, modularity and group identification fall as the community structure of the graph becomes fuzzier. Although Spectral Partitioning and Hierarchical clustering perform poorly, Extremal Optimization and NMF have a more stable behaviour extracting modular structure that is similar to the observed one. We also performed sensitivity analysis on weighted networks, following the sensor emulation scheme described in the previous section. For the three fully 19

Table 8: Artificial graph 1 results, Qobserved = 0.519 Method Modularity I(go , gf ) Group size Run-time(sec) EO 0.51±0.014 0.94±0.055 4±0 7.53±0.17 NMF 0.46±0.037 0.93±0.034 5.2±0.42 0.25±0.026 Spectral 0.492 0.89 7 0.37 Hierarchical 0.47 0.78 4 0.06 Table 9: Artificial graph 2 results, Qobserved = 0.555 Method Modularity I(go , gf ) Group size Run-time(sec) EO 0.49±0.002 0.89±0.005 4±0 22.05±0.56 NMF 0.55±  1±  3±0 0.62±0.04 Spectral 0.55 1 3 0.41 Hierarchical 0.46 0.76 2 0.14 Modularity Q across different values of 0.75

EO NMF Spectral Hierarchical

0.7

0.65

Modularity Q

0.6

0.55

0.5

0.45

0.4

0.35

1

2

3

4

5

6

7

8

Figure 4: Modularity of each method across different levels of community cohesion in a Newman-Girvan random network. observed weighted networks of Table 2, we monitored the modularity Q and the

20

Table 10: Artificial graph 3 results, Qobserved = 0.62 Method Modularity I(go , gf ) Group size Run-time(sec) EO 0.57±0.032 0.91±0.023 4.7±1.49 5.07±0.2 NMF 0.62±  1±  5±0 0.22±0.039 Spectral 0.49 0.79 4 0.25 Hierarchical 0.52 0.78 3 0.07

Similarity I(go,gf) to observed community structure across different values of EO NMF Spectral Hierarchical

I(go,gf) from normalized mutual information

1.1

1

0.9

0.8

0.7

1

2

3

4

5

6

7

8

Figure 5: Original group extraction of each method across different levels of community cohesion in a Newman-Girvan random network. similarity to the original groups I(g(o) , g(f ) ) for different probabilities psensor of capturing an observation. We can see in Fig. 6 that NMF has a solid top performance for any value of psensor , both in terms of modularity and original group extraction. Results for other artificial weighted networks illustrate the exact same behaviour.

21

Modularity across different psensor 0.7 EO NMF Spectral Hierarchical 0.65

observed Q

Modularity Q

0.6

0.55

0.5

0.45

0.4

0.35 0.9

0.85

0.8

0.75

0.7

p

0.65

0.6

0.55

0.5

0.45

0.4

sensor

Original group identification across different psensor 1 EO NMF Spectral Hierarchical 0.9

Similarity I(go,gf)

0.8

0.7

0.6

0.5

0.4 0.9

0.85

0.8

0.75

0.7

0.65

psensor

0.6

0.55

0.5

0.45

0.4

Figure 6: Modularity and original group similarity for a computer generated graph, across different values of sampling probability psensor . NMF outperforms all other methodologies, extracting the observed group structure. Runs using other computer generated graphs show similar results. 22

5

Conclusion and closing thoughts

In this work we provided a short ‘research diary’ of the community detection notions and methods we studied since the beginning of the project. While it is acknowledged that ‘most accurate methods tend to be more computationally expensive’ [16], we demonstrated how the novel NMF algorithm can provide state of the art results in community detection problems with minimal computational effort, showing a promising direction for our research from this point on. NMF does not only provide competitive results against other popular community detection methods in terms of modularity and original group identification with minimal computational effort; it also provides probabilistic outputs for community membership of each node. The existing methodologies described in previous sections assign each node to a single group based on some educated hard decision. Now matter how correct such an assignment is, it completely omits community overlap, which is an imporant aspect of real world networks. Most standard methodologies handle multi-group membership by performing multiple runs with different initialization parameters and based on a concordance matrix, they evaluate the stability of node assignments, i.e how frequently each node appears in the same community. NMF explicitly handles multi-community membership on a single run by providing for each individual node-i a group membership distribution; given a partition {gc }C c=1 , for each group-c we have a probability P (i ∈ gc ) that the node-i will belong to the group c. Another way of expressing it is that each element of the community matrix P of Eq. (4) now expresses a probability for group membership rather than a hard binary decision. This is illustrated in Fig. 7 on a computer generated dataset with overlapping community structure. Thus instead of following a frequentist approach, using the group membership distribution we capture our degree of belief for the assignment of each node. The advantages of such probabilistic outputs are: • we describe multi-community membership of each node in a formal way. • we identify overlapping communities based on the probability mass of their nodes. • model our prediction error in problems with assymmetric misclassification costs. • we can implement community-wise descriptive graph visualization tools. • we can now formally measure the ‘fuzziness’ of a network, based on the entropy of the group membership distribution of each node. For example, 23

Membership distributions for 19 nodes and 5 extracted communities

1

Membership probability

0.8

0.6

0.4

0.2

0 0 2 4 6 8 10 12 14 16 Individual nodes

18 20

1

2

3

4

5

Communities

Figure 7: Membership probabilities (z-axis) for a network of 19 nodes (y-axis) and 5 identified communities (x-axis) in a computer generated graph. Given a node, the bars across the x-axis represent the community membership distribution. We can see that there are nodes with different degree of participation across communities. given 4 communities, if the membership distribution is near ‘random guess’ i.e 25% probability that node-i will belong to each group, then this loss of confidence yields a high entropy value. • the network cartography and individual ‘role assignment’ techniques (based on intra and inter community participation coefficients of [14]) we described in previous section can be improved under a probabilistic framework. • we can make informed predictions on issues such as group stability over time; for example between fuzzy communities we can expect a larger exchange of nodes and phenomena such as fission-fussion can be described in a formal way. • based on how probability membership distributions change over time, we 24

can make informed predictions on events such as an individual from group A and another from group B will belong to the same group after a period of time. Based on the above, we conclude by stating that our research has the potential of a) improving already existing and well-acknowledged methods, b) introducing novel methodologies in the study of networks with the potential of wide adoption and c) providing important context-specific insights on applications such as our Great Tit social network study at Wytham Woods.

25

References [1] M.E.J Newman - The structure and function of complex networks, SIAM Review 45, 167-256 (2003) [2] Mason A. Porter, Jukka Pekka Onnela and Peter J. Mucha - Communities in Networks, Notices of the American Mathematical Society, Vol. 56, No. 9, 2009 [3] Santo Fortunato and Claudio Castellano - Community Structure in Graphs, chapter of Springer’s Encyclopedia of Complexity and System Science (2008) [4] Andrew Sih, Sean F. Hanser, Katherine A McHugh - Social Network Theory: new insights and issues for behavioral ecologisits, Behav Ecol Sociobiol (2009) 63:975-988 [5] Alain Barrat, Marc Barthelemy and Alessandro Vespignani Modeling the evolution of weighted networks, Physical Review E 70, 066149 (2004) [6] M.E.J Newman and M. Girvan - Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113 (2004) [7] M.E.J Newman - Analysis of weighted networks, Phys. Rev. E70,056131 (2004) [8] M.E.J Newman - Modularity and community structure in networks, PNAS June 6, 2006 vol. 103 no. 23 8577-8582 [9] Jordi Durch and Alex Arenas - Community detection in complex networks using extremal optimization. PHYSICAL REVIEW E72, 027104 (2005) [10] Ying Fan, Menghui Li, Pen Zhang, Jinshan Wu, Zengru Di - Accuracy and precision of methods for community identification in weighted networks, Physica A 377 (2007) 363-372 [11] Loca Donetti and Miguel A Munoz - Detecting network communities: a new systematic and efficient algorithm, Journal of Statistical Mechanics: Theory and Experiment (2004) [12] Cappocci A, Servedio VDP, Caldarelli G, Colaiori F - Detecting communities in large networks, Physica A. Vol 352, No 2-4, pp 669-676 [13] Jorg Reichardt and Stefan Bornholdt - Detecting Fuzzy Community Structures in Complex Networks with a Potts Model, Physical Review Letters Volume 93, Number 21 (2004) 26

[14] Guimer R, Amaral LA - Cartography of complex networks: modules and universal roles, J Stat Mech. (2005) [15] Roger Guimer , Marta Sales-Pardo and Lus A. N. Amaral - Classes of complex networks defined by role-to-role connectivity profiles, Nature Physics 3, 63 - 69 (2007) [16] Leon Danon, Albert Daz-Guilera, Jordi Duch and Alex Arenas - Comparing community structure identification, Journal of Statistical Mechanics: Theory and Experiment, Volume 2005, September 2005 [17] Amanda L. Traud, Eric D. Kelsic, Peter J. Mucha, Mason A. Porter - Community Structure in Online Collegiate Social Networks, Physics and Society (2008) [18] Aaron Clauset Cristopher Moore and M. E. J. Newman - Hierarchical structure and the prediction of missing links in networks, Nature 453, 98-101 (1 May 2008) [19] Erik Holmstrom, Nicolas Bock, Johan Brannlund - Modularity density of network community divisions, Physica D 238 (2009) 1161-1167 [20] Mark Ebden - Towards plan detection using message passing, Pattern Analysis and Machine Learning Group report, 18 November 2009 [21] L.C. Freeman - A set of measures of centrality based upon betweeness, Sociometry, 40 (1977), pp. 35-41 [22] W. W. Zachary - An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977) [23] Breiger R. - The duality of persons and groups Social Forces, 53, 181-190 (1974) [24] P.Gleiser and L. Danon , Adv. Complex Syst.6, 565 (2003) [25] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson - Behavioral Ecology and Sociobiology 54, 396-405 (2003)

27

Suggest Documents