Communities in Social Networks

5 downloads 0 Views 539KB Size Report
Zachary's karate club – social network of ... the Newman algorithm used for the Zachary's Karate .... [2] J. Scott, “Social Network Analysis”, SAGE Publications,.
2009 International Conference on Biometrics and Kansei Engineering

Communities in social networks Małgorzata J. Krawczyk AGH University of Science and Technology Faculty of Physics and Applied Computer Science al. Mickiewicza 30, 30-059 Krakow, Poland [email protected] communities. Possible approaches for a construction and analyses of constructed networks are reviewed in [1-4]. Accumulated information can be represented in a graph form, where nodes represent examined elements (people, institutions), and edges describes the connections between them. They can express different kinds of dependence, e.g. the number of exchanged mails, the frequency of meetings, or the number of common projects. In the simplest case, one can only give binary information about the existence of a relation between any two elements. In some cases more information is available, e.g. not only information regarding the acquaintance of two persons but also the frequency of business or social meetings they both participate in. Networks equivalently can be represented by connectivity matrix A. In the case of undirected networks, i.e. if direction of an interaction is unknown, the connectivity matrix is symmetric. All diagonal elements of matrix A are equal to 0, as elements are not connected to themselves. Remaining elements can be assigned a value equal to 0 or 1, and if it is possible to assign different weights for different edges of the graph they can be described by float numbers in a range [0,1]. There is a lot of possibilities for describing static properties of a network. For example, one can find a degree distribution of nodes, or a clusterization coefficient. Such kind of analysis allows one to obtain information about some properties of a network. Very often we are interested in identifying communities in the network. Community is understood as a set of nodes which are more densely connected to each other that to the remaining nodes in the network. Such communities can contain elements which are functionally similar to each other, e.g. family members, people who have the same profession or interests. Usually members of one community are at the same time elements of other communities. And because of that a few communities can be joined to the bigger one, and so on. This situation causes the identification of

Abstract The results of two different clusterization methods applied to six social networks are presented. It is shown that elements classified together by the differential equation method, are also classified to the one community by the Newman method. It is also shown that even though for all analysed networks the cumulative node degree distribution is described by the stretched-exponential law, the distribution of the nodes among different functional types is not the same for all networks.

1. Introduction Description of interactions between social elements via networks is a powerful method. Nodes of those networks can represent people, companies, or any other social element, while edges connecting nodes express the relation between them. It can be imagined that all the people in the world can be seen as a large network. This network is naturally divided into subnetworks, in which other sub-subnetworks can be pointed out, and so on. The smallest and the most connected communities are families, groups of friends, or coworkers. Independently of the way a network is constructed the question we are interested in is the structure of the network. Very often a community overlaps with other communities, as some nodes belong to more than one community. Identification of communities allows to capture dependences between individual elements of a network, which is an important point for understanding interactions and functioning of a network as a whole. Clusterization methods are based on different approaches, which are appropriate for different purposes. Also it is useful to make comparison between the results obtained from different methods. We would like to indicate elements which form communities, and possibly bonds between those

978-0-7695-3692-7/09 $25.00 © 2009 IEEE DOI 10.1109/ICBAKE.2009.20

106 111

neighbours, the connection between those nodes seems not to be random. Then, if two given nodes are connected by a large number of other nodes of a network, the edge between them should be kept during the evolution, while in the opposite case this edge should disappear. According to this rule the differential equation determining changes of weights of edges is as follows [5]:

communities to be a difficult task. Most existing methods do not take into account this phenomenon, and allow only for identifying isolated communities [3]. The second problem which can occur is connected with the size of an analysed network. If the network consists of a small number of elements, like several dozen, the time needed to obtain a result will be short independently of the algorithm used, but if a network contains several thousands of nodes the problem can not be omitted. It is an important point, as some methods can work faster at the expense of accuracy [5]. It should always be remembered that clusterization methods by definition return division of a network into communities. If no information is known about the actual structure of a network, it is impossible to state the correctness of the obtained result. What can be done is to make comparison between the communities detected by different methods. If those results are similar it is probable that the actual structure of the network is described correctly.

dAij

= G (Aij )∑ (Aik Akj − β ) dt k ≠ i, j where G ( x ) = x (1 − x ) .

Evolution of the network leads in a natural way to the split of the whole network into subnetworks, as some edges disappear. The assignation of nodes to subnetworks can change during the evolution, and in this case the value of the modularity is calculated. The community structure accepted as the best one is that with the highest value of the modularity. Our choice of the value of the parameter β was done on the basis of several numerical tests performed for networks with their structure known a priori. Obtained results indicate that all values of β larger than approximately 0.35 ensure the highest probability of a proper identification of the communities in the network. For lower values of the parameter β the probability of the reconstruction of the real structure of the network from the noisy data is significantly smaller. Fig. 1 in [5] shows how probability of the reconstruction of the real structure of the network for different amplitudes of the noise depends on the value of the parameter β. Concluding, the value of the parameter β is chosen to be 0.4.

2. Methods From among methods of finding communities in a network two will be examined here. The results obtained from the differential equation method DEM [5] are compared with those obtained from the well known Newman method NM [6]. Those two methods are different in principles, so it is interesting to liken them to each other. The Newman method belongs to the category of agglomerative hierarchical methods. Starting with a number of communities equal to a number of nodes, communities are successively connected to each other maximizing the modularity value. Here the following definition of the modularity is taken [6]:

kk ⎛ 1 ⎜⎜ Aij − i j Q= ∑ 2m ij ⎝ 2m

⎞ ⎟⎟δ (ci ,c j ) ⎠

3. Networks Methods, mentioned above, were applied to six different networks described in the literature: - Zachary’s karate club – social network of friendship between members of the Karate Club at a US university [7], - Bernard&Killworth ham radio – network representing amateur HAM radio calls [8], - Bernard&Killworth fraternity network network representing interactions among students living in a fraternity at a West Virginia college [9], - books about US politics network – network of books about US politicians sold by on-line bookseller Amazon.com, edges between books represent frequent copurchasing of books by the same buyers [10],

(1)

where: Aij – connectivity matrix element,

m=

(2)

1 ∑ Aij and k i = ∑j Aij . 2 ij

In Eq.1 summing up goes through nodes belonging to the same community, i.e. δ-function is 0 if clusters ci and cj are different. The higher value of modularity a structure of a network is farther from the random one. In the case of DEM the evolution of the whole connectivity matrix is performed. The time variation of the value of weights of the edges between each two nodes depends on their connections with the remaining nodes in the network. If two nodes have many common

112 107

others are binary. The number of nodes changes from 34 for ZKCN to 198 for JMN.

-

dolphin social network - network of association between bottlenose dolphins living in Doubtful Sound [11], - Jazz musicians network – network of bands, edges between two bands means that they have at least one musician in common [12]. A network can be characterized by an analysis of its degree distribution. Fig.1 shows cumulative degree distribution P(k), defined as the probability that a vertex has a degree larger or equal to k, for all analysed networks. All curves fit into stretched exponential behaviour (the worst suitability was obtained for ZKCN, which contained the smallest number of nodes). Obtained values of the fit parameters are gathered in Tab.1.

4. Results By definition the Newman method classifies all nodes to the clusters, making it difficult to distinguish between nodes with very unequal degrees. Nodes that are only weakly connected to other nodes in the network are also classified to one of communities. One possible representation of the obtained result is a dendrogram form, which visualize steps of the algorithm. Elements connected first are joined at a lower level. The dendrogram at Fig.2 shows result of the Newman algorithm used for the Zachary's Karate Club network. In contradiction to the NM, Differential Equation method eliminates some edges from the network. It implies that communities obtained from this method can not be directly compared to the previous one. What we can do is to check if nodes classified together, i.e. to a given community, by one method are also classified together by another one method.

Figure 1. Cumulative degree distribution P(k) Table 1. Analysed networks NETWORK

#NODES

P(k)

Zachary's karate club (ZKCN)

34

∼exp(-(k/4.16)1.23)

Bernard&Killworth ham radio (HRN)

44

∼exp(-(k/4.46)0.52)

Bernard&Killworth fraternity (FN)

58

∼exp(-(k/22.95)1.59)

Dolphin social network (DSN)

62

∼exp(-(k/6.57)2.01)

Books about US politics (PBN)

105

∼exp(-(k/7.48)1.26)

Jazz musicians network (JMN)

198

∼exp(-(k/32.95)1.85)

Figure 2. Zachary's karate club dendrogram (ʘ, ①, ⊗- denotes NM clusters, • and ∗ DEM clusters) Following this fact we can mark DEM communities on the NM dendrogram. In Fig.2 DEM clusters are marked as • and ∗. As can be seen, the number of nodes taken into account is significantly smaller than the real number of nodes in the considered network, but identified communities are coherent with those given by NM. All presented results were obtained for β=0.4. Similar figures can be drawn for all networks. Fig.3 presents the results obtained for the Dolphin social network. Arrows on the figure indicate nodes which were mismatched between communities identified by two clusterization methods. The number of clusters obtained from the two used methods is generally different for all analysed networks, which is understandable in the light of the different approaches of two used methods. The question is if nodes classified together by one method

All analysed networks are undirected and symmetric, two of them are weighted (HRN and FN), and the

113 108

Figure 3. Dolphin social network dendrogram (ʘ, ①, ⊕, ⊗- denotes NM clusters, • and ∗ DEM clusters) are seen as one community by the second method as well. For all analysed networks it is a case. Obtained results are collected in Tab.2. The second column contains information about the number of communities identified by two applied methods NM and DEM. In the third column the fraction on nodes classified by the differential equation method NDEM to the whole number of nodes in the network N in presented. In most cases, the only difference between the results of two used methods is a split of one cluster into two or more. The fraction of nodes which break this rule is given in column F.

Generally speaking if such kind of comparison is possible to perform, the high agreement between methods is obtained. In three cases only single nodes were exchanged between clusters (FN, DSN and PBN). In one case (ZKCN) all classified nodes were assigned in coherent way. In the case of the Dolphin social network and the Bernard&Killworth fraternity network three of clusters identified by NM were joined into one community by DEM. In the case of the books about US politics network the nodes classified to two NM clusters were also split into two DEM communities, but in a slightly different way. In the case of the Bernard&Killworth ham radio and the jazz musicians networks all nodes edges between which were notcancelled during the evolution of the connectivity matrix were classified into the one community. Because of that comparison with NM clusters is impossible. Nodes which are separated from clusters by DEM are twofold. One category covers nodes whose degrees are very low, in comparison to other nodes in the network. Another category seems to be more interesting. Most clusterization algorithms are not able to take into account overlapping of the communities. This is a serious disadvantage of the clusterization methods, as in a network describing real interactions communities are rather rarely separated from each other. Usually one can point out nodes which go-between two or more communities. Fig.4 shows books about US politics network. According to Tab.2 DEM assigned 89 nodes of PBN, and remaining 16 nodes – at Fig.4 marked as white - were not classified to any community. Some of those nodes can be seen as a bond between two (or more) communities.

Table 2. Clustering methods results NETWORK

# CLUSTERS

NDEM/N

F

2

13/34

0/13

7

1

19/44

-

Bernard&Killworth fraternity

6

3*

23/58

1/23

Dolphin social network

4

2*

44/62

2/44

Books about US politics

4

4**

89/105

1/89

100/198

-

NM

DEM

Zachary's karate Club

3

Bernard&Killworth ham radio

Jazz musicians 4 1 network * 3 NM clusters -> 1 DEM cluster ** 2 NM cluster -> 2 DEM clusters

114 109

non-hubs. Hubs are further divided into provincial hubs (connected mainly with nodes within the same community), connector hubs (connected to most of the other communities), and kinless hubs (connected homogeneously with the other communities). Non-hub nodes are divided into four groups: ultra-peripheral nodes (connected only with nodes within the same community), peripheral nodes (most edges within the same community), non-hub connector nodes (many edges to the other communities), and non-hub kinless nodes (edges homogenously distributed among all communities) [13]. According to this classification in the networks analysed here no hubs is observed, except the books about politics network. In the case of the Zachary’s karate club, the books about politics (see Fig.5), the dolphins and the Jazz musicians networks the whole set of nodes is divided approximately equally into ultra-peripheral and peripheral nodes, with only single nodes being classified as a non-hub connector one. The different situation is observed in the case of the fraternity network (Fig.6) where all nodes are classified as non-hub connector nodes or as peripheral, which are situated in the z-P parameter space close to non-hub connector nodes. Intermediate behaviour is observed for the Ham radio network where nodes are split into ultra-peripheral, peripheral, and non-hub connector groups. Such an analysis allows for social interpretation of a network (which is not the subject of this work).

Figure 4. Books about US politics network Although degrees of those nodes are high they are not classified to any community, because they are connected with a few nodes from different communities. In [13], a new method of examination of networks has been proposed. In particular, seven different roles of nodes were defined. The assignation of nodes to a region is defined by position of the node in the z-P parameter space, where z is the within-community degree of the node, and P is the participation coefficient. Those two values are defined as follows [13]:

zi =

κ i − κ si σκ

si

Nm

⎛ κ is and Pi = 1 − ∑ ⎜⎜ s=1 ⎝ k i

2

⎞ ⎟⎟ (3) ⎠

where: κi – the number of links of node i to other nodes in its community si, κ si - the average of κ over all the nodes in si,

σ κ - the standard deviation of κ in si, si

κ si – the number of links of node i to nodes in module s, ki – the total degree of node i,

N m – the number of communities.

Figure 5. Books about US politics network (∗-nodes classified by DEM, ♦-nodes not classified by DEM; borderlines after [13])

The value of zi express how strongly node i is connected to other nodes in the community, while the value of P measures the distribution of the edges of node i among all the communities. The first division is specified by the z-value, and classifies nodes as hubs or

115 110

[2] J. Scott, “Social Network Analysis”, SAGE Publications, London, 2005. [3] S. Fortunato, and C. Castellano, “Community structure in graphs”, Springer, Encyclopedia of Complexity and System Science, 2008 [4] K. Kułakowski, “Some recent attempts to simulate the Heider balance problem”, Computing in Science and Engineering 9, 2007, pp. 86-91. [5] M.J. Krawczyk, “Differential equations as a tool for community identification”, Phys. Rev. E 77, 2008, pp. 065701(R):1-4. [6] M.E.J. Newman, “Analysis of weighted networks”, Phys. Rev. E 70, 2004, pp. 056131:1-9.

Figure 6. Fraternity network (∗-nodes classified by DEM, ♦-nodes not classified by DEM; borderlines after [13])

[7] W.W. Zachary, “An information flow model for conflict and fission in small groups”, Journal of Anthropological Research 33, 1977, pp. 452-473.

5. Conclusions

[8] H. Bernard, P. Killworth, and L. Sailer, “Informant accuracy in social network data V”, Social Science Research 11, 1982, pp. 30-66.

Here we present results which show that distribution of the nodes by the differential equation method and the Newman method is coherent. If nodes were put together by the first one method, they were also placed in the one cluster by the second one. We also show that the cumulative degree distribution for all six analysed social networks is described by the stretchedexponential law. Positions of the nodes forming each of the networks in the z-P parameter space are different for the different networks. This fact reflects differences between kinds of relations which were the source of the network formation.

[9] P. Killworth, and H. Bernard, “Informant accuracy in social network data III”, Social Networks 2, 1979, pp. 19-46. [10] http://www.orgnet.com/ Copyright©2009, Valdis Krebs [11] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, “The bottlenose dolphin community of Doubtful Soubd features a large proportion of long-lasting associations”, Behavioral Ecology and Sociobiology 54, 2003, pp. 396-405. [12] P. Gleiser and L. Danon, “Community structure in Jazz”, Adv. Complex Syst.6, 2003, pp. 565-573.

Acknowledgements. The research in partially supported within the FP7 project SOCIONICAL, No. 231288

[13] R. Guimera, and L.A. Nunes Amaral, “Functional cartography of complex metabolic networks”, Nature 443, 2005, pp. 895-900.

7. References [1] S. Wasserman, and K. Faust, “Social Network Analysis: Methods And Applications”, Cambridge University Press, Cambridge, 1999.

116 111