Focused Community Discovery - Semantic Scholar

Focused Community Discovery Kirsten Hildrum IBM T.J. Watson Research Center

Philip S. Yu IBM T.J. Watson Research Center

E-mail: {hildrum,psyu}@us.ibm.com Abstract

of millions entities. Thus, the work required should depend on the size of the community returned and not the number of entities.

We present a new approach to community discovery. Community discovery usually partitions the graph into communities or clusters. Focused community discovery allows the searcher to specify start points of interest, and find the community of those points. Focused search allows for a much more scalable algorithm in which the time depends only on the size of the community, and not on the number of nodes in the graph, and so is scalable to arbitrarily large graphs. Furthermore, our algorithm is robust to imperfect data, such as extra or missing edges in the graph. We show the effectiveness of our algorithm using both synthetic graphs and on the real-life Livejournal friends graph, a publicly-available social network consisting of over two million users and 13 million edges.

• Robust. A good algorithm must not be confused by links representing interactions that have nothing to do with the community of interest. It must also be robust to links missing due to unobserved data or due to indirect community structures (criminals hiding their interaction by dealing via a third party). In this paper, we present an algorithm that meets all three of these requirements. Identifying communities Informally, a community is a group of entities that belong together. Real-life communities are formed by people working together, sharing a hobby, living nearby each other, etc. Making this intuition mathematical is difficult. We discuss two basic directions: a distance-based approach, and a cut-based approach. The simplest approach to focused community discovery around a starting point R is to return all nodes with direct links to R. This is the approach taken by Cortes, Pregibon, and Volinsky [2] and Aiello et al. [1] for network traffic. While scalable and focused, this is not robust. First, some neighbors of R result from essentially random interactions, and does not reflect any sort of community at all. Secondly, not all members of a community are connected to all other members. A natural extension of this approach is to look at entities within a certain distance of the start entity. This increases the number of relevant entities found, but also increases the number of irrelevant entities found, thus increasing the recall at the cost of reduced precision. Furthermore, because social network graphs tend to have a small diameter, doing this sort of expanding ring search is likely to be quite expensive. We want a definition that incorporates the strength of the connection rather than just the distance, so we use a notion of a cut. The cut between two nodes is the minimum

1 Introduction This paper deals with social networks that can be described by an interaction graph in which there is an edge between two entities if they have, for example, coauthored a paper, shared a phone call, or sent emails. Given such a graph and an entity of interest in that graph, our goal is to the community containing that entity. Our notion of community tries to capture the way information flows in social networks, so we use a notion of community that includes entities that are not directly connected, with the idea that if there are enough paths between them, information spreads from one to the other rather quickly. Community search should be: • Focused. Many approaches work on the entire graph rather than focusing on a particular point of interest. In contrast, we give the community around a specific individual, allowing our algorithm to scale to graphs of arbitrary size. • Scalable. Any algorithm for focused community search must work in the scenario with millions or tens 1

in an incremental, greedy fashion. We show that good communities can be found in an incremental, greedy way that can also be used for streaming graphs. core

2 The Algorithm

fringe

Figure 1. Left: A community. Right: The core and the fringe. number of edges that must be cut in order to separate those two nodes. If information “flows” along edges in the social network graph, then size of the cut measures how much information can move from one entity to another. To find the community around R, we draw a “circle” around the starting point R that cuts as few edges as possible, relative to the ¯ be the size of the circle we draw. Given a set S, let c(S, S) number of the edges leaving S. Flake, Lawrence, and Giles [3] define a community as a group with more links to themselves than to the rest of the graph, and use a minimum cut algorithm to find it. This technique is limited as shown in Figure 1(right). Though the six pictured nodes form a natural community–one even meeting their definition–the minimum cut is one that cuts one node off from all the others. Later work by Flake, Tarjan and Tsioutsiouliklis [4] expands on this idea using minimum cut trees, but this second approach is not focused. Ino, Kuno, and Nakamura [5] build off the first paper, but still do not find the community in Figure 1. We take a different approach, seeking to minimize not ¯ but a normalized version of that. As we describe c(S, S), later, we need an additional condition to ensure that the community is well-connected to R. More formally, Given A graph G and a relevant set R. Find A set S such that R ⊂ S, nodes in S are well¯ S) connected to R, and c(S, g(S) (where g(S) is a suitable normalizing function) is minimal. We consider two normalizing factors:1 • Expansion: The number of elements in S, denoted |S|. • Degree: The total degree of all entities in S, denoted deg(S).

Our cut-based algorithm greedily optimizes one of the three objectives. Figure 1 shows our initial algorithm. The algorithm maintains a core set, its current best guess as to the community. Then the fringe is the set of nodes that are directly connected to core but not in the core. At each step, for every node in the core, the algorithm determines if moving the node out of the core reduces the objective. If so, it makes the move. Likewise, for each node in fringe, it determines whether moving into the core lowers the objective. If it does, it moves it. We repeat this process, evaluating all the nodes in core and fringe until no change occurs or a cycle is detected. This algorithm is shown in Figure 2. This basic algorithm had two problems. Commonly, the vertices in R would be ejected in line 11, with the result that the the returned set did not include the start set. We could force R to remain in the set, but that did not force to the core to be a community around R. This drift away from the start points R happens when the core set is small. In those circumstances, a node with a tenuous connection to R can be enter the core, and then pull in nodes with a force equal to R’s, having a substantial impact on which other nodes enter. The result is a good core according to the objective function, but not a core containing R. Because of this sensitivity to the early nodes, the ordering of the vertices substantially influences the end set. To deal with this, we made two small changes that increase the influence of R over the final core set. We considered imposing distance constraints, but that would have introduced the problems mentioned in the introduction. Instead, we slowed down the algorithm. Rather than allowing all changes, in each run through the nodes in the core and the fringe the algorithm now first calculates the value of the each change, and then makes only a few of the top-valued changes. Second, in order to move a node in, we also now require that the node be connected to some minimal number of nodes inside the core. The algorithm is not very sensitive to this choice, so long as it it is non-zero. We chose 10%.

3 Performance Our Contribution We provide one of the first algorithms for focused community search. The core idea of our cutbased approach is to (1) devise an objective function (2) introduce the fringe concept, a way to limit the number of nodes that need to be examined so the algorithm can work 1 Both

¯ require that |S| ≤ |S|.

We run the algorithm, with different objective functions, on both synthetic data and real data, and show that the precision and recall are good, the algorithm is scalable, and that it is robust. To generate synthetic graphs for these tests, the vertices were probabilistically divided into clusters of a specified

F IND -C OMMUNITY(R) 1 core ← R 2 changed ← true 3 while changed 4 do 5 changed ← f alse 6 fringe ← neighbors ( core ) 7 for each v in core 8 do 9 if obj( core − {v}) < obj( core ) 10 then 11 core ← core − {v} 12 changed ← true 13 for each v in fringe 14 do 15 if obj( core ∪ {v}) < obj( core ) 16 then 17 core ← core ∪ {v} 18 changed ← true 19 return core

Figure 2. The algorithm skeleton. size. Then edges were generated at random, with different probabilities for an edge between two nodes sharing a cluster and two nodes in different cluster. By construction, we know what the community of any node should be. We test the precision and recall of our algorithm for the three normalization functions under our approach. The results are shown in Figure 3. Along the x-axis, we vary the ratio of expected external edges to expected internal edges. When there are very few external edges (at the extreme, no external edges, and all internal edges), the algorithm should perform better than when the number of external edges is higher. 2 Notice that for external to internal edge ratios less than two, the degree objective performed very well, with the expansion objective performing less well, but still with a precision and recall greater than 50%. Notice that this is still much better than the direct-link only approach, which 1 gives a recall of 50%, and a precision of 1+x , and so ranges from and 91% down to 16%. Below, we try to examine the results a little deeper. For the expansion objective, while some runs return the exact, correct result, some runs drift entirely away, returning a community sharing only the starting point. This is why the median and mean lines are different. For each setting of the external to internal edge ratio, most of the runs are either exact matches, or have drifted 2 It may at first seem more natural to graph the ratio of the probability of a in-cluster edge to the probability of an out-cluster edge. However, since there are many, many more nodes outside the cluster, the effect of a given ratio depends on the size of the graph. The number of external edges is a function of the out-cluster edge probability and the number of out-cluster nodes. For example, if in-cluster edges are twice as likely as out-cluster edges, for size 100 cluster in a size 10000 graph, there are 50 times more edges leaving the cluster than staying inside the cluster.

entirely away from the the starting set R (meaning that the found community and the planted community agreed only in R). One problem with greedy approaches is getting caught in a local minima. In our case, this would mean that the objective value of the found cut is much higher than the objective value of the planted cut. While the ratio between the found objective and the planted objective is not always one, it is usually quite close.3 The ratios hover around 1, and almost all are less than 1.1. In some cases, the community found has a lower cut objective than the planted one. The time of this algorithm depends only on the community size and number of links of community members, and so scales to arbitrarily large graphs (see Figure 3 (right)).

3.1 Livejournal data Livejournal is a web log (blog) site with over four million users. Bloggers using the site can list other users as “friends,” and these links appear on public user profile pages. At the time of our crawl, the graph resulting from the friends relation had about 2.2 million users, and 13 million edges. Most users have a very small degree, with a median degree of 5 and a mean degree of 12.4, but some users have a much higher degree. Using this data, we first measured the robustness to missing and extraneous links. Second, we used responses from users to gauge precision and recall.

Robustness Here, we view the community around the start as the ground truth, and then added or removed edges from the starting point R to a random nodes in the graph. Figure 4 (left) shows the precision and recall of the new community, using the original as the ground truth.4 The x-axis shows the number of random edges added to the starting point, and the y-axis shows the precision and recall, averaged over the relevant users. The results show that the community returned is fairly robust to these additional edges. We use the degree objective for these tests. For Figure 4 (center) shows the precision and recall as links are removed. The graph shows that information lost by removing links from the start point is recovered via the other community members. Figure 4 (right), shows the precision and recall in one particularly robust (but not unusual) community as edges are removed. The the precision and recall are stable at one until about 80 neighbors have been removed (62%), and then fluctuate until about 105 (81%) of the direct links have been removed. 3 We

do not know the global minima in this case. chose these users by sampling 100 random Livejournal users and then picking those that had a community size greater than 20 (a total of 15 users). 4 We

Expansion objective

Degree objective 1

0.8

0.8

0.6

0.6

0.4

0.4

median precision median recall mean precision mean recall

0.2

Clusters of size 50, within-cluster prob = 0.5, total degree held constant

median recall median precision mean precision mean recall

0.2

0

number of neighbor queries

1

0 0

0.5

1 1.5 2 2.5 3 3.5 4 4.5 external to internal edge ratio

5

0

0.5

1 1.5 2 2.5 3 3.5 4 4.5 external to internal edge ratio

1000 800 600 400 200

Mean queries Median queries

0

5

0

25000 50000 75000 Number of nodes

100000

Figure 3. Left:The precision and recall of the three optimization functions. Right: Scaling of the degree objective

Robustness to extra edges 1

0.6

0.4

mean precision median precision mean recall median recall

0.2

0

0.6

0.4

0.7 0.6 0.5 0.4 0.3

0.1 0 0

Edges added, as fraction of total

0.8

0.2

0.2

0 0

precision recall

0.9

Precision and recall

mean precision median precision mean recall median recall

0.8 Precision and Recall

0.8 Precision and Recall

1

Robustness to missing edges

1

0 Edges removed, as fraction of total

20

40

60 80 Number removed

100

120

140

Figure 4. Left and center show robustness to adding and removing edges respectively. Right shows the precision and recall of a single community as edges are removed. Precision and Recall To measure precision and recall, we requested volunteers on a bulletin board on the Livejournal website. Volunteers have a higher-than-average degree. They are frequently part of several communities (real life friends, groups from bulletin boards, etc) but also typically have non-community related links. The twelve volunteers were presented with an alphabetical list of the community returned for both the degree and expansion metric. In about three quarters of the cases, the user was able to identify either a community (such as “ex-Oxford role-playing group”), or a combination of several communities. Among those cases in which a community was found, the average precision was 81%, and the average recall was 94%. In several cases, the group returned consists of more than one community. In these cases, we counted results from any community as valid. These figures probably understate the precision, since some returned users may be part of a community but not recognized by the start point. (On their own initiative, some users looked up profiles on the unrecognized users and told us which ones did belong despite being unrecognized.) The results probably overstate the recall, since users may not be able to remember who is missing. The expansion metric performed poorly in general, but produced interesting result in a few cases. We believe this is a result of its heavy bias towards low-degree nodes.

4 Conclusion We have given a focused, scalable, cut-based community search algorithm, and shown it can return communities planted in synthetic data. In addition, we have shown that on real data is is robust to both missing and extraneous edges, and via a user study, we have shown that it returns real communities using only the link structure of the graph.

References [1] B. Aiello, C. Kalmanek, P. McDaniel, S. Sen, O. Spatscheck, and J. V. der Merwe. Analysis of communities of interest in data networks. In Proceedings of PAM, 2005. [2] C. Cortes, D. Pregibon, and C. Volinsky. Communities of interest. In Proceedings IDA2001, 2001. [3] G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150–160, Boston, MA, August 20–23 2000. [4] G. Flake, R. Tarjan, and K. Tsioutsiouliklis. Graph clustering and minimum cut trees. J. of Internet Mathematics. [5] H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 661–669, 2005.