An Optimal Allocation Approach to Influence ... - Semantic Scholar

9 downloads 231638 Views 285KB Size Report
Keywords. Influence Maximization, Modular Social Network, Optimal Allo- cation. 1. ... republish, to post on servers or to redistribute to lists, requires prior specific.
OASNET: An Optimal Allocation Approach to Influence Maximization in Modular Social Networks Tianyu Cao1 , Xindong Wu1,2 , Song Wang1 , Xiaohua Hu3 1

2

Department of Computer Science, University of Vermont, USA School of Computer Science and Information Engineering, Hefei University of Technology, China 3 College of Information Science and Technology, Drexel University, USA [email protected]; [email protected]; [email protected]; [email protected]

ABSTRACT Influence maximization in a social network is to target a given number of nodes in the network such that the expected number of activated nodes from these nodes is maximized. A social network usually exhibits some degree of modularity. Previous research efforts that made use of this topological property are restricted to random networks with two communities. In this paper, we firstly transform the influence maximization problem in a modular social network to an optimal resource allocation problem in the same network. We assume that the communities of the social network are disconnected. We then propose a recursive relation for finding such an optimal allocation. We prove that finding an optimal allocation in a modular social network is NP-hard and propose a new optimal dynamic programming algorithm to solve the problem. We name our new algorithm OASNET(Optimal Allocation in a Social NETwork). We compare OASNET with equal allocation, proportional allocation, random allocation and selecting top degree nodes without any allocation strategy on both synthetic and real world datasets. Experimental results show that OASNET outperforms these four heuristics.

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications—data mining

General Terms Algorithm, Performance, Experimentation

Keywords Influence Maximization, Modular Social Network, Optimal Allocation

1.

INTRODUCTION

Social networks are ubiquitous in the spread of opinions, ideas, innovations and recommendations. In a social network most people

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’10 March 22-26, 2010, Sierre, Switzerland. Copyright 2010 ACM 978-1-60558-638-0/10/03 ...$10.00.

are influenced by their family or friends because they usually consult to them before making decisions. Understanding this dynamic process gives us a better insight to human’s social behavior. For instance, how new products and fashions are adopted by the society. It also gives us implications on how to make use of this social influence. A better insight to this dynamic process gives birth to a new kind of marketing strategy "viral marketing"[13], which can be explained using a simple example. A camera company sends their cameras to some test users. The test users will recommend the product to their friends and some of their friends will try the camera and even recommend it to the friends’ friends. This process goes on until no more friends make recommendations on it. Given this assumption, the influence maximization problem under this scenario is how the company selects the test users so that the number of potential camera buyers is maximized. This problem was first introduced by Domingos and Richardson[3][13]. More formally, it is defined as extracting a set of k nodes to target for initial activation such that it yields the largest expected spread of information, where k is a given positive integer. The influence maximization problem has attracted a lot of attention in the research community recently. In [8] the authors gave a detailed proof on the NP-completeness of this problem and proposed a greedy hill climbing approach to solve it. The algorithm obtains a (1 − 1/e − ) approximation to the real optimal solution. This greedy strategy, however, exhibits two major problems. Firstly, greedy hill climbing does not make use of the topological properties of the network such as degree distribution, modularity, motifs and so forth. Secondly it needs to simulate the diffusion process for many times before the expected number of activations converges, thus it is inefficient. In [4] the authors employed a set cover greedy algorithm which does not involve simulation over the networks. It defines the neighbors of a node in a network as the nodes that are within a certain distance of the node. It repeatedly selects the node with the highest number of uncovered nodes. While this method is much faster than the greedy algorithm in [8], the diffusion size is not guaranteed to be within a certain ratio of the real optimal solution. Thus the performance of this algorithm is not stable compared to the greedy algorithm in [8]. In [10] the authors used the equivalence between the bond percolation process and the independent cascade model. This equivalence was first discovered in [12]. In [10] they performed the bond percolation process several times on the network and gained a group of sampled networks. By decomposition of strongly connected components of these sampled networks, they can save some time on calculating the marginal gain in the greedy hill climbing algorithm. To the best of our knowledge, [10]’s approach is the most efficient implementation of the greedy hill climbing heuristics. As far as we know, Aram Galstyan [5] firstly made use of the

network topological property to solve the influence maximization problem. It points out that under critical mass transition models, greedy hill climbing might lead to poor performances. Galstyan [5] analyzed degree scale’s influence on the targeting strategy of the influence maximization problem. Although Galstyan assumed a community structure of the network, his analytic approach was restricted to random networks composed of only two communities. The diffusion model is a critical mass transition model. A dual problem of the influence maximization problem is the problem of minimizing the spread of contamination, which was proposed and addressed in [9]. Based on the similarity to the influence maximization problem, Kimura [9] proposed a greedy hill climbing algorithm. He also made use of the equivalence between the bond percolation process and the independent cascade model to improve the performance of the greedy algorithm, which is very similar to the case in [10]. In this paper we make use of the modularity of a network in a different way. The information diffusion model is not restricted to critical mass models. Our primary contributions are summarized as follows: 1. We view the influence maximization problem in modular social networks as a resource allocation problem. The problem is hence transformed to how to effectively allocate the initial active nodes to all the communities of the social network such that the expected number of active nodes is maximized; 2. We derive a recursive relationship for optimal allocation in a social network based on the assumption that all the communities are disconnected and prove that this optimal allocation problem is NP-hard; 3. We give a dynamic programming algorithm to solve the optimal resource allocation problem; and 4. We evaluate our proposed algorithm OASNET on both synthetic networks and real world networks. Experimental results show that OASNET outperforms other heuristics such as equal allocation, proportional allocation, random allocation and top degree nodes without allocation strategies. The structure of the paper is organized as follows. Section 2 explains information diffusion models. Section 3 introduces the growth function F (k, G, M, Γ), the recursive relationship of the optimal allocation, the proof of the NP hardness of the optimal allocation problem and the algorithm to solve the influence maximization problem on modular networks. Section 4 shows the results of our experiments on both synthetic and real world networks. We conclude in Section 5.

2.

INFORMATION DIFFUSION MODELS

There are two kinds of information diffusion models. The first is the independent cascade model. The second is the linear threshold model. Many variants have been proposed based on these two models. The independent cascade model is a probabilistic information diffusion model [7, 8]. In this model it starts with an initial set of active nodes S. The process of activation unfolds in discrete steps. In step t, an active node u has a single chance to activate its neighbor v with probability p(u, v), which is a user-specified parameter. No matter whether this trial is a success or failure, node u will not attempt to activate v again. This process converges if there is no more activation possible (i.e., all the active nodes have used their trials). In the simplest prototype of the independent cascade model,

the parameter p(u, v) is set to be the same for all edges in the network. The linear threshold model [6, 8] relies on the assumption that one node u will be activated if the fraction of its activated neighbors are larger than a certain threshold θu . In a more general case, each neighbor v might have a different weight w(u, v) to node u’s decision. In this case a node u becomes active if the summation of the weights of its activated neighbors is greater than the given threshold θu . The requirement of a node u to become active can be described by the following relationship: Σv w(u, v) ≥ θu ; where v is an activated neighbor of u. The threshold value of each node is usually assigned randomly and is uniformly distributed from 0 to 1. But sometimes the threshold value is hardwired to be a certain constant for all nodes. After that, given an initial set of active nodes S, the process of activation unfolds deterministically in each discrete time step. At time step t, if the summation of the weights of u’s activated neighbors is greater than θu , u becomes active. Otherwise, u stays as not activated. This process terminates if there are no more activations, which means the weight summation of each non-active node’s neighbors does not meet the given threshold.

3.

GROWTH FUNCTION

Our optimization technique is based on a growth function. In this section, we will introduce the growth function that we have designed. A growth function F (k, G, M, Γ) maps four input parameters to an expected number of active nodes when the diffusion process terminates, where k is the number of initial active nodes, G is the network, M is the diffusion model and Γ is the strategy of how to select the initial active nodes. In [8] there is a similar function f (S) that maps the set of initial active nodes S to an expected number of active nodes when the diffusion terminates. Our growth function is an extension of f (S). In [8] the authors proved that the function f (S) is a submodular and monotonically non-decreasing with respect to S. In our growth function F (k, G, M, Γ), the strategy Γ is a function that maps (k, G(V, E)) to a set of initial active nodes S such that S is a subset of V and |S| = k. Here, the growth function is monotonically non-decreasing with respect to k. This is true for any reasonable strategy Γ. Reasonable strategies include k step greedy hill climbing proposed in [8], and strategies that we propose in this paper, such as top k degree nodes, random k nodes and optimal k nodes to the influence maximization problem. But there are also strategies that do not guarantee this property. For example, one can choose top k degree nodes if k is even, otherwise choose the lowest k degree nodes. In this case the growth function is not likely to be monotonically non-decreasing. Generally the derivation of a growth function is not easy. Usually we need to simulate the diffusion process on the network many times to approximate this function. This will be explained further in Section 4.

3.1

Optimal Allocation of Initial Active Seeds

In this subsection, we will demonstrate a process for transforming the influence maximization problem to an optimal resource allocation problem in a modular social network. We observe that many networks exhibit a modular property, which means that a network is composed of some communities. The connections within each community are strong and the connections between the communities are weak. We thus make an assumption as follows: a network G is composed of a set of communities G1 , G2 , ..., Gn and these communities are all disconnected. This

means there is no edge between one node in Gi and another in Gj where i and j are different. Given this strong assumption, there exists an Fi (k, Gi , Mi , Γi ) for each community. All these growth functions for different communities are independent of each other. Now we can transform the influence maximization problem to an optimal resource allocation problem in a more formal way. The optimal allocation problem is how to allocate the k initial active nodes to communities G1 , G2 , ..., Gn such that the expected diffusion size is maximized. Suppose d is the diffusion size for the whole network, Fi (ki , Gi , Mi , Γi ) is the growth function for the community i, and ki is the number of initial active nodes that are allocated to community i. Because all the communities are disconnected, we have the following equation. d = Σn i=1 Fi (ki , Gi , Mi , Γi )

(2)

The motivation for introducing Mi and Γi is explained as follows. In a real social network, there are sub-groups that share a common interest in certain topics. For example, it is likely that teenager boys form a group and share a common interest in games. Therefore the game topic has a higher diffusion probability in this community than in other communities. In this sense the diffusion model and the diffusion probability might be different across different groups. Now we come back to the problem of allocating k to different communities. Growth function F (ki , Gi , Mi , Γi ) is a discrete function. We first present a recursive relationship in an optimal allocation and then explain it.

maximize v = Σobjecti ∈S vi

(4)

s.t. Σobjecti ∈S wi ≤ w

(5)

Note that for each object i we can create a growth function as equation (6).  vi if ki ≥ wi F (ki , Gi , Mi , Γi ) = (6) 0 if ki < wi We can use equation (6) to substitute the left side of equation (4), and get the following equation. v = Σobjecti ∈S vi = Σn i=1 Fi (ki , Gi , Mi , Γi )

(7)

The meaning of ki is actually the weight that we allocate for the ith object. We can reformulate equation (5) as equation (8), which means that the summation of the weights that we allocate for each object should not exceed the weight constraint such that Σn i=0 ki ≤ w

(8) Σn i=0 ki

OP T (k, n) = M AX{F (i, Gn , Mn , Γn ) + OP T (k − i, n − 1)} 0≤i≤k

Proof of the NP Hardness of the Optimal Allocation Problem

Here we show that the optimal allocation problem is NP hard by reducing the knapsack problem to it. Given an instance of the knapsack problem with n objects (wi , vi ), the knapsack problem seeks to select a subset of objects so that the summation of the values of all the objects are maximized, while at the same time the weight constraint of knapsack must be met. It can be reformulated as follows. Assume set S is the subset of objects that are selected.

(1)

such that Σn i=1 ki = k

3.2

(3)

OP T (k, n) is the expected diffusion size in an optimal solution of allocating k to the first n communities. We start making decisions on how many active nodes should be allocated to one community in a backward manner. For the nth community, we can make a decision on how many initial active nodes should be allocated to it. We can allocate 0, 1 or up to k initial active nodes. Suppose the number of initial active nodes allocated to the nth community is i. Then for the first (n − 1) communities, the number of available initial active nodes is (k − i). If in an optimal allocation the number of initial active nodes is indeed i, the problem is reduced to how to allocate the remaining (k−i) initial active nodes to the first (n−1) communities. Note that this leads to an optimal sub-structure. It reduces the problem from OP T (k, n) to OP T (k − i, n − 1). We do not know how many initial active nodes should be allocated to the nth community in an optimal allocation. So we need to exhaust all the possibilities from 0 to k. The maximum diffusion of the (k + 1) choices is the optimal allocation. Solving this recursion naively will introduce an exponential time complexity. Fortunately we can use dynamic programming to reduce the running time to a polynomial time. The running time of solving (3) by using dynamic programming is O(nk2 ). Note that the running time is actually pseudo polynomial. We know that finding a best Γ on a network is an NP complete problem, which was already proven in [8]. Therefore all the growth functions of the communities are sub-optimal. But still we are able to build an optimal allocation based on these sub-optimal functions. In this case the optimization technique can be summarized as global optimality based on a local sub-optimality technique.

< w, we In an optimal solution to equations (7) and (8), if can always convert it to another optimal solution where Σn i=0 ki = w because the function F (ki , Gi , Mi , Γi ) in equation (6) is nondecreasing with respect to ki . Therefore equation (8) can be changed to the form of Σn i=0 ki = w. With equations (7) and (8), we have reduced an instance of a knapsack problem to our optimal allocation problem. So it can be seen that the optimal allocation problem is actually NP hard. But fortunately in the current problem settings k is usually small. The running time of such an NP hard algorithm could actually be fast.

3.3

Algorithm Design

We show the pseudo code of our dynamic programming based algorithm and its rationale in this section. The pseudocode is presented in Algorithm 1. Line 1 to line 7 find communities of a given network. Finding communities is done from . In a very simple case, we can use a network’s connected components to denote its community structure. But the size distribution of connected components of a network is usually highly skewed. The size of the largest component is hundreds or even thousands of times of the size of the second largest component. Take the condensed matter physics collaboration 2005 network for example. The largest component has about 36458 nodes, while the second largest component has only 19 nodes. In this case the optimization technique would allocate all the initial active nodes to the largest component. This makes the optimization technique useless. So on line 4 we use a community finding algorithm to cut the largest connected component into smaller communities. There are two problems here. Firstly whether the largest connected component exhibits modularity is unknown. Secondly there is no guarantee that the distribution of the sizes of the newly found communities is not highly skewed. We have performed some empirical tests on several standard social networks and it turns out that the

largest connected component exhibits a relatively high modularity level and the distribution is not highly skewed. In our algorithm the communities are not guaranteed to be disconnected, and usually they are connected. This somehow violates the requirements of the optimal allocation methods. The influence of a few edges that connect the communities is unknown. The influence might be either minor or huge. Here we assume the influence is minor. In later sections we will come to this issue again. Given this assumption, the optimal allocation is now an approximation to the real optimal allocation. Line 9 to line 13 are for calculating the growth function mentioned earlier. Γ means the selection strategy, which can be greedy hill climbing, degree centrality, random selection, closeness or other heuristics. M is the diffusion model. It can either be the linear threshold model or the independent cascade model. J is the number we have allocated for community G0i . After we have the growth function, line 14 uses a subroutine to calculate the optimal allocation. This subroutine uses dynamic programming to solve the optimal recursive relationship. Line 17 uses the allocation result to find the initial seeds.

3.3.1

Influence of Cross Community Edges and the Community Finding Algorithm

As we have stated above, the communities are loosely connected by some edges in most cases in the settings of Algorithm 1. Now we estimate the influence of these edges. Here we only analyze the case of the independent cascade model. The diffusion process of an independent cascade model corresponds to a bond percolation process on a network. Suppose the number of edges that connect the different communities is α, and the diffusion probability is p. Then only αp edges are open during a bond percolation process. Depending on the distance between the border nodes and the initial active nodes within a community, a certain fraction of the nodes that lie on two sides of the αp edges are activated. Assume the fraction of the nodes that can reach these edges are q. Thus in total there are αpq activations that cross the communities. If this value is relatively large, the optimization techniques might not work well. This is influenced heavily by an external community finding algorithm, which determines α and has a great impact on q. We tested the value empirically on several datasets and found that whether the size of αpq is relatively large is different across different datasets. But even in a dataset that this value is relatively large, the proposed algorithm still outperforms the degree heuristics, which are a reliable baselines for the influence maximization problem. At this point we do not have a clear explanation for it. It might be explained by the fact that the growth function vaguely captures the slope of the real growth function of that community.

3.3.2

Time Complexity

The running time of Algorithm 1 can be decomposed into three major parts. The first part is the time used to find the community structure of the network. The second part is the time used to determine the growth function of each community. The third part is the time used to solve the optimal allocation. The running time of the first part is determined by the external community finding algorithm. The state-of-art algorithm[2] can achieve O(n(logn)2 ), where n is the number of nodes in the network. The time complexity of the second part is O(ΣN i=0 kmi s), where N is the number of communities, mi is the number of edges in the ith community, and s is the times of simulation that are needed to get a stable expected diffusion size. Note that O(ΣN i=0 kmi s) is the same as O(ksΣN i=0 mi ) = O(ksm), where m is the number of edges of the whole network. The time complexity of the third part is O(N k2 ). The time complexity of Algorithm 1 is dominated by the second

part. Therefore the algorithm’s complexity is O(kms). Compared to a greedy hill climbing algorithm’s complexity of O(knms), it scales better to large networks. However, Algorithm 1 costs more time than the comparison heuristics in this paper. Algorithm 1 Optimal Allocation based Influence Maximization Input: A social network G, and a number k Output: The set of initial target nodes Method: 1: Let {Gi }=Connected-component-decomposition(G) 2: sort {Gi } decreasingly by number of nodes 3: if (G0 .size()/G1 .size() > threshold) then 0 4: Let {Gi } = F indCommunity(G0 ) ; 5: else 0 6: Let {Gi } = {Gi } ; 7: end if 0 0 8: Let N = {Gi }.size()//{Gi } is the set of communities of the social network 9: for (i =1 to N ) do 10: for (j =1 to k) do 0 11: calculate Fi (j, Gi , Mi , Γi ) by simulating the diffusion 0 process on Gi many times; 12: end for 13: end for 14: Let allocation=optimalallocation(Fi , k) //this subroutine solves the optimal allocation recursion by dynamic programming 15: let initialseeds={} 16: for i = 1 to allocation.length do 0 17: initialseeds.add(Γ(allocation[i],Gi )) 18: end for 19: RETURN initialseeds

4. 4.1

EXPERIMENTS AND RESULT ANALYSIS Experimental Considerations

The running time of the greedy hill climbing algorithm is O(knms), where k is the given number in the problem setting, n is the number of nodes in the network, m is the number of edges and s is the times of simulation to gain a stable diffusion size. In a sparse social network the number of edges is usually O(n). So the running time is O(kn2 s). In our experiments, the largest network is more than 40,000 nodes, the largest k is 120 and the times of simulation are about 10,000. The greedy hill climbing does not scale well to such cases. So we do not compare the results of our algorithm with the greedy hill climbing algorithm. Instead we choose four heuristics to compare, which are explained further in the section 4.2. In order to solve the recursion, we need the growth function first. The growth function F (k, G, M, Γ) is influenced by the node selection strategy Γ. For the consideration of running time we use the top k out degree nodes as the strategy for selecting initial active nodes. Note that this will not undermine the proof of our claim, because the optimal recursive relationship is not influenced by this selection strategy at all. By far we have a stable strategy Γ of selecting initial active nodes. The growth function cannot be derived easily. The maximum value of i in the growth function in the recursive relation (3) is actually k. So we only need to calculate the growth function between [1, k]. To calculate Fi (j, Gi , Mi , Γi ), we need to

simulate the diffusion process over the network many times. In this way we can approximate the growth function sufficiently close to the true form if we simulate sufficient times. Another issue is to find the communities. After balancing the performance of the algorithm and the algorithm running time, we use [2]’s community discovery algorithm to partition the largest components into smaller communities.

4.2

Comparison Heuristics

In order to validate the performance of our proposed method(denoted by M1), we compare it with four other heuristics, which are 1. M2: equal allocation; 2. M3: proportional allocation(allocation by the number of nodes in the community or the diffusion probability of the community); 3. M4: random allocation, and 4. M5: selecting top degree nodes of the whole network without using any allocation strategy. In the second round of experiments on synthetic networks, proportional allocation(M3) means allocation by the diffusion probability. For all other experiments proportional allocation(M3) means allocation by the number of nodes in the community. We will show that the proposed optimum allocation method yields a larger diffusion size than all the above four heuristics.

4.3

Experimental Settings

We conduct experiments on both synthetic networks and real world social networks. The diffusion model is set to be the independent cascade model for experiments on synthetic networks. Both the independent cascade model and the linear threshold model are used for experiments on real world social networks. Synthetic networks are generated by preferential attachments. For experiments on synthetic networks, all the communities are disconnected. The number of initial active nodes is set to be from 30 to 120 with an increment of 10. There are two rounds of experiments on synthetic networks. In the first round there are 20 communities, and the number of nodes in each community varies from 10 to about 900. The number of nodes of the whole network is 6000. The diffusion probability is 0.1 for all the communities. In the second round of experiments there are still 20 communities. Each community contains 1000 nodes. The diffusion probability varies across communities, and ranges from 0.1 to about 0.2. For experiments conducted on the independent cascade model on real world social networks, we set the diffusion probability to be the same for each community within the network. The social network datasets that are used are listed as follows. 1. Condensed matter physics 2005 network [11]: it is a network of co-authorships between scientists posting preprints on the Condensed Matter E-Print Archive. This network has more than 40421 nodes. Its maximal modularity is 0.62 by the algorithm of [2]. 2. PGP giant component network [1]: it is a list of edges of the giant components of the network of users of the PrettyGood-Privacy algorithm for secure information interchange. It is a single connected component. There are 10680 nodes in total. Its maximal modularity is 0.85 by the algorithm of [2].

3. Lederberg citation network and Zewail citation network: they are two different citation networks that are obtained from Garfield’s collection of citation network datasets produced using the HistCite software. In the Lederberg citation network there are 8843 nodes. In the Zewail citation network there are 6752 nodes. Their maximal modularities are 0.61 and 0.57 by the algorithm of [2].

4.4

Experimental Results on Synthetic Networks

Figure 1(a) shows the diffusion sizes of the first round of experiments. It can be seen that M1, M3 and M5 are notably better than M2 and M4. The average difference between M1 and M2 is about 50 nodes. The average difference between M1 and M4 is about 110 nodes. However, the difference between M1, M3 and M5 is not significant. The average difference between M1 and M5 is about 20 nodes. A possible explanation is that the growth functions of the largest communities are similar. In other words the heterogeneity of the growth functions is small. Nonetheless, it can be seen that optimal allocation always yields the largest diffusion size. Figure 1(b) shows the diffusion sizes of the second round of experiments. The difference between M1 and all other heuristics is clearly notable. The average difference between M1 and M2 is about 110 nodes. The average difference between M1 and M5 is about 290 nodes. The other average difference falls between 110 and 290 nodes. This suggests that the diffusion probability yields a larger heterogeneity of the growth functions. Another interesting fact is that M3 yields a worse performance than M2. One would expect the higher the diffusion probability, the more initial active nodes we should allocate to that community. On the contrary, the higher the diffusion probability, the sooner the growth function gets to the cut point. This reflects a nonlinear relationship between diffusion probability and diffusion size. It suggests that the more vulnerable a community is, the less initial active nodes we need in order to initiate a large cascade.

4.5

Experimental Results on Real World Social Networks

From Figure 1(c) to Figure 1(m) we can find that in all the four datasets, the proposed method yields a higher diffusion size than the other four heuristics with an exception at the beginning part of Figure 1(d). Experimental results on the independent cascade model are illustrated from Figure 1(c) to Figure 1(i). Experimental results on the linear threshold model are illustrated from Figure 1(j) to Figure 1(m).

4.5.1

Experimental Results on the Independent Cascade Model

Figure 1(c) shows the largest difference between the proposed method and other heuristics. The average difference between M1 and M5 is about 284 nodes. Also the trend is that as k goes up, the difference between M1 and M5 becomes larger and larger. When k=120, the difference between M1 and M5 is already about 400 nodes. This relatively large difference can be explained by the heterogeneity of the growth function and the topology of the communities. In the condensed matter collaboration 2005 network, there is a tightly connected community of 6802 nodes. This component’s growth function is very different from other communities’ growth functions. The growth function’s value is 1784 with 1 initial active node. After that it grows very slowly. While for the other communities, the connection is relatively sparse and the growth function’s value is within 100 with 1 initial active node. The behavior of these growth functions suggests that the top degree nodes of the

whole network are within that tightly connected community. So the top degree nodes of the whole network are actually from the same community, and there is a lot of overlapping of activated nodes between these top degree nodes. This is why the proposed method works much better than simply selecting top degree nodes without using any allocation strategy on the condensed matter physics network. Figure 1(h) exhibits the smallest difference between the proposed method and the other heuristics. The average difference between M1 and M3 is about 20 nodes. The average difference between M1 and M4 is about 85 nodes. All the other average differences fall between 20 and 85 nodes. Note that all the five lines are very close to each other. This is a result of lack of heterogeneity of the growth functions of the different communities. In the case of Zewail citation network under the diffusion probability of 0.1, all of the growth functions almost grow at an equal pace with regard to k. Figure 1(d) and Figure 1(e) show the diffusion sizes of the PGP network with different diffusion probabilities. A close comparison between them indicates that the difference between M1 and the other heuristics becomes larger with a higher diffusion probability. This is also observed by comparing Figure 1(f) to Figure 1(g) and comparing Figure 1(h) to Figure 1(i). However the difference between M1 and the other heuristics are not necessarily larger for larger networks(more nodes). In Figure 1(e) the average difference between M1 and other heuristics falls between 268 nodes and 46 nodes. In Figure 1(g) the average difference between M1 and other heuristics falls between 33 nodes and 376 nodes. Yet there are more nodes in the PGP network than in the Lederberg citation network.

4.5.2

Experimental Results on the Linear Threshold Model

Generally the experimental results on the linear threshold model are similar to that of the independent cascade model. When the diffusion probability of the independent cascade model is small, the linear threshold model usually yields a larger diffusion size than that of the independent cascade model. Therefore the absolute value of the differences between the proposed method and the other heuristics are usually larger for the linear threshold model. Figure 1(j) shows that the largest difference between the proposed method and the other heuristics. The largest difference is about 3000 nodes when k is equal to 120. The lower bound on the largest difference is about 300 nodes. This is revealed in Figure 1(m). All the largest differences fall between them. Figure 1(m), Figure 1(k), Figure 1(l) and Figure 1(m) all show that the proposed method outperforms the other heuristics. To sum up, all the figures confirm that the proposed method outperforms the other heuristics in most cases. It also suggests that the proposed method works well with a higher diffusion probability. With a higher diffusion probability the heterogeneity of the growth functions becomes more obvious. Selecting top degree nodes without using allocation strategies usually yields the worst performance. This case is very obvious for large networks with a high diffusion probability.

5.

CONCLUSIONS

In this paper, we have proposed a new resource allocation based approach to solve the influence maximization problem on modular social networks. We have made a simple assumption that the communities of a social network is disconnected. We have also proved that finding an optimal allocation in a modular social network under this assumption is NP-hard. We have then proposed a quasioptimal dynamic programming algorithm to solve the optimal allo-

cation problem in modular social networks. With several heuristics to measure the performance of our optimal allocation algorithm, we have shown empirically that our new algorithm OASNET is better than other heuristics, such as equal allocation, proportional allocation, random allocation and selecting top degree nodes without using any allocation strategies. In the current work, we assume that the network is static without any temporal change. One of the future tasks is to extend the OASNET algorithm to consider the temporal behavior of the network in the model, and apply it to biomolecular networks such as proteinprotein interaction networks to study the dynamic behavior of such networks. We hope to report our findings in the near future.

6.

ACKNOWLEDGMENTS

The research is supported by the US National Science Foundation (NSF) under grants CCF-0905337 and NSF CCF 0905291, the National Basic Research Program of China (973 Program) under award 2009CB326203, and the National Natural Science Foundation of China (NSFC) under award 60828005. The Lederberg and Zewail citation networks are from the Garfield’s collection of citation network datasets produced using the HistCite software. These networks are the search results in the WebofScience and are used with the permission of ISI of Philadelphia.

7.

REFERENCES

[1] M. Boguñá, R. Pastor-Satorras, A. Díaz-Guilera, and A. Arenas. Models of social networks based on social distance attachment. Phys. Rev. E, 70(5):056122, Nov 2004. [2] A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Phys. Rev. E, 70(6):066111, Dec 2004. [3] P. Domingos and M. Richardson. Mining the network value of customers. In KDD, pages 57–66, 2001. [4] P. A. Estévez, P. A. Vera, and K. Saito. Selecting the most influential nodes in social networks. In IJCNN, pages 2397–2402, 2007. [5] A. Galstyan, V. Musoyan, and P. Cohen. Maximizing influence propagation in networks with community structure. Phys. Rev. E, 79(5):056102, 2009. [6] M. Granovetter. Threshold models of collective behavior. The American Journal of Sociology, 83(6):1420–1443, 1978. [7] B. L. Jacob Goldenberg and E. Muller. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters, 12:211–223, 2001. [8] D. Kempe, J. M. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In KDD, pages 137–146, 2003. [9] M. Kimura, K. Saito, and H. Motoda. Minimizing the spread of contamination by blocking links in a network. In AAAI, pages 1175–1180, 2008. [10] M. Kimura, K. Saito, and R. Nakano. Extracting influential nodes for information diffusion on a social network. In AAAI, pages 1371–1376, 2007. [11] M. E. J. Newman. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2):404–409, 2001. [12] M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45:167, 2003. [13] M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing. In KDD, pages 61–70, 2002.

(a) Synthetic networks composed of communities with various sizes

(b) Synthetic networks with various diffusion probabilities across different communities

(c) Condensed Matter Physics networks with diffusion probability 0.05

(d) PGP networks with diffusion probability 0.1

(e) PGP networks with diffusion probability 0.2

(f) Lederberg citation networks with diffusion probability 0.1

(g) Lederberg citation networks with diffusion probability 0.2

(h) Zewail citation networks with diffusion probability 0.1

(i) Zewail citation networks with diffusion probability 0.2

(j) Condensed Matter Physics network with Linear threshold model

(k) PGP network with Linear threshold model

(l) Lederberg citation network with Linear threshold model

(m) Zewail citation network with Linear threshold model Figure 1: Experimental Results