Cumulative Activation in Social Networks

Cumulative Activation in Social Networks Xiaohan Shan

Wei Chen

Qiang Li

Institute of Computing Technology, CAS

Microsoft


[email protected]

arXiv:1605.04635v1 [cs.SI] 16 May 2016

[email protected]

[email protected]

Xiaoming Sun

Jialin Zhang



[email protected]

[email protected]

ABSTRACT

Keywords

Usually a customer make the decision to buy a product if she/he is cumulatively impacted by enough pieces of information for the product. In this paper we use the classical independent cascade model in social network to model the information diffusion. In this model, the active probability of a node can be regarded as the frequency of a customer receiving information after multiple information diffusion. In this work we study a new type of activation named cumulative activation. A node is cumulatively active if its active probability is beyond a given threshold. Two optimization problems are investigated under this framework: seed minimization with cumulative activation (SM-CA) problem, which asks how to select a seed set with minimum size such that the number of cumulatively active nodes can reach a given threshold; influence maximization with cumulative activation (IM-CA) problem, which asks how to choose a seed set with a fixed budget to maximize the number of cumulatively active nodes. We first show the nonsubmodularity of the function representing the number of cumulatively active nodes, which means unlike many previously studied problems in social network, our problems cannot be solved directly via submodular function optimization. For SM-CA problem, we design a greedy algorithm which yields a bicriteria O(ln n)-approximation when η = n, here η is the given threshold and n is the number of nodes in the network. For η < n, we show an Ω(nδ )-inapproximate result under a wildly believed hardness assumption of the densest k-subgraph problem. For IM-CA problem, we prove that it is NP-hard to approximate the problem within n1−ε for any ε > 0. What’s more, we provide two efficient heuristic algorithms for SM-CA and IM-CA respectively. Experiments results on different real-world social networks show that our heuristic algorithms outperform many previous algorithms.

social networks; independent cascade model; cumulative activation; influence maximization; seed minimization

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2016 ACM. ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

1.

INTRODUCTION

In recent years, with the rapid increasing popularity of Facebook, Google and Twitter etc., social networks have become powerful media for spreading information, ideas and products among individuals. Specially, marketing for products through social networks can attract large amount of customers. Motivated by this background, influence in social networks has been extensively studied. However, most of previous works only consider the influence after one time spreading. In contrast, in the real world, people typically make decisions after they have cumulated many pieces of information concerning an item, whether it is a new technology, new product, etc. Consider the following scenario: A company is going to launch a new version (named as V7 for convenient) of their product, but most people are not familiar with the performance of this new product. Thus, it is better for the company to show multiple versions of advertisements for different features of V7. To marketing in a social network, one frequently used and highly effective way is to choose a few influential people in the network and pay them such that they can broadcast advertisements of V7. We call those influential people “seeds”. Standing on the perspective of potential customers, the first time they receive the information of V7 from their friends, they may forward this information, but this may not necessarily lead to their purchase action. Users may repeatedly receive and forward new information about V7, until they have cumulated enough pieces of information about the new product, at which time the purchase action is triggered. We model this by assuming that there is a threshold for each customer, who will buy V7 if the amount of information and recommendation that he receives exceeding his threshold. We measure the amount of information a user received as the fraction of information cascades that reaches the user, which is simply the activation probability of the user in an information cascade model. Therefore, we have an information cascade model to specify how information or influence propagates in the social network, and the model determines the probability of user receiving information (given a fixed seed set), which is interpreted as the fraction of information

received by the user, and the user is finally activated if the fraction of information it receives exceeds his own threshold. Our diffusion process following the widely used independent cascade model presented in [17]: A social network is defined on a directed graph with nodes representing individuals and edges representing relationships between individuals. Each edge is related to an activation probability. Initially, a set of nodes is selected as seeds and become active, all other nodes are inactive. At each step, active nodes have one chance to activate each of its inactive out-neighbor and if successful (with probability on edges) then the neighbor become active. In this diffusion model, each node has a probability to be active. We say a node is cumulatively active if this probability exceeds its threshold. This probability can be regarded as the amount of recommendation the node receives after a large number of diffusion processes. Given the above cumulative activation model, the company may be facing one of the following two objectives: either the company may have a fixed budget to activate the seed nodes, and want to maximize the number of cumulative activations, or the company may need to reach a predetermined number of cumulative activations, and want to minimize the number of seeds. We formulate the above scenarios as the following two optimization problems: Seed minimization with cumulative activation (SM-CA) problem and influence maximization with cumulative activation (IM-CA) problem. Given a directed graph with a probability on each edge and a threshold for each node, a requirement η and a budget k, the SM-CA problem is to find a seed set with minimum size such that the number of cumulatively active nodes is at least η. The IM-CA problem is to find a seed set with k nodes such that the number of cumulatively active nodes is maximum. In this paper, we first show the nonsubmodularity of the function corresponding to the number of cumulatively active nodes under a seed set for SM-CA problem and IM-CA problem, which means we can not use greedy algorithm directly to guarantee the approximation ratio. For SM-CA problem, we consider the case η = n and η < n separately, where n is the number of nodes in the network. The complexity results on these two cases are quite different. When η = n, we show that it is NP-hard to approximate within (1 − ε) ln n for any ε > 0. Then we construct a contribution function f (S) which can guarantee the equivalency to all feasible solutions of SM-CA problem. We prove that exactly computing f (S) is #P-hard but f (S) can be estimated with enough precision by Monte Carlo simulation. We also show the submodularity of f (S) and then design a greedy algorithm yielding a bicriteria O(ln n)approximation. When η < n, we construct a reduction from the dense k-subgraph problem to SM-CA problem and show that SMCA problem can not be approximated within √16 nδ/2 if the dense k-subgraph problem can not be approximated within nδ for any δ > 0. For IM-CA problem, we prove that it is NP-hard to approximate in n1−ε for any ε > 0. Since SM-CA problem with η < n and IM-CA problem are hard to approximate, we design heuristic algorithms for them. To avoid the large time cost coming from repeatedly using Monte Carlo simulation in the greedy algorithm, we use another method named reverse influence sampling [24] to estimate Pu (S).

Finally, we conduct experiments on two real-world social networks to test the performance of our heuristic algorithms. To summarize, our contributions include: (a) we propose the seed minimization and influence maximization problem under cumulative activation (SM-CA problem and IM-CA problem respectively), which is a reasonable model for purchasing behavior of customers; (b)we design a greedy algorithm for SM-CA problem when η = n with tight performance ratio; (c) we show strong hardness results for SM-CA problem when η < n and IM-CA problem; (d) we design two heuristic algorithms for SM-CA problem and two heuristic algorithms for IM-CA problem and empirically demonstrate the performance ratio for our heuristic algorithms using realworld datasets.

1.1

Related Work

The classical influence maximization problem is to find a seed set of at most k nodes to maximize the expected number of active nodes. It is first studied as an algorithmic problem by Domingos and Richardson [9] and Richardson and Domingos [22]. Kempe et al. [17] first formulate the problem as a discrete optimization problem. They summarize the independent cascade model and the linear threshold model, and obtain approximation algorithms for influence maximization by applying submodular function maximization. Extensive studies follow their approach and provide more efficient algorithms ([7, 8, 20]). Leskovec et al. [20] present a “lazy-forward” optimization method in selecting new seeds, which greatly reduce the number of influence spread evaluations. Chen et al. [7, 8] propose scalable algorithms which are faster than the greedy algorithms proposed in [18]. Another aspect of influence problem is seed set minimization, Chen [5] studies the seed minimization problem under the fix threshold model and shows some strong negative results for this model. Long et al. [21] also study independent cascade model and linear threshold model from a minimization perspective. In [12], Goyal et al. study the problem of finding the minimum size of seed set such that the expected number of active nodes reaches a given threshold, they provide a bicriteria approximation algorithm for this problem. Recently, Zhang et al. [25] study the seed set minimization problem with probabilistic coverage guarantee and design an approximation algorithm. He et al. [16] study positive influence model under single-step activation and propose an approximation algorithm. Indeed, the work of [16] is a special case of our work. Beyond influence maximization and seed minimization, another interesting direction is the learning of social influence from real online social network data set, e.g. influence learning in blogspace [14] and academic collaboration network [23]. Paper organization. We formally define the diffusion model and the optimization problems SM-CA and IM-CA in Section 2. The approximation algorithms and hardness results for these two problems are proposed in Section 3, including a greedy algorithm for SM-CA problem with η = n in section 3.1.1, the hardness result for SM-CA problem with η < n in Section 3.1.2 and the inapproximate result for IMCA problem in Section 3.2. In Section 4, we present two heuristic algorithms for SM-CA problem and two heuristic algorithms for IM-CA problem. Section 5 shows our exper-

imental results on real-world datasets. We summarize the paper with some further directions in Section 6.

2.

MODEL AND PROBLEM DEFINITIONS

Our social network is defined on a directed graph G = (V, E), where V is the set of nodes representing individuals and E is the set of directed edges representing social ties between pairs of individuals. Each edge e = (u, v) ∈ E is associated with a probability puv , which represents the probability that u activates v. Our influence process follows the independent cascade (IC) model proposed by Kempe et al. [17]. In the IC model, discrete time steps t = 0, 1, 2, · · · are used to model the diffusion process. Each node in G has two states: inactive or active, all nodes are inactive at step 0. At step 1, a subset S ⊆ V is selected as seed set and nodes in the seed set are activated directly. For any step t ≥ 1, if a node u is newly activated at step t − 1, then u has a single chance to activate each of its inactive out-neighbor v with independent probability puv . Once a node becomes active, it will never return to the inactive state. The diffusion process stops when there is no new activation at a time step. We consider a new type of activation as follows: For each node u ∈ V , let Pu (S) be the probability that u becomes active after the diffusion process starting from the seed set S. Specially, we suppose Pu (S) = 1 for u ∈ S. If there are many times of diffusion processes, we can regard Pu (S) as the frequency that u accepts information successfully after these processes. Suppose that each node has an accepted threshold τu and a node is called cumulatively active if Pu (S) ≥ τu . We call this type of activation as Cumulative Activation (CA). Given a target set U ⊆ V and a seed set S, let ρU (S) be the number of cumulatively active nodes in U under S. When the context is clear, we omit the subscript U and use ρ(S) directly. We consider two optimization problems under cumulative activation, seed minimization with cumulative activation (SM-CA) problem and influence maximization with cumulative activation (IM-CA) problem. SM-CA problem aims at finding a seed set S with minimum size such that there are at least η (η ≤ n) nodes that are cumulatively activated in the target set. IM-CA problem is the problem that finding a seed set of size k to maximize the number of nodes cumulatively activated in the target set. The formal definitions are as follows. Definition 1 (Seed minimization with cumulative activation) In the seed minimization with cumulative activation (SM-CA) problem, the input includes a directed graph G = (V, E) with |V | = n, |E| = m, an influence probability set P = {puv : puv ∈ [0, 1], (u, v) ∈ E}, a target set U ⊆ V , an accepted threshold τu ∈ (0, 1] for each node u ∈ U and a coverage requirement η ≤ |U |. Our goal is to find the minimum size seed set S ∗ ⊆ V such that at least η nodes in U can be cumulatively activated, that is, S ∗ = arg min |S|. S:ρU (S)≥η

Definition 2 (Influence maximization with cumulative activation) In the influence maximization with cumulative activation (IM-CA) problem, the input includes a directed graph G = (V, E) with |V | = n, |E| = m, an influence probability set P = {puv : puv ∈ [0, 1], (u, v) ∈ E}, a target set U ⊆ V , an accepted threshold τu ∈ (0, 1] for each

Figure 1: Example to the nonsubmoduarity of ρ(S) node u and a size budget k ≤ n. Our goal is to find a seed set S ∗ ⊆ V of size k such that the number of cumulatively active nodes in U is maximized, that is, S ∗ = arg max ρU (S). S:|S|=k

Notice that, it makes no difference if we set the target set to be the set of all nodes in our paper. Thus, the target set is regarded as V if there is no special statement.

3.

ALGORITHM AND HARDNESS RESULTS

In this section, we provide algorithmic as well as hardness results for SM-CA problem and IM-CA problem. It is well known that submodular functions usually guarantee good approximations for greedy algorithms and most of the existing work about social influence take advantage of this nature (e.g. [3, 6, 12, 17]). Unfortunately, our objective function is not submodular in general and this fact makes our problems more intractable. A function f from 2V to reals is called submodular if f (S ∪ {w}) − f (S) ≥ f (T ∪ {w}) − f (T ), ∀ S ⊆ T ⊆ V and w ∈ / T . In SM-CA problem and IM-CA problem, given a seed set S, the objective function can be denoted by ρ(S) = P Pu (S) u∈U min{b τu c, 1}. The example in Figure 1 shows that ρ(S) is not submodular in general: Here G is a bipartite graph and the probability on each edge is 1/2, the target set U = {u} and τu = 7/8. Let S = {a} and T = {a, b}, we have ρ(S ∪ {c}) − ρ(S) = 0 and ρ(T ∪ {c}) − ρ(T ) = 1. Thus, ρ(S) is not submodular. In the rest part of this section, we consider how to design efficient algorithms for SM-CA problem and IM-CA problem as well as the hardness of them. In SM-CA problem, we take two cases “activate all nodes” (η = n) and “partially active” (η < n) into consideration separately since the results under these two cases are quite different.

3.1 3.1.1

Seed minimization with cumulative activation (SM-CA) problem SM-CA problem with η = n

We first give a hardness result for SM-CA problem with η = n. The result is based on the fact that SET COVER problem is a special case of SM-CA problem. Feige and Uriel proved that SET COVER is NP-hard and cannot be approximated within (1 − ε) ln n for any fixed ε > 0, unless N P ⊆ DT IM E(nO(log log n) ) [10]. We use this result to obtain the following hardness result directly. Theorem 1. When η = n, SM-CA problem is NP-hard. Moreover, for any ε > 0, SM-CA problem cannot be approximated within (1 − ε) ln n in polynomial time unless N P ⊆ DT IM E(nO(log log n) ).

Algorithm 1 Computing f (S) by Monte Carlo Input: G = (V, E), {puv }(u, v)∈E , τu , U, S, R Output: fˆ(S): estimate of f (S) ˆ 1: f (S) = 0 2: for u ∈ U do 3: tu = 0, Pû (S) = 0 4: for i = 1 to R do 5: simulate IC diffusion with seed set S 6: if u is activated then 7: tu = tu + 1 8: end if 9: end for 10: Pû (S) = tu /R 11: if Pû (S) ≥ τu then 12: fˆ(S) = fˆ(S) + τu 13: else 14: fˆ(S) = fˆ(S) + Pû (S) 15: end if 16: end for 17: return fˆ(S)

Algorithm 2 Greedy algorithm for SM-CA with η = n Input: G = (V, E), {puv }(u, v)∈E , τu , U , ε Output: Seed set S 1: S = ∅ 2: for u ∈ V do P 3: while fˆ(S) < u∈V τu − ε do 4: choose v = arg maxu [fˆ(S ∪ {u}) − fˆ(S)] 5: S = S ∪ {v} 6: end while 7: end for 8: return S (i)

PR

E[Xu ] = R · Pu (S). By Hoeffding’s inequality and the condition R ≥ (n2 ln(2nδ ))/2γ 2 , for any constant γ > 0 and δ > 0, i=1

Pr(|Pû (S) − Pu (S)| ≥ γ/n) = Pr(|Xu − E[Xu ]| ≥ Rγ/n) ≤ 2 exp(−

2( Rγ )2 1 n ) ≤ δ. R n

In following, we show that |fˆ(S)−f (S)| ≤ Pu (S)| always holds. Based on the above hardness result, we set our goal to find an algorithm with approximation ratio close to ln n. Our key observation is that when η = n, we can replace the nonsubmodular function ρ(S) to a submodular function f (S), such that the feasible solution for SM-CA problem can also be characterized by f (S) instead of ρ(S). The contribution function f (S) is defined as: P f (S) = u∈V min{Pu (S), τu }.

Lemma 1. For any seed set S, suppose fˆ(S) be the estimation of f (S) output by Algorithm 1, then ∀ γ > 0, δ > 0, Pr(|fˆ(S) − f (S)| ≤ γ) ≥ 1 − 1/nδ , if R ≥ (n2 ln(2nδ ))/2γ 2 . PR (i) Proof. For each node u, Let Xu = i=1 Xu , where (i) (i) Xu is a random variable defined as Xu = 1 if u is activated (i) in the i-th simulation and Xu = 0 otherwise. That is to say, Xu is the number of times that u becomes active after R times simulations. Thus, Xu = R · Pû (S) and E[Xu ] =

u∈V

|Pû (S)−

|fˆ(S) − f (S)| X X = min{Pû (S), τu } − min{Pu (S), τu } u∈V

u∈V

X

≤

X

+ X

|Pû (S) − τu |

û (S)≤τu , u:P Pu (S)>τu

|τu − Pu (S)| +

û (S)>τu , u:P Pu (S)≤τu

≤

X

|Pû (S) − Pu (S)| +

û (S)≤τu , u:P Pu (S)τu , u:P Pu (S)≥τu

|Pû (S) − Pu (S)|.

u∈V

Thus, Pr(|fˆ(S) − f (S)| ≤ γ) ≥ 1 −

1 nδ

.

Having the estimation algorithm of f (S), we show our greedy algorithm for SM-CA problem with η = n in Algorithm 2. Algorithm 2 starts from an empty seed set S. At each iteration, it adds one node v providing the largest marginal value of fˆ(S) into S, i.e., v = arg max [fˆ(S ∪ {u}) − fˆ(S)]. u∈V

P The algorithm ends when fˆ(S) ≥ u∈V τu − ε and outputs S as the selected seed set. Goyal et al. proved the performance guarantee for the greedy algorithm when f (S) is monotone and submodular [12]. Theorem 2. [12] Let G = (V, E) be a social graph and f (·) be a nonnegative, monotone and submodular function defined on 2V . Given a threshold 0 < η ≤ f (V ), let S ∗ ⊆ V be a subset with minimum size such that f (S ∗ ) ≥ η, and S be the greedy solution using a (1 − δ)-approximate function fˆ(·) with the stopping criteria fˆ(S) ≥ η − ε. Then, for any ϕ > 0, ε > 0, we have |S| ≤ |S ∗ |(1 + ϕ)(1 + ln(η/ε)). To analyze the performance of Algorithm 2, we first show the monotonicity and submodularity of f (S) in the following lemma.

Lemma 2. The contribution function f (S) is monotone increasing and submodular. Proof. (sketched) It obvious that f (S) is an increasing function and f (∅) = 0, so we only need to show the submodularity of f (S). Note that, given a seed set S and a node u ∈ V , if we set our target set exactly to U = {u}, then the expected number of active nodes starting from S is σ(S) = Pu (S). Thus, Pu (S) is submodular since σ(S) is a submodular function based on the conclusion in [17]. It easy to check that the function min{Pu (S), τu } is also submodular. Therefore, f (S) is submodular since it is the summation of a finite number of submodular functions. Now we can conclude the approximation ratio of Algorithm 2 based on Lemma 1, Lemma 2 and Theorem 2. Theorem 3. When η = P n, for any φ > 0, ε > 0, Algorithm 2 ends when fˆ(S) ≥ u∈V τu − ε and approximates P

SM-CA problem within a factor of (1 + φ) · (1 + ln with high probability.

3.1.2

u∈V

ε

τu

)

SM-CA problem with η < n

For the case η < n, SM-CA problem becomes more difficult since the contribution function f (S) has no direct relationship to the solution for SM-CA problem. (as Fact 1 for the η = n case). We show the hardness result first. Our analysis is based on the hardness of the dense k-subgraph (DkS) problem [11]. An instance of DkS problem consists of an undirected graph G = (V, E), where |V | = n, and a parameter k < n. The objective is to find a subset V 0 ⊆ V of cardinality k such that the number of edges with both endpoints in V 0 is maximized. The first polynomial time approximation algorithm for DkS problem is given by Feige et al. in 2001 [11] and the performance ratio is O(n1/3 ). This result was improved to O(n1/4+ε ) (for any ε > 0) by Bhaskara et al. [2] in 2010 and this is the currently the best known guarantee. For the hardness of DkS problem, Khot [19] proved that the DkS problem does not admit PTAS under the assuming that NP problems does not have sub-exponential time randomized algorithms. The exactly complexity of approximating DkS problem is still open, but it is widely believed that DkS problem can only be approximated within polynomial ratio. Partially borrow the idea in [15], we can prove a hardness result for SM-CA problem with η < n based on the hardness of DkS problem. Theorem 4. When η < n, SM-CA problem can not be approximated within √16 nδ/2 if DkS problem can not be approximated within nδ for some δ > 0. Proof. Suppose there is a polynomial time approximation algorithm A with performance ratio r for SM-CA problem with η < n. We design an algorithm for DkS problem based on A, which has approximation ratio 6r2 , hence the theorem follows. Given any instance of DkS problem on graph G = (V, E), construct an instance (denoted by SM-CA-I) of SM-CA problem as follows. It is defined on a one-way bipartite graph G0 = (V 0 = V1 ∪ V2 , E 0 ), where V1 = V, V2 = E, the directed edge set E 0 = {(v, e) : ∀ v ∈ V1 , e ∈ V2 , and v is one of the endpoints of e in E}, the probability on each edge e0 = (v, e) is pve = 1/2. The target set U = V1 ∪ V2 ,

for each node e ∈ V2 , τe = 3/4 and for each node v ∈ V1 , τv = 1. For any k, let η = η(k) be the maximum threshold requirement for which A outputs a solution for SM-CA with k nodes. That is to say, A outputs a seed set with k nodes if the threshold is η(k) and at least k +1 nodes if the threshold is η(k) + 1. 1 It is clearly that, in SM-CA-I, nodes in V2 are no better than nodes in V1 as candidates of seed since the target set is the set of all nodes, select a node in V2 can only activate itself, but a node in V1 may help to activate nodes in V2 . So here we assume that all seeds selected by algorithm A are from V1 . Since for each edge (v, e) ∈ E 0 , pve = 1/2 and for each node e ∈ V2 , τe = 3/4, an easy probability calculation implies that a node e ∈ V2 can be cumulatively activated if and only if both endpoints of e are selected as seeds. Suppose the seed set of SM-CA-I with parameter η = η(k) computed by algorithm A is S 0 , then we can use the corresponding node set S in graph G as an approximate solution of the DkS problem. Indeed, we have |S| = k. Since in SM-CA-I S 0 cumulatively activate at least η nodes, only k of them are in V1 , so at least η − k nodes are cumulatively activated in V2 . Therefore, in graph G the number of edges induced by S is at least η − k. Without loss of generality, we can assume η ≥ k + bk/2c, this is because we can easily choose k nodes from V1 to cumulatively active bk/2c nodes in V2 . It is easy to check that η − k ≥ 13 (η − 2). Suppose the optimal solution of DkS problem contain opt edges, then it is sufficient to show opt ≤ 2(η−2)r2 . Indeed, if we can prove opt ≤ 2(η−2)r2 , then we have opt ≤ 6(η−k)r2 , which means there is a 6r2 -approximate algorithm for the DkS problem. In SM-CA-I, based on the choice of η and the fact that A is a r-approximate algorithm, any seed set with size bk/rc can cumulatively activate at most η nodes. Thus, at most η − bk/rc nodes in V2 can be cumulatively activated by any bk/rc seeds in V1 . This is equivalent to the fact that there are at most η −bk/rc edges induced by any bk/rc vertexes in G. Thus, of for any T ⊆ V with |T | = k, all possible subset k k vertexes in T can induce at most (η − bk/rc) bk/rc bk/rc k−2 edges and each edge is counted exactly bk/rc−2 times. So, if k > 2r, the total number of edges induced by T is at most k (η−bk/rc)(bk/rc ) k−2 (bk/rc−2 )

k−1 ≤ r2 (η − bk/rc) k−r < 2(η − 2)r2 .

2 if k ≤ 2r, then opt ≤ k2 ≤ k2 ≤ 2r2 . By the arbitrary chosen of T , we have opt ≤ 2(η − 2)r2 and this complete the proof.

3.2

Influence maximization with cumulative activation (IM-CA) problem

In IM-CA problem, we prove a strong inapproximability result even when the base graph is a bipartite graph. Theorem 5. For any ε > 0, it is NP-hard to approximate IM-CA problem within a factor of N 1−ε , where N is the input size. Proof. (sketched) We reduce from SET COVER problem. The input of the SET COVER problem includes a 1 η can be computed efficiently by using algorithm A and binary search.

ground set W = {w1 , w2 , . . . , wn }, a collection of subsets S1 , S2 , . . . , Sm ⊂ W , and a positive integer k < m. The question is whether there exists k subsets whose union is W . Given an instance of the set cover problem, we construct an instance of IM-CA problem as follows: There are three types of nodes, set nodes, element nodes, and dummy nodes. There is a set node u corresponding to each set, an element node v corresponding to each element, and a directed edge (u, v) with activation probability puv = 1 if the element represented by v is belong to the set represented by u and puv = 0 otherwise. There are nc dummy nodes m e + 1), and there is x1 , x2 , · · · , xnc (where c = 2d 1 e + d log log n a directed edge (v, x) for each v and x. The activation probability on (v, x) is pvx = 1/2. The thresholds of set nodes, element nodes and dummy nodes are τu = τv = 1, τx = 1− 21n , respectively. The budget of the size of a seed set is k and the target set is all nodes. Notice that the input size of our c nc IM-CA problem is N = nc + n + m, so N 1− < 2n ≤ n+k . N Under our construction, if there exists a collection of k sets covering all elements in W for SET COVER problem, then in IM-CA problem, the node set corresponding to the collection will cumulatively activate all element nodes and all dummy nodes. In total, there will be nc + n + k nodes become cumulatively active. On the other hand, let’s consider the case if there is no set cover of size k. Again we can assume all the seeds are selected from set nodes, since as a candidate for seeds, set nodes are more efficient than element nodes and dummy nodes. Thus, if there is no set cover of size k, then we can not find k seeds which activate all the element nodes, hence none of the dummy notes are activated. Therefore, the total number of nodes cumulatively activated are no more than n + k. So, if a polynomial algorithm can approximate IM-CA problem within N 1− , then we can answer the decision problem of the SET COVER problem in polynomial time, this is impossible under the assumption P6= NP.

4.

EFFICIENT HEURISTIC ALGORITHMS

Given a seed set S and a node u, the compution of Pu (S) in Algorithm 1 is quite expensive. It needs to simulate on O(n2 ) graphs to guarantee the accuracy. Each simulation needs O(m) time. These lead to an O(n2 m) running time for each node. Moreover, we need to simulate in each step during the running of greedy algorithms and this is the main reason for the inefficiency of using the Monte Carlo method. For the sake of efficiency, here we use Reverse Influence Sampling (RIS) method to quickly estimate Pu (S) and avoid too many repetitive work during the process of greedy algorithms. RIS method was first proposed by Brogs et al. in [3] and then improved by Tang et al. in [24]. To explain how it works, we first introduce the concept of reverse reachable set. Definition 3 (Reverse reachable set) Let u be a node in G, and g be a random graph obtained by independently removing each edge e = (v, w) in G with probability 1 − pvw . The reverse reachable (RR) set for u is the set of nodes in g that can reach u. By the definition of RR set, if a node v is an element in the RR set of u, then v can reach u through a path in G and thus v can active u with non-zero probability if v is a node in the seed set. Intuitively, Pu (S) is increasing with

the probability that there exist some seeds in RR sets for u. Brogs et al. formally proved this observation. Lemma 3. [3] Let S be a seed set and u be a fixed node. Suppose Ru is a RR set for u generated from G, then Pu (S) is equal to the probability that S overlaps with Ru , that is, p , Pu (S) = Pr(S ∩ Ru 6= ∅). Similarly to the Monte Carlo method, we independently generate Ru θ times. Let Ru be the collection of generated RR sets for u. For a seed set S, denote FRu (S) , |{Ru ∈ Ru : Ru ∩ S 6= ∅}|/θ. For any u ∈ V , we use FRu (S) as the estimation of Pu (S). The following lemma shows that we can upper bound the error if θ is large enough. Lemma 4. For any µ > 0 and l > 0, if θ satisfies θ ≥ ln(2nl )/2µ2 , then Pr[|FRu (S) − Pu (S)| ≥ µ] ≤ n−l . Proof. For convenient, we use p instead Pu (S) in the derivation. Let X = θFRu (S) be the number of RR sets in P Ru overlapping with S. Thus, X = θi=1 Xi where Xi = 1 if S overlaps withP the i-th RR set in Ru and Xi = 0 otherwise. Then, E[X] = θi=1 E[Xi ] = θp. By Hoeffding’s inequality and the condition θ ≥ ln(2nl )/2µ2 , we have Pr(|FRu (S) − p| ≥ µ) = Pr(|θFRu (S) − θp| ≥ θµ) = Pr(|X − E[X]| ≥ θµ) ≤ 2 exp(−2(θµ)2 /θ) ≤ n−l .

Our heuristic algorithms for SM-CA problem and IM-CA problem are based on the above estimation. We design two greedy algorithms for SM-CA problem and two greedy algorithms for IM-CA problem, respectively. The main idea of these algorithms are selecting a node with the largest marginal increment under different greedy functions at each step. PWe use two greedy functions, the first one is f (S) = u∈V min{Pu (S), τu } which is the same as used in secP Pu (S) tion 3.1.1. Another one is ρ(S) = u∈V min{b τu c, 1} which represents the number of cumulatively active nodes activated by seed set S. Algorithm 3 (SelectByF) and Algorithm 4 (SelectByρF) correspond to selecting nodes with the largest marginal increment under f (S) and ρ(S), respectively. We first introduce Algorithm 3. In the input of Algorithm 3, req is an array storing the current requirements of all nodes in V . The requirement of a node u is the number of RR sets in Ru needs to be hit by a seed set so that u can become cumulatively active. We say a set S hit a RR set R if S ∩ R 6= ∅. Let inc(v) be the value of the marginal increment generated by any node v ∈ V , overlap(v, Ru ) be the number of RR sets in Ru overlapping with node v. In the main loop of SelectByF, we select the node which provides the largest marginal increment under f (·). To this end, for each node v ∈ V , we compute the marginal increment of v to nodes which are not cumulatively active. Based on lemma 3, the marginal increment of a node v to a node u can be estimated by overlap(v, Ru ) (see detail in lines 2-7). Having

Algorithm 3 SelectByF Input: G = (V, E), {puv }(u, v)∈E , req, {Ru }u∈V , S Output: a node with the largest marginal increment under f (S) 1: set inc(v) = 0 for all v ∈ V 2: for each node u ∈ V S and req(u) > 0 do 3: for each node v ∈ Ru do 4: /*compute the marginal increment of v*/ 5: inc(v) = inc(v) + min(overlap(v, Ru ), req(u)) 6: end for 7: end for 8: sort inc(v) for all v ∈ V 9: select x = arg maxv inc(v) 10: return x Algorithm 4 SelectByρF Input: G = (V, E), {puv }(u, v)∈E , req, {Ru }u∈V , S Output: a node with the largest marginal increment under ρ(S) 1: set inc(v) = 0 for all v ∈ V 2: for each node u ∈ V and req(u) > 0 do 3: for each node v ∈ Ru do 4: /*compute the marginal increment of v*/ Ru ) c, 1} 5: inc(v) = inc(v) + min{b overlap(v, req(u) 6: end for 7: end for 8: sort inc(v) for all v ∈ V 9: /*select one better node from nodes with the largest marginal increment*/ 10: for each node u ∈ V and req(u) > 0 do 11: for all nodes v ∈ V with the largest inc values do 12: inc(v) = inc(v) + min(overlap(v, Ru ), req(u)) 13: end for 14: end for 15: sort inc(v) for all nodes v ∈ V with the largest inc values 16: select x = arg maxv inc(v) 17: return x

the marginal increment values of all nodes v ∈ V , we sort these values and select the node with the largest marginal increment value. Another greedy strategy is given in SelectByρF (Algorithm 4). In this algorithm we first find nodes with the largest marginal increment under ρ(S). Similar to the operation in Algorithm 3, we use P overlap(v, Ru ) c, 1} u: req(u)>0 min{b req(u) to estimate the marginal increasing of ρ(·). There may be many nodes with the same value of inc(·) because of the truncating operation in ρ(·). To break the tie, we choose the node with the maximum marginal increasing of f (·) among all nodes with the largest inc value. Having these selection schemes, we now present the whole greedy algorithm for SM-CA problem in Algorithm 5. This algorithm selects nodes by SelectByF. The greedy algorithm selecting nodes by SelectByρF is named as GCAF-SM, we omit the detail of this algorithm since all operations in GCAF-SM are same with GF-SM except the process of selecting nodes. We compare the performance of GCAF-SM in the experiment section.

Algorithm 5 GF-SM Input: G = (V, E), {puv }(u, v)∈E , {τu }u∈V , η, θ Output: Seed set S 1: /*initialization*/ 2: set S = ∅ 3: set count = 0 4: generate {Ru }u∈V 5: set req(u) = τu θ for each u ∈ V 6: while count < η do 7: x = SelectByF(G, {puv }(u, v)∈E , req, {Ru }u∈V , S) 8: S = S ∪ {x} 9: remove all RR sets hit by x 10: /*count the number of cumulative active nodes*/ 11: for each u ∈ V and req(u) > 0 do 12: rem(u): the number of RR sets removed in Ru 13: req(u) = req(u) − rem(u) 14: if req(u) ≤ 0 then 15: count = count + 1 16: end if 17: end for 18: end while 19: return S

In GF-SM, we first initialize for the algorithm, then generate θ RR sets for each node in V . For each node u, u is cumulatively active only if there are at least θτu (i.e. the initial requirement for u) RR sets for u hit by seed set since we can estimate Pu (S) by FRu (S) based on lemma 4. At each step, we add a new node x into the current seed set by SelectByF. After a seed x be selected, we need to remove all RR sets containing x and update the requirement for each node. This is because x contributes to all uncumulatively active nodes whose RR sets containing x and the requirements of these nodes can be partial satisfied by x. The algorithm ends until the number of cumulative active nodes is at least η. Now we analyze the time complexity of Algorithm 5. Let EP T be the expectation of in-degree in G of all nodes in a RR set, that is to say, the expected time of generating a RR set is EP T . Thus, the total expected time of the generation is O(nθ · EP T ). By Lemma 4, θ = O(log n) is enough for accuracy. Thus, the expected generation time is O(n log n·EP T ). Besides the generation time, the main time cost depends on SelectByF since other operations only take time O(n). Let EP T V be the expected number of nodes in a RR set, since EP T V ≤ EP T , SelectByF takes time O(n · EP T ) in expectation. The sorting time in SelectByF is O(n log n). Thus, the expected total time cost of GF-SM is O(nη(log n + EP T )). Indeed, EP T is constant in realworld datasets, thus the real total time cost is O(nη log n). The time complexity of GCAF-SM is the same with GFSM since the time cost of SelectByρF is also O(n(log n + EP T )). For IM-CA problem, we use the same selection scheme with SM-CA problem. The algorithms based on SelectByF and SelectByρF are named GF-IM and GCAF-IM respectively. The pseudo-code of GF-IM is presented in Algorithm 6. The differences between GF-IM and GF-SM are the end conditions and the counting variable. We omit the details of GF-IM and the whole algorithm GCAF-IM here. It is not hard to analyze that the time complexity of GF-IM and GCAF-IM are both O(nk(log n + EP T )).

Algorithm 6 GF-IM Input: G = (V, E), {puv }(u, v)∈E , τu , k, θ Output: Seed set S 1: /*initialization*/ 2: set S = ∅ 3: set f (S) = 0 4: set count = 0 5: set req(u) = θτu for each node in V 6: generate θ RR sets for each node in V 7: for j = 1 to k do 8: x = SelectByF(G, {puv }(u, v)∈E , req, {Ru }u∈V ) 9: S = S ∪ {x} 10: remove all RR sets containing x 11: for each u in V do 12: rem(u):the number of RR sets removed from Ru 13: req(u) = req(u) − rem(u) 14: end for 15: end for 16: return S

5.

EXPERIMENTS

In order to test the performance of our heuristic algorithms presented in section 4, we conduct experiments on our algorithms and some other algorithms on real social networks. Our experiments are run on a machine with a 2.4GHz Intel(R) Xeon(R) E5-2670 CPU, 2 processors (16 cores), 64GB memory and Red Hat Enterprise Linux Server release 6.3 (64bit). All algorithms tested in this paper are written in C++ and compiled with g++ 4.8.4.

5.1

Experiment setup

Datasets. We use two real-world networks in our experiments. The first network is Flixster, which is an American movie rating social site for discovering new movies. In the Flixter graph, each node represents a user and a directed edge e = (u, v) represents that u and v rate the same movie and v rates the movie shortly after u. We simply use one specific topic in this network with 29357 nodes and 174939 directed edges. And we learn the active probabilities on edges by using the Topic-aware Independent Cascade Model presented in [1]. The mean of edge probabilities is 0.118 and the standard deviation is 0.025. The second one, called NetPHY, is the same as the one used in [7, 12, 13]. It is an academic collaboration network extracted from the “Physics” section from arXiv (http:// www. arXiv.org). The nodes in NetPHY are authors and undirected edges represent coauthorship relations. We use data from year 1991 to year 2003 which includes 37154 nodes and 348322 edges. The influence probabilities on edges are assigned by weighted cascade model [17]. Specifically, for each edge (u, v) ∈ E, we set puv = c(u, v)/d(v), where d(v) is the number of published papers of author v and c(u, v) is the number of papers that u and v collaborated. In this network, the mean of edge probabilities is 0.107 and the standard deviation is 0.025. Algorithms. We first recapitulate our algorithms in section 4. • GF: GF represents the algorithms that greedily promotes f (S) until the seed size reaches k or the number of cumulatively active nodes reaches η. One performance issue of such algorithms is that if we simply

estimate Pu (S) by Monte Carlo simulation, we need to run simulations on the networks 10000 times for each node in each seed selecting round. This is not acceptable, so we modify the algorithm of TIM+ [3] and directly calculate f (S ∪ {u}) − f (S) for each node u∈V. • GCAF: In each round, GCAF finds the node that bring the greatest marginal increment to the number of cumulatively active nodes, and adds it to current seed set. But when two nodes have the same marginal increment, we further compare their contribution function f (·) just like GF. The implement of GF and GCAF are similar, and details are discussed in Section 4 . The following algorithms are experimentalized for comparison. • TIM+ : We use the greedy algorithm presented in [12] as the contrastive algorithm for SM-CA problem and use the greedy algorithm in [17] for IM-CA problem. The greedy rules of both SM-CA problem and IM-CA problem are the marginal increasing of the expected number of active nodes denoted by σ(S). To guarantee the efficiency, we use the RIS method in [3] to estimate the marginal increasing. • High-degree: The high degree heuristic algorithm chooses nodes v in order of decreasing degrees d(v). It is popular to consider high degree nodes as influential in social and other networks. • PageRank: It is the popular algorithm used for ranking web pages [4]. P The transition probability on edge e = (u, v) is puv / w:(w,v)∈E pwv . In the PageRank algorithm, higher puv indicates u is more influential to v and thus v should vote u higher. We use 0.15 as the restart probability and use the power method to compute the PageRank values. The stopping criteria is when two consecutive iterations are different for at most 10−4 in L1 norm. • Random: As a baseline comparison, Random simply select seeds sequence in random order.

5.2

Experiment results

Algorithms performance of IM-CA problem. We conduct two sets of experiments on the two datasets with IC model. In IM-CA problem, for the efficiency of our experiment, we set θ = 1000 in all algorithms. For convenience, we set the same accepted threshold for all nodes. Accepted threshold τ ranges from 0.1 to 0.9. Figure 2 shows the results of different algorithms with τ = 0.5 on two datasets. From these two tests, we can find that TIM+ , GF-IM and GCAF-IM outperform three baseline algorithms Random, High-degree and PageRank on both two datasets. Specially, GCAF-IM outperforms all other algorithms significantly on NetPHY graph. The results of TIM+ , GF-IM and GCAF-IM are similar on Flixster graph . To further explore the performance of algorithms, we draw more results of six algorithms conducted on different settings of τ . Figure 3 shows the performance of different algorithms on Flixster with τ = 0.1, 0.3, 0.7 and 0.9. We can observe that with the increase of τ , it is more effective to greedily add node which promotes ρ(S) most.

(a) Flixster with τ = 0.5

(b) NetPHY with τ = 0.5

(a) NetPHY with τ = 0.1


(c) NetPHY with τ = 0.7

(d) NetPHY with τ = 0.9

Figure 2: IM-CA results samples on two datasets

Figure 4: IM-CA results on NetPHY (a) Flixster with τ = 0.1

(b) Flixster with τ = 0.3

(c) Flixster with τ = 0.7

(d) Flixster with τ = 0.9

(a) NetPHY with τ = 0.1


Figure 3: IM-CA results on Flixster

One interesting point is that TIM+ , GF-IM perform similarly when τ = P 0.9. This is because when τ is large enough, f (S) is close to u∈V Pu (S), which is the contribution function of classical activating scheme. In this case, GF-IM can not distinguish cumulative activation from classical activation proposed in [17]. However, when τ is small, GF-IM slightly outperforms TIM+ . Then we move to the results of IM-CA on graph NetPHY (see Figure 4). It can be inferred that GCAF-IM outperforms all other algorithms when τ is no less than 0.3. When τ = 0.9, GCAF-IM can cumulatively active about 2500 nodes with 1000 seeds, but the seeds found by other algorithms only achieve no more than 1500 cumulatively active nodes, whose size is very close to the size of original seeds. Comparing the figures of Flixster and NetPHY, the performance gap between GCAF-IM and other algorithms is larger in NetPHY network than Flixster network. We think it is probably because NetPHY are denser than Flixster and cumulative activation is harder in denser graph. Algorithms performance of SM-CA problem. For seed minimization problem, algorithms Random, Highdegree and PageRank will output a very large seed set and more important, is quite ineffective since they need many rounds of running as well as time-consuming simulation to estimate current influence size. Thus, in this section, we just compare the performance of TIM+ , GF-SM and GCAFSM. We set θ = 500 in SM-CA problem.

(c) NetPHY with τ = 0.5 Figure 5: SM-CA results on NetPHY

Figure 5 shows the performance of three algorithms on SM-CA problem on NetPHY. Both GF-SM and GCAFSM outperform TIM+ significantly. For large τ = 0.5, GF-SM even outperforms GCAF-SM. This is consistent with the Theorem 3: greedily selecting nodes according to the marginal increment to f (·) is better when η is close to n for SM-CA. Algorithms performance with different τ . We also study the relationship of algorithm performance and parameter τ individually. Figure 6 reflects the change of cumulative activation influence size with the increase of parameter τ . From these figures, we observe that GCAF-SM is the best algorithm for all settings of τ and seed size k. With the increase of τ , in all algorithms, the sizes of the cumulatively active nodes decrease rapidly. TIM+ and GFSM behave similarly, since both of them somehow follow the greedy scheme of classical activation model. Note that,

(a) NetPHY with k = 500

(b) NetPHY with k = 1000

Figure 6: IM-CA results of τ on NetPHY GCAF-SM directly promotes size of cumulatively active nodes ρ(S) each round, and it only further compares f (S) when two nodes contributes equally with the contribution function ρ(·). So the experiment results show that directly optimize ρ(·) is the best choice, although ρ(·) is not submodular. Conclusion of experiment. From these experiment results we can see the performance of GCAF-IM and GCAFSM are better than other algorithms for IM-CA problem and SM-CA problem respectively on both datasets. In IM-CA, given τ , the average number of cumulatively active nodes generated by algorithm A on different parameter k is denoted by Cτ (A). We conduct the comparison results between GCAF-IM and other algorithms in Table 1. Each percentage in Table 1 is computed by Cτ (GCAF −IM )−Cτ (A) . Cτ (A)

Similar to Table 1, we show comparison results between GCAF-SM and other algorithms in Table 2.

6.

FUTURE WORK

One possible direction of future work is to prove the inapproximability of SM-CA problem with η < n unconditionally, or under a weak hardness assumption. On the other side, it will be interested if one can design an approximate algorithm for SM-CA problem with performance ratio O(n1/8−2ε ) on bipartite graph and with η < n, since this will imply an O(n1/4−ε )-approximate algorithm for the dense k-subgraph. Another direction is to design a more efficient algorithm for IM-CA problem which can guarantee a good performance on real-world networks.

7.

REFERENCES

[1] N. Barbieri, F. Bonchi, and G. Manco. Topic-aware social influence propagation models. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 81–90. IEEE, 2012. [2] A. Bhaskara, C. Moses, E. Chlamt´ aˇc, and U. Feige. Detecting high log-densities: an o(n1/4 ) approximation for densest k-subgraph. In STOC’10, pages 201–210. ACM, 2010. [3] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. Maximizing social influence in nearly optimal time. In SODA’14, pages 946–957. ACM-SIAM, 2014. [4] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1):107–117, 1998.

[5] N. Chen. On the approximability of influence in social networks. SIAM Journal on Discrete Mathematics, 23(3):1400–1415, 2009. [6] W. Chen, F. Li, T. Lin, and A. Rubinstein. Combining traditional marketing and viral marketing with amphibious influence maximization. In EC’15, pages 779–796. ACM, 2015. [7] W. Chen, C. Wang, and Y. Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In KDD’10, pages 1029–1038. ACM, 2010. [8] W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In KDD’09, pages 199–208. ACM, 2009. [9] P. Domingos and M. Richardson. Mining the network value of customers. In KDD’01, pages 57–66. ACM, 2001. [10] U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652, 1998. [11] U. Feige, G. Kortsarz, and D. Peleg. The dense k-subgraph problem. Algorithmica, 29(3):410–421, 2001. [12] A. Goyal, F. Bonchi, L. V. Lakshmanan, and S. Venkatasubramanian. On minimizing budget and time in influence propagation over social networks. Social Network Analysis and Mining, pages 1–14, 2012. [13] A. Goyal, W. Lu, and L. V. S. Lakshmanan. SIMPATH: An Efficient Algorithm for Influence Maximization under the Linear Threshold Model. In ICDM’11, pages 211–220, 2011. [14] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In WWW’04, pages 491–501. ACM, 2004. [15] M. T. Hajiaghayi, K. Jain, K. Konwar, L. C. Lau, I. I. Mˇ andoiu, A. Russell, A. Shvartsman, and V. V. Vazirani. The minimum k-colored subgraph problem in haplotyping and dna primer selection. In IWBRA, 2006. [16] J. He, S. Ji, R. Beyah, and Z. Cai. Minimum-sized influential node set selection for social networks under the independent cascade model. In MobiHoc’14, pages 93–102. ACM, 2014. ´ Tardos. Maximizing [17] D. Kempe, J. Kleinberg, and E. the spread of influence through a social network. In KDD’03, pages 137–146. ACM, 2003. ´ Tardos. Influential [18] D. Kempe, J. Kleinberg, and E. nodes in a diffusion model for social networks. In ICALP’05, pages 1127–1138, 2005. [19] S. Khot. Ruling out ptas for graph min-bisection, dense k-subgraph, and bipartite clique. SIAM J. Comput, 36(4):1025–1071, 2006. [20] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD’07, pages 420–429. ACM, 2007. [21] C. Long and R.-W. Wong. Minimizing seed set for viral marketing. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 427–436. IEEE, 2011.

Table 1: Comparison of GCAF-IM and other algorithms tau Random High-Degree Pagerank TIM GF 0.1 272.97% 127.66% 36.60% 8.41% 4.27% 0.2 454.51% 153.73% 34.06% 8.15% 9.42% 0.3 501.33% 178.40% 24.67% 14.82% 25.20% 0.4 464.33% 226.50% 21.63% 31.66% 46.92% 0.5 366.06% 279.63% 34.20% 53.74% 70.71% 0.6 272.68% 269.67% 64.71% 69.08% 82.74% 0.7 178.62% 168.07% 79.01% 74.11% 83.84% 0.8 151.38% 143.65% 83.50% 68.91% 75.70% 0.9 144.36% 141.06% 83.39% 68.81% 72.06%

Table 2: Comparison of GCAF-SM and other algorithms tau Random High-Degree Pagerank TIM GF 0.1 411.37% 52.72% 14.83% 1.90% 0.58% 0.3 558.99% 86.55% 25.17% 0.24% 1.62% 0.5 483.51% 116.96% 35.61% 1.18% 4.39% 0.7 294.34% 124.94% 48.34% 7.89% 10.39% 0.9 102.73% 75.30% 41.24% 17.56% 17.90%

[22] M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing. In KDD’02, pages 61–70. ACM, 2002. [23] J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in large-scale networks. In KDD’09, pages 807–816. ACM, 2009. [24] Y. Tang, X. Xiao, and Y. Shi. Influence maximization: near-optimal time complexity meets practical efficiency. In SIGMOD’14, pages 946–957. ACM, 2014. [25] P. Zhang, W. Chen, X. Sun, Y. Wang, and J. Zhang. Minimizing seed set selection with probabilistic coverage guarantee in a social network. In KDD’14, pages 1306–1315. ACM, 2014.