2015 IEEE International Conference on Mobile Services
Efficient Sensor Placement Optimization for Early Detection of Contagious Outbreaks in Mobile Social Networks Chuan Zhou∗ , Ruisheng Shi†‡ , Wenyu Zang∗§ and Li Guo∗ of Information Engineering, Chinese Academy of Sciences, Beijing, China † Education Ministry Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications, Beijing, China ‡ School of Humanities, Beijing University of Posts and Telecommunications, Beijing, China § Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Email:
[email protected],
[email protected],
[email protected],
[email protected] ∗ Institute
proposed a cost-effective greedy algorithm for near optimal sensor placement with multiple criteria to detect outbreaks in networks. Based on their work, a number of sophisticated greedy algorithms have been proposed for contamination detection by using graph sampling, empirical methods, etc. [6], [20]. Existing greedy algorithms, however, have the same drawback of inefficiency when processing large networks. They are often not scalable and take heavy computation time on large networks. Specifically, these algorithms need to compute the objective function each time given a sensor set by running Monte-Carlo (MC) simulations of the diffusion model for sufficiently many times to obtain an approximation of the objective function. In this paper, we tackle the inefficiency issue of early outbreak detection algorithms from two complementary directions. In one direction, we design a new greedy algorithm (Upper Bound based Lazy Forward) to improve the efficiency of existing greedy algorithms by pruning unnecessary MC simulations under a new bound. In the other direction, we propose a new Quickest Path heuristic by using a new tractable distance function as in Eq. (13) to approximately evaluate the #P-hard objective function. Our greedy and heuristic algorithms are derived based on the discrete time Susceptible-Infection (SI) model, which is popular for modeling epidemics [1], [2]. We conduct experiments on four real network data and compare with the CELF [13], Degree (Kempe et al. 2003), PageRank (Brin et al. 2003), Inter-monitor Distance (Seo et al. 2012) and Random algorithms. We measure the results with respect to detection time (which measures the time delay of a message propagated from a diffusion source to a sensor) and selection time (which measures the time cost of an algorithm selecting sensors). We found that 1) the detection time of the new greedy algorithm exactly equals to that of CELF; 2) the new greedy algorithm, compared to CELF, reduces more than 90% MC simulation calls and achieves a 4-8 times speed-up; 3) the detection time of the new heuristic is close to the greedy algorithms, and always
Abstract—In this paper, we investigate the problem of placing sensors in a mobile social network to get quickly informed about contagious outbreaks, i.e., placing k sensors in a network in order to minimize the time until a contaminant - starting from a random node in the network - is detected. We aim to optimize the Sensor Placement from two complementary directions. One is to improve the original greedy algorithm and its extensions [13] to reduce sensor selection time, and the other is to propose a new Quickest Path heuristic that can shorten the detection time. We test and compare our algorithms with previous algorithms on four real data sets. Experimental results show that 1) the new greedy algorithm is more efficient than existing greedy algorithms in terms of selection time, 2) the quickest path heuristic obtains less detection time than centrality-based heuristics, and is as effective as the greedy algorithms, and 3) the new heuristic has the potential to scale well to large networks, having low detection time and selection time. Keywords-social network; early detection; contagious outbreak; placing sensors;
I. I NTRODUCTION With the wide spreading of mobile internet device, mobile social network has greatly changed the information propagation model in recent years. However, information diffusion in mobile social networks can mediate diffusion including not only positive information such as innovations, hot topics, and novel ideas, but also negative information like malicious rumors and disinformation. Hence, many challenges need to be addressed to fully utilize mobile social networks as marketing and information dissemination platforms. In this paper, we present our recent work towards addressing one of these challenges, namely finding pivotal individuals to quickly detect contaminants in a large-scale mobile social network. Detecting the spreading of virus or misinformation as quickly as possible can avoid a wide range of adverse influence propagating in mobile social networks and thus has practical applications in security and privacy. In recent years, cascade detection in networks and their outbreak phenomena have aroused considerable interests. The representative work by Leskovec et al. [13], [12] 978-1-4673-7284-8/15 $31.00 © 2015 IEEE DOI 10.1109/MS.2015.45
261
∪0≤i≤t Ii be the set of nodes that get infected before step t ≥ 0. Then, at step t + 1, each node u ∈ Jt may infect its out-neighbors v ∈ V \Jt with an independent probability of pp(u, v). Thus, a node infected at step t+1 with v ∈ V \Jt is the probability 1− u∈Jt ∩P ar(v) 1−pp(u, v) . If node v is successfully infected, it is added into the set It+1 . Note that each infected node has more than one chance to activate its susceptible out-neighbors until they get infected, and each node stays infected once it is infected by others. Obviously the cumulative infected process (Jt )t≥0 is Markovian.
outperforms existing heuristics; and 4) the new heuristic, compared to CELF, can significantly reduce selection time. The contribution of the work is twofold. First, we propose a new greedy algorithm that beats CELF in terms of selection time. Second, we propose a new heuristic that has the detection time close to the greedy algorithms, and it is orders of magnitude faster than CELF in terms of selection time. Encouraged by the results, we conclude that developing new heuristics is a promising solution to detecting outbreaks efficiently in large networks. In addition, we theoretically derive an upper bound for the remaining time R(v), which can be used to further prune MC simulations. II. R ELATED
B. Problem Formulation In the discrete time SI model, given an initially infected node u ∈ V and a set of sensors S ⊆ V , the detection time is defined as τ (u, S) := inf t ≥ 0 : It ∩ S = ∅ with I0 = {u} ∧ Tmax , (1) where a ∧ b := min{a, b} and Tmax is the time interval that we observe. We assume inf{∅} = +∞. The detection time τ (u, S) denotes the time delay that a contaminant initiated from node u is detected by one of the sensors in S. Furthermore, the expected detection time from a random source node to one of the selected detectors is T (S) := π(u)E τ (u, S) , (2)
WORK
Cascades and outbreaks happen ubiquitously in various networks. A common way to detect outbreaks is to select important nodes where we can place sensors for monitoring. This strategy has been widely applied to detect water contaminations in a water distribution network [11], and virus outbreaks in a human society [6]. Some early work places sensors by topological measures, e.g. targeting high degree nodes [16] or highly connected nodes [7]. By taking advantages of submodular property, Leskovec et al. [13] proposed to optimize the sensor placement with different criterions such as minimizing detection time or population affected. Following the same logic, Goyal et al. [9] proposed CELF++, an extension of CELF, that further reduces the number of estimation calls, leading to 35%−55% faster than CELF. Recently, Zhou et al. [21], [22] enhanced the CELF for influence maximization problem by an upper bound based approach to further reduce MC calls. Besides, Berry et al. [3] equated the placement problem with a p-median problem. Li et al. [14] proposed a dynamic-programming (DP) based greedy algorithm which has a near-optimal performance guarantee. This paper is enlightened by these works and aims to efficiently select important nodes as sensors for early detection of contagious outbreaks in large mobile social networks.
u∈V
where E is an expectation operatorunder the discrete time SI model, and = π(u), u ∈ V is a given probability distribution over the network which is a
priori knowledge of a node being the infected source with u∈V π(u) = 1. Our goal is to minimize the expected detection time over all potential infected sources. Formally, we formulate the problem as the following discrete optimization problem: we want to find a subset S ∗ ⊆ V such that |S ∗ | = k and T (S ∗ ) = min T (S) |S| = k, S ⊆ V , i.e., S ∗ = arg
min
|S|=k,S⊆V
T (S)
(3)
where k is a given parameter.
III. P RELIMINARIES
C. Greedy Algorithm
A. Notations
To make better use of the greedy algorithm, we consider an equivalent optimization problem as follows,
In this paper, we use the standard discrete time Susceptible-Infected (SI) model to describe the infection spreading in mobile social networks. The susceptible nodes are those with at least one infected neighbor, and the infected nodes do not recover. Specifically, consider a directed graph G = (V, E) with N nodes in V and edge labels pp : E → (0, 1]. For each edge (u, v) ∈ E, pp(u, v) denotes the propagation probability that v is activated by u through the edge. If (u, v) ∈ / E, pp(u, v) := 0. Let P ar(v) be the set of parent nodes of v, i.e., P ar(v) := u ∈ V, (u, v) ∈ E . Given an initially infected set I ⊆ V , the discrete time SI model works as follows. Let It ⊆ V be the set of nodes that gets infected at step t ≥ 0, with I0 = I. Define Jt :=
S ∗ = arg
max
|S|=k,S⊆V
R(S),
(4)
where R(S) := Tmax − T (S) is defined as the remaining time for taking actions when a contaminant is detected. The above alternative formulation has key properties as described in Theorem 1, which can be found in [13]. Theorem 1: The optimal problem (4) under the discrete time SI model is NP-hard. The remaining time function R : 2V → R+ is monotone and submodular with R(∅) = 0.
262
times faster than the original greedy algorithm according to the report in [13]. Although CELF significantly improves Greedy(k, R), the sensor selection time is still unaffordable on large networks [5]. In particular, in the first round to establish the initial upper bounds, CELF needs to estimate R({v}) using MC simulations for each node v, leading to N times of MC calls, which is time-consuming, especially when the network is very large. The limitation leads to a rather fundamental question that, can we derive an upper bound of R({v}) which can be used to further prune unnecessary spread estimations (MC calls) in Greedy(k, R)? Motivated by this question, in this section we derive an initial upper bound of R({v}) for Greedy(k, R). Based on the bound, we propose a new greedy algorithm Upper Bound based Lazy Forward (UBLF for short), which outperforms the original CELF algorithm. For simplicity, we denote R(v) := R({v}), T (v) := T ({v}) and τ (u, v) := τ (u, {v}) for all u, v ∈ V hereafter.
By following the properties, the problem given in Eq. (4) can be approximated by the greedy algorithm in Algorithm 1 with the set function f := R. Theoretically, a non-negative real-valued function f on subsets of V is submodular, if f (S ∪ {v}) − f (S) ≥ f (T ∪ {v}) − f (T ) for all v ∈ V and S ⊆ T ⊆ V . Thus, f has diminishing marginal returns. Moreover, f is monotone, if f (S) ≤ f (T ) for all S ⊆ T . For any submodular and monotone function f with f (∅) = 0, the problem of finding a set S of size k that maximizes f (S) can be approximated by the greedy algorithm in Algorithm 1. The algorithm iteratively selects a new sensor u that maximizes the incremental change of f and includes the new sensor into the set S until k sensors have been selected. It is shown that the algorithm guarantees an approximation ratio of f (S)/f (S ∗ ) ≥ 1 − 1/e, where S is the output of the greedy algorithm and S ∗ is the optimal solution [15]. Algorithm 1: Greedy(k, f ) 1: initialize S = ∅ 2: for i = 1 to k do 3: select u = arg maxw∈V \S (f (S ∪ {w}) − f (S)) 4: S = S ∪ {u} 5: end for 6: output S
A. The upper bound of R(v) In this part, we aim to derive an upper bound of R(v). Before introducing the upper bounds in Theorem 2, we first prepare two propositions. Let Pu (v ∈ Jt ) denote the probability that node v becomes infected before step t when the initially infected node is u. We have the first proposition as follows. Proposition 1: For v ∈ V , the remaining time R(v) under the discrete time SI model can be calculated as
In Greedy(k, R), a thorny problem is that there is no efficient way to compute R(S) given a placement S. Actually, it is #P-hard to compute R(S) by showing a reduction from the counting problem of s − t connectness in a graph. Based on this #P-hardness, we turn to run Monte-Carlo (MC) simulations of the propagation model for 10, 000 trials to obtain an accurate estimate of R(S), leading to expensive selection time. Another source of inefficiency in Greedy(k, R) is that there exists O(kN ) iterations at the remaining time estimation step, where k is the size of the initial sensor set, and N is the number of nodes. When N is large, the efficiency of the algorithm is unsatisfactory. Hence, in order to improve the efficiency of Greedy(k, R), one can either reduce the number of calls for MC simulations in computing R(S), or develop advanced heuristic algorithms which can conduct fast and approximate estimations for R(S) at the expense of accuracy guarantees.
Tmax −1
R(v) =
π(u)Pu (v ∈ Jt ).
(5)
t=0
u∈V
Proof: In fact, by the definition in Eq. (1), we first have E τ (u, v)
= =
Eu
Tmax −1
Tmax −1 1{v∈J Eu 1{v∈J / t} = / t}
t=0 Tmax −1 u
P (v ∈ / Jt ) =
t=0 Tmax −1
t=0
=
1 − Pu (v ∈ Jt )
t=0
Tmax −
Tmax −1
Pu (v ∈ Jt )
t=0
where 1{v∈J / t } is a binary indicative function, if v is not infected before step t, 1{v∈J / t } = 1; otherwise, 1{v ∈J / t } = 0. Then we have R(v) = Tmax − T (v) = Tmax − π(u)E τ (u, v)
IV. P RUNING M ONTE -C ARLO S IMULATIONS In order to prune MC simulations in Greedy(k, R), Leskovec et al. [13] exploited the submodular property of the objective function in Eq. (4), and proposed a Cost-Effective Lazy Forward selection (CELF) algorithm. The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of spread estimation calls can be greatly pruned. CELF optimization produces the same sensor set as the original greedy algorithm and is much faster, in fact 700
=
Tmax −
u∈V
π(u) · Tmax −
u∈V
=
Tmax −1 u∈V
263
t=0
π(u)Pu (v ∈ Jt )
Tmax −1 t=0
Pu (v ∈ Jt )
where the fourth ’=’ is due to the fact that in above derivation.
u∈V
Hereafter, define A ∧ 1 := a(i, j) ∧ 1 for a matrix A = a(i, j) . We can derive the upper bound of remaining time R(v) as follows. Theorem 2: The upper bound of remaining time function R(v) for each node v ∈ V is Tmax −1 t R(v) ≤ π(u) , (9) (E + P P ) ∧ 1
π(u) = 1
Proposition 1 reveals that we can treat the global remaining time R(v) as a summation of all Tmax propagation steps oflocal probabilities Pu (v ∈ Jt ) : 0 ≤ t ≤ Tmax − 1, v ∈ V . Based on Proposition 1, a following u question is, what is the relationship between two sets, P (v ∈ Jt ) : v ∈ V and Pu (v ∈ Jt−1 ) : v ∈ V ?
where E is a unit matrix and [A](u,v) is the element at position (u, v) in matrix A Proof: With the preparations of Proposition 1 and Proposition 2, it follows that
w∈V
Proof: For k ≥ 1, by the definition of conditional expectation and discrete time SI model, we obtain
Pu (v ∈ Jk ) = Eu Pu (v ∈ Jk |Jk−1 ) u = E 1{v∈Jk−1 } · 1 + 1 − pp(w, v) 1{v∈J / k−1 } 1 − w∈Jk−1
≤
Pu (v ∈ Jk−1 ) + Eu 1 −
1 − pp(w, v)
R(v)
Tmax −1
=
u∈V
=
≤ =
P (v ∈ Jk−1 ) + E u
P (v ∈ Jk−1 ) +
u
u∈V
t=0
Tmax −1
u∈V
t=0
=
π(u)
π(u)Pu (v ∈ Jt ) π(u) Θu0 · (E + P P )t ∧ 1 π(u) (E + P P )t ∧ 1
Tmax −1
v
(u,v)
(E + P P ) ∧ 1 t
t=0
u∈V
pp(w, v)
(u,v)
where [a]v is the element at position v in vector a. Define the remaining time row vector R = R(v) v∈V , then the upper bound in Eq. (9) turns to be Tmax −1 R≤Π· (10) (E + P P )t ∧ 1 ,
w∈Jk−1
Pu (w ∈ Jk−1 ) · pp(w, v)
w∈V
t=0
≤ 1, and the second where the first ’≤’ is due to 1{v∈J / k−1 } n n ’≤’ comes from the fact that 1 − i=1 (1 − xi ) ≤ i=1 xi in the above derivation.
where Π is a prior distribution on the likelihood of nodes being the infected source. Now we use an example to explain the bound calculation. Example 1: Given a directed network G in Fig. (1) with propagation probability matrix in Eq. (11).
Proposition 2 clearly identifies the ordering relationship between two adjacent elements in the series Pu (v ∈ J0 ), Pu (v ∈ J1 ), Pu (v ∈ J2 ), · · · . Now we simplify the results in Proposition 2 by using the form of matrix. Let P P be the propagation probability matrix with element (u, v) being pp(u, v). For t ≥ 0, denote the row vector Θut = θtu (v) v∈V (6)
as the probabilities of nodes being infected before step t, i.e., θtu (v) := Pu (v ∈ Jt ). Obviously, we have θ0u (v) = 1{u} (v). Now we can rewrite Proposition 2 by using the matrix form, Θut ≤ Θut−1 + Θut−1 · P P.
t=0
−1 Tmax
≤
w∈Jk−1
u
(u,v)
t=0
u∈V
Proposition 2: For k ≥ 1, we have the following inequation Pu (w ∈ Jk−1 )·pp(w, v). Pu (v ∈ Jk ) ≤ Pu (v ∈ Jk−1 )+
(7)
Figure 1.
By iteration, we further get that Θut ≤ Θu0 · (E + P P )t , where E is a unit matrix. Furthermore, due to the definition of probability θtu (v), it follows that (8) Θut ≤ Θu0 · (E + P P )t ∧ 1.
An illustration of the upper bound calculation.
⎞ 0 0.2 0.1 0 ⎜ 0 0 0 0.3 ⎟ ⎟ PP = ⎜ ⎝ 0 0 0 0.2 ⎠ 0.1 0 0 0 ⎛
264
(11)
Let Tmax = 10, we have Tmax −1
=
Algorithm 2: UBLF 01: Input: the propagation probability matrix P P of G = (V, E), a budget k, a prior distribution Π 02: Output: The sensor set S with k nodes 03: initial S ← ∅, R(S) ← 0, and
Tmax −1 t δ ←Π· ∧ 1 (E + P P ) t=0
(E + P P )t ∧ 1
⎛t=0 10.0000 ⎜ 3.5402 ⎜ ⎝ 2.4336 4.7009
7.0016 4.7009 10.0000 0.6329 0.8438 10.0000 2.4336 1.2168
⎞ 5.6006 7.8000 ⎟ ⎟. 7.0016 ⎠ 10.0000
04: for i = 1 to k do 05: set I(v) ← 0 for v ∈ V \S 06: while TRUE do 07: { u ← arg maxv∈V \S δv 08: if I(u) = 0 09: δu ← M C(S ∪ {u}) − R(S) 10: I(u) ← 1 11: end if 12: if δu ≥ maxv∈V \(S∪{u}) δv 13: R(S ∪ {u}) ← R(S) + δu 14: S ← S ∪ {u} 15: break 16: end if } 17: end for 18: output S
Assume the prior distribution Π is uniform on the entire graph. Based on Theorem 2, the upper bound of remaining time R can be calculated as follows, Tmax −1 R ≤ Π· (E + P P )t ∧ 1 t=0
= (5.1687, 5.0698, 4.1376, 7.6006). 1 ≤ 5.1687, R( ) 2 ≤ 5.0698, In other words, we have R( ) 3 ≤ 4.1376, and R( ) 4 ≤ 7.6006. R( ) B. The UBLF algorithm The Cost-Effective Lazy Forward (CELF) algorithm, proposed by Leskovec et al. [13], exploited the submodular property to improve the simple greedy algorithm. The idea is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of remaining time estimations can be significantly reduced. However, CELF demands N (the network size) spread estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs. Based on the derived bound in Theorem 2, we propose a new greedy algorithm UBLF to further reduce the number of remaining time estimations, especially in the initialization step. In doing so, the nodes will be all ranked by their upper bound scores. We use Example 2 for illustration. Example 2: We still use the network in Example 1 for explanation. The goal here is to find the top-1 node with maximal remaining time. Obviously, the upper bound of 4 7.6006, is the largest in the graph. Thus, R( ), we use MC 4 and get R 4 ≈ 6.2689. simulation to estimate R( ), Now, we can observe that 6.2689 is already larger than the 1 R( ) 2 and R( ). 3 Thus, we do not upper bounds of R( ), need extra MC simulations to estimate the other three nodes, 4 is the node with the maximal remaining time in and R( ) the network. We summarize the UBLF algorithm in Algorithm 2. In Algorithm 2, the column vector, δ = δu , denotes upper bounds of marginal increments under the current sensor set S, i.e., δu ≥ R S ∪ {u} − R(S). Before searching for the first node (i.e. S = ∅), we estimate an upper bound for each node by Eq. (10). Then, the algorithm proceeds similar to CELF. Note that by the properties of submodularity, these upper bounds of marginal increments can be dynamically adjusted by MC simulations, which becomes smaller and smaller with the algorithm carrying
on. In the algorithm, M C(S) denotes the MC simulations that we use to estimate R(S) for the sensor set S, I(v) = 0 denotes that MC simulations have not been used on node v in the current iteration, and I(v) = 1 means that MC simulations have already been computed on node v. V. Q UICKEST PATH H EURISTIC In large networks, the selection time of greedy algorithms is still unbearable. Thus, we propose a new heuristic as alternative. Since E τ (u, S) is #P-hard to compute, we want to design a tractable approximate evaluation to replace the heavy MC simulations in the greedy algorithms. Note that E τ (u, S) measures the expected time delay of a contaminant initiated from node u propagating to a sensor in S. An intuitive idea is that the most likely propagation path should be the quickest path from node u to the set S. Hence, a question is how to measure the distance from node u to the set S? The proposed quickest path heuristic is inspired by answering the question. According to the geometric distribution, if a random variable X is distributed geometrically with a parameter p, it follows that E[X] = 1/p. Since the time that a node u spends in activating its child node v is distributed geometrically with the parameter pp(u, v), the value 1/pp(u, v) should be the expected time that a contaminant propagates from node u to node v along the edge (u, v). Based on this observation, we label the graph G = (V, E) with a distance function m : E → [1, ∞) as follows: m(u, v) := 1/pp(u, v) for each edge (u, v) ∈ E. Fig. 2
265
treating undirected links as bidirectional. We implement the algorithms in C++. All experiments are run on a Linux (RedHat 4) machine with a 2.33GHz Intel Xeon CPU and 16GB memory.
gives an example. For any two nodes u and v, let d(u, v)
3URSDJDWLRQ SUREDELOLWLHV
Figure 2.
Table I S TATISTICS OF THE FOUR REAL NETWORKS .
Dataset #Node #Edge Average Degree Maximal Degree
'LVWDQFHV
An illustration of graph conversion.
v∈S
Digger 8,194 56,440 6.9 850
Twitter 32,986 763,713 23.2 674
Epinions 51,783 476,491 9.2 190
Benchmark methods. We implement other five algorithms, CELF [13], DEGREE1 [10], PAGERANK [4], INTER-MONITOR DISTANCE2 [18] and RANDOM, for comparisons. The MC simulation is employed to compute the detection time of the sensor set returned by each method. In our experiments, to obtain the detection time of sensor sets provided by heuristic algorithms, we run MC simulation on the networks 10, 000 times and calculate the mean. The simple greedy algorithm is not reported, as there are works already reporting that CELF has the same result yet less selection time than the simple greedy algorithm [5], [13]. Parameter setting. We mainly report results on a uniform propagation probability of 0.1 assigned to each directed link in the network, i.e., pp(u, v) = 0.1 for any directed edge (u, v) ∈ E. One can refer to the work [8], [17], [19] for learning real values of the parameters pp(u, v) : (u, v) ∈ E from available data. Besides, we let the time horizon Tmax = 30 and the prior distribution Π be uniform in the network.
be the shortest distance among the paths connecting u and 1 ) 4 = min{5 + 3.3, 10 + 5} = 8.3 in v. For example, d( , Fig. 2. Then, for a subset S ⊆ V , we define d(u, S) := min d(u, v).
Facebook 4,040 176,468 21.6 1,045
(12)
Intuitively, d(u, S) denotes the expected time that a contaminant propagates from u to S along the quickest (or shortest) path in the graph G = (V, E, m). Therefore, it is reasonable to have the following approximate relationship between the detection time and the shortest distance E τ (u, S) ≈ d(u, S). (13) Based on this observation, we propose the Quickest Path Heuristic in Algorithm 3. Compared with algo the greedy rithms, the essential difference is that E τ (u, S) is replaced by d(u, S) in the Quickest Path Heuristic. Algorithm 3: Quickest Path Heuristic 01: Input: G = (V, E) with propagation probability {pp(u, v)}, and a budget k 02: Output: The sensor set S with k nodes 03: initial S ← ∅ and a graph G with distances m 04: for i = 1 to k do
05: w ← arg min u∈V π(u)d u, S ∪ {v}
A. Results Number of Monte-Carlo calls. In Table II, we compare the number of MC calls at the first 10 iterations between CELF and UBLF on the four data sets. From the results, we can observe that the number of MC calls in UBLF is significantly reduced compared to that in CELF, especially at the first round. As listed in the last column, the total call number of the first 10 iterations of UBLF, compared to CELF, is reduced at a rate of 92.9%, 95.0%, 95.7%, 96.7% on the four data sets. From the observation, we can conclude that our UBLF is more efficient than CELF on large networks. Detection time. Detection time is another important measure. We run tests on the four data sets and obtain detection time results w.r.t. parameter k (the number of sensors), where k increases from 1 to 50. We list the results in Fig. 3. From the results, we can observe that UBLF, as an updated version of CELF, has competitive results on the four data sets. An
v∈V \S
06: S ← S ∪ {w} 07: end for 08: output S
VI. E XPERIMENTS We conduct experiments on four real data sets to evaluate the UBLF algorithm and the Quickest Path heuristic. The online social network data - Facebook, Twitter, and Epinions - are downloaded from http://snap.stanford.edu/data/. The Digger data set is obtained from http://arnetminer.org/heterinf, which is a heterogeneous network. The details of the data sets are listed in Table 1, where the degree means out-degree. Note that undirected graphs can be regarded as directed graphs by
1 The
sensors are the nodes with the k highest in-degrees. heuristic algorithm which requires any pair of sensors to be at least d hops away, where d is as large as it can choose k monitors. 2A
266
Table II T HE NUMBER OF M ONTE -C ARLO SIMULATIONS AT THE FIRST TEN ITERATIONS Datasets Facebook Digger Twitter Epinions
Algorihms CELF UBLF CELF UBLF CELF UBLF CELF UBLF
1 4,040 48 8,194 67 32,986 448 51,783 437
(a) Facebook
2 23 18 14 52 323 31 216 193
3 37 31 22 23 121 23 371 87
4 21 30 32 9 28 179 102 227
(b) Digger Figure 3.
5 51 43 55 41 18 112 98 169
6 33 12 38 38 78 152 46 161
7 23 42 19 22 67 36 29 82
8 19 42 38 38 38 251 15 134
9 31 13 17 82 98 134 12 120
10 5 23 28 52 82 97 115 136
Sum 4283 302 8457 424 33839 1463 52787 1746
(c) Twitter
(d) Epinions
Detection time w.r.t. the number of sensors k on the four data sets
important observation is that the detection times of UBLF and CELF are completely the same in the four figures, which explains again that UBLF and CELF share the same results in node selection. The only difference between UBLF and CELF is the number of MC calls. The quickest path heuristic always performs better than other heuristics, almost the same as UBLF, which indicates that the approximation in Eq. (13) is acceptable. Selection time. Fig. 4 shows the time cost of selecting sensors with k = 50. From the results, we can observe that UBLF is 4-8 times faster than CELF. As to heuristics, Indegree and QuickestPath are very fast in selecting candidate nodes, which take less than 30 seconds. The PageRank and InterMonitorDistance are slightly slower. One may question that such a low improvement of UBLF can be neglected in large networks. In fact, UBLF scales well to large networks. Note that with the size of a network increase, Monte-Carlo simulations take more time, and thus UBLF will achieve better performance by pruning more unnecessary MonteCarlo simulations.
upper bound in this study is derived under the discrete time SI model, it is still a question how to derive the bound under the continuous time model and other diffusion models. Second, the prior distribution Π is predefined as uniform in this paper, how to learn the actual distribution parameters from available data still remains unexplored.
VII. C ONCLUSION
ACKNOWLEDGMENT
In this paper, we have proposed a new greedy algorithm and a new heuristic for fast contaminant detection in large mobile social networks. The greedy algorithm can reduce the sensor selection time of existing greedy algorithms and guarantee a near-optimal detection time. The new heuristic can significantly reduce the detection time of existing heuristics and run faster than all other greedy algorithms. There are several interesting future directions. First, the
This work was supported by National Grand Fundamental Research 973 Program of China under Grant No.2013CB329605 and Strategic Leading Science and Technology Projects of CAS (No.XDA06030200).
Figure 4.
The selection time of the algorithms.
R EFERENCES [1] Fabrizio Altarelli, Alfredo Braunstein, Luca Dall’Asta, Alejandro Lage-Castellanos, and Riccardo Zecchina. Bayesian
267
inference of epidemics on networks via belief propagation. arXiv preprint arXiv:1307.6786, 2013.
[13] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance. Costeffective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 420–429. ACM, 2007.
[2] Norman TJ Bailey et al. The mathematical theory of infectious diseases and its applications. Charles Griffin & Company Ltd, 5a Crendon Street, High Wycombe, Bucks HP13 6LE., 1975.
[14] Rong-Hua Li, Jeffrey Xu Yu, Xin Huang, and Hong Cheng. Random-walk domination in large graphs. In Proceedings of the 30th IEEE International Conference on Data Engineering. IEEE, 2014.
[3] Jonathan Berry, William E Hart, Cynthia A Phillips, James G Uber, and Jean-Paul Watson. Sensor placement in municipal water networks with temporal integer programming models. Journal of water resources planning and management, 132(4):218–224, 2006.
[15] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functionsłi. Mathematical Programming, 14(1):265–294, 1978.
[4] Sergey Brin and Lawrence Page. The anatomy of a largescale hypertextual web search engine. Computer networks and ISDN systems, 30(1):107–117, 1998. [5] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208. ACM, 2009.
[16] Romualdo Pastor-Satorras and Alessandro Vespignani. Immunization of complex networks. Physical Review E, 65(3):036104, 2002.
[6] Nicholas Christakis and James Fowler. Social network sensors for early detection of contagious outbreaks. PloS ONE, 5(9):1–8, 2010.
[17] Kazumi Saito, Ryohei Nakano, and Masahiro Kimura. Prediction of information diffusion probabilities for independent cascade model. In Knowledge-Based Intelligent Information and Engineering Systems, pages 67–75. Springer, 2008.
[7] Reuven Cohen, Shlomo Havlin, and Daniel Ben-Avraham. Efficient immunization strategies for computer networks and populations. Physical review letters, 91(24):247901, 2003.
[18] Eunsoo Seo, Prasant Mohapatra, and Tarek Abdelzaher. Identifying rumors and their sources in social networks. In SPIE Defense, Security, and Sensing, pages 83891I–1–83891I–13. International Society for Optics and Photonics, 2012.
[8] Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. Learning influence probabilities in social networks. In Proceedings of the third ACM international conference on Web search and data mining, pages 241–250. ACM, 2010.
[19] Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 807–816. ACM, 2009.
[9] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Celf++: optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pages 47–48. ACM, 2011.
[20] Junzhou Zhao, John C. S. Lui, Don Towsley, Xiaohong Guan, and Pinghui Wang. Social sensor placement in large scale networks: A graph sampling perspective. In arXiv:1305.6489, 2014.
´ Tardos. Maximizing [10] David Kempe, Jon Kleinberg, and Eva the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003.
[21] Chuan Zhou, Peng Zhang, Jing Guo, and Li Guo. An upper bound based greedy algorithm for mining top-k influential nodes in social networks. In WWW, 2014.
[11] Andreas Krause and Carlos Guestrin. Optimizing sensing: From water to the web. Technical report, DTIC Document, 2009.
[22] Chuan Zhou, Peng Zhang, Wenyu Zang, and Li Guo. On the upper bounds of spread for greedy algorithms in social network influence maximization. IEEE Transactions on Knowledge and Data Engineering, no. 1, pp. 1, PrePrints PrePrints, doi:10.1109/TKDE.2015.2419659.
[12] Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and Christos Faloutsos. Efficient sensor placement optimization for securing large water distribution networks. Journal of Water Resources Planning and Management, 134(6):516–526, 2008.
268