Discovering Multiple Diffusion Source Nodes in Social ...

0 downloads 0 Views 1MB Size Report
[22] Erik Volz and Lauren Ancel Meyers. Susceptible–infected–recovered epidemics in dynamic contact networks. Proceedings of the Royal Society B: Biological ...
Procedia Computer Science Volume 29, 2014, Pages 443–452 ICCS 2014. 14th International Conference on Computational Science

Discovering Multiple Diffusion Source Nodes in Social Networks Wenyu Zang12 , Peng Zhang2 , Chuan Zhou2 , and Li Guo2 1 2

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected] Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China zhangpeng,zhouchuan,[email protected]

Abstract Social networks have greatly amplified spread of information across different communities. However, we recently observe that various malicious information, such as computer virus and rumors, are broadly spread via social networks. To restrict these malicious information, it is critical to develop effective method to discover the diffusion source nodes in social networks. Many pioneer works have explored the source node identification problem, but they all based on an ideal assumption that there is only a single source node, neglecting the fact that malicious information are often diffused from multiple sources to intentionally avoid network audit. In this paper, we present a multi-source locating method based on a given snapshot of partially and sparsely observed infected nodes in the network. Specifically, we first present a reverse propagation method to detect recovered and unobserved infected nodes in the network, and then we use community cluster algorithms to change the multi-source locating problem into a bunch of single source locating problems. At the last step, we identify the nodes having the largest likelihood estimations as the source node on the infected clusters. Experiments on three different types of complex networks show the performance of the proposed method. Keywords: source-locating, multi-source, complex networks, communities

1

Introduction

Online social networks provide a vibrant platform for information dissemination. Known as a double-edged sword, social networks have both advantage and disadvantage in information spread. On the bright side, online advertisements and news can be promoted and spread fast and efficiently via social network. On the dark side, rumors and virus can also disseminate in an uncontrollable manner in social networks. In order to understand the information dissemination, it is of utmost importance to identify the source nodes. That is, how to identify the source nodes in the contact network based on a given snapshot of infected nodes. Along with the rapid development of social networks, uncertain network structure and sparsely observed nodes have made this problem even more challenging. Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2014 c The Authors. Published by Elsevier B.V. 

443

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

Figure 1: Multi-source Locating Problem. There are two sources in this network, i.e.,source 1 and source 2 shown in green nodes. But if we consider it as a singer source locating problem, the yellow node is the most probable solution, which is far from satisfactory.

To date, there have been efforts in studying the source locating problem [5, 6, 16]. The limitation is that these methods assume that networks are tree or approximate tree properties, and thus cannot adapt to real-world network structures. On the other hand, the works [3, 4, 13, 16] merely focus on a single source locating problems. They neglect the reality that information diffusion usually have multiple sources, as shown in Fig.1. In [5, 17], authors solve the source locating problem based on the SI (suspect-infected) propagation model, without considering recovery from infection. Unlike the previous works, in this work we aim to develop a multi-source locating method based on general networks and sparsely observed nodes. The main challenges of the proposed problem include 1) the diversity and uncertainty of the network structure; 2) multiple sources behind the given snapshot of infected nodes; 3) the stochastic recovery from infection in the information propagation process. Specifically, we present a multi-source locating solution based on sparsely observed nodes and arbitrary network structures. First, we propose a reverse propagation method to detect recovered and unobserved infected nodes in a social network. Then, we cluster the infected nodes and detected into predefined community partitions. At the last step, we identify the most likely sources in each community partition. The main contributions of the paper are as follows, • The propose reverse propagation model can detect recovered and unobserved infected nodes in a social network, which can be transplanted to various propagation models in source locating. • We proposed to use community partition method and obtain multi-source locations. The rest of the paper is organized as follows. Section 2 introduces the information propagation model in social networks. Section 3 discusses the new multi-source locating algorithm. Experimental results are presented in Section 4. Section 5 gives a brief review of related work. Section 6 concludes the paper. 444

Multiple Diffusion Sources Detection

2

W.Zang, P.Zhang, C.Zhou and L.Guo

Propagation Model

The fast development of social networks has greatly amplified information propagation among different communities. Modelling the diffusion and propagation of information over large networks has become a critical problem in data mining community. Existing propagation models can be roughly divided into two categories: Epidemic propagation model (e.g. SI, SIR and SIS models) and Influence diffusion model (e.g. IC and LT model). In the sequel, we will introduce three important propagation models (SI, IC and SIR models) and analyze their differences.

Figure 2: SI model: red nodes represent the infected nodes and blue nodes represent suspected nodes.

Figure 3: SIR model: red nodes represent the infected nodes, blue nodes represent suspected nodes and green nodes represent recovered nodes.

Figure 4: Differences between SI, IC and SIR model. SI and IC model are special cases of SIR model

We define an undirected graph G(V, E) (V − set are nodes or vertices, E − set are links). Assume that the contact-network during the epidemic process is static. The spreading process can be simulated using discrete time step model. SI Model: susceptible-infected model is the most basic epidemic model. Each node in the underlying network is in one of two states - Susceptible (S) or Infected (I). Once a node is infected, it will stay infected forever. Each infected node tries to infect each of its susceptible neighbors independently with probability p in each discrete time-step. IC Model: independent cascade model was proposed by Kempe, Kleinberg and Tardos in 2003 [11]. An infected node v infects its susceptible neighbor w with probability p independently of all other nodes. Moreover, each infected node has just one chance to infect its neighbors. SIR Model: susceptible-infected-recovered model is a stochastic process model. At the beginning, all nodes from network G are susceptible except for a set of nodes which are initially infected. In each discrete time-step, infected nodes try to infect their susceptible neighbors with probability p, and these infected nodes can also be recovered with probability q. In addition, all recovered nodes cannot be infected again. The difference between the SI and IC models is that: in the SI model, infected nodes can infect their susceptible neighbors in every time unit, but in the IC model, each infected node has just one chance to influence their susceptible neighbors. In addition, in the SI model, each node has suspected and infected two states; while in the SIR model each node has three states 445

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

including suspectible, infected and recovered. In summary, the difference of the three models lies in the recovering process as shown in Fig. 4. Moreover, the SI and IC models are all special cases of the SIR model. Specifically in the SIR model, if q ≡ 0 we get the SI model, and if q ≡ 1 we get the IC model. Without loss of generality, we will use SIR propagation method in our multiple sources locating problem.

3

Solution to Multi-source Locating Problem

Let G = (V, E) be a connected undirected graph containing N nodes defined by the set of vertices V and the set of edges E. The graph G is assumed given and static. The information sources, S ∈ G, are the vertexes that originate the information or initiate the diffusion. We can observe that the state of node set O ⊂ V , and the task is to estimate the location of source nodes S based on this snapshot. To handle the problem, we first propose a reverse propagation method to find out recovered nodes from susceptible ones. Besides, this method can also detect unobserved infected nodes in the network. Second, we partition the infected nodes into several sets based on community detection. Through this partition, we can transform multi-sources locating problem to several independent single source-locating problem. At the last step, we point out the source of each partition independently. Along this line, we split this section into three parts.

3.1

The Reverse Propagation Method

In real-world applications, one of the most difficult problems in rumor source locating is that we can only observe a few infected nodes. Namely, a lot of useful information or hints is not available in the propagation network, e.g., many recovered and infected nodes are unobserved. In order to refill these missing information and simplify the source-locating problem, we propose a score-based reverse propagation method (as shown in Algorithm 1) to recover recovered and infected nodes set. After the process in Algorithm 1, we can restore the unobserved information of propagation, which actually presents a new network for locating analysis. In this paper, we call it as extended infected network.

3.2

Infected Nodes Partitions based on Different Sources

Most existing algorithms work only on a single-source, to overcome the multi-source locating challenge, we employ the divide-and-conquer method to divide complex multi-sources locating problem into several single-source locating problems. Here we introduce three partition solutions as follows, which will be later evaluated in the experiment part. Leading eigenvector based method A good division of an infected nodes into groups will satisfy two characteristics: sparse edges between different groups, abundant edges within the same group. These characteristics can be described as a modified benefit function called modularity [15] Q = (number of edges within communities) − (expected number of such edges) This is a function of the particular division of the infected network into groups, with larger values indicating better divisions. We can maximizing it over all possible divisions of the network. However, exhaustive maximization over all possible divisions is computational intractable. The modularity function can be rewritten in matrix terms, which allows us to express the optimization task as a spectral problem in linear algebra. Newman [14] has demonstrated 446

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

Algorithm 1: Score based Reverse Propagation Algorithm Input : A social network G = (V, E), which contains a nodes set N , a links set E and a partially observed infected nodes set I ∈ V ; a constant basesore. Output: A extended infected nodes set I ∗ ∈ V , which contains recovered nodes, observed and unobserved infected nodes, nodes once contact to infected nodes but not be infected. Moreover, I ∈ I ∗ ∈ V . Initialize nodes with unique labels C and unique scores Sc, ∀n ∈ V :  1 if n ∈ I Cn , Scn = 0 otherwise Initialize nodes set I  = I for iter 1 to Nstep do for n ∈ I  do for i ∈ nneighbos do update I  = I  i , Ci = 1 update Sci = Sci + Scn , if Ci = 0 for n ∈ V do  if Scn > basecore, I ∗ = I ∗ n Return extended infected nodes set I ∗

that the modularity can be succinctly expressed in terms of the eigenvalues and eigenvectors of a matrix called the modularity matrix. Edge betweenness based method Betweenness was first proposed by Freeman [7][18] as a measure of the centrality and influence of nodes in networks. While edge betweenness centrality is defined as the number of the shortest paths that go through an edge in a graph or network (Girvan and Newman 2002). An edge with a high betweenness centrality score represents a bridge-like connector between two parts of a network. Network partition [8] based on edge betweenness method detection infected nodes of different sources by focusing on the boundaries of communities instead of their cores. Rather than constructing communities by adding the strongest edges to an initially empty vertex set, it construct them by progressively removing edges from the original graph. Edge betweenness based method identify infected nodes as follows: 1. Calculate the betweenness for all edges in the network. 2. Remove the edge with the highest betweenness. 3. Recalculate betweennesses for all edges affected by the removal. 4. Repeat from step 2 until no edges remain. Mixed membership block model MMSB (mixed membership stochastic model) [1] partition infected nodes based on the assumption that nodes infected by the same source are more likely to link each other while nodes infected by different sources have less contact. Given a graph G = (N, Y ) has N nodes and the link collection Y (Y (p, q) ∈ 0, 1). K is the number of its latent groups. The goal of MMSB is to maximize the likelihood of collection of edges with 447

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

respected to model parameters α and B.   P (Y (p, q)|zp→q , zp←q , B) p(Y | α, B) = π Z s

p,q

P (zp→q |πp )P (zp←q |πq )



 P (πp | α) dπ

(1)

p

where B(g, h) in BK×K represents the probability of having a link between a node from group g and a node from group h. And Π = πN ×K is the node’s mixed membership probabilities over partitions.

3.3

Source-location on Given Sets of Nodes

After the above two steps, we actually have translated multi-source locating problem to a bunch of single-source locating problems. The work [4] verifies that centrality measurements can handle these problems well.The degree, closeness, betweenness and eigenvector are the four well-known centrality measurements, which we will adopt and evaluate in this work. Let dij be the length of shortest path between node i and j, then the mean shortest distance with respect to node i is 1  li = dij . (2) n−1 j,j=i

We define the closeness centrality of node i by taking the inverse of the mean shortest distance li with respect to node i, that is, 1 Ci = . (3) li We define betweenness centrality as  ni st Bi = , (4) nst s,t,s=t s=i,t=i

where nist is the number of shortest paths between node s and t that pass through node i, and nst is the total number of shortest paths between s and t. The eigenvector centrality follows the principle that a node connected to some other highrank node tends to have more relative importance in the network. Let si denote the score of the ith node. Let A be the adjacency matrix of the network. For the ith node, let the centrality score be propotional to the sum of the scores of all nodes that are connected to it. Hence, N

si =

1 Aij sj, λ j=1

(5)

where λ is a constant. Equation (5) can be rewritten in vector notation as As = λs.

(6)

The eigenvector associated with the maximal eigenvalue of this equation represents the eigenvalue centrality of the nodes. Experiments in [4] show that betweenness measurement gives better discrimination to sources. Moreover, considering the bias caused by the original network topology, we defined an unbiased betweenness which divides betweenness by the degree of the node, that is, α ˆi = (Bi ) B ki

448

(7)

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

where ki represents the degree of node i and we can optimize the unbiased betweenness by choosing a proper α. We use this unbiased betweenness as our measurement to rank the suspected source nodes in the given node sets. Thus, we obtain ranks of nodes for k separated sources.

4

Experiments

We experimentally evaluate our solution to multi-source locating problem on three kind of synthetic networks. Experimental results show that our algorithm can outperform other sourcelocating problem algorithms significantly. Experimental setup. We implement our algorithms in Python and realize the SIR model as a discrete event simulation. The infected probability is equal to 0.9 and recovered probability is equal to 0.2. All reported results are averaged over 100 independent runs on different sources. We conduct experiments on three different kinds of synthetic networks, i.e., regular network, barabasi-albert network and erdos-renyi network. All these networks are generated by N etworkX with 5,000 nodes. Quality of infected nodes partition We first test different community detection algorithms on infected nodes partitions, as shown is Fig.5 and Fig.6. We test three community partition algorithms (mentioned in Section 3.2 ) on above three different networks with 500 nodes and 5000 nodes respectively. The number of sources is range from 2 to 4.

Random Regular Network

ER network

BA network

Figure 5: Infected nodes partition accuracy in networks with 500 nodes

Random Regular Network

ER network

BA network

Figure 6: Infected nodes partition accuracy in networks with 5000 nodes 449

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

We can clearly observe that, leading eigenvector based method performs better than other community partition algorithms, especially when the number of nodes in graph becomes larger. Thus, we select the leading eigenvector based algorithm to partition the infected networks in the subsequent evaluation. Performance of source-locating algorithms We then test our solution to multi-source locating problems on different kind of networks, i.e, Regular network with 5000 nodes where every node with a degree of 3, Erdos-renyi network with 5000 nodes where the probability for edge creation is 0.01 and Barabasi-albert network with 5000 nodes where there are 2 edges to attach from a new node to existing nodes. Moreover, the number of sources is set to 2 (shown in Fig.7)and 3 (shown in Fig.8). The baseline algorithms are based on different centrality measurements of network graphs described in Section 3.2.

Random Regular Network

ER network

BA network

Figure 7: Cumulative probability distribution of average distance between real sources and calculated sources, while real source number k = 2

Random Regular Network

ER network

BA network

Figure 8: Cumulative probability distribution of average distance between real sources and calculated sources, while real source number k = 3 We can see that, our solution achieved very good results on random regular network. Moreover, we also outperform other three solutions on ER and BA networks.

5

Related Works

Over the past decade, there are considerable works on information propagation, including propagation process model , e.g., SIR(susceptible-infected-recovered) model[22], SI(susceptibleinfected) model[2] and IC(independent cascade) model[12]; influence maximization problems [9, 11, 23, 10] (finding a small subset of nodes in a social network that could maximize the 450

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

spread of influence) and immunization problems [21] (the problem of finding the best nodes for removal to stop static and dynamic graphs). However, the source-location has only been extensively studied recently. The work in [4] demonstrated that the source node in infected snapshot tends to have the highest centrality measurement value. They proposed a simple modification of betweenness that accounts for cases where the source has very low centrality in the original work. But this algorithm only works well on a snow-ball like propagation model. Similarly, some other works for a single source identification under the SI model were also proposed. In particular, [19][20] proposed a maximum likelihood (ML) estimator and derived its asymptotic performance; [5] construct a maximum a posteriori (MAP) estimator to consider a priori knowledge for possible rumor source; [16] maximum probability of correct source localization based on a small fraction of nodes can be observed. But all these works are required the graph share tree properties. Besides, other types of networks and models have also been considered, such as [24] proposed a sample path algorithm under the SIR model. [3] introduce a statistical inference framework on an arbitrary network structure, and [13] use a mean-field-type approach as a product of the marginal probabilities provided by the dynamic message-passing algorithm (DMP). But these models all focus on single source-locating problems, not take into consideration the multi-source locating problems. [17] employ MDL (minimum description length principle) with the goal of identifying multiple culprits. However, it is based on SI model and ignore that infected node can be recovered in realistic situation. Contrary to these approaches, we propose a solution to multi-source locating problems based on partially observed infected nodes in arbitrary networks under the SIR model.

6

Conclusion and Future Work

In this paper, we studied a new multi-source locating problem in social networks, which plays a key role in rumor control and virus elimination in social networks. We first proposed reverse propagation method to detect the unobserved and recovered infected nodes. Then we employed the community detection methods to cluster the infected nodes into different groups. In doing so, we simplified the multi-source locating problem into the single source locating problem in each community. Experiments on various networks showed that we can obtain highly accurate estimations in detecting the source nodes.

Acknowledgments This work was supported by the NSFC (No. 61370025), 863 projects (No. 2011AA010703 and 2012AA012502), 973 project (No. 2013CB329606), and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences(No.XDA06030200).

References [1] Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed membership stochastic blockmodels. The Journal of Machine Learning Research, 9:1981–2014, 2008. [2] Roy M Anderson, Robert M May, and B Anderson. Infectious diseases of humans: dynamics and control, volume 28. Wiley Online Library, 1992. [3] Nino Antulov-Fantulin, Alen Lancic, Hrvoje Stefancic, Mile Sikic, and Tomislav Smuc. Statistical inference framework for source detection of contagion processes on arbitrary network structures. arXiv preprint arXiv:1304.0018, 2013.

451

Multiple Diffusion Sources Detection

W.Zang, P.Zhang, C.Zhou and L.Guo

[4] Cesar Henrique Comin and Luciano da Fontoura Costa. Identifying the starting point of a spreading process in complex networks. Physical Review E, 84(5):056105, 2011. [5] Wenxiang Dong, Wenyi Zhang, and Chee Wei Tan. Rooting out the rumor culprit from suspects. arXiv preprint arXiv:1301.6312, 2013. [6] Vincenzo Fioriti and Marta Chinnici. Predicting the sources of an outbreak with a spectral technique. arXiv preprint arXiv:1211.2333, 2012. [7] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977. [8] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002. [9] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Simpath: An efficient algorithm for influence maximization under the linear threshold model. In 2011 IEEE 11th International Conference on Data Mining (ICDM), pages 211–220. IEEE, 2011. [10] Jing Guo, Peng Zhang, Chuan Zhou, Yanan Cao, and Li Guo. Personalized influence maximization on social networks. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 199–208. ACM, 2013. ´ [11] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003. [12] Masahiro Kimura, Kazumi Saito, and Ryohei Nakano. Extracting influential nodes for information diffusion on a social network. In AAAI, volume 7, pages 1371–1376, 2007. [13] Andrey Y Lokhov, Marc M´ezard, Hiroki Ohta, and Lenka Zdeborov´ a. Inferring the origin of an epidemy with dynamic message-passing algorithm. arXiv preprint arXiv:1303.5315, 2013. [14] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006. [15] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004. [16] Pedro C Pinto, Patrick Thiran, and Martin Vetterli. Locating the source of diffusion in large-scale networks. Physical Review Letters, 109(6):068702, 2012. [17] B Aditya Prakash, Jilles Vreeken, and Christos Faloutsos. Spotting culprits in epidemics: How many and which ones? In 2012 IEEE 12th International Conference on Data Mining (ICDM), pages 11–20. IEEE, 2012. [18] John Scott. Social network analysis. Sociology, 22(1):109–127, 1988. [19] Devavrat Shah and Tauhid Zaman. Rumors in a network: Who’s the culprit? Information Theory, IEEE Transactions on, 57(8):5163–5181, 2011. [20] Devavrat Shah and Tauhid Zaman. Rumor centrality: a universal source detector. 40(1):199–210, 2012. [21] Hanghang Tong, B Aditya Prakash, Charalampos Tsourakakis, Tina Eliassi-Rad, Christos Faloutsos, and Duen Horng Chau. On the vulnerability of large graphs. In 2010 IEEE 10th International Conference on Data Mining (ICDM), pages 1091–1096. IEEE, 2010. [22] Erik Volz and Lauren Ancel Meyers. Susceptible–infected–recovered epidemics in dynamic contact networks. Proceedings of the Royal Society B: Biological Sciences, 274(1628):2925–2934, 2007. [23] Chuan Zhou, Peng Zhang, Jing Guo, Xingquan Zhu, and Li Guo. Ublf: An upper bound based approach to discover influential nodes in social networks. In IEEE 13th International Conference on Data Mining (ICDM), pages 907–916. IEEE, 2013. [24] Kai Zhu and Lei Ying. Information source detection in the sir model: A sample path based approach. In Information Theory and Applications Workshop (ITA), 2013, pages 1–9. IEEE, 2013.

452

Suggest Documents