Improving Information Spread through a Scheduled Seeding Approach Alon Sela
Irad Ben-Gal
Alex “Sandy” Pentland
Erez Shmueli
Department of Industrial Engineering Tel-Aviv University Tel-Aviv, Israel Email:
[email protected]
Department of Industrial Engineering Tel-Aviv University Tel-Aviv, Israel Email:
[email protected]
The Media Lab Massachusetts Institute of Technology (MIT) Cambridge, MA USA Email:
[email protected]
Department of Industrial Engineering Tel-Aviv University Tel-Aviv, Israel Email:
[email protected]
Abstract—One highly studied aspect of social networks is the identification of influential nodes that can spread ideas in a highly efficient way. The vast majority of works in this field have investigated the problem of identifying a set of nodes, that if “seeded” simultaneously, would maximize the information spread in the network. Yet, the timing aspect, namely, finding not only which nodes should be seeded but also when to seed them, has not been sufficiently addressed. In this work, we revisit the problem of network seeding and demonstrate by simulations how an approach takes takes into account the timing aspect, can improve the rates of spread by over 23% compared to existing seeding methods. Such an approach has a wide range of applications, especially in cases where the network topology is easily accessible.
I.
I NTRODUCTION
Social networks provide a digital platform that has already changed the course of history. A known example includes President Obama’s pre-election activities in social network that contributed to his successful campaign and ultimately lead to his election [1]. Another known example is the Arab Spring (2010-2011), in which riots and anti-regime activities were initiated and organized through social media [10], resulting in protests of millions of people around the Arab world in an attempt to bring down their local regimes. These examples demonstrate the impact of social media and might explain why current academic research is heavily studying different aspects of social networks.
the beginning of the information spread process (i.e., finding “which nodes to seed”), but also choosing the specific time during the process in which each seeding action should occur (i.e., choosing “when to seed each node”). We claim that determining the correct timing for seeding is becoming more relevant nowadays as a result of information overload and the Limited Attention Paradigm [15], [9]. This paradigm assumes that a person can only process a limited amount of received information due to his/her limited attention and processing capabilities. Thus, only some messages would be fully processed, and messages that did not “catch” one’s attention are likely to be forgotten and never spread. One method to capture one’s attention and overcome the attention limitation is through the creation of social hypes which harness herd behavior tendencies [7], [3]. Hypes are a social signal that might reflect some irregular and unusual events, thus they capture attention resources, which have been formed through many years of evolution to recognize irregularities. Messages that arrive from different sources at adjacent time periods form an illusion of a hype. Therefore, it is imperative to wisely choose the set of messages that the user would be exposed to, and the time at which this exposure occurs such that it would create a perceived hype and increase the probability of further spreading the message.
One highly studied aspect of social networks is the identification of influential nodes that can spread ideas in a highly efficient way. The vast majority of works in this field have investigated the problem of identifying a set of nodes, that if “seeded” (i.e. proactively infected) simultaneously, would maximize the information spread in the network. Most studied seeding strategies utilize topological characteristics of the network to determine what nodes to seed. For example, node centrality measures (e.g., Degree centrality, PageRank, Eigenvector centrality, Betweenness centrality, Kats centrality) are frequently used for this purpose, where each of these centrality measures has its own advantages and disadvantages. Further information on centrality measures can be found in [12], [5], [2], [13], [14].
Consequently, the proposed information spread model and the Scheduled Seeding algorithm assume that the act of seeding requires some financial costs, and that the “infection” impact of each seeding decreases with time as an outcome of retention loss. Thus, seedings need to be planned in order to increase the infection impact in a rather narrow time frame. In many cases, this can be achieved by seeding a rather peripheral node at a specific time in which it has the potential to infect many other nodes. Generally speaking, the required conditions for seeding a node is that enough (but not too many) of its neighbors have already been infected. If only a few neighbors are infected at the time of seeding, the infection impact of the seed would be too weak (since a hype is unlikely to occur). On the other hand, if too many neighbors have already been infected, the impact of the seed decreases, as many neighbors are already infected anyway.
However, only a few recent works (e.g., [6]) have started to investigate the timing aspect of seeding. That is, not only the identification of the correct set of nodes to be seeded at
Preliminary results through a simulative evaluation show that the proposed Scheduled Seeding algorithm increases the final number of infected nodes by 25%-35%, compared to
existing methods in which seeding occurs only at the initial stage of a spread. II.
THE SCHEDULED SEEDING PROBLEM
We start this section by providing some preliminary definitions; then, we describe the information spread model; finally we sepcify the problem statement. A. We denote: 1)
2)
3)
G = (V, E) - An undirected graph representation of the social network with |V | = n nodes and |E| = m edges. 0 (N on − Inf ected) t δ (v) = 1 (Inf ected ∧ Inf ectious) 2 (Inf ected ∧ N on − Inf ectious) The state of node v at time t. VXt = {v|v ∈ V ∧ δ t (v) = X} - All nodes which are in state X at time t.
4)
Γ(v) = {u|{u, v} ∈ E} - The neighbors of node v.
5)
ΓtX (v) = Γ(v)∩VXt - The neighbors of node v which are in state X at time t. B - The total budget (without loss of generality it is assumed that seeding each node costs 1 unit). N - The maximal budget allowed in each time step (for the rest of this study we assume N = 1) O - The number of time steps in which an infected node can infect other nodes (infectiousness period). C - The infection threshold (If C or more neighbors of v are infectious then node v also gets infected). S - The % of infected nodes at the end of the process.
6)
which a node can infect others represents the retention loss as well as the Limited attention, by which the probability of processing new information decreases exponentially with time. In addition, the model adopts from the Linear Threshold model the assumption that the more of one’s neighbors have adopted an idea, the higher the probability of its adoption. More formally, it is assumed that nodes can only get infected, i.e. change their state from δ t (v) = 0 (Not Infected) to state δ t+1 (v) = 1 (Infected and Infectious), in a deterministic way and such an infection takes place only if the number of infectious neighbors is higher than a certain threshold, i.e. Γt1 (v) ≥ C. As long as a node is infectious, i.e. δ t (v) = 1, it can influence other nodes in its surroundings, but after O time steps from becoming infectious, the node changes its state again to Infected and Non-Infectious, i.e. δ t+O (v) = 2. Similarly to previous studies, we measure the success rate of the infection process (S) when nodes are no longer infectious, i.e. V1t = ∅ for some t > 0. The objective in the Scheduled Seeding problem is to maximize the success rate S in a given network by identifying up to N nodes to be seeded at each time step, such that the total number of seeded nodes would not exceed the available budget B.
To demonstrate the scheduled seeding problem, let us consider a network with n = 4 nodes and m = 4 edges as 7) depicted in Figure 1a and the following parameters: B = 3, O = 1 and C = 2. Note that under this setting, it is impossible 8) to infect all five network nodes by utilizing the entire budget for seeding at time t = 0, not even by selecting the most central 9) nodes, as demonstrated in Figure 1b. However, utilizing the budget over time, it becomes possible to infect all five nodes, 10) as demonstrated in Figure 1c where seeding two nodes at t = 0 results in a natural infection of a single node at t = 1; then, B. The information spread dynamics a third planned seeding at t = 1 results in a natural infection Two well-known models that capture the essence of inof a fifth node at t = 2, ending-up with the infection of the Assume the following problem and network formation spread are the Linear Threshold model and the entire network. Independent Cascade model [11]. The Linear Threshold model, • Assump0ons: • Oblivion – nodes are infec0ous for 1 0me which was initially proposed by Mark Granovetter [8], enables unit a6er geGng infected. • Complex contagion – at least 2 of my peers each node to be in one of two states: Infected or Non-Infected. should be infec0ous for me to get infected. • Limited Bthe udget – fraction we can seed up to of 3 The infection threshold of a person represents nodes. • Determinis0c infec0on his/her his/her social circle that is required in order to change state from Non-Infected to Infected. • Goal: T=0
S
T=2
T=1
S
.
S
.
• Infect as much nodes as possible
While in the Linear Threshold model the probability of a node to become infected increases as more of its neighbors become infected, the Independent Cascade model does not consider the neigbhors at all. Another difference between the two models is in the duration in which an infected node can infect others. In the Independent Cascade model, an infection can only occur in one single time step after the infection, but in the Linear Threshold model, an infected node can infect others whenever the threshold has been reached and the process continues. The information spread model that we consider here follows the two “classical” models mentioned above. Like the Independent Cascade model, it is assumed that a spread is possible only within a limited number of time steps. Yet, unlike the Independent Cascade model, the spread does not occur in one single time step followed the infection, but rather in a pre-defined number of steps. The limited period of time by
(a) Network
(b) Initial seeding.de
T=0 S
T=1
T=2
T=3
S
S
(c) Scheduled seeding.
Fig. 1: A toy example
III.
A SCHEDULED SEEDING ALGORITHM
It should be noted that the Scheduled Seeding problem stated above can be formulated as a classical scheduling problem and solved by combinatorial optimization. Similar
scheduling problems in the OR literature are known to be NPhard. Since actual network sizes are of millions or billions of nodes, fast heuristics must be developed for practical solutions. The intuition behind the Scheduled Seeding Algorithm includes two main ideas. First, seeding a highly central node might sometimes be less effective than seeding a less central node since at this specific point in time, the less central node may have a higher potential to infect others in its surrounding neighborhood. Second, in some cases it is better to shift the seeding efforts into a new cluster rather than continue seeding nodes in an existing saturated cluster. The algorithm starts by an initial seeding of nodes at time t = 0. For this purpose we choose the node with the highest Eigenvector Centrality score (Eigenvector Centrality was empirically shown to be a good estimator for the influence level of nodes in social networks [4]). At any other stage of the infection process, t > 0, we construct a layer of non-infected nodes that are neighbors of infectious nodes in V1t , denoted by S1. Although nodes in S1 have at least one infectious neighbor, some of them may not get infected naturally at their current state, but may get infected with a little “help”. Therefore, we construct a second layer, a multiset of noninfected nodes that are neighbors of nodes in S1, denoted by S2. This set consists of nodes that if seeded would necessarily cause nodes in S1 to become infected (since the corresponding nodes in S1 already have C − 1 infectious neighbors). Finally, we choose a node in S2 that would cause the largest number of nodes in S1 to become infected by choosing the node with the highest number of repetitions in S2 (a node that repeats k times in S2 would infect k nodes in S1 when seeded). If no such node exists in S2, the algorithm changes the tactics and “jumps” into a new cluster and selects a node with the highest Eigenvector Centrality.
Algorithm 1 Find-The-Next-Node-To-Seed Input: G =< V, E >, t, C Output: the next node to seed 1: if t = 0 then 2: return the node with the highest EC in G 3: end if 4: S1 ← ∅ //S1 is a regular set 5: for v ∈ V1t do 6: S1 ← S1 ∪ {Γt0 (v)} 7: end for 8: S2 ← ∅ //S2 is a multiset 9: for v ∈ S1 do 10: if |Γt1 (v)| = C − 1 then 11: S2 ← S2 ∪ {Γt0 (v)} 12: end if 13: end for 14: if |S2| > 0 then 15: return a node with the highest frequency in S2 16: else 17: “jump” to a new cluster 18: return a node with the highest EC in that cluster 19: end if
Seeding Algorithm succeeded to infect 84% of the nodes in the network when reaching equilibrium, compared to only 71% in the Initial Eigenvector Seeding Algorithm. As can be seen in figure 2 below, among the reasons for the superiority of this heuristic is its ability to “jump” to a new cluster when the existing cluster has reached saturation.
As long as the network is not highly clustered, the jumps to new clusters are not required. Nonetheless, when the network becomes highly clustered, it might sometimes become “shortsighted” regarding the future potential value of the nodes which are planned to be seeded. For example, some nodes might receive a high “local” score within their cluster but since the cluster is already saturated (i.e. most nodes in the cluster have already adopted the new idea) it is better to seed or plan the infection of nodes which are less valuable from the “global” network prospective.
(b) Scheduled seeding.
The pseudo-code for selecting the next node to seed in the Scheduled Seeding Algorithm is presented in Algorithm 1.
Fig. 2: A visualization of the spread over time.
IV.
E VALUATION
(a) Initial seeding.
We evaluated the Scheduled Seeding Algorithm using an agent-based simulation over several different network topologies and different initial conditions of budget (B), oblivion (O) and infection threshold (C). As a baseline for comparison, we used a method which utilizes the entire budget at time t = 0 on the set of nodes with the highest Eigenvector Centrality score, denoted by Initial Eigenvector Seeding Algorithm.
In order to further validate the effectiveness of the Scheduled Seeding Algorithm, the algorithm was executed 256 times on the same network of Facebook users with varying initial conditions. Examining the results of all 256 executions, we find that the Scheduled Seeding Algorithm outperforms the Initial Eigenvector Seeding Algorithm by 22% (on average). This superiority is kept also when diving into the different initial conditions (see Figure 3).
Figure 2 visualizes one of this executions for one of the network topologies (a sample of Facebook users, with n = 150, m = 7 and 5 communities) with the following initial conditions: B = 12, C = 3, O = 6. In this case, the Scheduled
It should be noted however, that for very low initial budgets, the trend reverses. This trend reverse can be easily explained by the inability of the algorithm to utilize the information on natural infections due to lack of funds.
140
Infected Nodes
120
124.6
117.1
111.4
104.6
100 80
75.3 86.7
88.2
87.3
86.8
60
87.9 Scheduled Seeding
40 Initial Eigenvector Seeding
20 0 5
10
15
20
30
Budget
(a) Varying budgets 160
Scheduled Seeding
137.8
Infected Nodes
140 120 100
Initial Eigenvector Seeding
106.2 111.3
75.8
80
90.6
60 60.4
40 20 0 2
3
4
Threshold
(b) Varying infection thresholds 140
Infected Nodes
100
116.2
89.0
80 60
112.6
108.6
120
95.1
87.1
99.1
Scheduled Seeding
68.3
40
Initial Eigenvector Seeding
20 0 1
together with its high performance might make it valuable not only as a theoretical solution, but also as a practical tool to be used by organizations that wish to spread information with limited financial resources. Health organizations, for example, could use such a tool to efficiently spread vital information on illness prevention and to increase the adoption rate of related health directives (e.g., STDs, Ebola, HIV, diabetes prevention, early detected mammograms, etc.). Other potential organizations that might use such methods are telecommunication companies for which the network structure can be relatively easily formed through their meta-data and log files.
2
3
This study opens-up numerous opportunities for future work. For example, our solution assumes that the social graph is provided. This assumption is theoretical to an extent, since in many cases, the social graph is a valuable commercial asset and is rarely freely available. As such, it is protected by the technology (e.g. Facebook), by regulations, and by privacy laws. A vast research potential could be found in the relaxation of some of the assumptions, in providing fast (and preferably parallel) algorithmic implementation, and in conducting a reallife experiment that would evaluate the ability of such an algorithm to increase the adoption of commercial products.
4
Oblivion
(c) Varying oblivion values
Fig. 3: Evaluation results for different initial conditions.
The results were consistent on several different networks including a C. Elegans network, the Political Blog network, the Dolphins social communication network and the Jazz musicians network. V.
S UMMARY AND D ISCUSSION
The spread of information through social networks is an important area of research for many industries. Based on the observations that: (1) humans are more likely to adopt information which has already been accepted by many of their friends (2) humans have a limited capability to process incoming messages, we claim that the arrival time of a message as well as its source play an important role in the process of spreading information. Furthermore, the number of messages that can be sent in a specific period of time might be limited due to the nature of some domains. For example, sales departments are inherently limited in the number of “persuasion” phone calls that can be made in a single day (sales departments are of limited sizes and working hours). Therefore, the tendencies to follow the opinions of others and the tendency to forget old messages must be addressed through a Scheduled Seeding policy. Although optimal solutions for similar scheduling methods are frequently discussed in Operations Research literature, they are known as NP-hard. The proposed Scheduled Seeding approach, despite being a greedy heuristic which plans for the short term, was found capable of improving the spread of information compared to a state-of-the-art seeding strategy which utilized the entire budget on the most central nodes at the initial stage. The simplicity and efficiency of the proposed algorithm
R EFERENCES [1] USA Today, 15 August 2008. Available: http://usatoday30.usatoday.com/news/politics/story/2012-08-15/obamaromney-online/57059418/1 [Last accessed: 2 July 2015]. [2] S. Aral, L. Muchnik, and A. Sundararajan. Engineering social contagions: Optimal network seeding in the presence of homophily. Network Science, 1(02):125–153, 2013. [3] S. E. Asch. Effects of group pressure upon the modification and distortion of judgments. Groups, leadership, and men. S, pages 222– 236, 1951. [4] A. Banerjee, A. G. Chandrasekhar, E. Duflo, and M. O. Jackson. The diffusion of microfinance. Science, 341(6144):1236498, 2013. [5] S. P. Borgatti. Centrality and network flow. Social networks, 27(1):55– 71, 2005. [6] F. Chierichetti, J. Kleinberg, and A. Panconesi. How to schedule a cascade in an arbitrary graph. SIAM Journal on Computing, 43(6):1906– 1920, 2014. [7] N. A. Christakis and J. H. Fowler. Connected: The surprising power of our social networks and how they shape our lives. Little, Brown, 2009. [8] M. Granovetter. Threshold models of collective behavior. American journal of sociology, pages 1420–1443, 1978. [9] N. O. Hodas and K. Lerman. How visibility and divided attention constrain social contagion. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pages 249–257. IEEE, 2012. [10] P. N. Howard, A. Duffy, D. Freelon, M. M. Hussain, W. Mari, and M. Mazaid. Opening closed regimes: what was the role of social media during the arab spring? Available at SSRN 2595096, 2011. ´ Tardos. Maximizing the spread of [11] D. Kempe, J. Kleinberg, and E. influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003. [12] M. Newman. Networks: an introduction. Oxford University Press, 2010. [13] M. E. Newman. A measure of betweenness centrality based on random walks. Social networks, 27(1):39–54, 2005. [14] P. Shakarian, S. Eyre, and D. Paulo. A scalable heuristic for viral marketing under the tipping model. Social Network Analysis and Mining, 3(4):1225–1248, 2013. [15] L. Weng, A. Flammini, A. Vespignani, and F. Menczer. Competition among memes in a world with limited attention. Scientific reports, 2, 2012.