Identifying and Tracking Dynamic Processes in Social Networks Wayne Chung† , Robert Savell† , Jan-Peter Sch¨ utt‡ , and George Cybenko† † Thayer ‡ Institute
School of Engineering, Dartmouth College, Hanover NH, USA 03755; of Electronics, University of the Federal Armed Forces, Hamburg, Germany ABSTRACT
The detection and tracking of embedded malicious subnets in an active social network can be computationally daunting due to the quantity of transactional data generated in the natural interaction of large numbers of actors comprising a network. In addition, detection of illicit behavior may be further complicated by evasive strategies designed to camouflage the activities of the covert subnet. In this work, we move beyond traditional static methods of social network analysis to develop a set of dynamic process models which encode various modes of behavior in active social networks. These models will serve as the basis for a new application of the Process Query System (PQS) to the identification and tracking of covert dynamic processes in social networks. We present a preliminary result from application of our technique in a real-world data stream— the Enron email corpus. Keywords: Dynamic Social Network Analysis, SNA, Process Query Systems, PQS, Enron email
1. INTRODUCTION We introduce a framework for detection and tracking of dynamic processes operating on social networks. While informed by role and structural analysis techniques found in traditional Social Network Analysis (SNA), this work seeks to expand the realm of SNA beyond the traditional stationary analyses of individual roles and community structure by capturing the processes underlying the socio-temporal dynamics of active social networks. The process driven approach to Dynamic Social Network Analysis (DSNA) adopts the view that a network’s organizational structure as well as the characteristic messaging patterns among nodes are a reflection of the interaction of a collection of distributed dynamic processes operating on the network. We are currently developing applications of the Process Query System (PQS)1, 2 to identify these processes via their characteristic sociotemporal signatures.
1.1. Problem Domain A primary impetus for the current direction of our work is the difficulty of tracking hostile embedded entities in an active social network. Malevolent organizations such as terrorist cells may use a variety of techniques to escape detection. Covert entities may cloak language via coding schemes and embedding of transmissions in otherwise innocuous background traffic. The relationships of group members may likewise be disguised. For example, organizational structures such as Leninist networks are designed to minimize the quantity of interactions among group members. The hierarchical control structure of a Leninist network is designed to isolate subordinate group members until a coordinated moment of group activation. In situations such as this, sparsity of communications among covert entities makes it difficult to identify relevant relationships and also to develop a corpus for classification of covert message traffic. Direct e-mail correspondence to:
[email protected]
1.2. Traditional SNA vs. Process Analysis on Dynamic Social Networks Common methods of Social Network Analysis (SNA) have tended toward static analysis of role relationships based on implicit assumptions of diffusive transaction patterns on stationary channels among actors. Semantic methods based on a stable corpus dominate the techniques for message classification. These sorts of stationary analyses are poorly suited to the current problem in which the behaviors of interest are potentially sparse but well coordinated (synchronous). In this context, a socio-temporal analysis of the generative processes which produce message event traffic can enhance or replace more common semantic analysis and SNA tools. Process Analysis on Dynamic Social Networks naturally extends the study of social networks to the process domain. The fundamental insight offered by traditional SNA— that the structure of social networks in diverse contexts may be defined and analyzed according to a simple set of organizational principles— extends, also, to the analysis of the underlying processes driving the dynamics of network behavior. Processes operating within the social network tend to exhibit patterns of temporal evolution which result in characteristic spatio- and sociotemporal signatures. By shifting the focus from stationary analysis to process driven dynamic network analysis, our methodology is able to exploit a variety of temporal evidence unavailable in a stationary framework. This results in a deeper understanding of the structural and functional relationships in the network. The advantages of the process based approach include: • Network relationships may be identified, in a context-independent manner, via the propagation patterns of network transactions. • Temporally interleaved sub-net activity may be differentiated by synchronous activity patterns, leading to a deeper structural understanding of the network. • Temporal correlations can reveal coordinated activity of covert sub-nets, even in the presence of language cloaking measures. • Critical messages or transactions may be identified by resultant bursts in secondary network traffic. • Critical but sparse coordination channels may be identified via secondary network activity. • Functional relationships of sub-nets are naturally identified by the process analysis. • Message and transactional streams are naturally segmented by the analysis to identify critical messages and thread states. A process based analysis of network activities exploits the fact that behaviors of interest tend to develop in characteristic spatio-temporal or socio-temporal patterns. The essence of a malevolent activity may be encoded in a process model which is independent of context and language. Process analysis on dynamic social networks provides a powerful mechanism for circumventing the cloaking techniques employed by covertly operating subnetworks. We are currently developing an application of the Process Query System (PQS) to Dynamic Social Network Analysis with the primary goals to: • Identify and track the organizational and coordinating activities of embedded networks of hostile entities such as terrorist cells. • Identify and track instances of potentially malevolent processes such as agent recruitment, attack planning and coordination, logistical preparation, and financial activities such as money laundering via spatiotemporal and socio-temporal process signatures. • Automatically analyze message and transactional streams generated by networks of hostile entities to identify key entities and key sequences of transactions. • Learn to automatically filter and focus attention on critical sequences in message and transactional streams given a small collection of labeled sequences.
2. BACKGROUND Social Network Analysis has traditionally focused on stationary network descriptions to perform a variety of structural analyses such as identification of cohesive subgroups and the characterization and classification of individual roles and relationships in a social network. Some typical analyses seek to establish an individual’s influence in a network or to identify nodes in key structural roles such as coordinators, brokers, and bridges. Several general references on stationary SNA are available.3–5 Unfortunately, stationary analyses tend to be ill-suited to dynamic situations in which network organization and message patterns vary with time. For example, in real-world networks, functional roles of individuals or sub-networks are often temporally interleaved, with entities operating in one functional capacity at one moment and a different capacity with different subnet relationships in the next. Interleaved behaviors may be difficult or impossible to distinguish in a stationary analysis. Stationary SNA is also limited in its ability to locate sparse transactional channels. These spares channels often reflect key relationships, but are ignored in an aggregate traffic analysis. In addition, without a mechanism for tracking dynamics, a stationary analysis technique lacks a mechanism for analyzing the transactional streams generated by the social network.
2.1. Dynamics of Network Formation The evolution of graph theoretic models describing the mechanics of relationship formation in social networks illustrates the utility of a dynamic approach. Early theoretical work on the phenomenon of small world connectivity6, 7 espoused an implicitly stationary view of relationship formation in which connections, once formed, were permanent— thus entailing zero maintenance cost. These simple models accurately capture the scale free behavior of certain types of network formation such as networks of casual acquaintances or the aggregation of WWW page references, but they fail to reflect many interesting aspects of functioning social networks. In contrast, Newman et. al.8 presents a dynamic model of community organization which captures many of the relevant aspects of dynamic relationship formation in three simple rules defined on the network graph: 1. Limited nodal degree: the probability of a person developing a new acquaintance falls off sharply as the current number of friends reaches a certain level. 2. Clustering: the probability of two people becoming acquainted is significantly higher if they have one or more mutual friends. 3. Decay of friendships: relationships may be broken with a probability or decay rate established by the strength of their mutual interests. While the scale-free methodology (the small world model)6, 7 exhibits a monotonic densification of network connections which bears little similarity to the real-world process of relationship formation, the dynamic model— though still proceeding via a process of random connection— succeeds in capturing the familiar process states of active search, selection, and relatiohship decay. And, the resulting network activity is readily perceived as a more realistic model of a functioning social network. Work by Kleinberg, et. al.9, 10 develop similar methods of dynamic relationship formation, with models more explicitly grounded in the notion of relationship formation as a dynamic process.
2.2. Language Based Message Analysis In this work, we are interested, not only in the process of relationship formation but also in the socio-temporal interplay of processes operating on functionally defined subnetworks. A significant challenge to tracking this sort of dynamic interplay of processes lies in establishing functional attributes for messaging events. In the case of a message stream such as our test corpus— the Enron email data set11 — some amount of textual analysis is unavoidable due to interleaving of functionally disparate topic threads. The current application employs a minimal amount of textual classification to coarsely isolate topic threads. Our textual/temporal clustering technique for thread preprocessing is informed by the work of Mccallum et. al.12, 13 and also Zhu and Lafferty et. al.14 Each of these language based applications12, 13 perform a variant of Latent Dirichlet Analysis (LDA)15, 16 to extract (an unspecified number of) coherent email threads. In the case of Mccallum12, 13 LDA is performed
in an author-recipient-topic variable space, enabling similarity of social roles to be established as a byproduct of the process of topic differentiation. The work of Lafferty14 combines LDA in the textual space with a clustering technique based on exponential time kernels to successfully cluster emails according to their textual/temporal correlations.
2.3. Socio-temporal Analysis to Identify Covert Networks Finally, in an application specific to the security context, Marcus et. al.17–19 demonstrate that the socio-temporal signature of messaging events is sufficient to identify an embedded Leninist subnetwork. The Leninist network structure is designed to maximize the control of the network’s coordinating nodes via an implicit maximization of the ratio of structural holes20 among subordinates. At the same time, the sparsity of control channels in the network works to minimize the risk of detection. The papers demonstrate that the socio-temporal signature associated with the rapid change in inter-connectivity of the cell members at the moment of its activation may be used to effectively differentiate the cell processes from normal diffusive background chatter and random relationship formation.
3. WEAK PROCESS MODELS FOR PROCESS QUERY SYSTEMS In this paper, we develop a framework for identification and tracking of message threads based on the Process Query System. Dynamic Social Networks present several theoretical challenges to a process based analysis including the stochastic nature of the evidence of process activity and the complexity of the potential configuration space of subnetworks supporting a given process. In this section, we present a brief introduction to the PQS system followed by an overview of the technique of weak process modeling (a technique for defining probabilistic spatio-temporal or socio-temporal models via high level abstractions) which directly addresses some of the most challenging problems associated with process driven DSNA.
3.1. PQS for Dynamic Social Network Analysis A PQS system may be visualized as a software system that allows users to interact with temporally indexed streams of data from multiple sources. Whereas traditional databases accept user queries in the form of constraints on field values of records in the database, PQS system queries take the form of a process model or description. Given a process query, PQS searches the data stream for evidence of the existence of processes consistent with the model. For an in-depth description of the Process Query System, see Cybenko et. al.1, 2 The PQS architecture for our current application to an email datastream is shown in Figure 1. To begin, a model or process description is presented to the TRAFEN engine. As processing initiates, the engine begins to monitor the stream of email events (corresponding to traditional sensor events) arriving at the front end of the system. Any necessary preprocessing is implemented in this stage. The filtered events are then forwarded to the main TRAFEN engine. In this stage, the engine parses the query (model description) and scans incoming
Figure 1. TRAFEN Engine (PQS) email architecture.
events for evidence of processes consistent with the model. During event scanning, the algorithm calculates the likelihood that each event is associated with an existing track (a process instance as described by the model and evidenced by a collection of events). Sets of tracks are collected to form hypotheses (the current set of active processes and associated event tracks). The likelihood of each hypothesis is continuously scored by the engine, and in most cases, the best hypothesis is selected in winner takes all fashion.
3.2. Weak Processes on Social Networks In the present context, a message thread may be viewed as a collection of socio-temporally correlated events which provide evidence of the presence of a coherent process operating on the social network. Local thread states associated with propagation patterns of the thread provide cues to identify key actors and communications. Two primary impediments arise in applying deterministic models to process analysis on social networks. The first is the stochastic nature of the event data, and the second is the combinatorial explosion of the potential configuration space for a deterministically defined process model applied to a generic network. Both these difficulties suggest a shift from a deterministic to probabilistic framework for model definition. To apply PQS to Dynamic Social Network Analysis, we adopt the Weak Process paradigm. In this paradigm, process models consist of two distinct descriptive layers. At the highest level of abstraction, the process models take the form of Finite State Machines or other familiar high-level process descriptions. These high-level models have an implicit spatio-temporal correlation signature which may be defined stochastically and presented to the correlation system. The system then scans the event stream for spatio-temporally or socio-temporally correlated events and collects these events into tracks and hypotheses. Hypotheses are scored according to the posterior probability of the existence of the associated processes, given the collected observations and the spatio-temporal or socio-temporal signatures of the weak process descriptions.
3.3. Weak Process Models in PQS As shown in Figure 2, a weak process model21 is a probabilistic expression of the characteristic spatio-temporal signature of a high-level, possibly deterministic abstract process model. Weak process models are well suited to the common situation where, either due to the noisy or incomplete nature of the datastream or the uncertainty in the models, processes are best identified and tracked probabilistically. Quite often, evidence of a process is scattered among events across observation space and time. If reasonable estimates of event probabilities are available, either from apriori knowledge of the high-level process or via a learning mechanism using labeled event streams, a Bayesian analysis of the posterior probability of the existence of a process and its current state may be derived. For example, let X denote a dynamic process defined in terms of its state sequence over time t, so that X1:t = x1 , x2 , . . . , xt . Assume that we have two observation spaces O and Q defined by sequences O1:t = o1 , o2 , . . . , ot and Q1:t = q1 , q2 , . . . , qt respectively. At each time t, spatial
Figure 2. Weak Process Models.
Figure 3. Text Analysis vs. Process Analysis: Hypothesis Generation and Thread Tracking.
and temporal events can be correlated if we have the posterior probability P r(si,t |O1:t , Q1:t ) of possible model states si . A complete description of the joint probabilities is often difficult to compute, but in many cases it is possible to apply simplifying assumptions such as independence of observation spaces. In this case, calculation of P r(si,t |O1:t , Q1:t ) may proceed iteratively, (as described in Cybenko et. al.21 ) in the following three steps: 1. At each time t, correlate temporal events and compute P r(si,t |O1:t ) and P r(si,t |Q1:t ). 2. Correlate spatial events and compute P r(si,t |O1:t , Q1:t ) using P r(si,t |O1:t ) and P r(si,t |Q1:t ) from Step 1. 3. Replace P r(si,t |O1:t ) and P r(si,t |Q1:t ) with P r(si,t |O1:t , Q1:t ) , set t = t + 1 and iterate. In the case where independence assumptions are insufficient, one may, alternatively, learn the joint probabilities via an appropriate statistical learning method utilizing labelled instances.
4. METHODOLOGY As previously mentioned, isolation of message threads propagating in a communication stream such as that generated by an email system may be complicated by a number of obfuscating factors. Messages associated with multiple threads or topics are typically intermingled in the message stream, making it difficult to assign individual messages to a particular thread. In addition, in the absence of direct evidence of causal relationships in the message traffic, evolving content and variations in topic language across the space of actors can make it difficult to track a thread as it propagates. Many solutions exist for separation of topic threads, however most are based on some form of textual analysis, and therefore tend to be context specific. A primary goal in application of process driven techniques to Dynamic Social Network Analysis is to develop techniques which are language and domain independent. Since the generative dynamic processes responsible for producing the message traffic tend to be generic or context-independent, process based methods for role classification and thread tracking have the potential to be transported to diverse domains with divergent native vocabularies. In pursuit of this goal, our current methodology seeks to minimize dependence on textual correlations.
4.1. Overview Figure 3 demonstrates the relationship of textual and process based correlation techniques in our methodology. Along the horizontal axis, the complementary domains of analysis may be described as textual/temporal correlation on the left (temporal correlation via exponential kernel in the the space of multinomial unigram (bag-of-words) classifiers defined by the corpus), and weak process based correlation on the right. Along the vertical axis, there are two modes of hypothesis generation. The lower box corresponds to generation of a naive
bayesian text classifier from a selected set of emails of interest. The upper box corresponds to generation of weak process models given a set of socio-temporally correlated events. While each mode of analysis may inform the other, reliance on textual correlation techniques (the leftmost box in the diagram), is limited in this application to a coarse preprocessing step. Thread tracking and hypothesis generation tasks flow naturally along the arrows of the chart, with evidence garnered in thread tracking in one domain providing a basis for hypothesis generation in the complementary domain. In our current implementation, we begin at the lower box in the figure, and proceed clockwise through the various modes: 1. Select a small set of emails and develop a naive bayes text classifier to perform the coarse selection of similar emails. 2. Further refine the topic thread by performing temporal correlation in the space defined by the classifier via an exponential kernel technique described below. 3. Select a segment of evidence from Step 2 and define a weak process model reflecting message propagation patterns. 4. Submit the weak process model to the PQS engine for indentification and tracking of the weak process in the data stream. Correlated events (messages) extracted in Step 4 may in turn be selected and a text classifier developed on the new evidence.
4.2. Textual/Temporal Preprocessing Our text preprocessing to isolate a thread topic employs a simplistic version of the method presented by Zhu, Lafferty et. al..14 The full Time-Sensitive Dirichlet Process Mixture model (tDPM) of the paper performs automatic thread differentiation via Latent Dirichlet methods. For the moment our work is focused on the
Figure 4. Primitive Subprocesses: A)Initiator. B)Coordinator. C)Bridge. D)Group(Triad). E)Terminator.
Figure 5. Example: Combining Thread Primitives.
x
S
1
2
3
x
0
0
0.01
0.01
0.01
S
0
1
1
1
1
1
0
0
0.01
0
0
2
0
0
0
0.01
0
3
0
0
0
0
0.01
Table 1. Correlation matrix for lazy coordinator node x with supervisor S.
development of the weak process paradigm, with full automation of text processing set aside as an avenue for future work. To summarize our text preprocessing methodology; in the first step, we select a small set of potentially interesting emails and also randomly select a background set. Given this collection of emails, we develop a Bayesian text classifier in the familiar multinomial unigram text classification framework. Applying this text classifier to the email stream, we have a measure of textual correlation or coherence of documents relative to the topic thread. Temporal correlations are estimated via an exponential time kernel. Application of the kernel and classifier to the data stream establishes the rate of textual/temporal topic (or cluster) coherence and is used to further refine the topic thread. Following Lafferty et. al.,14 consider a sequence of documents d with time stamps t : {(d1 , t1 ), . . . , (dn , tn )}. We may represent each document di as a bag-of-word vector. A topic j as defined by a specific collection of emails drawn from the background corpus may be represented by a multinomial distribution θj over the corpus vocabulary. The bag-of-words method defines a generative model such that the probability of cluster j generating document di is given by: Y Pr(di |θj ) = θj (v)di (v) (1) v∈vocabulary If we define si to be the true cluster membership of document di , then we may define a method for temporal correlation via a weight or ‘influence’ function given the cluster history to time t (s1 , . . . , si : ti < t): X w(t, j) = Pr[si = j] · k(t − ti ) (2) {i|ti