Anomaly Detection in Dynamic Networks of Varying Size 1
Timothy La Fond1 , Jennifer Neville1 , Brian Gallagher2 Purdue University, 2 Lawrence Livermore National Laboratory {tlafond,neville}@purdue.edu,
[email protected]
arXiv:1411.3749v1 [cs.SI] 13 Nov 2014
ABSTRACT Dynamic networks, also called network streams, are an important data representation that applies to many real-world domains. Many sets of network data such as e-mail networks, social networks, or internet traffic networks are best represented by a dynamic network due to the temporal component of the data. One important application in the domain of dynamic network analysis is anomaly detection. Here the task is to identify points in time where the network exhibits behavior radically different from a typical time, either due to some event (like the failure of machines in a computer network) or a shift in the network properties. This problem is made more difficult by the fluid nature of what is considered ”normal” network behavior. The volume of traffic on a network, for example, can change over the course of a month or even vary based on the time of the day without being considered unusual. Anomaly detection tests using traditional network statistics have difficulty in these scenarios due to their Density Dependence: as the volume of edges changes the value of the statistics changes as well making it difficult to determine if the change in signal is due to the traffic volume or due to some fundamental shift in the behavior of the network. To more accurately detect anomalies in dynamic networks, we introduce the concept of DensityConsistent network statistics. These statistics are designed to produce results that reflect the state of the network independent of the volume of edges. On synthetically generated graphs anomaly detectors using these statistics show a a 20400% improvement in the recall when distinguishing graphs drawn from different distributions. When applied to several real datasets Density-Consistent statistics recover multiple network events which standard statistics failed to find, and the times flagged as anomalies by Density-Consistent statistics have subgraphs with radically different structure from normal time steps.
1.
INTRODUCTION
Network analysis is a broad field but one of the more important applications is in the detection of anomalous or critical
events. These anomalies could be a machine failure on a computer network, an example of malicious activity, or the repercussions of some event on a social network’s behavior [16, 12]. In this paper, we will focus on the task of anomaly detection in a dynamic network where the structure of the network is changing over time. For example, each time step could represent one day’s worth of activity on an e-mail network. The goal is then to identify any time steps where the pattern of those communications seems abnormal compared to those of other time steps. As comparing the communication pattern of two network examples directly is complex, one simple approach is to summarize each network using a network statistic then compare the statistics. A number of anomaly detection methods rely on these statistics [15, 2, 8]. Another method is to use a network model such as ERGM [13]. However, both these methods often encounter difficulties when the properties of the network are not static. A typical real-world network experiences many changes in the course of its natural behavior, changes which are not examples of anomalous events. The most common of these is variation in the volume of edges. In the case of an email network where the edges represent messages, the total number of messages could vary based on the time or there could be random variance in the number of messages sent each day. The statistics used to measure the network properties are usually intended to capture some other effect of the network than simply the volume of edges: for example, the clustering coefficient is usually treated as a measure of the transitivity. However, the common clustering coefficient measure is statistically inconsistent as the density of the network changes. Even on an Erdos-Renyi network, which does not explicitly capture transitive relationships, the clustering coefficient will be greater as the density of the network increases. When statistics vary with the number of edges in the network, it is not valid to compare different network time steps using those statistics unless the number of edges is constant in each time step. A similar effect occurs with network models that employ features or statistics which are size sensitive: [11] show that ERGM models learn different parameters given subsets of the same graph, so even if the network properties are identical observing a smaller portion of the network leads to learning a different set of parameters.
The purpose of this work is to analytically characterize statistics by their sensitivity to network density, and offer principled alternatives that are consistent estimators, which empirically give more accurate results on networks with varying densities. The major contributions of this paper are: • We prove that several commonly used network statistics are Density Dependent and poorly reflect the network behavior if the network size is not constant. • We offer alternative statistics that are Density Consistent which measure changes to the distribution of edges regardless of the total number of observed edges. • We demonstrate through theory and synthetic trials that anomaly detection tests using Density Consistent statistics are better at identifying when the distribution of edges in a network has changed. • We apply anomaly detection tests using both types of statistics to real data to show that Density Consistent statistics recover more major events of the network stream while Density Dependent statistics flag many time steps due to a change in the total edge count rather than an identifiable anomaly. • We analyze the subgraphs that changed the most in the anomalous time steps and demonstrate that Density Consistent statistics are better at finding local features which changed radically during the anomaly.
2.
STATISTIC-BASED ANOMALY DETECTION
A statistic-based anomaly detection method is any method which makes its determination of anomalous behavior using network statistics calculated on the graph examples. The actual anomaly detection process can be characterized in the form of a hypothesis test. The network statistics calculated on examples demonstrating normal network behavior form the null distribution, while the statistic on the network being tested for anomalies forms the test statistic. If the test statistic is not likely to have been drawn from the null distribution, we can reject the null hypothesis (that the test network reflects normal behavior) and conclude that it is anomalous. Let Gt = {V, Et } be a multigraph that represents a dynamic network, where V is the node set and Et is the set of edges at time t, with eij,t the number of edges between nodes i and j at time t. The edges represent the number of interactions that occurred between the nodes observed within a discrete window of time. As the number of participating nodes is relatively static compared to the number of communications, we will assume that the node set is a constant; in time steps where a node has no communications we will treat it as being part of the network but having zero edges. Let us define some network statistic Sk (Gt ) designed to measure network property k which we will use as the test statistic (e.g., clustering coefficient). Given some set of time steps tLx ∈ {tL1 , ...tLmax } such that the set of graphs in those
times {GtLx } are all examples of normal network behavior, Sk is calculated on each of these learning set examples to estimate an empirical null distribution. If ttest is the time step we are testing for abnormality, then the value Sk (Gttest ) is the test statistic. Given a specified p-value (referred to as α), we can find threshold(s) that reject p percentage of the null distribution, then draw our conclusion about whether to reject the null hypothesis (conclude an anomaly is present) if the test statistic Sk (Gttest ) falls out of those bounds. For this work we will use a two-tailed Z-test with a p-value of α = 0.05 with the thresholds φlower and φupper . Anomalous test cases where the null hypothesis is rejected correspond to true positives; normal cases where the null hypothesis is rejected correspond to false positives. Likewise anomalous cases where the null is not rejected correspond to false negatives and normal cases where the null is not rejected correspond to true negatives. Deltacon [5] is an example of the statistic-based approach, as are many others [7, 8, 9, 14]. As these models also often incorporate network statistics, we will focus on the statistics themselves in this paper. Moreover, not all methods rely solely on statistics calculated from single networks: there are also delta statistics which measure the difference between two network examples. Netsimile is an example of such a network comparison statistic [1]. For a dynamic network, these delta statistics are usually calculated between the networks in two consecutive time steps. In practical use for hypothesis testing, these delta statistics function the same as their single network counterparts. If the network properties being tested are not static with respect to the time t this natural evolution may cause Sk (Gt ) to change over time regardless of anomalies which makes the null distribution invalid. It is useful in these instances to replace the statistic with a detrended version S ∗k (Gt ) = Sk (Gt ) − f (t) where the function f (t) is some fit to the original statistic values. The paper by [6] describes how to do dynamic anomaly detection using a linear detrending function, but other functions can be used for the detrending. This detrending operation does not change the overall properties of the statistic so for the remainder of the paper assume Sk (Gt ) refers to a detrended version, if appropriate.
2.1
Common Network Statistics
Listed here are some of the more commonly used network statistics for the anomaly detection. Graph Edit Distance: The graph edit distance (GED) [4] is defined as the number of edits that one network needs to undergo to transform into another network and is a measure of how different the network topologies are. In the context of a dynamic network the GED is applied as a delta measure to report how quickly the network is changing. GED(Gt ) = |Vt | + |Vt−1 | − 2 ∗ |Vt ∩ Vt−1 | + |Et | + |Et−1 | − 2 ∗ |Et ∩ Et−1 |
(1)
Degree Distribution Difference: P Define Di,t = j6=i eij,t to be the degree of i, the total number of messages to or from the node. The degree distribution is then a histogram of the number of nodes with a particular degree. Typically real-world network exhibit a power-law
degree distribution [3] but others are possible. To compare the degree distributions of t and t − 1, one option is to take the squared difference between the degree counts for each possible degree. We will call this the degree distribution difference (DD): maxi (Di,t ,Di,t−1 )
X
DD(Gt ) =
k=1
(
X
I(Di,t = k) − I(Di,t−1 = k))2
i
(2)
Other measures can be used to compare the two distributions but the statistical properties of the squared distance described later extend to other distance measures. Weighted Clustering Coefficient: Clustering coefficient is intended as a measure of the transitivity of the network. It is a critical property of many social networks as the odds of interacting with friends of your friends often lead to these triangular relationships. As the standard clustering coefficient is not designed for weighted graphs we will be analyzing a weighted clustering coefficient, specifically the Barrat weighted clustering coefficient (CB)[10]: CB(Gt ) =
X i
X ei,j + ei,k 1 ai,j ∗ ai,k ∗ aj,k N ∗ (ki − 1) ∗ si j,k 2 (3)
P
A second problem occurs when the edge counts in the learning set have high variance. If the statistic is dependent on the number of edges, noise in the edge counts translates to noise in the statistic values which lowers the statistical power of the test. Theorem 3.2 False Negatives for Density Dependent Statistics For any Sk (Gttest ) calculated on a network that is anomalous with respect to property k, if Sk (Gt ) is dependent on |Et | there is some value of the variance of the learning network edge counts such that Sk (Gttest ) is not detected as an anomaly.
P
Where ai,j = I[ei,j > 0], si = j ei,j , and ki = j ai,j . Other weighted clustering coefficients exists but they behave similarly to the Barrat coefficient.
3.
differs sufficiently in its number of edges compared to the learning examples the null hypothesis will be rejected regardless of the other network properties. As for why it is not sufficient to simply label these times as anomalies (due to unusual edge volume) the hypothesis test is designed to test for abnormality in a specific network property. If edge count anomalies also flag anomalies on other network properties, we cannot disambiguate the case where both are anomalous or just the message volume. If just the message volume is unusual, this might simply be an example of an exceptionally busy day where the pattern of communication is roughly the same just elevated. This is a very different case from where both the volume and the distribution of edges are unusual.
DENSITY DEPENDENCE
To illustrate why dependency of the network statistic on the edge count affects the conclusion of hypothesis tests, we will us first investigate statistics that are density dependent. Definition 3.1 A statistic S is density dependent if the value of S(Gt ) is dependent on the density of Gt (i.e., |Et |). Theorem 3.1 False Positives for Density Dependent Statistics Let L = {GtLx } be a learning set of graphs and Gttest be the test graph. If Sk (Gt ) is monotonically dependent on |Et | and {|EtLx |} is bounded by finite Elower and Eupper , there is some |Ettest | that will cause the test case to be rejected regardless of whether the network is an example of an anomaly with respect to property k. Proof. Let Sk (G) be a network statistic that is a monotonic and divergent function with respect to the number of edges in G. Given a set of learning graphs {GtL } the values of Sk ({GtL }) are bounded by max(Sk ({GtL })) and min(Sk ({GtL })), so the critical points φlower and φupper of a hypothesis test using this learning set will be within these bounds. Since an increasing |Ettest | implies Sk (Gttest ) increases or decreases, then there exists a |Ettest | such that Sk (Gttest ) is not within φlower and φupper and will be rejected by the test. If changing the number of edges in an observed network changes the output of the statistic, then if the test network
Proof. Let Sk (Gttest ) be the test statistic and {GtL } be the set of learning graphs where the edge count of any learning graph |EtLx | be drawn according to distribution ΦE . If Sk (Gt ) is a monotonic divergent function of |Et | then as the variance of ΦE increases the variance of Sk (GtLx ) increases as well. For a given α, the hypothesis test thresholds φlower and φupper will widen as the variance increases to incorporate 1 − α learning set instances. Therefore, for a given Sk (Gttest ) there is some value of var(ΦE ) such that φlower < Sk (Gttest ) < φupper . With a sufficient amount of edge count noise, the statistical power of the anomaly detector drops to zero. These theorems have been defined using a statistic calculated on a single network, but some statistics are delta measures which are measured on two networks. In these cases, the edge counts of either or both of the networks can cause edge dependency issues. Lemma 3.3 If a delta statistic Sk (Gt , Gt−1 ) is dependent on either Emin (Gt ) = min(|Et |, |Et−1 |) or E∆ (Gt ) = abs(|Et |− |Et−1 |) then Thms 3.1-3.2 apply to the delta statistic. Proof. For some delta statistic Sk (Gt , Gt−1 ) if the statistic is dependent on |Et |, |Et−1 |, or both, Theorems 3.1 and 3.2 apply to any edge count which influences the statistic. If Sk (Gt , Gt−1 ) depends on E∆ = abs(|Et | − |Et−1 |) then as either |Et | or |Et−1 | change the statistic produced changes leading to the problems described in theorems 3.1 and 3.2. If Sk (Gt , Gt−1 ) depends on Emin = min(|Et |, |Et−1 |) then if both |Et | and |Et−1 | increase or decrease the statistic is affected leading to the same types of errors.
These theorems show that dependency on edge counts can lead to both false positives and false negatives from anomaly detectors that look for unusual network properties other than edge count. In order to distinguish between the observed edge counts in each time step and the other network properties, we need a more specific data model to represent the network generation process with components for each.
4.
DENSITY INDEPENDENCE
Now that we have established the problems with density dependence, we need to define the properties that we would prefer our network statistics to have. To do this we need a more detailed model of how the graph examples were generated.
4.1
Data Model
Let the number of edges in any time step |Et | be a random variable drawn from distribution MEn (t) in times where there is a normal message volume and distribution MEa (t) in times where there is anomalous message volume. Now let the distribution of edges amongst the nodes of the graph be represented by a N ×N matrix Pt where the value of any cell pij,t is the probability of observing a message between two particular node pairs at a particular time. This is a probability distribution so the total matrix mass sums to 1. Like edge count, treat this matrix as drawn from distribution MPn (t) a in normal times and MP k (t) in anomalous times where k is the network parameter that is anomalous (for example, an atypical degree distribution). Any observed network slice can be treated as having been generated by a multinomial sampling process where |Et | edges are selected independently from N xN with probabilities Pt . Denote the sampling procedure for a graph Gt with F (|Et |, Pt ). In the next section, we will detail how this decomposition into the count of edges and their distribution allows us to define statistics which are not sensitive to the number of edges in the network.
4.2
Density Consistency and Unbiasedness
In the above data model, Pt is the distribution of edges in the network, thus any property of the network aside from the volume of edges is encapsulated by the Pt matrix. Therefore, a network statistic designed to capture some network property other than edge count should be a function of Pt alone. Let Sk (Pt ) be some test statistic designed to capture a network property k of Pt , that is independent of the density of Gt . Since Pt is not directly observable we can estimate the statistic with the empirical statistic Sˆk (Gt ) where Gt is used eij,t . to estimate Pt , with pˆij,t = |E t| ˆ t) Definition 4.1 A statistic S is density consistent if S(G is a consistent estimator of S(Pt ). If Sˆk (Gt ) is a consistent estimator of the true value of Sk (Pt ), then observing more edge examples should cause the estimated statistics to converge to the true value given Pt . More specifically it is asymptotically consistent with respect to the true value as the number of observed edges increases. Another way to describe this property is that Sˆk (Gt ) has some bias term dependent on the edge count |Et |, but the bias converges to zero as the edge count increases: lim|Et |→∞ Sˆk (Gt )− Sk (Pt ) = 0. Density consistent statistics allow us to perform
accurate hypothesis tests as long as a sufficient number of edges are observed in the networks. To begin, we will prove that the rate of false positives does not exceed the selected p-value α. Theorem 4.1 False Positive Rate for Density Consistent Statistics As |Et | → ∞ if Sˆk (Gt = F (|Et |, Pt ∼ MPn (t)) converges to Sk (Pt ), the probability of a false positive when testing a time with an edge count anomaly Gttest = F (|Et |, Pt ∼ MPn (t)) approaches α. Proof. Let all learning set graphs GtLx = F (|EtLx | ∼ MEn (tLx ), PtLx ∼ MPn (tLx )) be drawn from non-anomalous |E| and P distributions and the test instance Gttest = F (|Ettest | ∼ MEa (ttest ), Pttest ∼ MPn (ttest )) be drawn from an anomalous |E| distribution but a non-anomalous P distribution. If Sˆk (Gt ) is a consistent estimator of Sk (Pt ), lim|Et |→∞ Sˆk (Gt ) = Sk (Pt ). Then as both |EtLx | and |Ettest | increase, all learning set instances and the test set instance approach the distribution Sk (Pt ∼ MPn (t)). As any threshold is as likely to reject a learning set instance as the test instance, the false positive rate approaches α. As the bias converges to zero, graphs created with the same underlying properties will produce statistic values within the same distribution, making the test case come from the same distribution as the null. Even if the test case has an unusual number of edges, as long as the number of edges is not too small there will not be a false positive. Density consistency is also beneficial in the case of false negatives. Theorem 4.2 False Negative Rate for Density Consistent Statistics As |Et | → ∞ let Sˆk (Gt = F (|Et |, Ptn ∼ MPn (t))) converge to Sk (Ptn ) and Sˆk (Gt = F (|Et |, Pta ∼ MPa (t)) converge to Sk (Pta ). If Sk (Ptn ) and Sk (Pta ) are separable then the probability of a false negative → 0 as |Et | increases. Proof. Let all learning set graphs GtLx = F (|EtLx | ∼ MEn (tLx ), PtLx ∼ MPn (tLx )) be drawn from non-anomalous |E| and P distributions and the test instance Gttest = F (|Ettest | ∼ a MEn (ttest ), Pttest ∼ MP k (ttest )) be drawn from a non-anomalous |E| distribution but a P distribution that is anomalous on the network property being tested. Let Sˆk (Gt ) be a consisa tent estimator of Sk (Pt ). If Sk (Pt ∼ MP k (t)) and Sk (Pt ∼ n MP (t)) are separable, then lim
Sˆk (Gttest )
lim
Sˆk (GtLx )
|Ettest |→∞
and |EtLx |→∞
converge to two non-overlapping distributions and the probability of rejection approaches 1 for any α. The statistical power of a density consistent statistic depends only on whether the P matrices of normal and anomalous graphs are separable using the true statistic value: as long as |Et | is sufficiently large the bias is small enough that it is not a factor in the rate of false negatives.
A special case of density consistency is density consistent and unbiased, which refers to statistics where in addition to consistency the statistic is also an unbiased statistic of the true Sk (Pt ). ˆ t) Definition 4.2 A statistic S is density unbiased if S(G is a unbiased estimator of S(Pt ).
Graph edit distance Degree distribution Barrat clustering Mass shift Degree shift Triangle probability
Dependent X X
Consistent
Unbiased
X X X X
X
Figure 1: Statistical properties of previous network statistics and our proposed alternatives.
Unbiasedness is a desirable property because a density consistent statistic without it may produce errors due to bias when the number of observed edges is low.
5.1
4.3
Claim 5.1 GED is a density dependent statistic.
Proposed Density-Consistent Statistics
We will now define a set of Density-Consistent statistics designed to measure network properties similar to the previously described dependent statistics, but without the sensitivity to total network edge count. Probability Mass Shift: The probability mass shift (MS) is a parallel to GED as a measure of how much change has occurred between the two networks examined. Mass Shift, however, attempts to measure the change in the underlying P distributions and avoids being directly dependent on the edge counts. The probability mass shift between time steps t and t − 1 is M S(Pt ) =
X
(pij,t − pij,t−1 )2
The MS can be thought of as the total edge probability that changes between the edge distributions in t and t − 1. Probabilistic Degree Shift: We will now propose a counterpart to the degree distribution which is density consistent. P Define the probabilistic degree of a node to be P D(vi ) = j∈Vt ,i6=j pij,t . Then, let the probabilistic degree shift of a particular node in Gt be defined as the squared difference of the probabilistic degree of that node in times t and t − 1. The total probabilistic degree shift (DS) of Gt is then: X X X ( pij,t − pij,t−1 )2 i
j6=i
(5)
j6=i
This is a measure of how much the total probability mass of nodes in the graph change across a single time step. If the shape of the degree distribution is changing, the probabilistic degree of nodes will be changing as well. Triangle Probability: As the name suggests, the triangle probability (TP) statistics is an approach to capturing the transitivity of the network and an alternative to traditional clustering coefficient measures. Define the triangle probability as: T P (Pt ) =
X
pij,t ∗ pik,t ∗ pjk,t
(6)
i,j,k∈V,i6=j6=k
5.
When the edge counts of the two time steps are the same, the GED (Eq. 1) can be thought of as the difference in the distribution of edges in the network. However, the GED is sensitive to |Et | in two ways: the change in the number of edges from t−1 to t: E∆ = abs(|Et | − |Et−1 |), and the minimum number of edges in each time step Emin = min(|Et−1 |, |Et |). In both cases the statistic is density dependent, and in fact it diverges as the number of edges increases. The first case is discussed in Theorem 5.1, the second in Theorem 5.2. Theorem 5.1 As E∆ → ∞, GED(Gt ) → ∞, regardless of Pt and Pt−1 .
(4)
ij
DS(Pt ) =
Graph Edit Distance
PROPERTIES OF NETWORK STATISTICS
Now that we have described the different categories of network statistics and their relationship to the network density we will characterize several common network statistics as density dependent or consistent, comparing them to our proposed alternatives. Table 1 summarizes our findings.
Proof. Let E∆ be abs(|Et | − |Et−1 |). Since the GED corresponds to |Et | + |Et−1 | − 2 ∗ |Et ∩ Et−1 |, the minimum edit distance between Gt and Gt−1 occurs when their edge sets overlap maximally and is equal to E∆ . Therefore, as E∆ increases even the minimum (i.e., best case) GED(Gt , Gt−1 ) also increases. Theorem 5.2 As Emin → ∞, GED(Gt ) → ∞ if Pt 6= Pt−1 . b Proof. Let Gt = F (|Et |, Pta ) and Gt−1 = F (|Et−1 |, Pt−1 ), a b a with Pt 6= Pt−1 . Select two nodes i, j such that pij,t 6= pbij,t−1 . The edit distance contributed by those two nodes is abs(eij,t − eij,t−1 ). Let Emin increase but E∆ remain constant. As Emin increases the edit distance of the two nodes converges to abs(Emin ∗paij,t − Emin ∗pbij,t−1 ) = Emin ∗ abs(paij,t − pbij,t−1 ). Since every pair of nodes with differing edge probabilities in the two time steps will have increasing edit distance as Emin increases, the global edit distance will also increase.
Since the GED measure is the literal count of edges and nodes that differ in each graph, the statistic is dependent on the difference in size between the two graphs. Even if the graphs are the same size, comparing two large graphs is likely to produce more differences than two very small graphs due to random chance. In addition, even when small nonanomalous differences occur between the probability distributions of edges in two time steps, variation in the edge count can result in large differences in GED.
5.2
Degree Distribution Difference
Claim 5.2 DD is a density dependent statistic.
The degree distribution is naturally very dependent on the total degree of a network: the average degree of nodes is larger in networks with many edges. The DD measure (Eq. 2) is again sensitive to |Et | via E∆ and Emin . In both cases the statistic is density dependent. The first case, in Theorem 5.3, shows that as E∆ increases, the DD measure will also increase even if the graphs were generated with the same P probabilities. The second case, in Theorem 5.4, shows that as Emin increases, small variations in P will increase the DD measure.
Let CB(Gt , i) refer to the clustering coefficent of node i. Then, in the limit of |Et | it converges to: P
j,k
lim CB(Gt , i) = ∗(
lim I[eij,t > 0] ∗ I[eik,t > 0] ∗ I[ejk,t > 0])
|Et |→∞
P =
|Et |pij,t + |Et |pik,t P j6=i |Et |pij,t
2 ∗ ki,lim ∗
|Et |→∞
j,k
pij,t + pik,t P j6=i pij,t
2 ∗ ki,lim ∗
∗ I[pij,t > 0] ∗ I[pik,t > 0] ∗ I[pjk,t > 0]
Theorem 5.3 As E∆ ↑, DD(Gt ) ↑, regardless of Pt and Pt−1 . Proof. Pick any node i in Gt and j in Gt−1 . If |Et | increases, and |Et−1 | stays the same, the expected degree of i increases, while j stays the same; likewise the inverse is true if |Et−1 | increases and |Et | stays the same. Thus, as E∆ increases the probability of any two nodes having the same degree approaches zero, so the degree distribution difference of the two networks increases with greater E∆ . Theorem 5.4 As EM in ↑, DD(Gt ) ↑ if Pt 6= Pt−1 . P Proof. Let P D(i, Gt ) = k6=i pik,t be the probabilistic degree of node i for Gt = F (|Et |, Pt ). Pick any node i in Gt and j in Gt−1 such that P D(i, Gt ) 6= P D(j, Gt−1 ). For a constant E∆ , as Emin increases, thePexpected degrees of i and j converge to Di,t = Emin ∗ k6=i pik,t and P Dj,t−1 = Emin ∗ k6=j pjk,t−1 respectively. This means that the probability of the two nodes having the same degree approaches zero. Since every pair of nodes with differing edge probabilities in the two time steps will have unique degrees, the degree distribution difference will also increase.
P where ki,lim = j6=i I[pij,t > 0]. Since this limit can be expressed in terms of Pt alone, and CB is a sum of the clustering over all nodes, the Barrat clustering coefficient is a density consistent statistic. If we calculate the expectation of the Barrat clustering, we obtain: E[CB(Gt , i)] = E[
= E[
X ei,j + ei,k 1 ai,j ∗ ai,k ∗ aj,k ] (ki − 1) ∗ si j,k 2 X E[ei,j ] + E[ei,k ] 1 ] (ki − 1) ∗ si j,k 2
∗ E[ai,j ] ∗ E[ai,k ] ∗ E[aj,k ] X pij,t + pik,t 1 = E[ ] (ki − 1) ∗ si j,k 2 ∗ E[ai,j ] ∗ E[ai,k ] ∗ E[aj,k ] X pij,t + pik,t 1 = E[ ] (ki − 1) ∗ si j,k 2 |E |
If a node has very similar edge probabilities in matrices Pt and Pt−1 , when few edges are sampled it is likely to have the same degree in both time steps, and thus the impact on the degree distribution difference will be low. However, as the number of edges increases, even small non-anomalous difference in the P matrices will become more apparent (i.e., the node is likely to be placed in different bins in the degree distribution difference calculation), and the impact on the measure will be larger.
5.3
Weighted Clustering Coefficient
Claim 5.3 CB is a density consistent statistic. As shown in Theorem 5.5 below, the weighted Barrat clustering coefficient (CB, Eq. 3) is in fact density consistent. However, we will show later that the triangle probability statistic is also density unbiased, which gives more robust results, even on very sparse networks. Theorem 5.5 CB(Gt ) is a consistent estimator of CB(Pt ), with a bias that converges to 0. Proof. For Gt = F (|Et |, Pt ) the number of edges observed any pair of nodes can be represented using a multie eyz,t t |! . As |Et | → ∞, nomial distribution eij,t|E p ij,t ...pyz,t !...eyz,t ! ij,t the rate of sampling a particular node pair i, j is pij,t , so: lim
|Et |→∞
eij,t = |Et | ∗ pij,t
|E |
|E |
t t ∗ (1 − pij,tt ) ∗ (1 − pik,t ) ∗ (1 − pjk,t )
As this does not simplify to the limit of CB, it is not an unbiased estimator, and is thus density consistent but not unbiased. Other weighted clustering coefficients are also available but they have the same properties as the Barrat statistic.
5.4
Probability Mass Shift
Claim 5.4 MS is a density consistent statistic. As the true P is unobserved, we cannot calculate the Mass Shift (MS, Eq. 4) statistic exactly and must use the empiriP d pij,t − pˆij,t−1 )2 . cal Probability Mass Shift: M S(Gt ) = ij (ˆ As shown in Theorem 5.6 below, the bias of this estimator approaches 0 as |Et | and |Et−1 | increase, making this a density consistent statistic. d Theorem 5.6 M S(Gt ) is a consistent estimator of M S(Pt ), with a bias that converges to 0. Proof. The expectation of the empirical Mass Shift can be calculated with E[
X
(ˆ pij,t − pˆij,t−1 )2 ]
ij
= E[
X eij,t eij,t−1 2 − ) ] ( |E | |Et−1 | t ij
=
X
E[
ij
e2ij,t |Et
|2
e2ij,t
As the expectation of ten as:
+
e2ij,t−1 |Et−1 |2
−2∗
eij,t eij,t−1 ] |Et | |Et−1 |
d t ) is a consistent estimator of DS(Pt ), Theorem 5.7 DS(G with a bias that converges to 0.
for any node pair i, j can be writ-
Proof. The expectation of the empirical degree shift can be calculated with d t )] = E E[DS(G
E[e2ij,t ] = V ar(eij,t ) + E[eij,t ]2
i
= V ar(Bin(|Et |, pij,t )) + E[Bin(|Et |, pij,t )]2
The expected empirical mass shift can then be written as: E[
ij
e2ij,t |Et |2
+
e2ij,t−1 |Et−1 |2
−2∗
eij,t eij,t−1 ] |Et | |Et−1 |
j,k6=i
X |Et | ∗ pij,t ∗ (1 − pij,t ) |Et |2 ∗ p2ij,t = + 2 |E | |Et |2 t ij + + =
=
|Et−1 |2 X
X
i
j6=i
X pij,t (1 − pij,t ) X p2ij,t−1 + |Et | j6=i j6=i
p2ij,t +
!
|Et | ∗ pij,t |Et−1 | ∗ pij,t−1 −2∗ |Et | |Et−1 |
+2∗
X
p2ij,t − 2 ∗ pij,t ∗ pij,t−1 + p2ij,t−1
=
X
pij,t pik,t−1
j,k6=i
XX i
pij,t (1 − pij,t ) pij,t−1 (1 − pij,t−1 ) + |Et | |Et−1 | X pij,t (1 − pij,t ) 2 (pij,t − pij,t−1 ) + = |Et | ij
pij,t−1 ∗ pik,t−1 − 2 ∗
j,k6=i
ij
pˆij,t −
X
j6=i
pˆij,t−1
2
j6=i
+
X pij,t (1 − pij,t ) |Et | j6=i
X pij,t−1 (1 − pij,t−1 ) + |Et | j6=i
+
+
X
X pij,t−1 (1 − pij,t−1 ) X + pij,t ∗ pˆik,t +2∗ |Et | j6=i j,k6=i
|Et−1 | ∗ pij,t−1 ∗ (1 − pij,t−1 ) |Et−1 |2 |Et−1 |2 ∗ p2ij,t−1
j6=i
j6=i
X e2ij,t−1 X X e2ij,t =E ( + 2 |Et | |Et−1 |2 i j6=i j6=i X eij,t eik,t X eij,t−1 eik,t−1 +2∗ ∗ +2∗ ∗ |Et | |Et | |Et−1 | |Et−1 | j,k6=i j,k6=i # X pˆij,t pˆik,t−1 −2∗ "
= |Et | ∗ pij,t ∗ (1 − pij,t ) + |Et |2 ∗ p2ij,t
X
i hX X X pˆij,t−1 )2 pˆij,t − (
Which is density-consistent because the two additional bias terms converge to 0 as |Et | increases.
pij,t−1 (1 − pij,t−1 ) |Et−1 |
As the two additional bias terms converge to 0 as |Et | and |Et−1 | increase, the empirical mass shift is a consistent estimator of the true mass shift, and is density consistent.
Since the bias converges to 0, the statistic is density consistent. By subtracting out the empirical estimate of this bias term we can hasten the convergence. We use the following bias-corrected empirical degree shift in our experiments:
We can improve the rate of convergence as well by using our empirical estimates of the probabilities to subtract an estimate of the bias from the statistic. We use the following bias-corrected version of the empirical statistic in all experiments: ∗ d M S (Gt ) =
X
5.5
d (G ˆt) = DS
X i
(
X j6=i
pˆij,t −
X
pˆij,t−1 )2
j6=i
X pˆij,t (1 − pˆij,t ) X pˆij,t−1 (1 − pˆij,t−1 ) (8) + − |Et | |Et | j6=i j6=i
(ˆ pij,t − pˆij,t−1 )2
ij
−
∗
pˆij,t (1 − pˆij,t ) pˆij,t−1 (1 − pˆij,t−1 ) − |Et | |Et−1 |
5.6
Triangle Probability
(7)
Probabilistic Degree Shift
Claim 5.5 DS is a density consistent statistic. Again, since the true P is unobserved, we cannot calculate the Degree Shift (DS, Eq. 5) statistic exactly and must use the empirical Probability Degree Shift: P d G ˆ t ) = P (P DS( ˆij,t − j6=i pˆij,t−1 )2 . As shown in Thei j6=i p orem 5.7 below, the bias of this estimator approaches 0, making this a density consistent statistic.
Claim 5.6 TP is a density consistent and density unbiased statistic. Again, since the true P is unobserved, we cannot calculate the Triangle Probability (TP, Eq. 6) statistic exactly and must use the empirical Triangle Probability X eij,t ∗ eik,t ∗ ejk,t d T P (Gt ) = |Et |3 i,j,k∈V,i6=j6=k
which is an unbiased estimator of the true statistic (shown below in Theorem 5.8). This means that there is no minimum number of edges necessary to attain an unbiased estimate of the true triangle probability.
d Theorem 5.8 T P (Gt ) is an consistent and unbiased estimator of T P (Pt ). Proof. The expectation of the empirical Triangle Probability can be written as
ijk
eij,t eik,t ejk,t E[ ] |Et | |Et | |Et |
7000 5000
6000
1e-04
4000
0e+00
X pij,t ∗ |Et | pik,t ∗ |Et | pjk,t ∗ |Et | |Et | |Et | |Et | ijk X = pi,j pi,k pj,k =
8000
2e-04
As the number of edges on any node pair i, j can be represented with a multinomial, the expectation of each is |Et | ∗ pij,t . This lets us rewrite the triangle probability as
Figure 3: Recall when applying each statistic to flag synthetically generated anomalies. Each cell is an average over all model parameter pairs.
10000
=
Therefore the empirical triangle probability is an unbiased estimator of the true triangle probability and is a density consistent statistic.
EXPERIMENTS
Now that we have established the properties of densityconsistent and -dependent statistics we will show the tangible effects of these properties using both synthetic datasets as well as data from real networks. The purpose of the synthetic data experiments is to show the ability of hypothesis tests using various statistics to distinguish networks that have different distributions of edges but also a random number of observed edges. The real data experiments demonstrate the types of events that generate anomalies as well as the characteristics of the anomalies that hypothesis tests using each statistic are most likely to find.
6.1
Synthetic Data Experiments
To validate the ability of Density-Consistent statistics to more accurately detect anomalies than traditional statistics we evaluated their performance on sets of synthetically generated graphs. Rather than create a dynamic network we generated independent sets of graphs using differing model parameters and a random number of edges. For every combination of model parameters, statistics calculated from graphs of one set of parameters (or pairs of graphs with the same parameters in the case of delta statistics like mass shift) became the null distribution, and statistics calculated on graphs from the other set of parameters (or two graphs, one from each model parameter in the case of delta statistics) became the test examples. Treating the null distribution as normally distributed, the critical points corresponding to a p-value of 0.05 are used to accept or reject each of the test examples. The percentage of test examples that are rejected averaged over all combinations of model parameters becomes the recall for that statistic. To generate the synthetic graphs, we first sampled a uniform random variable representing the number of edges to insert into the graph. For each statistic we did three sets of trials with 1k-2k, 3k-5k, and 7k-10k edges respectively. For
3000
-1e-04
ijk
6.
7k-10k edges 0.21 ± 0.02 0.64 ± 0.06 0.78 ± 0.04 0.88 ± 0.04 0.67 ± 0.08 0.97 ± 0.03
E[ˆ pij,t pˆik,t pˆjk,t ]
ijk
X
3k-5k edges 0.05 ± 0.01 0.26 ± 0.05 0.78 ± 0.05 0.77 ± 0.05 0.63 ± 0.08 0.94 ± 0.04
9000
X
3e-04
E[PˆT (Gt )] =
1k-2k edges 0.01 ± 0.001 0.15 ± 0.03 0.44 ± 0.05 0.51 ± 0.05 0.62 ± 0.07 0.77 ± 0.04
Edit Distance Degree Dist. Clust. Coef. Mass Shift Degree Shift Triangle Prob.
0
20
40
(a)
60
80
100
0
20
40
60
80
100
(b)
Figure 4: Comparison of Mass Shifts (a) and Edit Distances (b) produced by synthetic dataset pairs. The variance of the Edit Distances is due to the variable edge counts of the graphs, and leads to errors when distinguishing the two distributions.
mass shift, edit distance, triangle probability, and clustering coefficient each synthetic graph was generated according to the edge probabilities of a stochastic blockmodel using accept/reject sampling to place the selected number of edges amongst the nodes. Each set of models was designed to produce graphs varying a certain property, i.e. triangle probability and clustering coefficient were applied to models with varying transitivity while mass shift and edit distance skew the class probabilities by a certain amount. For the degree distribution and degree shift statistics, rather than using a stochastic blockmodel we assigned degrees to each node by sampling from a power-law distribution with varying parameters then using a fast Chung-Lu graph generation process to construct the network. Unlike a standard Chung-Lu process we allow multiple edges between nodes and we continue the process until a random number of edges are inserted rather than inserting edges equal to the sum of the degrees. The recalls are calculated the same way as with the stochastic blockmodels, using pairs of graphs with the same power-law degree distribution as the null set and differing degree distributions as the test set. Figure 6.1 shows the average recall of each statistic when applied to all models of a certain range of edges. In general the more edges that are observed the more reliable the statistics are; however, the statistics we have proposed enjoy an advantage over the traditional statistics for all ranges of network sizes, and in some cases this improvement is as large as 200% or more.
Edge Count Mass Shift Degree Shift Triangle Probability Degree Distribution Edit Distance Clustering Coefficient
Edge Count Mass Shift Degree Shift Triangle Probability Degree Distribution Edit Distance Clustering Coefficient
September 2001: Enron stock begins to falter
May 2001: Mintz sends memorandum to Skilling on LJM paperwork
May 2000: Price manipulation strategy "Death Star" implemented
February 2002: Skilling testifies before Congress
Edge Count Mass Shift Degree Shift Triangle Probability Degree Distribution Edit Distance Clustering Coefficient
December 25: Christmas
November 14-16: Severe tornado warnings
February 14: Valentine's Day
January 16: Martin Luther King day
August 29: Semester begins
March 12-17: Spring Break
August 27: Classes begin
September 5: Labor day
December 2000: Skilling takes over as CEO
November 1: Basketball season begins
0
50
100
January 14: Classes resume
150
0
50
Time (weeks)
(a) Enron email network
100
150
November 1: Basketball season begins
200
Time (days)
(b) University e-mail network 2011-2
0
100
200
300
400
Time (days)
(c) Facebook Univ. subnetwork 2007-8
Figure 2: Timelines showing reported anomalies using each statistic for three real world networks. Figure 4 demonstrates the effect of density consistency using one pair of graph models and the mass shift and edit distance statistics. Each black point represents a pair of datasets drawn from the same distribution while each green point is the statistic value calculated from a pair of points from differing distributions; the number of edges sampled from each distribution was 7k-10k uniformly at random. Clearly the distribution of statistic values when calculated on graph pairs from the same distribution is different from the distribution of cross-model pairs for both statistics. However, the randomness of the size of the graphs translates to variance in the edit distance statistics calculated leading to two distributions which are not easily separable leading to reduced recall. Mass shift, on the other hand, is nearly unaffected by the edge variance leading to two very distinct statistic distributions.
6.2
Real Data Experiments
We will now demonstrate how using Density-Consistent statistics as the test statistics for anomaly detectors improves the ability of detectors to find novel anomalies in real-world networks. We analyzed three dynamic network, one composed of e-mail communication from the Enron corporation during its operation and collapse, one from e-mail communications from university students, and one from the Facebook wall postings of the subnetwork composed of students at a university. Solid points represent statistics that we have proposed in this paper while open bubbles represent classic network statistics. Ideally the major events marked with vertical lines will be found as unusual by the detectors. A hypothesis test using the statistic in question is applied to every time step to determine its category as anomalous or normal. The null distribution used is the set of statistics calculated on all other time steps of the stream. A normal distribution is fit to the statistic values and critical points selected set according to a p-value of 0.05; any time step with a statistic that exceeds the critical points is flagged as anomalous.
statistics are able to recover the critical events of the timeline including events that standard statistics are not able to find. In particular the price manipulation strategy that was implemented in the summer of 2000 and the CEO transition in December 2000 are not found by edge dependent statistics; unlike some of the later events this strategy was not accompanied with an abnormal number of edges so edge dependent statistics produced less of a signal on these points. In addition, the time steps that are flagged by traditional statistics cluster around a set of edge count anomalies just after the Enron stock begins to crash. This period has an elevated number of messages in general, so the statistics are responding to the number of edges in these time steps rather than changes in other properties of the network. Figure 2(b) is taken from the e-mail communication of students from the summer of 2011 to February 2012. In general the density-consistent statistics flag time steps where certain events are taking place such as the start of basketball season, Martin Luther King Day, and Valentine’s day. The density-dependent statistics tend to flag more randomly and often coincide with times of unusual total edge count. Figure 2(c) is constructed from the set of wall postings between members of a university Facebook group. As with the Enron data, there is a large set of time steps with an usual total edge count that are also flagged by the density-dependent statistics. The density-consistent statistics recover the major events such as the beginning of each semester and the start and end of spring break. Interestingly, they also find unusual activity before and after spring break; project deadlines often fall before or after the break which may explain this activity.
Let us take a look at the network structures found in the anomalous time steps. As the mass shift, degree shift, and triangle probability statistics are sums over the edges or nodes of the graph, by decomposing the total statistic into the values generated by each edge/node we can select a subgraph which contributes the most to the statistic. As these statistics are designed to measure changes in the probabilFigure 2(a) shows the the timeline of events that occurred during the Enron scandal and breakdown. The Density-Consistent ity distribution of edges, this subgraph can be considered as
s81
s81
s30570
s30570
s12495
s40348
s12495
s3977
s2136
s47636
s114
s96
s114
s359652
s2136
s50631
s41197
s7319
s27439
s50726
s18197
s32422
s7383
s6716
s41197
s50319
s21967
s38901
s359653
s359652
s359651
s21413
s11636
s14727
s21967
s7319
s99882
s20296
s20296
s51868
s14828
s18197
s32422
s14828
s7383
s6716
s38901
s11636
s11080
s968
s11080
s22553
s8051 s1714
s21413
s99882
s51868
s27439
s50726
s22553
s183
s359651
s5516
s968
s183
s8051
s184
s1714
s184
s5896
s63
s359653
s47636
s30541
s5516
s14727
s96
s40348
s3977
s50631
s30541
s50319
s5896
s63 s18569
Prior
Anomaly
(a) Mass Shi2: Enron
Prior
Anomaly
s18569
Prior
(b) Triangle Probability: E-‐mail
Anomaly
(c) Degree Shi2: Facebook
Figure 5: Anomalies detected by (a) Mass Shift: Enron time steps 31-32, (b) Triangle Probability: Email time steps 88-89, and (c) Degree Shift: Facebook time steps 249-250. Each plot shows most unusual subgraph for the prior time step and the anomalous one. s26
s26
s10
s41
s10
s41
s69
s1
s21
s1
s5310 s1880 s2062 s12542
s21
s13216s8189s11120s7024 s7198s25853s1335s29895 s5284s18785 s1162s1153s1166 s20377s1959 s1161s1152 s1167s9294 s1060s1481s2117 s31579 s9096 s6097s7649 s16177 s25417 s3775 s7130 s1420s1683 s9097 s4276 s42402 s6499 s49539 s3809 s80 s17373 s1609 s6864 s10135 s1687 s13496 s1686 s6825s601 s11461 s1960 s6541 s1737 s403 s1394 s2658 s1417 s2090 s1483 s9262 s2092 s4123 s2091 s879 s4275
s1165
s13216s8189s11120s7024s5284 s7198s25853s1335s29895 s1162s1153s1166 s18785 s20377s1959 s1161s1152 s1167s9294 s31579 s9096 s16177 s3775 s7130s9097 s42402 s49539
s6097s7649s1060s1481s2117 s25417 s1683 s1420 s4276 s6499 s3809 s17373
s61
s40
s56
s128
s61
s40
s56
s38
s71
s122
s25
s433612 s414280 s343058 s433613 s272821 s8559 s184673 s433614 s373895 s61956 s433611 s59006 s90482 s91701 s433610 s433615 s433609 s7304 s104653 s433616 s364021 s375234 s184672 s433617 s4817 s33033 s8556 s12348 s8561 s391730 s226910 s8565 s1352 s1354 s4816 s434336 s207244 s433635 s147477 s33165 s25877 s433636 s8558 s433637 s383584 s234833 s204321 s433638 s14762 s312254 s230085 s31132 s433608 s219818 s433607 s40849 s8554 s12728 s433606 s213517 s1290 s19202 s214644 s312253 s433605 s148507 s13990 s136231 s8563 s316141 s78350 s4558 s373411 s99585 s433604 s4745 s31050 s106731 s433603 s352505 s433602 s433639 s386337 s16664 s26798 s10652 s156721 s307432 s433601 s433640 s373410 s67116 s433600 s113855 s232772 s94603 s319005 s163962 s433599 s48004 s8557 s48005 s195636 s344137 s18044 s294996 s64422 s433641 s8519 s25006 s17357 s2867 s152120 s197733 s6163 s2856 s134649 s202920 s181248 s4073 s14862 s205700 s8527 s4043s125045 s76514 s434109 s1186 s29780 s433220 s1193 s46546 s27948 s15443 s23563 s8369 s273511 s8372 s66119s8351 s44449
s1609 s10135 s13496 s6825s601 s6541 s403 s2658 s2090 s9262 s2092 s4123 s2091s879s4275s19
s2295 s18062s5019 s19363 s5391 s14790 s91687 s14425 s3136 s74694 s433127 s1188 s15106 s76500 s323 s710 s117695 s433126 s26435 s96359 s86709 s14589 s49414 s315407 s10921 s80490 s10406 s21724 s27445 s36487 s46558 s67851 s52112 s78705 s3516 s5513 s13547 s275080
s1165 s4191
s1243
s2745
s4191
s3509
s33320
s2745
s1356
s1735
s33320
s1736
s7899
s1735
s1423
s4795
s7899
s2664
s996
s4795
s1418
s2821
s996
s1357
s4633
s5056
s2821
s35210
s2823
s2799
s4633
s5056
s2183
s4312
s2823
s2799
s2822
s1169
s2183
s4312
s738
s31337
s2822
s1169
s12813
s19534
s738
s31337
s11074
s11182
s12813
s19534
s5160
s81
s5160
s81
s17013
s8251
s8427
s45719
s17013
s9746
s2223
s45719
s28692
s44228s1269
s2223
s2094
s5468
s44228
s20015
s5468
s1767
s21606
s19167
s3694
s1269
s21606
s2313
s3161
s19167
s3694
s1859
s3160
s2313
s3161
s37618
s5134
s1859
s3160
s2354
s3433
s37618
s5134
s2151
s3168
s2354
s3433
s2152
s8081
s2151
s3168
s2153
s7012
s433612 s414280 s343058 s433613 s272821 s8559 s184673 s433614 s373895 s61956 s433611 s59006 s90482 s91701 s433610 s433615 s433609 s7304 s104653 s433616 s364021 s375234s33033 s184672 s433617s12348 s4817 s8556 s8561 s391730 s226910 s8565 s1352 s1354 s4816 s434336 s207244 s433635 s147477 s33165 s433636 s8558 s433637 s383584s25877 s234833 s204321 s433638 s14762 s312254 s230085 s31132 s433608 s219818 s433607 s40849 s8554 s12728 s433606 s213517 s1290 s19202 s214644 s312253 s433605 s148507 s13990 s136231 s8563 s316141 s78350 s4558 s373411 s99585 s433604 s4745 s31050 s106731 s433603 s352505 s433602 s433639 s386337 s16664 s26798 s10652 s156721 s307432 s433601 s433640 s373410 s67116 s433600 s113855 s232772 s94603 s319005 s163962 s433599 s48004 s8557 s48005 s195636 s344137 s18044 s294996 s64422 s433641 s8519 s25006 s17357 s2867 s152120 s197733 s6163 s2856 s134649 s202920 s181248 s4073 s14862 s205700 s8527 s4043 s76514 s434109 s1186 s125045 s29780 s433220 s1193 s46546 s27948 s15443 s5019 s23563 s2295 s8369 s18062 s273511 s19363 s8372 s5391 s66119s8351 s14790 s44449 s91687 s14425 s310974 s3136 s62511 s74694 s433579 s433127 s208374 s1188 s8363 s15106 s8338 s76500 s8334 s323 s433580 s710 s168573 s117695 s8355 s433126 s18816 s26435 s163739 s96359 s8376 s86709 s433581 s14589 s8373 s49414 s8360 s315407 s905 s10921 s8342 s80490 s8356 s10406 s8347 s21724 s433582 s27445 s433583 s36487 s8332 s46558 s8367 s67851 s6152 s52112 s8361 s78705 s74217 s3516 s38957 s5513 s187116 s13547 s433584 s275080 s117390 s251299 s29481 s38304 s8346 s97839 s63365 s99726 s2910 s74513 s412044 s358957 s2941 s45502 s12936 s13629 s24825 s50932 s4798 s23979 s433221 s283309 s433222 s15883 s433251 s17917 s284474 s274042 s4099 s136426 s282190 s403208 s27729 s4044 s84198 s379443 s6017 s122397 s433252 s15184 s318015 s267269 s433253 s4113 s4911 s433254 s220434 s124398 s270859 s433255
s8361 s74217 s38957 s187116 s433584 s117390 s29481 s8346 s63365 s2910 s412044 s2941 s12936 s24825 s4798 s433221 s433222 s433251 s284474 s4099 s282190 s403208 s27729 s4044 s84198 s379443 s6017 s122397 s433252 s15184 s318015 s267269 s433253 s4113 s4911 s433254 s220434 s124398 s270859 s433255
s23046 s130333 s97307s22364 s8528 s24803 s13673 s26807 s13348 s211968 s425432 s17682 s385739 s16223 s241808 s47755 s9668 s8870 s253631
s433256
s43375
s2154
s7025
s2154
s8062
s1291
s795
s5561
s7022
s1454
s49428 s7021 s780
s1297
s7533
s147 s3162
s1628
s7018
s13429
s3159 s49512
s995
s47860
s995
s118
s15489 s9304 s9046 s38585
s40212
s33
s5653
s10660
s8831
s40212
s6246
s5541 s36377 s1137
s9046 s38585 s2145 s8831 s6246
s412221 s82929
s1419 s1136
s69906
s9105 s2221
s6168
s2007
s10891 s8523
s433264
s4077
s4105 s433267 s433268 s1334 s433269 s433270 s433271 s4062 s413174 s296304 s4111 s271052 s4061 s433272 s4074
s2866
s385498 s5357 s434286 s424395
s3245
s702
s3245
s29905
s702
s45818
s432034
s29905
s25412
s2268
s760
s11308
s270
s173822
s2190
s5706
s5192
s96
s8917
s19
s30
s19
s30
s3528
s699
s3128 s41347 s8966 s16404 s5231 s46491 s777 s12631 s886
s141
s66
s31847
s21900
s41085
s593
s3639s41479 s8323 s998
s699
s3128 s41347 s8966 s16404 s5231 s46491 s777 s12631 s886
s704 s3184 s23937 s3325 s47707 s5543 s2155 s8242 s2156 s1961 s3244 s38109 s8249 s3238 s2189 s43712 s11479 s35336 s33303 s24619 s39263 s3294 s36993s8032 s11394 s2708s7440 s17268 s2121s21145s1760s22102 s4083s17287 s5764s2904 s98 s8394 s146 s2097
s16947
s7398
s29774
s700 s703
s8026 s22174 s17266 s14583 s50807 s36618 s11899 s19563 s2462 s46989 s15955 s38489 s1376
s13123
s2057
s38061
s20739
s40605
s20190
s41085
s593
s3639s41479 s8323 s998
s704 s3184 s23937 s3325 s47707 s5543 s2155 s8242 s2156 s1961 s3244 s38109 s8249 s3238 s2189 s43712 s11479 s35336 s33303 s24619 s39263 s3294 s36993s8032 s11394 s2708s7440s2121 s17268 s21145s1760s22102s5764s2904 s98 s8394 s146 s2097s4083s17287
s16947
s7398
s29774
s1972
s40605
s4179
s39082 s4304
s3308
s4304 s6630
s12007
s12007 s6909
s41003
s2018
s10813
s24790
s12094 s3196
s6600
s42239
s10617
s1972 s42924
s10813
s24790
s29338
s28136
s2606
s32483 s2514
s18388
s3308
s42924
s32
s13123
s2057
s38061
s20739
s10617 s20190
s4179
s6909
s3
s31847
s21900
s2606
s32483 s2514
s18388 s39082
s6630
s12094
s32
s763
s2018
s42239
s28136 s3120
s3196
s6600
s29338
s42168
s44766
s68
s763
s3120 s6315
s45563
s4059
s9054
s10604
s244283
s7535
s16629
s45563
s4059
s9054
s20823 s52
s68599
s6126 s16210
s78878
s371088
s11546 s68603
s52
s3042
s20502
s39681
s14608
s10366
s49339
s3633
s29694
s7041
s15834
s10145 s30270
s4600
s33343
s10145
s13767
s32121
s26148
s10239
s30270
s7150
s12109
s497
s14757
s14124
s13767
s32121
s26148
s10239
s6447
s4601
s13565
s35188
s7170
s44
s14757
s607
s39956
s2880
s433112
(a) Edit Distance: Enron
s178859
s7757
s211853
s186500
s323530
s32865
s93502 s13162
s242547 s112607
s508
s285945 s3512
s135101
s13121
s433114 s1031
s629
s4254
s2476
s4252 s6095
s14124
s433174
s1041 s1036
s110812
s179559
s45872
s607
s61456
s433117
s433112
s616
s433095 s1010
s2050
s433115
s2049
s142625
s66855
s1029
s349903
s433115
s350835
s13780
s2062
s431712
s634 s2073
s66855
s621
s124396
s1028
s1034 s620 s2052
s1028
s639
s395097
s433111 s433173
s344579
s433172
s92834
s433177
s610
s433116
s31824
s121144
s109470
s1008
s2093
s314397
s38654
s2738
s22091
s316236 s133589
Anomaly
(b) Barrat Clustering: E-‐mail
s1007
s314236
s2030
s198237
s132983
s593
Prior
s433118 s1007 s240653
s625
s433178 s13740 s272071
s27460
s8663
s2066
s116068
s395626 s213290
s35743 s637
s2102
s24837 s297182
s1008
s125480 s314397
s240653
s625
s433178
s2096 s433171
s36536
s433118 s35743
s8663
s2066
s116068
s11162 s25340
s2093
s36536
s198237 s637
s18681
s92834 s2084
s22091
s125480
s11634
s272071
Prior
s497
s640
s433117
s109470
s13740
Anomaly
s69712
s13123
s12109
s433094
s433116
s31824
s121144
s132983
Prior
s14143
s345914
s245383
s7159
s7150
s596 s433113
s639
s610
s39956
s28634
s4326 s2102
s11162 s25340
s13178
s9130
s247623 s4180
s138367
s35279
s231487
s621
s433111
s31556
s38190 s562
s38654
s2738
s19628
s58854
s34604 s502
s98593
s67441
s95852
s1029
s431712
s634 s124396
s344579
s433177
s563
s28634
s18681
s363005
s27362
s421907 s71436
s2072
s13780
s2062
s395097
s433173
s2084
s4252 s6095
s83996
s13135
s14328 s132817
s612
s328559
s192197
s5441
s18305
s563
s4254
s2476
s127728
s14703 s8744
s65571
s23275
s285464
s319522
s433095
s2049
s30927
s11739
s11712 s31556
s38190 s4326
s11634
s13200
s68101
s35628
s15189
s134147
s433086
s14411
s1013
s349903
s2050
s142625
s33719 s14942
s5441
s18305
s562
s19629
s322
s105488
s6124
s85086
s19931
s432914
s7532
s50872
s433176
s244564
s14363
s30927
s11739
s11712 s2880
s79069
s432493
s85095
s103858
s10145
s2079
s14412
s45872
s61456
s69
s251559
s13169
s6433
s6125
s433087
s9449 s82241
s95854 s331649 s433175 s318356 s110812
s179559
s629
s1034
s54
s131146
s37538 s506
s34117
s154024 s34833
s13121
s1036
s640
s35279
s231487
s620
s23
s13159
s13198 s433088
s134230
s23276
s135101
s1041
s433094
s67441
s95852
s433172
s69
s69201
s13174 s11975
s68608
s68035
s156069 s7534
s285945 s3512
s433114 s1031 s596 s433113
s2072
s2052
s54
s205395
s4029
s9376
s17631 s43592
s22162
s29126 s99083
s242547 s112607
s508
s138367
s328559
s192197
s2073
s23
s35627
s4829 s4561
s6434
s269927
s14927
s93502 s13162
s247623 s4180
s98593
s285464
s319522
s350835
s7170
s44
s213903
s507
s2760 s13344
s3801
s55272
s16853
s323530
s32865
s34604 s502
s612
s14411
s433174
s14988 s4600
s33719 s4601
s35188
s18100
s13194
s25852
s153092 s7672
s15263
s186500
s421907 s71436
s84092
s211853 s324803
s14328 s132817
s433176
s47311
s32279
s9835
s6447
s14363
s14942
s13565
s8036
s13199 s55601
s405790 s27329
s1378
s7757
s65571
s23275
s432914
s7532
s50872
s15617
s49339 s433086
s269927
s10145
s14412
s28801
s1010
s46
s9756
s13196
s17760
s221832 s59113
s38152
s616
s73
s371088
s240591 s20511
s15393
s44661
s32279
s9835
s14988
s46
s78878
s139244 s1759
s30097
s44089
s40088 s811
s244564
s73
s68599
s6126 s11546 s84136 s99212 s23256 s20861
s191014
s5650
s47311
s44661
s110439
s8302
s1013
s12345 s10779
s33343
s12054
s429369 s14143
s345914
s19628 s7671
s9449
s95854 s331649 s433175 s318356
s44089 s28801
s22164
s14676
s363005
s433525
s156069
s49881
s2645
s2079
s15834
s68603
s83996
s13178
s31434 s35960
s7041 s5650 s40088
s86923
s6636 s127728
s245383
s69712
s178859
s46994 s13254
s49881
s2645
s16210
s1104
s13200
s20506
s154024 s34833 s82241
s29694
s7535
s20502 s19629
s7159
s29126 s99083 s7534
s3633
s244283
s7384 s251559
s9130
s134147
s84092
s324803
s21741 s18688
s31434 s35960
s145996 s413 s258736 s130326 s434 s42891 s464 s420 s16162 s433085 s12746 s36306 s27956 s448 s194738 s454 s22316 s91683 s6416 s471 s422 s400 s444 s467 s358322 s189042 s452 s460 s163408 s153336 s469 s197018 s241131 s27668 s426 s10435 s84945 s440 s435 s52585 s385 s437 s42110 s135360 s439 s196358 s451 s438 s46427 s5893 s150006 s10979 s125685 s240698 s10955 s277755 s277754 s307635 s10992s10943 s151250 s3470 s3389 s347483 s10882s10906 s5926 s147042 s128210 s103136 s433792 s1247 s21732 s10981 s10968 s107868 s199495s10964 s5894 s58068 s2s10838 s101483s3240 s309178 s10944 s2191 s12520 s2216 s18821 s356618 s432588 s2202 s10990 s28019 s22979 s2220 s81228 s13662 s46673 s2207 s1232 s18169 s10937 s1892 s433791 s168299 s10909 s57150 s10913 s2193 s123664 s2208 s22995 s58067 s10902 s43371 s400385 s144119 s433790 s2184 s10911 s2190 s18481 s157432 s58193 s64896 s10945 s433652 s18824 s400887 s6322 s302913 s118848 s17556 s379167 s8940 s1148 s5286 s3079 s83995 s10958 s92391 s22410 s6560 s340620 s10950 s10892 s18494 s66025s3s10971
s10604
s26647 s16629 s17044 s20510 s20503
s131146
s58854
s6125
s85086
s19931
s20621 s6675
s46994 s13254
s41427 s102749 s311383 s49587 s99883 s303301 s14265 s14281 s1782 s28915 s197250 s173355 s124190 s63929 s212855 s98152 s434046 s5732 s39324 s14266 s59923 s256755 s17852 s20263 s434047 s434048 s22650 s177145 s129977 s14267 s5734 s254533 s147772 s110233 s11041 s137992 s346427 s183278 s14279 s100639 s92937 s9629 s91236 s131402 s261658 s306495 s149570 s265663 s173 s200395 s121271 s278133 s275658 s203122 s434049 s285860 s75943 s119737 s160576 s192409 s20264 s434050 s84534 s344448 s160577 s1781 s10531 s195022 s10150 s295910 s115594 s39243 s253669 s36093 s280373 s309789 s24745 s49346 s71443 s260233 s81518 s18144 s24442 s107067 s165456 s18129 s18131 s10149 s18130 s180297 s433651 s2187s241008 s2230 s112742 s134301 s345408s2201 s189852 s64198 s60536 s61404
s85095
s103858
s48422 s3632
s21741 s18688
s122119
s89330 s194368 s67458 s405596 s21308 s8964 s5263 s74725 s363802 s433648 s349997 s160870 s60994 s80588 s433647 s351912 s114006 s96849 s280347 s37745 s8942 s10991s186438 s474 s382
s15617
s10366
s17778
s20621 s6675
s12345
s169726
s268571
s13159
s27362
s432493
s23276 s55272
s16853
s39681
s44650
s48422 s3632
s57472
s74882
s3801
s30097
s14927
s17778
s10779
s125758
s258970
s69201
s13135
s6124
s6434
s8302
s2833
s13123
s44650
s18135
s265766
s8744
s134230
s6936
s433087
s14608
s14277
s214571
s15189
s22162
s7671
s34117
s811
s105085
s354566
s941
s29650
s27599
s3938
s322
s68101
s14703 s1378
s15263
s127
s16005
s255144
s35628
s68608
s68035
s433525
s2127
s20366 s3309
s2833
s38152
s51
s124365
s434282
s105488
s429369
s677
s45092
s2836
s40538
s15393
s127
s119782
s99425
s220876 s79069
s17631 s43592
s191014
s51
s12971
s434283
s205395
s4029
s13198
s37538
s13169
s6433
s4561
s110439
s3333
s6445 s6984
s51704
s941
s29650
s53594
s8649
s506
s4829 s14676
s20506
s3521
s3608 s2646
s41435
s6936 s3309
s75944
s320381
s433088
s153092
s13344
s3755
s2137
s26676
s37282
s2127
s20366
s2866
s434045
s8434
s7672 s2760
s15268
s2834
s25071
s677
s45092
s2847
s8804 s8255 s4347
s103557
s294943
s11975
s27329
s3681
s220876
s31676
s4682
s3333
s6445
s27599
s3938
s4041
s89476
s84783
s244332
s294942
s9376
s405790
s1080
s41629
s3521
s3608
s2836
s40538
s4074 s40966
s146012 s433642
s30499
s325278
s59113
s86923
s3755
s2137
s6984
s51704
s4061 s433272
s12705
s1814
s434044
s398682
s37954 s6636
s15268
s2646
s41435
s4111 s271052
s348515
s35095
s31046 s33769
s171
s387060
s10862
s35627
s240591
s1104
s2001
s3681 s19203
s31676
s2834
s26676
s37282
s4062 s296304
s260623
s170873 s327561 s202699 s14810
s118
s280948
s20511 s221832
s2711
s1080 s37954
s25071
s4105 s433267
s51862
s24061
s286704
s18100
s213903
s507
s139244 s1759
s8036
s13199 s13194
s25852
s23256 s20861
s13174
s3042
s2001
s4682
s4121
s40910
s77342
s55601
s99212 s12054
s14037
s9756
s13196
s17760
s84136 s22164
s20503 s10862
s14037
s2711
s19203
s41629
s107
s4077 s433266
s434387 s144050 s65327 s18319
s376640
s340788
s43847
s2002
s44766
s7384
s20823
s2
s36
s92
s420684
s24711
s333444
s42168
s43847
s2002
s17044
s29
s107
s64255
s225460 s18143
s69081
s267683
s18135
s20510
s68 s2
s36
s363150
s86042
s278233 s89716
s173822
s122119
s26647
s6315
s29
s433265
s146380
s89330 s41427 s194368 s102749 s67458 s311383 s405596 s49587 s21308 s99883 s8964 s303301 s5263 s14265 s74725 s14281 s363802 s1782 s433648 s28915 s349997 s197250 s160870 s173355 s60994 s124190 s80588 s63929 s433647 s212855 s351912 s98152 s114006 s434046 s96849 s5732 s280347 s39324 s37745 s14266 s8942 s59923 s10991 s256755 s474 s17852 s382 s20263 s186438 s434047 s145996 s434048 s413 s22650 s258736 s177145 s130326 s129977 s434 s14267 s42891 s5734 s464 s254533 s420 s147772 s16162 s110233 s433085 s11041 s12746 s137992 s36306 s346427 s27956 s183278 s448 s14279 s194738 s100639 s454 s92937 s22316 s9629 s91683 s91236 s6416 s131402 s471 s261658 s422 s306495 s400 s149570 s444 s265663 s467 s173 s358322 s200395 s189042 s121271 s452 s278133 s460 s275658 s163408 s203122 s153336 s434049 s469 s285860 s197018 s75943 s241131 s119737 s27668 s160576 s426 s192409 s10435 s20264 s84945 s434050 s440 s84534 s435 s344448 s52585 s160577 s385 s1781 s437 s10531 s42110 s195022 s135360 s10150 s439 s295910 s196358 s115594 s451 s39243 s438 s253669 s46427 s36093 s5893 s280373 s150006 s309789 s10979 s24745 s125685 s49346 s240698 s71443 s10955 s260233 s277755 s81518 s277754 s18144 s307635 s24442 s10992s10943 s107067 s151250 s165456 s18129 s3470 s18131 s3389 s10149 s347483 s18130 s10882 s180297 s5926 s241008 s10906 s433651 s147042 s2187 s128210 s2230 s103136 s112742 s433792 s2201 s1247 s134301 s21732 s345408 s10981 s189852 s10968 s107868 s60536 s199495 s61404s64198 s5894 s3240 s10964s2 s58068 s101483 s10838 s309178 s10944 s2191 s12520 s2216 s18821 s356618 s432588s22979 s2202 s10990 s28019 s2220 s81228 s13662 s46673 s2207 s1232 s18169 s10937 s1892 s433791 s168299 s10909 s57150 s10913 s2193 s123664 s2208 s22995s400385 s58067 s10902s433790 s43371 s144119 s2184 s10911 s2190 s18481 s157432 s58193 s64896 s10945 s433652 s18824 s400887 s6322 s302913 s118848 s17556 s379167 s8940 s1148 s5286 s3079 s83995 s10958 s92391 s22410 s6560 s340620 s10950 s10892 s18494 s66025s3s10971
s141
s41003
s3
s92
s192639
s157119
s197911
s51567
s169726
s268571
s34903
s3528
s700 s703
s8026 s22174 s17266 s14583 s50807 s36618 s11899 s19563 s2462 s46989 s15955 s38489 s1376
s66
s433264
s132431
s15774
s168653
s57472
s74882
s36288
s12203
s347719
s134596
s125758
s258970
s8
s20493
s34903
s892
s434284
s14277
s214571
s2191
s34286
s36288
s12203
s164652
s142964
s105085
s354566
s875 s21
s9324
s8
s20493
s38821
s205082
s16005
s255144
s265766
s2191s21
s34286
s64571
s426861
s124365
s434282
s49640
s381 s23932
s9324
s27707
s192052
s119782
s99425
s876
s8156
s875
s23932
s24746
s17076
s12971
s434283
s4062
s7251
s49640
s381
s207422
s115461
s53594
s8649
s97s577
s11199
s876
s8156
s195409
s434285
s75944
s320381
s984
s29262
s4062
s7251
s237845 s80301
s269611
s434045
s4070
s97
s11199
s297893
s103557
s294943
s1065
s1562
s577
s29262
s309272
s27935
s244332
s294942
s22
s33918
s984
s4070
s150041
s22285
s30499
s325278
s20525
s31296
s1065
s1562
s432034
s237845 s80301
s278233 s89716
s376640
s434044
s398682
s208
s1085
s22
s33918
s198812
s171
s387060
s4125
s8959
s20525
s31296
s274137
s118
s280948
s96
s8917
s208
s1085
s150551
s24061
s286704
s2190
s5706
s4125
s8959
s223232
s40910
s77342
s23664
s8434
s5192
s4056 s433263
s424395
s24711
s333444 s340788
s137
s5357 s434286
s69081
s267683
s18
s385498
s197911
s51567 s146380
s137
s141823
s15774
s168653
s18
s310930
s347719
s134596
s6004
s1049
s280230
s892
s434284
s9372
s2814
s23664
s232738
s164652
s142964
s4481
s16829
s6004
s1049
s2295
s32399
s9372
s2814
s1334
s372351
s38821
s205082
s1738
s52441
s4481
s16829
s433271
s434287
s64571
s426861
s2780
s26645
s2295
s32399
s433268
s7946 s391698
s27707
s192052
s3944
s269
s1738
s52441
s433270
s277664
s24746
s17076
s8522
s3693
s2780
s26645
s433269
s339466
s207422
s115461
s7023
s2288
s3944
s269
s413174
s73274 s160520
s195409
s434285
s1267
s3450
s8522
s3693
s83964 s142963
s150041
s269611
s50846
s1788
s7023
s2288
s11308
s270
s1267
s3450
s25412
s2268
s50846
s1788
s106411
s309272
s27935 s297893
s760
s4
s7945
s168360
s1814
s157119 s132431
s22285
s45818
s70
s8977
s8947
s414247
s12705 s35095
s31046 s33769
s18143 s86042
s274137 s198812
s4
s8968 s13209
s8951
s5830
s8971 s3224
s8957 s8973
s91949
s8804 s8255
s348515 s260623
s170873 s327561 s202699 s14810 s51862 s434387 s144050 s65327 s18319 s225460
s223232 s150551
s70
s433650 s144177
s8978 s8955 s8946 s41221 s8967 s183876
s330658
s2847
s141823
s134754
s184731
s4041
s89476
s84783 s4347
s310930
s433649
s40966
s146012 s433642
s280230
s8958 s260302
s4121 s433266
s232738
s4101
s282933
s420684
s372351
s701
s4135
s57576
s64255
s7946 s434287
s1163
s19227
s4064
s15803
s363150
s339466
s8163
s7415
s433262
s7840
s433265
s8958
s8968 s13209
s8971
s8957 s8947
s391698
s6437
s103
s433261
s181836
s192639
s15803 s282933 s260302 s433650 s144177
s8978 s8946 s8951 s183876
s277664
s16332
s1208
s701
s202029
s168313
s4056 s433263
s134754
s83964
s11950
s1156
s1163
s19227
s433260
s273770
s4101
s91949
s142963
s72
s1154
s8163
s7415
s168955
s118175
s4135
s168360
s624
s3512
s1164
s6437
s103
s4119
s317792
s4064
s184731
s30592
s8325
s1209
s16332
s1208
s433082
s297187
s433262
s160520
s1155
s11950
s1156
s69906
s240465
s433261
s73274
s5952
s72
s1154
s433259
s198958
s202029
s7945
s18073
s4281
s3512
s1164
s352222
s213438
s433260
s8973
s5316
s7963
s8325
s1209
s82929
s9087 s302394
s168955
s3224
s1505 s3100
s5969
s624
s1155
s412221
s9365 s30370
s4119 s433082
s106411
s30592
s5952
s4133
s276684
s433259
s414247
s6168
s2007
s4100
s265916 s7482 s159135
s352222
s41221
s12317 s17551
s1139 s32316
s18073
s4281
s144898
s288885
s4133
s57576
s9304
s9105
s5316
s7963
s47760
s143652
s4100
s433649
s5969
s433258
s23414
s144898
s330658
s1136
s20
s4106
s238343
s47760
s7840
s15489
s4512 s1471 s2149
s2145
s12317 s17551
s2221 s1505 s3100
s1419
s55 s72
s4312
s60937
s433258
s5830
s1139 s32316
s17 s126
s20
s71871
s20122
s4106
s8967
s1137
s96
s55
s433257
s112792
s4312
s8955
s5541 s36377
s17 s126
s4088
s96105
s71871
s8977
s33
s96
s72
s205701
s169307
s433257
s10891 s8523 s181836
s5653 s4512 s1471 s2149 s10660
s118
s413170
s269965
s4088
s168313
s38975
s18256
s284475
s66164
s205701
s118175
s16327
s273770
s38975
s4030
s242152
s413170
s297187
s46544
s317792
s16327
s4070
s99432
s284475
s198958 s240465
s46544
s4069
s43817
s4030
s9365 s30370 s302394 s213438
s13429
s4103
s195670
s4070
s276684
s6784
s9087
s1628
s4083
s3594
s4069
s60937 s23414 s288885 s265916 s7482 s159135
s6783
s1926
s6784
s143652
s17523
s7648
s7533
s238343
s1297
s3166
s6783
s20122
s1295
s779
s17523
s4104
s26311
s4103
s96105
s1293
s112792
s1295
s4093
s14863
s4083
s269965
s1292
s169307
s1293
s4084
s193900
s4104
s242152 s66164
s1292
s4068
s46377
s4093
s43817 s99432
s1454
s4128
s22827
s4084
s3594 s195670
s5561
s311120
s87587
s4068
s14863 s26311
s1291
s49428 s7021 s780 s779 s3166 s7648 s1926 s147 s3162 s7018 s3159 s49512 s47860 s18256
s311124
s7839
s4128
s46377 s193900
s7025 s8062 s795 s7022
s71
s433256
s43375
s311120
s87587 s22827
s477
s245042
s311124
s7839
s2153
s7012
s310974 s62511 s433579 s208374 s8363 s8338 s8334 s433580 s168573 s8355 s18816 s163739 s8376 s433581 s8373 s8360 s905 s8342 s8356 s8347 s433582 s433583 s8332s6152 s8367
s245042
s2152
s8081
s477
s38304 s97839 s99726s251299 s74513 s358957 s45502 s13629 s50932 s23979 s283309 s15883 s17917 s274042 s136426 s22364 s23046 s130333 s97307 s8528 s24803 s13673 s26807 s13348 s211968 s425432 s17682 s385739 s16223 s241808 s47755 s9668 s8870 s253631
s11074
s11182
s8251
s8427 s9746 s28692 s2094 s20015 s1767
s25
s80
s1687s6864 s1686 s11461 s1960 s1737 s1394 s1417 s1483 s69 s5310 s1880 s2062 s12542 s19
s1243 s3509 s1356 s1736 s1423 s2664 s1418 s1357 s35210
s128
s38
s122
s24837
s2096 s433171
s316236 s133589
s314236
s2030
s297182
s395626 s213290
s27460
s593
Anomaly
(c) Degree Distribu