Anomaly Detection in Networks with Application to Financial

Anomaly Detection in Networks with Application to Financial Transaction Networks Andrew Elliott1,2 , Mihai Cucuringu1,2 , Milton Martinez Luaces3 , Paul Reidy3 , and Gesine Reinert1,2

arXiv:1901.00402v1 [stat.AP] 2 Jan 2019

1

2

Alan Turing Institute Department of Statistics, University of Oxford 3 Accenture Plc January 3, 2019 Abstract

Detecting financial fraud is a global challenge. This paper is motivated by the task of detecting anomalies in networks of financial transactions, with accounts as nodes and a directed weighted edge between two nodes denoting a money transfer. The weight of the edge is the transaction amount. Examples of anomalies in networks include long paths of large transaction amounts, rings of large payments, and cliques of accounts. There are many methods available which detect such specific structures in networks. Here we introduce a method which is able to detect previously unspecified anomalies in networks. The method is based on a combination of features from network comparison and spectral analysis as well as local statistics, yielding 140 main features. We then use a simple feature sum method, as well as a random forest method, in order to classify nodes as normal or anomalous. We test the method first on synthetic networks which we generated, and second on a set of synthetic networks which were generated without the methods team having access to the ground truth. The first set of synthetic networks was split in a training set of 70 percent of the networks, and a test set of 30 percent of the networks. The resulting classifier was then applied to the second set of synthetic networks. We compare our method with Oddball, a widely used method for anomaly detection in networks, as well as to random classification. While Oddball outperforms random classification, both our feature sum method and our random forest method outperform Oddball. On the test set, the random forest outperforms feature sum, whereas on the second synthetic data set, initially feature sum tends to pick up more anomalies than random forest, with this behaviour reversing for lower-scoring anomalies. In all cases, the top 2 percent of flagged anomalies contained on average over 90 percent of the planted anomalies.

1

Introduction

Financial fraud is a global challenge. According to [66], financial fraud losses in the UK payment industry across payment cards, remote banking and cheques amounted to £768.8 million in 2016, an increase of 2% compared to 2015 and close to the projected GDP of The Gambia in 2018 1 . Some of this fraudulent behaviour, such as money laundering, may manifest itself through unusual patterns in financial transaction networks. In such networks, customers are nodes, and two nodes u and v are linked by a directed edge if there is a money transfer from u to v; the edge is annotated with the transferred amount as its weight. Unusual patterns could include long paths of transactions with large weights (denoted 1 Source:

http://statisticstimes.com/economy/countries-by-projected-gdp.php

1

hereafter by long heavy paths) and large cliques. The patterns of financial fraud are highly variable, and many patterns are likely to remain undetected to date. While anomaly detection in networks of financial transactions is a pressing problem, a similar task arises in networks for cyber security [45], and more generally, for threat detection in networks. Fraud detection and the related problem of anomaly detection are well studied problems; some reviews can be found in [4, 11, 22, 47, 58, 69]. Here is a brief overview of the field. Detecting money laundering is a difficult task, both because the number of cases is small in comparison to the number of legitimate transactions, but also because fraudsters change their behaviour to avoid detection. Some anomaly detection methods look for patterns of attempts to conceal behaviour as normal, such as [25] and [16]. Many different techniques have been developed with the aim of uncovering money laundering and crime, ranging from autoencoders [50, 59] and SVMs [64] to social network analysis [23]; other techniques include rule based approaches [60]. Data sets used for anomaly detection include unstructured data [71], financial records [50], social network data [23], and investment bank data [38]. Several approaches for anomaly detection exploit the network information. For example, [19] uses network statistics and a logit model to detect fraud in a factor house. Ref. [37] clusters transaction time series, and then assesses how far an individual deviates from its assigned cluster. Ref. [57] starts with a weighted network based on multiple evidence sources, constructs a 3-hop snowball sampled network, and uses features of these networks in a machine learning framework to uncover suspicious transactions. Moving away from the particular application to detection of money laundering, Many anomaly detection methods use community detection in networks. Ref. [16] declares an anomaly if the temporal evolution of network communities change sufficiently. Ref. [27] compares the spectral embedding (a form of community detection) of individuals in different data sources in a cross data source embedding, declaring an anomaly if the embeddings deviate substantially. Ref. [37] clusters accounts and then constructs a score based on the deviations of an account from the mean performance in the cluster. Spectral localisation is the phenomenon in which a large amount of the mass of an eigenvector is placed on a small number of its entries [20, 49]. When considering eigenvectors of various matrix operators derived from the graph adjacency matrix, the corresponding nodes on which the eigenvector localised can be regarded as different when compared to the rest of the nodes in the network, and constitute good candidates for the anomaly detection task. There is also a large literature on localisation and spectral approaches for anomaly detection in networks. For example, Miller et. al. [39, 40, 41, 42, 44] develop a series of methods to uncover anomalies using spectral features of the modularity matrix. The approach that is most relevant to this work is from [39], in which eigenvectors are assessed by their deviation in the z-score of the `1 norm. In [43], the authors extend these methods to use the `1 norm for eigenvectors of sparse PCA, which performs well at the cost of being more computationally intensive. A different spectral approach, proposed by Tong and Lin [65], is based on matrix factorisation in bipartite networks and leverages the intuition that the nodes and edges which are badly represented by the factorisation should be considered as anomalies. Network comparison methods have also been employed for anomaly detection. For example, the approach in [52] uses network comparison measures, and compares each network in a time series to their immediate neighbours, fitting an ARMA process to the resultant time series. The authors identify anomalies by looking for deviations from the model. Network analysis for time series is also treated in [36]. The standard anomaly detection method which we shall employ for a comparison with our results is Oddball by Akoglu et. al. [3]. This is a widely used method which provides publicly available code. The original formulation of Oddball is built to detect 4 different types of anomalies in weighted networks: 1. “almost” star; 2. “almost” clique; 3. so-called “heavy vicinities” based on weighted degree; 4. heavy edges. To detect anomalies, [3] observes empirical relationships between measured statistics. The relationships are power-laws which are identified between statistics of the 1-hop snowball sample of each node, namely the number of edges against the number of nodes, the total weight of edges against the number of edges, and the largest eigenvalues (of the adjacency matrix on the sample) against the total weight. For each relationship,

2

Oddball fits the resultant power law and then computes the following outlier score for each node max(observed, predicted) (log(|observed − predicted| + 1)). min(observed, predicted)

(1)

The authors augment this score by summing the scores (1) with the local outlier factor (see [13]) to capture points which may be close to prediction but far from other observed points. The score then defines a ranking for the power-law relationships of each feature. The authors’ website also provides a version of Oddball, called Oddball lite, which does not use Python and which we use in this paper; Oddball lite appears not to carry out the augmentation by the local outlier factor. For comparison of the anomaly detection methods, we construct an overall ranking from each of the features using a summation; full details can be found in Section 3.7. Despite a general agreement that patterns of financial fraud are highly variable, methods for anomaly detection in financial networks often focus on detecting known anomalies, such as long heavy paths of transactions, or large cliques. The method we propose here instead combines network analysis with spectral methods to detect structures which are anomalous, yet which may not be part of the set of known anomalies. We employ community detection in networks for two purposes; first, to increase the detectability of the embedded structures in the resultant subnetworks (see Sec. 3.2 for details), and second, to assign an anomalous score directly to each of the communities we identify. Our method is based on a combination of two approaches, namely an extension of the NetEMD network comparison method from [67], and an application of novel spectral localisation statistics. We combine these two approaches with community detection, which is used to decompose the network into smaller but denser subnetworks, thus allowing for parallel computation of the code and rendering our approach computationally scalable. The split of the network into communities has as drawback the fact that long paths may be cut (i.e., split over two or more clusters) and thus no longer detectable. Therefore, to alleviate this risk, we enrich our methodology by an approach which is specifically designed to find paths in networks. The two novel sets of spectral features which are developed in this paper are, firstly, an adaptation of the network comparison framework NetEMD to use eigenvectors. The second set of features is based on powers of the entries of eigenvectors, as well as their signs and magnitudes, and is detailed in Section 3.3. These two sets of features may be of interest more broadly than the application in this paper. Altogether, our combined method calculates 140 main features for each network, and uses these features to pin-point anomalies. These features are processed in two different ways. Our Feature Sum method aggregates the features as an unweighted sum. Our Random Forest method is constructed by performing training and feature selection on the 140 main features using a generating model for networks with randomly added anomalies which we develop in Sec. 2.1. As these planted anomalies are inspired by anti-money laundering, they focus on structures with heavy edges. For the synthetic models, we consider parameter regimes which ensure that in our simulated networks without anomalies, the expected count for each particular anomaly would be less than 1. To understand what is driving our performance, we also explore the features which are selected by the random forest. We find that different parameter regions favour different features, with features based on geometric average of edge weights and short paths dominating the space of selected features. We apply our method to a test data set derived from our model and to a data set which is provided by the Accenture team. The first data set is generated in the same way as the data set which is used to train the random forest, while the second data set is generated in a different way and was not used to train the random forest. On both data sets, our methods outperform random classification, and they perform favourably against Oddball from [3]. While the random forest method outperforms the Feature Sum method, in the second data set we observe that the random forest is outperformed by the Feature Sum method. Both methods uncover on average 92.3% of the anomalous nodes within the top 1024 nodes, in a network of size n = 55000. Furthermore, both are able to uncover cliques, although cliques were not explicitly used as features in the modelling part. Therefore, both methods are able to uncover anomalies without having explicit knowledge of their structure. 3

For clarity, we give a brief summary of the notation used throughout the paper, with an additional list in the appendix. We let n denote the number of nodes, di the degree of node i, p the probability of connection, w the smallest weight of an embedded edge, k the size of the embedded structure, and k1 the number of incoming edges in a directed star graph. We let W denote a possibly weighted and directed adjacency matrix, and let Ws , Lcomb and LRW denote the symmetrised matrix (W + W T ), the combinatorial Laplacian, and the random-walk Laplacian, respectively. Furthermore, for a node set X, we denote the induced subgraph of W on X by W(X). The ith ordered (i) eigenvectors of the matrix in question are denoted by v (i) with vj the jth element; for brevity we will often omit the superscript when the context permits it. We let T (j, W) denote a node statistic for node j based on network W, with UT (x, W) the empirical cumulative distribution function (CDF). We use µ(T, W) and σ(T, W) to represent the average and standard deviation of the statistic T over the network W . Furthermore, T˜(j, W) is a standardised version of T such that it has variance 1 over the network, i.e., σ(T˜, W) = 1. When counting a set of motifs, then T M otif (i) (j, W) denotes the ith weighted motif statistic. We use NetEMDT˜ (W, W0 ) to represent the NetEMD between a pair of networks with statistic T , and NetEMD(W,W’) to represent NetEMD averaged over our set of statistics CT . We use GWA, GWA10, and (All) GWA20 to represent geometric weighted averages, as defined in Sec. 3.1. We let FeatureRanki denote (p,w) the overall feature ranking, while FeatureRanki denotes the feature rank for a given set of parameters. Furthermore, Rec (i) and Pre (i) denote the precision and recall if we threshold at the top i-th nodes in a ranking. We let Φ−1 denote the inverse CDF of a mean 0, variance 1 normal distribution, and pval represent a p-value. The code uses NetworkX [29], Scikit-learn [51], Cython [9], Matplotlib [31], Rpy2 [28] and the NetEMD network comparison package [17] from Ref. [18].

2

Network Models and Network Data

To develop and measure the success of our methodology, we use two network models with known embedded anomalies. The first model guided our method development stage with an eye towards the underlying ground truth, which lead to a scoring system that integrates the different anomaly detection scoring methodologies that altogether define our pipeline (see right panel of Fig. 1 for details). The second model was developed by the Accenture team, following on from discussions with the Turing team; a network which was generated from this model was passed to the Turing team for anomaly detection. The Turing team was blind to the model details; the details were only revealed to the Turing team at the time of writing this paper. Therefore, we regard this model as an independent test of the scoring system (see Sec. 2.2 for details).

2.1

Synthetic Data Set: Weighted ER Network

Our synthetic model is a weighted directed Erdős-Rényi (ER) random graph model, in which small structures with unusually large weights are planted at random. The edge weights of the underlying ER network are drawn independently and uniformly from the interval (0, 1). Therefore, the distribution of the non-anomalous edges is that of Wij ∼ Bern(p)Uni(0, 1), (2) where Bern(p) is a Bernoulli trial with probability p, and Uni(0, 1) is a uniform random draw from the interval (0, 1) which is independent of the network and of all other draws. We embed a random number of anomalous structures of five different types at random locations throughout the network. These five different types of anomalies are given by cliques, paths, rings, and particular directed multipartite networks (which we will also refer to as trees), and stars, as shown in the left panel of Fig. 1. The embedding procedure of such structures is as follows.

4

For a path of size k, we embed a directed series of edges (γ1 , γ2 ), ..., (γk−1 , γk ) with an ordered set of distinct nodes (γ1 , .., γk ) selected uniformly at random. For a ring, we additionally place an directed edge between γk and γ1 . For the star, (Fig. 1C), we choose k distinct nodes uniformly at random, and randomly and independently select one of them to be the centre node. In order to make the network directed, we select the direction of each edge independently and uniformly. This anomaly thus appears to be taking inputs (for example, money or information) from multiple nodes and then distributes the inputs to multiple other nodes. For the clique, (Fig. 1D), we select k distinct nodes uniformly at random, and chose a random direction between each pair of nodes. The final embedded structure, slightly more complex, is a special case of a directed multipartite network, as shown in Fig. 1E, and is inspired by the dynamics and flow of information or money in a real-world system. In this setting, we consider a one-dimensional ordered series of independent sets (to which we refer to as shells), with nodes in a given shell passing input to all (or a subset of) nodes in the neighbouring shell upstream (to the right in Fig. 1E). In our experiments, we place 5 nodes in the leftmost shell, which are connected by directed edges to all of the 3 nodes in middle shell, which in turn are full connected to 1 node in the rightmost shell. For brevity in this document and a slight abuse of notation, we shall refer to this multipartite network as a ‘tree’. For simplicity, the directional flow of all (existing) edges across shells is from left to right, without exception. In practice, most likely one would search for a similar but noisy structure, which exhibits a significant flow imbalance across the shells. More specifically, the majority (but not all) of the edges would flow from the left shell to the middle shell, and similarly from the middle shell to the right shell. Let us denote by ω the fraction of the edges obeying the above flow, from left to right. Given a pair of (adjacent) shells, • if ω = 1, then all edges point from left to right, as depicted in Fig. 1E, • if ω = 0.5, then half of the edges point in one direction (left-to-right) and half of the edges in the other direction (right-to-left), • if ω = 0, then all edges point from right to left. The closer ω is to 0.5, the harder it is to detect the three shells in the network. On the other hand, the closer ω is to either 0 or 1 (thus indicating an imbalance flow in the subnetwork of shells), the easier it is to detect the three shells. Spectral and semidefinite programming algorithms that can detect similar planted clusters in directed networks (for example, where the above there shells make up the entire network, or just a small fraction of it), together with a robustness analysis in the context of a directed stochastic block model, constitute ongoing and future work. This line of work could open the way for detection of planted directed structures in networks, whose imbalance flow may signal the existence of certain anomalies, depending on the problem of interest. To construct a heavy structure on the planted anomalies, rather than drawing the weights on the edges from Uni(0, 1) as for the other edges in the network, we draw them independently from Uni(w, 1), where w is a parameter to the model. This weight construction also extends to the clique, with edges in the direction opposite to that selected for an anomalous edge being subject to the standard Uni(0, 1) edge selection procedure. Altogether, we obtain the following procedure to construct a network with anomalies. 1. Generate a directed ER network with n nodes and probability of connection p. 2. Add weights from Uni(0, 1) to each of the existing edges. 3. Draw the number of anomalies to embed uniformly at random from [5, ..., 20]. 4. Select the type of anomaly uniformly at random from {rings, paths, cliques, stars, trees}. 5. For each anomaly, select its size uniformly at random from [5, ..., 20].2 (Note: The size of the tree is fixed) 6. Select the appropriate number of nodes from the network uniformly at random. 7. Insert the chosen anomaly in the network. 8. Replace the edges in the chosen induced copy of the anomaly with weights drawn from Uni(w, 1). The choice of the size and number of the embedded structures is motivated by the consideration that 2 In

an earlier version of this work, we had a minimum size of 4.

5

Figure 1: Left Panel: Diagram of the embedded anomalies. (A) Top yellow squares - Directed path. (B) Middle left red circles - Ring structure. (C) Middle right light blue star - Star structure with a directed flow from left to right. (D) Bottom left blue triangles - Clique structure. (E) Bottom right green diamonds - Directed multipartite structure showing flow of money/information from left to right. Right Panel: Figure showing the complete pipeline; red parallelograms represent the input, blue parallelograms represent the output scores, and the yellow rectangles represent processes which are defined in the following sections: Basic Tests Sec. 3.1, Community Detection Sec. 3.2, Path Finder Sec. 3.5, Com. Tests Sec. 3.2, Localisation Test Sec. 3.3, NetEMD Test Sec. 3.4, Combine (Random Forest/Feature Sum) Sec. 3.6.

6

we would like the structure to be large enough such that it can be found but not too large or numerous to overwhelm the network. Our choice of constants in the above procedure produces on average around 147 anomalies. so that for n = 10000 on average 1.475 percent of nodes in the network are anomalies. This level of imbalance was chosen so that the anomalies do not overwhelm the underlying network.

2.2

Synthetic Data Sets - Testing

To test our approach on an independent data set, an internal team at Accenture (MML, PR) devised a synthetic test network with planted anomalies. The construction method of this network, as well as the anomalies, were not revealed to the Turing research group (MC, AE and GR). They were given access to several samples from the model, and they were informed that the anomalies would include heavy edges. Due to lack of other information about the network structure, the Turing team used a configuration model as null model for the Monte Carlo tests in the anomaly detection pipeline. The Turing team submitted a list of flagged anomalies to the internal Accenture team in order to assess the performance of the method. Only at the stage of writing the paper was the network construction method revealed. This synthetic Accenture network is constructed as follows. The number of nodes is fixed to 55, 000, and each node is assigned an in-degree and an out-degree. The mean in-degree is set to 21 with a standard deviation of 3, and the mean out-degree is set to 19 with a standard deviation of 2. If the sum of the in-degrees does not agree with the sum of the out-degrees, then the larger sum is reduced by deleting edges at random, until the totals match. As in a standard configuration model, each node is assigned an in-degree and an out-degree, and a corresponding number of stubs are created. The stubs are then randomised and matched. In the case of self loops, the algorithm randomly selects a set of edges of same length as the number of random edges, and swaps the ‘sender’ node of a self loop with the sender node of the randomly selected edge. This procedure can generate multi-edges, which are permitted. This process repeats until all stubs are matched. Edge weights are then drawn from a normal distribution with mean 1000 and variance 200, with all edge weights being positive. To this network, the following anomalies were added: a clique of size 8 and one of size 12, a ring of size 4 and one of size 10, and a heavy path of size 5 and one of size 10. In contrast to the weighted ER case (Sec. 2.1), all of the weights on the ring are constant, and the heavy path is sampled from paths that already exist in the network. Moreover, the edges in the clique of size k are not randomly arranged; instead, there is one edge between each pair of nodes and the out- (in-) degrees of the nodes in the clique are set to [k − i : i ∈ [1, ..., k]]. Finally, the weights on the heavy edges are drawn as follows: draw a number uniformly at random from the interval (0.99, 0.999990), and use as weight the value that would be required to obtain this number as percentile from the normal distribution with mean 1, 000 and variance 200.

3

Anomaly Detection Method

The anomaly detection framework proposed in this paper can be split into several modules, as depicted in the right panel of Fig. 1. We first perform some basic tests which look for deviations in the degree or the local edge weights of the nodes in the network, which are described in Sec. 3.1. We then divide the network into communities (described in Sec. 3.2), looking for anomalies using modules based on basic features of the communities (Sec. 3.2); novel methods based on localisation (Sec. 3.3), and finally, methods based on NetEMD, a recently developed network comparison method (Sec. 3.4). We complement these methods with an approach which searches for paths and path-like structures directly in the whole network (Sec. 3.5), before finally combining the individual contributions with a random forest as described in Sec. 3.6 – we call this the Random Forest method. For comparison, we also use the sum of the scores of the features, without any feature selection; we call this the Feature Sum method.

7

3.1

Basic Detection Module

While the more complex methods should be able to detect many different structures, it is sensible to also include tests of some basic properties of the network to check for anomalies. To this end, we perform node level tests, which mainly investigate if the weights that have been allocated to a given node appear to deviate from random. As a null model for the test, we randomly shuffle the weights while keeping the network fixed, and thus when testing if the local edge weights of a node deviates from random we simply test if it deviates from a random draw with replacement of the weights. In order to capture heavy structures, i.e. structures which have large weights on each of their edges, we use as a test statistic the geometric mean of the strengths of the edges. To give the statistic more power in the case of an anomaly with extremely high weights but small number of nodes, we also construct a statistic over the top 10% and 20% of weights. Altogether, this gives us the following test statistics. To make the (i) (i) (i) results comparable we use the geometric average of weights or GAW. Letting s1 , s2 , ..., sdi denote the ordered edge weights (both incoming and outgoing) of a given node i, we define the following scores  GAW(i)

=

di Y



d−1 i (i) sj 

j=1

 GAW10(i)

=

di Y



(d0.1di e)−1 (i) sj 

j=di −d0.1di e+1

 GAW20(i)

=

di Y



(d0.2di e)−1 (i) sj 

j=di −d0.2di e+1

For each of these test statistics, we sample with replacement from the weights in the network and compute the statistic of interest, and compute the resultant Monte Carlo p-value for each node. To save computational time, we share a null distributions between nodes with the same number of edges, and as sampling is very computationally efficient, for the results in this paper we use 10, 000 samples in each Monte Carlo test. We then score each node as follows. If the p-value is greater than 0.05, the node gets a score of 0. If the p-value pval is below 0.05, then we assign a score of Φ−1 (1 − pval ), where Φ denotes the CDF of the standard normal distribution. In addition to testing the local edge weights, we include a feature which is based on the degree of the nodes. This feature is the standardised version (z-score based on empirical mean and empirical standard deviation) of the degree, to capture when the degree is larger or smaller than expected at random.

3.2

Community Detection

The more intricate modules of our anomaly detection pipeline (NetEMD and Localisation) look for more subtle differences than the basic tests. However, looking for these more subtle changes is difficult as it is unlikely that a small anomaly will cause a deviation over the whole network that is large enough to trigger an alert. To address this, we perform a community detection step, to divide a network in a number of components that are usually denser within communities than between communities. Thus, by performing community detection we obtain smaller networks in which we can apply our intricate methods. As we are looking for structures with heavy flow through the network, we augment such structures in the network in order to increase the likelihood of them being placed within the same community. To this end, we augment paths of heavy edges by adding additional edges along the direction of the flow. The augmentation algorithm is summarised as follows. 1. Fix a threshold at the 99th percentile of the observed weight distribution. 2. For each path of length 2 (3 nodes and 2 edges) such that the weight on both edges (w1 and w2 ) is greater than the threshold: 8

(a) If there is not an edge between the first and last node place one with weight min(w1 , w2 ) (b) If there is an edge of weight w3 update the weight to max(w3 , min(w1 , w2 )) 3. Go back to Step 2 until no new edges can be added. The intuition behind our augmentation step is that structures of heavy average weight are transformed into structures that resemble cliques. Augmentation as regularization in the sparse regime. It is well known in the literature that standard spectral methods (for tasks such as clustering) often fail to produce meaningful results for sparse networks that exhibit strong degree heterogeneity [6] and [32]. To this end, Chaudhuri et al. [15] proposed −1/2 −1/2 the regularized graph Laplacian defined as Lτ = Dτ ADτ where Dτ = D + τ I, for τ ≥ 0. The spectral algorithm introduced and analyzed in [15] splits the nodes into two random subsets and only relies on the subgraph induced by only one of the subsets to compute the spectral decomposition. Tai and Karl [53] studied the more traditional formulation of a spectral clustering algorithm that uses the spectral decomposition on the entire matrix [46], and proposed a regularized spectral clustering which they analyze. Subsequently, Joseph and Yu provided a theoretical justification for the regularization Aτ = A + τ J, where J denotes the all ones matrix [35], partly explaining the empirical findings of [6] that the performance of regularized spectral clustering becomes insensitive for larger values of regularization parameters, and show that such large values can lead to better results. It would be interesting to investigate the usefulness of our augmentation step in the context of spectral clustering for sparse networks, which we leave for future work. As a next step, we consider the augmented network and perform community detection using the Louvain algorithm [10]. For this task we use the python-louvain package from Ref. [7]. To test if a given community density or edge weights deviates from random, we construct the following features. For these tests, we employ five different statistics on each community; three statistics are based on the density of the community and two are based on the strengths within the community. The density-based measurements we use are as follows. (1) As a first density statistic we use the density of the community divided by the density of the full network. This measure attempts to capture communities that are denser than we would expect, and dividing by the density of the overall network should help this measure generalise to observed networks with different densities. (2) For the second density statistic, we consider the same measure and divide it by the number of nodes in the community, in an attempt to further distinguish small communities (for which the penalty is smaller) from large communities. (3) For the third density statistic we use replicas from a configuration null model and perform community detection on each of those to obtain a distribution for the expected density of a community. We consider one community from each network randomly and measure the density in the non-augmented network to obtain a null distribution. As these community detection procedures are computationally expensive, we perform only 20 of them and then use the resultant null distribution to measure a p-value for the density of each of the communities in the resultant network focusing on the upper tail, i.e. when they are denser than we would expect at random. To make the results comparable, we give each node a score as follows: if the resultant p-value is more than 0.5, we attribute all nodes in the community a score of 0, else we use the same procedure as with the GAW statistics, i.e., we assign a node with a p-value pval the score Φ−1 (1 − pval ). To account for very small communities (with less than 4 nodes) we return a separate feature which has value 1 for nodes in such a small community, and 0 otherwise. The edge weights measures are similar to the first two density statistics, however, rather than the density, we use the geometric average of the weights (GAW) within each community and across the network. Finally, we construct one additional feature; applying NetEMD and Localisation to very small communities can cause issues and is unlikely to be informative. To this end, if the resultant communities are very small (of size less than 4), rather than running the NetEMD and Localisation modules, we assign each node a score of 1 in a special small community feature. After this community detection step, we construct two sets of networks, the first induced by the communities in the original network and the second induced by the communities in the augmented network. Crucially, each set of networks is useful for a different part of our pipeline, as we will describe in the following sections.

9

3.3

Spectral Localisation

Spectral localisation in networks is the phenomenon that a large amount of the mass of an eigenvector is placed on a small number of nodes. In many cases, such nodes are different in some way compare to the remainder of nodes, and hence they may be good candidates for anomaly detection. In contrast, eigenvector delocalisation refers to the scenario when most or all of the components of an eigenvector are small and roughly of the same magnitude. [54] calls a unit eigenvector delocalised if its entries are distributed roughly uniformly over the real or complex sphere, and quantifies this in a specific way. Our approach is motivated by a large body of work in the mathematics and physics community, that extensively studied the spectrum of the (continuous) Laplace operator, and the interplay between the spectrum and the geometry of the domain. While most of the work has focused on understanding the spectrum, i.e., the behaviour of the eigenvalues, a lot less is known about the eigenvectors, which are typically the objects of interest. For instance, in financial applications, lower order eigenvectors of the empirical correlation matrix are associated to the least risky portfolios, and one wishes to design methods that are able to delimit the informative part of the spectrum from the noise, typically via random matrix theory techniques. While most of the work, including [14, 8, 34], has focused on the high frequency eigenvectors (i.e., those corresponding to the larger eigenvalues), recent work [30] revealed localised eigenvectors associated to small eigenvalues. Finally, Sapoval [55] studied localised eigenvectors in different domains and pointed out their importance for physical applications, including the design of efficient noise-protective walls. In the remainder of this section, we consider other forms of behaviour which are not strictly localisation, i.e., a small number of nodes that have positive entries, in contrast to the majority of the nodes having negative values. However, for compactness, in this document we shall broadly refer to each of these behaviours as localisation. We look for localisation in the eigenvectors of three matrices, namely the symmetrised adjacency matrix, the combinatorial Laplacian of the symmetrised adjacency matrix and the random walk Laplacian of the symmetrised adjacency matrix, which are defined as follows. Let W be a weighted (possible directed) adjacency matrix with non-negative entries, W = (Wij ) ∈ IRn×n , and let Ws be the symmetrised version i.e. Ws = W + WT . The Laplacians are defined as combinatorial Laplacian: random-walk Laplacian:

LComb = D − W(s) , LRW = D

−1

W

(s)

,

(3) (4)

where D is the matrix with the sum of the incidents weights (the weighted degree) of the symmetrised adjacency matrix of each node on the diagonal. Eigenvectors are always normalised so that the sum of the squares of the entries is 1. Rather than looking at the complete network, we look for localisation in each of the communities separately to increase our chances of finding small-sized anomalies. Furthermore, rather than using the subnetworks corresponding to the communities in the original network, we look at the subnetworks of the augmented network as we believe that the additional augmented edges around the anomalies (as well as other heavy structures which appear at random) are likely to help the localisation methods. To this end, we look for localisation in the top 20 and bottom 20 eigenvectors of the symmetrised adjacency matrix Ws , and the top 20 non-trivial eigenvectors of the combinatorial Laplacian LComb and random-walk Laplacian LRW of the augmented subnetwork corresponding to each community. Note that, in the case of the Laplacian matrices, we consider the top 20 eigenvectors ignoring the first one, as it is trivial; in the case of LComb the first eigenvalue (smallest) is always to 0, and in the case of LRW the first eigenvalue (largest) is always 1. Whenever the network has fewer than 20 nodes, we consider as many eigenvectors as possible, and if there are ties (i.e., eigenvalues with multiplicity) we break ties randomly. If a replica network which is generated under the null model has fewer than the required number of eigenvectors, we resample the replica to obtain a network with sufficient eigenvectors for the adjacency comparison (note the Laplacian comparisons need one more due to the trivial eigenvector). To avoid running the procedure indefinitely, if this 10th network has more than 2 nodes we use this network, else we continue to resample until the network has more than the required number of eigenvectors. We then compare the ith eigenvector 10

with the networks in the ensemble of networks which are generated under the null model and which contain at least i eigenvectors. For a more in depth explanation see Appendix B.2.5 and Appendix B.2.6. To capture the different types of localisation we use several statistics. These statistics can be divided into three broad classes, namely: norm based, sign based, and direct localisation. Norm Based Statistics For the norm-based methods, as in Ref. [21] we look for eigenvectors vectors such that the sum of the 4th powers of the eigenvector entries also known as the inverse participation ratio (IPR) is larger than expected at random. As the sum of the squares of an eigenvector is 1, a large sum of fourth powers implies there are large elements P in the vector. Extending this idea, for an eigenvector v with elements vi , we also look at the statistics i (e|vi | − |vi | − 1). An eigenvector with all the mass on one entry would have a statistic of e − 2, whereas a perfectly delocalised eigenvector (vi = n−0.5 ∀i) would have a −0.5 value of n(en − n−0.5 − 1), which for n = 10, 000 amounts to ≈ 0.50167, whereas a perfectly localised version would be ≈ 0.71828. Note that [39] also uses localisation, but focuses on the `1 norm rather than the inverse participation ratio. Sign Based Statistics Some eigenvectors can be used to identify anomalous structures by the sign of the observation on each node, even when many entries in the eigenvector are zero. To capture this effect, we compute the number of nodes that have a strictly positive or strictly negative value, and we use the minimum of these quantities divided by Min(20, Number of Non Trivial Eigenvectors) as a test statistic. There are a number of boundary cases for this statistic (for example a vector that is all positive), the treatment of which we detail in Appendix B.2.4. Direct Localisation Statistics To directly capture the localisation phenomenon, we construct a test statistic which is the minimum number of entries required such that their sum is at least 90% of the sum of a statistic over the eigenvector. We consider as statistics for this procedure the sum of the fourth power of each of the entries, and the sum of the absolute values of these statistics. For consistency, whenever the entries are exactly equal, we count all nodes with the given value. In the next step, we consider all the above test statistics and evaluate whether they deviate in the appropriate direction based on a p-value test using a set of replicas which are generated under the null model. For results throughout this paper, we use the configuration null model which is ran directly on the augmented subgraphs from the community detection step. For computational reasons, we share this set of replicas across different eigenvectors, and therefore, the tests for different eigenvectors are not independent. Thus this test procedure is not a rigorous significance test, but rather a procedure to generate a score. Once we have identified vectors that deviate significantly from what is expected at random, we then select the nodes which caused the deviation. For the norm-based approaches, the divide between significantly different nodes is more difficult to distinguish, and thus we consider four different statistics to capture this divide, which are related to the raw value, the significance of the test statistic that highlighted the eigenvector, and the element-wise behaviour observed at random. As their derivation is slightly involved, we refer the reader to Appendix B.2.3 for additional details. For the sign-based approaches, we select the nodes which possess the rare sign, and following Sec. 3.1, we give a score related to the p-value, taking the inverse CDF of a standard normal distribution at 1 − pval as a score. To further highlight small communities, we also construct a score normalised by the number of nodes that have the rare sign. We further note as with the sign based test statistic there are a small number of edge cases in selecting the nodes e.g. if the number of positive entries equals the number of negative entries, see Appendix B.2.4 for details. For the direct localisation results, we compute the set of nodes required such that the statistic is greater than 90% of the total, and assign them a series of scores based on the inverse normal CDF of 1 − pval . Furthermore, a small number of additional implementation details are described in Appendix. B.2 and a small number of additional features which highlight edge cases can be found in Appendix E.5. For the experiments in this paper, we fix the p-value threshold at 0.05, and the number of null model repeats at 500. Note that we compute a different set of null replicas for each matrix type (e.g. LRW , LComb , 11

,

,

,

Figure 2: Example which highlights that NetEMD only considers the shape of the distribution rather than the position or the scale. In the top plot, the NetEMD score for comparing two normal distributions is 0 , whereas in the second panel the NetEMD score for comparing a normal distribution and an exponential distribution isgreater than 0. large eigenvalues of Ws and small eigenvalues of Ws ), but as discussed above, due to computational reasons we share the replicas between eigenvectors in the same matrix. Finally, in order to limit the total number of features, we combine (sum) the contribution from a given localisation test and scoring method in each eigenvector of a given matrix into one feature, as we would expect similar behaviour to be captured by eigenvectors of the same matrix with the same test statistic. Thus, rather than obtaining 1200 features which are composed of all combinations of the 4 different sets of eigenvectors, ( LRW , LComb , large eigenvalues of Ws ) and small eigenvalues of Ws , we obtain 60 features; for a complete list see Sec. E.4.

3.4

The NetEMD Module

The NetEMD module extends the NetEMD network comparison tool from [67] into an anomaly detection method by measuring the difference between the observed network and an expected network. Rather than comparing raw values of a given network statistic, NetEMD compares the shapes of the distributions of a statistic over the nodes of a network. As an illustration of NetEMD, Fig. 2 shows that the NetEMD score for comparing two normal distributions with different mean and variances is 0. In contrast, the the NetEMD score for comparing a normal distribution and an exponential distribution is greater than 0. To describe NetEMD, following the notation in the localisation section, we represent networks as a (possibly) weighted (possibly) directed adjacency matrix W = (Wij ) ∈ IRn×n . Let T (j, W) be a node level statistic for node j on a network W which takes on real values, and let 1X 1(x ≤ T (j, W)) (5) UT (x, W) = n j denote the empirical distribution function of the statistic T evaluated on node j of network W. We define the standardised version of T in order to remove the effect of scale T (j, W) (6) T˜(j, W) = σ(T, W) Pn 2 where σ(T, W) is the empirical variance, i.e. σ(T, W) = n1 i=1 (T (i, W) − µ(T, W)) , and µ(T, W) = P n 1 0 i=1 T (i, W). For two networks W, W , and a given statistic T , the NetEMD score is given by n Z ∞ NetEMDT (W, W0 ) = mins |UT˜ (x + s, W) − UT˜ (x, W0 )|dx, −∞

where we highlight the use of T˜ rather than T in UT˜ , in order to remove the effect of scale. The minimum R∞ s of −∞ |UT˜ (x + s, W) − UT˜ (x, W0 )|dx will be achieved as the empirical distribution functions only have finitely many jumps, see for example [70]. Let CT = {T (m) , m = 1, . . . , M } denote a set of node statistics such that for each node j and each network W, T (m) (j, W) ∈ IR. The NetEMD statistic for such a set of node level statistics is the average over the single NetEMD scores, 1 X NetEMD(W, W0 ) = NetEMDT (W, W0 ). (7) |CT | T ∈CT

12

Figure 3: The subnetwork statistics that are used in the NetEMD module. Statistic 1 is the total the out-strength, statistic 2 is the total the in-strength and statistic 3 is the total of the in-strength and outstrength. The remaining statistics are the directed motifs of size 3, where the weight is given by the product of the weights on each of the edges. In [67], the statistics often used as T (m) are the so-called orbit counts, i.e., the number of times that a given node is part of a given subnetwork in a given position. Our Set of Statistics We use a combination of directed network summaries as our set of NetEMD statistics, as follows. As shown in Fig. 3, we use the in-strength (i.e., the sum of all weights coming into a node), out-strength (i.e., the sum of all weights going out of a node), the sum of the out-strength and the in-strength, and finally each of the directed motifs of size 3. Here, a motif is just a small network; no overor under-representation is implied. Following the approach used by [68], we let the weight of each motif be the product on the weights on the edges. Formally, for a given motif M on 3 nodes, letting W([i, j, k]) be the induced subnetwork of W with nodes i, j and k, our test statistic is X Y T M otif (i) (j, W ) = 1(W([j, a, b]) is an induced copy of M ) ((Wcd − 1)1(Wcd > 0) + 1) , a 0) and θ(G) P Nneg (v) = i 1(vi < 0). Many of the boundary cases relate to the setting where all entries have the same sign, resulting in a test statistic of 0. A test statistic of 0 is problematic, as we are interested in cases where a small non zero number of entries have the same sign (and thus those entries are anomalous). Thus, rather than assigning 0 to the test statistic, we assign it θ(G), alleviating the problem. The resulting test statistic is 1 Min Npos (v) + 1(Npos (v) = 0)θ(G) , Nneg (v) + 1(Nneg (v) = 0)θ(G) . θ(G) which is equivalent to the original test statistic when Npos 6= 0 and Nneg 6= 0. Feature Construction As with the other localisation features, in vectors where the test statistic appears to deviate from expectation, we attempt to highlight the nodes causing the deviation. In doing so, we consider three different edge cases, namely the presence of vectors where Nneg or Npos equals 0, if Nneg = Npos , and finally the extremely rare case where more than 50% of the nodes are considered anomolous.

34

The original features, as described in the paper, are given by (1)

SignStati

= 1(Npos (v) < Nneg (v))1(vi > 0)Φ−1 (1 − pval ) +1(Npos (v) > Nneg (v))1(vi < 0)Φ−1 (1 − pval )

(2)

SignStati

=

1(Npos (v) < Nneg (v))1(vi > 0)Φ−1 (1 − pval ) Npos (v) +

1(Npos (v) > Nneg (v))1(vi < 0)Φ−1 (1 − pval ) Nneg (v)

where Φ−1 is the inverse of the CDF of a normal distribution with mean 0 and variance 1, and 1 is the indicator function. [ To address the first edge case, where Npos or Nneg equals zero, we replace Npos (Nneg ) with N pos = [ Npos + 1(Npos = 0)|v| ( Nneg = Nneg + 1(Nneg = 0)|v|), leading to the following features (1)

SignStati

−1 [ [ = 1(N (1 − pval ) pos (v) < Nneg (v))1(vi > 0)Φ

(21)

−1 [ +1(N\ (1 − pval ) pos (v) > Nneg (v))1(vi < 0)Φ (2)

SignStati

=

−1 [ [ 1(N (1 − pval ) pos (v) < Nneg (v))1(vi > 0)Φ [ N pos (v)

+

(22)

−1 [ [ 1(N (1 − pval ) pos (v) > Nneg (v))1(vi < 0)Φ [ Nneg (v)

where we again note that they are equivalent to their original formulation in the cases where Npos 6= 0 and Nneg 6= 0. The second edge case occurs if Nneg = Npos ; in this case, we note that the features as defined in Eq. (21) and Eq. (22) are always 0. This case can happen if the vector has a large number of 0’s and a small number of non-zero entries, thus allowing the test statistic to be significant, while Npos = Nneg . To address this, we introduce additional features to capture this behaviour (1)

SignStatEquali

(2)

SignStatEquali

−1 [ [ = 1(N (1 − pval ) pos (v) = Nneg (v))(1(vi > 0) + 1(vi < 0))Φ −1

=

[ [ 1(N pos (v) = Nneg (v))1(vi > 0) + 1(vi < 0))Φ [ [ N pos + Nneg

(1 − pval )

.

(23) (24)

We further note that there is one additional extremely rare edge case for this statistic. If many elements of the null distribution have the same sign, then any observed vectors with more than 50% of their entries with one sign but the remaining entries 0 can appear significant. To address this issue and to avoid adding noise to the feature set, we exclude these vectors from the features, and we simply create a new set of features for the cases where the entries are large. Finally, as the extremely rare case does not appear in our data sets, we exclude the related features from our main set of 140 features and instead list them in the additional feature section in Sec. E.5. B.2.5

Configuration Null Model Replicates have insufficient size

Sometimes a null model replica is too small to have sufficient eigenvectors to run a comparison to the real data. The underlying reason for this is that in the stub-based configuration model described in Appendix B.1, removing degree 0 nodes can affect the number of eigenvectors. When the networks are large, then the impact of these missing nodes is small. However, an issue arises in Monte Carlo tests when some simulated networks are so small that the size of the network drops below the number of eigenvectors required for the null comparison. To overcome this issue, as described in the main body, we re-sample any networks that are too small with a exception at the 10th generated network. Formally, we rely on the following procedure. 35

1. Set LoopCounter to 0; 2. Generate a configuration model replica; 3. If the number of number of nodes in the replica is less than the number of required eigenvectors for the adjanency comparision; (a) Increase LoopCounter by 1; (b) If LoopCounter equals 10 and the network has 2 nodes or more then stop the procedure and use the current replica; (c) Go to 2; 4. Use the current replica. The re-sampling results in a slightly biased (rather than uniform) random sample over all configuration model networks. As we require fewer than 25 eigenvectors, and the communities uncovered in our generating model are often much larger than this, the re sampling is only rarely required. B.2.6

Comparing with Null Replicas of Insufficient size

As we discussed in the previous section, it is possible to obtain null replicas of insufficient size. Thus we must address how to compute significances with such a null distribution. To address this issue, we use the following procedure. In our Monte Carlo p-value computation, we compare the ith real eigenvector only with networks that have at least i eigenvectors, thus giving a well defined comparison. Combining the null generation and comparison procedures, we obtain the following procedure for localisation (note that NetEMD-based features use a slightly different method). The procedure to compute the significance of the top numEigsOfInterest eigenvectors using numberOfNullNetworkRequired null replicates is give by 1. Construct an empty list NullNetworkList; 2. Set LoopCounter to 0; 3. Generate a configuration model replica; 4. If the number of number of nodes in the replica is less than the number of required eigenvectors for the adjanency comparision; (a) Increase LoopCounter by 1; (b) If LoopCounter equals 10 and the network has 2 nodes or more then go to 5; (c) Go to 3; 5. Add the last generated network to NullNetworkList; 6. If numberOfNullNetworkRequired is less than length(NullNetworkList) Go to 2; 7. Compare each of the observed top numEigsOfInterest eigenvectors against the equivalent eigenvector in each of the null model replicas (if it exists).

B.3

NetEMD Implementation Details

Similar to the previous section relating to localisation, there is an implementation detail of NetEMD which is not obvious. NetEMD compares pairs of networks by comparing the shapes of distributions of node statistics. It does this by standardising the distributions and then computing the minimum EMD over translation. Furthermore, distributions in which all the mass is concentrated on a single point do not undergo variance scaling as the variance is 0. However, when using floating point features, such as eigenvectors, it is important to recognise when we are observing a single point feature rather than a tight distribution of points. This is important because variance scaling will inflate the small floating point difference. To address this issue, if the range of a statistic of the network is less than 10−10 , we replace the statistic with a point distribution.

36

B.4

PathFinder

The path finder method is simpler than the other approaches, but nevertheless, there are implementation details which we cannot fit into the main body. Thus, in this section, we discuss how we construct our original paths of size 3 to expand, and we also discuss our breaking rule when considering two paths of equal fitness. Selecting the Starting Paths of Size 3 To avoid considering all paths of size 3, which can be very large, we start with a subset of paths of size 3, and then, in line with our beamwidth approach, we consider and expand on the BeamWidth paths with the highest fitness. Following the approach used in the main paper, we define fitness as the smallest edge weight in a path. The procedure to construct our paths of size 3 is as follows. 1. Make a List (PathsToConsider) with BeamWidth dummy paths of weight 0: 2. For each node (CurrentNode) in the network: (a) Consider the node (CurrentOut) with highest-weighted outgoing edge to CurrentNode. (b) Consider the node (CurrentIn) with highest-weighted incoming edge from CurrentNode. (c) If the minimum fitness of the paths in PathsToConsider is less than the fitness of the path CurrentIn -> CurrentNode -> CurrentOut. i. Add path to PathsToConsider, removing the path with the lowest fitness. ii. If there no more incoming and outgoing edges to consider go to 2 and consider the next CurrentNode. iii. If the largest unconsidered incoming edge to CurrentNode has a higher weight than the largest unconsidered outgoing edge from CurrentNode, then update CurrentIn to the relevant node and go to (c). iv. Update CurrentOut to the node corresponding to the largest unconsidered outgoing edge and go to (c). (d) Go to 2 and consider the next CurrentNode We note that the order in which we consider the nodes does play a role in the selection of the paths of size 3, as the if statement in (c) will only add paths if they have higher fitness than those previously observed. We further discuss such arbitrary choices in the next section, including the procedure for breaking ties in our algorithm. Underlying Data structure/Breaking Ties To discuss the motivation behind our breaking rule, we must first discuss the underlying implementation, as we chose a breaking rule to reduce computational complexity as much as possible. The native implementation is to store each of the elements and then to sort them and select the top beam width elements. However, if we consider ρ path extensions, sorting the elements takes O(ρlog(ρ)) with O(ρ) storage. Our underlying implementation of the method uses a fixed size heap (using the Python module heapq) to which we are constantly adding and removing elements. The computational complexity of adding an element to a heap and removing the least fit element is O(log(nnumber of elements in the heap)). We set the size of the heap to the beamwidth,. Adding the ρ path extensions has complexity O(ρlog(number of elements in the heap/ beam width)), with O(number of elements in the heap/ beam width) storage. Thus, we use the heap solution; however, as we have a fixed-size heap, we need to be able to break ties for paths which have the same fitness (minimum weight on any of the edges). Furthermore, as we are using the inbuilt heapq module, we break ties in fitness using the inherent Python 2 order of lists, which compares the first element, and then the second and so on. The lists in question are the lists of node indices involved in a path arranged in the order they are visited, in which larger indexes are considered fitter.

37

C

Regions of Detectability

In the main body of the paper, we discussed limiting the parameter regimes to those where the expected number of anomalous networks that would occur at random is less than 1. In this section of the appendix, we derive the expressions used in the paper (Sec. C.1) and we show that the case of path 5 is the most restrictive case (Sec. C.2).

C.1

Derivation of Regions of Detectability

Anomalies will only be detectable if they are unusual in some way, in the underlying network. In this paper, we require that the expected counts of the anomaly in a network without planted anomalies is less than 1. The parameter region which ensures that the expected counts are less than 1 is what we call the region of detectability for that anomaly. To compute the bounds on the expected number of anomalies, let f (A(k,w)) (G, X) be the number of anomalies of type A of size k and minimum weight w, observed on the subnetwork of GPinduced by the nodes in X. Let the total number of such anomalies over the network be F (A(k,w)) (G) = X⊆V f (A(k,w)) (G, V ). We assume that the function f preserves the exchangeability in the |X|=k

sense that the second equality in (25) holds E[F (A(k,w)) (G)] =

X

E[f (A(k,w)) (G, X)] =

X⊆V |X|=k

n E[f (A(k,w)) (G, Xarb )], k

(25)

where the final equality comes from the fact that all nodes in an ER network are exchangeable, and Xarb is an arbitrary set of nodes of size k. For example, subnetwork counts preserve the exchangeability. Here, we calculate the expectations under a directed ER model with edge probability p and edge weight Wij ∼ U nif orm(0, 1). An edge is heavy if its edge weight W is at least w, where w is a fixed number. Thus the probability of seeing a heavy edge is p(1 − w). The structures for which we find the expected counts are 1. a clique on k nodes in which there is at least one heavy edge between any two nodes; 2. a star on k nodes in which the centre has in-degree at least k1 and out-degree at least k − k1 − 1, with the additional requirement that the k1 in-degree edges come from different nodes than the k − k1 − 1 out-degree edges connect to. There may be additional edges in the star; 3. a path on k nodes v1 , . . . , vk starting in v1 , with a directed heavy edge from vi to vi+1 , for i = 1, . . . , k−1. Again, we allow more edges to be present; 4. a ring of k nodes - a directed completion of a heavy path; 5. a particular tree with 1 root (the inner shell) connecting to 3 nodes (the middle shell) connecting to 5 nodes each (the outmost shell), so that there are 9 nodes in the tree and 5 × 3 + 3 × 1 = 18 edges in total, with the specified connecting edges again being heavy. Here are the calculations. 1. The derivation for the clique is as follows. The probability of observing a clique of size k on a fixed k set of k nodes in an undirected ER network is given by p(2) . Further, the probability of observing a edge in either direction in a directed ER network is given by 1 − (1 − p)2 (one minus the probability of there being no edge in either direction). Observing that the probability of a heavy edge in our network model is given by p(1 − w), we obtain the following inequality (clique)

(k,w)) 1 > E[F (A (G)] (clique) n (k,w)) = E[f (A (G, Xarb )] k k n = (1 − (1 − (1 − w)p)2 )(2) k

38

which is satisfied when  12 −1 ! k1 (2) n  , k

 (1 − w)p < 1 − 1 − as stated in the main text.

2. The derivation for the star is as follows. We calculate the expected counts for stars on k nodes. To make the computation easier, we parameterise the number of nodes that have incoming heavy edges into the star as k1 , where k1 < k. For a star with k nodes, the probability that k − 1 heavy edges occur is ((1 − w)p)k−1 . Starting with a set of k nodes, there are kk1 ways of selecting k1 nodes with edges into the centre node. There are then k − k1 ways of choosing the central node, which also fixes (Star) (k,w)) the number of nodes with edges. Putting this all together in an expression for f (A (G, Xarb ) we arrive at X (Star) (k,w,k1 )) E[f (A (G, Xarb )] = ((1 − w)p)k−1 X1 ,X2 ,X3 ⊂Xarb X1 ∪X2 ∪X3 =Xarb |X1 ∩X2 |=|X2 ∩X3 |=|X1 ∩X3 |=0 |X1 |=k1 , |X2 |=1,

=

k k − k1 ((1 − w)p)k−1 . k1 1

Substituting this into equation (25) and rearranging we obtain (Star) n k k − k1 (k,w,k1 )) 1 > E[F (A (G)] = ((1 − w)p)k−1 k k1 1 giving that 1 k n

(1 − w)p
E[F (A (G)] n = (k)!((1 − w)p)k−1 k

which is satisfied if (1 − w)p
E[F (A (G)] n = (k − 1)!((1 − w)p)k k

39

giving that 1 n k (k − 1)!

(1 − w)p
E[F (A (G)] n 9 4 = ((1 − w)p)18 9 5 1

which amounts to (1 − w)p
numLesser then use v. 5. If numGreater < numLesser then use −v. 6. If numGreater = numLesser then: (i) Set a to 0

43

2a+1 j (vj ) 2a+1 j (vj ) P 2a+1 j (vj )

(ii) If

P

> 0 use v

(iii) If

P

< 0 use −v

(iv) If

= 0 use increase a by 1 and go back to (ii)

As, potentially, such an algorithm may result in an infinite loop, we show that the procedure will always result in a selection between v and −v in finite time. More precisely, we shall show that there always exists an P exponent a such that j (vj )2a+1 6= 0. Note, for this proof we borrow from from other approaches e.g. [61], and thus we do not claim novelty in the method, but we include it for completeness. Clearly, if the number of elements in v which are greater than 0 was not equal to the number of elements in v less than 0, then either step 4 or step 5 would have resulted in a selection of either v or −v. Therefore, the number of positive entries must be equal to the number of negative entries. Let the negative entries be qi such that q1 ≤ q2 ≤ q3 ≤ ... ≤ q n2 < 0 and ri be the positive entries 0 < r1 ≤ r2 ≤ r3 ≤ ... ≤ r n2 . P As j (vj )2a+1 is a sum of odd powers (power sums) we can remove any pair of elements (qi , rj ) such that |qi | = |rj |, as their odd powers cancel each other in the sum. We first consider (q1 , r1 ), remove if they are equal, else move on to the next pairing, assuming they both have not already been removed. This leaves q˜1 ≤ q˜2 ≤ q˜3 ≤, ... ≤ q˜n˜ ≤ 0 and 0 ≤ r˜1 ≤ r˜2 ≤ r˜3 ≤ ... ≤ r˜n˜ . By assumption the sequence is not symmetric 2 2 and therefore n ˜ 6= 0. To show that the procedure converges, we show that for every set of values which is not symmetric around 0, P there is an odd powerP sums is not equal to 0. Clearly, P power such P that the sum ofPthe values to this P odd2a+1 if i vi2a+1 = i q˜i2a+1 + i r˜i2a+1 = 0 then i (bvi )2a+1 = i (b˜ qi ) + i (b˜ ri )2a+1 = 0. Further, as all pairs of values that cancel each other out in the sum of odd powers have been removed, q˜0 r˜0 |˜ q1 | = 6 |˜ r n˜ |. Without loss of generality, |˜ q1 | ≤ |˜ r n˜ |. Let b = |˜r1n˜ | , let q˜i0 = bi and let r˜i0 = bi . Then 2

2

2

n ˜

n ˜

n ˜

i=1

i=1

i=1

n ˜

2 2 2 2 −1 X X X X 2a+1 2a+1 0 2a+1 0= (b˜ qi ) + (b˜ ri ) = (˜ qi ) + (˜ ri0 )2a+1 + 1.

i=1

Using that the ri > 0 and that qi < 0 , it follows that n ˜ 0 2a+1 (˜ q ) + 1 ≤ 0. 2 1 Re-arranging and taking logarithms, n ˜ log(1) ≤ log( ) + (2a + 1)log(−(˜ q 0 )1 ). 2 Thus, −

log( n2˜ ) ≥ (2a + 1) log(−(˜ q10 ))

where the direction of the inequality changes as by assumption −1 < q˜10 < 0, and therefore log(−(˜ q10 )) < 0. 0 As −(˜ q )1 6= 1, the left-hand side of the last inequality is a finite positive quantity which does not depend on a. Therefore, there exists a positive value a such that the inequality does not hold. Thus, there is a power sum which is non zero, and hence the algorithm terminates after finitely many steps, which concludes our proof.

E E.1

List of all anomaly detection features Basic Node Statistics

All of the statistics below are node statistics in that they give a different value to each node - but some of them give the same value to each node in a community. 44

1. Standardised Degree 2. Community Density Full, i.e. the density of the community divided by overall density. 3. Community Density Average, i.e. the density of the community divided by overall density and also divided by the size of the community. 4. Community GAW Full, i.e. the Geometric Average Weight of the community divided by overall Geometric Average Weight. 5. Community GAW Average, i.e. the Geometric Average Weight of the community divided by overall Geometric Average Weight and also divided by the size of the community. 6. Geometric Average Weight (GAW) 7. Geometric Average Weight Top 10% (GAW Top 10%) 8. Geometric Average Weight Top 20% (GAW Top 20%) 9. Community Density Configuration Model - Statistic which compares the density of the community with respect to a null distribution of densities of communities of configuration model graphs. The resultant p-value is then transformed by Max(F (1 − pval ), 0), where Φ−1 is the inverse CDF of a 0 1 normal distribution. 10. Small Community Category - As discussed in Sec. 3.2

E.2

Path Statistics

The statistics labelled 11-40 are obtained by using the PathFinder method. To maintain consistency with the embedded paths, we denote the size of the path as the number of nodes in said path. Thus, we consider paths of size 3-32 as follows. As described in the document, for the main experiments we only compute path of size up to 21 (which corresponds to 20 edges), with the remaining features left blank. 11. Path of size 2 12. Path of size 3 13. Path of size 4 14. Path of size 5 15. Path of size 6 16. Path of size 7 17. Path of size 8 18. Path of size 9 19. Path of size 10 20. Path of size 11 21. Path of size 12 22. Path of size 13 23. Path of size 14 24. Path of size 15 25. Path of size 16 26. Path of size 17 27. Path of size 18 28. Path of size 19 29. Path of size 20 30. Path of size 21

45

31. Path of size 22 32. Path of size 23 33. Path of size 24 34. Path of size 25 35. Path of size 26 36. Path of size 27 37. Path of size 28 38. Path of size 29 39. Path of size 30 40. Path of size 31

E.3

NetEMD Measures

The statistics labelled 41-72 are features related to 16 motifs with NetEMD statistics 1 and 2. The statistics 73-80 are features related to the upper adjacency matrix, the lower adjacency matrix, the combinatorial Laplacian and the random walk Laplacian with NetEMD scores 1 and 2 from Eq. (8) and Eq. (9). 41. Motif 1 - NetEMD Score 1 (Eq. (8)) 42. Motif 1 - NetEMD Score 2 (Eq. (9)) 43. Motif 2 - NetEMD Score 1 (Eq. (8)) 44. Motif 2 - NetEMD Score 2 (Eq. (9)) 45. Motif 3 - NetEMD Score 1 (Eq. (8)) 46. Motif 3 - NetEMD Score 2 (Eq. (9)) 47. Motif 4 - NetEMD Score 1 (Eq. (8)) 48. Motif 4 - NetEMD Score 2 (Eq. (9)) 49. Motif 5 - NetEMD Score 1 (Eq. (8)) 50. Motif 5 - NetEMD Score 2 (Eq. (9)) 51. Motif 6 - NetEMD Score 1 (Eq. (8)) 52. Motif 6 - NetEMD Score 2 (Eq. (9)) 53. Motif 7 - NetEMD Score 1 (Eq. (8)) 54. Motif 7 - NetEMD Score 2 (Eq. (9)) 55. Motif 8 - NetEMD Score 1 (Eq. (8)) 56. Motif 8 - NetEMD Score 2 (Eq. (9)) 57. Motif 9 - NetEMD Score 1 (Eq. (8)) 58. Motif 9 - NetEMD Score 2 (Eq. (9)) 59. Motif 10 - NetEMD Score 1 (Eq. (8)) 60. Motif 10 - NetEMD Score 2 (Eq. (9)) 61. Motif 11 - NetEMD Score 1 (Eq. (8)) 62. Motif 11 - NetEMD Score 2 (Eq. (9)) 63. Motif 12 - NetEMD Score 1 (Eq. (8)) 64. Motif 12 - NetEMD Score 2 (Eq. (9)) 65. Motif 13 - NetEMD Score 1 (Eq. (8)) 66. Motif 13 - NetEMD Score 2 (Eq. (9))

46

67. Motif 14 - NetEMD Score 1 (Eq. (8)) 68. Motif 14 - NetEMD Score 2 (Eq. (9)) 69. Motif 15 - NetEMD Score 1 (Eq. (8)) 70. Motif 15 - NetEMD Score 2 (Eq. (9)) 71. Motif 16 - NetEMD Score 1 (Eq. (8)) 72. Motif 16 - NetEMD Score 2 (Eq. (9)) 73. Adj. Upper Eigenvectors - NetEMD Score 1 (Eq. (8)) 74. Adj. Upper Eigenvectors - NetEMD Score 2 (Eq. (9)) 75. Adj. Lower Eigenvectors - NetEMD Score 1 )(Eq. (8)) 76. Adj. Lower Eigenvectors - NetEMD Score 2 (Eq. (9)) 77. Combinatorial Laplacian Eigenvectors - NetEMD Score 1 (Eq. (8)) 78. Combinatorial Laplacian Eigenvectors - NetEMD Score 2 (Eq. (9)) 79. Random Walk Laplacian Eigenvectors - NetEMD Score 1 (Eq. (8)) 80. Random Walk Laplacian Eigenvectors - NetEMD Score 2 (Eq. (9))

E.4

Localisation Statistics

The statistics labelled 81 - 140 are the localisation statistics on the upper adjacency matrix, the lower adjacency matrix, the combinatorial laplacian and the random walk Laplacian with the following localisation measures: 81. Adj. Upper Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 1 (Eq. (17)) 82. Adj. Upper Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 2 (Eq. (18)) 83. Adj. Upper Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 3 (Eq. (19)) 84. Adj. Upper Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 4 (Eq. (20)) 85. Adj. Upper Eigenvectors - Exponential - Norm Node Statistic 1 (Eq. (17)) 86. Adj. Upper Eigenvectors - Exponential - Norm Node Statistic 2 (Eq. (18)) 87. Adj. Upper Eigenvectors - Exponential - Norm Node Statistic 3 (Eq. (19)) 88. Adj. Upper Eigenvectors - Exponential - Norm Node Statistic 4 (Eq. (20)) 89. Adj. Upper Eigenvectors - 90 percent Contribution in Inverse Participation Ratio 90. Adj. Upper Eigenvectors - 90 percent Contribution in absolute value 91. Adj. Upper Eigenvectors - Sign Statistic 1 (Eq. (21)) 92. Adj. Upper Eigenvectors - Sign Statistic 2 (Eq. (22)) 93. Adj. Upper Eigenvectors - Sign Statistic Equal 1 (Eq. (23)) 94. Adj. Upper Eigenvectors - Sign Statistic Equal 2 (Eq. (24)) 95. Adj. Upper Eigenvectors - Absolute value of the eigenvector 96. Adj. Lower Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 1 (Eq. (17)) 97. Adj. Lower Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 2 (Eq. (18)) 98. Adj. Lower Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 3 (Eq. (19)) 99. Adj. Lower Eigenvectors - Inverse Participation Ratio - Norm Node Statistic 4 (Eq. (20)) 100. Adj. Lower Eigenvectors - Exponential - Norm Node Statistic 1 (Eq. (17)) 101. Adj. Lower Eigenvectors - Exponential - Norm Node Statistic 2 (Eq. (18)) 102. Adj. Lower Eigenvectors - Exponential - Norm Node Statistic 3 (Eq. (19))

47

103. Adj. Lower Eigenvectors - Exponential - Norm Node Statistic 4 (Eq. (20)) 104. Adj. Lower Eigenvectors - 90 percent Contribution in Inverse Participation Ratio 105. Adj. Lower Eigenvectors - 90 percent Contribution in absolute value 106. Adj. Lower Eigenvectors - Sign Statistic 1 (Eq. (21)) 107. Adj. Lower Eigenvectors - Sign Statistic 2 (Eq. (22)) 108. Adj. Lower Eigenvectors - Sign Statistic Equal 1 (Eq. (23)) 109. Adj. Lower Eigenvectors - Sign Statistic Equal 2 (Eq. (24)) 110. Adj. Lower Eigenvectors - Absolute value of the eigenvector 111. Combinatorial Laplacian - Inverse Participation Ratio - Norm Node Statistic 1 (Eq. (17)) 112. Combinatorial Laplacian - Inverse Participation Ratio - Norm Node Statistic 2 (Eq. (18)) 113. Combinatorial Laplacian - Inverse Participation Ratio - Norm Node Statistic 3 (Eq. (19)) 114. Combinatorial Laplacian - Inverse Participation Ratio - Norm Node Statistic 4 (Eq. (20)) 115. Combinatorial Laplacian - Exponential - Norm Node Statistic 1 (Eq. (17)) 116. Combinatorial Laplacian - Exponential - Norm Node Statistic 2 (Eq. (18)) 117. Combinatorial Laplacian - Exponential - Norm Node Statistic 3 (Eq. (19)) 118. Combinatorial Laplacian - Exponential - Norm Node Statistic 4 (Eq. (20)) 119. Combinatorial Laplacian - 90 percent Contribution in Inverse Participation Ratio 120. Combinatorial Laplacian - 90 percent Contribution in absolute value 121. Combinatorial Laplacian - Sign Statistic 1 (Eq. (21)) 122. Combinatorial Laplacian - Sign Statistic 2 (Eq. (22)) 123. Combinatorial Laplacian - Sign Statistic Equal 1 (Eq. (23)) 124. Combinatorial Laplacian - Sign Statistic Equal 2 (Eq. (24)) 125. Combinatorial Laplacian - Absolute value of the eigenvector 126. Random Walk Laplacian - Inverse Participation Ratio - Norm Node Statistic 1 (Eq. (17)) 127. Random Walk Laplacian - Inverse Participation Ratio - Norm Node Statistic 2 (Eq. (18)) 128. Random Walk Laplacian - Inverse Participation Ratio - Norm Node Statistic 3 (Eq. (19)) 129. Random Walk Laplacian - Inverse Participation Ratio - Norm Node Statistic 4 (Eq. (20)) 130. Random Walk Laplacian - Exponential - Norm Node Statistic 1 (Eq. (17)) 131. Random Walk Laplacian - Exponential - Norm Node Statistic 2 (Eq. (18)) 132. Random Walk Laplacian - Exponential - Norm Node Statistic 3 (Eq. (19)) 133. Random Walk Laplacian - Exponential - Norm Node Statistic 4 (Eq. (20)) 134. Random Walk Laplacian - 90 percent Contribution in Inverse Participation Ratio 135. Random Walk Laplacian - 90 percent Contribution in absolute value 136. Random Walk Laplacian - Sign Statistic 1 (Eq. (21)) 137. Random Walk Laplacian - Sign Statistic 2 (Eq. (22)) 138. Random Walk Laplacian - Sign Statistic Equal 1 (Eq. (23)) 139. Random Walk Laplacian - Sign Statistic Equal 2 (Eq. (24)) 140. Random Walk Laplacian - Absolute value of the eigenvector

48

E.5

Additional Features

Finally, we include a small number of additional features which are employed primarily to highlight possible edge cases, and allow the random forest to highlight them if required. These statistics were constructed in order to highlight the edge cases which could arise when we constructed our test suite. However, they were not added to the set of features considered in the main body, as these edge cases did not appear in the data analysis. For brevity, as this does not appear in the list of our 140 statistics, we do not list each combination of our statistic and matrix. We simply list the set of statistics. Localisation We flag a special cases by creating the following additional features: 1. Sign based statistic - Full score Large Number 2. Sign based statistic - Average score Large Number 3. Sign based statistic - Full score Equal Large Number 4. Sign based statistic - Average score Equal Large Number This series of features to flags cases where the minimum of the number of positive and negative entries becomes greater than or equal to n2 . As this statistic captures the smaller of the number of positive entries and the number of negative entries, it seems surprising that we can get entries that are greater than n2 and are still significant. A statistic that is greater than or equal to n2 is possible if the remaining entries are zero and thus if a sufficient number of the null replicas are in the same position. However, as this is outside of the original motivation of the feature, we split it out into a set of separate features. We perform a similar action for the case where the number of positive and negative terms are equal, which is only possible if there are exactly n2 positive entries, and n2 negative entries.

F

Feature Selection

F.1

List of Selected Features

1. Geometric Average Weight Top 10% (GAW Top 10%) 2. Localisation - Adjacency Upper Eigenvectors - Sign based statistic - Average score 3. NetEMD - Adj. Lower Eigenvectors - NetEMD Score 1 (Eq. (8)) 4. NetEMD - Adj. Upper Eigenvectors - NetEMD Score 1 (Eq. (8)) 5. Localisation - Adjacency Upper Eigenvectors - Exponential Average Large Node no threshold 6. Localisation - Adjacency Upper Eigenvectors - Inverse Participation Ratio average Large Node No Threshold 7. Localisation - Adjacency Lower Eigenvectors - Exponential Average Large Node no threshold 8. Localisation - Adjacency Upper Eigenvectors - 90 percent Contribution in absolute value 9. Localisation - Adjacency Lower Eigenvectors - 90 percent Contribution in absolute value 10. Localisation - Random Walk Laplacian - 90 percent Contribution in absolute value 11. Basic Statistics -Commmunity GAW Full i.e the Geometric Average Weight of the community divided by overall Geometric Average Weight. 12. NetEMD - Adj. Lower Eigenvectors - NetEMD Score 2 (Eq. (9)) 13. Basic Statistics - Standardised Degree 14. Localisation - Adjacency Upper Eigenvectors - Inverse Participation Ratio average Large Node No Threshold 15. Localisation - Adjacency Upper Eigenvectors - 90 percent Contribution in Inverse Participation Ratio

49

16. Random Walk Laplacian - NetEMD Score 1 (Eq. (8)) 17. Basic Statistics - Commmunity GAW Average i.e the Geometric Average Weight of the community divided by overall Geometric Average Weight and also divided by the size of the community. 18. Commmunity Density Full i.e the density of the community divided by overall density. 19. Localisation - Random Walk Laplacian - Exponential Average Large Node No threshold 20. Path Methods - Path of Size 5 21. Path Methods - Path of Size 4 22. Localisation - Adjacency Upper Eigenvectors - Sign based statistic - Full score 23. Localisation - Adjacency Upper Eigenvectors - Inverse Average Large Node p-value 24. Basic Statistics - Commmunity Density Average i.e the density of the community divided by overall density and also divided by the size of the community. 25. NetEMD - Motif 5 - NetEMD Score 2 (Eq. (9)) 26. Localisation - Random Walk Laplacian - Sign based statistic - Average score 27. Localisation - Adjacency Upper Eigenvectors - Inverse Participation Ratio average Large Node Threshold 28. NetEMD - Adj. Upper Eigenvectors - NetEMD Score 2 (Eq. (9)) 29. NetEMD - Motif 5 - NetEMD Score 1 (Eq. (8)) 30. Localisation - Random Walk Laplacian - Sign based statistic - Full score 31. Localisation - Adjacency Upper Eigenvectors - Exponential Average Large Node p-value 32. Localisation - Adjacency Upper Eigenvectors - Exponential Average Large Node threshold 33. Localisation - Random Walk Laplacian - Inverse Participation Ratio Average large node No Threshold 34. Basic Statistic - Geometric Average Weight Top 10% (GAW Top 10%) 35. Localisation - Combinatorial Laplacian - 90 percent Contribution in absolute value 36. Localisation - Adjacency Lower Eigenvectors - Sign based statistic - Average score 37. Localisation - Adjacency Lower Eigenvectors - Sign based statistic - Full score 38. Basic Statistics - Community Density Configuration Model 39. Random Walk Laplacian Eigenvectors - NetEMD Score 2 (Eq. (9)) 40. NetEMD - Motif 5 - NetEMD Score 2 (Eq. (9)) 41. Localisation - Adjacency Lower Eigenvectors - 90 percent Contribution in Inverse Participation Ratio 42. Localisation - Combinatorial Laplacian - Exponential Average Large Node No threshold 43. Basic Statistics - Geometric Average Weight (GAW) 44. NetEMD - Motif 6 - NetEMD Score 2 (Eq. (9))

F.2

Alternative Feature Selection

To select the feature set for our random forest we used the average rank over the 27 parameter regimes. Another approach would have been to use the average feature importance score. The resultant feature plot, which is a direct comparison to the ranking version from Fig. 7A, can be seen in Fig. F.2. This approach would yield substantially fewer features. We decided to use the feature rank approach because the feature importance scores could be confounded by correlated features; a set of features with high correlation may be seen as less important than a set of features that are not correlated, see for example Ref. [5] or http://rnowling.github.io/machine/learning/2015/08/11/ random-forest-correlation-bias.html for details.

50

Rank plot of the average importance of each feature across the parameter regimes 0.200 0.175 Average Ranking

0.150 0.125 0.100 0.075 0.050 0.025 0.000 0

25

50 75 100 Ordered set of statistics

125

Figure 17: Feature selection using the average score over the 27 parameter regimes. The red dashed line represents a possible feature cut-off point. The green line represents the cut-off point which we selected when using the average rank, as can be seen in Fig. 7A

F.3

Full Feature Selection Plots

In the main body, we showed the feature importance detailing the top three features for each parameter regime. In a similar manner to Sec. F.2, we also show the relative feature importance for each of these features. These can be seen in Fig. F.3. The weight on the first few features varies considerably across parameter regimes. Notably, when p = 0.001 and 1 − w = 0 or 1 − w = 0.001, over 50% of the feature importance is given to only 1 feature. At the other extreme, for example in the region p = 0.003 and 1 − w = 0.003 the feature importance is much less concentrated, with the first three features having feature importances in broadly similar ranges.

G

Within-Sample Scores for the Random Forest Method

To test the performance of the random forest model, we only reported the out-of-sample performance in the main paper. For completeness, in this section we report the in-sample performance of the random forest, which can be seen in Fig. 19. We observe almost perfect performance across all of the regimes. This behaviour is not unexpected for a random forest, but it confirms that in the sample there is enough variation in the features to correctly identify all of the anomalies.

H

Oddball

In this section, we present more details on Oddball, with Sec. H.1 addressing the adaptation of Oddball for directed networks, Sec. H.2 addressing the performance of each of the Oddball statistics on the Weighted ER Model, and Sec. H.3 addressing the performance of each of the Oddball statistics on the Accenture model.

51

Feature by Parameter Regime 0.01

0.006

1 w

Feature 1

0.008

0.004

0.002

0.0 0.01

0.006

1 w

Feature 2

0.008

0.004

0.002

0.0 0.01

0.006

1 w

Feature 3

0.008

0.004

0.002

0.0

3 4 4 4 4 4 4 1 4 4 4 6 8 8 8 8 8 8 4 1 1 1 11 6 6 6 6 2 1 6 6 6 8

11 5 12 4 4 7

0.001

0.002

Feature Importance Score

10 10 9

10 10 13

0.288 0.331 0.370 0.372 0.392 0.445 0.425 0.456 0.334 0.512 0.540 0.147 0.117 0.115 0.118 0.121 0.106 0.137 0.154 0.312 0.127 0.125 0.136 0.105 0.101 0.103 0.095 0.069 0.103 0.068 0.066 0.063 0.068

0.004

0.005

0.001

1. Path Size 5 2. Path Size 6 3. Path Size 7 4. GAW10 5. Community - GAW Average

6 6 6 1 1 1

4 4 4 9 7 4

7 2 1 1

2 1 1

2 1 1

6. Localise - Upper Eigs. Adj. - Sign Stat 2 7. Localise - Upper Eigs. Adj. - Exp. norm - Stat 3 8. Localise - Upper Eigs. Adj. - 90% Contrib. - Abs. Value 9. Localise - Upper Eigs. Adj. - IPR norm - Stat 3

9 9 7 4

9 9 4

11 11 4

10. Localise - RW Lap. - 90% Contrib. - Abs. Value 11. NetEMD - Upper Eigs. Adj - Stat1 12. NetEMD - RW Lap. - Stat1 13. NetEMD - Lower Eigs. Adj. - Stat1

4 4 4 9 0.003 p

0.5

0.4

0.352 0.350 0.343 0.375 0.460 0.442

0.3

0.155 0.322 0.300 0.280 0.453 0.382 0.346 0.482 0.444 0.392

0.2

0.1

0.0 0.30 0.25 0.20

0.144 0.149 0.158 0.140 0.099 0.158

0.15

0.153 0.143 0.083 0.073 0.091 0.079 0.061 0.132 0.118 0.120

0.10 0.05 0.00 0.14 0.12 0.10

0.068 0.105 0.082 0.074 0.067 0.051 0.002

0.08 0.06

0.141 0.060 0.057 0.053 0.055 0.056 0.052 0.054 0.044 0.045 0.003 0.004 0.005 p

0.04 0.02 0.00

52 Figure 18: Feature importance plots, as seen in the main body Fig. 8, combined with the corresponding score displayed on the right. To aid understanding, we also include the index of the feature in the left set of plots, and the raw value of the statistic in the right set of plots.

Boxplot of Average Precision Score across parameter regimes - Full Model - Training Data 1.0

Average Precision Score

0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 19: The average precision for the random forest, for each of the parameter regimes using the full model after feature selection and training on the training data (in-sample performance). The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

H.1

Oddball for Directed networks

The original formulation of Oddball in Ref. [3] was primarily concerned with weighted undirected networks, although stating that it is easily generalisable to directed networks. The software which accompanies the paper provides the generalisation. While in the undirected case many of the statistics are based on the ego-networks of a individual nodes, in the directed generalisation the ego-networks (egonets) are computed on a symmetrised matrix and thus consider both incoming and outgoing edge. We use this extension in the analysis in the main paper and accompanying appendices. For completeness, we include a list of the comparisons used in the directed version of Oddball. 1. Number of nodes vs number of edges 2. Number of edges vs the total weight in the egonet 3. Egonet out-degree vs egonet out-weight (edges from the ego network from to the remaining network) 4. Egonet in-degree vs egonet in-weight (edges to the ego network from the remaining network) 5. Ego out-degree vs ego out-weight 6. Ego in-degree vs ego in-Weight 7. Egonet total weight vs egonet maximum weight 8. Egonet total in-weight vs egonet maximum in-weight 9. Egonet total out-weight vs egonet maximum out-weight. Here “ego network” refers to properties of the ego, while “ego net” refers to properties of the 1 hop snowball sample. The software also provides one additional comparison, namely number of nodes vs number of degree-1 nodes, which we do not use as our networks have very few of these nodes.

H.2

Oddball Individual Statistics - Weighted ER Network

For the Weighted ER network we only reported the performance of a statistic based on the summation of each of the underlying Oddball statistics. In this section, we present the results for each of the statistics individually, in Figs. 20, 21, 22, 23, 24, 25, 26, 27, and 28.

53

None of the statistics outperform either the feature summation or the random forest approach. A visual inspection indicates that the top four best-performing Oddball statistics are ego out-degree vs ego out-weight, ego in-degree vs ego in-weight, number of nodes vs number of edges, and number of edges vs weight. Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

10

2

10

3

10

0

10

1

10

2

0.6

0.4

0.2

10

0

10

10

No. of Nodes

3

10

0

10

1

10

2

1

10

2

10

3

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

No. of Nodes

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.8

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

Recall

Recall

1

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Num. Nodes vs Num. Edges 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 20: Performance measurements for each of the parameter regimes using the Oddball Number of Nodes vs Number of Edges feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

H.3

Oddball Individual Statistics - Accenture Model

For completeness, this section shows the recall, precision and average precision plots for each of the underlying Oddball statistics over the 100 Accenture networks. Note that, for some of the networks, some of the 54

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

10

2

10

3

10

0

10

1

10

2

0.6

0.4

0.2

10

0

10

10

No. of Nodes

3

10

0

10

1

10

2

1

10

2

10

3

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

No. of Nodes

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.8

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

Recall

Recall

1

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Num. Edges vs Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 21: Performance measurements for each of the parameter regimes using the Oddball Number of Edges vs Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51]. underlying Oddball statistics are badly defined, in the sense that they contain Infs. The results can be found in Figs. 29, 30, 31, 32, 33, 34, 35, 36, and 37.

55

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

1

10

2

Recall

3

10

0

10

1

10

2

0.2

0

10

1

10

2

10

No. of Nodes

3

10

0

10

1

10

2

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

3

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

0.4

10

10

No. of Nodes

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.6

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

0.8

Recall

10

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Weight vs Egonet max Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 22: Performance measurements for each of the parameter regimes using the Oddball Weight vs Egonet Maximum Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

56

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

10

2

10

3

10

0

10

1

10

2

0.6

0.4

0.2

10

0

10

10

No. of Nodes

3

10

0

10

1

10

2

1

10

2

10

3

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

No. of Nodes

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.8

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

Recall

Recall

1

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Egonet In Degree vs Egonet In Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 23: Performance measurements for each of the parameter regimes using the Oddball Egonet In-Degree vs Egonet In-Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

57

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

10

2

10

3

10

0

10

1

10

2

0.6

0.4

0.2

10

0

10

10

No. of Nodes

3

10

0

10

1

10

2

1

10

2

10

3

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

No. of Nodes

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.8

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

Recall

Recall

1

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Egonet Out Degree vs Egonet Out Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 24: Performance measurements for each of the parameter regimes using the Oddball Egonet OutDegree vs Egonet out-weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

58

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

10

2

10

3

10

0

10

1

10

2

0.6

0.4

0.2

10

0

10

10

No. of Nodes

3

10

0

10

1

10

2

1

10

2

10

3

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

No. of Nodes

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.8

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

Recall

Recall

1

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Ego In Degree vs Ego In Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 25: Performance measurements for each of the parameter regimes using the Oddball Ego In-Degree vs Ego In-Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

59

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

10

2

10

3

10

0

10

1

10

2

0.6

0.4

0.2

10

0

10

10

No. of Nodes

3

10

0

10

1

10

2

1

10

2

10

3

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

No. of Nodes

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.8

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

Recall

Recall

1

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Ego Out Degree vs Ego Out Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 26: Performance measurements for each of the parameter regimes using the Oddball Ego Out-Degree vs Ego Out-Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

60

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

1

10

2

Recall

3

10

0

10

1

10

2

0.2

0

10

1

10

2

10

No. of Nodes

3

10

0

10

1

10

2

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

3

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

0.4

10

10

No. of Nodes

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.6

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

0.8

Recall

10

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Ego In Weight vs Ego max In-Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 27: Performance measurements for each of the parameter regimes using the Oddball Ego OutWeight vs Ego Maximal In-Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

61

Precision p = 0.001

p = 0.002

Precision

0.8

0.6

0.4

0.2

0.0

10

0

10

1

10

2

Recall

3

10

0

10

1

10

2

0.2

0

10

1

10

2

10

No. of Nodes

3

10

0

10

1

10

2

10

0

10

1

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

1 w= 0.0 0.001 0.002

3

10

0

10

2

10

No. of Nodes

1

10

2

10

0

10

1

10

3

p = 0.005 1 w= 0.0 0.001 0.002

3

10

No. of Nodes

p = 0.004 1 w= 0.0 0.001 0.002 0.003

3

10

No. of Nodes

p = 0.003 1 w= 0.0 0.001 0.002 0.003 0.004 0.005

3

p = 0.005

1 w= 0.0 0.001 0.002 0.003

p = 0.002

0.4

10

10

No. of Nodes

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.6

p = 0.004

1 w= 0.0 0.001 0.002 0.003 0.004 0.005

p = 0.001

0.8

Recall

10

No. of Nodes

1.0

0.0

p = 0.003

1 w= 0.0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

1.0

2

10

No. of Nodes

1 w= 0.0 0.001 0.002

3

10

0

10

1

10

2

10

No. of Nodes

3

Boxplot of Average Precision Score across parameter regimes - Ego Out Weight vs Ego max Out-Weight 1.0


0.8

0.6

0.4

0.2

103p0.0 1 103(1 w) 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

1 10

2 0

2 1

2 2

2 3

2 4

2 5

3 0

3 1

3 2

3 3

4 0

4 1

4 2

5 0

5 1

5 2

Parameter Regime

Figure 28: Performance measurements for each of the parameter regimes using the Oddball Ego out-Weight vs Ego maximal Out-Weight feature. The first pair of rows shows the mean precision and mean recall respectively with error bars that are one one standard error over the relevant networks. The final panel shows a boxplot of the the average precision over the test networks. The red starred line indicates the largest percentage of anomalies expected at random using this measure [51].

62

Num. Nodes vs Num. Edges

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4

Average Precision No. of Obs: 100

80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

0 0.00

3

0.25

0.50

0.75

Average Precision

1.00

Figure 29: The performance, for each of the parameter regimes using the Number of Nodes vs Number of Edges feature. Performance is measured in three different ways. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons.

Num. Edges vs Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4

0.4 10

0

10

1

10

2

No. of Nodes

10

3


80

Precision

1.0

40 20

10

0

10

1

10

2

No. of Nodes

10

3

0 0.00

0.25

0.50

0.75

Average Precision

1.00

Figure 30: The performance from scikit-learn [51], for each of the parameter regimes using the Number Edges and Weight feature. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner.

63

Egonet Out Degree vs Egonet Out Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4


80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

0 0.00

3

0.25

0.50

0.75

Average Precision

1.00

Figure 31: The performance from scikit-learn [51], for each of the parameter regimes using the Egonet Out-Degree and Egonet Out-Weight feature. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner.

Egonet In Degree vs Egonet In Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4


80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

3

0 0.00

0.25

0.50

0.75

Average Precision

1.00

Figure 32: The performance from scikit-learn [51], for each of the parameter regimes using the Egonet In-Degree and Egonet In-Weight feature. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner. 64

Ego Out Degree vs Ego Out Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4


80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

0 0.00

3

0.25

0.50

0.75

Average Precision

1.00

Figure 33: The performance from scikit-learn [51], for each of the parameter regimes using the Ego OutDegree and Ego Out-Weight feature. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons.

Ego In Degree vs Ego In Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4

0.4 10

0

10

1

10

2

No. of Nodes

10

3


80

Precision

1.0

40 20

10

0

10

1

10

2

No. of Nodes

10

3

0 0.00

0.25

0.50

0.75

Average Precision

1.00

Figure 34: The performance from scikit-learn [51], for each of the parameter regimes using the Ego InDegree and Ego In-Weight feature. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner.

65

Weight vs Egonet max Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4


80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

0 0.00

3

0.25

0.50

0.75

Average Precision

1.00

Figure 35: The performance from scikit-learn [51], for each of the parameter regimes using the Egonet weight and Egonet Maximum Weight. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner.

Ego In Weight vs Ego max In-Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4


80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

3

0 0.00

0.25

0.50

0.75

Average Precision

1.00

Figure 36: The performance from scikit-learn [51], for each of the parameter regimes using the Ego InWeight and Ego Maximum In-Weight. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner. 66

Ego Out Weight vs Ego max Out-Weight

Recall

Precision

1.0

0.8

0.8

0.6

0.6

100

60

0.2

Freq.

0.4

Recall

0.4

0.2

0.0

0.0

0.2

0.2

0.4


80

Precision

1.0

40 20

0.4 10

0

10

1

10

2

No. of Nodes

10

3

10

0

10

1

10

2

No. of Nodes

10

3

0 0.00

0.25

0.50

0.75

Average Precision

1.00

Figure 37: The performance from scikit-learn [51], for each of the parameter regimes using the Ego OutWeight and Ego Maximal Out-Weight. First (far left): the average performance (error bars denoting 1 standard error) with respect to the recall when we consider the top nodes in our ranking. Second (middle): the average performance (error bars denoting 1 standard error) with respect to precision. Finally (far right): the distribution of the average precision according to over our 100 test networks. When the performance on a measure resulted in Infs, then we removed the network from all of the comparisons. For ease we thus display the exact number of observations in the top right corner.

67

Anomaly Detection in Networks with Application to Financial

Anomaly Detection in Networks with Application to Financial

Suggest Documents

Anomaly Detection in Environmental Monitoring Networks [Application ...

Anomaly Detection and Attribution in Networks with

Anomaly detection with convolutional neural networks

Anomaly detection in communication networks using wavelets ...

Robust Anomaly Detection in Dynamic Networks

Anomaly Detection in Networks Using Machine

Robust Anomaly Detection in Dynamic Networks

ANOMALY DETECTION IN COMPUTER NETWORKS ... - ailab ijs

Anomaly detection in communication networks using wavelets ...

ANOMALY DETECTION IN MOBILE COMMUNICATION NETWORKS ...

Anomaly detection in dynamic networks: a survey

Anomaly Detection and Attribution in Networks with ... - Google Sites

Effective Dimension in Anomaly Detection: Its Application to Computer

Statistical Anomaly Detection with Sensor Networks - Boston University

Anomaly Detection Approaches for Communication Networks

Anomaly Detection using One-Class Neural Networks

Anomaly Detection Approaches for Communication Networks

Anomaly Detection with ANN and SVM for Telemedicine Networks

Application-Specific Traffic Anomaly Detection Using ... - INESCTEC

UNSUPERVISED ANOMALY DETECTION IN

Anomaly Detection in Wireless Sensor Networks in a ... - IEEE Xplore

Anomaly detection in video with Bayesian nonparametrics

A Distribution-Based Approach to Anomaly Detection and Application ...

Anomaly Detection in Wireless Sensor Networks in a ... - IEEE Xplore