ron bipartite graph. 123. Figure 6.8. Runtime and memory ...... ron corpus [56]. The dataset contains â 0.5 ...... Gökhan H. Bakir. Weighted substructure mining for ...
D I S C O V E R I N G PAT T E R N S A N D A N O M A L I E S I N G R A P H S W I T H D I S C R E T E A N D N U M E R I C AT T R I B U T E S Michael Davis BSc. (Hons) Computer Science MSc. Computer and Electronic Security
Thesis submitted for the degree of Doctor of Philosophy in the School of Electronics, Electrical Engineering and Computer Science Queen’s University, Belfast December 2013
Dedicated to my parents.
ABSTRACT
Graph representations are useful in many domains where data has a natural structure. Real-world graphs usually have discrete labels on the vertices or edges. Many graph datasets are also annotated with numeric labels or weights. Graphs are commonly used to represent complex structures such as social networks, infrastructure networks, information or communication networks and chemical, biological or ecological processes. In this thesis, we investigate pattern mining and anomaly detection in datasets with both structural and numeric attributes. Most graph mining approaches ignore numeric attributes; where they are taken into consideration, it is often assumed that graph structure and numeric attributes are independent. We show that numeric attributes are closely related to graph structure, and exploit this observation for substructure discovery and anomaly detection. Our first contribution is Agwan (Attribute Graphs: Weighted and Numeric), a generative model for random graphs with discrete labels and weighted edges. Real-world graphs tend to exhibit a well-known set of properties, such as heavy-tailed degree distributions, clustering and community formation. Much effort has been directed into creating realistic and tractable models for unlabelled graphs, which has yielded insights into graph structure and evolution. Agwan is a model for labelled, weighted graphs. We present an algorithm for fitting the parameters of the model to real-world graphs and an algorithm to generate random graphs from the model. Using real-world directed and undirected graphs as input, we compare our approach to state-of-the-art random labelled graph generators and draw conclusions about the contribution of discrete vertex labels and numeric edge weights to graph structure. Our second contribution is a constraint on substructure discovery based on the “outlierness” of graph numeric attributes. Most frequent substructure discovery algorithms ignore numeric attributes; we show how they can be used to improve search performance and discrimination. Reasoning that the most descriptive substructures are those which have normative numeric attribites, we propose an outlierdetection step, which is used as a constraint during substructure dis-
iii
covery. In our experiments, we implement our method as a pre-processing step to prune anomalous vertices and edges prior to graph mining, allowing us to evaluate it on both graph databases (using gSpan) and on Single Large Graphs (using Subdue). We measure the effect of our constraint-based approach on runtime, memory requirements and coverage of discovered patterns, relative to the unconstrained approaches. Our method is applicable to multi-dimensional numeric attributes; we also outline how it can be extended for highdimensional numeric features. Finally, we present Yagada (Yet Another Graph-based Anomaly Detection Algorithm), an algorithm to search for anomalies using both structural data and numeric attributes. Our motivating application is to detect suspicious or unusual activity in large, secure buildings such as airports. Yagada is explained using several securityrelated examples and validated with experiments on a physical Access Control database. Quantitative analysis shows that in the upper range of anomaly thresholds, Yagada detects twice as many anomalies as the best-performing numeric discretization algorithm. Qualitative evaluation shows that the detected anomalies are meaningful, representing a combination of structural irregularities and numerical outliers.
iv
P U B L I C AT I O N S
Some ideas and figures have appeared previously in the publications listed below. In particular, material from [Davis 6] appears in Chapters 4 and 7; material from [Davis 1, 3] is used in Chapter 5; and material from [Davis 2, 4, 5] is used in Chapter 6. Data from the Physical Activity Loyalty Card (PAL) Scheme study [Hunter 7, 8, 9, 10, 11] is used for the experiments in Chapter 5. The PAL Scheme was a health intervention studying the effect of incentives on physical activity levels; I was involved in the design of the system for tracking participants. Details of the PAL Scheme dataset are given in Chapter 3.
[1] Michael Davis, Weiru Liu, and Paul Miller. AGWAN: A generative model for labelled, weighted graphs. In New Frontiers in Mining Complex Patterns—Second International Workshop, NFMCP 2013, Held in Conjunction with ECML/PKDD 2013, Prague, Czech Republic, September 27, 2013, Revised Selected Papers, Lecture Notes in Computer Science. Springer, 2014 (accepted for publication). [2] Michael Davis, Weiru Liu, and Paul Miller. Finding the most descriptive substructures in graphs with discrete and numeric labels. Journal of Intelligent Information Systems. Springer, 2014 (accepted for publication). [3] Michael Davis, Weiru Liu, and Paul Miller. AGWAN: A generative model for labelled, weighted graphs. In Workshop on New Frontiers in Mining Complex Patterns—Second International Workshop, NFMCP 2013, Held in Conjunction with ECML/PKDD 2013, Prague, Czech Republic, 27 Sept. 2013. [4] Michael Davis, Weiru Liu, and Paul Miller. Finding the most descriptive substructures in graphs with discrete and numeric labels. In Annalisa Appice, Michelangelo Ceci, Corrado Loglisci, Giuseppe Manco, Elio Masciari, and Zbigniew W. Ras, editors, NFMCP, volume 7765 of Lecture Notes in Computer Science, pages 138–154. Springer, 2012.
v
[5] Michael Davis, Weiru Liu, and Paul Miller. Finding the most descriptive substructures in graphs with numeric labels. In Workshop on New Frontiers in Mining Complex Patterns—First International Workshop, NFMCP 2012, Held in Conjunction with ECML/ PKDD 2012, Bristol, UK, 24 Sept. 2012. [6] Michael Davis, Weiru Liu, Paul Miller, and George Redpath. Detecting anomalies in graphs with numeric labels. In Craig Macdonald, Iadh Ounis, and Ian Ruthven, editors, CIKM, pages 1197–1202. ACM, 2011. [7] Ruth F. Hunter, Mark A. Tully, Michael Davis, Michael Stevenson, and Frank Kee. Physical activity loyalty cards for behavior change: A quasi-experimental study. American Journal of Preventive Medicine, 45:56–63, Jul. 2013. [8] Ruth Hunter, Mark Tully, Michael Davis, Michael Stevenson, and Frank Kee. The Physical Activity Loyalty Card scheme: A RCT investigating the use of incentives to encourage physical activity. Journal of Science and Medicine in Sport, 15:S347–S348, Dec. 2012. [9] Ruth F. Hunter, Mark A. Tully, Michael Davis, Michael Stevenson, and Frank Kee. Exploring the use of physical activity loyalty cards for behaviour change in public health: randomised controlled trial. The Lancet, 380, Supplement 3:S4, 2012. [10] Ruth F. Hunter, Michael Davis, Mark A. Tully, and Frank Kee. Physical activity buddies: a network analysis of social aspects of physical activity in adults. The Lancet, 380, Supplement 3:S51, 2012. [11] Ruth F. Hunter, Michael Davis, Mark A. Tully, and Frank Kee. The Physical Activity Loyalty Card scheme: Development and application of a novel system for incentivizing behaviour change. In Patty Kostkova, Martin Szomszor, and David Fowler, editors, eHealth, volume 91 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, pages 170–177. Springer, 2011.
vi
ACKNOWLEDGMENTS
This Ph.D project would not have been possible without the guidance and input of my supervisors: I would like to thank Prof. Weiru Liu for encouraging me to start writing from the very first week of my Ph.D and to always have an eye to publication. I appreciated Prof. Liu always taking the time to thoroughly read my work and giving me detailed, constructive and insightful criticism. I would like to thank my second supervisor, Dr. Paul Miller, for encouraging me to be rigorous and principled in research, to consider where and how my research can be applied, and for creating opportunities to showcase it. I would also like to thank my examiners, Prof. Frans Coenen and Dr. Jun Hong for a stimulating discussion and for their constructive comments, which have improved this thesis. I want to extend my thanks to Prof. Frank Kee and Dr. Ruth Hunter at Queen’s University Belfast’s Centre for Public Health, for a very fruitful collaboration on the PAL Scheme study and for permission to use the data in my experiments. I would also like to acknowledge Mr. Dara Curley and Mr. Derrick Black at Queen’s Estates Department for their assistance in obtaining the Access Control System data and Dr. George Redpath at CEM Systems for providing two other Access Control System datasets. I am grateful to Erich Schubert at Ludwig-Maximilians Universität München for assistance with verifying my LOF implementation, providing me with a pre-release version of ELKI for the RP + PINN + LOF experiments and for many helpful comments on two of my papers. This Ph.D was enjoyable largely due to the friendship of my colleagues at ECIT: Colin Burgess, Kevin McAreavey, Stefan Popa and Sriram Varadarajan. Special thanks to Niall McLaughlin for sharing his encyclopediac knowledge and telling me about Dirichlet Process GMMs; Fabian Campbell-West for always being positive and encouraging; and Andrew Bolster for suggesting I should call my algorithm “Agwan”.
vii
I could not have made it to the end without the love and support of my wife, Sharon, who was working full-time and doing her management accounting exams, but still found time to bring her warmth and care into the family. Finally, I am grateful to my children—Siona, Michael and Tara— who have been very patient with me while I spent many evenings and weekends typing away behind a closed door. I am incredibly proud of you.
viii
CONTENTS Abstract
iii
Publications
v
Acknowledgements
vii
i
introduction
1
1
introduction
2
1.1
Graphs and Networks
2
1.1.1
Social Networks
1.1.2
Infrastructure Networks
1.1.3
Information Networks
1.1.4
Sensor Networks
1.1.5
Chemical and Biological Networks
4 5 6 8
1.2
Motivation and Research Questions
1.3
Methodology and Evaluation Criteria
1.4
Contributions
1.5
Organisation of the Thesis
11
13
patterns and anomalies
2
graph properties and models 2.1
Introduction
2.2
Concepts and Terminology
2.4
9
12
ii
2.3
9
15 16
16
2.2.1
Weighted Graphs
2.2.2
Labelled Graphs
Graph Properties
16 18 19
19
2.3.1
Power Laws and Degree Distributions
2.3.2
Graph Diameter
2.3.3
Spectral Properties
24
2.3.4
Triad Participation
26
2.3.5
Clustering Coefficient
Graph Models
20
22
27
29
2.4.1
Erd˝os-Rényi Random Graphs
2.4.2
Preferential Attachment-based Models
2.4.3
Copying Models
2.4.4
The Small World Model
2.4.5
The RMat Model
2.4.6
The Kronecker Graph Model
29
32 33
33
ix
34
30
contents
5
2.4.8
The MAG Model
38
graph mining
41
3.1
Introduction
3.2
Substructure Mining
41 41
3.2.1
Frequent Subgraph Mining
3.2.2
Compression-based Substructure Mining
3.2.3
Substructures with Overlapping Instances
3.3
Graph-based Anomaly Detection
3.4
Graph Datasets
43
49
53
3.4.1
PAL Scheme Social Network
3.4.2
Enron Social Network
3.4.3
Enron Information Network
3.4.4
QUB Sensor Network
57
3.4.5
CEM Sensor Network
61
Conclusion
46
50
53
54 57
61
numeric attributes and anomalies
64
4.1
Introduction
4.2
Weights and Numeric Attributes
64
4.2.1
Mining Weighted Graphs
64
4.2.2
Mining Graphs with Numeric Attributes
66
4.2.3
Anomaly Detection in Weighted Graphs
67
4.3
iii
36
40
3.5 4
Statistical Models
Conclusion
2.5 3
2.4.7
64
Numeric Outlier Detection
68
4.3.1
Statistical Outliers
4.3.2
Cluster-based Outliers
4.3.3
Nearest Neighbour-based Outliers
4.3.4
Density-based Outliers
70 72 74
75
4.4
Outlier Detection in High-Dimensional Space
4.5
Conclusion
78
79
contributions
81
agwan: a generative model for labelled graphs with numeric attributes
82
5.1
Introduction
5.2
Graph Attributes and Structure
5.3
82 83
5.2.1
Discrete Attributes and Structure
5.2.2
Numeric Attributes and Structure
83 85
Agwan: A Generative Model for Labelled, Weighted Graphs
86
x
contents
Graph Generation
88
5.3.2
Parameter Fitting
88
5.3.3
Extending Agwan to multiple attributes
5.4
Experiments
5.5
Results
5.6 6
5.3.1
91
93
5.5.1
Real Attributes
5.5.2
Synthetic Attributes
Conclusions
94 99
108
substructure discovery in labelled graphs with numeric attributes
110
6.1
Introduction
6.2
Graph Patterns and Structure
6.3
110 111
6.2.1
Discrete Attributes and Substructure Mining
6.2.2
Numeric Attributes and Substructure Mining
6.3.2
119
119
6.4
High-dimensional Numeric Attributes
6.5
Experiments and Results
124
6.5.1
Graph Transaction Database
6.5.2
Single Large Graphs
Conclusions
121
124
128
132
yagada: detecting anomalous substructures in labelled graphs with numeric attributes
134
7.1
Introduction
7.2
Detecting Anomalies in Graphs with Numeric Labels
7.3
134
7.2.1
Detecting Anomalies in Single Large Graphs
7.2.2
Detecting Anomalies in Graph Databases
135 136
138
Yagada: an Algorithm for Detecting Structural and Numeric Anomalies
8
117
Incorporating outlier values into substructure discovery
iv
115
Using outlier values to prune the graph in preprocessing
6.6
111
Constraint-based Mining Using Numeric Attributes 6.3.1
7
90
140
7.4
Experiments and Results
7.5
Conclusions
144
151
conclusions and future work conclusions
153
154
8.1
Graph Structure and Numeric Attributes
8.2
Substructure Discovery
8.3
Anomaly Detection
156
155
154
xi
contents
v
8.4
Multi-dimensional Numeric Attributes
8.5
Future Work
158
8.5.1
Graph Structure and Numeric Attributes
8.5.2
Substructure Discovery
8.5.3
Anomaly Detection
8.5.4
High-dimensional Numeric Attributes
appendices
161
a algorithms
162
a.1 gSpan a.2 Subdue b
157
158
158
159 159
162 162
implementation notes
164
b.1
Experimental System
b.2
Graph Representation and Visualisation
b.3
Dirichlet Process Gaussian Mixture Models
b.4
LOF and PINN
b.5
gSpan
b.6
Subdue
165
bibliography
166
165
164
164 164 164
xii
LIST OF FIGURES
Figure 1.1
Google Doodle from 15 April 2013 commemorating the birth of Leonard Euler in 1707. His sketch of the Seven Bridges of Königsberg is on the bottom left.
Figure 1.2
3
Hand-drawn social network, reproduced from Jacob Moreno’s book, Who Shall Survive? (1953) [140]
Figure 1.3
4
Internet Backbone, from the Opte Project www. opte.org (2005)
6
Figure 1.4
Map of Facebook Social Connections (2010) [45]
Figure 1.5
Transaction Data Collected from a Physical Access Control System (ACS) in an Airport
8
Figure 1.6
Graph of Chemical Structure of Caffeine
9
Figure 2.1
Connected Graphs
Figure 2.2
Examples of labelled graphs
Figure 2.3
Power law distributions with γ = [2, 3]
Figure 2.4
Deviations from power-law degree distributions
Figure 2.5
Average degree c and principal eigenvalue λ1
7
17 18 20 21
for different graph topologies with N = 6 vertices
25
Figure 2.6
Triadic Closure
Figure 2.7
Closed Triad Patterns in a Directed Graph
Figure 2.8
Adjacency matrix for the Recursive Matrix (RMat) Model [49]
Figure 2.9
26 27
34
Example of Kronecker Multipication [128]. A 3 × 3 initiator matrix K1 is recursively multiplied by itself to obtain its 4th Kronecker power K4
Figure 2.10
The Multiplicative Attribute Graph (MAG) Model [109, 111]
Figure 3.1
35
38
Existence of an isomorphism between graphs G and H [37]
42
Figure 3.2
Overlapping instances
Figure 3.3
Graph of social interactions during the PAL Scheme study
53
xiii
49
List of Figures
Figure 3.4
Social graph of who communicates with whom in the Enron corpus
Figure 3.5
xiv
56
Fragment of the Enron Communication Graph, a bipartite graph of actors and messages from the Enron corpus
Figure 3.6
58
Graph of QUB University Campus Access Control System
59
Figure 3.7
Adding Forward Edges to Graph Transactions
Figure 3.8
Graph of CEM Office Access Control System
Figure 4.1
Time of Day edge attribute for QUB Access Control Dataset
Figure 4.2
69
71
Outlier Detction with Cluster-based Local Outlier Factors (CBLOF)
Figure 4.4
73
Outlier Detection with k-Nearest Neighbours (kNN)
Figure 4.5
62
Statistical Outlier Detection, using the model of Fig. 4.1b
Figure 4.3
60
75
Outlier Detection with Local Outlier Factors (LOF)
76
Figure 5.1
Examples of labelled graphs
Figure 5.2
Agwan model in plate notation
Figure 5.3
Agwan model example. Vertex labels are se-
83 87
lected according to prior probability π. Edge weight wuv is selected from mixture model Ω42 and wvu is selected from mixture model Ω24 .
88
Figure 5.4
Vertex Strength Distribution—Real Attributes
95
Figure 5.5
Spectral Properties—Real Attributes
Figure 5.6
Clustering Coefficients—Real Attributes
Figure 5.7
Triad Participation—Real Attributes
Figure 5.8
Vertex Strength Distribution—Synthetic Attributes
Figure 5.9
Spectral Properties—Synthetic Attributes
Figure 5.10
Clustering Coefficients—Synthetic Attributes
Figure 5.11
Triad Participation—Synthetic Attributes
Figure 6.1
Substructure Patterns (Motifs) Commonly Found in Real Graphs
97 98
100 101
103 104
106
111
Figure 6.2
Carbon Ring
Figure 6.3
Relationship between vertex labelling and num-
111
ber of discovered substructures: Unlabelled Graph
113
List of Figures
Figure 6.4
Relationship between vertex labelling and number of discovered substructures: Labelled Graph (with unique vertex labels). More label values imply more substructures, with fewer instances of each.
Figure 6.5
114
Effect of no. of vertex partitions on complexity of substructure discovery. The number of distinct substructures and number of GI tests required is shown for RMat random graphs with 0, 1, . . . , 9 binary labels, i.e. 1–512 vertex partitions. Label values were assigned independently from a uniform distribution.
Figure 6.6
116
Computation time for RP + PINN + LOF on “bag of words” feature vectors on Enron bipartite graph
Figure 6.7
122
Distribution of LOF scores for RP + PINN + LOF on “bag of words” feature vectors on Enron bipartite graph
Figure 6.8
123
Runtime and memory performance of constrained and unconstrained substructure discovery in the Access Control System graph database
Figure 6.9
125
Analysis of subgraphs returned by constrained and unconstrained substructure discovery in the Access Control System graph database
Figure 6.10
126
Analysis of subgraphs returned by constrained and unconstrained substructure discovery in the Access Control System graph database
Figure 6.11
Performance of substructure discovery in the Enron graphs
Figure 6.12
128
Performance of substructure discovery in the Enron graphs
Figure 6.13
129
Analysis of subgraphs returned by substructure discovery in the Enron Social graph
Figure 6.14
130
Analysis of subgraphs returned by substructure discovery in the Enron Bipartite graph
Figure 7.1
127
131
Single Large Graph representing TCP SYN and ICMP PING network traffic, with two Denial of Service (DoS) attacks taking place. Anomalies are highlighted in blue.
136
xv
Figure 7.2
Graph Database where each subgraph G1 –G4 is a transaction at an Automated Teller Machine
138
Figure 7.3
Access Control System Example
145
Figure 7.3
Access Control System Example
146
Figure 7.4
Cumulative Frequency Distribution showing transactions discovered as anomalous
Figure 7.5
148
Outlier score distributions for elapsed time in the ACS dataset
151
L I S T O F TA B L E S
Table 3.1
Characteristics of the PAL Scheme Social Graph (Fig. 3.3). {T1,T2,T3} means the attribute was measured at three time points during the study. EQ5D, MCS8 and PCS8 are scores from standarised questionnaires used to measure health outcomes.
55
Table 3.2
Characteristics of the Enron Social Graph
Table 3.3
Characteristics of the Enron Communication Graph
Table 3.4
58
Characteristics of the QUB Access Control System Graph Database
Table 3.5
95
Vertex Strength for Enron Directed Graph— Real Attributes
Table 5.3
95
Spectral Properties for PAL Undirected Graph— Real Attributes
Table 5.4
97
Spectral Properties for Enron Directed Graph— Real Attributes
Table 5.5
97
Clustering Coefficient for PAL Undirected Graph— Real Attributes
Table 5.6
62
Vertex Strength for PAL Undirected Graph— Real Attributes
Table 5.2
59
Characteristics of the CEM Access Control System Graph Database
Table 5.1
56
98
Clustering Coefficient for Enron Directed Graph— Real Attributes
98
xvi
Table 5.7
Triad Participation for PAL Undirected Graph— Real Attributes
Table 5.8
99
Triad Participation for Enron Directed Graph— Real Attributes
Table 5.9
99
Vertex Strength for PAL Undirected Graph— Synthetic Attributes
Table 5.10
Vertex Strength for Enron Directed Graph— Synthetic Attributes
Table 5.11
107
Triad Participation for Enron Directed Graph— Synthetic Attributes
Table 5.17
105
Triad Participation for PAL Undirected Graph— Synthetic Attributes
Table 5.16
105
Clustering Coefficient for Enron Directed Graph— Synthetic Attributes
Table 5.15
105
Clustering Coefficient for PAL Undirected Graph— Synthetic Attributes
Table 5.14
103
Spectral Properties for Enron Directed Graph— Synthetic Attributes
Table 5.13
101
Spectral Properties for PAL Undirected Graph— Synthetic Attributes
Table 5.12
101
107
Summary of results and findings: dependencies between graph labels and weights and structural properties (∗ hypothesised); relative accuracy of the properties of weighted graphs generated using Agwan and MAG models
Table 7.1
Comparison of Top 20 Anomalies discovered by EqualFreq, EqualWidth and KMeans
Table 7.2
108
150
Comparison of Top 20 Anomalies discovered by Yagada. The best-performing of the alternative algorithms are highlighted in bold.
LIST OF ALGORITHMS
Figure 1
Agwan Graph Generation
89
Figure 2
Agwan Parameter Fitting
90
xvii
150
Figure 3
Yagada: detect anomalies in structural and numeric data
Figure 4
143
Discretize: calculate an anomaly score on the numeric attributes of a vertex or edge
Figure 5
GraphSet_Projection: search for frequent substructures
Figure 6
162
Subdue: search for frequent substructures
ACRONYMS
Access Control System
acs
Association Rule Mining
arm ba
Barabási and Albert
bp
Belief Propagation Breadth-First Search
bfs
Cluster-Based Local Outlier Factors
cblof
Cumulative Distribution Function
cdf ccdf
Complementary Cumulative Distribution Function
dfs
Depth-First Search
dgx
Discrete Gaussian Exponential Description Length
dl
dpgmm
Dirichlet Process Gaussian Mixture Model
dpl
Densification Power Law
em
Expectation Maximisation
er
Erd˝os-Rényi
erg gi
143
Exponential Random Graph
Graph Isomorphism
xviii
163
acronyms
gmm
Gaussian Mixture Model
knn
k-Nearest Neighbour Local Outlier Factors
lof
Multiplicative Attribute Graph
mag
mcmc
Markov Chain Monte Carlo
mdl
Minimum Description Length
mrf
Markov Random Field
pal
Physical Activity Loyalty Card
pdf
Probability Density Function
pinn
Projection Indexed Nearest Neighbours
rmat
Recursive Matrix
rp
Random Projection
slg
Single Large Graph
sna
Social Network Analysis
svd
Singular Value Decomposition
www
World-Wide Web
xix
Part I INTRODUCTION The term ‘holistic’ refers to my conviction that what we are concerned with here is the fundamental interconnectedness of all things. I do not concern myself with such petty things as fingerprint powder, telltale pieces of pocket fluff and inane footprints. I see the solution to each problem as being detectable in the pattern and web of the whole.
— Dirk Gently’s Holistic Detective Agency, Douglas Adams
1
INTRODUCTION
In its simplest form, a graph is a set of points (called vertices) joined together by lines (called edges). Graphs are often used to represent complex networks or systems of interaction—from human relationships to neural pathways in the brain, physical road layouts to the Internet—where vertices represent some type of entity and edges represent the relationships between entities. In this thesis, we address the problem of finding anomalous patterns in graphs. Departures from the norm may indicate unusual behaviour or otherwise interesting regions in the graph structure. In financial networks, are there unusual connections between people which indicate that fraudulent activity or money laundering is taking place? In an e-mail network, are there anomalous gaps which imply that evidence has been deleted? In a large secure building, such as a hospital, power station or airport, are there patterns of behaviour which are suspicious? In order to find meaningful anomalies, we first need to understand what the normal patterns are. Much of this thesis is devoted to finding typical patterns in graph-based data, which we use as the basis for our anomaly detection approach. Most graph-based pattern and anomaly detection algorithms consider the structure of the graph and discrete or categorical vertex labels. However, many graphs are also labelled with numeric attributes or weights. Our novel contribution is to consider how these numeric values are related to the structure of the graph and how we can incorporate them into pattern mining and anomaly detection. But more on that later. Let’s start at the beginning, with the invention of graph theory almost 300 years ago.
1.1
graphs and networks
Graph theory was invented by Leonhard Euler in 1735 to solve a topological problem that was famous at the time: the Seven Bridges of Königsberg [19]. In the Prussian city of Königsberg, the river Pregel branched in two and contained an island; the epyonymous bridges
2
Simple Graph with 7 vertices and 7 edges
1.1 graphs and networks
3
Figure 1.1: Google Doodle from 15 April 2013 commemorating the birth of Leonard Euler in 1707. His sketch of the Seven Bridges of Königsberg is on the bottom left.
spanned the various sections of the river. The problem was to determine whether it was possible to tour the city, crossing each bridge exactly once, and return to the starting point. Many people had conjectured that the tour was not possible, but until Euler, no-one had been able to produce a proof. To solve the problem, Euler devised the graph, an abstract representation of the topology of the bridges. Recognising that the choice of route on land mass (bank or island) was irrelevant to the problem, he reduced the land masses to points and represented the bridges as lines joining these points. Noting that the number of times one enters a (non-terminal) vertex must be the same as the number of times one leaves it, Euler proved that a graph traversal is only possible if all the vertices are of even degree or if exactly two vertices (the start and end of the path) are of odd degree. As the topology of the Königsberg bridges did not satify this condition, the problem was solved: crossing each bridge once and returning to the starting point is impossible. Although the graph was only one of Euler’s many mathematical accomplishments, it is remembered as one of his most significant (Fig. 1.1). Since Euler, graph theory has come a long way. Applied graph theory, or network analysis, represents real-world structures as graphs for visualisation [133] or algorithmic analysis, which is the subject of this thesis. Any system composed of entities which are linked or related to each other can be represented as a graph. Common examples of graph-based data include social networks; chemical, biological or ecological processes; co-citation networks; and computer networks such as the Internet or the World Wide Web [142].
The degree of a vertex is the number of edges incident on it.
1.1 graphs and networks
Figure 1.2: Hand-drawn social network, reproduced from Jacob Moreno’s book, Who Shall Survive? (1953) [140]
Next, we will make a brief tour of some of these types of network, or graph-based data. This is not intended as an exhaustive taxonomy of networks, but rather a sampling of some representative classes of network that our algorithms can be applied to. 1.1.1 Social Networks Social networks are one of the most widely-studied network phenomena. The study of society as connections between social actors goes back as far as the work of August Comte (1798–1857) [84], but social network analysis in terms of applied graph theory began with the work of psychiatrist Jacob Moreno in the 1930s [84, 140, 142]. Fig. 1.2 shows one of Moreno’s hand-drawn social network diagrams. Sociologists have developed their own rich terminology. A graph is a sociogram or Social Network. Historically, analysis of social networks was called sociometry, but today the term Social Network Analysis (SNA) is preferred. Vertices are actors: the current vertex under discussion is the ego and other vertices, as viewed from the ego, are alters. Edges—called ties or links—represent some kind of relationship between people, such as friendship or communicating with one another. Relationships may be bilateral, giving rise to an undirected
4
1.1 graphs and networks
graph; or unilateral, giving rise to a directed graph, as in Moreno’s diagram (Fig. 1.2). The mathematical treatment of graphs as abstract topological structures and the social science analysis of social networks proceeded more or less independently through most of the 20th century [84]. This is evident in the literature review of Chapter 2, where the analytical graph models of Sects. 2.4.1–2.4.6 were developed by pure mathematicians and computer scientists and the statistical models of Sect. 2.4.7 were devised by statisticians and social scientists. In recent years, the two strands of research have begun to converge. Traditionally, the graphs studied in SNA were very small (as in Fig. 1.2), so issues of complexity were not given much consideration. With the rise of massive online social networks in the last decade, social scientists gained unprecedented access to large datasets, but found that many of their tools were inadequate for large-scale graph analysis. The last decade has seen many fruitful collaborations between Computer Science and Social Science research. Many of the graphs that we study are social networks, combined with other types of network: the Enron e-mail corpus (§3.4.2) is a social and communication network and the PAL Scheme (§3.4.1) is a social and sensor network. 1.1.2 Infrastructure Networks Unlike social networks, which are a phenomenon which arises naturally, infrastructure networks are man-made [142]. Examples include power grids, road or rail networks and the telephone network. Characteristically, vertices in infrastructure networks have a fixed spatial position and edges represent the physical connections (cables or roads) between them. One of the most studied infrastructure networks is the Internet. Fig. 1.3 shows the connections between the ≈ 13 million class C subnets from the Internet of 2005, mapped using traceroute [42]. It seems strange that although the Internet is a feat of human engineering, no-one knows exactly how big it is. Estimates vary considerably; the controversial Internet Census 2012 [7] found 420 million IP addresses that responded to ping and up to 177 million more that are active but not pingable. The size and complexity of the Internet illustrates the current limits of network analysis, and the need to design data structures and algorithms which can scale to billion-vertex
5
1.1 graphs and networks
Figure 1.3: Internet Backbone, from the Opte Project www.opte.org (2005)
graphs [44, 78], where even holding the entire graph in memory may not be possible. The Graph 500 Benchmark (www.graph500.org) assesses the suitability of supercomputing systems for data-intensive algorithms on graphs up to petabyte size in the fields of Cybersecurity, Medical Informatics, Data Enrichment, Social Networks and Symbolic Networks. Graph 500 generates random graphs with RMat [48], which we discuss in Chapter 2. We do not specifically study infrastructure networks in this thesis, but we note that the sensor networks that we do study have an underlying infrastructure layer of doors or roads which determines what routes exist between each pair of sensors. In the case of the CEM dataset (§3.4.5), we were able to infer the layout of the building and the functions of various rooms from the sensor network data. 1.1.3 Information Networks The world’s largest and most famous information network is the World-Wide Web (WWW). In many people’s minds, the Internet and the WWW are synonymous, but from a network point of view they are quite different. The Internet is a network of computers, routers, switches and other infrastructure, connected by physical cables (or
6
1.1 graphs and networks
Figure 1.4: Map of Facebook Social Connections (2010) [45]
wireless access points). The WWW is a virtual network built as a layer on top of the physical infrastructure. In the WWW graph, vertices are pages of text, images and other data, and edges are hyperlinks between the pages. Other information networks, such as Wikipedia, exist as layers on top of the WWW. Many information networks also incorporate aspects of social networks: e.g., LinkedIn, Flickr, Twitter and of course Facebook (Fig. 1.4). The main information network that we study in this thesis is from e-mail communications. Archived e-mails provide a detailed record of communication patterns in organisations or other large social networks. E-mail is often used as an evidence database for law enforcement and intelligence organisations to find hidden groups in an organisation who are engaged in illegal activity [167] or by the organisations themselves to discover insider threats [71]. The importance of e-mail to digital forensics is highlighted by the ongoing News International phone hacking inquiry. Police in Scotland Yard obtained evidence for their arrests from a database of 300 million e-mails [4]. Ten years previously, e-mail evidence played an important role in the indictments following the Enron scandal [137]. Searching a huge e-mail database is a formidable task. Detecting anomalies in the e-mail communications network can draw attention to the most important or most unusual communications. Another problem which occurred in both of the above cases was the deliberate deletion of e-mail evidence [5, 137]. Graph-based anomaly detection can be used to identify anomalous “holes” in the network where emails may have been deleted. Since the Enron e-mail corpus is public, we use it for our experiments in Chapters 5–6.
7
1.1 graphs and networks
Figure 1.5: Transaction Data Collected from a Physical Access Control System (ACS) in an Airport
1.1.4 Sensor Networks Sensor networks track the movements of people or things. Applications include: tracking people in building security systems; tracking vehicles in public transport networks; tracking cattle movements between farms; and baggage handling, inventory tracking or postal systems. Typically, data is collected as a sequence of transactions which record the location of an entity at some time. This transaction data can be reorganised as a graph, with vertices representing physical locations, and edges representing movements between locations. One of the applications we are most interested in is physical building security. Many secure buildings—airports, hospitals and power stations—are equipped with Access Control Systems (ACSs) based on door sensors and electronic locks. Authorised users gain access by presenting credentials—an ID card or badge—to a door sensor. The system authenticates users and records all movements in a database (Fig. 1.5). The current generation of access control systems can detect suspicious events—such as the use of an ID Card which has been reported stolen, or an attempt to physically force a door—but cannot detect suspicious patterns—transactions which are innocent in themselves but anomalous when considered in context, for example an airport technician who regularly hangs around in the baggage handling area. We can detect suspicious patterns by organising the transaction data as a graph and searching for anomalous regions, paths or substructures. We use ACS graphs for our experiments in Chapters 5–7.
8
1.2 motivation and research questions
Figure 1.6: Graph of Chemical Structure of Caffeine
1.1.5 Chemical and Biological Networks Graphs are widely used to model chemical and biological processes, from chemical reactions within cells to interactions between species in ecosystems [142]. Examples abound: metabolic networks, protein– protein interaction networks, genetic regulatory networks, neural networks, food webs and host-parasite networks. The need to find frequent molecular fragments in a database of compounds motivated research into subgraph mining algorithms [39, 98, 121, 185]. Figure 1.6 shows a graph representation of one of my favourite recreational substances: vertices represent atoms and edges represent chemical bonds. Unlike the large graphs that we have discussed so far, molecular databases consist of many small graphs, called graph transactions. The graph database can be thought of as a large graph with many disconnected components, where each transaction forms one component. We will study subgraph mining algorithms in detail in Chapter 3 and use this approach for our work in Chapters 6–7.
1.2
motivation and research questions
This Ph.D project began with the question of how to detect anomalous behaviour patterns in a physical ACS. Previous approaches have relied on rules derived from domain knowledge [32, 79, 157]. For example, [32] proposes four separate algorithms to detect four types of suspicious patterns: temporal patterns—when a person stays in an
9
1.2 motivation and research questions
area for an unusually long period of time (“zone overstay”); repetitive access patterns—an unusual number of repetitive accesses in a given period of time; displacement patterns—where a person appears to move a long way in an unusually short period of time; and out-ofsequence patterns—where a person appears to move from one area to another without passing through the intervening doors. Rather than this rather ad-hoc approach, we want to be able to detect any kind of anomalous pattern, including patterns about which we have no prior knowledge. For our investigation, we decided to represent staff movements within the building sensor network as a graph database, and to research unsupervised graph mining algorithms which can discriminate between typical and atypical patterns. Most graph mining approaches are for unweighted graphs, but many graphs have numeric weights on the vertices or edges. In the case of our ACS, edge weights can represent the number of people who travel between a pair of sensors. Weights can also be continuous values, e.g. the absolute time when someone swiped a sensor or the elapsed time to walk between a pair of sensors. Where we have multiple numeric attributes, we want to be able to represent all of them and incorporate them into our models and algorithms. This leads us to the following research questions: 1. How are numeric attribute values related to graph structure? Should we assume that numeric attributes are independent of graph structure or is there a dependency relationship? 2. How can we integrate numeric graph labels and weights into substructure discovery? Do the numeric attributes help us to decide which are the “best” substructures? 3. How can the relationship between graph structure and attributes be applied to anomaly detection? Do anomalies in numeric attributes tell us something about structural anomalies? 4. How can the answers to these questions be applied in the case where there are multiple (multi-dimensional) numeric attributes? Our research outputs are models and algorithms for pattern and anomaly detection in graphs with discrete and numeric labels. Our ultimate goal is a single algorithm which can search for both structural anomalies (e.g. unusual paths through a building) and numeric
10
1.3 methodology and evaluation criteria
anomalies (e.g. unusual timing data) in a database of graph transactions. Although the initial application was physical building security, the models and algorithms can be applied to a wide range of other domains. We describe the graph datasets that we used in our investigations in Chapter 3.
1.3
methodology and evaluation criteria
The main approaches to detecting anomalous substructures in graphbased data are based on an information-theoretic approach [68, 144], so we used this as our starting point. The most descriptive substructures are considered to be the ones with least entropy; by implication, anomalous substructures have the greatest entropy. For unsupervised learning on large, graph-based datasets, we cannot make any assumptions about the underlying distribution of the numeric attributes. There may be multiple numeric attributes, each of which follows its own distribution. As different processes give rise to different parts of the graph structure, there can be multiple distributions within a single attribute. We evaulate a number of numeric anomaly detection approaches—statistical, clustering-based, distancebased and density-based—for their suitability for unsupervised learning. To incorporate numeric attributes into substructure discovery, they can be discretized, or the evaluation of which substructures are “best” can somehow incorporate the numeric values. We evaluate a number of numeric discretisation approaches including histogram binning and clustering and compare these to discretisation using numeric anomaly detection. We measure the effect of these different discretisation approaches on anomaly detection by assigning an anomaly score to each substructure, and ranking discovered substructures from most to least anomalous. Each approach is evaluated on its discriminative power: how easy is it to tell which substructures are “normal” and how much difference is there between substructures ranked as anomalous? The work on anomaly detection raised questions about how numeric attributes contribute to frequent substructure discovery. We investigated the effects of numeric anomaly detection on two wellknown approaches to substructure discovery, Subdue [58] and gSpan [185]. We prune the result set based on the anomalousness of the
11
1.4 contributions
substructures and measure the effect on computational performance, completeness of the result set and accuracy. We also extend the anomaly detection approach to the multi-dimensional and high-dimensional cases. This work raised further questions: how exactly are numeric attributes related to graph structure? To investigate this, we created a generative graph model for weighted graphs. By learning parameters for the model from real-world graphs and comparing statistics for generated graphs against statistics calculated on the input graph, we discover which aspects of graph structure are correlated with the numeric attributes and which aspects of the structure are independent from the attributes. In the thesis, we present this work in the opposite order in which we did the investigations, as the later work explains the earlier work: the findings about graph structure and attributes inform the work on substructure discovery, which informs the work on anomaly detection.
1.4
contributions
1. We present Agwan (Attribute Graphs: Weighted and Numeric), a model for random graphs with discrete vertex labels and weighted edges, including a fitting algorithm to learn the parameters from a real-world input graph and a generative algorithm to generate random labelled, weighted graphs with similar characteristics to the input graph. 2. We use Agwan to show that some structural properties of graphs— vertex strength distribution and spectral properties—are dependent on numeric edge weights, while other properties—triad participation and clustering—appear to be independent of the weights. 3. We introduce a numeric outlier-based constraint on substructure discovery and evaluate its effect on graph mining in single large graphs and graph databases, using two common substructure discovery algorithms (gSpan and Subdue). Our results show that removing a small number of anomalous edges or vertices reduces the number of instances of each pattern, which has a significant effect on runtime and memory usage. While the result set is not complete, we retain the most descriptive substruc-
12
1.5 organisation of the thesis
tures. In many cases where the input graph is intractable with an unconstrained approach due to the computational or memory overheads, our approach allows the graph to be processed. 4. We show how our outlier-based constraint can be applied to multi-dimensional numeric attributes and present some preliminary work for the very high-dimensional case (thousands of attributes). 5. We present Yagada (Yet Another Graph-based Anomaly Detection Algorithm), which incorporates numeric anomaly detection into anomalous substructure discovery. Yagada is shown to have substantially greater discriminative power than anomaly detection with Subdue (without numeric attributes) and outperforms a number of discretisation approaches based on binning and clustering. We have proposed a number of improvements to Yagada which we intend to evaluate as a follow-up to this Ph.D project.
1.5
organisation of the thesis
The rest of this thesis is divided into three parts. Part II: Patterns and Anomalies covers the background to our research: Chapter 2 is an overview of graph properties and models. It is well known that real-world graphs exhibit power-law degree distributions; we survey a number of other graph properties, many of which are also characterised by power laws. We follow this with an overview of graph models, which attempt to explain the mechanisms by which these properties arise. Chapter 3 is a survey of approaches to graph mining and anomaly detection. We move from the macroscopic patterns of Chapter 2 to microscopic patterns: how to discover the most descriptive substructures in a graph or graph database. We also examine approaches to graph-based anomaly detection at macroscopic and microscopic scales. In Chapter 4 we turn from the structural properties of graphs to numeric attributes. We survey the literature on integrating numeric attributes into graph mining. Then we discuss numeric anomaly detection, which forms the basis of our approach in Chapters 6–7. We evaluate several outlier detection methods—statistical, cluster-based,
13
1.5 organisation of the thesis
distance-based and density-based approaches—in terms of their suitability for unsupervised learning. Finally, we take a look at the problems of detecting outliers in high-dimensional space. Part III: Contributions contains three chapters detailing the novel contributions of this Ph.D project: Chapter 5 presents Agwan (Attribute Graphs: Weighted and Numeric), a generative graph model for labelled, weighted graphs. We use Agwan to better understand the relationship between labels and weights and graph structure. Furthermore, Agwan allows us to generate realistic synthetic graph datasets. In Chapter 6, we investigate how weights and numeric attributes can be incorporated into frequent substructure discovery. Our approach is to introduce a constraint based on the “outlierness” of the numeric attributes. We show how the method can be applied to multi-dimensional attributes and present some preliminary work for the high-dimensional case. We evaluate the method using two wellknown substructure mining approaches, gSpan and Subdue. Chapter 7 presents Yagada (Yet Another Graph-based Anomaly Detection Algorithm). We develop the approach used for substructure discovery in Chapter 6 into an algorithm for detecting anomalous regions in a Single Large Graph and anomalous subgraphs in a graph transaction database. We compare Yagada to several alternative discretisation approaches and show that using outlier values gives the greatest discrimination between normal and anomalous patterns. We sum up in Part IV: Conclusions and Future Work. Chapter 8 summarises what has been achieved in this project and presents some directions for future work.
14
Part II PAT T E R N S A N D A N O M A L I E S
“You’ll notice that the girls who are assigned to that particular duty are unusually tall. If the Germans were to somehow get their hands on the personnel records for all of the people who work at Bletchley Park, and graph their heights on a histogram, they would see a normal bell-shaped curve, representing most of the workers, with an abnormal bump on it—representing the unusual population of tall girls who we have brought in to work the plug boards.” “Yes, I see.” Waterhouse says, “and someone like Rudy—Dr. von Hacklheber—would notice the anomaly, and wonder about it.”
— The Cryptonomicon, Neal Stephenson
G R A P H P R O P E RT I E S A N D M O D E L S
2.1
introduction
In the last chapter, we looked at examples of real-world graphs: social networks, infrastructure networks, information networks, sensor networks and chemical and biological networks. In this chapter, we look at the properties which are common to real-world graphs (§2.3). Of particular note is the power-law distribution, which crops up in a number of graph-related contexts. Researchers have created graph models in an attempt to understand how these properties arise; we survey these models in Sect. 2.4. Before discussing graph properties and models, we introduce some concepts and our notation.
2.2
concepts and terminology
We have already encountered some graph concepts in the preceding chapter. In this section, we define our terms more formally and introduce the notation that we will use through the rest of the thesis. definition 1: An unlabelled graph G = (V, E) consists of a set of vertices V and edges E ⊆ V × V. Two vertices v, w ∈ V are adjacent iff hv, wi ∈ E. If the tuple hv, wi is ordered, the graph is directed, otherwise it is undirected. definition 2: A graph G0 (V0 , E0 ) is a subgraph of G, denoted G0 ⊆ G, iff V0 ⊆ V, E0 ⊆ E ∩ (V0 × V0 ). We are often interested in tracing the flow of people, objects, information or viruses across a network, from one vertex to another, travelling along the edges. Here we formalise the notion of a path: definition 3: A path is an ordered set of vertices P = {v1 , . . . , vN : (vi , vi+1 ) ∈ E, 1 6 i < N}. The path length is the number of edges in the path, viz. N − 1. A simple path is a path which does not cross itself {vi , vj ∈ P : i < j 6 N, vi 6= vj }.
16
2
2.2 concepts and terminology
(a) Connected
(b) Weakly Connected
(c) Strongly Connected
Figure 2.1: Connected Graphs
definition 4: A cycle is a closed path {vi ∈ P : v1 = vN }. A simple cycle is a cycle with no repeated vertices or edges aside from the necessary repetition of the start and end vertex: {vi , vj ∈ P : i < j < N, vi 6= vj , v1 = vN }. definition 5: If a path exists from every vertex to every other vertex, the graph is said to be connected (Fig. 2.1a). In the case of a directed graph we make a distinction between weak and strong connectivity. A directed graph G is weakly connected if replacing all the directed edges with undirected edges would result in a connected graph (Fig. 2.1b). It is strongly connected if there is a path from every vertex to every other vertex (Fig. 2.1c): ∀u, v ∈ V : ∃P1 = {u, . . . , v} ∧ ∃P2 = {v, . . . , u}
(2.1)
definition 6: The components of a graph are defined by its set of maximal connected subgraphs. A maximal connected subgraph is a connected subgraph which has no connected supergraph. That is, a subgraph G0 is a component of G iff: G0 is connected ∧ @G1 : G0 ⊆ G1 , G1 is connected
(2.2)
Single Large Graphs (SLGs) usually have a giant component which is a deliberately informal term for the connected component that contains a significant fraction of all vertices [60, 142]. This property can be seen in the SLG datasets described in Chapter 3; it is shown most clearly in Fig. 3.3, which has a giant component and 73 small components (most of which are isolated vertices).
17
2.2 concepts and terminology
Run Swim
Swim
H
Blue
O C
C
H
Blue
Red
Red
Red
Red
Blue
Blue
Green
Cycle
Cycle
Blue
18
Swim
Swim Cycle
C H
(a) Social graph
C H
(b) Molecular structure
Green
Green
Green
Green
(c) Scene Analysis
Figure 2.2: Examples of labelled graphs
2.2.1 Weighted Graphs A weighted graph has a numeric value attached to its edges: definition 7: In an edge-weighted graph G = (V, E), each edge e ∈ E is a 3-tuple hu, v, wuv i, where wuv is the edge weight. In the case of an undirected graph, wuv = wvu . Weights are commonly used to represent the number of occurrences of each edge, such as the number of e-mails sent between individuals in a communications network [17] (§1.1.3); the number of calls to a subroutine in a software call graph [73]; or the number of cattle moved between farms in an infrastructure network [100] (§1.1.2). In these cases, wuv ∈ N. In other graphs, the edge weight may represent continuous values: donation amounts in a bipartite graph of donors and political candidates [17]; distance or speed in a transportation network [73]; or elapsed time between the sensors in a building network (§1.1.2). In these cases, wuv ∈ R. In the literature, weights are usually understood to be one-dimensional, but some authors allow multi-dimensional weights [73]; in our notation, we allow graphs with multi-dimensional numeric attributes (see Def. 9), and reserve the term weight for the one-dimensional case. Although the concept of edge weights is easily generalised to vertex weights, vertex-weighted graphs are not common in the literature. We use weighted graph to mean an edge-weighted graph unless otherwise specified.
For an example of vertex weights, see the Enron bipartite graph, §3.4.3
2.3 graph properties
2.2.2 Labelled Graphs So far we have considered only unlabelled graphs. Where a graph represents a real-world network or phenomenon, the vertices are typically labelled, as in Fig. 2.2. In Fig. 2.2a, the vertices represent people, labelled with their favourite sport; edges represent friendships. Fig. 2.2b is a graph of a molecular structure: vertices are atoms and edges are chemical bonds. Fig. 2.2c is a planar graph, where edges may not cross. It represents a frame in a video sequence where a red object is moving across a green and blue background; vertices are superpixels in the video frame and edges represent spatial relationships between adjacent superpixels. In our notation, we allow an arbitrary number of labels on vertices and edges: definition 8: A labelled graph G is a tuple hV, E, L, LV , LE i, where L is a set of vertex and edge labels and LV and LE are label-to-value mapping functions, defined below. definition 9: Let LD be the set of discrete labels and LN be the set of numeric labels, such that L = LV ∪ LE = LD ∪ LN , LD ∩ LN = ∅. Let AD be the set of discrete attribute values and AN ⊂ R be the set of numeric attribute values. The vertex label-to-value mapping function is denoted as: LV : V × (LV ∩ LD ) → AD ,
(2.3)
LV : V × (L ∩ L ) → A
(2.4)
V
N
N
The edge label-to-value mapping function is defined similarly: LE : E × (LE ∩ LD ) → AD ,
(2.5)
LE : E × (LE ∩ LN ) → AN
(2.6)
Armed with these definitions, we can begin our discussion of the properties of real-world graphs.
2.3
graph properties
Our research question is to design algorithms to discover frequent patterns and anomalies in graphs. Understanding the properties of real-world graphs tells us what makes a graph “typical”, and therefore what may be anomalous. We also need to understand graph
19
2.3 graph properties
1
10
1 γ=2 γ=3
0.9
γ=2 γ=3
0
10
0.8 −1
10
0.7 0.6
−2
10
0.5 −3
10
0.4 0.3
−4
10
0.2 −5
10
0.1 0
−6
10
0
20
40
60
80
100
0
10
(a) Linear scale
1
10
2
10
(b) Log scale
Figure 2.3: Power law distributions with γ = [2, 3]
properties to generate realistic synthetic graphs for our experiments (Chapter 5). Other motivations to understand graph properties include: graph compression; extrapolations; asking what-if questions; link prediction; and studying graph propagation (viruses, information, political or health opinions, etc.) [48]. As we shall see in this section, many of the statistics on real-world graphs follow a power-law distribution [48, 142], so we shall begin by looking at power laws and other heavy-tailed distributions. 2.3.1 Power Laws and Degree Distributions Power-law distributions crop up in many graph-related contexts. Here we discuss power laws in relation to degree distributions, but we will meet them again in the following sections. definition 10: A random variable X follows a power law distribution when its Probability Density Function (PDF) is given by: P(x) = Cx−γ , γ > 1, x > xmin , where C is a normalising constant. For this to be a probability distribution, we must specify some value xmin , otherwise the distribution has infinite area as x approaches zero. Likewise, γ must be greater than 1, otherwise the tail has infinite area. [141] reports exponents in the range 1.8 < γ < 3.6 across a range of real-world phenomena, from the populations of cities to the size of craters on the moon. definition 11: The degree of a vertex u, ku , is the number of edges connected to it. In a directed graph, the in-degree is the number of in-edges and the out-degree is the number of out-edges.
20
2.3 graph properties
(a) Exponential Cutoff
(b) Discrete Gaussian Exponential (DGX)
Figure 2.4: Deviations from power-law degree distributions
definition 12: Power-law degree distribution. The distribution of vertex degrees pk in a SLG usually follows a power law: pk ∝ k−γ [48, 142]. Degree distributions usually have exponents in the range 2 < γ < 3 (Fig. 2.3). Power laws decay very slowly, so the structure of most graphs is comprised of a small number of high-degree vertices (called hubs) and a large number of low-degree vertices. Graphs exhbiting a power-law degree distibution are said to be scale-free, as the power-law pattern repeats itself recursively at all scales. In log-scale, power laws appear as a straight line (Fig. 2.3b). In directed graphs, the power law holds for the separate in-degree and out-degree distributions of each vertex, with exponents γin and γout . It is not unusual to see degree distributions deviate from a pure power law for very low or very high values of k [142]. The Exponential Cutoff distribution was proposed to account for one type of deviation [20]: pk ∝ k−γ e
− k/κ
. Here k−γ is the power-law term and e
− k/κ
is the exponential cutoff term, where κ is a constant representing the rate of decay. The Exponential Cutoff looks like a power law over the lower range of x-axis values, but the heavy tail decays exponentially at higher values (Fig. 2.4a). [20] suggests two reasons why this occurs: The first is “aging”—in the case of a bipartite network of films and actors, actors stop receiving new links when they retire—and the second is exceeding the capacity to form new links, as in an air transport network when the landing slot capacity of each airport is reached. Another deviation is the log-normal distribution: while a power law looks like a straight line in log-log scale, a log-normal looks like a
21
2.3 graph properties
22
parabola. [30] found that while the WWW as a whole exhibits a powerlaw degree distribution, some subsets of the WWW (university and newspaper websites) can be better fitted to a discrete truncated lognormal, which they called the Discrete Gaussian Exponential (DGX) distribution (Fig. 2.4b). Another recently-proposed distribution, observed in graphs of calls between mobile phones, is the Double-Pareto Log-Normal (DPLN) distribution [165], which is comprised of two power laws with different exponents. In log-log scale, it looks like two straight lines which meet in the middle of the plot. For weighted graphs, we generalise the concept of vertex degree k to vertex strength s, which is the sum of the weights of the incident edges of each vertex [77]: su =
X
wuv
(2.7)
v6=u
For directed graphs, the in-strength is the sum of the weights of the in-edges and out-strength is the sum of the weights of the out-edges. [135] found two power-law relationships in weighted, directed, timeevolving graphs: definition 13: Weight Power Law (WPL). W(t) = E(t)w , where W(t) is the total strength at time t, E(t) is the number of edges at time t and w is the weight exponent. definition 14: Snapshot Power Law (Fortification). The vertex outγ out and out-weight expostrength sout ∝ (kout u u ) for out-degree ku
nent γ. The same power-law relationship holds between in-strengths and in-degrees. [17] found further power-law relationships in the densities, weight distribution, principal eigenvalues and ranks of weighted egonets. An analysis of the vertex strength distribution of real-world graphs is one of our contributions in Chapter 5. 2.3.2 Graph Diameter Real-world graphs tend to have a small diameter. In 1929, Hungarian author Frigyes Karinthy wrote a short story based on this observation [105]. One of the characters in the story observes that it is easier to contact anyone on the planet than at any previous time in history. Someone else asks how they would contact Selma Lagerlöf in as few
Selma Lagerlöf was a Swedish novelist who received the Nobel Prize for literature in 1909. Karinthy’s story may have been partly inspired by Marconi’s Nobel Prize acceptance speech the same year.
2.3 graph properties
23
steps as possible. “Nothing could be easier. . . Selma Lagerlöf just won the Nobel Prize for Literature, so she’s bound to know King Gustav of Sweden, since, by rule, he’s the one who would have handed her the Prize. And it’s well known that King Gustav loves to play tennis and participates in international tennis tournaments. He has played Mr. Kehrling, so they must be acquainted. And as it happens I myself also know Mr. Kehrling quite well.” This informal observation received mathematical treatment in the 1950s [84], leading to the famous “Small World” experiment of Travers and Milgram [176]. They selected some people at random and asked them to pass on a message to someone they didn’t know who lived in another city, via intermediate acquaintences. Of the 64 messages which reached their destination, the mean number of acquaintences was found to be 5.2 (uncannily close to the 5 acquaintences in Karinthy’s story). As well as demonstrating the Small World effect, [176] showed that almost half the messages passed through three people, revealing the presence of hubs in the network. We define this notion more formally in relation to graph diameter: definition 15: The diameter of a graph is defined as the maximum path length d(u, v) between all pairs of vertices hu, vi. As a graph with one disconnected vertex has infinite diameter, and diameter can be skewed by a long chain, a more useful measure is the effective diameter, which is the 90-percentile of pairwise distances d(u, v) : u, v ∈ VS where GS (VS , ES ) is a component of G. The Small World effect states that the diameter of real-world graphs tends to be small in relation to the size of the graph. Some more recently-discovered patterns and laws relating to graph diameter are not so intuitive. As graphs evolve over time, they reach a gelling point at which several disconnected components connect into one giant component. At this point, the effective diameter of the graph tends to spike [48], but after the gelling point, the diameter tends to shrink over time, even if new vertices are being added. [127] observed the shrinking diameter effect in time-evolving graphs and proposed the Densification Power Law (DPL): definition 16: Densification Power Law. The number of edges in a graph grows super-linearly with respect to the number of vertices: Et ∝ (Vt )β where Et is the number of edges at time t, Vt is the number of vertices at time t, and β is the densification exponent.
The Small World phenomenon is sometimes known as “Six Degrees of Separation”, a term popularised by the title of a 1990 play by John Guare [88].
2.3 graph properties
The DPL offers an explanation for shrinking diameters: as graphs become larger and denser, there are more edges and therefore more paths through the graph, so the longest path becomes shorter. 2.3.3 Spectral Properties The spectral properties of a graph are calculated on its adjacency matrix. definition 17: The adjacency matrix
A
of an unweighted graph is
a |V| × |V| square matrix where:
auv
1 = 0
if hu, vi ∈ E
(2.8)
otherwise
For example, the adjacency matrix for the undirected graph from Fig. 2.1a can be represented as: 0 1 0 A= 0 0
1
0
0
0
0
1
0
0
1
0
1
0
0
1
0
1
0
0
1
0
0
0
0
1
1
0 0 0 1 1 0
A non-zero diagonal means that the graph has self-edges. Undirected graphs are symmetrical around the diagonal; directed graphs are asymmetrical. The directed graph from Fig. 2.1b can be represented as: 0 0 0 A= 0 0 0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0 0 0 1 1 0
For weighted graphs, the 1s in the adjacency matrix are replaced with the edge weights. The eigenvectors and eigenvalues of the adjacency matrix are important spectral properties: definition 18: Eigen-decomposition. An eigenvector-eigenvalue pair of a square matrix
A
are a vector ~v and scalar λ such that
~v = λ~v.
A
24
2.3 graph properties
c = N−1 = 5
c = 2(N − 1)/N = 1.67
λ1 = N − 1 = 5
c = 2(N − 1)/N = 1.67 √ λ1 = N − 1 = 2.24
(a) Clique
(b) Star
(c) Chain
λ1 ≈ 2 = 1.80
Figure 2.5: Average degree c and principal eigenvalue λ1 for different graph topologies with N = 6 vertices
In other words, A rotates and scales ~v, but ~v remains parallel to itself and its length is scaled by λ. The principal eigenvalue λ1 (sometimes called the spectral radius of ) is the maximum value of λ. λ1 roughly corresponds to the aver-
A
age degree c (Fig. 2.5). For regular graphs such as a ring or clique, c = λ1 (Fig. 2.5a). For non-regular graphs, λ1 is a better measure of connectivity than c [151]. The average degree tells us how many vertices can reach other vertices along a path length of 1; λ1 takes longer paths into account. Figs. 2.5b and 2.5c show a 6-vertex star and chain respectively. Both have the same average degree (1.67), but the star is clearly better connected than the chain. This is measured by the principal eigenvalue (2.24 > 1.80). Connectivity is important when considering models of propagation across a network, for modelling epidemics or the spread of information such as behaviour or opinions. [151] proves that, irrespective of the virus propagation model, the epidemic threshold of a virus can be predicted with just one parameter, the principal eigenvalue. As graphs evolve over time, the principal eigenvalue follows a power law with increasing number of edges: λ1 (t) ∝ E(t)α , where λ1 (t) is the principal eigenvalue at time t, E(t) is the number of edges at time t and α 6 0.5 is the power-law exponent. The power-law fit is better after the gelling point [48]. For weighted graphs, λ1 follows a power law with a higher exponent 0.5 6 α 6 1.6 [48]. The spectrum or scree plot of a matrix is the set of its eigenvalues sorted by decreasing magnitude. The spectrum of the largest eigen-
25
2.3 graph properties
26
values of the adjacency matrices of real graphs are often power-law distributed [48]. The Singular Value Decomposition (SVD) of a matrix is closely related to its eigenvectors and eigenvalues: definition 19: Singular Value Decomposition. If
A
is an arbitrary
p × q matrix of rank r (p > q > r), then the SVD of
A
is
U
× Λ × VT ,
where UT U = VT V = VVT = Iq and Λ is a diagonal r × r matrix with non-negative diagonal elements arranged in descending order [29]. For Gramian matrices, the SVD and eigen-decomposition are the same. Where eigen-decomposition is not defined, such as for the rectangular adjacency matrix of a bipartite graph, the SVD can still be used. The SVD is also used in some graph algorithms including HITS [113] and Pagerank [41].
A Gramian, or positive semidefinite, matrix is the product of a matrix with its transpose [29]
2.3.4 Triad Participation Triangles form much more frequently than random in real graphs. This phenomenon was observed as early as the 1950s in the social network literature [154]: if u is friends with v and w (Fig. 2.6), this does not guarantee that v and w are friends, but it makes it much more likely than random. This can be understood in terms of transitivity [142]. A relation ◦ is transitive if a ◦ b ∧ b ◦ c =⇒ a ◦ c. The relations in a graph are the edges, so u, v, w ∈ V : hu, vi ∧ hu, wi =⇒ hv, wi. Perfect transitivity occurs only in fully connected subgraphs (cliques, like Fig. 2.5a), but partial transitivity is a
Figure 2.6: Triadic Closure
common feature of real graphs. [177] observed that the number of triangles ∆ in a graph is proportional to the sum of the cubes of its eigenvalues. This observation can be used to rapidly estimate the number of triangles in large graphs. [177] also discovered two power laws relating to triangles: definition 20: Triangle Participation Law. Let ∆u be the number of triangles that vertex u participates in. Then the distribution of ∆u follows a power law: while many vertices have only a few triangles, a few vertices have very many triangles. definition 21: Degree-Triangle Law. ∆u ∝ kσ u : the number of triangles a vertex participates in has a super-linear relation to the vertex degree, a power law with exponent σ ≈ 1.5.
2.3 graph properties
Wcycle uvz = 1
1
Wmiddleman = uvz 1
1
3 3 3 wuz · wzv · wvu
(a) Cycle
1
Win uvz = 1
1
3 3 3 wuv · wzu · wzv
1
1
Wout uvz = 1
1
1
3 3 3 wzu · wzv · wvu
3 3 3 wuv · wuz · wzv
(c) In
(d) Out
(b) Middleman
Figure 2.7: Closed Triad Patterns in a Directed Graph
In directed graphs, there is more than one triangle pattern, or motif. The interpretation of the triangle depends on the direction of the edges, as shown in Fig. 2.7. To extend this to weighted, directed graphs, we can calculate the total strength of the edges for each motif [77]: ty u =
X v,z∈Wu \u
Wy uvz
(2.9)
where y = {cycle, middleman, in, out} is the triangle type and Wy uvz is calculated as shown in Fig. 2.7 for each triangle type y. 2.3.5 Clustering Coefficient The Clustering Coefficient of a graph is a measure of transitivity. A higher clustering coefficient means that the graph exhibits a community structure. definition 22: Global Clustering Coefficient. The clustering coefficient C is the fraction of paths of length two which participate in triangles [142]: C =
3∆/Γ
where ∆ is the number of triangles and Γ is
the number of connected triples (paths of length 2). C = 1 is perfect transitivity (all network components are cliques); C = 0 implies no closed triads, which occurs in some graph topologies, such as trees and square lattices [142]. Social networks tend to have high values of C, as people tend to make new friends within their existing social circles. Often we are interested in the clustering coefficient for each vertex:
27
2.3 graph properties
definition 23: Local Clustering Coefficient. The local clustering coefficient of vertex u is defined as [142]:
Cu =
nu/ku
0
if ku > 1
(2.10)
otherwise
where ku is the degree of u and nu is the number of edges shared between the vertices adjacent to u. For example, in the graph of Fig. 2.2a, the “cycle” vertex in the centre has ku = 5 neighbours, which share nu = 2 edges between them, so Cu = 2/5 = 0.4, whereas the “swim” vertex on the right is part of a clique with Cu = 1. Sometimes the global clustering coefficient is calculated as the average of the local clustering coefficients. This can lead to quite a different value to that of Def. 22, as there tends to be a high variance in Cu for low-degree vertices [48]. [156] calculated C(k), the average clustering coefficient for vertices of degree k, and found that for many graphs which exhibit hierarchical structure, C(k) ∝ k−1 , but C(k) is independent of k for graphs which do not have a hierarchical topology. The clustering coefficient can be extended to directed graphs [77]: definition 24: Directed Clustering Coefficient. For adjacency matrix
A
and vertex u, let ktot be the total degree (sum of u’s in- and u
out-degrees) and k↔ u be the number of bilateral edges (the number of neighbours of u which have both an in-edge and an out-edge between themselves and u). Then the directed clustering coefficient of u is defined as: CD u =
(A + AT )3uu tot ↔ 2 [ktot u (ku − 1) − 2ku ]
(2.11)
where (A)3uu is the uth element of the main diagonal of A · A · A. This can be further extended to weighted graphs [77]: definition 25: Weighted Clustering Coefficient. For vertex u and weighted adjacency matrix Wu for u and its neighbours, the weighted clustering coefficient of u is defined as: [1]
CW u
1
[Wu3 + (WTu )[ 3 ] ]3uu = tot ↔ 2[ktot u (ku − 1) − 2ku ]
(2.12)
28
2.4 graph models
By combining equations 2.9 and 2.12, we can calculate a separate clustering coefficient for each of the motifs in Fig. 2.7 [77].
2.4
graph models
In the preceding section, we discussed some of the laws and properties exhibited by real-world graphs. Many of these properties were discovered by building graph models. Models also allow us to conduct simulation experiments, such as the epidemic threshold experiments mentioned in Sect. 2.3.3. We are interested in discovering frequent substructures and anomalies; as we shall see in Chapter 6, the properties of the underlying graph have a significant effect on the performance of substructure discovery algorithms. There have been two separate approaches to graph modelling [111]. The first is to create “mechanistic” models which are simple enough to yield to mathematical analysis. We will discuss these models in Sects. 2.4.1–2.4.6. The second approach is to create stochastic models which may not be mathematically tractable, but which can be fit to real-world graphs and used to learn the properties of those graphs. Stochastic models are discussed in Sect. 2.4.7. The MAG model, discussed in Sect. 2.4.8, attempts to bring these two strands together, to produce a model which can be analysed mathematically and is also statistically meaningful. 2.4.1 Erd˝os-Rényi Random Graphs Our understanding of the mathematical properties of graph structure was pioneered by Paul Erd˝os and Alfréd Rényi [74, 75, 76]. definition 26: Erd˝os-Rényi (ER) Random Graph. An ER graph G(n, p) is parameterised by a number of vertices n and wiring probability p between each pair of vertices. One of the most important results from [74] was that the size of the giant component undergoes a sudden change or phase transition at a certain value of p. They derive the equation S = 1 − e−cS
(2.13)
where S is the fraction of vertices in the giant component and c = (n − 1)p is the mean degree. Although this equation has no closed-form
29
2.4 graph models
solution, it can be shown graphically [142] or heuristically [48] that the phase transition point—the point at which the giant component appears—is c = (n − 1)p = 1. Therefore, a giant component only appears when p > 1/n − 1. The degree distribution of an ER graph is a binomial distribution [142]: k n−1−k . In the limit of large n, this simplifies to a pk = n−1 k p (1 − p) Poisson distribution [142]: pk = e−c ck/k!. Thus, the ER model is sometimes called the Bernoulli or Poisson random graph model and ER graphs do not follow the heavy-tailed degree distribution observed in real-world graphs. The diameter of ER graphs are concentrated around
ln n/ln c,
so they
do exhibit the Small World effect. This result also shows how coreperiphery structure arises in a graph. If c > 1, the average size of the periphery will grow exponentially (and will eventually form the giant component). If c 6 1, it will shrink exponentially and vanish [142]. The global clustering coefficient is calculated as C = c/n − 1, which is much lower than real-world graphs (tending to zero in the limit n → ∞) [142]. In real graphs, C is independent of n. While ER graphs have been essential to our understanding of component sizes and expected diameters, the model does not explain other important properties of real-world graphs. One solution to the unrealistic degree distribution is to generalise the ER model to produce a heavy-tailed distribution. A number of such generalised random graph models have been proposed [12, 142, 147], but these still do not address other properties such as transitivity and clustering. 2.4.2 Preferential Attachment-based Models Generalised random graph models start with a network property and try to reproduce it in the graph, without considering how the property arises in the first place. The appearance of power laws indicates the presence of an interesting underlying process [141]; some researchers have used this observation to develop generative models, which try to replicate the mechanisms that give rise to graph properties. The first generative graph model was proposed in 1976 by Price [61], who investigated why power laws arise in graphs. He was insprired by Simon [168], who had identified the stochastic processes that caused power-laws to arise in a variety of (non-graph) data. Price called his mechanism Cumulative Advantage, the effect where the “rich get
30
2.4 graph models
richer”. When a new vertex u is added to the graph, the probability of forming a new edge to existing vertex v is proportional to the number of edges that v already has: k v + k0 P(hu, vi) = P i (ki + k0 )
(2.14)
where k0 is a constant. This gives rise to a power law degree distribution. Price was modelling a citation network, so the graph was directed and new vertices could only have out-edges. Barabási and Albert (BA) independently discovered the same mechanism over 20 years later [23], calling it Preferential Attachment. Preferential Attachment has become one of the most influential and wellknown generative approaches. It is similar to Price’s model, but the graph is undirected and there is no parameter k0 : kv P(hu, vi) = P i ki
(2.15)
The degree distribution of the BA model is a power law with fixed exponent γ = 3. This is somewhat infexible, as the model cannot be adjusted to match the exponents found in real graphs. Some extenstions to BA allow exponents other than 3 [18, 43, 64]. There are also extensions to yield directed graphs [13, 35, 59] with power-law in and out degrees. The diameter of a BA graph grows as O( log n/log log n) as the size of the graph increases [36]. So BA graphs exhibit the Small World effect, but do not exhibit the shrinking diameter effect. Neither does BA exhibit the DPL [48]: when a new vertex is added, it has a fixed number of edges m, so the average degree c is fixed. [22] extends BA by adding extra edges with probability P(hu, vi) ∝ ku · kv ; i.e., both endpoints are selected by Preferential Attachment. This version does exhibit DPL but it is unknown whether it matches the shrinking diameter effect. Unlike ER graphs, the BA model gives rise to graphs with exactly one connected component. New vertices are added but old vertices (and edges) are never removed. This is a property of some real networks (e.g., citation networks) but most real-world graphs can lose vertices as well as gain new ones. Extensions to BA have been proposed to allow the addition of arbitrary edges [18], edge removal [142] and addition and removal of arbitrary vertices [139].
31
2.4 graph models
Another property of BA graphs is that vertex age and degree are positively correlated [116]. This corresponds to the principle of “firstmover advantage” [162] which again is a feature of some real networks, including citation networks. However, in other networks such as the WWW, new vertices can rapidly obtain a high degree. [116] also found that vertices with similar degree are more likely to be connected, which is not a property of real graphs. [31] attempted to address these problems by assigning a fitness value ηu ∈ [0, 1] to each vertex, which is used to weight the probabilities for edge formation. The degree distribution emerges as a power law with an extra inverse logarithmic factor [142]. One variant of BA worthy of note is the Forest Fire model [127], which matches both the DPL and shrinking diameters. This model has two parameters, a forward-burning probability p and a backwardburning ratio r. The algorithm is analagous to a fire which starts at an “ambassador” vertex and probabilistically spreads to other vertices which are connected to vertices which are still “burning”. Some vertices create large “conflagrations” which form many out-edges before the fire dies out, resulting in a power-law degree distribution. Forest Fire also gives rise to a community structure similar to edge-copying methods (discussed below), because existing edges are copied to new vertices as the fire spreads. Proof of the graph properties is obtained empirically, as analytical proofs are difficult for this type of model [48]. 2.4.3 Copying Models Another generative model that emerged around the same time as BA was the Copying Model [114, 120]. This gives rise to the same heavytailed degree distributions of BA, but by a different process. [114] attempts to simulate the process by which the WWW arises. In each iteration, some vertices are randomly created or deleted. Then k edges are added to a random vertex v. With probability β, the edges are linked to random vertices. With probability 1 − β, the endpoints of the new edges are copied from another vertex (or more than one vertex if the chosen vertex has less than k edges). Then some random edges are deleted. This yields a power-law in-degree distribution with exponent γin = 1/1 − β. The approach of [120] is similar: when a new vertex v is added, one new edge is added. With probability α, the tail of the new edge is
32
2.4 graph models
incident on v, otherwise it links to a random vertex. With probability β, the head of the new edge is incident on v. This model yields powerlaw in- and out-degree distributions with exponents γin = 1/1 − α, γout = 1/1 − β. [13] constructs a more general version of this model where the edge probability is based on the out-weight and in-weight of the endpoints. These models produce graphs with a large number of bipartite cores, giving rise to community and clustering properties. [142] analyses copying models in some detail, noting that they mimic some biological processes. Cells in living organisms reproduce by splitting, making a copy of their DNA. This copying process is not perfect and sometimes a section of DNA is copied twice. As genes are subject to Darwinian selection, duplicated proteins may be favoured where the duplicate performs a different function to the original. Over time, this gives rise to a power-law distribution of chromosomes. 2.4.4 The Small World Model The Small World model [182] was proposed to create graphs with a small diameter, but a high clustering coefficient C. The model starts with a ring lattice with n vertices each of degree k, n k ln n 1. Each vertex is linked to its k/2 nearest neighbours on either side. Edges in the graph are rewired with probability p to new endpoints chosen uniformly at random. This has a highly nonlinear effect on the diameter of the graph. With p = 0 we have the lattice, with high C. With p = 1 we have an ER graph, with low diameter. As p moves from 0 to 1, clustering and diameter both decrease; for a large range of p both properties are exhibited. The main weakness of the Small World model is that the degree distribution is not heavy-tailed [142]. 2.4.5 The RMat Model The Recursive Matrix (RMat) model [49] matches graph properties including heavy-tailed degree distributions, community and clustering effects. The parameters can be learned from real graphs and generated graphs can be shown empirically to have realistic degree distributions. Despite its expressive power, RMat is essentially very simple and requires only a few parameters. graphs consist of 2n vertices and E edges, parameterized by four probabilities ha, b, c, di : a > b > c > d, a + b + c + d = 1. The RMat
33
2.4 graph models
Figure 2.8: Adjacency matrix for the Recursive Matrix (RMat) Model [49]
generative algorithm starts with an empty adjacency matrix which is divided into four equal partitions (Fig. 2.8). One partition is selected uniformly at random according to ha, b, c, di, and this is repeated recursively until we reach a single 1 × 1 cell, where we assign the value 1 to indicate an edge. This is repeated E times to get the full graph. The algorithm can generate bipartite graphs by using a rectangular adjacency matrix. The degree distribution of an RMat graph depends on the skew of the parameters. If a = b = c = d, the result will be an ER graph. RMat
can model log-normals, the DGX distribution or power laws; the
appropriate parameter settings can be learned from real graphs by fitting to the empirical degree distribution. The recursive nature of RMat causes communities to form: partitions a and d represent two groups of vertices; partitions b and c are crosslinks between them, analagous to the “weak ties” observed in social networks [60]. RMat
has not been studied analytically, but empirical studies have
shown that in addition to the in- and out-degree distributions, it can effectively model the hop-plot (number of reachable pairs of vertices), effective diameter, singular values and ranked components of principal singular vector. However, RMat does not model the DPL or the shrinking diameter effect. 2.4.6 The Kronecker Graph Model Kronecker graphs [128] fulfil all the properties of RMat graphs, plus the DPL and the shrinking diameter effect. This work synthesises the previous work in random graphs in a very elegant way: it is computationally tractable, and the specification of the model yields sim-
34
2.4 graph models
3×3
9×9
27 × 27
81 × 81
(a) K1
(b) K2 = K21
(c) K3 = K31 = K2 ⊗ K1
(d) K4 = K41 = K3 ⊗ K1
Figure 2.9: Example of Kronecker Multipication [128]. A 3 × 3 initiator matrix K1 is recursively multiplied by itself to obtain its 4th Kronecker power K4
ple closed-form expressions for degree distributions and other graph properties, which are easy to analyse. The paper proves that RMat graphs are a special case of Stochastic Kronecker graphs. (The main difference is that RMat graphs have a parameter to specify the number of edges; in Stochastic Kronecker graphs the number of edges is encoded within the initiator matrix.) Like RMat graphs, Kronecker graphs are recursive. The model starts with a small initiator matrix. Kronecker multipication is recursively applied to yield the final adjacency matrix of the desired size: definition 27: The Kronecker Product of an n × m matrix
A
n0
m 0 ).
A
×
m0
matrix B is a matrix
C
of dimensions (n ·
n 0)
× (m ·
and If
= [ai,j ], then C is given by: a1,1 B a2,1 B C = A⊗B = .. . an,1 B
a1,2 B · · · a1,m B a2,2 B · · · a2,m B .. .. .. . . . an,2 B · · · an,m B
(2.16)
An example is shown in Fig. 2.9. Starting with a 3 × 3 initiator matrix K1 (Fig. 2.9a), this is multiplied by itself to get its Kronecker product K2 = K1 ⊗ K1 (Fig. 2.9b). Continuing to multiply the result by K1 yields the 3rd and 4th Kronecker powers (Figs. 2.9c–2.9d). The algorithm stops when it has generated a graph of the desired size. [128] proves that Kronecker graphs have a multinomial distribution for in- and out-degrees, eigenvalues and eigenvector components. A suitable choice of initiator matrix K1 makes these distributions behave like a power-law or DGX.
35
2.4 graph models
36
Community structure arises naturally due to the recursive nature of Kronecker multipication (Fig. 2.9d). The diameter can be constrained to the diameter of K1 by including self-edges in K1 (as in Fig. 2.9a). Kronecker graphs also exhibit the DPL: a Kronecker graph with initiator matrix K1 = (V1 , E1 ) will have densification exponent α = log |E1 |/log |V |. 1
[107] provides a more detailed formal analysis of Kro-
necker graphs. As well as being analytically tractable, Kronecker graphs are statistically meaningful. Stochastic Kronecker graphs replace the binary initiator matrix with a matrix of probabilities to generate random graphs which mimic the properties of specific real-world graphs. When the nth Kronecker power of the probabilistic initiator matrix is calculated, each entry in the matrix contains the probability of an edge at that location. Parameter estimation for the initiator matrices is framed as an optimisation problem, allowing the parameters to be learned from real graphs using maximum likelihood and Expectation Maximisation (EM). [128] demonstrates empirically that an initiator matrix of four elements is sufficient to model the properties of several realworld networks. Variations of the Kronecker graph model include [16], which generalises the initiator matrix to a tensor in order to generate weighted graphs; and [15] which uses a related approach to generate weighted time-evolving graphs. While they fulfil the desired structural properties, Kronecker Graphs are unlabelled. In the next section, we will cover a different line of research, seen mostly in the SNA literature, which takes the attributes of graph vertices into account. 2.4.7 Statistical Models The models in Sects. 2.4.1–2.4.6 aim to reproduce global graph properties; in the SNA literature, the focus has been on the relationships between individual vertices. One property of great interest to sociologists is assortative mixing or homophily [60, 138, 142]: the propensity for edges to form between vertices which share some common property, such as age or gender. The models in this section rely on group memberships or other attributes, which are usually represented as vertex labels (§2.2.2). The Exponential Random Graph (ERG) model [181] is an ensemble— a set of possible graph topologies plus a probability distribution over
The ER model G(n, p) can be considered to be an ensemble over all graphs with n vertices.
2.4 graph models
them—which specifies the exponential family of graphs p∗ . For each hypothesised structural feature (such as transitivity), there is a corresponding graph statistic (e.g., T , the number of closed triads) and a corresponding variable in the model. These variables act as constraints on the probability distribution over all possible graphs, making it possible to draw uniformly from the set of all graphs which satisfy those properties [142]. The best choice of probability distribution is the one which maximises the Gibbs entropy subject to the constraints. However, ERG models have been criticised for erratic behaviour when the parameters are varied [93, 142]. The Stochastic Block Model [180] is an extension of ERG where vertices which share a common attribute are grouped into blocks. This idea was developed into the Mixed Membership Stochastic Block Model [14], which combines global attributes to partition the vertices into dense areas of connectivity (the block model) with local attributes to introduce individual variation within these blocks (mixed membership). The absence of an edge may be due to the rarity of interactions in general, or because the members of two separate blocks rarely interact with each other. [14] resolves this with a variational approach: by positing a distribution of latent variables with free parameters, then using an EM algorithm to fit the parameters to the graph data. [93] presents a general framework for latent space models. The paper defines a “social space” using a set of unobserved latent variables which represent the likelihood of transitive relationships. The probability of an edge depends on the Euclidean distance between the vertices in this social space. Positions in social space are estimated using logistic regression and Markov Chain Monte Carlo (MCMC). The models from the SNA literature tend to focus on very small graphs; issues of complexity and scalability are not considered in detail. They also tend to have a very large number of parameters. While this gives them expressive power, in general they are too complex for analytical treatment; the graph properties which arise are not amenable to mathematical proofs. Where the models fail to reproduce some characteristics of real graphs, it is hard to understand why. The Random Dot Product model [187] is similar to the models above, in that edges are dependent on vertex attributes, but unlike them it is susceptible to analytical treatment. Each vertex is assigned a vector of numeric attributes Wv ∈ Rd ; the probability of an edge is equal to the dot product of the vector endpoints. Proofs are pro-
37
2.4 graph models
(a) Combining attribute values with affinity matrices to give edge probability pij
(b) Affinity matrices for different graph structures
Figure 2.10: The Multiplicative Attribute Graph (MAG) Model [109, 111]
vided for small diameter, clustering and power-law degree distribution properties. However, the model is statistically uninteresting: there is no method to fit the model to real graphs and no empirical results. 2.4.8 The MAG Model The Multiplicative Attribute Graph (MAG) model [111] is a bold attempt to combine the mathematical rigour of the models of Sects. 2.4.1– 2.4.6 with the statistical power of the latent space models of Sect. 2.4.7. MAG
is parameterised by the number of vertices N, the number of
vertex labels k, a set of prior probabilities for vertex attribute values {µa } : 1 6 a 6 n and a corresponding set of affinity matrices {Θa } which specify the probability of an edge conditioned on the vertex labels. The calculation of the edge probability for each dyad hi, ji is illustrated in Fig. 2.10a. The values of each vertex attribute, LV (i, la ) and LV (j, la ), are used to dereference the row and column of affinity matrix Θa . The corresponding matrix elements are multiplied together to obtain the edge probability pij . Vertex attributes take binary values, so the affinity matrices are 2 × 2 matrices. As we show in Chapter 5, this can be generalised to categorical labels which take an arbitrary number of discrete values. Fig. 2.10b shows how the values in the affinity matrices give rise to different graph structures. High probabilities in the diagonal creates assortative mixing (homophily). Low values in the diagonal with high values in the top-right and bottom-left quadrants creates disas-
38
2.4 graph models
39
sortative mixing (heterophily). Extreme heterophily gives rise to a bipartite graph. Where the affinity for 0 is much higher than 1, this creates a core-periphery structure, or assortative mixing by degree [142]. Where there is uniform affinity, this creates a random ER graph. The proofs in the paper are given for a simplified version of the MAG
model, where the priors µa and affinity matrices Θa are the
same for all attributes; the graphs are undirected (the top-right and bottom-left quadrants of Θa are the same, as in Fig. 2.10); and the graphs exhibit a core-periphery structure (α > β > γ). Proofs are provided for connectedness and the existence of a giant component, small diameter and the DPL. The expected degree distribution is highly dependent on µ and Θ, but for certain bounds of µ and Θ, a lognormal degree distribution will arise. An example is given of parameters that will give rise to a power-law distribution. [111] proves that Kronecker graphs are a special case of MAG graphs: it is possible to specify a Kronecker graph in the MAG model if N = 2k and each vertex has a unique combination of attribute values. [111] also provides an empirical study to show that MAG can accurately model the degree distribution, singular value/singular vector distribution, triad participation and hop-plot of a real-world Yahoo!Flickr graph. It is not possible to model the clustering coefficient with simplified MAG, as core-periphery structure is insufficient to give rise to clustering, but general MAG can model clustering. [109] provides an algorithm for learning the affinity matrices from real graphs using a variational approach and Maximum Likelihood Estimation (similar to the latent space approach of [14]). The Latent Multi-group Membership Model (LMMG) [110] is an extension to MAG which introduces a layer between the attribute values and the affinity matrices to indicate latent group memberships. The MAG model is a very significant development, as it has succeeded in bringing mathematical rigour and statistical significance together in one approach. However, the model is not without some flaws. One result in [111] proves that the optimal number of attributes is k = ρ log N for some constant ρ; i.e., there is a dependency between the number of attributes in the model and the size of the graph that will be generated. Our own experiments revealed that the average degree differs significantly between an input graph and a generated graph with the same number of vertices. MAG also assumes that the vertex attributes are independent from each other; this assumption is
Simplified MAG has just six parameters: no. of vertices N, no. of labels k, label prior µ and affinity matrix Θ(α, β, γ).
2.5 conclusion
not always justified. We will study MAG further in our experiments in Chapter 5.
2.5
conclusion
In this chapter we looked at the global properties found in real-world graphs, many of which are characterised by a power-law distribution. This led us into a study of the models which try to explain how these properties arise. These are broadly divided into “mechanistic” models which are analytically tractable, and stochastic models which are statistically interesting. The MAG model attempts to bring these two strands together. In Chapter 5, we develop these ideas and propose our own generative model for labelled, weighted graphs. In the next chapter, we will survey approaches to mining graphs for patterns and anomalies.
40
3
GRAPH MINING
3.1
introduction
Following on from our study of graph properties and models, we now turn our attention to graph mining: algorithms to discover interesting patterns in graph datasets. We begin with a survey of approaches to substructure mining in Sect. 3.2. Next, we look at departures from known patterns. Identifying unusual patterns has great practical value: for example, to discover accounting fraud [26, 137], criminal networks [51], network intrusions [133, 164, 169], suspicious activity in physical environments [32], and countless other applications [50]. In Sect. 3.3, we give an overview of approaches to discovering graph anomalies. Finally, in Sect. 3.4 we introduce the graph datasets that we will use for our experiments in Chapters 5–7.
3.2
substructure mining
In Chapter 2, we considered the patterns observable in real-world graphs at a global level. The goal of frequent subgraph mining is to discover the most common repeating patterns or motifs in SLGs (graphs with a giant component) or graph databases, which are collections of subgraphs [185] or transactions [98, 121]. The patterns discovered by frequent pattern mining can be used for concept learning, classification, clustering [58], anomaly detection [67] or to compress the graph structure [58]. Application areas include protein structure analysis [58], predictive toxicology [160], circuit analysis [58], software defect localisation [73], insider threat detection [69], the study of cattle movements [100] and many others. A substructure is a connected subgraph. One of the earliest approaches to substructure mining was the Subdue system [57, 58, 94]. In its earliest incarnation in the late 1980s, Subdue selected the “best” substructures based on four psychologically-motivated heuristics: cognitive savings, connectivity, compactness and coverage [94]. Later re-
41
3.2 substructure mining
f(a) = 1 f(b) = 6 f(c) = 8 f(d) = 3 f(g) = 5 f(h) = 2 f(i) = 4 f(j) = 7 (a) Graph G
(b) Graph H
(c) Isomorphism f : VG → VH
Figure 3.1: Existence of an isomorphism between graphs G and H [37]
finements replaced these heuristics with the Minimum Description Length (MDL) principle, discussed in Sect. 3.2.2. Another approach to substructure discovery arose independently from the data mining community. Methods for finding frequent itemsets in databases began to appear in the 1990s with Association Rule Mining (ARM), exemplified by the Apriori algorithm [11]. A typical application area is supermarket basket analysis, where ARM can find surprising correlations—such as beer and nappies commonly being purchased together on a Friday evening [150]. It was not long before the idea of finding frequent itemsets in databases was extended to frequent patterns in structured data. Although the methods are analagous, substructure discovery in graphs is much more complex than itemset mining, in particular because of the graph isomorphism problem. Frequent substructures are usually defined as those which pass some minimum support threshold [98, 121, 143, 185]. Having found many candidate substructures, we need to compare them for identity—or isomorphism—in order to measure their support. Graph Isomorphism (GI) is illustrated in Fig. 3.1. At first glance, the graphs in Figs. 3.1a–3.1b appear to be different, but in fact an isomorphism exists (Fig. 3.1c). We define GI formally as follows: definition 28: An isomorphism between two graphs G and H is a bijection between their vertex sets f : VG → VH such that ∀u, v ∈ G: hu, vi are adjacent in G ⇐⇒ hf(u), f(v)i are adjacent in H
(3.1)
42
3.2 substructure mining
If an isomorphism exists, we say the graphs are isomorphic and denote this as G ' H. If G and H are in fact the same graph, then the isomorphism is an automorphism, G = H. GI
belongs to an unusual class of problems which are in NP, but are
not known to be in P or NP-complete [82]. √ The best proven worst-case complexity for the general case is eO(
N log N )
[136], though most
real-world graphs can be solved in polynomial time. The interested reader is referred to [82] for an introduction and to [136] for the stateof-the-art. Suffice to say that GI is the most computationally expensive part of the discovery process. Graph database mining algorithms usually measure the support of each subgraph pattern as the number of graph transactions which contain it. This is the subgraph isomorphism problem: definition 29: A subgraph isomorphism exists between graphs S and G if ∃G0 : G0 is a subgraph of G, S ' G0
(3.2)
Subgraph Isomorphism has been proved to be NP-complete [82], so algorithms which rely on it [98, 121] are severely constrained in the size of graph transactions that they can handle. 3.2.1 Frequent Subgraph Mining Early approaches to frequent substructure discovery were based on Apriori-style itemset mining [11]. In each iteration, candidate subgraphs are generated using Breadth-First Search (BFS) and tested against a minimum support criterion. The support of each candidate is equal to the number of transactions which contain it. Support is an anti-monotone property, i.e. the support of a graph cannot exceed that of its subgraphs. definition 30: Anti-monotone constraint. A constraint c is a Boolean predicate which any subgraph G0 ⊆ G must fulfil. c is said to be antimonotone if it satisfies the downward closure property: ∀G1 ⊂ G0 : c(G0 ) =⇒ c(G1 )
(3.3)
In ARM, itemsets are sorted into lexicographic order for comparison. In frequent subgraph mining, subgraphs are given canonical labels, which involves finding the set of automorphisms and determining
43
3.2 substructure mining
which is “least” [82]. Subgraphs with identical canonical labels are isomorphic. AGM (Apriori-based Graph Mining) [98] generates induced subgraphs GS = (VS , ES ) from the database transactions, which have the property ES = E ∩ (VS × VS ). AGM starts with the set of all 2 × 2 adjacency matrices, and grows them by one vertex in each iteration by joining adjacency matrices with a common core. Substructures which fail to meet a user-defined minimum support threshold at the end of each iteration are discarded. FSG (Frequent Subgraphs) [121] is a similar approach, which grows candidate substructures one edge at a time. FSG tries to reduce the number of subgraph isomorphism tests required by using canonical labels and maintaining an embedding list of which graph transactions support each candidate substructure [121]. gFSG [124] is an extension to FSG, for graph databases where vertex coordinates are available, such as databases of chemical compounds. Substructure discovery is constrained by the relative positions of the vertices, to find frequent geometric subgraphs that are rotation, scaling and translation invariant. The main weakness of AGM and FSG is that candidate generation is very expensive. Although they avoid directly testing for GI by using canonical labels, the substructure expansion step generates a large number of redundant candidates which all have to be labelled, and canonical labelling is itself in the same complexity class as GI. Therefore, these approaches work with a database of small graph transactions, but do not scale up to larger subgraphs. MoFa (Molecule Fragment Miner) [39] was developed to mine chemical databases but can be applied to other kinds of graph database. The number of redundant candidates is reduced by restricting the extension operation to subgraphs which actually exist (using an embedding list). MoFa also uses structural pruning and background knowledge to reduce the support computation. However, MoFa still generates many duplicates, resulting in unnecessary support calculations. gSpan (Graph Substructure Pattern) [185] avoids the problem of generating redundant canonical labels. Canonical labels are determined by the minimum representation of vertex orderings as discovered by a Depth-First Search (DFS), and organising the labels into a hierarchical spanning tree. Frequent substructures are discovered by traversing this tree and checking for substructures which exceed minimum support.
44
3.2 substructure mining
FFSM (Fast Frequent Subgraph Mining) [96] aims to further reduce the complexity of candidate enumeration by using a different canonical form based on the maximal ordering of the adjacency matrix (rather than the minimal ordering used by AGM, FSG and gSpan). All possible subgraphs are arranged in a suboptimal tree where each vertex is enumerated by a join or an extension operation. FFSM uses an embedding list to remove the need for explicit subgraph isomorphism. GASTON (Graph/Sequence/Tree Extraction) [143] streamlines the search by searching first for sequences (paths), then trees, then graphs. It is based on the intuition that (at least in molecular databases) the most frequent substructures are free trees. Thus simpler algorithms are used for sequence and tree mining, and the more expensive subgraph isomorphism step is only invoked when frequent trees are joined into subgraphs. GASTON uses an embedding list to only grow patterns which actually appear. Some extentions to gSpan aim to reduce redundancy in the results. CloseGraph [186] mines only the set of closed subgraphs. A subgraph GS is closed if there is no proper supergraph of GS which has the same support as GS . Several pruning methods are introduced which reduces the size of the result set and improves performance over gSpan and FSG. SPIN (Spanning Tree-based Maximal Graph Mining) [97] has a similar goal, mining only maximal subgraphs, those which are not a part of any other frequent subgraph. SPIN mines all the frequent trees in the graph, then reconstructs all maximal subgraphs from the mined trees. They report improved performance over FFSM and gSpan. MARGIN [175] is another maximal subgraph mining algorithm. Using the intuition that maximal subgraphs are frequent subgraphs with infrequent children, MARGIN searches along the “border” between frequent and infrequent subgraphs, reducing the number of candidate patterns. With the exception of MoFa, the algorithms discussed above are designed for undirected graphs only. FFSM strongly relies on symmetric adjacency matrices, so cannot be used for directed graphs. GASTON’s rules for uniquely constructing all sequences and trees cannot be used for directed graphs without major changes. gSpan can be extended to cope with directed graphs [100, 108]. [38] generalises the DFS-based canonical form found in gSpan, FFSM and GASTON into a family of canonical forms based on systematic
45
3.2 substructure mining
ways to construct spanning trees. This paper demonstrates that the BFS-based
canonical form used by MoFa [39] is a member of the same
family, and goes on to exploit it in the same way, leading to an improved version of MoFa. The relative performance of gSpan, FFSM, GASTON and MoFa for chemical substructure mining is compared in [183]. FFSM’s approach of growing candidates by joining was found to be cheap compared to the extending method used by gSpan, GASTON and MoFa. Most computation time is spent in support computation or in calculating embedding lists. Embedding lists were found to be advantageous only where the mined fragments became large, when subgraph isomorphism becomes very expensive. Using canonical forms for detecting duplicates is more efficient than explicit GI tests. GASTON’s approach is even better as it avoids generating duplicate subgraphs (at least for non-cyclic subgraphs). Due to their heritage from database mining, the frequent subgraph mining algorithms cannot be applied to SLGs. This is partly due to issues of scalability; partly because support is calculated from the number of transactions which contain the pattern (rather than the number of instances of the pattern). Algorithms for searching SLGs are considered in the following sections. 3.2.2 Compression-based Substructure Mining Compression-based substructure mining evaluates substructures using the MDL principle rather than frequency. As compression-based algorithms do not count support, they can be applied to SLGs as well as graph databases. The MDL principle [158, 159] states that the best theory to describe a set of data is the one which minimizes the Description Length (DL) of the entire data set, or: the most frequently occuring patterns get encoded by the shortest strings. This was applied to the Subdue substructure discovery system in [57]. The DL of a graph is equal to vbits + rbits + ebits, where vbits is the number of bits required to encode vertex labellings, rbits is the number of bits required to encode the adjacency matrix and ebits is the number of bits required to encode the edge labellings:
46
3.2 substructure mining
vbits = log2 |V| + |V| log2 |LV | |V| X |V| rbits = (|V| + 1) log2 (b + 1) + log2 ki
(3.4) (3.5)
i=1
where ki is the number of 1s in row i of the adjacency matrix and b = maxi ki ; and ebits = e(1 + log2 |LE |) + (K + 1) log2 m
(3.6)
where e is the number of edge partitions in the graph, K is the number of 1s in the adjacency matrix and m is the maximum number of edges in any partition. As an example, consider the graph in Fig. 2.2a which has 8 vertices, 3 vertex labels and an 8 × 8 adjacency matrix. In this encoding scheme, undirected edges are recorded above the diagonal only, so the adjacency matrix becomes:
0 0 0 0 A= 0 0 0 0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0 0 0 0 1 1 1 0
Now k = {2, 1, 3, 0, 3, 2, 1, 0}, b = 3 and K = 12. Then: vbits = log2 8 + 8 log2 3 = 15.68 bits rbits = 9 log2 4 + log2 82 + log2 81 + log2 log2 83 + log2 82 + log2
8 3 8 1
+ log2
+ log2
8 0
+ 8 0
= 27.23 bits ebits = 1(1 + log2 1) + (12) log2 12 = 44.02 bits DL = 15.68 + 27.23 + 44.02 = 86.93 bits Note that this encoding scheme assumes a single vertex label and a single edge label. As Fig. 2.2a does not have labelled edges, we assume that all edges are labelled with the same value.
47
3.2 substructure mining
When Subdue finds a substructure GS , all instances of the substructure in the input graph are replaced with a single vertex, to give the compressed graph G|GS . The best substructure is the one which minimises the DL of the graph: argmin GS
DL(GS ) + DL(G|GS ) DL(G)
(3.7)
By iteratively compressing and finding new substructures in the compressed graph, Subdue produces a hierarchical description of the structural patterns in the graph. Unlike the frequent subgraph mining approaches, which exhaustively search all substructures meeting the minimum support criterion, Subdue employs a greedy beam search strategy [58] (a limitedlength queue of the best few patterns found so far). The advantage is that it becomes possible to search much larger graphs. However, DL
is not an anti-monotone property: [57] notes that once the DL of
an expanding substructure begins to increase, further expansion of the substructure typically does not yield a smaller DL, but this is not guaranteed. By pruning candidates with an increasing DL, some interesting patterns could be missed, especially if the beam width is too narrow. Subdue has some additional heuristics to improve the evaluation of candidate substructures. Inexact matching is allowed by assigning a cost to distortions such as deletion, insertion or substitution of vertices or edges [95]. Domain-specific background knowledge can be incorporated, using a set of rules to boost known substructures, such as carbon rings in organic chemistry [57]. GraphScope [173] uses a compression-based approach to search for communities and discontinuity time-points in dynamic (time-evolving) bipartite graphs. GraphScope takes “snapshots” of the graph at regular intervals and reorders the adjacency matrix to group communities of similar vertices together. If the communities do not change much over time, consecutive snapshots have a similar DL and are grouped into the same segment. If a new snapshot cannot fit into the segment (in terms of compression), GraphScope introduces a change point, and a new segment is started. The MDL principle is also used for anomaly detection in graphs: GBAD [67] and Autopart [47] are discussed in Sect. 3.3. Frequent subgraph mining and compression-based approaches are compared in [66]. Subdue and GASTON were able to discover the
48
3.2 substructure mining
49
same patterns. For graph databases, GASTON was found to outperform Subdue; but Subdue was able to handle large and complex graphs which were intractable for GASTON. 3.2.3 Substructures with Overlapping Instances Sometimes multiple instances of the same subgraph overlap within one graph. Consider the example in Fig. 3.2, which has only one instance of vertex A, but contains four instances of the substructure defined by E = {hA, Bi}. As the size of the candidate substructures is expanded, the number of instances can go up or down: there are six instances of {hA, Bi , hA, Bi}, four instances of {hA, Bi , hA, Bi , hA, Bi} and of course only one instance of the substructure equal to the whole
B A
B
graph. In other words, where instances of the same substructure are allowed to overlap, the count of instances violates the anti-monotone property (Def. 30).
B
B
Subdue’s default behaviour is to ignore instances which overlap, which is computationally efficient but will down-rank substructures with overlapping instances. The software implementation [6] has an option
Figure 3.2: Overlapping instances
to permit overlaps: during the compression phase, overlapping instances are replaced with one duplicate vertex per instance, connected by overlap edges [146]. An alternative to Subdue’s all-or-nothing overlap scheme is to allow the user to specify which vertices are allowed to overlap [146]. This approach incorporates domain knowledge to boost relevant substructures, mitigating the cost by ignoring overlapping instances of irrelevant substructures. The literature on Subdue [57, 58, 95, 146] does not discuss the fact that the count of overlapping instances is not anti-monotone. The overlap problem is not a concern for the frequent subgraph miners of Sect. 3.2.1, as support is based on the number of graph transactions which contain an instance, rather than the number of instances. However, to search in a SLG, each instance must contribute independently to the support of the substructure. siGram [123] modifies the frequent subgraph mining approach for SLGs in two variants for BFS and DFS candidate search. They define the set of edge-disjoint embeddings as the instances of a substructure which may share vertices but do not share any edges. The overlap graph of a subgraph GS is obtained by creating a vertex for each instance of GS and an edge
3.3 graph-based anomaly detection
between each pair of non-edge-disjoint instances. The support of each subgraph is the size of the Maximum Independent Set (MIS) of the overlap graph. By counting only instances which are edge-disjoint, the anti-monotone property is preserved, but at the cost of completeness: the support of substructures with overlapping instances may be reduced below the minimum support threshold. Grew [122] is
50
An independent set is a set of vertices, none of which are adjacent. The MIS of a graph G is the largest independent set in G.
a related heuristic approach by the same authors, extending FSG to search SLGs using vertex-disjoint embeddings. Grew uses an approximate method to find the MIS, so returns fewer patterns than siGram. [80] argues that a vertex-disjoint overlap graph is too restrictive, and defines a harmful overlap as one where the ancestors of each instance are equivalent. Support is counted as the size of the MIS of the Harmful Overlap (HO) graph. The paper proves that HO-support is anti-monotone. [89] proves the necessary and sufficient conditions for instance-counting support measures to maintain the anti-monotone property, and presents an Apriori-based algorithm similar to [123], which also counts support based on edge-disjoint instances. [46] builds on the proofs of [89] to provide a complete set of proofs for homomorphisms, isomorphisms and homeomorphisms on labeled and unlabeled, directed and undirected graphs, for vertex and edge overlaps. The MIS is proved to be the minimal anti-monotonic overlap support measure; they also introduce the Minimum Clique Partition measure (MCP), which they prove to be the maximal anti-monotonic overlap support measure. In between those measures is the Lovász measure, which is also proved to be anti-monotonic. This is significant because the Lovász measure is computable in polynomial time, whereas computing MIS and MCP is NP-hard. So far in this chapter, we have covered the main approaches to substructure discovery in graph databases and SLGs. However, these approaches consider only discrete labels. Many graph datasets—including those presented later in this chapter—also have numeric labels. In Chapter 4, we will look at approaches to substructure discovery in graphs with numeric attributes.
3.3
graph-based anomaly detection
Approaches to graph-based anomaly detection are broadly divided into macroscopic approaches (detecting anomalous regions in the graph) and microscopic (substructure-based) approaches.
A homomorphism is a mapping which preserves vertices and edges but not non-edges and has applications in social networks. A homeomorphism is an isomorphism in the category of topological spaces and has applications in the study of biological networks [46].
3.3 graph-based anomaly detection
51
Early approaches were supervised and domain-specific: a human expert defines fraudulent behaviour, intrusions or other anomalous patterns, and the system searches for those specific patterns. GrIDS [169] constructs “activity graphs” (graph transactions) to track the progress of a worm as it infects multiple network hosts; intrusions are detected using rules-based pattern matching. Visual Analytics [133] has also been used to detect network intrusions. Fraudulent trading in online auction websites was detected using Belief Propagation (BP) over a Markov Random Field (MRF) to combine user-level features (no. of transactions, average price of goods exchanged) with network-level features (who interacts with whom) [51]. Criminal gangs tend to emerge as bipartite cores of fraudsters and accomplices who boost their online reputations. These supervised methods are useful for the domains for which they are defined, but cannot detect previously-unknown types of anomaly and may not expose knowledge hidden within the network structure. Unsupervised approaches to graph-based anomaly detection began to emerge around ten years ago. In [132], paths and loops which occur infrequently are discovered using “rarity analysis”; the rarity of a path is defined as the reciprocal of the number of similar paths. Vertex significance is measured based on the number of rare paths connecting each vertex pair. A number of researchers use statistical measures to identify graph anomalies. [63, 152, 178] split the Enron graph into transactions based on temporal windows of one day to one one month in duration. [63] calculates standard graph statistics (Betweenness Centrality, Closeness Centrality, Eigenvector Centrality, etc.) on each transaction. The graphs became denser, more centralised and more connected during the Enron crisis. [152] uses a form of time-series analysis; statistically significant effects such as excessive activity are detected using “scan statistics” calculated on each transaction. Statistical methods have also been used to identify anomalous vertices and edges in a graph. In [178], a set of vertex and edge features is defined; anomalies are detected based on statistical deviation from the patterns for each individual and the patterns for the cluster they belong to. In [155], the authors discover anomalous edges using the Katz measurement [106], a statistic for computing the influence of an actor in a social network, which had previously been used for link prediction [131]. [167] combines information theory (“event-based
The Enron e-mail dataset (§3.4.2) is referenced by a number of the papers reviewed here.
3.3 graph-based anomaly detection
graph entropy”) and statistical techniques to determine the most important vertices in the Enron graph. Autopart [47] is another information-theoretic approach which discovers anomalous edges as a means to cluster the vertices. The intuition is that edges which deviate from normal patterns will increase the cost of compressing the graph, so removing anomalous edges will minimise the DL. [172] also uses graph partioning, this time to find anomalous vertices, but only in bipartite graphs: a vertex in V2 is an anomaly if it links to two vertices in V1 that do not belong in the same partition. Two approaches to graph-based anomaly detection [68, 144] are based on the Subdue substructure discovery system [58] (Sect. 3.2.2): In [144], each substructure in a SLG is evaluated based on its size and number of instances; substructures with a low score are flagged as anomalous. The weakness of this method is that it requires an exhaustive search of substructures up to a given size in order to detect anomalies, which will be intractable for many real-world graphs. [144] also presents a method to detect anomalous subgraphs in a graph transaction database. Using the intuition that subgraphs which contain anomalous patterns will have a higher DL after compression, each subgraph is compressed with Subdue over multiple iterations, and evaluated according to how much and how soon it is compressed. We adopt this approach for Yagada in Chapter 7. This paper also presents measures of graph regularity based on substructure entropy and conditional substructure entropy. GBAD [68] is a suite of three anomaly detection algorithms, also based on Subdue: GBAD-MDL is an information-theoretic approach to detect anomalous graph modifications; GBAD-P is a probabilitic approach to detect anomalous insertions; and GBAD-MPS is a Maximum Partial Substructure-based approach to detect anomalous deletions. GBAD’s definition of an anomaly is an unexpected deviation from a normative pattern, rather than an unexpected substructure. Our survey has covered approaches to detecting anomalies in graphs with discrete labels. In Chapter 4, we will discuss anomaly detection in graphs with numeric attributes, such as those presented in the next section.
52
3.4 graph datasets
Figure 3.3: Graph of social interactions during the PAL Scheme study
3.4
graph datasets
In this section, we introduce the real-world datasets which we used for our investigations and the experiments in Chapters 5–7. All of these graphs have discrete labels and weights or numeric labels. Two of the graphs are social networks, one is an information network and two are graph databases constructed from sensor networks. 3.4.1 PAL Scheme Social Network The PAL Scheme study [Hunter 11] was a randomised controlled trial by researchers at the Centre for Public Health at Queen’s University Belfast, to measure the effect of incentives on participation in physical activity. Over 400 civil service employees took part in the study. Seven Near-Field Communication (NFC) sensors were laid out in a public park near the participants’ place of work to create a network of walking routes. Participants were issued with a “PAL Card”, embedded with a passive Radio Frequency Identification (RFID) tag, allowing them to log their activity by swiping at the sensors when they went for a walk in the park. We reasoned that if two people frequently present their cards to the same sensor within a few seconds of each other, this implies a social relationship. A valid path through the network consists of at least two separate sensors. We counted the number of times that each pair of participants swiped a sensor within 15 seconds of each other on the
53
3.4 graph datasets
54
same day. If this occurred only once, it was dismissed as noise. Otherwise, it was counted as a social relationship (an edge in the social graph). The edges are weighted with the number of days on which the participants went walking together. The resulting PAL Scheme social graph is shown in Fig. 3.3 and its characteristics are in Table 3.1. The PAL Scheme graph has a giant component (the large cluster of people who are connected to each other) and 73 smaller components. Most of the smaller components are disconnected vertices; individuals who either preferred to exercise alone or who dropped out of the study at an early stage. We used the modularity algorithm from [34] to identify 95 communities, which were used to colour the graph. Although the PAL Scheme is a small social network, its rich feature set makes it interesting. Each participant completed a number of questionnaires before, during and after the study, giving a set of around 50 demographic and behavioural attributes, which we attach to the graph as vertex labels (Table 3.1). In Chapter 5, we use these attributes to investigate the relationship between discrete vertex labels and graph structure. Another interesting feature of the PAL Scheme graph is that we can analyse how it develops over time. The study was conducted over 12 weeks, so we were able to construct daily snapshots of the social relationships on each day and visualise them as a dynamic graph. 3.4.2 Enron Social Network During the investigation that followed the collapse of Enron [137, 63], the US Federal Energy Regulatory Commission ordered that the emails of over 150 senior employees from the period 1998–2002 be made public. As e-mail collections on this scale are usually private, this is a unique dataset for study [115]. We created our graphs from the August 21, 2009 version of the Enron corpus [56]. The dataset contains ≈ 0.5 million e-mails, but most of these are communications between Enron employees and external e-mail addresses. As we have no information about what communications took place outside the Enron network, we restricted our analysis to the 30, 000 messages that were exchanged between the 159 identifiable Enron employees. We identified these individuals and their job roles from [166]. Fig. 3.4 shows the “who communicates with whom” social graph created from the Enron dataset and Table 3.2 gives the characteristics
A video showing the dynamic evolution of this network is online at http: //goo.gl/sCjTf
3.4 graph datasets
Graph type
55
Single large graph
Structure
Unipartite
Edge type
Undirected
Self-cycles
No
Average Degree
3.53
Average Strength
43.96
Effective Diameter
11
Clustering Coefficient
0.34
Average Path Length
4.17 Vertices
No. of elements
324
Labels
PAL Group
Edges 571 Numeric:
Gender Age
Number of social interactions
Education Occupation Staff Grade Employment Status Smoker? Drink alcohol? Alcohol Frequency {T1,T2,T3} Body Mass Index (BMI) {T1,T2,T3} BMI Category {T1,T2,T3} Physical Activity recommended? {T1,T2,T3} Physical Activity Category {T1,T2,T3} Metabolic Equivalents (METS) per week {T1,T2,T3} Minutes per week {T1,T2,T3} Self Eff. Mean {T1,T2,T3} EQ5D State {T1,T2,T3} EQ5D Index {T1,T2,T3} MCS8 {T1,T2,T3} PCS8 {T1,T2,T3} Sick Days {T1,T2,T3} Annual Leave Days Table 3.1: Characteristics of the PAL Scheme Social Graph (Fig. 3.3). {T1,T2,T3} means the attribute was measured at three time points during the study. EQ5D, MCS8 and PCS8 are scores from standarised questionnaires used to measure health outcomes.
3.4 graph datasets
Figure 3.4: Social graph of who communicates with whom in the Enron corpus
Graph type
Single large graph
Structure
Unipartite
Edge type
Directed
Self-cycles
Yes
Average Degree
17.51
Average Strength
506.58
Effective Diameter
5
Clustering Coefficient
0.39
Average Path Length
2.32 Vertices
Edges
Discrete:
Numeric:
Employee role
Number of e-mails
No. of elements
159
2767
No. of partitions
9
58
Labels
Table 3.2: Characteristics of the Enron Social Graph
56
3.4 graph datasets
of this graph. Vertices are people, labelled with their employee role. Edges indicate messages exchanged between people: the edges are directed to indicate sender and recipient and weighted to indicate the number of messages exchanged. In Chapter 5, we use the Enron social graph to investigate the relationship between discrete vertex labels and graph structure. This dataset is used to show that our findings hold for directed graphs (comparing with the PAL Scheme graph, which is undirected). The Enron social graph is also used as the directed, unipartite graph for the substructure mining experiments in Chapter 6. 3.4.3 Enron Information Network An alternative view of the Enron data is to consider it as an Information Network. In this representation, the focus is on the structure and content of the information flows. We represent both people and email messages as a bipartite graph with over 30,000 vertices. Employee vertices are labelled with the name of the employee. Message vertices are labelled with numeric features of the message: the size of the message in bytes or the time it was sent (seconds since midnight). A small excerpt illustrating the graph structure is shown in Fig. 3.5 and the characteristics of the complete graph are given in Table 3.3. This graph provides a richer representation for data mining, and it is considerably larger than the Enron social graph. We use it for the directed, bipartite case in our substructure discovery experiments in Chapter 6. Another useful property of this dataset is that we can create highdimensional numeric features from the text of each message, using a “bag of words” approach [129] (§6.4). In Chapter 6, We use this property to investigate the use of high-dimensional numeric features in substructure discovery. 3.4.4 QUB Sensor Network Our last two datasets are from physical ACSs used for building security. The first of these is from Queen’s University Belfast, which has ≈ 800 door sensors installed in buildings around the campus. The transaction log has ≈ 1 million entries, recording the movements of approximately 6,500 students and staff over a period of several months. This dataset is visualised as a graph in Fig. 3.6: vertices repre-
57
3.4 graph datasets
Kenneth Lay
E-mail
Time 00:27 Size 154 bytes
E-mail
Time 22:02 Size 650 bytes
E-mail
Time 20:34 Size 1148 bytes
E-mail
Time 07:18 Size 428 bytes
E-mail
Time 21:37 Size 3809 bytes
E-mail
Time 06:28 Size 382 bytes
Jeffery Skilling
David Delainey
Figure 3.5: Fragment of the Enron Communication Graph, a bipartite graph of actors and messages from the Enron corpus
Graph type
Single large graph
Structure
Bipartite
Edge type
Directed
Self-cycles
No
Average Degree
3.05
Effective Diameter
12
Clustering Coefficient
0
Average Path Length
5.44
Labels
Vertices (Actors)
Vertices (Messages)
Edges
Discrete:
Numeric:
Discrete:
Name
Size
FROM
Employee Role
Time Sent
TO
“Bag of Words”
CC BCC
No. of elements
159
31,396
96,286
No. of partitions
159
1
636
Table 3.3: Characteristics of the Enron Communication Graph
58
3.4 graph datasets
'IRXVI JSV 'ERGIV 6IWIEVGL ERH 'IPP &MSPSK]
1IHMGEP &MSPSK] 'IRXVI
-RXIVREXMSREP 6IWIEVGL 'IRXVI MR )\TIVMQIRXEP 4L]WMGW
%HQMRMWXVEXMSR (MVIGXSVEXIW *EGYPX] &YMPHMRKW 7SYXL
0MFVEV] *EGYPX] &YMPHMRKW 'IRXVEP
Figure 3.6: Graph of QUB University Campus Access Control System
Graph type
Graph Transaction Database
No. of graph transactions
70,595
Edge type
Directed
Self-cycles
Yes
Labels
Vertices
Edges
Discrete:
Numeric:
Door Sensor ID
Absolute Time Elapsed Time
Total number of elements
554,661
5,952,974
Maximum per transaction
340
57,630
Average per transaction
7.8
84.3
Number of partitions
468
339
Table 3.4: Characteristics of the QUB Access Control System Graph Database
59
3.4 graph datasets
(a) Chain Graph
(b) Directed Clique
Figure 3.7: Adding Forward Edges to Graph Transactions
sent door sensors and weighted, directed edges represent movements between pairs of sensors. Fig. 3.6 shows that there are more sensors and a higher density of transactions in areas with greater security requirements, viz. laboratories for laser, radiation and medical research. To find interesting or suspicious movement patterns, we need to mine the movements of individuals rather than the network as a whole. To investigate this, we organised the data as a graph database, where each graph transaction represents the movement of one person within a 12-hour time period (Fig. 3.7a). The path taken by the user is represented as a sequence, or chain graph. Sometimes vertices are missing from the sequence, for example when following someone through an open door. To account for missing card swipes, we include forward edges (Fig. 3.7b), turning each chain graph into a directed clique. As we shall see in Chapter 6, mining cliques is substantially more complex than mining sequences. The characteristics of the QUB graph database are shown in Table 3.4. The edges in each graph transaction are labelled with two numeric attributes: the absolute time (seconds since midnight) that the user presented their ID card to the door sensor at the end of the path segment; and the elapsed time in seconds to make the journey between the pair of sensors. We use the QUB graph database for the transaction graph mining experiments in Chapter 6.
60
3.5 conclusion
3.4.5 CEM Sensor Network Our final dataset was provided by CEM Systems, a company specialising in ACS solutions for building security in airports and other large installations. This data is from an office building, and comprises 155,000 transactions from 256 employees, covering a period of approximately 15 months. The data is visualised as a graph in Fig. 3.8. We added “dummy” Start and End vertices at the beginning and end of each path through the network to infer which sensors were attached to main entrances/exits from the building. The graph visualisation allowed us to determine which sensors were for offices where many people worked, and which were for document or server rooms where access was limited to a few authorised individuals. As with the QUB dataset, we are interested in suspicious movement patterns, so we organised the data as a graph database. The characteristics of the database are given in Table 3.5. This dataset is used for the anomaly detection experiments in Chapter 7. This dataset has advantages and disadvantages compared to the QUB dataset. On the one hand, it is much smaller, with fewer people and fewer sensors. On the other hand, it contains more detailed information: there are sensors on both sides of main doors, allowing us to detect when people exit the building as well as when they come in. In summary, the five datasets cover the cases of undirected, unipartite SLG (PAL Scheme); directed, unipartite SLG (Enron social); directed, bipartite SLG (Enron communication); and directed transaction graph databases (QUB and CEM).
3.5
conclusion
In this chapter, we turned our attention from macroscopic graph patterns to microscopic patterns or substructures. We surveyed the main approaches to substructure mining in both graph databases and SLGs. This forms the background to our approach to substructure discovery in graphs with numeric attributes in Chapter 6. Next, we surveyed approaches to graph-based anomaly detection, starting with macroscopic approaches and finishing with substructurebased approaches. This forms the background to our substructurebased anomaly detection approach in in Chapter 7.
61
3.5 conclusion
Figure 3.8: Graph of CEM Office Access Control System
Graph type
Graph Transaction Database
No. of graph transactions
37,989
Edge type
Directed
Self-cycles
Yes
Labels
Vertices
Edges
Discrete:
Numeric:
Door Sensor ID
Absolute Time Elapsed Time
Table 3.5: Characteristics of the CEM Access Control System Graph Database
62
3.5 conclusion
Finally, we introduced the graph datasets that we use for our experiments. Up to this point, we have focussed on the structural aspects of graph mining and anomaly detection. However, many graph datasets contain weights or numeric attributes. In the next chapter, we will survey the approaches that have been taken to integrate numeric attributes into graph mining. We follow this with a discussion of numeric anomaly detection, which forms the basis of our approach in Chapters 6–7.
63
N U M E R I C AT T R I B U T E S A N D A N O M A L I E S
4.1
introduction
In the previous chapter, we covered substructure mining and graphbased anomaly detection approaches for unweighted graphs with discrete labels. As we saw in Sect. 3.4, many real-world graphs have vertices or edges with numeric attributes. We begin this chapter with a survey of approaches to integrating numeric attributes into graph mining in Sect. 4.2. Our own approach to integrating numeric attributes into substructure discovery (Chapters 6–7) relies on the relationship between numeric attributes and graph structure, which we discuss in Chapter 5, and the notion of numeric anomalies. Sect. 4.3 gives the necessary background in unsupervised anomaly detection approaches, covering statistical, cluster-based, distance-based and density-based methods. Finally, in Sect. 4.4 we take a look at the problems of detecting outliers in high-dimensional space. We conduct empirical studies on high-dimensional numeric attributes in Chapter 6.
4.2
weights and numeric attributes
We begin this section with approaches to substructure discovery in weighted graphs. This is followed by the more general problem of incorporating an arbitrary number of numeric attributes into substructure discovery. We conclude with a brief survey of anomaly detection in weighted graphs. 4.2.1 Mining Weighted Graphs [72] searches for frequent substructures in a database of software call graphs, where weighted edges represent the number of calls made to each subroutine. The weights are analysed in a post-processing step. First, the graph transactions are mined using CloseGraph. Then entropy-based feature selection is applied to the result set, yielding an ordered list of edges most likely to contain coding errors. This
64
4
4.2 weights and numeric attributes
approach improves the discriminitiveness of the results, but does not have any effect on search performance. The alternative to post-processing the weights is to use them as a constraint (cf. Def. 30) during substructure discovery or as a preprocessing step. [179] propose a method to integrate constraints into frequent subgraph mining. They use anti-monotone constraints to prune the search space and monotone constraints to speed up the evaluation of further constraints. One of the proposed constraints is average weight, which is used to prune vertices and edges with outlier values from the input database as a pre-processing step. Like the approximate mining algorithms of Sects. 3.2.2–3.2.3, mining which uses weight-based constraints for pruning cannot guarantee completeness. [100] extends gSpan by including edge weights into the support calculation. Three weighting schemes are proposed: Average Total Weighting (ATW), the ratio of the average weights of transactions which contain a subgraph over the average weights of all transactions; Affinity Weighting (AW), which evaluates a weighting ratio function on each subgraph; and Utility Based Weighting (UBW), which counts the weighted support of each subgraph. ATW and AW preserve the anti-monotone property. UBW does not preserve anti-monotonicity, so it includes an additional heuristic to control the result set, based on the weighted share of each subgraph (the ratio of the weights in the subgraph over the weights of the transactions in which it occurs). ATW and AW are found to out-perform UBW in terms of both computational requirements and coverage. All three weighting schemes assume that a higher weight indicates greater significance, which is usually the case where the weight indicates the count of an edge occurance, but not necessarily when the weight is an arbitrary continuous value (§2.2.1). Constraints on weighted graphs are considered within a general framework in [73]. The authors note that while weight-based constraints are not guaranteed to be anti-monotone, in practice there is frequently a correlation between graph structure and weights (our discussion in Sect. 2.4.7 sheds some light on why this is the case). They go on to investigate approximate mining results from non-antimonotone constraints, by mapping edge weights to a new value using a measure function, then thresholding on these values. The measures evaluated in the paper are Information Gain (an entropy-based measure), Pearson’s Product-Moment Correlation Coefficient (PMCC)
65
4.2 weights and numeric attributes
and Variance, but the authors note that other measures from statistics and data analysis could also be used. The experiments evaluate the completeness of mining results between unconstrained and constrained versions of gSpan and CloseGraph. One of the results is that graph structure and weights are indeed correlated in real-world graphs. If the definition of the measure function is extended to take multi-dimensional numeric attributes as its input, then the outlier detection step we take in Chapter 5 could be used as a measure function within this framework. 4.2.2 Mining Graphs with Numeric Attributes The approaches above consider a single numeric label (the weight) in the graph database setting. [101] considers searching a SLG with multiple numeric labels: a transport logistics network, where vertices are labelled with hlatitude, longitudei pairs and edges are labelled with pickup date, delivery date, distance, hours, laden weight, and mode of transport. Their approach is to use binning to convert each numeric attribute into a few (7–10) discrete values as a pre-processing step. Substructure discovery is performed using Subdue and FSG. As FSG cannot be applied to SLGs, they partition the graph and treat each partiton as a transaction. This approach is weak, as the number of bins has to be determined manually for each attribute; close values can be assigned different labels if they are near a boundary; and although the bins are ordinal values, the authors treat them as categorical. We also note that each attribute is treated independently, though some of them (distance and hours, for example) are clearly correlated. [149] also includes a discretization step before attempting frequent subgraph matching. Several discretization methods are proposed: equalwidth or equal-frequency binning, k-means or EM clustering and kernel density estimation. All of these methods require a specification of the the number of bins or clusters k, so the authors propose seven methods to estimate k. However, this deals with only the first of the shortcomings mentioned above. [160] is a brief outline of an extension to Subdue to handle numeric attributes. A longer (unpublished) version of the paper [161] notes that there are three ways in which Subdue can handle numeric attributes: exact match (treat numeric attributes as if they are discrete); tolerance match (two numeric values are considered equal if their difference is less than some threshold); and difference match (two numeric
66
4.2 weights and numeric attributes
values are considered equal if they are drawn from the same Gaussian PDF). The authors propose to replace these approaches with frequency
binning, similar to [149]. [145] performs frequent subgraph mining on a graph created from the bag-of-words representation of image features. Each vertex is an image feature and edges are labelled with the geometric relationships between the features. The authors also discretize numeric values as a pre-processing step, but they treat numeric label values as a feature vector. Discrete labels are determined by clustering the numeric feature vectors in k-dimensional space. This solves some of the above problems, but close points could still fall into different clusters. In summary, the constraint-based approach has a much stronger theoretical grounding than simple discretisation approaches, but so far weight-based constraints have only been applied in the one-dimensional case. In Chapter 6, we propose a measure based on numeric outlier detection which can be applied to feature vectors of arbitrary length and used as a constraint on mining. We will lay the groundwork for our outlier-based approach in the next section. 4.2.3 Anomaly Detection in Weighted Graphs OddBall [17] operates on the 1-step neighbourhood (or “egonet” in the parlance of SNA) of each vertex. Relying on the fact that the edge density, vertex strength, ranks and spectral properties of weighted egonets follow a power-law distribution (§2.3.1), outliers are discovered for each of these properties based on the distance to a fitting line and density-based outlier detection (§4.3.4). Several distinctive patterns emerge as anomalies using this method: near-cliques and stars, heavy vicinities and dominant heavy links. As it is based on powerlaw properties of graphs, OddBall is related to our presentation of Agwan in Chapter 5. However, OddBall ignores vertex and edge labels and cannot be used to mine anomalies in graph databases. For anomalous substructure mining, [70] proposes an extension to GBAD-P (§3.3) to detect anomalies in graphs with numeric labels by including the numeric values into the probability that a particular edge should exist. However, the method presented is for a single attribute only and assumes that the values are normally distributed. We will discuss more robust approaches to numeric outlier detection in the next section, which leads us towards Yagada—our method for detecting anomalies in graphs with numeric attributes—in Chapter 7.
67
4.3 numeric outlier detection
4.3
numeric outlier detection
What is an anomaly? Popular definitions include, “an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data,” [24] or which “deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” [91]. While these definitions give an intuitive feel for the notion of anomalousness, they are somewhat vague. The simplest approach to numeric anomaly detection (used in [32]) is to have a threshold τ: if a numeric value x > τ, it is considered anomalous. This method is weak, as there is no clear mechanism for determining τ; it cannot detect anomalies in the middle of the data range, only at the extremes; and the result is a binary—“normal” or “anomalous”—with no measure of the degree of anomalousness. To address these deficiencies, we will examine four methods of unsupervised numeric anomaly detection: statistical, cluster-based, distancebased and density-based [50]. To give some context to our discussion, we consider one of the numeric attributes from the QUB ACS graph database (§3.4.4), “Time of Day”. When a user presents their access card to a door sensor, the time is recorded in the database. When we create the graph database, we add Time of Day as a numeric edge attribute. The activity pattern at each door is different, depending on whether the door is a main entrance, a lift, a laboratory or a storeroom. We also observe a number of different processes at work: staff tend to arrive in the morning and leave in the evening; students may arrive a little later; cleaners may arrive earlier; and security staff may make short, periodic visits through the night. Fig. 4.1a shows six representative Time of Day datasets, partitioned by edges which share the same vertex endpoints. It is clear that apart from the top dataset, the data does not follow a normal distribution. However, we can make a reasonable approximation of the data using a Gaussian Mixture Model (GMM) with K Gaussian distributions (Fig. 4.1b) as described below. Consider a dataset T = {t0 , . . . , tn }, where each 0 6 ti < 86400 represents the absolute time of day (in seconds) that a staff member presented their access badge to a door sensor. Then the probability of observing timestamp ti is: P(ti ) =
K X j=1
ωj · η(ti , µj , σ2j )
(4.1)
68
4.3 numeric outlier detection
1
1000 0.5
0 0
3
6
9
12
15
18
21
24
0 0 1
3
6
9
12
15
18
21
24
3
6
9
12
15
18
21
24
3
6
9
12
15
18
21
24
3
6
9
12
15
18
21
24
3
6
9
12
15
18
21
24
3
6
9
12
15
18
21
24
1000 0.5
0 0 400
3
6
9
12
15
18
21
24
200 0 0 400
0.5
3
6
9
12
15
18
21
24
200 0 0 400
3
6
9
12
15
18
21
24
0 0 1 0.5
3
6
9
12
15
18
21
24
100 0 0
0 0 1 0.5
200 0 0 200
0 0 1
0 0 1 0.5
3
6
9
12
15
18
21
24
0 0
(a) No. of transactions (10 minute bins)
(b) Gaussian Mixture Model
Figure 4.1: Time of Day edge attribute for QUB Access Control Dataset
where ωj is the weight (the proportion of the data that is accounted for by this Gaussian) of the jth Gaussian in the mixture and η is the value of the Gaussian probability density function with mean µj and variance σ2j at time ti . This model is the basis of the statistical approach to outlier detection discussed in next section. To provide a framework for our comparison of outlier detection methods, we define a numeric outlier function O which returns some constant value q0 for inliers and some value q q0 for outliers, where q is a measure of the degree of outlierness. This definition allows us to make a binary test for normality, but also to rank anomalies in terms of their outlier score q. definition 31: A numeric outlier function O on a dataset D is defined as: O:D→R
q 0 ∀d ∈ D : O(d) = q
if d is “normal” w.r.t. D otherwise (4.2)
69
4.3 numeric outlier detection
The value of q0 and the range of O depend on the choice of outlier detection function. In graph mining, each numeric label may be generated by a different process—more likely multiple processes—so we cannot assume prior knowledge about how numeric attribute values are distributed. Therefore, we evaluate each candidate for O based on its suitability for unsupervised outlier detection where the underlying distribution of the data is unknown. 4.3.1 Statistical Outliers The assumption behind the statistical method of outlier detection is that the high-probability areas of a stochastic model indicate normal data; observations in the low-probability regions are treated as anomalous [50]. Using the stochastic model of Eqn. 4.1, the probability that ti is anomalous is simply: P(ti is an outlier) = (1 − P(ti )) > τ
(4.3)
for some threshold τ. This is illustrated graphically in Fig. 4.2a; we have summed the probability across all the mixtures of Fig. 4.1b to get P (dashed blue line) and the outlierness probability is shown as 1 − P (solid red line). To represent this in terms of Def. 31, we select an anomaly threshold 0 6 qa 6 1 and define: qi =
q0
1 − P(t ) i
if 1 − P(ti ) < qa
(4.4)
otherwise
The main advantage of GMMs is that the model can be rapidly computed at relatively low computational cost. The most commonly-used method is EM [50, 134], which typically has linear complexity for each iteration (though it may be slow to converge); Dirichlet Process Gaussian Mixture Model (DPGMM) is a better alternative where there is an arbitrary number of mixtures (cf. §5.3.2). Once the model is known, outlierness can be calculated in linear time. One of the difficulties with the statistical approach is that it can only discover anomalies with respect to the global data distribution. For example, Fig. 4.1a (iii) has a series of small spikes in activity at regular intervals from 6pm to 6am, presumably a security guard who makes the same rounds every night. The GMM fails to capture this distinctive pattern because it is of low density compared to the normal day-time activity: the model in Fig. 4.1b represents these hourly
70
4.3 numeric outlier detection
1
4000
0.5
2000
0 0
3
6
9
12
15
18
21
24
0
1
2000
0.5
1000
0 0
3
6
9
12
15
18
21
24
1000
0.5
500
3
6
9
12
15
18
21
24
400
0.5
200
3
6
9
12
15
18
21
24
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
1
0 0
0.2
0
1
0 0
0
0 500
1 0.5 0 0
3
6
9
12
15
18
21
24
0
1
200
0.5
100
0 0
0
3
6
9
12
15
18
21
24
(a) Probability Density (dashed blue line) and corresponding Outlierness Score (solid red line)
(b) Distribution of Outlierness Scores. Normal data is close to zero, outliers are close to one.
Figure 4.2: Statistical Outlier Detection, using the model of Fig. 4.1b
spikes as a single Gaussian with high variance. Using this model, the security guard’s rounds will be detected as anomalous. In general, model-based approaches are sensitive to the choice of parameters. Fig. 4.2b illustrates this problem: the distribution of anomaly scores is dependent on the data being modelled, so a different value of the threshold qa must be selected for each dataset. The optimal value for each dataset is the one which minimises the error rate [65], which implies a supervised approach. GMMs
also suffer from the so-called “curse of dimensionality”: in
high dimensions, a large number of samples are needed to train the model [134, 190]. A final observation is that the underlying distribution of the data may not be a Gaussian mixture at all. For example, the data distribution for elapsed time to walk between a pair of sensors may be better represented with a power-law or Poisson mixture model. [189] uses power-law mixture models to capture community formation behaviour in the Twitter network and Poisson mixture models for
71
4.3 numeric outlier detection
anomaly detection. While these different mixture models have been used successfully in a variety of domains, there is no clear way to determine which model will give the best fit to an arbitrary dataset, other than by trying and comparing each of them in turn. In summary, GMMs have been used successfully where the domain is known in advance, where reasonably clean training data is available and where the density of each cluster is more-or-less uniform. It is most useful in a supervised or semi-supervised setting [50, 134, 184], but less ideal for unsupervised anomaly detection. 4.3.2 Cluster-based Outliers While clustering and outlier detection have different goals, outliers are often detected as a by-product of clustering, as we have already seen with Autopart [47] in Sect. 3.3. One class of cluster-based anomaly detection techniques relies on the intuition that anomalies are observations which do not belong to any cluster [50]. An example of this catgeory is FindOut [188], which uses wavelets to detect and remove clusters from numerical data; whatever is left is labelled as an anomaly. Another class of cluster-based algorithms detects anomalies based on the distance from their closest cluster centroid [50]. Often these use a two-step approach: Cluster-Based Local Outlier Factors (CBLOF) [92] first clusters the dataset using any appropriate algorithm (e.g., k-means [103]). Then each cluster is classified as “large” or “small” using two parameters α (the proportion of the data which must be accounted for by the large clusters) and β (the ratio of large cluster size to small cluster size). For data samples belonging to large clusters, the outlier score CBLOF(ti ) is calculated as the distance to the cluster centroid. For samples from small clusters, CBLOF(ti ) is the distance to the closest large cluster centroid. Therefore samples in big or dense clusters are less anomalous than those in small or sparse clusters. To put this into our framework of Def. 31, we choose an anomaly threshold qa and define qi =
q0
CBLOF(t ) i
if CBLOF(ti ) < qa
(4.5)
otherwise
Fig. 4.3a shows CBLOF for the datasets of Fig. 4.1a using k-means clustering with k = 3, α = 0.95, β = 2.
72
4.3 numeric outlier detection
(a) CBLOF Scores
(b) Distribution of CBLOF Scores
Figure 4.3: Outlier Detction with Cluster-based Local Outlier Factors (CBLOF)
The complexity of CBLOF is dependent on the clustering approach chosen. [103] presents an efficient implementation of k-means (Lloyd’s algorithm) using k-d trees (logarithmic for each iteration); faster approximation solutions have also been proposed [104]. Once the clusters are known, CBLOF scores can be calculated in linear time. One of the stated goals of CBLOF is to detect local outliers, but it does not do any better than the GMM-based approach in this respect. Fig. 4.3a (iii) shows that the patterns of our security guard are not represented. Again like GMMs, CBLOF’s performance is highly sensitive to the choice of parameters: the number of clusters K, α and β must be determined separately for each dataset. The anomaly threshold qa is also different for each dataset, but as Fig. 4.3b shows, at least the normal values are denser around zero and become sparser with increasing CBLOF score. The dimensionality that CBLOF can handle is dependent on the clustering algorithm chosen. The k-means algorithm in [103] is suitable for data up to about 20 dimensions.
73
4.3 numeric outlier detection
In summary, cluster-based approaches such as CBLOF suffer from many of the same difficulties as GMMs when applied to an unsupervised learning task. To eliminate the problem of creating a suitable model, we now turn to two non-parametric (model-free) approaches to outlier detection: distance-based and density-based. 4.3.3 Nearest Neighbour-based Outliers Non-parametric approaches to outlier detection determine outlierness based on the local neighbourhood of each data point rather than a global model of the data. The main distance-based approach is k-Nearest Neighbour (kNN) [153]: let the k-distance of a point be the distance to its kth -nearest neighbour. Samples in dense parts of the dataset have very small k-distances (close to 0). Outliers have kdistances which are orders of magnitude larger. Therefore we can choose an anomaly threshold qa 0 and define: qi =
q0
k-distance(t ) i
if k-distance(ti ) < qa
(4.6)
otherwise
The choice of k is independent of the statistical distribution of T . k acts as a smoothing parameter: we obtain reasonable results with k = 1, but the sensitivity of detecting anomalies is reduced for higher values of k. The k-distances for the datasets of Fig. 4.1a are shown in Fig. 4.4a, with k = 50. The computationally expensive part of kNN is the search for the kneighbourhood of each point. A naïve search has O(N2 ) complexity; this can be reduced to O(N log N) using indexing such as R*-trees [27, 153] or in higher dimensions, X-trees [28, 40]. [153] presents a heuristic to further reduce the complexity by partitioning the dataset and pruning away partitions which cannot contain outliers. The results in Fig. 4.4a are easier to interpret than the model-based approaches. Normal data is shown as a straight line close to zero, with anomalous values rising towards the edges. Looking at Fig. 4.4a (iii), kNN has successfully identified our security guard’s nighly patrols as normal, with the most anomalous regions being 6–9am and 6–9pm. Fig. 4.4b shows the distribution of outlier scores. Normal data is heavily clustered close to zero, with outliers stretching out in a long tail to the right (this tail extends up to much higher values of several
74
4.3 numeric outlier detection
(a) kNN Scores
(b) Distribtion of kNN Scores
Figure 4.4: Outlier Detection with k-Nearest Neighbours (kNN)
thousand seconds, not shown in Fig. 4.4b). We still have the problem of selecting a suitable threshold between normal values and anomalies, as this is different for each dataset. However, kNN still detects anomalies with respect to the global data distribution. It cannot discover small anomalous regions within a larger area of dense data points. And if the global density of the data is not uniform, kNN will detect sparse clusters as anomalies. These problems are addressed by the density-based approach discussed in the following section. 4.3.4 Density-based Outliers Using statistical or clustering approaches, we detect outliers with respect to the global distribution of T . With distance-based approaches, we detect outliers with respect to the global density of T . However, data is often comprised of clusters of varying densities. This is the problem addressed by density-based approaches. Local Outlier Factors (LOF) [40] computes the outlierness of each observation with respect to its local neighbourhood, like kNN. How-
75
4.3 numeric outlier detection
76
10000 5000 0 0
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
10000 5000 0 0 10000 5000 0 0 2000 1000 0 0 2000 1000 0 0 2000 1000 0 0
(a) LOF Scores
(b) Distribution of LOF Scores
Figure 4.5: Outlier Detection with Local Outlier Factors (LOF)
ever, LOF also takes the relative density of the neighbourhood into account. LOF(ti ) is defined as: P LOFMinPts (ti ) =
lrdMinPts (n) n∈NMinPts (ti ) lrdMinPts (ti )
|NMinPts (ti )|
(4.7)
where MinPts is the minimum number of points to consider as the local neighbourhood; NMinPts (n) is the set of points comprising the local neighbourhood of n; and lrdMinPts is a function which computes the local reachability density of the neighbourhood. The lrd is defined as the inverse of the average reachability distance: lrdMinPts (ti ) = P
|NMinPts (ti )| n∈NMinPts (ti ) reach_distMinPts (ti , n)
(4.8)
The reachability distance is defined as: reach_distk (p, o) = max{k_distance(o), distance(p, o)}
(4.9)
where k_distance(o) is the distance to the kth -nearest neighbour of o.
MinPts is analagous but not identical to k in kNN; if all points are distinct, MinPts = k
4.3 numeric outlier detection
Samples belonging to a dense cluster or deep within a sparse cluster have LOF(ti ) . 1. Outliers have LOF values several times larger. Thus unlike the previous outlier approaches, LOF does not require an outlier threshold parameter qa . In terms of Def. 31: qi =
1 LOF(t ) i
if LOF(ti ) . 1
(4.10)
otherwise
The LOF scores for the datasets of Fig. 4.1a are shown in Fig. 4.5a, with MinPts = 50. Like kNN, normal values appear as a straight line. The periodic pattern of our security guard is clearly defined in Fig 4.1a (iii). Unlike kNN, the largest anomalies do not necessarily occur at the edges of dense clusters, but can also occur within them (Figs. 4.1a (i–ii). These local anomalies were not detected by any of the previous methods. The distribution of LOF scores is shown in Fig. 4.5. Although the data distributions are very different, the distributions of LOF scores are very similar: LOFs for normal data are tightly clustered around 1, with outliers in a long tail to the right (LOF(ti ) > 5 are not shown). This solves the problem of selecting an anomaly threshold, as the threshold is no longer dependent on the data distribution. LOF is therefore well-suited to unsupervised learning, as we can reliably detect outliers without making any assumptions about the underlying distribution of the data or the sizes and densities of clusters. Although LOF was created more than a decade ago, it remains the state-of-the-art for unsupervised outlier detection in data of varying density. Newer algorithms have been proposed, but many of them are very similar to LOF, with some optimisation that only works for special cases. [163] presents a unified theoretical framework in which to compare density-based outlier detection methods. By generalising the notion of locality, LOF is compared to a number of specialised domain-specific algorithms. As LOF can efficiently reproduce the results of the specialised approaches, it is shown to be a good general method which can be applied to many different types of data; in most cases, highly specialised approaches are unjustified. The complexity of the nearest neighbour search in Naïve LOF is O(N2 ), but this can be reduced to O(N log N) using indexing methods such as X-Trees [28, 40]. [148] was an attempt to improve on LOF’s performance by using an approximate algorithm for local outlier detection. First, the au-
77
4.4 outlier detection in high-dimensional space
thors introduce an exact algorithm, LOCI (LOcal Correlation Integral). The complexity of LOCI is worse than LOF (O(N3 )) and in empirical studies using LOCI and LOF as a one-class classifier, LOCI is shown to be less accurate than LOF [99]. The general idea behind LOCI is used as the basis for an approximate algorithm, aLOCI (Approximate LOCI). aLOCI approximates the nearest-neighbour search by storing data points in a k-dimensional quad-tree. The complexity of neighbourhood discovery is O(NLkg), where L is the number of levels (or scales) in the quad-tree and g is the number of grids. L and g are selected based on the intrinsic dimensionality of the dataset. This yields an algorithm which is asymptotically faster than LOF. Unfortunately, its accuracy is considerably worse than the exact algorithm. Our own experiments with aLOCI showed that many inliers were detected as outliers. The main weakness of LOF is a drop in performance on high-dimensional datasets. As we project into a very high dimensional space, Lp norm distance measures lose their discriminative ability, which affects LOF’s
density measurements. There is also a disproportionate effect
on performance. Indexed LOF shows near-linear performance at low dimensions, but the effectiveness of the X-tree index degrades at 10– 20 dimensions [40] and at high (k > 50) dimensions, neighbourhood discovery requires a brute-force sequential search, with complexity O(N2 k). We discuss these problems in the next section.
4.4
outlier detection in high-dimensional space
[190] is a detailed survey on recent work on outlier detection in highdimensional space. The “curse of dimensionality” is broken down into concrete phenomena: the distance concentration effect (as dimensionality increases, the relative contrast between near and far neighbours tends to degrade); the presence of irrelevant attributes (subspace approaches tackle this directly; other approaches do so indirectly or not at all); and efficiency issues. Despite the loss of relative distances due to the concentration effect, it is still possible to differentiate between near and far points by ranking the distance values, which is sufficient for determining the local neighbourhood of each point. One interesting result is that it is not the concentration effect per se which impedes outlier detection in high dimensions: it is the presence of irrelevant attributes which make the concentration effect a problem.
78
4.5 conclusion
Irrelevant attributes can be explicitly handled by subspace feature selection [118, 190]. DB-CSC [90] combines graph clustering with subspace feature selection on the numeric attributes of the graphs, but experimental datasets only have synthetic data up to 20 dimensions, so it is not clear if this approach scales to higher dimensions. Subspace Outlier Detection (SOD) [117] is a subspace LOF method which combines the tasks of finding relevant subspaces and detecting outliers, but the problem of indexing in high-dimensional space is not addressed. A solution to the indexing problem is proposed by Projection Indexed Nearest Neighbours (PINN) [62]. PINN’s solution is based on the observation that real-world datasets have high representational dimensionality but low intrinsic dimensionality [174]. The Achlioptas Random Projection (RP) [9] is used to project data into a lowerdimensional space. The Johnson-Lindenstrauss Lemma [102] proves that a reduced dimensionality of k = 2 ·
log n e2
will preserve the pair-
wise L2 (Euclidean) distances between data points within an error bound of 1 + e with high probability. Candidates for the neighbourhood of each point are located in this projected space, where the computational costs are much lower, O(kN log N). The candidate set is used as an index into the original space to calculate the LOFs. One of the interesting results of the PINN paper is a proof that LOF scores are preserved within the known error bounds of the RP. Using RP + PINN + LOF makes LOF tractable for large, high-dimensional data. We verify this experimentally in Chapter 6.
4.5
conclusion
In this chapter we surveyed approaches to integrating numeric attributes into substructure discovery and anomaly detection. Most of these approaches are for graphs with a single numeric attribute or weight. In Chapter 6, we propose an alternative constraint-based approach for graphs with an arbitrary number of continuous values, represented as a multi-dimensional feature vector. Our approach in Chapter 6 is based on the anomalousness of the numeric attributes. We have surveyed a number of approaches to unsupervised outlier detection: numeric anomalies are detected as statistical outliers, outliers from data clusters, distance from a local neighbourhood, or distance from and density of the local neighbourhood. Using a set of six numeric datasets, we demonstrated the ad-
79
4.5 conclusion
vantages of the density-based approach for unsupervised outlier discovery when the distribution and density of the numeric dataset are not known in advance. We concluded with a look at density-based outlier discovery in high-dimensional space. In the Chapter 5, we present Agwan, a generative model for labelled, weighted graphs. In Chapter 6, we apply numeric outlier detection to the problem of frequent substructure discovery in graphs with numeric labels. In Chapter 7, we present Yagada, an algorithm for anomaly detection in graphs with numeric labels.
80
Part III CONTRIBUTIONS agwan: Irish slang, short for “ah go on”. yagada: slang for “what’s up?” or “how are you doing?”
— Urban Dictionary
A G WA N : A G E N E R AT I V E M O D E L F O R L A B E L L E D G R A P H S W I T H N U M E R I C AT T R I B U T E S
5.1
introduction
In Chapter 2, we looked at generative graph models (§2.4). These were broadly classified into “mechanistic” models, which are parsimonious and amenable to analytical proofs; and stochastic models, which may not be analytically tractable but which can be fitted to realworld graphs to learn their properties. In general, the mechanistic models generate unlabelled graphs, whereas stochastic models make use of vertex label information and can generate labelled graphs. In this chapter, we develop a generative model for labelled, weighted graphs. Our main motivation is to use the model to better understand the laws governing the relationship between graph structure and numeric labels or weights. Furthermore, we want to be able to create realistic random, labelled, weighted graphs for simulation experiments for our pattern discovery algorithms (Chapters 6–7). We begin the chapter by examining how vertex attributes give rise to graph structure, in terms of processes such as selection and induction. We then extend this discussion to weights and numeric attributes. This lays the groundwork for the main contribution of this chapter: Agwan (Attribute Graphs: Weighted and Numeric) is our generative model for random graphs with discrete labels and weighted edges. We include algorithms for fitting the parameters of Agwan to realworld graphs and generating random graphs from the model. As far as we are aware, Agwan is the first generative model to include both vertex labels and edge weights. The basic model has a single numeric attribute (edge weight), but as we later show, this can easily be generalised to edges labelled with an arbitrary number of numeric attributes. In our experiments, we evaluate how “realistic” our random graphs are by measuring the closeness of fit of statistics calculated on the generated graphs against the same statistics calculated on the real-world input graphs. We also compare against statistics calculated on graphs
82
5
5.2 graph attributes and structure
Run Swim
Swim
H
Blue
O C
C
H
Blue
Red
Red
Red
Red
Blue
Blue
Green
Cycle
Cycle
Blue
Swim
Swim Cycle
C
C
H
(a) Social graph
H
Green
(b) Molecular structure
Green
Green
Green
(c) Scene Analysis
Figure 5.1: Examples of labelled graphs
generated with the state-of-the-art in labelled graph generation, the MAG model (§2.4.8). The comparison of the two approaches gives some measure of which graph properties are influenced by the edge weight and which are independent of edge weight.
5.2
graph attributes and structure
5.2.1 Discrete Attributes and Structure Back in Chapter 2, we gave some examples of labelled graphs (Fig. 2.2, reproduced here as Fig. 5.1 for convenience). In this section, we consider how vertex labels are related to the structure of the graph. The social network in Fig. 5.1a has a discrete label indicating the favourite sport of the actors. People with similar attributes—in this case similar sports interests—tend to form clusters in the graph. Social scientists have attempted to uncover why this is the case. Similarity is explained in terms of social network processes. The first of these is selection, also known as homophily [60, 138] or assortative mixing [142], the principle of “like attracts like”: we tend to form relationships with those similar to ourselves. (We mentioned homophily in Sect. 2.4.7 in the context of statistical models). The second process which explains similarity is induction, the tendency to influence those around us. Induction in longitudinal social networks has been measured using data from the Framingham Heart Study, which started in 1948 and is still continuing today. The original study collected data on 5,124 subjects who lived in Framingham, Massachusetts (USA)—about two thirds of the town’s population— but what makes it unique is that the data collection included each participant’s close friends, colleagues and family members. The original project was for the purpose of studying heart disease, but Chris-
83
5.2 graph attributes and structure
takis and Fowler realised that they could use it study the propagation of health behaviours. They found that behaviours such as smoking [54], obesity [53] and even happiness [83] spread through the network in much the same way that a virus does, demonstrating that well-connected social groups tend to become more homogenous over time. Besides these social processes, similarity can also arise from confounding factors, where connected individuals are similar because they are exposed to the same external influences [171] (such as an economic downturn or living in the same neighbourhood). So, homogeneity is often observed in the attributes of connected vertices in social networks, and there are at least three possible explanations: selection, induction and confounding factors. What is not clear is the relative contribution of the three processes. We hope that the generative model that we present in this chapter will be a first step towards models of network propagation which can answer this question in future. Correlation between graph attributes and structure is not restricted to social networks; it is also observed in other real-world processes. In the molecular structure of Fig. 5.1b, the vertex label (atom name) is conditionally dependent on its place in the structure. The degree of each vertex is dependent on the number of free electrons of each atom. The lengths of the edges (atomic bonds) are dependent on the atomic weights of the vertices. In the scene analysis example (Fig. 5.1c), the colour of each vertex (superpixel) is conditionally dependent on the colour of adjacent vertices. The velocities of adjacent vertices are also conditionally dependent, as superpixels in the foreground object will move together and those of the background will move together. Most generative algorithms do not consider vertex attributes (§2.4.1– 2.4.6), but stochastic models (§2.4.7), which arise from the SNA literature, rely on dependencies between vertex attributes and structure. The MAG model (§2.4.8) can encode homophily and heterophily using affinity matrices learned from real-world graphs. In this case, the probability of an edge between each pair of vertices is dependent on the attribute values of the vertices.
84
5.2 graph attributes and structure
5.2.2 Numeric Attributes and Structure Many graphs have weighted edges, as we discussed in Sect. 2.2.1. While the relationship between vertex labels and graph structure has been studied in the social science literature, the relationship with edge weights is not so well understood. In this section, we start from an intuitive understanding of how edge weights arise and develop this into the hypothesis on which we base our model. Consider a graph with discrete vertex labels: G = hV, E, L, LV i (Def. 8) and weighted edges ∀u, v ∈ V, e ∈ E : e = hu, v, wuv i (Def. 7). As we noted earlier, edge weights can be one of two types. Sometimes the weight is a count of the number of times an event occurred. For example, in Fig. 5.1a, edge weights might represent the number of times that two actors participated in a sporting event together. In this case, wuv ∈ N. The second type of edge weight is a continuous amount, e.g. amounts of money or quantities such as distance, speed or time. In this case, wuv ∈ R. In either case, we expect the weight to be related to the vertices at its endpoints. In the MAG model, the existence of an edge is conditioned on the labels of the two endpoints. We extend this line of reasoning to arrive at the following hypothesis: hypothesis 1: The weight of an edge wuv is conditioned on the vertex labels LV (u) and LV (v) and is conditionally independent of the other vertices in the graph. Note that we have made some simplifying assumptions. First, we assume conditional independence between the edge weight and vertices which are not endpoints of the edge. Vertices two or three steps further away may well exert an influence (indeed such patterns are observed in [53, 54, 83]), but we assume that this influence is indirect and can be modelled adequately by considering only the vertices incident on the edge. Second, we assume there is only one label on each vertex. This assumption does not reduce the generality of the model but it does place practical limits on the number of possible label values, as we have to model the full joint distribution across all possible values. We will comment on whether it is possible to relax this assumption in Sect. 5.3.3. In our model, the existence of an edge is conditioned on the weight: if the weight is within the valid range (wuv > 0), an edge exists, otherwise there is no edge.
85
5.3 agwan: a generative model for labelled, weighted graphs
Next, we consider the probability distribution of the weights W ij = {wuv |LV (u) = i, LV (v) = j}. The prior distribution is dependent on what the edge weights represent. If the weight represents the number of events occurring in a fixed interval of time—number of e-mails sent or number of people moving between sensors—it is expected to follow a Poisson distribution. If the weight is drawn from an independent and identically distributed (i.i.d.) random variable—speed, elapsed time, or amounts of money—it is expected to follow a Gaussian distribution. Other quantities—frequency of word use or number of papers written by scientists—follow a power-law distribution. In many cases, W ij is not drawn from a simple probability distribution, as it arises from multiple processes. As we cannot assume prior knowledge about the distribution of W ij , we model it using a GMM with an arbitrary number of Gaussian components, which provides a reasonable approximation to any general probability distribution. We avoid the problem of knowing the “correct” number of components by assuming that W ij consists of an infinite number of components and using variational inference to determine the optimal number for our model, as described in the next section. Next, we develop Hypothesis 1 into our generative model, Agwan. We use Agwan to validate the hypothesis on some real-world datasets in the experiments later in the chapter.
5.3
agwan: a generative model for labelled, weighted graphs
Here we present our generative model, Agwan (Attribute Graph: Weighted and Numeric). The Agwan model is parameterised by π, a set of prior probabilities over L; and a set of edge weight mixture parameters Θ = {Ωij |i, j ∈ L}. The model is shown in Fig. 5.2 in plate notation and illustrated in Fig. 5.3 for the Enron graph (§3.4.2) which we use later for the experiments. For each combination of vertex attributes hi, ji, the distribution of edge weights W ij is modelled by a GMM Ωij with M Gaussian components: Ωij =
M−1 X m=0
ij ij 2 ωij m · η(µm , (σm ) )
(5.1)
86
5.3 agwan: a generative model for labelled, weighted graphs
(a) Vertex Labels
(b) Edge Weights
Figure 5.2: Agwan model in plate notation ij ij 2 where ωij m is the weight of each component and η(µm , (σm ) ) is ij 2 the Gaussian PDF with mean µij m and variance (σm ) . The mixture
weights form a probability distribution over the mixture components: M−1 X
ωij m =1
(5.2)
m=0
We specify Ωij such that the first component encodes the probability of no edge: ωij 0 = 1 − P(eij ), where P(eij ) is the probability of an edge between pairs of vertices with labels hi, ji. For directed graphs, |Θ| = |L|2 and we need to generate both wuv and wvu . For undirected graphs, Ωij = Ωji , so |Θ| = O(|L|2 /2) and wvu = wuv . The model degenerates to an unweighted graph if there are two components, η0 (0, 0) and η1 (1, 0). Furthermore, if the weights ωij m are the same for all hi, ji, the model degenerates to an Erd˝os-Rényi random graph.
87
5.3 agwan: a generative model for labelled, weighted graphs
Figure 5.3: Agwan model example. Vertex labels are selected according to prior probability π. Edge weight wuv is selected from mixture model Ω42 and wvu is selected from mixture model Ω24 .
5.3.1 Graph Generation Algorithm 1 describes how to generate a random graph using Agwan(N, L, π, Θ). The number of vertices in the generated graph is specified by N. After assigning discrete label values to each vertex (lines 2–3, cf. Fig. 5.3), the algorithm checks each vertex pair hu, vi for the occurrence of an edge (lines 4–7). If m = 0, hu, vi is not an edge (line 7). If there is an edge, we assign its weight from component m (lines 8–9). The generated graph is returned as G = (V, E). 5.3.2 Parameter Fitting To create realistic random graphs, we need to learn the parameters π, Θ from a real-world input graph G. During parameter fitting, we want to create a model Ωij for each W ij in G. Each GMM Ωij has a finite number of mixture components M. If M is known, Ωij can be estimated using Expectation Maximisation [81]. However, not only is M unknown, but we expect that it will be different for each Ωij within a given graph model. We solve this problem by modelling Ωij as a non-parametric mixture model with an unbounded number of mixture components: a Dirichlet Process Gaussian Mixture Model (DPGMM) [33]. “Non-parametric” does not mean that the model has no parameters; rather, the number of parameters is allowed to grow as more data are observed.
88
5.3 agwan: a generative model for labelled, weighted graphs
Algorithm 1 Agwan Graph Generation Require: N (no. of vertices), L (set of discrete label values), π (prior distribution over L), Θ = Ωij (set of mixture models) 1: Create vertex set V of cardinality N, edge set E = ∅ 2: for all u ∈ V do 3: Assign discrete label lu ∈ L from prior π 4: 5: 6: 7: 8:
9:
for all u, v ∈ V : u 6= v do i = lu , j = lv Select Gaussian m uniformly at random from Ωij if m 6= 0 then Assign edge weight wuv uniformly at random from ij 2 η(µij , (σ m m) ) Create edge e = hu, v, wuv i , E = E ∪ {e} return G = (V, E)
In essence, the DPGMM is a probability distribution over the probability distributions of the model. The Dirichlet Process (DP) over edge weights W ij is a stochastic process DP(α, H0 ), where α is a positive scaling parameter and H0 is a finite measure on W ij . If we draw a sample from DP(α, H0 ), the result is a random distribution over values drawn from H0 . This distribution H is discrete, represented as an infinite sum of atomic measures. If H0 is continuous, then the infinite set of probabilities corresponding to the frequency of each possible value that H can return are distributed according to a stick-breaking process. The stickbreaking representation of H is given as: ωij m (x) H=
m−1 Y
(1 − ωij n)
(5.3)
ωij n (x)δη∗m
(5.4)
n=1 ∞ X n=1
where {η∗1 , η∗2 , . . .} are the atoms representing the mixture components. We learn the mixture parameters using the variational inference algorithm for generating Dirichlet Process Mixtures described in [33]. The weights of each component are generated one-at-a-time by the stick-breaking process, which tends to return the components with the largest weights first. In our experiments, the first 3–5 mixture components accounted for over 99% of the data. Mixtures with weights summing to less than 0.01 are dropped from the model, and the remaining weights {ωij m } are normalised.
89
5.3 agwan: a generative model for labelled, weighted graphs
Algorithm 2 Agwan Parameter Fitting Require: Input graph G = (V, E) 1: L = {discrete vertex label values}, d = |L| 2: Calculate vertex label priors, apply Laplace smoothing ∀l ∈ L : P(l) = count(l)+α N+αd 3: π = the normalised probability distribution over L such that Pd i=1 P(li ) = 1 4: ∀i, j ∈ L : W ij = ∅ 5: for all u, v ∈ V : u 6= v do 6: i = lu , j = lv 7: W ij = W ij ∪ {wuv } . If hu, vi is not an edge, then wuv takes value NO_EDGE for all i, j ∈ L do 9: estimate Ωij from W ij using variational inference 10: Θ = Ωij return π, Θ 8:
Algorithm 2 is the algorithm for Agwan parameter fitting. First, we estimate the vertex priors (lines 1–3). Next, we sample the edge weights for each possible combination of vertex label values, with no edge taking some value outside the range of possible edge weights (lines 4–7). Finally, we estimate the GMMs Ωij from the appropriate set of samples W ij using the the stick-breaking process described above. 5.3.3 Extending Agwan to multiple attributes The Agwan model described so far is for graphs with a single discrete vertex label and a single numeric edge label (the weight). Many graphs have multiple labels on vertices and edges. Agwan can be extended to multiple numeric edge labels by generalising the concept of edge weight to k dimensions. In this case, the mean of each mixture 2 component becomes a k-dimensional vector and the variance (σij m)
is replaced with the k × k covariance matrix Σij m . The variational algorithm can be accelerated for higher-dimensional data using a kdtree [125] and has been demonstrated to work efficiently on datasets of hundreds of dimensions. A more difficult question is how to model multiple discrete vertex labels. A naïve approach is to replace the discrete vertex labels
90
5.4 experiments
LV = {l0 , l1 , . . . , lN }, with a new label whose values are the Cartestian product of the original label values: LV = { l } : l = { LV (V, l0 ) × LV (V, l1 ) × . . . × LV (V, lN ) } (5.5) This has the same expressive power as the original representation, but there are two problems. First, we need to model the full joint probability over l, which will be a complex combinatorial problem with hundreds or thousands of parameters. Second, it is unlikely that every possible combination of l will exist in the input graph, and the Laplace smoothing of Algorithm 2 will be insufficient to deal with the missing cases. This will result in a model which is overfitted to the input graph. The MAG model handles the complexity of multiple vertex labels by assuming independence between vertex labels, so edge probabilities can be computed as the product of the probabilities for each label [111]. This approach works well for latent attributes, as the MagFit algorithm enforces independendence by regularising the variational parameters using mutual information [109]. However, the MAG model has not solved this problem for real attributes, where independence cannot be assumed. Furthermore, multiplying the probabilities sets an upper limit (proportional to log N) on the number of attributes which can be used in the model. In many of our experiments (§5.5), MAG worked best with a very small number of latent variables. An alternative to multiplying independent probabilities is to calculate the GMM for each edge as the weighted summation of the GMM for each individual attribute. This removes the need to model the full joint probability without needing to assume independence between labels. As the contribution of each attribute is weighted, attributes which have little effect on graph structure will not skew the model. One possible approach for the fitting algorithm would be to use a factor graph and determine the values of π and Θ using BP. This problem remains a topic for further research.
5.4
experiments
We evaluate Agwan by learning the model parameters from realworld datasets, generating random graphs from the model and calculating a series of statistics on each graph. These statistics are used to compare how closely the model maps to the input graph.
91
5.4 experiments
Our input datasets are the graph of “who exercised with whom” from the PAL Scheme behavioural health study (§3.4.1, |V| = 279, |E| = 1308) and the “who communicates with whom” graph of the Enron e-mail corpus (§3.4.2, |V| = 159, |E| = 2667). Vertices in the PAL Scheme graph are labelled with 28 attributes representing demographic information and health markers obtained from questionnaire data. Edges are undirected and weighted with the number of mutual coincidences between actors during the study. Vertices in the Enron graph are labelled with the job role of the employee. As e-mail communications are not symmetric, edges are directed. Edges are weighted with the number of e-mails exchanged between sender and recipient. Our first set of experiments uses the real attributes of the graph. We performed a separate experiment for each attribute of the input graph. For comparison, we evaluated against two alternative generative models: ˝ erd os-rényi random graph (e-r): The E-R model G(n, p) has two parameters. We set the number of vertices n and the edge probability p to match the input graphs as closely as possible. We do not expect a very close fit, but the ER model provides a useful baseline. mag model with real attributes (mag-r1): The MAG model with a single real attribute has a set of binary edge probabilities, Θ = {pij } instead of a set of GMMs Θ = {Ωij }. As E-R and MAG do not generate weighted graphs, we fixed the weight of the edges in the generated graphs to the mean edge weight from the input graphs. This ensures that statistics such as average degree strength are not skewed by unweighted edges. The second set of experiments ignore the real attributes of the graph, replacing them with a set of randomly generated synthetic attributes. The purpose of these experiments is to evaluate the relative contribution of the discrete vertex labels and numeric attributes to the graph structure. By reducing the number of numeric attributes to zero, we can evaluate the contribution of the numeric attributes in isolation. As we increase the number of degrees of freedom in the vertex attributes, we observe the effect on the model. We compared this set of experiments against: mag model with latent attributes (mag-lx): This version of the MAG model ignores the discrete labels provided in the
92
5.5 results
input graph, instead using MagFit to learn a set of latent attributes to describe the graph structure. We compared MAG with x = 1 . . . 9 latent binary attributes against Agwan with a single synthetic attribute taking 20 . . . 29 values, so both approaches have the same degrees of freedom. The statistics we use to evaluate each model are Vertex Strength (§2.3.1), Singular Values and Primary Left Singular Vector Components (§2.3.3), Clustering Coefficients (§2.3.5) and Triad Participation (§2.3.4). To give an objective measure of the closeness of fit between the generated graphs and the input graph, we use a Kolmogorov-Smirnov (KS) test on each of the statistics. The KS test is commonly used for goodness of fit and is based on the following statistic: KS = sup |F∗ (x) − S(x)|
(5.6)
where F∗ (x) is the hypothesized Cumulative Distribution Function (CDF) and S(x) is the empirical distribution function based on the sampled data [87]. As trying to perform a linear fit on a heavy-tailed distribution is biased [87], we use the logarithmic variant of the KS test [109]: KS = sup | log F∗ (x) − log S(x)|
(5.7)
We also calculate the L2 (Euclidean) distance for each statistic in the logarithmic scale: v u u L2 = t
b X 1 (log F∗ (x) − log S(x))2 log b − log a x=a
(5.8)
where [a, b] is the support of distributions F∗ (x) and S(x) respectively. The model that generates graphs with the lowest KS and L2 values for each of the above statistics has the closest fit to the real-world graph. If we obtain good KS and L2 scores for Agwan, this is evidence in support of Hypothesis 1.
5.5
results
For each model, we generated 10 random graphs and calculated statistics for each. The plots of the averaged CDFs of the 10 graphs for each
93
5.5 results
model are shown in Figs. 5.4–5.11. The closeness of fit of each CDF (KS and L2 statistics) are shown in Tables 5.1–5.16. 5.5.1 Real Attributes For the PAL Scheme (undirected) graph, we show results for four vertex attributes: age; total minutes spent exercising; EQ5D State (a quality-of-life metric determined by questionnaire); and Floor (the building location and floor number where the person works). For the Enron (directed) graph, we have one vertex attribute, the person’s job role. 5.5.1.1 Vertex Strength One of the most important graph statistics is the degree distribution, which is usually heavy-tailed for real-world graphs (§2.3.1). For weighted graphs, we generalise this to vertex strength (Equation 2.7). Fig. 5.4 shows the average Complementary Cumulative Distribution Function (CCDF) for vertex strength against the number of vertices with that strength or more. Fig. 5.4a is the CCDF for the undirected (PAL Scheme) graphs. For the directed (Enron) graphs, we plot in-strength and out-strength separately (Figs. 5.4b–5.4c). Total vertex strength is the sum of in-strength and out-strength. The graphs generated from the Agwan model have vertex strength distributions which map very closely to the input graphs. The graphs generated from MAG-R1 are better than E-R (random), but the vertex strength distribution has too few high-strength vertices and lowstrength vertices. These results indicate that vertex strength is dependent on both the vertex label distribution and the edge weight distribution. 5.5.1.2 Spectral Properties We use SVD to calculate the singular values and singular vectors of the graph’s adjacency matrix, which act as a signature of the most important features of the graph structure (§2.3.3). For SVD UΣV, we plot CDFs
of the singular values Σ (Figs. 5.5a–5.5b) and the components of
the left singular vector U corresponding to the highest singular value (Figs. 5.5c–5.5d). The spectral properties of the Agwan graphs are preserved very well: the shape of the CDFs closely match the CDFs for the input graphs.
94
5.5 results
95
3
No. of Vertices (CCDF)
10
2
10
1
10
Real−world graph Erdos−Renyi MAG−R1, Age MAG−R1, T1 Total Mins MAG−R1, T1 EQ5D State
0
10
MAG−R1, Floor AGWAN, Age AGWAN, T1 Total Mins AGWAN, T1 EQ5D State AGWAN, Floor
−1
10
0
1
10
2
10
3
10
10
Vertex Strength
(a) Undirected 3
3
10
No. of Vertices (CCDF)
No. of Vertices (CCDF)
10
2
10
Real−world graph
1
10
Erdos−Renyi MAG−R1 AGWAN
0
10
−1
10
2
10
Real−world graph
1
10
Erdos−Renyi MAG−R1 AGWAN
0
10
−1
1
10
2
10
3
10
4
10
10
0
1
10
10
Vertex Strength
2
10
3
10
4
10
5
10
Vertex Strength
(b) Directed (In-strength)
(c) Directed (Out-strength)
Figure 5.4: Vertex Strength Distribution—Real Attributes
E-R
MAG-R1 Agwan Age Total Mins EQ5D State Floor Age Total Mins EQ5D State Floor Vertex Strength KS 6.064 5.940 2.957 3.689 5.799 0.799 1.081 1.674 Vertex Strength L2 9.686 7.281 8.265 9.377 10.039 1.829 2.589 3.294 0.635 1.765
Table 5.1: Vertex Strength for PAL Undirected Graph—Real Attributes
In-Vertex Strength KS Out-Vertex Strength KS In-Vertex Strength L2 Out-Vertex Strength L2
E-R 2.469 2.708 5.679 5.100
MAG-R1 4.700 2.659 4.912 3.534
Agwan 1.455 2.303 1.816 2.117
Table 5.2: Vertex Strength for Enron Directed Graph—Real Attributes
5.5 results
This is significant as it indicates that graphs generated with Agwan have similar connectivity to the input graph (§2.3.3). MAG-R1 does not preserve the singular value curve, showing a linear relationship instead. For the singular vector components, MAG-R1 is no better than random. MAG can accurately model the spectral properties of unweighted graphs [109], but not the spectral properties of weighted graphs. These results indicate that the singular values and primary singular vector components are dependent on both the vertex label distribution and the edge weight distribution. 5.5.1.3
Clustering Coefficients
The Clustering Coefficient is a measure of transitivity and community structure in the graph (§2.3.5). Fig. 5.6 shows CCDFs for Weighted Clustering Coefficient (Equation 2.12) against vertex degree. In all cases, the performance of Agwan and MAG-R1 is similar. For the undirected graph, Age and Floor attributes, performance is little better than random. The Total Mins and EQ5D State attributes give results which are somewhat better, but not as close as we have seen with vertex strength and spectral properties. For the directed graph, the Employee Type attribute gives reasonable performance for the low-degree vertices, but drops away for the high-degree vertices. These results suggest that the process that gives rise to clustering is partly dependent on the vertex label values, but independent of the edge weight distribution. There also seems to be another (unobserved) process which is influencing cluster formation. In [109], it was not possible to get an accurate model of clustering or triad participation using real attributes. Using latent attributes and “Simplified MAG”—where all affiliation matrices are the same—also could not model the clustering property [111]. We hypothesise that cluster formation may in fact be (conditionally) independent of the vertex label values, and is better explained by triadic closure (§2.3.4): links are likely to form between two people who share a mutual friend, independently from their vertex attributes. The apparent dependency on vertex labels may be an artefact of the connection to the mutual friend, rather than the true explanation of why clustering asises. This aspect of network formation requires further investigation.
96
5.5 results
97
4
6
x 10
14000
5
Singular Value (CDF)
Singular Value (CDF)
12000 10000 Real−world graph Erdos−Renyi
8000
MAG−R1, Age MAG−R1, T1 Total Mins MAG−R1, T1 EQ5D State
6000
MAG−R1, Floor AGWAN, Age AGWAN, T1 Total Mins
4000
AGWAN, T1 EQ5D State
4 Real−world graph
3
Erdos−Renyi MAG−R1 AGWAN
2
AGWAN, Floor
1
2000 0 0
50
100
150
200
250
0 0
300
20
40
60
Singular
Vector—
100
120
140
160
140
160
(b) Singular Values—Directed Primary Left Singular Vector Components (CDF)
(a) Singular Values—Undirected
(c) Primary Left Undirected
80
Rank
Rank
12 Real−world graph
10
Erdos−Renyi MAG−R1 AGWAN
8
6
4
2
0 0
20
40
60
80
100
120
Rank
(d) Primary Left Singular Vector—Directed
Figure 5.5: Spectral Properties—Real Attributes
E-R Singular Values KS 36.193 Singular Vector KS 1.323 Singular Values L2 41.815 Singular Vector L2 5.004
MAG-R1 Agwan Age Total Mins EQ5D State Floor Age Total Mins EQ5D State Floor 35.644 35.393 35.612 36.001 34.482 33.720 34.946 1.239 0.964 0.984 1.134 0.491 0.450 0.371 41.298 41.052 41.227 41.623 39.629 39.100 40.060 4.940 4.614 4.742 4.852 3.257 2.914 2.307 32.319
0.248
38.211
1.486
Table 5.3: Spectral Properties for PAL Undirected Graph—Real Attributes
Singular Values KS Singular Vector KS Singular Values L2 Singular Vector L2
E-R 37.235 1.915 25.044 7.316
MAG-R1 35.752 1.801 19.546 7.587
Agwan 34.894 0.282 18.360 0.988
Table 5.4: Spectral Properties for Enron Directed Graph—Real Attributes
5.5 results
98
2
Clustering Coefficient (CCDF)
10
1
10
0
10
Real−world graph Erdos−Renyi MAG−R1, Age MAG−R1, T1 Total Mins MAG−R1, T1 EQ5D State
−1
10
MAG−R1, Floor AGWAN, Age AGWAN, T1 Total Mins AGWAN, T1 EQ5D State AGWAN, Floor
−2
10
0
10
1
2
10
3
10
10
Vertex Degree (Total)
(a) Undirected 2
2
10
Clustering Coefficient (CCDF)
Clustering Coefficient (CCDF)
10
1
10
0
10
Real−world graph Erdos−Renyi MAG−R1
1
10
0
10
Real−world graph Erdos−Renyi MAG−R1
AGWAN
AGWAN
−1
10
−1
0
10
1
10
2
10
10
0
1
10
2
10
Vertex In−degree
10
Vertex Out−degree
(b) Directed (In-edges)
(c) Directed (Out-edges)
Figure 5.6: Clustering Coefficients—Real Attributes
E-R
MAG-R1 Agwan Age Total Mins EQ5D State Floor Age Total Mins EQ5D State Floor Clustering Coecient KS 5.224 5.048 2.083 3.343 4.895 5.132 2.493 5.161 2.042
Table 5.5: Clustering Coefficient for PAL Undirected Graph—Real Attributes
Clustering Coecient (In-Edges) KS Clustering Coecient (Out-Edges) KS Clustering Coecient KS Clustering Coecient (In-Edges) L2 Clustering Coecient (Out-Edges) L2 Clustering Coecient L2
E-R 3.444 3.728 4.347 3.528 3.145 6.949
MAG-R1 2.208
0.769
1.651
1.607 1.191
1.438
Agwan 2.220 0.702
3.163
1.528 1.002
2.284
Table 5.6: Clustering Coefficient for Enron Directed Graph—Real Attributes
5.5 results
E-R Triad Participation KS Triad Participation L2
MAG-R1 Agwan Age Total Mins EQ5D State Floor Age Total Mins EQ5D State Floor 7.012 6.877 5.704 5.704 6.685 6.328 5.829 6.768 17.334 18.828 18.861 17.101 19.746 19.434 18.348 20.288 5.106
16.879
Table 5.7: Triad Participation for PAL Undirected Graph—Real Attributes
Triad Participation (Cycles) KS Triad Participation (Middlemen) KS Triad Participation (Ins) KS Triad Participation (Outs) KS Triad Participation (Cycles) L2 Triad Participation (Middlemen) L2 Triad Participation (Ins) L2 Triad Participation (Outs) L2
E-R 4.787 4.382 4.700 4.382 3.823 5.144 4.630 3.727
MAG-R1 4.248
Agwan
4.500
4.500
4.500
4.094 3.000 4.178
4.826 3.295
3.555
2.436
4.248 3.101 4.207
4.332 3.203
Table 5.8: Triad Participation for Enron Directed Graph—Real Attributes
5.5.1.4
99
Triad Participation
In Fig. 5.7, we plot triad strength (Equation 2.9) against the number of participating vertices. Fig. 5.7a shows the triad strength for the directed graph. Figs. 5.7b–5.7e show the triad strength for each of the directed graph motifs of Fig. 2.7 (Cycles, Middlemen, Ins and Outs). As Triad Participation is closely related to clustering (§2.3.4), it is not surprising that the results are similar to the clustering results. Again the accuracy of Agwan and MAG-R1 is similar, better than random but not as close as for vertex strength and spectral properties. The results appear to be to some extent dependent on the vertex label values, but independent of the edge weight distribution. We hypothesise that like clustering, the dependency between triad participation and vertex label values may be an artefact of triadic closure, which is not currently modelled by either MAG or Agwan. In future work, we will investigate triadic closure to determine if this can better explain the triad participation and clustering properties. 5.5.2 Synthetic Attributes An alternate interpretation of the MAG model ignores the true attribute values from the input graph and represents attributes as latent variables, which are learned using a variational inference EM approach [109]. To compare Agwan with this approach, we replaced the real labels in the input graph with a synthetic vertex attribute taking
5.5 results
100
3
No. of Participating Vertices
10
2
10
1
10
Real−world graph Erdos−Renyi MAG−R1, Age MAG−R1, T1 Total Mins MAG−R1, T1 EQ5D State
0
10
MAG−R1, Floor AGWAN, Age AGWAN, T1 Total Mins AGWAN, T1 EQ5D State AGWAN, Floor
−1
10
0
1
10
2
10
3
10
4
10
10
Total Triad Strength
(a) Undirected 3
3
10
No. of Participating Vertices
No. of Participating Vertices
10
2
10
Real−world graph
1
10
Erdos−Renyi MAG−R1 AGWAN
0
10
2
10
1
10
0
10
Real−world graph Erdos−Renyi MAG−R1 AGWAN
−1
10
−1
0
10
1
2
10
3
10
4
10
10
10
5
10
1
2
10
4
10
5
10
10
Total Triad Strength
(b) Directed (Cycles)
(c) Directed (Middlemen)
3
3
10
No. of Participating Vertices
10
No. of Participating Vertices
3
10
Total Triad Strength
2
10
1
10
0
10
Real−world graph Erdos−Renyi
2
10
Real−world graph
1
10
Erdos−Renyi MAG−R1 AGWAN
0
10
MAG−R1 AGWAN
−1
10
−1
1
10
2
10
3
10
4
10
Total Triad Strength
(d) Directed (Ins)
5
10
10
0
10
1
10
2
10
3
10
Total Triad Strength
(e) Directed (Outs)
Figure 5.7: Triad Participation—Real Attributes
4
10
5
10
5.5 results
3
No. of Vertices (CCDF)
10
101
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN, T1 EQ5D State
2
10
1
10
0
10
−1
10
0
1
10
2
10
3
10
10
Vertex Strength
(a) Undirected 3
3
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
2
10
1
10
0
10
−1
10
10
No. of Vertices (CCDF)
No. of Vertices (CCDF)
10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
2
10
1
10
0
10
−1
1
10
2
10
3
10
10
4
10
0
5
10
10
Vertex Strength
Vertex Strength
(b) Directed (In-strength)
(c) Directed (Out-strength)
Figure 5.8: Vertex Strength Distribution—Synthetic Attributes
Vertex Strength KS Vertex Strength L2
Agwan EQ5D State 0.635 1.765
Vertex Strength KS Vertex Strength L2
0 3.401 6.266
1 2.243 7.944
1 2.197 4.537
2 5.106 8.473
3 5.886 9.236
2 2.303 3.754
3 1.050 2.584
MAG Latent 4 5 5.886 5.670 10.783 10.103 Agwan 4 5 1.758 0.916 2.160 1.731
6 5.481 8.635
7 4.605 9.120
6 0.975 1.343
7 0.875 0.873
8 5.561 9.603
9 6.234 21.027
8
9 1.589 1.229
0.854 0.693
Table 5.9: Vertex Strength for PAL Undirected Graph—Synthetic Attributes
In-Vertex Strength KS Out-Vertex Strength KS In-Vertex Strength L2 Out-Vertex Strength L2
Agwan Employee Type 1.455 2.303 1.816 2.117
1 4.700 4.942 5.023 3.001
In-Vertex Strength KS Out-Vertex Strength KS In-Vertex Strength L2 Out-Vertex Strength L2
0 2.418 2.996 5.638 5.128
1 2.513 2.234 5.774 4.807
2 3.704
3 5.991 5.768 8.856 7.805
2 2.345 2.090 5.473 4.732
3 2.590 4.248 4.355 4.756
3.602
4.605
3.055
MAG Latent 4 5 6.522 6.142 5.991 6.234 19.820 15.718 14.329 10.882 Agwan 4 5 2.303 1.150 3.151 2.071 3.060 2.224 1.120 1.122
6 5.704 5.075 8.678 3.740
7 3.951 4.317 6.171 3.120
6 1.897 1.514 1.367 1.918
7 2.015 1.966 1.299
2.034
Table 5.10: Vertex Strength for Enron Directed Graph—Synthetic Attributes
8 5.347 3.466 8.672 2.668
8 2.303 1.386 1.412 1.415
9 5.193 3.401
7.066 3.737
9 0.693 1.204 0.665 1.045
5.5 results
20 . . . 29 values allocated uniformly at random, then learned the edge weight distributions using variational inference as normal. We have plotted Agwan with one real attribute alongside for comparison. 5.5.2.1 Vertex Strength The results in Fig. 5.8 show that Agwan significantly outperforms MAG for the accuracy of the vertex strength distribution. For the undirected graph (Fig. 5.8a), the highest accuracy is obtained using Agwan with real attributes, closely followed by Agwan with 28 synthetic attribute values. MAG with synthetic attributes is considerably less accurate. For the directed graph (Figs. 5.8b–5.8c), the best results are obtained using Agwan with 24 synthetic attribute values, followed closely by Agwan with real attributes. Again, MAG with synthetic attributes is considerably less accurate. These results confirm our earlier finding that vertex strength is dependent on both the vertex label distribution and the edge weight distribution. 5.5.2.2 Spectral Properties As before, Fig. 5.9 shows that the plots of the Agwan statistics maintain the shape of the curve for singular values and primary left singular vector components. The plots for the MAG statistics are almost a straight line. In absolute terms, the model which maps most closely is Agwan with 27 attributes for singular values and Agwan with 26 attributes for singular vector components. These results confirm that the singular values are dependent on both the vertex label distribution and the edge weight distribution. For the singular vectors, the results show that the degrees of freedom of the vertex labels are important: singular vectors are also dependent on both the vertex label distribution and the edge weight distribution. 5.5.2.3 Clustering Coefficients Both Agwan and MAG are more accurate using synthetic attributes than they were with real attributes, implying clustering is not closely related to the real attribute values, but the clustering process can be approximated using synthetic labels. For the undirected graph (Fig. 5.10a), the model which maps most closely is MAG with 1 latent attribute. For the directed graph (Figs. 5.10b–
102
5.5 results
103
4
4
7
x 10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN, T1 EQ5D State
1.8
Singular Value (CDF)
1.6 1.4 1.2 1 0.8 0.6
x 10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
6
Singular Value (CDF)
2
5 4 3 2
0.4
1 0.2 0 0
100
200
0 0
300
50
12
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN, T1 EQ5D State
10
8
6
4
2
100
150
200
200
300
Rank
(c) Undirected—Primary Left Singular Vector
(b) Directed—Singular Values Primary Left Singular Vector Components (CDF)
Primary Left Singular Vector Components (CDF)
(a) Undirected—Singular Values
0 0
100
Rank
Rank
12
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
10
8
6
4
2
0 0
50
100
150
200
Rank
(d) Directed—Primary Left Singular Vector
Figure 5.9: Spectral Properties—Synthetic Attributes
Singular Values KS Singular Vector KS Singular Values L2 Singular Vector L2
Agwan EQ5D State 33.720 0.450 39.100 2.914
Singular Values KS Singular Vector KS Singular Values L2 Singular Vector L2
0 35.238 0.675 40.448 4.477
MAG Latent 2 3 4 5 6 7 8 9 46.771 89.148 93.658 81.082 93.413 125.855 72.059 85.863 0.645 0.654 0.821 0.694 0.590 0.561 0.645 0.579 94.881 106.265 109.813 104.160 109.673 120.108 113.166 173.884 3.231 3.324 3.895 3.622 2.894 3.092 2.873 3.079 Agwan 1 2 3 4 5 6 7 8 9 35.194 35.226 35.341 35.542 33.763 32.824 34.052 37.384 0.827 0.847 0.950 1.139 0.559 0.221 0.258 0.361 40.394 40.391 40.504 40.873 38.980 37.613 44.148 74.019 5.513 5.671 6.316 7.530 3.612 1.237 1.719 2.351 1
30.901
0.313
55.080
0.396
27.713
0.183
27.296
0.866
Table 5.11: Spectral Properties for PAL Undirected Graph—Synthetic Attributes
5.5 results
2
Clustering Coefficient (CCDF)
10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN, T1 EQ5D State
1
10
0
10
−1
10
−2
10
0
1
10
2
10
3
10
10
Vertex Degree (Total)
(a) Undirected 2
2
10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 1 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 0 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN −1
Clustering Coefficient (CCDF)
Clustering Coefficient (CCDF)
10
1
10
0
10
−1
10
−2
10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
10
10
10
−2
0
1
10
10
2
10
10
0
10
Vertex In−degree
(b) Directed (In-edges)
1
10
2
10
Vertex Out−degree
(c) Directed (Out-edges)
Figure 5.10: Clustering Coefficients—Synthetic Attributes
5.10c), Agwan with 7 synthetic attributes gives the best match, followed by MAG with 2 latent attributes. These results show that there is a relationship between clustering and vertex attributes, but clustering appears to be independent of the edge weight distribution. Again, we propose to investigate triadic closure as a better explanation for the clustering property. 5.5.2.4
Triad Participation
Similarly to clustering, Agwan and MAG are both more accurate using synthetic attributes than they were with real attributes. For the undirected graph (Fig. 5.11a), Agwan with 9 attributes gives the closest fit, followed by MAG with 1 attribute. For the directed graph (Figs. 5.11b–5.11e), Agwan with 8 attributes gives the best fit for all four triad patterns (Cycles, Middlemen, Ins, Outs). MAG performs
104
5.5 results
Singular Values KS Singular Vector KS Singular Values L2 Singular Vector L2
Agwan Employee Type 34.894 0.282 18.360 0.988
1 35.715 1.636 19.285 7.470
2 35.591 1.630 18.938 7.530
3 27.492 1.453 13.768 7.100
Singular Values KS Singular Vector KS Singular Values L2 Singular Vector L2
0 37.497 1.887 25.020 7.814
1 37.866 1.962 25.815 6.764
2 37.377 1.811 24.922 7.798
3 36.590 1.665 22.017 5.949
MAG Latent 5 148.080 0.765 90.672 160.831 4.062 Agwan 4 5 36.159 34.801 1.130 20.767 18.270 5.471 4 89.063
0.190
0.388
0.616
2.643
6 32.392 1.586 28.601 7.453
7 1.708
1.525
6.158
7.200
6 33.812 0.824 16.748 4.088
7 32.696 0.908 15.010 4.421
105
8 31.555 1.526 28.074 7.266
9 37.163 1.552 38.490 7.339
8
9 8.327 0.789 8.758 1.396
26.494
0.887
12.516
2.725
Table 5.12: Spectral Properties for Enron Directed Graph—Synthetic Attributes
Clustering Coecient KS
Agwan EQ5D State 2.042
Clustering Coecient KS
0 5.353
1 1.283
1 5.350
2 4.406
3 3.863
2 3.561
3 4.615
MAG Latent 4 5 6 4.575 4.401 3.470 Agwan 4 5 6 4.395 4.054 4.470
7 3.256 7 3.676
8 4.397
9 4.773
8
9 3.440
3.401
Table 5.13: Clustering Coefficient for PAL Undirected Graph—Synthetic Attributes
Agwan Employee Type Clustering Coecient (In-Edges) KS 2.220 Clustering Coecient (Out-Edges) KS 0.702 Clustering Coecient KS 3.163 Clustering Coecient (In-Edges) L2 1.528 2.284 Clustering Coecient L2
1 2.961 3.164 3.278 2.507 2.450
0 3.477 4.945 4.580 3.987 7.065
1 3.567 4.316 4.018 4.972 8.188
Clustering Coecient (In-Edges) KS Clustering Coecient (Out-Edges) KS Clustering Coecient KS Clustering Coecient (In-Edges) L2 Clustering Coecient L2
2 0.897 0.513 2.347 1.786 2.419
2 4.386 5.134 4.837 5.834 9.244
MAG Latent 3 4 5 4.775 5.294 4.578 5.193 5.877 5.463 5.251 6.255 5.839 6.733 12.533 7.692 10.611 22.886 13.922 Agwan 3 4 5 4.159 3.704 3.682 4.969 4.948 4.747 2.691 4.369 3.933 4.314 3.846 3.413 6.606 7.581 6.872
6 4.357 4.363 4.387 5.841 5.851
7 3.302 3.142 3.739 4.184 4.568
8 3.770 3.273 4.339 5.705 5.381
9 4.512 2.865 4.000 4.819 7.653
6 7 8 9 2.678 0.662 0.492 3.200 2.605 3.204 1.501 2.620 0.848 2.575 1.524 0.686 4.951 3.658 2.536 0.460
2.563
1.075
0.999
4.189
Table 5.14: Clustering Coefficient for Enron Directed Graph—Synthetic Attributes
5.5 results
106
3
No. of Participating Vertices
10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN, T1 EQ5D State
2
10
1
10
0
10
−1
10
0
2
10
4
10
10
Total Triad Strength
(a) Undirected 3
3
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
2
10
1
10
0
10
10
No. of Participating Vertices
No. of Participating Vertices
10
−1
10
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
2
10
1
10
0
10
−1
0
10
5
10
10
0
(c) Directed (Middlemen)
3
3
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
2
10
1
10
0
10
−1
10
No. of Participating Vertices
10
No. of Participating Vertices
10
Total Triad Strength
(b) Directed (Cycles)
10
5
10
Total Triad Strength
Real−world graph AGWAN−L0 AGWAN−L1 AGWAN−L2 AGWAN−L3 AGWAN−L4 AGWAN−L5 AGWAN−L6 AGWAN−L7 AGWAN−L8 AGWAN−L9 MAG−L1 MAG−L2 MAG−L3 MAG−L4 MAG−L5 MAG−L6 MAG−L7 MAG−L8 MAG−L9 AGWAN
2
10
1
10
0
10
−1
0
5
10
10
Total Triad Strength
(d) Directed (Ins)
10
0
5
10
10
Total Triad Strength
(e) Directed (Outs)
Figure 5.11: Triad Participation—Synthetic Attributes
5.5 results
Agwan MAG Latent EQ5D State 1 2 3 4 5 6 Triad Participation KS 5.829 6.292 6.709 6.593 6.016 5.768 Triad Participation L2 18.348 12.047 15.550 15.821 17.494 11.038 11.646 Agwan 0 1 2 3 4 5 6 Triad Participation KS 6.985 7.090 6.994 5.991 5.872 6.607 6.131 Triad Participation L2 22.841 20.975 23.682 17.878 17.287 16.174 15.254
7 4.868
3.829
10.367
7 5.561 10.310
107
8 9 5.914 6.877 14.507 29.136 8 2.238 5.753
9 1.204 2.803
Table 5.15: Triad Participation for PAL Undirected Graph—Synthetic Attributes
Agwan Employee Type Triad Participation (Cycles) KS 3.555 Triad Participation (Middlemen) KS 4.500 2.436 Triad Participation (Ins) KS Triad Participation (Outs) KS 4.248 3.101 Triad Participation (Cycles) L2 Triad Participation (Middlemen) L2 4.207 Triad Participation (Ins) L2 4.332 3.203 Triad Participation (Outs) L2
Triad Participation (Cycles) KS Triad Participation (Middlemen) KS Triad Participation (Ins) KS Triad Participation (Outs) KS Triad Participation (Cycles) L2 Triad Participation (Middlemen) L2 Triad Participation (Ins) L2 Triad Participation (Outs) L2
0 4.500 4.787 4.700 4.942 3.212 4.670 4.391 4.887
1 3.912 4.248
3 5.347 4.942 5.670 2.526 4.571 2.060 7.788 2.828 11.094 3.293 11.473 1.816 6.646
3.912
2
2.996 3.401
3.912
1.476
1.800 1.771 1.902
1.459
1 4.500 4.787 4.700 4.942 3.017 4.310 3.757 4.537
2 2.659 4.094 4.007 3.624 2.407 3.586 3.575 3.305
3 3.912 5.247 5.298 4.094 4.816 7.121 7.742 4.615
MAG Latent 4 5 5.940 6.867 5.920 6.319 5.940 7.170 5.695 6.768 15.981 16.270 19.126 18.575 12.061 16.361 17.093 14.603 Agwan 4 5 3.602 3.283 3.843 3.314 4.700 3.283 3.977 3.912 3.728 3.856 5.734 5.924 6.376 6.616 4.540 4.963
6 5.247 4.339 5.704 5.075 6.781 7.016 9.756 5.950
7 4.094 3.602 4.700 4.500 6.121 7.204 8.905 5.399
6 2.996 3.807 2.862 3.114 3.566 5.288 5.902 4.359
7 3.912 4.248 4.094 2.862 3.733 4.942 5.306 3.978
Table 5.16: Triad Participation for Enron Directed Graph—Synthetic Attributes
best with 1 or 2 attributes. For one motif (Ins), Agwan with real attributes outperforms MAG. As before, we conclude that real vertex labels and edge weights cannot adequately explain clustering. Synthetic vertex labels offer a more accurate model, but are sensitive to the number of degrees of freedom in the vertex label distribution. Triadic closure may offer a more satisfactory explanation for triad participation. Earlier in the chapter, we proposed Hypothesis 1, that edge weights are dependent on the labels of the edge endpoints, and conditionally independent of the rest of the graph. Our results provide evidence in support of this hypothesis, as our model of edge weights produces very accurate results for the vertex strength distribution and spectral properties of the generated graphs.
8 9 3.843 5.704 3.689 5.858 5.075 6.153 4.094 4.745 5.378 8.763 6.517 11.150 9.124 13.740 4.698 6.315 8 1.204 1.609 1.609 1.696 1.113 2.382 2.464 1.947
9 1.548 1.099 1.099 0.916 1.014 0.611 0.936 0.589
5.6 conclusions
Vertex Labels
Edge Weights
Vertex Strength
Dependent
Singular Values
Dependent Partially
Primary Singular Vector
Agwan
MAG
Dependent
Accurate
Less accurate
Dependent
Accurate
Less accurate
Dependent
Accurate
Poor
108
Dependent Clustering Coefficients
Conditionally
Independent
Independent∗ Triad Participation
Conditionally
Less accurate when using real attributes More accurate with synthetic attributes
Independent
Independent∗
Less accurate when using real attributes More accurate with synthetic attributes
Table 5.17: Summary of results and findings: dependencies between graph labels and weights and structural properties (∗ hypothesised); relative accuracy of the properties of weighted graphs generated using Agwan and MAG models
For clustering and triad participation, we conclude that these properties are independent of the edge weight distribution. While there appears to be a relationship with the vertex label distribution, we suggest that this may be an artefact of the true process giving rise to these properties, triadic closure. It is interesting to note that in general, MAG acheives the best results when there are one or two vertex attributes, whereas Agwan performs best when there are 7 or 8 attributes. MAG assumes that each attribute is independent, so there is a limit on the number of attributes that can be included in the model (proportional to log N). Above this limit, the performance of the model degrades. With Agwan, there is no independence assumption, so the attributes model the full joint probability. As the number of attribute values (2x ) approaches N, there is a danger of overfitting and the model performance degrades.
5.6
conclusions
In this chapter, we presented Agwan, a model for random graphs with discrete vertex labels and weighted edges. We included a fitting algorithm to learn a model of graph edge weights from realworld data, and a generative algorithm to generate random labelled, weighted graphs with similar characteristics to the real-world graph. Our results and findings are summarised in Table 5.17. The results show that Agwan can create an accurate model of weighted, labelled graphs. Vertex strength and spectral properties are modelled very closely, indicating that these properties are indeed dependent on vertex labels and edge weights.
5.6 conclusions
Clustering and triad participation are modelled most accurately with synthetic attributes, and these properties appear to be independent of the edge weight distribution. It is likely that there are other graph properties which give rise to triangles and clustering which are not being adequately captured by our model. Further research into the relationship between vertex attributes and clustering is indicated. In the next chapter, we turn our attention from macro-level patterns in graphs to micro-level patterns. We will use our conclusions about Hypothesis 1 to develop a constraint-based approach to frequent pattern mining, which exploits the dependency between discrete vertex labels and numeric edge attributes.
109
S U B S T R U C T U R E D I S C O V E RY I N L A B E L L E D G R A P H S W I T H N U M E R I C AT T R I B U T E S
6.1
introduction
In the previous chapter, we concerned ourselves with graph patterns that occur at a macro level, such as the degree distribution and spectral properties. In this chapter, we look at micro-level patterns, which consist of only a few vertices and edges but which are repeated many times: frequent substructures. Finding frequent substructures or subgraphs is a common task in graph mining (§3.2). Frequent substructures can be used for concept learning, classification, clustering or—as we shall see in the next chapter—for anomaly detection. In this chapter, we discuss the relationship between substructures and vertex and edge attributes. Most substructure discovery approaches assume discrete graph labels. We examine the relationship between discrete labels and the occurrence of substructures in Sect. 6.2.1. However, as we have seen in the previous chapter, many graph datasets also contain numeric labels or weights, representing counts, frequencies, or attributes such as size, distance or time. We examine the relationship between numeric graph labels and substructures in Sect. 6.2.2. This leads us into the main contribution of this chapter in Sect. 6.3: our method to use numeric attributes as a constraint during substructure discovery. In Sect. 6.4, we propose how our approach could be scaled up to graph elements with very highdimensional numeric attributes and present some preliminary results in that direction. Sect. 6.5 presents our experiments validating the use of numeric outliers as a constraint on substructure mining, implemented as a pre-processing step. The approach we develop in this chapter leads us into Chapter 7, where we take it a step further to detect anomalous substructures in graphs with discrete and numeric attributes.
110
6
6.2 graph patterns and structure
(a) Chain
(b) Star
(c) Tree
111
(d) Clique
Figure 6.1: Substructure Patterns (Motifs) Commonly Found in Real Graphs
6.2
graph patterns and structure
Generative models such as Agwan (Chapter 5) illustrate the relationship between micro- and macroscopic structure in a graph. For example, triangles are a micro-pattern which occur frequently in real-world graphs (§2.3.4); the clustering coefficient is a macro-pattern which measures the number of edges which participate in triangles as a ratio of all edges (§2.3.5). In other words, triangles are a frequent pattern, or motif, which gives rise to larger-scale graph structure. Besides triangles, other common motifs include chains, stars, trees and cliques (Fig. 6.1). These motifs are also known to give rise to larger graph structures; [17] uses the presence of near-cliques and near-stars to detect anomalies in weighted graphs. Some motifs occur only in specific domains. For example, carbon rings (Fig. 6.2) commonly occur in databases of organic compounds, but are unlikely to be seen outside that domain. The job of frequent substructure discovery algorithms is to determine which graph patterns are typical of a specific dataset. While the relationship between triangles and clustering coefficient is well-studied, the relationship between arbitrary Figure 6.2: Carbon Ring substructures and the larger graph structures that they may give rise to is not well understood. 6.2.1 Discrete Attributes and Substructure Mining The generative model of the previous chapter relies on the fact that graph structure is highly correlated with vertex attributes (§5.2), due to effects such as homophily and heterophily. In this section, we consider the effect of vertex attributes on substructure discovery.
6.2 graph patterns and structure
The complexity of substructure discovery is dependent not only on the size and edge density of the input graph, but also on the distribution of vertex labels. Substructure discovery algorithms rely on vertex partitioning to reduce the complexity of the GI test [82, 98, 121, 185]. Vertex partitions are similar disjoint sets or equivalence classes: definition 32: The vertex partition set is defined as: V=
[
(6.1)
Vi
i
where all vertices in the same partition share the same discrete attribute values: ∀v ∈ Vi
∀w ∈ Vi
∀l ∈ (LV ∩ LD ) :
LV (v, l) = LV (w, l)
(6.2)
To allow graphs with labelled edges, we extend this notion to also define edge partitions: definition 33: The edge partition set is defined as: E=
[
Ei
(6.3)
i
where all edges in the same partition share the same discrete attribute values and their source and target vertices are from the same partitions: ∀ hv, wi ∈ Ei
∀ hx, yi ∈ Ei :
v ∈ Vj ∧ x ∈ Vj ∧ w ∈ Vk ∧ y ∈ Vk (6.4)
In the case of an undirected graph, hv, wi ⇔ hw, vi The number of graph partitions is determined by the distribution of discrete vertex and edge labels. The number of partitions contributes to the number of unique substructures, which affects the number of GI tests required to calculate support. This relationship is too complex
for us to treat analytically, but we explore it by means of an example and some empirical results. Consider Figs. 6.3–6.4, which show a partial clique of four vertices and five edges. The unlabelled version of this graph (Fig. 6.3) contains eight distinct substructures with two or more vertices, 29 instances in total. Assume that we want to discover all possible substructures and calculate the support for each. Each candidate has to be compared to the substructures with the same vertex set and edge set. As there are
112
6.2 graph patterns and structure
Figure 6.3: Relationship between vertex labelling and number of discovered substructures: Unlabelled Graph
no labels, there is only one vertex partition and one edge partition, so in the worst case, every candidate has to be compared to each substructure with the same number of vertices and edges. For this trivial example, the worst case scenario requires 4 + 7 + 1 + 12 + 7 + 0 = 31 GI tests to determine the complete set of substructures. The labelled version of the same graph (Fig. 6.4) has a unique vertex and edge partition for each substructure. Each possible substructure has a unique vertex and edge partition set. As substructures with different partition sets cannot be isomorphic, we can determine that each substructure has only one instance without needing to do any GI
tests.
113
6.2 graph patterns and structure
Figure 6.4: Relationship between vertex labelling and number of discovered substructures: Labelled Graph (with unique vertex labels). More label values imply more substructures, with fewer instances of each.
114
6.2 graph patterns and structure
In practice, most real-world graphs lie somewhere between the extremes of no labels and a unique labelling for every vertex. Fig. 6.5 shows the relationship between the number of vertex partitions and the complexity of substructure discovery: the number of distinct substructures rises linearly with the number of partitions (Fig. 6.5a), but the number of instances of each substructure falls, leading to an exponential reduction in the number of GI tests required (Fig. 6.5b). Our experiments on real datasets (Sect. 6.5) verify that the complexity of substructure discovery increases with the homogeneity of vertices and edges. A more subtle effect of vertex labelling is its effect on the number of possible expansions of partially-overlapping substructures. Where each vertex is uniquely labelled, overlaps are not possible and every expansion is unique. Graphs with fewer labels are likely to have more overlapping instances. As we discussed in Sect. 3.2.3, in general overlapping instances do not contribute independently to the support of the substructure as this violates the anti-monotone property. However, where embedding lists are used, we need to maintain a record of each instance for the next phase of expansion, so overlapping instances may still have an impact on memory consumption. In summary, more vertex and edge partitions lead to more substructures, with fewer instances of each. This affects the number of GI tests required during substructure discovery, which has a large impact on computational time and memory requirements. 6.2.2 Numeric Attributes and Substructure Mining The experiments of Chapter 5 show how numeric edge weights influence graph structure at a macro level. In this section, we consider how weights and numeric attributes influence graph structure at a micro level, and propose a method to exploit this during substructure discovery. As we discussed in Chapter 4, most substructure mining approaches consider discrete attributes only. Numeric attributes have been integrated into graph mining either by discretizing them, or by using them indirectly as a constraint on graph mining (§4.2). Discretization approaches such as binning or clustering are usually unsatisfactory for a number of reasons: close values can have different labels if they are near a boundary; data is often assumed to follow
115
6.2 graph patterns and structure
|V| = 1000, |E| = 8000 1000 950
No. of substructures
900 850 800 750 700 650 600 550 500
16 64 32
128
256
512
No. of vertex partitions
(a) No. of Unique Substructures |V| = 1000, |E| = 8000
8
No. of GI tests (log scale)
10
7
10
6
10
5
10
4
10
16 64 32
128
256
512
No. of vertex partitions
(b) No. of Graph Isomorphism Tests to Discover Substructures
Figure 6.5: Effect of no. of vertex partitions on complexity of substructure discovery. The number of distinct substructures and number of GI tests required is shown for RMat random graphs with 0, 1, . . . , 9 binary labels, i.e. 1–512 vertex partitions. Label values were assigned independently from a uniform distribution.
116
6.3 constraint-based mining using numeric attributes
a uniform or Gaussian distribution; the optimal number of bins or clusters is unknown; and ordinal values are treated as categorical. Constraint-based approaches avoid treating numeric attributes as discrete values. Instead, numeric values are used to prune the search space as a pre-processing or post-processing step, or during substructure mining. The constraint-based framework proposed by [73] presents three measure functions on edge weights (two statistical and one information-theoretic). Here we propose a new measure based on outlier detection. We generalise the framework of [73] to allow an arbitrary number of numeric attributes on vertices or edges. We also outline how our approach can scale to higher dimensions with thousands of numeric attributes. This can be used in graphs where each vertex or edge is a document or image, where we may wish to extract high-dimensional feature vectors. Frequency-based approaches to substructure discovery rely on constraints that fulfil the anti-monotone property (Def. 30), most notably minimum support: the support of a graph cannot exceed the support of any of its subgraphs. However, weight-based constraints do not satisfy the anti-monotone property. [100] proposes that the traditional support calculation can be combined with a weighting ratio function in a way that does preserve the anti-monotone condition. However, [73] notes that there is frequently a correlation between graph structure and weights, which means that even though the anti-monotone property is not satisfied by weight-based constraints, in practice the results on real-world graphs are quite good. The anti-monotone property is not a concern for Subdue’s compression-based mining, as the evaluation function (the compression ratio given in Equation 3.7) is not anti-monotone in any case.
6.3
constraint-based mining using numeric attributes
While previous work on attribute-based constraints has assumed independence between the structure of a graph and its attributes, this is not the case in most real-world graphs. Based on our work in Chapter 5, our approach makes the opposite assumption: that the attribute values on each vertex and edge are dependent on the values on adjacent vertices, but are conditionally independent of the rest of the graph. In this section, we use this assumption to define a constraint based on numeric outliers within the vertex and edge partition sets of the graph.
117
6.3 constraint-based mining using numeric attributes
Constraint-based mining is an approximation approach: we sacrifice completeness of the result set in return for a reduced result set which contains the most interesting patterns, which we hope to obtain at a lower computational cost. One question is how to define which patterns are most interesting. Frequency-based approaches define interesting in terms of the number of transactions which contain the pattern; compression-based approaches define interesting in terms of the compression ratio, which takes the DL or size of the patterns into account as well as the number of instances. To incorporate numeric attributes into both of these approaches, we define the most descriptive substructures as being those which have normative values for their numeric attributes in addition to having an interesting structure as defined above. If the most descriptive substructures are normative in terms of their structure and numeric attributes, the corollary is that vertices or edges containing numeric outliers are abnormal and can be pruned early in the discovery process. We determine whether numeric attributes are normal or anomalous by means of a numeric outlier detection function (Def. 31). In Chapter 4, we compared four approaches to outlier detection and concluded that density-based approaches are most suitable for unsupervised outlier detection where the underlying data distribution is not known in advance. For our experiments in this chapter, we use LOF, as it is has received the most positive reviews over time in various real data domains. [163] surveys a number of variants of LOF and concludes that in most cases, LOF performs just as well as domain-specific specialised approaches. We were also motivated by the fact that a reference implementation for LOF exists, allowing us to validate our implementation. Based on our results in Chapter 5, we can assume that edge weights are dependent on the discrete attributes of the source and target vertices. We generalise this to allow multi-dimensional numeric attributes on edges. Now we can calculate which edges are outliers, relative to the other edges in the same edge partition Ei (Def. 33): definition 34: Outlier function. For each edge e ∈ Ei , we define de as a multi-dimensional feature vector across all the numeric attributes of e: de = LE (e, LN ). The outlier function O(de ) is calculated relative to the dataset defined by the edge partition Ei : O : Di → R
∀de ∈ Di : e ∈ Ei
(6.5)
118
6.3 constraint-based mining using numeric attributes
Outlier factors for vertex attributes can be defined relative to their vertex partition in an analagous manner. 6.3.1 Using outlier values to prune the graph in pre-processing The first way to use numeric outliers to constrain substructure discovery is in a pre-processing step. We use the outlier value to classify vertices and edges as normal or anomalous: definition 35: A vertex v ∈ Vi is normal if O(LV (v, LN )) 6 1, anomalous otherwise. An edge e ∈ Ei is normal if O(LE (e, LN )) 6 1, anomalous otherwise. Anomalous vertices and edges are pruned from the graph during pre-processing, before generating all frequent 1- and 2-vertex subgraphs. After the pruning step, substructure discovery proceeds as normal. (See Appendix A for details). One of the advantages of removing outlier vertices and edges in a pre-processing step is that it dramatically reduces the number of GI tests required, which has a significant impact on runtime and memory requirements. It would also be possible to perform the pruning as a post-processing step, as in [72]. This would improve the discriminativeness of the results but would not have any effect on runtime performance. One question that arises is the effect of removing anomalous vertices and edges on the completeness of the result set. We will examine this in the experiments. 6.3.2 Incorporating outlier values into substructure discovery An alternative to pruning the graph in pre-processing is to incorporate the outlier result directly into the support calculation. Normally, the contribution of a transaction G towards the support of a subgraph g is equal to:
support(G, g) =
1
if g ⊆ G
0 otherwise
(6.6)
This measure can be replaced with the weighted support of each transaction, WS(G, g), where normal subgraphs have WS(G, g) = 1 and subgraphs with anomalous elements have 0 < WS(G, g) < 1.
119
6.3 constraint-based mining using numeric attributes
The weight w of each vertex and edge can be calculated as the reciprocal of its LOF score l. This gives 0 < w < 1 for outlier elements. Data points within a cluster ususally have a LOF score in the range 0.5 < l < 1, so normal elements will have weight 1 6 w < 2. We threshold w at 1 so that normal elements contribute evenly to the weighted support of the subgraph wg . wg is calculated as the average of the subgraph’s vertex and edge weights: 1 wg = |Vg | + |Eg |
X
1 ,1 + min O(LV (v, LN )) v∈Vg X 1 min ,1 O(LE (e, LN ))
(6.7)
e∈Eg
The weighted support of a subgraph g from a transaction G is calculated as the maximum weighted support of all the instances of g in G: WS(G, g) =
if g 6⊆ G
0
max
wg i
(6.8)
∀gi ⊆ G : gi ' g otherwise
Thus, instances with normal numeric attribute will have support WS(G, g) = 1, while instances with anomalous numeric attributes will have support 0 < WS(G, g) < 1. Our method can be incorporated into compression-based substructure discovery in an analagous manner. We replace the MDL-based compression ratio of Equation 3.7 with a compression ratio based on the outlier score of the graph elements. The software implementation of Subdue [6] has a simplified method of calculating the MDL based on graph size: DL(G) = |VG | + |EG |, i.e., each vertex and edge contributes 1 to the DL of the graph or subgraph. We extend this into a Weighted Description Length WDL(G) by making the contribution of each vertex and edge equal to the reciprocal of its LOF score: WDL(G) =
X v∈VG
X 1 1 + O(LV (v, LN )) O(LE (e, LN ))
(6.9)
e∈EG
Normal elements will contribute wdl u 1; “super-normal” elements (those with numeric attributes deep within a cluster) will contribute 1 < wdl . 2; while anomalous elements will contribute wdl < 1. Thus graphs with more normal elements will have a higher WDL. The
120
6.4 high-dimensional numeric attributes
most descriptive substructure is the one which minimises the WDL of the input graph when it is compressed with the substructure: argmin GS
6.4
WDL(GS ) + WDL(G|GS ) WDL(G)
(6.10)
high-dimensional numeric attributes
So far, we have used LOF to determine the outlierness of the numeric attributes of each element in the graph. LOF works well for datasets with low to moderate dimensionality. At higher dimensions, we face additional challenges, the most important of which is determining the nearest neighbourhood of each point in high-dimensional space. In low dimensions, an indexed search for the kNN of each point has complexity O(nm log n), for n data points with dimensionality m. However, the X-tree index proposed in [40] is only effective up to around 20 dimensions. Beyond this, the index degrades, and a bruteforce search for the kNN of each point has complexity O(n2 m). This makes applying LOF directly to large, high-dimensional datasets intractable. We address this problem using RP + PINN + LOF, as described in Sect. 4.4. The RP creates a lower-dimensional representation of the data points, which preserves the pairwise Euclidean distances within a known error bound. The kNN of each point is determined in this projected space, with complexity O(nm log n). These neighbourhoods are used as an index into the original space to calculate the LOFs of each point. To evaluate the performance of RP + PINN + LOF on graph elements with very high-dimensional numeric attributes, we use the Enron email graph. The text of each e-mail can be represented as a “bag of words”, a feature vector representing the number of occurrences in the message of each word in a dictionary [129]. Using the bipartite graph representation of the Enron dataset (§3.4.3), each message is represented by a vertex, with the bag of words as its numeric attributes. Edges represent relationships between people (senders and recipients) and messages. Fig. 6.6 shows the time to compute outliers on the full set of messages from the Enron e-mail corpus (|V| = 300, 000) for dictionary sizes of 1,000–4,000 words. This verifies that the RP + PINN + LOF ap-
121
6.4 high-dimensional numeric attributes
9
Runtime (hours)
8.5
8
7.5
7
6.5
6
1000
2000 3000 No. of dimensions
4000
Figure 6.6: Computation time for RP + PINN + LOF on “bag of words” feature vectors on Enron bipartite graph
proach is tractable for processing the high-dimensional attributes of large graphs. We were prevented from increasing the dictionary size above 4,000 by memory constraints (in Fig. 6.6, the sharp rise in computation time for 4,000 attributes is probably an artefact caused by exceeding physical memory and using swap). Our implementation of the Achlioptas Random Projection [9] held all objects in main memory (see Appendix B). However, we note that Achlioptas RP is designed to be “database-friendly”; it would be straightforward to reimplement it with an index on disk in order to process datasets with more than 4,000 dimensions. Fig. 6.7 shows the distribution of LOF scores for the four dictionary sizes. Fig. 6.7a shows the distribution of LOF scores at the lower end of the scale, representing the “normal” data samples and the transition to anomalous values. LOF scores above 6 in the long tail are not shown. The shape of the curve is somewhat different to the characteristic shape of LOF for low-dimensional attributes (see Fig. 4.5 for comparison). At first glance, the threshold at which a point is an outlier seems to be around ≈ 2.5 rather than 1. Fig.s 6.7b–6.7e show the distribution of the long tail of LOF scores (log–log scale). As dimensionality increases, the distribution of normal values changes only slightly (Fig. 6.7a), but the distribution of outliers becomes spread out over a wider range. These initial results are encouraging, but further investigation is required. The Lp -norm distance measures used by LOF are valid
122
6.4 high-dimensional numeric attributes
25% 1000 dimensions 2000 dimensions 3000 dimensions 4000 dimensions
Proportion of vertices
20%
15%
10%
5%
0
1
2
3 LOF Score
4
5
6
(a) Distribution of LOF scores 6
6
10
10
5
5
10
10
4
4
10 No. of vertices
No. of vertices
10
3
10
2
10
2
10
1
1
10
10
0
10
3
10
0
0
50
100
150 200 250 LOF Score
300
350
10
400
0
(b) 1,000 dimensions 6
150 200 250 LOF Score
300
350
400
350
400
6
10
5
5
10
10
4
4
10 No. of vertices
10 No. of vertices
100
(c) 2,000 dimensions
10
3
10
2
10
3
10
2
10
1
1
10
10
0
10
50
0
0
50
100
150 200 250 LOF Score
300
(d) 3,000 dimensions
350
400
10
0
50
100
150 200 250 LOF Score
300
(e) 4,000 dimensions
Figure 6.7: Distribution of LOF scores for RP + PINN + LOF on “bag of words” feature vectors on Enron bipartite graph
123
6.5 experiments and results
124
on dense feature vectors (as long as each dimension adds information [190]) but we need to establish that they remain meaningful on sparse feature vectors such as our bag of words. If the distances are valid, it may be necessary to rescale the LOF scores, for example using one of the methods presented in [119]. As these results are preliminary, we do not include the higherdimensional numeric attributes in our substructure discovery experiments in the next section. However, they do suggest some interesting directions for future research.
6.5
experiments and results
Our experiments evaluate our constraint-based mining approach implemented as a pre-processing step, as described in Sect. 6.3.1. This allows us to compare results using a frequent subgraph mining algorithm for graph databases (gSpan), and a compression-based mining algorithm for single large graphs (Subdue). Our experiments measure the runtime performance, memory performance and result set coverage of our constrained mining approach compared to unconstrained mining using the standard substructure discovery algorithms. 6.5.1 Graph Transaction Database Our first set of experiments were performed on the ACS graph database of Sect. 3.4.4, using gSpan. Each transaction represents the movement of an individual between door sensors. Edges are labelled with two numeric attributes. We ran each set of experiments three times: once using the absolute time numeric attribute, once using elapsed time, and once using a two-dimensional feature vector of absolute + elapsed time. Fig. 6.8 shows the runtime and memory performance of substructure discovery with LOF + gSpan (constrained search) against gSpan (unconstrained search), for a range of minimum support thresholds (minsup). With unconstrained search, the lowest possible minsup was 1.3%; below this, memory was exhausted and the search became intractable. Using constrained search, it was possible to lower minsup to 0.4% within the same time and memory limits. At minsup 1.3%, constrained search with one numeric attribute ran ≈ 100 times faster than unconstrained search. Using two numeric attributes ran 183 times faster.
See Appendix B for implementation details.
6.5 experiments and results
unconstrained absolute time constraint elapsed time constraint absolute + elapsed time constraint
4
runtime (seconds)
10
3
10
2
10
1
10
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
minimum support threshold (%) (a) Computation time unconstrained absolute time constraint elapsed time constraint absolute + elapsed time constraint
maximum resident set (Gb)
3
10
2
10
1
10
0
10
−1
10
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
minimum support threshold (%) (b) Memory requirements
Figure 6.8: Runtime and memory performance of constrained and unconstrained substructure discovery in the Access Control System graph database
Fig. 6.9a shows the number of patterns returned at different minimum support thresholds. The maximum number of patterns that could be discovered using the unconstrained search was 54. With the constrained search, it is possible to find more patterns by lowering the minimum support threshold. As with all approximation approaches, our method does not guarantee completeness. Fig. 6.9b shows the coverage of the result set for various minimum support thresholds. As a baseline, we took the 54 patterns discovered by unconstrained gSpan with minsup 1.3%. The figure shows that most of these patterns only show up between minsup 1.3–1.6%. The result set discovered using the elapsed time con-
125
6.5 experiments and results
500 unconstrained absolute time constraint elapsed time constraint absolute + elapsed time constraint
450 400
pattern count
350 300 250 200 150 100 50 0 −50
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
minimum support threshold (%) (a) No. of discovered substructures
(b) Coverage
Figure 6.9: Analysis of subgraphs returned by constrained and unconstrained substructure discovery in the Access Control System graph database
straint at minsup 1.3% is the same as unconstrained search at minsup 1.6%. By lowering minimum support, the constrained search gradually recovers more of the result set. At minsup 1.0% the constrained search recovers around 45% of the patterns in 16.8 seconds, against the unconstrained search which takes 22.7 minutes to discover 100% of the patterns. Figs. 6.10a and 6.10b give some insights into which patterns are retained. For comparison, we set minimum support for constrained search to return approximately the same number of patterns as the maximum number for unconstrained search: unconstrained search with minsup 1.3% (54 patterns; 1,363 secs) is compared to constrained
126
6.5 experiments and results
3500 unconstrained absolute + elapsed time constraint
3000
pattern count
2500 2000 1500 1000 500 0 0
10
20
30
40
50
discovered patterns (a) Pattern counts
constrained search
20
15
10
5
same pattern pattern demoted 5
10
15
20
unconstrained search (b) Ranking order
Figure 6.10: Analysis of subgraphs returned by constrained and unconstrained substructure discovery in the Access Control System graph database
search with minsup 0.7% (50 patterns; 10 secs). Fig. 6.10a shows the count of each of the 54 patterns in the unconstrained result set. In the constrained result set, the support of most patterns is reduced by ≈10–30%; where the support level dropped below the minimum support threshold, the count shows as zero. Fig. 6.10b shows the relative ranking of the top 20 patterns for the same result set. The top 10 patterns (those with the highest support) are all retained, though the ranking order is changed. These results show a beneficial trade-off between improved performance and reduced coverage. The patterns which are retained are the most descriptive ones, as they satisfy both minimum support
127
6.5 experiments and results
unconstrained weight constraint
runtime (hours)
12 10 8 6 4 2 0
depth 4 beam 100
depth 7 beam 10
depth 10 beam 100
(a) Computation time (Enron Social graph) unconstrained size time of day size + time of day
runtime (hours)
15
10
5
0
depth 8 beam 100
depth 9 beam 50
depth 9 beam 100
(b) Computation time (Enron Bipartite graph)
Figure 6.11: Performance of substructure discovery in the Enron graphs
and the normal numeric attribute constraint. This conclusion could be strengthened by applying our approach to a supervised learning task on a dataset with ground truth. 6.5.2 Single Large Graphs The experiments on single large graphs compare LOF + Subdue (constrained search) with standard Subdue (unconstrained search). We performed one set of experiments on the Enron social graph (§3.4.2), with edge weights representing the number of e-mails exchanged between actors; and one set of experiments on the Enron bipartite graph (§3.4.3), where vertices have numeric labels for the size of the mes-
128
6.5 experiments and results
maximum resident set (Gb)
180 unconstrained size time of day size + time of day
150
120
90
60
30
0
depth 8 beam 100
depth 9 beam 50
depth 9 beam 100
(a) Memory requirements (Enron Bipartite graph)
Figure 6.12: Performance of substructure discovery in the Enron graphs
sage in bytes and the time of day (seconds since midnight) that the message was sent. For this dataset, we performed three sets of experiments, using size as the numeric attribute, time of day as the numeric attribute, and a two-dimensional feature vector of size + time of day. The results for single large graphs are shown in Figs. 6.11–6.14. Fig. 6.11 shows the computation time and Fig. 6.12 shows the memory usage on the Enron graphs for a range of Subdue parameter settings: depth controls how large discovered substructures are allowed to grow before the search terminates; beam width controls the “greediness” of the search. For the Enron Social graph at depth 4, the constrained search was 142 times faster. Increasing to depth 7, the unconstrained search was intractable; narrowing the beam width to 10, the constrained search was 155 times faster. At depth 10, the unconstrained search was intractable even at beam 10, but the constrained search could still complete in reasonable time at beam 100. For the Enron Bipartite graph (Fig. 6.11b), calculating LOFs across the two-dimensional feature vector gives better performance than using either attribute in isolation. Memory requirements are similarly reduced (Fig. 6.12a). These results demonstrate that using our approach, we can relax the greediness of the Subdue search while keeping processing time and memory usage tractable. Next, we analysed the result sets (Figs. 6.13–6.14). The constrained search returns a similar set of patterns to the unconstrained search. As Subdue uses a greedy search heuristic, it does not return a complete set of results, so it is not meaningful to compare coverage. The relative
129
6.5 experiments and results
400
unconstrained weight constraint
350
pattern count
300 250 200 150 100 50 0 1
20
40
60
80
100
discovered patterns
(a) Pattern counts (Enron Social Graph) 1.05 unconstrained weight constraint
1.045
compression ratio
1.04 1.035 1.03 1.025 1.02 1.015 1.01 1.005 1
1
20
40
60
80
100
discovered patterns
(b) Compression ratios (Enron Social Graph)
constrained search
20
15
10
5 same pattern pattern demoted
1 1
5
10
15
20
unconstrained search (c) Ranking order (Enron Social Graph)
Figure 6.13: Analysis of subgraphs returned by substructure discovery in the Enron Social graph
130
6.5 experiments and results
1000 unconstrained size + time of day constraint
900 800
pattern count
700 600 500 400 300 200 100 0
1
20
40
60
80
100
discovered patterns
(a) Pattern counts (Enron Bipartite Graph) unconstrained size + time of day constraint
compression ratio
1.1
1.05
1
0.95
1
20
40
60
80
100
discovered patterns
(b) Compression ratios (Enron Bipartite Graph)
constrained search
20
15
10
5 same pattern pattern demoted
1 1
5
10
15
20
unconstrained search (c) Ranking order (Enron Bipartite Graph)
Figure 6.14: Analysis of subgraphs returned by substructure discovery in the Enron Bipartite graph
131
6.6 conclusions
counts of the top patterns are shown in Figs. 6.13a and 6.14a. As Subdue ranks patterns by their compression ratio, the most frequent patterns are not necessarily ranked highest. While the total number of instances of each pattern have been reduced by the constraint, the relative count of each pattern is approximately preserved. The compression ratios of the top patterns are compared in Figs. 6.13b and 6.14b. The shape of the curve is approximately preserved for the constrained and unconstrained search. In Fig. 6.13b, the gradient of compression ratios is steeper using the constraint. By removing anomalous edges, we have increased the discrimination between “good” patterns and less good patterns. In the Bipartite graph (Fig. 6.14b), there are a smaller number of interesting patterns to begin with (patterns with a compression ratio of 1.0 or less are not interesting). The compression ratios of the interesting patterns are preserved very well. The ranking of the top 20 substructures discovered by the constrained and unconstrained searches is shown in Figs. 6.13c and 6.14c. For both graphs, 17 of the top 20 results are the same, and the ordering is very similar. Thus our constraint-based approach provides order-of-magnitude improvements in discovery time with only a small impact on which substructures are discovered.
6.6
conclusions
In this chapter we considered how discrete and numeric vertex and edge attributes contribute to graph substructures or motifs. Our main contribution is a method to use numeric outliers as a constraint during substructure discovery. Our thesis is that the most descriptive substructures are those which are not only the most frequent, but which are also normative in terms of their numeric attributes. We presented three methods of integrating our outlier-based constraint into substructure discovery: as a guided pruning step (which can be applied to any substructure discovery algorithm); as weighted support (for frequent subgraph algorithms); and as weighted description length (for compression-based algorithms). We also outlined how our approach can be extended to work when vertices or edges represent images or text and have very high-dimensional numeric features. Our experiments evaluated our constraint-based approach, implemented as a pre-processing step, on graph transaction databases (using LOF + gSpan) and on single large graphs (using LOF + Subdue).
132
6.6 conclusions
Our results show that removing a small number of anomalous edges or vertices reduces the number of instances of each pattern, which has a significant effect on runtime and memory usage. While our result set is not complete, we retain the most descriptive subgraphs. In many cases where the input graph is intractable with an unconstrained approach due to the computational or memory overheads, our approach allows the graph to be processed. For the heuristic search with LOF + Subdue, we show that the relative count and compression ratio of each pattern are approximately preserved, but with improved discrimination between patterns in the result set. The ranking of discovered patterns is not significantly altered. In future work, we plan to incorporate the outlier score into the support and description length calculations, as outlined in Sect. 6.3.2. As this work requires a re-implementation of gSpan and Subdue, it is out of the scope of the current project. We also propose to objectively measure the accuracy of the result set by applying our method to a classification task on a dataset with ground truth. We expect to be able to improve classification accuracy by incorporating numeric attributes. There is a substantial amount of work to be done in outlier detection in high-dimensional space. For graphs with a sparse highdimensional representation, such as the Enron bag of words, further investigation is required to determine whether Lp -norm distances are still discriminitive. If Lp -norm distances are not appropriate, further research is required to determine whether there exists a “databasefriendly” random projection for other more suitable distance measures. Where there is a high-dimensional representation with many irrelevant attributes, we need to investigate detecting numeric anomalies in subspaces rather than in full space by choosing only locallyrelevant attributes on which to calculate the outlier score. In this chapter, we have concentrated on finding the most descriptive substructures in graphs with discrete and numeric attributes. In Chapter 7, we build on this to find anomalous substructures.
133
YA G A D A : D E T E C T I N G A N O M A L O U S SUBSTRUCTURES IN LABELLED GRAPHS WITH N U M E R I C AT T R I B U T E S
7.1
introduction
So far, we have focussed on finding typical patterns or substructures. In this chapter, we turn our attention to atypical patterns, or anomalies. There are numerous practical applications of anomaly detection, including uncovering fraud, detecting network intrusions or detecting suspicious activity in physical environments. Continuing from Chapter 6, we develop our method for finding the most descriptive substructures into an approach to detect anomalous substructures, based on information theory. Over multiple iterations, we find the most descriptive patterns in labelled, weighted graphs. At each iteration, the “best” pattern is compressed away. At the end of the process, patterns which are left uncompressed are anomalous. For SLGs, we can detect anomalous regions. For graph databases, we can go a step further and assign an anomaly score to each graph transaction. The intuition is that patterns which are compressed early and fully are the most typical. Patterns which are compressed late or not at all are most anomalous. Each transaction is scored based on how much and how soon it was compressed. This allows us to rank transactions in order of anomalousness, or to select the top-n anomalous transactions. As in Chapter 6, we are focussed on graphs which have both discrete and numeric labels. As graph structure and attributes are correlated, we expect anomalous regions to show up in the numeric attributes. By incorporating numeric data into our approach, we increase the discrimination of anomaly detection. We begin in Sect. 7.2 with some worked examples, to illustrate the difference between ignoring numeric attributes and incorporating them into the substructure discovery phase. We also show how to calculate anomaly scores. In Sect. 7.3, we present our algorithm, Yagada—Yet Another Graph-based Anomaly Detection Algorithm— to detect anomalies in labelled graphs with numeric attributes. In
134
7
7.2 detecting anomalies in graphs with numeric labels
Sect. 7.4, we evaluate Yagada on Access Control System graph databases, demonstrating how it can be used to detect unusual behaviour in the setting of a secure building such as a hospital, power station or airport. We conclude in Sect. 7.5 with a discussion of how Yagada can be improved.
7.2
detecting anomalies in graphs with numeric labels
Back in Sect. 3.3, we surveyed the main approaches to graph-based anomaly detection. While a few approaches look for anomalies in global graph statistics [63, 152], most approaches search for anomalies in structural elements: vertices [17, 178, 167], edges [47, 155], paths [132], clusters [51] or substructures [68, 144]. We focus on detecting anomalous substructures, as they represent the most general case (vertices, paths, etc. are also substructures). Two substructure-based approaches in the literature are based on Subdue. Noble and Cook [144] use substructure entropy: graph anomalies are the parts of the graph not described by the frequent substructures. Eberle and Holder’s GBAD (Graph-based Anomaly Detection) [68] detects anomalies as unexpected deviations from normal substructures, in terms of vertex or edge additions, deletions or modifications. We recall from Sect. 3.2.2 that Subdue considers the most descriptive substructure to be the one which minimises the DL of the input graph (Equation 3.7). As the denominator DL(G) is constant for any given input graph, we can restate Equation 3.7 to say that the best substructure is the one that minimises: M(GS , G) = DL(GS ) + DL(G|GS )
(7.1)
where DL(G|GS ) is the DL of G after compressing it with GS . The converse is that substructures with a high value of M(GS , G) are infrequent and therefore anomalous. Anomaly detection approaches based on Subdue do not consider numeric attributes or weights. From Chapters 5–6, we know that graph structure and attributes are correlated, so we expect to be able to improve the discrimination between normal and anomalous substructures by taking numeric values into account. Yagada combines structural and numeric anomaly detection. To give the intuition behind our method, we present two worked exam-
135
7.2 detecting anomalies in graphs with numeric labels
Figure 7.1: Single Large Graph representing TCP SYN and ICMP PING network traffic, with two Denial of Service (DoS) attacks taking place. Anomalies are highlighted in blue.
ples, for anomaly detection in a SLG and in a graph database. We show that using Subdue will miss some important anomalies, which can be detected by taking numeric attribute values into account. We present the algorithm for Yagada in Sect. 7.3. 7.2.1 Detecting Anomalies in Single Large Graphs Figure 7.1 shows a fragment of a bipartite graph representing traffic in a TCP/IP network. Vertices represent network servers or network protocols and have discrete labels. Edges represent traffic flows and are weighted with the number of TCP/IP packets sent or received. There are two types of network traffic: ping is used to test that another server is alive, and SYN is used to synchronise the order of TCP packets. SYN is a three-part exchange: the TCP client sends a synchronise (SYN) request; the server acknowledges this (SYNACK); and then waits for the client acknowledgement (ACK). The anomalies in this graph (highlighted in blue) represent two common Denial of Service (DoS) attacks. A ping flooding attack is where the client sends continuous ping requests to consume the server’s resources. A SYN flooding attack is where a client starts many SYN exchanges but fails to send the final ACK, forcing the server to hold all its ports open and denying the service to legitimate users. To illustrate structural anomaly detection, let G be the graph in Figure 7.1. Subdue uses a greedy beam search (see algorithm in Appendix A.2) to find candidate substructures, and evaluates them us-
136
7.2 detecting anomalies in graphs with numeric labels
ing Equation 7.1. Assuming we ignore numeric attributes, at the end of the first iteration, the best substructure S1 is:
If we compress G with S1 , we obtain a new graph G1 = (G|S1 ):
In the second iteration, we discover the next best frequent substructure S2 :
Compressing G1 with S2 gives G2 = (G1 |S2 ):
No more compression is possible, so we evaluate the compressed graph. Anomalies show up as uncompressed regions. The missing ACK vertex means that the Host vertices A and B are not compressed as much as the rest of the graph, so the SYN flooding attack shows up as an anomalous region. The Ping flooding attack is undetected. Evaluating the numeric edge weights as discrete labels does not help: 5, 3 and 9473 will be considered as distinct label values, resulting in substructures which are all equally anomalous. We can improve anomaly detection by incorporating numeric attributes in a more meaningful way. Following the approach of Chapter 6, we distinguish between normal and anomalous numeric attributes, and encode that information in the graph labels. We assign a constant label q0 to edges with normal numeric attributes and some label qi to edges with anomalous values. To illustrate this approach, let Ga be the graph in Fig. 7.1, replacing edge weight 227 with some value q1 , edge weight 9473 with q2 and
137
7.2 detecting anomalies in graphs with numeric labels
replacing all other edge weights with q0 . Now when we apply Subdue to Ga , after two iterations we discover the two most descriptive a substructures Sa 1 and S2 :
a a When we compress Ga with Sa 1 and S2 , we obtain G1 :
By incorporating numeric attribute values, we are able to detect both the SYN flooding attack and the Ping flooding attack. The anomalous region around the SYN attack is much more prominent as there are both structural and numeric anomalies. In our second example, we show how this approach can be applied to graph databases. 7.2.2 Detecting Anomalies in Graph Databases Figure 7.2 shows a fragment of a graph database with four transactions, G1 –G4 , each of which represents a cash withdrawal from a bank account at an Automated Teller Machine (ATM). The hub vertex for each transaction is labelled with the account number (discrete). The hub is connected to four leaf vertices by labelled edges indicating the type of data at the vertex. The leaf vertices are labelled with
Figure 7.2: Graph Database where each subgraph G1 –G4 is a transaction at an Automated Teller Machine
138
7.2 detecting anomalies in graphs with numeric labels
the transaction type (discrete; W means a withdrawal), the amount of the transaction (numeric), the time that the transaction took place (numeric) and the physical location where the transaction took place. A real database may represent locations as a two-dimensional numeric hlatitude, longitudei pair, but for this trivial example we have used two discrete locations, A and B. There are two anomalies in this database fragment. Transaction G1 took place at an unusual time of day and was for an unusual amount. Transaction G4 took place in an unusual location. To illustrate anomaly detection in the transaction database context, let G be the graph database in Fig. 7.2. As before, we use Subdue to find and evaluate candidate substructures. The best substructure S1 depends on how we treat numeric attributes:
In either case, compressing with S1 will detect the anomalous location in G4 but will miss the anomalous time and amount in G1 . As in the SLG case, we can improve anomaly detection by encoding the numeric anomalies as labels on the graph. Let Ga be the graph database in Fig. 7.2, replacing time 0300 with some value q1 , amount 250 with some value q2 and all other numeric values with q0 . Now a the first two iterations discover the best substructures Sa 1 and S2 :
139
7.3 yagada: an algorithm for detecting structural and numeric anomalies
a Compressing all subgraphs with Sa 1 and S2 gives:
Transactions which represent typical behaviour are compressed completely, whereas those which are in some way atypical are only partially compressed (or not at all). By measuring how soon and how much each subgraph is compressed, we can score the transactions, allowing us to rank them from most anomalous to least anomalous. Following [144], we define ci , the proportion of the subgraph which is compressed away after the ith iteration: ci =
DLi−1 (G) − DLi (G) DL0 (G)
(7.2)
where DLi (G) is the DL of graph G after compression at the end of iteration i. We use this to define the anomaly score as for each transaction: 1X (n − i + 1) · ci n n
as = 1 −
(7.3)
i=1
where n is the number of iterations. Using this measure for our graph database, G1 is not compressed at all, so a1 = 1.0. G4 is compressed only in the first iteration, so a2 = 1 − 12 (2 · 35 ) = 0.4. G2 and G3 are compressed on both iterations, so a3 = a4 = 1 − 21 (2 ·
3 5
+ 1 · 15 ) = 0.3. This gives our ranking order
from most to least anomalous transactions as G1 > G4 > G2 > G3 .
7.3
yagada: an algorithm for detecting structural and numeric anomalies
In the previous section, we described the spirit of our approach and illustrated how numeric anomaly scores can be integrated into the graph structure for more accurate anomaly detection. As graph structure and numeric attributes are correlated (§5.2, §6.2.2), abnormal numeric attributes can indicate the presence of anomalous regions in the graph. Our examples show that taking numeric attributes into consid-
140
7.3 yagada: an algorithm for detecting structural and numeric anomalies
eration can increase the discrimination of anomaly detection and even allow us to detect some anomalous patterns which would otherwise go undetected. Now we consider how to calculate qi , the anomaly score for numeric attributes, and present our complete algorithm for graph-based anomaly detection, Yagada. Subdue handles numeric attributes in one of four ways [161]: 1. Ignore numeric labels. As we showed in the examples above, ignoring numeric labels loses information and reduces the discrimination of anomaly detection. 2. Treat numeric attributes as discrete: two numeric labels i and j are treated as equal if i = j. This approach will have very poor results, as numbers which are very close, e.g. 5.0 and 5.1, will be treated as separate discrete labels. 3. Tolerance match: two numeric values are considered equal if |i − j| < t for some threshold t. This approach may work in some cases where the data is uniformly distributed and an appropriate threshold can be determined. It requires prior knowledge of the distribution of the data and for many real-world data distributions, a single threshold is not appropriate. 4. Difference match: two numeric values are considered equal if matchcost(i, j) < t, where matchcost is a function defined as the probability that j is drawn from a Gaussian PDF with the same mean as i and standard deviation provided as a parameter. This approach assumes that the data is distributed normally and that all numeric data in the graph are drawn from the same distribution. As we have shown in Chapters 5–6, there are many real world datasets for which these assumptions do not hold. None of these methods is satisfactory. Several alternative discretization approaches have been proposed for numeric attributes as a preprocessing step to graph mining with Subdue: equal-width or equalfrequency binning, k-means or EM clustering and kernel density methods [149]. We will compare Yagada to the binning/clustering approaches in the experiments. GBAD-P [68] is a Subdue-based method to detect anomalous insertions to a graph structure. Extensions to normative patterns are evaluated to find instances with vertices or edges that are less likely than other possible extensions. [71] describes an extension to GBAD-
141
7.3 yagada: an algorithm for detecting structural and numeric anomalies
P to incorporate numeric attributes. The probability of a vertex or edge is calculated as: P(attribute = value|attribute exists) × P(attribute exists) (7.4) where P(attribute = value|attribute exists) is the probability of the specific attribute value, calculated from the Gaussian distribution of the numeric attribute, and P(attribute exists) is the probability that it exists as an extension of the normative pattern. However, this approach assumes that the data is normally distributed. Yagada uses a similar discretization approach to the one we used for graph pruning in Chapter 6. For each labelled vertex or edge in the graph, we replace all numeric labels LN → AN with a single weight LOUT LIER , which takes a constant value q0 for normal numeric values and some value qi indicating the degree of outlierness otherwise. We then perform multiple iterations of substructure discovery and graph compression as outlined in the previous section. Yagada is described in detail in Algorithm 3. First, we pre-process the graph by detecting numeric anomalies. Vertices are partitioned into disjoint sets according to their discrete attribute values (line 1, cf. Def. 32 in Chapter 6). Each vertex is weighted with a score calculated from its numeric attributes, LOUT LIER → qi (lines 2–4, see Discretize algorithm below). This process is repeated for edges (lines 5–8). After calculating the vertex and edge weights from the numeric attributes, we can disregard the actual attribute values (lines 9–10). Next, we commence structural anomaly detection (lines 11–16). We use Subdue’s greedy beam search to find and evaluate candidate substructures (lines 12–13, cf. Appendix A.2) and compress the graph with the best substructure (line 14). Where G is a graph database, we can calculate at , the anomaly score for each transaction t at each iteration, by Equation 7.3 (lines 15–16). Algorithm 4 details the step of calculating LOUT LIER from the numeric attributes. q0 is the threshold for normal numeric attributes, which for kNN-distances is close to zero (§4.3.3). FV is a set of feature vectors created from the attributes of the vertices (or edges) under consideration (line 1). The dimensionality of FV is |LV ∩ LN | for vertices and |LE ∩ LN | for edges. fvv ∈ FV is the feature vector for the current vertex (line 2). We calculate the kNN-distance between fvv and FV (line 3, cf. Equation 4.6) and use this to evaluate whether the point is normal or an outlier (lines 4–5). For points which are normal,
142
7.3 yagada: an algorithm for detecting structural and numeric anomalies
Algorithm 3 Yagada: detect anomalies in structural and numeric data Require: G = (V, E, LV , LE , LV , LE ), numIterations Return: Set of anomaly scores A = {a0 , . . . , an } 1: Partition V into disjoint sets V0 , . . . , VN (Def. 32) 2: for all i : 0 6 i < N do 3: for all v ∈ Vi do 4: let LV (v, anomalousness) = discretize(v, Vi , LV ) 5: Partition E into disjoint sets E0 , . . . , EM (Def. 33) 6: for all j : 0 6 j < M do 7: for all e ∈ Ej do 8: let LE (e, anomalousness) = discretize(e, Ei , LE ) V 9: let L = LV \ LN 10: let LE = LE \ LN 11: for all iter : 0 6 iter < numIterations do 12: Execute Subdue(G) to return set of substructures GS = {G0 , . . . , Gm } 13: Find best substructure Gmin = argming∈GS M(g, G) (Def. 7.1) 14: Compress G with Gmin 15: for all transaction graphs t ⊂ G do 16: Calculate anomaly score at (G, t, iter) by Equation 7.3
Algorithm 4 Discretize: calculate an anomaly score on the numeric attributes of a vertex or edge Require: v, V, L, k Return: Outlier score qi 1: let FV = L(V, LN ) 2: let fvv = L(v, LN ) 3: let q = distance from fvv to kth nearest neighbour in FV 4: if q . q0 then return q0 5: else return q
143
7.4 experiments and results
we return the constant value q0 ; for anomalous points we return the anomaly score q. Yagada can detect both structural and numeric anomalies in graph data. The novelty of our method is to replace numeric values in the graph with a constant q0 if the value is normal, and an anomaly score qi otherwise. When we subsequently search the graph for frequent substructures, q0 will be incorporated into frequent patterns. The values qi are infrequent and therefore substructures which contain qi 6= q0 are more anomalous. The operation of Yagada on a graph database representing an ACS is shown in Fig. 7.3. Fig. 7.3a shows the input database, with transactions for three users—Alice, Bill and Eve—over four days. Vertices represent door sensors, weighted with the time of day that an access card was presented. Directed edges represent movements between sensors. Fig. 7.3b shows the graph after numeric anomaly detection. Normal access times for each sensor are labelled as 0. Unusual access times are labelled with a score representing the degree of anomalousness. Figs. 7.3c, 7.3d and 7.3e show the transaction graph after 1, 2 and 3 iterations of substructure discovery and compression. The anomaly scores at for the transactions in Fig. 7.3e will highlight Eve’s suspicious behaviour. Leaving by a side door on Wednesday evening is a small change to her usual pattern. Coming in to the cash room early in the morning and leaving the building almost immediately shows up as a very unusual pattern.
7.4
experiments and results
For our experiments, we used the CEM ACS dataset (§3.4.5), which tracks the movements of approximately 250 staff between door sensors in an office building over fifteen months. We organised the data as a graph database with ≈ 40, 000 graph transactions, where each transaction represents the movements of a specific user on a specific day. Vertices represent door sensors and directed edges represent movements between them. Edges e = hv, wi are weighted with the elapsed time in seconds between presenting an access card at v and presenting the card at w. We compare the performance of Yagada to five alternative discretization approaches. Two approaches use standard Subdue, treating numeric values as discrete; and ignoring the numeric attributes. This discovers structural anomalies only and gives a baseline for
144
7.4 experiments and results
(a) Input Graph Database
(b) Calculate Anomaly Scores
Figure 7.3: Access Control System Example
145
7.4 experiments and results
(c) Compression Iteration 1
(d) Compression Iteration 2
(e) Compression Iteration 3
Figure 7.3: Access Control System Example
146
7.4 experiments and results
comparison. We also compared against three of the discretization approaches suggested in [149]: sorting the dataset into ten bins of equal frequency; sorting into ten bins of equal width; and k-means clustering with k = 10. For each discretization approach, we compressed the graph database with Subdue and calculated the anomaly score a = [0, 1] for each transaction, following Algorithm 3. This continued over multiple iterations until no further compression was possible. (This occurs when the best substructure has only one instance). We calculated anomaly scores for each transaction in the graph database for each of the six methods described above. The anomaly scores are plotted as a CDF in Fig. 7.4. Fig. 7.4a shows the full CDF of anomaly scores. One way of thinking about these plots is that each point represents the number of transactions which would be flagged as anomalous if we used its value of a as a threshold. As one would expect, the transactions are clearly divided into normal patterns representing most of the data (on the right), and a small proportion of anomalous patterns (on the left). Four of the anomaly detection methods agree almost exactly on the transition point between normal and anomalous patterns: transactions with a & 0.32 are anomalous. Fig. 7.4b shows a more detailed view of the distribution of anomaly scores for the anomalous transactions. In Fig. 7.4a, the dashed red line shows the results with standard Subdue, treating numeric values as discrete. Each small difference in numeric values is treated as a separate label, so very few patterns can be established. Approximately half of the dataset is assigned the highest possible anomaly score (a = 1), which is obviously meaningless. If we ignore the numeric attributes, Subdue detects very few anomalies (Fig. 7.4b). This shows that there are few purely structural anomalies in this dataset. Most of the anomalous patterns are a combination of structural and numeric features. The relative performance of the remaining four numeric discretization approaches can be seen most clearly in Fig. 7.4b. KMeans clustering is somewhat more discriminative than ignoring attributes, but as KMeans tends to produce equal-sized clusters, there is poor discrimination between normal and anomalous points. The performance of the two binning approaches (EqualFreq and EqualWidth) is somewhat better. Yagada offers the greatest discrimination between normal and anomalous patterns. At the upper part of the range (a > 0.8), Yagada detects almost four times as many anomalies as the next best
147
7.4 experiments and results
(a) 1 > a > 0
(b) 1 > a > 0.4
Figure 7.4: Cumulative Frequency Distribution showing transactions discovered as anomalous
148
7.4 experiments and results
algorithm. In the mid-part of the range (a > 0.65), Yagada detects approximately twice as many anomalies. This difference decreases with decreasing values of a, until at the boundary between normal and anomalous values (a ≈ 0.32), the algorithms agree. This behaviour suggests that while all algorithms may be discovering the same anomalous patterns, Yagada detects them with much higher confidence, and provides much better discrimination between slightly anomalous and very anomalous patterns. To evaluate this, we compared the anomaly scores for the top 20 anomalous patterns discovered by EqualFreq, EqualWidth and KMeans to the anomaly scores produced by Yagada for the same patterns (Table 7.1). Of the top 20 anomalies discovered by any of the other four approaches, Yagada detects all of them and in most cases assigns an anomaly score which is greater than or similar to that produced by the bestperforming of the alternative algorithms. Finally, we looked at the top 20 anomalies discovered by Yagada and compared how the other algorithms scored these anomalies (Table 7.2). Subdue performed poorly as it can only detect structural anomalies. Of the discretization algorithms, EqualWidth performed the best, as the anomalies were usually found in the top one or two bins. EqualFreq usually performed less well, because anomalies were binned together with high-end normal values. This shows that EqualFreq and EqualWidth are sensitive to the boundaries chosen for the bins; data clusters close to a boundary can be split into two bins, and a bin can contain more than one cluster. KMeans performs poorly for similar reasons to EqualFreq: clusters tend to be similar sizes, so very anomalous values are detected, but slightly anomalous values are clustered with high-end normal values. Yagada’s kNN distancebased measure is much less sensitive to the distribution of the data. Of course, it is not enough to demonstrate that Yagada detects the most anomalies. We must perform some kind of qualitative analysis to assure ourselves that the anomalies that are discovered are meaningful and useful. We examined the top anomalous patterns and assured ourselves that they did represent meaningful patterns, such as failing to swipe at one sensor along a path. The patterns that Yagada discovered with highest confidence were those which combined unusual paths with unusual timings. When we compared the patterns discovered by each approach, we were able to confirm that all of the purely structural anomalies discovered by Subdue are also discovered by the other algorithms, with a
149
7.4 experiments and results
Yagada 1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.80
0.79
0.79
0.79
0.79
0.71
0.70 0.66
0.66
0.65 0.62 0.62 0.62
0.81
EqualFreq 0.42 0.78 0.78 0.78 0.51 0.62 0.62 0.75 0.58 0.96
0.68
0.96
0.66
0.77
0.68 0.64
0.66
0.63
0.64 0.63
0.69
EqualWidth 0.93 0.77 0.68 0.68 0.93 0.66 0.66 0.81
0.77 0.68 0.77 0.55 0.70 0.67
0.93
0.64 0.63
0.64
0.63 0.63 0.71
KMeans 0.72 0.49 0.49 0.49 0.49 0.62 0.62 0.68 0.64 0.49 0.49 0.52 0.65 0.60 0.54 0.62 0.62 0.61 0.61 0.61 0.58
Subdue 0.34 0.39 0.39 0.39 0.35 0.27 0.23 0.28 0.36 0.39 0.39 0.42 0.39 0.28 0.35 0.29 0.39 0.27 0.31 0.31 0.34
Table 7.1: Comparison of Top 20 Anomalies discovered by EqualFreq, EqualWidth and KMeans
Yagada 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
EqualFreq
EqualWidth
1.00
1.00
0.96 0.78
0.87
0.42
0.78
0.35 0.35 0.51
0.78 0.78
0.51 0.62 0.62 0.43
0.68
0.51 0.51 0.41 0.36 0.61
1.00 1.00
0.77
0.93
0.77
1.00 1.00 1.00
0.68 0.68
0.93 0.66 0.66 0.77
0.64
0.62 0.62 0.77 0.68
0.81
KMeans 0.85 0.85 0.72 0.72 0.72 0.49 0.72 0.72 0.49 0.49 0.49 0.49 0.62 0.62 0.54 0.46 0.49 0.49 0.39 0.44 0.59
Subdue 0.49 0.40 0.36 0.31 0.34 0.39 0.34 0.34 0.35 0.39 0.39 0.35 0.27 0.23 0.34 0.23 0.35 0.35 0.34 0.34 0.34
Table 7.2: Comparison of Top 20 Anomalies discovered by Yagada. The bestperforming of the alternative algorithms are highlighted in bold.
150
7.5 conclusions
(a) kNN-distance
(b) LOF
Figure 7.5: Outlier score distributions for elapsed time in the ACS dataset
similar anomaly score. In many cases, the other algorithms were able to assign a higher anomaly score as they could also take numeric attributes into account. Our industrial partner for this project was interested in tagging each pattern using a “traffic light” system, where normal patterns are Green, anomalous patterns are Yellow and severely anomalous patterns are Red. We have determined for this dataset, that the threshold between Green and Yellow lies at a ≈ 0.32. Figure 7.4b shows that the CDF is flat for 1 6 a . 0.8, so we could set the threshold between Yellow and Red at a = 0.8. We conclude that Yagada is able to detect all of the anomalies which can be discovered using the alternative approaches, usually with equal or greater confidence. Yagada outperforms the other approaches as it uncovers a larger number of meaningful anomalies at the higher values of a. Our experiments show that using kNN distances for discretization gives better discrimination than binning or clustering. Fig. 7.5 shows the kNN distances and LOF scores across a pair of sensors in the CEM dataset. It can be seen that the data density is quite uniform; both approaches return “normal” scores across a similar range of elapsed times. In the case where the data clusters are of different densities, kNN
distances will not perform so well. We argued in Sect. 4.3.4 that
LOF
is a better choice for unsupervised learning, as we are not re-
quired to select a (data-dependent) distance threshold, and LOF detects outliers more reliably if the data is composed of multiple clusters of different densities. Therefore we plan to create an improved version of Yagada using LOF as the outlier detection method.
7.5
conclusions
Graph-based anomaly detection algorithms mine the structural aspects of graph data. Until now, very little consideration has been
151
7.5 conclusions
given to how numeric attributes may indicate the presence of anomalous regions. By incorporating numeric outlier detection, Yagada can detect anomalies in graphs that consist of both structural data and numeric attributes. The experiments show that this outperforms anomaly detection based on structural data alone. We have not yet evaluated the case where there are multiple (multidimensional) numeric attributes, or where the density of the numeric attribute distribution is non-uniform. Following on from our work in Chapter 6, we expect that LOF will also give better results than kNN in these cases. We would also like to evaluate the meaningfulness of the discovered anomalies in a more objective way. Most anomaly detection papers in the literature have very subjective assessments of what constitutes an anomaly. [99] suggests that the accuracy of anomaly detection approaches can be measured by using them as a one-class classifier, where the classifier is trained on positive examples only. The anomaly detection algorithm is used to determine if each test instance is a member of the class or something else (an anomaly). By applying this approach to Yagada, we could obtain an objective measurement of accuracy for various datasets. In Chapter 6, we suggested that the numeric outlier score could be incorporated into the DL calculation, to give a weighted DL (§6.3.2). This concept could be adapted to Yagada, as an alternative to discretizing the outlier score. In the graph database setting, we have presented a method for scoring and ranking transactions according to how much and how soon they are compressed. In the SLG setting, we can detect anomalous regions but we did not present a method to score or rank them. One possible approach would be to use egonets [17]: the region around each vertex in the uncompressed graph could be evaluated by how much and how soon that region is compressed. This would allow each vertex in the graph to be scored according to how anomalous its neighbourhood is. Yagada is presented for use on static graphs, so it can be used for forensic analysis of graph databases. It would be interesting to extend it for use on streaming graph data, for online or real-time detection of anomalous behaviours. This would require an incremental approach. An extension of Subdue to learn concepts from streaming data has been proposed [55]; this is a possible starting point for a streaming version of Yagada.
152
Part IV CONCLUSIONS AND FUTURE WORK The future is there. . . looking back at us. Trying to make sense of the fiction we will have become.
— Pattern Recognition, William Gibson
8
CONCLUSIONS
In Chapter 1, we proposed four research questions: How are numeric attribute values related to graph structure? How can numeric attributes be integrated into substructure discovery? How can information from numeric attributes be applied to structural anomaly detection? And how can our findings be generalised to multi-dimensional numeric attributes? We addressed the first question in Chapter 5, by creating a generative graph model—Agwan—to study the relationship between network structure and attributes. The second question is addressed in Chapter 6, where we proposed using the “outlierness” of numeric attributes as a constraint on substructure discovery. We investigated the third question in Chapter 7 using Yagada, our algorithm for unsupervised anomaly detection in graphs with numeric attributes. Yagada applies the constraint-based approach of Chapter 6 to the problem of finding anomalous subgraphs. The final question is addressed in the discussion in Chapter 5 and the experimental investigation in Chapter 6. In this concluding chapter, we summarise our findings and propose some directions for future research.
8.1
graph structure and numeric attributes
Chapter 5 presented Agwan (Attribute Graphs: Weighted and Numeric), a model for random graphs with discrete vertex labels and weighted edges, including a fitting algorithm to learn Agwan’s parameters from real-world data and a generative algorithm for random labelled, weighted graphs. We measured the closeness of fit between the random graphs and the input graph across a range of statistics. We used Agwan to measure the relationship between discrete and numeric attributes and graph structure, verifying our hypothesis that edge weights are dependent on the labels of the edge endpoints, and conditionally independent of the rest of the graph. From our experiments with Agwan, we conclude that the vertex strength distribution and spectral properties of real-world graphs are
154
8.2 substructure discovery
highly dependent on both the discrete vertex labels and numeric edge weights. While some structural properties are dependent on the numeric attributes, this is not true for all properties: clustering and triad participation were shown to be independent of the edge weights. Real label values appear to contribute to triangle and clustering patterns to some extent, but there is another process taking place which is not captured by our model: Agwan models homophily but does not model triadic closure. We suggest that triadic closure may explain how clustering arises, where links are formed between people who share a mutual friend, independently from their vertex attributes. The fact that vertex labels are slightly correlated with clustering patterns may be an artefact of the connection to the mutual friend rather than an explanation for the clustering.
8.2
substructure discovery in graphs with numeric attributes
Having established that dependencies exist between graph structure and attributes, we investigated how to integrate numeric attributes into substructure discovery in Chapter 6. We proposed that the most descriptive substructures are those which are not only the most frequent, but which are also normative in terms of their numeric attributes. We formulated a numeric outlier-based constraint on substructure discovery and evaluated its effect on graph mining in graph databases and SLGs. In Chapter 4, we evaluated a number of numeric anomaly detection approaches—statistical, clustering-based, distance-based and densitybased—for their suitability for unsupervised learning. On the basis of that evaluation, we selected a density-based approach, LOF, for our experiments in Chapter 6. Because outliers are computed relative to their local neighbourhood, LOF does not need to make any assumptions about the underlying distribution of the data and performs well for a moderate number of dimensions. We proposed three ways in which outlier scores could be used to constrain substructure discovery: as a guided pruning step, which can be applied to any substructure discovery algorithm; as weighted support, for frequent subgraph algorithms; and as weighted description length, as an alternative to MDL evaluation. For our experiments, we implemented our approach as a pruning step during pre-processing
155
8.3 anomaly detection
and evaluated the effect of pruning on runtime performance, memory performance and completeness of the result set. From our experiments, we conclude that using numeric outliers to prune the input graph prior to substructure discovery gives a beneficial trade-off between improved performance and reduced coverage. The patterns which are retained are the most descriptive ones, satisfying both minimum support and the normal numeric attribute constraint. In many cases where searching the input graph is intractable with an unconstrained approach due to the computational or memory overheads, the outlier-based constraint allows the graph to be processed. For compression-based substructure mining, the relative count of each pattern is approximately preserved, as is the distribution of compression ratios, making it possible to relax the greediness of the search without exceeding processing time and memory usage limits.
8.3
anomaly detection in graphs with numeric attributes
One of the main motivations for this Ph.D project was to devise an unsupervised machine learning approach to detect suspicious behaviour in ACS databases. In Chapter 7, we framed this problem in terms of detecting anomalous transactions in a graph database. As ACS data can contain both structural anomalies (unusual paths through a building) and numeric anomalies (unusual timing data), the main research questions addressed by Chapter 7 were how to integrate this structural and numeric data, and the effect on discriminating between normal and anomalous patterns. Building on the approach described in Chapter 6, we devised an unsupervised algorithm to detect anomalous subgraphs, Yagada (Yet Another Graph-based Anomaly Detection Algorithm). At each iteration, Yagada finds the most descriptive substructure and uses it to compress the input graph database. The “most descriptive” substructure is the one which maximally compresses the graph according to the MDL principle. Following the constraint-based approach of Chapter 6, instances with anomalous numeric attributes provide less support than instances with normal numeric attributes. The anomalousness of each graph transaction is calculated based on how much and how soon it is compressed. This allows us to classify subgraphs as
156
8.4 multi-dimensional numeric attributes
normal or anomalous, or to rank subgraphs from most anomalous to least anomalous. We conclude that integrating numeric anomaly detection and graph-based anomaly detection improves the discrimination between normal and anomalous patterns. In our experimental evaluation, we found that alternative anomaly detection approaches performed poorly where numeric attributes were ignored or considered as categorical; binning approaches are sensitive to the boundaries selected for the bins; clustering approaches are sensitive to the number of clusters and perform poorly where clusters are of differing densities. Yagada detected anomalous subgraphs most consistently, detecting all of the anomalous patterns found by the comparative approaches.
8.4
multi-dimensional numeric attributes
During our investigations, we considered graphs with more than one numeric attribute on vertices and edges. Our algorithms treat multiple attributes as multi-dimensional numeric feature vectors. In Sect. 5.3.3, we discussed how to extend Agwan to multiple attributes by generalising the DPGMM to multiple dimensions. This approach works for a moderate number of dimensions. However, the statistics we used to evaluate our model assume one-dimensional weights; in order to evaluate higher-dimensional Agwan, we will need to devise new statistical measures. The constraint-based approach in Chapter 6 allows vertices and edges to have an arbitrary number of numeric attributes. In our experimental evaluation, we found that calculating LOFs across twodimensional feature vectors gave better performance than using a single numeric feature. In higher ( 20) dimensions, the indexes used by LOF degrade, which has a deleterious effect on the complexity of the nearest neighbour search. In Sect. 6.4, we outlined how our approach could be extended to very high-dimensional features using RP + PINN + LOF. Our preliminary experiments in this direction suggest that RP + PINN + LOF is tractable for feature sets of several thousand dimensions, but further investigation is required to establish that the Lp -norm distance measures used by LOF remain valid on sparse feature vectors at higher dimensions. We conclude that our approach works best when there is a moderate number of informative features.
157
8.5 future work
8.5
future work
As the work in this thesis has answered some questions, it has also created new questions for further investigation. 8.5.1 Graph Structure and Numeric Attributes Agwan is suited to the task of measuring correlations between graph structure and numeric attributes; with some improvements, it could be used as a more general model for labelled, weighted graphs. Agwan’s main weakness is that it does not accurately model triad participation and clustering, so community structure is not captured. To address this, we plan to add triadic closure into the model, where the probability of closing a triangle is learned from the input graph. We need to investigate whether this probability is dependent on or independent from the vertex labels. Another possible way to reproduce community structure is with an edge-copying model. We did not investigate how Agwan graphs evolve over time. We plan to conduct a series of experiments to measure whether Agwan graphs obey the Densification Power Law and the shrinking diameter effect. We also plan to investigate whether weighted versions of these laws hold: does the weighted density increase and weighted diameter decrease asymptotically like their unweighted counterparts? Finally, Agwan models the full joint probability across all combinations of vertex labels. As the number of labels and label values increase, calculating the joint probability becomes intractable due to combinatorial explosion. An alternative for further investigation would be to consider each vertex attribute independently, calculating the edge weight distribution as the weighted summation of the distribution for each combination of vertex labels. 8.5.2 Substructure Discovery in Graphs with Numeric Attributes As a follow-on from this thesis, we plan to measure the meaningfulness of our substructure discovery results more objectively, by applying our method to a classification task on a dataset with ground truth. We expect to be able to improve classification accuracy by incorporating numeric attributes. We plan to integrate the numeric outlier-based constraint into the weighted support measure (gSpan) and weighted description length measure (Subdue) as descibed in Chapter 6. The
158
8.5 future work
classification measure will allow us to compare the accuracy of these approaches against the guided pruning approach. 8.5.3 Anomaly Detection in Graphs with Numeric Attributes Our implementation of Yagada in Chapter 7 evaulated one-dimensional numeric attributes and used kNN distances for outlier detection. Following on from our findings in Chapter 6, we plan to extend this to the multi-dimensional case using density-based outlier detection (LOF). LOF should be more robust where the data is of varying density (which was not the case in our experimental dataset). After this improvement, we plan to evaluate Yagada on a wider range of datasets. In the graph database setting, we would like to evaluate anomalous subgraphs more objectively, by using Yagada as a one-class classifier on labelled subgraphs. Subgraphs can be classified as members of the class or anomalies. The anomaly prediction false positive/false negative rate trade-off can be measured for different values of an anomaly threshold parameter using ROC curves. In the SLG setting, we plan to add a method to score or rank anomalous regions. One possible approach is to score each vertex based on how much and how soon its egonet is compressed. This would give an anomaly score to each vertex based on the anomalousness of its k-nearest vertex neighbourhood. In Chapter 6, we suggested that the numeric outlier score could be incorporated into the DL calculation, to give a weighted DL. This concept could be adapted to Yagada, as an alternative to discretizing the outlier score. Finally, Yagada is presented for use on static graphs, making it suitable for forensic analysis of graph databases. An interesting extension would be to adapt it for streaming graph data, for online detection of anomalous behaviours. 8.5.4 High-dimensional Numeric Attributes Our approach works well on data with a moderate number of numeric attributes. The effectiveness of numeric outlier detection in high-dimensional data requires further investigation. For graphs with a sparse high-dimensional representation, such as the Enron bag of words, further research is needed to determine whether Lp -norm dis-
159
8.5 future work
tances are still discriminitive. If Lp -norm distances are not appropriate, perhaps there is a more suitable distance measure. If so, further research is required to determine whether a “database-friendly” random projection exists for it. If the distances are valid, it may be necessary to rescale the LOF scores. Related to this, where there is a high-dimensional representation with many irrelevant attributes, it may be better to detect numeric anomalies in subspaces rather than in full space by choosing only locally-relevant attributes on which to calculate the outlier score. We have reached the end of our thesis and come full circle. We began this Ph.D project with anomaly detection in graph-based data; the Yagada algorithm was our first published work. Yagada raised questions about how numeric attributes are related to graph structure; how numeric attributes could be incorporated into frequent substructure discovery as well as anomaly detection; what the best way is to detect numeric anomalies for unsupervised learning; and what if the data is multi-dimensional? These are questions that we have attempted to answer in Chapters 5–6. In Chapter 7, we suggested how the knowledge we have gained from trying to answer these questions can be re-integrated into the Yagada approach. It is fitting to finish with Yagada, as the stepping-off point for our post-Ph.D research.
160
Part V APPENDICES When asked for advice by beginners. . . Know your ending, I say, or the river of your story may finally sink into the desert sands and never reach the sea.
— I. Asimov: A Memoir, Isaac Asimov
A
ALGORITHMS
The substructure discovery algorithms used for the experiments in Chapters 6–7 are reproduced here for ease of reference.
a.1
gspan
gSpan is an algorithm for frequent substructure discovery in graph databases. There is one parameter, the minimum support threshold minSup. For full details, see [185]. Algorithm 5 GraphSet_Projection: search for frequent substructures Require: Graph Transaction Database D, minSup 1: Sort the labels in D by their frequency 2: Remove infrequent vertices and edges 3: Relabel the remaining vertices and edges 4: S1 ← all frequent 1-edge graphs in D 5: Sort S1 in DFS lexicographic order 6: S ← S1 7: for all edge e ∈ S1 do 8: Initialise s with e, set s.D to graphs which contain e 9: Subgraph_Mining(D, S, s) 10: D ← D−e 11: if |D| < minSup then 12: break 13: return Discovered Subgraphs S To apply our numeric outlier constraint (Chapter 6), we calculate outlier factors for all vertices and edges by Def. 34 as a pre-processing step. Then amend step 2 to “Remove infrequent and anomalous vertices and edges”.
a.2
subdue
Subdue uses a greedy beam search strategy to find the most descriptive substructures in SLGs or in graph databases. It maintains a limited-length list of candidate substructures. At each iteration, the
162
A.2 subdue
best substructures are retained and unpromising candidates are discarded. BeamWidth is the number of candidate substructures to retain in the list after evaluation. MaxBest is the number of output substructures to be retained. Where the best substructure is used to compress the input graph over multiple iterations of Subdue, MaxBest can be set to 1. MaxSubSize is a constraint on the maximum size of substructures to be evaluated. This is important when searching in SLGs but may not be necessary when searching in a graph database, if the individual transactions are not too large. Limit is an upper limit on the number of substructures to process. For full details, see [58]. Algorithm 6 Subdue: search for frequent substructures Require: Graph, BeamWidth, MaxBest, MaxSubSize, Limit 1: let ParentList = {} 2: let ChildList = {} 3: let BestList = {} 4: let ProcessedSubs = 0 5: Create a substructure from each unique vertex label and its singlevertex instances; insert the resulting substructures in ParentList 6: while ProcessedSubs 6 Limit and ParentList is not empty do 7: while ParentList is not empty do do 8: let Parent = RemoveHead(ParentList) 9: Extend each instance of Parent in all possible ways 10: Group the extended instances into Child substructures 11: for all Child do 12: if SizeOf(Child) 6 MaxSubSize then 13: Evaluate the Child 14: Insert Child in ChildList in order by value 15: if Length(ChildList) > BeamWidth then 16: Destroy the substructure at the end of ChildList 17: 18: 19: 20:
let ProcessedSubs = ProcessedSubs + 1 Insert Parent in BestList in order by value if Length(BestList) > MaxBest then Destroy the substructure at the end of BestList
Switch ParentList and ChildList 22: return BestList 21:
To apply our numeric outlier constraint to Subdue, we calculate outlier factors for all vertices and edges by Def. 34 as a pre-processing step, as for gSpan. Then remove all anomalous vertices and edges from the graph before line 5 above.
163
I M P L E M E N TAT I O N N O T E S
b.1
experimental system
Our experiments were conducted on an Intel Xeon 2.67 GHz CPU with 100 Gb of main memory, running 64-bit Debian GNU/Linux 6.0 (“Squeeze”).
b.2
graph representation and visualisation
Graphs were stored in Graph Exchange XML Format1 (GEXF), which allows an arbitrary number of discrete and numeric labels to be attached to vertices and edges. We made extensive use of Gephi2 [25], the Open Graph Viz Platform, for visualising and understanding graphs, calculating graph statistics and creating animations of graph evolution. Most of the graphs represented in this thesis were created using Gephi.
b.3
dirichlet process gaussian mixture models
In Chapter 5, we learned the DPGMMs using an implementation of mean-field variational inference [33] which uses the capping method from [125] so that it can operate incrementally. This code was implemented in Python by Tom Haines3 .
b.4
lof and pinn
The code to compare outlier detection approaches in Chapter 4 (GMM, CBLOF, kNN, LOF)
was written in Matlab.
For the experiments in Chapters 6–7, we created our own implementation of LOF in C++. This was validated against the “official” im-
1 http://gexf.net/format/ 2 https://gephi.org/ 3 http://code.google.com/p/haines/wiki/dpgmm
164
B
B.5 gspan
plementation in the ELKI toolkit4 (Environment for Developing KDD Applications Supported by Index Structures [8]). For the RP + PINN + LOF experiments in Chapter 6, we used ELKI. One limitation of ELKI’s implementation of the Achlioptas Random Projection [9] is that it holds all objects in main memory. It is likely that the sharp increase in computation time at 4, 000 dimensions in Fig. 6.6 is an artefact caused by running out of main memory and using swap space. As Achlioptas RP is designed to be “databasefriendly”, it would be straightforward to reimplement it with an index on disk to process larger datasets for future investigations.
b.5
gspan
For the experiments in Chapter 6, we used the gSpan implementation from the gboost toolkit5 as it can handle graphs with directed edges and self-cycles.
b.6
subdue
For the experiments in Chapters 6–7, we created our own implementation of Subdue in C++, using the graph isomorphism test from the Boost Graph Library6 . This version of Subdue contained the extensions required for Yagada and the alternative discretization approaches used in the experiments in Chapter 7.
4 http://elki.dbs.ifi.lmu.de/ 5 http://www.nowozin.net/sebastian/gboost/ 6 http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/isomorphism.html
165
BIBLIOGRAPHY
[1] Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9–12 December 2002, Maebashi City, Japan, 2002. IEEE Computer Society. [2] Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida, USA, 2003. IEEE Computer Society. [3] Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15–19, 2008, Pisa, Italy, 2008. IEEE Computer Society. [4] Police have 300 million News International emails. http://www. bbc.co.uk/news/uk-15679784, Nov. 2011. BBC News.
[5] Phone hacking: Police probe suspected deletion of emails by NI executive. http://www.theguardian.com/media/2011/jul/08/ phone-hacking-emails-news-international, Jul. 2011.
The
Guardian. [6] SUBDUE Manual, version 1.5 edition, Jun. 2011. URL http:// ailab.wsu.edu/subdue.
[7] Internet census 2012: Port scanning /0 using insecure embedded devices. http://internetcensus2012.bitbucket.org/ paper.html, 2012.
[8] Elke Achert, Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. Interactive data mining with 3D-parallel-coordinatetrees. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2013. [9] Dimitris Achlioptas.
Database-friendly random projections:
Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, 2003. [10] Douglas Adams. Dirk Gently’s Holistic Detective Agency. William Heinemann, London, 1987. [11] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB ’94:
166
bibliography
Proceedings of the 20th International Conference on Very Large Data Bases, pages 487–499, San Francisco, 1994. Morgan Kaufmann Publishers. [12] William Aiello, Fan R. K. Chung, and Linyuan Lu. A random graph model for massive graphs. In F. Frances Yao and Eugene M. Luks, editors, STOC, pages 171–180. ACM, 2000. [13] William Aiello, Fan R. K. Chung, and Linyuan Lu. Random evolution in massive graphs. In FOCS, pages 510–519. IEEE Computer Society, 2001. [14] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, Jun. 2008. [15] Leman Akoglu and Christos Faloutsos. RTG: a recursive realistic graph generator using random typing. Data Min. Knowl. Discov., 19(2):194–209, 2009. [16] Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws and a recursive generator for weighted time-evolving graphs. In ICDM DBL [3], pages 701–706. [17] Leman Akoglu, Mary McGlohon, and Christos Faloutsos. OddBall: Spotting anomalies in weighted graphs.
In Mo-
hammed Javeed Zaki, Jeffrey Xu Yu, B. Ravindran, and Vikram Pudi, editors, PAKDD (2), volume 6119 of Lecture Notes in Computer Science, pages 410–421. Springer, 2010. [18] Réka Albert and Albert L. Barabási. Topology of evolving networks: Local events and universality. Physical Review Letters, 85 (24):5234–5237, Dec. 2000. [19] Gerald L. Alexanderson. Euler and Königsberg’s bridges: A historical view. Bulletin of the American Mathematical Society, 43: 567–573, 2006. [20] L.A.N. Amaral, A. Scala, M. Barthélémy, and H.E. Stanley. Classes of small-world networks. Proceedings of the National Academy of Sciences of the U.S.A., 97(21):11149–11152, Oct. 2000. [21] Isaac Asimov. I. Asimov: A Memoir. Random House, New York, USA, 1995.
167
bibliography
[22] A.L. Barabási, H. Jeong, Z. Neda, E. Ravasz, A. Schubert, and T. Vicsek. Evolution of the social network of scientific collaborations. Physica A, 311(3–4), Apr. 2001. [23] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999. [24] Vic Barnett and Toby Lewis. Outliers in Statistical Data. Wiley Series in Probability & Statistics. Wiley, 3rd edition, 1994. [25] M. Bastian, S. Heymann, and M. Jacomy. Gephi: an open source software for exploring and manipulating networks. In International AAAI Conference on Weblogs and Social Media. AAAI, 2009. [26] Stephen Bay, Krishna Kumaraswamy, Markus G. Anderle, Rohit Kumar, and David M. Steier. Large scale detection of irregularities in accounting data. In ICDM, pages 75–86. IEEE Computer Society, 2006. [27] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In Hector Garcia-Molina and H. V. Jagadish, editors, SIGMOD Conference, pages 322–331. ACM Press, 1990. [28] Stefan Berchtold, Daniel A. Keim, and Hans-Peter Kriegel. The X-tree: An index structure for high-dimensional data. In T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda, editors, VLDB, pages 28–39. Morgan Kaufmann, 1996. [29] Ten Berge.
Least square optimization in multivariate analysis.
DSWO Press, Leiden, 1993. [30] Zhiqiang Bi, Christos Faloutsos, and Flip Korn. The “DGX” distribution for mining massive, skewed data. In Lee et al. [126], pages 17–26. [31] G. Bianconi and A.L. Barabási. Competition and multiscaling in evolving networks. Europhysics Letters, 54(4):436–442, 2001. [32] Robert P. Biuk-Aghai, Yain-Whar Si, Simon Fong, and Peng-Fan Yan. Security in physical environments: Algorithms and system for automated detection of suspicious activity. In PAKDD Workshop on Behavior Informatics 2010, Hyderabad, India, 2010.
168
bibliography
[33] David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1:121–144, 2005. [34] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, Oct 2008. [35] Béla Bollobás, Christian Borgs, Jennifer Chayes, and Oliver Riordan. Directed scale-free graphs. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’03, pages 132–139, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Mathematics. [36] Bélaa Bollobás and Oliver Riordan. The diameter of a scale-free random graph. Combinatorica, 24(1):5–34, Jan. 2004. [37] Booyabazooka.
Graph isomorphism.
org/wiki/Graph_isomorphism,
Nov
http://en.wikipedia.
2006.
Files
Graph
isomorphism a.svg and Graph isomorphism b.svg reproduced
from Wikimedia Commons under the terms of the GNU Free Documentation License v1.2. [38] Christian Borgelt. Canonical forms for frequent graph mining. In Reinhold Decker and Hans-Joachim Lenz, editors, GfKl, Studies in Classification, Data Analysis, and Knowledge Organization, pages 337–349. Springer, 2006. [39] Christian Borgelt and Michael R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In ICDM DBL [1], pages 51–58. [40] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. LOF: Identifying density-based local outliers. In Chen et al. [52], pages 93–104. [41] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30 (1–7):107–117, Apr. 1998. [42] Matt Britt.
Internet map.
http://en.wikipedia.org/wiki/
File:Internet_map_1024.jpg, Jan 2005.
Reproduced from
Wikimedia Commons under the terms of the Creative Commons Attribution 2.5 Generic license.
169
bibliography
[43] T. Bu and D. Towsley.
On distinguishing between Internet
power law topology generators. In INFOCOM, volume 2, pages 638–647, 2002. [44] Paul Burkhardt and Chris Waring. An NSA big graph experiment. Technical Report NSA-RD-2013-056002v1, US National Security Agency, Research Directorate R6, May 2013. [45] Paul Butler. Visualizing friendships. https://www.facebook. com/note.php?note_id=469716398919, Dec 2010.
[46] Toon Calders, Jan Ramon, and Dries Van Dyck. Anti-monotonic overlap-graph support measures. In ICDM DBL [3], pages 73– 82. [47] Deepayan Chakrabarti. Autopart: Parameter-free graph partitioning and outlier detection. In Jean-François Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi, editors, PKDD, volume 3202 of Lecture Notes in Computer Science, pages 112–124. Springer, 2004. [48] Deepayan Chakrabarti and Christos Faloutsos. Graph Mining: Laws, Tools, and Case Studies. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2012. [49] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. RMAT: A recursive model for graph mining. In Michael W. Berry, Umeshwar Dayal, Chandrika Kamath, and David B. Skillicorn, editors, SDM. SIAM, 2004. [50] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3): 15:1–15:58, Jul 2009. [51] Duen Horng Chau, Shashank Pandit, and Christos Faloutsos. Detecting fraudulent personalities in networks of online auctioneers. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, PKDD, volume 4213 of Lecture Notes in Computer Science, pages 103–114. Springer, 2006. [52] Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein, editors. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16–18, 2000, Dallas, Texas, USA, 2000. ACM.
170
bibliography
[53] Nicholas A. Christakis and James H. Fowler. The spread of obesity in a large social network over 32 years. New England Journal of Medicine, 357(4):370–379, 2007. [54] Nicholas A. Christakis and James H. Fowler. The collective dynamics of smoking in a large social network. New England Journal of Medicine, 358(21):2249–2258, 2008. [55] Jeffrey A Coble, Runu Rathi, Diane J Cook, and Lawrence B Holder. Iterative structure discovery in graph-based data. International Journal on Artificial Intelligence Tools, 14(01n02):101–124, 2005. [56] William W. Cohen. Enron e-mail dataset. https://www.cs.cmu. edu/~enron/, Aug. 2009.
[57] Diane J. Cook and Lawrence B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1(1):231–255, Feb 1994. [58] Diane J. Cook and Lawrence B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32–41, 2000. [59] Colin Cooper and Alan Frieze. A general model of web graphs. Random Struct. Algorithms, 22(3):311–335, May 2003. [60] Easley David and Kleinberg Jon. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, New York, NY, USA, 2010. [61] Derek de Solla Price. A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science, 27(5–6):292–306, 1976. [62] Timothy de Vries, Sanjay Chawla, and Michael E. Houle. Finding local anomalies in very high dimensional space. In Geoffrey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu, editors, ICDM, pages 128–137. IEEE Computer Society, 2010. [63] Jana Diesner and Kathleen M. Carley. Exploration of communication networks from the Enron email corpus. In Proceedings of Workshop on Link Analysis, Counterterrorism and Security, SIAM International Conference on Data Mining 2005, pages 3–14, 2005.
171
bibliography
[64] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Inc., New York, USA, 2003. [65] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-Interscience, 2nd edition, 2000. [66] William Eberle and Lawrence Holder. Compression versus frequency for mining patterns and anomalies in graphs. In Ninth Workshop on Mining and Learning with Graphs (MLG 2011), San Diego, CA, Aug 2011. SIGKDD. at the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2011). [67] William Eberle and Lawrence B. Holder. Mining for structural anomalies in graph-based data. In Robert Stahlbock, Sven F. Crone, and Stefan Lessmann, editors, DMIN, pages 376–389. CSREA Press, 2007. [68] William Eberle and Lawrence B. Holder. Anomaly detection in data represented as graphs. Intelligent Data Analysis, 11(6): 663–689, 2007. [69] William Eberle and Lawrence B. Holder. Mining for insider threats in business transactions and processes. In CIDM, pages 163–170. IEEE, 2009. [70] William Eberle and Lawrence B. Holder. Discovering anomalies to multiple normative patterns in structural and numeric data. In H. Chad Lane and Hans W. Guesgen, editors, FLAIRS Conference. AAAI Press, 2009. [71] William Eberle, Lawrence B. Holder, and J. Graves. Insider threat detection using a graph-based approach. Journal of Applied Security Research, 6(1):32–81, Jan 2011. [72] Frank Eichinger, Klemens Böhm, and Matthias Huber. Mining edge-weighted call graphs to localise software bugs. In Walter Daelemans, Bart Goethals, and Katharina Morik, editors, ECML/PKDD (1), volume 5211 of Lecture Notes in Computer Science, pages 333–348. Springer, 2008. [73] Frank Eichinger, Matthias Huber, and Klemens Böhm. On the usefulness of weight-based constraints in frequent subgraph mining. In Max Bramer, Miltos Petridis, and Adrian Hopgood, editors, SGAI Conf., pages 65–78. Springer, 2010.
172
bibliography
[74] Paul Erd˝os and Alfréd Rényi. On random graphs, I. Publicationes Mathematicae, 6:290–297, 1959. [75] Paul Erd˝os and Alfréd Rényi. On the evolution of random graphs. Acta Mathematica Academiae Scientiarum Hungaricae, 5: 17–61, 1960. [76] Paul Erd˝os and Alfréd Rényi. On the strength of connectedness of a random graph. Acta Mathematica Academiae Scientiarum Hungaricae, 12:261–267, 1961. [77] Giorgio Fagiolo. Clustering in complex directed networks. Physical Review E, 76(2), August 2007. [78] Christos Faloutsos. Mining billion-node graphs: Patterns, generators and tools. In José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, editors, ECML/PKDD (1), volume 6321 of Lecture Notes in Computer Science, page 1. Springer, 2010. [79] Eduardo B. Fernandez, Jose Ballesteros, Ana C. DesouzaDoucet, and Maria M. Larrondo-Petrie. Security patterns for physical access control systems. In Data and Applications Security XXI, volume 4602 of LNCS, pages 259–274, Berlin, 2007. Springer. [80] Mathias Fiedler and Christian Borgelt. Subgraph support in a single large graph. In ICDM Workshops, pages 399–404. IEEE Computer Society, 2007. [81] Mario A.T. Figueiredo and A.K. Jain. Unsupervised learning of finite mixture models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(3):381–396, 2002. [82] Scott Fortin. The graph isomorphism problem. Technical report, Univ. of Alberta, 1996. [83] James H Fowler and Nicholas A. Christakis. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. BMJ, 337, 12 2008. [84] Linton C. Freeman. The development of social network analysis: a study in the sociology of science. Empirical Press, Vancouver, 2004.
173
bibliography
[85] Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos, editors. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24–27, 2003, 2003. ACM. [86] William Gibson. Pattern Recognition. G.P. Putnam’s Sons, New York, USA, 2003. [87] M.L. Goldstein, S.A. Morris, and G.G. Yen. Problems with fitting to the power-law distribution. The European Physical Journal B—Condensed Matter and Complex Systems, 41(2):255–258, 2004. [88] John Guare. Six Degrees of Separation: A Play. Vintage Books, 1990. [89] Ehud Gudes, Solomon Eyal Shimony, and Natalia Vanetik. Discovering frequent graph patterns using disjoint paths. IEEE Transactions on Knowledge and Data Engineering, 18(11):1441– 1456, 2006. [90] Stephan Günnemann, Brigitte Boden, and Thomas Seidl. DBCSC: A density-based approach for subspace clustering in graphs with feature vectors. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, ECML/PKDD (1), volume 6911 of Lecture Notes in Computer Science, pages 565–580. Springer, 2011. [91] Douglas M. Hawkins. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman and Hall, London, 1980. [92] Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10): 1641–1650, 2003. [93] Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, Dec 2002. [94] Lawrence B. Holder. Empirical substructure discovery. In Alberto Maria Segre, editor, ML, pages 133–136. Morgan Kaufmann, 1989. [95] Lawrence B. Holder, Diane J. Cook, and Horst Bunke. Fuzzy substructure discovery. In Derek H. Sleeman and Peter Edwards, editors, ML, pages 218–223. Morgan Kaufmann, 1992.
174
bibliography
[96] Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In ICDM DBL [2], pages 549–552. [97] Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. SPIN: mining maximal frequent subgraphs from graph databases. In Kim et al. [112], pages 581–586. [98] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An Apriori-based algorithm for mining frequent substructures from graph data.
In Djamel A. Zighed, Henryk Jan Ko-
morowski, and Jan M. Zytkow, editors, PKDD, volume 1910 of Lecture Notes in Computer Science, pages 13–23. Springer, 2000. [99] Jeroen H. M. Janssens, Ildikó Flesch, and Eric O. Postma. Outlier detection with one-class classifiers from ML and KDD. In M. Arif Wani, Mehmed M. Kantardzic, Vasile Palade, Lukasz A. Kurgan, and Yuan Qi, editors, ICMLA, pages 147–153. IEEE Computer Society, 2009. [100] Chuntao Jiang, Frans Coenen, and Michele Zito. Frequent subgraph mining on edge weighted graphs. In Torben Bach Pedersen, Mukesh K. Mohania, and A Min Tjoa, editors, DaWak, volume 6263 of Lecture Notes in Computer Science, pages 77–88. Springer, 2010. [101] Wei Jiang, Jaideep Vaidya, Zahir Balaporia, Chris Clifton, and Brett Banich. Knowledge discovery from transportation network data. In Karl Aberer, Michael J. Franklin, and Shojiro Nishio, editors, ICDE, pages 1061–1072. IEEE Computer Society, 2005. [102] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189–206. American Mathematical Society, 1984. [103] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):881–892, 2002. [104] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local
175
bibliography
search approximation algorithm for k-means clustering. Comput. Geom., 28(2–3):89–112, 2004. [105] Frigyes Karinthy. Láncszemek (Chains). Minden masképpen van (Everything is Different), 1929. Translated from Hungarian by Adam Makkai. [106] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, March 1953. [107] Jeremy V. Kepner and J. R. Gilbert. Graph algorithms in the language of linear algebra. Society for Industrial and Applied Mathematics, 2011. [108] Cane Wing ki Leung. Technical notes on extending gspan to directed graphs. Technical report, Singapore Management University, Oct. 2010. [109] Myunghwan Kim and Jure Leskovec. Modeling social networks with node attributes using the Multiplicative Attribute Graph model. In Fabio Gagliardi Cozman and Avi Pfeffer, editors, UAI, pages 400–409. AUAI Press, 2011. [110] Myunghwan Kim and Jure Leskovec. Latent multi-group membership graph model. In ICML. icml.cc / Omnipress, 2012. [111] Myunghwan Kim and Jure Leskovec. Multiplicative Attribute Graph model of real-world networks. Internet Mathematics, 8 (1–2):113–160, 2012. [112] Won Kim, Ron Kohavi, Johannes Gehrke, and William DuMouchel, editors. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, 2004, 2004. ACM. [113] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, Sep 1999. [114] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. The web as a graph: Measurements, models, and methods. In COCOON, pages 1–17, 1999. [115] Bryan Klimt and Yiming Yang.
The Enron corpus: A new
dataset for email classification research. In Jean-François Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi,
176
bibliography
editors, ECML, volume 3201 of Lecture Notes in Computer Science, pages 217–226. Springer, 2004. [116] P. L. Krapivsky and S. Redner. Organization of growing random networks. Physical Review E, 63(6), 2001. [117] Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, and Tu Bao Ho, editors, PAKDD, volume 5476 of Lecture Notes in Computer Science, pages 831–838. Springer, 2009. [118] Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery in Data, 3(1):1:1–1:58, March 2009. [119] Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In SDM, pages 13–24. SIAM/Omnipress, 2011. [120] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases from the web. In Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, editors, VLDB, pages 639–650. Morgan Kaufmann, 1999. [121] M. Kuramochi and G. Karypis. An efficient algorithm for discovering frequent subgraphs. Knowledge and Data Engineering, IEEE Transactions on, 16(9):1038–1051, 2004. [122] Michihiro Kuramochi and George Karypis. GREW—a scalable frequent subgraph discovery algorithm. In ICDM, pages 439– 442. IEEE Computer Society, 2004. [123] Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Min. Knowl. Discov., 11(3): 243–271, Nov. 2005. [124] Michihiro Kuramochi and George Karypis. Discovering frequent geometric subgraphs.
Inf. Syst., 32(8):1101–1120, Dec.
2007. [125] Kenichi Kurihara, Max Welling, and Nikos Vlassis. Accelerated variational dirichlet process mixtures. In NIPS, 2006.
177
bibliography
[126] Doheon Lee, Mario Schkolnick, Foster J. Provost, and Ramakrishnan Srikant, editors. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA, August 26–29, 2001, 2001. ACM. [127] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1), Mar 2007. [128] Jure Leskovec, Deepayan Chakrabarti, Jon M. Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11:985–1042, 2010. [129] David D. Lewis. Naïve (Bayes) at forty: The independence assumption in information retrieval. In Claire Nedellec and Céline Rouveirol, editors, ECML, volume 1398 of Lecture Notes in Computer Science, pages 4–15. Springer, 1998. [130] Ying Li, Bing Liu, and Sunita Sarawagi, editors. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24– 27, 2008, 2008. ACM. [131] David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management, CIKM ’03, pages 556–559, New York, NY, USA, 2003. ACM. [132] Shou-De Lin and Hans Chalupsky. Unsupervised link discovery in multi-relational data via rarity analysis. In ICDM DBL [2], pages 171–178. [133] F. Mansman, L. Meier, and Daniel A. Keim. Visualization of host behavior for network security. In John R. Goodall, Gregory J. Conti, and Kwan-Liu Ma, editors, VizSEC, Mathematics and Visualization, pages 187–202. Springer, 2007. [134] Markos Markou and Sameer Singh.
Novelty detection: a
review—part 1: statistical approaches. Signal Processing, 83(12): 2481–2497, 2003. [135] Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Weighted graphs and disconnected components: patterns and a generator. In Li et al. [130], pages 524–532.
178
bibliography
[136] Brendan D. McKay and Adolfo Piperno. Practical graph isomorphism, II. CoRR, abs/1301.1493, 2013. [137] B. McLean and P. Elkind. The Smartest Guys in the Room: The Amazing Rise and Scandalous Fall of Enron. Penguin Group USA, 2003. [138] Miller McPherson, Lynn S. Lovin, and James M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415–444, 2001. [139] Cristopher Moore, Gourab Ghoshal, and M. E. J. Newman. Exact solutions for models of evolving networks with addition and deletion of nodes. Physical Review E, 74:036121+, Sep. 2006. [140] Jacob L. Moreno. Who Shall Survive? Foundations of Sociometry, Group Psychotherapy and Sociodrama. Beacon House, Beacon, NY, 1953. [141] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):323–351, May 2005. [142] Mark Newman. Networks: An Introduction. Oxford University Press, Inc., New York, NY, USA, 2010. [143] Siegfried Nijssen and Joost N. Kok. A quickstart in frequent structure mining can make a difference. In Kim et al. [112], pages 647–652. [144] Caleb C. Noble and Diane J. Cook. Graph-based anomaly detection. In Getoor et al. [85], pages 631–636. [145] Sebastian Nowozin, Koji Tsuda, Takeaki Uno, Taku Kudo, and Gökhan H. Bakir. Weighted substructure mining for image analysis. In CVPR. IEEE Computer Society, 2007. [146] Manuel Alfredo Pech Palacio. Spatial data modeling and mining using a graph-based representation. PhD thesis, Department of Computer Systems Engineering, University of the Americas, Puebla, Dec. 2005. [147] C. R. Palmer and J. G. Steffan. Generating network topologies that obey power laws. In Proceedings of GLOBECOM 2000, Nov. 2000.
179
bibliography
[148] Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos Faloutsos. Loci: Fast outlier detection using the local correlation integral. In Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman, editors, ICDE, pages 315–326. IEEE Computer Society, 2003. [149] Gerardo Perez, Ivan Olmos, and Jesus A. Gonzalez. Subgraph isomorphism detection with support for continuous labels. In Hans W. Guesgen and R. Charles Murray, editors, FLAIRS Conference. AAAI Press, 2010. [150] Daniel J. Power. What is the “true story” about using data mining to identify a relation between sales of beer and diapers? DSS News, 3(23), Nov. 2002. [151] B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos, and Christos Faloutsos.
Virus propagation on
time-varying networks: Theory and immunization algorithms. In José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, editors, ECML/PKDD (3), volume 6323 of Lecture Notes in Computer Science, pages 99–114. Springer, 2010. [152] Carey E. Priebe, John M. Conroy, David J. Marchette, and Youngser Park. Scan statistics on Enron graphs. Computational & Mathematical Organization Theory, 11(3):229–247, 2005. [153] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In Chen et al. [52], pages 427–438. [154] Anatol Rapoport. Spread of information through a population with socio-structural bias: I. assumption of transitivity. Bulletin of Mathematical Biology, 15(4):523–533, Dec. 1953. [155] Matthew J. Rattigan and David Jensen. The case for anomalous link discovery. SIGKDD Explorations, 7(2):41–47, 2005. [156] Erzsébet Ravasz and Albert L. Barabási. Hierarchical organization in complex networks. Physical Review E, 67(2):026112+, Feb 2003. [157] G. Redpath and G. McClure. The role of electronic security systems integration in airport management. In European Conference on Security and Detection (ECOS 97), pages 137–141, Apr 1997.
180
bibliography
[158] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978. [159] Jorma Rissanen.
Stochastic Complexity in Statistical Inquiry.
World Scientific Publishing Co., Inc., River Edge, NJ, USA, 1989. [160] Oscar E. Romero, Jesus A. Gonzalez, and Lawrence B. Holder. Handling of numeric ranges with the subdue system.
In
R. Charles Murray and Philip M. McCarthy, editors, FLAIRS Conference. AAAI Press, 2011. [161] Oscar E. Romero, Lawrence B. Holder, and Jesus A. Gonzalez. A new approach for handling numeric ranges for graph-based knowledge discovery. Unpublished, 2011. URL http://www. lidi.info.unlp.edu.ar/WorldComp2011-Mirror/DMI2102.pdf.
[162] Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311(5762):854–856, 2006. [163] Erich Schubert, Arthur Zimek, and Hans-Peter Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, pages 1–48, Dec. 2012. [164] Karlton Sequeira and Mohammed Javeed Zaki.
ADMIT:
anomaly-based data mining for intrusions. In KDD, pages 386– 395. ACM, 2002. [165] Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos Faloutsos, and Jure Leskovec.
Mobile call
graphs: beyond power-law and lognormal distributions. In Li et al. [130], pages 596–604. [166] Jitesh Shetty and Jafar Adibi.
Ex-employee status report.
http://www.isi.edu/~adibi/Enron/Enron.htm.
This spread-
sheet contains 161 names, but two of them appear to be duplicates. [167] Jitesh Shetty and Jafar Adibi. Discovering important nodes through graph entropy the case of Enron email database. In Proceedings of the 3rd international workshop on Link discovery, LinkKDD ’05, pages 74–81, New York, NY, USA, 2005. ACM.
181
bibliography
[168] Herbert A. Simon. On a class of skew distribution functions. Biometrika, 42(3–4):425–440, 1955. [169] S. Staniford-Chen, S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagland, K. Levitt, C. Wee, R. Yip, and D. Zerkle. GrIDS — a graph based intrusion detection system for large networks. In Proceedings of the 19th National Information Systems Security Conference, Baltimore, USA, Oct. 1996. [170] Neal Stephenson. The Cryptonomicon. Arrow Books, London, 2000. [171] S V Subramanian, Daniel Kim, and Ichiro Kawachi. Covariation in the socioeconomic determinants of self rated health and happiness: a multivariate multilevel analysis of individuals and communities in the usa. Journal of Epidemiology and Community Health, 59(8):664–669, 2005. [172] Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. Relevance search and anomaly detection in bipartite graphs. SIGKDD Explorations, 7(2):48–55, 2005. [173] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S. Yu. GraphScope: parameter-free mining of large timeevolving graphs. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 687–696, New York, NY, USA, 2007. ACM. [174] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, Dec 2000. [175] Lini T. Thomas, Satyanarayana R. Valluri, and Kamalakar Karlapalem. MARGIN: Maximal frequent subgraph mining. ACM Trans. Knowl. Discov. Data, 4(3):10:1–10:42, Oct. 2010. [176] Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. Sociometry, 32:425–443, 1969. [177] Charalampos E. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM DBL [3], pages 608–617. [178] Xiaomeng Wan, Evangelos E. Milios, Nauzer Kalyaniwalla, and Jeannette Janssen. Link-based event detection in email com-
182
bibliography
munication networks. In Sung Y. Shin and Sascha Ossowski, editors, SAC, pages 1506–1510. ACM, 2009. [179] Chen Wang, Yongtai Zhu, Tianyi Wu, Wei Wang, and Baile Shi. Constraint-based graph mining in large database. In Yanchun Zhang, Katsumi Tanaka, Jeffrey Xu Yu, Shan Wang, and Minglu Li, editors, APWeb, volume 3399 of Lecture Notes in Computer Science, pages 133–144. Springer, 2005. [180] Y. Wang and G. Wong. Stochastic block models for directed graphs. Journal of the American Statistical Association, 82(397):8– 19, 1987. [181] Stanley Wasserman and Philippa Pattison. Logit models and logistic regressions for social networks. Psychometrika, 61(3):401– 425, 1996. [182] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘smallworld’ networks. Nature, 393(6684):440–442, 1998. [183] Marc Wörlein, Thorsten Meinl, Ingrid Fischer, and Michael Philippsen. A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In Alípio Jorge, Luís Torgo, Pavel Brazdil, Rui Camacho, and João Gama, editors, PKDD, volume 3721 of Lecture Notes in Computer Science, pages 392–403. Springer, 2005. [184] Kenji Yamanishi and Jun ichi Takeuchi. Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner. In Lee et al. [126], pages 389–394. [185] Xifeng Yan and Jiawei Han. gSpan: Graph-based substructure pattern mining. In ICDM DBL [1], pages 721–724. [186] Xifeng Yan and Jiawei Han. CloseGraph: mining closed frequent graph patterns. In Getoor et al. [85], pages 286–295. [187] Stephen J. Young and Edward R. Scheinerman. Random dot product graph models for social networks. In Proceedings of the 5th international conference on Algorithms and models for the webgraph, WAW’07, pages 138–149, Berlin, 2007. Springer-Verlag. [188] Dantong Yu, Gholamhosein Sheikholeslami, and Aidong Zhang. Findout: Finding outliers in very large datasets. Knowl. Inf. Syst., 4(4):387–412, 2002.
183
bibliography
[189] Bin Zhang, Liye Ma, and Ramayya Krishnan. Statistical analysis and anomaly detection of sms social networks. In Dennis F. Galletta and Ting-Peng Liang, editors, ICIS. Association for Information Systems, 2011. [190] Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363–387, 2012.
184
colophon This document was typeset in LATEX using the typographical lookand-feel classicthesis developed by André Miede. The style was inspired by Robert Bringhurst’s seminal book on typography “The Elements of Typographic Style”. classicthesis is available from: http://code.google.com/p/classicthesis/
Final Version as of 17th March 2014 (classicthesis version 4.1).