Institut für Verteilte Systeme. Fachgebiet Wissensbasierte Systeme (KBS). Anomaly Detection in Data Streams. By,. Amit Amit. June, 2017. Supervised by: Prof.
Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme Fachgebiet Wissensbasierte Systeme (KBS)
Anomaly Detection in Data Streams
By,
Amit Amit
Supervised by: Prof. Dr. Eirini Ntoutsi Prof. Dr. Wolfgang Nejdl
June, 2017
Dr. Thomas Risse
Anomaly Detection in Data Streams 1
Outline ■
Motivation
■
Introduction to Anomaly Detection
■
Related work
■
Approach
■
Dataset
■
Results
■
Conclusion
Anomaly Detection in Data Streams
2
Motivation & Problem Statement ■
GfK (Gesellschaft für Konsumforschüng) collects media usage data (streams)
■
Usage data comprises the abnormal behavior which affects market predictions
■
Currently, the adopted method to address the problem is manual quality check
■
■
❑
Not efficient and cost-effective
❑
Not suitable for real-time analytics
❑
Number of false negatives (misdetection) is significantly high
An anomaly in GfK data streams could be one of the following: ❑
Bugs in measurement methodology
❑
Changes in the measured media-outlet
❑
Intended change of behavior
Objective: ❑
To find multiple probabilistic models to identify anomalies, evaluate for thier robustness and prove that relevant non-trivial anomalies can be identified automatically
Anomaly Detection in Data Streams
3
Anomaly Detection – At a glance ■
An anomaly is a pattern in the data which does not conform to the expected behavior
■
The anomaly is known by several names (depending on the context): ■
Outlier
■
Deviant
■
Changepoint
■
Fraud
■
Discordant
■
Failure
■
Fault
■
Novelty
Hawkins: “an outlier is an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”
■
Ex: Fraud (CC), intrusion, anomalous MRT image etc.
Anomaly Detection in Data Streams
4
Types of Anomalies ■
Point Anomaly
■
Collective Anomaly
■
Contextual Anomaly
❑
A single data instance is considered to be an anomaly
❑
A sequence of data instances (with certain relation)
❑
A data instance (only with respect to a specific context)
❑
Ex: Temperature on a particular day vs. normal temperature
❑
Ex: Crime scene (traffic jam, emergency calls etc)
❑
Ex: Yearly rainfall in figure below
Anomaly Detection in Data Streams
5
Taxonomy - Anomaly Detection Approaches (1/2) ■
Supervised Anomaly Detection ❑
Models are built around two labeled classes ■
■
Semi-supervised Anomaly Detection ❑
■
Normal class vs Anomalous class
Labels are available only for Normal class
Unsupervised Anomaly Detection ❑
No labeling information is available
❑
Based on the hidden structures in data
Anomaly Detection in Data Streams
6
Taxonomy – State-of-the-art Analysis (2/2) ■
Proximity-based ❑
■
Clustering-based ❑
■
K-means
Classification-based ❑
■
Kth Nearest Neighbour (k-NN), Local Outlier Factor (LOF)
Support Vector Machine (SVM)
Probabilistic and Statistical based ❑
Parametric Techniques ■
❑
Non-parametric Techniques ■
■
Kernel Function based, Histogram based
High-dimensional search based ❑
■
3-sigma Rule, Student t-test & Hoteling‘s t2 test
Selecting High-Contrast Subspaces (HiCS)
Information Theoretic based ❑
Entropy-based
Anomaly Detection in Data Streams
7
Related work ■
Research work is based on the work of Prescott Adams and MacKay [2007] “Bayesian Online Changepoint Detection”
■
The original work is based on changepoints detection in data streams
■
There are many other contributions in Bayesian inference for changepoint detection ❑
Barry and Hartigen [1993] (Offline & Retrospectively)
❑
Stephens [1994] (Offline & Retrospectively)
❑
Sharifzadeh et. al. [2005] (Offline & Retrospectively)
❑
Jervis and Jardine [1999] (Online)
❑
Ruanaidh et. Al [1994] (Online)
❑
Prescott Adams and MacKay [2007] (Online)
Anomaly Detection in Data Streams
8
Proposed framework ❑
Consists of three components ■
Anomaly Detection
■
Human Validation
■
Machine Learning (ML)
❑
The focus of this research is anomaly detection framework
❑
ML component provides feedback to anomaly detector
❑
Human annotator will validate the accuracy of the detection
Anomaly Detection in Data Streams
9
Approach ■
Bayesian Inference:
❑
Formal Definition:
x is a data point (or vector)
Θ is distribution parameter
α is hyper-parameter of parameters
X is set of observed data {x1, x2, ....xn}
x̃ is new data point (whose distribution is under prediction)
Anomaly Detection in Data Streams
10
Approach ■
A synthetic data stream is segmented (mean)
■
The partitions are denoted by run length rt.
■
Run length rt drops to zero when a changepoint is detected
■
The message passing algorithm shows that the probability mass is being passed upwards
Anomaly Detection in Data Streams
11
Approach ■
The marginal predictive distribution:
■
The posterior distribution can be calculated as:
Hence,
Anomaly Detection in Data Streams
12
Approach ■
Joint distribution is essentially the quanitity i.e. prior * likelihood
Anomaly Detection in Data Streams
13
Approach ■
The probability mass for the changepoint prior (Hazard Function)
Where,
h(x) is hazard function i.e., S(x) is survival function i.e., F(x) is cumulative distribution function (cdf) i.e., Anomaly Detection in Data Streams
14
Approach ■
In case, when rt = rt-1 +1 (means no changepoint) Hence,
Where, Marginal likelihood (evidence)
Anomaly Detection in Data Streams
15
Approach ❑
Boundary conditions (initialization)
❑
Calculate the posterior distribution of the run length rt
❑
Perform prediction (new datum)
❑
Update the parameters and hyperparameters
❑
Calculate the anomaly scores
❑
Data Modeling ■
Exponential family
Anomaly Detection in Data Streams
16
Approach ❑
The Run length - primary factor
❑
The simple assumption here is: “The likelihood of a data instance to be anomalous is higher in long run (larger run length) than in a short run (smaller run length).“
❑
The anomaly score value - [0, 1]
Anomaly Detection in Data Streams
17
Dataset (1/2) ■
The dataset used in the research work is called Unified LEOtrace® Dataset (ULD)
■
The ULD dataset combines data from mobile and desktop devices
■
LEOtrace® is GfK’s operational platform for user-centric online behavioral and attitudinal measurement on the panelists’ PCs.
■
The software was developed five years ago and currently, tracks more than 80k members in several countries.
■
The ULD dataset has 11 tables (sections) based on types of data
■
There are 113 attributes in the dataset
Anomaly Detection in Data Streams
18
Dataset (2/2) ■
The algorithm was implemented on three different datasets ❑
Synthetic Dataset
❑
Well-log Dataset
❑
ULD Dataset ■
Amazon UK page impressions – United Kingdom Panel
■
Total number of events - Singapore Panel
■
Total number of events - United Kingdom Panel
19
Evaluation ■
Following methods used on labeled validation dataset ❑
Confusion Matrix
❑
ROC Curve ■
Set a score threshold δ
■
Prepare labeled & predicted data instances
■
Get TPR and FPR by comparing true and predicted labels (sequentially)
■
Draw the FPR on x-axis and FPR on y-axis and calculate AUC
■
Iterate the process for an optimal threshold 20
Results ❑
Synthetic Dataset ■
with varying mean
Anomaly Detection in Data Streams
21
Results ❑
Well-log Dataset ■
Used by original authors to detect changepoints
Anomaly Detection in Data Streams
22
Results ❑
ULD Dataset-1 ■
❑
Page impressions in UK
Anomalies: ■
Sudden increase in the number of page impressions for amazon in the United Kingdom
■
Fraudulent users increased the number of unique users
■
Score threshold: 0.9
Anomaly Detection in Data Streams
23
Results ❑
ULD Dataset-2 ■
❑
Number of Events in Singapore
Anomalies: ■
A single household (user) has caused the anomalies (145, 160 and 210) in singapore (number of total records)
■
The user was watching a video advertisement which had a massive number of refferrals
■
Score threshold: 0.8
Anomaly Detection in Data Streams
24
Results ❑
ULD Dataset-3 ■
❑
Total number of Events in UK
Anomalies: ■
The anomaly (630) caused due to the error in tracklet configuration (no HTML-5 event for youtube)
■
Another one (430) is missed anomaly (not reported)
■
Score threshold: 0.8
Anomaly Detection in Data Streams
25
Results ❑
Receiver Operating Characterstic (ROC) Curve
Anomaly Detection in Data Streams
26
Results ❑
❑
Various evaluation measures
Algorithm took 0.15 seconds for a total of 1101 data instances on a Macintosh machine with 16 GB RAM and a Core i7 processor.
Anomaly Detection in Data Streams
27
Conclusion ■
The algorithm works reliably on various datasets i.e., Synthetic, Well-log and ULD datasets
■
The algorithm estimates the exact posterior probabilities for the current run length
■
Change points are potential candidates for anomalies but not necessarily need to be an anomaly
■
The scoring method is used to assign scores to each data instance based upon current run length
■
The anomaly score is assigned in the range of [0, 1]
■
The algorithm is highly modular in nature and allows to plug in different components
■
The algorithm can be improved in terms of ❑
Improvement in prior parameter estimation
❑
Run length pruning (e.g. discard data instances having run length probability below 10e-4)
❑
Dynamic assignment of score threshold
Anomaly Detection in Data Streams
28
Future Work ■
The proposed framework comprised three component: ❑
Anomaly Detection component
❑
Human Validation component
❑
Machine Learning component
■
Machine learning component will be the focus of future work
■
Machine Learning component essentially will provide feedback to reduce FPR and FNR
■
A supervised learning based method can be used for the first prototype: ■
Support Vector Machine
■
Artificial Neural Network
Anomaly Detection in Data Streams
29
Anomaly Detection in Data Streams
30
References ■
Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742.
■
Jaynes, E. T. (1986). Bayesian methods: General background.
■
Evans, M., Hastings, N., & Peacock, B. (2000). Statistical distributions.
■
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. def, 1(2σ2), 16.
■
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
■
Aggarwal, C. C. (2013). An introduction to outlier analysis. In Outlier Analysis (pp. 1-40). Springer New York.
■
Turner, R., Saatci, Y., & Rasmussen, C. E. (2009, December). Adaptive sequential Bayesian change point detection. In Temporal Segmentation Workshop at NIPS.
■
Hawkins, D. M. (1980). Identification of outliers (Vol. 11). London: Chapman and Hall.
Anomaly Detection in Data Streams
31