Anomaly Detection in Data Streams

2 downloads 0 Views 2MB Size Report
Institut für Verteilte Systeme. Fachgebiet Wissensbasierte Systeme (KBS). Anomaly Detection in Data Streams. By,. Amit Amit. June, 2017. Supervised by: Prof.
Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme Fachgebiet Wissensbasierte Systeme (KBS)

Anomaly Detection in Data Streams

By,

Amit Amit

Supervised by: Prof. Dr. Eirini Ntoutsi Prof. Dr. Wolfgang Nejdl

June, 2017

Dr. Thomas Risse

Anomaly Detection in Data Streams 1

Outline ■

Motivation



Introduction to Anomaly Detection



Related work



Approach



Dataset



Results



Conclusion

Anomaly Detection in Data Streams

2

Motivation & Problem Statement ■

GfK (Gesellschaft für Konsumforschüng) collects media usage data (streams)



Usage data comprises the abnormal behavior which affects market predictions



Currently, the adopted method to address the problem is manual quality check







Not efficient and cost-effective



Not suitable for real-time analytics



Number of false negatives (misdetection) is significantly high

An anomaly in GfK data streams could be one of the following: ❑

Bugs in measurement methodology



Changes in the measured media-outlet



Intended change of behavior

Objective: ❑

To find multiple probabilistic models to identify anomalies, evaluate for thier robustness and prove that relevant non-trivial anomalies can be identified automatically

Anomaly Detection in Data Streams

3

Anomaly Detection – At a glance ■

An anomaly is a pattern in the data which does not conform to the expected behavior



The anomaly is known by several names (depending on the context): ■

Outlier



Deviant



Changepoint



Fraud



Discordant



Failure



Fault



Novelty

Hawkins: “an outlier is an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”



Ex: Fraud (CC), intrusion, anomalous MRT image etc.

Anomaly Detection in Data Streams

4

Types of Anomalies ■

Point Anomaly



Collective Anomaly



Contextual Anomaly



A single data instance is considered to be an anomaly



A sequence of data instances (with certain relation)



A data instance (only with respect to a specific context)



Ex: Temperature on a particular day vs. normal temperature



Ex: Crime scene (traffic jam, emergency calls etc)



Ex: Yearly rainfall in figure below

Anomaly Detection in Data Streams

5

Taxonomy - Anomaly Detection Approaches (1/2) ■

Supervised Anomaly Detection ❑

Models are built around two labeled classes ■



Semi-supervised Anomaly Detection ❑



Normal class vs Anomalous class

Labels are available only for Normal class

Unsupervised Anomaly Detection ❑

No labeling information is available



Based on the hidden structures in data

Anomaly Detection in Data Streams

6

Taxonomy – State-of-the-art Analysis (2/2) ■

Proximity-based ❑



Clustering-based ❑



K-means

Classification-based ❑



Kth Nearest Neighbour (k-NN), Local Outlier Factor (LOF)

Support Vector Machine (SVM)

Probabilistic and Statistical based ❑

Parametric Techniques ■



Non-parametric Techniques ■



Kernel Function based, Histogram based

High-dimensional search based ❑



3-sigma Rule, Student t-test & Hoteling‘s t2 test

Selecting High-Contrast Subspaces (HiCS)

Information Theoretic based ❑

Entropy-based

Anomaly Detection in Data Streams

7

Related work ■

Research work is based on the work of Prescott Adams and MacKay [2007] “Bayesian Online Changepoint Detection”



The original work is based on changepoints detection in data streams



There are many other contributions in Bayesian inference for changepoint detection ❑

Barry and Hartigen [1993] (Offline & Retrospectively)



Stephens [1994] (Offline & Retrospectively)



Sharifzadeh et. al. [2005] (Offline & Retrospectively)



Jervis and Jardine [1999] (Online)



Ruanaidh et. Al [1994] (Online)



Prescott Adams and MacKay [2007] (Online)

Anomaly Detection in Data Streams

8

Proposed framework ❑

Consists of three components ■

Anomaly Detection



Human Validation



Machine Learning (ML)



The focus of this research is anomaly detection framework



ML component provides feedback to anomaly detector



Human annotator will validate the accuracy of the detection

Anomaly Detection in Data Streams

9

Approach ■

Bayesian Inference:



Formal Definition: 

x is a data point (or vector)



Θ is distribution parameter



α is hyper-parameter of parameters



X is set of observed data {x1, x2, ....xn}



x̃ is new data point (whose distribution is under prediction)

Anomaly Detection in Data Streams

10

Approach ■

A synthetic data stream is segmented (mean)



The partitions are denoted by run length rt.



Run length rt drops to zero when a changepoint is detected



The message passing algorithm shows that the probability mass is being passed upwards

Anomaly Detection in Data Streams

11

Approach ■

The marginal predictive distribution:



The posterior distribution can be calculated as:

Hence,

Anomaly Detection in Data Streams

12

Approach ■

Joint distribution is essentially the quanitity i.e. prior * likelihood

Anomaly Detection in Data Streams

13

Approach ■

The probability mass for the changepoint prior (Hazard Function)

Where,

h(x) is hazard function i.e., S(x) is survival function i.e., F(x) is cumulative distribution function (cdf) i.e., Anomaly Detection in Data Streams

14

Approach ■

In case, when rt = rt-1 +1 (means no changepoint) Hence,

Where, Marginal likelihood (evidence)

Anomaly Detection in Data Streams

15

Approach ❑

Boundary conditions (initialization)



Calculate the posterior distribution of the run length rt



Perform prediction (new datum)



Update the parameters and hyperparameters



Calculate the anomaly scores



Data Modeling ■

Exponential family

Anomaly Detection in Data Streams

16

Approach ❑

The Run length - primary factor



The simple assumption here is: “The likelihood of a data instance to be anomalous is higher in long run (larger run length) than in a short run (smaller run length).“



The anomaly score value - [0, 1]

Anomaly Detection in Data Streams

17

Dataset (1/2) ■

The dataset used in the research work is called Unified LEOtrace® Dataset (ULD)



The ULD dataset combines data from mobile and desktop devices



LEOtrace® is GfK’s operational platform for user-centric online behavioral and attitudinal measurement on the panelists’ PCs.



The software was developed five years ago and currently, tracks more than 80k members in several countries.



The ULD dataset has 11 tables (sections) based on types of data



There are 113 attributes in the dataset

Anomaly Detection in Data Streams

18

Dataset (2/2) ■

The algorithm was implemented on three different datasets ❑

Synthetic Dataset



Well-log Dataset



ULD Dataset ■

Amazon UK page impressions – United Kingdom Panel



Total number of events - Singapore Panel



Total number of events - United Kingdom Panel

19

Evaluation ■

Following methods used on labeled validation dataset ❑

Confusion Matrix



ROC Curve ■

Set a score threshold δ



Prepare labeled & predicted data instances



Get TPR and FPR by comparing true and predicted labels (sequentially)



Draw the FPR on x-axis and FPR on y-axis and calculate AUC



Iterate the process for an optimal threshold 20

Results ❑

Synthetic Dataset ■

with varying mean

Anomaly Detection in Data Streams

21

Results ❑

Well-log Dataset ■

Used by original authors to detect changepoints

Anomaly Detection in Data Streams

22

Results ❑

ULD Dataset-1 ■



Page impressions in UK

Anomalies: ■

Sudden increase in the number of page impressions for amazon in the United Kingdom



Fraudulent users increased the number of unique users



Score threshold: 0.9

Anomaly Detection in Data Streams

23

Results ❑

ULD Dataset-2 ■



Number of Events in Singapore

Anomalies: ■

A single household (user) has caused the anomalies (145, 160 and 210) in singapore (number of total records)



The user was watching a video advertisement which had a massive number of refferrals



Score threshold: 0.8

Anomaly Detection in Data Streams

24

Results ❑

ULD Dataset-3 ■



Total number of Events in UK

Anomalies: ■

The anomaly (630) caused due to the error in tracklet configuration (no HTML-5 event for youtube)



Another one (430) is missed anomaly (not reported)



Score threshold: 0.8

Anomaly Detection in Data Streams

25

Results ❑

Receiver Operating Characterstic (ROC) Curve

Anomaly Detection in Data Streams

26

Results ❑



Various evaluation measures

Algorithm took 0.15 seconds for a total of 1101 data instances on a Macintosh machine with 16 GB RAM and a Core i7 processor.

Anomaly Detection in Data Streams

27

Conclusion ■

The algorithm works reliably on various datasets i.e., Synthetic, Well-log and ULD datasets



The algorithm estimates the exact posterior probabilities for the current run length



Change points are potential candidates for anomalies but not necessarily need to be an anomaly



The scoring method is used to assign scores to each data instance based upon current run length



The anomaly score is assigned in the range of [0, 1]



The algorithm is highly modular in nature and allows to plug in different components



The algorithm can be improved in terms of ❑

Improvement in prior parameter estimation



Run length pruning (e.g. discard data instances having run length probability below 10e-4)



Dynamic assignment of score threshold

Anomaly Detection in Data Streams

28

Future Work ■

The proposed framework comprised three component: ❑

Anomaly Detection component



Human Validation component



Machine Learning component



Machine learning component will be the focus of future work



Machine Learning component essentially will provide feedback to reduce FPR and FNR



A supervised learning based method can be used for the first prototype: ■

Support Vector Machine



Artificial Neural Network

Anomaly Detection in Data Streams

29

Anomaly Detection in Data Streams

30

References ■

Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742.



Jaynes, E. T. (1986). Bayesian methods: General background.



Evans, M., Hastings, N., & Peacock, B. (2000). Statistical distributions.



Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. def, 1(2σ2), 16.



Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.



Aggarwal, C. C. (2013). An introduction to outlier analysis. In Outlier Analysis (pp. 1-40). Springer New York.



Turner, R., Saatci, Y., & Rasmussen, C. E. (2009, December). Adaptive sequential Bayesian change point detection. In Temporal Segmentation Workshop at NIPS.



Hawkins, D. M. (1980). Identification of outliers (Vol. 11). London: Chapman and Hall.

Anomaly Detection in Data Streams

31

Suggest Documents