Anomaly Detection in Data Streams

Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme Fachgebiet Wissensbasierte Systeme (KBS)

Anomaly Detection in Data Streams

By,

Amit Amit

Supervised by: Prof. Dr. Eirini Ntoutsi Prof. Dr. Wolfgang Nejdl

June, 2017

Dr. Thomas Risse

Anomaly Detection in Data Streams 1

Outline ■

Motivation

■

Introduction to Anomaly Detection

■

Related work

■

Approach

■

Dataset

■

Results

■

Conclusion


2

Motivation & Problem Statement ■

GfK (Gesellschaft für Konsumforschüng) collects media usage data (streams)

■

Usage data comprises the abnormal behavior which affects market predictions

■

Currently, the adopted method to address the problem is manual quality check

■

■

❑

Not efficient and cost-effective

❑

Not suitable for real-time analytics

❑

Number of false negatives (misdetection) is significantly high

An anomaly in GfK data streams could be one of the following: ❑

Bugs in measurement methodology

❑

Changes in the measured media-outlet

❑

Intended change of behavior

Objective: ❑

To find multiple probabilistic models to identify anomalies, evaluate for thier robustness and prove that relevant non-trivial anomalies can be identified automatically


3

Anomaly Detection – At a glance ■

An anomaly is a pattern in the data which does not conform to the expected behavior

■

The anomaly is known by several names (depending on the context): ■

Outlier

■

Deviant

■

Changepoint

■

Fraud

■

Discordant

■

Failure

■

Fault

■

Novelty

Hawkins: “an outlier is an observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”

■

Ex: Fraud (CC), intrusion, anomalous MRT image etc.


4

Types of Anomalies ■

Point Anomaly

■

Collective Anomaly

■

Contextual Anomaly

❑

A single data instance is considered to be an anomaly

❑

A sequence of data instances (with certain relation)

❑

A data instance (only with respect to a specific context)

❑

Ex: Temperature on a particular day vs. normal temperature

❑

Ex: Crime scene (traffic jam, emergency calls etc)

❑

Ex: Yearly rainfall in figure below


5

Taxonomy - Anomaly Detection Approaches (1/2) ■

Supervised Anomaly Detection ❑

Models are built around two labeled classes ■

■

Semi-supervised Anomaly Detection ❑

■

Normal class vs Anomalous class

Labels are available only for Normal class

Unsupervised Anomaly Detection ❑

No labeling information is available

❑

Based on the hidden structures in data


6

Taxonomy – State-of-the-art Analysis (2/2) ■

Proximity-based ❑

■

Clustering-based ❑

■

K-means

Classification-based ❑

■

Kth Nearest Neighbour (k-NN), Local Outlier Factor (LOF)

Support Vector Machine (SVM)

Probabilistic and Statistical based ❑

Parametric Techniques ■

❑

Non-parametric Techniques ■

■

Kernel Function based, Histogram based

High-dimensional search based ❑

■

3-sigma Rule, Student t-test & Hoteling‘s t2 test

Selecting High-Contrast Subspaces (HiCS)

Information Theoretic based ❑

Entropy-based


7

Related work ■

Research work is based on the work of Prescott Adams and MacKay [2007] “Bayesian Online Changepoint Detection”

■

The original work is based on changepoints detection in data streams

■

There are many other contributions in Bayesian inference for changepoint detection ❑

Barry and Hartigen [1993] (Offline & Retrospectively)

❑

Stephens [1994] (Offline & Retrospectively)

❑

Sharifzadeh et. al. [2005] (Offline & Retrospectively)

❑

Jervis and Jardine [1999] (Online)

❑

Ruanaidh et. Al [1994] (Online)

❑

Prescott Adams and MacKay [2007] (Online)


8

Proposed framework ❑

Consists of three components ■

Anomaly Detection

■

Human Validation

■

Machine Learning (ML)

❑

The focus of this research is anomaly detection framework

❑

ML component provides feedback to anomaly detector

❑

Human annotator will validate the accuracy of the detection


9

Approach ■

Bayesian Inference:

❑

Formal Definition: 

x is a data point (or vector)



Θ is distribution parameter



α is hyper-parameter of parameters



X is set of observed data {x1, x2, ....xn}



x̃ is new data point (whose distribution is under prediction)


10

Approach ■

A synthetic data stream is segmented (mean)

■

The partitions are denoted by run length rt.

■

Run length rt drops to zero when a changepoint is detected

■

The message passing algorithm shows that the probability mass is being passed upwards


11

Approach ■

The marginal predictive distribution:

■

The posterior distribution can be calculated as:

Hence,


12

Approach ■

Joint distribution is essentially the quanitity i.e. prior * likelihood


13

Approach ■

The probability mass for the changepoint prior (Hazard Function)

Where,

h(x) is hazard function i.e., S(x) is survival function i.e., F(x) is cumulative distribution function (cdf) i.e., Anomaly Detection in Data Streams

14

Approach ■

In case, when rt = rt-1 +1 (means no changepoint) Hence,

Where, Marginal likelihood (evidence)


15

Approach ❑

Boundary conditions (initialization)

❑

Calculate the posterior distribution of the run length rt

❑

Perform prediction (new datum)

❑

Update the parameters and hyperparameters

❑

Calculate the anomaly scores

❑

Data Modeling ■

Exponential family


16

Approach ❑

The Run length - primary factor

❑

The simple assumption here is: “The likelihood of a data instance to be anomalous is higher in long run (larger run length) than in a short run (smaller run length).“

❑

The anomaly score value - [0, 1]


17

Dataset (1/2) ■

The dataset used in the research work is called Unified LEOtrace® Dataset (ULD)

■

The ULD dataset combines data from mobile and desktop devices

■

LEOtrace® is GfK’s operational platform for user-centric online behavioral and attitudinal measurement on the panelists’ PCs.

■

The software was developed five years ago and currently, tracks more than 80k members in several countries.

■

The ULD dataset has 11 tables (sections) based on types of data

■

There are 113 attributes in the dataset


18

Dataset (2/2) ■

The algorithm was implemented on three different datasets ❑

Synthetic Dataset

❑

Well-log Dataset

❑

ULD Dataset ■

Amazon UK page impressions – United Kingdom Panel

■

Total number of events - Singapore Panel

■

Total number of events - United Kingdom Panel

19

Evaluation ■

Following methods used on labeled validation dataset ❑

Confusion Matrix

❑

ROC Curve ■

Set a score threshold δ

■

Prepare labeled & predicted data instances

■

Get TPR and FPR by comparing true and predicted labels (sequentially)

■

Draw the FPR on x-axis and FPR on y-axis and calculate AUC

■

Iterate the process for an optimal threshold 20

Results ❑

Synthetic Dataset ■

with varying mean


21

Results ❑

Well-log Dataset ■

Used by original authors to detect changepoints


22

Results ❑

ULD Dataset-1 ■

❑

Page impressions in UK

Anomalies: ■

Sudden increase in the number of page impressions for amazon in the United Kingdom

■

Fraudulent users increased the number of unique users

■

Score threshold: 0.9


23

Results ❑

ULD Dataset-2 ■

❑

Number of Events in Singapore

Anomalies: ■

A single household (user) has caused the anomalies (145, 160 and 210) in singapore (number of total records)

■

The user was watching a video advertisement which had a massive number of refferrals

■



24

Results ❑

ULD Dataset-3 ■

❑

Total number of Events in UK

Anomalies: ■

The anomaly (630) caused due to the error in tracklet configuration (no HTML-5 event for youtube)

■

Another one (430) is missed anomaly (not reported)

■



25

Results ❑

Receiver Operating Characterstic (ROC) Curve


26

Results ❑

❑

Various evaluation measures

Algorithm took 0.15 seconds for a total of 1101 data instances on a Macintosh machine with 16 GB RAM and a Core i7 processor.


27

Conclusion ■

The algorithm works reliably on various datasets i.e., Synthetic, Well-log and ULD datasets

■

The algorithm estimates the exact posterior probabilities for the current run length

■

Change points are potential candidates for anomalies but not necessarily need to be an anomaly

■

The scoring method is used to assign scores to each data instance based upon current run length

■

The anomaly score is assigned in the range of [0, 1]

■

The algorithm is highly modular in nature and allows to plug in different components

■

The algorithm can be improved in terms of ❑

Improvement in prior parameter estimation

❑

Run length pruning (e.g. discard data instances having run length probability below 10e-4)

❑

Dynamic assignment of score threshold


28

Future Work ■

The proposed framework comprised three component: ❑

Anomaly Detection component

❑

Human Validation component

❑

Machine Learning component

■

Machine learning component will be the focus of future work

■

Machine Learning component essentially will provide feedback to reduce FPR and FNR

■

A supervised learning based method can be used for the first prototype: ■

Support Vector Machine

■

Artificial Neural Network


29


30

References ■

Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742.

■

Jaynes, E. T. (1986). Bayesian methods: General background.

■

Evans, M., Hastings, N., & Peacock, B. (2000). Statistical distributions.

■

Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. def, 1(2σ2), 16.

■

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.

■

Aggarwal, C. C. (2013). An introduction to outlier analysis. In Outlier Analysis (pp. 1-40). Springer New York.

■

Turner, R., Saatci, Y., & Rasmussen, C. E. (2009, December). Adaptive sequential Bayesian change point detection. In Temporal Segmentation Workshop at NIPS.

■

Hawkins, D. M. (1980). Identification of outliers (Vol. 11). London: Chapman and Hall.


31

Anomaly Detection in Data Streams

Anomaly Detection in Data Streams

Suggest Documents

Online Anomaly Detection over Big Data Streams - eXascale Infolab

Parameterless Outlier Detection in Data Streams

Distance-based Outlier Detection in Data Streams

Detecting anomaly in data streams by fractal model

UNSUPERVISED ANOMALY DETECTION IN

Operational Data Based Anomaly Detection for ... - CiteSeerX

Correlated Anomaly Detection from Large Streaming Data

Outlier Detection Over Data Streams - CiteSeerX

Global Iceberg Detection over Distributed Data Streams

Processing of massive audit data streams for real-time anomaly ...

Multilevel Anomaly Detection for Mixed Data

Spatiotemporal Anomaly Detection in Gas

ANOMALY DETECTION IN COMPLEX ENVIRONMENTS

Visual Anomaly Detection in Spatio-Temporal Data using Element ...

Anomaly detection in concurrent Java programs using dynamic data ...

Anomaly Detection in GPS Data Based on Visual Analytics

Unsupervised Anomaly Detection in Time Series Data using Deep

Data Fusion-Based Anomaly Detection in ... - Semantic Scholar

Anomaly Detection in Aircraft Data using Recurrent Neural Networks

Unsupervised Network Anomaly Detection in Real-Time on Big Data

Statistical Techniques for Online Anomaly Detection in Data Centers

Contextual Anomaly Detection in Text Data - Semantic Scholar

Data Clustering-based Anomaly Detection in Industrial ... - IEEE Xplore

High Utility Drift Detection in Quantitative Data Streams