Outlier Detection Over Data Streams - CiteSeerX

OUTLIER DETECTION OVER DATA STREAMS He Zengyou, Xu Xiaofei and Deng Shengchun Department of Computer Science and Engineering, Harbin Institute of Technology, Post Code：150001, Harbin, China [email protected], [email protected], [email protected]

ABSTRACT

In the data stream model, data points can only be accessed in the order of their arrivals and random access

We study discovering outliers under the data stream

is disallowed. And the space available to store

model. The data stream model is relevant to new classes

information is supposed to be small relatively to the huge

of applications involving massive datasets, such as web

size of unbounded streaming data points. Thus, the data

click stream analysis and detection of network intrusions.

mining algorithms on data streams are restricted to be

The frequent pattern based outlier detection method [31]

able to fulfill their works with only one pass over data

is modified in the data stream environment. Instead of

sets and limited resources. It is a very challenging

finding the exact frequent patterns with multiple passes,

research field.

we approximate them with single pass and the error of

1.1 RELATED

estimated support of each frequent itemset is guaranteed

STREAMS

not to exceed a user-specified parameter. For each

There has been some initial work addressing data

coming data point, its outlier factor is computed using

streams in the data mining community. These proposals

the frequent patterns seen so far, and the top k outliers

tried to adapt traditional data mining technologies to the

are output to the user.

data stream model.

WORK

ON

MINING

DATA

References [26,23,17] consider clustering in the

KEYWORDS： Outlier Detection, Data Streams,

data stream model; they use the extended classical

Frequent Pattern, Data Mining

clustering algorithms K-median and K-means to fulfill the work. References [18,19,20] focus on efficiently

1. INTRODUCTION

constructing decision trees and the problem of ensemble classification in data stream environment. Reference[30]

For many recent applications, the concept of data stream is more appropriate than a dataset. By nature, a

presents a online classification system based on info-fuzzy networks.

stored dataset is an appropriate model when significant

Reference [29] discusses the problem of frequent

portions of the data are queried again and again, and

pattern mining in data streams. The authors in [28]

updates are relatively infrequent. In contrast, a data

proposed

stream is an appropriate model when a large volume of

time-series data streams.

algorithms

for

regression

analysis

of

data is arriving continuously and it is either unnecessary

Reference [21] considers extracting information

or impractical to store the data in some form of memory.

about customers from a stream of transactions and

Data streams are also appropriate as a model of access to

mining it in real-time. Reference [22] proposes Hancock,

large data sets stored in secondary memory where

which is a language for extracting signatures from data

performance requirements necessitate linear scans [17].

streams.

The authors in [27,24] address the problem of mining multiple

data

streams.

Reference

[27]

develops

expected to be found from data objects with shallow depth values.

algorithms for analyzing co-evolving time sequences to

Distance-based outlier is presented by Knorr and

forecast future values and detect correlations. Reference

Ng [1]. A distance-based outlier in a dataset D is a data

[24] presents a collective approach to mine Bayesian

object with pct% of the objects in D having a distance of

networks from distributed heterogenous web log data

more than d

streams.

many concepts from distribution-based approach and

min

away from it. This notion generalizes

Reference [25] identifies some key aspects of stream

enjoys better computational complexity. It is further

data mining algorithms and outlines a number of possible

extended based on the distance of a point from its kth

directions for future research.

nearest neighbor [2]. After ranking points by the distance

1.2 RELATED WORK ON OUTLIER DETECTION

to its kth nearest neighbor, the top k points are identified

Data mining tasks can be classified into four general

as outliers. Efficient algorithms for mining top-k outliers

categories: (a) dependencies detection (e.g. association

are given. Alternatively, in the algorithm proposed by

rules), (b) class identification (e.g. classification,

Angiulli and Pizzuti [11], the outlier factor of each data

clustering),

point is computed as the sum of distances from its k

(c)

class

description

(e.g.

concept

generalization), and (d) outlier/exception detection [1].

nearest neighbors.

However, most of the recent works has focused on the

Deviation-based techniques identify outliers by

first three categories to find patterns applicable to a

inspecting the characteristics of objects and consider an

considerable portion of objects in a dataset. The fourth

object that deviates these features as an outlier [9].

on

Breunig, et al. [3] introduced the concept of “local

identifying a relatively small part of the whole dataset. In

outlier”. The outlier rank of a data object is determined

some cases and for some applications, these “outliers”

by taking into account the clustering structure in a

are more interesting than those common ones.

bounded neighborhood of the object, which is formally

category,

outlier

detection

problem,

focuses

Recently, methods for finding such outliers in large

defined as “local outlier factor” (LOF). The LOCI

dataset are drawing increasing attention [1-13]. The

method [13] further extended the density-based approach

statistics community conducted most of the previous

[3].

studies on outlier mining [14]. These studies can be

Clustering-based

outlier

detection

techniques

broadly classified into two categories. The first category

regarded small clusters as outliers [5] or identified

is distribution-based, where a standard distribution is

outliers by removing clusters from the original dataset

used to fit the dataset. Outliers are defined based on the

[8]. The authors in [16] further extended existing

probability

and

clustering based techniques by proposing the concept of

Williams [10] used a Gaussian mixture model to present

cluster-based local outlier, in which a measure for

the normal behaviors and each datum is given a score on

identifying the outlier-ness of each data object is defined.

the basis of changes in the model. High score indicates

Aggarwal and Yu [4] discussed a new technique for

high possibility of being an outlier. This approach has

outlier detection, which finds outliers by observing the

been combined with a supervised-based learning

density distribution of projections from the data. That is,

approach to obtain general patterns for outlier [12].

their definition considers a point to be an outlier, if in

distribution.

Yamanishi,

Takeuchi

Depth-based is the second category for outlier mining in statistics [15]. Based on some definition of

some lower dimensional projection, it is present in a local region of abnormally low density.

depth, data objects are organized in convex hull layers in

The replicator neutral network (RNN) is employed

data space according to peeling depth, and outliers are

to detect outliers by Harkins, et al. [6]. The approach is

based on the observation that the trained neutral network

threshold for the permissible minimal support, find all

will reconstruct some small number of individuals poorly,

itemsets with support greater or equal to minisupport.

and these individuals can be considered as outliers. The

Frequent itemsets are also called frequent patterns.

outlier factor for ranking data is measured according to the magnitude of the reconstruction error.

From the viewpoint of knowledge discovery, frequent patterns reflect the “common patterns” that

An interesting recent technique finds outliers by

apply to many objects, or to large percentage of objects,

incorporating semantic knowledge such as the class

in the dataset. In contrast, outlier detection focuses on a

labels of each data point in the dataset [7]. In view of the

very small percentage of data objects. Hence, the idea of

class information, a semantic outlier is a data point,

making use of frequent patterns for outlier detection is

which behaves differently with other data points in the

very intuitive.

same class. 1.3 SUMMARY

Definition 1: (FPOF-Frequent Pattern Outlier Factor) Let the database D = {t1, t2, …, tn} be a set of n

Continue with Section 1.1 and Section 1.2, existing

transactions with items I. Given threshold minisupport,

proposals on mining data streams didn’t address the

the set of all frequent patterns is donated as: FPS (D,

problem of outlier detection. Furthermore, all proposed

minisupport). For each transaction t, the Frequent

works on mining outliers are not appropriate for data

Pattern Outlier Factor of t is defined as:

∑ support ( X )

stream model. In this paper, we address the problem of detecting

FPOF (t) =

outliers from data streams. We begin by introducing frequent pattern based outlier detection method [31]. In

X

|| FPS ( D, minisupport ) ||

X ⊆ t and X ∈ FPS (D, minisupport).

, where

(1)

the sequel, it is extended to data stream environment by The interpretation of formula (1) is as follows. If a

frequent pattern approximation.

data object contains more frequent patterns, its FPOF

2. FREQUENT PATTERN BASED OUTLIER DETECTION METHOD

value will be larger, which indicates that it is unlikely to be an outlier. In contrast, objects with smaller FPOF values will have greater outlying-nesses. In addition, the

Agrawal’s statement of the problem of discovering

FPOF value is between 0 and 1.

frequent itemset in market basket databases is the following [32]. Let I = {i1, i2, …, im} be a set of m literals called

Definition 2: For each transaction t, the itemset X is said to be contradict to t if X ⊄ t. The contradict-ness of X to t is defined as: Contradict-ness (X, t) = (||X||–|| t ∩ X||)*support (X)

items. Let the database D = {t1, t2, …, tn} be a set of n transactions each consisting of a set of items from I. An

(2)

itemset X is a non-empty subset of I. The length of

In our approach, the frequent pattern outlier factor

itemset X is the number of items contained in X, and X is

given in definition 1 is used as the basic measure for

called a k-itemset if its length is k. A transaction t ∈ D is said to contain an itemset X if X ⊆ t. The support of an

identifying outliers. To describe the reasons why

itemset X is the percentage of transactions in D containing X: support (X) = || {t ∈ D | X ⊆ t}|| / || {t ∈ D}||.

identified outliers are abnormal, the itemsets those are not contained in the transaction (it is said that the itemset is contradict to the transaction) are good candidates. The consideration behind formula (2) is as follows.

The problem of finding all frequent itemsets in D is

First, the greater the support of the itemset X, the greater

then traditionally defined as follows. Given user defined

the value of contradict-ness of X to t since larger support

value of X suggests a more strong deviation. Second,

buckets of width┌ 1 ┐transactions each. Buckets are ε

longer itemsets will give a better description than that of

labeled with bucket ids, starting from 1. The current

shorter ones. With definition 2, it is possible to identify the contribution of each itemset to the outlying-ness of specified transaction. However, it is not feasible to list all the contradict itemset, and it will be preferable to present only the top k contradict frequent pattern to the end user, as shown in definition 3. Definition 3: (TKCFP-Top K Contradict Frequent

bucket id is donated by bcurrent , whose value is ┌ N ε

┐. For an element e, its true frequency in the stream seen so far is donated by f e . The data structure D is a set of entries of the form (e, f, ∆ ), where e is an element in the stream, f is an integer representing its estimated

minisupport) are the same as given in Definition 1. For

frequency, and ∆ is the maximum possible error in f. Initially, D is empty. Whenever a new element e

each transaction t, the itemset X is said to be a top k

arrives, we first look up D to see weather an entry for e

contradict frequent pattern if there exist no more than

already exists or not. If the lookup succeeds, the entry is

(k-1) itemsets whose contradict-ness is higher than that

updated by incrementing its frequency f by one.

Pattern) The meanings of D, I, minisupport and FPS (D,

of X, where X ∈ FPS (D, minisupport). Our task is to mine top-n outliers with regard to the

Otherwise, a new entry of the form (e, 1, bcurrent -1) is

value of frequent pattern outlier factor. For each

created. Also, D is pruned by deleting some of its entries

identified outlier, its top k contradict frequent patterns

at bucket boundaries, i.e., whenever N ≡ 0 mod w. The

will also be discovered for the purpose of description.

rule for deletion is: an entry (e, f, ∆ ) is deleted if f

3. FREQUENT APPROXIMATION

PATTERN

+ ∆ ≤ bcurrent . When a user requests a list of items with threshold s, outputting those entries with f ≥ (s- ε )N. Theoretical analysis [29] shows that answers

From Section 2, the key aspect for detecting

produced by the lossy counting algorthim will have the

FP-Outliers is to get the frequent item-sets. However,

following guarantees:

existing methods for frequent pattern mining require

1.

multiple passes over the datasets, which is not allowed in the data stream model. Thus, instead of finding the exact

output. There are no false negatives. 2.

counts technique over data streams developed by G. S.

No itemset whose true frequency is less than (s- ε )N is output.

frequent patterns with multiple passes, we get the estimated frequent patterns by exploring approximation

All itemsets whose true frequency exceeds sN are

3.

Estimated frequencies are less than the true frequencies by at most

ε N.

Manku and R. Motwani [29]. In the sequel, we will briefly

introduce

their

method

(Lossy

Counting

Intuitively, the above guarantees indicate that using the lossy counting algorthim for approximating frequent

Algorithm). The lossy counting algorthim [29] accepts two

patterns is effective and theoretic boundary. Hence,

user-specified parameters: a support threshold s ∈ (0,1)

guaranteeing the quality of the approximation of

and an error parameter ε ∈ (0,1) such that

ε

Outlier Detection Over Data Streams - CiteSeerX

Outlier Detection Over Data Streams - CiteSeerX

Suggest Documents

Distance-based Outlier Detection in Data Streams

Parameterless Outlier Detection in Data Streams

Global Iceberg Detection over Distributed Data Streams

Distance-based Outlier Detection in Data Streams - InfoLab

Outlier Detection with Uncertain Data - CiteSeerX

Multiple Aggregations Over Data Streams

Distributed Sequence Pattern Detection Over Multiple Data Streams ...

Online Anomaly Detection over Big Data Streams - eXascale Infolab

Anomaly Detection in Data Streams

Case Study: Outlier Detection on Sequential Data

Outlier Detection in Stream Data by Clustering Method - CiteSeerX

Outlier Detection of Data in Wireless Sensor Networks ... - CiteSeerX

Outlier detection for skewed data - K.U.Leuven

Outlier detection for compositional data using

Interactive Outlier Exploration in Big Data Streams - VLDB Endowment

Evaluating Top-k Queries over Incomplete Data Streams - CiteSeerX

Scheduling for shared window joins over data streams ... - CiteSeerX

Scheduling for shared window joins over data streams ... - CiteSeerX

Merging Multiple Data Streams on Common Keys over ... - CiteSeerX

Tracking clusters in evolving data streams over sliding ... - CiteSeerX

Exploiting Predicate-window Semantics over Data Streams

Discovering Frequent Tree Patterns over Data Streams

Outlier Detection with Kernel Density Functions - CiteSeerX

Weighted Random Sampling over Data Streams