OUTLIER DETECTION OVER DATA STREAMS He Zengyou, Xu Xiaofei and Deng Shengchun Department of Computer Science and Engineering, Harbin Institute of Technology, Post Code:150001, Harbin, China
[email protected],
[email protected],
[email protected]
ABSTRACT
In the data stream model, data points can only be accessed in the order of their arrivals and random access
We study discovering outliers under the data stream
is disallowed. And the space available to store
model. The data stream model is relevant to new classes
information is supposed to be small relatively to the huge
of applications involving massive datasets, such as web
size of unbounded streaming data points. Thus, the data
click stream analysis and detection of network intrusions.
mining algorithms on data streams are restricted to be
The frequent pattern based outlier detection method [31]
able to fulfill their works with only one pass over data
is modified in the data stream environment. Instead of
sets and limited resources. It is a very challenging
finding the exact frequent patterns with multiple passes,
research field.
we approximate them with single pass and the error of
1.1 RELATED
estimated support of each frequent itemset is guaranteed
STREAMS
not to exceed a user-specified parameter. For each
There has been some initial work addressing data
coming data point, its outlier factor is computed using
streams in the data mining community. These proposals
the frequent patterns seen so far, and the top k outliers
tried to adapt traditional data mining technologies to the
are output to the user.
data stream model.
WORK
ON
MINING
DATA
References [26,23,17] consider clustering in the
KEYWORDS: Outlier Detection, Data Streams,
data stream model; they use the extended classical
Frequent Pattern, Data Mining
clustering algorithms K-median and K-means to fulfill the work. References [18,19,20] focus on efficiently
1. INTRODUCTION
constructing decision trees and the problem of ensemble classification in data stream environment. Reference[30]
For many recent applications, the concept of data stream is more appropriate than a dataset. By nature, a
presents a online classification system based on info-fuzzy networks.
stored dataset is an appropriate model when significant
Reference [29] discusses the problem of frequent
portions of the data are queried again and again, and
pattern mining in data streams. The authors in [28]
updates are relatively infrequent. In contrast, a data
proposed
stream is an appropriate model when a large volume of
time-series data streams.
algorithms
for
regression
analysis
of
data is arriving continuously and it is either unnecessary
Reference [21] considers extracting information
or impractical to store the data in some form of memory.
about customers from a stream of transactions and
Data streams are also appropriate as a model of access to
mining it in real-time. Reference [22] proposes Hancock,
large data sets stored in secondary memory where
which is a language for extracting signatures from data
performance requirements necessitate linear scans [17].
streams.
The authors in [27,24] address the problem of mining multiple
data
streams.
Reference
[27]
develops
expected to be found from data objects with shallow depth values.
algorithms for analyzing co-evolving time sequences to
Distance-based outlier is presented by Knorr and
forecast future values and detect correlations. Reference
Ng [1]. A distance-based outlier in a dataset D is a data
[24] presents a collective approach to mine Bayesian
object with pct% of the objects in D having a distance of
networks from distributed heterogenous web log data
more than d
streams.
many concepts from distribution-based approach and
min
away from it. This notion generalizes
Reference [25] identifies some key aspects of stream
enjoys better computational complexity. It is further
data mining algorithms and outlines a number of possible
extended based on the distance of a point from its kth
directions for future research.
nearest neighbor [2]. After ranking points by the distance
1.2 RELATED WORK ON OUTLIER DETECTION
to its kth nearest neighbor, the top k points are identified
Data mining tasks can be classified into four general
as outliers. Efficient algorithms for mining top-k outliers
categories: (a) dependencies detection (e.g. association
are given. Alternatively, in the algorithm proposed by
rules), (b) class identification (e.g. classification,
Angiulli and Pizzuti [11], the outlier factor of each data
clustering),
point is computed as the sum of distances from its k
(c)
class
description
(e.g.
concept
generalization), and (d) outlier/exception detection [1].
nearest neighbors.
However, most of the recent works has focused on the
Deviation-based techniques identify outliers by
first three categories to find patterns applicable to a
inspecting the characteristics of objects and consider an
considerable portion of objects in a dataset. The fourth
object that deviates these features as an outlier [9].
on
Breunig, et al. [3] introduced the concept of “local
identifying a relatively small part of the whole dataset. In
outlier”. The outlier rank of a data object is determined
some cases and for some applications, these “outliers”
by taking into account the clustering structure in a
are more interesting than those common ones.
bounded neighborhood of the object, which is formally
category,
outlier
detection
problem,
focuses
Recently, methods for finding such outliers in large
defined as “local outlier factor” (LOF). The LOCI
dataset are drawing increasing attention [1-13]. The
method [13] further extended the density-based approach
statistics community conducted most of the previous
[3].
studies on outlier mining [14]. These studies can be
Clustering-based
outlier
detection
techniques
broadly classified into two categories. The first category
regarded small clusters as outliers [5] or identified
is distribution-based, where a standard distribution is
outliers by removing clusters from the original dataset
used to fit the dataset. Outliers are defined based on the
[8]. The authors in [16] further extended existing
probability
and
clustering based techniques by proposing the concept of
Williams [10] used a Gaussian mixture model to present
cluster-based local outlier, in which a measure for
the normal behaviors and each datum is given a score on
identifying the outlier-ness of each data object is defined.
the basis of changes in the model. High score indicates
Aggarwal and Yu [4] discussed a new technique for
high possibility of being an outlier. This approach has
outlier detection, which finds outliers by observing the
been combined with a supervised-based learning
density distribution of projections from the data. That is,
approach to obtain general patterns for outlier [12].
their definition considers a point to be an outlier, if in
distribution.
Yamanishi,
Takeuchi
Depth-based is the second category for outlier mining in statistics [15]. Based on some definition of
some lower dimensional projection, it is present in a local region of abnormally low density.
depth, data objects are organized in convex hull layers in
The replicator neutral network (RNN) is employed
data space according to peeling depth, and outliers are
to detect outliers by Harkins, et al. [6]. The approach is
based on the observation that the trained neutral network
threshold for the permissible minimal support, find all
will reconstruct some small number of individuals poorly,
itemsets with support greater or equal to minisupport.
and these individuals can be considered as outliers. The
Frequent itemsets are also called frequent patterns.
outlier factor for ranking data is measured according to the magnitude of the reconstruction error.
From the viewpoint of knowledge discovery, frequent patterns reflect the “common patterns” that
An interesting recent technique finds outliers by
apply to many objects, or to large percentage of objects,
incorporating semantic knowledge such as the class
in the dataset. In contrast, outlier detection focuses on a
labels of each data point in the dataset [7]. In view of the
very small percentage of data objects. Hence, the idea of
class information, a semantic outlier is a data point,
making use of frequent patterns for outlier detection is
which behaves differently with other data points in the
very intuitive.
same class. 1.3 SUMMARY
Definition 1: (FPOF-Frequent Pattern Outlier Factor) Let the database D = {t1, t2, …, tn} be a set of n
Continue with Section 1.1 and Section 1.2, existing
transactions with items I. Given threshold minisupport,
proposals on mining data streams didn’t address the
the set of all frequent patterns is donated as: FPS (D,
problem of outlier detection. Furthermore, all proposed
minisupport). For each transaction t, the Frequent
works on mining outliers are not appropriate for data
Pattern Outlier Factor of t is defined as:
∑ support ( X )
stream model. In this paper, we address the problem of detecting
FPOF (t) =
outliers from data streams. We begin by introducing frequent pattern based outlier detection method [31]. In
X
|| FPS ( D, minisupport ) ||
X ⊆ t and X ∈ FPS (D, minisupport).
, where
(1)
the sequel, it is extended to data stream environment by The interpretation of formula (1) is as follows. If a
frequent pattern approximation.
data object contains more frequent patterns, its FPOF
2. FREQUENT PATTERN BASED OUTLIER DETECTION METHOD
value will be larger, which indicates that it is unlikely to be an outlier. In contrast, objects with smaller FPOF values will have greater outlying-nesses. In addition, the
Agrawal’s statement of the problem of discovering
FPOF value is between 0 and 1.
frequent itemset in market basket databases is the following [32]. Let I = {i1, i2, …, im} be a set of m literals called
Definition 2: For each transaction t, the itemset X is said to be contradict to t if X ⊄ t. The contradict-ness of X to t is defined as: Contradict-ness (X, t) = (||X||–|| t ∩ X||)*support (X)
items. Let the database D = {t1, t2, …, tn} be a set of n transactions each consisting of a set of items from I. An
(2)
itemset X is a non-empty subset of I. The length of
In our approach, the frequent pattern outlier factor
itemset X is the number of items contained in X, and X is
given in definition 1 is used as the basic measure for
called a k-itemset if its length is k. A transaction t ∈ D is said to contain an itemset X if X ⊆ t. The support of an
identifying outliers. To describe the reasons why
itemset X is the percentage of transactions in D containing X: support (X) = || {t ∈ D | X ⊆ t}|| / || {t ∈ D}||.
identified outliers are abnormal, the itemsets those are not contained in the transaction (it is said that the itemset is contradict to the transaction) are good candidates. The consideration behind formula (2) is as follows.
The problem of finding all frequent itemsets in D is
First, the greater the support of the itemset X, the greater
then traditionally defined as follows. Given user defined
the value of contradict-ness of X to t since larger support
value of X suggests a more strong deviation. Second,
buckets of width┌ 1 ┐transactions each. Buckets are ε
longer itemsets will give a better description than that of
labeled with bucket ids, starting from 1. The current
shorter ones. With definition 2, it is possible to identify the contribution of each itemset to the outlying-ness of specified transaction. However, it is not feasible to list all the contradict itemset, and it will be preferable to present only the top k contradict frequent pattern to the end user, as shown in definition 3. Definition 3: (TKCFP-Top K Contradict Frequent
bucket id is donated by bcurrent , whose value is ┌ N ε
┐. For an element e, its true frequency in the stream seen so far is donated by f e . The data structure D is a set of entries of the form (e, f, ∆ ), where e is an element in the stream, f is an integer representing its estimated
minisupport) are the same as given in Definition 1. For
frequency, and ∆ is the maximum possible error in f. Initially, D is empty. Whenever a new element e
each transaction t, the itemset X is said to be a top k
arrives, we first look up D to see weather an entry for e
contradict frequent pattern if there exist no more than
already exists or not. If the lookup succeeds, the entry is
(k-1) itemsets whose contradict-ness is higher than that
updated by incrementing its frequency f by one.
Pattern) The meanings of D, I, minisupport and FPS (D,
of X, where X ∈ FPS (D, minisupport). Our task is to mine top-n outliers with regard to the
Otherwise, a new entry of the form (e, 1, bcurrent -1) is
value of frequent pattern outlier factor. For each
created. Also, D is pruned by deleting some of its entries
identified outlier, its top k contradict frequent patterns
at bucket boundaries, i.e., whenever N ≡ 0 mod w. The
will also be discovered for the purpose of description.
rule for deletion is: an entry (e, f, ∆ ) is deleted if f
3. FREQUENT APPROXIMATION
PATTERN
+ ∆ ≤ bcurrent . When a user requests a list of items with threshold s, outputting those entries with f ≥ (s- ε )N. Theoretical analysis [29] shows that answers
From Section 2, the key aspect for detecting
produced by the lossy counting algorthim will have the
FP-Outliers is to get the frequent item-sets. However,
following guarantees:
existing methods for frequent pattern mining require
1.
multiple passes over the datasets, which is not allowed in the data stream model. Thus, instead of finding the exact
output. There are no false negatives. 2.
counts technique over data streams developed by G. S.
No itemset whose true frequency is less than (s- ε )N is output.
frequent patterns with multiple passes, we get the estimated frequent patterns by exploring approximation
All itemsets whose true frequency exceeds sN are
3.
Estimated frequencies are less than the true frequencies by at most
ε N.
Manku and R. Motwani [29]. In the sequel, we will briefly
introduce
their
method
(Lossy
Counting
Intuitively, the above guarantees indicate that using the lossy counting algorthim for approximating frequent
Algorithm). The lossy counting algorthim [29] accepts two
patterns is effective and theoretic boundary. Hence,
user-specified parameters: a support threshold s ∈ (0,1)
guaranteeing the quality of the approximation of
and an error parameter ε ∈ (0,1) such that
ε