Keywords outlier detection, uncertain data, probabilistic data stream, sliding window. 1 Introduction ...... lize the above result and develop an efficient two-step pruning-based ... ios: (i) remove (e, de) from the list if ei is an out- lier; (ii) ei is a filter, ...
Wang B, Yang XC, Wang GR et al. Outlier detection over sliding windows for probabilistic data streams. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(3): 389–400 May 2010
Outlier Detection over Sliding Windows for Probabilistic Data Streams Bin Wang (王 斌), Member, CCF, Xiao-Chun Yang (杨晓春), Senior Member, CCF, Member ACM, IEEE Guo-Ren Wang (王国仁), Senior Member, CCF, Member, ACM, IEEE and Ge Yu (于 戈), Senior Member, CCF, Member, ACM, IEEE School of Information Science and Engineering, Northeastern University, Shenyang 110004, China Key Laboratory of Medical Image Computing (Northeastern University), Ministry of Education, Shenyang 110004, China E-mail: {binwang, yangxc, wanggr, yuge}@mail.neu.edu.cn Received June 30, 2009; revised November 16, 2009. Abstract Outlier detection is a very useful technique in many applications, where data is generally uncertain and could be described using probability. While having been studied intensively in the field of deterministic data, outlier detection is still novel in the emerging uncertain data field. In this paper, we study the semantic of outlier detection on probabilistic data stream and present a new definition of distance-based outlier over sliding window. We then show the problem of detecting an outlier over a set of possible world instances is equivalent to the problem of finding the k-th element in its neighborhood. Based on this observation, a dynamic programming algorithm (DPA) is proposed to reduce the detection cost from O(2|R(e,d)| ) to O(|k·R(e, d)|), where R(e, d) is the d-neighborhood of e. Furthermore, we propose a pruning-based approach (PBA) to effectively and efficiently filter non-outliers on single window, and dynamically detect recent m elements incrementally. Finally, detailed analysis and thorough experimental results demonstrate the efficiency and scalability of our approach. Keywords
1
outlier detection, uncertain data, probabilistic data stream, sliding window
Introduction
Distance-based outlier detection techniques on deterministic data have been extensively studied[1-2] in the areas of network intrusion detection, event detection in wireless sensor networks and so on. As one of the emerging database techniques, uncertain data, which concerns the uncertainty of data and can reflect the real world better, attracts more and more attention nowadays. Due to the intrinsic differences between uncertain and deterministic data, such as the new introduced probabilistic dimension called confidence, the existing techniques on deterministic data cannot be applied on outlier detection on uncertain data directly. Moreover, in the sliding window computation model, when a new element arrives, it should be inserted into the window and the expired element should be removed from the window[3] . To our best knowledge, no existing algorithm considers distance-based outlier detection on uncertain data streams. In this paper, we study distance-based outlier detection on uncertain data streams by using the sliding window computation model, where outlier
detection need to be implemented after each sliding step. As an example, Fig.1(a) shows a sliding window with length 5 over a probabilistic data stream. Fig.1(b) shows the values of six elements from e1 to e6 . For each element ei , we use V1 and V2 to represent its two attributes and p to denote its probability, which reflects the uncertainty of its existence. Fig.1(c) maps these elements into a 2D coordinate. For simplicity, we only consider two attributes here. We study several challenges that arise naturally when detecting outliers on a probabilistic data stream. Challenge 1. What does distance-based outlier mean on an uncertain data stream? When considering deterministic data, an element is a distance-based outlier if the number of its neighbor elements (including itself) within distance d is below a threshold k. For example, in Fig.1(c), if the probabilities of all elements are 1, all elements can be regarded as deterministic data. Then e5 has four neighbor elements including e1 , e2 , e4 and e5 within d = 50 from it. Thus, when k is 5, e5 is an outlier. However, on an uncertain data stream, each element has a probability value, which result in the
Regular Paper This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 60973020, 60828004, and 60933001, the Program for New Century Excellent Talents in University of China under Grant No. NCET-06-0290, and the Fundamental Research Funds for the Central Universities under Grant No. N090504004. ©2010 Springer Science + Business Media, LLC & Science Press, China
390
J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
Table 1 shows all possible worlds (pw 0 ∼ pw 15 ) of neighborhoods of e5 (here d = 50) in the window shown in Fig.1(a) and their existence probabilities. We assume that all elements are mutually independent. The probability that pw 0 equals (1 − 0.6) × (1 − 0.4) × (1 − 0.8)×(1 − 0.6) is 0.0192, and the probability that pw 10 equals 0.6 × (1 − 0.4) × (1 − 0.8) × 0.6 is 0.0432. Let k be 2 and the probability threshold λ be 0.65. We can see only the possible worlds (pw 5 ∼ pw 15 ) contain the big enough number of elements (>k). We then calculate the probability 0.8336 of having at least k neighborhoods for e5 by summating the probabilities of possible worlds from pw 5 to pw 15 . Since it is larger than λ, we say e5 is not an outlier. Table 1. Possible Worlds of d-Neighborhoods of e5
Fig.1. An example of a 1-sliding window on a probabilistic data stream. (a) 1-sliding window on a data stream. (b) Partial elements with probabilities. (c) Mapping elements in the sliding window to a 2D-space.
number of neighbor elements is uncertain. Thus, distance-based outlier on uncertain data stream cannot be defined just based on the number of neighbor elements. Our Contribution. We present a new definition of distance-based outlier on an uncertain data stream. Keeping the basic idea of the traditional definition of distance-based outlier, we employ probability in this new definition. After each sliding step, an element e is regarded as a distance-based outlier if the probability that the number of neighbor elements of e in current window is less than or equal to a threshold λ. Challenge 2. How can such outliers on a single window be detected efficiently? We still use k to denote the minimum number of enough neighbor elements. A naive approach is to process all elements in current window sequentially. For each of them unfold all possible worlds of its neighborhoods, and sum up the probabilities of possible worlds containing at least k elements. If this summation is not larger than λ, then the element is an outlier.
Possible World Instances
Pr (wi )
2nd Element
w0 = ∅ w1 = {e1 } w2 = {e2 } w3 = {e4 } w4 = {e5 } w5 = {e1 , e2 } w6 = {e1 , e4 } w7 = {e1 , e5 } w8 = {e2 , e4 } w9 = {e2 , e5 } w10 = {e4 , e5 } w11 = {e1 , e2 , e4 } w12 = {e1 , e2 , e5 } w13 = {e1 , e4 , e5 } w14 = {e2 , e4 , e5 } w15 = {e1 , e2 , e4 , e5 }
0.0192 0.0128 0.0768 0.0288 0.0288 0.0512 0.0192 0.0192 0.1152 0.1152 0.0432 0.0768 0.0768 0.0288 0.1728 0.1152
− − − − − e2 e4 e5 e4 e5 e5 e2 e2 e4 e4 e2
This naive approach is infeasible in realistic due to the following two reasons: (i) when the size of sliding window is large, it costs too much time to process all the elements contained in the window; (ii) the number of possible worlds for each element could be exponentially growing with the increasing number of its neighborhoods and it will be too expensive to process the element by unfolding all the possible worlds. In order to detect such distance-base outliers on uncertain data streams, a more efficient approach is expected. Our Contribution. We propose a pruning-based approach named PBA to effectively and efficiently reduce the processing elements in the sliding window and save the detection cost. Moreover, we present a dynamic programming algorithm named DPA, which can process each single element in linear time, avoiding expensively unfolding the possible worlds of its neighborhoods. Challenge 3. How to design algorithms to support online computation in the presence of rapid updates of data elements? In the sliding window computation
Bin Wang et al.: Outlier Detection on Probabilistic Data Streams
model against data streams, we consider the most m data elements, that is, the oldest l elements in an lsliding window should be deleted and the arriving l elements should be inserted. Our Contribution. We utilize the detected elements to incrementally detect outliers when sliding the window dynamically. When we slide the window for 1 step, the non-expired elements in the previous window will share the new window, for example, e2 , e3 , e4 , and e5 are non-expired elements in the 1-sliding window in Fig.1(a). We call such elements joint elements. We will show it is not necessary to process all joint elements using proposed pruning strategies and we could accelerate detection performance on the fly. We consider two aspects: i) data structure needs to be well designed and maintained, and ii) operations for deleting expired elements and inserting new arrival elements need to be handled skillfully and accurately. The rest of the paper is organized as follows. In Subsection 1.1, we review the related work on uncertain data, uncertain data stream, and outlier detection. In Section 2, we present a new definition of distancebased outlier on probabilistic data stream. In Section 3 we analyze the property of outliers with probabilities and propose a dynamic programming approach (DPA), to efficiently detect a single element. In Section 4 we propose a pruning-based approach (PBA) to process recent m elements when sliding the window. In Section 5, we report and analyze our experimental results. In Section 6, we conclude this paper with directions for future work. 1.1
Related Work
So far, there are five kinds of outlier detection techniques on deterministic data including techniques based on statistics, density, depth, distance and deviation[4-7] , among which statistics-based, densitybased and distance-based techniques are widely used. Statistics-based technique is one of the oldest, which is used to detect anomalies of data samples based on little probability events. Density-based outlier detection[4] defines density based on two parameters including the distances between different records and the number of records lying within a given range. It determines whether a record is an outlier or not according to the density of records. Distance-based outlier detection considers a point as an outlier of a dataset if the number of points within a certain distance from it is below a given threshold. Because of complex algorithms, depth-based[5] and deviation-based techniques[6-7] are used less than the other three. The above techniques are all designed based on deterministic data. However, in many applications, due
391
to various factors such as the limitations of equipment, delay or loss in transmission, data are acquired with uncertainty. To handle the uncertainty of data, uncertain data abstracts more and more attentions nowadays and covers wide range of different aspects[8-12] . Probabilistic Skylines[8] aims at finding out the skyline objects on uncertain data. Due to the probability value is introduced in uncertain data, the traditional Topk query is further refined into three different kinds of queries, like U-Topk query, U-kRanks query[9] , and PTk query[10] . Aggarwal and Yu[11] proposed a densitybased technique for outlier detection on uncertain data, in which each object appears anywhere within an uncertain region with a probability distribution function (PDF). Kriegel and Pfeifle[12] proposed a density-based clustering algorithm for the same kind of uncertain data as [11]. Compared with uncertain data, techniques on uncertain data stream[13-15] mainly focus on how to process uncertain data efficiently to adapt the on-line query in a sliding window. A method for answering Top-k queries on uncertain data stream was proposed in [13]. [14] proposed a monitoring architecture for real-time monitoring the uncertain data streams. Several work has been engaged in clustering probabilistic data stream, in which the most representative one[15] is proposed by Aggarwal and Yu. To our best knowledge, there is no work focusing on distance-based outlier detection on uncertain data stream, in which each element is affiliated with a probability implying the uncertainty of its existence. Therefore, base on the main idea of handling sliding window stream data model in [16], we study distance-based outlier detection to uncertain data stream over sliding window model. 2 2.1
Semantic of Outlier on Probabilistic Data Stream Outliers on Probabilistic Data
A distance-based outlier is element that has not enough element in its d-neighborhood[1] . For uncertain database, we need consider all possible world instances to determine whether the d-neighborhood contains enough data. Let D denote a set of probabilistic data D = {e1 , . . . , en }, in which each element ei has the form of hdi , pi i, and pi is the probability of di . The dneighborhood of an element ei , denoted R(ei , d), is the set of elements ej in D, such that R(ei , d) = {ej |ej ∈ D, dist(di , dj ) 6 d}. Definition 1 ((k, d, λ)-Outlier). Given an uncertain dataset D. For an element e ∈ D, let W (e, d, k) be
392
J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
a subset of all possible world instances of R(e, d) such that for each possible world w ∈ W (e, d, k), |R(e, w)|>k holds① . An element e ∈ D is a (k, d, λ)-outlier, if and only if the summation of the probabilities of possible world instances in W (e, d, k) is less than or equal to λ. Notice that, for an element e ∈ D, if R(e, d) contains less than k elements, it must be an outlier, since W (e, d, k) is an empty set. For each w ∈ W (e, d, k), we use Pr (w) to represent the probability of w, in which there are at least k elements. According to Definition 1, a (k, d, λ)-outlier detection to an uncertain dataset D returns a set of data satisfying (1). n o X T = e| Pr (w) 6 λ, e ∈ D . (1)
column in Table 1 shows the 2nd element in different possible worlds. Therefore, in order to get the summation of possible worlds that at least k elements appear in R(e, d), we sum up the probabilities that each data ei being the k-th element in R(e, d), denoted by PK (ei , d, k), shown in (2). X
Sliding-Window Outliers on Probabilistic Stream
Let DS = (e1 , e2 , . . . , en ) be a probabilistic data stream, where i is the timestamp of ei . Each ei has the form of hdi , pi i, in which di is the data item and pi is the probability of di . In this paper, we use a tuplebased sliding window, which is a fixed window size of m frames, starting at position i and ending at i + m − 1. We use DS [i, i + m − 1] = (ei , . . . , ei+m−1 ) to represent elements in the window. Our technique can be easily extended to time-based sliding window. Problem Statement. Given a probabilistic data stream DS = (e1 , e2 , . . . , en ) and a sliding window size m, the goal is to detect the (k, d, λ)-outliers for every sliding window DS [i, i + m − 1] as i goes from 1 to n − m + 1. 3
Detecting a Single Element
In this section, we simplify the problem by considering a single item e within a sliding window DS [i, i + m − 1]. According to the definition of the problem in Section 2, the naive approach is to unfold all possible worlds in the sliding window and sum up the probabilities of possible worlds containing at least k d-neighborhoods of e. Obviously, the time complexity of checking all elements in the sliding window is O(m × 2m ), which is prohibitively costly when m is large. 3.1
Property of Outliers with Probabilities
For P each element e ∈ D, we are interested in computing w∈W (e,d,k) Pr (w) in (1) efficiently. Since data in the stream are ordered according to their timestamps, we can say R(e, d) contains at least k elements, if and only if we find the k-th element in R(e, d). The third
(2)
Theorem 1. For each element e in a sliding window S[h, h + m − 1], (3) holds. X w∈W (e,d,k)
w∈W (e,d,k)
2.2
PK (ei , d, k).
ei ∈R(e,d)
Pr (w) =
X
PK (ei , d, k).
(3)
ei ∈R(e,d)
Consider DS [1, 5] = (e1 , . . . , e5 ) in the example in Fig.1. Let d be 50 and k be 2, then for e5 , we have R(e5 , 50) = {e1 , e2 , e4 , e5 }, the possibility of e5 being a non-outlier is the summation of the possibilities between w5 and w15 in Table 1, which is equal to 0.8336. We rank elements in R(e5 , 50) using their time stamps, and e2 , e4 , and e5 could be the 2nd element appears in R(e5 , 50). The probabilities of these three elements being the 2nd element are 0.32, 0.336, and 0.1776, respectively. Therefore, the summation of the possibilities of e2 , e4 , and e5 being the 2nd element in R(e5 , 50) is 0.8336, equal to the summation of possibilities between w5 and w15 . 3.2
Dynamic Programming Approach P In order to compute ei ∈R(e,d) Pk (ei , d, k) (shown in (2)) to detect outliers efficiently, we propose a dynamic programming approach (DPA), which can run in polynomial time. Given the ordered elements using their time stamps, we use (4) to calculate PK (ej , d, k) for each element ej ∈ R(e, d). PK (ej , d, k) = pj · P (k−1, j−1),
(4)
where pj is the probability of element ej and P (k−1, j−1) is the probability that among the first j−1 elements, k − 1 of them appear in R(e, d). Subproblems. We create subproblems as follows. Given an element e, let 0 6 i 6 k − 1 and 0 6 j 6 |R(e, d)|−1 be two integers. Let P (i, j) be the probability among the first j elements, and i of them appear in R(e, d). Our goal is to compute the values P (k − 1, j). Initialization. • We have P (0, 0) = 1, in which P (0, 0) is the probability that none of the first zero elements appear in R(e, d).
① For any set S, we use |S| to express the number of elements in S.
Bin Wang et al.: Outlier Detection on Probabilistic Data Streams
• For each 1 6 i 6 k − 1, we have P (i, 0) = 0, which means the probability that i of the first zero elements appear in R(e, d). • For each 1 6 j 6 |R(e, d)| − 1, we have P (0, j) = P (0, j − 1) · (1 − pj−1 ). • For i > j, we have P (i, j) = 0, since the probability that i of the first j elements appear in R(e, d) is zero. Recurrence Function. Consider the subproblem of computing a value for the entry P (i, j), where i > 0 and j > 0. We have the following two options. 1) The j-th element appears in R(e, d) and among the first j − 1 elements, i − 1 of them must appear. 2) The j-th element does not appear in R(e, d) and among the first j − 1 elements, i of them must appear. Therefore, P (i, j) can be set as the summation of the above two options. The following formula summarizes the recurrence function: P (i, j) = pi · P (i − 1, j − 1) + (1 − pi ) · P (i, j − 1). (5) Using the above analysis, we can initialize a matrix of size (k − 1) × (|R(e, d)| − 1). We set the values in the first row and the first column according to the initialization conditions. We use the recurrence function to compute the value of each entry, starting from the top-left entry, until we reach the right-bottom entry. The bottom row will give us all the probabilities of P (k−1, j), where 1 6 j 6 |R(e, d)|−1. Matrix Reduction. According to the recursive function in (5), the value of each entry depends on its left entry and upper left entry. For speed up the calculation, we only compute these entries. Still take element e5 in our running example as illustration. Its dynamic programming matrix is shown in Fig.2. The summation of probabilities that each element be the 2nd element in R(e5 , 50) is 0.8336.
Fig.2. Calculating
P ei ∈R(e5 ,50)
393
of R(e, d) for each element in the sliding window. In the worst case where R(e, d) = m for every element, the total time complexity is O(km2 ). 4
Pruning Strategies for Recent m Elements
Given a sliding window with length m, each element ei in the sliding window could be the k-th element when we check a single element P e in the sliding window. We use Pr (R(e, d), k) = ei ∈R(e,d) PK (ei , d, k) to express the summation of the probabilities of all elements in R(e, d) being the k-th element. 4.1
Pruning Strategy
Given a sliding window with length m, in order to avoid detecting outliers for every element in the window, we propose a pruning strategy to effectively prune elements that cannot be (k, d, λ)-outliers to speed up the detection process. Theorem 2. Given two uncertain datasets T and T 0 . Pr (T, k) 6 Pr (T 0 , k) holds, if T ⊆ T 0 . S Proof. Since T ⊆ T 0 , we get T 0 = T (T 0 −T ). We use [T, k] to denote the event that at least k elements appear in T and “·” to express the product event. Therefore, [T, k] · [T 0 −T, 0] is a sub-event of [T 0 , k]. Since elements are mutually independent, [T, k] and [T 0 −T, 0] are also independent. Thus Pr (T, k) × Pr (T 0 −T, 0) 6 Pr (T 0 , k), where [T 0 −T, 0] is a certain event and Pr (T 0 − T, 0) is 1. Therefore, Pr (T, k) 6 Pr (T 0 , k) holds. ¤ Theorem 3. Given an element e in a sliding window, if we can find a radius d0 ( λ, then all elements in R(e, d − d0 ) are not outliers. Proof. We use Fig.3 to proof this theorem. Let the grey circle in the figure denote R(e, d0 ). We randomly choose an element ej in R(e, d0 ) and an element ei in R(e, d − d0 ). Based on the triangle inequality, we know dist(ei , ej ) 6 dist(ei , e) + dist(ej , e). Since dist(ei , e) 6 d − d0 and dist(ej , e) 6 d0 , we get dist(ei , ej ) 6 d. Therefore, we conclude that all elements in R(e, d0 ) must also appear in R(ei , d). Since Pr (R(e, d0 ), k) > λ,
PK (ei , 50, 2) using DPA.
According to the discussion above, to calculate (k, d, λ)-outliers on elements in a sliding window with length m, DPA scans the elements in the window sequentially. For each element e, DPA calculates k(|R(e, d)| − k) cells (notice that if |R(e, d)| 6 k, no calculation is required). Therefore, the time complexity of DPA is O(km|R|), where |R| is the average size
Fig.3. Pruning strategy.
394
J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
then Pr (R(ei , d), k) > λ holds. Therefore, each ei in R(e, d − d0 ) could not be an outlier. ¤ Based on Theorem 3, we can prune all the elements lying in R(e, d − d0 ) as non-outliers if e is not an outlier, which will save the cost of outlier detection a lot. We call R(e, d − d0 ) pruning area. 4.2
Checking Elements in a Single Window
Given an element e in a sliding window with length m, if e is detected as a non-outlier, then we could use the pruning strategy to filter some other elements in the window. Ideally, we want to choose an element to detect firstly who can filter more elements than others. In this subsection we propose PBA to find a good processing order in the sliding window. When only considering elements in one window, we could not say which order is better due to the unknown data distribution in the window. However, when we consider recent m elements along with the time, there are many joint elements in different sliding windows. We could use this feature to find a processing order with low cost. The basic idea is to retain the element who have pruning ability and could survive as long as possible in the window. Therefore, we process elements starting from new arrival elements to old elements in a sliding window. Before discussing the detection process, we category each element in a sliding window into four states: unmarked, outlier, ignored, and filter. Given a sliding window, all elements in this window is unmarked. During the processing, we could mark elements into the other three different states as shown in Table 2.
therefore, e5 is not an outlier and we mark e5 as filter. We could further use e5 to prune elements in R(e5 , d − d0 ) = R(e5 , 20), e1 and e4 are marked as ignored since they are not outliers and can be filtered. We then process element e3 and detect it as outlier since R(e3 , 50) only contains one element. We finally check e2 and mark it as filter according to Pr (R(e2 , 50), 2). Algorithm 1. Checking Recent m Elements Input: recent m elements e1 , . . ., em ; integers k, d; probability threshold λ; Output: (k, d, λ)-outliers; 1 while there exists at least one unmarked element do 2 Choose an unmarked element ei with largest time stamp; 3 Calculate R(ei , d); 4 flag = FALSE; 5 if R(ei , d) contains at least k elements then 6 Sort elements in R(ei , d) in ascending order according to distances between elements in R(ei , d) and ei ; 7 Find the k-th closest element e to ei ; 8 Let d0 = dist(e, ei ) and j = 0; 9 while Pr (R(ei , d0 ), k) 6 λ do 10 j + +; 11 Let d0 = dist(e0 , ei ), where e0 is the (k + j)-th closest element to ei ; 12 flag = TRUE; 13 14 15 16
if flag == TRUE then Mark ei as filter ; if d − d0 > 0 then Mark elements except ei in R(e, d − d0 ) as ignored ;
17 18 19
else Mark ei as outlier ; Output ei ;
Table 2. Different States of an Element e State
Description
unmarked Element e needs to be checked. outlier Element e is detected as an outlier. ignored Element e is a non-outlier prunned by another element. filter Element e is a non-outlier and might be a filter to prune other elements.
Signature hcounter , d0 i h0, −1i h−1, −1i hc, −1i (c>0) hc, d0 i (c>0, 06d0 6d)
We use Algorithm 1 in Fig.4 to describe our PBA approach. We start from an unmarked element with the largest time stamp. If the d-neighborhoods of the element ei contains at least k elements, we then find the k-th closest element e to ei and let the radius d0 = dist(e, ei ). Consider the running example shown in Fig.1 and let k = 2, d = 50, and λ = 0.65. We start from element e5 with the largest time stamp in the sliding window. We can find a radius d0 = 30 such that Pr (R(e5 , d0 ), 2) > λ,
Fig.4. PBA algorithm.
4.3
Checking Recent m Elements Incrementally
Now we discuss outlier detection on recent m elements when sliding the window. Let each sliding step be l, when new elements arrive, the oldest l elements in the sliding window are expired. A straightforward approach is to check elements in each sliding window as discussed in Subsection 4.2. In this subsection, we propose an efficient approach by taking the advantage of shared joint elements among sliding windows. For example, let the sliding step l be 1. When the window slides from DS [1, 5] to DS [2, 6], the elements e2 , e3 , e4 , and e5 are joint elements, which have been checked in DS [1, 5]. We need to remove the effect of e1 and consider the new arrival element e6 . After the window slides, only partial joint elements are possible affected due to the expired elements and
Bin Wang et al.: Outlier Detection on Probabilistic Data Streams
395
the new arrival elements, the states of the rest elements will not change. Therefore, we only consider those affected elements, which could improve the efficiency of outlier detection dynamically. 4.3.1 Effects on Joint Elements The expired elements and new arrival elements might affect the statuses of joint elements. A joint element could have three different statuses: outlier, ignored, or filter. In this subsection we discuss the effect on a joint element according to its status. Effect of an Expired Element on Joint Elements. When an element e0 is expired, its effect on joint elements should be discarded. If a joint element e is marked as outlier, then removing e0 could not affect the status of e since this remove could not increase the probability Pr (R(e, d), k). If e is marked as ignored or filter, it means Pr (R(e, d), k) > λ before removing e0 from the sliding window. When e0 belongs to R(e, d), this remove might decrease Pr (R(e, d), k), otherwise, the status of e will not change. Effect of an Arrival Element on Joint Elements. Given a new arrival element e0 , this insertion could also affect statuses of joint elements. If a joint element e is marked as outlier, then this insertion might increase Pr (R(e, d), k) to make e become a non-outlier. If e is marked as ignored or filter, this insertion can only increase Pr (R(e, d), k) and could not change the status of e. 4.3.2 Lists on d-Neighborhoods In order to utilize the effects on joint elements, we propose a data structure to improve the checking process efficiently. Each element e in a sliding window is associated with a signature hcounter , d0 i, where counter records the ignored times and d0 is the shortest radius to satisfy Pr (R(e, d0 ), k) > λ. We list the value of a signature with different statuses in the third column in Table 2. If an element e has been checked (e.g., either marked as filter or outlier), we create a list of elementdistance pairs (eid , dist(e, eid)) for e. All the elements are sorted in ascending order according to their distances dist(e, eid ) to e. We use Fig.5 to show different statuses and lists for elements in DS [1, 5]. Let k = 2, d = 30, and λ = 0.65. The signature of e1 is h1, −1i, which means e1 is an ignored element filtered by a certain filter ei due to e ∈ R(ei , d0i ). The signature of e2 is h1, 30i, which means e2 is a filter and 30 is the shortest radius satisfying Pr (R(e2 , 30), 2) > λ. It also maintains a list of elements in its R(e2 , d). Using the pair h1, 30i we can calculate the filter radius using d − d0 = 50 − 30 = 20, therefore, we know that the first two elements e2 and
Fig.5. Data structure for elements between e1 and e5 .
e4 belong to the filter range R(e2 , 20) and could be filtered by e2 . Similar, e5 is a filter and maintains a list of elements in its filter range. The element e3 is marked as an outlier and we set its signature to be h−1, −1i. It also maintains a list in its d-neighborhoods R(e3 , 50). 4.3.3 Checking Elements Incrementally Using Lists of d-Neighborhoods By using the lists of d-neighborhoods, we can utilize the above result and develop an efficient two-step pruning-based approach to detect outliers on sliding window as follows. Removing an Expired Element. Given an expired element e, if it is an ignored element, we search e in all lists; otherwise, we search e in the lists of those elements appearing in the list of e. If a list of an element ei contains a pair (e, de ), we then modify the list according to the following scenarios: (i) remove (e, de ) from the list if ei is an outlier; (ii) ei is a filter, e is outside of R(ei , d0ei ), that is de > dei , then we remove (e, de ) directly since this removal could not increase the minimal radius d0ei to satisfy Pr (R(ei , d0ei ), k) > λ. Otherwise we decrease counter of each ignored element in the filter range R(ei , d−d0 ) by 1 since ei might not be a filter any more. We then set the signature of ei h0, −1i since ei should be reprocessed after removing the expired element e.
Fig.6. Data structure for elements after removing e1 and before processing e6 .
For instance, Fig.6 shows the checking process in between removing e1 and processing the arrival element e6 . The expired element e1 is an ignored element, we need search the lists of e2 , e3 , and e5 . We directly remove (e1 , 50) from the list of the filter e2 since 50 is greater than the radius 30 in the signature of e2 . The
396
J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
list of the outlier e3 does not contain e1 . In the list of the filter e5 , dist(e2 , e5 ) = 20 is less than the radius 30 in the signature of e5 , we remove (e1 , 20), decrease the counter of e4 by 1, and change the signature of e5 to h0, −1i. Insertion an Arrival Element. Given a new arrival element, let its signature be h0, −1i, which means it is an unprocessed element. We start from an unprocessed element e with largest time stamp. We need check filters to see whether the distance between e and a filter ei is less than the filter radius of ei . If so, we insert (e, dist(e, ei )) to the list of ei , increase the counter of e by 1. Therefore, e becomes an ignored element. If we could not find such a filter, then we check each outlier e0 . If dist(e, e0 ) 6 d, we then set the signature of e0 to be h0, −1i since this insertion might increase Pr (R(e0 , d), k) and we need reprocess e0 . Otherwise, we check whether e is a filter by calculating R(e, d) and increase counter of each element inside its filter range by 1.
The default values of the three parameters k, d, and λ for outlier were 40, 20, and 0.7, respectively. The default length of the sliding window was 105 and the sliding step l was 3000. All the algorithms were implemented using Microsoft Visual C++. The experiments were run on an HP DX 2708 PC with an Intel 2.33GHz Dual Core CPU and 2GB memory with a 160GB disk, running a Microsoft Windows XP Professional operating system. 5.1
Performance on Elements in a Single Window
We compared two approaches to evaluate performance on elements in a single window, which are BA and s-PBA. BA is a basic approach to process all elements in the window using the dynamic program approach without any pruning, whereas s-PBA is our pruning-based approach on a single window. We varied k, d, λ, and the length of sliding window respectively. Varying k. We varied k from 10 to 100. Fig.8(a) shows that the running time decreased when increasing the value of k due to the number of outliers decreased (from 22399 to 1995), which reduces the number of processed elements. As shown in Fig.8(b), our proposed pruning technique could filter many elements, which
Fig.7. Dynamically maintaining data structure after removing e1 and inserting e6 .
For example, we use Fig.7 to show the checking process after inserting e6 into the sliding window. We start from the unprocessed element e6 and find the filter e2 shown in Fig.6. Since dist(e2 , e6 ) = 10 is less than the filter radius (50 − 30 = 20) of e2 , we insert e6 to the list of e2 and increase counter of e6 by 1. We then check the outlier e3 and keep it unchanged due to dist(e3 , e6 ) > 50. We then process next unprocessed element e5 and the final result is shown in Fig.7. 5
Experimental Results
We evaluated our outlier detection technique on probabilistic data stream from two aspects: i) performance on detecting elements in a single window, and ii) performance on detecting recent m elements when sliding the window. We generated 105 two-dimensional elements, each dimension is a float value uniformly distributed in [0, 1000]. Each element contains a randomly generated probability from 0 to 1.0. All elements were mutually independent. We increased the time stamp of each new generated element by 1.
Fig.8. Running time and pruned records when varying k. (a) Running time. (b) Filterability.
Bin Wang et al.: Outlier Detection on Probabilistic Data Streams
makes s-PBA run much faster than BA shown in Fig.8(a). Moreover, increasing k enlarged the filtering area of non-outlier and made the pruning techniques more effective. Varying d. We varied d from 10 to 30. Fig.9(a) shows that the running time of BA is much higher than s-PBA. The reason is that when d became larger, the number of d-neighborhoods increased, which made it much more costly to process outliers. Without pruning technique, the entire cost of BA increased rapidly. For s-PBA, a larger d value enlarged the filtering area and produced more ignored elements (as shown by Fig.9(b)), which saved lots of time spent on non-outliers and finally reduced the entire cost. Varying λ. We varied λ value from 0.1 to 0.9. Fig.10(a) demonstrates that the running time of both algorithms grew slowly when increasing λ value. The algorithm BA ran much slower than BA due to more outliers needed to be processed (from 7808 to 10031). For s-PBA, larger λ caused smaller filtering area, which degraded the effect of our pruning techniques (as shown in Fig.10(b)) and therefore increased the running time. Varying Length of a Sliding Window. We varied the length of a sliding window from 6 × 104 to 15 × 104 . Fig.11(a) shows the running time of both BA and s-PBA increased when increasing the length of the sliding window. The running time of s-PBA grew much
397
Fig.10. Running time and pruned records when varying λ. (a) Running time. (b) Filterability.
Fig.9. Running time and pruned records when varying d.
Fig.11. Effects of the length of a sliding window. (a) Running
(a) Running time. (b) Filterability.
time. (b) Filterability.
398
J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
slower than the running time of BA, because BA needed to process more elements when the length of the window was large. In addition, the larger window size the more d-neighborhoods, therefore the filtering area became larger and the s-PBA approach could filter more elements (as demonstrated in Fig.11(b)). 5.2
Performance on Incrementally Detecting Recent m Elements
We compared two approaches to evaluate performance on detecting recent m elements incrementally, which are n-PBA and r-PBA. n-PBA is a naive approach that processes elements in each sliding window separately, whereas r-PBA is our proposed pruningbased approach that utilizes the joint elements among sliding windows. Varying k. We varied k from 10 to 100. Fig.12(a) shows both n-PBA and r-PBA ran much faster when increasing k value due to the average unprocessed elements decreased. As shown in Fig.12(b), by utilizing the joint elements, r-PBA ran much faster than n-PBA. Moreover, increasing k enlarged the filtering area of non-outlier and made the pruning technique more effective and decreased the number of processed elements shown in Fig.12(b).
Varying d. We varied d from 10 to 30. Fig.13(a) shows that the running time of n-PBA and r-PBA. Both of them decreased as d increased, because most elements are ignored elements and less elements needed to be processed. However, the running time of these two algorithms getting much closer when d became larger enough. As we expect, r-PBA ran faster than n-PBA due to the sharing joint elements. The reason why the two lines getting closer was that when d increased, the number of d-neighborhoods increased, which made it much more costly to maintain the neighborhood lists. The number of processed elements in Fig.13(b) also affected the advantage of r-PBA.
Fig.13. Effects of d values. (a) Running time. (b) Number of processed elements.
Varying λ. We varied λ from 0.1 to 0.9. Fig.14(a) demonstrates that both n-PBA and r-PBA ran slower when increasing λ. The approach r-PBA still ran faster than n-PBA due to more outliers to make the filterability decrease. Therefore, we need more time to process them. Fig.14(b) shows the number of processed elements of r-PBA is smaller than n-PBA.
Fig.12. Effects of k values. (a) Running time. (b) Number of processed elements.
Varying l. We varied the sliding step l from 3 × 103 to 3 × 104 . Fig.15(a) shows when l became larger, the running time of n-PBA kept steady whereas r-PBA became slower. This is because larger sliding step caused
Bin Wang et al.: Outlier Detection on Probabilistic Data Streams
399
Fig.14. Effects of λ. (a) Running time. (b) Number of processed
Fig.15. Effects of the sliding step l. (a) Running time. (b) Num-
elements.
ber of processed elements.
more cost on maintaining d-neighborhood lists and reduced the number of joint elements among different sliding windows. The numbers of processed elements of n-PBA and r-PBA in Fig.15(b) demonstrate the same results. 6
Conclusions
In this paper, we studied outlier-detection on probabilistic data stream, gave a new definition of distancebased outlier on probabilistic data stream, and proposed a dynamic programming algorithm (DPA) and an effective pruning-based approach (PBA) to detect outlier efficiently. Our experimental results show that our approach are efficient and scalable to detect outlier on probabilistic data stream. In our future work, we plan to consider probabilistic data stream with multiple dimensions and exclusive condition. References [1] Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases (VLDB), New York City, USA, Aug. 24-27, 1998, pp.392-403. [2] Knorr E M, Ng R T. Finding intensional knowledge of distance-based outliers. In Proc. the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, Sept. 7-10, 1999, pp.211-222.
[3] Shen H, Zhan Y. Improved approximate detection of duplicates for data streams over sliding windows. Journal of Computer Science and Technology, 2008, 23(6): 973-987. [4] Breuning M M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD), Dallas, USA, May 16-18, 2000, pp.93-104. [5] Hinterberger H. Exploratory Data Analysis. Encyclopedia of Database Systems, Springer US, 2009, p.1080. [6] Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. In Proc. the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Portland, USA, Aug. 2-4, 1996, pp.164-169. [7] Sarawagi S, Agrawal R, Megiddo N. Discovery-driven exploration of OLAP data cubes. In Proc. the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, Mar. 23-27, 1998, pp.168-182. [8] Pei J, Jiang B, Lin X, Yuan Y. Probabilistic skylines on uncertain data. In Proc. the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria, Sept. 2327, 2007, pp.15-26. [9] Soliman M A, Ilyas I F, Chang K C-C. Top-k query processing in uncertain databases. In Proc. the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, Apr. 15-20, 2007, pp.345-360. [10] Hua M, Pei J, Zhang W, Lin X. Ranking queries on uncertain data: A probabilistic threshold approach. In Proc. the ACM SIGMOD International Conference on Management of Data (SIGMOD), Vancouver, Canada, Jun. 10-12, 2008, pp.673689. [11] Aggarwal C C, Yu P S. Outlier detection with uncertain data. In Proc. the SIAM International Conference on Data Mining (SDM), Atlanta, USA, Apr. 24-26, 2008, pp.483-493.
400
J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
[12] Kriegel H, Pfeifle M. Density-based clustering of uncertain data. In Proc. the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Chicago, USA, Aug. 21-24, 2005, pp.672–677. [13] Jin C, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. PVLDB, 2008, 1(1): 301-312. [14] Woo H, Mok A K. Real-time monitoring of uncertain data streams using probabilistic similarity. In Proc. the 28th IEEE Real-Time Systems Symposium (RTSS), Tucson, Arizona, USA, Dec. 3-6, 2007, pp.288-300. [15] Aggarwal C C, Yu P S. A framework for clustering uncertain data streams. In Proc. the 24th International Conference on Data Engineering (ICDE), Canc´ un, M´ exico, Apr. 7-12, 2008, pp.150-159. [16] Lin X, Yuan Y, Wang W, Lu H. Stabbing the sky: Efficient skyline computation over sliding windows. In Proc. the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, Apr. 5-8, 2005, pp.502-513.
Bin Wang received the Ph.D. degree in computer science from the Northeastern University, China, in 2008. He is currently a lecturer in computer science at the Northeastern University. He is a member of CCF. His research interests include design and analysis of algorithms, databases, data quality, and distributed systems.
Xiao-Chun Yang received the Ph.D. degree in computer science from Northeastern University, China, in 2001. She is a professor at Northeastern University, China. She is a member of the ACM and the IEEE Computer Society, and a senior member of CCF. Her research interests are in the areas of data quality, query processing, data privacy, and distributed data management. Guo-Ren Wang received the Ph.D. degree from Northeastern University, China, in 1996. He is a professor at Northeastern University of China. He is a member of IEEE, ACM, and a senior member of CCF. His research interests are XML data management, query processing and optimization, bioinformatics, high-dimensional indexing, parallel database systems, and P2P data management. He has published more than 100 research papers. Ge Yu received his Ph.D. degree in computer science from Kyushu University of Japan in 1996. He is a professor at Northeastern University of China. He is a member of IEEE, ACM, and a senior member of CCF. His research interests include database theory and technology, distributed and parallel systems, embedded software, network information security.