One Pass Outlier Detection for Streaming Categorical Data

1 downloads 0 Views 408KB Size Report
Abstract. Attribute Value Frequency (AVF) is a simple yet fast and effective method for detecting outliers in categorical nominal data. Previous work has.
One Pass Outlier Detection for Streaming Categorical Data Swee Chuan Tan1, Si Hao Yip1, Ashfaqur Rahman2 SIM University, School of Business 535A Clementi Road, Singapore {jamestansc, shyip002}@unisim.edu.sg 1

Intelligent Sensing and Systems Lab, CSIRO Hobart, Australia [email protected]

2

Abstract. Attribute Value Frequency (AVF) is a simple yet fast and effective method for detecting outliers in categorical nominal data. Previous work has shown that AVF requires lesser processing time while maintains very good outlier detection accuracy when compared with other existing techniques. However, AVF works on static data only; this means that AVF cannot be used in data stream applications such as sensor data monitoring. In this paper, we introduce a modified version of AVF known as One Pass AVF to deal with streaming categorical data. We compare this new algorithm with AVF based on outlier detection accuracy. We also apply One Pass AVF for detecting unreliable data points (i.e., outliers) in a marine sensor data monitoring application. The proposed algorithm is experimentally shown to be as effective as AVF and yet capable of detecting outliers in streaming categorical data.

1 Introduction Outlier detection is the process of detecting instances with unusual behavior that occurs in a system. Effective detection of outliers can lead to the discovery of valuable information in the data. Over the years, mining for outliers has received significant attention due to its wide applicability in areas such as detecting fraudulent usage of credit cards, unauthorized access in computer networks, weather prediction and environmental monitoring. A number of existing methods are designed for detecting outliers in continuous data. Most of these methods use distances between data points to detect outliers. In the case of data with categorical attributes, attempts are often made to map categorical features to numerical values. Such mappings impose arbitrary ordering of categorical values and may cause unreliable result. Another issue is related to the big data phenomenon. Many systems today are able to generate and capture real-time data continuously. Some examples include real-time

data acquisition systems, condition monitoring systems, and sales transaction systems. It is a challenging task to effectively detect outliers occurring in data streams. Traditional outlier detection approaches are no longer feasible as they only deal with statics data sets and require multiple scans of data to produce effective results. In data streams setting, outlier detection algorithms (e.g., [8]) need to process each data item within a strict time constraint and can only afford to analyze the entire data set with a single scan of data. In this paper, we introduce a modified version of Attribute Value Frequency (denoted as AVF, which is proposed by Koufakou et. al [6]) for outlier detection. This version is known as One Pass Attribute Value Frequency (One Pass AVF) for outlier detection in streaming categorical data. Note that AVF computes the frequency of each attribute-value pair in the entire data set. In contrast, One Pass AVF computes the cumulative probability of each possible attribute-value pair that has been identified at the time point of processing a data stream. As a result, One Pass AVF is capable of detecting outliers in just a single scan of the data, and this allows it to process massive streaming data. The focus of this paper is twofold (i) to compare One Pass AVF with the original AVF based on detection accuracy, and (ii) to apply One Pass AVF in a real world marine sensor data monitoring application.

2 Related Work Most existing outlier detection methods are designed to work on numerical data. For example, distance-based and density-based techniques cannot handle data sets containing categorical attributes because the notions of „distance‟ or „density‟ are not well defined. Yet categorical data is commonly found in many real-world databases. In the following, we will review some of the existing methods for mining categorical outliers. Frequent Pattern Outlier Factor (FPOF): This method [4] uses association rule mining technique (e.g., [1]) to find frequently occurring itemsets in data. It then assigns an outlier score to each data point based on the number of frequent itemsets associated with the data point. As real-world data sets are usually large, this approach requires longer processing time to find frequent itemsets. In addition, it requires many attempts to locate an appropriate support threshold for identifying frequent itemsets. Hypergraph based Outlier Test: The Hypergraph based Outlier Test (HOT) was proposed [11] to deal with large data set with missing values and mixed-type attributes. HOT uses connectivity to process data with missing values and its detection results are easy to interpret. However, HOT cannot be extended to deal with streaming data. Attribute Value Frequency: Most traditional outlier detection approach like the Greedy algorithm [5] requires multiple scans of the data to produce an effective result. Indeed, these methods tend to slow down when the data set becomes large. To

address this problem, Koufakou et. al. [6] proposes the use of Attribute Frequency Value method to detect outliers that have few occurrences and irregular attribute values. The outlier score, AVF, is obtained by computing the relative frequency of occurrences of attribute-value. It is formulated as AVFScore(xi) = 1/m ∙ ∑f(xij); where m is number of categorical variables, the summation runs from j = 1, 2, ..., m, and f(xij) is the relative frequency of the jth attribute value of instance xi appearing in the data set. Data points that are few and different [10] will contain lower AVF scores, and tend to have higher probability of being outliers. As AVF algorithm requires less scans of data to identify outliers, it is significantly faster than many existing methods. Unlike FPOF, AVF is easy to implement since it does not generate frequent itemsets; it is also easier to use since it does not require users to set the minimum support threshold. One major drawback of AVF is that it works on static data sets only. The algorithm needs to load the entire data set into the computer memory and then scan through all the records, before it can start detecting outliers. Fortunately, it is quite easy to extend AVF to deal with streaming data. In the following section, we will present such an extension of AVF.

3 Proposed Method We name the proposed extension of AVF as One Pass Attribute Value Frequency (One Pass AVF) method for detecting streaming categorical outliers. Note that One Pass AVF uses cumulative probability of each attribute-value pair in instances seen so far in the data stream, whereas AVF computes the relative frequency of each attribute-value pair in the entire data set. The use of cumulative probability allows One Pass AVF to perform one-pass outlier detection in constant time and with fixed amount of memory space when processing each instance. In data stream environment, instances flow in continuously and there is no room to store the whole data set, and the algorithm cannot perform multiple accesses of the data instances. As each instance streams in, One Pass AVF processes it and then disposes the instance right away, thereby preventing the memory from running out of space. One Pass AVF score is formulated as shown: OPAVFScore(xi) = 1/m ∙ ∑p(xij), where m is number of categorical variables, and p(xij) is cumulative probability of the jth attribute value of instance xi in the data stream. The summation runs from j = 1, 2, ..., m. Algorithm 1 shows the proposed One Pass AVF algorithm. As each streaming data point x is being read in, the cumulative probability of each attribute-value pair found in x is computed. Then, the OPAVFScore of x is being computed and reported. Finally, x is being discarded to release computer memory before processing the next streaming instance. Consider a toy data set with one nominal attribute A and four observations: {„p‟, „p‟, „q‟, „p‟}. In the beginning, Algorithm 1 reads in the first streaming data point „p‟,

with frequency of occurrence equals 1 (i.e., f(A=„p‟) = 1), and the cumulative probability of „p‟ occurring equals 1 (i.e., p(A=„p‟) = 1/1). For the second streaming point „p‟, the frequency of occurrence is 2 and its cumulative probability of occurrence is 1 (since p(A=„p‟) = 2/2 = 1). For the third streaming point, the frequency of occurrence for „q‟ is 1 and its cumulative probability of occurrence is 1/3. The fourth and last streaming point „p‟ has frequency of occurrence equals 3 and the cumulative probability of occurrence equals 3/4. At the end of the process, a list of OPAVFScores containing {1, 1, 1/3, 3/4} is produced. With respect to the stream‟s sequence, it correctly assigns the third observation „q‟ with the lowest OPAVFScore of 1/3, signifying it to be an outlier. Algorithm 1: The proposed One Pass AVF Algorithm. f(A=c) is the number of times that the event A=c has occurred so far. p(A=c) is the cumulative probability of event A=c.

Algorithm OnePassAVF begin count

Suggest Documents