May 28, 2015
14:18
IJAIT
S0218213015500037
International Journal on Artificial Intelligence Tools Vol. 24, No. 3 (2015) 1550003 (16 pages) c World Scientific Publishing Company DOI: 10.1142/S0218213015500037
Semi-Supervised Outlier Detection with Only Positive and Unlabeled Data Based on Fuzzy Clustering
Armin Daneshpazhouh∗ and Ashkan Sami† CSE and IT Department, School of Electrical Engineering and Computer Shiraz University, Shiraz, Iran ∗
[email protected] †
[email protected] Received 17 July 2013 Accepted 3 August 2014 Published 9 June 2015
The task of semi-supervised outlier detection is to find the instances that are exceptional from other data, using some labeled examples. In many applications such as fraud detection and intrusion detection, this issue becomes more important. Most existing techniques are unsupervised. On the other hand, semi-supervised approaches use both negative and positive instances to detect outliers. However, in many real world applications, very few positive labeled examples are available. This paper proposes an innovative approach to address this problem. The proposed method works as follows. First, some reliable negative instances are extracted by a kNN-based algorithm. Afterwards, fuzzy clustering using both negative and positive examples is utilized to detect outliers. Experimental results on real data sets demonstrate that the proposed approach outperforms the previous unsupervised state-of-the-art methods in detecting outliers. Keywords: Data mining; fraud detection; outlier detection; semi-supervised learning.
1. Introduction One of the famous definition of outlier is given by Hawkins.1 It is stated that “an outlier is an observation that deviates so much from other observations as to arouse suspicious that it was generated by a different mechanism”. In many real world applications such as fraud detection, intrusion detection, and medicine, finding such exceptional instances is more interesting than identifying the common cases. However, in some cases outliers are not so easy to identify e.g. in fraud detection or network intrusion, outliers have behaviors like normal data. This matter makes the problem of identifying the outliers more difficult. Many outlier detection techniques are proposed in past decades. These methods are based on statistical-based, depth-based, distance-based, and density-based approaches.2,3 These techniques are effective in many data sets due to detecting outliers. However, it has been examined that most of the algorithms that are developed to detect anomalies are not † Corresponding
author 1550003-1
page 1
A. Daneshpazhouh & A. Sami
accurate4 because of detecting normal data as outliers, which leads to increase false alarm rate. Outlier detection techniques are categorized into three groups due to the kind of supervision each method utilizes. The first approach is unsupervised. Most of the existing outlier detection techniques are unsupervised.5,6 Without labelled information, unsupervised method usually suffers from high false alarm and low detection rate.7 The second approach is supervised.8 Supervised method requires a large number of labelled examples for training, which can be difficult and time-consuming. In many application domains, obtaining labelled data is usually expensive and require much human effort while gathering unlabeled data is much cheaper. Consequently, the third approach, which is semi-supervised, has attracted much attentions in recent years9,10 and has been shown to achieve better generalization performances in many fields. In binary problems, there are two sets of labeled samples. The first set consists of positive instances while the other includes negative instances. However, in some cases the set of negative samples is not available due to the fact that labeling negative instances is costly. For instance, in fraud detection the information of very few fraudulent instances, which are labeled as positive, is available. In this case, there is an incomplete set P of some positive instances, and set U of unlabeled samples, which consists of both positive and negative instances. The task is to find other positive (fraudulent) instances using semisupervised approach based on the input sets P and U. The main contribution of this paper is as follows: This paper proposes an innovative approach to address the problem of detecting outliers using only few positive instances (outliers). At first, a suggested kNN-based algorithm is utilized in order to extract enough negative instances from few positive examples and unlabeled data. Afterwards, fuzzy rough C-means clustering is employed to accurately identify outliers. The rest of the paper is organized as follows. Section 2 presents a brief survey of existing related works. Section 3 describes in details the main parts of the proposed approach. Section 4 illustrates the experimental results followed by conclusion in section 5. 2. Related Work This section presents the previous works in three categories. Different methods of detecting outliers are represented in the first category. The second category includes semi-supervised techniques that use only positive and unlabeled data for learning. The last category contains semi-supervised outlier detection techniques that recently have been proposed. Previous studies in outlier detection mainly fall into the following categories. Distribution-based methods are previously conducted by the statistics community.11 Some standard distribution models have been used and points that deviate from these models are detected as outliers. For example, Yamanishi et al.12 used a Gaussian mixture model to present a normal behavior. In Ref. 13, this method is combined with a supervised approach to gain general patterns for outliers. Depth-based is the second class of outlier mining in statistics.14 Based on some definition of depth, objects are organized in convex hull layers in data space according to 1550003-2
Semi-Supervised Outlier Detection
peeling depth. Outliers are expected to be found from objects with shallow depth values.15 These techniques are infeasible for large data set with high dimensions due to the extreme computational complexity. Distance-based outlier is originally suggested by Knorr et al.16 In distance-based methods, outlier is an object, which has the most distance from its k nearest neighbors.17 In Ref. 18, outliers are defined by using a point distance in all dimensions. Therefore they have the problem of cost and unexpected performances regarding to the curse of dimensionality. Deviation-based techniques identify outliers by inspecting the characteristics of objects and consider an object that deviates these features as an outlier. The concept of “local outlier” was proposed by Ref. 19. LOF is local outlier factor for each point, which depends on local density of its neighbors. These methods confront with the problem of high dimensions too. Cluster-based outlier detection methods only consider small clusters as outliers. They also identify outliers by removing clusters from original data set.20 The concept of “cluster-based local outlier” is proposed in Ref. 21. In the problem of learning from positive and unlabeled data (LUP), binary classification is the case. Therefore some classes are considered as positive class while the others are considered as negative. Traditional semi-supervised classification techniques are not suitable for LPU problem since they require labeled training examples for all classes. However, there are four kinds of approaches for this problem. (1) Ignoring unlabeled data and learn just from labeled positive examples. In one-class SVM,22 a suitable area for covering almost all labeled positive examples is constructed and all the data out of this region considered as negative examples. By discarding useful information in unlabeled data, it easily makes overfitting. (2) The most common approach is based on co-training with labeled positive data. It usually uses two-phases strategy: (i) extracting likely negative examples from the unlabeled set; (ii) using traditional learning methods for classification. PEBL,23 S-EM, NB24 and Roc-Clu-SVM25 use this strategy to classify the unlabeled instances. (3) B-Pr26 and W-SVM27 belong to probabilistic approaches. Extracting likely negative examples is not required in these methods. However, there are several parameters needed to be estimated, which directly affect the performance. Additionally, they cannot cope with imbalance data sets. (4) LGN28 is based on the assumption that the identical distribution of words belonging to the positive class both in labeled positive examples and unlabeled data generates an artificial negative document for learning. However, in practice, it is very hard or impossible to satisfy this assumption.29 The last category includes outlier detection algorithms, which use semi-supervised approach as learning method.9,10 Gao et al.9 employs an objective function to detect outliers based on clustering. Xue et al.7 method is based on fuzzy rough C-means clustering and uses a different objective function for outlier detection. However, none of the mentioned methods can cope with the problem of detecting outlier using only positive instances. In contrast to the existing approaches, this paper properly addresses the problem of detecting outliers based on semi-supervised learning from positive and unlabeled examples. 1550003-3
A. Daneshpazhouh & A. Sami
3. The Proposed Approach: SSODPU (Semi-Supervised Outlier Detection with Positive and Unlabeled Data) This section represents the proposed SSODPU method to solve the cases when there are only few positive examples available for training. The proposed approach is based on two phases: (1) Extracting reliable negative examples from positive and unlabeled data; (2) Detecting outliers using the new labeled examples. 3.1. Extracting reliable negative examples using a kNN-based algorithm kNN30 method, which is a famous statistical technique, has been broadly studied in data mining. kNN is an instance-based method, which can process all the training set in classification phase. In this method, an object is assigned to a class based on the votes of its k nearest neighbors. Prior knowledge about a problem can help to choose a suitable value for parameter k, which is the number of object’s neighbors, in kNN method. It is more desirable for k to be odd in binary classification problems in order to make ties less likely. The kNN algorithm can be adapted to LPU problem by employing a process of ranking.31 The distance of unlabeled examples and k nearest neighbors of each positive sample is measured. Therefore the unlabeled examples can be ranked according to their distance values. As a result, instances that are more similar to the positive samples would go to the top of the ranking list. The proposed algorithm of extracting negative examples using kNN, is inspired by the method used in Ref. 32. In this algorithm, set P of positive instances, set U of unlabeled data, k as the number of neighbors and N as the percentage of negative instances are Algorithm: Extracting reliable negative examples using kNN Input: positive examples set P, unlabeled examples set U, the number of nearest neighbors k, the percentage of negative examples N 1. foreach unlabeled example ui do 2. foreach positive example vj do 3. compute distance (ui, vj) //Euclidean distance is used as distance function 4. endfor 5. endfor 6. select k nearest neighbors vj ( j = 1,…, p) 7. foreach unlabeled example ui do 8. foreach positive example vj ( j = 1,…, p) do 9. compute distance (ui, kNN vj) //Euclidean distance is used as distance function 10. compute sum distance (ui, kNN vj) 11. endfor 12. endfor 13. rank the results for vj ( j = 1,…, p) 14. NE = top N% of examples with the higher sum distance from vj & kNN vj ( j = 1,…, p) Output: NE negative examples set 1550003-4
Semi-Supervised Outlier Detection
provided as input. The first step is to calculate the distances of unlabeled instances in set U from each positive sample in set P. Euclidean distance is used as the distance measure in this algorithm. Then for each positive instance in P, its k nearest neighbors are selected. The second step is to compute the distances of unlabeled instances in U from k nearest neighbors of each positive sample in P. Afterwards, for each unlabeled instance ui, the summation of its distance from each positive sample vj and its k nearest neighbors is calculated. At last, these summation lists are ranked for each positive sample vj in descending order. The final NE set includes the combination of top N% of each list. 3.2. Detecting outliers using the new labeled examples This section presents FRSSOD (Fuzzy Rough Semi-Supervised Outlier Detection),7 which can be applied to the task of detecting outliers by using both positive and negative instances. This algorithm uses a semi-supervised approach to find outliers. FRCM33 is employed in FRSSOD, which is a new clustering method that can handle overlap of clusters and uncertainty in class boundary. Therefore the best approximation of a given data can be achieved. There is a significant difference that distinguishes the proposed SSODPU approach from FRSSOD algorithm. FRSSOD algorithm needs both positive and negative instances to detect outliers. However, in many real world applications, the information of only few positive examples are available. Therefore, algorithms like FRSSOD cannot cope with this issue, since they require both positive and negative Instances as training set. Unlike these algorithms, SSODPU is an innovative two-phased method, which can properly deal with this problem. The ability of using only few positive instances as training makes SSODPU unique and demonstrate the requisite of this approach over FRSSOD algorithm. 3.2.1. FRSSOD method In FRSSOD, the goal is to find an n × c matrix u = {uik |1 ≤ i ≤ n, 1 ≤ k ≤ c}, where 0 ≤ uik ≤ 1 is a fuzzy membership that the ith point belongs to the kth class, and uik = 0 otherwise, given a set of instances X = {x1, x2,…,xn}, which the first l < n instances are labeled as yi ∈ {1, 0}, i = 1, 2,…,l, where yi = 0 if the ith instance is an outlier, and yi = 1 otherwise. c 0 < uik ≤ 1 k =1 c uik = 0 k =1
∑
∑
xi is a normal instance (1)
xi is an outlier
The optimization problem is: m 2 n c n c l c min J m (u , v) = ∑∑ (uik )m dik2 + γ 1 n − ∑ ∑ uik + γ 2 ∑ yi − ∑ uik . i =1 k =1 i =1 k =1 i =1 k =1
1550003-5
(2)
A. Daneshpazhouh & A. Sami
subject to c
∑ uik ≤ 1,
0 ≤ uik ≤ 1,
i = 1, 2,… , n, k = 1, 2,… , c,
(3)
k =1
γ 1 > 0, γ 2 > 0, m ∈ (1, +∞) . The first term represents the sum squared error. The aim of the second term is to constrain the number of outliers not to be too large. The third term maintains consistency of the labeling with existing labels and punishes mislabeled instances. Two weighting parameters γ1, γ2 is utilized to make these three terms compete with each other. 3.2.2. A special case of FRSSOD (Data sets with only one cluster) This section presents a special case of FRSSOD when data sets consist of only one cluster. The whole procedure remains the same while some of the equations are slightly changed. The objective function is as follows:
n
min J m (u, v) =
n
m
∑ (uic )m dic2 + γ 1 n − ∑ (uic ) i =1
i =1
l
∑
2 + γ 2 ( yi − uic ) i =1
(4)
subject to
s.t. 0 ≤ uic ≤ 1, i = 1, 2,… , n, c = cluster center, γ 1 > 0, γ 2 > 0, m ∈ (1 + ∞ ) (5) Algorithm: FRSSOD Input: Incomplete labeled set , number of cluster , parameters ζ, criterion . foreach z from 0 to l do initialize center ( = 1, 2, . . . , ) assign instances to the approximations as Ref. 34. foreach in set do foreach j from 1 to c do compute distance endfor = min distance
, l, index
, and stop
, ,
,
// the class with min distance from
is called class h
foreach j from 1 to c do
dij A = ∀j , j = 1, 2,… , c, j ≠ h : ≤ ζ dih if ≠ " ∈ $% , ∈ $ , & = 1, 2 . . . , , & ≠ ℎ, ∈ $% )*
∈ $%
// dij = distance( xi , v j ) ∉ $ , = 1, 2 . . . ,
// C: center of a class // $% : lower boundary of the class h // $% : upper boundary of the class h 1550003-6
Semi-Supervised Outlier Detection
endif endfor foreach i from 1 to n do foreach k from 1 to c do if ∈ $ + =1 endif if ∈ $ ,
uij =
1
∑ j =1 (dij /d jk )2/(m−1) c
endif endfor endfor switch case unlabeled and located in boundary area if ∑1 2 (+ )0 > (∑1 2 + )0 set as an outlier update uik to u′ik // + = 0 endif case labeled as normal instance and located in boundary area if ∑1 2 (+ )0 > (∑1 2 + )0 + set as an outlier // + = 0 update uik to u′ik endif case unlabeled and located in boundary area if ∑1 2 (+ )0 > (∑1 2 + )0 − (∑1 2 + ) set as an outlier update uik to u′ik // + = 0 endif endswitch foreach k from 1 to c do
(u ′ )m xi ∑ i =1 ik = n ∑ i =1 (uik′ )m n
compute
vk(l +1)
endfor endfor check convergence endfor Output: Configuration matrix +, cluster centers 1550003-7
A. Daneshpazhouh & A. Sami
3.3. Complexity analysis The proposed SSODPU method consists of two phases. Therefore, it is needed to calculate time complexity for each phase separately. In the first phase, there are three inputs, positive examples set “P”, unlabeled examples set “U”, the number of nearest neighbors “k”. In the algorithm of extracting reliable negative examples using kNN. “n” is the number of examples in set U. The first step is calculating the distance of n examples from p samples, which can be performed in O(n*p). Computing k nearest neighbors of p samples takes O(n*k*p). Since p ≪ n, this part has the time complexity O(n*k). At last, the results can be ranked in O(n log n). The FRSSOD algorithm has three inputs set X =“n”, incomplete labeled set X =“l”, and number of cluster “c”. The complexity of computing the distance of n sample from each cluster center is O(n*c) for one iteration and O(n*c*l) for all iterations. Worst-case analysis of the main step of this algorithm, which takes the majority of the computational load, needs runtime O(n2*c*l ). However, since c ≪ n and l ≪ n, the time complexity of this algorithm is reduced to O(n2), which is comparable to most existing outlier detection algorithms. 4. Experimental Result A comprehensive performance study is conducted to evaluate the SSODPU method. This section presents the results. The comparison of the proposed method with eight state-ofthe-art outlier detection algorithms is performed on a real-life data set obtained from the UCI Machine Learning Repository.35 All the experiments were performed on a 1.6 GHz Pentium PC with 1G of main memory running on Windows 8 (32-bit). All the algorithms are implemented in MATLAB version 7.12.0.635 (R2011a). 4.1. Modified Wisconsin breast cancer data The data set used in this experiment is reduced form of Wisconsin breast cancer. The original WDBC is commonly used in medical diagnosis. It has 699 instances with 9 attributes. Each record is labeled as benign (458 or 65.5%) or malignant (241 or 34.5%). To form a very unbalanced distribution, some of the benign and malignant records were removed. As a result two different data sets with 367 instances (357 benign records and 10 malignant records) are built. To determine positive examples, 3 instances among 10 malignant records are selected. The class distribution is shown is Table 1. It is important to note that the malignant records in two different data sets are the only difference among the two. They share the same benign records. Table 1. Modified WDBC data set. Class Name
Class Label
Percentage of Instances
Benign (Common class)
Negative
97.28% (357)
0
Malignant (Rare class)
Positive
2.72% (10)
3
1550003-8
Number of Known Labels
Semi-Supervised Outlier Detection
4.1.1. Results on modified WDBC The purpose of this experiment is to test the performance of SSODPU algorithm when the number of positive training instances is few. Yet, considering the fact that there is no other method dealing with this problem, it is decided to compare the performance of the proposed method against some state-of-the-art unsupervised algorithms. The methods are as follows: ABOD algorithm,36 Gaussian method,12 INFLO algorithm,37 KNN algorithm,18 LDOF algorithm,38 LOF algorithm,19 LOOP algorithm39 and SOD algorithm.40 Similarity is measured by Euclidean distance. Thus, the attributes having large range of values will easily dominate the distance calculation and lead to the loss of information hidden in other important attributes. So min-max normalization is deployed to counter the effects mentioned above. xnew (i ) =
x(i ) − min(i ) . max(i ) − min(i )
(6)
Five evaluation metrics are employed to compare the results of the SSODPU method since complete knowledge of outliers exists. In other words, the algorithm is only aware of three outliers but other outliers in the data set are also known. SSODPU algorithm provides a ranking of all instances. The best case scenario would list the top 10 instances as actual outliers. However, this is not the case in majority of runs. The metrics are as follows: “First” metric indicates the rank of the first outlier instance among the result list. “Last” shows the rank of the last “known” outlier example detected. Median, Mean (average) and Std. Dev. (standard deviation) are three other statistical metrics utilized to appraise the SSODPU method performance. Statistical test is also performed to evaluate the proposed algorithm’s impact. A paired t-test was conducted to determine whether the difference of SSODPU with other methods is statistically significant. To do that, another evaluation metric “p-value” is employed, which indicate the value of p of paired t-test. p-values under 0.05 are shown in bold. In this experiment, 3 positive samples were chosen randomly for 10 times. For each data set 30 independent runs are performed. The parameter k altered from 1 to 65 for each algorithm and the best k is designated based on their results. Note that the result did not change when the parameter k is specified to alternative values. As noted in Ref. 7 some parameters are set as follows: m = 2, γ1 = 0.063, γ2 = 0.1 and ε = 3. Tables 2 and 3 show the results produced by the mentioned algorithms respectively on WDBC 1 and WDBC 2. In this experiment, SSODPU performed the best in almost all metrics, except for metric Median in the second data set. However, the results demonstrate the superiority of the proposed SSODPU method over the eight novel works on the subject of outlier detection. Three other evaluation metrics are also selected to display the performance of the proposed method better: Precision, Recall and F-measure. The formulas for Precision, Recall and F-measure are as follows:
Precision =
True Positive . True Positive + False Positive 1550003-9
(7)
A. Daneshpazhouh & A. Sami Table 2.
Results on WDBC data set 1.
Algorithm
K
First
Last
Median
Mean
Std Dev.
p-value
ABOD
15
1
38
5.5
Gaussian
—
2
69
10
13.5
13.5
0.0387
22
24.3
INFLO
35
1
23
0.0364
5.5
8.7
7.4
KNN
5
1
0.0420
29
5.5
12.4
10.8
LDOF
40
0.0201
1
25
6.5
10.4
8.4
LOF
0.0148
10
1
18
6.5
8.5
6.1
0.0118
LOOP
35
1
26
5.5
9.9
8.3
0.0248
SOD
—
0.0406
SSODPU
—
1
245
5.5
72.5
91.7
1
14
5.5
6.1
4.06
1
14
5.5
6.8
4.6
1
15
5.5
7
5
1
16
5.5
6.9
5
1
16
5.5
7.1
5.1
1
18
5.5
7.2
5.5
1
18
5.5
7.4
5.7
1
18
5.5
7.6
5.9
1
19
5.5
7.7
6.1
2
21
6.5
8.8
6.4
Table 3.
—
Results on WDBC data set 2.
Algorithm
K
First
Last
Median
Mean
Std Dev
p-value
ABOD
6
1
84
6.5
20.1
29.5
0.0362
Gaussian
—
2
294
16
47.4
89.2
0.0407
INFLO
16
3
54
15.5
20.5
17.6
0.0493
KNN
2
1
64
6.5
19.1
24.6
0.0218
LDOF
25
2
67
9.5
18.7
21.5
0.0157
LOF
10
1
68
8.5
18.2
24.1
0.0296
LOOP
15
1
76
14.5
22
22.9
0.0431
SOD
—
1
259
5.5
54.8
84
0.0425
1
24
6.5
9.3
8.26
1
25
7.5
9.1
7.6
1
25
6.5
9.9
8.7
1
27
6.5
10.4
9.6
1
30
7
10.8
10.1
1
31
7
10.9
10.4
1
31
7
11.3
10.6
2
35
8
12.3
11
2
39
7.5
13.4
12.8
2
40
7.5
13.3
13.2
SSODPU
—
1550003-10
—
Semi-Supervised Outlier Detection
Recall =
True Positive . True Positive + False Negaive
F-measure =
(8)
2 ∗ Precision ∗ Recall . Precision + Recall
(9)
For each method the precision and recall are calculated from the first true outlier (Malignant samples) to the last actual outlier that has been identified. To visualize the results, Precision-Recall graph, Precision-Neighborhood size of k graph and also Fmeasure graph for the two data sets are demonstrated in Figures 1–4. In the PrecisionNeighborhood size of k graph, average precision in ten identified true outliers for each k is calculated and in F-measure formula, the average precision and recall for the k with best result in each algorithm is used.
(a) Fig. 1.
(b)
(Color online) Precision-Recall graph on (a) WDBC 1 and (b) WDBC 2.
(a) Fig. 2.
(b)
(Color online) Precision with different size of neighbourhood k on (a) WDBC 1 and (b) WDBC 2.
1550003-11
0.66 0.63 0.57 0.570.58 0.6 0.6 0.51 0.47
F-Measure
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.61 0.6 0.61
0.590.59 0.6 0.6 0.56 0.4
SSODPU ABOD INFLO Gaussian KNN LDOF LOF LoOP SOP
SSODPU ABOD INFLO Gaussian KNN LDOF LOF LoOP SOP
F-Measure
A. Daneshpazhouh & A. Sami
Method
Method
Fig. 3. F-Measure for different methods on WDBC 1. Fig. 4. F-Measure for different methods on WDBC 2.
4.2. UCI ML data sets For more evaluation of SSODPU method, five benchmark data sets, which are originally binary classification data sets, are used: Ionosphere, SPECTF, Parkinsons, Spambase, and Musk. Moreover, another two benchmark data sets, which are Annealing and Lymphography, are utilized to determine the performance of the proposed method with different measurement criterions. The description of the mentioned data sets is in Tables 4–6. Table 4. Data Set Name
Class sizes, data set size and number of attributes.
Number of Instances
Number of Attributes
Target Class
Other Cass
351 264 195 4601 476
32 44 22 57 166
126 212 148 1813 207
225 55 48 2788 269
Ionosphere SPECTF Parkinsons Spambase Musk
Table 5. Class distribution of Annealing data set. Case Commonly occurring classes Rare classes
Table 6.
Class Codes
Percentage of Instances
3
76.1
1, 2, 5, U
23.9
Class distribution of Lymphography data set. Class Codes
Percentage of Instances
Commonly occurring classes
2,3
95.9
Rare classes
1,4
4.1
Case
1550003-12
Semi-Supervised Outlier Detection Table 7. Method
Area under the curve (AUC) and percent accuracy on the target class for the UCI data sets. Ionosphere AUC
% Acc
SPECTF AUC
% Acc
Parkinsons AUC
% Acc
Spambase AUC
% Acc
Musk AUC
% Acc
ABOD
0.73
79%
0.76
82%
0.72
78%
0.69
62%
0.51
47%
Gaussian
0.60
68%
0.51
64%
0.47
43%
0.33
36%
0.48
50%
INFLO
0.82
81%
0.84
89%
0.68
83%
0.79
66%
0.39
41%
KNN
0.47
61%
0.52
71%
0.70
75%
0.48
42%
0.56
57%
LDOF
0.63
71%
0.47
78%
0.49
59%
0.47
48%
0.62
67%
LOF
0.58
66%
0.41
77%
0.43
72%
0.46
36%
0.49
43%
LOOP
0.78
76%
0.71
80%
0.69
75%
0.74
75%
0.63
54%
SOP
0.55
48%
0.40
52%
0.46
65%
0.42
36%
0.52
47%
SSODPU
0.89
91%
0.81
89%
0.74
82%
0.88
81%
0.76
78%
4.2.1. Results on UCI ML data sets Table 7 gives the results of eight different methods on the UCI data sets. The proposed method is compared with seven unsupervised algorithms: ABOD, Gaussian, INFLO, KNN, LDOF, LOF, LOOP, and SOD. Two evaluation metrics, which are area under the curve (AUC) and accuracy, are employed as in Ref. 41. The experiment is repeated 5 times with different number of outliers, which are randomly selected from “Target class” of each data set, as training set. Each time, 20 different runs are performed and the average accuracy is calculated. The results show that the SSODPU method had better accuracy in 3 data sets and for SPECTF and Parkinsons, it had insignificant difference compared to the other two methods. As mentioned in Ref. 4, one way to test the performance of outlier detection algorithm is to run the method on the data set and calculate the percentage of points which belong to the rare classes. So the top ratio with the percentage of coverage is employed for Annealing and Lymphography data sets. Annealing data set has 798 instances with 38 attributes. This data set contains 5 classes. Class 3 has the largest amount of instances. Another four classes are regarded as rare due to their small sizes. The Lymphography data set has 148 instances with 18 attributes. The data set contains four classes. Class 2 and 3 have the largest number of instances. The remained classes are considered as rare classes. Tables 8 and 9 show the performance of the proposed method against FindCBLOF method21 and kNN algorithm.18 As mentioned in Ref. 21, the two parameters needed for FindCBLOF are set to 90% and 5 separately and for kNN algorithm, 5-nearest-neighbor is applied. The parameter k for SSODPU method is set to 6. In Annealing data set 3 percent of instances belong to the rare classes are randomly taken and used as training set in SSODPU. Also 1 random instance of rare classes in Lumphography data set is chosen for training set.
1550003-13
A. Daneshpazhouh & A. Sami Table 8.
Detected rare classes in Annealing data set. Number of Rare Classes Included (Coverage)
Top Ratio (Number of Records)
FindCBLOF
kNN
SSODPU
10% (80) 15% (105)
45 (24%) 55 (29%)
21 (11%) 30 (16%)
67 (35%) 103 (54%)
20% (140) 25% (175) 30% (209)
82 (43%) 105 (55%) 105 (55%)
41 (22%) 58 (31%) 62 (33%)
125 (66%) 133 (70%) 141 (74%)
Table 9.
Detected rare classes in Lymphography data set. Number of Rare Classes Included (Coverage)
Top Ratio (Number of Records)
FindCBLOF
kNN
SSODPU
5% (7) 10% (15) 15% (22) 20% (30)
4 (67%) 4 (67%) 4 (67%) 6 (100%)
1 (17%) 1 (17%) 2 (33%) 2 (33%)
4 (67%) 6 (100%) 6 (100%) 6 (100%)
Here, the top ratio is the number of detected instances as rare classes by algorithm. The coverage metric indicates the number of instances which are correctly identified as rare classes among the top ratio. The SSODPU results are compared with the results for FindCBLOF method and kNN algorithm in Ref. 21. The results on both data sets demonstrate the superiority of SSODPU over the FindCBLOF method and kNN. 5. Conclusion This paper provides a demonstration of the SSODPU method, which different from former works, tackles the problem of detecting outliers with only few labeled positive examples. In addition, this paper contributes with a new approach, which is a two-phase strategy toward learning from positive data and detecting unusual samples. This method consists of two phases. At first, some reliable negative examples are extracted and finally by means of FRSSOD, outliers would be detected. The results verify that this approach surpass the previous works in the literature. According to the results from different data sets with different distributions, it is realized that the proposed method works better on unbalanced data sets rather than data sets with the same amount of outliers and inliers. The future works can be toward using another algorithm in the second phase, which is finding outliers using some positive samples. In addition, larger testing with more real data could also cause more accurate answers. References 1. D. M. Hawkins, Identification of Outliers, Vol. 11 (Springer, 1980). 2. S. Chen, W. Wang, and H. Van Zuylen, A comparison of outlier detection algorithms for ITS data, Expert Systems with Applications 37 (2010) 1169–1178.
1550003-14
Semi-Supervised Outlier Detection
3. S. Kim, N. W. Cho, B. Kang and S.-H. Kang, Fast outlier detection for very large log data, Expert Systems with Applications 38 (2011) 9587–9596. 4. C. C. Aggarwal and P. S. Yu, Outlier detection for high dimensional data, in ACM Sigmod Record (2001), pp. 37–46. 5. B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor and J. C. Platt, Support vector method for novelty detection, in Advances in Neural Information Processing Systems (2000) 582–588. 6. D. Fustes et al., SOM ensemble for unsupervised outlier analysis. Application to outlier identification in the Gaia astronomical survey, Expert Systems with Applications 40 (2013) 1530–1541. 7. Z. Xue, Y. Shang and A. Feng, Semi-supervised outlier detection based on fuzzy rough C-means clustering, Mathematics and Computers in Simulation 80 (2010) 1911–1921. 8. M. Markou and S. Singh, A neural network-based novelty detector for image sequence analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006) 1664–1677. 9. J. Gao, H. Cheng and P.-N. Tan, Semi-supervised outlier detection, in Proc. of the 2006 ACM Symposium on Applied Computing (2006), pp. 635–636. 10. Y. Li, B. Fang and L. Guo, A novel data mining method for network anomaly detection based on transductive scheme, in Advances in Neural Networks – ISNN 2007, ed. (Springer, 2007), pp. 1286–1292. 11. V. Barnett and T. Lewis, Outliers in Statistical Data, Applied Probability and Statistics, Wiley Series in Probability and Mathematical Statistics (Chichester: Wiley, 1984), 2nd edn. Vol. 1 (1984). 12. K. Yamanishi, J.-I. Takeuchi, G. Williams and P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2000), pp. 320–324. 13. K. Yamanishi and J.-I. Takeuchi, Discovering outlier filtering rules from unlabeled data: Combining a supervised learner with an unsupervised learner, in Proc. of the 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2001), pp. 389–394. 14. T. Johnson, I. Kwok and R. T. Ng, Fast computation of 2-dimensional depth contours, in KDD (1998), pp. 224–228. 15. Z. He, S. Deng and X. Xu, A unified subspace outlier ensemble framework for outlier detection, in Advances in Web-Age Information Management, ed. (Springer, 2005), pp. 632–637. 16. E. M. Knorr, R. T. Ng and V. Tucakov, Distance-based outliers: Algorithms and applications, The VLDB Journal — The International Journal on Very Large Data Bases 8 (2000) 237–253. 17. Y. Chen, D. Miao and H. Zhang, Neighborhood outlier detection, Expert Systems with Applications 37 (2010) 8745–8749. 18. S. Ramaswamy, R. Rastogi and K. Shim, Efficient algorithms for mining outliers from large data sets, in ACM SIGMOD Record (2000), pp. 427–438. 19. M. M. Breunig, H.-P. Kriegel, R. T. Ng and J. Sander, LOF: Identifying density-based local outliers, in ACM SIGMOD Record (2000), pp. 93–104. 20. C.-H. Wang, Outlier identification and market segmentation using kernel-based clustering techniques, Expert Systems with Applications 36 (2009) 3744–3750. 21. Z. He, X. Xu, and S. Deng, Discovering cluster-based local outliers, Pattern Recognition Letters 24 (2003) 1641–1650. 22. L. M. Manevitz and M. Yousef, One-class SVMs for document classification, The Journal of Machine Learning Research 2 (2002) 139–154. 23. H. Yu, J. Han and K. C.-C. Chang, PEBL: Positive example based learning for web page classification using SVM, in Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2002), pp. 239–248.
1550003-15
A. Daneshpazhouh & A. Sami
24. B. Liu, Y. Dai, X. Li, W. S. Lee and P. S. Yu, Building text classifiers using positive and unlabeled examples, in 3rd IEEE Int. Conf. on Data Mining (ICDM 2003) (2003), pp. 179–186. 25. X. Li and B. Liu, Learning to classify texts using positive and unlabeled data, in IJCAI (2003) 587–592. 26. D. Zhang and W. S. Lee, A simple probabilistic approach to learning from positive and unlabeled examples, in Proc. of the 5th Annual UK Workshop on Computational Intelligence (UKCI ) (2005), pp. 83–87. 27. C. Elkan and K. Noto, Learning classifiers from only positive and unlabeled data, in Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008), pp. 213– 220. 28. X. Li, B. Liu and S.-K. Ng, Learning to identify unexpected instances in the test set, in IJCAI (2007) 2802–2807. 29. X. Wang, Z. Xu, C. Sha, M. Ester and A. Zhou, Semi-supervised learning from only positive and unlabeled data using entropy, in Web-Age Information Management, ed. (Springer, 2010), pp. 668–679. 30. Y. Yang and X. Liu, A re-examination of text categorization methods, in Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999), pp. 42–49. 31. J. Žižka, J. Hroza, B. Pouliquen, C. Ignat and R. Steinberger, The selection of electronic text documents supported by only positive examples, in Proc. of the 8th Int. Conf. on the Statistical Analysis of Textual Data, JADT (2006), pp. 19–21. 32. B. Zhang and W. Zuo, Reliable negative extracting based on KNN for learning from positive and unlabeled examples, Journal of Computers 4 (2009) 94–101. 33. Q. Hu and D. Yu, An improved clustering algorithm for information granulation, in Fuzzy Systems and Knowledge Discovery, ed. (Springer, 2005), pp. 494–504. 34. G. Peters, Outliers in rough k-means clustering, in Pattern Recognition and Machine Intelligence, ed. (Springer, 2005), pp. 702–707. 35. C. Blake and C. J. Merz, "UCI Repository of Machine Learning Databases (1998). 36. H.-P. Kriegel and A. Zimek, Angle-based outlier detection in high-dimensional data, in Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2008), pp. 444–452. 37. W. Jin, A. K. Tung, J. Han and W. Wang, Ranking outliers using symmetric neighborhood relationship, in Advances in Knowledge Discovery and Data Mining, ed. (Springer, 2006), pp. 577–593. 38. K. Zhang, M. Hutter, and H. Jin, A new local distance-based outlier detection approach for scattered real-world data, in Advances in Knowledge Discovery and Data Mining, ed. (Springer, 2009), pp. 813–822. 39. H.-P. Kriegel, P. Kröger, E. Schubert and A. Zimek, LoOP: Local outlier probabilities, in Proc. of the 18th ACM Conference on Information and Knowledge Management (2009), pp. 1649– 1652. 40. H.-P. Kriegel, P. Kröger, E. Schubert and A. Zimek, Outlier detection in axis-parallel subspaces of high dimensional data, in Advances in Knowledge Discovery and Data Mining, ed. (Springer, 2009), pp. 831–838. 41. A. Foss and O. R. Zaïane, Class separation through variance: A new application of outlier detection, Knowledge and Information Systems 29 (2011) 565–596.
1550003-16