A K-Farthest-Neighbor-based approach for support vector data ...

8 downloads 0 Views 1MB Size Report
Feb 15, 2014 - Advanced Analytics Institute, University of Technology, Sydney,. Australia e-mail: ... only the boundary examples to train the classifier, the train- ing set size is largely cut .... ξi are slack variables; C is a user pre-defined parameter .... Given a train- ing set S, the k-farthest neighbors of each example are iden-.
Appl Intell (2014) 41:196–211 DOI 10.1007/s10489-013-0502-0

A K-Farthest-Neighbor-based approach for support vector data description Yanshan Xiao · Bo Liu · Zhifeng Hao · Longbing Cao

Published online: 15 February 2014 © Springer Science+Business Media New York 2014

Abstract Support vector data description (SVDD) is a well-known technique for one-class classification problems. However, it incurs high time complexity in handling largescale datasets. In this paper, we propose a novel approach, named K-Farthest-Neighbor-based Concept Boundary Detection (KFN-CBD), to improve the training efficiency of SVDD. KFN-CBD aims at identifying the examples lying close to the boundary of the target class, and these examples, instead of the entire dataset, are then used to learn the classifier. Extensive experiments have shown that KFN-CBD obtains substantial speedup compared to standard SVDD, and meanwhile maintains comparable accuracy as the entire dataset used. Keywords Support vector data description · K-Farthest Neighbors

Y. Xiao (B) · Z. Hao School of Computers, Guangdong University of Technology, Guangzhou, Guangdong, China e-mail: [email protected] Z. Hao e-mail: [email protected] B. Liu School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China e-mail: [email protected] L. Cao Advanced Analytics Institute, University of Technology, Sydney, Australia e-mail: [email protected]

1 Introduction Support vector machine (SVM) [1–4] is a powerful technique in machine learning and data mining. It is based on the idea of structural risk minimization, which shows that the generalization error is bounded by the sum of the training error and a term depending on the Vapnik-Chervonenkis dimension. By minimizing this bound, high generalization performance can be achieved. Support vector data description (SVDD) [5–7] is an extension of SVM for one-class classification problems where only examples from one class (target class) are involved in the training phase. In SVDD, the training data is transformed from the original data space (input space) into a higher dimensional feature space via a mapping function. After being mapped into the feature space via a proper mapping function, the data becomes more compact and distinguishable in the feature space. A hyper-sphere is then determined to enclose the training data. A test example is classified to the target class if it lies on or inside the hyper-sphere. Otherwise, it is classified to the non-target class. To date, SVDD has been used in a wide variety of realworld applications from machine fault detection and credit card fraud detection to network intrusion [8–10]. The success of SVDD in machine learning leads to its extension to the classification problems for mining large-scale data. However, despite the promising properties of SVDD, it incurs a critical computational efficiency issue when it is applied on large-scale datasets. A typical example is outlier detection of handwritten digit data. The handwritten digit classification system deals with the handwritten digit images collected from people and builds a classifier to detect the anomalous handwritten digit images. Since different people have different handwriting and the handwriting may even change according to people’s emotions or situations, a large amount of handwritten digit images need to be trained for an

A K-Farthest-Neighbor-based approach for support vector data description

appropriate learning of the handwritten digit classification system. Another example is credit card transactional fraud detection. It usually contains large-scale records in the system and the number of examples involved during the training stage can be considerably large. The training time on these data can be too expensive to make SVDD a practical solution. Hence, it is essential to explore novel approaches to improve the training efficiency of SVDD. This paper proposes a novel approach, called K-FarthestNeighbor-based Concept Boundary Detection (KFN-CBD). The main motivation for us to propose KFN-CBD is the desire to further speed up SVDD on large-scale datasets by eliminating the examples which are most likely to be nonSVs. It is known that the SVDD classifier is decided by the support vectors (SVs) which lie on or outside the hypersphere, and removing the non-SVs does not change the classifier. Hence, KFN-CBD aims at identifying the examples lying close to the boundary of the target class. These examples are called boundary examples in this paper. By using only the boundary examples to train the classifier, the training set size is largely cut down and the computational cost is substantially reduced. Specifically, KFN-CBD consists of two steps: concept boundary pre-processing and concept boundary learning. In the concept boundary pre-processing step, the k-farthest neighbors of each example are identified. To accelerate the k-farthest neighbor query, M-tree [11], which is originally designed for k-nearest neighbor search, is extended for k-farthest neighbor learning by performing a novel tree query strategy. To the best of our knowledge, it is the first learning method for efficient k-farthest neighbor query. In the concept boundary learning step, each example is associated with a ranking score and a subset of top ranked examples is used to train the SVDD classifier. To do this, a scoring function is proposed to evaluate the examples based on their proximation to the data boundary. According to the scoring function, the examples with larger scores are more likely to be SVs than those with smaller scores. Compared to SVDD, KFN-CBD only includes a subset of informative examples, instead of the whole dataset, to learn the classifier, so that a large number of examples, which are likely to be non-SVs, are eliminated from the training phase. Substantial experiments have shown that KFN-CBD obtains substantial speedup compared to SVDD, and meanwhile maintains comparable classification accuracy as the entire dataset used. The rest of this paper is organized as follows. Section 2 introduces previous work related to our study. Section 3 presents the details of the proposed approach. Experimental results are reported in Sect. 4. Section 5 concludes the paper and puts forward the future work.

197

2 Related work SVDD is an extension of SVM on one-class classification problems. In this section, we first review the previous studies on speeding up SVM and SVDD in Sect. 2.1. Then, the basic idea of SVDD is briefly introduced in Sect. 2.2. 2.1 Speeding up SVM and SVDD Previous work on speeding up SVM and SVDD can be roughly classified into two categories: (1) algorithmic approaches [12–20], where optimization methods are used to make the quadratic programming (QP) problems1 solve faster; (2) data-processing approaches [21–28], which reduce the training set size using data down-sampling techniques. In practice, the data-processing approaches always run on top of the algorithmic approaches as an ultimate way to accelerate the training efficiency of SVM and SVDD. Our method falls into the category of data-processing approaches. The related work in this category will be discussed. Based on the variety of sampling techniques, the data-processing approaches can be further divided into random sampling-based approaches and selective samplingbased approaches. 2.1.1 Random sampling-based approaches These approaches randomly divide the training set into a number of subsets and train the classifier on each subset. For a test example, the results of sub-classifiers are combined for predictions. Bagging [21] and Cascade SVM [22] are typical examples. In Bagging [21], the training data is randomly divided into a series of subsets and each subset is trained to obtain a “weak classifier”. The test data is then predicted by voting on the “weak classifiers”. [22] proposes “Cascade SVM”, which randomly decomposes the learning problem into a number of independent smaller problems and the partial results are combined in a hierarchy way. Recently, [29] extends Cascade SVM from SVM to SVDD. Bagging [21] and Cascade SVM [22] have shown their capability in accelerating the training efficiency. However, there is a limitation to these methods: random sampling may not select the most informative examples for each sub-problem and much training time is devoured by a large number of less informative examples.

1 Quadratic programming (QP) problem is about optimizing a quadratic

function of several variables subject to linear constraints on these variables. The optimization problems of SVM and SVDD are QP problems.

198

Y. Xiao et al.

2.1.2 Selective sampling-based approaches In order to choose the informative examples from the training set, selective sampling-based approaches have been proposed. Instead of learning on the whole dataset, selective sampling-based approaches aim at selecting a subset of informative examples and use the selected examples to train the classifier. In the following, we review the selective sampling-based approaches for SVM and SVDD, respectively. Considerable efforts have been done to speed up SVM by using selective sampling [23–25]. For example, CBSVM [30] clusters the training data into a number of microclusters, and selects the centroids of the clusters as the representatives along the hierarchical clustering tree. The examples near the hyper-plane are used to train the classifier. [31] utilizes k-means clustering to cluster the training data, and the data on the cluster boundaries is selected for SVM learning. In [32], the potential SVs are identified by an approximate classifier trained on the centroids of the clusters, and the clusters are then replaced by the SVs. IVM [33] uses posterior probability estimates to choose a subset of examples for training. In contrast with SVM, there are only a few studies on SVDD with selective sampling. Among these studies, KSVDD [34] adopts a divide-and-conquer strategy to speed up SVDD. Firstly, it decomposes the training data into several subsets using k-means clustering. Secondly, each subset is trained by SVDD. Lastly, the global classifier is trained by using the SVs obtained in all sub-classifiers. However, the efficiency of KSVDD may be limited since it is computationally expensive to partition the dataset using k-means clustering. Moreover, it is time consuming to train the subsets for obtaining the SVs. Our proposed approach is a complement of the algorithmic approaches. In the experiments, we explicitly compare our approach with the representative random samplingbased approaches and selective sampling-based approaches. The experimental results have demonstrated that our approach obtains markedly better speedup than the methods compared. 2.2 Support vector data description Suppose that the training dataset is S = {x1 , x2 , . . . , xl }, where l is the number of training examples. The basic idea of SVDD is to enclose the training data in a hyper-sphere classifier. The objective function of SVDD is given by: min

R2 + C

l 

ξi

(1)

i=1

s.t.

xi − o2 ≤ R 2 + ξi ,

ξi ≥ 0, i = 1, 2, . . . , l

where o is the center and R is the radius of the hyper-sphere; ξi are slack variables; C is a user pre-defined parameter which controls the tradeoff between the hyper-sphere radius and the training errors. By introducing the Lagrange multipliers αi , we have the following findings [5]: – For xi which lies outside the hyper-sphere, i.e., xi − o2 > R 2 , it has αi = C. – For xi which lies on the hyper-sphere, i.e., xi − o2 = R 2 , it has 0 < αi < C. – For xi which lies inside the hyper-sphere, i.e., xi − o2 < R 2 , it has αi = 0. In addition, assuming that x is an example lying on the hyper-sphere, the center o and the radius R of the hypersphere can be calculated by [5]: o=

l 

αi xi .

i=1

R 2 = (x · x) − 2

(2) 

αi (xi · x) +

i



αi αj (xi · xj ).

(3)

ij

From (2) and (3), we can see that the hyper-sphere center o and the radius R are determined by examples xi with αi = 0. It is known that the examples lying on or outside the hyper-sphere have αi = 0, and those located inside the hyper-sphere have αi = 0. Therefore, the hyper-sphere classifier is decided by the examples which lie on or outside the hyper-sphere, and eliminating the examples which are located inside the hyper-sphere does not change the classifier. Moreover, the examples lying on or outside the hyper-sphere are called SVs, and those located inside the hyper-sphere are called non-SVs. The distribution of SVs for SVDD is shown in Fig. 1. A test example x is classified as a normal example if its distance to the hyper-sphere center o is no larger than the radius R. Otherwise, it is considered as an outlier. In SVDD, the classifier is constructed as a hyper-sphere. However, the training data may not be compactly distributed in the input space. In order to obtain better data description, the original training data is mapped from the input space into a higher dimensional feature space via a mapping function φ(·). By using the mapping function φ(·), the training data can be more compact and distinguishable in the feature space. The inner product of two vectors in the feature space can be calculated by using the kernel function as K(x, xi ) = φ(x) · φ(xi ) [35–37]. 3 Proposed approach In this section, we present the proposed approach in details. Given a training dataset S which consists of l target examples, the objective of SVDD is to build a hyper-sphere classifier by enclosing the target examples and the classifier is

A K-Farthest-Neighbor-based approach for support vector data description

199

Algorithm 1 The proposed KFN-CBD framework Step 1 (Boundary Concept Pre-processing): Given a training set S, the k-farthest neighbors of each example are identified. To accelerate the k-farthest neighbor search, M-tree is used and a novel tree query strategy is proposed. Step 2 (Boundary Concept Learning): Each example is assigned with a ranking score based on its k-farthest neighbors. A subset of top ranked examples is selected out and used to learn the SVDD classifier. learning method is proposed to determine the boundary examples by considering the distances between the examples and their k-farthest neighbors. The framework of our proposed approach is presented in Algorithm 1. 3.1 Boundary concept pre-processing Fig. 1 The distribution of SVs in SVDD. The examples lying on the hyper-sphere surface and outside the hyper-sphere are called “boundary SVs” and “outlying SVs”, respectively. Those located inside the hyper-sphere are non-SVs. The SVDD classifier is decided by the “boundary SVs” and “outlying SVs”

thereafter applied to classify the test examples. However, SVDD incurs a critical computation issue when it is applied on large-scale datasets. To speed up SVDD on large-scale datasets, we propose to select a subset of examples from the datasets and train the SVDD classifier using the selected examples, instead of the whole dataset. In such case, the training set size is largely cut down and the computational cost of SVDD can be substantially reduced. The motivation behind our proposed approach is illustrated in Fig. 1. The SVDD classifier is decided by the SVs which lie on or outside the hypersphere, while the non-SVs, which are located inside the hyper-sphere, have no determination on the SVDD classifier, as discussed in Sect. 2.2. That is to say, removing the nonSVs does not change the SVDD classifier [5]. Motivated by this observation, an attractive approach for accelerating SVDD is to exclude the non-SVs from training the classifier so that the training set size can be reduced. However, the exact SVs can not be known until the SVDD classifier is trained. To overcome this difficulty, we propose to determine the SVs by identifying the examples lying close to the boundary of the target class, namely boundary examples. To determine the boundary examples, we put forward a kfarthest-neighbor-based learning method. In SVDD, after being mapped from the input space into the feature space via an appropriately chosen kernel function, the target data becomes more compact and distinguishable in the feature space. A hyper-sphere is then constructed by enclosing the target data with good performance. Based on this geometrical characteristic of SVDD, the k-farthest-neighbor-based

The main task of this step is to determine the k-farthest neighbors for each example. This pre-processing step is performed only once for the whole dataset. Compared to knearest neighbor query, k-farthest neighbor learning aims at identifying k-farthest examples from the query point. At the end of this stage, we obtain a data structure which consists of the indexes and distances of the k-farthest neighbors for each example in the whole dataset. Similar to k-nearest neighbor query [38, 39], obtaining the k-farthest neighbors of each example can be computationally expensive on large-scale datasets. Hence, M-tree [11, 40], originally designed for efficient k-nearest neighbor learning, is extended to speed up the k-farthest neighbor search. Specifically, M-tree for k-farthest neighbor search consists of two steps: tree construction and tree query. In the tree construction step, a hierarchical tree is built. The tree construction process of k-farthest neighbor learning is the same as k-nearest neighbor query. In the tree query step, the tree is queried to obtain the k-farthest neighbors. Different from k-nearest neighbors, we propose a tree query scheme for efficient k-farthest neighbor search. The k-farthest neighbor query algorithm uses a priority PR queue to store pending requests, and a k-element FN array to contain the k-farthest neighbor candidates, which are the results at the end of the algorithm. To do this, a branchand-bound heuristic algorithm is adopted. In the following, we first introduce several important notations and the others are defined in Table 1. PR (Priority) queue The PR queue stores a queue of pending requests [ptr(T (Oi )), dmax (T (Oi ))], where dmax (T (Oi )) = d(Oi , Q) + rOi is the upper-bound distance, representing the maximum distance between Q and T (Oi ). d(Oi , Q) is the distance between the query point Q and object Oi , and rOi is the radius of subtree T (Oi ). The pending

200

Y. Xiao et al.

Table 1 Symbol definition Symbol

Definition

Q

Query point

Oi

The ith data object

T (Oi )

Covering tree of object Oi

rQ

Query radius of Q

d(Oi , Q)

Distance between the data object Oi and the query point Q

dmin (T (Oi ))

The smallest possible distance between the query point Q and the objects in subtree T (Oi ). The lower-bound distance dmin (T (Oi )) is defined as: dmin (T (Oi )) = max{d(Oi , Q) − rOi , 0}.

dmax (T (Oi ))

Upper-bound distance between the query point Q and the objects in subtree T (Oi ). The upper-bound distance dmax (T (Oi )) is defined as: dmax (T (Oi )) = d(Oi , Q) + rOi .

dk

Minimum lower-bound distance dmin (T (Oi )) of k objects in the FN array. If there is no object in the FN array, let dk = 0.

PR queue

Queue of pending requests which have not been filtered from the search. The entry in the PR queue is of form [T (Oi ), dmax (T (Oi ))].

FN array

During the query process, the FN array stores the entries of form [Oi , d(Q, Oi )] or [−, dmin (T (Oi ))]. At the end of the query, the FN array contains the results, i.e., the k-farthest neighbors.

KFN(φ(xi ))

Top k-farthest neighbor list of example φ(xi )

f (φ(x))

Ranking score of example φ(x)

requests in the PR queue are sorted by dmax (T (Oi )) in descending order. Hence, the request with max {dmax (T (Oi ))} is inquired first. FN (Farthest Neighbor) array The FN array stores the k-farthest neighbors or the lower-bound distances of subtrees. It contains k entries of form [oid(Oi ), d(Q, Oi )] or [−, dmin (T (Oi ))], where dmin (T (Oi )) = {0, d(Oi , Q) − rOi } is the lower-bound distance, representing the minimum distance between Q and T (Oi ). The entries in the FN array are sorted by d(Q, Oi ) or dmin (T (Oi )) in descending order. The k-farthest neighbors are in the FN array at the end of the query. rQ (Query Range) At the beginning, the query range rQ is set to be 0. When a new entry is inserted into the FN array, rQ is enlarged as the distance d(Q, Oi ) or dmin (T (Oi )) stored in the last entry FN[k]. Based on the above notations, the basic idea of efficient k-farthest neighbor query is as follows. The query starts at the root node, and [root, 0] is the first and only request in the PR queue. All entries in the FN array are set to [−, 0], and the query range rQ is set to be 0. During the query, the entry in the PR queue with max{dmax (T (Oi ))} is searched first. This is because k-farthest neighbors are more likely

to be acquired from the farthest subtree and dmax (T (Oi )) is the largest possible distance between the query point Q and the objects in subtree T (Oi ). Let dk equal the distance d(Q, Oi ) or dmin (T (Oi )) stored in the last entry FN[k]. For the entry inquired, if dmin (T (Oi )) > dk holds, it is inserted into the FN array, and dk is updated according to the last entry FN[k]. After inserting the entry into the FN array and updating dk , we let rQ = dk and the query region enlarges. Based on this region, the entries with dmax (T (Oi )) < dk are removed from the PR queue, i.e., being excluded from the search. This is because if the upper-bound distance dmax (T (Oi )) between Q and T (Oi ) is smaller than the minimum distance in the FN array, i.e., dk , T (Oi ) is unlikely to contain the kfarthest neighbors and can be safely pruned. When no object can be filtered, the query stops and the k-farthest neighbors are in the FN array. Take Figs. 2(a)–(d) as an example to further illustrate the query process of k-farthest neighbors. Here, k = 2 is set. – Fig. 2(a): The query starts at the root level and the PR queue is set to be [root, 0]. rQ is equal to 0, and all entries in the FN array are [−, 0]. – Fig. 2(b): Subtrees II and I are waiting for inquiring, and hence put into the pending PR queue. Since it has dmax (II) > dmax (I ), subtree II is the first entry in the PR queue, and is searched before subtree I . Moreover, dmin (II) and dmin (I ) are put into the FN array. Considering that it has dmin (II) > dmin (I ), dmin (II) is placed in front of dmin (I ), and dmin (I ) is the last entry in the FN array. rQ stays 0, since rQ is equal to the minimum lowerbound distance in the FN array, i.e., rQ = dmin (I ), while it has dmin (I ) = 0. – Fig. 2(c): Inquire the subtree in the first entry of the PR queue, i.e., subtree II, and then subtrees D, I and C are inserted into the PR queue by their upper-bound distances in descending order. Since dmin (D) > dmin (C) and dmin (C) > dmin (I ) hold, the two largest lower-bound distances, i.e., dmin (D) and dmin (C), are put into the FN array. rQ increases from 0 to dmin (C). – Fig. 2(d): The subtree in the first entry of the PR queue, i.e., subtree D, is filtered. O8 , O7 , I and C are sorted by their upper-bound distances in descending order and placed into PR. Then, O8 and O7 are inquired. Entries [O8 , d(Q, O8 )] and [O7 , d(Q, O7 )] are put into the FN array. This is because d(Q, O8 ) and d(Q, O7 ) are larger than dmin (D) and dmin (C), and hence d(Q, O8 ) and d(Q, O7 ) are in the FN array. After the entries in the FN array are updated, the query range rQ is enlarged and set to be the distance in the last entry of the FN array, i.e., rQ = d(Q, O7 ). Since rQ = d(Q, O7 ) holds, it has rQ > dmax (I ) and rQ > dmax (C). Considering that rQ is equal to the minimum distance stored in the FN array, if

A K-Farthest-Neighbor-based approach for support vector data description

201

Fig. 2 An example of 2-farthest neighbor query using M-tree. Here, I , II, A, B, C and D represent different subtrees. Q1 to Q8 are data objects

the upper-bound distance dmax (T (Oi )) between Q and T (Oi ) is smaller than the minimum distance stored in the FN array, i.e., rQ , it can be safely pruned. Hence, subtrees I and C can be pruned and removed from the PR queue. Lastly, the query stops since the PR queue is empty and the 2-farthest neighbors of the query point Q, namely O7 and O8 , are contained in the FN array. 3.2 Boundary concept learning After the k-farthest neighbors are identified, the next step is to assign each example with a ranking score based on its kfarthest neighbors, such that a subset of top ranked examples is chosen to train the classifier. To do this, we propose a scoring function. The objective of the scoring function is to assign higher scores to examples which are more likely to lie around the boundary. Given l target examples S = {φ(x1 ), φ(x2 ), . . . , φ(xl )}, we first compute the k-farthest neighbors of each example φ(xi ) (i = 1, . . . , l), as presented in Sect. 3.1. Let KFN(φ(xi )) be the top k-farthest neighbor list of example φ(xi ). In the following, when we express φ(xj ) ∈ KFN(φ(xi )), it means that example φ(xj ) is on the top kfarthest neighbor list of example φ(xi ). Based on the top kfarthest neighbor list of each target example, our proposed scoring function is given by:   1 f φ(x) = 1 − k

 φ(xi )∈KFN(φ(x))

e−φ(x)−φ(xi )

2

(4)

where φ(x) − φ(xi )2 is the square of the distance from φ(x) to φ(xi ), and it has φ(x) − φ(xi )2 = φ(x) · φ(x) + φ(xi ) · φ(xi ) − 2φ(x) · φ(xi ). Here, φ(x) · φ(xi ) represents the inner product of φ(x) and φ(xi ), and it can be replaced by kernel function K(x, xi ). To illustrate (4), we assume that k is equal to 1 and the top 1-farthest neighbor list contains only one example φ(xi ). Hence, (4) can be rewritten 2 2 as f (φ(x)) = 1 − e−φ(x)−φ(xi ) , where e−· is an exponentially decaying function. By utilizing this function, when the distance between φ(x) and φ(xi ) is sufficiently small, 2 the value of e−φ(x)−φ(xi ) is approximately 1 and f (φ(x)) is close to 0. Conversely, when the distance between φ(x) 2 and φ(xi ) is large enough, the value of e−φ(x)−φ(xi ) approaches 0 and f (φ(x)) reaches 1. That is to say, when φ(x) is close to its k-farthest neighbors, a small ranking score f (φ(x)) is obtained. When φ(x) is far from its k-farthest neighbors, a large ranking score f (φ(x)) is attained. More2 over,  · 2 is always no less than 0 and 0 < e−· < 1 holds. Hence, the value of ranking scores f (φ(x)) falls into the range of (0, 1). It is known that after the target data is mapped from the input space into the feature space via a proper kernel function, the target data becomes more compact in the feature space, so that a hyper-sphere classifier is obtained by enclosing the target data with satisfactory accuracy, as shown in Fig. 3(a). Considering this geometrical characteristic of SVDD, the scoring function in (4) is capable of determining the examples which are likely to lie around the boundary of the target class. Let us take Figs. 3(a)–(b) as an example to

202

Y. Xiao et al.

Algorithm 2 The calculation of ranking scores with M-tree Input: Q (Query point), k (Integer), T (Tree built); Output: f (Q) (Ranking score); 1: Let FN = ∅; /* FN is an empty array */ 2: FN = KFN_Query(Q, k, T ); /* Obtaining the k-farthest neighbors */ 3: f (Q) = RankingScore(Q, FN); /* Calculating the ranking scores */ 4: return f (Q); Fig. 3 Illustration of the scoring function with 3-farthest neighbors

illustrate how the scoring function contributes to the boundary example determination. Figure 3(a) shows that the SVDD classifier is a hypersphere and the SVs are those examples lying on the hypersphere surface or outside the hyper-sphere. To make the illustration clearer, we only keep examples φ(x1 )–φ(x8 ) and remove the other examples, as presented in Fig. 3(b). φ(x1 ) is an SV lying near the hyper-sphere surface, while φ(x5 ) is a non-SV located deep inside the hyper-sphere. φ(x2 ), φ(x3 ) and φ(x4 ) are the 3-farthest neighbors of φ(x1 ), while the 3-farthest neighbors of φ(x5 ) are φ(x6 ), φ(x7 ) and φ(x8 ). Then, we consider the corresponding ranking scores of φ(x1 ) and φ(x5 ), as follows: 4   1  −φ(x1 )−φ(xi )2 e , f φ(x1 ) = 1 − 3

(5)

8   1  −φ(x5 )−φ(xi )2 e . f φ(x5 ) = 1 − 3

(6)

i=2

i=6

For f (φ(x1 )) and f (φ(x5 )), it is observed from Fig. 3(b) that the distances between φ(x1 ) and its k-farthest neighbors are larger than those of φ(x5 ). Hence, the ranking score of φ(x1 ) is greater than φ(x5 ), namely f (φ(x1 )) > f (φ(x5 )). From this example, it is seen that by tuning an appropriate k, the examples near the data boundary can have larger distances from their k-farthest neighbors than those lying deep inside the hyper-sphere, and hence the ranking scores of boundary examples are roughly larger than those inside the hyper-sphere. From this aspect, the scoring function can be considered as a way to determine the boundary examples approximately. The pseudo codes to calculate the ranking scores with M-tree are presented in Algorithm 2. The pseudo codes for the functions in Algorithm 2 are shown in Algorithms 2.1, 2.1.1, 2.1.2 and 2.2, respectively. Selection of subset sizes After the k-farthest neighbors have been determined by the extended M-tree, each example in the dataset is assigned with a ranking score according

Algorithm 2.1 Pseudo codes for the “KFN_Qeury” function Input: Q (Query point), k (Integer), T (Tree built); Output: FN (k-farthest neighbor array); 1: PR = [T , _]; 2: for i = 1 to k do; 3: FN[i] = [0, _]; 4: while PR = ∅ do; 5: Next_Node = ChooseNode(PR); 6: FN = KFN_NodeSearch(Next_Node, Q, k);

Algorithm 2.1.1 Pseudo codes for the “ChooseNode” function Input: PR (Priority queue); Output: Next_Node (The next node to be searched); 1: 2: 3: 4: 5:

Let dmax (T (Or∗ )) = Max{dmax (T (Or ))}; Considering all the entries in PR; Remove [T (Or∗ ), dmax (T (Or∗ ))] from PR; Next_Node = T (Or∗ ); return Next_Node;

to the scoring function (4). The examples with large ranking scores tend to lie near the data boundary and those with small ranking scores are usually located deeply inside the hyper-sphere. Thereafter, we perform the following operations to select a subset of examples for learning the SVDD classifier. Firstly, we sort the training examples according to their ranking scores and the sum H of all sorted scores is ob tained, where it has H = li=1 f (φ(xi )). Secondly, we reindex the sorted examples as {φ(x 1 ), φ(x 2 ), . . . , φ(x l )} and the example with the highest score is represented as φ(x 1 ). Thirdly, let P denote a user pre-specified percentage. Starting from example φ(x 1 ), we keep picking up the examples until the sum of chosen examples reaches the pre-defined percentage P of H . Based on this, suppose that a subset of top ranked examples is selected out and contained in subset Ssub . For the examples in subset Ssub , the sum of their

A K-Farthest-Neighbor-based approach for support vector data description

Algorithm 2.1.2 Pseudo codes for the “KFN_NodeSearch” function Input: Q (Query point), k (integer), Next_Node (Next node to be searched); Output: FN (k-farthest neighbor array); 1: Let N = Next_Node; 2: if N is not a leaf then; 3: {for each entry Or in N do; 4: {if d(Or , Q) + rOr ≥ dk then; 5: Add [T (Or ), dmax (T (Or ))] to PR; 6: if dmin(T (Or )) > dk then; 7: {dk = FN_Update([_, dmax (I (Or ))]); 8: for dmax (T (Or )) < dk do 9: Remove the entry from PR;}}} 10: else 11: {for all Oj in N do; 12: {if d(Q, Oj ) ≥ dk then; 13: {dk = FN_Update([Oj , d(Oj , Q)]); 14: for dmax (T (Or )) < dk do; 15: Remove the entry from PR;}}}

scores satisfies: 1 H

|Ssub |−1 

  f φ(xi ) < P

and

i=1

(7)

|Ssub |   1  f φ(xi ) ≥ P , H i=1

where |Ssub | denotes the number of examples in Ssub . Based on the selected examples in Ssub , the problem is transformed into:       αi K x i , x i − αi αj K x i , x j max i

s.t.

i

j

0 ≤ αi ≤ C,  αi = 1, i = 1, 2, . . . , |Ssub |.

(8)

i

From the optimization function (8), it can be seen that the original learning problem is transformed into a smaller problem. Compared to standard SVDD which is trained on the entire dataset, our proposed approach learns the classifier by using only a subset of examples located near the data boundary. As a result, the computational cost of learning the classifier is largely reduced and the training accuracy is maintained as the entire dataset used. In all, the kernel function is usually used to map the training data from the input space into a higher dimensional feature space. By adopting a proper kernel function, the training data in the feature space can become more compact and

203

Algorithm 2.2 Pseudo codes for the “RankingScore” function Input: Q (Query point), FN (k-farthest neighbor array); Output: f (Q) (Ranking score of Query point Q); 1: Sum_NormalizedDis(Q) = 0; 2: for all O in FN do 3: Dis(Q, O) = Q − O2 ; 4: NormalizedDis(Q, O) = exp(−Dis(Q, O)); 5: Sum_NormalizedDis(Q) = Sum_NormalizedDis(Q) + NormalizedDis(Q, O); 6: f (Q) = 1 − Sum_NormalizedDis(Q)/k; 7: return f (Q);

distinguishable. Based on the transformed data, the scoring function is presented to weight the examples based on the distances to their k-farthest neighbors. By tuning the parameter k, we can make the scoring function discriminative enough to determine the boundary examples. At the same time, we can adjust the value of P , making Ssub include more informative examples which are likely to be SVs, so that Ssub can be utilized to train a more accurate classifier. The learning time of our method consists of two parts: data pre-processing and subset training. On the one hand, the data pre-processing time refers to that taken to obtain the k-farthest neighbors and compute the ranking scores. The time complexity of building a balanced M-tree is around O(n log n), where n is the number of training examples [11, 40]. For each example, retrieving the M-tree for its k-farthest neighbors is appropriately O(log n) on average, which is affected by the amount of overlap among the metric regions [11, 41]. Hence, for a training set with n examples, the total time of building the M-tree and retrieving the k-farthest neighbors is O(2n log n). Based on the kfarthest neighbors, the ranking scores are computed and it takes time O(nk). Hence, the computational cost in the data pre-processing part is O(2n log n) + O(nk). On the other hand, the subset training time refers to that taken to train the SVDD classifier on the selected examples. Assume that the percentage of examples which make up the top P percentage of H is p. p is usually smaller than P . Considering that the time complexity of SVDD is O(n2 ), training the SVDD classifier on p percentage of examples takes time O((pn)2 ). Therefore, the time complexity of our method is O(2n log n) + O(nk) + O((pn)2 ), where it has 0 < p < 1 and k is far less than n. When the training set size is large, our method achieves markedly better speedup than SVDD. 3.3 Discussion of the proposed approach 3.3.1 KFN-CBD for classifying datasets with outliers In SVDD [5], some training examples may be located outside the hyper-sphere, so that the objective function can be

204

Y. Xiao et al.

Fig. 4 KFN-CBD on the noisy normal distribution dataset

optimized by seeking a tradeoff between the hyper-sphere volume and the training errors. The examples lying outside the hyper-sphere are known as outliers. These outliers can be determined by our proposed KFN-CBD method. This is because the outliers are always located outside the hyper-sphere, and naturally have larger distances from the k-farthest neighbors than the examples inside the hypersphere. As a result, the outliers are usually associated with larger ranking scores. Take Fig. 3(b) as an example. φ(x1 ) is an outlier and φ(x5 ) is an example inside the hyper-sphere. It is seen that the distances from φ(x1 ) to its k-farthest neighbors are generally larger than those of φ(x5 ). Hence, the outlier φ(x1 ) has a larger ranking score than φ(x5 ). Figures 4(a)–(c) illustrate the classification boundary obtained by KFN-CBD on a noisy normal distribution dataset when different values of P are adopted to learn the SVDD classifier. In Fig. 4(a), the entire dataset is used for training. In Fig. 4(b), P = 30 % is set and the examples making up the top 30 % of H are chosen by KFN-CBD to train the SVDD classifier. It can be seen that although only a subset of examples is used, the classification boundary is exactly drawn out. Figure 4(c) shows the classification boundary when P is set to be 40 %. It is not difficult to see that the classification boundary stays unchanged and the newly added examples fall inside the hyper-sphere, which confirms the effectiveness of KFN-CBD in selecting the SVs. Additionally, the existence of outliers may be controlled by the penalty coefficient C [5]. When 0 < C < 1 holds, there may exist some examples (namely outliers) lying outside the hyper-sphere. In our experiments, the datasets implicitly contain outliers since we choose the value of C from 2−5 to 25 . The experimental results show that KFN-CBD can obtain comparable classification accuracy with standard SVDD, which indicates the capability of KFN-CBD in handling the datasets with outliers. 3.3.2 KFN-CBD for classifying multi-distributed datasets In the multi-distributed dataset, the examples come from more than one distribution. That is to say, in the input space, the dataset consists of several clusters and each cluster represents one of the data distributions. As discussed in [42],

the formulation of SVDD can be used to cluster the multidistributed unlabelled data (known as support vector clustering). In support vector clustering, the unlabelled data is mapped into the feature space and enclosed by a hypersphere. Then, the unlabelled data is mapped back to the input space. The data which is originally located on the hypersphere surface in the feature space is re-distributed on the boundaries of different clusters in the input space. By doing this, the data with multiple distributions can be identified. The process of SVDD to classify multi-distributed data is similar to that of support vector clustering. In SVDD, the multi-distributed data is mapped into the feature space and the classifier is obtained by enclosing the data in a hypersphere. By choosing an appropriate kernel function, the examples which originally lie on the boundaries of different clusters in the input space are relocated on the hyper-sphere surface in the feature space. Similar to the single distributed datasets, the SVs of multi-distributed datasets lie on or outside the hyper-sphere surface. Therefore, KFN-CBD can be applied on multi-distributed datasets by capturing the examples near the hyper-sphere boundary. Figures 5(a)–(c) show the classification boundary of KFN-CBD on a multi-distributed dataset which consists of two banana shaped clusters. In Fig. 5(a), the whole dataset is trained. In Fig. 5(b), P = 30 % is set and the examples making up the top 30 % of H are selected to learn the SVDD classifier. It can be seen that some of the SVs have been detected. Figure 5(c) illustrates the classification boundary when P is equal to 40 %. It is observed that all SVs have been determined and the classification boundary is explicitly drawn out. From the above illustrations, it can be seen that for the dataset whose data structure is not a hyper-sphere, e.g., the multi-distributed dataset in Fig. 5(a), our method can still obtain satisfactory classification results by utilizing the kernel function. In the experiments, we explicitly investigate the performance of KFN-CBD on multi-distributed datasets. Among the experimental datasets, as shown in Table 2, the target classes of Mnist(1), Mnist(2) and Covertype(3) sub-datasets are made up of multi-distributed data. For example, the target class of Mnist(1) sub-dataset consists of the data from five classes, i.e., classes 1, 2, 3, 4 and 5, and each class

A K-Farthest-Neighbor-based approach for support vector data description

205

Fig. 5 KFN-CBD on the multi-distributed dataset

comes from one data distribution. Furthermore, the experimental results demonstrate that KFN-CBD is effective in dealing with the multi-distributed datasets. For example, KFN-CBD obtains comparable classification accuracy to standard SVDD on Mnist(1) sub-dataset, as shown in Table 3.

4 Experiment We investigate the accuracy, recall, precision and efficiency of KFN-CBD on large-scale datasets with example sizes from 23523 to 226641. For comparison, standard SVDD and two representative data-processing methods, i.e., random sampling [21] and selective sampling [34], are used as baselines. 4.1 Dataset descriptions The large-sized datasets used in the experiments include Mnist, IJCNN and Covertype datasets. The general descriptions of these datasets are given as follows: 1. Mnist dataset2 contains 70000 examples with 10 classes and 780 features. The original data has 780 features and PCA is used to select 60 features for the experiments. The example sizes from class 1 to 10 are 5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851 and 5949, respectively. 2. IJCNN dataset3 has 141690 examples with 2 classes and 22 features. The example sizes of class 1 and class 2 are 128126 and 13564, respectively. 3. Covertype dataset4 is a relatively large dataset, containing 581012 examples with 7 classes and 54 features. The example sizes from class 1 to class 7 are 211840, 283301, 35754, 2747, 9493, 17367 and 20510, respectively. 2 Available

at http://yann.lecun.com/exdb/mnist/.

3 Available

at http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

4 Available

at http://archive.ics.uci.edu.

Table 2 Descriptions of the sub-datasets used for one-class classification Name

Target class

Training set size

Testing set size

# Feature

Mnist(1)

1, 2, 3, 4, 5

24477

35523

Mnist(2)

6, 7, 8, 9, 10

23523

36477

60 60

IJCNN(1)

1

102501

39189

22

Covertype(1)

1

169472

411540

54

Covertype(2)

2

226641

354371

54

Covertype(3)

3, 4, 5, 6, 7

68697

512315

54

Since these datasets are designed for binary class or multi-class classification, we follow the operations in [5] to obtain sub-datasets for one-class classification. That is, choose one class as the target class and treat the other classes as the non-target class at each round. To ensure that the obtained sub-datasets are large, before performing the operations in [5], we re-organize each dataset as follows. For the Mnist dataset, the first five classes are grouped into one class and the last five classes are combined into another class. For the Covertype dataset, the last five classes are considered as one class. For the IJCNN dataset, since the first class occupies over 90 % of the examples, we only consider the first class as the target class. By doing this, we acquire 6 sub-datasets, i.e., Mnist(1), Mnist(2), IJCNN(1), Covertype(1), Covertype(2) and Covertype(3), for the task of one-class classification. Each number in parentheses represents the class number after some classes are combined. Table 2 shows the details of these datasets. In Table 2, “Target Class” is the classes in the original dataset, which we consider as the target class in our experiments. 4.2 Experimental settings All the experiments run on a Linux machine with 2.66 GHz processor and 4 GB DRAM. The training of SVDD is performed by modifying LibSVM,5 and the RBF kernel is 5 Available

at www.csie.ntu.edu.tw/~cjlin/libsvm/.

206

Y. Xiao et al.

Table 3 Results on large-scale sub-datasets (RS-random sampling, SS-selective sampling) (k = 100, P = 80 %)

Accuracy

Precision

Recall

SVDD

RS

SS

KFN-CBD

SVDD

KFN-CBD

SVDD

KFN-CBD

Mnist(1)

85.74

80.66

82.29

85.73

88.28

88.26

86.89

86.88

Mnist(2)

81.64

77.42

76.21

81.62

88.58

88.57

81.71

81.69

IJCNN(1)

89.15

76.85

81.71

89.12

85.76

85.73

90.22

90.18

Covertype(1)

64.49

55.59

58.41

64.48

68.99

68.98

73.11

73.09

Covertype(2)

62.36

52.57

56.66

62.34

73.54

73.53

70.76

70.74

Covertype(3)

60.58

55.93

57.59

60.65

71.05

71.02

65.64

65.62

adopted. As in [8, 29, 43], we select the kernel parameter σ from 10−5 to 105 , and the penalty coefficient C from 2−5 to 25 . The optimal parameters are selected by conducting fivefold cross validation on the target class ten times. For each of the 6 sub-datasets, the target class is partitioned into five splits with equal sizes. In each round, four of the five splits are used to form the training set; the remaining split and the non-target class are used as the testing set. We repeat this procedure ten times and compute the average results. It is seen that the training set sizes vary from 23523 to 226641, which guarantees that the experimental results are obtained based on relatively large-sized datasets. In Sect. 4.3, we fix k = 100 and P = 80 %, and compare KFN-CBD with the baselines. Here, k is the number of farthest neighbors considered in calculating the scoring function. The examples making up the top P percentage of H are selected to train the SVDD classifier. Then, we analyze the performance variations of KFN-CBD when different P and k values are adopted in Sect. 4.4. It is known that the upper-bound of outliers’ number is 1/C [5], where C is the penalty coefficient. For the selection of k, we make it no less than 1/C. This is because, in addition to outliers, the examples near the hyper-sphere surface can also be utilized to make the scoring function more distinguishable. Empirically, when k reaches 100, satisfactory classification results can be obtained. 4.3 Result comparison We investigate the performance of KFN-CBD with k = 100 and P = 80 %. Although larger values of k and P can improve the classification results, it is found that when k = 100 and P = 80 % are set, KFN-CBD reaches comparable classification quality as the whole dataset used. The performance variations of KFN-CBD under different values of k and P will be discussed later. Table 3 reports the experimental results of SVDD and KFN-CBD in terms of accuracy, precision and recall. It is clearly shown that KFN-CBD obtains comparable classification quality to SVDD. Taking Mnist(1) sub-dataset as an example, the accuracy, precision and recall of SVDD are

85.74 %, 88.28 % and 86.89 % respectively, while KFNCBD obtains very close classification results with 85.73 %, 88.26 % and 86.88 %, which indicates that KFN-CBD is effective for determining the SVs. In Table 3, the accuracy of random sampling and selective sampling on the 6 sub-datasets is also given. It can be easily seen that KFNCBD markedly outperforms random sampling and selective sampling. For example, the accuracy of KFN-CBD on the Mnist(1) sub-dataset is 85.73 %, which is higher than random sampling 80.66 % and selective sampling 82.29 %. The training time of KFN-CBD consists of the data pre-processing time Tp and the subset learning time Tl . Tp mainly refers to the k-farthest neighbor query and score evaluation, while Tl is the time taken by learning the classifier on the selected subset. For all results related to the speedup, we have taken both Tp and Tl into account. Let T (KFN) = Tp + Tl be the total training time of KFNCBD on the experimental dataset, and T (SVDD) be the learning time of SVDD. For each experimental dataset, the speedup of KFN-CBD is computed from dividing T (SVDD) by T (KFN). Similarly, the speedup of random sampling and selective sampling is calculated from dividing T (SVDD) by their corresponding training time. In this way, we can investigate the different improvements of KFN-CBD, random sampling and selective sample over SVDD. Table 4 presents the training time and SV numbers for KFN-CBD and SVDD, as well as the speedup for KFN-CBD, random sampling and selective sampling. The specific training time of random sampling and selective sampling can be obtained from dividing T (SVDD) by the corresponding speedup. T (SVDD) can be attained from Table 4. It is seen that KFN-CBD is around 6 times faster than SVDD, and selective sampling obtains appropriately 2 times speedup over SVDD. Moreover, KFN-CBD is 3 times faster than selective sampling on average. From the above analysis, it is clear that compared to SVDD, KFN-CBD obtains markedly speedup and meanwhile guarantees the classification quality. In addition, KFN-CBD turns out to be explicitly faster than both of random sampling and selective sampling, and demonstrates even better classification accuracy.

A K-Farthest-Neighbor-based approach for support vector data description Table 4 Results on large-scale sub-datasets (RS-random sampling, SS-selective sampling, Tp -time taken by data pre-processing, Tl -time taken by subset learning) (k = 100, P = 80 %)

207

Training time SVDD

Speedup

KFN-CBD (Tp + Tl )

RS

SS

SV KFN-CBD

SVDD

KFN-CBD

Mnist(1)

1645

85 + 218

1.42

1.94

5.43

1040

558

Mnist(2)

1583

58 + 201

1.25

2.03

6.11

1037

517

IJCNN(1)

3969

195 + 471

1.39

1.85

5.96

3452

2078 4668

Covertype(1)

10180

397 + 1242

1.33

1.93

6.21

7222

Covertype(2)

35836

1327 + 4494

1.27

2.05

6.16

9833

7388

Covertype(3)

1904

117 + 208

1.17

1.60

5.86

3048

1604

Fig. 6 Performance of KFN-CBD on the Covertype(1), (2) and (3) sub-datasets. (a)–(d) show the results when k is fixed to be 100 and P varies from 30 % to 90 %. (e) and (f) present the results when P is fixed to be 80 % and k varies from 100 to 500

4.4 Performance variations In this section, the performance variations of KFN-CBD under different k, P and dimension numbers are investigated. Moreover, the classification accuracy of KFN-CBD with different distance measurements is also reported. Figures 6(a)– (d) present the results of KFN-CBD with k = 100 and P

varying from 30 % to 90 % on the Covertype(1), (2) and (3) sub-datasets. In order to better reflect the relative performance of KFN-CBD compared to standard SVDD, the accuracy rate, precision rate and recall rate are given. The accuracy rate is the ratio of KFN-CBD’s accuracy to that of SVDD. The precision rate and recall rate are obtained in a similar way. The specific values of accuracy, precision and

208

Y. Xiao et al. Table 5 Classification accuracy with different distance measurements Mnist(1)

Fig. 7 Accuracy rates of KFN-CBD on the Mnist(1) dataset with different dimension numbers

recall for KFN-CBD can be obtained by multiplying the corresponding rates with those of standard SVDD, as shown in Table 3. In Figs. 6(a)–(c), it is seen that the accuracy rate, recall rate and precision rate improve synchronously as P increases. It is noted that when P reaches 80 %, KFN-CBD obtains comparable accuracy as standard SVDD. Meanwhile, the speedup decreases from about 46.7 to around 4.65 on average, as shown in Fig. 6(d), since more examples are included in the training phase. Figures 6(e) and (f) show the accuracy rate and speedup when P is set to be 80 % and k varies from 100 to 500. We observe that with the increase of k, the accuracy rate remains almost unchanged, while the speedup dips since more processing time is needed to obtain the k-farthest neighbors. From these experimental results, it is seen that the classification quality are not very sensitive to k and even a small number of k (e.g., k = 100) can suffice. Moreover, we evaluate the performance of KFN-CBD with different dimension numbers. To do this, we form five sub-datasets by selecting the first 20 %, 40 %, 60 %, 80 % and 100 % dimensions from the Mnist(1) dataset, respectively. The Mnist(1) dataset used in the experiments contains 60 dimensions, and hence the five sub-datasets have 12, 24, 36, 48 and 60 dimensions, respectively. Based on the five sub-datasets, KFN-CBD is performed to learn the classifiers, and the accuracy rates are recorded. Let Acc(1) denote the classification accuracy of KFN-CBD on the subdataset with 12 dimensions, and Acc(SVDD) represent the accuracy of SVDD on the Mnist(1) dataset with all 60 dimensions. To conduct a clear comparison, the accuracy rate is computed from dividing Acc(1) by Acc(SVDD). The specific value of Acc(1) can be obtained by multiplying the accuracy rate and Acc(SVDD), whose value is 85.74 %, as seen from Table 3. In this way, the accuracy rates for the five sub-datasets with different dimensions can be obtained, as shown in Fig. 7. Here, P and k are set to 80 % and 100, respectively. It can be seen that as the number of dimensions increases, the accuracy rate goes up. This is because the data dimensions represent different classification information which is valuable to construct the classifier. When

IJCNN(1)

Covertype(1)

Euclidean distance

85.73

89.12

64.48

Manhattan distance

85.72

89.09

64.48

Mahalanobis distance

85.79

89.08

64.46

more dimensions are contained in the datasets, more classification information can be available to learn an accuracy classifier. When the dimension number reaches 60, the accuracy rate is about 100 %. That is to say, when all dimensions are included in the training phase, KFN-CBD can achieve comparable accuracy to SVDD, which can be also observed from Table 3. More experiments on the Mnist and IJCNN sub-datasets are presented in Figs. 8 and 9, respectively, and the similar findings can be observed. Above, we evaluate the performance of KFN-CBD using the Euclidean distance. It is also interesting to investigate its performance under different distance measurements. Table 5 presents the accuracy of KFN-CBD by employing the Euclidean, Manhattan and Mahalanobis distances, respectively. It is seen that KFN-CBD obtains relatively stable classification accuracy with different distance measurements, which verifies the robustness of our method. KFN-CBD is developed based on M-tree which is a metric tree and just requires the distance function to be a metric, namely satisfying the non-negative, identity of indiscernibles, symmetry and triangle inequality properties [11]. Hence, KFN-CBD is applicable to the distance measurement which is a metric.

5 Conclusions and future work 5.1 Contribution of this work SVDD faces a critical computation issue in handling largescale datasets. This paper proposes a novel approach, named K-Farthest-Neighbor-based Concept Boundary Detection (KFN-CBD), to provide a reliable and scalable solution for speeding up SVDD on large-scale datasets. KFN-CBD holds the following characteristics: (1) the k-farthest-neighborbased scoring function is proposed to weight the examples according to their proximity to the data boundary; (2) by adopting a novel tree query strategy, M-tree is extended to speed up the k-farthest neighbor search; to the best of our knowledge, this is the first learning method for efficient kfarthest neighbor query; (3) it is able to acquire comparable classification accuracy, and with significantly reduced computational costs.

A K-Farthest-Neighbor-based approach for support vector data description

209

Fig. 8 Performance of KFN-CBD on the Mnist(1) and (2) sub-datasets. (a) and (b) show the results when k is fixed to be 100 and P varies from 30 % to 90 %. (c) and (d) present the results when P is fixed to be 80 % and k varies from 100 to 500

Fig. 9 Performance of KFN-CBD on the IJCNN(1) sub-dataset. (a) and (b) show the results when k is set to be 100 and P varies from 30 % to 90 %. (c) and (d) present the results when P is set to be 80 % and k varies from 100 to 500

5.2 Limitations and future work There are several limitations that may restrict the use of our method in applications and we plan to extend this work in the following directions in the future.

(1) Farthest neighbor query in high dimensions: Our work for determining the k-farthest neighbors is based on M-tree, which is a distance-based tree learning method and inevitably suffers from the “curse of dimensionality” [11]. When the training data is in high dimen-

210

sions, the k-farthest neighbor query may become impractical. To deal with similarity search in high dimensions, some novel learning methods, such as localitysensitive hashing [44] and probably approximately correct NN search [41], are proposed. These methods are put forward for k-nearest neighbor search, and we would like to extend them to deal with k-farthest neighbor query in high dimensions. (2) Selection of k and P : k is the number of farthest neighbors included in computing the ranking score. The experiments show that when the value of k rises, the classification accuracy increases, but the learning time goes up synchronously. Hence, the user should trade off the classification accuracy and computational effort according to different application problems. In our experiments, we observe that when k reaches 100, satisfactory classification results can be obtained. In the future, we will investigate how to select an appropriate k according to different data characteristics and dataset sizes. Furthermore, the examples making up the top P percentage of H are selected to train the classifier. When P is larger, more boundary examples are likely to be included to train the classifier and better classification accuracy can be attained, but it takes longer learning time. Together with the parameter k, we will investigate the selection of P on more real-world datasets. In this paper, we mainly focus on one-class classification problems without the rare class. We will extend our method to one-class classification problems with the rare class and evaluate its performance on real-world applications in the future. Acknowledgements This work is supported by Natural Science Foundation of China (61070033, 61203280, 61202270), Guangdong Natural Science Funds for Distinguished Young Scholar (S2013050014133), Natural Science Foundation of Guangdong province (9251009001000005, S2011040004187, S2012040007078), Specialized Research Fund for the Doctoral Program of Higher Education (20124420120004), Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, GDUT Overseas Outstanding Doctoral Fund (405120095), Science and Technology Plan Project of Guangzhou City (12C42111607, 201200000031, 2012J5100054), Science and Technology Plan Project of Panyu District Guangzhou (2012-Z-03-67), Australian Research Council Discovery Grant (DP1096218, DP130102691), and ARC Linkage Grant (LP100200774, LP120100566).

References 1. Vapnik VN (1998) Statistical learning theory. Wiley, New York 2. Yu H, Hsieh C, Chang K, Lin C (2012) Large linear classification when data cannot fit in memory. ACM Trans Knowl Discov from Data 5(4):23210–23230 3. Diosan L, Rogozan A, Pecuchet JP (2012) Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl Intell 36(2):280– 294

Y. Xiao et al. 4. Shao YH, Wang Z, Chen WJ, Deng NY (2013) Least squares twin parametric-margin support vector machine for classification. Appl Intell 39(3):451–464 5. Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66 6. Wang Z, Zhang Q, Sun X (2010) Document clustering algorithm based on NMF and SVDD. In: Proceedings of the international conference on communication systems, networks and applications 7. Wang D, Tan X (2013) Centering SVDD for unsupervised feature representation in object classification. In: Proceedings of the international conference on neural information processing 8. Tax DMJ, Duin RPW (1999) Support vector data description applied to machine vibration analysis. In: Proceedings of the fifth annual conference of the ASCI, pp 398–405 9. Lee SW, Park J (2006) Low resolution face recognition based on support vector data description. Pattern Recognit 39(9):1809– 1812 10. Brunner C, Fischer A, Luig K, Thies T (2012) Pairwise support vector machines and their application to large scale problems. J Mach Learn Res 13(1):2279–2292 11. Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the international conference on very large data bases, pp 426–435 12. Debnath R, Muramatsu M, Takahashi H (2005) An efficient support vector machine learning method with second-order cone programming for large-scale problems. Appl Intell 23(3):219–239 13. Joachims T (1998) Making large-scale support vector machine learning practical. Advances in Kernel methods. MIT Press, Cambridge 14. Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. In: Proceedings of the IEEE workshop on neural networks for signal processing, pp 276–285 15. Platt J (1998) Fast training of support vector machines using sequential minimal optimization. Advances in Kernel methods. MIT Press, Cambridge 16. Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649 17. Chang CC, Lin CJ (2001) LibSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/ libsvm 18. Fine S, Scheinberg K (2001) Efficient SVM training using lowrank kernel representation. J Mach Learn Res 2:243–264 19. Cossalter M, Yan R, Zheng L (2011) Adaptive kernel approximation for large-scale non-linear SVM prediction. In: Proceedings of the international conference on machine learning 20. Chen M-S, Lin K-P (2011) Efficient kernel approximation for large-scale support vector machine classification. In: Proceedings of the SIAM international conference on data mining 21. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140 22. Graf H, Cosatto E, Bottou L, Dourdanovic I, Vapnik V (2005) Parallel support vector machines: the cascade SVM. In: Proceedings of the advances in neural information processing systems, pp 521– 528 23. Smola A, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: Proceedings of the international conference on machine learning, pp 911–918 24. Pavlov D, Chudova D, Smyth P (2000) Towards scalable support vector machines using squashing. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 295–299

A K-Farthest-Neighbor-based approach for support vector data description 25. Li B, Chi M, Fan J, Xue X (2007) Support cluster machine. In: Proceedings of the international conference on machine learning, pp 505–512 26. Zanni L, Serafini T, Zanghirati G (2006) Parallel software for training large scale support vector machines on multiprocessor systems. J Mach Learn Res 7:1467–1492 27. Collobert R, Bengio S, Bengio Y (2002) A parallel mixture of SVMs for very large scale problems. Neural Comput 14(5):1105– 1114 28. Gu Q, Han J (2013) Clustered support vector machines. In: Proceedings of the international conference on artificial intelligence and statistics 29. Yuan Z, Zheng N, Liu Y (2005) A cascaded mixture SVM classifier for object detection. In: Proceedings of the international conference on advances in neural networks, pp 906–912 30. Yu H, Yang J, Han J (2003) Classifying large data sets using svm with hierarchical clusters. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 306–315 31. Sun S, Tseng C, Chen Y, Chuang S, Fu H (2004) Cluster-based support vector machines in text-independent speaker identification. In: Proceedings of the international conference on neural networks 32. Boley D, Cao D (2004) Training support vector machine using adaptive clustering. In: Proceedings of the SIAM international conference on data mining 33. Lawrence N, Seeger M, Herbrich R (2003) Fast sparse Gaussian process methods: the informative vector machine. In: Proceedings of the advances in neural information processing systems, pp 609– 616 34. Kim P, Chang H, Song D, Choi J (2007) Fast support vector data description using k-means clustering. In: Proceedings of the international symposium on neural networks: advances in neural networks, part III, pp 506–514 35. Wang CW, You WH (2013) Boosting-SVM: effective learning with reduced data dimension. Appl Intell 39(3):465–474 36. Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34(1):19–27 37. Lee LH, Lee CH, Rajkumar R, Isa D (2012) An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37(1):80–99 38. Garcia V, Debreuve E, Barlaud M (2008) Fast k nearest neighbor search using GPU. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition workshops 39. Menendez HD, Barrero DF, Camacho D (2013) A multi-objective genetic graph-based clustering algorithm with memory optimization. In: Proceedings of the IEEE Congress on evolutionary computation 40. Ciaccia P, Patella M, Zezula P (1998) Bulk loading the M-tree. In: Proceedings of the Australasian database conference, pp 15–26 41. Ciaccia P, Patella M (2000) PAC nearest neighbor queries: approximate and controlled search in high-dimensional and metric spaces. In: Proceedings of the international conference on data engineering 42. Ben-Hur A, Horn D, Siegelmann H, Vapnik V (2001) Support vetor clustering. J Mach Learn Res 2:125–137 43. Davy M, Godsill S (2002) Detection of abrupt spectral changes using support vector machines—an application to audio signal segmentation. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing 44. Mayur D, Nicole I, Piotr I, Vahab SM (2004) Locality-sensitive hashing scheme based on P-stable distributions. In: Proceedings of the annual ACM symposium on computational geometry

211 Yanshan Xiao received the Ph.D. degree in computer science from the Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia, in 2011. She is with the Faculty of Computer, Guangdong University of Technology. Her research interests include multiple-instance learning, support vector machine and data mining.

Bo Liu is with the Faculty of Automation, Guangdong University of Technology. His research interests include machine learning and data mining. He has published papers on IEEE Transactions on Neural Networks, IEEE Transactions on Knowledge and Data Engineering, Knowledge and Information Systems, International Joint Conferences on Artificial Intelligence (IJCAI), IEEE International Conference on Data Mining (ICDM), SIAM International Conference on Data Mining (SDM) and ACM International Conference on Information and Knowledge Management (CIKM). Zhifeng Hao is a professor with the Faculty of Computer, Guangdong University of Technology. His research interests include design and analysis of algorithms, mathematical modeling and combinatorial optimization.

Longbing Cao is a professor with the Faculty of Information Technology, University of Technology, Sydney. His research interests include data mining, multi-agent technology, and agent and data mining integration. He is a senior member of the IEEE.