May 2, 2003 13:57 WSPC/115-IJPRAI
00240
International Journal of Pattern Recognition and Artificial Intelligence Vol. 17, No. 3 (2003) 333–347 c World Scientific Publishing Company
KERNEL WHITENING FOR ONE-CLASS CLASSIFICATION
DAVID M. J. TAX Fraunhofer Institute FIRST.IDA, Kekul´ estr.7, D-12489 Berlin, Germany
[email protected] PIOTR JUSZCZAK Pattern Recognition Group, Faculty of Applied Science, Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands
[email protected]
In one-class classification one tries to describe a class of target data and to distinguish it from all other possible outlier objects. Obvious applications are areas where outliers are very diverse or very difficult or expensive to measure, such as in machine diagnostics or in medical applications. In order to have a good distinction between the target objects and the outliers, good representation of the data is essential. The performance of many one-class classifiers critically depends on the scaling of the data and is often harmed by data distributions in (nonlinear) subspaces. This paper presents a simple preprocessing method which actively tries to map the data to a spherical symmetric cluster and is almost insensitive to data distributed in subspaces. It uses techniques from Kernel PCA to rescale the data in a kernel feature space to unit variance. This transformed data can now be described very well by the Support Vector Data Description, which basically fits a hypersphere around the data. The paper presents the methods and some preliminary experimental results. Keywords: Novelty detection; one-class classification; kernel PCA; feature extraction.
1. Introduction In almost all machine learning and pattern recognition research, it is assumed that a (training) dataset is available which reflects well what can be expected in practice. On this data a classifier or regressor should be fitted such that good generalization over future instances will be achieved.2 Unfortunately, it is very hard to guarantee that the training data is a truly identically distributed sample from the real application. In the data gathering process certain events can easily be missed, because of their low probability of occurence, their measuring costs or because of changing environments. In order to detect these “unexpected” or “ill represented” objects in new, incoming data, a classifier should be fitted, which detects the objects that do not 333
May 2, 2003 13:57 WSPC/115-IJPRAI
334
00240
D. M. J. Tax & P. Juszczak
resemble the bulk of the training data in some sense. This is the goal of one-class classification,7,14 novelty detection,8 outlier detection1 or concept learning.5 Here, one class of objects, the target class, has to be distinguished from all other possible objects, the outlier objects. A common solution for outlier or novelty detection is to fit a probability density on the target data,1,12,13 and classify an object as outlier when the object falls into a region with a density lower than some threshold value. This works well in the cases where the target data is sampled well. That means, the sample size is sufficient and the distribution is representative. But density estimation requires large sample sizes. When the boundary of the target class with limited sample size is to be estimated, it might be better to directly fit the boundary instead of estimating the complete target density. This is Vapnik’s principle to avoid solving a more general problem than what is actually needed to solve.15 Using this principle, the problem is changed from density estimation to domain description. The support vector data description (SVDD14 ) is a method which tries to fit directly a boundary with minimal volume around the target data without performing density estimation. It is inspired by the (two-class) support vector classifier.15 All objects inside the hypersphere will be “accepted” and classified as target objects. All other objects are labeled outliers. By minimizing the volume of the hypersphere, it is hoped that the chance of accepting outliers is minimized. Sch¨ olkopf 9 presented a linear one-class classifier, based on the idea to separate the data with maximal margin from the origin. In Ref. 3 again a linear classifier was used, but here the problem was posed as a linear programming problem, instead of a quadratic programming problem. In general, the hypersphere model is not flexible enough to give a tight description of the target class and analogous to the Support Vector Classifier (SVC), the SVDD is made more flexible by transforming the objects from the input space representation to a representation in kernel space. It appears that not all kernels that were proposed for the SVC can be used by the SVDD. In most cases the data classes are elongated, which is useful for discrimination between two classes, but is harmful for one-class classification. An exception is the Gaussian kernel, where good performances can be obtained. Unfortunately, even using the Gaussian kernel, an homogeneous input feature space is assumed, which means that distances in all directions in the space should be comparable. In practice, data is often distributed in subspaces, resulting in very small typical distances between objects in directions perpendicular to the subspace. Moving inside the subspace will change the objects just slightly, but moving out of the subspace will result in an illegal object, or an outlier. Although comparable distances are traveled, the class memberships of the objects differ drastically. This homogeneity of the distances does not just harm the SVDD, but in principle all one-class methods which rely on distances or similarities between the objects. In this paper we propose a rescaling of the data in the kernel feature space, which is robust against large scale differences in scaling of the input data. It rescales the
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
Kernel Whitening for One-Class Classification
335
data in a kernel space such that the variances of the data are equal in all directions. We will use the techniques of Kernel-PCA.10 In Sec. 2, we will present the SVDD, followed by an example where it fails. In Sec. 3, the rescaling of the data is presented, followed by some experiments and conclusions. 2. SVDD To describe the domain of a dataset, we enclose the data by a hypersphere with minimum volume (minimizing the chance of accepting outlier objects). Assume we have a d-dimensional data set containing n data objects, X tr : {xi , i = 1, . . . , n} and the hypersphere is described by center a and radius R. We will assume throughout P P the paper that a sum i will sum over all training objects, which means ni=1 . To allow the possibility of outliers in the training set, the distance from xi to the center a need not be strictly smaller than R2 , but larger distances should be penalized. An extra parameter ν is introduced for the trade-off between the volume of the hypersphere and the errors. Thus, an error function L, containing the volume of the hypersphere and the distances, is minimized. The solution is constrained with the requirement that (almost) all data is within the hypersphere. The constraints can be incorporated in the error function by applying Lagrange multipliers.2 This yields the following function to maximize with respect to α (for details, Ref. 14): X X X 1 , αi = 1 (1) L= αi (xi · xi ) − αi αj (xi · xj ) with 0 ≤ αi ≤ nν i i i,j and a=
X
α i xi .
(2)
i
The last constraint in (1) influences the effective range of hyperparameter ν. For ν > 1 this constraint cannot be met, and therefore in practice ν is 0 ≤ ν ≤ 1. (This hyperparameter plays the same role of ν in an comparable one-class classifier, the ν-SVC.9 ) Now (1) is in a standard quadratic optimization problem. By the box constraints, the free parameters αi after optimization can be in two situations. Most objects xi will satisfy kxi − ak2 < R2 , αi = 0 and for just a few objects xi αi > 0. Analogous to Ref. 15 these objects are called the support objects, because they determine the (center of the) hypersphere via (2). A new object z is accepted by the description (or classified as target object) when: f (z) = kz − ak2 = (z · z) − 2
X i
αi (z · xi ) +
X
αi αj (xi · xj ) ≤ R2 .
(3)
i,j
The radius R is determined by calculating the distance from the center a to any support vector xi on the boundary.
May 2, 2003 13:57 WSPC/115-IJPRAI 4
00240
David M.J. Tax, Piotr Juszczak
The hyperspherical shape for the boundary of a dataset is very restricting and will not be satisfied in the general case. Analogous to the method of Vapnik 15 , we can336 replace the inner products (x · y ) in Eq. (1) and in (3) by kernel functions D. M. J. Tax & P. Juszczak K(x, y ) = Φ(x) · Φ(y ) (where K is a positive definite kernel, or Mercer kernel). The hyperspherical forproduct the boundary of athe dataset very restrictingmapped and By this replacement of theshape inner by K, datais is implicitly to 15 will not be satisfied in the general case. Analogous to the method of Vapnik, we a new feature space. Ideally, this mapping would map the data into a spherical can replace the inner products (x · y) in Eq. (1) and in (3) by kernel functions constrained domain, such that the assumptions for the SVDD are fulfilled. K(x, y) = Φ(x) · Φ(y) (where K is a positive definite kernel, or Mercer kernel). 15 , mainly in the application of toSupport Several kernels have ofbeen proposed By this replacement the inner product by K, the data is implicitly mapped Vector aClassifiers. popular choice is the polynomial kernel (x · ay )spherical → K(x, y ) = new feature A space. Ideally, this mapping would map the data into p constrained domain, such that the assumptions for the SVDD are fulfilled. (xy + 1) , which maps the data to a feature space spanned by all monomial feaSeveral kernels have been proposed,15 mainly in works the application Supportit tends tures up to p. For one-class classification this kernel poorly, of because Vector Classifiers. A popular choice is the polynomial kernel (x · y) → K(x, y) = to transform thep data into elongated, flat structures instead of spherical clusters. (x · y + 1) , which maps the data to a feature space spanned by all monomial Especially for up larger p, taking the power will stress the differences features to p. degrees For one-class classification this kernel works poorly, because it in the variances in to different feature directions. For large p the direction largest varitends transform the data into elongated, flat structures instead with of spherical Especially larger degrees taking the power willinstress the space. differences ance inclusters. input space willforoverwhelm allp,smaller variances kernel in the variances in different feature directions. For large p the direction with largest For another popular kernel, the Gaussian kernel, this is not the case: variance in input space will overwhelm all smaller variances in kernel space. 2 . case: (x · y)kernel, → K( x, yGaussian ) = exp(− x −this yis2 /σ For another popular the kernel, not )the
(4)
The width parameter σ in the kernel (from definition (4)) determines the scale or (x · y) → K(x, y) = exp(−kx − yk2 /σ 2 ) . (4) resolution at which the data is considered in input space. Although here the data is implicitly mapped to an F 11 , thetheinner The width parameter σ ininfinitely the kernel dimensional (from definitionspace (4)) determines scale products or (or theresolution kernel outputs) aredata between 0 andin1. Furthermore, K( x, x)the =data 1 indicating at which the is considered input space. Although here is 11 implicitly mapped to an infinitely dimensional space F, the inner products (or the that all objects have length 1, placing the objects effectively on a hypersphere with outputs) are between 0 and 1. Furthermore, K(x, x) = 1 indicating that all radius kernel 1. objects have length 1, placing the objects effectively on a hypersphere with radius 1. For good performance of the SVDD with the Gaussian kernel, still properly For good performance of the SVDD with the Gaussian kernel, still properly scaled scaled distances are required. TheThe new inner now depends dependsonondisdistance distances are required. new innerproduct product (4) (4) now 2 . Very distances will will stillstill result clusters and x − ytance kx − inhomogeneous yk2 . Very inhomogeneous distances resultininelongated elongated clusters and large empty areas around the target in inputfeature feature space space that stillstill aclarge empty areas around the target classclass in input thatareare accepted. cepted. 5
4
3
2
1
0
−1
−2
−3
−4
−5 −6
Fig. 1.
−4
−2
0
2
4
6
Decision boundary of an SVDD trained on an artificial 2D dataset.
Fig. 1. Decision boundary of an SVDD trained on an artificial 2D dataset.
In figure 1 a scatterplot of an artificial 2-dimensional dataset is show. The SVDD
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
Kernel Whitening for One-Class Classification
337
In Fig. 1 a scatterplot of an artificial two-dimensional dataset is shown. The SVDD is trained to fit a boundary around it such that about 25% of the target data is on the boundary. Although the SVDD follows the curve in the data, it does not fit tightly the subspace structure in the data. A large strip inside the curve is classified as target object, but does not contain target training objects. This is caused by the large scale difference of the data parallel and perpendicular to the subspace. In the approach of Ref. 9 a linear hyperplane instead of an hyperspherically shaped boundary is used. This plane should separate the target data with maximal margin from the origin of the feature space. Although in input space this is incomparable with the hypersphere approach, the method can be “kernalized” and using the Gaussian kernel this method appears to be identical to the SVDD.14 3. Kernel Whitening Instead of directly fitting a hypersphere in the kernel space, we propose to rescale the data to have equal variance. Fitting a hypersphere in the rescaled space F will be identical to fitting an ellipsoid in the original kernel space. The rescaling is easily done, using the derivation of the Kernel PCA.10 The data is basically mapped onto the principal components (or the largest eigenvalues) of the data covariance matrix and then rescaled by the corresponding eigenvalues. Therefore the eigenvectors and eigenvalues of the covariance matrix in the kernel space have to be estimated. The eigenvectors with eigenvalues close or equal to zero will be disregarded. Assume the data X tr is mapped to the kernel space F by some (possibly nonlinear) mapping Φ : Rd → F. When we also assume that the data is centered in P this space, i.e. i Φ(xi ) = 0, the covariance matrix C of the mapped dataset can be estimated by: 1X Φ(xi )Φ(xi )T . (5) C= n i The eigenvectors v and eigenvalues λ satisfy: 1X Cv = (Φ(xj ) · v) Φ(xj ) = λv . n j
(6)
Equation (6) shows that the eigenvectors with nonzero eigenvalue must be in the span of the mapped data {Φ(xi )}, which means that v can be expanded as: X v= αi Φ(xi ) . (7) i
Multiplying Eq. (6) from the left with Φ(xk ) and using (7) gives: ! X X 1X (Φ(xk ) · Φ(xj )) Φ(xj ) · αi Φ(xi ) = λ αi (Φ(xk ) · Φ(xi )) n j i i
∀k . (8)
May 2, 2003 13:57 WSPC/115-IJPRAI
338
00240
D. M. J. Tax & P. Juszczak
When again the kernel matrix Kij = Φ(xi ) · Φ(xj ) is introduced, it appears that the coefficients α from Eq. (7) can directly be obtained by solving the eigenvalue problem: λα = Kα .
(9)
For normal kernel-PCA the eigenvectors should be normalized to unit length, and this means that for each eigenvector vk the αk are rescaled to: λk α k · α k = 1 . (10) We assumed that the data is centered in F. This can be done by transforming the original kernel matrix. Assume K is the n × n kernel matrix of the training data and K tst the m × n matrix of some new data (or possibly the same training data). The centered kernel matrix is computed by: ˜ = K tst − 1∗n K − K tst 1n + 1∗n K1n K
(11)
where 1n is an n × n matrix and 1∗n is an m × n matrix both with all entries 1/n.10 We will assume that we always have centered the kernel matrices using (11). When the coefficients α are obtained, a new object z can be mapped onto eigenvector vk in F by: X X (ˆ z)k = (vk · Φ(z)) = αki (Φ(xi ) · Φ(z)) = αki K(xi , x) (12) i
i
where (ˆ z)k means the kth component of vector zˆ. To transform the data into a representation with equal variance in each feature direction (for directions with λk > 0), the normalization from Eq. (10) has to be slightly adapted. The variance of the mapped data along component v k is: !2 X X X 1 1 1 (ˆ xj ) 2 = αki k(xi , xj ) (13) = (αk )T KKαk . var(X tr ) = n j n j n i Using Eq. (9), this is constant for all features when instead of (10), we use the normalization: (14) λ2k αk · αk = 1 for all considered components k .
The dataset Xˆtr , transformed using the mapping (12) with normalization (14), can now be used by any one-class classifier. The dimensionality d0 of this dataset depends on how many principal components vk are taken into account. Not only do all the features have equal variances, by the fact that the data is mapped onto the principal components of the covariance matrix, the data is also uncorrelated. The fact that the data is now properly scaled, makes it ideal to estimate a normal distribution, or to use the SVDD, which in the linear case just fits a hypersphere. In Fig. 2 an artificial 2D dataset is shown, where the data is distributed in a sinusoidal subspace. In the left subplot the output of the SVDD is shown, in the right subplot the output of the SVDD with the data scaled to unit variance. In
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
Kernel Whitening for One-Class Classification
1
1
0
0
−1
−1 0
2
4
6
0
2
4
339
6
Fig. 2. The data description of a sinusoidal distributed dataset. Left shows an SVDD trained in the input space, the right shows the decision boundary of the hypersphere in kernel space. In both cases a Gaussian kernel with σ = 4 is used.
order to model this data well, a one-class classifier has to be very flexible, and large amounts of data should be available to follow both the large sinusoidal structure and be tight around the subspace. The SVDD are optimized to have about 30% error on the target set. The decision boundary are given by the white line. It is clear that it does not model the subspace structure in the data. 4. Characteristics Whitening How efficient the mapping of the data to the new representation with unit variance is, depends on the choice of the kernel and parameters. When this feature extraction captures the data structure, it is easy to train a one-class classifier on this data and obtain good classification performance. In Table 1 decision boundaries for the artificial data for different choices of the kernels are shown. The upper row shows the results for the polynomial kernel of degree p = 1, p = 3 to p = 5 (from left to right). The lower row shows the results for the Gaussian kernel, for σ = 5, 15 and 50. Results show large dependence on the choice of the free parameter. The rescaling tends to overfit for high values of the degree p and low values of σ. Visually it can be judged that for the polynomial kernel p = 3 is reasonable, for the Gaussian kernel any σ between 15 and 50 can be used. Applying an ill-fitting kernel results in spurious areas in the input space. Many one-class classifiers rely on the distances between the objects in the input space. When the data is whitened in the kernel space, and all significant eigenvectors are taken into account, the influence of rescaling (one of the) features is eliminated. In Table 2 the results of rescaling one of the features is shown. In the middle row a scatterplot of the original data is shown. On this dataset, an SVDD on the whitened data with all nonzero principal components, and an SVDD using just the first 5 principal components are trained. It appears that for this data there are just 8 nonzero principal components. In the upper row of the table, the horizontal feature was rescaled to 10% of the original size, while on the lower row the data feature was 10 times enlarged. The SVDD on the (kernel-) whitened data not only gives a tight description, but is also robust against rescaling the single feature. The SVDD in input space heavily suffers from rescaling the data. Using just a few principal components from the mapped data also results in poorer results and in spurious areas.
clear that it does not model the subspace structure in the data.
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
4. Characteristics whitening How efficient the mapping of the data to the new representation with unit variance is, depends on the choice of the kernel and parameters. When this feature extraction captures 340 D.the M. J.data Tax &structure, P. Juszczak it is easy to train a one-class classifier on this data and obtain good classification performance. In table 1 decision boundaries for the Table 1. The influence of the choice of the kernel. The upper row shows the results using the polynomial kernel with varying degrees, the lower row the Gaussian kernel with varying σ.
p=1
p=3
p=5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2 −1
0
1
2
3
4
5
6
−2 −1
7
2
0
1
2
3
4
5
6
7
2
−2 −1
0
1.5 1.5 January 30, 2003 14:7 WSPC/INSTRUCTION FILE
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
8
2
4
−2
6
3
4
5
6
7
ijprai2003
1
0
2
1.5
1
−2
1
2
0
σ =M.J. 5 Tax, Piotr Juszczak David
2
4
−2
6
0
2
σ = 15
4
6
σ = 50
Table 2.tends Influence of the scaling of the features. Thedegree left column shows decision boundary to overfit for high values of the p and low the values of σ. Visually it Table 1. The influence of column the choice of the kernel. The upper row shows thewith results of the SVDD, the middle the results of the data description using the whitening all using the can be judged that for the polynomial kernel p = 3 is reasonable, for the Gaussian polynomial kernel with varying degrees, the lower Gaussian with varying σ. nonzero variance directions and the right column showsrow the the output using thekernel first five principal kernel any σ between 15 the andoriginal 50 candata. be used. results components. The middle row shows In the Applying upper row an the ill-fitting horizontal kernel feature is spurious in the space. 10 timesinshrinked, forareas the lower the input horizontal feature is 10 times enhanced. For display purposes the data is scaled to show comparable scales.
artificial data for different choiceskernel of the kernels are shown. The upper row shows scaling SVDD whitening whitening reduced d = 5 the results for the polynomial kernel of degree p = 1, p = 3 to p = 5 (from left to right). The lower row shows the results for the Gaussian kernel, for σ = 5, 15 and 50. Results0.1× show large dependence on the choice of the free parameter. The rescaling 2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2 −0.2
1×
0.2
0.4
−2 −0.2
0.6
0
0.2
0.4
−2 −0.2
0.6
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2 −2
10×
0
0
2
4
−2 −2
6
0
2
4
−2 −2
6
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2 −20
0
20
40
60
−2 −20
0
20
40
60
−2 −20
0
0
0
0.2
2
20
0.4
4
0.6
6
40
60
Table 2. Influence of the scaling of the features. The left column shows the decision boundary of the SVDD, the middle column the results of the data description using the whitening with all non-zero variance directions and the right column shows the output using the first five principal components. The middle row shows the original data. In the upper row the horizontal feature is 10 times shrinked, the lower the horizontal feature is 10 times enhanced. For display purposes the data is scaled to show comparable scales.
Many one-class classifiers rely on the distances between the objects in the input space. When the data is whitened in the kernel space, and all significant eigenvectors
May 2, 2003 13:57 WSPC/115-IJPRAI
00240 Kernel whitening for one-class classification
9
also robust against rescaling the single feature. The SVDD in input space heavily suffers from rescaling the data. Using just a few principal components from the mapped data also results in poorer results and in spurious areas. Kernel Whitening for One-Class Classification
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
0
Fig. 3.
1
2
3
4
5
6
−1.5
0
1
2
3
4
5
341
6
Typical decision boundaries of the normal distribution (left) and the support vector data
description trained on the normalized data distribution in F. The SVDD tends to the be tighter, especially Fig. 3. Typical (right), decision boundaries of the normal (left) and support vector data when some outliers are present in the training data (for instance, the object at (2.5, −0.8) is a description (right), trained on the normalized data in F. The SVDD tends to be tighter, especially clear outlier). when some outliers are present in the training data (for instance, the object at (2.5, −0.8) is a clear outlier.
The fact that the data is unit variance with uncorrelated features makes the normal distribution a good choice for describing the dataset in the kernel space. The fact3 again that the data is unit withnow uncorrelated features outlier makes the In Fig. the sinusoidal datavariance set is shown, with one prominent present. Furthermore, typical decision the dataset fitted normal distribution normal distribution a good choice for boundaries describingofthe in the kernel space. (left) 3 and the support vector data description (right)now are with shown.one In most cases the In figure again the sinusoidal data set is shown, prominent outlier differences in decision boundary between the SVDD and the Gaussian model are present. Furthermore, typical decision boundaries of the fitted normal distribution the training data contains some significant outliers, The SVDD tends (left)minor. and In thecase support vector data description (right) are shown. In most cases to obtain tighter descriptions, because it can effectively ignore prominent outliers the difference in decision boundary between the SVDD and the Gaussian model in the data. The normal distribution is still influenced by it, and starts to accept are minor. In case trainingspace. dataThis contains outliers, Thethe SVDD superfluous areasthe in feature is alsosome visiblesignificant in Fig. 3. In both cases tends to be obtain tighter descriptions, because it can effectively ignore prominent decision boundary was optimized such that 10% of the training data is rejected.
outliers in the data. The normal distribution is still influenced by it, and starts to accept superfluous areas in feature space. This is also visible in figure 3. In both Selection of the Kernel and Kernel Parameters cases5.the decision boundary was optimized such that 10% of the training data is By applying a suitable kernel whitening, it is sufficient to use a simple one-class rejected. classifier on the mapped data. The complexity is moved from optimizing a classifier to optimizing the preprocessing. In order to avoid a complete model selection for 5. Selection of the kerneland and parameters both the kernel whitening thekernel classifier, we propose to use a standard model selection criterion for the selection of the parameters of the kernel whitening, the By applying a suitable kernel whitening, it is sufficient to use a simple one-class Chernoff distance between the target and outlier class. This distance is defined as: classifier on the mapped data. The complexity is moved from optimizing a classifier Z to optimizing the preprocessing. In order sto avoid 1−s a complete model selection for p(z|ωtar ) p(z|ωout ) dz (15) JC = − log
both the kernel whitening and the z classifier, we propose to use a standard model selection criterion for the selection of the parameters of the kernel whitening, the where 0 ≤ s ≤ 1, p(z|ωtar ) is the data distribution of the target objects and p(z|ωout ) Chernoff distance between theobjects. target Parameter and outlier distance defined the distribution of the outlier s isclass. free toThis choose. Assumeiswe have as: two normally distributed classes, with means µ1 and µ2 , and covariance matrices z |ωtar )sreduces p(z|ωoutto: )1−s dz (15) − log C =the Σ1 and Σ2 . In thisJcase Chernoffp( distance z
May 2, 2003 13:57 WSPC/115-IJPRAI
342
00240
D. M. J. Tax & P. Juszczak
1 s(1 − s)(µ2 − µ1 )T [(1 − s)Σ1 + sΣ2 ]−1 (µ2 − µ1 ) 2 January 30, 2003 14:7 WSPC/INSTRUCTION ijprai2003 |(1 − s)Σ1 + sΣFILE 1 2| + log . 1−s s 2 |Σ1 | |Σ2 | JC =
(16)
In the case of one-class classification, we can make some assumptions about the distribution of the outlier class, to simplify this expression. We assume that the mean of the outlier class is very close to the mean of the target class, such that µout − µtar = 0. By the kernel whitening, we transformed the data such that Σtar = M.J. Tax, Piotr Juszczak we will use s = 1 , such that our final evaluation I, 10 the David identity matrix. Furthermore 2 criterion becomes: data distribution where 0 ≤ s ≤ 1, p(z|ωtar1) is the of the target objects and |I/2 + Σout /2| J = . (17) log C p(z|ωout ) the distribution of2 the outlier objects. Parameter s is free to choose. 1/2 |Σout | Assume we have two normally distributed classes, with means µ1 and µ2 , and For the artificial dataset shown in Fig. 2 kernel whitening is applied using a covariance matrices Σ1 and Σ2 . In this case the Chernoff distance reduces to: polynomial kernel with the degree ranging from p = 1 to p = 5. The outlier class is |(1 − s)Σ1 + sΣ2 | 1 1 T −1 artificially created, the JC = s(1 − s)(µ2uniformly log of the kernels, − µ1 ) [(1around − s)Σ1 the + sΣtarget (µclass. ) + each 2] 2 − µ1For 2 used principal components d0 is varied. On this data 2 a one-class |Σ1 |1−s |Σ2 |s number of classifier (16) is fitted (in this case an SVDD). Each time the Receiver Operating Characteristic In the case of one-class classification, we can make some assumptions about the (ROC) curves are computed.6 The ROC gives the error on the outlier data, varying distribution of the outlier class, to simplify this expression. We assume that the values for the error on the target class. From the ROC curves an error is derived, mean of the outlier class is very close to the mean of the target class, such that called the Area Under the ROC curve (AUC). Here, low values of the AUC indicates µout − µtar = 0. By the kernel whitening, we transformed the data such that a good separation between the target and outlier data. Σtar = I, the identity matrix. Furthermore we will use s = 21 , such that our The results are shown in the left subplot of Fig. 4, where the AUC is plotted as final evaluation criterion becomes: a function of the retained dimensionality d0 of the kernel PCA. It is clearly visible |I/2 + Σout /2| 1 that for polynomial kernels with 5, increasing the dimensionality JC =degrees log 3, 4 and 1/2 (17) 2 |Σout | helps the separation. Around d0 = 8 good performances are obtained. Degrees 1 and 30
0.35
degree 1 degree 2 degree 3 degree 4 degree 5
0.3
degree 1 degree 2 degree 3 degree 4 degree 5
25
0.25
20
0.2
15 0.15
10 0.1
5
0.05
0
2
4
6
8
10
12
14
16
0
2
4
6
8
10
12
14
16
(a) (b) Fig. 4. Left: the AUC for varying data dimensionality d for the sinusoidal data which is preproFig. 4. (a) The for varying data dimensionality d0 for the1,sinusoidal is preprocessed with theAUC polynomial kernel with different degrees, p= .., 5. Right:data the which resulting Chernoff cessed with between the polynomial kernel different degrees, p = 1, .data . . , 5.dimensionalities. (b) The resulting Chernoff distance the target andwith outlier class for the different distance between the target and outlier class for the different data dimensionalities.
For the artificial dataset shown in figure 2 kernel whitening is applied using a polynomial kernel with the degree ranging from p = 1 to p = 5. The outlier class is artificially created, uniformly around the target class. For each of the kernels, the number of used principal components d is varied. On this data a one-class classifier is fitted (in this case an SVDD). Each time the Receiver Operating Characteristic (ROC) curves are computed 6 . The ROC gives the error on the outlier data given varying values for the error on the target class. From the ROC curves an error is derived, called the Area Under the ROC curve (AUC). Here, low values of the AUC
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
Kernel Whitening for One-Class Classification
343
2 do not fit the data very well, and do not improve with increasing dimensionality. In particular, for p = 1, some instability is visible for higher values of d0 . For p = 1, Σtar and Σout only have two significant eigenvalues. Using higher values of d0 therefore result in instable results. The same picture can be constructed for the Gaussian kernel, but there it appears that the different values for σ do not result in very distinct solutions (which is already visible in Table 1). In the right subplot of Fig. 4 the Chernoff distance between the target and outlier classes is shown. For the linear kernel p = 1, no significant increase in the distance can be observed. For the other kernels, the distance increases, until a maximum is reached and the distance stabilizes. Note that the higher the degree, the faster the distance increases. While this figure suggests that using higher degrees and higher dimensionalities gives best performance, in practice very good performances are already obtained with lower values of p and d0 . By the finite sample size the increase in distance is not reflected in a decrease in AUC. In practice the Chernoff distance can be used to find a near optimal d0 for each kernel (d0 = 2 for p = 1, d0 = 5 for p = 2, d0 = 8 for p = 3, etc.). Then, for each kernel and its corresponding d0 , a one-class classifier should be trained, and the AUC’s should be compared. This avoids the training and evaluation of a one-class classifier on each combination of kernel and dimensionality. 6. Experiments JanuaryTo 30,show 2003the14:7 WSPC/INSTRUCTION FILE results on real world datasets, we use the ijprai2003 standard Concordia dataset, 4 in which the digits are stored in 32 × 32 black-and-white images. Each of the digit classes can be designated to be the target class. Then all other digits are considered outliers. For training 400 objects per class, and for testing 200 objects per class are available. In Fig. 5 typical images of rejected objects are shown. The one-class classifier was trained on classes “2”Piotr and Juszczak “3” respectively. The 32×32 images are first preprocessed 12 David M.J. Tax,
Fig. 5. 5.Examples handwritten digits digits from fromthe theConcordia Concordiadataset. dataset.An AnSVDD SVDD trained Fig. Examples of of rejected rejected handwritten trained on on class classes 2 and 3, kernel whitened using a polynomial kernel 2 and 3, kernel whitened using a polynomial kernel withwith d = d2.= 2.
May 2, 2003 13:57 WSPC/115-IJPRAI
344
00240
Fig. 5. Examples of rejected handwritten digits from the Concordia dataset. An SVDD trained on class 2 and 3, kernel whitened using a polynomial kernel with d = 2. D. M. J. Tax & P. Juszczak
Fig.Fig. 6. 6. AnAn SVDD trained onon digit class “4”. SVDD trained digit class ’4’.On Onthe theleft, leftthe thedata datawas waspreprocessed preprocessedusing usingnormal normal PCA, andon onthe theright right, kernel whitening with polynomial is used. PCA, kernel whitening with polynomial d =d3=is3used.
to retain 80% the variance, to remove pixels with (almost)the zero variance over Some of theofobjects are rejected in bothall methods, for instance upper left object theinwhole dataset. Then the data was (kernel-) whitened using a polynomial kernel, PCA and the second object in the kernel whitening. Other objects are specifically degree 2. The first 20the principal were chosen, the other rejected because do not components fit the particular model, the for eigenvalues instance theoflower right −6 principal components always were a factor 10 or more smaller than the largest object in both the PCA and kernel whitening. eigenvalue. On 7this dataonaall normal SVDD was fitted, thatOnabout 5%the of digit the In figure results Concordia digit classes aresuch shown. each of target dataone-class is rejected. The results showand thatthe rejected objects are often skewed or classes classifiers are trained ROC curve is computed. are written veryoffatly, or contain big curls. On each the classes 6 one-class classifiers have been trained. The first two In Fig. 6 the results are shown to compare the outliers by using normal methods are density models: the Normal Density and obtained the Mixture of Gaussians January 30,PCA 2003 14:7 WSPC/INSTRUCTION FILE ijprai2003 and5 the kernel The whitening, the SVDD polynomial kernel withindegree 3. In the (with clusters). third is using the basic directly trained the input space, optimized that about 10% of the target class islook rejected. In thebylast three normal PCA such 12 objects are rejected. Some of them reasonable human classifiers theIndata mapped using the kernel whitening (polynomial kernel, d= interpretation. the iskernel whitening processing, 10 objects are rejected. Some 1, 2 objects and 3). are In rejected many cases no clear break for point in the the Chernoff graph of the in both methods, instance, upperdistance left object in can be observed, in particular for lower kernel degrees. When no clear optimal PCA and the second object in the kernel whitening. Other objects are specifically observed, the defaultmodel, of d =for 20 instance, principal components was dimensionality d can rejected because they dobenot fit the particular the lower right used.in both the PCA and kernel whitening. object Kernel whitening for one-class classification 13 0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35 0.3 AUC
AUC
0.3 0.25
0.25
0.2
0.2
0.15
0.15 Gauss MoG SVDD Ellipse d=1 Ellipse d=2 Ellipse d=3
0.1 0.05 0
0
1
2
3
4 5 classnr
6
7
8
9
Gauss MoG SVDD Ellipse d=1 Ellipse d=2 Ellipse d=3
0.1 0.05 0
0
1
2
3
4 5 classnr
6
7
8
9
Fig. 7. AUC errors on the 10 classes of the Concordia handwritten digits. Left shows the AUC Fig. 7. on for the a10simple classesGaussian of the Concordia Left (k shows theSVDD, AUC error onAUC all 10errors classes density, ahandwritten Mixture of digits. Gaussians = 5), error on all 10 polynomial classes for a simple1,Gaussian whitening with degrees 2 and 3. density, a Mixture of Gaussians (k = 5), SVDD, whitening with polynomial degree 1, 2 and 3.
In the left subplot, the data is not preprocessed. The density methods are not capable in estimating the density and give the highest AUC error of 0.5. In most cases the best performances is obtained by applying the whitening procedure with p = 2. The SVDD can perform poorly, due to the relative low sample size and the complexity of following the boundary in the high dimensional feature space. Whitening with higher polynomial degrees also suffers from low sample size effects. In the right subplot, the data is preprocessed by basic PCA to retain again 80%
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
Kernel Whitening for One-Class Classification
345
In Fig. 7 results on all Concordia digit classes are shown. On each of the digit classes one-class classifiers are trained and the ROC curve is computed. On each of the classes 6 one-class classifiers have been trained. The first two methods are density models: the Normal Density and the Mixture of Gaussians (with 5 clusters). The third is the basic SVDD directly trained in the input space, optimized such that about 10% of the target class is rejected. In the last three classifiers the data is mapped using the kernel whitening (polynomial kernel, d = 1, 2 and 3). In many cases no clear break point in the Chernoff distance graph can be observed, in particular for lower kernel degrees. When no clear optimal dimensionality d0 can be observed, the default of d0 = 20 principal components was used. In the left subplot, the data is not preprocessed. The density methods are not capable in estimating the density and give the highest AUC error of 0.5. In most cases the best performances are obtained by applying the whitening procedure with p = 2. The SVDD can perform poorly, due to the relatively low sample size and the complexity of following the boundary in the high dimensional feature space. Whitening with higher polynomial degrees also suffers from low sample size effects. In the right subplot, the data is preprocessed by basic PCA to retain again 80% of the variance. By the reduction of the dimensionality, in some cases some overlap between the classes is introduced and the performance of the best whitening procedures deteriorate. The density methods now work well and often outperform the poorer whitening versions. The actual performance increase or decrease is mainly determined by how well the model fits the data. That means for the whitening procedure good performance is obtained when the data is distributed in some (nonlinear) subspace.
7. Conclusions This paper presents a simple whitening preprocessing for one-class classification problems. It uses the idea of Kernel PCA to extract the nonlinear principal features of the dataset. After mapping the data to this new feature space (implicitly defined by the kernel function), feature directions with (almost) zero variance are removed and the other features are rescaled to unit variance. By the Kernel PCA and rescaling, the resulting data is zero mean with an identity covariance matrix. Finally, this data can now in principle be described by any one-class classifier. By this preprocessing step one-class classifiers can be trained which contain large differences in scale in the input space. In particular, data in (nonlinear) subspaces can be described well. For most one-class classifiers data distributed in subspaces are problematic, because the data contains large differences in typical scale within the subspace and perpendicular to the subspace. By using a suitable kernel in the kernel PCA, these scale differences in the data are recognized and modeled in the mapping. The transformed data now has equal variance in each feature direction.
May 2, 2003 13:57 WSPC/115-IJPRAI
346
00240
D. M. J. Tax & P. Juszczak
The problem how to choose the kernel function and values for the hyperparameters is still open. When test data is available, both from the target as the outlier class, this can be used for evaluation of the model (which then includes both the whitening and the classifier in the kernel space). In the general case of one-class classification, we just have a very poorly represented outlier class, and estimating the performance on this dataset will give a bad indication of the expected performance. In these cases we have to rely on, for instance, artificially generated outlier data. In order to find a suitable kernel, we propose the use of the Chernoff distance between the target and (possibly artificially created) outlier data. This Chernoff distance is relatively cheap to compute, and thus the expensive optimization of the one-class classifier and the computation of the AUC for all combinations of kernel definition and data dimensionality can be avoided. For each kernel definition a near optimal dimensionality can be estimated. Only for these combinations of kernel and dimensionality a one-class classifier and the AUC have to be computed and compared. The subspace modeling comes at a price though. The mapping requires a reasonable sample size, in order to extract the more complex nonlinear subspaces. Using too complex mappings and too many principal components in combination with small sample sizes will result in overfitting on the data and in poor results on independent test data. A further drawback of this rescaling on the Kernel PCA basis is that the expansion in (13) is in general not sparse. This means that for each projection of a test point onto a principal direction, all training objects have to be taken into account. For large training sets this can become very expensive. Fortunately, approximations can be made, which reduce the number of objects in the expansion (7) drastically.9 Acknowledgments This research was supported through a European Community Marie Curie Fellowship. The author is solely responsible for information communicated and the European Commission is not responsible for any views or results expressed. References 1. C. M. Bishop, “Novelty detection and neural network validation,” IEE Proc. Vision, Image and Signal Processing. Special Issue on Applications of Neural Network 141, 4 (1994) 217–222. 2. C. M. Bishop, Neural Network for Pattern Recognition, Oxford University Press, 1995. 3. C. Campbell and K. P. Bennett, “A linear programming approach to novelty detection,” NIPS, 2000, pp. 395–401. 4. S.-B. Cho, “Recognition of unconstrained handwritten numerals by doubly selforganizing neural network,” Int. Conf. Pattern Recognition, 1996. 5. N. Japkowicz, C. Myers and M. Gluck, “A novelty detection approach to classification,” Proc. Fourtennth Int. Joint Conf. Artificial Intelligence, 1995, pp. 518– 523.
May 2, 2003 13:57 WSPC/115-IJPRAI
00240
Kernel Whitening for One-Class Classification
347
6. C. E. Metz, “Basic principles of ROC analysis,” Seminars in Nucl. Med. VIII, 4 (1978) 283–298. 7. M. M. Moya and D. R. Hush, “Neetwork constraints and multi-objective optimization for one-class classification,” Neural Networks 9, 3 (1996) 463–474. 8. G. Ritter and M. T. Gallegos, “Outliers in statistical pattern recognition and an application to automatic chromosome classification,” Patt. Recogn. Lett. 18 (1997) 525–539. 9. B. Sch¨ olkopf, R. C. Williamson, A. Smola and J. Shawe-Taylor, “SV estimation of a distribution’s support,” Advances in Neural Information Processing Systems, 1999. 10. B. Sch¨ olkopf, A. J. Smola and K. R. M¨ uller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput. 10, 5 (1998) 1299–1319. 11. A. J. Smola, “Learning with kernel,” Ph.D. thesis, Technischen University Berlin, 1998. 12. C. Surace, K. Worden and G. Tomlinson, “A novelty detection approach to diagnose damage in a cracked beam,” Proc. SPIE, 1997, pp. 943–947. 13. L. Tarassenko, P. Hayton and M. Brady, “Novelty detection for the identification of masses in mammograms,” Proc. Fourth Int. IEE Conf. Artificial Neural Netwoks, Vol. 409, 1995, pp. 442–447. 14. D. M. J. Tax, “One-class classification,” Ph.D. thesis, Delft University of Technology, http://www.ph.tn.tudelft:nl/∼ davidt/thesis.pdf, June 2001. 15. V. N. Vapnik, Statistical Learning Theory, Wiley, 1998.
David M. J. Tax received the M.Sc. degree in physics from the University of Nijmegen, the Netherlands in 1996. In 2001 he received the Ph.D. degree at the Delft University of Technology for a thesis on the problem of oneclass classification or novelty detection. Currently he is working in a European Community Marie Curie Fellowship called “One-class classification” in the Intelligent Data Analysis group of Fraunhofer FIRST, Berlin, in close collaboration with Delft University of Technology. His research interests include pattern recognition and machine learning with a focus on outlier detection and novelty detection, the feature selection for and the evaluation of one-class classifiers.
Piotr Juszczak received the M.Sc. degree in biomedical engineering from Wroclaw University of Technology, Poland. As an exchange student, he stayed two years at the Department of Electronic Engineering, Galway University of Technology, Ireland. As a member of the Boston Scientific team, he was involved in a research project on the simulation of the blood flow in veins. Currently, he is a Ph.D. student in Pattern Recognition Group at Delft University of Technology, the Netherlands. His research interests include statistical pattern recognition, one-class classification problems and selective sampling algorithms.