Cross-Entropy Clustering approach to One-class

0 downloads 0 Views 5MB Size Report
Abstract. Cross-entropy clustering (CEC) is a density model based clustering algorithm. In this paper we present a possible application of. CEC to the one-class ...
Cross-Entropy Clustering approach to One-class Classification P. Spurek1? , M. W´ojcik2∗ , and J. Tabor1∗ 1

Jagiellonian University, Faculty of Mathematics and Computer Science Lojasiewicza 6, 30-348 Krak´ ow, Poland [email protected] 2 AGH University of Science and Technology, Faculty of Electrotechnics, Automation, Computer Science and Biomedical Engineering Al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected]

Abstract. Cross-entropy clustering (CEC) is a density model based clustering algorithm. In this paper we present a possible application of CEC to the one-class classification, which has several advantage over classical approaches based on Expectation Maximization (EM) and Support Vector Machines (SVM). More precisely, we can use various types of gaussian models with lower computational complexity. We test the designed method on real data coming from the monitoring systems of wind turbines. Keywords: covariance matrix, Gaussian filter, mathematical morphology, electron microscopy

1

Introduction

One-class classification, also called novelty detection, outliers detection, or data description [10, 14, 16] can be used to detect uncharacteristic observations. It is also useful when the background class contains enormous variations making its estimation unfeasible, for example in the case of background class in object detection (the background class should contain everything except the object to be detected). Another typical example is given by the classification of active chemical compounds, where the inactive class is usually totally non representative, as the scientists rarely publish information about chemically nonactive compounds [13, 19, 21]. One-class classification is also necessary when samples can be obtained only from a single known class. This situation is often encountered in the case of data from the monitoring systems of wind turbines [4, 6, 8, 9]. The growing number of that kind of systems causes the necessity for analysis of gigabytes of data obtained every day. Apart from the development of several advanced diagnostic methods for this type of machinery, there is a need for a group of methods which can act as an “early warning tools”. The idea of this approach could be based on data ?

The paper was supported by the National Centre for Research and Development under Grant no. WND-DEM-1-153/01.

2

driven algorithm which would decide on a similarity of current data to the data which are already known. Using the one class classification method, data from a turbine which is in good condition could be classified as a one class. If there are data points outside the determined class, it means that an unknown operational state of the turbine appeared and an expert should be informed about the situation. Thus, this simple approach can be used in the case when there are no failure states known. In this paper some of the one class classification methods are tested against the data from vertical axis wind turbine prototypes which are innovative machines developed in Poland (no failure has occurred yet for this kind of turbines). The tests were performed on 4-D data covering the period from 18.04.2014 till 29.04.2014, recorded every 1 second by an on-line monitoring system. The data set contained only the basic values that define the operational state of a turbine: wind speed, rotational speed of the rotor and the AC/DC power generated by the turbine. Negative values of AC power mean that the power was absorbed. The recorded data were not averaged. The data set included 985837 measurements. It should be also mentioned that the operational states classification is one of subtasks of a complex task of the intelligent wind turbines monitoring what is highly multifaceted in terms of scientific research [5, 7]. The most straightforward method for obtaining one class classifier is to estimate the probability density of the data and to set a density value threshold. This approach was successfully applied within the use of Gaussian Mixture Models (GMM) and have been widely used in classification [3, 10, 25, 24] and general density estimation tasks. They are also suitable for one-class classification. Other class of algorithms widely used in outliers detection is given by kernel methods like one-class Support Vector Machines (1-SVM) [17] or Support Vector Data Description (SVDD) [26]. These methods inherit provable generalization properties from learning theory [17] and can handle high dimensional feature spaces. Contrary to EM the complexity of the methods is high. In this paper we propose a new approach to one class classification problem which is strongly related to GMM, and is based on the cross-entropy clustering [20, 22, 23] (CEC). The CEC model uses similar approach like GMM, but using different combination of models [23]. Moreover, in the case of CEC implementation the data is covered by finite set of ellipsoids or balls. Consequently it is easier to compute and visualize classification border. It occurs that at the small cost of minimally worse density approximation [23] we gain speed in implementation3 and the ease of using more complicated density models. In particular spherical CEC has lower computational complexity then spherical EM. Moreover, we can on-line modify clusters by adding new data which could be of high importance in the case of monitoring wind turbines. This paper is arranged as follows. In the next section the theoretical background of CEC and comparison with GMM will be presented. We also describe 3

We can often use the Hartigan approach to clustering which is faster and typically finds better minima.

3

how to adapt CEC to one-class classification problem. Last section presents numerical experiments.

2

One class classification

Classical approaches to one class problems are usually based on two different ideas. In the first one we directly look for the decision borders. The main disadvantages of SVM based model is a problem with numerical complexity of the method and high number of parameters. In consequence, decision border depend on large number of constraints. The second one is constructed in two steps. First we estimate a density model and after that we construct the decision algorithm. EM based approaches uses reasonably small number of parameters with good flexibility of changing decision border. More precisely, the model is constructed before the classification process is started. This kind of methods give good results if we have reasonably large set with quite small dimension [18] (lower then 10). In such situation density estimation works effectively. In the case of data form the monitoring systems of wind turbines we have large amount of data in R4 . Consequently, density based model give essentially better results. Nevertheless, for data in higher dimensional spaces SVM (thanks to kernel trick) works nice. Since our work is based on the CEC method and one class classification (based on density estimation) we start by introducing CEC in relation to GMM which uses EM (Expectation Maximization) approach. Let us recall that the standard Gaussian densities in Rd are defined by   1 1 2 exp − kx − mkΣ , N (m, Σ)(x) := 2 (2π)d/2 det(Σ)1/2 where m denotes the mean, Σ is the covariance matrix and kvk2Σ := v T Σ −1 v is the square of Mahalanobis norm. In general EM aims at finding p1 , . . . , pk ≥ 0, P k i=1 pi = 1 and f1 , . . . , fk Gaussian densities (where k is given beforehand and denotes the number of densities which convex combination builds the desired density model) such that the convex combination f := p1 f1 + . . . pk fk optimally approximates the scatter of our data X = {x1 , . . . , xn } with respect to MLE cost function MLE(f, X) := −

n X

ln(p1 f1 (xl ) + . . . + pn fn (xl )).

(1)

l=1

The EM procedure consists of the Expectation and Maximization steps. While the Expectation step is relatively simple, the Maximization usually needs complicated numerical optimization even for relatively simple Gaussian models [12, 11, 2].

4

Fig. 1. The Old Faithful waiting data fitted by GMM and CEC with 95% level of point classified to X+ .

A goal of CEC is to minimize the cost function, which is a minor modification of that given in (1) by substituting sum with maximum: CEC(f, X) := −

n X

ln(max(p1 f1 (xl ), . . . , pn fn (xl ))).

(2)

l=1

Instead of focusing on the density estimation as its main tasks, CEC aims itself directly at the clustering problem. Let it be remarked that the seemingly small difference in the cost function between (1) and (2) has profound consequences, which follow from the fact that the densities in (2) do not “cooperate” to build the final approximation of f . Roughly speaking, the advantage is obtained because models do not mix with each other, since we take the maximum instead of the convex mixture. As it was mentioned the CEC model is given by following formula: f = max(p1 f1 (xl ), . . . , pn fn (xl )), where fi are Gaussian densities. The density based one–class classification need an estimation of probability distribution f , data set X and probability of belonging to arbitrary given distribution ε ∈ [0, 1]. As an output we obtain assignment of points to two group X+ which we interpret as an element from distribution and X− which contain outliers.

5

Decision border α is given by level set construct such empirical probability of belonging to the class X+ is equal ε: |{x ∈ X : f (x) ≤ α}| = ε. |X| where | · | denote the cardinality of a set.

Algorithm 1 Density based classifier: input X ⊂ Rd Y ⊂ Rd ε≥0 construction of the model f : Rd → R+ Compute decision border α such, that for y ∈ Y do if f (y) < α then y add to Y+ else y add to Y− end if end for

. training data . testing data . level of decision . density estimation |{x∈X : f (x)≤α}| |X|

≈ε

. classified as the positive class . classify to outliers

The pseudo code of density based one-class classification (we can us CEC or EM algorithm for density estimation) is presented in Algorithm 1. The result of density estimation obtain by GMM and CEC models on the Old Faithful waiting data [1] we present on Fig. 2. The difference between models are quite small. We also see that the 95% border is similar. Consequently, applying CEC method we obtain similar results like in EM with lower computational time. Since we have only one class, it is nontrivial to construct a reasonable criterion for verifying the correctness of classification process [17, 15]. To do so we use standard measure of sensitivity and and introduce a new approach to measure specificity. In the first case we want to verify how many true positive examples are in testing set. Let X be a data coming from the density f : Rd → R+ . Since we do not have two classes we assume that some percentage ε of data are sampled from random variable X+ and other ones are treated as outliers X− . We construct +| a decision border on training set such, that |X |X| ≤ ε. Consequently, sensitivity can be measured by |Y+ | |X+ | |Y | − |X| where Y is testing and X is training set. The smaller the above value, the more “stable” is the considered classification boundary.

6

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

0

0.2

0.4

0.6

0.8

1

0

(a) General GMM.

0.2

0.4

0.6

0.8

(b) Spherical GMM.

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

0

0.2

0.4

0.6

0.8

1

0

(c) General CEC.

0.2

0.4

0.6

0.8

1

1

1

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0.2

0.4

0.6

0.8

0

1

0

(e) SVM σ = 1.

0.2

0.4

0.6

0.8

1

(f) SVM σ = 10.

0

1

1

1

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0.2

0.4

0.6

0.8

1

(h) Mahalanobis SVM σ = 0.1.

0.2

0.4

0.6

0.8

1

(g) SVM σ = 100.

0.8

0

1

(d) Spherical CEC.

0.8

0

1

0

0

0.2

0.4

0.6

0.8

1

(i) Mahalanobis SVM σ = 1.

0

0.2

0.4

0.6

0.8

1

(j) Mahalanobis SVM σ = 10.

Fig. 2. Comparison of decision border in the case of different one-class classification algorithms on Abalon dataset from the UCI repository.

7

We measure specificity by a different approach. The problem lies in the fact that we want to be able to generate negative examples without being explicitly given second class. To do so, without using additional knowledge, we assume that elements from the second class have uniform distribution on some large cube containing all the positive class. Such reasoning leads us to verifying which decision border contains smaller number of elements of negative class. This equivalently can be stated as finding those decision border which has inside the minimal volume. Remark 1. Let us observe that for a known density this reduces to the level sets. For density f : Rd → R+ and ε ∈ [0, 1] consider all U ⊂ Rd such that Z f (x)dx = ε U

ThenR the above is minimal if U is given by U = f −1 ([a, ∞]), where a is such that f −1 ([a,∞]) f (x)dx = ε.

3

Experiments

In this section we present the results of CEC one-class classification method in relation to GMM and SVM. Let us start from Abalone data from UCI repository [27]. For our experiments we choose two coordinates, which enabled us to easily plot the decision border and visually compare quality of classifiers (see Fig. 2), and divided the data into two groups: teaching (40%) and testing (60%). We trained CEC, GMM and SVM algorithms in such a way that 95% of training data are classified to X+ . In our experiments we apply classical and spherical versions of EM and CEC. Analogically, we use SVM with different kernels (spherical and Mahanalobis) with varied kernel width. As it was discussed in the previous section we prefer the classifier which minimizes volume of area dedicated for the positive class and which has similar percentage of points positively classified simultaneously in training and testing sets. In the case of abalone data set, CEC, EM and SVM gives similar results. Nevertheless, SVN uses more parameters then density based approaches with similar measure of volume and testing rate, see two first column in Tab. 1. By using circles instead ellipses we reduce the number of parameters and consequently also the possibility of overfitting. The result of general and spherical version of CEC we present in Fig. 3. In the second example we use real data from monitoring system of wind turbine. The tests were performed on 4-D data covering the period from 18.04.2014 till 29.04.2014, recorded every 1 second by an on-line monitoring system. The data set included up to 985837 measurements. It occurs that the data are situated in fact in lower dimensional subspace and are strongly correlated in one direction, as the last eigenvalue of covariance matrix is essentially smaller then three first: 350102.91, 2006.03, 126.14, 0.76. Therefore, we used PCA (Principal

8

Time Testing rate CEC Volum Parameters Time Testing rate EM Volume Parameters Time Testing rate SVM σ=1 Volume Parameters Time Testing rate SVM σ = 10 Volume Parameters Time Testing rate SVM σ = 100 Volume Parameters Time Testing rate CEC spherical Volume Parameters Time Testing rate EM spherical Volume Parameters Time SVM Testing rate Mahalanobis Volume σ = 0.1 Parameters Time SVM Testing rate Mahalanobis Volume σ=1 Parameters Time SVM Testing rate Mahalanobis Volume σ = 10 Parameters

abalone 1 0,272 95,57% 5.361 30 · 7 = 210 0,863 93,82% 4.807 30 · 7 = 210 0,014 95,13% 20.575 85 · 3 = 255 0,021 94,8% 6.65 86 · 3 = 258 0,033 93,42% 5.55 100 · 3 = 300 0.109 93.26% 5.395 28 · 4 = 112 0,863 93,82% 4.807 30 · 4 = 120 0.016 95.25% 8.636 87 · 3 = 261 0.026 94.18% 5.851 96 · 3 = 288 0.121 86.64% 4.836 441 · 3 = 1323

abalone 2 0,218 95,89% 5.470 29 · 7 = 203 0,211 92,3% 4.595 29 · 7 = 203 0,019 95,73% 21.068 84 · 3 = 252 0,017 94,65% 6.567 85 · 3 =255 0,022 93,34% 5.672 112 · 3 = 336 0.011 94.34% 6.029 28 · 4 = 112 0,211 92,3% 4.595 30 · 4 = 120 0.022 94.93% 8.562 87 · 3 = 261 0.025 93.30% 5.546 96 · 3 =288 0.118 87.20% 4.652 414 · 3 = 1242

turbion 1 24,793 95,03% 0.005 29 · 13 = 377 23,025 95,23% 0.006 15 · 13 = 195 48,792 95,55% 0.524 4288 · 4 = 17152 48,563 96,44% 0.316 4288 · 4 = 17152 49.4126 96.28% 0.081 4289 · 4 = 17156 21.76 94.87% 0.03 29 · 5 = 145 23,025 95,23% 0.006 30 · 5 = 150 49.33 96.55% 0.116 4289 · 4 = 17156 53.023 95.43% 0.011 4316 · 4 = 17264 64.950 94.60% 0.006 4486 · 4 = 17944

turbion 2 133,735 98,66% 0.001 30 · 13 = 390 53,955 98,63% 0.001 30 · 13 = 390 1034,062 98,03% 0.161 19717 · 4 = 78872 1067,091 97,98% 0.099 19718 · 4 = 78872 1137.22 97.75% 0.018 19718 · 4 = 78872 151.54 98,45% 0.003 30 · 5 = 150 53,955 98,63% 0.001 30 · 5 = 150 1459.01 94.87% 0.012 19718 · 4 = 78872 1061.36 97.85% 0.002 19732 · 4 = 78928 1209.17 98.20% 0.001 19852 · 4 = 79408

Table 1. Comparison of the CEC, GMM and SVM in the case of one class classification.

Component Analysis) for extracting three most important dimensions, which also enables easier visualization of the results, see Fig. 3. In the case of this large amount of data we observe the difference between computational complexity of CEC approach and EM and SVM algorithms, see two last columns in Tab. 1. The difference between computation time in the case of spherical version of EM and CEC shows further advantage of our method. In the case of turbine dataset we obtain similar testing rate like in EM and SVM. The basic difference lies in the number of parameters and computation time.

9

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.8

0.3

0.6

0.2

0.4

0

0.4

0.8

0.3

0.6

0.2

0.4

0 0.1

0.1 0.2

0.2

0.3 0.4

0.2

0.2

0.3 0.4

0.5

0.5 0.6 0.7

0

(a) General CEC.

0.6 0.7

0

(b) Spherical CEC.

Fig. 3. The result of classical and spherical CEC in 3D data form monitoring of wind turbines.

4

Conclusions

In this paper a new approach to one-class classification problem was presented. The method is based on Cross-Entropy Clustering algorithm. More precisely, we use density estimation which is a mix of Gaussian distribution and thresholding border constructed so that 5% of data is classified as outliers. We obtained similar results to EM based approach with smaller computational complexity, larger flexibility for applying more complicated models and possibility of applying online version. The method was verified on data from turbine monitoring system.

References 1. Azzalini, A., Bowman, A.: A look at some data on the old faithful geyser. Applied Statistics pp. 357–365 (1990) 2. Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics pp. 803–821 (1993) 3. Barnett, V., Lewis, T.: Outliers in statistical data, vol. 3. Wiley New York (1994) 4. Barszcz, T., Bielecka, M., Bielecki, A., Wjcik, M.: Wind turbines states classification by a fuzzy-art neural network with a stereographic projection as a signal normalization. In: Adaptive and Natural Computing Algorithms, vol. 6594, pp. 225–234 (2011) 5. Barszcz, T., Bielecka, M., Bielecki, A., W´ ojcik, M.: Wind speed modelling using weierstrass function fitted by a genetic algorithm. Journal of Wind Engineering and Industrial Aerodynamics 109, 68–78 (2012) 6. Barszcz, T., Bielecki, A., Wjcik, M.: Art-type artificial neural networks applications for classification of operational states in wind turbines. In: Artifical Intelligence and Soft Computing, vol. 6114, pp. 11–18 (2010)

10 7. Bielecki, A., Barszcz, T., W´ ojcik, M.: Modelling of a chaotic load of wind turbines drivetrain. Mechanical Systems and Signal Processing 5455, 491–505 (2015) 8. Bielecki, A., Barszcz, T., W´ ojcik, M., Bielecka, M.: Art-2 artificial neural networks applications for classification of vibration signals and operational states of wind turbines for intelligent monitoring. Diagnostyka 14(4), 21–26 (2013) 9. Bielecki, A., Barszcz, T., W´ ojcik, M., Bielecka, M.: Hybrid system of art and rbf neural networks for classification of vibration signals and operational states of wind turbines. In: Artificial Intelligence and Soft Computing, vol. 8467, pp. 3–11 (2014) 10. Bishop, C.M.: Novelty detection and neural network validation. In: Vision, Image and Signal Processing, IEE Proceedings-. vol. 141, pp. 217–222. IET (1994) 11. Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern recognition 28(5), 781–793 (1995) 12. Davis-Stober, C., Broomell, S., Lorenz, F.: Exploratory data analysis with MATLAB. Psychometrika 72(1), 107–108 (2007) 13. Guner, O.: History and evolution of the pharmacophore concept in computer-aided drug design. Current topics in medicinal chemistry 2(12), 1321–1332 (2002) 14. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22(2), 85–126 (2004) 15. Lukashevich, H., Nowak, S., Dunker, P.: Using one-class svm outliers detection for verification of collaboratively tagged image training sets. In: Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on. pp. 682–685. IEEE (2009) 16. Markou, M., Singh, S.: Novelty detection: a reviewpart 1: statistical approaches. Signal processing 83(12), 2481–2497 (2003) 17. Sch¨ olkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural computation 13(7), 1443–1471 (2001) 18. Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC press (1986) ´ 19. Smieja, M., Warszycki, D., Tabor, J., Bojarski, A.J.: Asymmetric clustering index in a case study of 5-ht1a receptor ligands. PloS one 9(7), e102069 (2014) 20. Spurek, P., Tabor, J., Zajac, E.: Detection of disk-like particles in electron microscopy images. In: Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. pp. 411–417. Springer (2013) 21. Stahura, F.L., Bajorath, J.: Virtual screening methods that complement hts. Combinatorial chemistry & high throughput screening 7(4), 259–269 (2004) 22. Tabor, J., Misztal, K.: Detection of elliptical shapes via cross-entropy clustering. In: Pattern Recognition and Image Analysis, pp. 656–663. Springer (2013) 23. Tabor, J., Spurek, P.: Cross-entropy clustering. Pattern Recognition 47(9), 3046– 3059 (2014) 24. Tarassenko, L., Nairac, A., Townsend, N., Buxton, I., Cowley, P.: Novelty detection for the identification of abnormalities. International Journal of Systems Science 31(11), 1427–1439 (2000) 25. Tax, D.M., Duin, R.P.: Outlier detection using classifier instability. In: Advances in Pattern Recognition, pp. 593–601. Springer (1998) 26. Tax, D.M., Duin, R.P.: Support vector data description. Machine learning 54(1), 45–66 (2004) 27. Waugh, S.: Extending and benchmarking cascade-correlation. Dept of Computer Science, University of Tasmania, Ph. D. Dissertation (1995)

Suggest Documents