An Improved Algorithm for SVMs Classification of ... - Semantic Scholar

0 downloads 0 Views 188KB Size Report
are outnumbered by the examples of the non-target class (majority), the performance of SVM classifier is not so successful. In medical diagnosis and text ...
An Improved Algorithm for SVMs Classification of Imbalanced Data Sets Cristiano Leite Castro, Mateus Araujo Carvalho, and Antˆ onio Padua Braga Federal University of Minas Gerais Department of Electronics Engineering Av. Antˆ onio Carlos, 6.627 Campus UFMG - Pampulha, 30.161-970 Belo Horizonte, MG, Brasil {crislcastro,mateus.carvalho,apbraga}@ufmg.br

Abstract. Support Vector Machines (SVMs) have strong theoretical foundations and excellent empirical success in many pattern recognition and data mining applications. However, when induced by imbalanced training sets, where the examples of the target class (minority) are outnumbered by the examples of the non-target class (majority), the performance of SVM classifier is not so successful. In medical diagnosis and text classification, for instance, small and heavily imbalanced data sets are common. In this paper, we propose the Boundary Elimination and Domination algorithm (BED) to enhance SVM class-prediction accuracy on applications with imbalanced class distributions. BED is an informative resampling strategy in input space. In order to balance the class distributions, our algorithm considers density information in training sets to remove noisy examples of the majority class and generate new synthetic examples of the minority class. In our experiments, we compared BED with original SVM and Synthetic Minority Oversampling Technique (SMOTE), a popular resampling strategy in the literature. Our results demonstrate that this new approach improves SVM classifier performance on several real world imbalanced problems. Keywords: Support Vector Machines, supervised learning, imbalanced data sets, resampling strategy, ROC Curves, pattern recognition applications.

1

Introduction

Since their introduction by V. Vapnik and coworkers [1], [2], [3], Support Vector Machines (SVMs) have been successfully applied to many pattern recognition real world problems. SVMs are based on Vapnik-Chervonenkis’ theory and the structural risk minimization principle (SRM) [2], [4] which aims to obtain a classifier with high generalization performance through minimization of the global training error and the complexity of the learning machine. However, it is well established that in applications with imbalanced data sets [5], where the training D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 108–118, 2009. c Springer-Verlag Berlin Heidelberg 2009 

An Improved Algorithm for SVMs Classification of Imbalanced Data Sets

109

examples of the target class (minority) are outnumbered by the training examples of the non-target class (majority), the SVM classifier performance becomes limited. This probably occurs because the global training error considers different errors as equally important assuming that the class prior distributions are relatively balanced [6]. In the case of most real world problems, when the imbalanced class ratio is huge, one can observe that the separation surface learned by SVMs is skewed toward the minority class [5]. Consequently, test examples belonging to the small class are more often misclassified than those belonging to the prevalent class [7]. In order to improve SVMs performance on applications with imbalanced class distributions, we designed the Boundary Elimination and Domination Algorithm (BED). It eliminates outliers and noisy examples of the majority class in lower density areas and generates new synthetic examples of the minority class by considering both the analyzed example, and its k minority nearest neighbors. By increasing the number of minority examples in areas near the decision boundary, the representativity of support vectors of this class is improved. Moreover, the BED has parameters that allow the control of the resampling process intensity and can be adjusted according to the level of imbalance and overlapping of the problem. In our experiments, we used several real world data sets extracted from UCI Repository [8] with different degrees of class imbalance. We compared the BED algorithm with Synthetic Minority Oversampling Technique (SMOTE) [9], a popular oversampling strategy. In both algorithms, we used SVMs as base classifiers. The performances were evaluated using appropriate metrics for imbalanced classification such as F-measure [10], G-mean [11] and ROC (Receiver Operating Characteristics) curves [12]. This paper is organized as follows: Section 2 reviews the SVMs learning algorithm and previous solutions to the learning problem with imbalanced data sets, specially in the context of SVMs. Section 3, presents our approach to the problem and describes the BED algorithm. Section 4 describes how the experiments were performed and the results obtained. Finally, Section 5 is the conclusion.

2 2.1

Background Support Vector Machines

In their original formulation [1], [2], SVMs were designed to estimate a linear function f (x) = sgn (w · x + b) of parameters w ∈ d and b ∈ , using only a training set drawn i.i.d. according to an unknown probability distribution P (x, y). This training set is a finite set of samples, (x1 , y1 , · · · , xn , yn ) .

(1)

where xi ∈ d and yi ∈ {−1, 1}. The SVMs learning aims to find the hyperplane which gives the largest separating margin between the two classes. For a linearly separable training set, the margin ρ is defined as euclidean distance between

110

C.L. Castro, M.A. Carvalho, and A.P. Braga

the separating hyperplane and the closest training examples. Thus, the learning problem can be stated as follows: find w and b that maximize the margin while ensuring that all the training samples are correctly classified, 1 2 w . 2 s.t. yi (w · xi  + b) ≥ 1, i = 1, . . . , n .

min(w,b)

(2) (3)

For the non-linearly separable case, slack variables εi are introduced to allow for some classification errors (soft-margin hyperplane) [3]. If a training example is located inside the margins nor the wrong side of the hyperplane, its corresponding εi is greater than 0. The i=1 εi corresponds to an upper bound of the number of training errors. Thus, the optimal hyperplane is obtained by solving the following constrained (primal) optimization problem,  1 2 w + C εi . 2 i=1 n

min(w,b,εi )

s.t. yi (w · xi  + b) ≥ 1 − εi , i = 1, . . . , n . εi ≥ 0, i = 1, . . . , n .

(4) (5) (6)

where the constant C > 0, controls the trade-off between the margin size and the misclassified examples. Instead of solving the primal problem directly, one considers the following dual formulation, max(α)

n  i=1

αi −

n 1  yi yj αi αj xi · xj  . 2 i,j=1

s.t. 0 ≤ αi ≤ C, i = 1, . . . , n . n  αi yi = 0 .

(7) (8) (9)

i=1

Solving this dual problem the Lagrange multipliers αi are obtained whose sizes are limited by the box constraints (αi ≤ C); the parameter b can be obtained from some training example (support vector) with non-zero corresponding αi . This leads to the following decision function,  n   f (xj ) = sgn yi αi xi · xj  + b . (10) i=1

The SVM formulation presented so far is limited to linear decision surfaces in input space, which are definitely not appropriate for many classification tasks. The extension to more complex decision surfaces is conceptually quite simple and, is done by mapping the data into a higher dimensional feature space F , where the problem becomes linear. More precisely, a non-linear SVM first maps the input vectors by Φ : x → Φ (x), and then estimates a separating hyperplane in F , f (x) = sgn (Φ (x) · w + b)) . (11)

An Improved Algorithm for SVMs Classification of Imbalanced Data Sets

111

It can be observed, in (7) and (10), that the input vectors are only involved through their inner product xi · xj . Thus, to map the data is not necessary to consider the non-linear function Φ in explicit form. The inner products can only be calculated in the feature space F . In this context, a kernel is defined as a way to directly compute this product [4]. A kernel is a function K, such that for all pair x, x in input space, K (x, x ) = Φ (x) · Φ (x ) .

(12)

Therefore, a non-linear SVM is obtained by only replacing the inner product xi · xj  in equations (7) and (10) by the kernel 1 function K (xi , xj ) that corresponds to that inner product in the feature space F . 2.2

Related Works

Two main approaches have been designed to address the learning problem with imbalanced data sets: algorithmic and data preprocessing [13]. In the algorithmic approach, learning algorithms are adapted to improve performance of the minority (positive) class. In the context of SVMs, [14] and [15] proposed techniques to modify the threshold (parameter b) in the decision function given by the equation (10). In [16] and [17], the error for positive examples was distinguished from the error for negative examples by using different constants C + and C − . The ratio C + /C − is used to control the trade-off between the number of false negatives and false positives. This technique is known by Asymmetric Misclassification Costs SVMs [17]. Another algorithmic approach to improve SVMs on imbalanced classification is to modify the employed kernel. Thus, based on kernel-target alignment algorithms [18], Kandola and Taylor [19] assigned different alignment targets to positive and negative examples. In the same direction, [5] proposed the kernel boundary alignment algorithm (KBA) which adapts the decision function toward the majority class by modifying the kernel matrix. In the data preprocessing approach, the objective is to balance the class distribution by resampling the data in input space, including oversampling examples of the minority class and undersampling examples of the majority class. Oversampling works by duplicating pre-existing examples (oversampling with replacement) or generating new synthetic data which is usually obtained by interpolating. For instance, in the SMOTE algorithm [9], for each minority example, its nearest minority neighbors are identified and new minority examples are created and placed randomly between the example and the neighbors. Undersampling involves the elimination of class majority examples. The examples to be eliminated can be selected randomly (random undersampling) or through some prior information (informative oversampling). The one-sided 1

Kernel functions used in this work: Linear kernel : K (x, x ) = x ·x   2 ) . RBF kernel: K (x, x ) = exp −(x−x 2 2r

112

C.L. Castro, M.A. Carvalho, and A.P. Braga

selection proposed by Kubat and Matwin [11], for instance, is a informative undersampling approach which removes noisy, borderline, and redundant majority examples. Previous data preprocessing strategies that aim to improve the SVMs learning on imbalanced data sets include the following: [20] combined the SMOTE algorithm with Asymmetric Misclassification Costs SVMs mentioned earlier. [21] used SMOTE and also random undersampling for SVM learning on intestinalcontraction-detection task. Recently, [22] proposed the Granular SVM - Repetitive Undersampling algorithm (GSVM-RU) whose objective is to minimize the negative effect of information loss in the undersampling process.

3

Boundary Elimination and Domination Algorithm

In this Section, we describe the Boundary Elimination and Domination algorithm (BED) to improve SVMs performance on imbalanced applications. The basic idea behind BED is to increase the representativity of the minority class on the training set through data preprocessing in input space. The data preprocessing is done by removing (undersampling) noisy examples of the majority class and generating (oversampling) new examples of the minority class. The whole process of resampling is guided by density information around the training examples. This information makes it possible for the BED algorithm to identify both isolated examples or examples that belong to class overlapping area. The process of elimination and synthesis of examples in these regions improves the representativity of support vectors so that a SVM classifier will estimate a better surface separation. To obtain the density information, BED defines a credibility score for each example of the training set. This score is calculated from the k nearest neighbors (based on Euclidean distance) of the evaluated example according to the following equation, nC c . (13) cs(xci ) = + nC + nC − where, xci corresponds to the ith training example of the C c class. The symbol c = {+, −} indicates if the example belongs to the positive (minority) class or the negative (majority) class, respectively. The nC + value corresponds to the number of positive neighbors of the xci . For a given positive example, cs(x+ i ) evaluates the proportion of positive examples between the k nearest neighbors + + of the x+ i . Therefore, if cs(xi ) ≈ 1, one can state that xi is on a region with high density of the positive examples. Equivalent definitions hold for the negative class. − For majority class examples (x− i ), the credibility score cs(xi ) is used to find noisy examples that occupy class overlapping areas and also isolated examples belonging to the minority class regions. Thus, BED establishes a rule to detect and eliminate these examples. This rule depends on the maxtol parameter and is given by,

An Improved Algorithm for SVMs Classification of Imbalanced Data Sets

113

− if cs(x− i ) < maxtol → xi is eliminated,

− if cs(x− i ) ≥ maxtol → xi is not eliminated.

The maxtol parameter value (0 ≤ maxtol ≤ 1) should be defined by the user and will determine the intensity of undersampling stage. The higher the maxtol value the more examples belonging to C − are eliminated. In the case of high levels of class imbalance and overlapping, values of maxtol close to 1 are suggested given that it is likely that a large number of the negative examples in areas belonging to the C + exist. When the the imbalance is softer smaller values for maxtol are suggested. + For minority class examples (x+ i ), the credibility score cs(xi ) is also used + to synthesize new examples. If the cs(xi ) is considered valid by BED, a new example x ˆ+ is created as follows: 1. for each continuous attribute (feature) m, its value is calculated based on the mean of m-values of the nC + nearest neighbors and the example x+ i , nC + +

x ˆ (m) =

j=1

+ x+ j (m) + xi (m)

nC + + 1

.

(14)

2. each nominal attribute p assumes the value most frequently observed between the p-values of the nC + nearest neighbors and the example x+ i . + Otherwise, if cs(x+ i ) is not valid, xi is considered an isolated or a noisy example and, a new example should not be created around it. The rule which evaluates the positive example is based on the mintol parameter and is given by, + if cs(x+ i ) < mintol → xi is not eliminated, ˆ+ is created. if cs(x− i ) ≥ mintol → x

The parameter mintol defines the validity degree of positive examples in the input space and similar to the maxtol parameter can have values from 0 to 1. The lower the mintol value the higher the probability of a positive example on the class overlapping area to be considered valid and used to generate a new synthetic example x ˆ+ . mintol values however, should not be very close to 0 in order to ensure that isolated examples belonging to the C + do not generate new examples. The mintol adjustment should also be done by the user according to the level of class imbalance and overlapping. At each iteration of BED algorithm, a new training example is analyzed. The algorithm ends when the classes become balanced. The effect caused by BED algorithm on an imbalanced training set is illustrated in Fig. 1: majority examples in regions of prevalence of minority examples are eliminated. Meanwhile, new minority examples are better represented in these regions. It is expected, therefore, that our resampling strategy allows that a SVM classifier obtains a separation surface that maximize the number of correct positive classifications. Moreover, it is important to notice that BED can be used with any classification algorithm.

114

C.L. Castro, M.A. Carvalho, and A.P. Braga

Fig. 1. Illustration of resampling process of the BED algorithm

4 4.1

Experiments and Results Experiment Methodology

Six real world datasets from the UCI Repository [8] with different levels of class imbalance were used in our experiments (Table 1). In order to have the same negative to positive ratio, stratified 7-fold crossvalidation was used to obtain training and test subsets (ratio 7:3) for each data set. To improve the performance of SVM classifiers in imbalanced applications, original SVM was used as base classifier in all experiments conducted. Furthermore, we compared our algorithm BED with a resampling strategy well known in the literature, called SMOTE [9]. For all the SVM classifiers, we employed linear and RBF kernels. The parameters C (box-contraints) and r (radius of Gaussian function) were kept equal in all runs for each data set. Thus, the algorithms could be compared without the influence of SVM parameters. For the SMOTE algorithm the percentage of minority class oversampling for each data set is shown in Table 1. The parameters of BED algorithm (k, maxtol and mintol) were set empirically from average results obtained in several runs. The optimal choices of these parameters are also in Table 1. Table 1. Characteristics of the six data sets used in experiments: number of attributes and number of positive and negative examples. For some data sets, the class label in the parentheses indicates the target class we chose. Moreover, this table shows the optimal choice of parameters for SMOTE (% oversampling) and BED (k, maxtol and mintol). Data Set Diabetes Breast Heart Car(3) Yeast(5) Abalone(19)

#Attrib #POS #NEG %overs. k maxtol mintol 08 33 44 06 08 08

268 47 55 69 51 32

500 151 212 1659 1433 4145

200% 200% 300% 1000% 1000% 1000%

4.0 10.0 5.0 3.0 10.0 10.0

0.5 0.5 0.5 0.5 0.5 0.5

0.5 0.5 0.5 0.5 0.5 0.5

An Improved Algorithm for SVMs Classification of Imbalanced Data Sets

115

After setting of the parameters, algorithm performances were evaluated using appropriate metrics for imbalanced classification. For each metric, the mean and standard deviation were calculated from 7 runs with different training and test subsets obtained from stratified 7-fold crossvalidation. The metrics used in evaluation process and the average results achieved for the test set are presented in details in Section 4.2 below. 4.2

Results

√ Table 2 illustrates the results using the G-mean metric, defined as tpr · f pr [11], which corresponds to the geometric mean between the correct classification rates for positive (sensitivity) and negative (specificity) examples, respectively. Note that, in five out of the six data sets evaluated, BED achieved better results than SMOTE and original SVM (the best results are marked in bold). The Gmean values in Table 2 indicate that BED achieved a better balance between sensitivity and specificity. It is worth noting that in the case of Abalone data set, characterized by a huge imbalance degree, when both original SVM and SMOTE were unable to give satisfactory values for G-mean, the BED algorithm worked well. In Table 3, we evaluated the algorithms using the metric F-measure [10] that considers only the performance for the positive class. F-measure is calculated through two important measures: Recall and Precision. Recall (R) is equivalent to sensitivity and denotes the ratio between the number of positive examples correctly classified and the total number of original positive examples. Precision (P ), in turn, corresponds to the ratio between the number of positive examples correctly classified and the total number of examples identified as positives by the classifier. Thus, F-measure is defined as 2·R·P R+P and represents the harmonic mean between Recall and Precision. As shown in Table 3, the BED algorithm produced better results. Compared to the original SVM and SMOTE algorithms, Table 2. This table compares G-mean values on UCI data sets. The first column lists the data sets used. The following columns show the results achieved by the algorithms: Support Vector Machines (SVMs) (column 2), Synthetic Minority Oversampling Technique (SMOTE) (column 3) and Boundary Elimination and Domination (BED) (column 4). Mean and standard deviation values for each data set were calculated for 7 runs with different test subsets obtained from stratified 7-fold crossvalidation.

Data Set

SVMs

SMOTE

BED

Diabetes Breast Heart Car Yeast Abalone

0.70 ± 0.04 0.58 ± 0.07 0.63 ± 0.05 0.91 ± 0.08 0.15 ± 0.25 0.00 ± 0.00

0.71 ± 0.02 0.66 ± 0.08 0.76 ± 0.04 0.94 ± 0.02 0.54 ± 0.10 0.00 ± 0.00

0.75 ± 0.06 0.71 ± 0.04 0.76 ± 0.02 0.95 ± 0.02 0.68 ± 0.04 0.66 ± 0.12

116

C.L. Castro, M.A. Carvalho, and A.P. Braga

Table 3. This table compares F-measure values on UCI data sets. The first column lists the data sets used. The following columns show the results achieved by the algorithms: Support Vector Machines (SVMs) (column 2), Synthetic Minority Oversampling Technique (SMOTE) (column 3) and Boundary Elimination and Domination (BED) (column 4). Mean and standard deviation values for each data set were calculated for 7 runs with different test subsets obtained from stratified 7-fold crossvalidation.

Data Set

SVMs

SMOTE

BED

Diabetes Breast Heart Car Yeast Abalone

0.63 ± 0.06 0.48 ± 0.23 0.46 ± 0.07 0.86 ± 0.08 0.12 ± 0.21 0.00 ± 0.00

0.65 ± 0.01 0.56 ± 0.10 0.52 ± 0.08 0.87 ± 0.03 0.12 ± 0.04 0.00 ± 0.00

0.68 ± 0.04 0.55 ± 0.05 0.56 ± 0.05 0.97 ± 0.01 0.38 ± 0.07 0.33 ± 0.01

BED performance was superior especially for data sets with higher imbalance degree which is the case of most of the real world problems. The F-measure values, described in Table 3, shows that BED improves the classifier performance for the positive class. Average ROC curves (Receiver Operating Characteristics) [12] were plotted for all data sets and gave similar results. The ROC curve for a binary classifier shows graphically the true positive rate as a function of the false positive rate when the decision threshold varies. To illustrate, Fig. 2 shows the example for the Diabetes data set. Note that BED generates a better ROC curve than SMOTE and original SVM. ROC Curve 1 0.9 Original SVM BED SMOTE

True Positive rate

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

False Positive rate

Fig. 2. Average ROC curves for test set obtained from Diabetes data set

An Improved Algorithm for SVMs Classification of Imbalanced Data Sets

5

117

Conclusions

It is known that representativeness in a training set is the most important feature to achieve classifiers with high generalization performance. However in most classification problems, representativeness is not only expensive but often a very difficult task. In general, data sets available are small, sparse, with missing values and with heavily imbalanced prior probabilities. SVMs classifiers try to solve the representativeness problem in data sets by controlling complexity of decision function through maximizing separation margin. When applied to small and imbalanced data sets problems, SVMs tend to smooth the response. One of the possible explanations is the asymptotic bounds imposed by the reduced size of the training and validation data sets. In addition, SVMs do not take into consideration the differences (costs) between the class distributions during the learning process. Here, we propose the Boundary Elimination and Domination algorithm (BED) to tackle the representativeness problem in training sets. The results obtained with real world applications demonstrated that BED is an efficient resampling strategy when used in small and imbalanced data sets leading to a better surface separation.

References 1. Boser, B.E., Guyon, I.M., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152. ACM Press, New York (1992) 2. Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995) 3. Cortes, C., Vapnik, V.: Support-Vector Networks. Mach. Learn. 20, 273–297 (1995) 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, London (2000) 5. Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17, 786–795 (2005) 6. Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. 42, 203–231 (2001) 7. Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007) 8. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/mlearn/MLRepository.html 9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 10. Tan, P., Steinbach, M.: Introduction to Data Mining. Addison Wesley, Reading (2006) 11. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of 14th International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997) 12. Egan, J.P.: Signal detection theory and ROC analysis. Academic Press, London (1975)

118

C.L. Castro, M.A. Carvalho, and A.P. Braga

13. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004) 14. Karakoulas, G., Shawe-Taylor, J.: Optimizing classifiers for imbalanced training sets. In: Proceedings of Conference on Advances in Neural Information Processing Systems II, pp. 253–259. MIT Press, Cambridge (1999) 15. Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document categorization. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 216–227 (2003) 16. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 55–60 (1999) 17. Joachims, T.: Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Norwell (2002) 18. Cristianini, N., Shawe-Taylor, J., Kandola, J.: On kernel target aligment. In: Proceedings of the Neural Information Processing Systems NIPS 2001, pp. 367–373. MIT Press, Cambridge (2002) 19. Kandola, J., Shawe-Taylor, J.: Refining kernels for regression and uneven classification problems. In: Proceedings of International Conference on Artificial Intelligence and Statistics. Springer, Heidelberg (2003) 20. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Proceedings of European Conference on Machine Learning, pp. 39–50 (2004) 21. Vilari˜ no, F., Spyridonos, P., Vitri, J., Radeva, P.: Experiments with SVM and stratified sampling with an imbalanced problem: detection of intestinal contractions. In: Proceedings of International Workshop on Pattern Recognition for Crime Prevention, Security and Surveillance, pp. 783–791 (2005) 22. Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst., Man, Cybern. B 39, 281–288 (2009)

Suggest Documents