Robust Multiclass Classification for Learning from ... - IEEE Xplore

1 downloads 0 Views 306KB Size Report
Nov 18, 2012 - method applies cost-sensitive approach and ramp loss function to the ... Key words: multiclass classification; imbalanced data; ramp-loss; ...
TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll02/10llpp619-628 Volume 17, Number 6, December 2012

Robust Multiclass Classification for Learning from Imbalanced Biomedical Data Piyaphol Phoungphol , Yanqing Zhang, Yichuan ZhaoŽ Department of Computer Science, Georgia State University, Atlanta, GA 30302-3994, USA; Ž Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30302-3994, USA Abstract: Imbalanced data is a common and serious problem in many biomedical classification tasks. It causes a bias on the training of classifiers and results in lower accuracy of minority classes prediction. This problem has attracted a lot of research interests in the past decade. Unfortunately, most research efforts only concentrate on 2-class problems. In this paper, we study a new method of formulating a multiclass Support Vector Machine (SVM) problem for imbalanced biomedical data to improve the classification performance. The proposed method applies cost-sensitive approach and ramp loss function to the Crammer and Singer multiclass SVM formulation. Experimental results on multiple biomedical datasets show that the proposed solution can effectively cure the problem when the datasets are noisy and highly imbalanced. Key words: multiclass classification; imbalanced data; ramp-loss; Support Vector Machine (SVM); biomedical data

Introduction Classification is an important data mining method in biomedical research. In the classification task, training examples are required to predict a target class of an unseen example. However, training data sometimes have imbalanced class distribution[1-3] . One of major reasons for the problem is insufficiency of the absolute amount of examples of some classes for training a classifier. This is caused by the intrinsic rarity of the cases or by limitations on data collection process such as high cost or privacy problems. For example, biomedical data derived from rare disease and abnormal prognoses are difficult to obtain, and some biomedical data are often obtained via expensive experiments. This Received: 2012-09-21; revised: 2012-11-18  Supported by GSU Molecular Basis of Disease Graduate Fellow, 2011-2012.  To whom correspondence should be addressed. E-mail: [email protected]

problem causes standard machine learning algorithms to be overwhelmed by the majority classes and ignore the minority class since traditional classifiers tend to be bias towards the majority class and try to optimize overall accuracy. In the biomedical field, this issue is particularly crucial since learning from these imbalanced data can discover useful knowledge to make important decisions while it can also be extremely costly to misclassify these data. Figure 1 shows an ideal and bias separate line of Support Vector Machine (SVM) between two imbalanced classes. There are many research works that try to improve traditional techniques or develop new algorithms to solve the class imbalance problem. However, most of those studies are focused only on binary case or two classes. Only a few researches have been done for multiclass imbalance problem that is much more common and complex in the real-world application. In this paper, we studied the multiclass imbalanced data problem, and developed new classification algorithms that can effectively handle the imbalance problem in many biomedical domains.

Tsinghua Science and Technology, December 2012, 17(6): 619-628

620

We then noticed from both cases that the summation of all errors is not suitable for use as objective function of the optimization problem. In the next section, we will review current research works to improve the performance of multiclass classifiers for imbalanced data. Then in Section 2, we will introduce our new multiclass SVM’s objective function for imbalanced dataset.

1 Fig. 1 SVM on imbalanced dataset is biased toward the major class.

Preliminaries In classifying m class problem from n training samples x1 ; x2 ;    ; xn where xi is a point in feature space Rd with label y1 ; y2 ;    ; yn 2 f1;    ; mg, Crammer and Singer[4] suggested a learning model that is in the form of f .x/ D arg max W r  x (1) r2Y

where W is an m  d matrix and W r , the r-th row of W, is a coefficient vector of class r. Therefore, the predicted label of x is a class having the highest similarity score. To find the best coefficients of matrix W in Eq. (1), we can formulate the problem as an optimization problem shown in Formula (2) with an objective to minimize the norm of W and the total classification error where C is a regularization parameter and i is an error in predicting point i. n X 1 min kWk2 C C i (2) W 2 i D1

s.t. 8i; 8r ¤ yi ; Wyi  xi 8i;

W r  xi C i > 1;

i > 0:

Problems While the formulation in Formula (2) usually yields a good result in most cases, it may provide a poor result when a training data is highly imbalanced or noisy. (1) Imbalanced data The total error will be dominated by errors of major class instances. Thus, a classifier will certainly be biased toward the major class to minimize the total errors, as shown in Fig. 1. (2) Noisy data An error of each point can take values ranging from 0 to C1, therefore errors of a few bad or noisy points can severely compromise the overall errors which result in classifier’s performance deterioration.

Current Research Works

In the past decade, many researchers have studied the problem of imbalanced data classification in order to improve the performance of classification models. However, most of them only concentrate their works on binary classification. General ideas such as feature selection[5] , sampling-based approach, and cost sensitive learning can easily be extended to multiclass problem, but it is rather difficult for algorithm-specific techniques[6, 7] . We have grouped the related work that shows promising results in improving multiclass imbalance classification. 1.1

Sampling

Sampling is a common approach in dealing with class imbalance by changing original class frequencies at a preprocessing step. Thus, this requires no change to an original learning algorithm. It could involve undersampling, oversampling, or both. The amount to sample in each class can be chosen empirically[1] or in relation to its misclassification cost[8, 9] . The concern is that undersampling might remove some potentially valuable information, while oversampling could also lead to overfitting. Therefore, many algorithms combine undersampling and oversampling to gain advantages from both. Few experiments on sampling have been carried out with and designed for multiclass cases. According to Ref. [9], sampling methods are only useful for binary imbalance problems. They are not effective in dealing with multiclass imbalance problems and could even degrade classifiers’ performances when the number of classes is high or imbalance is severe. 1.2

Cost-sensitive learning

In cost-sensitive learning, it considers the costs of misclassification data in one class to another class, and then tries to minimize total costs rather than total errors. A widely used approach in applying costsensitive approach to SVM[10] , is to assign a higher penalty error for a minority class in the optimization

Piyaphol Phoungphol et al.: Robust Multiclass Classification for Learning from Imbalanced Biomedical Data

problem. To facilitate this approach, Zhou and Liu recommended a rescaling technique[9] in converting a confusion matrix i;j to class penalty factors for multiclass SVM model. The weights of each classes wr will be rescaled simultaneously according to their misclassification costs. Solving the relations in Eq. (3) will get relative optimal weights of imbalanced data for multiclass SVM. w1 w2

D

1;2 ; 2;1

w1 w3 w2 w3

D D

1;3 ; 3;1 2;3 ; 3;2

:::; :::; :::; :::;

w1 wm w2 wm wm 1 wm

D D

1;m ; m;1 2;m ; m;2

::: m 1;m D m;m 1

(3)

With the optimal weights for each class, the objective of multiclass SVM in Formula (2) can be rewritten as n X 1 2 min kWk C C (4) wyi  i W 2 i D1

1.3

One-class SVM

Unlike standard discrimination-based learning, oneclass SVM only recognizes examples from one class rather than differentiating all examples. It is useful when examples from the target classes are rare or difficult to obtain. The classification is based on the predetermined threshold, which indicates whether to include the instance to the target class or not. In order to apply one-class SVM for multiclass problem, multiple models of one-class SVM will be trained together. Hao and Lin[11] formulated multiple one-class SVM models as one large optimization problem to minimize the total errors, while Refs. [12, 13] used separate one-class SVM models and iteratively adjusted the threshold of each models to maximize the total accuracy. The main challenging problem is selecting proper thresholds for each class: Setting a high threshold value would make an inclusion more difficult, so positive examples are more likely to be misclassified. Yet, setting the value too low would allow more examples from non-target classes in the target class region. 1.4

Boosting classifier

Boosting classifier is one of the most effective approaches to solve class-imbalanced problem for 2-class datasets. It iteratively improves the classifiers’ performance by either updating misclassification cost[14] or changing the distribution ratio[7] of each class. Some recent studies[15, 16] extended this approach

621

to multiclass imbalance problem and achieved promising results. The main drawback of this approach is slow speed because it might require a classifier to train datasets multiple times until the result is converged or all data points are classified. Although our proposed method has similarities to this approach, our work is targeted to optimize G-mean directly, which is more appropriate for imbalanced data than the total losses/errors. Moreover, we use a fast and effective technique, called ConCave-Convex Procedure (CCCP)[17] , to solve the optimized problem. Although, the current works discussed in this section: sampling, cost-sensitive learning, one-class SVM, and boosting classifier show successful results in solving multiclass imbalance, they do not take into account the noise in data that is usually involved with imbalanced data. On the other side, our proposed solution can solve both noise and class imbalance problems simultaneously.

2

Multiclass Classification Objective

In the previous section, we discussed the drawbacks of using the sum of errors from all training data as an objective function of the optimization problem. In this section, we will find a new objective function that is more suitable for imbalanced data classification. This new objective function should be formulated based on a measure that indicates how good a classifier is. We first review the popular measures which have been used to compare the performance of learning models for imbalanced data. 2.1

Metrics for imbalanced classification

Accuracy (Acc) is the most widely used measure in comparing the performance of classifier models; however, it is generally not applicable for an imbalanced data set. Alternative performance measures, such as G-mean, F-measure, and volume under Receiver Operating Characteristic ( ROC), have usually been used in multiclass imbalance learning research. 2.1.1

G-mean

G-mean has been introduced to solve the problems of Acc in imbalanced dataset. It is the geometric mean of all class accuracies, so the G-mean for m class can be calculated by Eq. (5) m Y G-mean D . Acci /1=m (5) i D1

Tsinghua Science and Technology, December 2012, 17(6): 619-628

622

2.1.2

F-measure

F-measure is widely used to evaluate text classification system. It is defined as a harmonic mean of Precision () and Recall () in binary class problem. Precision is the proportion of predicted positives which are actual positive and Recall is the proportion of actual positives which are predicted positive. The values of  and  can be calculated from Eq. (6) where the notation of TP, FP, and FN are illustrated in Table 1. TP D ; TP C FP TP ; D TP C FN 2  F-measure D (6)  C To extend F-measure to multiclass, two types of average, micro-average and macro-average are commonly used[18] . In micro-averaging, F-measure is computed globally over all classes decisions as in Eq. (7).  and  are obtained by summing over all individual decisions. m m X X TPi TPi D

i D1 m X

; D

.TPi C FPi /

i D1

i D1 m X .TPi C FNi /

;

i D1

2  F-score.micro/ D  C

(7)

In macro-averaging, F-measure is computed locally over each class first and then the average over all classes is taken.  and  are computed for each class as in Eq. (8). 2  i  i Fi D ; i C i m X Fi F-score (macro) D

2.1.3

i D1

m Volume under ROC (multiclass AUC)

(8)

The Area Under the ROC Curve AUC metric is one of the most popular measures in comparing binary classifiers because there is no need to

make any assumptions about the class distribution and misclassification costs. However, extending the AUC to multiclass problem is still an open research topic. Directly calculating the Volume under ROC (VUC) surface[19] is complicated while many studies[20, 21] tried to estimate the volume by projecting it into multiple 2-class dimensions. 2.2

A new objective function

Among three matrices for imbalanced classification mentioned in the previous section, we decided to choose G-mean as a part of our objective function because it is one of the most popular measures and easier to calculate than VUC. Thus, a new objective function in Formula (2) can be re-defined as m p Y 1 m 2 min kWk C Accr (9) W 2 rD1 As m is constant, a minimum point of Formula (9) will also be a minimum point of Formula (10), m Y 1 Accr (10) min kWk2 C 0 W 2 rD1 The objective function in Formula (10) should yield the best perfect model for multiclass imbalanced classification as it is derived directly from Gmean. However, in practice, it is quite computational expensive to find a solution of the optimized problem when the objective function is not continuous, nonlinear, and non-convex function. Only few methods such as a genetic algorithm, a particle swarm optimization are able to solve this problem. In order to make the problem easier to solve, we decide to modify Formula (10) by using an average of all accuracies instead of a geometric mean of all accuracies in G-mean. The modified objective function will be reformulated as: m C X 1 Accr (11) min kWk2 W 2 m rD1 Let ` be 0-1 loss function; `i is equal to 0 when a predicting label of i is correct, and equal to 1 when a predicting label of i is invalid. Thus, the accuracy of n X class r, Accr D .1 `i /=Nr where Nr is the i jyi Dr

Table 1

Actual positive Actual negative

Classification terms for F-measure Predicted positive

Predicted negative

TP (true positive) FP (false positive)

FN (false negative) TN (true negative)

total number of data in class r. If we define ıi;r as a 0-1 membership function of point i to a class r; ıi;r is equal to 1 when yi D r, and 0 otherwise. Then, the accuracy n X of class r will be Accr D ıi;r .1 `i /=Nr . We can i D1

Piyaphol Phoungphol et al.: Robust Multiclass Classification for Learning from Imbalanced Biomedical Data

3.1

rewrite Formula (11) as Eq. (12) . m n 1 C X X ıi;r .1 `i / min kWk2  W 2 m rD1 Nr i D1

n X C0 1 2 . min kWk C / `i  W 2 Nyi i D1

n

min W

X 1 Cyi `i kWk2 C 2

(12)

i D1

We can notice from Eq. (12) that the new objective is very close to the original objective function in Formula (2). The main differences are (1) a regularization parameter of each point is weighted inversely by the number of data in the same class, C 0 =Nyi ; (2) an error/loss of each point `i is 0-1 or step function. Although the problem in Eq. (12) is less complex than Formula (10) and can be solved as Mixed Integer Quadratic Programming (MIQP) by popular solvers, the solving process is still very computationally-expensive. In the next section, we will find an alternative way to simplify Eq. (12) and solve it efficiently.

3

Multiclass Classification with Ramp Loss

Many studies tried to solve SVM with 0-1 loss function by replacing it with other smooth functions such as sigmoid function[22, 23] , logistic function[24] , polynomial function[23] , and hyperbolic tangent function[25] . These functions, while yielding accurate result, still suffer from being computationally-expensive when solved as an optimization problem due to its non-convex nature. Alternatively, a truncated hinge loss or ramp loss function which is a non-smooth but continuous function has been proved to be accurate and efficient for SVM problem[26-28] . Due to its efficiency and simplicity, we decide to use the ramp loss to simulate 0-1 loss function in Eq. (12).

Fig. 2

623

Ramp loss function

The ramp loss or truncated hinge loss will be in range [0,1] when error  is less than a constant ramp parameter, and z will be equal to 1 when  is greater than z. An example of ramp loss function is shown in Fig. 2a. It is obvious that the ramp loss, R./, is equal to the original hinge loss, , in Fig. 2b minus a truncated part, Hz , in Fig. 2c. R./ D 

Hz ./

(13)

when we replace `i in Eq. (12) with R.i /, the objective function will be n n X X 1 Cyi i Cyi Hz .i / (14) min kWk2 C W 2 i D1

i D1

Although the objective function in Formula (14) is non-convex, Karush-Kuhn-Tucker conditions remain necessary (but not sufficient) optimality conditions. In the next section, we will introduce an efficient method to solve it. 3.2

Solve multiclass ramp loss with CCCP

CCCP was first introduced by Yuille and Rangarajan[17] to solve any non-convex optimized problems. According to CCCP, any objective functions, f .x/ can be decomposed into the sum of a convex function and a concave function: f .x/ D fvex .x/ C fcave .x/. CCCP will solve nonconvex optimized problem iteratively by approximating the concave part from its tangent @fcave =@x and a current solution of convex function, xt , until a result is convergent. @fcave t t x ; fcave .x/ D @x t xtC1 D arg min fvex .x/ C fcave .x/

(15)

x

In order to apply CCCP to our objective functions in Formula (14), we first decompose the objective function

Composition of ramp loss.

Tsinghua Science and Technology, December 2012, 17(6): 619-628

624

into convex and concave parts in Formula (16). n n X X 1 min kWk2 C Cyi i Cyi HS .i / W 2 i D1 i D1 ƒ‚ …„ ƒ‚ … „ convex

4 (16)

concave

The tangent of concave parts will be: ( t @fcave 0; if it 6 zI D i Cyi ; if it > z

(17)

Thus, in each iteration, CCCP will solve the optimized problem in the form of: n X 1 Cyi i min kWk2 C (18) W 2 t i j i 6z

s.t. 8i; 8r ¤ yi ; Wyi  xi 8i;

W r  xi C i > 1;

i > 0:

The problem in Formula (18) can easily be reformulated into dual form using the same technique used by Crammer and Singer[4] . Therefore, the complete step in solving multiclass SVM with ramp loss is shown in Algorithm 1. Points having one or more of non-zero ˛i;r will be support vector points of the model. Interestingly, the multiclass SVM with ramp loss in Algorithm 1 looks complex and requires to solve an optimization problem multiple times until its solution is converged. In fact, the successive optimizations only refine existing support vectors from the previous iteration. Therefore, it will be much faster because their solutions have roughly the same group of support vectors. Algorithm 1 Multiclass SVM with Ramp Loss 1: find ratios of each class: ratior 2: calculate Cr for each class: Cr D C =ratior 3: initialize ˛ 4: repeat 5: compute i D 1C Pn Pn maxm j D1 ˛j;r K.xi ; xj / iD1 ˛j;yi K.xi ; xj / r¤yi 6:

min˛

1 2

Pn

s.t. ˛i;r  7: 8:

i D1

(

Pn

Cyi 0

i D1

K.xi ; xj /

if r D yi otherwise

Pn

iD1

and

until ˛ converges Pn f .x/ D arg maxr2Y i D1 ˛i;r K.xi ; x/

˛i;yi

i 6 z

4.1

Evaluations Dataset

In order to evaluate the performance of our proposed solution, multiclass SVM with ramp loss (RampMCSVM), we have compared it against other popular multiclass SVM classifiers; Crammer and Singer multiclass SVM (MCSVM)[4] and cost-sensitive multiclass SVM (Cost-MCSVM). The experiments use 7 real-world imbalanced biomedical datasets including Thyroid Disease, Yeast, Contraceptive Method, Vertebral Column, Heart Disease (Switzerland and Cleveland), and Pima Diabetes datasets. These datasets are publicly available from the UCI repository[29] and their characteristics are summarized in Table 2. The binaryclass Pima Diabetes dataset is selected to show that our Ramp-MCSVM algorithm is a general n-class classifier which is applicable to 2-class problem as well. For each dataset, we calculated class ratios, and then marked its smallest rmin and largest rmax class ratios by asterisk (). The imbalance of the datasets are indicated by the high ratio between the number of instances of the majority class and the minority class: rmax =rmin . 4.2

Parameters & performance metrics

In the experiments, we implemented all three algorithms in Java and used Gurobi Optimizer[30] to solve the optimization problems. We experimented with two types of kernels: linear and non-linear (Radial Basis Function, RBF) kernels. Parameters, cost (C ) and RBF Gamma ( ) of each algorithm are tuned using grid search with 5-fold cross validation to maximize G-mean where C 2 Œ2 2 ; 2 1 ;    ; 215 , 2 Œ2 6 ; 2 5 ;    ; 25 . Then, Ramp-cut (z) parameter of RampMCSVM is selected by grid search over a range Œ0:1; 0:2;    ; 2:0 while keeping the values of C and

parameters the same as in Cost-MCSVM algorithm. For each dataset, we ran 40 trials; in each trial, eighty percent of data are randomly selected as training data while the rest are used as test data. Then, a G-mean value of a model is calculated from the accuracies of each class. The average G-mean for linear and RBF kernels are shown in Table 3 and Table 4 respectively. In addition to G-mean, we investigated another performance metrics, F-measure, as previously discussed in Section 2.1. The micro and macro-averaging F-measure of the model in Table 5 are calculated according to Eq. (7) and Eq. (8).

Piyaphol Phoungphol et al.: Robust Multiclass Classification for Learning from Imbalanced Biomedical Data Table 2

Multiclass imbalanced datasets from UCI

Number of instances

Number of attributes

Number of classes

Thyroid Disease

3772

21

3

Yeast

1484

8

10

Contraceptive Method Vertebral Column Heart Disease (Switzerland) Heart Disease (Cleveland) Pima Diabetes

1473 310 123 303 768

9 5 13 13 8

3 3 5 5 2

Dataset

Notes:



625

Imbalance ratio rmax =rmin

Class ratios, r 0:025 , 0.050, 0:925 0:312 , 0.289, 0.164, 0.110, 0.034 0.030, 0.024, 0.020, 0.014, 0:003 0:427 , 0:226 , 0.347 0:193 , 0:484 , 0.323 0.076, 0:381 , 0.276, 0.219, 0:048 0:541 , 0.182, 0.119, 0.115, 0:043 0:651 , 0:349

37.51 92.60 1.89 2.50 9.60 12.62 1.87

indicates the smallest class ratio rmin and largest class ratio rmax of each dataset. Table 3

Average G-mean performance for linear kernel

Dataset Thyroid Disease Yeast Contraceptive Method Vertebral Column Heart Disease (Switzerland) Heart Disease (Cleveland) Pima Diabetes

MCSVM

Cost-MCSVM

Ramp-MCSVM

0.945 0.382 0.303 0.776 0.304 0.367 0.492

0.991 0.513 0.327 0.835 0.553 0.454 0.703

0.996 0.576 0.513 0.847 0.586 0.481 0.735

Notes: The bold number means Ramp-MCSVM has better performance than the other two classifiers. Table 4

Average G-mean performance for RBF kernel

Dataset Thyroid Disease Yeast Contraceptive Method Vertebral Column Heart Disease (Switzerland) Heart Disease (Cleveland) Pima Diabetes

MCSVM 0.961 0.908 0.735 0.931 0.897 0.899 0.913

Cost-MCSVM 0.968 0.912 0.743 0.933 0.912 0.905 0.938

Ramp-MCSVM 0.986 0.937 0.786 0.967 0.920 0.932 0.951

Notes: The bold number means Ramp-MCSVM has better performance than the other two classifiers. Table 5

Macro / micro F-measure performance

Linear Kernel MCSVM Cost-MCSVM Thyroid Disease 0.886 / 0.903 0.923 / 0.930 Yeast 0.465 / 0.476 0.507 / 0.541 Contraceptive Method 0.369 / 0.417 0.380 / 0.464 Vertebral Column 0.786 / 0.801 0.813 / 0.825 Heart Disease (Switzerland) 0.685 / 0.706 0.673 / 0.724 Heart Disease (Cleveland) 0.587 / 0.672 0.616 / 0.690 Pima Diabetes 0.577 / 0.605 0.671 / 0.686 Dataset

Ramp-MCSVM 0.943 / 0.949 0.531 / 0.572 0.402 / 0.493 0.814 / 0.827 0.721 / 0.768 0.642 / 0.694 0.679 / 0.694

RBF Kernel MCSVM Cost-MCSVM Ramp-MCSVM 0.970 / 0.974 0.971 / 0.975 0.993 / 0.994 0.963 / 0.967 0.969 / 0.972 0.964 / 0.969 0.761 / 0.766 0.772 / 0.778 0.811 / 0.819 0.960 / 0.962 0.977 / 0.980 0.983 / 0.984 0.948 / 0.956 0.984 / 0.987 0.961 / 0.969 0.954 / 0.959 0.965 / 0.971 0.973 / 0.978 0.956 / 0.958 0.972 / 0.972 0.975 / 0.975

Notes: The first number is macro-averaging F-measure and the second number is micro-averaging F-measure. The bold number means Ramp-MCSVM has better performance than the other two classifiers.

4.3

Results

Table 3 and Table 5 clearly show that in linear kernel, the multiclass SVM with ramp loss yields better

classification results than cost-sensitive model and original multiclass SVM, with respect to both G-mean and F-measure. Indeed, the performance improvements

626

Tsinghua Science and Technology, December 2012, 17(6): 619-628

of Ramp-MCSVM are significant when comparing with MCSVM and Cost-MCSVM on datasets with low average G-mean value such as Yeast, Contraceptive Method, and Heart Disease; despite the fact that those datasets have very high imbalance ratio (Yeast = 92.60, Heart Disease (Switzerland) = 9.60, Heart Disease (Cleveland) = 12.62). In the RBF kernel, the multiclass SVM with ramp loss generally achieves the best performance among three algorithms; it has the highest G-mean and F-measure in all datasets except the lower Fmeasure in Yeast and Heart Disease (Switzerland) datasets. Nevertheless, it is relevant to point out that parameters used in the experiments are selected to maximize G-mean value, not F-measure. The improvements of Ramp-MCSVM are less significant than linear kernel case, possibly due to the higherdimensional feature space and non-linear nature of RBF kernel. 4.4

Table 6 dataset

Comparison of class errors for Contraceptive

Class No-use Long-term Short-term Table 7

Average error 1.24 16.99 5.74

Error variance 3.20 48.62 18.38

Comparison of class errors for Vertebral dataset

Class Disk Hernia Spondylolisthesis Normal

Average error 1.38 1.00 1.53

Error variance 1.84 1.12 1.02

Analysis of Ramp-MCSVM effectiveness

We have studied the effectiveness of using ramploss in imbalanced data classification. Why does the ramp-loss multiclass SVM yield much better result than the cost-sensitive multiclass SVM on some datasets, such as Contraceptive and Yeast, but show only slight effects on some datasets such as Vertebral Column and Thyroid Disease? We provided more detailed analysis by examining the distributions of classification errors, , from the cost-sensitive multiclass SVM model on two different datasets: one that shows dramatic improvement with ramp-loss (Contraceptive Method dataset) and the other with only marginally improvement (Vertebral Column dataset). In the following, we will identify why our ramp-loss multiclass SVM yields significantly better result on Contraceptive Method dataset. We took a closer look at these two datasets; Contraceptive and Vertebral. Tables 6 and 7 show the average classification errors and variances within each class for Contraceptive Method and Vertebral Column datasets, respectively. The errors of each class are appropriately scaled by the inverse of class ratio. In Table 7, we observe that the average and variance errors of each class in Vertebral Column dataset are relatively close to each other. Contrast this with Contraceptive Method dataset, in which the averages and variances of errors within each class are remarkably varied. In addition, we also plotted the error distributions from Contraceptive Method dataset in Fig. 3 for better

Fig. 3 Probability distribution of prediction error from Contraceptive dataset.

clarity. The errors distributions of each class are significantly different from each other; however, the main problem seems to lie with the variance of errors in the second class, long-term. Figure 3 clearly shows that the “long-term” class is extremely noisy compared to other classes within the same dataset. Undoubtedly, these results help us gain better understanding of why using the summation of all errors on imbalanced and noisy dataset such as Contraceptive Method dataset, will deteriorate the performance of classifier, despite the fact that those errors are already scaled in the costsensitive multiclass SVM. Our ramp-loss multiclass SVM solve this problem by removing these noisy data points from the objective function, thus yielding a higher accuracy.

5

Conclusions

This paper presented a new approach for alleviating the multiclass imbalance problem, which is commonly found in many classification systems involving biomedical data. In many applications, extremely noisy and imbalanced data can seriously deteriorate

Piyaphol Phoungphol et al.: Robust Multiclass Classification for Learning from Imbalanced Biomedical Data

the performance of existing classification tools. Our goal is to find a new technique that is more accurate and efficient for learning from imbalanced biomedical data. We explored different measures used in evaluating the performance of the classifier, formalized a new objective function to maximize G-mean, and then derived a new multiclass SVM algorithm that is suitable for multiclass imbalanced data. Our proposed solution applied the ramp loss function and weighted cost method in order to solve the complex optimized problem efficiently. To verify the effectiveness of our ramp loss multiclass SVM, we experiment with 7 real-world imbalanced biomedical datasets, and the empirical results show that our ramp loss multiclass SVM clearly outperforms the original multiclass SVM, and cost-sensitive multiclass SVM. We believe that the solution proposed in this paper is a promising step toward winning in the battle against multiclass imbalance problem in biomedical field. The ability to classify data more accurately and more efficiently will provide significant benefits for biomedical research and the healthcare industry.

[9] [10]

[11]

[12]

[13]

[14]

[15]

References [1]

[2] [3]

[4]

[5]

[6]

[7]

[8]

Chawla N V, Japkowicz N. Editorial: Special issue on learning from imbalanced datasets. SIGKDD Explorations, 2004, 6: 1-6. Weiss G M. Mining with rarity: A unifying framework. SIGKDD Explor Newsl, 2004, 6: 7-19. Japkowicz N, Holte R. Workshop report: AAAI-2000 workshop on learning from imbalanced data sets. AI Magazine, 2001, 22: 127-136. Crammer K, Singer Y, Cristianini N, Shawe-taylor J, Williamson B. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2001, 2: 265292. Wasikowski M, Chen X W. Combating the small sample class imbalance problem using feature selection. Knowledge and Data Engineering, IEEE Transactions on, 2010, 22: 1388-1400. Chen X, Gerlach B, Casasent D. Pruning support vectors for imbalanced data classification. In: Proceedings of the International Joint Conference on Neural Networks. Montrcal, Canada, 2005: 1883-1888. Tang Y, Zhang Y Q, Chawla N, Krasser S. SVMs modeling for highly imbalanced classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 2009, 39: 281-288. Elkan C. Magical thinking in data mining: Lessons from CoIL challenge 2000. In: Proceedings of the 7th

[16]

[17]

[18]

[19]

[20]

[21]

627

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA, 2001: 426431. Zhou Z H, Liu X Y. On multi-class cost-sensitive learning. Computational Intelligence, 2010, 26: 232-257. Bach F R, Heckerman D, Horvitz E. Considering cost asymmetry in learning classifiers. J. Mach. Learn. Res., 2006, 7: 1713-1741. Hao P Y, Lin Y H. A new multi-class support vector machine with multi-sphere in the feature space. In: Proceedings of the 20th International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems. Kyoto, Japan, 2007: 756-765. Tohm´e M, Lengell´e R. Maximum margin one class support vector machines for multiclass problems. Pattern Recognition Letters, 2011, 32: 1652-1658. Yang X Y, Liu J, Zhang M Q, Niu K. A new multi-class SVM algorithm based on one-class SVM. In: Proceedings of the 7th International Conference on Computational Science. Beijing, China, 2007: 677-684. Sun Y, Kamel M S, Wong A K, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40: 3358-3378. Sun Y, Kamel M, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of the 6th International Conference on Data Mining. Hong Kong, China, 2006: 592-602. Jeatrakul P, Wong K W. Enhancing classification performance of multi-class imbalanced data using the oaa-db algorithm. In: Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN). Brisbane, Australia, 2012: 1-8. Yuille A L, Rangarajan A. The concave-convex procedure (CCCP). Advances in Neural Information Processing Systems, 2002, 15: 915-936. ¨ ur A, Ozg¨ ¨ ur L, G¨ung¨or T. Text categorization Ozg¨ with class-based and corpus-based keyword selection. In: Proceedings of the 20th International Conference on Computer and Information Sciences. Istanbul, Turkey, 2005: 606-615. Ferri C, Hernndez-orallo J, Salido M. Volume under the ROC surface for multi-class problems. exact computation and evaluation of approximations. In: Proc. of 14th European Conference on Machine Learning. CavtatDubrovnik, Croatia, 2003: 108-120. Hand D J, Till R J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn, 2001, 45: 171-186. Zhang X L, Jiang C. Improved SVM for learning multiclass domains with ROC evaluation. In: Proceedings of the International Conference on Machine Learning and Cybernetics. Hong Kong, China, 2007: 2891-2896.

628

Tsinghua Science and Technology, December 2012, 17(6): 619-628

[22] Ratnagiri M, Rabiner L, Juang B H. Multi-class classification using a new sigmoid loss function for minimum classification error (MCE). In: Proceedings of the 9th International Conference on Machine Learning and Applications (ICMLA). Washington DC, USA, 2010: 8489. [23] Perez-Cruz F, Navia-Vazquez A, Figueiras-Vidal A, ArtesRodriguez A. Empirical risk minimization for support vector classifiers. Neural Networks, IEEE Transactions on, 2003, 14: 296-303. [24] Liu Y, Shen X. Multicategory -Learning. Journal of the American Statistical Association, 2006, 101: 500-509. [25] Perez-Cruz F, Navia-Vazquez A, Alarcon-Diana P, ArtesRodriguez A. Support vector classifier with hyperbolic tangent penalty function. In: Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Istanbul, Turkey, 2000: 3458-3461.

[26] Collobert R, Sinz F, Weston J, Bottou L. Trading convexity for scalability. In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, USA, 2006: 201-208. [27] Wu Y, Liu Y. Robust Truncated hinge-loss support vector machines. JASA, 2007, 102: 974-983. [28] Phoungphol P, Zhang Y, Zhao Y. Multiclass SVM with ramp loss for imbalanced data classification. In: Proceedings of the 2012 IEEE International Conference on Granular Computing. Hangzhou, China, 2012: 160-165. [29] Frank A, Asuncion A. UCI machine learning repository. Available: http://archive.ics.uci.edu/ml, 2010. [30] Gurobi Optimization, Inc. Gurobi optimizer reference manual. Available: http://www.gurobi.com, 2012.

Suggest Documents