Sample subset optimization for classifying imbalanced biological data Pengyi Yang1,2,3 , Zili Zhang4,5⋆ , Bing B. Zhou1,3 and Albert Y. Zomaya1,3 1
3
School of Information Technologies, University of Sydney, NSW 2006, Australia 2 NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia Centre for Distributed and High Performance Computing, University of Sydney NSW 2006, Australia 4 Faculty of Computer and Information Science, Southwest University CQ 400715, China 5 School of Information Technology, Deakin University, VIC 3217, Australia
[email protected];
[email protected]
Abstract. Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.
1
Introduction
Modern molecular biology is rapidly advanced by the increasing use of computational techniques. For tasks such as RNA gene prediction [1], promoter recognition [2], splice site identification [3], and the classification of protein localization sites [4], it is often necessary to address the problem of imbalanced class distribution because the datasets extracted from those biological systems are likely to contain a large number of negative examples (referred to as majority class) and a small number of positive examples (referred to as minority class). Many popular classification algorithms such as support vector machine (SVM) have been applied to a large variety of bioinformatics problems including those mentioned above (e.g. refs. [1, 3, 4]). However, most of these algorithms are sensitive to the ⋆
Corresponding author
2
P. Yang et al.
imbalanced class distribution and may not perform well if being directly applied on the imbalanced data [5, 6]. Sampling is a popular approach to addressing the imbalanced class distribution [7]. Simple methods such as random under-sampling and random oversampling are routinely applied in many bioinformatics studies [8]. With random under-sampling, the size of the majority class is reduced to compensate the imbalance, whereas with random over-sampling, the size of the minority class is increased to compensate the imbalance. Although they are straightforward and computationally efficient, these two methods are prone to either increased noise and duplicated samples or informative sample removal [9]. A more sophisticated approach known as SMOTE is to synthesize “new” samples using original samples in the dataset [10]. However, many bioinformatics problems often present several thousands of samples with a highly imbalanced class distribution. Applying SMOTE will introduce a large number of synthetic samples which may increase the data noise substantially. Alternatively, a cost-metric can be specified to force the classifier to pay more attention to the minority class [11]. This requires to choose a correct cost-metric which is often unknown a priori. Several recent studies found that ensemble learning could improve the performance of a single classifier in imbalanced data classification [6, 12]. In this study, we explore along this direction. In particular, we introduce a sample subset optimization technique for ‘intelligent under-sampling’ in imbalanced data classification. Using this technique, we designed an ensemble of SVMs specifically for learning from imbalanced biological datasets. This system has several advantages over the conventional ones: – It creates each base classifier using a roughly balanced training subset with a built-in intelligent under-sampling. This is important in learning from imbalanced data because it reduces the risk of bias towards one class while neglecting the other one. – The system embraces an ensemble framework in which multiple roughly balanced training subsets are created to train an ensemble of classifiers. Thus, it reduces the risk of removing informative samples from the majority class, which may occur when a simple under-sampling technique is applied. – As opposed to random sampling, the sample subset optimization technique is applied to identify optimal sample subsets. This may improve the quality of the base classifiers and result in a more accurate ensemble. – The aforementioned biological problems often present several thousands of training samples. The proposed technique is essentially an under-sampling approach. It can avoid the introduction of data noise and the generated data subsets may be more efficient for classifier training. The rest of the paper discusses the details of the proposed sample subset optimization technique and the associated ensemble learning system. Section 2 presents the ensemble learning system. Section 3 describes the main idea of sample subset optimization. The base classifier and fitness function of the ensemble system are described in Section 4. Comparisons with typical sampling and ensemble methods are given in Section 5. Section 6 concludes the paper.
Sample subset optimization for classifying imbalanced biological data
2
3
Ensemble system
Ensemble learning is an effective approach for improving the prediction accuracy of a single classification algorithm. Such an improvement is commonly achieved by using multiple classifiers (known as the base classifiers) each trained on a subset of samples created by random sampling such as those used in bagging [13], or cost-sensitive sampling such as those used in boosting [14]. The base classifiers are typically combined using an integration function such as averaging [15] or majority voting [16].
Training set
Optimized training subsets
m
…
Optimize samples from majority class
n
m
m
n1
n2
…
m
Test set m’
nL
Base classifiers c1
c2
…
n’ cL
Majority voting
Prediction AUC value
Fig. 1. A schematic representation of the proposed ensemble system.
We propose an ensemble learning system specifically designed for imbalanced biological data classification. The schematic representation of the proposed system is shown in Figure 1. It has three main components – sample subset optimization, base classifier, and fitness function. The key of this ensemble system is the application of the sample subset optimization techniques (to be described in Section 3). Suppose that a highly imbalanced dataset contains n samples from the majority class and m samples from the minority class where n ≫ m, the system creates each sample subset by including all m minority samples and selecting a subset of samples from the n majority samples according to an internal optimization procedure. This procedure is conducted to generate multiple optimized sample subsets, each being a roughly balanced subset containing m minority samples and ni carefully selected majority samples, where ni ≪ n (i = 1...L) and L is the total number of optimized sample subsets. Using those optimized sample subsets, we can obtain a group of base classifiers ci (i = 1...L), each
4
P. Yang et al.
being trained on its corresponding sample subset {m + ni }. The base classifiers are then combined using majority voting to form an ensemble of classifiers. Algorithm 1 summarizes the procedure. A line starting with “//” in the algorithm is a comment for its adjacent next line.
Algorithm 1 sampleSubsetOptimization Input: Imbalanced dataset DI Output: Roughly balanced dataset DB 1: cvSize = 2; 2: cvSets = crossValidate(DI , cvSize); 3: for i = 1 to cvSize do 4: // obtain the internal training samples DiT = getTrain(cvSets, i); 5: 6: // obtain the internal test samples Dit = getTest(cvSets, i); 7: 8: // obtain samples of the minority class 9: Diminor = getMinoritySample(DiT ); 10: // obtain samples of the majority class 11: Dimajor = getMajoritySample(DiT ); 12: // select a subset of samples from the majority class 13: Dimajor′ = optimizeMajoritySample(Dimajor , Diminor , Dit ); 14: DB = DB ∪ (Diminor ∪ Dimajor′ ); 15: end for 16: return DB ;
3
Sample subset optimization
The key function in Algorithm 1 is the optimization procedure applied to select a subset of samples from the majority class (Algorithm 1, line 13). The principal idea of the sample subset optimization procedure is to apply a cross validation procedure to form a subset in which each sample is selected according to the internal classification accuracy. In this section, we describe its formulation using a particle swarm optimization (PSO) algorithm [17], and analyze its behavior using a synthetic dataset. The base classifier and the fitness function used for optimization are discussed in Section 4. 3.1
Formulation of sample subset optimization
We formulate the sample subset optimization using a particle swarm optimization algorithm. In particular, for each sample from the majority class a dimension in the particle space is assigned. That is, for n majority samples, the particle is coded as an indicator function set p = {Ix1 , Ix2 , ..., Ixn }. For each dimension, an indicator function Ixj takes value “1” when the corresponding jth sample
Sample subset optimization for classifying imbalanced biological data
5
xj is included to train a classifier. Similarly, a “0” denotes that the corresponding sample is excluded from training. By optimizing a population of L particles pi (i = 1...L), the velocity of the ith particle vi,j (t) and the position of this particle si,j (t) in the jth dimension of the solution space are updated in each iteration t as follows: vi,j (t + 1) = w · vi,j (t) + c1 r1 · (pbesti,j − si,j (t)) + c2 r2 · (gbesti,j − si,j (t)) (1) { 0: if random() > S(vi,j (t + 1)) si,j (t + 1) = (2) 1: if random() < S(vi,j (t + 1)) S(vi,j (t + 1)) =
1
(3) 1+ where pbesti,j and gbesti,j are the previous best position and the best position found by informants, respectively. c1 , r1 , c2 , and r2 are the learning rates and social coefficients. random() is the random number generator with a uniform distribution of [0,1]. Representing this optimization procedure in pseudocode, we obtain Algorithm 2. Note that the PSO algorithm produces multiple optimized sample subsets in parallel. Therefore, by specifying the popSize parameter, we can obtain any number of optimized sample subsets with a single execution of the algorithm. e−vi,j (t+1)
Algorithm 2 optimizeMajoritySamples Input: Majority samples Dmajor , Minority samples Dminor , Internal test samples Dt i Output: Optimized sample subsets Dpmajor ′ (i = 1...L) 1: popSize = L; 2: initiateParticles(Dmajor , popSize); 3: for t = 1 to termination do 4: // go through each particle in the population 5: for i = 1 to popSize do 6: // extract the samples according to the indicator function set i 7: Dpmajor ′ = extractSelectedSamples(pi , Dmajor ); i i 8: Dptrain = Dpmajor ′ ∪ Dminor ; 9: // train a classifier using selected majority samples and all minority samples i 10: hi = trainClassifier(Dptrain ); 11: // calculate the fitness of the trained classifier using internal test samples 12: f itness = calculateFitness(hi , Dt ); 13: // update velocity (Eq. (1)) and position (Eq. (2)) according to fitness value 14: vi,j (t) = updateVelocity(vi,j (t), f itness); 15: si,j (t) = updatePosition(si,j (t), f itness); end for 16: 17: end for i 18: return Dpmajor ′ (i = 1...L)
6
P. Yang et al.
3.2
Analysis of behavior
We analyze the behavior of sample subset optimization by using an imbalanced synthetic data. Samples are created with each has two features. These two features are generated from the same distribution. Specifically, 20 samples of the majority class are generated from a normal distribution N (5, 1) and 10 samples of the minority class are generated from a normal distribution N (7, 1). In addition, 5 “outlier” samples are introduced to the dataset. They are labeled as majority class, but are generated from the normal distribution of the minority class. The class ratio of the data is 25:10. Figure 2(a) shows the original dataset and the resulting classification boundary of a linear SVM, and Figure 2(b) shows a dataset after applying sample subset optimization and the resulting classification boundary of a linear SVM. Note that this is one of the optimized dataset which is used to train one base classifier. Our ensemble is the aggregation of multiple base classifiers trained on multiple optimized datasets. It is evident that the class ratio is more balanced after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier samples are removed, and 7 redundant majority samples which has limited effect on the decision boundary of the linear SVM classifier are removed to correct the imbalanced class distribution.
9
9 Linear SVM border
8
8
7
7
Feature 2
Feature 2
Linear SVM border
6 5 4
5 4
3 2
6
3
Majority samples Minority samples 3
4
5
6
7
8
9
10
2
Majority samples Minority samples 3
4
Feature 1
(a) orinigal dataset
5
6
7
8
9
10
Feature 1
(b) dataset after optimization
Fig. 2. The green lines are the classification boundary created using a linear SVM with (a) the original dataset and (b) the dataset after optimization.
4
Base classifier and fitness function
We select SVM as the base classifier for building the ensemble system. SVM is routinely applied to many challenging bioinformatics problems. The design of the fitness function is another important facet for sample subset optimization. It determines the quality of the base classifiers, and thus the performance of the ensemble. The following subsections describe these two components in details.
Sample subset optimization for classifying imbalanced biological data
4.1
7
Base classifier of support vector machine
SVM is a popular classification algorithm which has been widely used in many bioinformatics problems. Among different kernel choices, linear SVM with a soft margin is robust for large scale and high-dimensional dataset classification [18]. Let us denote each sample in the dataset as a vector xi (i = 1...M ) where M is the total number of samples, and yi is the class label of sample xi . Each component in xi is a feature xij (j = 1...N ) interpreted as the jth feature of the ith sample, where N is the dimension of the feature space. In our case, features could be GC-content, dinucleotide values, or other biological markers used to characterize each sample. A linear SVM with a soft margin is trained by optimizing following functions: ∑ 1 ||w||2 + C ξi 2 i=1 M
min
w,b,ξ
subject to : yi (< w, xi >) + b ≥ 1 − ξi where w is the weight vector, ξi are slack variables, and b is the bias. The constant C determines the trade-off between maximizing the margin and minimizing the amount of slack. In this study, we utilize the implementation proposed by Hsieh et al. [19]. This is an implementation for fast and large scale linear SVM, which is especially suited as base classifier for ensemble learning due to its computational efficiency. Notice that classifiers are trained both for sample subset optimization and for composing ensemble. However, these two procedures are independent from each other, and therefore, the classifiers trained for sample subset optimization are not the classifiers used for ensemble. The purpose of the classifiers trained in the sample subset optimization procedure are to provide fitness feedbacks of the selected samples, whereas the classifiers used for composing ensemble are trained by using the optimized sample subsets and serve as the base classifiers of the ensemble. To maximize the specificity of the feedbacks, the same classification algorithm, that is, linear SVM, is used for both procedures. 4.2
Fitness function
For building a classifier, a subset of samples from the majority class is selected according to an indicator function set pi (see Section 3.1), and combined with i the samples from the minority class to form a training set Dptrain . The goodness of an indicator function set can be assessed by the performance of the classifier trained with the samples specified by it. For imbalanced data, one effective way to evaluate the performance of the classifier is to use area under the ROC curve i metric [20]. Hence, we devise AU C(hi (Dptrain , Dtest )) as a component of fitness pi function, where Dtrain denotes the training set generated using pi and Dtest denotes the test data. Function AU C() calculates the AUC value of a classification model hi (Da , Db ) which is trained on Da and evaluated on Db .
8
P. Yang et al.
Moreover, the size of the subset is also important because a small training set is likely to result in a poorly trained model with poor generalization. Therefore, the fitness function can be constructed by combining the two components: i , Dtest )) + w2 · Size(pi ) f itness(ui ) = w1 · AU C(hi (Dptrain
(4)
where Size() determines the size of a subset (specified by pi ). Coefficients w1 and w2 are empirical constants which can be adjusted to alter the relative importance of each fitness component. The default values are w1 = 0.8 and w2 = 0.2 as they work well in a range of datasets.
5
Experimental results
In this section, we first describe four imbalanced biological datasets used in our experiment. They are generated from several important and diverse biological problems and represent different degrees of imbalanced class distribution. Next we present the performance results of our ensemble algorithm compared with six other algorithms using those datasets. 5.1
Datasets
We evaluated different algorithms using datasets generated for identification of miRNA, classification of protein localization sites, and prediction of promoter (drosophila and human). Specifically, the miRNA identification dataset contains 691 positive samples and 9248 negative samples, which is described by 21 features [21]. The protein localization dataset is generated from the study discussed in [22]. We attempted to differentiate membrane proteins (258) from the rests (1226). The human promoter dataset contains 471 promoter sequences and 5131 coding sequences (CDS) and intron sequences. Compared to the human promoter dataset, the drosophila promoter dataset has a relatively balanced class distribution with 1936 promoter sequences and 2722 CDS and intron sequences. We calculated the 16 dinucleotide features according to [23]. The datasets are summarized and organized according to class ratio in Table 1. Table 1. Summary of biological datasets used for evaluation. Dataset (short name) # Sample # Features Minority vs. Majority drosophila promoter (DroProm) 6594 16 0.4156 (≈ 1:2.5) protein localization (ProtLoc) 1484 8 0.2104 (≈ 1:5) 5602 16 0.0918 (≈ 1:10) human promoter (HuProm) miRNA identification (miRNA) 9939 21 0.0747 (≈ 1:13)
Sample subset optimization for classifying imbalanced biological data
5.2
9
Performance comparison
The performance of the single classifier of SVM was used as the baseline for all datasets. We compared the single classifier approaches including random undersampling with SVM (RUS-SVM), random over-sampling with SVM (ROS-SVM), SMOTE sampling with SVM (SMOTE-SVM), and the ensemble approaches including boosting with base classifiers of SVM (Boost-SVMs), bagging with base classifiers of SVM (Bag-SVMs), and our sample subset optimization technique with SVM (SSO-SVMs).
0.9
0.92 0.91 0.9
0.8
0.75 SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
0.7
0.65
10
20
30
40
50
60
70
80
90
Area Under ROC Curve
Area Under ROC Curve
0.85
0.89 0.88 0.87 SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
0.86 0.85 0.84 0.83 0.82
100
10
20
Number of Base Classifiers
0.75
0.9
0.7
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM 10
20
30
40
50
60
70
80
Number of Base Classifiers
(c) human promoter
90
100
Area Under ROC Curve
Area Under ROC Curve
0.95
0.55
50
60
70
80
90
100
(b) protein localization
0.8
0.6
40
Number of Base Classifiers
(a) drosophila promoter
0.65
30
0.85
0.8
SSO−SVMs Bag−SVMs Boost−SVMs Single−SVM ROS−SVM RUS−SVM SMOTE−SVM
0.75
0.7
10
20
30
40
50
60
70
80
90
100
Number of Base Classifiers
(d) miRNA identification
Fig. 3. The comparison of different algorithms for data classification. The x-axis denotes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms that use a single classifier, the same AUC value is plotted on different ensemble sizes for the purpose of comparison.
For the ensemble methods, we tested the ensemble size from 10 to 100 with a step of 10. A 5-fold cross-validation procedure was applied to partition datasets for training and testing, and each algorithm was tested on the same partition
10
P. Yang et al.
to reduce evaluation variance. Among the six tested algorithms, four of them employed the randomization procedure. They are RUS-SVM, ROS-SVM, BagSVMs, and SSO-SVMs (note that the Boost-SVMs algorithm uses the reweighting implementation and is deterministic). For those with the randomization procedure, we repeated the test 10 times, each time with a different random seed. Figure 3 shows the results comparison. It can be seen that in most cases ensemble approaches give higher AUC values than the single classifier approaches. For single classifier approaches, random under-sampling, random over-sampling, and SMOTE sampling do improve the classification results when the analyzed dataset has a highly imbalanced class distribution such as the cases in Figure 3(b)(c)(d). However, the improvements become less significant when the imbalance is moderate (drosophila promoter dataset in Figure 3(a)). SMOTE sampling performs better than random under-sampling and over-sampling approaches in the case of protein localization (Figure 3(b)). However, the performance gain is marginal in other three datasets (Figure 3(a)(c)(d)). We do not observe significant difference of the performance between random under-sampling and random over-sampling, except in the case of miRNA identification (Figure 3(d)) where random over-sampling is relatively better than random under-sampling. For ensemble approaches, Boost-SVMs performs surprisingly worse than the other two approaches in most cases and the performance fluctuates among different ensemble sizes. This may be caused by its training process in that the boosting algorithm assigns increasingly more classification weights to those most “difficult” samples in each iteration. However, those “difficult” samples could be the outliers and cause deleterious effect when the classifiers pay too much attention on classifying them while ignoring other more representative samples. In this regard, Bag-SVMs and SSO-SVMs appear to be the better approaches. However, SSO-SVMs almost always performs the best in every case and generates much smaller performance variance when different random seeds were used. It is likely that the SSO-SVMs can capture the most representative samples from the training set which gives a better generalization on unseen data classification. We also observe that the improvement is more significant when the datasets has a highly imbalanced class distribution (Figure 3(b)(c)(d)).
Table 2. The comparison of different algorithms for data classification according to AUC value. The value for ensemble approaches are averaged across different ensemble sizes. Algorithm DroProm Single-SVM 0.6584 0.6584 RUS-SVM ROS-SVM 0.6555 0.6400 SMOTE-SVM Boost-SVMs 0.7756 Bag-SVMs 0.8507 0.8520 SSO-SVMs
ProtLoc HuProm miRNA 0.8296 0.5740 0.7542 0.8850 0.6016 0.7644 0.8866 0.5986 0.8114 0.8976 0.5961 0.7924 0.8852 0.6644 0.8891 0.8671 0.7264 0.9198 0.9098 0.7718 0.9419
Sample subset optimization for classifying imbalanced biological data
11
Table 3. P -value using one-tail student t-test to compare the performance difference Algorithm SSO-SVMs vs. Single-SVM SSO-SVMs vs. RUS-SVM SSO-SVMs vs. ROS-SVM SSO-SVMs vs. SMOTE-SVM SSO-SVMs vs. Boost-SVMs SSO-SVMs vs. Bag-SVMs
DroProm 2 × 10−15 2 × 10−15 2 × 10−15 8 × 10−16 2 × 10−8 6 × 10−4
ProtLoc 4 × 10−18 1 × 10−13 2 × 10−13 8 × 10−11 8 × 10−7 7 × 10−11
HuProm 1 × 10−11 4 × 10−11 4 × 10−11 3 × 10−11 7 × 10−6 1 × 10−6
miRNA 1 × 10−14 2 × 10−14 3 × 10−13 9 × 10−14 2 × 10−5 2 × 10−3
Table 2 shows the AUC values of both single classifier and ensemble approaches. For the ensemble approaches, the AUC value is the average of those given by the ensemble sizes from 10 to 100. The proposed SSO-SVMs performs the best in all four tested datasets. Comparing these results with the baseline of a single SVM, they account for 10%-20% improvements. To confirm the improvements are statistically significant, we applied a one-tail student t-test and compared SSO-SVMs with the other six methods. Table 3 shows the p-value of the comparison. In all four datasets, the performance of SSO-SVMs is significantly better than the other six methods, with a p-value smaller than 0.05. Therefore, we confirmed the effectiveness of the proposed ensemble approach.
6
Conclusion
In this paper we introduced a sample subset optimization technique for sampling optimal sample subsets from training data. We integrated this technique in an ensemble learning framework and created an ensemble of SVMs specifically for imbalanced biological data classification. The proposed algorithm was applied to several bioinformatics tasks with moderate and highly imbalanced class distributions. According to our experimental results, (1) the approaches based on data sampling for a single SVM are generally less effective compared to the ensemble approaches; (2) the proposed sample subset optimization technique appears to be very effective and the ensemble optimized by this technique produced the best classification results in terms of AUC value for all evaluation datasets.
References 1. Meyer, I.: A practical guide to the art of RNA gene prediction. Briefings in bioinformatics 8(6) (2007) 396–414 2. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Briefings in Bioinformatics 10(5) (2009) 498–508 atsch, G.: Accurate splice 3. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., R¨ site prediction using support vector machines. BMC Bioinformatics 8(Suppl 10) (2007) S7 4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8) (2001) 721–728
12
P. Yang et al.
5. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Proceedings of the 15th European Conference on Machine Learning. (2004) 39–50 6. Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining. (2006) 107–118 7. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5) (2002) 429–449 8. Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems. In: 2009 International Conference on Machine Learning and Applications, IEEE (2009) 545–550 9. Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6 (2004) 1–6 10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(1) (2002) 321–357 11. Weiss, G.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1) (2004) 7–19 12. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6) (2009) 412–426 13. Breiman, L.: Bagging predictors. Machine Learning 24(2) (1996) 123–140 14. Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26(5) (1998) 1651–1686 15. Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33(9) (2000) 1475–1485 16. Lam, L., Suen, S.: Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 27(5) (1997) 553–568 17. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelligence 1(1) (2007) 33–57 18. Ben-Hur, A., Ong, C., Sonnenburg, S., Sch¨ olkopf, B., R¨ atsch, G.: Support vector machines and kernels for computational biology. PLoS Computational Biology 4(10) (2008) 19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, ACM (2008) 408–415 20. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8) (2006) 861–874 21. Batuwita, R., Palade, V.: microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25(8) (2009) 989–995 22. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, AAAI Press (1996) 109– 115 23. Rani, T., Bhavani, S., Bapi, R.: Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 23(5) (2007) 582–588