Instance-based ensemble pruning for imbalanced learning - NSFC

5 downloads 0 Views 346KB Size Report
Keywords: Imbalanced data sets, ensemble, ensemble pruning, KNN, base ..... python. Everyone can download it from http://nlp.zzu.edu.cn/LySpoon.asp.1.
Intelligent Data Analysis 19 (2015) 779–794 DOI 10.3233/IDA-150745 IOS Press

779

Instance-based ensemble pruning for imbalanced learning Weimei Zhia,∗ , Huaping Guob , Ming Fana and Yangdong Yea a College

b College

of Information Engineering, Zhengzhou University, Zhengzhou, Henan, China of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan,

China Abstract. Class-imbalance is very common in real world. However, traditional state-of-the-art classifiers do not work well on imbalanced data sets for imbalanced class distribution. This paper considers imbalance learning from the viewpoint of ensemble pruning, and proposes a novel approach called IBEP (Instance-Based Ensemble Pruning) to improve classifier’s performance on these data sets. Unlike traditional approaches which consider imbalance problem in training stage, IBEP focuses on the problem in prediction stage. Given an unlabeled instance, IBEP tries to search for the k nearest neighbors as the corresponding pruning set and adopts ensemble pruning strategy to select a subset of ensemble members to form sub-ensemble based on the pruning set to predict the instance. In this way, IBEP pays more attention to rare class and achieves better performance on imbalanced data set. Besides, two widely used sampling techniques, under-sampling and SMOTE, are skillfully combined with IBEP to further improve its performance. Experimental results on 14 data sets show that IBEP performs significantly better than many state-of-the-art classification methods on all metrics used in this paper including recall, f -measure and g-mean. Keywords: Imbalanced data sets, ensemble, ensemble pruning, KNN, base classifier

1. Introduction Class-imbalance problem is referred to as an imbalanced, skewed or rare class problem. For two-class, the imbalanced class distribution can be characterized as having many more instances of one class (majority class or prevalent class) than the other (minority class or rare class) [1]. In many applications, the correct identification of instances in rare class is more valuable than the contrary case. For example, in cancer detection, most patients suffer from common disease, rare patients may have cancer. Therefore how to effectively detect cancer patients is very meaningful. Conventional classification methods, such as C4.5, naive bayes and neural network, try to pursue a high accuracy by assuming that all classes have similar size, leading to the fact that the rare class instances are often overlooked and misclassified to majority class. However, accuracy is not a suitable evaluation metric when there is class-imbalance. Instead, recall, f -measure and g-mean are more appropriate evaluation metrics for class-imbalance problem [2]. Reported solutions to the imbalanced problem can be categorized as data level and algorithm level approaches. At the data level, the objective is to re-balance the class distribution by resampling the data space. Commonly used resampling methods include random over-sampling instances of the rare ∗ Corresponding author: Weimei Zhi, School of Information Engineering, Zhengzhou University, Zhengzhou, Henan 450052, China. Tel.: +86 13937101614; E-mail: [email protected].

c 2015 – IOS Press and the authors. All rights reserved 1088-467X/15/$35.00 

780

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

class, random under-sampling instances of prevalent class, informatively under-sampling the prevalent class (EasyEnsemble and BalanceCascade [3]), over-sampling the rare class by generating new synthetic instances (SMOTE) [4] and over-sampling based on cluster (CBO) [5]. Resampling technique is a widely used method in dealing with imbalanced problem, but the optimal class distribution is always unknown and differs from data to data. At the algorithm level, solutions try to adapt existing classifier learning algorithms to bias towards the rare class, such as two-phase rule induction [6], cost-sensitive learning [7], one-class learning and so on. Different from the above, we re-consider imbalanced problem from a novel viewpoint of ensemble pruning, and propose a new method called IBEP (instance-based ensemble pruning) to tackle the problem. The idea is inspired by the experiments we conducted on ensemble learning, which demonstrate that some base classifiers perform rather well in certain local area due to the bootstrap of random disturbance. To utilize this locality property, IBEP applies the idea of KNN to the process of ensemble pruning and tries to search for the k nearest neighbors of an unlabeled instance as the corresponding pruning set. Based on the pruning set, IBEP selects a subset of ensemble members to form sub-ensembles to predict the instance. In this way, IBEP pays more attention to rare class and achieves better performance on imbalanced problem. The main contributions of this paper are as follows: – Applying ensemble pruning to imbalanced learning to improve ensemble performance; – Considering the local characteristic of rare class and proposing IBEP (Instance-Based Ensemble Pruning) for imbalanced learning to improve the performance on rare class; – Combining sampling technique to IBEP to further improve the performance on rare class. The experimental results on 14 UCI data sets show that IBEP can boost recall, f -measure and g-mean of Bagging. Compared with other state-of-the-art classification methods, IBEP shows a great advantage. The experimental results indicate reasonably applying ensemble pruning to imbalanced problem can effectively boost the generalization performance of classification and the results also show that sampling technique does improve the performance of imbalanced learning method. The rest of the paper is organized as follows. Following the introduction, Section 1 reviews the related work, including imbalanced learning, ensemble learning, ensemble pruning and sampling technique. Section 2 introduces the idea of instance-based ensemble pruning. Section 3 describes the evaluation metrics used for imbalanced learning. Section 4 reports the experimental results of parameter learning, IBEP’s performance and IBEP with sampling. And finally, Section 5 concludes this work. 2. Related work 2.1. Imbalanced learning Technically speaking, the data set which exhibits an unequal distribution between its classes can be considered imbalanced (skewed). There are two kinds of imbalances in a data set: between-class imbalance and within-class imbalance. For the former, one class severely out-represents another. With respect to the latter, a class may additionally contain some sub-concepts with limited instances, amounting to diverging degrees of classification difficulty [1]. Usually, we discuss the first kind of imbalance case. There are many factors that influence the modeling of a capable classifier when facing rare events. Examples include the skewed data distribution which is considered to be the most influential factor, small sample size, separability and the existence of within-class sub-concepts [7]. The imbalance degree of a class distribution can be denoted by the ratio of the sample size of the rare class to that of the prevalent class. Reported studies indicate that a relatively balanced distribution

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

781

usually attains a better result. However, to what imbalance degree the class distribution deteriorates the classification performance cannot be stated explicitly, since other factors such as sample size and separability also affect performance [7]. Small sample size means the sample size is limited, uncovering regularities inherent in small class is unreliable. In [8], the authors suggest that the imbalanced class distribution may not be a hindrance to classification by providing a large enough data set. The difficulty in separating the rare class from the prevalent class is the key issue of the imbalanced problem. Assuming that there exist highly discriminative patterns among each class, then not very sophisticated rules are required to distinguish class objects. However, if patterns among each class are overlapping, discriminative rules are hard to be induced [7]. Within-class concepts mean that a single class is composed of various sub-clusters or sub-concepts. Instances of a class are collected from different sub-concepts. These sub-concepts do not always contain the same number of instances. The presence of within-class concepts worsens the imbalance distribution problem [7]. In general, we only consider imbalanced data distribution in imbalanced learning and fix other factors. 2.2. Ensemble An ensemble E consists of a set of individually trained classifiers (such as naive bayes and decision trees) whose prediction is combined when classifying unlabeled instances. It is well accepted that an ensemble usually generalizes better than an individual classifier alone. Both theory and experiment have demonstrated that a good ensemble is made up of classifiers with both high accuracy and diversity [9]. Many approaches have been proposed to create accurate and diverse ensemble members. Examples include Bagging [10], Boosting [11], Random Forest [12], Rotation Forest [13], EasyEnsemble and BalanceCascade [3]. Bagging is a bootstrap ensemble method that constructs base classifiers by training each one on a random sampled data set. Boosting attempts to produce new classifier on a re-distribute training set. Random Forest and Rotation Forest obtain base classifiers by manipulating the input features to map the training set into different feature spaces. EasyEnsemble and BalanceCascade, which are designed for imbalanced problems, combine with sampling technique to create accurate and diverse classifiers. 2.3. Ensemble pruning Although an ensemble is usually more accurate than a single classifier, existing ensemble methods often tend to construct unnecessarily large ensembles, which increases the memory consumption and computational cost. Ensemble pruning, which is also called ensemble selection or ensemble thinning, tackles this problem by selecting a subset of ensemble members to form sub-ensembles that are subject to less resource consumption and response time with accuracy that is similar to or better than the original ensemble [14,15]. Given an ensemble with N members, the problem of finding the subset of ensemble members with best performance involves searching the space of N 2 − 1 non-empty sub-ensembles, which is proved to be a NP-complete problem [16]. Therefore, searching the best sub-ensemble by enumerating all candidates is computational impracticable. Two kinds of efficient ensemble pruning approaches [17–21] via a greedy search method have been proposed. One is forward ensemble pruning, the other is backward ensemble pruning. For the former, given a sub-ensemble S which is initialized to be empty, it searches for the space of different combination candidates by iteratively adding into S the classifier h ∈ H \ S that optimizes a cost function. For the latter, given a sub-ensemble S which is

782

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

initialized to be full, it searches for the space of different combination candidates by iteratively removing from S the classifier h (h ∈ S ) that optimizes a cost function. Let h be a classifier and H be an ensemble. Partalas et al. [15] identify that the prediction of h and H on an instance xi can be categorized into four cases: (1) etf : h(xi ) = yj ∧ H(xi ) = yj , (2) ett : h(xi ) = yj ∧ H(xi ) = yj , (3) eft : h(xi ) = yj ∧ H(xi ) = yj , (4) eff : h(xi ) = yj ∧ H(xi ) = yj . They conclude that considering all four cases is crucial to design ensemble diversity metrics [15]. Many diverse metrics are designed by considering some or all four cases, for example, complementariness [22] and concurrency [18]. The complementariness of h with respect to H and a pruning set Dp is calculated as  I(xi ∈ etf ), (1) COM(h, H) = xi ∈Dp

where I(true) = 1, I(false) = 0. The complementariness is exactly the number of instances that are correctly classified by h and incorrectly classified by H . The concurrency is defined as  CON(h, H) = (2I(xi ∈ etf ) + I(xi ∈ ett ) − 2I(xi ∈ eff )), (2) xi ∈Dp

which is similar to the complementariness, with the difference that it considers two more cases and weights them. Unlike complementariness and concurrency, Partalas et al. [22] introduce a new metric called UWA (Uncertainty Weighted Accuracy) which considering all four cases which are given above. UWA is defined as  U W A(h, H) = ((N Ti × I(xi ∈ etf ) + N Fi × I(xi ∈ ett ) xi ∈Dp (3) − N Fi × I(xi ∈ eft ) − N Ti × I(xi ∈ eff )),

where NTi is the proportion of classifiers in the current ensemble S that predict xi correctly, and NFi = 1 − NTi is the proportion of classifiers in S that predict xi incorrectly. In addition to considering all four cases, UWA takes into account the strength of the decision of the current ensemble. Different from [22], Zhi et al. [20] apply energy-based learning to ensemble pruning and design a new metric (loss function) called EBM (Energy-based Metric) to supervise pruning process. An energy-based model [23] associates each configuration (x, y ) with a scaler energy E(w, x, y) where w is a parameter vector. This energy can be viewed as the measure of “comparable” between x and y . The inference is to search for the desired answer y in a set that minimizes the energy. That is to say, the input instance xi is given, and the model must learn the value y , chosen from the set Ω, for which E(w, xi , y) is the smallest. Based on EBM, the pruning set Dp is divided into two sets: (1) et = {xi |xi ∈ Dp ∧ (xi ) = yj }(2)ef = {xi |xi ∈ Dp ∧ (xi ) = yj }. So EBM is defined as  L(w, xi , yi ), (4) EBM(h, H) = xi ∈Dp

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

783

where L(w, xi , yi ) =

I(xi ∈ ef ) − I(xi ∈ et ) , |E(w, xi , yi ) − E(w, xi , y) + m| + 1

(5)

where E(w, xi , y) = −



wt δ(ht (xi ) = y),

(6)

t

and E(w, xi , yi ) < E(w, xi , y) − m, y = arg miny∈Ω,y=yi E(w, xi , y), m is a adjustment parameter and constant 1 is to avoid |E(w, xi , yi ) − E(w, xi , y)| being too small or zero. In fact, |E(w, xi , yi ) −E(w, xi , y) + m| is exactly the distance between the energy margin and the energy of correctly (incorrectly) classifying xi by H , it reflects the confidence that xi is correctly (incorrectly) classified by H . If |E(w, xi , yi ) − E(w, xi , y) + m| is too small or zero, xi is on the margin that ensemble H correctly (incorrectly) classifies xi , and thus changing the weight of one classifier to be 1 may change ensemble prediction on the instance. In this way, EBM considers the strength of current ensemble decision on each instance. Experimental results show that, compared with other state-of-the-art ensemble pruning methods, EBM based method has significantly better generalization capability in most of data sets [20]. So we use EBM as the metric to supervise the procedure of ensemble pruning in the paper. 2.4. Sampling technique Sampling technique such as under-sampling and SMOTE which have been demonstrated effectively for imbalanced learning [24,25], is an important method to deal with classification of imbalanced data set. Under-sampling tries to balance an imbalanced data set by randomly sampling a subset of majority instances. Report researches have demonstrated that random under-sampling is more effective than random over-sampling for imbalanced learning [24,25]. Several under-sampling methods are proposed to further improve the performance of imbalanced learning including Tomek’s links [26], Condensed Nearest Neighbor Rule (CNN) [27], one-sided selection [28] and Neighborhood Cleaning rule (NCL) [29]. SMOTE (Synthetic Minority Over-sampling Technique) generates new synthetic instances along the line between the minority instances and their selected nearest neighbors. These synthetic instances help break the ties introduced by simple over-sampling, and furthermore, augment the original data set in a manner that generally significantly improves learning. SMOTE has been proved to be a useful method dealing with imbalanced learning. After being proposed, SMOTE has been applied to many algorithms. For example, in [30], Chawla et al. integrate SMOTE into a standard boosting procedure to improve the prediction of the rare class. There are other studies on sampling techniques for imbalanced learning. Gustavo et al. [31] combine over-sampling and under-sampling methods to resolve the imbalanced problem. Estabrooks et al. [32] propose a multiple resampling method which selects the most appropriate resampling rate adaptively. Jo et al. [5] put forward a cluster-based over-sampling method which deals with between-class imbalance and within-class imbalance simultaneously. In order to further improve the performance of IBEP , we combine sampling technique with IBEP. For the limited space, only two widely used sampling techniques are selected: random under-sampling and SMOTE. Corresponding experimental results are presented in Section 4.5.

784

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

Algorithm 1 Instance-Based Ensemble Pruning The training stage: Input: D : training data set ensLearn : ensemble learning algorithm (for example: Bagging) N : size of ensemble H Output: ensemble : H Process : 1. H = ensLearn(D, N ); 2. return H ; The testing stage: Input : H : ensemble which is obtained in training stage x : unlabeled instance k : the number of nearest neighbor of x p : percentage of H ’s size Output : the label of x Process : 1. Dp = the k nearest neighbors of x 2. S = Φ 3. while |S|  p ∗ |H| do 4. search for the classifier h ∈ H that minimizes Eq. (4) 5. S = S ∪ h 6. end while 7. y = S(x) 8. return y 3. Instance-based ensemble pruning In this paper, we apply ensemble pruning to imbalanced learning. The idea is inspired by the experiments we conducted on ensemble learning, which demonstrate that some base classifiers perform rather well in certain local area due to the bootstrap of random disturbance. By considering the local characteristic of rare class, we propose an ensemble pruning approach called IBEP (instance-based ensemble pruning) to deal with imbalanced problem. IBEP first adopts bootstrap sample to construct a classifier lab in which there always exist some base classifiers perform rather well in certain local area. For predication, IBEP tries to search for the k nearest neighbors of an unlabeled instance x as the corresponding pruning set Dp and selects a subset of ensemble members to form sub-ensembles based on Dp to predict x. Algorithm 1 outlines the details of IBEP, where only forward ensemble pruning is provided for clarity. In the training stage, IBEP uses bootstrap sample to learn an ensemble H with N base classifiers on training data set D . In testing stage, given an unlabeled instance x, IBEP first searches for the k nearest neighbors of x (line 1) and treats them as pruning set Dp , then IBEP selects the sub-ensemble S with

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

785

p ∗ |H| base classifiers from H (lines 3 ∼ 6) using Dp , predicts the class label of x using S and returns the label (lines 7 ∼ 8). k and p are two hyper-parameters, where k is the number of nearest neighbors of x and p is the percentage of H ’s size , the best value of them are learned (see Section 4.3). One problem in Algorithm 1 is how to measure the distance between two instances (line 1 in the training stage). There are a lot of distance measures, such as Euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance, Standardized Euclidean distance, Mahalanobis distance, Cosine, Hamming distance, Jaccard similarity coefficient, Correlation coefficient and Correlation distance. Most KNN classifiers use simple Euclidean distances to measure the dissimilarities between instances represented as vector inputs [33]. Also, our experimental results show that Euclidean distance has good robustness, so we use Euclidean distance as distance metric to calculate the nearest neighbors of an instance in this paper. How to prune the ensemble is another important problem in Algorithm 1. As mentioned in 1.3, given an ensemble with N members, the problem of finding the subset of ensemble members with best performance involves searching the space of N 2 − 1 non-empty sub-ensembles. To speed up the pruning process, we employ greedy search policy to select the optimal or suboptimal sub-ensemble: initializing S to be empty (or full), searching for the space of different combination candidates by iteratively adding into (or removing from) S the classifier h ∈ H \ S (or h ∈ S ) that optimizes a given metric, such as energy-based metric(EBM) [20]. For clarity, only forward ensemble pruning is provided in Algorithm 1. In Section 4.4, we provide the experimental results of both forward ensemble pruning and backward ensemble pruning. The running time of IBEP algorithm is composed of two parts: the running time of training stage and the running time of testing stage. In the training stage, we construct base classifier library used by Bagging. If the cost time of constructing single base classifier is O(T (|D|)), time complexity of training stage is O(N ∗T (|D|)), where N is the number of classifiers and |D| is size of training data set. The cost time of testing stage includes the time of calculating k nearest neighbors of unlabeled instance and the time of searching for sub-ensemble S . The cost time of calculating k nearest neighbors is O(|D| ∗ log k). The time cost of searching for S is consumed in lines 2–6. The cycle of line 3 performs N times at most, for 0  p  1. When searching for S , we need to compute Eqs (4) and (5). For the sum of Eq. (5) can be incrementally completed, we assume the time complexity of computing Eq. (5) is O(1) and then the complex of computing Eq. (4) is O(|Dp |), where |Dp | is k. According to algorithm of IBEP we know Eq. (4) is computed N times at most, and then the time consumption of line 3 is not more than O(k ∗ N ). Based on the analysis, the complexity of searching for S is O(k ∗ N 2 ).

4. Evaluation metrics Evaluation metric is very important in assessing the classification performance. Traditionally, the most frequently used evaluation metrics are accuracy or error rate. Considering a basic two-class classification problem, instances can be categorized into four groups after a classification process as denoted in the confusion matrix presented in Table 1. We use minority class as the positive class and the majority class as the negative class. So, accuracy and error rate are defined as: TP + TN TP + TN + FP + FN errorRate = 1 − accuracy accuracy =

(7) (8)

786

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

Actually positive Actually negative

Table 1 Confusion matrix Predicted as positive TP FP

Predicted as negative FN TN

However, the evaluation metrics used to balanced classification are very different from the ones used to imbalanced classification and accuracy is inadequate for imbalanced learning. Assuming that, the rare class is represented by only 10% in a training data set, a naive approach of classifying every instance to be a majority class instance would provide an accuracy of 90%. 90 percent accuracy across the entire data set appears good. However, this description fails to reflect the fact that 0 percent of minority instances are identified. That is to say, the accuracy metric in this case does not provide adequate information on a classifier’s functionality with respect to the type of classification required. Therefore, other metrics are proposed. In lieu of accuracy, other assessment metrics are frequently adopted in the research community to provide comprehensive evaluation of imbalanced learning problems, including recall, precision, f measure and g-mean. These metrics are defined as TP recall = , (9) TP + FN TP precision = , (10) TP + FP (1 + β)2 × recall × precision f -measure = , (11) β 2 × recall + precision where β is a coefficient to adjust the relative importance of precision versus recall (usually, β = 1):  TP TN g-mean = × (12) TP + FN TN + FP From Eq. (9) we know recall is a measure of completeness that shows how many instances of the positive class are classified correctly. Precision is a measure of exactness, i.e., of the instances labeled as rare class, how many are actually labeled correctly [1]. Like accuracy and error rate, these two metrics share an inverse relationship between each other. Recall cannot provide how many instances are incorrectly labeled as positive and precision cannot assert how many positive instances are classify incorrectly. In practice, we expect recall and precision are high at the same time, however, high recall doesn’t means high precision. Classifier with higher recall and lower precision, or with lower recall and higher precision, may not be good classifier. To construct a classifier with both high recall and high precision, f -measure is proposed. F-measure combines recall and precision as a measure of the effectiveness of classification in terms of a ratio of the weighted importance on either recall or precision as determined by β coefficient set by the user. So, f -measure represents a harmonic mean between recall and precision. Another metric, g-mean, evaluates the degree of inductive bias in terms of a ratio of positive accuracy and negative accuracy, that is, g-mean measures the balanced performance of a classifier between positive class and negative class. In addition to the metrics mentioned above, there are several curves usually used to imbalanced learning, such as receiver operating characteristics curves, precision-recall curves, cost curves and so on. We don’t describe them for limited space. In the paper, we employ accuracy, recall, f -measure and g-mean to evaluate the classification performance of imbalanced data sets. Though accuracy is inadequate to evaluate the classification performance, poor accuracy means a bad classifier. An efficient classifier may improve recall, f -measure or g-mean without decreasing accuracy.

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

787

Table 2 The description of experimental data sets Data sets Auto-mpg Breast-cancer Car Credit-g Diabetes Hayes-roth Heart-statlog Hepatitis Labor Sick Splice Tic-tac-toe Vehicle Yeast

Number of instances 398 286 1728 1000 768 132 270 155 57 3772 3190 958 846 1484

Number of attributes 7 9 6 20 8 4 13 19 16 29 60 9 18 8

Missing value Yes Yes No No No Yes No Yes Yes Yes No No No No

Ratio 0.176 0.297 0.222 0.300 0.349 0.227 0.444 0.206 0.351 0.061 0.240 0.347 0.235 0.164

5. Experiments 5.1. Data sets and platform In order to facilitate the experiment, we design a platform named Loyang Spoon (LySpoon) based on python. Everyone can download it from http://nlp.zzu.edu.cn/LySpoon.asp.1 14 data sets are randomly selected from the UCI repository [34]. Of these data sets, breast-cancer, credit-g, diabetes, heart-statlog, hepatitis, labor, sick and tic-tac-toe are imbalanced 2-class data sets. Auto-mpg, car, hayes-roth, splice, vehicle and yeast are 2-class imbalanced data sets derived from multiclass data sets according to one of the following strategies: 1) treating one class of a multi-class data set as the positive class while the union of all other classes as the negative class or 2) selecting two classes of a multi-class data set, where one class is regarded as the positive class while the other is regarded as the negative class [35]. Many data sets have some missing attribute values, such as breast-cancer, hepatitis and sick. C4.5, the base classifier adopted in this paper, handles missing attribute values automatically. More details about data sets see Table 2. 5.2. Experimental setup Three experiments are designed in this paper: (1) parameters learning, (2) IBEP’s performance, and (3) IBEP with sampling. – Experiment (1) first learns two hyper-parameters which influence IBEP’s performance (referring to Algorithm 1), i.e., k (the number of neighbors), p (percentage of H ’s size), then applies undersampling and SMOTE to IBEP and learns the impact of the sampling ratio (sr) on IBEP. More setup details of this experiment and the corresponding results are shown in Section 4.3. – Experiment (2) aims to evaluate the performance of IBEP comparing to Bagging, KNN and C4.5, where IBEP uses both forward and backward greedy searching strategies to select subset of ensemble members as sub-ensemble from original classifier lab, denoted as IBEP-F (IBEP with forward 1 CasebasedEnsemblePruning is the implementation of the algorithm we used in this paper. Please contact [email protected]. cn for the related source code.

788

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning 0.9

0.9 auto−mpg

hayes−roth

f−measure

f−measure

hayes−roth

tic−tac−toe

0.8

0.8 0.7 0.6

0.7 0.6 0.5

0.5 0.4

auto−mpg

tic−tac−toe

2

4

6

8

10

0.4

20

40

(a) k

60

80

100

(b) p

Fig. 1. (a) is the impacts of k on f -measure, (b) is the impacts of p on f -measure. 0.935

0.7

breast−cancer

hepatitis

0.93 f−measre

f−measure

0.69 0.925 0.92

0.68

0.915 0.67 0.91 0.905 0

100

200

300

(a) sr

400

500

0.66 0

100

200

300

400

500

(b) sr

Fig. 2. The results of under-sampling, (a) is the results of sr on f -measure based on breast-cancer, (b) is the results of sr on f -measure based on hepatitis.

ensemble pruning) and IBEP-B (IBEP with backward ensemble pruning) respectively. In this experiment, k and p in Algorithm 1 are set to be 7 and 15% respectively according to the first experimental results. – Experiment (3) is to evaluate the performance of IBEP with sampling techniques: under-sampling and SMOTE. The same to experiment (2), we set k = 7 and p =15%. Besides, sr of under-sampling and SMOTE are set to 100% and 200% respectively. For each experiment, 10-folds cross validation is performed. For each tail, we employ Bagging with C4.5 as base classifier learner to construct original ensemble lab with 200 base classifiers. 5.3. Parameter learning In Algorithm 1 there are two hyper-parameters. One is k, which is the number of nearest neighbors of an unlabeled instance, and the other is p which is the percentage of original ensemble size. We learn the influences of k and p on IBEP by increasing k with step 1 and p with step 5%. The corresponding results are shown in Fig. 1, where three data sets are selected as representative data sets, i.e., auto-mpg, hayes-roth and tic-tac-toe (more details about data sets see Section 5.1). The impacts of k and p on f -measure are shown in Figs 1(a) and (b) respectively. As shown in Fig. 1(a), f -measure increases with k increasing until reaching 6, and then keeps stable in almost top

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

789

Table 3 The comparison of IBEP-F, Bagging, C4.5 and KNN on accuracy Dataset Auto-mpg Breast-cancer Car Credit-g Diabetes Hayes-roth Heart-statlog Hepatitis Labor Sick Splice Tic-tac-toe Vehicle Yeast IBEP-F W-T-L IBEP-F W-T-L (Sig) 0.93

IBEP-F 80.55 (1.55) 71.26 (2.90) 93.04 (1.15) 71.62 (1.52) 74.04 (1.81) 90.91 (2.14) 76.96 (3.13) 83.25 (2.50) 85.91 (4.25) 98.74 (0.31) 96.65 (0.35) 93.95 (2.03) 94.68 (1.87) 87.49 (0.49) – –

Bagging 82.06 (1.71)v 71.61 (3.11) 90.69 (1.75)* 73.32 (2.02)v 74.92 (1.42) 86.21 (2.81)* 79.48 (2.75)v 83.24 (3.49) 79.66 (4.57) 98.43 (0.37) 96.74 (0.42) 88.31 (1.88)* 94.78 (1.38) 88.38 (0.64)v 6/0/8 3/7/4

hepatitis

0.71 f−measure

f−measure

KNN 80.20 (2.59) 71.12 (5.45) 84.53 (1.32)* 71.38 (1.92) 70.94 (1.46)* 84.55 (4.39)* 78.67 (2.44) 83.10 (3.61) 87.03 (7.02) 96.46 (0.64) 87.15 (0.71)* 83.24 (2.05)* 92.84 (1.36)* 86.70 (0.58)* 12/0/2 7/7/0

0.72

breast−cancer

0.925 0.92 0.915 0.91 0.905 0

C4.5 81.71 (1.52)v 71.12 (3.08) 89.28 (2.10)* 71.02 (2.36) 72.14 (2.38)* 84.85 (2.14)* 76.44 (2.68) 81.81 (2.55) 69.83 (6.47)* 98.47 (0.36) 96.33 (0.50)* 81.61 (3.05)* 92.98 (1.66)* 86.93 (1.19) 13/0/1 7/6/1

0.7 0.69 0.68 0.67

100

200

300

(a) sr

400

500

0.66 0

100

200

300

400

500

(b) sr

Fig. 3. The results of SMOTE, (a) is the results of sr on f -measure based on breast-cancer, (b) is the results of sr on f -measure based on hepatitis.

point. In order to compare IBEP with KNN, we set k = 7. Figure 1(b) shows that, f -measure almost keeps stable on high point until p increasing to 30% and then decreases gradually as p increases. Therefore, we set p = 15%. Moreover, two sampling techniques, randomly under-sampling and SMOTE, are applied to IBEP to further improve its performance, where one parameter is involved, i.e., sr (sampling ratio). We learn the influence of sr on IBEP’s performance by increasing sr gradually with step 25%. Only two data sets are selected as representative data sets to learn sr for limited space. Figures 2 and 3 report the results of sr of under-sampling and SMOTE respectively. From Fig. 2, we observe that f -measure tend to be stable when sr is 100%, i.e., the number of instances of majority class is equal to the minority class. Carefully observing Fig. 3, we have that f -measure tends to be stable as sr is 200%. 5.4. IBEP’s performance In this section, we use accuracy, recall, f -measure and g-mean to evaluate IBEP’s performance on 14 data sets. The experimental results are reported both in tables and figures.

790

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning 5

IBEP−F

Bagging

C4.5

KNN

4

rank

3 2 1 0

1

2

3

4

5

6

7 8 the index of data set

9

10

11

12

13

14

Fig. 4. Comparison rank of each algorithms on each data set. 1

IBEP−F

Bagging

C4.5

KNN

recall

0.8 0.6 0.4 0.2 0

1

2

3

4

5

6

7 8 9 the index of data set

10

11

12

13

14

Fig. 5. Comparison recall of each algorithms on each data set.

Table 3 presents the accuracy of IBEP-F, Bagging, KNN and C4.5, where ∗ (v) denotes that IBEP-F outperforms (is outperformed by) corresponding algorithms in pair-wise t-test at 95% significance level. The corresponding statistical results are shown in last two lines, where W-T-L denotes the number of data sets on which IBEP-F wins-ties-losses compared with other methods, and W-T-L (Sig.) denotes IBEP-F wins-ties-losses significantly. Note: Bagging is the original ensemble with 200 base classifiers while IBEP-F is the sub-ensemble with 30 base classifiers selected from Bagging (p = 15%). As shown in Table 3, IBEP-F outperforms Bagging on 6 data sets and is outperformed by Bagging on 8 ones. At 95% significant level, IBEP-F outperforms Bagging on 3 data sets, is outperformed by Bagging on 4 data sets and draws with Bagging on 7 data set. The results are acceptable although Bagging is better than IBEP-F since we focus on imbalanced learning of which accuracy is not a very ideal metric to evaluate its performance. Compared to KNN and C4.5, IBEP-F wins/ties/losses on 13/0/1 and 12/0/2 data sets respectively. At 95% significant level, IBEP-F wins/ties/losses on 7/6/1 and 7/7/0 data set respectively. According to [36], in comparison to two or more algorithms, a reasonable approach is to compare their rank on each data set or average rank across all data sets. On a data set, the best performing algorithm gets the rank of 1, the second best rank 2, and so on. In case of ties, average ranks are assigned. Figure 4 displays the rank of IBEP-F, Bagging, C4.5 and KNN on accuracy. From Fig. 4 we can see the average

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning 1

IBEP−F

Bagging

C4.5

791

KNN

f−measure

0.8 0.6 0.4 0.2 0

1

2

3

4

5

6

7 8 9 the index of data set

10

11

12

13

14

13

14

Fig. 6. Comparison f -measure of each algorithms on each data set. 1

IBEP−F

Bagging

C4.5

KNN

g−mean

0.8 0.6 0.4 0.2 0

1

2

3

4

5

6

7 8 9 the index of data set

10

11

12

Fig. 7. Comparison g-mean of each algorithms on each data set.

rank of IBEP-F, Bagging, C4.5 and KNN are 1.79, 1.57, 3.32 and 3.32. Obviously the average rank of Bagging is lowest and the average rank of IBEP-F is a little higher than Bagging. Figure 5 shows recall of IBEP-F, Bagging, C4.5 and KNN. As shown in Fig. 5, IBEP-F wins on 12, 13 and 10 out of all the 14 data sets comparing to Bagging, C4.5 and KNN respectively. Especially, IBEP-F boosts recall of C4.5 with 90.75% on hayes-roth. F -measure of IBEP-F, Bagging, C4.5 and KNN are illustrated in Fig. 6. From Fig. 6, IBEP-F wins on 10, 14, 11 out of the 14 data sets comparing to Bagging, C4.5 and KNN. On 8 data sets, the f -measure of IBEP-F is highest than all of the other algorithms. Figure 7 depicts g-mean of IBEP-F, Bagging, C4.5 and KNN. On 14 data sets, IBEP-F wins on 11, 13, 10 out of 14 data sets comparing to Bagging, C4.5 and KNN, and averagely boost g-mean of Bagging with 6.85%, C4.5 with 13.6% and KNN with 7.69%. Combining the above results, we conclude that IBEP with forward ensemble pruning can make good use of the local characteristic of rare class and thus, achieve better performance than other algorithms on recall, f -measure and g-mean. As mentioned above, there is another greedy method, denoted by IBEP-B (IBEP with backward ensemble pruning), which can be used to ensemble pruning. From experimental results, we observe that IBEP-B’s performance is similar to IBEP-F on all the metrics mentioned in this paper, and we only

792

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning IBEP−F

1

IBEP−B

0.8

0.8

0.6

0.6

g−mean

f−measure

1

0.4

0.2

0

IBEP−F

IBEP−B

0.4

0.2

1

2

3

4

0

5 6 7 8 9 10 11 12 13 14 the index of data set

1

2

3

4

5 6 7 8 9 10 11 12 13 14 the index of data set

Fig. 8. Comparison IBEP-F to IBEP-B. IBEP(US)

IBEP(SMOTE)

0.8

0.8

0.6

0.6

0.4 0.2 0

IBEP(US)

1

IBEP−F

g−mean

f−measure

1

IBEP(SMOTE)

IBEP−F

0.4 0.2

1

2

3

4

5

6 7 8 9 10 11 12 13 14 the index of data

0

1

2

3

4

5

6 7 8 9 10 11 12 13 14 the index of data

Fig. 9. Comparison IBEP-US, IBEP-SMOTE to IBEP.

present the results of f -measure and g-mean since the two metrics are most important ones for evaluating imbalanced learning. The corresponding results are illustrated in Fig. 8. As shown in Fig. 8, IBEP-B performs similar to IBEP-F. However, IBEP-B consumes more time than IBEP-F since the size of the object sub-ensemble is small. Therefore, IBEP-F is more notable for imbalanced learning. 5.5. IBEP with sampling As aforementioned, sampling technique is an efficient method to deal with imbalanced learning. Many sampling techniques have been proposed, and among them, under-sampling and SMOTE are two oftenused ones to cope with the classification of imbalanced data set. Therefore, we apply under-sampling and SMOTE to IBEP to further improve its performance and denote IBEP (US) as IBEP with under-sampling and IBEP (SMOTE) with SMOTE. Note, only forward ensemble pruning approach is considered. The corresponding results, f -measure and g-mean, are shown in Fig. 9. From Fig. 9, we observe that IBEP (US) has the best performance, followed by IBEP (SMOTE) and IBEP. IBEP (US) averagely boosts f -measure of IBEP with 2.23%. On some data sets, such as hayes-roth, IBEP (US) boosts f -measure of IBEP with 9.57%. IBEP (US) averagely boosts g-mean of

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning

793

IBEP with 2.65%. On some data sets, such as breast-cancer, IBEP (US) boosts g-mean of IBEP almost with 11.61%. Generally speaking, sampling technique does affect the performance of IBEP and really improves the performance of IBEP. All the experimental results we provided above indicate that ensemble pruning is an efficient method to boost the generalization performance of ensemble in imbalanced learning and is worth of further research. Moreover, the results show that considering the local characteristic of rare class can boost the performance of imbalanced learning.

6. Conclusion In the paper, we apply ensemble pruning to imbalanced problem and propose a method called IBEP (Instance-based ensemble pruning) for imbalanced learning. To take advantage of the locality of minority class, IBEP employs the idea of KNN to the process of ensemble pruning. Experimental results show that IBEP is an efficient method to deal with imbalanced learning. Moreover, we apply sampling technique to IBEP to further improve its performance, i.e., under-sampling and SMOTE. The corresponding experimental results indicate that sampling technique does improve the performance of classification algorithms on imbalanced data set. However, the proposed method bears some disadvantages. For example, in the prediction stage, IBEP needs to search for k nearest neighbors of unlabeled instance as pruning set to select optimal sub-ensemble, which consumes much more time and may cause IBEP can not work effectively in real-time environments. Therefore, the future work is to design an effective method to accelerate IBEP’s prediction.

Acknowledgments This work is supported by the National Natural Science Foundation of China (Grant No. 61170223 with title “Research on multiple variables IB method and algorithm”) , Science and Technology Research key Project of the Education Department of Henan Province (Grant No. 14A520016 with title “Research on classification of imbalanced data set”).

References [1] [2] [3] [4] [5] [6] [7] [8]

H.B. He and E.A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (2009), 1263–1284. X.Y. Liu, Q.Q. Li and Z.H. Zhou, Learning imbalanced multi-class data with optimal dichotomy weights, Proceedings of the ACM SIGMOD International Conference on Management of Data (2013), 478–487. X.Y. Liu, J.X. Wu and Z.H. Zhou, Exploratory under sampling for class imbalance learning, Proceedings of IEEE International Conference on Data Mining (2006), 965–969. N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357. T. Jo and N. Japkowicz, Class imbalances versus small disjuncts, ACM SgKDD Explorations Newsletter 6 (2004), 40–49. M.V. Joshi, R.C. Agarwal and V. Kumar, Mining needles in a haystack: Classifying rare classes via two-phase rule induction, Proceedings of the ACM SIGMOD International Conference on Management of Data (2001), 91–102. Y.M. Sun, M.S. Kamel, Andrew K.C. Wong and Y. Yang, Cost-sensitive boosting for classification of imbalanced data, Patter Recognition 40 (2007), 3358–3378. N. Japkowicz and S. Stephen, The class imbalance problem: A systemaic study, Intelligent Data Analysis 6 (2002), 429–450.

794 [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36]

W. Zhi et al. / Instance-based ensemble pruning for imbalanced learning L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms, John Wiley and Sons Inc., Hoboken, New Jersey, 2004. L. Breiman, Bagging predictors, Machine Learning 24 (1996), 123–140. Y. Freund and R.F. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55 (1997), 119–139. L. Breiman, Random forests, Machine Learning 45 (2001), 5–32. J.J. Rodriguez, L.I. Kuncheva and C.J. Alonso, Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006), 1619–1630. H.H. Chen, P. Tino and X. Yao, Predictive ensemble pruning by expectation propagation, IEEE Transactions on Knowledge and Data Engineering 21 (2009), 999–1013. I. Partalas, G. Tsoumakas and I.P. Vlahavas, Focused ensemble selection: A diversity-based method for greedy ensemble selection, Proceedings of 18th European Conference on Artificial Intelligence (2008), 117–121. C. Tamon and J. Xiang, On the boosting pruning problem, Proceedings of 11th European Conference on Machine Learning (2000), 404–412. G. Martínez-Munoz, D. Hernández-Lobato and A. Suárez, An analysis of ensemble pruning techniques based on ordered aggregation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009), 245–259. R.E. Banfield, L.O. Hall, K.W. Bowyer and W.P. Kegelmeyer, Ensemble diversity measures and their application to thinning, Information Fusion 6 (2005), 49–62. I. Partalas, G. Tsoumakas and I.P. Vlahavas, Focused ensemble selection: A diversity-based method for greedy ensemble selection, Proceedings of 18th European Conference on Artificial Intelligence (2008), 117–121. W.M. Zhi, H.P. Guo and M. Fan, Energy-based metric for ensemble selection, Proceedings of 14th Asia-Pacific Web Conference (2012), 306–317. G. Martinez-Muverbnoz and A. Suarez, Aggregation ordering in bagging, Proceedings of International Conference on Artificial Intelligence and Applications (2004), 258–263. I. Partalas, G. Tsoumakas and I.P. Vlahavas, An ensemble uncertainty aware measure for directed hill climbing ensemble pruning, Machine Learning 81 (2010), 257–282. Y. LeCun, S. Chopra, R. Hadsell, M.A. Ranzato and F.J. Huang, A tutorial on energy-based learning, Predicting Structured Data MIT Press, 2006, pp. 191–246. N.V. Chawla, C4.5 and imbalanced data sets: Investigating the effective of sampling method, probabilistic estimate, and decision tree structure, Proceedings of the ICML’03 Workshop on Learning From Imbalanced Data Sets (2003). C. Drummond and R.C. Holte, C4.5, Class imbalance, and cost sensitivity: Why under-sampling beats over-sampling, Proceedings of the ICML’03 Workshop on Learning From Imbalanced Data Sets (2003). L. Tomek, Two modifications of CNN, IEEE Transactions on Systems, Man and Cybernetics 6 (1976), 769–772. P.E. Hart, Condensed nearest neighbor rule, IEEE Transactions on Information Theory 14 (1968), 515–516. M. Kubat and S. Matwin, Addressing the course of imbalanced training sets: One-sided selection, Proceedings of International Conference of Machine Learning (1997), 179–186. J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, Proceedings of Artificial Intelligence in Medicine (2001), 63–66. N.V. Chawla, A. Lazarevic, L.O. Hall and K.W. Bowyer, SMOTEBoost: Improving prediction of the minority class in boosting, Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (2003), 107–119. G.E. Batista, R.C. Prati and M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explorations 1 (2004), 20–29. A. Estabrooks, T. Jo and N. Japkowicz, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence 20 (2004), 19–36. K.Q. Weinberger, J. Blitzer and L.K. Saul, Distance metric learning for large margin nearest neighbor classification, Proceedings of 2005 Annual Conference on Neural Information Processing Systems (2005), 1473–1480. C. Blake and C. Merz, UCI repository of machine learning databases, [http://www.ics.uci.edu/mlearn/MLRepository. html]. X.Y. Liu and Z.H. Zhou, The influence of class imbalance on cost-sensitive learning: An empirical study, Proceedings of IEEE International Conference on Data Mining (2006), 970–974. J. Demsar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 6 (2006), 1–30.

Suggest Documents