Balancing Strategies and Class Overlapping - CiteSeerX

0 downloads 0 Views 528KB Size Report
Supervised Machine Learning – ML – systems aim to automatically create a ... is created, it can be used to automatically predict the class label of unlabeled .... data set has 10,000 examples with different proportions of examples belonging to.
Balancing Strategies and Class Overlapping? Gustavo E. A. P. A. Batista1,2 , Ronaldo C. Prati1 , and Maria C. Monard1 1

2

Institute of Mathematics and Computer Science at University of S˜ ao Paulo P. O. Box 668, ZIP Code 13560-970 S˜ ao Carlos (SP), Brazil Faculty of Computer Engineering at Pontifical Catholic University of Campinas Rodovia D. Pedro I, Km 136, ZIP Code 13086-900 Campinas (SP), Brazil {gbatista,prati,mcmonard} at icmc usp br

Abstract. Several studies have pointed out that class imbalance is a bottleneck in the performance achieved by standard supervised learning systems. However, a complete understanding of how this problem affects the performance of learning is still lacking. In previous work we identified that performance degradation is not solely caused by class imbalances, but is also related to the degree of class overlapping. In this work, we conduct our research a step further by investigating sampling strategies which aim to balance the training set. Our results show that these sampling strategies usually lead to a performance improvement for highly imbalanced data sets having highly overlapped classes. In addition, oversampling methods seem to outperform under-sampling methods.

1

Introduction

Supervised Machine Learning – ML – systems aim to automatically create a classification model from a set of labeled training examples. Once the model is created, it can be used to automatically predict the class label of unlabeled examples. In many real-world applications, it is common to have a huge intrinsic disproportion in the number of examples in each class. This fact is known as the class imbalance problem and occurs whenever examples of one class heavily outnumber examples of the other class3 . Generally, the minority class represents a circumscribed concept, while the other class represents the counterpart of that concept. Several studies have pointed out that domains with a high class imbalance might cause a significant bottleneck in the performance achieved by standard ML systems. Even though class imbalance is a problem of great importance in ML, a complete understanding of how this problem affects the performance ?

3

This research is partly supported by Brazilian Research Councils CAPES and FAPESP. Although in this work we deal with two-class problems, this discussion also applies to multi-class problems. Furthermore, positive and negative labels are used to denominate the minority and majority classes, respectively.

of learning systems is not clear yet. In spite of poor performances of standard learning systems in many imbalanced domains, this does not necessarily mean that class imbalance is solely responsible for the decrease in performance. Rather, it is quite possible that beyond class imbalance yields certain conditions that make the induction of good classifiers difficult. For instance, even for highly imbalanced domains, standard ML systems are able to create accurate classifiers when classes are linearly separable. All matters considered, it is crucial to identify in which situations a skewed dataset might lead to performance degradation in order to develop new tools and/or to (re)design learning algorithms to cope with this problem. To accomplish this task, artificial data sets may provide a useful framework, since their parameters can be fully and easily controlled. For instance, using artificial data sets Japkowicz [3] showed that class imbalance is a relative problem depending on both the complexity of the concept and the overall size of the training set. Futhermore, in previous work [6] using artificial datasets, we showed that performance degradation of imbalanced domains is related to the degree of data overlapping between classes. In this work, we broaden this research by applying several under and oversampling methods to balance the training data. Under-sampling methods balance the training set by reducing the number of majority class examples, while over-sampling methods balance the training set by increasing the number of minority class examples. Our objective is to verify whether balancing training data is an effective approach to deal with the class imbalance problem, and how the controlled parameters, namely class overlapping and class imbalance, affect each balancing method. Our experimental results in artificial datasets show that balancing training data usually leads to a performance improvement for highly imbalanced data sets with highly overlapped classes. In addition, over-sampling methods usually outperform under-sampling methods. This work is organized as follows: Section 2 presents some notes related to evaluating the performance of classifiers in imbalanced domains. Section 3 introduces our hypothesis regarding class imbalances and class overlapping. Section 4 discusses our experimental results. Finally, Section 5 presents some concluding remarks and suggestions for future work.

2

Evaluating Classifiers in Imbalanced Domains

As a rule, error rate (or accuracy) considers misclassification of examples equally important. However, in most real-world applications this is an unrealistic scenario since certain types of misclassification are likely to be more serious than others. Unfortunately, misclassification costs are often difficult to estimate. Moreover, when prior class probabilities are very different, the use of error rate or accuracy might lead to misleading conclusions, since there is a strong bias to favour the majority class. For instance, it is straightforward to create a classifier having an error rate of 1% (or accuracy of 99%) in a domain where the majority class holds 99% of the examples, by simply forecasting every new example as

belonging to the majority class. Another point that should be considered when studying the effect of class distribution on learning systems is that misclassification costs and class distribution may not be static. When the operating characteristics, i.e. class distribution and cost parameters, are not known at training time, other measures that disassociate errors (or hits) that occurred in each class should be used to evaluate classifiers, such as the ROC (Receiving Operating Characteristic) curve. A ROC curve is a plot of the estimated proportion of positive examples correctly classified as positive — the sensitive or true-positive rate (tpr) — against the estimated proportion of negative examples incorrectly classified as positive — the false alarm or false-positive rate (f pr) — for all possible trade-offs between the classifier sensitivity and false alarms. ROC graphs are consistent for a given problem even if the distribution of positive and negative examples is highly skewed or the misclassification costs change. ROC analysis also allows performance of multiple classification functions to be visualised and compared simultaneously. The area under the ROC curve (AUC) represents the expected performance as a single scalar. The AUC has a known statistical meaning: it is equivalent to the Wilconxon test of ranks, and is equivalent to several other statistical measures for evaluating classification and ranking models [2].

3

The Effect of Class Overlapping in Imbalanced Data

There seems to be an agreement in the ML community with the statement that imbalance between classes is the major obstacle when inducing good classifiers in imbalanced domains. However, we believe that class imbalance is not always the problem. In order to illustrate our conjecture, consider the two decision problem shown in Figure 1. The problem is related to building a classifier for a simple-single attribute problem that should be classified into two classes, positive and negative. The conditional probabilities of both classes are given by a onedimensional unit variance Gaussian function, but the negative class centre is one standard deviation apart from the positive class centre in the first problem – Figures 1(a) and 1(b) – and four (instead of one) standard deviations apart from the positive class centre in the second problem– Figures 1(c) and 1(d). In Figure 1(a), the aim is to build a (optimal) Bayes classifier, and perfect knowledge regarding probabilitiy distributions is assumed. The vertical line represents the optimal Bayes split. In such conditions, the optimal Bayes split should be the same however skewed the dataset is. On the other hand, Figure 1(b) depicts the same problem, but now no prior knowledge is assumed regarding probability distributions, and the aim is to build a Na¨ıve-Bayes classifier only with the data at hand. If there were a huge disproportion of examples between classes, the algorithm is likely to produce poorer estimates for the class with fewer examples, rather than for the majority class. Particularly, in this figure, the variance is overestimated at 1.5 (continuous line) instead of the truly variance 1 (dashed line). In other words, if we know beforehand the conditional probabilities (a constraint seldom applicable for most real-world problems) which makes the construction

of a true Bayes classifier possible, class distribution should not be a problem at all. Conversely, a Na¨ıve-Bayes classifier is likely to suffer from poor estimates due to few data available for the minority class. Consider now the second decision problem. As in Figure 1(a), Figure 1(c) represents the scenario where full knowledge regarding probabilities distribution is assumed, while Figure 1(d) represents the scenario where the learning algorithm must induce the classifier only with the data at hand. For the same reasons stated before, when perfect knowledge is assumed, the optimal Bayes classifier should not be affected by the class distribution. However, if this is not the case, the final classifier is likely to be affected. Nevertheless, due to low overlapping between the classes, the effect of class imbalance in this case is lower than when there is a high overlapping. This is to say that the number of examples misclassified in the former scenario is, therefore, higher than the number of examples misclassified in the latter. This might indicate that class probabilities are not solely responsible for hindering the classification performance, but instead the degree of overlapping between classes.

(a) The learner has perfect knowledge (b) The learner only uses the data at about the domain hand

(c) The learner has perfect knowledge (d) The learner only uses the data at hand about the domain Fig. 1. Two different decision problems (vide text).

4

Experiments

The main goal of this work is to gain some insight on how balancing strategies may aid classifier’s induction in the presence of class imbalance and class overlapping. In former work [6], we performed a study aimed to understand when class imbalance causes performance degradation on learning algorithms when applied to class overlapped datasets. In this work we broaden this research by investigating how several balancing methods affect the performance of learning in such conditions. We start describing the experimental set up conducted to perform our analysis, followed by a description of the balancing methods used in the experiments. To make this work more self-contained, we continue by summing up the findings reported in [6] and conclude the section with an analysis of the obtained results. 4.1

Experimental Set up

To perform our analysis, we generated 10 artificial domains. Two clusters compose these domains: one representing the majority class and the other one representing the minority class. The data used in the experiments have two major controlled parameters. The first one is the distance between both cluster centroids, and the second one is the imbalance degree. The distance between centroids enable us to control the “level of difficulty” of correctly classifying the two classes. The grade of imbalance let us analyse if imbalance is a factor by itself for degrading performance. Each domain is described by a 5-dimensional unit-variance Gaussian variable. Jointly, each domain has 2 classes: positive and negative. For the first domain, the mean of the Gaussian function for both classes is the same. For the following domains, we stepwise add 1 standard deviation to the mean of the positive class, up to 9 standard deviations. For each domain, we generated 14 data sets. Each data set has 10,000 examples with different proportions of examples belonging to each class, ranging from 1% up to 45% in the positive class, and the remainder in the negative class as follows: 1%, 2.5%, 5%, 7.5% 10%, 12.5%, 15%, 17.5%, 20%, 25%, 30%, 35%, 40% and 45%. We also included a control data set, which has a balanced class distribution. Although the class complexity is quite simple (we generated data sets with only 5 attributes, two classes, and each class is grouped in only one cluster), this situation is often faced by supervised learning systems since most of them follow the so-called divide-and-conquer (or separate-and-conquer) strategy, which recursively divides (or separates) and solves smaller problems in order to induce the whole concept. Furthermore, Gaussian distribution might be used as an approximation of several statistical distributions. To run the experiments, we chose the C4.5 [8] algorithm for inducing decision trees. The reason for choosing C4.5 is twofold. Firstly, tree induction is one of the most effective and widely used methods for building classification models. Secondly, C4.5 is quickly becoming the community standard algorithm when evaluating learning algorithms in imbalanced domains. In this work, the induced decision trees were modified

to produce probability decision trees (PET) [7], instead of only forecasting a class. We also use the AUC as the main method for assessing our experiments. All experiments were evaluated using 10-fold stratified cross validation. Moreover, the choice of two Gaussian distributed classes enables us to easily compute the theoretical AUC values for the optimal Bayes classifier. The AUC can be computed using Equation 1 [5], where Φ(.) is the standard normal cumulative distribution, δ is the Euclidean distance between the centroids of the two distributions and φpos as well as φneg are, respectively, the standard deviation of positive and negative distribution. AU C = Φ

4.2

δ p φpos + φneg

! (1)

Summary of our Previous Findings

Figure 2 summarises results of our previous findings [6]. For a better visualization, we have omitted some proportions and distances, however the lines omitted are quite similar, respectively, to the curves with 9 standard deviations apart and 50% of examples in each class. Figure 2(a) plots the percentage of positive examples in the data sets versus the AUC of the classifiers induced by C4.5 for different centroids of the positive class (in standard deviations) from the negative class. Consider the curve of the positive class where the class centroids are 2 standard deviations apart. Observe that these classifiers have good performance, with AUC higher than 90%, even if the proportion of positive class is barely 1%.

100

100 90

90

9 2 1 0

70

AUC (%)

AUC (%)

80

1.00% 2.50% 5.00% 10.00% 15.00% 50.00%

80

70

60

60

50 40

0.00%

50

10.00%

20.00%

30.00%

40.00%

Proportion of positive instances

50.00%

0

1

2

3

4

5

6

7

8

9

Distance of centroids (SDs)

(a) Variation in the proportion of positive (b) Variation in the centre of positive instances versus AUC. class versus AUC. Fig. 2. Experimental results for C4.5 classifiers induced in data sets with several overlapping and imbalance rates [6].

Figure 2(b) plots the variation of centroid distances versus the AUC of classifiers induced by C4.5 for different class imbalances. In this graph, we can see that

the main degradation in the classifiers performance occurs mainly when the difference between the centre of positive positive and negative classes is 1 standard deviation apart. In this case, the degradation is significantly higher for highly imbalanced data sets, but decreases when the distance between the centre of the positive and negative classes increases. The difference in the performance of the classifiers are statistically insignificant when the difference between centres goes up 4 standard deviations, independently of how many examples belong to the positive class. Thus, these results suggest that datasets with linearly separable classes do not suffer from the class imbalance problem. 4.3

Balancing Methods

As summarized in Section 4.2, in [6] we analyzed the interaction between class imbalance and class overlapping. In this work we are interested in analysing the behaviour of methods that artificially balance the (training) dataset in the presence of class overlapping. Two out of five evaluated methods, described next, are non-heuristic methods, while the other three make use of heuristics to balance the training data. The non-heuristic methods are: Random under-sampling is a method that aims to balance class distribution through random elimination of majority class examples. Random over-sampling is a method that aims to balance class distribution through random replication of minority class examples. Several authors agree that the major drawback of Random under-sampling is that this method can discard potentially useful data that could be important for the induction process. On the other hand, Random over-sampling can increase the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example. The remaining three balancing methods, which are described next, use heuristics in order to overcome the limitations of the non-heuristic methods: NCL Neighbourhood Cleaning Rule [4] uses the Wilson’s Edited Nearest Neighbour Rule (ENN) [10] to remove majority class examples. ENN removes any example whose class label differs from the class of at least two of its three nearest neighbours. NCL modifies ENN in order to increase data cleaning. For a two-class problem the algorithm can be described in the following way: for each example Ei in the training set, its three nearest neighbours are found. If Ei belongs to the majority class and the classification given by its three nearest neighbours contradicts the original class of Ei , then Ei is removed. If Ei belongs to the minority class and its three nearest neighbours misclassify Ei , then the nearest neighbours that belong to the majority class are removed. Smote Synthetic Minority Over-sampling Technique [1] is an over-sampling method. Its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. Thus,

the overfitting that may occur with random over-sampling is avoided and causes the decision boundaries for the minority class to spread further into the majority class space. Smote + ENN Although over-sampling minority class examples can balance class distributions, some other problems usually present in data sets with skewed class distributions are not solved. Frequently, class clusters are not well defined since some majority class examples might be invading the minority class space. The opposite can also be true, since interpolating minority class examples can expand the minority class clusters, introducing artificial minority class examples too deeply in the majority class space. Inducing a classifier under such a situation can lead to overfitting. In order to create better-defined class clusters, we propose applying ENN to the over-sampled training set as a data cleaning method. Differently from NCL, which is an under-sampling method, ENN is used to remove examples from both classes. Thus, any example that is misclassified by its three nearest neighbours is removed from the training set.

4.4

Experimental Results

From hereafter, we focus on data sets up to d = 3 standard deviations apart, since these data sets provided the most significant results. Furthermore, we generated new domains, also a 5-dimensional unit variance Gaussian variable having the same class distributions than the previous domains, every 0.5 standard deviation. Therefore, 7 domains were analysed in total, with the following distances in standard deviation between the centroids: 0, 0.5, 1, 1.5, 2, 2.5 and 3. The results of theoretical AUC values for these distances are shown in Table 1. As we want to gain some insight into the interaction between large class imbalances and class overlapping, we also constraint our analysis for domains up to 20% of examples in the positive class, and compared results with the naturally balanced dataset.

Table 1. Theoretical AUC values δ 0 0.5 1 1.5 2 2.5 3 AUC 50.0 % 78.54 % 94.31 % 99.11 % 99.92 % 99.99 % 99.99 %

Smote, Random over-sampling and Random under-sampling methods have internal parameters that enable the user to set up the resulting class distribution obtained after the application of these methods. We decided to add/remove examples until a balanced distribution was reached. This decision is motivated by the results presented in [9], in which it is shown that when AUC is used as a performance measure, the best class distribution for learning tends to be near the balanced class distribution.

Proportion versus AUC for distance 0

Proportion versus AUC for distance 0.5

60

75 70

AUC(%)

AUC(%)

55

50

45

65 60 55 50

40 1

2.5

5

7.5

10

12.5

15

17.5

20

1

50

2.5

5

7.5

10

12.5

15

17.5

20

50

Positive class proportion (%)

Positive class proportion (%)

Proportion versus AUC for distance 1.5

Proportion versus AUC for distance 1 100

95 90

97.5

85

AUC (%)

AUC (%)

80 75 70

95 92.5

65 60

90

55 50

87.5 1

2.5

5

7.5

10

12.5

15

Positive clas proportion (%)

17.5

20

50

1

2.5

5

7.5

10

12.5

15

17.5

20

50

Positive class proportion (%)

Fig. 3. Experimental results for distances 0, 0.5, 1 and 1.5.

Figure 3 presents in graphs4 the experimental results for distances between centroids of 0, 0.5, 1 and 1.5 standard deviations apart. Note that we have promoted a change in the scale of the AUC axis, in order to better present the results. These graphs show, for each distance between centroids, the mean AUC measured over the 10 folds versus the number of positive (minority) class examples in percentage of the total number of examples in the training set. Distance 0 was introduced into the experiments for comparison purposes. As expected, AUC values for this distance oscillate (due to random variation) around random performance (AUC = 50%). In our experiments, the major influence of class skew occurs when the distance is 0.5 standard deviations. In this case, the theoretical AUC value is 78.54%, but the archived AUC values for the original data set is under 60% for proportions under 15% of positive instances. In almost all cases, sampling methods were able to increase the AUC values for the induced classifiers. As can be observed, NCL shows some improvements 4

Due to lack of space, tables with numerical results were not included in this article. However, detailed results, including tables, graphs and the data sets used in the experiments can be found in http://www.icmc.usp.br/∼gbatista/ida2005.

Proportion versus AUC for distance 2.5

100

100

97.5

97.5

AUC (%)

AUC (%)

Proportion versus AUC for distance 2

95

92.5 1

2.5

5

7.5

10

12.5

15

17.5

20

95

92.5

50

1

2.5

Positive class proportion (%)

5

7.5

10

12.5

15

17.5

20

50

Positive class proportion (%)

Proportion versus AUC for distance 3 100

AUC (%)

Original NCL Random Under-sampling

97.5

Random Over-sampling Smote+ENN Smote

95 1

2.5

5

7.5

10

12.5

15

17.5

20

50

Positive class proportion (%)

Fig. 4. Experimental results for distances 2, 2.5 and 3.

over the original data, however these improvements are smaller than those obtained by Random under-sampling; the over-sampling methods usually provide the best results, with the Smote-based methods achieving almost a constant performance for all class distributions. Smote + ENN presents better results than other methods for almost all class distributions. We believe this is due to the data cleaning method, which seems to be more efficient in highly overlapped regions. The ENN data cleaning method starts to become less effective as the distance increases, since there are less data to be cleaned when the clusters are more distant from each other. Therefore, results obtained by Smote + ENN are becoming more similar to the results obtained by Smote. From distance 1.5, almost all methods present good results, with most values greater than 90% AUC, and the over-sampling methods reaching almost 97% AUC. Nevertheless, the Smote-based methods produced better results and an almost constance AUC value in the most skewed region. Figure 4 presents the experimental results for distances between centroids of 2, 2.5 and 3 standard deviations apart. For these distances, the over-sampling methods still provide the best results, especially for highly imbalanced data sets. Smote and Smote + ENN provide results that are slightly better than Random over-sampling, however the data cleaning provided by ENN becomes very ineffective. Observe that the Smote-based methods provide an almost constant, near 100% AUC for all class distributions. It is interesting to note that the per-

Distance versus AUC for proportion 1%

Distance versus AUC for proportion 2.5% 100

90

90

80

80

AUC (%)

AUC (%)

100

70

70

60

60

50

50

40

40 0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3

0

0.25

0.5

0.75

1

Distance of centroids (SDs)

1.25

1.5

1.75

2

2.25

2.5

2.75

3

Distance of centroids (SDs)

Distance versus AUC for proportion 5% 100

90

AUC (%)

80

Original NCL

70

Random Under-sampling Random Over-sampling

60

Smote Smote+ENN

50

40 0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3

Distance of centroids (SDs)

Fig. 5. Experimental results in graphs for proportions 1%, 2.5% and 5%.

formance decreases for the Random over-sampling method for distance 3 and highly imbalanced data sets. This might be indicative of overffiting, but more research is needed to confirm such a statement. In a general way, we are interested in which methods provide the most accurate results for highly imbalanced data sets. In order to provide a more direct answer to this question, Figure 5 shows the results obtained for all distances for the most imbalanced proportions: 1%, 2.5% and 5% of positive examples. These graphs clearly show that the over-sampling methods in general, and Smote-based methods in particular, provide the most accurate results. They also show that, as the degree of class imbalance decreases, the methods tend to achieve similar performance.

5

Conclusion and Future Work

In this work, we analyse the behaviour of five methods to balance training data in data sets with several degrees of class imbalance and overlapping. Results show that over-sampling methods in general, and Smote-based methods in particular, are very effective even with highly imbalanced and overlapped data sets. Moreover, the Smote-based methods were able to achieve a similar performance

as the naturally balanced distribution, even for the most skewed distributions. The data cleaning step used in the Smote + ENN seems to be especially suitable in situations having a high degree of overlapping. In order to study this question in more depth, several further approaches can be taken. For instance, it would be interesting to vary the standard deviations of the Gaussian functions that generate the artificial data sets. It is also worthwhile to consider the generation of data sets where the distribution of examples of the minority class is separated into several small clusters. This approach can lead to the study of the class imbalance problem together with the small disjunct problem, as proposed in [3]. Another point to explore is to analyse the ROC curves obtained from the classifiers and simulate some misclassification cost scenarios. This approach might produce some useful insights in order to develop or analyse methods for dealing with class imbalance.

References 1. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. 2. D. J. Hand. Construction and Assessment of Classification Rules. John Wiley and Sons, 1997. 3. N. Japkowicz. Class Imbalances: Are We Focusing on the Right Issue? In Proc. of the ICML’2003 Workshop on Learning from Imbalanced Data Sets (II), Washington, DC (USA), 2003. 4. J. Laurikkala. Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere, 2001. 5. C. Marzban. The ROC Curve and the Area Under it as a Performance Measure. Weather and Forecasting, 19(6):1106–1114, 2004. 6. R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard. Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In 3rd Mexican International Conference on Artificial Intelligence (MICAI’2004), volume 2971 of LNAI, pages 312–321, Mexico City, 2004. Springer-Verlag. 7. F. Provost and P. Domingos. Tree Induction for Probability-Based Ranking. Machine Learning, 52:199–215, 2003. 8. J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann, 1988. 9. G. M. Weiss and F. Provost. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19:315–354, 2003. 10. D. L. Wilson. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. on Systems, Management, and Communications, 2(3):408–421, 1972.

Suggest Documents