Imbalanced Classes in Machine Learning - Brief

0 downloads 0 Views 887KB Size Report
Oct 27, 2014 - One Class. Learning. Threshold. Adjust. Probabilities on. Decision Trees. Evaluation. Warning. Conclusions. References. Imbalanced Classes ...
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Imbalanced Classes in Machine Learning Brief Introduction H´ector Manuel S´anchez Castellanos Tecnol´ ogico de Monterrey

October 27, 2014

Imbalanced Classes

Outline

HMSC

1 Introduction Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

2 Approaches

Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR

Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

3 Evaluation

Warning 4 Conclusions 5 References

Imbalanced Classes

Introduction

HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

• Largely unequal quantity of samples for each class • The class with less samples is usually the most important

one

Imbalanced classes problem generally exists whenever we have a classification problem where one class has “many” more samples than its counterpart.

Imbalanced Classes

Approaches

HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

There are two main groups of solutions to solve the imbalanced sampling problem: • Sampling Techniques: applied at the sampling stage and

are aimed at balancing the distribution of the classes. • Algorithmic Techniques: adapting our method to make it

deal with the imbalance by itself.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Sampling Techniques Pre-process the data via sampling of the classes so that the class imbalance problem is not so drastic.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Random Undersampling Balance the number of instances of each class by randomly removing the excess from the majority class.

• Advantages: • Very easy to program • Reasonable performance • Disadvantages: • Can remove important instances or features

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Random Oversampling Balance the number of instances of each class by randomly sampling multiple instances of the minority class.

• Advantages: • Very easy to program • Disadvantages: • Can lead to overfitting

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

SMOTE: Synthetic Minority Over-Sampling Take the difference between the feature vector (sample) under consideration and its nearest neighbour. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general. • Advantages: • Easy to program and understand • Alleviates some overfitting problems • Causes the decision boundaries for the minority class to spread • Disadvantages: • Can make the process of learning fine features difficult

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

TOMEK Given two examples ei and ej belonging to different classes, with d(ei, ej) the distance between ei and ej.A (ei, ej) pair is called a Tomek link if there is no example el, so that d(ei, el) < d(ei, ej) or d(ej, el) < d(ei, ej). If two examples form a Tomek link, then either one of these examples is noise or both examples are borderline.

Evaluation Warning

Conclusions References

• Advantages: • Can remove “noisy” samples • Disadvantages: • If the decision border is not clear many samples can be removed

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

CNN: Condensed Nearest Neighbour Rule Randomly draw one majority class example and all examples from the minority class and put these examples in E 0 . Afterwards, use a 1-NN over the examples in E 0 to classify the examples in E . Every misclassified example from E is moved to E 0 . The idea behind this implementation of a consistent subset is to eliminate the examples from the majority class that are distant from the decision border. • Advantages: • Focuses on the samples on the frontier • Disadvantages: • Can eliminate important features

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

NCR: Neighbourhood Cleaning Rule For each example Ei in the training set, its three nearest neighbours are found. If Ei belongs to the majority class and the classification given by its three nearest neighbours contradicts the original class of Ei, then Ei is removed. If Ei belongs to the minority class and its three nearest neighbours misclassify Ei, then the nearest neighbours that belong to the majority class are removed. • Advantages: • Removes noisy features • Disadvantages: • Can remove important information

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Algorithms Modify your algorithm in any way you can imagine (or combine it with sampling techniques or other classification algorithms) so that it compensates the lack of data from one class.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Adjusted KNN Run the traditional KNN algorithm but assign different weights to the distances between data points

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Cost Modifying Some algorithms depend their classification adjustment on a classification matrix. In the cases where classes imbalance is a problem the penalty for miss-classification should be adjusted to compensate the ratio of classes.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

One Class Learning In some specific instances of classification problems (such as multimodal in the domain space) the classifier could behave better in a one-class identification problem than a two-class classification one.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Thresholds Create instances of one class classifiers and apply thresholds to their membership factor so that they are tuned to compensate the imbalance.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Adjust Decision Trees Probabilities Modify the selection probabilities in a decision tree to favour the classes with less instances representation.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Evaluating Classifiers: Warning! Fn+Fp • Err = Fn+Fp+Tn+Tp Tn+Tp • Acc = Fn+Fp+Tn+Tp

...it is straightforward to create a classifier having an accuracy of 99% (or an error rate of 1%) in a domain where the majority class proportion corresponds to 99% of the examples, by simply forecasting every new example as belonging to the majority class.

Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

Conclusions • The class imbalance problem depends on: • Degree of the class imbalance • Complexity of the search space • Size of the Training set • Classifier • The problem is not inherent to unbalanced sets as long as

they contain a representative number of examples of each class.

Imbalanced Classes

Conclusions

HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

• It is important to take into account the balance of samples

of each class • It is easy to be cheated by evaluation metrics if we do not

know enough about them [10] • The use of a specific balancing technique depends on the

problem to solve • Knowing the search space is crucial to understand the

behaviour of an algorithm

Imbalanced Classes

References I

HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation

G. E. a. P. a. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1):20, June 2004. N. Chawla. Data Mining for Imbalanced Datasets: An Overview. Data mining and knowledge discovery handbook, 2005. T. G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923, Oct. 1998. X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou. On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation, pages 192–201, 2008. N. Japkowicz. The class imbalance problem: A systematic study. Intelligent data analysis, 2002.

Warning

Conclusions References

S. Kotsiantis. Handling imbalanced datasets: A review. GESTS International . . . , 30, 2006. M. Longadge. Class Imbalance Problem in Data Mining: Review. International Journal of Computer Science and Network, 2(1), 2013.

Imbalanced Classes

References II

HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees

Evaluation Warning

Conclusions References

L. Puente-Maury and A. L´ opez-Chau. M´ etodo r´ apido de preprocesamiento para clasificaci´ on en conjuntos de datos no balanceados. rcs.cic.ipn.mx, 73:129–142, 2014. E. Ramentol, Y. Caballero, R. Bello, and F. Herrera. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and information . . . , 2012. S. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 328:317–328, 1997.

Suggest Documents