Oct 27, 2014 - One Class. Learning. Threshold. Adjust. Probabilities on. Decision Trees. Evaluation. Warning. Conclusions. References. Imbalanced Classes ...
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Imbalanced Classes in Machine Learning Brief Introduction H´ector Manuel S´anchez Castellanos Tecnol´ ogico de Monterrey
October 27, 2014
Imbalanced Classes
Outline
HMSC
1 Introduction Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
2 Approaches
Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR
Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
3 Evaluation
Warning 4 Conclusions 5 References
Imbalanced Classes
Introduction
HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
• Largely unequal quantity of samples for each class • The class with less samples is usually the most important
one
Imbalanced classes problem generally exists whenever we have a classification problem where one class has “many” more samples than its counterpart.
Imbalanced Classes
Approaches
HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
There are two main groups of solutions to solve the imbalanced sampling problem: • Sampling Techniques: applied at the sampling stage and
are aimed at balancing the distribution of the classes. • Algorithmic Techniques: adapting our method to make it
deal with the imbalance by itself.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Sampling Techniques Pre-process the data via sampling of the classes so that the class imbalance problem is not so drastic.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Random Undersampling Balance the number of instances of each class by randomly removing the excess from the majority class.
• Advantages: • Very easy to program • Reasonable performance • Disadvantages: • Can remove important instances or features
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Random Oversampling Balance the number of instances of each class by randomly sampling multiple instances of the minority class.
• Advantages: • Very easy to program • Disadvantages: • Can lead to overfitting
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
SMOTE: Synthetic Minority Over-Sampling Take the difference between the feature vector (sample) under consideration and its nearest neighbour. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general. • Advantages: • Easy to program and understand • Alleviates some overfitting problems • Causes the decision boundaries for the minority class to spread • Disadvantages: • Can make the process of learning fine features difficult
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
TOMEK Given two examples ei and ej belonging to different classes, with d(ei, ej) the distance between ei and ej.A (ei, ej) pair is called a Tomek link if there is no example el, so that d(ei, el) < d(ei, ej) or d(ej, el) < d(ei, ej). If two examples form a Tomek link, then either one of these examples is noise or both examples are borderline.
Evaluation Warning
Conclusions References
• Advantages: • Can remove “noisy” samples • Disadvantages: • If the decision border is not clear many samples can be removed
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
CNN: Condensed Nearest Neighbour Rule Randomly draw one majority class example and all examples from the minority class and put these examples in E 0 . Afterwards, use a 1-NN over the examples in E 0 to classify the examples in E . Every misclassified example from E is moved to E 0 . The idea behind this implementation of a consistent subset is to eliminate the examples from the majority class that are distant from the decision border. • Advantages: • Focuses on the samples on the frontier • Disadvantages: • Can eliminate important features
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
NCR: Neighbourhood Cleaning Rule For each example Ei in the training set, its three nearest neighbours are found. If Ei belongs to the majority class and the classification given by its three nearest neighbours contradicts the original class of Ei, then Ei is removed. If Ei belongs to the minority class and its three nearest neighbours misclassify Ei, then the nearest neighbours that belong to the majority class are removed. • Advantages: • Removes noisy features • Disadvantages: • Can remove important information
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Algorithms Modify your algorithm in any way you can imagine (or combine it with sampling techniques or other classification algorithms) so that it compensates the lack of data from one class.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Adjusted KNN Run the traditional KNN algorithm but assign different weights to the distances between data points
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Cost Modifying Some algorithms depend their classification adjustment on a classification matrix. In the cases where classes imbalance is a problem the penalty for miss-classification should be adjusted to compensate the ratio of classes.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
One Class Learning In some specific instances of classification problems (such as multimodal in the domain space) the classifier could behave better in a one-class identification problem than a two-class classification one.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Thresholds Create instances of one class classifiers and apply thresholds to their membership factor so that they are tuned to compensate the imbalance.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Adjust Decision Trees Probabilities Modify the selection probabilities in a decision tree to favour the classes with less instances representation.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Evaluating Classifiers: Warning! Fn+Fp • Err = Fn+Fp+Tn+Tp Tn+Tp • Acc = Fn+Fp+Tn+Tp
...it is straightforward to create a classifier having an accuracy of 99% (or an error rate of 1%) in a domain where the majority class proportion corresponds to 99% of the examples, by simply forecasting every new example as belonging to the majority class.
Imbalanced Classes HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
Conclusions • The class imbalance problem depends on: • Degree of the class imbalance • Complexity of the search space • Size of the Training set • Classifier • The problem is not inherent to unbalanced sets as long as
they contain a representative number of examples of each class.
Imbalanced Classes
Conclusions
HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
• It is important to take into account the balance of samples
of each class • It is easy to be cheated by evaluation metrics if we do not
know enough about them [10] • The use of a specific balancing technique depends on the
problem to solve • Knowing the search space is crucial to understand the
behaviour of an algorithm
Imbalanced Classes
References I
HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation
G. E. a. P. a. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1):20, June 2004. N. Chawla. Data Mining for Imbalanced Datasets: An Overview. Data mining and knowledge discovery handbook, 2005. T. G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923, Oct. 1998. X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou. On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation, pages 192–201, 2008. N. Japkowicz. The class imbalance problem: A systematic study. Intelligent data analysis, 2002.
Warning
Conclusions References
S. Kotsiantis. Handling imbalanced datasets: A review. GESTS International . . . , 30, 2006. M. Longadge. Class Imbalance Problem in Data Mining: Review. International Journal of Computer Science and Network, 2(1), 2013.
Imbalanced Classes
References II
HMSC Introduction Approaches Sampling Random Undersampling Random Oversampling SMOTE Tomek Links CNN NCR Algorithms Adjusted KNN Cost Modifying One Class Learning Threshold Adjust Probabilities on Decision Trees
Evaluation Warning
Conclusions References
L. Puente-Maury and A. L´ opez-Chau. M´ etodo r´ apido de preprocesamiento para clasificaci´ on en conjuntos de datos no balanceados. rcs.cic.ipn.mx, 73:129–142, 2014. E. Ramentol, Y. Caballero, R. Bello, and F. Herrera. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and information . . . , 2012. S. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 328:317–328, 1997.