2014 13th International Conference on Machine Learning and Applications
OUPS: a combined approach using SMOTE and Propensity Score Matching William A. Rivera
Amit Goel
J. Peter Kincaid
Institute for Simulation Training University of Central Florida Orlando, FL, USA
[email protected]
Institute for Simulation Training University of Central Florida Orlando, FL, USA
[email protected]
Institute for Simulation Training University of Central Florida Orlando, FL, USA
[email protected]
Abstract—Building accurate classifiers is difficult when using data that is skewed or imbalanced which is typical of real world data sets. Two popular approaches that have been applied for improving classification accuracy and statistical comparisons of imbalanced data sets are: synthetic minority over-sampling technique (SMOTE) and propensity score matching (PSM). A novel sampling approach is introduced referred to as over-sampling using propensity scores (OUPS) that blends the two and is simple and easy to perform resulting in improvement in accuracy and sensitivity over both SMOTE and PSM. The performance of our proposed approach is assessed using a simulation experiment and several performance metrics are shown where this approach fares and falls in comparison to the others.
I.
learning methods tend to be inherently biased towards the majority group. Re-sampling techniques offer a simple alternative although tuning them remains challenging [3]. Two re-sampling approaches popularly seen in literature include Synthetic minority over-sampling technique(SMOTE) and Propensity score matching (PSM). The intent between both SMOTE and PSM have different purposes and we were unable to find any research literature comparing, contrasting or combining both approaches. We feel that re-sampling techniques remain a simple and straight forward approach for handling class imbalance and should be explored further. In this paper we conduct an experiment to study the relationship between SMOTE and PSM in regards to sampling and introduce a novel approach named over-sampling using propensity score matching (OUPS). We applied four popular machine learning algorithms to data sets with various features, sizes and class imbalance and assess sampling performance. In section 1,2 and 3 we discuss PSM, SMOTE, OUPS and the experimental design. In section 4 the results are discussed and interpreted. In section 5 we provide a conclusion of our findings.
I NTRODUCTION
In classification problem solving based on machine learning, real world data sets usually contain disproportionate sample sizes making the task of accurately predicting group membership on new data difficult. In extreme cases of class imbalance (cases that are not equally represented) the classifier will typically prefer to classify data as belonging to the majority group due to the inherent bias. Class imbalanced data frequently occurs in fraud detection, mammography of cancerous cells and post term births where extreme cases of 100,000 to 1 have been reported [1], [2].
A. Propensity Score Matching (PSM) The propensity score is defined as the conditional probability of group membership to the treatment group (typically the minority group) versus the control group (typically the majority group) based on its covariates, formally:
The machine learning community has approached class imbalance by techniques such as assigning costs to the training examples, using active learning methods, using kernel based methods or by re-sampling the data set e.g. over and undersampling [1], [3]. These approaches are useful when the data is expensive or difficult to collect which is often the case.
e(x) = P (G = 1|X)
The goal in propensity score matching (PSM) is to select a subset of observations from the majority group based on a matching metric involving the logit of the estimated propensity score closest to observations in the minority group. The resulting subsampled data set is efficiently reduced into two mutual exclusive groups of equal size allowing for an apples to apples comparison between both groups.
Costs assignment techniques attempt to correct class imbalance by applying different costs to both the majority and minority groups. Active learning typically retains the most relevant samples closest to the decision boundary thus subsampling evenly between groups. Kernel based methods operate in the feature space by using similarity measures to map data in a dot product space in order to look for linear relationships. Over-sampling is done by generating new data based on the minority group to increase group size to closely match the majority group. Under-sampling removes samples from the majority group so that the group counts closely match.
Matching metric approaches include nearest neighbor (KNN) matching that may be constrained within a threshold, known as caliper matching on the calculated logit of the estimated propensity score and Mahalanobis distance matching in conjunction with the logit of the estimate propensity score [4], [5]. Mahalanobis distance is defined as:
The difficulty with cost approaches is that the cost of misclassification is often unknown while kernel and active 978-1-4799-7415-3/14 $31.00 © 2014 IEEE DOI 10.1109/ICMLA.2014.106
(1)
d(u, v) = (u − v)T C −1 (u − v) 423 424
(2)
where u is the vector of covariates (features) from the minority group Gj and v is the vector of features from the majority group Gi having covariance matrix C representative of the full set of the majority group.
One drawback to using the PSM approach for building accurate models is that it reduces the overall amount of samples thus increasing variance. For observational studies this may not be a concern which may help to explain why both approaches have not been usually studied together.
As an example, in one to one propensity score matching without replacement, a single score is assigned for each observation using the logistic function (equation 3) based on weighted features which can be estimated through linear discriminant analysis (LDA) or logistic regression.
We turned our attention to blending the two approaches by using the propensity score for the match criteria without the need to perform under-sampling on the majority group. Using the same approach that SMOTE uses for creating synthetic samples without randomly picking neighbors; we generated new cases based on the over-sampling amount needed to closely match the majority group amount. Thus when an oversampling of 300% is needed, 3 new cases were generated by selecting 3 observations with the closest propensity score and generating 3 new samples using equation 4 for each observation in the minority group. Algorithm 1 depicts the OUPS approach.
1 (3) 1 + e−(B·X+α) Where B = {β1 , β2 , . . . , βn } are the estimated weights, X = {x1 , x2 , . . . , xn } are the covariates in the feature space and α is the intercept term. Observations from the majority group Gi are selected based on the match criteria and removed from the pool of available candidates until all observations in the minority group Gj have finished selecting matched pairs. f (X) =
Algorithm 1 Over-sampling Using Propensity Score
B. Synthetic Over-sampling Minority Technique (SMOTE)
1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
SMOTE performs a combination of under-sampling on the majority group and over-sampling on the minority group. To generate new synthetic samples for the minority group, each observation from the minority group is iterated through and k nearest neighbors (KN N ) are selected based on a specified ratio using the majority sample size. For example, if an over-sampling ratio of 300% is needed, 3 KN N will be selected. SMOTE will then pick one of the k neighbors at random and generate a new instance N based on the features from both the random sample xj and the original xi observation used to generate it. The new features are created using the feature vector from xi and the feature difference of xi and xj multiplied by a random number r. N = xi + r · (xi − xj ), 0 ≤ r ≤ 1
11: 12: 13:
(4)
The OUPS approach is both simple and easy to implement and eliminates the need for under-sampling which may introduce variance. By using closely matched propensity scores for generating synthetic samples we are able to synthesize similar observations based on closely matched probabilities that results in similar proportions effectively reducing bias.
For the under-sampling process observations are randomly removed based on a defined percentage of under-sampling specified. If the majority group had 100 samples and the minority group had 50, specifying 200% would mean that the new majority sample size should contain twice as many elements from the minority group as from the majority group. Hence the majority group would be reduced to 25 samples [1]. II.
procedure OUPS Initialize : M inorityGroup, M ajorityGroup Initialize : N ewData k ← CaculateNeededKobservations( ) j ← length of Minority Group while j = 0 do kGroup[] ← f indClosest(M inorityGroup, k) newCases[] ← GenerateNewCases(kGroup) N ewData ← assign newCases j ←j−1 end While N ewData ← assign Minority Group N ewData ← assign Majority Group return NewData
III.
E XPERIMENTAL D ESCRIPTION
Four data sets with varying degrees of imbalance and features from different industries were used to test our approach. New data sets were created using propensity score matching, SMOTE and OUPS. Four different machine learning algorithms were trained on the original and simulated data sets. The resulting model was then applied against a test set that was created from the original data set before any synthetic data was created in order to evaluate how well each classification model would work against real data. To measure the effectiveness of each sampling approach the use of confusion matrix was used to evaluate each algorithm.
OVER-SAMPLING USING PROPENSITY SCORES (OUPS)
Under-sampling data with imbalanced classes can affect accuracy especially if the imbalance is particularly high. Although SMOTE provides a nice blend of over and undersampling, we see no need to remove any of the majority samples with imbalanced data which can also introduce variance. The algorithm for generating synthetic data could also be simplified by reducing the searching and selecting process to picking fixed neighbors as opposed to randomly picking them since the newly synthesized data already generates random features.
A. Experimental data sets New data sets where synthesized using R statistical software from the four data sets obtained from the Machine Learning Repository1 . The DMwR package was used for
The propensity score is a single metric which represents the overall likelihood of an observation belonging to a specific group. The metric is straightforward to calculate and works especially well with confounding variates in the feature space.
1 UCI
425 424
ML databases at http://archive.ics.uci.edu/ml/
Data Sampling Technique Pima Pima w/ SMOTE Pima w/ PSM Pima w/ OUPS Adult Adult w/ SMOTE Adult w/ PSM Adult w/ OUPS Bank Bank w/ SMOTE Bank w/ PSM Bank w/ OUPS Readmit Readmit w/ SMOTE Readmit w/ PSM Readmit w/ OUPS
Majority Group Gi = 0 500 536 171 500 23073 22944 5761 23073 39922 31734 4686 39922 88994 67500 11250 88994
Counts Minority Group Gj = 1 268 804 171 535 7648 30592 5761 22940 5289 37023 4686 42263 11250 78750 11250 89951
Mean 0.3490 0.6 0.5 0.5169 0.249 0.5714 0.5 0.4986 0.117 0.5385 0.5 0.5142 0.1122 0.5385 0.5 0.5027
Dependent Variable Statistics Standard Variance Kurosis deviation 0.4770 0.2275 -1.602 0.4901 0.2402 -1.8351 0.5007 0.2507 -2.0058 0.5 0.25 -1.9974 0.4324 0.187 -0.6518 0.4949 0.2449 -1.9167 0.5 0.25 -2.0002 0.5 0.25 -2 0.3214 0.1033 3.6803 0.4985 0.2485 -1.976 0.5 0.25 -2 0.4998 0.2498 -1.9968 0.3156 0.0996 4.037 0.4985 0.2485 -1.976 0.5 0.25 -2 0.5 0.25 -1.9999
Skewness 0.6325 -.04078 0 -0.0676 1.1611 -0.2887 0 0.0058 2.3833 -0.1543 0 -0.057 2.457 -0.1543 0 -0.0107
TABLE I: Data distribution
C. Evaluation criteria
SMOTE, and the Matchit package for PSM. An algorithm for performing over-sampling was created using propensity scores (OUPS). Table I summarizes the class balance for each newly synthesized data set.
2)
3)
4)
The Pima Indian Diabetes data set [1] contains 768 observations with a 35% class imbalance. The minority group represents individuals diagnosed with diabetes from the Pima Indian tribe located near Phoenix, AZ. The Adult data set [1] contains 30721 observations with a 25% class imbalance. The minority group represents individuals who make less than $50,000 dollars annual income and was originally extracted from the census bureau database. The bank data set contains 45211 observations with a 12% class imbalance. The minority group represents individuals subscribed to a term deposit for bank marketing. The Readmit data set contains 100244 observations with a 13% class imbalance. The minority group represents individuals who readmitted back into the hospital within a 30 day period after being seen.
Truth
1)
A confusion matrix similar to table II was used to produce metrics of classification accuracy, sensitivity, specificity and F-measure. True positives (TP) are results that were correctly classified as belonging to the minority group while true negatives (TN) represent all the results correctly classified as belonging to the majority group. False positives (FP) are results that belonged to the majority group but were incorrectly classified as belonging to the minority group and false negatives (FN) are results that belonged to the minority group but were incorrectly classified as belonging to the majority group. Accuracy is determined by taking all of the correctly classified results over the sum of the entire matrix as shown in equation 5. Prediction Positive Negative (TP) (FN) (FP) (TN)
TABLE II: Confusion Matrix
For imbalanced classes the accuracy metric can seem misleading since the amount of imbalanced classes are usually less and the penalty for misclassifying them is overlooked. In a scenario where misclassifying is costly, such as incorrectly diagnosing someone with a rare condition then the true positive rate also called sensitivity may be more appropriate. Sensitivity, sometimes referred to as the hit rate or recall rate is the ability to identify a condition correctly T P/(F N + T P ). If the desire is to exclude a condition correctly instead of include then the specificity or true negative rate can be used as an evaluation metric T N/(F P + T N ). The F-measure (equation 6) is a single utility score which represents the harmonic mean between precision and sensitivity was also included.
B. Algorithm Selection Four different machine learning algorithms were used on the synthetic data to generate coefficients for testing against a test set generated from the original data set before performing any sampling. Each data set was then sampled to create a training set. This was done using random iterative splitting starting from a 50/50 train/test split to a 75/25 split in increasing intervals of 5 resulting in 6 different splits for each machine learning algorithm per data set. The only test set used was the one generated from the original data set. Since the main concern is the sampling approach and not the feature space evaluation the need for cross validation was not applied. The following algorithms were selected: • • • •
Positive Negative
A=
TP + TN TP + TN + FP + FN
(5)
precision ∗ recall precision + recall
(6)
F =2∗
Logistic Regression Support Vector Machine Neural Network Linear Discriminant Analysis
The F-measure can also be misleading since it does provide a single score for two different measures that do have different 426 425
Sampling Technique SMOTE PSM OUPS
Accuracy 74.85% 58.84% 81.57% *
Sensitivity 39.68% 30.64% 53.32% *
Performance Measure Specificity 92.64% * 83.55% 90.78%
F-measure 23.67% * 16.47% 20.88%
TABLE III: Results by Sample Technique Sampling Technique Pima w/ SMOTE Pima w/ PSM Pima w/ OUPS Adult w/ SMOTE Adult w/ PSM Adult w/ OUPS Bank w/ SMOTE Bank w/ PSM Bank w/ OUPS Readmit w/ SMOTE Readmit w/ PSM Readmit w/ OUPS
Accuracy 73.67% 49.08% 76.32% * 74.29% 73.43% 77.81% * 81.41% 51.32% 83.77% * 69.96% 62.30% 86.69% *
Sensitivity 58.11 36.84% 63.05% * 49.15% 55.43% * 53.85% 36.81% 14.70% 40.44% * 15.43% 16.03% 56.03% *
Performance Measure Specificity 89.66% * 66.29% 85.93% 93.67% * 85.79% 91.56% 97.47% * 90.89% 96.71% 89.86% 91.65% * 88.86%
F-measure 46.75% * 26.30% 43.93% 21.91% * 12.86% 21.31% 18.05% * 14.10% 17.01% 7.82% 11.98% * 0.40%
TABLE IV: Results by Data set
meanings. We do not consider the F-measure as relevant as the traditional measurement for accuracy. IV.
accuracy and sensitivity while SMOTE outperformed other sampling approaches in regards to specificity and F-measure. We plan to further study the OUPS approach by performing significantly more simulations, including larger data sets with greater levels of imbalance (closer to 1%) and also evaluating classifier specific performance to see which re-sampling techniques perform better with each classifier used.
R ESULTS
The results from each run and machine learning algorithm were averaged over each data set and also per sampling technique. Table III summarizes the performance for each sampling technique and table IV show the results broken out by each data set used.
The use of PSM affected the Readmit data set differently than the other data sets. It appears that OUPS may perform drastically better for accuracy and sensitivity in cases where propensity score matching results in higher specificity and Fmeasure. Clinical data is more subject to sensitivity than most industries. Incorrectly classifying an ill patient is not only costly but can endanger lives. Further studies using clinically imbalanced data remains another area for future research. We also plan to use this work for imbalanced data sets observed in our ongoing research in distributed virtual worlds and formal models of virtual enterprise architecture [6], [7].
Overall OUPS outperformed SMOTE and PSM in terms of accuracy and sensitivity. SMOTE outperformed all other sampling techniques in specificity and in F-measure. In addition to increased accuracy for classification, OUPS produced fewer false negatives as a result of high sensitivity. False negatives represent misclassification of a minority group example as belonging to the majority group. Thus OUPS did a better job removing some of the bias caused by the imbalance by approximately 15 - 25% as compared to the other sampling approaches. Although OUPS did not outperform SMOTE in terms of specificity and F-measure, it did perform relatively well with 90.78% on average for specificity and a close Fmeasure making OUPS a reliable approach.
R EFERENCES [1]
Table IV provides the performance percentage details for each sampling technique and data set used in the experiment. OUPS and SMOTE were consistently close and the top performers for all performance measures used. OUPS performed better for accuracy and sensitivity measures while SMOTE performed better for specificity and F-measure in most cases. PSM did perform significantly better on the specificity and F-measure for the Readmit data set which also contained the highest disparity of F-measure compared to other data sets. V.
[2]
[3]
[4]
[5]
C ONCLUSION AND F UTURE W ORK
[6]
We introduced a novel sampling approach (OUPS) that resulted in higher accuracy and sensitivity then SMOTE or propensity score matching alone. We conducted experiments using accuracy, specificity, sensitivity and F-measure performance measures and analyzed the results. The OUPS approach outperformed other sampling approaches in regards to overall
[7]
427 426
N. V. Chawla, K. W. Bowyer, and L. O. Hall, “SMOTE : Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. R. B. D’Agostino, “Tutorial in biostatistics: propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group,” Stat Med, vol. 17, pp. 2265–2281, 1998. H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009. P. C. Austin, “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies,” Multivariate Behavioral Research, vol. 46, no. 3, pp. 399–424, May 2011. P. R. Rosenbaum and D. B. Rubin, “The central role of the propensity score in observational studies for causal effects,” Biometrika, vol. 70, no. 1, pp. 41–55, 1979. A. Goel, S. K. Jha, I. Garibay, H. Schmidt, and D. Gilbert, “A Survey of Approaches to Virtual Enterprise Architecture: Modeling Languages, Reference Models, and Architecture Frameworks,” Journal of Enterprise Architecture, vol. 7, no. 4, pp. 42–51, 2011. A. Goel, H. Schmidt, and D. Gilbert, “Formal models of virtual enterprise architecture: motivations and approaches,” in Proceedings of PACIS 2010. AIS, 2010, p. Paper 117.