Classification performance of data mining algorithms ...

7 downloads 9312 Views 150KB Size Report
Classification performance of data mining algorithms applied to breast cancer data. Vıtor Santos & Nuno Datia & M.P.M. Pato. ISEL, Lisbon, Portugal.
Classification performance of data mining algorithms applied to breast cancer data V´ıtor Santos & Nuno Datia & M.P.M. Pato ISEL, Lisbon, Portugal

ABSTRACT: In this paper, we study how several classification algorithms perform when applied to a breast cancer dataset. The challenge is to develop models for computer-aided detection (CAD), capable to classify, at early stages, masses spotted in X-ray images. The dataset was available at KDD CUP 2008. The imbalanced nature of the dataset and its high-dimensional feature space poses problems to the modelling that are tackled using dimension reduction techniques. The algorithms are compared using the area under the curve (AUC) of the receiver operating characteristic curve (ROC) between true- and false-positive rates (TPR and FPR). Other metrics, such as patient sensitivity and FPR are used and discussed. We find that Na¨ıve Bayes classifier achieved the best performance irrespective of the combination of datasets and allow controlled trade-offs between false positives and negatives.

1

INTRODUCTION

According to 2008 GLOBOCAN estimates, breast cancer is by far the most frequent cancer among women, with an estimated 1.38 million new cases diagnosed (23% of all cancers), and ranks second overall (10.9% of all cancers). The American Cancer Society’s estimates, for 2013, in the US, 3.6 more new cases of invasive breast cancer than non-invasive will be diagnosed in women. Breast cancer will be responsible for a woman’s death is about 3%1 . Eurostat2 reported in 2009 that one in every six women’s deaths from cancer, in EU, is caused by breast cancer. The incidence and mortality have increased in developing countries (Ferlay et al. 2010, Jemal et al. 2011). As stated by ACS, around 90% of breast cancers are curable if detected in their early stage and properly treated. Routine screening promotes early diagnosis with regular medical prevention depending on the presence or absence of risk factors. New methods have been developed for early detection and treatment, which helped decrease the cancer related death rates (Nicolosi et al. 2013, Schnall 2000, Warner 2003, Preidt 2013). Nowadays the number of imaging techniques is quite large and is continuously increasing. The new image techniques produce threedimensional (3D) and four-dimensional (4D) digital datasets. The image can be reconstructed and visualised in 3D, allowing the radiologists to localise and estimate the sign of disease. Another advantage is 1 2

http://www.cancer.org/ http://ec.europa.eu/eurostat/

given by studying the behaviour of tissues over time. A medical image do not give any hint about the diagnosis or the therapy: the collected images must be analysed and taken by experts. The true (malignant or benign) labels are obtained by biopsy of suspicious or malignant lesions. The 3D or 4D datasets contain hundreds of images and their number, per examination, is still increasing. Each image must be tested, and this take much time. The full inspection process can be error-prone, due to fatigue and habituation of the radiologist. The problems discussed above can be partly solved by exploiting the digital format of the new data. It is possible to store digital data in a centralised repository, allowing the radiologists to cross-validate their diagnosis or to get access to existing examinations. Furthermore, digital data can be processed by a computer. Extensive use of medical information systems and the enormous growth of medical databases require joint work in traditional manual data analysis methods and computer-aided detection (CAD) solutions (Castellino 2005). Technology plays an important role to increase the detection rate of new breast cancers. Machine learning techniques are among those tools, namely, classification algorithms. According to Domingos (2012), machine learning algorithms can figure out how to perform important tasks by generalising from examples. In this paper, we used four different data mining algorithms to classify candidate region of interest (ROI) in medical images (mammogram). In this particular

domain, the prevalence of cancer regions is extremely low. The class imbalance may reduce the performance achieved by existing learning and classification systems (Japkowicz and Stephen 2002). For instance, it is difficult to develop a classification model suitable to real medical requirements. Our goal is to maximise the area under the receiver operating characteristic (ROC) curve, abbreviated AUC, in the clinically relevant region 0.2-0.3 False Positive (FP) per image. In a second task, our aim is to reach 100% sensitivity of the malignant patients. The results reveal similar AUC for many algorithms, but with differing FPR. Different dimension reduction techniques produce different results, and that one technique may be preferred over another, depending on the measure we want to optimise. The paper is organised as follows: Section 2 reviews some related work; Section 3 briefly describes the dataset used in the experimental study; Section 4 review the four learning algorithms used and describes the ROC curve produced; Section 5 gives the methodology used to develop the classification models. The experimental results are presented in Section 6. Discussion and future work are given in the last section. 2

BACKGROUND AND RELATED WORK

Data mining is a very popular technique nowadays, covering areas as diverse as health, finance, etc. The data stored in databases need special techniques for identifying, extracting and evaluating variables, given their size. These databases can be a helpful support for handling medical related decision-making problems. Reading mammograms is a very demanding job for radiologists. Their judgements depend on training, experience, and subjective criteria. A literature survey showed that there have been several studies using machine learning algorithms for classification of instances in the medical diagnostics datasets. Machine learning methods have been applied to various medical domains to improve medical decision-making (Delen et al. 2005, Bellaachia and Guven 2006, Bellazzi and Zupan 2008, Sarvestani et al. 2010). Degenhard et al. (2002) have evaluated an artificial radiological classifier using receiver operating characteristic (ROC) curve analysis and compared the results with those obtained using a proposed radiological scoring system. The AUC is a widely used measure of performance for classification and diagnostic rules (Hanley and McNeil 1982, Hanley et al. 1989, Bradley 1997, Fawcett 2006). D.J. Hand (Hand 2009, Hand and Anagnostopoulos 2013) has recently argued that AUC is fundamentally incoherent as a measure of aggregated classifier performance and proposed an alternative measure. High imbalance occurs in real-world domains where the decision system wants to detect a rare but

important case (Japkowicz and Stephen 2002, Barandela et al. 2004, Verleysen and Franc¸ois 2005, ThaiNghe et al. 2011). 3

THE DATA

The experimentation used the dataset available for the KDD CUP 2008. It consists of 102,294 candidate ROI, each described by 117 features, covering a set of 118 malignant and 1594 normal patients. Each candidate has the image and the patient IDs, the (x,y) location, several features, and a class label telling whether it is malignant. The features were created by proprietary algorithms applied to X-ray images and the class labels are based on a radiologists’ interpretation, a biopsy, or both. We split the dataset into training and testing sets, and all the evaluations were made using only the test dataset (see Section 5). 4

THE LEARNING ALGORITHMS

The learning algorithms chosen for this experimental comparison were: the Support Vector Machines (SVM), Bagging using the RPART function (BAG), Random Forest (RF) and Na¨ıve Bayes (NB). We found these algorithms most suitable for predicting cancer survivability rate in our dataset. They cover instance based learning (SVM), statistical learning (NB), and ensemble learning (BAG and RF). The SVM is a supervised learning algorithm that can be used in classification problems. It settles a decision function (hyperplane) based in selected cases from the training data, termed support vectors (SVs). In general, this hyperplane corresponds to a nonlinear decision boundary in the input space. Traditional techniques for pattern recognition try to minimize the empirical risk, i.e. optimise the performance on the training set. Nonetheless, the SVM minimizes the structural risk, i.e. the probability of yet-to-beseen patterns to be classified correctly for a fixed but unknown probability distribution of the data (Cortes and Vapnik 1995). Bagging techniques generates multiple weak predictors and combined them to get one, more robust, predictor. The aggregation averages over the versions when predicting a numerical outcome (regression) and does a plurality vote when predicting a class (classification). The multiple versions are formed by making bootstrap copies of the learning set and using these as new learning sets (Breiman 1996). Random Forest is a method for building multiple decision trees at training time. The predicted class is the mode of the class estimated by individual trees. The method combines Breiman’s “bagging” idea and the random selection of features, to build a decision trees with controlled variation (Ho 1998). The fourth technique is the Na¨ıve Bayes. It depends on the famous Bayesian approach to produce a sim-

ple, clear, and fast classifier, particularly suited when the input dimension is high. It is called Na¨ıve because it assumes mutually independent attributes. In practice, this is almost never true but is achievable by preprocessing the data to remove dependent features (Witten and Frank 2005). The algorithms are available in R3 environment. R provides various statistical and graphical tools. According to Ihaka and Gentleman (1996), it is a versatile tool built with the help of an extensive community of researchers. Their implementation provide models that output the probabilities of each class. This information can be used to choose the operational point in the ROC curves, based on the costs of FP and TP. 4.1 ROC analysis Receiver operating characteristic (ROC) curves are a technique for visualising, organising and selecting classifiers based on their performance. It is a useful tool in biomedical and bioinformatics applications, among others. Generally, a ROC graph depicts relative trade-off between benefits (true positives) and costs (false positives). However, conclusions are often reached through inconsistent use or insufficient statistical analysis. To compare classifiers we may want to reduce ROC performance to a single scalar value representing expected performance. That is where AUC is used. The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance (Hanley and McNeil 1982). A random classifier has AU C = 0.5, while an ideal classifier has AU C = 1. Thus, any “interesting” classifier has 1 ≤ AU C < 0.5. The best operating point depends on the trade-off between the costs of failing to detect positives cases (type II error) against the costs of raising false alarms (type I error). These costs need not be equal, however this is a common assumption. For Olshen and Stone (1984), the AUC is also closely related to the Gini coefficient. Hand and Till (2001) point out that Gini + 1 = 2 × AUC. 5

THE TRAINING METHODOLOGY

Given the high-dimensionality of the data, we preprocess the dataset, otherwise some of the algorithms cannot be applied. To deal with the so-called curse of dimensionality (Verleysen and Franc¸ois 2005), we perform feature selection procedures using correlation measures and feature ranking meta-algorithms. We also used train sets with different class distributions to deal with the size, but also with the class imbalance problem. We consider two datasets: training set and test set. The set of training examples are used 3

http://www.r-project.org/

Table 1: First five features with high ranking. IG GR CS SU 1 F.9 F.10 F.9 F.9 2 F.10 F.9 F.10 F.10 3 F.20 F.20 F.20 F.20 4 Left.Breast Left.Breast Left.Breast Left.Breast 5 MLO MLO MLO MLO IG=Information gain, GR=Gain Ratio, CS=Chi-square and SU=Symmetrical Uncertainty

to produce the learned concept descriptions and a separate set of test examples are needed to evaluate the accuracy. When testing, the class labels are not presented to the algorithm. The algorithms take, as input, a test example and produces class label — Malignant (T) or Benign (F), with a given with a probability. 5.1 Feature selection Feature selection is considered successful if the dimensionality of the data is reduced and the accuracy of a learning algorithm improves or remains the same. Measuring accuracy on a test set is better than using the training dataset because examples in the test set have not been used to induce concept descriptions. Using the training set to measure accuracy will typically give an optimistically biased estimate especially if the learning algorithm overfits the training data. Pre-selection of informative features for classification is a crucial, although delicate, task. Feature selection should identify the ones that improve classification, using a suitable measure of correlation between attributes and a sound procedure to select attributes based on this measure. An attribute is redundant if it can be derived from another attribute or set of attributes. Some redundancies can be detected by correlation analysis. Applying the Pearson correlation coefficient to all combination pairs of independent variables, we found 57 high correlations, with ρ ∈ [0.96, 1]. To do feature ranking, because of the high dimensionality of the dataset, we need to reduce the instances so it can be tractable by the algorithms. To lower the impact of this reduction in the ranking, we apply an iterative procedure of 50 trials. In each trial, we build a work dataset of 50,000 cases, using sampling with replacement. This is about half of the original dataset. The dataset is subjected to the same algorithm, keeping the “weight” of each attribute in the intermediate iterations. At the end, “weights” are summed to settle the rank, and sorted in descending order. According to the above procedure, we use four algorithms: Information gain (Burnham and Anderson 2002), Gain Ratio, Symmetrical Uncertainty (Witten and Frank 2005) and Chi-square (Liebetrau 1983). Table 1 shows the high ranked variables. It illustrates the stability in the ranks, that seems independent of the algorithm used. The changes, when happened, occurred in two consecutive positions, for instance F.9 and F.10 features.

Table 2: Dataset combination.

0.96

SMOTE

T1 T2 T3

T4 T5 T6

Feature Ranking Feature Selection All Features

0.93

AUC

Undersampling

0.90

5.2 Imbalanced dataset 0.87

5.3 Datasets There are two strategies to select instances for the datasets: random undersampling and SMOTE. The first produces datasets with 1246 instances, which contains all positive cases and 623 randomly negative instances. The second, produces datasets with 2000 instances, with approximate class distribution. The reliability of the approach reside on multiple construction of a classifier for many training sets randomly chosen from the original instance set, where instances in each training set consist of only a fraction of all of the observed features. We settle the datasets combining the two strategies for selecting instances, with three strategies to select features: all the features, removing the redundant features based on correlation and, selecting the top 66 features using feature ranking algorithms. Table 2 shows the combination, naming the datasets from [T1 ..T6 ]. 5.4 Comparison of algorithms and ROC ROC is a graphical plot which illustrates the performance of a binary classifier. The decision maker must

BAG

NB

RF

SVM L

SVM P

SVM R

Figure 1: AUC results for different algorithms.

0.3

FPR

A dataset is imbalanced when one of the classes (the minority one), malignant patients, is heavily underrepresented in comparison to the other (the majority) class, normal patients. This issue is particularly important in those applications where it is costly to misclassify minority-class examples. Our dataset consists of 102,294 instances, with a set of 623 positive. It is clear that the dataset is imbalanced, with a ratio of 1/164 between negative and positive instances. This may have an impact on data modelling, and there are several techniques to deal with it (Japkowicz and Stephen 2002). However, the dataset dimensionality force the use of undersampling techniques (Drummond et al. 2003), where the instances of the larger class are reduced to equal the number of the lower class. We also use the synthetic minority oversampling technique (SMOTE) (Chawla et al. 2002). The SMOTE is an approach to the construction of classifiers from imbalanced datasets, where the minority class is oversampled by creating synthetic examples, rather than by oversampling with replacement. We restrict the number of cases for the larger class, which means that we undersampled the larger class using synthetic instances.

0.2

0.1

BAG

NB

RF

SVM L

SVM P

SVM R

Figure 2: False positive rate (instance).

decide to which of the two states each test case belongs, and in which a population of test cases has been defined and sampled (van Erkel and Pattynama 1998). Although ROC curves may be used to evaluate classifiers, care should be taken when using them to make conclusions about classifier superiority. Taking this into account, we use three criteria to compare the classifiers: AUC, sensitivity (Patient), and FPR (instance). The best classifier is the one that, simultaneously, provide in (1). sensitivity = 100% ∧ max(AU C) ∧ min(F P R) (1) 6

RESULTS

The algorithms are tested with different parameters, but the best results are achieved with default values, except the SVM. In this case, we optimised ν-SVM, with ν = 0.5. We use SVM with three kernels: SVM L for linear, SVM P for polynomial, and SVM R for radial. Figure 1 shows the AUC for the selected algorithms. As we can see, the performance for RF, SVM L and BAG seems independent from the type of dataset used. The NB have a larger variation in the performance, but is the one that achieves better AUC. The same independence is found for the FPR, when using RF and BAG algorithms, as depicted in Figure 2. Nevertheless, the best FPR is reached using the SVM R.

0.96

1.00

0.93

AUC

Sensitivity

0.96

0.90

0.92 0.87 0.88 T1

T2

T3

T4

T5

T6

Figure 3: AUC for different dataset constructions.

T1

T2

T3

T4

T5

T6

Figure 6: Patient sensitivity for different dataset constructions Table 3: Best combinations for maximising AUC and sensitivity.

FPR

0.3

0.2

0.1

T1

T2

T3

T4

T5

T6

Figure 4: False positive rate for different dataset constructions

In terms of dataset types, it is clear in Figure 3 that T1 get an overall better performance. The result for the other types seems similar. Nevertheless, it seems that Feature Ranking (T1 e T4 ) produces better results. For the FPR, T1 gives small variance in performance, but it is with T5 that we get the best FPR value (Figure 4). The NB and SVM L perform the best results for sensitivity, getting 100% in all tests (Figure 5). This means that these algorithms correctly identify all the patients with cancer. In relation to the dataset type, Figure 6 shows that all types get, at lest once, 100% sensitivity. In spite of that, T3 gives less variance in the results and T5 is the worst. These results are summarised in Table 3. The only classifier that verifies (1) is the NB combined with T1 . However, the algorithm that achieved the best FPR was the SVM R combined with T5 , but also gets poor AUC and sensitivity. 1.00

7

Algo+Dataset

AUC

Patient sensitivity FPR

NB+T1 SVM L+T1 SVM L+T2 SVM P+T1 SVM P+T3 RF+T1 RF+T3 RF+T6 SVM R+T1 SVM R+T5 BAG+T1 BAG+T3 BAG+T6

0.955 0.932 0.925 0.931 0.912 0.931 0.925 0.920 0.930 0.850 0.920 0.908 0.900

100% 100% 100% 100% 100% 95% 97.5% 95% 97.5% 87.5% 97.5% 100% 95%

0.074 0.102 0.097 0.106 0.075 0.101 0.100 0.086 0.100 0.047 0.123 0.124 0.102

DISCUSSION

Machine learning algorithms are an important tool to classify breast cancer datasets. The algorithms are compared using different metrics, such as AUC, sensitivity and FPR. Based on our results, the AUC appears to be one of the best ways to evaluate a classifier performance. There was a good agreement between AUC and Patient sensitivity as to the performance of the classification algorithms. The SVM depends on the kernel parametrization but together with NB are able to deliver 100% sensitivity. We find that NB achieved the best performance with higher AUC and lower FPR. The feature ranking seems a good solution to reduce the dimension of the dataset but retaining the necessary information to get high classification performance. REFERENCES

Sensitivity

0.96

0.92

0.88 BAG

NB

RF

SVM L

SVM P

Figure 5: Patient sensitivity for different algorithms

SVM R

Barandela, R., R. M. Valdovinos, J. S. S´anchez, & F. J. Ferri (2004). The imbalanced training sample problem: Under or over sampling? In Structural, Syntactic, and Statistical Pattern Recognition, pp. 806–814. Springer. Bellaachia, A. & E. Guven (2006). Predicting breast cancer survivability using data mining techniques. Age 58(13), 10–110. Bellazzi, R. & B. Zupan (2008). Predictive data mining in

clinical medicine: current issues and guidelines. International Journal of Medical Informatics 77(2), 81–97. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145 – 1159. Breiman, L. (1996). Bagging predictors. Machine Learning 24(2), 123–140. Burnham, K. P. & D. R. Anderson (2002). Model selection and multi-model inference: a practical informationtheoretic approach. Springer Verlag. Castellino, R. A. (2005). Computer aided detection (CAD): an overview. Cancer Imaging 5(1), 17. Chawla, N. V., K. W. Bowyer, L. O. Hall, & W. P. Kegelmeyer (2002). SMOTE: synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321–357. Cortes, C. & V. Vapnik (1995). Support-vector networks. Machine Learning 20(3), 273–297. Degenhard, A., C. Tanner, C. Hayes, D. J. Hawkes, & M. O. Leach (2002). Comparison between radiological and artificial neural network diagnosis in clinical screening. Physiological Measurement 23(4), 727. Delen, D., G. Walker, & A. Kadam (2005). Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine 34(2), 113–127. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM 55(10), 78–87. Drummond, C., R. C. Holte, et al. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Datasets II, Volume 11. Citeseer. Fawcett, T. (2006, June). An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874. Ferlay, J., H.-R. Shin, F. Bray, D. Forman, C. Mathers, & D. M. Parkin (2010). Estimates of worldwide burden of cancer in 2008: Globocan 2008. International Journal of Cancer 127(12), 2893–2917. Hand, D. & C. Anagnostopoulos (2013). When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance? Pattern Recognition Letters 34(5), 492 – 495. Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning 77(1), 103–123. Hand, D. J. & R. J. Till (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45(2), 171–186. Hanley, J. A. et al. (1989). Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagn Imaging 29(3), 307–35. Hanley, J. A. & B. J. McNeil (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36. Ho, T. K. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20(8), 832– 844. Ihaka, R. & R. Gentleman (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5(3), 299–314. Japkowicz, N. & S. Stephen (2002). The class imbalance

problem: A systematic study. Intelligent Data Analysis 6(5), 429. Jemal, A., F. Bray, M. M. Center, J. Ferlay, E. Ward, & D. Forman (2011). Global cancer statistics. A Cancer Journal for Clinicians 61(2), 69–90. Liebetrau, A. M. (1983). Measures of association, Volume 32. Sage Publications, Incorporated. Nicolosi, S., G. Russo, I. DAngelo, G. Vicari, M. C. Gilardi, & G. Borasi (2013). Combining DCE-MRI and 1H-MRS spectroscopy by distribution free approach results in a high performance marker: Initial study in breast patients. Journal of Biomedical Science and Engineering 6, 357–364. Olshen, L. B. J. F. R. & C. J. Stone (1984). Classification and regression trees. Wadsworth International Group. Preidt, R. (2013, April). Scientists create breast cancer survival predictor. WebMD News from HealthDay. Sarvestani, A. S., A. Safavi, N. Parandeh, & M. Salehi (2010). Predicting breast cancer survivability using data mining techniques. In Software Technology and Engineering (ICSTE), 2010 2nd International Conference on, Volume 2, pp. V2–227. IEEE. Schnall, M. D. (2000). Breast imaging technology: Application of magnetic resonance imaging to early detection of breast cancer. Breast Cancer Research 3(1), 17. Thai-Nghe, N., Z. Gantner, & L. Schmidt-Thieme (2011). A new evaluation measure for learning from imbalanced data. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pp. 537–542. van Erkel, A. R. & P. M. Pattynama (1998). Receiver operating characteristic (ROC) analysis: Basic principles and applications in radiology. European Journal of Radiology 27(2), 88 – 94. Verleysen, M. & D. Franc¸ois (2005). The curse of dimensionality in data mining and time series prediction. In Computational Intelligence and Bioinspired Systems, pp. 758–770. Springer. Warner, J. (2003, September). Cancer death rates falling, but slowly. WebMD Health News. Witten, I. H. & E. Frank (2005, June). Data Mining: Practical Machine Learning Tools and Techniques. (2nd ed.). Morgan Kaufmann.

Suggest Documents