Ensemble vote approach for predicting primary tumors using data mining

3 downloads 19467 Views 256KB Size Report
This paper aims at analyzing various data mining techniques for primary tumor prediction. .... describe and distinguish data classes and concepts, for the.
Ensemble Vote Approach for Predicting Primary Tumors Using Data Mining Mehak Naib, Amit Chhabra Department of Computer Science & Engineering Guru Nanak Dev University, Amritsar Amritsar, India [email protected] The works in proposed evaluation concludes work.

Abstract: Primary tumor is a neoplasm which in clinical parlance is regarded as malignant, arising in one site and capable of giving rise to metastatic tumors. Primary tumor disease is a major health problem in today’s time. This paper aims at analyzing various data mining techniques for primary tumor prediction. The observations reveal that the hybrid approach of any three classifiers using Vote ensemble technique on resampled dataset has outperformed over all other single data mining classifiers.The study considers total 19 attributes by adding ‘small-intestine’ an attribute in the original primary tumor dataset. By addition of ‘small-intestine’ attribute, ensemble Vote classifier achieves high accuracy of 94.01% even when the data set contains missing values. Evaluations and results are carried out with 10-fold cross validation using Weka 3-6-10. Keywords- Data mining, Resampling, Primary Tumor, Vote, ARFF, WEKA.

I.

II.

RELATED WORK

Several studies have been reported that have focused on the importance of data mining techniques in the field of medical diagnosis. This section gives an overview of numbers of research papers related to data mining contribution in medical field. Many researches focus on the cancer diseases. The linear regression algorithm applied on the Blood cancer dataset considering various demographic and clinical characteristics of patients [9]. Authors have compared the different classifiers decision tree, Multi-Layer Perception, Naive Bayes, Sequential Minimal Optimization, and Instance Based for K-Nearest neighbor on three different databases of breast cancer by using classification accuracy and confusion matrix based on 10-fold cross validation method in WEKA [4]. Authors concluded the support vector machine as best classifier in terms of accuracy for prediction of Oral cancer which is the sixth most common cancer and a major health problem in the world [6]. The studies revealed that depending on the type of dataset used each model differs in their performance. If the dataset consists of unlabelled features then the clustering model better suits for pattern recognition among the several methods k-means algorithm [5]. Some studies focused on heart diseases and analyze dataset using NaƯveBayes, K-NN, and Decision List algorithm using Tanagra data mining tool[10].The study to investigate comparison of seven different classification algorithms namely, Naive Bayes, Naive Bayes updatable, FT Tree, KStar, J48, LMT, and Neural network for analyzing Hepatitis prognostic data has been presented. The study concludes that the Naive Bayes classification performance is better than other classification techniques for hepatitis dataset [3]. By analyzing various techniques through research papers, this paper has selected the most accurate method for prediction of primary tumors. Next section discusses the proposed method to be used.

INTRODUCTION

Data mining plays an important role in the medical field by predicting various diseases. It is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for business advantage [1]. There are various data mining techniques, such as classification, regression, clustering and association rules that are applied on datasets for prediction results [8] [11]. This paper analyzes the prediction of primary tumor disease using data mining. This is a location from where tumor starts spreading and gives rise to secondary tumor. Sometimes the primary tumor can be deadly or harmful even the cancer has not spread. This is true for tumors growing in the lungs, brain, and other major organ systems. The primary tumor is generally the easiest to remove. It is very important to find it else it may grow and may arise some other linked tumors called secondary tumors.[2]. This paper focuses on reducing the workload of doctors in predicting primary tumors in patients. Doctors perform various medical tests but still avoid focusing on this disease but it can have severe results afterwards if it spreads in the human body as it is the initial stage of tumor. This study has provided the efficient method that can help the doctors through data mining prediction results. Vote (3) ensemble technique for classification of primary tumor dataset is used. This technique combines prediction outputs of three base classifiers’; Multilayer perception, Random Forest and K Nearest Neighbour on Primary tumor dataset containing 18 attributes plus one added attribute and 339 instances. An attribute has been added ‘small-intestine’ in the dataset that improves accuracy rate of classifier.

c 978-1-4799-4236-7/14/$31.00 2014 IEEE

paper is organized as; Section 2 discusses similar the data mining field. Section 3 discusses the work. Experimental results and performance are presented in Section 4 and finally, Section 5 the paper and points out some potential future

97

III.

PROPOSED METHOD

Classification is the process of finding a set of models that describe and distinguish data classes and concepts, for the purpose of being able to use the model to predict the class whose label is unknown [13]. In this study, Ensemble classifier Vote(3) is applied on dataset. Vote(3) combines prediction output of three base classifiers; multilayer perception, Random forest and K Nearest Neighbour. Filtered dataset is obtained after perform preprocessing on the original dataset. Filtered dataset is classified through various classification techniques, out of which three best performed classifiers from different category of classifiers are selected for making predictions on test dataset. The prediction outputs of selected base classifiers are combined and highly predicted classes are selected as class variables of test instances.

@attribute degree-of-diffe {well,fairly,poorly} @attribute bone {yes,no} @attribute bone-marrow {yes,no} @attribute lung {yes,no} @attribute pleura {yes,no} @attribute peritoneum {yes,no} @attribute liver {yes,no} @attribute brain {yes,no} @attribute skin {yes,no} @attribute neck {yes, no} @attribute supraclavicular {yes,no} @attribute axillar {yes,no} @attribute mediastinum {yes,no} @attribute abdominal {yes, no} @attribute class {lung, 'head and neck', esophagus, thyroid, stomach,' duoden and sm.int', colon, rectum, anus, 'salivary glands', pancreas, gallbladder, liver, kidney, bladder, testis, prostate, ovary, 'corpus uteri', 'cervix uteri', vagina, breast} (class is location of tumor). 1.

Remove Missing values using unsupervised filter by using the modes and medians of assigned values of attribute to the instances from the training data. Table 1: Attribute Containing Missing Values S.no

Attributes

1. 2.

Gender Histologictype Degree of difference Skin Axillar

3. 4. 5.

3.

Dataset has non-uniform distribution of classes as shown in figure 2 , lung class has more instances so prediction may biased towards majority class and ignores minority classes. Apply Random oversampling (without Replacement) method of Supervised filter for subsampling the imbalanced dataset with uniform distribution of classes to create subsamples of instances. This increases instances from “339” to “3390” by “1000.0” sample size percent and “1.0” for no biasing the class distribution as shown in fig 3

RESULTS AND ANALYSIS

Following steps are used in Prediction:

Primary-tumor.arff dataset is selected form UCI repository [12]. In this dataset, total 18 attributes, one class attribute and 339 instances are contained. Attribute information in attribute relation file format (arff): @attribute age {=60} @attribute sex {male,female} @attribute histologic-type{epidermoid, adeno, anaplastic}

98

1 1

Addition of attribute: For making the effective predictions of multiple classes, we need more information. The original dataset contains total 18 attributes, but by adding one more attribute that plays an important role in giving rise to various primary tumors and also linked to major organs improves prediction results. In this research, an attribute named ‘small-intestine’ is added that has linked to various organs like abdominal, liver, stomach, pancreas, gallbladder, colon, esophagus etc in the human body. The ‘small-intestine’ attribute is selected by collecting information from the medical field that it has direct link to various parts of body that are the attributes of Primary tumor dataset, for example, we take “yes” value of ‘small-intestine’ attribute if linked attributes like ‘Abdominal’ or ‘peritoneum’ has “yes” values for particular instance. With this, total 19 attributes are present in the dataset.[14]

Figure1 shows the design of proposed method. The training data contains demographic data of patients with actual classes. The dataset passes through supervised and unsupervised filter that results filtered dataset. Various data mining models apply and evaluation measures are compared. The selected models from each category of classifiers are combined in Vote ensemble technique. Predictions are made on the basis of collected votes from classifiers.

A. Primary Tumor Dataset Preprocessing

155(46%)

2.

Figure 1: Design of Proposed Method

IV.

Missing Values 1 67(20%)

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

Table 2: Class Distribution in Dataset Class

Fig 2: Imbalanced Class Distribution in WEKA

4.

Imbalance

Balance

Distribution

Distribution

(Before

(After

Resampling)

Resampling)

Lung

84

157

Head & neck

20

174

Esophagus

9

160

Thyroid

14

171

Stomach

39

176

Duoden and sm.int

1

161

Colon

14

169

Rectum

6

180

Anus

0*

0*

Salivary glands

2

180

Pancreas

28

129

Gallbladder

16

152

Liver

7

149

Kidney

24

168

Bladder

2

163

Testis

1

157

Prostate

10

146

Ovary

29

162

Corpus uteri

6

142

Cervix uteri

2

173

Vagina

1

178

Breast

24

163

Rank attributes through Attribute evaluation through infogain with Ranker search method. Figure 4 shows the attribute evaluation and ranking on the basis of attribute evaluator. “Bone–marrow” attribute gets lowest rank (0.0184) where “small-intestine” & “gender” attains highest rank. This evaluation is based upon the information gained from dataset. It indicates that the attributes having low ranks are contributing less in order to predict the primary tumors but if these features are omitted, information may be lost as they can lead to a reduction of the classification accuracy. Hence, all attributes are selected.

Fig 3: Balanced Class Distribution in WEKA

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

99

4.

5.

6. Figure 4: Attribute Ranking Using Ranker Search

Each Attribute has its own worth. Based on Attribute evaluation, all attributes are selected for classification. Hence, attribute evaluator is only used for rank purposes. Filtered dataset is obtained after preprocessing is done on the dataset. B. Building Classifier Perform Classification by applying various classifiers on the dataset using WEKA. The results show that by taking multiple classifier approach using “Vote ensemble technique” with Majority Voting as combination rule gives better results than single classifiers. Vote uses majority voting as combination rule applies on these classifiers that increase accuracy to 90.01%. 1. Naïve Bayes classifier This classifier is of Bayesian classifiers using estimator classes. All parameters of this classifier are set to default values. This classifier attains accuracy rate of 64.45 % with TP rate 0.645 and F-measure is of 0.624 without small-intestine attribute and 68.82% with small-intestine attribute. This classifier does not perform well on the dataset. 2. Multilayer Perception classifier This classifier is of Neural Network classifiers. The parameters of this algorithm are set to default value except parameters “training time” changes from “500” to “100”, “momentum” from “0.2” to “0.4”,”learning rate” from “0.3” to “0.1” for faster the execution of classifier. This classifier attains accuracy rate of 83.45%, TP rate and F-measure is of 0.831 without small-intestine attribute and 90.12 % with smallintestine attribute. This performs well on dataset. 3. K Nearest Neighbour(KNN) classifier This classifier is of Lazy classifiers category. This is based upon the nearest neighbour algorithm. Here “K” is set to default as “1” as number of nearest neighbours to be used in training. This achieves accuracy of 89.7 %, TP rate 0.897 and F-measure is of 0.897 without small-intestine attribute and 93.41 % with smallintestine attribute. This performs well on dataset.

100

Decision Trees (J48) classifier This classifier is of decision trees category. Parameters are set to default values. This achieves accuracy of 88.20 %, TP rate of 0.882 and F-measure is 0.87 without small-intestine attribute and 93.3 % with small-intestine attribute. Random Forest(RF) Classifier This classifier is of decision tree category. This is the combination of total 10 randomly drawn trees. All parameters are set to default values. This attains accuracy rate of 89.6%, TP rate of 0.896 and Fmeasure is 0.896 without small-intestine attribute and 93.8 % with small-intestine attribute. Ensemble Vote(3) Classifier This classifier is of type Meta that combines prediction outputs of more than two classifiers. Here three classifiers are combined, Multilayer Perception of neural network, KNN of lazy based and Random Forest of decision tree classifiers. This achieves very high accuracy rate of 90.01% with total 3050 instances correctly predict out of 3390 instances, highest TP rate of 0.901 and F-measure of 0.899 without small-intestine attribute and 94.21 % with small-intestine attribute.. This performs best among all classifiers. Table3 shows the Hybridization of output predictions of Multilayer Perceptron, Random forest and K Nearest Neighbour gives best possible results as compare to others algorithms. Figure 7 shows FP Rate of Vote Classifier is lowest with 0.003.Less be the FP rate, more accurate is the model. Table 3: Comparison of Classification Algorithms

Algorithms

Naïve Bayes MLP KNN J48 Random Forest Vote(RF+NB+MLP)

Accuracy without ‘smallintestine’ attribute (%) 64.45 83.45 89.70 87.28 89.60 90.00

Accuracy with ‘smallintestine’ attribute (%) 68.82 90.12 93.41 93.30 93.80 94.21

B. Measures for performance evaluation I.

Accuracy: It is simply a ratio of ((no. of correctly classified instances) / (total no. of instances)) *100).Technically it can be defined as: Accuracy: (TP+TN)/ (TP+FN) + (TN+FP)

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

II.

Kappa Statistics: The Kappa statistic is a metric that

compares an Observed Accuracy with an a Expected Accuracy (random chance).It is calculated ass: Io = TP+TN/ TP+TN+FP+FN Ie = ((TN+FN) (TN+FP) + (TP+FP) (TP+FN N))/n2 where n = TP+TN+FP+FN Kappa = Io – Ie / 1- Ie Io=observed accuracy, Ie= expected accuracy y III.

F-measure: In information retrieval, it can be used as a single measure of performance. The F-meeasure is the harmonic mean of precision and recall. It is calculated as follows: ( + F-measure = (2× precision × recall)/ (precision recall); Precision=TP/ (TP+FP); Recall=TP// (TP+FN) Here: 1. TN / True Negative: case was negative and predicted negative a predicted 2. TP / True Positive: case was positive and positive p but 3. FN / False Negative: case was positive predicted negative b predicted 4. FP / False Positive: case was negative but positive Accuracy(%) 95 90 85 80 75 70 65 60 55 50 45

FP Rate of Vote techniqque 0.006 0.005 0.004 0.003 0.002 0.001 0 without smallintestine

h small-intestine With

Figure 7: FP Rate Comparisonn of Vote

Figure 5 represents the comparisoon of algorithms overall by considering all changges resampling filter, attribute selector evaluattion and most important addition of attribute ‘sm mall-intestine’ in the dataset, which concludes that Vote(3) ensemble technique is best to make effective predictions with 94.11% accuracyy. Figure 6 and figure 7 shows TP rate and FP rrate of Vote (3) classifier with or without ‘small-inttestine’ attribute in the dataset. By all experiments aand evaluations, we conclude that with more inforrmation (adding attributes and instances in dataset)), for multiclass dataset is quite helpful in making effective predictions. C. Testing the dataset Perform testing on the test dataseet containing 18 instances, apply selected model V Vote on the test dataset to predict the primary tuumors for each instance. Figure 8 shows the prediictions output of test dataset of 18 instances in WEKA, their predicted values calculated by Votee Meta classifier.

Fig 5: Accuracy Comparison with 'small-inttestine' attribute TP Rate of Vote Techique 0.95 0.94 0.93 0.92 0.91 0.9 0.89 0.88 without small-intestine

With small-inttestine

Figure 6: TP Rate Comparison of Vote

Figure 8: Output Predictions on Test ddataset in WEKA

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)

101

Table 4: Output Predictions of 15 Instances of Test Dataset

Instanc e-No

MLP

KNN

Random Forest

1. 2.

gallbladd er lung

gallbladd er Lung

gallbladd er lung

Vote (3) Meta using Majority Voting Gallbladd er Lung

3.

rectum

4.

lung

Head & neck thyroid

Head & neck lung

Head & neck Lung

5.

lung

Lung

lung

Lung

6.

ovary

Vary

ovary

Ovary

7.

colon

Corpus

Corpus

Corpus

8.

Colon

Colon

Colon

Colon

9.

Ovary

Ovary

Ovary

Ovary

10.

Pancreas

Pancreas

Pancreas

Pancreas

11.

Lung

Lung

Lung

Lung

12. 13.

Head & neck Vagina

Head & neck Vagina

Head & neck Vagina

Head & neck Vagina

14.

Stomach

Stomach

Stomach

stomach

15.

Head & neck

Head & neck

Head & neck

Head & neck

Table 3 represents the output predictions of each instance through majority voting combination rule in Vote(3) ensemble technique. As it shows, instance1 has ‘gallbladder’ as its output class which gets ‘3’ votes from three selected classifiers. Similarly, classes of rest of the test instances are calculated. V.

CONCLUSION

A comparative study of data mining classification techniques and adding attribute in the dataset with hybridization of prediction output of three classifiers through Vote technique helps in identifying large data sets. It is also found that with Random oversampling technique balance in distribution of classes create that removes biasness towards majority class and helps to identify the minority classes in predictions. Most important thing is found that if we add attribute in the multiclass datasets, accuracy of classifiers improves a lot. In future, the work can be further enhanced by adding more classifiers or attributes for the automation of primary tumor prediction.

102

REFERENCES [1] Dave Smith, “Data Mining in the Clinical Research Environment”. Available at. http://www.sas.com (Accessed 23 October 2013). [2] Erin J. Hill and Bronwyn Harris, “Primary Tumor”. Available at http://www.wisegeek.com (Accessed 17 November 2013). [3] Fadl Mutaher Ba-Alwi et.al, “Comparative Study for Analysis the Prognostic in Hepatitis Data: Data Mining Approach”, International Journal of Scientific and Engineering Research, Vol 4, Issue 8, August 2013 680ISSN 2229-5518. [4] Gouda I. Salama et.al, “Fuzzy Analysis of Breast Cancer Disease using Fuzzy c-means and Pattern Recognition”, Southeast Europe journal of soft computing, sep 2012. [5] K.Lokanayaki et.al, “Exploring on Various Prediction Model in Data Mining Techniques for Disease Diagnosis”, International Journal of Computer Applications (0975 – 8887), Vol. 77, No.5, September 2013. [6] Li-Yeh Chuang et.al, “Support Vector Machinebased Prediction for Oral Cancer Using Four SNPs in DNA Repair Genes”, Proceedings of the International Multi Conference of Engineers and Computer Scientists 2011 Vol. I, 2011. [7] Mohammad Taha Khan et.al , “ A Prototype of Cancer/Heart Disease Prediction Model Using Data Mining”, International Journal of Applied Engineering Research, Vol.7 No.11,2012 [8] Nevine M. Labib et.al, “Data Mining for Cancer Management in Egypt Case Study: Childhood Acute Lymphoblastic Leukemia”, International Journal of Medical, Health, Pharmaceutical and Biomedical Engineering, Vol. 1 No: 8, 2007. [9] N.Sudha Bhuvaneswari et.al, “Information extraction of predicting blood cancer”, IJCS International Journal of Computer Science, Vol. 1, Issue 4, September 2013. [10] S.Vijiyarani et.al, “Disease Prediction in Data Mining Technique – A Survey”, International Journal of Computer Applications & Information Technology, Vol. II, Issue I, January 2013. [11] Damtew A., “Designing a predictive model for heart disease detection using data mining techniques” A Thesis Submitted to the School of Graduate Studies of Addis Ababa University, 2011. [12] Dataset collected, [http://tunedit.org/repo/UCI] accessed 2013. [13] Varun Kumar et.al,” Knowledge discovery from database Using an integration of clustering and classification”, International Journal of Advanced Computer Science and Applications, Vol. 2, No.3, March 2011. [14] Small-intestine Attribute information, [http://www.cancer.gov/cancertopics/pdq/treatment /smallintestine/Patient/page1/AllPages ] accessed 2014.

2014 5th International Conference- Confluence The Next Generation Information Technology Summit (Confluence)