2016 Second International Conference on Computational Intelligence & Communication Technology
Medical Data Mining Using Different Classification and Clustering Techniques: A Critical Survey Richa Sharma PhD Scholar
Dr. Shailendra Narayan Singh
ASET, Amity University Uttar Pradesh, India
[email protected]
ASET, Amity University Uttar Pradesh, India
[email protected] [email protected]
Abstract – One of the applications of data mining is disease diagnosis for this purpose one needs medical dataset to identify hidden patterns and finally extracts useful knowledge from medical database. Recently, researchers have used different classification and clustering algorithms for diagnosing diseases. This paper provides survey on two different complex diseases which includes the heart disease and Cancer disease, paper critically observed the existing literature work to find out significant knowledge in this area and summarized different approaches used in disease diagnosing, further discussed about the tools available for processing and classification of data.
across the globe there is some valuable information is hidden in vast amount data, throughout the years various algorithms were designed to handle these data sets, approaches of classification and clustering is generally used classification is used for predicting certain outcomes after analyzing the input, it generally uses If-Then rules. If represents the condition while then represents the consequences. Statistical algorithms like Decision tree ID3 and C4.5 identifies the attributes and differentiate one class from another. ID3 is very sensitive with large number of values; its entropy is poor so C4.5 algorithm overcomes this issue by using the concept of information gain. Genetic algorithm classification is also very popular and widely used classification technique as the presentation of rules is very natural. Neural network classification predicts the outcome using existing knowledge all the computation is performed with the help of neurons and associated weights, neural network trained the dataset and provides very accurate information but main criticism is only its black box approach. Clustering analysis is to partitioning or grouping the data on the basis of similarity, it is adaptable to changes various methods of clustering, in partitioning method first creates initial set of k partitions then use iterative relocation technique to improve, hierarchal method is either agglomerative (bottom-up) or divisive (top-down) on the basis of these approaches hierarchal decomposition is formed. In density based method, cluster grows according to the density of neighboring objects. Model based method is hypothesized method it finds the best fit of the data to the model. In Constraint based clustering method objects are based on either application based constraint or user defined constraint.
Keywords: Data Mining, Classification, Clustering, Heart Disease Diagnosis, Cancer Diagnosis, Data Mining Tools.
I.
INTRODUCTION
Data mining is a broad and generic concept having different taxonomies and classifications many methods are included in this generic term, data mining is an integrated naturally evolved part of database technology, it has found its remarkably outstanding hold in every field inclusive of health care, data mining in healthcare is application of datamining technologies this is an emerging field used in disease diagnosis, prediction and deep understanding of medical data, it is necessary step in exposition and discovery of knowledge from extensive dataset. Recently mining of data is more in trend than that of its analysis and it comprises of Classification, Clustering and Association rule discovery, it also stretch its arms to other areas which includes Data Warehousing, Statistics, Machine Learning and Artificial Intelligence. Discussion on classification and clustering techniques: As the data is expanding day by day, enormous amount of data is generated and hoarded 978-1-5090-0210-8/16 $31.00 © 2016 IEEE DOI 10.1109/CICT.2016.142
Dr. Sujata Khatri Assistant professor Deen Dayal upadhyay college Delhi University
[email protected]
Data mining gives various models for diagnosis of disease such as 1) supervised learning 2) 688 687
unsupervised learning 3) Ensemble learning and 4) Hybrid classification.
CART and DT their observation shows CART performance is more accurate with accuracy of 83.49% and calculated time to build the model is 0.23 seconds, and CART classifier has lowest average error at 0.3 compared to other classifiers. G.E. Sakr, I.H. Elhajj and H.A. Huijer [6] anticipated decision support system to classify and detect agitation transition. In this system SVM is used, this system is designed for Dementia Patients, it presented a decision confidence measure and 2 novel architectures of SVM, for agitation detection and agitation transition detection. An accuracy of 91.4% was attained in assessment and 90.9% for the conventional SVM. G.G. Cabral and A.L.I de Oliveira [7] presented an analysis of medical data aimed at determining whether or not patients are cardiac, collected the imbalance data they employ one class classification (OCC) to solve the problem. The four metrics used in experiments shows that the method FBDOCC, OCSVM and SVDD performed well, however FBDOCC significantly outperformed the other methods for the heart disease dataset and according to the Wilcoxon Rank Sum test, obtained the finest results for all datasets. The result also shows that the use of OCC paradigm may support the screening phase of the treatment. M. Alsalamah, Dr. S. Amin and Dr. J. Halloran [8] discussed the use of Radial basis function network (RBFN) classification of heart disease, it proposed a model sytem that forms data collection, processing, storage and usage procedures, their paper explains RBFs and RBFN and describes novel method for using RBFN by incorporating an automated continuous improvement strategy for heart disease classification of patients’ records. M. Shouman, T. Turner and R. Stocker [9] There statement of belief determined the efficacy of an unsupervised learning technique i.e. k-means clustering with naïve bayes in diagnosing patients of heart disease. Their outcome reveals the integrated kmeans clustering with naïve bayes with various initial centroid selections enhances accuracy of naïve bayes in diagnosing patients of heart disease. It also presented that 2 cluster random row initial centroid selection method with accuracy of 84.5%. N.A. Setiawan, D.W. Prabowo and H.A. Nugroho [10] provided a benchmark comparison of the feature selection techniques in diagnosing coronary artery disease (CAD).Total four feature selection methods are used which includes motivated feature selection (MFS), correlation based feature selection (CFS), wrapper based feature selection (WFS) and rough set based feature selection (RST). The Naïve Bayes and J4.8 classifiers are used to diagnose the presence of CAD. The result shows that WFS and CFS are superior as compared to MFS and RST. S. Shaikh, A. Sawant, S. Paradkar and K. Patil [13] proposed an
Classification is supervised learning technique which predicts the target class of each sample in data where the classes are predefined. Clustering, an unsupervised learning technique which clusters the data into group of similar objects, objects which are having similar properties group into same cluster and those which are having dissimilar properties group into other cluster. Ensemble learning approaches using various merging methods whose goal is to achieve enhanced accuracy over single classifier model. Hybrid learning combines heterogeneous machine learning approaches. Though both ensemble and hybrid learning model gives enhanced accuracy [1].The Aim of this Survey paper is to keenly observe and analyze the previous research work in medical data mining and techniques developed in diagnosing of heart disease and cancer disease. II.
SURVEY ON HEART DISEASE DIAGNOSIS AND PROGNOSIS
American heart association statistical report tracks 190 countries to find universal cause of mortality with 17.3 million of death every year caused by disease of heart, it is the dominant cause of death and this number is expected to be rise up to 23.6 million by 2030: heart disease and stroke statistic update 2015[2]. M. Shouman, T. Turner, R. stocker [3] identified less attention has received for the sufficient medication for heart disease ,although many different data mining techniques (Single and Hybrid) have already applied, in their research work on heart disease diagnosis and treatment they proffered a model which closes the gap systematically in research on heart disease , model suggests single data mining technique could be implemented on heart disease diagnosis data to established the baseline accuracy and hybrid techniques of data mining in heart disease treatment data and finally testing techniques of data mining for accuracy. M. Shouman, T. Turner, and R. Stocker [4] also investigated the better model of decision tree for diagnosing patients of this disease. They applied various techniques to different decision trees seeking better performance in heart disease diagnosis, used benchmark dataset and to assess the outcome of different decision trees accuracy, specificity and sensitivity were computed, their research model shows j4.8 decision tree and bagging algorithm outperforms in diagnosing patients suffered from heart related disease. V. Chaurasia and S. Pal [5] studied different classifiers and conducted experiments to find best classifier for predicting heart disease patients, they uses three classifiers ID3,
688 689
intelligent system by using Naïve Bayes Classification data mining technique, the implemented their system using java language, in their application user have to answer some predefined queries. It retrieves the hidden data values from stored data values then compare with values entered by user with trained dataset, it can answer complex queries related to heart disease. III.
optimize a FS indeed. There objective is to learn Takagi- Sugeno-Kang (TSK) type fuzzy rules having high accuracy, they introduced chaos into the HCMSPSO. Eleven Chaotic maps used in the intelligent diagnosis system. The accuracy between benign and malignant cancer is above 90% whereas sinusoidal chaotic map provides accuracy of 99%, their simulation performed on UCI-breast cancer data-base [20]. Z. Mahmud and S. Sulong [21] considered several factors contributing cervical cancer their study investigated impact of age, marital status, ethnicity on the four stages (Stage I, Stage II, Stage III and Stage IV) of cervical cancer they take 444 patients records those who are suffering from cervical cancer from databank of Department in the UKM medical center. Their study found that mostly women having mean age of 57 years diagnosed with Stage II cervical cancer women having age less than 57 years have 4 times more chance to be diagnosed with Stage I cervical cancer rather than Stage II. Women those are having age of 66 years or above diagnosed with cervical cancer were 7 times more likely to have undergone to the treatment of radiotherapy than women having age of 45 years or below while married women having age of 46 years have more chances of Stage I cervical cancer, they suggested womens of malaysia must go for the test before the age of 45 years. O. Anunciaco, B.C. Gomes, S. Vinga, J. Gaspar, A.L. Oliverira and J. Rueff [22] indentified the suitable decision tree for detecting high risk group of breast cancer there results showed statistical significant association is achievable in breast cancer by decision tree derivation and then selecting the best leaf it was identified that high risk breast cancer group comprised from 13 cases and 1 control. S.D. Savarkar, A.A. Ghalot and A.P. Pande [23] used support vector machine(SVM) and Artificial neural network(ANN) on WBC data, their outcomes of SVM and ANN prediction model found more accurate when we compare it with human beings, accuracy of these models could be used to make decisions and avoid biopsy, it achieves accuracy of 97%.
SURVEY ON CANCER DISEASE DIAGNOSIS AND PROGNOSIS
There are more than 200 types of cancer. In U.K 1 out of 2 people is affected from this disease in lifetime. A Cancer cell grows through normal tissue, it starts when gene changes and makes one cell or few cells grow and multiply, this growth makes tumor [14]. In research of cancer genome, DNA Microarray technology has great impact, the expression level of thousand of genes are measured simultaneously [15]. A. H. Chen and J. C. Hsu [16] proposed two systematic methods used to predict cancer classification to find optimal information gene subset they applied GAGS method, their approach called multi-task support vector sample learning (MTSVL) technique. They demonstrated experimentally that GAGS and MTSVL methods provides remarkable classification performance with application of disease like leukemia and prostate cancer gene expression dataset. P. Kumar and S.K. Wasan [17] applied clustering data mining technique for the scrutiny of X- means and K-means using tumor classification they applied two algorithms on colon data set to classify it into two equivalent classes, algorithms are x-means and global k-means result shows accuracy achieved by k-means is more than x-means algorithm but in case of execution speed x-means algorithm is faster than k-means algorithm. Y. Xu, J.Y. Zhu, E. Chang and Z. Tu [18] come up with a new learning method i.e. multiple cluster instances learning (MCIL) for classification, segmentation and clustering cells of cancer in colon histopathology images. Proposed MCIL method simultaneously performed classification at image level in cancer Vs non cancer images, segmentation at pixel level in cancer Vs non-cancer tissues and clustering at patch level in cancer subclasses. They embedded the clustering concepts into multiple instance learning (MIL) derived a principal solution to achieve 3 tasks in an integrated framework, empirical outcomes demonstrated the efficiency and effectiveness of MCIL to evaluate colon cancers. M. Yassi, A. Yassi and M. Yaghoobi [19] presented a paper to distinguish the type of breast cancer they used chaotic hierarchal cluster-based multispecies particle swarm optimization (CHCMSPO) to
IV.
TOOLS AVAILABLE FOR DATA PREPROCESSING AND CLASSIFICATION
There are plenty of data mining tools available in market for data preprocessing and mining using machine learning, artificial intelligence and other techniques here we are discussing six powerful tools available for this purpose.
689 690
1) Rapid Miner: One of the proficient tools for
2)
3)
4)
5)
6)
customizable. One can build their applications on top of it and can modify them later as it provides flexibility to users. [11].
data mining based on java technologies. The tool is highly efficient and provides integrated environment for various business, industrial and research applications. The framework is available for data mining, text mining, business analytics and various business applications. It is an automated tool in which manual coding is minimal which can be considered as an advantage to user. WEKA : The tool was primarily developed for agricultural solutions, but now days used in many other domains. The advantage of WEKA is its user friendly GUI. The tool is developed on java and its efficiency lies in its collectiveness of various visualization tools, data processing techniques and highly robust algorithms. But the tool works efficiently only for single relational data mining. However one can use different tools to convert multiple tables into single one and then process the table through WEKA. [11] [12]. R-Programming : The language is popular among statisticians and data miners due to its proficiency in statistical computing. The main advantage of using this language is its adaptability to implement various graphs and graphical solutions. The R object is linkable and can be linked to other programming languages code to manipulate them. To name a few, java, C++, .NET can be used. Orange: One of the tools gaining rapid popularity among data miners is Orange due to its varied collections of data mining algorithms like feature scoring, filtering, exploration techniques etc. The tool is based on C++ and python thereby providing the robustness of C++ and flexibility of python. KNIME: The tool is relatively new gaining popularity because of its OS portability as it is coded in java. The salient feature of KNIME is, it allows the use of various plugin and extensions according to the requirements. One can say you can increase the scope of KNIME using these plug-in to support the requirement needs. One of the important aspects of this tool is its capability to process large volumes of data. NLTK: NLTK is a natural language processing tool developed in python which means it will support a large number of libraries thereby increasing its scope. The main advantage of the tool is it is easily
V.
CONCLUSION
Data mining technology is the fastest evolving technology, it is being adopted in biomedical sciences and research. In this survey paper we have reviewed the literary works of different writers in field of medical data mining using different classification and clustering techniques further we have discussed various tools available for data preprocessing and classification. This paper summarizes various approaches, algorithms applied in this area would be helpful for researchers in medical diagnosis and medical practioners to develop a decision support system integrating classification and clustering techniques. The selection of data mining approaches is not same for all it truly depends on the dataset type, if available dataset is labeled then the best approach is to apply classification algorithms while in case of unlabelled dataset it is better to apply clustering technique which is best suited for pattern recognition. This survey study reveals the importance of research in area of life threatening disease diagnosis. When we talk about the case of health one needs to achieve the accuracy of cent percent various researches approximately reaches there target but still disease diagnosis suffers from high false alarm so we need to propose novel approach to reduce this false alarm rate which would help in early diagnosis of disease. REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
690 691
B.V. Sumana and T. Santhanam, “Prediction of diseases by cascading clustering and classification”, Advances in Electronics, Computers and Communications, IEEE International Conference, pp 18, oct 2014. Online Source: http://blog.heart.org/american-heartassociation-statistical-report-tracks-global-figures-firsttime. M. Shouman, T. Turner and R. Stocker, “Using data mining techniques in heart disease diagnosis and treatment”, Electronics, Communications and Computers Conference, 2012. M. Shouman, T. Turner, and R. Stocker, “Using decision tree for diagnosing heart disease patients”, Conferences in Research and Practice in Information Technology, vol. 121, 2011. V. Chaurasia and S. Pal, “Early prediction of heart diseases using data mining techniques”, Caribbean Journal of Science and Technology, vol.1, pp 208-217, 2013. G.E. Sakr, I.H. Elhajj and H.A. Huijer, “Support vector machines to define and detect agitation transition”, Affective Computing IEEE Transactions, vol. 1, 2010. G.G. Cabral and A.L.I de Oliveira, “One-class classification for heart disease diagnosis”, IEEE
[8]
[9]
[10]
[11] [12]
[13]
[14] [15]
[16]
[17]
[18]
[19]
[20] [21]
[22]
[23]
International Conference on Systems, Man and Cybernetics, 2014. M. Alsalamah, Dr. S. Amin and Dr. J. Halloran, “Diagnosis of heart disease by using radial basis function network classification technique on patients’ medical records”, RF and Wireless Technologies for Biomedical and Healthcare Applications, 2014. M. Shouman, T. Turner and R. Stocker, “Integrating naïve bayes and K- Means clustering with different initial centroid selection methods in the diagnosis of heart disease patients” ICAITA, 2012. N.A. Setiawan, D.W. Prabowo and H.A. Nugroho, “Benchmarking of feature selection techniques for coronary artery disease diagnosis”, International Conference on Information Technology and Electrical Engineering, 2014. Online source: http://thenewstack.io/six-of-the-bestopen-source-data-mining-tools. S. Singhal and M. Jena, “A study on WEKA Tool for data preprocessing, classification and clustering”, IJITEEI, 2013. S. Shaikh, A. Sawant, S. Paradkar and K. Patil, “Electronic recording system- Heart disease prediction system”, IEEE International Conference on Technologies for Sustainable Development, Feb 2015. Online source: http://www.cancerresearchuk.org/aboutcancer/what-is-cancer. A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin and S. Levy, “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis”, Bioinformatics, vol. 21, pp 631-643, 2005. A. H. Chen and J. C. Hsu, “Exploring novel algorithms for the prediction of cancer classification”, Software Engineering and Data Mining, 2010. P. Kumar and S.K. Wasan, “Analysis of X-means and global K-means using tumor classification”, Computer and Automation Engineering, vol 5, pp 832-835, 2010. Y. Xu, J.Y. Zhu, E. Chang and Z. Tu, “Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering”, Computer Vision and Pattern Recognition, 2012. M. Yassi, A. Yassi and M. Yaghoobi, “Distinguishing and clustering breast cancer according to hierarchal structures based on chaotic multispecies particle swarm optimization”, Iranian Conference on Intelligent Systems, pp 1-6, feb 2014. Online source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer. Z. Mahmud and S. Sulong, “Confounding effects of age, marital status and treatment on cervical cancer stages among malaysian women”, Statistics in Science, Business, and Engineering, 2012. O. Anunciaco, B.C. Gomes, S. Vinga, J. Gaspar, A.L. Oliverira and J. Rueff, “A data mining approach for detection of high-risk breast cancer groups”, Advances in Soft Computing, vol. 74, 2010. S.D. Savarkar, A.A. Ghalot and A.P. Pande, “Neural network aided breast cancer detection and diagnosis”, International Conference on Neural Networks, 2006.
691 692