International Research Journal of Applied and Basic Sciences © 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (12): 4041-4046 Science Explorer Publications
Intelligent Diagnosis of Asthma Using Machine Learning Algorithms TahaSamadSoltaniHeris1, MostafaLangarizadeh2, Zahra Mahmoodvand3, Maryam Zolnoori4 1. PhD student of medical informatics, the College of Paramedics , Tehran University of Medical Science Tehran , Iran 2. assistant professor in medical informatics, the College of Paramedics, Tehran University of Medical Sciences , Tehran , Iran 3. M.Sc. student of health information technology, the College of Management and Medical Information , Tehran University of Medical Sciences , Tehran, Iran 4.post – doctoral student in health informatics , the College of Informatics , the State University of Indiana , The United States of America *
Corresponding Author email:
[email protected]
ABSTRACT: Data mining in healthcare is a very important field in diagnosis and in deeper understanding of medical data. Health data mining intends to solve real- world problems in diagnosing and treating diseases. One of the most important applications of data mining in the domain of machine learning is diagnosis, and this type of diagnosis of the disease asthma is a notable challenge due to the lack of sufficient knowledge of physicians concerning this disease and because of the complexity of asthma. The purpose of this research is the skillful diagnosis of asthma using efficient algorithms of machine learning. This study was conducted on a dataset consisting of 169 asthmatics and 85 non - asthmatics visiting the Imam Khomeini and MasseehDaneshvari Hospitals of Tehran. The algorithms of k – nearest neighbors , random forest , and support vector machine, together with pre – processing and efficient training were implemented on this dataset ,and the degrees of accuracy and specificity of the system used in our study were calculated compared with each other and with those of previous research. From among the different values for neighborhood, the highest degree of specificity was achieved with five neighbors. Our method was investigated together with other methods of machine learning and similar research, and the ROC curve was plotted, too. Other methods achieved suitable results as well, and they can be relied on. Therefore, we propose our approach based on the k- nearest algorithm together with pre-processing based on the Relief – F strategy and the Cross Fold data sampling as an efficient method in artificial intelligence with the purpose of data mining for the classification and differential diagnosis of diseases. Keywords: asthma, Data mining, Diagnosis, intelligent systems,machine learning. INTRODUCTION Data mining is an important step in discovering and extracting knowledge and implies exploring large datasets with the purpose of extracting unknown patterns among(Han 2005). Since discovering relations among data is very difficult in traditional statistical methods(Lee, Liao et al. 2000), this is not a new approach and it is widely in financial institutions in areas such as recognizing fraud and pricing. Salespersons and retailers use this approach in dividing their markets and in distributing their products among warehouses, and manufacturers employ it for controlling quality and for scheduling maintenance and repairs (Koh and Tan 2011). Data mining applications have rapidly spread in large sections of organizations offering medical care, in financial predictions, and in weather forecasting (Obenshain 2004).There is a high potential for data mining applications in healthcare. These applications can be grouped, in general, into evaluation of the effects of treatments, management of healthcare, management of relations with patients, and recognition of fraud and misuse. More specialized data mining includes medical diagnosis and analysis of DNA micro- arrays (Koh and Tan 2011).
Intl. Res. J. Appl. Basic. Sci. Vol., 4 (12), 4041-4046, 2013
Data mining is a very important field in diagnosis and in deeper understanding of medical data. Health data mining aims to solve real- world problems in diagnosis and in disease treatment(Liao and Lee 2002); and also , in healthcare, it is also considered to be an important area for research in predicting diseases and in gaining a more profound understanding of health data. The purpose of health data mining is the resolution of real- world health problems in diagnosis and disease treatment (Liao and Lee 2002). Researchers use this method to diagnose different diseases and, to this end, employ various approaches and algorithms with different specificity and accuracy (Shouman, Turner et al. 2012). Methods and algorithms of data mining have been used in diagnosing diseases. The most important of these studies include the automatic diagnosis of the skin disease erythemato-squamous in which these methods and algorithms are employed (Güvenir, Demiröz et al. 1998; Übeylı and Güler 2005; Übeyli and Doğdu 2010; Xie and Wang 2011; Abdi and Giveki 2013). Data classification is carried out by using a variety of methods including k – nearest neighbor, support vector machine, and random forest, which are popular methods for multi – class diagnosis in the area of pattern recognition (Athitsos 2005). These are the most commonly used methods in data mining and classification (Moreno-Seco, Micó et al. 2003). In this article, we use the above – mentioned algorithms for the diagnosis of the disease asthma in order to achieve a suitable efficiency in its diagnosis. METHODS AND MATERIALS The process of solving the problem was modeled in steps to obtain the desired output. In these steps, which are shown in Figure 1, the processes of organizing the input, pre-processing, data sampling, adjusting the parameters of the algorithms, executing the algorithms, and analyzing the outputs were carried out. Figure1. Common steps in carrying out data mining recommended in the data mining software Orange for classification: Organizing input data → Pre-processing → Data sampling → Adjusting the algorithm → Executing the algorithm → Evaluating the output Organizing input data In this research, the dataset regarding patients visiting the Imam Khomeini and the MasseehDaneshvari Hospitals of Tehran was used(Zolnoori, Zarandi et al. 2007). The dataset contains 250 records consisting of 24 input features and one discrete output feature shown in Table 1. Each of these features has a range of numerical values. For example, coughing, depending on its severity, may have values from zero to 50. Output classes also include the two classes of zero and one, which denote those diagnosed with and those not diagnosed with asthma. Table 1. Features in the dataset concerning the disease asthma which are used in this research Asthma disease (numberof patients) Asthma(169) Non Asthma(85)
Features Feature 1: Parent1 Feature 2: Parent2 Feature 3:Rel Feature 4: allergic rhinitis Feature 5: BMI Feature 6: gastro esophageal reflux Feature 7: Reaction to Allergens Feature 8: Reaction to Irritants Feature 9: Reaction to Emotion Feature 10: chest tightness Feature 11: Exercise induced Symptoms Feature 12: Dyspnea
Feature 13: history of eczema or hay fever Feature 14: Hospitalization before age 3 Feature 15: allergy to food before age 3 Feature 16: cold status Feature 17: air pollution Feature 18: exposure to smoking Feature 19: mother smoking in pregnancy Feature 20: exposure to pets in childhood Feature 21: high damper in house or humid climate Feature 22:cough Feature 23: phlegm Feature 24: wheeze
Pre - processing The pre-processing of data is an important step in the process of data mining. In data mining, the methods of data collection are not very controlled and outlier, repetitive, and wrong values may result in obtaining invalid output. Therefore, it is necessary to take a set of actions before utilizing data to ensure good quality (Pyle 1999). If the information used in the processes of extracting knowledge and educating contains redundant or irrelevant data, it will be very difficult to carry out these processes. The pre-processing step may be time – consuming and include processes such as the cleaning, normalization, conversion, and extraction of features. The output of this step will be the training set (Kotsiantis, Kanellopoulos et al. 2006). In this study, the pre-processing of data was conducted in two steps. In the first step, incomplete data was eliminated from the dataset in order to achieve better agreement 4042
Intl. Res. J. Appl. Basic. Sci. Vol., 4 (12), 4041-4046, 2013
in later stages. The second step consisted of the selection of the features based on the strategy and algorithm of Relief – F. In the strategy of Relief – F, which is one of the most successful methods of selecting features (Liu and Motoda 2008), samples are randomly chosen, and the weight of each feature is calculated based on the nearest neighbor. After experimenting with the different values of the features and calculating the outputs, 15 superior features were selected and applied as the input for the algorithm. Data sampling In the data-sampling step, a subset of data is extracted from the input dataset. This is performed for sampling and for dividing the data into the two classes of training and testing data. In this research, the cross validation method was used. This method is employed for estimating common errors based on numerous sampling, and the result of such estimation is usually used for selecting from among various models including network architecture. In the k- fold CV method, the data and the samples are divided into k almost equal subsets. The network is then tested k times and each time one of the subsets is omitted (Sarle 2012 ). In our research, we divided the data and the samples into 10 subsets and calculated the accuracy and sensitivity of the outputs to make the necessary. After investigating the best accuracies and sensitivities, 26 records were selected for modeling and 228 records remained. Adjusting the algorithm In this step, the parameters of the algorithms are tested and evaluated by using different values in order to find the best possible result in the system output. The values of neighborhood, the resolution type, and the number of trees are the parameters in the k – nearest neighbor, the support vector machine, and the random forest algorithms, respectively. Executing the algorithm The k – nearest neighbor algorithm The KNN algorithm, also called the memory – based classification method, is one of easiest and most direct techniques of data mining (Alpaydin 1997). In this method, differences between features are calculated by using Euclidean distances in order to work with continuous and discrete features. For example, if the first sample is ( and the second sample ( , then the distance between them is calculated using formula 1: √ (1) The most difficult problem in employing the formula of Euclidean distances is that larger values often flood smaller ones. The KNN algorithm usually works with continuous data but it is efficient for discrete data as well. If the discrete values in the hypothetical samples differ from each other, this difference will be considered one; and if they do not differ from each other, the difference will be considered zero (Shouman, Turner et al. 2012). The algorithm was run at different neighborhood values until the highest levels of sensitivity and specificity were achieved (which will be referred to in the section on results). The support- vector machine (SVM) algorithm Vapnik et al. (Cortes and Vapnik 1995) introduced the support - vector machine (SVM) algorithm as a very efficient method for the purpose of pattern recognition. The SVM is a powerful tool for data classification. This algorithm allocates semi – discrete spaces, together with the space of the original data, to two sets of data in linear classification and in high- level dimensions in non-linear classification (Fung and Mangasarian 2005). In classification, data is usually separated into the two sets of training and testing data. Each sample in the training set includes the target value (output). The purpose in SVM is to produce a model based on training data in such a way that the model can predict the target values of the training set (Hsu, Chang et al. 2003). A support- vector machine produces a hyperplane, or a set of hyperplanes, in a high – dimensional space. These hyperplanes can be used for classification, regression or for other applications. Good separations are performed by hyperplanes that have the longest distances to the nearest points of the training data in each class (Deepa and Thangaraj 2011). The random forest algorithm The random forest (s) algorithm is a general classification method consisting of a large number of decision trees and class outputs, each output being implemented by specific trees. Leo Breiman and Adele Cutler introduced the random - forest deduction algorithm (Ho 1995). Random forests are combinations of computational trees each of which is related to values of random vectors that are independently sampled; and the same 4043
Intl. Res. J. Appl. Basic. Sci. Vol., 4 (12), 4041-4046, 2013
distribution is applied for all the trees in the forest. This classification algorithm consists of a set of tree classifications which has the form of {…, h(x, , in which { are the random vectors that have been distributed independently and in the same way. Each tree has a single vote for the most favorite input class x (Breiman 2001). Evaluating the output Output evaluation was performed on the basis of sensitivity analysis , accuracy , specificity , the plotting of the confusion matrix , and the plotting of the ROC curve (Zhu, Zeng et al. 2010).Relations 2 , 3 , and 4 were used to calculate sensitivity , specificity , and accuracy. Sensitivity is the ratio of correct positive diagnoses that are correctly revealed by a specific test. Based on this definition, one hundred percent sensitivity implies correct diagnoses of the disease in all patients: Sensitivity = (2) The statistical measure determining the accuracy of the test for the purpose of identifying the negative diagnoses is called specificity. Accuracy= (3) Finally, the following formula is used to calculate sensitivity: Accuracy
=
(4) RESULTS AND DISCUSSION This research was conducted in the area of clinical diagnosis in the Canvas Orange data mining environment and implemented in the Python language. The input features for the application program prepared are listed in Table 1. The suitable performance of the algorithms depends on the sizes of the training and testing datasets. A set of identical samples of the same size was employed for evaluating and comparing the implemented algorithms. We obtained an optimal output for training the SVM from the RBF kernel functions by adjusting the cost parameters and the numerical accuracy of 2, 001, and zero, respectively. Euclidean distances were used in the nearest – neighbor algorithm, weighting was performed based on these distances, and the continuous values were normalized. In the random forest method , twenty trees were used. Table 2 in the section on results shows sensitivity, specificity, and accuracy of the results of the algorithms in the diagnosis of asthma. The results of our study were compared with the outputs of the fuzzy expert system designed by Zarandi et al. based on this dataset. Note that every time the program is run these values undergo change because the samples had been randomly taken. These values change and so we stopped the iteration process after obtaining 100 percent specificity, sensitivity, and accuracy from one of the methods. Table 2. Results of the analysis of the outputs of our study and comparison of these results with those obtained by other researchers concerning asthma Algorithm
Description
KNN
K=5
SVM
Exp(-g|x-y| )
RF
Number of trees=20
Fuzzy Decision Rule Decision Rule
2
Fuzzy Expert system(Zarandi, Zolnoori et al. 2010) Computer-Assisted, SymptomBased Diagnosis(Choi, Yoo et al. 2007) CAIDSA(Chakraborty, Mitra et al. 2009)
Class 1 0 1 0 1 0
Accuracy 1 1 0.9870 0.9870 0.9652 0.9652
Sensitivity 1 1 0.9934 0.9737 0.9868 0.9211
Specifity 1 1 0.9737 0.9934 0.9211 0.9868
1
-
0.94
1
1
-
0.562
0.935
1
0.90
-
-
In the nearest – neighbor method, the best results were obtained when there were five neighbors, because this k value corresponds to the specificity, accuracy, and sensitivity values of one. In the SVM method, the best results were achieved, i.e., specificity, sensitivity, and accuracy were 0.9934, 0, 9737, and 0.9870, respectively, 4044
Intl. Res. J. Appl. Basic. Sci. Vol., 4 (12), 4041-4046, 2013 when the radial basis function, which is a non – linear method, was used(Chang, Hsieh et al. 2010). In the random forest method, in which the number of trees was 20, the values for specificity, sensitivity, and accuracy were 0.9868, 0.9211, and 0.9652, respectively. For all values, both output classes of zero and one, which stand for people diagnosed and not diagnosed with asthma, were calculated. Of course, class 1 has more applications. The confusion matrices for all three algorithms are shown in Figure 1. They indicate the numbers of false and correct positive diagnoses, and the numbers of false and correct negative diagnoses.
Figure 1. The matrices of confusion for the algorithms used in the study.
The ROC curve was plotted to reveal and evaluate the efficiency of the system. The nearer the points are to the top left part of the curve, the more suitable they will be and the closer the prediction model will be to its ideal state (Metz 1978).The curves for all three algorithms are shown in Figure 2.
Figure 2. The ROC curves for the machine learning algorithms used in our research. Colors represent different training methods that were employed.
The results obtained from our research, the findings of studies carried out in which the dataset we used was employed, and the conclusions drawn in similar research on diagnosing asthma are compared , and important points are revealed and listed , in Table 2. In 2010, FazelZarandi et al. introduced a fuzzy method for diagnosing the disease asthma in the dataset used in our study (29). The general specificity and sensitivity of the classification of their model were 100 and 94 percent, respectively. The algorithms we used in our study yielded better results than the fuzzy method with respect to the sensitivity of the diagnosis. In another research titled “Easy diagnosis of asthma: computer – assisted, symptom – based diagnosis”, the specificity and sensitivity obtained were 93.5 and 56.2 percent, respectively (30). This study was conducted using questionnaires and the general diagnosis of the disease was offered by calculating the sum of the scores given. The evaluation indices in our research are much more efficient and are better, compared to those employed in this study.
4045
Intl. Res. J. Appl. Basic. Sci. Vol., 4 (12), 4041-4046, 2013
Chakraborty et al. (2009) also proposed a system for the intelligent diagnosis of asthma with an accuracy of 90.03 percent. In this research, a neural network and decision rules were used (31). This system has a weaker performance than ours (in which we obtained an accuracy of 96.52 to 100 percent). In this study, three approaches of the K – nearest neighbor, the random forest, and the support – vector machine algorithms together with pre – processing based on the Relief – F strategy and the Cross Fold data sampling are proposed for data mining the dataset on asthma for diagnosing this disease. The highest efficiency for the k-nearest neighbor algorithm was achieved when there were five neighbors, because in this situation all cases of the disease were correctly diagnosed. The other algorithms also performed better than the study based on the fuzzy expert system. Efficient pre – processing omits incomplete and problematic data, discretizes values through the use of discretization techniques, adopts a suitable training algorithm , selects and estimates appropriate algorithm parameters, and , thereby, greatly affects the output and performance of classification methods. REFERENCES Abdi MJ, Giveki D. 2013. "Automatic detection of erythemato-squamous diseases using PSO–SVM based on association rules." Engineering Applications of Artificial Intelligence 26(1): 603-608. Alpaydin E.1997. "Voting over Multiple Condensed Nearest Neighbors." Artif. Intell. Rev. 11(1-5): 115-132. Athitsos V. 2005. Boosting nearest neighbor classifiers for multiclass recognition. Computer Vision and Pattern Recognition - Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on .B. University. IEEE: 45. Breiman L. 2001. "Random forests." Machine learning 45(1): 5-32. Chakraborty C, Mitra T. et al. 2009. "CAIDSA: computer-aided intelligent diagnostic system for bronchial asthma." Expert Systems with Applications 36(3 .8544- 8594 :) Chang YW, Hsieh CJ. et al. 2010. "Training and testing low-degree polynomial data mappings via linear SVM." The Journal of Machine Learning Research 99: 1471-1490. Choi BW, Yoo KH. et al. 2007. "Easy diagnosis of asthma :computer-assisted, symptom-based diagnosis." Journal of Korean medical science 22(5): 832-838. Cortes C, Vapnik V.1995. "Support-vector networks." Machine learning 20(3): 273-297. Deepa VB, Thangaraj P. 2011. "Classification of EEGdata using FHT and SVM based on Bayesian Network." IJCSI International Jou rnal of Computer Science Issues 8(2): 239-243. Fung GM, Mangasarian OL.2005. "Multicategory proximal support vector machine classifiers." Machine learning 59(1-2): 77.59Güvenir HA, Demiröz G. et al.1998. "Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals." Artificial intelligence in medicine 13(3): 147-165. Han J.2005. Data Mining: Concepts and Techniques ,Morgan Kaufmann Publishers Inc. Ho TK.1995. Random decision forests. Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, IEEE. Hsu CW, Chang CC. et al.2003. A practical guide to support vector classification. Koh HC, Tan G. 2011. "Data mining applications in healthcare." Journal of Healthcare Information Management—Vol 19(2): 65. Kotsiantis SB, Kanellopoulos D. et al. 2006. "Data Preprocessing for Supervised Leaning." INTERNATIONAL JOURNAL OF COMPUTER SCIENCE 1(2.) Lee IN, Liao SC.et al. 2000. "Data mining techniques applied to medical information." Med Inform Internet Med 25(2): 81-102. Liao SC, Lee IN. 2002. "Appropriate medical data categorization for data mining classification techniques." Med Inform Internet Med 27(1): 5967. Liu H, Motoda H. 2008. Computational Methods of Feature Selection, Chapman & Hall. Metz CE. (1 .)594Basic principles of ROC analysis. Seminars in nuclear medicine, Elsevier. Moreno-Seco F, Micó L. et al. 2003. "A modification of the LAESA algorithm for approximated k-NN classification." Pattern Recognition Letters 24(1–3): 47-53. Obenshain MKMAT. 2004. "Application of Data Mining Techniques to Healthcare Data • " Infection Control and Hospital Epidemiology 25(8): 690-695. Pyle D.1999. Data preparation for data mining, Morgan Kaufmann Publishers Inc. Sarle W. 2012. "What are cross-validation and bootstrapping?". Fro http://www.faqs.org. Shouman M, Turner T. et al.2012. Applying k-Nearest Neighbour in Diagnosing Heart Disease Patients. 2012 International Conference on Knowledge Discovery (ICKD 2012) IPCSIT. Übeyli ED, Doğdu E. 2010. "Automatic detection of erythemato-squamous diseases using k-means clustering." Journal of medical systems 34(2): 179-184. Übeylı ED, Güler I. 2005. "Automatic detection of erythemato-squamous diseases using adaptive neuro-fuzzy inference systems." Computers in biologyand medicine 35(5): 421-433. Xie J, Wang C.2011. "Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases." Expert Syst. Appl. 38(5): 5809-5815. Zarandi MF, Zolnoori M. et al.2010. "A fuzzy rule-based expert system for diagnosing Asthma." Transaction E: Industrial Engineering 17(2): 129-142. Zhu W, Zeng N. et al. 2010. "Sensitivity, specificity, accuracy, associated confidence interval and ROC analysis with practic al SAS® implementations." NESUG proceedings: health care and life sciences, Baltimore, Maryland. Zolnoori M, Zarandi MHF. et al.2007. Designing fuzzy expert system for diagnosing and evaluating childhood asthma. Department of information technology management, Tarbiat Modares University. Master of Science in Information Technology Management: 199 .
4046