Using data mining methods with a focus on ...

Using data mining methods with a focus on Parkinson’s disease 1

1,2

Michal Vadovský (2st year) Supervisor: 2Ján Paralič

Dept. of Cybernetics and Artificial Intelligence, FEI TU of Košice, Slovak Republic 1

[email protected], [email protected]

Abstract—This article in the introduction explains the area of medical data analysis for the purpose of early diagnosis as well as the basic characteristics of Parkinson´s disease. Then it describes the current state and method of collecting the data from people suffering from this disease in order to create classification models. The mPower mobile application is also mentioned in terms of authors sourcing the additional data from the individuals, capturing their memory activities, walking and tapping on the screen. The last section of the paper describes our previous results achieved in the speech data analysis of patients utilizing the programming language R. In conclusion, we describe the future direction of our work. Keywords—Parkinson´s handwriting, mPower

disease,

data

mining,

speech,

I. INTRODUCTION Modern hospitals are now equipped with various monitoring devices for collection of various types of data. This is relatively inexpensive way of collecting and storing the primary data, which is then used in the hospital information systems. The main goal of predictive data mining is to create models which can work with specific information about the patient and based on these information provide descriptive and/or predictive models that assist doctors in decision making [1]. Predictive methods are used in creation of decision models, such as prognosis, diagnosis and treatment plan. A typical medical procedure for diagnosis a patient’s disease is tedious and time-consuming. First, the doctor needs to gather the necessary information and results of examination of the patient and, consequently provide decision (diagnosis or proper treatment procedure). Over time, the volume of these data is significant and the data mining methods can accelerate the process and also help the doctors decide in difficult situations. Parkinson´s disease (PD) [2] is one of the most common chronic neurodegenerative diseases, which affects about 3.8 million patients worldwide. Comparing men and women, the disease currently occurs 3 times more often in men, which may be also related to the protective effects of estrogen in women [3]. Typical primary symptoms may be, for example, shaking hands, arms, legs, also the slowness of movement, muscle rigidity and problems with speech [4]. Currently, there is no appropriate method of treatment, which is capable to assist patients suffering from the disease completely. Drugs replacing the missing dopamine are partially helpful in relation to keeping the patients in good condition.

II. CURRENT STATE Since there is still no proper cure for PD, a number of researchers are currently focusing on the creation of decisionsupport systems, which may serve for early diagnosis of this disease. PD affects the proportion of the brain known as the substantia nigra, which controls the movement of the body. Unlike healthy people, patients with this disease are demonstrated by the disruption in the implementation of practical skills such as handwriting and speech. Therefore, many researchers attended to the collection of these data types from the patients. In several publications [5] [6], P. Drotar et al. engaged in the handwriting of PD patients which were recorded on the tablet. They monitored the movement on the tablet surface, the movement over the surface in the air, and also, the pressure applied when writing. These types of data were exported to a range of indicators, which were then compared utilizing the data mining methods (e.g. SVM, AdaBoost, and KNN). M. Little and A. Tsanas et al. [7] [8] worked with the data from Parkinson´s patients and recorded their speech signals using different indicators. These indicators were used subsequently to classify patients. Similar data were also collected by the authors O. Kursun and B. E. Sakar et al. [9], were they focused on different types of words in Turkish. Mobile application mPower: Mobile PD Study [10] measures and tracks symptoms associated with PD, thanks to which authors managed to map a large number of healthy people as well as patients suffering from this disease. The application itself consists of demographic, UPDRS and psychological (PDQ-8) survey, along with the recording of memory, voice, movement and tapping activities of people. III. ACHIEVED RESULTS In the current research, we worked only with the data that are freely available on the UCI Machine Learning Repository and refer to patients’ speech. We focused on two datasets, which were obtained by different authors and subjects [11]. Patients’ speech was transformed in both datasets to parameters (attributes), which in most cases were the same and additionally derived [12] [13]. In the phase of the data understanding, we focused on tracking all the attributes individually checking their dependence to the target attribute, which has been expressed in binary form. It informed about whether the patient is suffering from PD (1) or not (0). For

241

SCYR 2017 – 17th Scientiﬁc Conference of Young Researchers – FEI TU of Košice

this task, we used Welch´s two-sample t-test comparing the average values of individual attribute's distribution by grade of the target attribute. In this statistical test p-value is monitored principally. The lower the value, the more likely the target attribute according to the selected numeric attribute is. For comparison, at the first set of data, we achieved the lowest p-value (0.028) among the target attribute Status and attribute expressing the maximum vocal frequency. In the second dataset the highest dependence (p-value = 0.0000000609) was achieved to the target attribute in Jitter (local, absolute). A. Creating models For the first dataset which we have worked with, only the common speech patients' records were available. The models were created using Naïve Bayesian classifier and the method of decision trees (algorithms C4.5, C5.0, and CART), which are easy to interpret and understand by the doctors. The data split into the training and testing sets was carried out on a 70/30 and 80/20 basis, whereby the ratio of the target attribute values in both sets was retained from the original set of data (stratified data division). For each method and algorithm, we have developed 10 models and the resulting accuracy was calculated as the average of all accuracies achieved. The highest average accuracy of 87.46% was reached utilizing algorithm C4.5 (70/30), while the lowest average accuracy was achieved by Naïve Bayesian classifier. For the best method of decision tree algorithm C4.5, the results in the form of contingency table are presented in Table 1 below, which shows the comparison of the real values in the testing set with the predicted values using classification models [12]. TABLE I CONTINGENCY TABLE The real value C4.5 0 0 11 Predicted value by 1 4 the model

speech in the future. Moreover, obtaining transformed attributes of patients’ speech is quite simple. There are several tools available for this purpose, e.g. Praat Acoustic Analysis. V. FUTURE WORK In future work, we would like to focus on creating and comparing models employing different types of data which we received from the mPower Company. Then, based on the accuracies and other indicators, we want to focus on a particular type of the data and highlight the typical values of selected parameters or their cut-off points. Also, our aim is not only to determine whether the patient is suffering or not suffering from PD, but also what stage of the disease they are in (for example prediction UPDRS). The main objective is to develop and verify the decision-support system for the doctors in case of the primary diagnosis of people with PD. ACKNOWLEDGMENT This publication arose thanks to the support of the Operational Programme Research and development for the project "Centre of Information and Communication Technologies for Knowledge Systems" (ITMS code 26220120020), co-financed by the European Regional Development Fund. REFERENCES [1] [2] [3] [4]

1 3 41

[5]

In the second dataset, we already had available records of the subjects who pronounced the vowels A, U, O, numbers (110), 4 short sentences and 9 words. Our main goal was to determine which type of speech can be used to create classification models with the highest accuracy. In order to maintain the accuracy of the data split into the training and testing sets, 4 and 5-fold cross validation was used and the ratio of target attribute values was retained also. The modeling was implemented only using the method of the decision trees and algorithms C4.5, C5.0, CART, and RandomForest. From the accuracies obtained, we found that the model with the highest accuracy (71%) was achieved with the algorithm RandomForest and records, where subjects pronounced the numbers from 1 to 10. On the other hand, the worst results (50.63%) were received by models from the records with pronunciation of vowel U [13].

[6] [7] [8]

[9] [10] [11] [12]

IV. CONCLUSION Based on the results obtained from the data capturing patients’ speech we can conclude that the classification is possible with the highest average accuracy at the level of 87.46%. In the publication [14] the highest accuracy of 76% was achieved using the same data utilizing the method of Support Vector Machine. In addition, the accuracies broken down by the type of speech may help to determine which words subjects should be pronouncing, when recording their

[13]

[14]

242

Cios, Krzysztof J. - Moore, G. William: Uniqueness of medical data mining. In: Artificial Intelligence in Medicine. Vol. 26, No. 1-2 (2002), pp. 1-24. ISSN: 0933-3657. De Lau, L. M. – Breteler, M. M.: Epidemiology of Parkinson's disease. In: The Lancet Neurology. Vol. 5, No. 6 (2006). pp. 525-535. Dexter, D. T. – Jenner, P.: Parkinson disease: from pathology to molecular disease mechanisms. In: Free Radical Biology and Medicine. No. 62 (2013). pp. 132-144. Cnockaert, L., et al.: Low-frequency vocal modulations in vowels produced by Parkinsonian subjects. In: Speech Communication, (2008), Vol. 50, No. 4, pp. 288-300. Drotár, P. et al.: Analysis of In-Air Movement in Handwriting: A Novel Marker for Parkinson's disease. In: Computer Methods and Programs in Biomedicine. Vol. 117, No. 3 (2014), pp. 405-411. Drotár, P. et al.: Evaluation of handwriting kinematics and pressure for differential diagnosis of Parkinson's Disease. In: Artificial Intelligence in Medicine. Vol. 67 (2016), pp. 39-46. Little, M. A. et al.: Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection. In: BioMedical Engineering OnLine. Vol. 6, No. 23 (2007), 19 p. Little, M. A. et al.: Suitability of Dysphonia Measurements for Telemonitoring of Parkinson's Disease. In: IEEE Transactions on Biomedical Engineering. Vol. 56, No. 4 (2009), pp. 1015-1022. ISSN: 1558-2531. Bot, Brian M. et al.: The mPower study, Parkinson disease mobile data collected using ResearchKit. In: Scientific Data. Vol. 3, No. 160011 (2016). Sakar, B. E. et al.: Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. In: IEEE Journal of Biomedical and Health Informatics. Roč. 17, č. 4 (2013), s. 828-834. Vadovský, M. – Paralič, J.: Data Collection Methods for the Diagnosis of Parkinson's Disease. In: International Journal on Biomedicine and Healthcare, Vol. 5, No.1 (2017), pp. 28-32. ISSN 1805-8698. Vadovský, M. – Paralič, J.: Predikcia Parkinsonovej choroby pomocou signálov reči použitím metód dolovania v dátach. In: WIKT & DaZ 2016, Bratislava: STU, 2016. pp. 329-333. ISBN: 978-80-227-4619-9. Vadovský, M. – Paralič, J.: Parkinson´s Disease patients’ classification based on the speech signals. In: Applied Machine Intelligence and Informatics (SAMI): 2017 IEEE 15th International Symposium on, IEEE, 2017. pp. 321-325. ISBN: 978-1-5090-5654-5. Geeta, Y., et al.: Predication of Parkinson's disease using data mining methods: A comparative analysis of tree, statistical and support vector machine classifiers. In: Computing and Communication Systems (NCCCS), 2012 National Conference on, IEEE, (2012), pp. 1-8.

Using data mining methods with a focus on ...

Using data mining methods with a focus on ...

Suggest Documents

Mining big data with computational methods

Comparing Data Mining Methods with Logistic ...

KANGAROO - Analyzing Gases With Data Mining Methods

Dynamic Integration of Data Mining Methods Using Selection in a ...

Dynamic Integration of Data Mining Methods Using Selection in a ...

Summarizing Data Sets for Data Mining by Using Statistical Methods ...

Analysis of Breast Feeding Data Using Data Mining Methods

STATISTICAL METHODS FOR DATA MINING

WAVELET METHODS IN DATA MINING

Opinion Mining using Hybrid Methods

Exploring Local Demographics Using Online Data Tools: A Focus on ...

using data comparison to support a focus on distribution - International

Using Instructive Data Mining Methods to Revise the Impact ... - SERSC

Using data mining methods for automated chat ... - Semantic Scholar

Electric Vehicle Load Forecasting using Data Mining Methods

The Options of Using Data Mining Methods in Process

Using methods from the data-mining and machine-learning literature ...

The Options of Using Data Mining Methods in Process ...

Using educational data mining methods to assess field ... - Springer Link

Using Twitter to engage with customers: a data mining ...

Using Data Mining with Time Series Data in Short

RAINFALL PREDICTION USING DATA MINING TECHNIQUES - A ...

Data Mining with R

A Survey on Brain Tumour Detection Using Data Mining Algorithm