Comparison of Machine Learning Techniques using the ... - IEEE Xplore

2011 20th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises

Comparison of Machine Learning Techniques using the WEKA environment for Prostate Cancer Therapy Plan Nikolaos Mallios Dept. of Informatics and Computer Technology, Technological Educational Institute of Lamia Lamia, Greece [email protected]

Abstract— The improvement and exploitation of a number of prominent Data Mining techniques in numerous real-world application areas (e.g. Industry, Healthcare and Bioscience) has led to the utilization of such techniques in machine learning environments, in order to extract useful pieces of information of the specified data and support decision making. Throughout this study, a comprehensive techniques’ comparison is performed upon a fairly large set of data consisting of real medical incidents of men with the diagnosis of prostate cancer which are receiving medical treatment. 40 patients, suffered previously with prostate cancer and without undergone radiation therapy, were examined for therapy change after already receiving medical treatment. Six parameters were measured for eight subsequent quartiles to assess the patient state and its treatment outcome. Specifically, with the aim of the open source WEKA environment, the given data is tested with a number of machine learning and classification techniques in order to compare the performance of the chosen algorithms upon the practitioner’s decision of a potential therapy change. Keywords - Data Mining, WEKA, Machine Learning, Bioinformatics, Prostate Cancer

I.

INTRODUCTION

The utilization of Data Mining classification techniques in Bioinformatics area is a fairly common procedure. A major problem which arises when analyzing and evaluating medicalclinical data is in the context of medical decision making. Namely, the correct diagnosis for further treatment or a potential therapy change of a specific patient could be assisted with the use of machine learning techniques applied upon the given data. A lot of effort has been assembled in the context of Bioinformatics research in order to address and assist a number of medical problems. The term machine learning encompasses a number of tasks, e.g. recognition, diagnosis, planning, prediction etc. [1]. 1524-4547/11 $26.00 © 2011 IEEE DOI 10.1109/WETICE.2011.28

Michael Samarinas MD, Konstantinos Skriapas MD,PhD, FEBU

Elpiniki Papageorgiou Dept. of Informatics and Computer Technology, Technological Educational Institute of Lamia Lamia, Greece [email protected]

Dept. of Urology, General Hospital of Larissa, Greece [email protected]

A moderate advantage which occurs with the use of machine learning techniques upon a set of data is the extraction of useful information and its correlations. Throughout this study the medical data given was employed in order to evaluate the performance of a number of classification techniques. Specifically, we analyze and evaluate the decision making task of therapy change which a doctor suggests, when a number of blood test parameters – mainly Prostate Specific Antigen (PSA) - are measured every 3 months. Among the various parameters obtained, the selection of six of them, for each quarter was decided to be adequate. Prostate cancer is the most common noncutaneous cancer and the second-leading cause of death from cancer in men in the United States [2]. Because prostate cancer is prevalent in many countries and exhibits a wide spectrum of aggressiveness, different methods of treatment have been developed, and the preferred methods for detection and treatment are controversial. The paper is organized as follows: section II describes the problem from a medical point of view and provides the essential basic knowledge for both the WEKA environment and the techniques used. In section III the experiments performed are presented along with the results obtained from each algorithm. Subsequently in section IV a discussion is given upon the results and conclusion remarks along with further future work plans are given in section V. II.

MATERIAL AND METHODS

A. Medical Problem Description and Data The natural history of prostate cancer varies from indolent disease that might not cause symptoms during a patient's lifetime to highly aggressive cancer that metastasizes quickly and causes terrible suffering and untimely death. The challenge for the physician who treats patients with prostate cancer is to advise effective treatment in those for whom treatment is necessary. Selection of the appropriate treatment

151

requires assessment of the tumor's potential aggressiveness and the general health, life expectancy, and quality of life preferences of the patient. In order to evaluate and compare different classification techniques in the domain of medical test data during this study, we used real data obtained from 40 patients which are currently receiving medical treatment. An important decision taken from the beginning of this study was to acquire six important blood parameter’s values, which were measured for any patient during a term’s period. A physician evaluates these values comparing them both to critical values and the values obtained during previous measurements, to facilitate an important decision for therapy change. The parameters chosen for the purposes of this study were: Hematocrit (HCT), White Blood Cells (WBC), free Prostate Specific Antigen (PSA free), total Prostate Specific Antigen (PSA total), ratio PSA (i.e. PSAfree/PSAtotal) and Prostatic Acidic Phosphatase (PAP). The last important parameter which was obtained referred to a potential therapy change decision (yes/no). The overall data which was used during this study was 1960 unique instances, consisting of 280 rows and 7 columns. All of the data was to test the accuracy of the classification of the doctor’s therapy change decision. A number of carefully selected classification algorithms were chosen which are presented shortly at a further point of this paper and the WEKA data mining environment was selected to be sufficient for the acquirement and presentation of the results. B. The WEKA environment WEKA data mining software is an open source software tool implemented by the Machine Learning Group at University of Waikato [3], which currently provides a sufficient toolbox of machine learning algorithms that could be easily applied to large sets of raw data (datasets). WEKA implements various machine learning classification techniques, algorithms for regression and clustering along with a number of visualization tools. Nowadays, it is accepted to be a powerful and adequate environment for a number of data mining tasks. The main interface of the software is a GUI chooser where it is possible for the user to choose the desired application. There exist 4 main applications, i.e. the Explorer, the Experimenter, the Knowledge Flow and a simple Command Line Interface. In [3] the developers of WEKA tool present its basic functionalities and all the changes which have been made since the initial introduction of it. Throughout this study, all data analyzed and mined with the aim of WEKA is saved in ARFF (Attribute Relation File Format) file format (WEKA’s data format), which consists of special tags in order to designate between attributes, values and names of the data given. As will exemplify later on, all of the parameters chosen (blood test parameters) were numerical values and the change therapy decision of the doctor in the simple format of a yes/no. C. Techniques The comparison performed and presented in this study encompasses a number of well-known machine learning

techniques such as decision tree learner C4.5 (Release 8) [4], the multi layer perceptron (MLP) Neural Network back propagation training algorithm [5], the Naïve Bayes classifier [6], the Radial Basis Function (RBF) network [7] and the Knearest neighbor classifier (IBk) [8]. i.

Decision Trees – J48

Among the various machine learning techniques, a decision tree could be characterized as one of the most widely used. It represents a mapping of the attributes given and consists of nodes which link to two or more sub-trees. A node calculates a specific outcome which is based on the value of the instance and each possible outcome is linked with one of the sub-trees. The J48 algorithm, an implementation of C4.5 release 8 [9], is an efficient method for estimation and classification of fuzzy data and it was chosen for the purposes of this study with promising results. ii.

Neural Network (MultiLayer Perceptron – MLP)

The Neural Network is an adaptive system that changes its structure based on external or internal information which flows through the network during an initial learning phase [10]. In more practical terms, NNs are non-linear statistical data modeling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data [11]. For this approach, the back propagation algorithm Multi-Layer Perceptron (MLP) in order to categorize a practitioner’s decision (therapy change) was applied, using two input nodes (no=0 and yes=1). iii.

Naïve Bayes

Naïve Bayes is a representation of the Bayesian classifier. The classifier produces probabilistic rules and it received noteworthy attention when used for classification purposes. The Naïve Bayes model when a new data item is presented categorizes it, by presenting a probability percentage, into possible class categories [6]. Classification is performed when the well-known Bayes rule is applied to each attribute of the model and the probability over an independent class variable (label) C is computed. Although the model is straightforward, it provides quite promising results on many real world datasets. iv.

Radial Basis Function (RBF)

Radial Basis function (RBF) networks were initially introduced in order to address a variety of problems (old pattern recognition techniques, clustering, functional approximation etc.). Nowadays, it is acknowledged to be one of the most important NN models [7] for classification. Its basic functionality is based on two-layer feed-forward model with a hidden layer (hidden units) between the sets of input and output [7]. When the model network is used for classification purposes the Gaussian function is preferred [12] and a key factor for the successful implementation is to find a suitable center. 152

v.

k-Nearest Neighbor (IBk)

One of the simplest forms of classification algorithms is Nearest Neighbor implementations. Such learning schemes are depicted as statistical learning algorithms and are generated by simply storing the given data. For the classification to be performed a distance metric is chosen and any new data is compared against all-ready “memorized” data items. The new item is assigned to the class which is most common amongst its k nearest neighbors. IBk is an implementation of the knearest-neighbours classifier [8]. The number of nearest neighbors (k) can be set manually, or determined automatically using cross-validation. III.

J48, MLP, Naïve Bayes, RBF, IBk) which were used are considered to be quite promising. More specifically the results of simulation are shown below in Tables II and III for the 1st quarter of the given data. Table II mainly summarizes the accuracy of each Machine Learning algorithm for all 40 patients (instances) along with the time taken and Kappa statistic for each algorithm. In table III, an overall synopsis is attempted based on different error rates. TABLE II. WEKA Techni ques

EXPERIMENTS AND RESULTS

In order to precisely demonstrate the experiments made and identify the applicability of Machine Learning algorithms in Bioinformatics a number of blood test parameters were measured for the period of 8 quarters; and for each patient the physician indicated a potential therapy change or not. These parameters were chosen following the physician’s guidelines, as they constitute the most important among a large number of blood parameters measured (around 40). Namely, data of HCT, WBC, Prostatic Acidic Phosfatase and serum PSA data, including the total PSA level, the rate of change of PSA (PSA velocity and doubling time), the PSA density (serum PSA divided by prostate volume), and the percentage of PSA in the free or complexed isoforms, were used in order to predict the patient’s state over a period of 2 years.

CLASSIFICATION RESULTS FOR EACH EXAMINED ALGORITHM FOR Q1 Kappa statistic

J48

85 %

(34)

15%

(6)

0.03

0.4146

MLP

85 %

(34)

15 %

(6)

0.13

0.4146

Naïve Bayes

90%

(36)

10%

(4)

0.01

0.6098

RBF

90%

(36)

10%

(4)

0.11

0.6098

IBk

82.5%

(33)

17.5%

(7)

0.01

0.2708

TABLE III. WEKA Techni ques

TABLE I. BLOOD TEST PARAMETERS AND THEIR CRITICAL VALUES

Simulation Results for Q1 Incorrectly Time taken (sec) classified

Correctly classified

TRAINING AND SIMULATION ERROR FOR Q1 Simulation Results for Q1

Mean Absolute Error

Root Mean squared Error

Relative Absolute Error (%)

Root Relative Squared Error (%)

J48

0.1737

0.3638

57.409

94.407

MLP

0.1899

0.3651

62.735

95.039

Critical Values

Naïve Bayes

0.1014

0.3163

33.494

81.334

Hct

>28%

RBF

0.1423

0.3127

47.007

81.406

WBC

>4000 /mL

IBk

0.1921

0.408

63.478

106.212

PSA free

0.03 ng/dl

PSA total

0.05 ng/dl

PSAf/PSAt

>0.2

Prostatic Acid Phosfatase

Comparison of Machine Learning Techniques using the ... - IEEE Xplore

Comparison of Machine Learning Techniques using the ... - IEEE Xplore

Suggest Documents

A Comparison of Different Machine Learning Algorithms ... - IEEE Xplore

Machine Learning Techniques for Mobile Intelligent ... - IEEE Xplore

A Survey on Machine-Learning Techniques in Cognitive ... - IEEE Xplore

Comparison of two modulation techniques using ... - IEEE Xplore

Generalization Evaluation of Machine Learning ... - IEEE Xplore

Performance Evaluation of Machine Learning Techniques using ...

Parkinson's Disease Prediction Using Machine Learning ... - IEEE Xplore

Voice Disorder Identification by using Machine Learning ... - IEEE Xplore

Using Learning Automaia - IEEE Xplore

comparison of change detection techniques for the ... - IEEE Xplore

Sentiment Classification using Machine Learning Techniques

Using Machine Learning Techniques to Combine ... - CiteSeerX

Using Machine Learning Techniques to Identify

Automating XML Markup using Machine Learning Techniques

Text Classification Using Machine Learning Techniques - CiteSeerX

Subjectivity Classification using Machine Learning Techniques ... - arXiv

Modeling Virtualized Applications using Machine Learning Techniques

image compression using machine learning techniques

Flow Clustering Using Machine Learning Techniques - CiteSeerX

Mathematical Formula Recognition using Machine Learning Techniques

Predicting students' emotions using machine learning techniques

Using Machine Learning Techniques to Predict Introductory ...

Using Machine Learning Techniques to Distinguish ...

Using Machine Learning Techniques to Enhance the Performance of ...