2011 20th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises
Comparison of Machine Learning Techniques using the WEKA environment for Prostate Cancer Therapy Plan Nikolaos Mallios Dept. of Informatics and Computer Technology, Technological Educational Institute of Lamia Lamia, Greece
[email protected]
Abstract— The improvement and exploitation of a number of prominent Data Mining techniques in numerous real-world application areas (e.g. Industry, Healthcare and Bioscience) has led to the utilization of such techniques in machine learning environments, in order to extract useful pieces of information of the specified data and support decision making. Throughout this study, a comprehensive techniques’ comparison is performed upon a fairly large set of data consisting of real medical incidents of men with the diagnosis of prostate cancer which are receiving medical treatment. 40 patients, suffered previously with prostate cancer and without undergone radiation therapy, were examined for therapy change after already receiving medical treatment. Six parameters were measured for eight subsequent quartiles to assess the patient state and its treatment outcome. Specifically, with the aim of the open source WEKA environment, the given data is tested with a number of machine learning and classification techniques in order to compare the performance of the chosen algorithms upon the practitioner’s decision of a potential therapy change. Keywords - Data Mining, WEKA, Machine Learning, Bioinformatics, Prostate Cancer
I.
INTRODUCTION
The utilization of Data Mining classification techniques in Bioinformatics area is a fairly common procedure. A major problem which arises when analyzing and evaluating medicalclinical data is in the context of medical decision making. Namely, the correct diagnosis for further treatment or a potential therapy change of a specific patient could be assisted with the use of machine learning techniques applied upon the given data. A lot of effort has been assembled in the context of Bioinformatics research in order to address and assist a number of medical problems. The term machine learning encompasses a number of tasks, e.g. recognition, diagnosis, planning, prediction etc. [1]. 1524-4547/11 $26.00 © 2011 IEEE DOI 10.1109/WETICE.2011.28
Michael Samarinas MD, Konstantinos Skriapas MD,PhD, FEBU
Elpiniki Papageorgiou Dept. of Informatics and Computer Technology, Technological Educational Institute of Lamia Lamia, Greece
[email protected]
Dept. of Urology, General Hospital of Larissa, Greece
[email protected]
A moderate advantage which occurs with the use of machine learning techniques upon a set of data is the extraction of useful information and its correlations. Throughout this study the medical data given was employed in order to evaluate the performance of a number of classification techniques. Specifically, we analyze and evaluate the decision making task of therapy change which a doctor suggests, when a number of blood test parameters – mainly Prostate Specific Antigen (PSA) - are measured every 3 months. Among the various parameters obtained, the selection of six of them, for each quarter was decided to be adequate. Prostate cancer is the most common noncutaneous cancer and the second-leading cause of death from cancer in men in the United States [2]. Because prostate cancer is prevalent in many countries and exhibits a wide spectrum of aggressiveness, different methods of treatment have been developed, and the preferred methods for detection and treatment are controversial. The paper is organized as follows: section II describes the problem from a medical point of view and provides the essential basic knowledge for both the WEKA environment and the techniques used. In section III the experiments performed are presented along with the results obtained from each algorithm. Subsequently in section IV a discussion is given upon the results and conclusion remarks along with further future work plans are given in section V. II.
MATERIAL AND METHODS
A. Medical Problem Description and Data The natural history of prostate cancer varies from indolent disease that might not cause symptoms during a patient's lifetime to highly aggressive cancer that metastasizes quickly and causes terrible suffering and untimely death. The challenge for the physician who treats patients with prostate cancer is to advise effective treatment in those for whom treatment is necessary. Selection of the appropriate treatment
151
requires assessment of the tumor's potential aggressiveness and the general health, life expectancy, and quality of life preferences of the patient. In order to evaluate and compare different classification techniques in the domain of medical test data during this study, we used real data obtained from 40 patients which are currently receiving medical treatment. An important decision taken from the beginning of this study was to acquire six important blood parameter’s values, which were measured for any patient during a term’s period. A physician evaluates these values comparing them both to critical values and the values obtained during previous measurements, to facilitate an important decision for therapy change. The parameters chosen for the purposes of this study were: Hematocrit (HCT), White Blood Cells (WBC), free Prostate Specific Antigen (PSA free), total Prostate Specific Antigen (PSA total), ratio PSA (i.e. PSAfree/PSAtotal) and Prostatic Acidic Phosphatase (PAP). The last important parameter which was obtained referred to a potential therapy change decision (yes/no). The overall data which was used during this study was 1960 unique instances, consisting of 280 rows and 7 columns. All of the data was to test the accuracy of the classification of the doctor’s therapy change decision. A number of carefully selected classification algorithms were chosen which are presented shortly at a further point of this paper and the WEKA data mining environment was selected to be sufficient for the acquirement and presentation of the results. B. The WEKA environment WEKA data mining software is an open source software tool implemented by the Machine Learning Group at University of Waikato [3], which currently provides a sufficient toolbox of machine learning algorithms that could be easily applied to large sets of raw data (datasets). WEKA implements various machine learning classification techniques, algorithms for regression and clustering along with a number of visualization tools. Nowadays, it is accepted to be a powerful and adequate environment for a number of data mining tasks. The main interface of the software is a GUI chooser where it is possible for the user to choose the desired application. There exist 4 main applications, i.e. the Explorer, the Experimenter, the Knowledge Flow and a simple Command Line Interface. In [3] the developers of WEKA tool present its basic functionalities and all the changes which have been made since the initial introduction of it. Throughout this study, all data analyzed and mined with the aim of WEKA is saved in ARFF (Attribute Relation File Format) file format (WEKA’s data format), which consists of special tags in order to designate between attributes, values and names of the data given. As will exemplify later on, all of the parameters chosen (blood test parameters) were numerical values and the change therapy decision of the doctor in the simple format of a yes/no. C. Techniques The comparison performed and presented in this study encompasses a number of well-known machine learning
techniques such as decision tree learner C4.5 (Release 8) [4], the multi layer perceptron (MLP) Neural Network back propagation training algorithm [5], the Naïve Bayes classifier [6], the Radial Basis Function (RBF) network [7] and the Knearest neighbor classifier (IBk) [8]. i.
Decision Trees – J48
Among the various machine learning techniques, a decision tree could be characterized as one of the most widely used. It represents a mapping of the attributes given and consists of nodes which link to two or more sub-trees. A node calculates a specific outcome which is based on the value of the instance and each possible outcome is linked with one of the sub-trees. The J48 algorithm, an implementation of C4.5 release 8 [9], is an efficient method for estimation and classification of fuzzy data and it was chosen for the purposes of this study with promising results. ii.
Neural Network (MultiLayer Perceptron – MLP)
The Neural Network is an adaptive system that changes its structure based on external or internal information which flows through the network during an initial learning phase [10]. In more practical terms, NNs are non-linear statistical data modeling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data [11]. For this approach, the back propagation algorithm Multi-Layer Perceptron (MLP) in order to categorize a practitioner’s decision (therapy change) was applied, using two input nodes (no=0 and yes=1). iii.
Naïve Bayes
Naïve Bayes is a representation of the Bayesian classifier. The classifier produces probabilistic rules and it received noteworthy attention when used for classification purposes. The Naïve Bayes model when a new data item is presented categorizes it, by presenting a probability percentage, into possible class categories [6]. Classification is performed when the well-known Bayes rule is applied to each attribute of the model and the probability over an independent class variable (label) C is computed. Although the model is straightforward, it provides quite promising results on many real world datasets. iv.
Radial Basis Function (RBF)
Radial Basis function (RBF) networks were initially introduced in order to address a variety of problems (old pattern recognition techniques, clustering, functional approximation etc.). Nowadays, it is acknowledged to be one of the most important NN models [7] for classification. Its basic functionality is based on two-layer feed-forward model with a hidden layer (hidden units) between the sets of input and output [7]. When the model network is used for classification purposes the Gaussian function is preferred [12] and a key factor for the successful implementation is to find a suitable center. 152
v.
k-Nearest Neighbor (IBk)
One of the simplest forms of classification algorithms is Nearest Neighbor implementations. Such learning schemes are depicted as statistical learning algorithms and are generated by simply storing the given data. For the classification to be performed a distance metric is chosen and any new data is compared against all-ready “memorized” data items. The new item is assigned to the class which is most common amongst its k nearest neighbors. IBk is an implementation of the knearest-neighbours classifier [8]. The number of nearest neighbors (k) can be set manually, or determined automatically using cross-validation. III.
J48, MLP, Naïve Bayes, RBF, IBk) which were used are considered to be quite promising. More specifically the results of simulation are shown below in Tables II and III for the 1st quarter of the given data. Table II mainly summarizes the accuracy of each Machine Learning algorithm for all 40 patients (instances) along with the time taken and Kappa statistic for each algorithm. In table III, an overall synopsis is attempted based on different error rates. TABLE II. WEKA Techni ques
EXPERIMENTS AND RESULTS
In order to precisely demonstrate the experiments made and identify the applicability of Machine Learning algorithms in Bioinformatics a number of blood test parameters were measured for the period of 8 quarters; and for each patient the physician indicated a potential therapy change or not. These parameters were chosen following the physician’s guidelines, as they constitute the most important among a large number of blood parameters measured (around 40). Namely, data of HCT, WBC, Prostatic Acidic Phosfatase and serum PSA data, including the total PSA level, the rate of change of PSA (PSA velocity and doubling time), the PSA density (serum PSA divided by prostate volume), and the percentage of PSA in the free or complexed isoforms, were used in order to predict the patient’s state over a period of 2 years.
CLASSIFICATION RESULTS FOR EACH EXAMINED ALGORITHM FOR Q1 Kappa statistic
J48
85 %
(34)
15%
(6)
0.03
0.4146
MLP
85 %
(34)
15 %
(6)
0.13
0.4146
Naïve Bayes
90%
(36)
10%
(4)
0.01
0.6098
RBF
90%
(36)
10%
(4)
0.11
0.6098
IBk
82.5%
(33)
17.5%
(7)
0.01
0.2708
TABLE III. WEKA Techni ques
TABLE I. BLOOD TEST PARAMETERS AND THEIR CRITICAL VALUES
Simulation Results for Q1 Incorrectly Time taken (sec) classified
Correctly classified
TRAINING AND SIMULATION ERROR FOR Q1 Simulation Results for Q1
Mean Absolute Error
Root Mean squared Error
Relative Absolute Error (%)
Root Relative Squared Error (%)
J48
0.1737
0.3638
57.409
94.407
MLP
0.1899
0.3651
62.735
95.039
Critical Values
Naïve Bayes
0.1014
0.3163
33.494
81.334
Hct
>28%
RBF
0.1423
0.3127
47.007
81.406
WBC
>4000 /mL
IBk
0.1921
0.408
63.478
106.212
PSA free
0.03 ng/dl
PSA total
0.05 ng/dl
PSAf/PSAt
>0.2
Prostatic Acid Phosfatase