Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Knowledge Extraction from Prostate Cancer Data Dursun Delen1 & Nainish Patil Department of Management Science and Information Systems Oklahoma State University, Stillwater, Oklahoma. Abstract Although cancer research has generally been clinical and/or biological in nature, in recent years, data driven analytic research has become a common complement. In medical domains where data and analytics driven research is successfully applied, new and novel research directions are identified to further advance the clinical and biological studies. Specifically, it is the combination of the serious effects of cancer, the promising results of prior analytical research in related fields, the potential benefits of the expected research outcomes and the desire to further understand the nature of cancer that provided the motivation for this research effort. Therefore, the main objective of this research has been to take advantage of the available data mining tools and techniques to develop accurate prediction models for prostate cancer survivability, and to explain the prioritized importance of the prognostic factors.
1. Introduction The American Cancer Society estimates that during 2005 about 234,430 new cases of prostate cancer will be diagnosed in the United States. One man in six will be diagnosed with prostate cancer during his lifetime, but only 1 man in 32 will die of this disease. AfricanAmerican men are more likely to have prostate cancer and to die from it than are white or Asian men [15]. The reasons for this are still not known. Prostate cancer is the second leading cause of cancer death in men in the United States, exceeded only by lung cancer. According to the statistics, prostate cancer accounts for about 10% of cancer-related deaths in men [1]. The American Cancer Society estimates that about 29,528 men in the United States will die of prostate cancer during 2005. People facing cancer are naturally concerned about what the future holds. Understanding cancer and what to expect can help patients and their loved ones plan treatment, think about lifestyle changes, and make decisions about their quality of life and finances. 1
Many people with cancer want to know their prognosis. Research in the field of cancer detection helps to give an idea of the likely course and outcome of this disease. Complementing these biological and clinical studies, data mining, as powerful tool that can be used to discover patterns in medical data repositories, is finding its way into the analytics driven medical research arena. In this study, we have used two popular data mining techniques (artificial neural networks and decision trees) along with the most commonly used statistical analysis technique Logistic regression. The data set used for this study was obtained from SEER. The data set contained around 120,000 records and 77 variables. Data cleaning and preparation was performed to homogenize the dataset and remove records that contained missing values. Ten fold cross validation was used in model building and evaluation. The results indicated that Artificial Neural Networks is the best predictor with an accuracy of 91.07% followed by decision tree and logistic regression. The developed models can be deployed as part of a real world clinical decision support system.
.
2. Background 2.1 Prostate Cancer Prostate cancer happens when cells in the prostate (the prostate is a small gland in men, located underneath the bladder and in front of the rectum and is about the size of a walnut) begin to grow out of control and then invade nearby tissues or spread throughout the body. Large collections of this out of control tissue are called tumors. However, some tumors are not really cancer because they cannot spread or threaten someone's life. These are called benign tumors. The tumors that can spread throughout the body or invade nearby tissues are considered cancer and are called malignant tumors. Usually, prostate cancer is very slow growing. However, sometimes it will grow quickly and spread to nearby
Corresponding author: tel: (918) 594-8283, fax: (918) 594-8281, email:
[email protected]
0-7695-2507-5/06/$20.00 (C) 2006 IEEE
1
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
lymph nodes. Lymph nodes are small, pea-sized pieces of tissue that filter lymph, a clear liquid waste product. If prostate cancer has spread to your lymph nodes when it is diagnosed, then it usually means that there is a higher risk that it has spread to other areas of the body [1]. Although there are several known risk factors for getting prostate cancer, no one knows exactly why one man gets it and another does not. Some of the most important risk factors for prostate cancer include age, ethnicity, genetics and diet. Age is generally considered the most important risk factor for prostate cancer. The incidence of prostate cancer rises quickly after the age of 60, and the majority of men will have some form of prostate cancer after the age of 80. Another important risk factor for prostate cancer is ethnicity. There is some evidence that a man's diet may affect his risk of developing prostate cancer. The most common dietary culprit implicated in raising prostate cancer risk is a high fat diet, particularly a diet high in animal fats. Also, a few studies have suggested that a diet low in vegetables causes an increased risk of prostate cancer. According to the researchers at Fred Hutchinson Cancer Research Center, men who are long-term, heavy smokers face twice the risk of developing aggressive prostate cancer than men who have never smoked. A family history of prostate cancer also increases a man's chances of developing the disease [1]. 2.2 Medical Data Mining Health care related data mining is one of the most rewarding and challenging area of application in data mining and knowledge discovery. The challenges are due to the data sets which are large, complex, heterogeneous, hierarchical, time series and of varying quality. The available healthcare data sets are fragmented and distributed in nature, thereby making the process of data integration a highly challenging task. Other issues related to the use of highly sensitive healthcare data that the data miner has to tackle are ethical, legal and social aspects. Due to the lack of domain knowledge on the analysts’ behalf it becomes necessary for an active collaboration between the domain specialist and the data miner [7]. Various data mining techniques are used in the area of clinical decision support, and, in particular, cancer diagnostics and prognostics. Explanatory and confirmative techniques are the most commonly used in medical data mining. The major issues related to data mining in the medical field faced by data miners are as follows: Heterogeneity of data: Raw medical data are voluminous and heterogeneous. Medical data are
collected from various sources like images, interviews with the patients, physicians’ notes and interpretation, etc. All these data elements bear upon the diagnosis, prognosis and treatment of the patient and must be taken into account in the data mining research. The physicians’ interpretation of signals, images and other clinical data is written in unstructured free text English and thereby posing a difficulty in standardizing and mining the data. As compared to the other areas of physical sciences, the underlying data structures of medicine are poorly characterized mathematically. The conceptual structure of medicine consists of word descriptions and images with very few formal constraints on vocabulary, composition of images or the allowable relationships among basic concepts. Medical data has no formal structure into which a data miner can organize information which can be modeled by clustering, regression models or sequence analysis [7]. Ethical and social issues: Since data related to humans is involved in medical data mining, ethical, legal and social issues plays an important role. Care should be taken to prevent the abuse of patients and misuse of their data. Corpus of human medical data potentially available for data mining is enormous. Data ownership can easily stymie efforts at obtaining the data needed or creating links between datasets. The question of ownership of patient information is highly muddled with issues like recurrent highly publicized lawsuits and congressional inquiries. Another feature unique to medical data mining is the privacy and security concerns. There is a potential breach of patient confidentiality and the possibility of ensuing legal action. The patient is extraordinarily candid with the physician in the expectation that such information will never be made public. Another issue is data security in data handling and data transfer. Legally and ethically one cannot perform data analysis for frivolous and nefarious purposes. Any use of patient data even de-identified must be justified as having some expected benefits [7]. 2.3 Data mining use in cancer research This section discusses in brief some of the works in the field of data mining for predicting cancer. Mangasarian et al. [14] discusses the applications of linear programming to the problem of clinical classification of patients with breast cancer. Bellazzi et al. [2] present the use of data mining tools to derive a prognostic model of the outcome of resectable hepato cellular carcinoma. Land et al [12] discuss a new neural network technology developed to improve the diagnosis of breast cancer using mammogram findings, while Walter and Mohan [23] presents an
2
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
algorithm that extracts classification rules from trained neural networks and discusses its application to breast cancer diagnosis, as well as describing how the accuracy of the networks and the accuracy of the rules extracted from them can be improved by a simple preprocessing of the data. As far as prostate cancer is concerned, Zupan et al [25] proposes a schema that enables the use of classification methods, including machine learning classifiers, for survival analysis of prostate cancer patients, while Zhang and Zhang [24] developed and validated ProstAsure - a neural network derived algorithm which analyzes the profile of multiple serum tumor markers and produces a single-valued diagnostic index which can potentially be used for early detection of prostate cancer. Emerging pattern clustering and gene expression data are increasingly being used for Cancer detection. Many have defined emerging patterns as item sets whose supports increase significantly, larger than the threshold value called the growth rate. Emerging Patterns (EP) have high discrimination power and can capture the biological significant information from the data. EP and projected clustering techniques were used independently in solving different types of problems since they are strong in different domains. Larry et al. [13] proposed the integration of these two techniques to form effective and easy to understand clusters of gene expression data. The clusters thus formed, were used to classify the unseen data (cancerous and normal tissues) in the cancer detection problem. The development of micro array technology has supplied a large volume of data to many fields. It has been used in cancer research so that it helps in better predict and diagnose cancer. Machine learning for DNA micro-array can be defined as one that selects discriminative genes related with classification from gene expression data, trains classifier and then classifies new data using learned classifier. Sung-bae et al. [21] in their work describe how they acquired the gene expression data calculated from the DNA micro-array; their prediction system had two stages: feature selection and pattern classification stages. Statistical methods and artificial neural networks have been used to classify tumors using micro-array data. Blair et al. [3] proposes an alternative method that performs well on a wide variety of problems. PAM is based on a technique known as “nearest shrunken centroids”. Churio et al. [6] used statistical analysis like the mean or median of the population, with the confidence interval to study whether racial variations in prostate cancer detection exist. They evaluated the patients who were referred to a urology clinic by other physician because they had either an elevated level of
prostate specific antigen (PSA) concentration in the blood or an abnormal finding on digital rectal exam. Tigrani et al. [22] has used a derived field in analysis the available data. They came out with “PSA Velocity” which measures the change in PSA in a single patient over time. Bagirov et al. [6] demonstrate use of optimization-based clustering techniques to cluster the prostate cancer patients into risk homogeneous patient groups in order to support future treatment decisions. Patient’s age, tumor stage, pathology gleason score, and PSA (prostate-specific antigen) levels in blood are used to generate clusters. These clusters reveal interesting differences in patient’s future health and survival. Generally, discriminate analysis and logistic regression are two most commonly used data mining techniques to construct classification models. However, linear discriminate analysis (LDA) has often been criticized due to its assumption about the categorical nature of the data and the fact that the covariance matrices of different classes are unlikely to be equal. Even though neural networks have reported to have better classification capability than LDA and logistic regression, it is, however, also being criticized for its long training process in designing the optimal network’s topology and hard to identify the relative importance of potential input variables, and hence presumed to be limited at its applicability in handling classification problems [20].
3. Research Methodology 3.1 Data Understanding and Preparation The data set used for the project was obtained from SEER (the Surveillance, Epidemiology, and End Results) program of the National Cancer Institute. SEER is one of the largest and most comprehensive sources of information on cancer incidence and survival in the United States. The SEER Program currently collects and publishes cancer incidence and survival data from 14 population-based cancer registries and three supplemental registries covering approximately 26 percent of the US population [19]. Information on more than 3 million in situ and invasive cancer cases is included in the SEER database, and approximately 170,000 new cases are added each year within the SEER coverage areas. The SEER Registries routinely collect data on patient demographics, primary tumor site, morphology, stage at diagnosis, first course of treatment, and follow-up for vital status. The SEER Program is the only comprehensive source of population-based information in the United States that includes stage of cancer at the time of diagnosis and
3
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
survival rates within each stage. The mortality data reported by SEER are provided by the National Center for Health Statistics. The SEER Program is considered the standard for quality among cancer registries around the world. Quality control has been an integral part of SEER since its inception. Every year, studies are conducted in SEER areas to evaluate the quality and completeness of the data being reported [19]. Geographic areas were selected for inclusion in the SEER Program based on their ability to operate and maintain a high quality population-based cancer reporting system and for their epidemiologically significant population subgroups. The population covered by SEER is comparable to the general US population with regard to measures of poverty and education. The SEER population tends to be more urban and has a higher proportion of foreign-born persons than the general US population [19]. The data used in this study contains the SEER public-use data files from nine SEER registries for the years of 1973-2001. The data have been broken into nine site group (i.e., cancer type) files. These files are stored as ASCII text files. For the specific purpose of this research study we have used the Male Genital (MALEGEN.TXT) text file that includes the prostate cancer cases. The original data set contained seventy seven variables and over 350K+ records. The data from the flat text file was imported in SPSS for initial data exploration and understanding. As we were interested in researching prostate cancer, the complete record set was filtered to include only the ones identified from male genital text file. Descriptive statistics was studies for each and every variable. Mean, range maximum, minimum and missing value statistics were identified for interval variables. Histograms were plotted to get an idea of the distribution of the variables. Box and whisker plots were created to look for distribution and any outliers present in the data. Some of the variables had more than 40% missing values. Further exploration resulted that all the records prior to 1988 had missing vales. A binary categorical variable was computed for data prior to 1988 and data after 1988. Then a t-test was run for the target variable and the categorical variable as a class variable to test if the two data were significantly different. Analysis found no significant difference between the two groups. Therefore we decided to delete the records prior to 1988. This resulted in reduction of the dataset to 200K+ records. As the objective of this research was to develop a statistical model, which would aid in predicting the survivability with the most accuracy, we had to derive a survivability variable to be used in prediction. Following the footsteps of the previous cancer
studies, the survivability is defined as living for 60 months or more after an individual has been diagnosed with prostate cancer. Accordingly, a binary dependent variable was developed where a person surviving more than 60 months is coded as 1 and less than 60 months as 0. Also, the records with the cause of death other than prostate cancer were disregarded. Each record in the data set relates to a specific incidence of the cancer type. So there can be multiple records for the same person. As we are predicting survivability for an individual we had to aggregate the data for all records of the same person using the encoded patient identification variables provided within the dataset. 3.3 Prediction Models We used three different types of classification models: artificial neural networks, decision trees, and logistic regression. These models were selected for inclusion in this study due to their popularity in the recently published literature as well as the desirable performance they had shown in our preliminary comparative studies. What follows is a brief description of these three classification model types. 3.3.1 Artificial neural networks Artificial neural networks (ANNs) are commonly known as biologically inspired, highly sophisticated analytical techniques, capable of modeling extremely complex non-linear functions. Formally defined, ANNs are analytic techniques modeled after the processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data [9]. We used a popular ANN architecture called multi-layer perceptron (MLP) with back-propagation (a supervised learning algorithm). The MLP is known to be a robust function approximator for prediction/classification problems. It is arguably the most commonly used and well-studied ANN architecture. Our experimental runs also proved the notion that for this type of classification problems MLP performs better than other ANN architectures such as radial basis function (RBF), recurrent neural network (RNN), and self-organizing map (SOM). In fact, Hornik et al. [10] empirically showed that given the right size and the structure, MLP is capable of learning arbitrarily complex nonlinear functions to arbitrary accuracy levels. The MLP is essentially the collection of nonlinear neurons (a.k.a. perceptrons or processing elements) organized and connected to each
4
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
other (using what is commonly called the weights) in a feedforward multi-layer structure. Figure 1 illustrates the graphical representation of the MLP architecture used in this study.
1 - Survived
. . .
. . .
0 - Did not Survive
Logistic regression is a generalization of linear regression [8]. It is used primarily for predicting binary or multi-class dependent variables. Because the response variable is discrete, it cannot be modeled directly by linear regression. Therefore, rather than predicting point estimate of the event itself, it builds the model to predict the odds of its occurrence. In a two-class problem, odds greater than 50% would mean that the case is assigned to the class designated as “1” and “0” otherwise. While logistic regression is a very powerful modeling tool, it assumes that the response variable (the log odds, not the event itself) is linear in the coefficients of the predictor variables. Furthermore, the modeler, based on his or her experience with the data and data analysis, must choose the right inputs and specify their functional relationship to the response variable. 3.4 Validation
INPUT LAYER (17 Neurons)
HIDDEN LAYER I (15 PEs)
OUTPUT LAYER (2 Neurons)
Figure 1. Graphical depiction of the ANN model 3.3.2 Decision trees Decision trees are powerful classification algorithms that are becoming increasingly more popular with the growth of data mining in the field of information systems. Popular decision tree algorithms include Quinlan’s ID3, C4.5, C5 [17], and Breiman et al.’s CART [5]. As the name implies, this technique recursively separates observations in branches to construct a tree for the purpose of improving the prediction accuracy. In doing so, they use mathematical algorithms (e.g., information gain, Gini index, and Chi-squared test) to identify a variable and corresponding threshold for the variable that splits the input observation into two or more subgroups. This step is repeated at each leaf node until the complete tree is constructed. The objective of the splitting algorithm is to find a variable-threshold pair that maximizes the homogeneity (order) of the resulting two or more subgroups of samples. The most commonly used mathematical algorithm for splitting includes Entropy based information gain (used in ID3, C4.5, C5), Gini index (used in CART), and ChiSquared test (used in CHAID). Based on the favorable prediction results we have obtained from the preliminary runs, in this study we chose to use CART algorithm as our decision tree method. 3.3.3 Logistic regression
3.4.1 Measures for performance In this study, we used three performance measures: accuracy (equation 1), sensitivity (equation 2) and specificity (equation 3): Accuracy =
TP + TN TP + TN + FP + FN
(1)
Sensitivity =
TP TP + FN
(2)
Specificity =
TN TN + FP
(3)
where TP, TN, FP and FN denotes true positives, true negatives, false positives and false negatives respectively. 3.4.2 k-Fold cross validation In order to minimize the bias associated with the random sampling of the training and holdout data samples in comparing the predictive accuracy of two or more methods, researchers tend to use k-fold cross validation. In k-fold cross validation, also called rotation estimation, the complete dataset (D) is randomly split into k mutually exclusive subsets (the folds: D1, D2,…, Dk) of approximately equal size. The classification model is trained and tested k times. Each time (t = 1, 2,…, k), the model is trained on all but one folds (D \ Dt) and tested on the remaining single fold (Dt). The cross validation estimate of the overall accuracy is then calculates as the average of the k individual accuracy measures (see equation 4):
5
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
CVA =
1 k ¦ Ai k i =1
(4)
where CVA stands for cross validation accuracy, k is the number of folds, and A is the accuracy measure of each folds. Since the cross-validation accuracy would depend on the random assignment of the individual cases into k distinct folds, a common practice is to stratify the folds themselves. In stratified k-fold cross validation, the folds are created in a way that they contain approximately the same proportion of predictor labels as the original dataset. Empirical studies showed that stratified cross validation tend to generate comparison results with lower bias and lower variance when compared to regular k-fold cross-validation [11].
4. Results 4.1 Classification results In this study the models were evaluated based on the accuracy measures discussed above (classification accuracy, sensitivity and specificity). The results were achieved using ten fold cross validation for each model, and are based on the average results obtained from the test data set (the 10th fold) for each fold. In comparison to the above studies, we found that the decision tree model achieved a classification accuracy of 0.9000 with a sensitivity of 0.9188 and a specificity of 0.7375. The logistic regression model achieved a classification accuracy of 0.8961 with a sensitivity of 0.9130 and a specificity of 0.7361. However, the ANN preformed the best of the three models by achieving an accuracy of 0.9107 with sensitivity of 0.9310 and a specificity of 0.7383. Table 1 shows the complete set of results in a tabular format. For each fold of each model type, the detailed prediction results of the validation datasets are presented in form of confusion matrixes. A confusion matrix is a matrix representation of the classification results. In a twoclass prediction problem (such as the one in this research) the upper left cell denotes the number of samples classifies as true while they we true (i.e., true positives), and lower right cell denotes the number of samples classified as false while they were actually false (i.e., true false). The other two cells (lower left cell and upper right cell) denote the number of samples misclassified. Specifically, the lower left cell denoting the number of samples classified as false
while they actually were true (i.e., false negatives), and the upper right cell denoting the number of samples classified as true while they actually were false (i.e., false positives). Once the confusion matrixes were constructed, the accuracy, sensitivity and specificity of each fold were calculated using the respective formulas presented in the previous section. 4.2 Sensitivity analysis on ANN output We have used sensitivity analysis to gain some insight into the decision variables used for the classification problem. Sensitivity analysis is a method for extracting the cause and effect relationship between the inputs and outputs of a neural network model. As has been noted by many investigators in the AI field, most of the time ANN may offer better predictive ability, but not much explanatory value. This criticism is generally true, however sensitivity analysis can be performed to generate insight into the problem. Recently, it has become a commonly used method in ANN studies for identifying the degree at which each input channel (independent variables or decision variables) contributes to the identification of each output channel (dependent variables). The sensitivity analysis provides information about the relative importance of the input variables in predicting the output field(s). In the process of performing sensitivity analysis, the ANN learning is disabled so that the network weights are not affected. The basic idea is that the inputs to the network are perturbed slightly, and the corresponding change in the output is reported as a percentage change in the output [16]. The first input is varied between its mean plus (or minus) a user-defined number of standard deviations, while all other inputs are fixed at their respective means. The network output is computed and recorded as the percent change above and below the mean of that output channel. This process is repeated for each and every input variable. As an outcome of this process, a report (usually a column plot) is generated, which summarizes the variation of each output with respect to the variation in each input. The sensitivity analysis performed for this research project and presented in a graphical format in Figure 2, lists the input variables by their relative importance (from most important to least important). The value shown for each input variable is a measure of its relative importance among the other variables.
6
Neural Networks (MLP)
Decision Tree Induction (CART)
Logistic Regression
0.0024
0.0023
0.9310 0.0109
0.7383
0.0026
0.9000
0.0032
0.9188
Confusion matrix shows the classification of the cases in the test dataset In Confusion Matrix, the columns denote the actual cases and the rows denote the predicted Accuracy = (TP + TN) / (TP + FP + TN + FN); Sensitivity = TP / (TP + FN); Specificity = TN / (TN + FP)
0.9107
St. Dev.
0.0077
0.7375
0.9130 0.0026
0.8961 0.0024
0.0087
0.7361
Confusion Matrix Accuracy Sensitivity Specificity Confusion Matrix Accuracy Sensitivity Specificity Confusion Matrix Accuracy Sensitivity Specificity 10155 295 10217 333 10117 333 0.9075 0.9273 0.7319 0.8993 0.9181 0.7317 0.8976 0.9136 0.7424 801 909 902 908 960 850 10137 355 10031 361 10076 316 0.9108 0.9327 0.7307 0.9021 0.9234 0.7273 0.8978 0.9156 0.7327 929 866 732 963 832 963 10153 329 10049 333 10075 307 0.9116 0.9316 0.7376 0.9007 0.9200 0.7290 0.8968 0.9141 0.7283 745 925 874 896 947 823 10390 331 10271 350 10306 315 0.9126 0.9318 0.7506 0.9035 0.9232 0.7411 0.8986 0.9156 0.7420 760 996 854 1002 950 906 10094 323 10162 355 10077 340 0.9109 0.9327 0.7315 0.8956 0.9142 0.7372 0.8909 0.9081 0.7313 733 967 946 954 1021 879 10184 360 10111 333 10121 323 0.9092 0.9310 0.7315 0.8996 0.9183 0.7376 0.8963 0.9142 0.7328 900 936 950 886 755 981 10249 325 10041 333 10070 304 0.9138 0.9333 0.7496 0.8989 0.9171 0.7498 0.8935 0.9093 0.7479 733 973 908 998 1004 902 10168 350 10097 321 10088 330 0.9062 0.9273 0.7229 0.8972 0.9152 0.7314 0.8946 0.9132 0.7206 797 913 936 874 959 851 10163 287 10244 306 10130 320 0.9115 0.9291 0.7575 0.8997 0.9173 0.7430 0.8961 0.9112 0.7471 990 848 782 956 913 925 10243 334 10064 313 10070 307 0.9126 0.9328 0.7391 0.9036 0.9214 0.7472 0.8983 0.9155 0.7356 930 854 738 946 859 925
Mean
10
9
8
7
6
5
4
3
2
1
Fold No
Table 1. Tabular results for 10-fold cross validation for all folds and all model types.
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
7
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Figure 2. Sensitivity analysis graph
5. Discussion and Conclusion Data collection. One of the key components of predictive accuracy is the amount and quality of the data [7]. However, the data gathered in medicine is generally collected as a result of patient-care activity to benefit the individual patient, and research is only a secondary consideration. As a result, medical databases contain many features that create problems for the data mining tools and techniques. Medical databases may consist of a large volume of heterogeneous data, including a variety of field types. The heterogeneity of the data complicates the use data mining tools and techniques. Additionally, as with any large database, medical databases contain missing values that must be dealt with prior to the use of the data mining tools. Further, as a result of the method of collection, medical databases may contain data that is redundant, incomplete, imprecise or inconsistent, which can affect the use and results of the data mining tools. Also, the collection method
can introduce noise into the data, and can affect the results of the data mining tools. In addition to the collection problems, medical databases have the unique problem of incorporating medical concepts into an understandable form. All of the above may create problems for data mining, and as a result, may require more data reduction and data preparation than data derived from other sources [7]. Even with the inherent problems associated with medical databases, the use of medical data for research has many benefits. The research can provide useful information for diagnosis, treatment and prognosis [18]. As mentioned above, the results of data mining are directly affected by the quantity and quality of the data [7]. By improving the collection of the data, medical data mining can yield even greater results and benefits. By making the collection process a primary focus, the methods of obtaining medical data can be formalized and standardized. Thus, the problem of missing, redundant or inconsistent values in data can be reduced. In addition to the above listed technical problems, the use of medical data in data mining
8
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
involves the critical issues of privacy, security and confidentiality. The privacy of the individual should be respected concerning any medical data collected. Patient identification should be kept confidential and secure. Additionally, medical data governed under the rules of Common Rule (45 CFR 46) and HIPPA, and is subject to the penalties there under if the proper procedures are not followed [4]. The rules under these laws can generally be satisfied by the use of anonymous data, anonymized data and de-identified data. Anonymous data is data that is collected without any patient identification. Anonymized data is data that has the patient identification information permanently removed subsequent to collection. Deidentified data is data in which the patient identification information is encoded or encrypted subsequent to collection. With de-identified data, the patient identification information can be retrieved with the appropriate approval. Identified data should never be used without the patient’s prior consent [7]. Predictive models. As shown herein, based on certain predictive attributes, models can be developed that accurately predict the outcome of an incidence of cancer. These predictive models can be valuable tools in medicine. They can be used to assist in determining prognosis, developing successful treatment, or the avoidance of treatment [7]. However, there are areas of concern in the development of predictive models: (1) the model should include all clinically relevant data, (2) the model should be tested on an independent sample, and (3) the model must make sense to the medical personnel who is supposed to make use of it. It has been shown that not all predictive models constructed using data mining techniques satisfy all of these requirements [18]. While data mining can provide useful information and support to the medical staff by identifying patterns that may not be readily apparent, there are limitations to what data mining can do. Not all patterns found via data mining are “interesting”. For a pattern to be interesting, it should be logical and actionable. Therefore, data mining requires human intervention to exploit the extracted knowledge. For example, data mining can provide assistance in making the diagnosis or prescribing the treatment, but it still cannot replace the physician’s intuition and interpretive skills [18]. Conclusion. In this paper, we reported on a research project aimed at developing prediction models and explaining prognostic factors of prostate cancer survivability. We have used the SEER data in order to develop the models. We have used three modeling
techniques: one traditional statistical model (logistic regression), and two machine learning techniques (decision trees and artificial neural networks). SEER dataset was quite a large dataset (350K+ records) that required a rather lengthy procedure of data cleansing and transformation. 10-fold cross validation was used in model building and evaluation, where we divided dataset into 10 mutually exclusive partitions using a stratified sampling technique. While doing so, out of 10 folds, 9 folds are used for training and 10th fold was used for testing. We repeated this process for 10 times so that each and every data point would be used as part of the training and testing datasets. For all 10 folds accuracy, sensitivity and specificity is calculated for all three model types. Then across 10 folds accuracy, sensitivity and specificity are averaged in order to make a comparison on the best model predicting prostate cancer survivability. By averaging across 10 folds we found that artificial neural networks models performed the best with an accuracy measure of 91.07 percent. The decision tree came close second and the logistics regression came third. As shown in this research, advanced data mining methods can be used to develop models that possess a high degree of predictive power. However, there are several issues involved with the data understanding, data transformation and the predictive models that warrant for careful consideration. Although data mining methods are capable of extracting patterns and relationships hidden deep into large medical datasets, without the cooperation and feedback from the medical professional, their results would be useless. The patterns found via data mining methods should be evaluated by medical professionals who have years of experience in the problem domain to decide whether they are logical, actionable and novel to fuel new biological and clinical research directions. In short, data mining is not aiming to replace medical professionals and researchers, but to complement their invaluable efforts to save more human lives. Some of the research extensions to compensate the limitations of the presented effort can be listed as follows. First, in studying prostate cancer survivability, we have not considered the potential correlation to other cancer types. It would be an interesting study to investigate if a person having, say, a skin cancer has a worse survivability rating for prostate cancer. This can be done by including all possible cancer types and their prognostic factors to investigate the correlations, commonalities and differences among them. Second, new and more promising methods such as support vector machines and rough sets can be used to see if the prediction accuracy can be further improved. Another viable option to improve the prediction accuracy would be a
9
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
hybrid intelligent system where the prediction results of data mining methods are augmented with expert opinions, which are captured and embedded into an expert system.
[14]
References
[16]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9] [10]
[11]
[12]
[13]
The American Cancer Society, “All about Prostate Cancer Overview”, www.cancer.org, Accessed on April 06, 2005. Bellazzi, R., Azzini, I., Toffolo, G., Bacchetti, S. and Lise, M., “Mining data from a knowledge management perspective: an application to outcome prediction in patients with respectable hepatocellular carcinoma”, In Proceedings of AIME, 2001, 40-49. Blair E., Tibshirani, R., “Machine Learning Methods Applied to DNA Microarray Data can Improve the Diagnosis of Cancer”, SIGKDD Explorations, 2003, 5(2): 48-55. Berman, J.J., “Confidentiality issues for medical data miners” Artificial intelligence in medicine 2002; 26: 25-36. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., Classification and regression trees. Monterey, CA, Wadsworth & Brooks/Cole Advanced Books & Software, 1984. Churilo L., Bagirov A.M., K. Schwartz, Smith K., Dally M., “Improving Risk Grouping Rules for Prostate Cancer Patients with Optimization”, Proceedings of the 37th Hawaii International Conference on System Sciences, 2004. Cios, K.J. and Moore, G.W., “Uniqueness of medical data mining” Artificial intelligence in medicine 2002; 26: 1-24. Hastie, T., Tibshirani, R. and Friedman, J. The elements of statistical learning. New York, NY 2001 Springer-Verlag. Haykin, S. Neural networks: a comprehensive foundation. New Jersey: Prentice Hall 1998. Hornik, K., Stinchcombe, M. and White, H. “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward network” Neural networks 1990; 3: 359-366. Kohavi, R., “A study of cross-validation and bootstrap for accuracy estimation and model selection” In Proceedings of the 14th international conference on AI (IJCAI), Montreal, Canada. 1995, Morgan Kaufman Publishing; 1137-1145. Land W.H., “New results in breast cancer classification obtained from an evolutionary computation/adaptive boosting hybrid using mammogram and history data” Proceedings of 2001 IEEE Mountain Workshop on Soft Computing in Industrial Applications. IEEE, 2001, 47-52. Larry T.H., “Using Emerging Pattern Based Projected Clustering and Gene expression data for Cancer detection” 2nd Asia-Pacific Bioinformatics Conference, Vol. 29, 2004.
[15]
[17] [18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
Mangasarian O.L., “Breast cancer diagnosis and prognosis via linear programming”, Operations Research, 1995, 43(4): 570-577. National Cancer Institute. What You Need To Know About Prostate Cancer. www.cancer.gov, Accessed on April 06, 2005. Principe, J.C., Euliano, N.R. and Lefebvre, W.C., Neural and adaptive systems. New York, NY 2001, John Wiley and Sons. Quinlan, J., “Induction of decision trees” Machine learning, 1986; 1: 81-106. Richards, G., Rayward-Smith, V.J., Sonksen, P.H., Carey, S. and Weng, C., “Data mining for indicators of early mortality in a database of clinical records” Artificial intelligence in medicine, 2001; 22: 215231. Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) Public-Use Data (1973-2001), National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2004, based on the November 2003 submission. Shieu-Ming C. et al, Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines, Expert Systems with Applications, 27 (2004) 133-142. Sung-Bae C., “Machine Learning in DNA Microarray Analysis for Cancer Classification” 1st Asia-Pacific Bioinformatics Conference, Adelaide, Australia, 19, 2003. Tigrani V., John G., “Data mining and statistics in medicine: an application in prostate cancer detection”, Source: www.robotics.stanford.edu, Accessed: December 23, 2004. Walter D., Mohan C.K., “ClaDia: A fuzzy classifier system for disease diagnosis” In Proceedings of the 2000 Congress on Evolutionary Computation, IEEE Part Vol. 2, 2000, 1429-1435. Zhang Z. and Zhang H., “Development of a neural network derived index for early detection of prostate cancer” In Proceedings of IJCNN’99 – International Joint Conference on Neural Networks, IEEE, Vol. 5 1999, 3636-3641. Zupan B., Demsar J., “Machine learning for survival analysis: a case study on recurrence of prostate cancer” Artificial intelligence in medicine, 2000, 20(1): 59-75.
10