Predicting overall survivability in comorbidity of

7 downloads 0 Views 732KB Size Report
Apr 19, 2015 - eases in the course of treatment, this study uses SEER's cancer data to create two comorbid data ..... Next, using SQL procedure in SAS Enterprise Guide 6.1, data files of .... use of a validation data subset, automatic termination of training .... resources/factsheet/high-performance-analytics-factsheet.pdf. 156.
Decision Support Systems 74 (2015) 150–161

Contents lists available at ScienceDirect

Decision Support Systems journal homepage: www.elsevier.com/locate/dss

Predicting overall survivability in comorbidity of cancers: A data mining approach Hamed Majidi Zolbanin ⁎, Dursun Delen, Amir Hassan Zadeh Department of Management Science and Information Systems Spears School of Business, Oklahoma State University, Stillwater, OK, United States

a r t i c l e

i n f o

Article history: Received 7 February 2015 Received in revised form 1 April 2015 Accepted 2 April 2015 Available online 19 April 2015 Keywords: Medical decision making Comorbidity Concurrent diseases Concomitant diseases Predictive modeling Random forest

a b s t r a c t Cancer and other chronic diseases have constituted (and will do so at an increasing pace) a significant portion of healthcare costs in the United States in recent years. Although prior research has shown that diagnostic and treatment recommendations might be altered based on the severity of comorbidities, chronic diseases are still being investigated in isolation from one another in most cases. To illustrate the significance of concurrent chronic diseases in the course of treatment, this study uses SEER's cancer data to create two comorbid data sets: one for breast and female genital cancers and another for prostate and urinal cancers. Several popular machine learning techniques are then applied to the resultant data sets to build predictive models. Comparison of the results shows that having more information about comorbid conditions of patients can improve models' predictive power, which in turn, can help practitioners make better diagnostic and treatment decisions. Therefore, proper identification, recording, and use of patients' comorbidity status can potentially lower treatment costs and ease the healthcare related economic challenges. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Cancer is the second leading cause of death in the United States.1 It is also a major cause of death worldwide, especially (and ironically) in high income countries.2 Research on causes and behavior of cancers has resulted in significant advances in our understanding of the disease over the past four decades. Even though cancer studies have traditionally been clinical and biological in nature, the recent technological advances have made data driven analytic studies a common complement. The exploration of massive medical databases with the aid of new computational tools has confirmed the existence of coexisting diseases, including certain cancers. However, current medical research has a tendency to follow a reductionist approach to the study of ailments by investigating them in isolation from one another, rather than considering their interactions [57]. Recent findings urge taking a different stance toward comorbid diseases by denoting how coexisting illnesses might affect the diagnosis, treatment, and evaluation of treatment effectiveness, as well as survival of patients [2,20,21,26,27,45,61]. Yet, another equally important reason for the consideration of comorbidities is their impact on treatment costs, which in turn affect

⁎ Corresponding author. E-mail address: [email protected] (H.M. Zolbanin). 1 http://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm. 2 http://www.who.int/mediacentre/factsheets/fs310/en/index1.html.

http://dx.doi.org/10.1016/j.dss.2015.04.003 0167-9236/© 2015 Elsevier B.V. All rights reserved.

economies. According to the National Health Council, 133 million Americans are affected by incurable, ongoing chronic diseases, and this number is expected to grow to 157 million in 2020, with 81 million suffering from multiple conditions.3 These figures find more salience with chronic conditions accounting for more than 75% of all healthcare costs. While in 2007, $1.3 trillion was reported as the adverse economic impact of chronic diseases, including cancer, it is projected to increase to $4.2 trillion for superfluous treatment costs and lost economic output. Prior research has shown that cancer treatment recommendations might significantly be altered based on the severity of comorbidities. Specifically, the extent of the tumor spread is not the sole indicator of treatment. Instead, the overall health of the patient might have a greater weight in choosing the treatments [17]. Regarding the serious interplay of coexisting complications with all different phases of cancer treatment, and with the increasing trend in the development of intercurrent illnesses, the present cancer classification system needs to be revised, as it does not account for the severity of comorbid conditions [46]. Even if concurrent health issues are diagnosed and accounted for during the course of treatments, excluding them from general data sets or storing them in disconnected systems hampers prospective statistical analyses that might reveal useful patterns about their interplay. Similarly, elimination of comorbidity information may compromise the effectiveness of clinical decision support systems (CDSS). These systems “apply best3

pdf.

http://www.nationalhealthcouncil.org/NHC_Files/Pdf_Files/AboutChronicDisease.

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

Fig. 1. Final outcomes in the SEER data (1973–2011).

known medical knowledge to patient data for the purpose of generating case-specific decision support” [68], particularly in preventive care services and treatment recommendation [6]. The reductionist approach to the study of diseases ignores parts of the potential interactions among comorbidities; thus, rendering the CDSS less effective. Prior studies on physicians' information needs have shown that in as many as 81% of clinical encounters in ambulatory care, clinicians may be missing critical information. As a result, providers confront serious challenges in accessing relevant information, obtaining a thorough picture of the patient's clinical state and history, and determining the optimal testing or therapeutic actions that should be further taken [54]. It seems, therefore, that collective consideration of concurrent diseases can potentially improve the quality of clinical decision support. To demonstrate the importance of this issue, the current study seeks to investigate how concurrence of two cancers, namely urinary with male genital and breast with female genital, might affect the predictability of the disease outcomes. Improved predictability not only depreciates the reductionist approach, but can also help build more accurate clinical decision support systems, which in turn, would allow practitioners to make more effective decisions and eventually, lower overall healthcare costs. 2. Motivation Comorbidity was first defined by Feinstein [22] as “any distinct additional clinical entity that has existed or may occur during the clinical course of a patient who has the index disease under study”. Others have limited it to neoplasia, i.e., conditions and diseases that existed before a cancer diagnosis and are not adverse effects of cancer treatment [46], or to illness processes that coexist and are not related to the index disease under study [58]. Some authors have used such terms as intercurrent disease [28,30,55] or coexisting illness [44] interchangeably with comorbidity. Others have been more specific by making a subtle distinction between comorbidity and multimorbidity, which is simply defined as the presence of several diseases in one individual [57,58]. While some researchers refer to comorbidity as coexisting non-cancer medical conditions [18], others have made a distinction

151

between cancer and non-cancer concurrent diseases; implicitly confirming that comorbidity can also be considered for the coexistence of two or more different cancers [19]. Although multimorbidity would be a better term for the coexistence of two cancers in a same patient, for the sake of simplicity, we will use these terms interchangeably. The impact of comorbid health conditions on patients' overall survival cannot be overstated. In more than seven million unique cancer incidents in the Surveillance, Epidemiology, and End Results (SEER) data set between 1973 and 2011, almost 26% of all deaths were due to non-cancer comorbid causes. Comorbidities accounted for more than 14% of overall outcomes, including survivals. This information is illustrated in Fig. 1. Investigation of historical cancer incidences reveals that certain types of cancers have higher correlations. Table 1 shows the number of patients who suffered from two different types of cancers during their lives. As it can be seen, urinary and male genital cancers, with 46,204 cases, co-develop the most frequently in a single patient's lifetime. Coexisting breast and female genital cancers stand third with 34,056 cases. Therefore, this study focuses on two of the most common comorbid subsets of SEER cases; i.e., those individuals who were diagnosed with male genitourinary or breast and female genital cancers during their lives. Development of two types of cancers in one patient may occur at different ages. As stated previously, comorbidity (or more accurately, multimorbidity in a study of concurrent cancers) refers to coexisting ailments in one individual. Consequently, for the two cancers to be considered “comorbid”, we limited our sample to those cases whose second cancer was diagnosed within one year after the first cancer's diagnosis. As a result, the new sample of comorbid genitourinary cancers composed of 14,243 cases. For breast and female genital cancers, the comorbid sample included 3664 cases. Figs. 2 and 3 show the distribution of patients' final outcomes in these samples. Based on these figures, coexisting complications account for a significant proportion of deaths in the study's samples, and their impact is specifically greater in the male genital and urinary cancer patients. Comorbidities, including other cancers, cause more than 47% of deceases among genitourinary, and 37% among breast and female genital patients. Investigating the impact of comorbid cancers on patients' final outcomes is the main impetus for this study. More specifically, we want to show how two concurrent diseases may change the predictability of disease outcomes. We will develop a set of predicting models for the aforementioned cancers, for each of the cancers alone and for their integration. The models' performances will be compared and the role of comorbidities will be discussed. 3. Background The primary criterion in the traditional TNM system of cancer classification is the morphology of the carcinoma, and as such, it does not consider patients' overall health and comorbidity [45]. With the sizable body of literature that attests to the prognostic impact of factors other than tumor stage and treatment, comorbidities have found more

Table 1 Co-occurrence and total counts of different types of cancer. Cancer type

Breast

Colon and rectum

Other digestive

Female genital

Leukemia and lymphoma

Male genital

Other

Respiratory

Urinary

Breast Colon and rectum Other digestive Female genital Leukemia and lymphoma Male genital Other Respiratory Urinary

1,275,422

24,259 841,426

11,687 14,856 619,246

34,056 12,280 6,251 631,841

14,559 12,501 6,000 6,452 639,434

809 34,553 18,219 0 21,657 1,117,558

31,584 21,498 13,395 12,889 22,238 43,989 1,358,973

22,032 21,531 10,461 10,562 12,614 33,688 29,668 1,058,253

10,478 17,147 8,859 5,609 9,984 46,204 19,052 22,166 522,376

152

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

Fig. 2. Final outcomes in the sample of comorbid male genital and urinary cancers.

salience in the study of cancers [44]. These studies have looked at concomitant diseases from different angles to determine how the severity of comorbidity may affect prognostics, choice and effectivity of initial treatment, and quality of life among cancer patients. In a review of the extant literature on the role of comorbidity in head and neck cancer, Paleri et al. [44] found that comorbidity is a significant factor in increasing mortality rates among patients, and that this impact is greater in the early years after treatments. They also observed that comorbidity has a negative impact on disease-specific survivability, as it is likely a significant factor in diagnosis delay. Piccirillo [45] noticed a positive relationship between the severity of comorbidity and the mortality rate. He concluded that among individuals with head and neck cancer, comorbidity is an independent prognostic factor, rather than a result of difference in treatment. Others have found the same results on the relationship between intercurrent diseases and overall survival of newly diagnosed head and neck cancer patients [13]. In a study of predicting the influence of comorbidities on treatment costs among head and neck patients, Hollenbeak et al. [29] observed the effect to be significant. Concomitant diseases have also been mentioned as influential factors in altering the choice of treatment among head and neck patients [51]. In lung cancer, Smith et al. [55] showed that when operation is not a viable option for patients with early stage cancer, using curative radiotherapy would be associated with a significantly longer survival. They also noticed that for these individuals, comorbidity index predicts overall survival significantly. In another study on the role of age and comorbidity as independent factors on the patients' final outcome, Asmis et al. [3] concluded that the presence of comorbid conditions, rather than age above 65 years, was more strongly associated with poor survival. These results were observed in other research as well. For instance, Tammemagi et al. [59] found that comorbidity count and Charlson Comorbidity Index (CCI) were significant predictors of survival; albeit they explained a small variation. The 2013 annual report on the status of cancer identifies comorbidity level as an important indicator for overall survival among patients with local and regional diseases. The report

Fig. 3. Final outcomes in the sample of comorbid breast and female genital cancers.

maintains, however, that comorbidity has a smaller impact when cancer has spread distantly. In another stream of comorbidity research, Post et al. [50] observed that patients with no comorbidity had a better short-term prognosis than those with one or more other diseases. They found that among individuals with localized prostate cancer, concurrent diseases had the most significant prognostic influence in the first 3 years of survival, followed by histological grade. This effect was stronger in younger men and attenuated with increasing age. Interestingly, they noticed that the presence of two concurrent diseases had the largest effect on survival, and co-occurrence of three or more ailments, which was found to be less likely, did not add any significance. Daskivich et al. [12] argued that comorbidity is a key consideration in clinical decision making for prostate cancer. In line with these results, Fitzpatrick [23] mentioned comorbidity as a crucial predictor of mortality in senior adults and accentuated that concomitant diseases, rather than chronological age, should be of prime concern in choosing the course of treatment among senior adults who suffer from localized prostate cancer. Other studies (e.g., [2,21]) found that higher comorbidity levels in genitourinary cancers are more strongly associated with overall mortality and cancer-specific survivability. Similarly, higher rates of pre-existing medical conditions and limited access to healthcare services were found to be significant contributors to survival disparities between indigenous and non-indigenous colon cancer patients in New Zealand [27]. Comorbidity was also found to have a significant role in breast and ovarian cancers. Tetsche et al. [62] observed greater prevalence of comorbidity among individuals with advanced stages of cancer. They noted higher rates of cancer-specific survivability in patients without coexisting ailments than in patients with comorbidities. While mammographically detected tumors were found to be significantly related to lower relative risk of death in all age groups, the detection method did not contribute to overall survival among older patients with severe comorbidities [38]. Besides the impact of comorbidity on cancer, there has been a sizable body of research, particularly in the past 15 years, on the applications of data mining and machine learning techniques in analyzing medical data. One stream has focused on the applications of artificial intelligence techniques in analyzing medical data in general. For instance, Ghazavi and Liao [24] noted the high dimensionality of medical data and offered a fuzzy model to select an optimal subset of features that might be more relevant depending upon the specific application. Others applied machine learning techniques to present a method that can be used in diagnosis of hepatitis [48] and general liver disorders [11]. Several studies, however, used artificial intelligence for classification and diagnosis of cancers. In general studies, Hong and Cho [31] used gene expression data to conclude that accurate diagnosis leads to better treatment and toxicity minimization for cancer patients. Shah and Kusiak [53] analyzed gene expression data to facilitate appropriate treatment selection and drug development. Genetic programming methodologies were also applied in building rule-based diagnosis systems [63]. Li et al. [37] used unique serum proteomic patterns to discriminate cancer samples from noncancer ones. Apart from general studies, most researchers have used these techniques to analyze specific cancer data. Several papers have focused on breast cancer. These studies have applied different rules, algorithms, or methods in pattern recognition [8, 56,67], used rules and methods such as neural networks for diagnosis [1,33,34,49], compared different methods' detection accuracy or survival prediction performance [10,15,42,65], or used analysis methods together to predict recurrence-free survival of breast cancer patients [64]. Prostate cancer has also been investigated in some studies. For instance, Delen [14] used three popular data mining techniques (i.e., decision trees, artificial neural networks, and support vector machines) along with logistic regression to develop prediction models for prostate cancer survivability. In another research, Chiu et al. [9] applied

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

Cancer DB 1

Cancer DB 2

153

Cancer DB n

Combined Cancer DB

Data Preprocessing

Cleaning Selecting Transforming Partitioned data (training & testing)

Artificial Neural Networks (ANN)

Partitioned data (training & testing)

Partitioned data (training & testing)

Logistic Regression (LR)

Training and calibrating the model

Training and calibrating the model

Testing the model

Testing the model

Random Forest (RF) Training and calibrating the model

Testing the model

Assess variable importance

Tabulated Relative Variable Importance Results

Tabulated Model Testing Results (Accuracy, Sensitivity and Specificity)

Fig. 4. Research methodology.

neural network models to predict skeletal metastasis in patients with prostate cancer. Other cancers were also explored in some studies. For instance, fuzzy neural networks were applied to detect ovarian cancer [60]; neural networks, genetic algorithms, and logistic regression were used for screening pancreatic cancer [7]; and data mining approaches were utilized to generate accurate lung cancer diagnosis [35]. As stated earlier, a majority of prior research has considered diseases in isolation from one another. With the increased understanding of the significant interplay of coexisting diseases with cancers, there has recently been a surge in comorbidity research. However, very few studies, if any, have investigated two or more concomitant cancers. Before illustrating the applied models and discussing the results, we will explain the data management process. 4. Methodology It is widely known among data scientists that if the input data sets do not have the required quality to begin with, all the results and findings will be questionable [16]. Consequently, a considerable extent of time and effort in data mining applications is devoted to data management. In the following subsections we describe how this critical activity was

performed in our study. Fig. 4 graphically illustrates the research methodology employed in this study. 4.1. Data acquisition The data used in this study was acquired from the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute (NCI), which is an authoritative source of information on cancer incidence and survival in the United States. “SEER currently collects and publishes cancer incidence and survival data from populationbased cancer registries covering approximately 28 percent of the US population. SEER coverage includes 26 percent of African Americans, 38 percent of Hispanics, 44 percent of American Indians and Alaska Natives, 50 percent of Asians, and 67 percent of Hawaiian/Pacific Islanders.”4 This program collects and organizes cancer data into nine distinct categories. Most of these categories contain data for specific anatomical sites; however, when the number of overall incidents are small for some cancer types, they are grouped together to form a general class of 4 National Cancer Institute, the Surveillance, Epidemiology, and End Results Program. Available at: http://seer.cancer.gov/about/.

154

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

“other” cancers. The remaining eight common cancer groups are: breast, colon and rectum, other digestive, female genital, lymphoma and leukemia, male genital, respiratory, and urinary. There are 149 variables in each file, and each file record relates to a specific cancer incidence. The SEER project assembles cancer data from nine registries at different geographical locations. Data quality and entirety are enforced at the best extent possible. “The SEER program registries routinely collect data on patient demographics, primary tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status. This program is the only comprehensive source of population-based information in the United States that includes stage of cancer at the time of diagnosis and patient survival data. The mortality data reported by SEER are provided by the National Center for Health Statistics. The population data used in calculating cancer rates is obtained periodically from the Census Bureau. Updated annually and provided as a public service in print and electronic formats, SEER data are used by thousands of researchers, clinicians, public health officials, legislators, policymakers, community groups, and the public.”5 The SEER database has been widely used in a variety of analytical research projects. At the time of this study, a simple search on the National Library of Medicine's database (PUBMED) showed more than 540 publications in 2014 that either used the data set as the main data source or to report some cancer statistics. 4.2. Data preparation and preprocessing Data preparation is a process that cannot be carried out blindly [32]. The process includes understanding what the data represents, exploring variable statistics and distributions, performing appropriate transformations, handling missing values, analyzing outliers, and reducing the data, among others. As such, data preparation takes more than half, and oftentimes up to 80%, of the time involved in a data mining initiative [47]. Creating the final data set for this study included several steps. First, the data files were sorted by increasing case numbers (a variable used to uniquely, but anonymously identify patients). Second, 10 new numerical variables were added to each cancer file. Nine of these variables corresponded the nine cancer categories (called “cancer flags” hereafter) and the tenth variable was a counter (called “counter” hereafter). Next, using SQL procedure in SAS Enterprise Guide 6.1, data files of the nine categories were incrementally joined to update the cancer flags. For example, if a patient's case number appeared in both breast and female genital cancer data files, the corresponding cancer flags were updated to 1. If, in later joins, the same patient was found to have been diagnosed with another cancer, such as respiratory, that cancer flag was also updated to 1. For patients who had been diagnosed with one cancer, only the appropriate cancer flag was updated. The incremental joining process included 36 iterations. The first cancer file was joined with the other eight files; the second file was joined with the remaining seven files, and so forth. Once the process was completed, each cancer file included a summary of all cancers that a specific patient had been diagnosed with during his or her lifetime. The next step added cancer flags to update the counter variable. Although we only deal with four cancers in this study, it was crucial to consider all cancer files in the joining process. At the end of the aggregation, we filtered the cases to only include those patients that were diagnosed with exactly two cancers, i.e., breast with female genital and male genital with urinary. This restriction helped us leave out the potential effect of other cancers on model outcomes. In the next step, based on the definition of comorbidity—which emphasizes on the co-occurrence of diseases—we further narrowed our samples to those patients whose cancers were diagnosed within 1 5 National Cancer Institute, the Surveillance, Epidemiology, and End Results Program. Available at: http://seer.cancer.gov/about/.

year. As such, the female and the male samples had 3664 and 14,243 cases, respectively. Data preparation and preprocessing were followed by understanding the variables. 4.3. Data understanding Not all the 149 variables in SEER cancer files were usable in this study. To obtain a better understanding of and choosing meaningful variables, we explored SEER's data record description and coding and staging manuals. Variables that exist in the public version of the data can be broadly classified into six categories. Table 2 summarizes these categories. These categories help to acquire a better understanding of the variables and, more importantly, to define a target variable. A follow up information variable, i.e., number of months the patient survived after diagnosis, was used to define the target. All cases who survived the disease for 60 months or more after diagnosis were coded as “alive” (shown by survived = 1) and others who died earlier were coded as “did not survive” (shown by survived = 0). This designation imposed restrictions on the usability of a few variables as inputs. Therefore, any variable that related to cause of death or vital status of patients could not serve as an input to the model. Similarly, some variables were excluded or automatically rejected by the models due to having excessive missing values. A complete list of the included and excluded variables along with their descriptions is provided in Tables 3 and 4. Exclusion due to extensive missingness of a few variables that seem to be related to the extent and morphology of the disease does not reduce the predictive power of our models, as SEER has aggregated these variables to form new and more informative variables. Furthermore, we managed to keep useful variables whose missingness was not extremely high. To this goal, we replaced the unknown values regarding the overall distribution of the data. In other terms, for any record in which one informative variable had a missing value, we used the corresponding value in another record whose other variables had quite the same values as other variables of the first record. We believe that a predictive model would be valuable if it does not use any information related to the target variable. That is, a useful prediction cannot and should not include any input or make any assumptions that are closely related to the outcomes. As such, although we excluded variables that were highly correlated to the defined target, we did not leave out any of the records whose cause of death was other than the specific cancers under study. This, we believe, is more realistic and can potentially make our models more useful. 4.4. Prediction models We used four different types of prediction models in SAS Enterprise Miner 12.3. As our preliminary comparisons confirmed, high performance data mining (HPDM) nodes of the software significantly outperformed their traditional counterparts. HPDM offers several advantages, such as reducing dimensionality of structured inputs and performing unsupervised variable selection. Therefore, the following models were utilized: HP neural network, HP logistic regression, HP decision tree, and HP random forest. Along with these nodes, we also used HP data partitioning and HP imputation nodes. A brief description of the Table 2 Classification of used variables. Category number

Category variables

I II III IV V VI

Basic record identification Information source Demographic information Description of neoplasm First course of therapy Follow up information

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

155

Table 3 List of variables included in the models. Variable

Default namea

Description

Record number Marital status at diagnosis Race Origin Sequence number–central

REC_NO MAR-STAT RACE ORIGIN SEQ_NUM

Number of records submitted to SEER for a patient

Primary site Laterality Behavior code Histologic type

SITE2OV LATERAL BEHO2V HISTO3V

Tumor grade Tumor size

GRADE EOD10_PN

Regional nodes examined Tumor marker 1 Tumor marker 2 Surgery of primary site Reason for no surgery of primary site Radiation Radiation sequence with surgery

EOD10_NE TUMOR_1V TUMOR_2V SURGPRIM NO_SURG RADIATION RAD_SURG

Age recode Site recode ICD-O-3/WHO 2008 Recode ICD-O-2 to 9 Histology recode Historic stage A

AGE_REC SITERWHO ICDOTO9V HISTREC HST_STGA

Number of primaries Estrogen receptor status for breast cancer Progesterone receptor for breast cancer Prostate path extension Breast adjusted AJCC 6th T Breast adjusted AJCC 6th N Breast adjusted AJCC 6th M Adjusted AJCC stage Country of birth Cancer order Survived

NUMPRIMS ERSTATUS PRSTATUS EOD10_PE ADJTM_6VALUE ADJNM_6VALUE ADJM_6VALUE ADJAJCCSTG PLC_BRTH_CNTRY

a

Identifies patients with Spanish/Hispanic surname or of Spanish origin Number and sequence of all reportable malignant, in situ, benign, and borderline primary tumors which occur over a patient's lifetime The site in which the primary tumor originated The side of a paired organ or side of the body on which the reportable tumor originated Tumor behavior The microscopic composition of cells and/or tissue for a specific primary. The tumor type or histology is a basis for staging and determination of treatment options. It affects prognosis and course of treatment. Grading and differentiation of tumors Regional nodes positive: exact number of regional lymph nodes examined by the pathologist that were found to contain metastasis The total number of regional lymph nodes that were removed and examined by the pathologist Records prognostic indicators for breast cases (1990–2003) Prostate cases and testis cases (1990–2003) A surgical therapy that removes and/or destroys tissues of the primary site The reason that surgery was not performed on the primary site Indicates the method of radiation therapy performed as part of the first course of treatment The order in which surgery and radiation therapies were administered for those patients who had both Uses age at diagnosis to classify cases into age groups Code based on primary site and histology in order to make analysis of site/histology groups easier Recode for the primary site and morphology Derived from collaborative stage (CS) for 2004 and extent of disease (EOD) from 1973 to 2003. It is a simplified version of stage: in situ, localized, regional, distant, and unknown. Based on the total number of tumors in SEER ER status recode for breast cancer PR status recode for breast cancer Reflects information from radical prostectomy; used only for prostate cancer Only for breast cancer Only for breast cancer Only for breast cancer Only for breast cancer Two values: USA/CAN and other Created variable to account for the order in which the comorbid cancers were diagnosed Created target variable based on values of SRV_TIME_MONTH

Default names as appearing in SEER's SAS conversion code.

utilized models follows. It deserves to be mentioned that because prediction in this study is actually a classification of cases as “survived” or “not survived,” we will use prediction and classification interchangeably hereafter.

4.4.1. Artificial neural network Artificial neural networks (ANNs) are defined as “massively parallel processors, which tend to preserve experimental knowledge and enable their further use” [25]. Modeled after the processes of learning in the

Table 4 List of variables excluded from the models. Variable

Default name

Description

Reason of exclusion

Type of follow up Cause of death to SEER site recode Cause of death to site Type of reporting source

TYPEFUP ICD_5DIG CODKM REPT_SRC

Value “1” indicates autopsy or death certificate. Related to the target variable Related to the target variable Values “6” and “7” are related to autopsy and death.

Surgical procedure of other site

SURGOTH

Age at diagnosis Tumor size Extension

AGE_DX EOD10_SZ EOD10_EX

Nodes

EOD10_ND

Tumor size Tumor extension Involvement of lymph nodes Distant metastasis Derived SS1977 Derived SS2000

CS_SIZE CS_EXT CS_NODE CS_METS D_SSG77 D_SSG00

Codes the type of follow-up expected for a SEER case Includes both cancer and non-cancer cause of death A recode based on underlying cause of death The source documents used to abstract the case. It is the source that provided the best information. The surgical removal of distant lymph node(s) or other tissue(s) or organ(s) beyond the primary site Age of the patient at diagnosis Records largest dimension of the primary in millimeters Codes the farthest documented extension of tumor away from the primary site Records the highest specific lymph node chain that is involved with the tumor Information on tumor size Information on the extension of the tumor Information on the involvement of lymph nodes Information on distant metastasis Derived summary stage 1977 Derived summary stage 2000

Values “0” and “9” are related to autopsy and death. AGE_REC was used instead. Extensive missing values Extensive missing values Extensive missing values Extensive missing values Extensive missing values Extensive missing values Extensive missing values Extensive missing values Extensive missing values

156

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

cognitive system and the neurological functions of the brain, ANNs are capable of modeling extremely complex non-linear functions and predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a so-called process of learning from existing data. High-performance neural networks take advantage of the parallel computing environment to enhance the predictive power of the algorithm.6 Application of highperformance ANNs enables users to build better models with significantly more lift, which is made possible by allowing more runs to incrementally enhance the predictive power. Other features of these models include automatic standardization of input and target variables, smart defaults for most neural network parameters, automatic selection and use of a validation data subset, automatic termination of training when the validation error stops improving, and weighting individual observations. 4.4.2. Logistic regression Logistic regression is used to classify cases into the most likely category [41]. It is a standard for predicting binary, binomial and multinomial outcomes. Since the response variable is discrete, linear regression cannot be directly used for modeling. Instead, rather than predicting the point estimate of the event, it predicts the odds of its occurrence. In a two-class problem, odds greater than 50% would assign the case to the desired event (designated as “1”) and to non-event (designated as “0”) otherwise. While a powerful modeling tool, logistic regression assumes that the log odds of the response variable are linearly related to the predictor variables. This might render the explanation of predictor coefficients difficult. High-performance logistic regression completes model selection in seconds or minutes. This allows the user to include more variables, explore their effects, and finally, build better models. Some features of HP logistic regression include variable selection, weighted and group analysis, and modeling capabilities for unordered multinomial data.

obtained the most votes among all trees [5]. A random forest is grown in three steps: 1. A random sample (with replacement) is drawn from the original data. The sample size is equal to the number of cases in the training set. 2. Assuming there are M input variables, a number m which is relatively much smaller than M, and whose value is held constant during the growth of the forest, is specified. Then at each node, m variables are randomly selected out of the M input variables. The best split on these m selected variables is used to split the node. 3. Nodes are split using the selected variables to grow the trees to the largest extent possible, without any pruning. Random forests are computationally efficient and robust to noise [4]. Some of the more important features of random forests in the current application include high accuracy, efficiency on large databases, ability to handle large number of input variables, specification of variable importance, and effective handling and estimation of missing data. 4.5. Measures for performance evaluation In binary classification models, three performance measures are commonly used: accuracy, sensitivity, and specificity. Accuracy determines model's overall classification performance as the ratio of correct classifications to all classifications, either correct or incorrect; sensitivity measures the proportion of actual positives which are correctly identified as such; and specificity measures the proportion of negatives which are correctly identified as such. These measures can be mathematically represented by the following formulas, where TP, TN, FP, and FN stand for True Positive, True Negative, False Positive, and False Negative, respectively. Accuracy ¼

TP þ TN TP þ TN þ FP þ FN

Specificity ¼ 4.4.3. Decision tree A decision tree is a classification algorithm in which each non-leaf node indicates a test on an attribute of the input cases; each branch corresponds to an outcome of the test; and each leaf node indicates a class prediction. Classification accuracy and size of a decision tree are used to determine its quality [36]. Decision trees recursively separate observations into branches to construct a tree for the purpose of improving prediction accuracy. In doing so, they use mathematical algorithms to identify a variable and a corresponding threshold for that variable that splits the input observation into two or more subgroups. This process is repeated at each leaf node until the complete tree is constructed. The splitting algorithm seeks to find a variable-threshold pair that maximizes the homogeneity (order) of the resulting subgroups of samples. The most commonly used mathematical algorithms for splitting include entropy-based information gain (used in ID3, C4.5, C5), Gini index (used in CART), and Chisquare test (used in CHAID). The splitting criterion used in this study was FastCHAID, which is based on Chi-square test. High-performance decision tree supports interval and nominal inputs, as well as nominal targets. It supports entropy, Gini, and FastCHAID methods for tree growth and C4.5-style pruning. 4.4.4. Random forest A random forest grows multiple decision trees. To classify a new observation from an input vector, the observation is sent as input to each of the trees in the forest. Each tree specifies a classification, or “votes” for that class. At the end, the forest chooses the classification that has 6 A more detailed description of these techniques is available at http://www.sas.com/ resources/factsheet/high-performance-analytics-factsheet.pdf.

Sensitivity ¼

TP TP þ FN

TN TN þ FP

Some texts tend to report the complement of accuracy, which is called misclassification rate. SAS Enterprise Miner, the software used to conduct the analyses in this study, uses misclassification rate in the validation sample to rank the models based on their performance. Therefore, either accuracy or misclassification rate can be used to report the overall performance of models. 5. Results A summary of the models' performances on the six data sets (two sets for the comorbid cancers and four sets for each individual cancer) used in this study is presented in Table 5. As the results show, random forest outperforms other classification models in all six data sets. A simple comparison of the random forest results reveals that the best classification accuracy for each of the comorbid cancer samples is greater than the best classification accuracy of either comprising cancers. While individual breast and female genital cancers observed best accuracy rates of 0.7525 and 0.7607, respectively, their combination into the comorbid data set achieved a best accuracy of 0.7780. Similarly, whereas individual male genital and urinary cancers obtained best accuracy rates of 0.7152 and 0.7245, respectively, their comorbid set observed a best accuracy of 0.7348. Random forest also excels the other models in reducing Type I errors (predicting death for those who will actually survive) in the comorbid data sets. Therefore, it can help provide more effective and efficient treatments for a greater number of patients. Type II errors (predicting survival for those patients who will not survive) in the comorbid data sets, however, are almost the same as the individual sets. Therefore, for the random forest model, which seems to fit this data better than the other models, the results confirm that more information

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

157

Table 5 Classification results. Cancer

Random forest Confusion matrix

Breast Female genital Male genital Urinary Breast–female genital Male genital–urinary

Cancer

275 128 265 109 804 398 862 416 270 95 868 378

144 552 154 571 819 2252 761 2234 149 585 755 2272

Female genital Male genital Urinary Breast–female genital Male genital–urinary

Cancer

261 162 261 136 849 479 945 538 283 135 916 558

158 518 158 544 774 2171 678 2112 136 545 707 2092

Female genital Male genital Urinary Breast–female genital Male genital–urinary

Cancer

240 119 277 121 658 297 766 377 277 121 882 434

179 561 142 559 965 2353 857 2273 142 559 741 2216

Female genital Male genital Urinary Breast–female genital Male genital–urinary

Confusion Matrix Legend

Specificity

0.6563

0.7562

0.2393

0.7607

0.6325

0.8397

0.2848

0.7152

0.4954

0.8498

0.2755

0.7245

0.5311

0.8430

0.2220

0.7780

0.6443

0.8602

0.2652

0.7348

0.5348

0.8573

Misclassification rate

Accuracy

Sensitivity

Specificity

0.2912

0.7088

0.6229

0.7618

0.2675

0.7325

0.6229

0.8000

0.2932

0.7068

0.5231

0.8192

0.2846

0.7154

0.5823

0.7970

0.2465

0.7535

0.6754

0.8014

0.2960

0.7039

0.5643

0.7894

Misclassification rate

Accuracy

Sensitivity

Specificity

0.2712

0.7288

0.5728

0.8250

0.2393

0.7607

0.6611

0.8221

0.2953

0.7047

0.4054

0.8879

0.2888

0.7112

0.4720

0.8577

0.2363

0.7607

0.6610

0.8220

0.2768

0.7250

0.5434

0.8362

Misclassification rate

Accuracy

Sensitivity

Specificity

0.2866

0.7134

0.6539

0.7500

0.2484

0.7516

0.6611

0.8074

0.5496

0.4531

0.6673

0.3219

0.2853

0.7147

0.5767

0.7992

0.2502

0.7498

0.6730

0.7970

0.5336

0.4664

0.6309

0.3656

Logistic regression Confusion matrix

Breast

Sensitivity

0.7525

Decision tree Confusion matrix

Breast

Accuracy

0.2475

Neural networks Confusion matrix

Breast

Misclassification rate

274 170 277 131 1083 1797 936 532 282 138 1024 1681

145 510 142 549 540 853 687 2118 137 542 599 969

TP

FN

FP

TN

on comorbid diseases can potentially increase the accuracy of survival prediction. Besides performance comparisons across different model types, it is noteworthy to observe the discrepancies between the machine learning and traditional classification techniques. Results in Table 5 show that logistic regression performs slightly worse than the machine learning techniques in breast and female genital cancers, their comorbid combination,

and urinary cancer. However, this is not true for male genital cancer. Each time logistic regression is applied to a data set in which male genital variables have also been included, the prediction results deviate considerably from the machine learning techniques. This suggests that survival patterns among male genital cancer patients are too complex to be captured by logistic regression. However, logistic regression has lower Type II errors for the comorbid sets than the other models.

158

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

Table 6 Variable importance of the models. Breast

Female genital

Breast–female genital

Variable

Number of splitting rules

Variable

Number of splitting rules

Variable

Number of splitting rules

Age group Historic stage Radiation with surgery Laterality Surgery of primary site Tumor size Number of primaries Record number Tumor marker 2 Sequence number

1109 1025 869 851 810 772 757 743 742 733

Age group Historic stage Grade Marital status Surgery of primary site Histologic type Sequence number Number of primaries Reason for no surgery Tumor size

1130 1098 1021 809 740 722 701 623 577 569

FGa Grade FG historic stage Age group BR historic stage BR laterality BR grade Cancer order FG sequence number FG tumor size BR radiation with surgery

1008 969 882 866 731 704 701 655 645 630

Male genital

Urinary

Male genital–urinary

Variable

Number of splitting rules

Variable

Number of splitting rules

Variable

Number of splitting rules

Age group Country of birth Grade Reason for no surgery Surgery of primary site Number of primaries Historic stage Marital status Record number Sequence number

2174 1480 1440 1302 1296 1222 1221 1218 1122 1063

Grade Age group Surgery of primary site Histologic type Sequence number Historic stage Country of birth Marital status Recode ICD-O-2 to 9 Tumor size

3132 2907 2794 2262 2165 2145 2082 1702 1466 1452

Age group URb grade UR historic stage Country of birth MG grade UR surgery of primary site UR histologic type Cancer order UR recode ICD-O-2 to 9 MG number of primaries

2548 2004 1794 1717 1579 1477 1274 1267 1240 1162

a b

FG and BR stand for female genital and breast, respectively. UR and MG stand for urinary and male genital, respectively.

Apart from the discrepancies in prediction accuracy, the models developed for individual and comorbid cancers in this study also behaved differently in how they generated their splitting rules, and thereby, in their variable importance results. The algorithm used to create the trees tries to find a predictor x and a split value c at each node that maximizes the survival differences between the two child nodes. The splitting rule is then formed by assigning each case to one child node if x N c, and to the other child node otherwise [52]. Variable importance is, therefore, determined by the number of times each predictor is used for the purpose of splitting nodes. The top 10 most important variables for each of the cancer sets is presented in Table 6. Based on these results, age group is by far the most important variable as it tops in four models and ranks among the top third predictors

in the other two. Among individual cancers, age is the most important variable in breast, male and female genital cancers. This variable falls second in urinary cancer. These results confirm previous findings on the relationship between aging and poor survival. Age, historic stage, number of primaries, and surgery of primary site appear among the top 10 most important variables in all four individual cancers. Grade, marital status, and tumor size appear in three cancers. These results support prior findings on the impact of independent prognostic factors in different cancers (e.g., [2,21,26,30,38,46]). Although individual cancers share some important variables, the order in which these predictors appear is different. For example, radiation with surgery, laterality, and tumor marker are among the top 10 most important variables in breast cancer, whereas they do not appear to be as important in the other cancers. Similarly, reason for no surgery proves to be a

Fig. 5. Positioning individual cancers based on variable importance (without comorbidity).

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

159

Fig. 6. Positioning individual cancers based on variable importance (with comorbidity).

significant prognostic factor in male and female genital cancers, while it has less importance in the other two cancers. To obtain a better sense of how these individual cancers differ in terms of survival patterns, we used multidimensional scaling to understand the role of independent variables. Number of splitting rules was used as an indicator of the relative importance of input variables, and all variables, not only the top 10, were included in the modeling. Multidimensional scaling results for individual cancers without considering comorbidity is presented in Fig. 5. The results with comorbidity taken into account is shown in Fig. 6. As it can be seen in these two figures, genital cancers, when considered individually, show similar patterns of behavior; i.e., the effects of tumor size and age is similar within them but different from other cancers. However, when they are considered with other cancers in comorbid sets, they show different patterns. In the comorbid genitourinary data set, regional nodes examined, tumor marker 1, and number of primaries are the variables that characterize the behavior of male genital cancer differently. In the female comorbid data set, laterality and primary site (recode) characterize female genital cancer's behavior. In other words, the aforementioned variables demonstrate a pattern of behavior in each of these cancers that is very different from other cancers. In a similar vein, when considered individually, urinary cancer's pattern of behavior is characterized by histologic type, regional nodes examined, and tumor marker 2. In presence of comorbidity, however, this pattern is mostly driven by histologic type, histology recode, historic stage, and site recode. Breast cancer, in both cases, is mostly characterized by the variables exclusively used for it. 6. Discussion Comorbid diseases have been shown to have significant impacts not only on diagnosis and prognosis in different cancers, but also on efficacy of treatments. Comorbidities, especially if they are severe, can delay diagnosis of cancers, which in turn leads to more advanced stages of the disease and lower chances of survival after detection. Furthermore, comorbid medical conditions can affect treatment costs and alter the choice of treatment or complicate its course. In the presence of severe concomitant medical complexities, which can be either one disease in a fairly advanced stage or a number of illnesses with mild to medium severity, patients would receive less aggressive treatment options. This would increase the risk of death from the cancer itself [66] as well as other illnesses. Comorbidities might also affect or complicate

cancer progression or create complex interactions with different cancer therapies [40]. Such behavior would, at best, impose greater costs on patients and economies. With the increasing worldwide attention to healthcare, chronic diseases that comprise a majority of treatment expenditures seem to acquire even more scrutiny. Research findings on the significant interplay of concurrent chronic diseases accentuate the need to answer the call for discontinuation of traditional investigation of diseases in isolation from one another. Although one might argue that comorbid conditions are well taken into account in practice, they are still recorded separately for patients suffering from multiple coexisting complexities. This would limit our ability to understand the interactions among such comorbid conditions. Without consolidating these data sets, most of our analyses would be based on small samples which might not be generalizable to the whole population of cancer patients. Furthermore, we would not be able to use the benefits of recent developments in machine learning techniques that work best with larger data sets. Advances in machine learning techniques and their application in different areas, including medicine, allow more effective analyses of historical data to discover interesting patterns. While traditional methods focused on data analysis ex post, with low accuracy in predictions in most cases, machine learning techniques have extended our barriers in predicting events ex ante. However, for these techniques to be useful, accurate and complete data is essential. While accuracy of medical data—such as cancer data in SEER—is oftentimes assured through rigorous mechanisms, their completeness is suboptimal. Specifically, most of these databases do not store data on patients' comorbid conditions at the time of diagnosis or throughout the course of treatment. Consolidation of such records can potentially facilitate prospective explorations, which in turn will inform the whole field of practice. In this research we showed how more information about patients' health status could increase the accuracy of predictive models. Although the data for the comorbid cancers that we studied were not readily available, they were recorded together as part of a larger database. This allowed us to combine the data to form new sets of comorbid cancers, which as we showed, obtained greater prediction accuracy than each of the individual data sets. Consequently, consideration of other comorbidities can potentially increase accuracy rates even more. Prior research indicates that more than 68% of cancer patients suffer from concomitant illnesses [40]. The comorbid samples we created in this study composed less than 1% of the original individual samples. Therefore, we believe that coding and recording comorbidity at the

160

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

time of diagnosis or admission for treatment would significantly increase such models' prediction accuracy. Results and findings of such models can inform health providers to make better decisions—which is based on prior evidence—when treating new cancer or other chronic-disease patients. Due to lack of large-scale and integrated comorbidity data for cancer patients, our study's major limitation was the significant decrement in the size of the input data sets. However, the data sets we used for our analyses were still much larger in size than almost all similar survey studies. Consequently, the validity of the results should not be a problem. Some of the other limitations of this research include the following. The data obtained from SEER, although rather large, is a sample of cancer disease cases obtained from a limited number of providers located in a variety of geographic locations. Since data mining and its benefits increase with the increasing quantity and quality of data (both number of records and number of variables), a larger and more representative sample could potentially produce better predictive and descriptive results. Another limitation of the study is the types of machine learning methods used to develop the models. As is common knowledge, most machine learning methods, including the ones used in this study, have a number of modeling parameters that need to be “optimized.” While there are techniques to improve the predictive power by systematically adjusting these modeling parameters, there is no guarantee of any kind to reach the optimal model. Consequently, they are often called heuristic methods. Even though the results obtained suggest that additional information about comorbid conditions would increase the models' predictive powers, they cannot be claimed as leading to optimal outcomes. 7. Summary and conclusion Cancer and other chronic diseases are increasingly absorbing healthcare expenditures all over the world. Yet even worse, comorbidity of such diseases thwarts the efforts made for their diagnosis and treatment, spawning more economic losses. One way to alleviate this complication is to shift from a reductionist approach in studying the diseases in isolation to the consideration of their interactions. By combining records of patients suffering simultaneously from two cancers, we created a comorbid set of cancer patients. We analyzed the resultant data regardless of the patients' final outcomes, i.e., whether they survived, died as a result of the cancer(s) they suffered from, or died due to other diseases or reasons. As opposed to some studies which build prediction models on disease-specific outcomes (i.e., only those patients who die from the adverse effects of the disease under study), we believe our treatment of the data sets is more realistic, as a true prediction model should not and could not use any variables that somehow relate to the desired target. Consequently, while our models might have lower accuracy rates compared to some existing models, it is of greater predictive value. Our models cannot only be used to predict the patients' outcomes with fairly acceptable accuracy rates, they can also help practitioners make better decisions on the course of treatment. More importantly, our results show that to obtain even better accuracy rates, disease registries should record patients' concomitant diseases so that prospective analyses to build better models and find insightful patterns become possible or at least, less costly. This study shows the importance of comorbidity related data in analysis and treatment of chronic diseases. In addition to its theoretical contribution to the extant body of knowledge, the findings of this research effort can also be extended to potential managerial and practical implications. Evidence-based (data-based) medical decision making is among the most important tools we currently have to improve the wellbeing of people while controlling healthcare costs. The use of decision support systems that have a holistic view (by considering multiple conditions or diseases and their interactions) towards diagnostic and treatments would pave the road towards more effective and efficient healthcare systems.

Our results, once again, underscore the importance of comorbidities in studies of cancers and other chronic diseases. Researchers working in these fields can benefit from our conclusions in gaining a better understanding of the impact and role of concurrent complications. Specifically, our results show that identifying significant variables in each cancer and building clinical decision support systems should not be conducted without paying a closer attention to comorbid conditions. References [1] M.F. Akay, Support vector machines combined with feature selection for breast cancer diagnosis, Expert Systems with Applications 36 (2) (2009) 3240–3247. [2] P.C. Albertsen, D.F. Moore, W. Shih, Y. Lin, H. Li, G.L. Lu-Yao, Impact of comorbidity on survival among men with localized prostate cancer, Journal of Clinical Oncology 29 (10) (2011) 1335–1341. [3] T.R. Asmis, K. Ding, L. Seymour, F.A. Shepherd, N.B. Leighl, T.L. Winton, G.D. Goss, Age and comorbidity as independent prognostic factors in the treatment of non–smallcell lung cancer: a review of National Cancer Institute of Canada Clinical Trials Group trials, Journal of Clinical Oncology 26 (1) (2008) 54–59. [4] S. Bhattacharyya, S. Jha, K. Tharakunnel, J.C. Westland, Data mining for credit card fraud: a comparative study, Decision Support Systems 50 (3) (2011) 602–613. [5] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32. [6] T.J. Bright, A. Wong, R. Dhurjati, E. Bristow, L. Bastian, R.R. Coeytaux, G. Samsa, V. Hasselblad, J.W. Williams, M.D. Musty, L. Wing, A.S. Kendrick, G.D. Sanders, D. Lobach, Effect of clinical decision-support systems: a systematic review, Annals of Internal Medicine 157 (1) (2012) 29–43. [7] C.L. Chang, M.Y. Hsu, The study that applies artificial intelligence and logistic regression for assistance in differential diagnostic of pancreatic cancer, Expert Systems with Applications 36 (7) (2009) 10663–10672. [8] T.C. Chen, T.C. Hsu, A GAs based approach for mining breast cancer pattern, Expert Systems with Applications 30 (4) (2006) 674–681. [9] J.S. Chiu, Y.F. Wang, Y.C. Su, L.H. Wei, J.G. Liao, Y.C. Li, Artificial neural network to predict skeletal metastasis in patients with prostate cancer, Journal of Medical Systems 33 (2) (2009) 91–100. [10] S.M. Chou, T.S. Lee, Y.E. Shao, I.F. Chen, Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines, Expert Systems with Applications 27 (1) (2004) 133–142. [11] E. Çomak, K. Polat, S. Güneş, A. Arslan, A new medical decision making system: least square support vector machine (LSSVM) with Fuzzy Weighting Pre-processing, Expert Systems with Applications 32 (2) (2007) 409–414. [12] T. Daskivich, N. Sadetsky, S.H. Kaplan, S. Greenfield, M.S. Litwin, Severity of comorbidity and non-prostate cancer mortality in men with early-stage prostate cancer, Archives of Internal Medicine 170 (15) (2010) 1396–1397. [13] F.R. Datema, M.B. Ferrier, M.P. van der Schroeff, B. de Jong, J. Robert, Impact of comorbidity on short-term mortality and overall survival of head and neck cancer patients, Head and Neck 32 (6) (2010) 728–736. [14] D. Delen, Analysis of cancer data: a data mining approach, Expert Systems 26 (1) (2009) 100–112. [15] D. Delen, G. Walker, A. Kadam, Predicting breast cancer survivability: a comparison of three data mining methods, Artificial Intelligence in Medicine 34 (2) (2005) 113–127. [16] D. Delen, A. Oztekin, L. Tomak, An analytic approach to better understanding and management of coronary surgeries, Decision Support Systems 52 (3) (2012) 698–705. [17] C.E. Desch, L. Penberthy, C.J. Newschaffer, B.E. Hillner, M. Whittemore, D. McClish, T.J. Smith, S.M. Retchin, Factors that determine the treatment for local and regional prostate cancer, Medical Care 34 (2) (1996) 152–162. [18] C. Diederichs, K. Berger, D.B. Bartels, The measurement of multiple chronic diseases—a systematic review on existing multimorbidity indices, The Journals of Gerontology. Series A, Biological Sciences and Medical Sciences 66 (3) (2011) 301–311. [19] B.K. Edwards, A.M. Noone, A.B. Mariotto, E.P. Simard, F.P. Boscoe, S.J. Henley, J. Ahmedian, H. Cho, R.N. Anderson, B.A. Kohler, C.R. Eheman, E.M. Ward, Annual report to the nation on the status of cancer, 1975–2010, featuring prevalence of comorbidity and impact on survival among persons with lung, colorectal, breast, or prostate cancer, Cancer 120 (9) (2014) 1290–1314. [20] M. Extermann, Interaction between comorbidity and cancer, Cancer Control 14 (1) (2007) 13. [21] A.S. Fairey, N.E.B. Jacobsen, M.P. Chetner, D.R. Mador, J.B. Metcalfe, R.B. Moore, K.F. Rourke, G.T. Todd, P.M. Venner, D.C. Voaklander, E.P. Estey, Associations between comorbidity, and overall survival and bladder cancer specific survival after radical cystectomy: results from the Alberta Urology Institute Radical Cystectomy database, The Journal of Urology 182 (1) (2009) 85–93. [22] A.R. Feinstein, The pre-therapeutic classification of co-morbidity in chronic disease, Journal of Chronic Diseases 23 (7) (1970) 455–468. [23] J.M. Fitzpatrick, Management of localized prostate cancer in senior adults: the crucial role of comorbidity, BJU International 101 (s2) (2008) 16–22. [24] S.N. Ghazavi, T.W. Liao, Medical data mining by fuzzy modeling with selected features, Artificial Intelligence in Medicine 43 (3) (2008) 195–206. [25] P. Hájek, Municipal credit rating modelling by neural networks, Decision Support Systems 51 (1) (2011) 108–118. [26] A.D. Hanchate, K.M. Clough-Gorr, A.S. Ash, S.S. Thwin, R.A. Silliman, Longitudinal patterns in survival, comorbidity, healthcare utilization and quality of care among

H.M. Zolbanin et al. / Decision Support Systems 74 (2015) 150–161

[27]

[28]

[29]

[30]

[31]

[32] [33]

[34] [35]

[36]

[37]

[38]

[40] [41] [42]

[44]

[45] [46]

[47] [48]

[49] [50]

[51]

[52]

[53] [54] [55]

[56]

[57]

older women following breast cancer diagnosis, Journal of General Internal Medicine 25 (10) (2010) 1045–1050. S. Hill, D. Sarfati, T. Blakely, B. Robson, G. Purdie, J. Chen, E. Dennett, D. Cormack, R. Cunningham, K. Dew, T. McCreanor, I. Kawachi, Survival disparities in Indigenous and non-Indigenous New Zealanders with colon cancer: the role of patient comorbidity, treatment and health service factors, Journal of Epidemiology and Community Health 64 (2) (2010) 117–123. R.B. Hines, C. Chatla, H.L. Bumpers, J.W. Waterbor, G. McGwin, E. Funkhouser, C.S. Coffey, J. Posey, U. Manne, Predictive capacity of three comorbidity indices in estimating mortality after surgery for colon cancer, Journal of Clinical Oncology 27 (26) (2009) 4339–4345. C.S. Hollenbeak, B.C. Stack, S.M. Daley, J.F. Piccirillo, Using comorbidity indexes to predict costs for head and neck cancer, Archives of Otolaryngology - Head and Neck Surgery 133 (1) (2007) 24–27. A. Homma, T. Sakashita, N. Oridate, F. Suzuki, S. Suzuki, H. Hatakeyama, T. Mizumachi, S. Taki, S. Fukuda, Importance of comorbidity in hypopharyngeal cancer, Head and Neck 32 (2) (2010) 148–153. J.H. Hong, S.B. Cho, The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming, Artificial Intelligence in Medicine 36 (1) (2006) 43–58. M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, John Wiley & Sons, 2011. M. Karabatak, M.C. Ince, An expert system for detection of breast cancer based on association rules and neural network, Expert Systems with Applications 36 (2) (2009) 3465–3469. T. Kiyan, T. Yildirim, Breast cancer diagnosis using statistical neural networks, IU-Journal of Electrical & Electronics Engineering 4 (2) (2011) 1149–1153. A. Kusiak, J.A. Kern, K.H. Kernstine, B.T. Tseng, Autonomous decision-making: a data mining approach, IEEE Transactions on Information Technology in Biomedicine 4 (4) (2000) 274–284. S. Lee, Using data envelopment analysis and decision trees for efficiency analysis and recommendation of B2C controls, Decision Support Systems 49 (4) (2010) 486–497. L. Li, H. Tang, Z. Wu, J. Gong, M. Gruidl, J. Zou, M. Tockman, R.A. Clark, Data mining techniques for cancer detection using serum proteomic profiling, Artificial Intelligence in Medicine 32 (2) (2004) 71–83. C.P. McPherson, K.K. Swenson, M.W. Lee, The effects of mammographic detection and comorbidity on the survival of older women with breast cancer, Journal of American Geriatrics Society 50 (6) (2002) 1061–1068. K.S. Ogle, G.M. Swanson, N. Woods, F. Azzouz, Cancer and comorbidity, Cancer 88 (3) (2000) 653–663. D.L. Olson, B.K. Chae, Direct marketing decision support through predictive customer response modeling, Decision Support Systems 54 (1) (2012) 443–451. I.K. Omurlu, K. Ozdamar, M. Ture, Comparison of Bayesian survival analysis and Cox regression analysis in simulated and breast cancer data sets, Expert Systems with Applications 36 (8) (2009) 11341–11346. V. Paleri, R.G. Wight, C.E. Silver, M. Haigentz Jr., R.P. Takes, P.J. Bradley, A. Rinaldo, A. Sanabria, S. Bien, A. Ferlito, Comorbidity in head and neck cancer: a critical appraisal and recommendations for practice, Oral Oncology 46 (10) (2010) 712–719. J.F. Piccirillo, Importance of comorbidity in head and neck cancer, Laryngoscope 110 (4) (2000) 593–602. J.F. Piccirillo, R.M. Tierney, I. Costas, L. Grove, E.L. Spitznagel Jr., Prognostic importance of comorbidity in a hospital-based cancer registry, Journal of the American Medical Association 291 (20) (2004) 2441–2447. S. Piramuthu, On learning to predict web traffic, Decision Support Systems 35 (2) (2003) 213–229. K. Polat, S. Güneş, Hepatitis disease diagnosis using a new hybrid system based on feature selection (FS) and artificial immune recognition system with fuzzy resource allocation, Digital Signal Processing 16 (6) (2006) 889–901. K. Polat, S. Güneş, Breast cancer diagnosis using least square support vector machine, Digital Signal Processing 17 (4) (2007) 694–701. P.N. Post, B.E. Hansen, P.J.M. Kil, M.L.G. Janssen-Heijnen, J.W.W. Coebergh, The independent prognostic value of comorbidity among men aged b75 years with localized prostate cancer: a population-based study, BJU International 87 (9) (2001) 821–826. S.N. Rogers, A. Aziz, D. Lowe, D.J. Husband, Feasibility study of the retrospective use of the Adult Comorbidity Evaluation index (ACE-27) in patients with cancer of the head and neck who had radiotherapy, British Journal of Oral and Maxillofacial Surgery 44 (4) (2006) 283–288. J. Rosenvinge, Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression(Master's thesis) 2013. (Retrieved form Digitala Vetenskapliga Arkivet (DiVA) Portal (DivA2:699959)). S. Shah, A. Kusiak, Cancer gene search with data-mining and genetic algorithms, Computers in Biology and Medicine 37 (2) (2007) 251–261. E.H. Shortliffe, J.J. Cimino, Biomedical Informatics, Springer Science Business Media, 2006. S.L. Smith, D. Palma, T. Parhar, C.S. Alexander, E.S. Wai, Inoperable early stage nonsmall cell lung cancer: comorbidity, patterns of care and survival, Lung Cancer 72 (1) (2011) 39–44. T.S. Subashini, V. Ramalingam, S. Palanivel, Breast mass classification based on cytological patterns using RBFNN and SVM, Expert Systems with Applications 36 (3) (2009) 5284–5290. R. Tabarés-Seisdedos, J.L. Rubenstein, Inverse cancer comorbidity: a serendipitous opportunity to gain insight into CNS disorders, Nature Reviews. Neuroscience 14 (4) (2013) 293–304.

161

[58] R. Tabarés-Seisdedos, N. Dumont, A. Baudot, J.M. Valderas, J. Climent, A. Valencia, J.L. Rubenstein, No paradox, no progress: inverse cancer comorbidity in people with other complex diseases, The Lancet Oncology 12 (6) (2011) 604–608. [59] C.M. Tammemagi, C. Neslund-Dudas, M. Simoff, P. Kvale, Impact of comorbidity on lung cancer survival, International Journal of Cancer 103 (6) (2003) 792–802. [60] T.Z. Tan, C. Quek, G.S. Ng, K. Razvi, Ovarian cancer diagnosis with complementary learning fuzzy neural network, Artificial Intelligence in Medicine 43 (3) (2008) 207–222. [61] H. Teppo, O.P. Alho, Comorbidity and diagnostic delay in cancer of the larynx, tongue and pharynx, Oral Oncology 45 (8) (2009) 692–695. [62] M.S. Tetsche, C. Dethlefsen, L. Pedersen, H.T. Sorensen, M. Norgaard, The impact of comorbidity and stage on ovarian cancer mortality: a nationwide Danish cohort study, BMC Cancer 8 (1) (2008) 31. [63] A. Tsakonas, G. Dounias, J. Jantzen, H. Axer, B. Bjerregaard, D.G. von Keyserlingk, Evolving rule-based systems in two medical domains using genetic programming, Artificial Intelligence in Medicine 32 (3) (2004) 195–216. [64] M. Ture, F. Tokatli, I. Kurt, Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4. 5 and ID3) in determining recurrence-free survival of breast cancer patients, Expert Systems with Applications 36 (2) (2009) 2017–2026. [65] E.D. Übeyli, Implementing automated diagnostic systems for breast cancer detection, Expert Systems with Applications 33 (4) (2007) 1054–1062. [66] R. Yancik, M.N. Wesley, L.A. Ries, R.J. Havlik, B.K. Edwards, J.W. Yates, Effect of age and comorbidity in postmenopausal breast cancer patients aged 55 years and older, Journal of the American Medical Association 285 (7) (2001) 885–892. [67] W.C. Yeh, W.W. Chang, Y.Y. Chung, A new hybrid approach for mining breast cancer pattern using discrete particle swarm optimization and statistical method, Expert Systems with Applications 36 (4) (2009) 8204–8211. [68] K. Zheng, Clinical decision support systems, Management, Types and Standards 36 (2012) 501–509. Hamed Majidi Zolbanin is a PhD student in Management Science and Information Systems at Oklahoma State University. He holds degrees in Computer Engineering and in Management. His research interests include healthcare, machine learning, data and text mining, and behavioral studies.

Dr. Dursun Delen is the holder of William S. Spears and Neal Patterson Endowed Chairs in Business Analytics, Director of Research for the Center for Health Systems Innovation, and Professor of Management Science and Information Systems in the Spears School of Business at Oklahoma State University (OSU). He received his PhD in Industrial Engineering and Management from OSU in 1997. Prior to his appointment as an Assistant Professor at OSU in 2001, he worked for a privately-owned research and consultancy company, Knowledge Based Systems Inc., in College Station, Texas, as a research scientist for 5 years, during which he led a number of decision support, information systems and advanced analytics related research projects funded by federal agencies, including DoD, NASA, NIST and DOE. His research has appeared in major journals including Decision Support Systems, Communications of the ACM, Computers and Operations Research, Computers in Industry, Journal of Production Operations Management, Artificial Intelligence in Medicine, Expert Systems with Applications, among others. He recently published seven books in the broader are of Business Analytics. He is often invited to national and international conferences for keynote addresses on topics related to Data/Text Mining, Business Intelligence, Decision Support Systems, Business Analytics and Knowledge Management. He regularly serves and chairs tracks and mini-tracks at various information systems conferences, and serves on several academic journals as senior editor, associate editor and editorial board member. His research and teaching interests are in data and text mining, decision support systems, knowledge management, business intelligence and enterprise modeling.

Amir Hassan Zadeh is an assistant professor of Information Systems and Supply Chain Management at Wright State University. He has previously published papers in Decision Support Systems, Production Planning & Control, Annals of Information Systems. He has also presented several papers at national and international meetings. His research interests include data mining, business analytics, decision support systems, and operations management.