a data-mining approach in SCT

8 downloads 679 Views 931KB Size Report
Oct 7, 2013 - Application of machine learning algorithms for clinical predictive modeling: a ... wider approach for data analysis termed data mining (DM).
Bone Marrow Transplantation (2014) 49, 332–337 & 2014 Macmillan Publishers Limited All rights reserved 0268-3369/14 www.nature.com/bmt

REVIEW

Application of machine learning algorithms for clinical predictive modeling: a data-mining approach in SCT R Shouval1,2, O Bondi3, H Mishan3, A Shimoni1, R Unger3 and A Nagler1 Data collected from hematopoietic SCT (HSCT) centers are becoming more abundant and complex owing to the formation of organized registries and incorporation of biological data. Typically, conventional statistical methods are used for the development of outcome prediction models and risk scores. However, these analyses carry inherent properties limiting their ability to cope with large data sets with multiple variables and samples. Machine learning (ML), a field stemming from artificial intelligence, is part of a wider approach for data analysis termed data mining (DM). It enables prediction in complex data scenarios, familiar to practitioners and researchers. Technological and commercial applications are all around us, gradually entering clinical research. In the following review, we would like to expose hematologists and stem cell transplanters to the concepts, clinical applications, strengths and limitations of such methods and discuss current research in HSCT. The aim of this review is to encourage utilization of the ML and DM techniques in the field of HSCT, including prediction of transplantation outcome and donor selection. Bone Marrow Transplantation (2014) 49, 332–337; doi:10.1038/bmt.2013.146; published online 7 October 2013 Keywords: data mining; machine learning; artificial intelligence; hematopoietic SCT; predictive modeling

INTRODUCTION Allogeneic hematopoietic SCT (HSCT) has long been utilized in various hematologic malignant and non-malignant diseases, leading to cure and significant survival prolongation.1 Outcomes of HSCT are improving.2 Nevertheless, the procedure is still accompanied by high rate of morbidity and mortality, making patient selection a crucial issue.1,3 Besides clinical judgment on whether, when and how to transplant, clinical scores such as the European Group for Blood and Marrow Transplantation risk score, the Hematopoietic Cell Transplant-Co-morbidity Index and others may aid clinical decision.4–6 However, these scores rely on conventional statistical methodologies that carry inherent limitations,7 possibly leading to suboptimal performance (for example, reasonable but relatively low c-statistics or area under the curve of the receiver operating characteristic curve).4–6,8 Conventional statistical techniques are model (hypothesis)-driven. They start with a model and check whether the data fit the suggested model. The underlying assumption is that data are provided by a stochastic model (for example, linear or logistic regression). Validation is based on the goodness of fit tests (that is, w2-test, R2).7,9 This approach has proven itself over the years. Still, it carries limitations7–11: (a) it usually assumes data are normally distributed, independence of variables and linear associations. However, real data are noisy and do not fulfill such prior assumptions. (b) The conclusions are about data fitting the model and not a model fitting the data, forcing rigid assumptions about data behavior. (c) Computational and theoretical ability to handle very large data sets is limited. (d) In standard survival models, it is necessary to pre-select the variables, as a high number increases the number of possible solutions (‘the curse of dimensionality’). This might lead to the loss of information relevant for outcome prediction.

Given the limitations presented and the formation of large registries incorporating biological and clinical data,9,12,13 a new approach to data analysis was needed. This is true more than ever for HSCT. Computer scientists have long struggled with such complex data scenarios, historically starting from problems such as image and voice recognition and moving on to handle data of huge volume such as purchase records in Amazon. The development of machine learning (ML) algorithms accounting for a multiplicity of factors led to the generation of robust and accurate prediction models. MACHINE LEARNING ML is a field in artificial intelligence stemming from computer science. It was initially defined by Arthur Samuel as a field of study that gives a computer the ability to learn without being programmed. A more contemporary definition is that a computer program is said to learn from experience if its performance at a certain task improves with experience.14 For example, if we would like to predict mortality in HSCT (the task), the more patients/ examples we provide (that is, experience) the better we would be able to predict mortality. Applications of ML are all around us. A classic example is the detection of spam e-mails. ML algorithms go over millions of e-mails, learning which properties are characteristic (for example, multiple recipients, words such as discount, buy, and so on). Accordingly, a prediction model (termed classifier) has been produced, which is capable of classifying new unseen e-mail as spam or not. Other ML applications include the following: detection of credit card fraud, prediction of customer purchase behavior or personal interests of web users, optimizing manufacturing processes15 and a growing number of applications

1 The Division of Hematology and Bone Marrow Transplantation and Internal Medicine "F" Department, The Chaim Sheba Medical Center, Tel HaShomer, Israel; 22013 Pinchas Borenstein Talpiot Medical Leadership Program, The Chaim Sheba Medical Center, Tel HaShomer, Israel and 3The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel. Correspondence: Dr R Shouval, The Division of Hematology and Bone Marrow Transplantation, The Chaim Sheba Medical Center, Tel HaShomer, Ramat-Gan 52621, Israel. E-mail: [email protected] Received 1 June 2013; revised 31 July 2013; accepted 3 August 2013; published online 7 October 2013

Data mining in SCT R Shouval et al

333

2

30

1



3

60

5

+

4

70

6

+ •••

New patient

72

7

?

Pt. # 1 2 3 4

Age 20 30 60 70

#co-morbidities 2 1 5 6 •••

20

2

Age

Data preparations (preprocessing): Medical and real-world databases are highly susceptible to noisy, missing and inconsistent data because of the nature of data collection and human error.13 Although ML algorithms are relatively robust and capable of handling noise, preprocessing the data is essential for improving predictive accuracy. Tasks such as integrating different databases, discretization, imputing missing values and attribute transformation are all part of this stage and are dependent on the modeling technique that will be chosen in the next stage.17,18 Working with big data sets containing multiple attributes might lead to noise hiding the real signal. Algorithms for feature selection (that is, attribute selection) are occasionally included in the preprocessing

# comorbidities

# comorbidities

1

X

Data collection and understanding: Collecting or obtaining data is one of the hardest tasks. Benefits of data-mining techniques over conventional statistical methods are augmented when applying them to large data sets. Finding or creating registries or databases that fulfill these requirements is a laborious process. Once in hand, the miner will proceed with activities to become familiar with the data, identify data quality problems, gain first insights into the data and detect interesting subsets to form hypotheses for hidden information. Current data-mining suits (for example SPSS, SAS, WEKA and more) contain understandable data-visualization modules.

Mortality (label) –

•••

#co-morbidities

•••

Age

•••

Pt. #

Problem definition: Understanding the problem at hand, the current knowledge and defining data-mining goals. For instance, a problem could be the proper candidate selection for allogenic HSCT in patients with acute leukemias. Current knowledge includes reviewing the literature and the suggested solutions (for example, risk scores). The goal of the data-mining project would be the development of a prediction model for TRM at 100 days and 1 year post-allogeneic HSCT for acute leukemia patients, thus aiding evaluation and patient selection of candidates.

•••

DATA MINING ML algorithms are tools in a wider approach for analyzing large and complex data sets called ‘data mining (DM)’. DM is a multidisciplinary field based on statistics, mathematics, computer science, artificial intelligence and more. It seeks to discover knowledge in databases in an automatic or semiautomatic process.14,15,17,18 In practice, the two primary goals of DM tend

to be prediction and description. Prediction involves using some variables in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding interpretable patterns in the data.21 Different standards for the data-mining process have been developed. Here, we present an approach based on the CRISP-DM standards,22 tailored for predictive clinical DM. It consists of the following stages.

•••

towards clinical practice and research.16 The paradigm underlying ML does not start with a predefined model, rather it lets the data create the model by detecting underlying patterns.7,9 Thus, this approach avoids preassumptions about model types and variable interactions. Different algorithms are used to produce a function—a model—which will fit the data and not the other way around. In such procedures, a large number of variables and combinations thereof can be used. The models are developed on a training set and validated on a test set, as will be further discussed below. Learning can be divided into two main types:17,18 supervised and unsupervised. In supervised learning (Figure 1a), a prediction model is produced by learning a data set (retrospective data) where the outcome (label) is known and accordingly the outcome of new unlabeled examples can be predicted. For demonstration, we will use a hypothetical data set of patients who underwent a HSCT. Each patient is an example, the attributes (ML terminology for variables) are diagnosis, age and the amount of comorbidities, and the outcome we wish to predict for new cases (termed label) is TRM at 100 days. By learning the retrospective labeled data set, a TRM prediction model can be developed, helpful for assessing new patients, being evaluated for HSCT. Of course, such a problem can be extended to multiple examples and attributes. Predictions of discrete properties (for example, survival—yes/no) and continuous properties (for example, survival length) are termed as classification and regression, respectively. Unsupervised learning (Figure 1b) is about detecting patterns in data without predefined labels. For example, in the hypothetical data set, even when the label is unknown, according to the distribution of the attributes it is possible to detect two separate groups in the data. This process is often called clustering and is commonly used in bioinformatics for detecting patterns of gene expressions in microarray studies.19,20

II I

Age

Figure 1. Two types of learning. (a) Supervised learning: the table represents a hypothetical data set of patients who underwent an HSCT. Each patient is an instance. The attributes are age and number of comorbidities and the known label is HSCT mortality (  , þ ), denoted by rectangles and circles, respectively. Patients are represented on the plot. A label prediction model for new examples is developed according to experience gained (learned) from the retrospective labeled data set. In the case presented, a new patient aged 72 years with seven comorbidities (marked by X) is unlikely to survive according to the developed model. (b) Unsupervised learning: patients are again presented in a table but no labels are given. Unsupervised learning finds structure in data according to instances learned. The model discovers two separate clusters (I and II). & 2014 Macmillan Publishers Limited

Bone Marrow Transplantation (2014) 332 – 337

Data mining in SCT R Shouval et al

334 stage and may reduce data dimensionality, thereby improving computation time and hopefully prediction. In addition, they may also aid the pre-selection of variables for conventional statistical models, such as logistic regression.23,24 Modeling: Predictive data-mining models involve the application of supervised ML algorithms on retrospective data where the class label is known (for example, the variable you wish to predict), thereby allowing outcome prediction for new cases. Models are built on the basis of a training set. Usually a number of algorithms are applied, their parameters are calibrated to optimal values and the best one is selected, according to the results of the test set. Finding the best parameters for each model is frequently a result of trial and error. In this review, we will focus on three popular ML algorithms. All these algorithms can be used for both classification and regression problems. For simplicity, we will concentrate on the former. (1) Decision trees—A decision tree is a flowchart-like tree structure, in which each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label. The topmost node in a tree is the root node. The hierarchy of a decision tree is formed by asking questions iteratively about the attributes (represented by all nodes except the leaves) of the training set (see example in Figure 2). A good question will split a collection of items with heterogeneous class labels into subsets with nearly homogeneous labels, stratifying the data so that there is little variance in

Diagnosis

Age

Protocol

AML AML ALL ALL ALL MDS AML AML ALL AML MDS ALL

60 35 61 55 45 40 70 45 80 35 50 66

RIC MA RIC RIC MA MA RIC RIC RIC MA RIC MA

TRM at 100 days (label) Yes No No No Yes No Yes No No No No Yes

each stratum. Several measures have been designed to evaluate the degree of in-homogeneity in a set of items.17,25 The main advantage of decision trees is their relative interpretability, which make them a preferred option for medical research. That said, interpretability may diminish as the trees become more complex. Examples of their successful application include the prediction of mortality in acute liver-failure patients,26 prediction of axillary lymph node metastasis in breast cancer patients27 and prediction of iron-deficiency anemia from hematological parameters.28 The predictive power of decision trees may be enhanced by applying ensemble methods such as random forest, a technique that involves the generation of multiple trees and voting for the most popular class;29 however, interpretability is lost. (2) Artificial neural networks (ANNs)—These types of algorithms are inspired by neuronal learning. One can consider a neuron as a computational unit that receives weighted inputs from other neurons through dendrites, processes them, and if a certain threshold is reached, an output is delivered through axon. ANNs are collection of model neurons interconnected to increase computation/prediction power. Similarly to neurons, connections between units are assigned different weights that are adjusted during training. The input nodes observed are attributes used for prediction. The output nodes are the possible outcomes which the network predicts (for example, TRM), and in between there are accessory computational nodes referred to as hidden layers (Figure 3).17,18,30 As mentioned above, in supervised learning, labeled retrospective data are used for model training. In ANN, we train the network by iteratively running labeled samples and correcting the network weights according to the distance between the predicted label and the real label. ANNs were applied successfully in various clinical studies.31–35 Caocci et al.36 compared the performance of ANN with logistic regression for predicting acute GVHD in a group of 78 b-thalassemia major patients who underwent HSCTs. Prediction sensitivity and specificity of the ANN were 83.3% and 90.1%, respectively, versus 21.7% and 83.3% with logistic regression. These results are impressive; however, given the small number of patients with a relatively large number of variables (24 variables), one should suspect overfitting, as will be discussed later.

Input layer

AML

Age 50

Yes (2/2)

Hidden layer

Age

Diagnosis

RIC

No (3/3)

Yes (2/2)

Figure 2. Decision trees. A hypothetical data set of patients who underwent an HSCT is given in the table. Attributes are diagnosis, age and conditioning protocol, and the label is TRM at 100 days. A decision tree for the prediction of TRM at 100 days is developed on the basis of the retrospective data given in the table. Each ellipse is an internal node containing an attribute and the lines connecting are branches of possible values. The rectangles denote leaves where class label is reached and the tree ends. The right number in brackets is the number of instances reaching the leaf and the left one is the number of instances correctly classified. After the tree is constructed, it can be used for classification (that is, predict TRM at 100 days) of new patients according to their diagnosis, age and planned conditioning protocol. Bone Marrow Transplantation (2014) 332 – 337

TRM +/–

MA

Diagnosis

Figure 3. Artificial neural network. A hypothetical three layer ANN for the binary prediction of post HSCT TRM. The layers are constructed from a number of interconnected nodes, which represent neurons connected by edges. Patients from a retrospective data set are iteratively presented to the network via the input layer, which communicates to one hidden layer or more, where the actual processing is performed through a system of weighted connections. The hidden layer is linked to the output node that gives the prediction for each patient (TRM ±). Instances are passed forward, and according to distance between the predicted and actual label weights are adjusted. & 2014 Macmillan Publishers Limited

Data mining in SCT R Shouval et al

335 ANNs are accurate and powerful predictors. However, they lack interpretability (‘black box’ model);18,30 thus, the consideration behind the suggested model is not revealed to the user. (3) Support vector machines (SVMs)—Usually applied to a twoclass classification problem, the basic concept behind these algorithms is detecting the plane or hyperplane (refers to a twodimensional problem or more, respectively) that gives the greatest separation between the two classes. SVMs find the optimal hyperplane with a maximum distance to the closest point of the two classes (that is, the maximum-margin hyperplane). A set of instances that are closest to the optimal hyperplane defines the support vector and creates the margin of each class (Figure 4). The SVM algorithm is capable of dealing with both linearly and nonlinearly separable problems in classification and regression tasks by applying a kernel function.37 SVMs were successfully applied in a number of clinical studies.38,39 For instance, Lu et al.40 used a SVM to predict malignancy of ovarian tumors preoperatively. SVMs are mathematically more complicated than decision trees and ANNs; however, they are gaining popularity because of their general better performance and the observation that they are less prone to overfitting. However, it should be noted that they are very difficult to interpret, especially when nonlinear kernel functions are used. Model evaluation: As described earlier, in supervised learning, model development includes training and testing. Optimally, the whole data set is divided into two separate sets (Figure 5a). A prediction model is constructed according to the training set, where association between attributes of the examples and their labels are learned, using a specified algorithm (for example, SVM). The second part of the data set is a test set, which simulates new samples with an unknown label. These are presented to the model generated in the training stage and a label prediction is given to each sample. Performance measures are calculated according to the ability of the model to correctly predict the labels of the test set, as will be discussed shortly.17,18

In scenarios where data are insufficient to use separate sets for training and testing, an alternative approach called the k-cross validation is often used (Figure 5b). The whole data set is divided into k subsamples, and iteratively the model is trained on k-1 subsamples and tested on the remaining subsamples. Model performance is an average of the performance in the repeated tests.17,18,41 Objective performance measures reflect the model generalizability and are the results of cumulative classification successes and failures of the test set, which are given in a confusion matrix (also known as contingency table) (Figure 6a). Performance measures (Figure 6b) are calculated both for the training and the test set; however, the latter is more relevant as it tells us about the ability of our classifier to cope with new unfamiliar data. Measuring performance only for the training set tends to yield over-optimistic estimations, as the model might have only ‘remembered’ the data instead of ‘learning’ it, allowing little room for variance. This is the problem of overfitting, in which the model developed closely fits the training data but fails to generalize when presented with new examples. Using a testing set for validation is an important step in overcoming this hurdle. Accuracy, error rate and precision are easily affected by imbalanced class distribution,42 a common scenario in medicine (that is, the event you wish to predict occurs only in a minority of the patients). Receiver operating characteristic curves depict the trade-off between the true-positive rate (that is, sensitivity) and the false-positive rate (that is, 1-specificity) and reflect the performance of classifiers without any regard to class distribution. They are useful for model comparison and evaluation. Roughly speaking, a larger area under the receiver operating characteristic curve reflects better model performance (ranges from 0 to1). Limitations of the receiver operating characteristic analysis

The whole data set

M M H

# comorbidities

r

cto

e tv

Separate training and testing set

or

p up

ne

S

pla

r pe

K-fold cross-validation

r

cto

Hy

e tv

or

pp

Su

Age

Figure 4. Support vector machine. A two-dimensional plot, based on two attributes—age and amount of comorbidities, is used for the development of a hypothetical SVM classifier for the prediction of HSCT mortality. Rectangles and circles denote patients who survived or died, respectively, in the training set. The SVM algorithm finds the maximum margin hyperplane (MMH)—that is, the maximal distance between the hyperplane and the closest examples of each class in the data set. Support vectors are the margins defined by the closest example of each class. According to a mathematical function based on the MMH, the SVM can be used to classify new unseen example (that is, patients who are undergoing evaluation for transplant). SVM can be extended to multiple dimensions and to nonlinear classification data scenarios. & 2014 Macmillan Publishers Limited

Training set Test set

Figure 5. Training and testing. (a) The parameters of the model are estimated on a training set, which is a subsample of the data set itself. The model produced is applied on a test set, on which predictive accuracy is calculated according to the prediction error rate. Other performance metrices are calculated as well. This is an optimal scenario. (b) When faced with data sets where data are sparse, it is possible to use k-fold (usually 10-fold) cross validation. The whole data set is divided into k equal parts and the model is iteratively trained on k-1 subsamples and tested on the remaining subsample. Model performance is an average of the folds. Bone Marrow Transplantation (2014) 332 – 337

Data mining in SCT R Shouval et al

336 decision-support systems, allowing better treatment allocation for each patient.

Confusion matrix Predicted class

Actual class

Yes

No

Yes

TP

FN

No

FP

TN

Measure Predictive accuracy Error rate (1- predictive accuracy)

Formula TP + TN TP + TN + FP + FN FP + FN TP + TN + FP + FN

Sensitivity, true positive rate, recall

TP + FN

Specificity, true negative rate

FP + TN

Precision, PPV NPV

TP

TN

TP TP + FP TN TN + FN

Figure 6. Model evaluation. (a) A confusion matrix describes the cumulative successes and failures of the model’s prediction over the test set. (b) Performance measure definitions. TP ¼ true-positive, FP ¼ false-positive, TN ¼ true-negative, FN ¼ false-negative, PPV ¼ positive predictive value, NPV ¼ negative predictive value.

are beyond the scope of this review and are discussed elsewhere.17,18,42,43 Models can also be evaluated according to other measures.17 Interpretability refers to the level of understanding and insight that is provided. As we mentioned previously, decision trees display an interpretable model structure, as opposed to ‘black box’ models (for example, SVMs and ANNs) where the rationale behind the prediction is unclear. Feature selection algorithms (usually applied in the preprocessing stage) might not only improve model performance but also improve interpretability by eliminating attributes not relevant for prediction. By applying a feature selection algorithm on a database of 1160 AML patients, Sarkar et al.44 identified and ranked 15 (out of 121 present) attributes significant for allogeneic HSCT survival prediction.45 The selected attributes that were not detected by a standard statistical approach improved predictive accuracy. Robustness, another performance measure, evaluates the ability of the model to make correct predictions, given the noisy data or data with missing values.17,27,46 ML algorithms handle missing values in different ways that may improve predictive accuracy in comparison to standard techniques.46 Deployment: Creation of the model is generally not the end of the project and refining the results, making them practical for end users (that is, clinicians), is necessary. Optimally, the models serve as the basis for the generation of decision-support systems. Ng et al.47 created a simple clinical decision-support system, through an internet interface, for determining whether a patient will survive beyond 120 days after chemotherapy. Predictive modeling can also enhance our knowledge about attributes significant for outcome prediction (this capability is algorithmdependent). For instance, Delen et al.48 applied a Cox regression model to a combination of attributes derived from a data-mining analysis with attributes from conventional statistical studies held previously to predict the outcome of thoracic transplantations. In HSCTs, given the highly divergent populations with multiple attributes that fail to meet the prior assumption (for example, independence of attributes, linear data behavior), one could think of many issues in which a data-mining approach would be beneficial. For instance, prediction of overall survival, TRM, GVHD, GVHD related death and relapse related death at various time points post-transplant (e.g., 100 days, 1 year and 5 years).48–53 In addition, applying data-mining techniques to databases with combined HSCT and nontransplant (for example, chemotherapy) data could potentially lead to the development of therapeutic Bone Marrow Transplantation (2014) 332 – 337

SUMMARY DM is a promising approach for the development of prediction models. Nevertheless, it is not free of limitations. Lack of model interpretability is a major issue. However, the long admired Ockham’s razor principle, where simpler means better, does not necessarily take into account complex data scenarios, where predictive accuracy may be more important than interpretability, especially in clinical decisions.7 Secondly, standards for data analysis are still premature but are evolving.13,22,55 Standards for data censoring in survival analysis are also progressing.39,56–58 Thirdly, data patterns detected may be simply a product of random fluctuations in data. In addition, overfitting as discussed earlier is a peril that might be hard to avoid with small data. Most algorithms take measures to avoid this. Numerous publications discuss the pros and cons of DM in depth.7,9,59 In conclusion, given the power of the data-mining approach to process a multiplicity of variables, describe complex non-linear interactions and create accurate prediction models, it seems natural to apply it for the complex analysis of HSCT databases. So far, lack of interpretability and experience with the different models have deterred clinical researchers and physicians. However, embracing these novel techniques of artificial intelligence may lead to better experience-based clinical decisions improving patient and donor selection, reducing TRM and improving transplantation outcome. CONFLICT OF INTEREST The authors declare no conflict of interest.

ACKNOWLEDGEMENTS This work was supported by a research grant from the Israeli Cancer Association (grant no. 20130180). Dr Roni Shouval was supported by the Dr Pinchas Borenstein Talpiot Medical Leadership Program 2013 fellowship.

REFERENCES 1 Copelan EA. Hematopoietic stem-cell transplantation. N Engl J Med 2006; 354: 1813–1826. 2 Gooley TA, Chien JW, Pergam SA, Hingorani S, Sorror ML, Boeckh M et al. Reduced mortality after allogeneic hematopoietic-cell transplantation. N Engl J Med 363: 2091–2101. 3 Hamadani M, Craig M, Awan FT, Devine SM. How we approach patient evaluation for hematopoietic stem cell transplantation. Bone Marrow Transplant 45: 1259–1268. 4 Gratwohl A, Stern M, Brand R, Apperley J, Baldomero H, de Witte T et al. Risk score for outcome after allogeneic hematopoietic stem cell transplantation: a retrospective analysis. Cancer 2009; 115: 4715–4726. 5 Parimon T, Au DH, Martin PJ, Chien JW. A risk score for mortality after allogeneic hematopoietic cell transplantation. Ann Intern Med 2006; 144: 407–414. 6 Sorror ML, Maris MB, Storb R, Baron F, Sandmaier BM, Maloney DG et al. Hematopoietic cell transplantation (HCT)-specific comorbidity index: a new tool for risk assessment before allogeneic HCT. Blood 2005; 106: 2912–2919. 7 Breiman L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat Sci 2001; 16: 199–231. 8 Bagley SC, White H, Golomb BA. Logistic regression in the medical literature: standards for use and reporting, with particular attention to one medical domain. J Clin Epidemiol 2001; 54: 979–985. 9 Hand DJ. Data mining: statistics and more? Am Stat 1998; 52: 112–118. 10 Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol 1996; 49: 907–916. 11 Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol 1996; 49: 1225–1231. 12 Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet 2012; 13: 395–405.

& 2014 Macmillan Publishers Limited

Data mining in SCT R Shouval et al

337 13 Jitao Z, Ting W. A general framework for medical data mining. Future Information Technology and Management Engineering (FITME), 2010 International Conference, Changzhou, China, 2010. 14 Mitchell T. Machine Learning. 1st edn. McGraw Hill, Blacklick, Ohio, USA, 1997. 15 Mitchell T. Machine learning and data mining. Commun ACM 1999; 42: 30–36. 16 Iavindrasana J, Cohen G, Depeursinge A, Muller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform 2009; 121–133. 17 Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd edn. Morgan Kaufmann, 2012. 18 Witten IH, Frank E, Hall MA. Data Mining: Practical Machine Learning Tools and Techniques. 3rd edn (Morgan Kaufmann, 2011). 19 Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403: 503–511. 20 Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 2002; 8: 68–74. 21 Kantardzic M. Data Mining: Concepts, Models, Methods, and Algorithms. Wiley-IEEE Press, 2011. 22 Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz R, Shearer C et al. CRISP-DM 1.0: step-by-step data mining guide. CRISP-DM Consortium Tech Rep 2000. 23 Del Fiol G, Haug PJ. Classification models for the prediction of cliniciansaˆht information needs. J Biomed Inform 2009; 42: 82. 24 Hall MA, Holmes G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowledge Data Eng 2003; 15: 1437–1447. 25 Kingsford C, Salzberg SL. What are decision trees? Nat Biotechnol 2008; 26: 1011–1013. 26 Nakayama N, Oketani M, Kawamura Y, Inao M, Nagoshi S, Fujiwara K et al. Algorithm to determine the outcome of patients with acute liver failure: a datamining analysis using decision trees. J Gastroenterol 2012; 47: 664–677. 27 Takada M, Sugimoto M, Naito Y, Moon HG, Han W, Noh DY et al. Prediction of axillary lymph node metastasis in primary breast cancer patients using a decision tree-based model. BMC Med Inform Decis Mak 2012; 12: 54. 28 Dogan S, Turkoglu I. Iron-deficiency anemia detection from hematology parameters by using decision trees. Int J Sci Technol 2008; 3: 85–92. 29 Breiman L. Random forests. Mach Learn 2001; 45: 5–32. 30 Krogh A. What are artificial neural networks? Nat Biotechnol 2008; 26: 195–197. 31 Burke HB, Goodman PH, Rosen DB, Henson DE, Weinstein JN, Harrell FE et al. Artificial neural networks improve the accuracy of cancer survival prediction. Cancer 1997; 79: 857–862. 32 Sato F, Shimada Y, Selaru FM, Shibata D, Maeda M, Watanabe G et al. Prediction of survival in patients with esophageal carcinoma using artificial neural networks. Cancer 2005; 103: 1596–1605. 33 Sargent DJ. Comparison of artificial neural networks with other statistical approaches: results from medical data sets. Cancer 2001; 91: 1636–1642. 34 Rotondano G, Cipolletta L, Grossi E, Koch M, Intraligi M, Buscema M et al. Artificial neural networks accurately predict mortality in patients with nonvariceal upper GI bleeding. Gastrointest Endosc 2011; 73: 226 e1–2. 35 Lisboa PJ, Taktak AF. The use of artificial neural networks in decision support in cancer: a systematic review. Neural Netw 2006; 19: 408–415. 36 Caocci G, Baccoli R, Vacca A, Mastronuzzi A, Bertaina A, Piras E et al. Comparison between an artificial neural network and logistic regression in predicting acute graft-vs-host disease after unrelated donor hematopoietic stem cell transplantation in thalassemia patients. Exp Hematol 38: 426–433. 37 Noble WS. What is a support vector machine? Nat Biotechnol 2006; 24: 1565–1567. 38 Chia CC, Rubinfeld I, Scirica BM, McMillan S, Gurm HS, Syed Z. Looking beyond historical patient outcomes to improve clinical models. Sci Transl Med 4: 131ra49. 39 A data mining approach to MPGN type II renal survival analysis. Proceedings of the 1st ACM International Health Informatics Symposium. ACM, 2010.

& 2014 Macmillan Publishers Limited

40 Lu C, Van Gestel T, Suykens JA, Van Huffel S, Vergote I, Timmerman D. Preoperative prediction of malignancy of ovarian tumors using least squares support vector machines. Artif Intell Med 2003; 28: 281–306. 41 Stone M. Cross-validatory choice and assessment of statistical predictions. J Royal Stat Soc. Series B (Methodological) 1974 111–147. 42 Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A. Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 2006; 37: 7–18. 43 Linden A. Measuring diagnostic and predictive accuracy in disease management: an introduction to receiver operating characteristic (ROC) analysis. J Eval Clin Pract 2006; 12: 132–139. 44 Sarkar C, Cooley S, Srivastava J. Improved feature selection for hematopoietic cell transplantation outcome prediction using rank aggregation. Computer Science and Information Systems (FedCSIS), 2012 Federated Conference on 9–12 September, 2012. 45 Improved feature selection for hematopoietic cell transplantation outcome prediction using rank aggregation. 2012 Federated Conference on Computer Science and Information Systems (FedCSIS); 9–12 September 2012. 46 Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 2010; 50: 105–115. 47 Ng T, Chew L, Yap CW. A clinical decision support tool to predict survival in cancer patients beyond 120 days after palliative chemotherapy. J Palliat Med 2012; 15: 863–869. 48 Delen D, Oztekin A, Kong ZJ. A machine learning-based approach to prognostic analysis of thoracic transplantations. Artif Intell Med 2010; 49: 33–42. 49 De Souza CA, Vigorito AC, Ruiz MA, Nucci M, Dulley FL, Funcke V et al. Validation of the EBMT risk score in chronic myeloid leukemia in Brazil and allogeneic transplant outcome. Haematologica 2005; 90: 232–237. 50 Gratwohl A, Hermans J, Goldman JM, Arcese W, Carreras E, Devergie A et al. Risk assessment for patients with chronic myeloid leukaemia before allogeneic blood or marrow transplantation. Chronic Leukemia Working Party of the European Group for Blood and Marrow Transplantation. Lancet 1998; 352: 1087–1092. 51 Lodewyck T, Oudshoorn M, van der Holt B, Petersen E, Spierings E, von dem Borne PA et al. Predictive impact of allele-matching and EBMT risk score for outcome after T-cell depleted unrelated donor transplantation in poor-risk acute leukemia and myelodysplasia. Leukemia 2011; 25: 1548–1554. 52 Schmid C, Labopin M, Nagler A, Niederwieser D, Castagna L, Tabrizi R et al. Treatment, risk factors, and outcome of adults with relapsed AML after reduced intensity conditioning for allogeneic stem cell transplantation. Blood 2012; 119: 1599–1606. 53 Sorror M, Storer B, Sandmaier BM, Maloney DG, Chauncey TR, Langston A et al. Hematopoietic cell transplantation-comorbidity index and Karnofsky performance status are independent predictors of morbidity and mortality after allogeneic nonmyeloablative hematopoietic cell transplantation. Cancer 2008; 112: 1992–2001. 54 Xhaard A, Porcher R, Chien JW, de Latour RP, Robin M, Ribaud P et al. Impact of comorbidity indexes on non-relapse mortality. Leukemia 2008; 22: 2062–2069. 55 Fayyad U, Piatetsky-Shapiro G, Smyth P. The KDD process for extracting useful knowledge from volumes of data. Commun ACM 1996; 39: 27–34. 56 Stajduhar I, Dalbelo-Basic B, Bogunovic N. Impact of censoring on learning Bayesian networks in survival modelling. Artif Intell Med 2009; 47: 199–217. 57 Hothorn T, BA˜14hlmann P, Dudoit S, Molinaro A, Van Der Laan MJ. Survival ensembles. Biostatistics 2006; 7: 355–373. 58 Sesen MB, Kadir T, Alcantara RB, Fox J, Brady M. Survival prediction and treatment recommendation with Bayesian techniques in lung cancer. AMIA Annu Symp Proc 2012 838–847. 59 Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Stat Med 2000; 19: 541–561.

Bone Marrow Transplantation (2014) 332 – 337