Nov 14, 2003 - Machine Learning Techniques For. Lung Cancer ... (b) Comparison of 6 machine learning methods; .... habit of falling in love with their models.â ...
Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar, Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003
Outline • Summary of Results (1 slide) • Overview of Tasks (1 slide) • Data Integration (4 slides) • Methods (6 slides) • Results and Biological Interpretation (6 slides) • Conclusions (1 slide)
Summary of Results • With respect to tasks: – Classification task: Prediction of 5-year survival is most accurate when we build a model using only patient data (age, tumor stage,…); – Regression task: Prediction of survival in months is more accurate for the model relying on expression data than on patient data, and best when the model relies on both patient and expression data;
• With respect to methods: – “Best” model: Decision tree
Tasks • Task #1: Data integration – Integration of lung cancer microarray data of Harvard and Michigan data set and data pre-processing;
• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;
• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;
• Task #4: Interpretation – Biological interpretation of identified genes.
Task #1: Data Integration [1/4]
Task #1: Data Integration [2/4]
Target variables 211 patients
Patient data
Expression data 3,588 genes
Task #1: Data Integration [3/4] • Data pre-processing for classification task: – Group patients into 2 classes: • LOW RISK: Survival ≥ 5 years • HIGH RISK: Survival < 5 years
– Discard patients that are censored before 60 months – Remaining number of patients: 136
• Data pre-processing for regression task: – Include all 211 patients.
• Data pre-processing for both tasks: – Generate learning set and test set by randomly splitting the entire data set (~70% : ~30%).
Task #1: Data Integration [4/4] Learning Set
Test Set
1
1
Model Test Set
1
1
1
1
148
63
CART Learning Set
Test Set
1
1
Patient + Expression data ...
Model
Test Set
...
...
...
40
Learning Set
...
40
Learning Set
CART
Expression data
...
... 96
Task #3: Regression
1
63
...
Task #2: Classification
Test Set
...
...
148
Model 1
1
Patient data
40
Learning Set
Test Set
1
...
... 96
96
Learning Set
148
63
CART
Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set;
• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;
• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;
• Task #4: Interpretation – Biological interpretation of identified genes.
Methods – Overview •
Methods used to address Classification-Task (1) k -nearest neighbour (k-NN) (2) Decision Tree C5.0 (3) Boosted Decision Trees (4) Support Vector Machines (SVMs) (5) Artificial Neural Networks (Multilayer Perceptrons, MLPs) (6) Probabilistic Neural Networks (PNNs)
•
Methods used to address Regression-Task (1) Classification and Regression Tree (CART)
Methods – Comparison of Principles •
Consider the following 2-class problem
Methods – Decision Tree • Recursively split the data set into decision regions and generate a rule set • Classify the test case using the rule set Root Node y ≤ Split #1 Class •
y > Split #1 Split again
Boosted decision trees: • Aggregating decision trees (committee) by weighted voting and resampling of the data set.
Methods – Support Vector Machine • Find optimal separating hyperplane by maximizing the margin between 2 classes • Classify the test case using hyperplane
Class Class •
Methods – Strengths and Weaknesses
*Lee Y., C.K.: Classification of multiple by multicategory support vector Most (if not all)Leemodels ultimately relycancer on atypes definition of distance between objects. machines using gene expression data. Bioionformatics 19(9), pp. 1132-1139, (2003).
This definition is not trivial in high-dimensional space. Distance metric: tuning parameter? à fractal distance [Aggarwal et al., ICDT , 2001]
Results of Task#1: Classification
Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set;
• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;
• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;
• Task #4: Interpretation – Biological interpretation of identified genes.
Methods – Classification and Regression Tree • Algorithm is similar to the decision tree C5.0 • Heuristic is based on recursive partitioning of data set • Differences:
Results of Task #2: Regression [1/3] • Evaluation criteria: – How many death events are correctly identified as death events, and how many are not? à accuracy – For the correctly identified death events, what is the deviance of the residuals between the real and the predicted survival time?
Results of Task #2: Regression [2/3]
Results of Task #2: Regression [3/3]
10.9
Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set;
• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;
• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;
• Task #4: Interpretation – Biological interpretation of identified genes.
Task #4: Biological Interpretation [1/2] • How to interpret the results? à Using literature, OMIM, PubMed,… • # of features relevant for classification task: 8, e.g. ZNF174 (zinc finger protein) • Proteins of this family probably have an impact on repression of growth factor gene expression [OMIM, 603900] • Example: Wilms tumour suppressor WT1 encodes a zinc finger protein that downregulates the expression of various growth factor genes [OMIM, 603900] • Decision tree: overexpression of ZNF174 is associated with LOW RISK, underexpression with HIGH RISK • ZNF174: important marker in Burkitt’s lymphoma cells [Li et al., PNAS, May 2003]
Task #4: Biological Interpretation [2/2] • # of features relevant for regression task: 5, e.g. NifU • function not fully understood yet • is likely to be involved in the mobilization of iron and sulfur for nitrogenase-specific iron-sulfur cluster formation • important for breast cancer classification [Hedenfalk et al., N Engl J Med, Feb. 2001];
• Decision tree: overexpression of NifU is associated with good clinical outcome for patients with early tumour stage.
Conclusions • Integrating clinical and transcriptional data might improve survival outcome prediction; • “Best” model in this study: decision tree, but… • Method of choice is not available; • No Free Lunch Theorem: “No classifier is inherently superior to any other. The type of the problem determines which classifier is most appropriate.” • George Box: “Statisticians, like artists, have the bad habit of falling in love with their models.”
Acknowledgements • Brian Sturgeon • Ian Bradbury • C. Stephen Downes • Werner Dubitzky Supplementary information will be available at http://research.bioinformatics.ulster.ac.uk/~dberrar/camda03.html.
Methods – Comparison of Principles •
Consider the following 2-class problem of cases (x,y)
•
t = {0.1, 0.2,…10},
•
Class A: f •(x) = t cos(t), f•(y) = t sin(t)
•
Class B: f (x) = t sin(t), f (y) = t cos(t)
Methods – k-Nearest Neighbour
• Retrieve the nearest neighbours of the test case • Classify test case based on the class membership of the nearest neighbours
Methods – k-Nearest Neighbour Learning: • For each case in the learning set, determine all neighbours and rank them with respect to similarity = 1 − distance • Determine global optimal number of nearest neighbours kopt (e.g., in LOOCV) Test: • Use kopt for classifying the test cases • Interpret normalized similarities as measure of confidence Suppose that kopt = 3 and the following nearest neighbours:
Case # 27 29 34
Similarity (= 1 − distance) 0.0921 0.0833 0.0819
Normed 0.35795 0.32375 0.31831
Confidence for class • : 0.35795 + 0.31831 = Confidence for class :
Class
• • 0.67626 0.32375
Methods – Support Vector Machine • Goal: Finding the optimal decision boundery between 2 classes
• SVM-heuristic for separable problems (non-overlapping classes): – Construct optimal separating hyperplane by maximizing the margin
Methods – Support Vector Machine Linear separable
Not linear separable
Minimization calculus
Methods – Support Vector Machine • Use of projection to higher dimensional space using kernel function.
(x, y) à (x, y, xy)
Methods – Linear SVM Linear SVM
Class Class •
Linear SVM (higher space)
Methods – SVM Fruits on this site of the hyperplane, given by SVM1, cannot be bananas. SVM2 SVM 1
SVM2
SVM1
SVM3
SVM3
Methods – Multilayer Perceptron • Construct non-linear decision boundery • Classify the test case using decision boundery
Class Class •
Methods – Probabilistic Neural Network • Parallel implementation of Bayes-Parzen classifier • Bayes decision criterion – pk: prior probability that case belongs to class k – ck: costs associated with a case of this class being misclassified – fk: estimated density of class k
An unknown case z is classified as member of class i if for all j ≠ i : pi ci fi (z) > pj cj fj (z) pˆ (Y |?)
pˆ (X |?) fˆY
OX,Y
∑X
fˆX
x1
x2
x3
x4
∑Y x5
x6
x7
y1
IX,Y x1 x2 x3
x4
x5
x6
z
x7
y1 y2
y3
y4 y5 y6 y7
?
y2
y3
y4
y5
y6
y7
Methods – Probabilistic Neural Network • Parallel implementation of BayesParzen classifier • Takes into account class densities and class priors • Estimates class posteriors for test cases • Classify new cases, e.g. using argmax(pi)
Curse of Dimensionality • Aka large p, small n-problem: many variables, few observations • Frequent in life sciences (e.g., microarray data analysis) • Most machine learning methods have been developed for scenarios that are characterized by many observations and few variables • Problem: how to define similarity? • Ok, similarity = 1 – distance, but how to define distance?
Curse of Dimensionality • L1 norm: Manhattan distance • L2 norm: Euclidean distance • The higher the dimension of the data, the smaller the Lk norm [Aggarwal et al.,2001, “On the surprising behavior of distance metrics in high dimensional space.”] • Fractal distance:
• Implemented for PNN and k-NN in the present study • Additional tuning parameter: fract
Task #2: CART – Background • Method: Classification and Regression Tree (CART) • Heuristic: Recursive partitioning of data set • Example:
Task #2: CART – Background • Problem: Censored observations
Results of Task #2: Regression
Results of Task #2: Regression [1/3] Node #4 (Learning)
Node #4 (Test)