Microarray Data Integration and Machine Learning ...

1 downloads 0 Views 3MB Size Report
Nov 14, 2003 - Machine Learning Techniques For. Lung Cancer ... (b) Comparison of 6 machine learning methods; .... habit of falling in love with their models.” ...
Microarray Data Integration and Machine Learning Techniques For Lung Cancer Survival Prediction Daniel Berrar, Brian Sturgeon, Ian Bradbury, C. Stephen Downes, Werner Dubitzky November 14, 2003

Outline • Summary of Results (1 slide) • Overview of Tasks (1 slide) • Data Integration (4 slides) • Methods (6 slides) • Results and Biological Interpretation (6 slides) • Conclusions (1 slide)

Summary of Results • With respect to tasks: – Classification task: Prediction of 5-year survival is most accurate when we build a model using only patient data (age, tumor stage,…); – Regression task: Prediction of survival in months is more accurate for the model relying on expression data than on patient data, and best when the model relies on both patient and expression data;

• With respect to methods: – “Best” model: Decision tree

Tasks • Task #1: Data integration – Integration of lung cancer microarray data of Harvard and Michigan data set and data pre-processing;

• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

• Task #4: Interpretation – Biological interpretation of identified genes.

Task #1: Data Integration [1/4]

Task #1: Data Integration [2/4]

Target variables 211 patients

Patient data

Expression data 3,588 genes

Task #1: Data Integration [3/4] • Data pre-processing for classification task: – Group patients into 2 classes: • LOW RISK: Survival ≥ 5 years • HIGH RISK: Survival < 5 years

– Discard patients that are censored before 60 months – Remaining number of patients: 136

• Data pre-processing for regression task: – Include all 211 patients.

• Data pre-processing for both tasks: – Generate learning set and test set by randomly splitting the entire data set (~70% : ~30%).

Task #1: Data Integration [4/4] Learning Set

Test Set

1

1

Model Test Set

1

1

1

1

148

63

CART Learning Set

Test Set

1

1

Patient + Expression data ...

Model

Test Set

...

...

...

40

Learning Set

...

40

Learning Set

CART

Expression data

...

... 96

Task #3: Regression

1

63

...

Task #2: Classification

Test Set

...

...

148

Model 1

1

Patient data

40

Learning Set

Test Set

1

...

... 96

96

Learning Set

148

63

CART

Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set;

• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

• Task #4: Interpretation – Biological interpretation of identified genes.

Methods – Overview •

Methods used to address Classification-Task (1) k -nearest neighbour (k-NN) (2) Decision Tree C5.0 (3) Boosted Decision Trees (4) Support Vector Machines (SVMs) (5) Artificial Neural Networks (Multilayer Perceptrons, MLPs) (6) Probabilistic Neural Networks (PNNs)



Methods used to address Regression-Task (1) Classification and Regression Tree (CART)

Methods – Comparison of Principles •

Consider the following 2-class problem

Methods – Decision Tree • Recursively split the data set into decision regions and generate a rule set • Classify the test case using the rule set Root Node y ≤ Split #1 Class •

y > Split #1 Split again

Boosted decision trees: • Aggregating decision trees (committee) by weighted voting and resampling of the data set.

Methods – Support Vector Machine • Find optimal separating hyperplane by maximizing the margin between 2 classes • Classify the test case using hyperplane

Class Class •

Methods – Strengths and Weaknesses

*Lee Y., C.K.: Classification of multiple by multicategory support vector Most (if not all)Leemodels ultimately relycancer on atypes definition of distance between objects. machines using gene expression data. Bioionformatics 19(9), pp. 1132-1139, (2003).

This definition is not trivial in high-dimensional space. Distance metric: tuning parameter? à fractal distance [Aggarwal et al., ICDT , 2001]

Results of Task#1: Classification

Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set;

• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

• Task #4: Interpretation – Biological interpretation of identified genes.

Methods – Classification and Regression Tree • Algorithm is similar to the decision tree C5.0 • Heuristic is based on recursive partitioning of data set • Differences:

Results of Task #2: Regression [1/3] • Evaluation criteria: – How many death events are correctly identified as death events, and how many are not? à accuracy – For the correctly identified death events, what is the deviance of the residuals between the real and the predicted survival time?

Results of Task #2: Regression [2/3]

Results of Task #2: Regression [3/3]

10.9

Tasks • Task #1: Data pre-processing – Integration of lung cancer microarray data of Harvard and Michigan data set;

• Task #2: Classification – (a) Prediction of 5-year survival of patients based on (1) patient data, (2) expression data, and (3) both patient and expression data; – (b) Comparison of 6 machine learning methods;

• Task #3: Regression – Prediction of survival time using (1) patient data, (2) expression data, and (3) both patient and expression data;

• Task #4: Interpretation – Biological interpretation of identified genes.

Task #4: Biological Interpretation [1/2] • How to interpret the results? à Using literature, OMIM, PubMed,… • # of features relevant for classification task: 8, e.g. ZNF174 (zinc finger protein) • Proteins of this family probably have an impact on repression of growth factor gene expression [OMIM, 603900] • Example: Wilms tumour suppressor WT1 encodes a zinc finger protein that downregulates the expression of various growth factor genes [OMIM, 603900] • Decision tree: overexpression of ZNF174 is associated with LOW RISK, underexpression with HIGH RISK • ZNF174: important marker in Burkitt’s lymphoma cells [Li et al., PNAS, May 2003]

Task #4: Biological Interpretation [2/2] • # of features relevant for regression task: 5, e.g. NifU • function not fully understood yet • is likely to be involved in the mobilization of iron and sulfur for nitrogenase-specific iron-sulfur cluster formation • important for breast cancer classification [Hedenfalk et al., N Engl J Med, Feb. 2001];

• Decision tree: overexpression of NifU is associated with good clinical outcome for patients with early tumour stage.

Conclusions • Integrating clinical and transcriptional data might improve survival outcome prediction; • “Best” model in this study: decision tree, but… • Method of choice is not available; • No Free Lunch Theorem: “No classifier is inherently superior to any other. The type of the problem determines which classifier is most appropriate.” • George Box: “Statisticians, like artists, have the bad habit of falling in love with their models.”

Acknowledgements • Brian Sturgeon • Ian Bradbury • C. Stephen Downes • Werner Dubitzky Supplementary information will be available at http://research.bioinformatics.ulster.ac.uk/~dberrar/camda03.html.

Methods – Comparison of Principles •

Consider the following 2-class problem of cases (x,y)



t = {0.1, 0.2,…10},



Class A: f •(x) = t cos(t), f•(y) = t sin(t)



Class B: f (x) = t sin(t), f (y) = t cos(t)

Methods – k-Nearest Neighbour

• Retrieve the nearest neighbours of the test case • Classify test case based on the class membership of the nearest neighbours

Methods – k-Nearest Neighbour Learning: • For each case in the learning set, determine all neighbours and rank them with respect to similarity = 1 − distance • Determine global optimal number of nearest neighbours kopt (e.g., in LOOCV) Test: • Use kopt for classifying the test cases • Interpret normalized similarities as measure of confidence Suppose that kopt = 3 and the following nearest neighbours:

Case # 27 29 34

Similarity (= 1 − distance) 0.0921 0.0833 0.0819

Normed 0.35795 0.32375 0.31831

Confidence for class • : 0.35795 + 0.31831 = Confidence for class :

Class

• • 0.67626 0.32375

Methods – Support Vector Machine • Goal: Finding the optimal decision boundery between 2 classes

• SVM-heuristic for separable problems (non-overlapping classes): – Construct optimal separating hyperplane by maximizing the margin

Methods – Support Vector Machine Linear separable

Not linear separable

Minimization calculus

Methods – Support Vector Machine • Use of projection to higher dimensional space using kernel function.

(x, y) à (x, y, xy)

Methods – Linear SVM Linear SVM

Class Class •

Linear SVM (higher space)

Methods – SVM Fruits on this site of the hyperplane, given by SVM1, cannot be bananas. SVM2 SVM 1

SVM2

SVM1

SVM3

SVM3

Methods – Multilayer Perceptron • Construct non-linear decision boundery • Classify the test case using decision boundery

Class Class •

Methods – Probabilistic Neural Network • Parallel implementation of Bayes-Parzen classifier • Bayes decision criterion – pk: prior probability that case belongs to class k – ck: costs associated with a case of this class being misclassified – fk: estimated density of class k

An unknown case z is classified as member of class i if for all j ≠ i : pi ci fi (z) > pj cj fj (z) pˆ (Y |?)

pˆ (X |?) fˆY

OX,Y

∑X

fˆX

x1

x2

x3

x4

∑Y x5

x6

x7

y1

IX,Y x1 x2 x3

x4

x5

x6

z

x7

y1 y2

y3

y4 y5 y6 y7

?

y2

y3

y4

y5

y6

y7

Methods – Probabilistic Neural Network • Parallel implementation of BayesParzen classifier • Takes into account class densities and class priors • Estimates class posteriors for test cases • Classify new cases, e.g. using argmax(pi)

Curse of Dimensionality • Aka large p, small n-problem: many variables, few observations • Frequent in life sciences (e.g., microarray data analysis) • Most machine learning methods have been developed for scenarios that are characterized by many observations and few variables • Problem: how to define similarity? • Ok, similarity = 1 – distance, but how to define distance?

Curse of Dimensionality • L1 norm: Manhattan distance • L2 norm: Euclidean distance • The higher the dimension of the data, the smaller the Lk norm [Aggarwal et al.,2001, “On the surprising behavior of distance metrics in high dimensional space.”] • Fractal distance:

• Implemented for PNN and k-NN in the present study • Additional tuning parameter: fract

Task #2: CART – Background • Method: Classification and Regression Tree (CART) • Heuristic: Recursive partitioning of data set • Example:

Task #2: CART – Background • Problem: Censored observations

Results of Task #2: Regression

Results of Task #2: Regression [1/3] Node #4 (Learning)

Node #4 (Test)