Optimal number of features. TR. Feature. Feature. Ranking. Ranking n. Learning. Learning. TRk. TRK. TSk. TSK. Resampling. Procedure. RFk. RFK. Mk1. Mkn.
Gene Selection and Classification by Entropy-based Recursive Feature Elimination (E-RFE) IJCNN03 - Bioinformatics Portland, 24th July 2003
C. Furlanello, M. Serafini, S. Merler, G. Jurman
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 1/16
Practical Feature Ranking and Honest Classification Given a data set (Training), with no. of genes p
¯ ¯
N sample size
Build an accurate predictive model Identify most important variables
but (two opposite constraints): 1. Avoid the “selection bias” effect 2. Maintain a computationally feasible machine Problem: In order to consider reliable the error estimate you will need many resampling, Problem: thus RFE algorithm is computationally too expensive and procedure can be Problem: impossible to apply in microarray data
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 2/16
Recursive Feature Elimination (Guyon et. al, 2002) Recursively removes the feature which causes the minimum variation in a cost function:
¯
Given the SVM, whose weight vector is w £
∑ yi α£i xi
i¾SV
¯ ¯
Cost function J w
1 w 2
2
Variation in cost function δJ i
1 ∂2 J w δwi 2 2 2 ∂wi
1 wi 2 2
RFE algorithm: Given F=list of ranked feature=0/ and R=set of remaining features=1 2 n, then 1. Train the SVM on the reduced training set S 2. Compute δJ i i
1 n
x i k yi k¾R
3. Remove the feature with minimum δJ i and put it at top of F
4. Repeat 1,2,3 until R
0/
F contains all the variables, ranked according to their importance in building the SVM
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 3/16
Optimal number of features TE11
TE1n
Learning
Feature Ranking TRk
TR
Averaging
TE1i
Resampling Procedure
RFk
Mk1
Testing
TEk1
TE1
Mki
TEki
TEi
Mkn
TEkn
TEn
Exponential Fit n
TSk Learning
Feature Ranking TRK
RFK
MK1
Testing
TEK1
MKi
TEKi
MKn
TEKn
CV error curve: 0.4
TSK % Error
0.3
0.2
0.1
3–fold Cross–validation (stratified)
1
5
10
50
100
500
1000
Number of features
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 4/16
The selection bias problem (Ambroise & McLachlan, 2002) 0.5
0.5
Random labels True labels
0.3
0.3
% Error
0.4
% Error
0.4
f 1000 - 5000 f 0 - 5000
0.2
0.2
0.1
0.1
0.0
0.0 1
5
10
50
100
500
1000
Number of features
Colon cancer data: Zero (l1o) error with only 8 genes (blue) Randomizing labels 14 genes perfectly discriminate (black) (similar results published in PNAS, Machine Learning, Genome Research, etc.)
1
5
10
50
100
500
1000
Number of features
Synthetic data: Zero (10–CV) error with 9 variables on 1000 significant variables (black) Zero (10–CV) error with 20 variables on no information data (pink)
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 5/16
Model Validation and error estimation Resampling Procedure
0.08
10
OFS–M TRb
Feature Frequencies
9 7
70
Mb Testing
8
0.06
RF
Weight value
OFSb
58 4843
0.04
6
5 2 67 6664 61 62 63 57 60 56 53 49 54 51 52 47 40 41 39 37 38 35 36 33 32 30 31 29 27 26 25 24 23 20 19 17 15 16 13 14 12
4
13 69 65 59 5550 45
68
11
46 44 42 34 28 22
18
21
9
10 8
0.02 7
TR
b
TS
6 5
4
b
3
TE
2
0
ATE
B
OFS OFS–M
TRB
TSB
MB Testing Averaging TEB
Train–Test splitting in proportion 34–14
1
0
10
20
30
40
50
Number of extractions
The multiplicity of extractions of variables in the replicated experiments may be used as an additional measure of importance
Predictive estimate of test error
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 6/16
Distribution of feature weights
1.0 0.8 0.6 0.0
0.2
0.4
0.6 0.4 0.2 0.0
Weight value
0.8
1.0
Idea: Instead of eliminating one variable at each step, we could discard many of them at Idea: each step (e.g. SQRT–RFE eliminates #remaining variables at each step)
0
5000
10000
Feature index
15000
0
2000
6000
10000
Frequencies
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 7/16
1.0 0.8 0.6
High entropy Weights are very concentrated: we would like to eliminate many of them at the same time
0.0
0.2
0.4
0.8 0.6 0.4 0.0
0.2
Weight value
1.0
Entropy and weight distributions
0
200
400
600
800
1000
0
200
600
800
1.0 0.8 0.6
Low entropy Weights are quite equally distributed: we should be careful not to eliminate too many of them
0.0
0.2
0.4
0.8 0.6 0.4 0.0
0.2
Weight value
400
Frequencies
1.0
Feature index
0
200
400
600
Feature index
800
1000
0
20
40
60
80
100
Frequencies
nint
Entropy as measure of concentration:
H
∑ pi log pi
0 H log nint
i 1
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 8/16
1.0 0.6 0.4
0.8
High entropy
0.0
0.2
0.6
Weight value
0.0
0.2
0.4
0.8
1.0
Entropy, Mean and weight distributions
0
200
400
600
Feature index
800
1000
0
100
200
300
400
Frequencies
500
Weight are quite concentrated: we can eliminate many of them
Mean of the weights can be a further discriminating measure for our purpose (we want to eliminate only low weights)
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 9/16
Entropy–based RFE RFE YES F R
#R
compute SVM
Rt
NO
compute entropy
wi
Rt
H w
compute mean
M w
NO
H
Ht
Ht
YES
Mt
1 2
log nint
02
1
M
Mt
0.8
YES
0.6
Weight value
NO
0.4
0.2
1
1
0.8
0.8
0.6
0.6
0
0
5000
10000 Feature index
s3
Weight value
Weight value
remove features from R and put them at the top of F
nint
100 #R
0.4
0.4
0.2
0.2
0
100
200 Feature index
s1
300
s2 s3: cautiosly discard features whose weight is in the leftmost quantile
0
0
15000
0
500
1000 Feature index
s2
s1: eliminate variables whose weight is in the first bin
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 10/16
Synthetic data
1.0 0.2 0.0 0
200
400
600
800
1000
0
200
600
600
400
200
0
800
0
20
40
60
1.0
Steps
0.8
1000
SQRT E−RFE
n 400 99 steps, 400th variable ranked 401th
0.4 0.2
Remaining features
800
0.6
0.8 0.6 0.4 0.2
Weight value
400
SQRT E−RFE
800
Frequencies
1.0
Feature index
1000
Remaining features
0.8 0.6
n 100 26 steps (5 steps to recognize unimportant features, 100th variable ranked 104th
0.4
0.8 0.6 0.4 0.0
0.2
Weight value
1.0
100 samples, 1000 features, of which n are significant (Gaussian distribution) SQRT–RFE requires 63 stepsindependently from the number of significant features
600
400
0.0
0.0
200
0
200
400
600
800
1000
0
100
400
300
0
500
0
1.0
40
60
80
1000
100
SQRT E−RFE
800
n
600
400
0.2
1000 398 steps
Remaining features
0.8 0.6
20
Steps
0.4
0.8 0.6 0.4 0.2
Weight value
200
Frequencies
1.0
Feature index
0.0
0.0
200
0
200
400
600
Feature index
800
1000
0
20
40
60
Frequencies
80
100
0 0
100
200 Steps
300
400
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 11/16
Microarray data Colon cancer data: 2000 genes, 62 tissues: 22 normal and 40 tumor cases (Alon et. al, 1999)
Lymphoma data: 4026 genes, 96 samples: 72 cancer and 24 non cancer (Alizadeh et. al, 2000)
Tumor vs. Metastases data: 16063 genes, 76 cases: 64 primary adenocarcinomas and 12 metastatic adenocarcinomas (Ramaswamy et al., 2001 and Nature Genetics, 2003)
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 12/16
Microarray data: results
E–RFE SQRT–RFE
Tumor vs. Metastases ATE n Time 166 29 70 28 1780 162 42 70 25 12125
ATE: average test error rates (%) - 50 VAL runs n : average optimal number of features Time: elapsed time for each run (s)
Lymphoma ATE n Time 38 43 80 9 128 43 46 80 9 641 39 41 80 7 29612
SQRT E−RFE
15000
Remaining features
E–RFE SQRT–RFE RFE
Colon cancer ATE n Time 172 76 70 24 34 184 84 70 27 194 181 84 70 27 1780
10000
5000
0 0
50
100
150 Steps
200
250
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 13/16
Pitfalls of 1–RFE First experiment: 100 samples, 5 variables
¯ 3 variables Ui i 1 2 3 uniformly distributed in 0 1 and 1 0 ¯ 3 variable Vi i 1 2 3 uniformly distributed in 1 1 ¯ Ui Vi are subjected to counterclockwise rotation of angle π4 into ui vi ¯ u1 u2 u3 v1 v2 are the chosen features The two groups of variables are equally important 1–RFE does not consider the correlation and gives v 1 v2 u1 u2 u3 , while RFE correctly mixes v1 u3 u1 v2 u2 : Step 1: u1 Step 2: u1 Step 3: u1 Step 4: u3
16 u2 14 u3 14 u3 12 v1
16 14 14
12
u3 v1 v1
16 v1
14 v2 12
14
14
v2
14
Second experiment: 201 copies of u 1 and 200 copies of v 1 : 1–RFE ranks all u variables before the v’s, while E–RFE correctly mixes them
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 14/16
Experimental design Given a data set (Training), find the optimal feature set and build the model to classify new samples. In order to obtain a predictive estimation of the error you will need to: 1. Resampling procedure (3/4–1/4) to use validation procedure 2. Resample each training (3–fold CV) to find optimal number of features 3. Use Feature Ranker (RFE) and Classifier (SVM) to build models with increasing number of features 4. Apply models on test sets to obtain the (CV) error curve 5. Fit the error curve and estimate the optimal number of features (n ) as saturation point of the fitting curve 6. Build the optimal feature set as n first ranked variables in the ranked list 7. Build the model on the different training sets using the optimal feature set 8. Test the models on test sets to obtain the predictive estimation of error Problem: In order to consider reliable the error estimate you will need many resampling, Problem: thus RFE algorithm is computationally too expensive and procedure can be Problem: impossible to apply in microarray data
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 15/16
References (Alizadeh et. al, 2000) A. Alizadeh, M. Eisen, M. Davis et. al (2000). Distinct types of diffuse large B–cell lymphoma identified by gene expression profiling. Nature, 403: 503–511. (Alon et. al, 1999) U. Alon, N. Barkai, D. Notterman et. al (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS, 96(12): 6745–6750. (Ambroise & McLachlan, 2002) C. Ambroise & G. McLachlan (2002). Selection bias in gene extraction on the basis of microarray gene–expression data. PNAS, 99(10): 6562–6566. (Guyon et. al, 2002) I. Guyon, J. Weston, S. Barnhill and V.N. Vapnik (2002). Gene selection for cancer classification using Support Vector Machines. Machine Learning, 46: 389–422. (Ramaswamy et. al, 2003) S. Ramaswamy et. al (2003). A molecular signature of metastasis in primary solid tumors. Nature Genetics, 33:1–6. (Ramaswamy et. al, 2001) S. Ramaswamy et. al (2001). Multiclass cancer diagnosis using tumor gene expression signatures. PNAS, 98(26): 15149–15154. (Vapnik, 2000)
V.N. Vapnik (2000) The Nature of Statistical Learning Algorithm. Springer–Verlag.
Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 16/16