Gene Selection and Classification by Entropy-based ...

Gene Selection and Classification by Entropy-based Recursive Feature Elimination (E-RFE) IJCNN03 - Bioinformatics Portland, 24th July 2003

C. Furlanello, M. Serafini, S. Merler, G. Jurman

Gene Selection and Classification by Entropy-based Recursive Feature Elimination – p. 1/16

Practical Feature Ranking and Honest Classification Given a data set (Training), with no. of genes p

¯ ¯

N sample size

Build an accurate predictive model Identify most important variables

but (two opposite constraints): 1. Avoid the “selection bias” effect 2. Maintain a computationally feasible machine Problem: In order to consider reliable the error estimate you will need many resampling, Problem: thus RFE algorithm is computationally too expensive and procedure can be Problem: impossible to apply in microarray data


Recursive Feature Elimination (Guyon et. al, 2002) Recursively removes the feature which causes the minimum variation in a cost function:

¯

Given the SVM, whose weight vector is w £

∑ yi α£i xi

i¾SV

¯ ¯

Cost function J w

1 w 2

2

Variation in cost function δJ i

1 ∂2 J w δwi 2 2 2 ∂wi

1 wi 2 2

RFE algorithm: Given F=list of ranked feature=0/ and R=set of remaining features=1 2 n, then 1. Train the SVM on the reduced training set S 2. Compute δJ i i

1 n

x i k yi k¾R

3. Remove the feature with minimum δJ i and put it at top of F

4. Repeat 1,2,3 until R

0/

F contains all the variables, ranked according to their importance in building the SVM


Optimal number of features TE11

TE1n

Learning

Feature Ranking TRk

TR

Averaging

TE1i

Resampling Procedure

RFk

Mk1

Testing

TEk1

TE1

Mki

TEki

TEi

Mkn

TEkn

TEn

Exponential Fit n

TSk Learning

Feature Ranking TRK

RFK

MK1

Testing

TEK1

MKi

TEKi

MKn

TEKn

CV error curve: 0.4

TSK % Error

0.3

0.2

0.1

3–fold Cross–validation (stratified)

1

5

10

50

100

500

1000

Number of features


The selection bias problem (Ambroise & McLachlan, 2002) 0.5

0.5

Random labels True labels

0.3

0.3

% Error

0.4

% Error

0.4

f 1000 - 5000 f 0 - 5000

0.2

0.2

0.1

0.1

0.0

0.0 1

5

10

50

100

500

1000

Number of features

Colon cancer data: Zero (l1o) error with only 8 genes (blue) Randomizing labels 14 genes perfectly discriminate (black) (similar results published in PNAS, Machine Learning, Genome Research, etc.)

1

5

10

50

100

500

1000

Number of features

Synthetic data: Zero (10–CV) error with 9 variables on 1000 significant variables (black) Zero (10–CV) error with 20 variables on no information data (pink)


Model Validation and error estimation Resampling Procedure

0.08

10

OFS–M TRb

Feature Frequencies

9 7

70

Mb Testing

8

0.06

RF

Weight value

OFSb

58 4843

0.04

6

5 2 67 6664 61 62 63 57 60 56 53 49 54 51 52 47 40 41 39 37 38 35 36 33 32 30 31 29 27 26 25 24 23 20 19 17 15 16 13 14 12

4

13 69 65 59 5550 45

68

11

46 44 42 34 28 22

18

21

9

10 8

0.02 7

TR

b

TS

6 5

4

b

3

TE

2

0

ATE

B

OFS OFS–M

TRB

TSB

MB Testing Averaging TEB

Train–Test splitting in proportion 34–14

1

0

10

20

30

40

50

Number of extractions

The multiplicity of extractions of variables in the replicated experiments may be used as an additional measure of importance

Predictive estimate of test error


Distribution of feature weights

1.0 0.8 0.6 0.0

0.2

0.4

0.6 0.4 0.2 0.0

Weight value

0.8

1.0

Idea: Instead of eliminating one variable at each step, we could discard many of them at Idea: each step (e.g. SQRT–RFE eliminates #remaining variables at each step)

0

5000

10000

Feature index

15000

0

2000

6000

10000

Frequencies


1.0 0.8 0.6

High entropy Weights are very concentrated: we would like to eliminate many of them at the same time

0.0

0.2

0.4

0.8 0.6 0.4 0.0

0.2

Weight value

1.0

Entropy and weight distributions

0

200

400

600

800

1000

0

200

600

800

1.0 0.8 0.6

Low entropy Weights are quite equally distributed: we should be careful not to eliminate too many of them

0.0

0.2

0.4

0.8 0.6 0.4 0.0

0.2

Weight value

400

Frequencies

1.0

Feature index

0

200

400

600

Feature index

800

1000

0

20

40

60

80

100

Frequencies

nint

Entropy as measure of concentration:

H

∑ pi log pi

0 H log nint

i 1


1.0 0.6 0.4

0.8

High entropy

0.0

0.2

0.6

Weight value

0.0

0.2

0.4

0.8

1.0

Entropy, Mean and weight distributions

0

200

400

600

Feature index

800

1000

0

100

200

300

400

Frequencies

500

Weight are quite concentrated: we can eliminate many of them

Mean of the weights can be a further discriminating measure for our purpose (we want to eliminate only low weights)


Entropy–based RFE RFE YES F R

#R

compute SVM

Rt

NO

compute entropy

wi

Rt

H w

compute mean

M w

NO

H

Ht

Ht

YES

Mt

1 2

log nint

02

1

M

Mt

0.8

YES

0.6

Weight value

NO

0.4

0.2

1

1

0.8

0.8

0.6

0.6

0

0

5000

10000 Feature index

s3

Weight value

Weight value

remove features from R and put them at the top of F

nint

100 #R

0.4

0.4

0.2

0.2

0

100

200 Feature index

s1

300

s2 s3: cautiosly discard features whose weight is in the leftmost quantile

0

0

15000

0

500

1000 Feature index

s2

s1: eliminate variables whose weight is in the first bin


Synthetic data

1.0 0.2 0.0 0

200

400

600

800

1000

0

200

600

600

400

200

0

800

0

20

40

60

1.0

Steps

0.8

1000

SQRT E−RFE

n 400 99 steps, 400th variable ranked 401th

0.4 0.2

Remaining features

800

0.6

0.8 0.6 0.4 0.2

Weight value

400

SQRT E−RFE

800

Frequencies

1.0

Feature index

1000

Remaining features

0.8 0.6

n 100 26 steps (5 steps to recognize unimportant features, 100th variable ranked 104th

0.4

0.8 0.6 0.4 0.0

0.2

Weight value

1.0

100 samples, 1000 features, of which n are significant (Gaussian distribution) SQRT–RFE requires 63 stepsindependently from the number of significant features

600

400

0.0

0.0

200

0

200

400

600

800

1000

0

100

400

300

0

500

0

1.0

40

60

80

1000

100

SQRT E−RFE

800

n

600

400

0.2

1000 398 steps

Remaining features

0.8 0.6

20

Steps

0.4

0.8 0.6 0.4 0.2

Weight value

200

Frequencies

1.0

Feature index

0.0

0.0

200

0

200

400

600

Feature index

800

1000

0

20

40

60

Frequencies

80

100

0 0

100

200 Steps

300

400


Microarray data Colon cancer data: 2000 genes, 62 tissues: 22 normal and 40 tumor cases (Alon et. al, 1999)

Lymphoma data: 4026 genes, 96 samples: 72 cancer and 24 non cancer (Alizadeh et. al, 2000)

Tumor vs. Metastases data: 16063 genes, 76 cases: 64 primary adenocarcinomas and 12 metastatic adenocarcinomas (Ramaswamy et al., 2001 and Nature Genetics, 2003)


Microarray data: results

E–RFE SQRT–RFE

Tumor vs. Metastases ATE n Time 166 29 70 28 1780 162 42 70 25 12125

ATE: average test error rates (%) - 50 VAL runs n : average optimal number of features Time: elapsed time for each run (s)

Lymphoma ATE n Time 38 43 80 9 128 43 46 80 9 641 39 41 80 7 29612

SQRT E−RFE

15000

Remaining features

E–RFE SQRT–RFE RFE

Colon cancer ATE n Time 172 76 70 24 34 184 84 70 27 194 181 84 70 27 1780

10000

5000

0 0

50

100

150 Steps

200

250


Pitfalls of 1–RFE First experiment: 100 samples, 5 variables

¯ 3 variables Ui i 1 2 3 uniformly distributed in 0 1 and 1 0 ¯ 3 variable Vi i 1 2 3 uniformly distributed in 1 1 ¯ Ui Vi are subjected to counterclockwise rotation of angle π4 into ui vi ¯ u1 u2 u3 v1 v2 are the chosen features The two groups of variables are equally important 1–RFE does not consider the correlation and gives v 1 v2 u1 u2 u3 , while RFE correctly mixes v1 u3 u1 v2 u2 : Step 1: u1 Step 2: u1 Step 3: u1 Step 4: u3

16 u2 14 u3 14 u3 12 v1

16 14 14

12

u3 v1 v1

16 v1

14 v2 12

14

14

v2

14

Second experiment: 201 copies of u 1 and 200 copies of v 1 : 1–RFE ranks all u variables before the v’s, while E–RFE correctly mixes them


Experimental design Given a data set (Training), find the optimal feature set and build the model to classify new samples. In order to obtain a predictive estimation of the error you will need to: 1. Resampling procedure (3/4–1/4) to use validation procedure 2. Resample each training (3–fold CV) to find optimal number of features 3. Use Feature Ranker (RFE) and Classifier (SVM) to build models with increasing number of features 4. Apply models on test sets to obtain the (CV) error curve 5. Fit the error curve and estimate the optimal number of features (n ) as saturation point of the fitting curve 6. Build the optimal feature set as n first ranked variables in the ranked list 7. Build the model on the different training sets using the optimal feature set 8. Test the models on test sets to obtain the predictive estimation of error Problem: In order to consider reliable the error estimate you will need many resampling, Problem: thus RFE algorithm is computationally too expensive and procedure can be Problem: impossible to apply in microarray data


References (Alizadeh et. al, 2000) A. Alizadeh, M. Eisen, M. Davis et. al (2000). Distinct types of diffuse large B–cell lymphoma identified by gene expression profiling. Nature, 403: 503–511. (Alon et. al, 1999) U. Alon, N. Barkai, D. Notterman et. al (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS, 96(12): 6745–6750. (Ambroise & McLachlan, 2002) C. Ambroise & G. McLachlan (2002). Selection bias in gene extraction on the basis of microarray gene–expression data. PNAS, 99(10): 6562–6566. (Guyon et. al, 2002) I. Guyon, J. Weston, S. Barnhill and V.N. Vapnik (2002). Gene selection for cancer classification using Support Vector Machines. Machine Learning, 46: 389–422. (Ramaswamy et. al, 2003) S. Ramaswamy et. al (2003). A molecular signature of metastasis in primary solid tumors. Nature Genetics, 33:1–6. (Ramaswamy et. al, 2001) S. Ramaswamy et. al (2001). Multiclass cancer diagnosis using tumor gene expression signatures. PNAS, 98(26): 15149–15154. (Vapnik, 2000)

V.N. Vapnik (2000) The Nature of Statistical Learning Algorithm. Springer–Verlag.


Gene Selection and Classification by Entropy-based ...

Gene Selection and Classification by Entropy-based ...

Suggest Documents

Gene selection for microarray cancer classification

Evolution by selection, recombination, and gene ... - BioMedSearch

A Memetic Algorithm for Gene Selection and Molecular Classification ...

Informative gene selection and the direct classification ... - Springer Link

Gene selection and cancer type classification of ... - Human Genomics

A Genetic Embedded Approach for Gene Selection and Classification ...

Attribute Clustering for Grouping, Selection, and Classification of Gene ...

SVM-Based Local Search for Gene Selection and Classification of ...

ONLINE FEATURE SELECTION AND CLASSIFICATION

Classification, Selection Rules, and Symmetry

Gene selection for cancer classification using a ... - Semantic Scholar

Tumor Classification by Gene Expression Profiling - CiteSeerX

Gene Selection for Cancer Classification through ... - Springer Link

A TOPSIS based Method for Gene Selection for Cancer Classification

Crossover Gene Selection by Spatial Location - CiteSeerX

ipo mechanism selection by using classification and regression trees

Feature Selection and Classification in Brain Computer Interfaces by a ...

Feature Selection by Genetic Algorithm and SVM Classification for ...

A model for gene selection and classification of gene ... - Springer Link

Classification-Based Resource Selection

Variable selection and discrimination in gene expression data by ...

Feature Selection and Extraction for Malware Classification

Feature Selection and Fault Classification of Reciprocating ...

online feature selection and classification