A Sparse Learning-Based Approach for Class-Specific Feature SelecPon Ornella Affinito2, Angelo Ciaramella1, Sergio Cocozza2,3, Gennaro Miele4, Antonella Mon>celli3, Davide Nardone1, Domenico Palumbo2 and Antonino Staiano1
1Dept. of Science and Technology, University of Naples “Parthenope”
2Dept. of Molecular Medicine and Biotechnology, University of Naples Federico II 3Ins>tute Experimental Endocrinology and Oncology (IEOS) 4Italian Ins>tute of Nuclear Physics (INFN)
Abstract
Experimental Results
Feature selec>on (FS) plays a key role in computa>onal biology, making it possible to treat models with fewer variables, which in turn are easier to explain and might speed the experimental valida>on up, by providing valuable insight into the importance and role of variables. We propose a novel procedure for FS conceiving a two-steps approach. Firstly, a Sparse Coding Based Learning technique is used to find the best subset of features for each class of the training data. Secondly, the discovered feature subsets are fed to a Class-Specific Feature Selec(on scheme, to assess the effec>veness of the selected features in classifica>on tasks. To this end, an ensemble of classifiers is built by training a classifier, one for each class on its own feature subset. To assess the effec>veness of the proposed FS approach, a number of experiments have been performed on benchmark microarray datasets.
In the experiments, in order to evaluate the performance of the proposed feature selec>on strategy (SCBA-CSF), we have applied our methods to a total of six publicly available microarray datasets. ALLAML dataset [5], the human carcinomas (CAR) dataset [6], the human lung Carcinomas (LUNG) dataset [7], the diffuse large B-cell lymphoma (DLBCL) dataset and the malignant glioma (GLIOMA) dataset [8]. The Support Vector Machine (SVM) classifier has been run on all datasets, using 5-fold Cross-Valida>on. We compared SCBA-CSF with the most representa>ve state-of-art supervised FS algorithms.
IntroducPon Selec>ng small subsets out of thousands of genes in microarray data is an important task for several medical purposes. Microarray data analysis is popular for involving a huge number of genes compared to a rela>vely small number of samples. In par>cular, Gene selec(on is the task of iden>fying the most significantly deferen>ally expressed genes under different condi>ons, and it's an open research topic [1]. These selected genes are very useful in clinical applica>ons such as recognizing diseased profiles. Nonetheless, because of its high costs, the number of experiments that can be used for classifica>on purposes is usually limited. This small number of samples, compared to the large number of genes in an experiment, is well known as the Curse of Dimensionality [2] and challenges the classifica>on as well as other data analysis tasks. Moreover, it is well-known that a significant number of genes play an important role, whereas many others could be unrelated to the classifica>on task [3]. Therefore, a cri>cal step to an effec>ve classifica>on is to iden>fy the representa>ves genes, in order to decrease the number of genes used for classifica>on purpose.
Average accuracy of top 80 features (%) All features
RFS
ls-2,1
ll-2,1
Fisher Relief-F
mRmR
SCBA
SCBA-CSFS
ALLAML
93.21
97.84
74.27
95.73 98.95
98.89
83.16
96.84 95.67
CAR
90.25
96.98
88.88
94.61 92.92
96.95
94.93
94.61
99.32
LUNG_C
95.57
98.12
97.84
98.99 99.28
99.57
98.71
99.42
99.70
LUNG_D
83.43
95.93
95.93
94.62 95.93
97.31
96.60
97.29
97.29
DLBCL
93.74
99.03
95.42
99.76 99.76
99.76
99.8
99.76
99.76
GLIOMA
74
88.33(29)
70
80
80
78.33
81.67
88.33
Average
88.37
96.03
87.05
93.95 95.03
95.41
91.92
94.93
96.68
83.33
ALLAML
GLIOMA
LUNG_C
LUNG_D
DLBCL
CAR
2 classes
7 classes
4 classes
9 classes
5 classes
11 classes
Figure 1. Classifica>on accuracy comparisons of eight feature selec>on algorithms on six datasets. The SVM classifier with 5-fold Cross-Valida>on is used for classifica>on. SCBA and SCBA-CSFS are our methods.
Methods and Materials
Discussion
Feature selec>on has been widely used for elimina>ng redundant or irrelevant features, and it can be done in two ways: Tradi>onal Feature Selec>on (TFS) for all classes and Class-Specific Feature Selec>on (CSFS) [4]. CSFS is the process of finding different set of features for each class. Conversely from a TFS algorithm, where a single feature subset is selected for discrimina>ng among all the classes in a supervised classifica>on problem, a CSFS algorithm selects a subset of feature for each class.
Based on experimental results, we may generally state that applying TFS allows gesng beqer results than using all the available features, and also that, in most of the >me, by applying CSFS allows gesng beqer results than applying TFS methods. In par>cular, it is worth observing that the CSFS method seems to achieve the best results when the datasets contain several classes (e.g., LUNG_C, LUNG_D, CAR, DLBLC). In addi>on, as shown in Fig. 1, it’s crucial to point out that the proposed method always outperforms its compe>tors with a fewer number of features. Definitely, we might claim that our method is able to iden>fy/retrieve the most representa>ve features maximizing the classifica>on accuracy. With the top 80 features, SCBA-CSFS is around 1%-10% beqer than the compe>ng methods on all six datasets.
1. Dataset spli]ng: Train and test set are made up. 2. Class Sample SeparaPon: Sample separa>on class by class. 3. Class Balancing (arbitrary): SMOTE oversampling process. 4. Class-Specific Feature SelecPon: Feature selec>on using SparseCoding Based Approach. 5. ClassificaPon: Ensemble of Classifiers is built (one for each class) on its own feature subset, and an ad-hoc decision rule is adopted to compute the ensemble responses.
TEST SET
TRAIN SET C1
C2
C1
C2
⃛ ⃛
SCBA
SCBA
e1
e2
P1
P2
Cn
⃛
SCBA
⃛
en
⃛
Pn
P1
P2
Cn
Pn
Fp
Conclusions We proposed a novel Sparse-Coding Based Approach Feature Selec>on with emphasizing joint 𝑙↓1,2 -norm minimiza>on and the Class Specific Feature Selec>on. Experimental results on six different datasets validate the unique aspects of SCBA-CSFS and demonstrate the beqer performance achieved against thestate-of-art methods. In addi>on, it is able to retrieve the most representa>ve features that maximize the classifica>on accuracy in the case that the dataset is made up of many classes.
Contact
References
Dept. of Science and Technology, University of Naples “Parthenope” davide.nardone@studen>.uniparthenope.it
[email protected] [email protected] Dept. of Molecular Medicine and Biotechnology, University of Naples Federico II
[email protected] ornella.affi
[email protected] [email protected]
1. Mukherjee et al. A theore(cal analysis of gene selec(on. Computa>onal Systems Bioinforma>cs Conference, 2004. CSB 2004. Proceedings. 2004 IEEE. IEEE, 2004. 2. Liu, Huan et al. Feature selec(on for knowledge discovery and data mining. Vol. 454. Springer Science & Business Media, 2012. 3. Xiong et al. Biomarker iden(fica(on by feature wrappers. Genome Research 11.11 (2001): 1878-1887. 4. Pineda-Bau>sta, et al. General framework for class-specific feature selec(on. Expert Systems with Applica>ons 38.8 (2011): 10018-10024. 5. Fodor et al. Massively parallel genomics. Science 277.5324 (1997): 393. 6. Nuq, et al. Gene expression-based classifica(on of malignant gliomas correlates beBer with survival than histological classifica(on. Cancer research 63.7 (2003): 1602-1607. 7. Alizadeh, et al. Dis(nct types of diffuse large B-cell lymphoma iden(fied by gene expression profiling. Nature 403.6769 (2000): 503-511. 8. Nuq, et al. Gene expression-based classifica>on of malignant gliomas correlates beqer with survival than histological classifica>on. Cancer research 63.7 (2003): 1602- 1607.