A Sparse Learning-Based Approach for Class

A Sparse Learning-Based Approach for Class-Specific Feature SelecPon Ornella Affinito2, Angelo Ciaramella1, Sergio Cocozza2,3, Gennaro Miele4, Antonella Mon>celli3, Davide Nardone1, Domenico Palumbo2 and Antonino Staiano1

1Dept. of Science and Technology, University of Naples “Parthenope”

2Dept. of Molecular Medicine and Biotechnology, University of Naples Federico II 3Ins>tute Experimental Endocrinology and Oncology (IEOS) 4Italian Ins>tute of Nuclear Physics (INFN)

Abstract

Experimental Results

Feature selec>on (FS) plays a key role in computa>onal biology, making it possible to treat models with fewer variables, which in turn are easier to explain and might speed the experimental valida>on up, by providing valuable insight into the importance and role of variables. We propose a novel procedure for FS conceiving a two-steps approach. Firstly, a Sparse Coding Based Learning technique is used to find the best subset of features for each class of the training data. Secondly, the discovered feature subsets are fed to a Class-Specific Feature Selec(on scheme, to assess the effec>veness of the selected features in classifica>on tasks. To this end, an ensemble of classifiers is built by training a classifier, one for each class on its own feature subset. To assess the effec>veness of the proposed FS approach, a number of experiments have been performed on benchmark microarray datasets.

In the experiments, in order to evaluate the performance of the proposed feature selec>on strategy (SCBA-CSF), we have applied our methods to a total of six publicly available microarray datasets. ALLAML dataset [5], the human carcinomas (CAR) dataset [6], the human lung Carcinomas (LUNG) dataset [7], the diffuse large B-cell lymphoma (DLBCL) dataset and the malignant glioma (GLIOMA) dataset [8]. The Support Vector Machine (SVM) classifier has been run on all datasets, using 5-fold Cross-Valida>on. We compared SCBA-CSF with the most representa>ve state-of-art supervised FS algorithms.

IntroducPon Selec>ng small subsets out of thousands of genes in microarray data is an important task for several medical purposes. Microarray data analysis is popular for involving a huge number of genes compared to a rela>vely small number of samples. In par>cular, Gene selec(on is the task of iden>fying the most significantly deferen>ally expressed genes under different condi>ons, and it's an open research topic [1]. These selected genes are very useful in clinical applica>ons such as recognizing diseased profiles. Nonetheless, because of its high costs, the number of experiments that can be used for classifica>on purposes is usually limited. This small number of samples, compared to the large number of genes in an experiment, is well known as the Curse of Dimensionality [2] and challenges the classifica>on as well as other data analysis tasks. Moreover, it is well-known that a significant number of genes play an important role, whereas many others could be unrelated to the classifica>on task [3]. Therefore, a cri>cal step to an effec>ve classifica>on is to iden>fy the representa>ves genes, in order to decrease the number of genes used for classifica>on purpose.

Average accuracy of top 80 features (%) All features

RFS

ls-2,1

ll-2,1

Fisher Relief-F

mRmR

SCBA

SCBA-CSFS

ALLAML

93.21

97.84

74.27

95.73 98.95

98.89

83.16

96.84 95.67

CAR

90.25

96.98

88.88

94.61 92.92

96.95

94.93

94.61

99.32

LUNG_C

95.57

98.12

97.84

98.99 99.28

99.57

98.71

99.42

99.70

LUNG_D

83.43

95.93

95.93

94.62 95.93

97.31

96.60

97.29

97.29

DLBCL

93.74

99.03

95.42

99.76 99.76

99.76

99.8

99.76

99.76

GLIOMA

74

88.33(29)

70

80

80

78.33

81.67

88.33

Average

88.37

96.03

87.05

93.95 95.03

95.41

91.92

94.93

96.68

83.33

ALLAML

GLIOMA

LUNG_C

LUNG_D

DLBCL

CAR

2 classes

7 classes

4 classes

9 classes

5 classes

11 classes

Figure 1. Classifica>on accuracy comparisons of eight feature selec>on algorithms on six datasets. The SVM classifier with 5-fold Cross-Valida>on is used for classifica>on. SCBA and SCBA-CSFS are our methods.

Methods and Materials

Discussion

Feature selec>on has been widely used for elimina>ng redundant or irrelevant features, and it can be done in two ways: Tradi>onal Feature Selec>on (TFS) for all classes and Class-Specific Feature Selec>on (CSFS) [4]. CSFS is the process of finding different set of features for each class. Conversely from a TFS algorithm, where a single feature subset is selected for discrimina>ng among all the classes in a supervised classifica>on problem, a CSFS algorithm selects a subset of feature for each class.

Based on experimental results, we may generally state that applying TFS allows gesng beqer results than using all the available features, and also that, in most of the >me, by applying CSFS allows gesng beqer results than applying TFS methods. In par>cular, it is worth observing that the CSFS method seems to achieve the best results when the datasets contain several classes (e.g., LUNG_C, LUNG_D, CAR, DLBLC). In addi>on, as shown in Fig. 1, it’s crucial to point out that the proposed method always outperforms its compe>tors with a fewer number of features. Definitely, we might claim that our method is able to iden>fy/retrieve the most representa>ve features maximizing the classifica>on accuracy. With the top 80 features, SCBA-CSFS is around 1%-10% beqer than the compe>ng methods on all six datasets.

1.  Dataset spli]ng: Train and test set are made up. 2.  Class Sample SeparaPon: Sample separa>on class by class. 3.  Class Balancing (arbitrary): SMOTE oversampling process. 4.  Class-Specific Feature SelecPon: Feature selec>on using SparseCoding Based Approach. 5.  ClassificaPon: Ensemble of Classifiers is built (one for each class) on its own feature subset, and an ad-hoc decision rule is adopted to compute the ensemble responses.

TEST SET

TRAIN SET C1

C2

C1

C2

⃛ ⃛

SCBA

SCBA

e1

e2

P1

P2

Cn

⃛

SCBA

⃛

en

⃛

Pn

P1

P2

Cn

Pn

Fp

Conclusions We proposed a novel Sparse-Coding Based Approach Feature Selec>on with emphasizing joint 𝑙↓1,2 -norm minimiza>on and the Class Specific Feature Selec>on. Experimental results on six different datasets validate the unique aspects of SCBA-CSFS and demonstrate the beqer performance achieved against thestate-of-art methods. In addi>on, it is able to retrieve the most representa>ve features that maximize the classifica>on accuracy in the case that the dataset is made up of many classes.

Contact

References

Dept. of Science and Technology, University of Naples “Parthenope” davide.nardone@studen>.uniparthenope.it [email protected] [email protected] Dept. of Molecular Medicine and Biotechnology, University of Naples Federico II [email protected] ornella.affi[email protected] [email protected]

1.  Mukherjee et al. A theore(cal analysis of gene selec(on. Computa>onal Systems Bioinforma>cs Conference, 2004. CSB 2004. Proceedings. 2004 IEEE. IEEE, 2004. 2.  Liu, Huan et al. Feature selec(on for knowledge discovery and data mining. Vol. 454. Springer Science & Business Media, 2012. 3.  Xiong et al. Biomarker iden(fica(on by feature wrappers. Genome Research 11.11 (2001): 1878-1887. 4.  Pineda-Bau>sta, et al. General framework for class-specific feature selec(on. Expert Systems with Applica>ons 38.8 (2011): 10018-10024. 5.  Fodor et al. Massively parallel genomics. Science 277.5324 (1997): 393. 6.  Nuq, et al. Gene expression-based classifica(on of malignant gliomas correlates beBer with survival than histological classifica(on. Cancer research 63.7 (2003): 1602-1607. 7.  Alizadeh, et al. Dis(nct types of diffuse large B-cell lymphoma iden(fied by gene expression profiling. Nature 403.6769 (2000): 503-511. 8.  Nuq, et al. Gene expression-based classifica>on of malignant gliomas correlates beqer with survival than histological classifica>on. Cancer research 63.7 (2003): 1602- 1607.

A Sparse Learning-Based Approach for Class

A Sparse Learning-Based Approach for Class

Suggest Documents

A Generalized Class of Hard Thresholding Algorithms for Sparse

A class of null space conditions for sparse recovery via ...www.researchgate.net › publication › fulltext › A-class-of

Compressed Sensing/Sparse-Recovery Approach for Improved ...

Novel Approach for Fingerprint Recognition Using Sparse ...

A WIDEBAND DOUBLY-SPARSE APPROACH ... - Infoscience - EPFL

A WIDEBAND DOUBLY-SPARSE APPROACH ... - Infoscience - EPFL

A SPARSE APPROACH TO PEDESTRIAN TRAJECTORY MODELING

A Unified Approach via Sparse Coding - MDPI

A New Approach to Parallel Sparse Cholesky

A latent class approach - SciELO

A Sparse Regression Approach For Evaluating and ... - bpames

A New Combinatorial Approach For Sparse Graph Problems - CiteSeerX

A weighted L1-minimization approach for sparse polynomial chaos ...

PCCA: A new approach for distance learning from sparse ... - Hal

A sparse-sampling approach for the fast computation of matrices ...

A Sparse Bayesian Approach for Forward-Looking ... - MDPI

A Convex Hull Approach to Sparse Representations for ... - Google Sites

A Novel Hierarchical Bayesian Approach for Sparse ... - IEEE Xplore

A constrained optimization approach for complex sparse ... - Hal

A Fast LDL-factorization Approach for Large Sparse Positive Definite ...

A Sparse Bayesian Approach for Joint Feature Selection and ... - UOC

A novel approach for solving an arbitrary sparse linear system

A Convex Hull Approach to Sparse Representations for ... - Google Sites

A MAP approach for $\ell_q $-norm regularized sparse parameter ...