Feature selection based on functional group structure for ... - IEEE Xplore

4 downloads 0 Views 209KB Size Report
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ... Department of Computer Science and Engineering, Shanghai Jiao Tong ...
2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Feature selection based on functional group structure for microRNA expression data analysis Yang Yang1,2,4∗ , Tianyu Cao1 , Wei Kong3,4

1

2

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering 3 Department of Computer Science and Engineering, Shanghai Maritime University, Shanghai, China 4 Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai, China ∗ To whom correspondence should be addressed

Abstract—Feature selection methods have been widely used in gene expression analysis to identify differentially expressed genes and explore potential biomarkers for complex diseases. While a lot of studies have shown that incorporating feature structure information can greatly enhance the performance of feature selection algorithms, and genes naturally fall into groups with regard to common function and co-regulation, only a few of gene expression studies utilized the structured properties. And, as far as we know, there has been no such study on microRNA (miRNA) expression analysis due to the lack of available functional annotation for miRNAs. In this study, we focus on miRNA expression analysis because of its importance in the diagnosis, prognosis prediction and new therapeutic target detection for complex diseases. MiRNAs tend to work in groups to play their regulation roles, thus the miRNA expression data also has group structure. We utilize the GObased semantic similarity to infer miRNA functional groups, and propose a new feature selection method taking group structure into consideration, called MiRFFS (MiRNA Functional groupbased Feature Selection). We also apply the group information to the sparse group Lasso method, and compare MiRFFS with the sparse group Lasso as well as some existing feature selection methods. The results on three miRNA microarray profiles of breast cancer show that MiRFFS can achieve a compact feature subset with high classification accuracy.

I. I NTRODUCTION During the last decades, high throughput experimental technologies have gained significant breakthroughs and assisted the rapid development in many biomedical applications. Especially, the differential expression analysis on the whole-genome scale has significantly speeded the discovery of biomarkers for complex diseases, but also brought a lot of computational problems with great challenge, such as high dimensionality, small sample size and heterogeneous data sources. Statistical and machine learning approaches are two major branches of tools for the analysis of gene expression data. Till now, various statistical methods have been proposed in this field. For example, fold change is a simple metric for measuring the significance of change in expression levels between two groups of samples [1]. T-test [2], is among the most widelyused hypothesis test methods to select differentially expressed genes. Other popular statistical methods include ANOVA [3], Wilcoxon rank sum test [4], SAM [1], LIMMA [5], SMVar [6], etc.

978-1-5090-1610-5/16/$31.00 ©2016 IEEE

242

Besides statistical methods, feature selection based on machine learning methods has also been widely used in gene expression analysis, including three major categories: filter methods, wrapper methods and embedded methods [7]. The first category has largely overlapped with statistical methods, because filter methods often adopt statistical significance as ranking criteria, such as pvalue, qvalue and FDR. Wrapper methods rank the genes directly based on their performance on a learning model. For example, let classification accuracy be the ranking score, and a heuristic search can be performed to find the gene combinations with highest scores. In both of the first two categories, feature selection is a separated procedure from learning model, while in the third category, feature selection is embedded in the learning of models. Take SVM-RFE as an example [8], the weight vector (projection direction) of the learned SVM model directly suggests feature importance in the classification. Most of the aforementioned feature selection algorithms are general methods regardless of the structure or correlation of features. In fact, in many applications, features are strongly correlated and present intrinsic structures, like groups, trees and graphs [9], [10]. In recent years, the structured feature selection has attracted a lot of research interests, in which prior structural information is usually utilized to help select more important features in accord with the data properties. A typical example is structured sparse learning, including a series of Lasso methods, such as group Lasso (groups can be disjointed or overlapped) [11], [12], sparse group Lasso [13], tree Lasso [14], fused Lasso [15] and graph Lasso [12]. As for gene expression analysis, the features (genes) exhibit natural group structures, because genes can be grouped by their biological functions. Genes within a group are usually co-regulated, or involved in the same pathways and biological processes. Ma et al applied group Lasso to microarray data analysis [16]. They firstly clustered genes according to their expression data, and then performed group Lasso. The functional groups of genes can also be identified by their semantic similarity defined by domain knowledge, such as Gene Ontology (GO) [17] and function domain [18]. In this study, we focus on microRNA (miRNA) expression analysis because of its importance in the diagnosis, prognosis

prediction and new therapeutic target detection for complex diseases. Especially, miRNAs have been verified as oncogene or tumor suppressor in various cancer types [19]. The identification of miRNA biomarkers has become an important topic in cancer research over the last decade. MiRNAs tend to work in groups to play their regulatory roles, thus the miRNA expression data also has group structure. However, unlike coding genes, functional similarities of miRNAs are hard to obtain due to the lack of functional annotation for miRNAs. Therefore, studies on the structured feature selection for miRNA expression profiles have been scarce. Here, we utilize the GO-based semantic similarity to infer miRNA functional groups, and apply the group information into the sparse group Lasso method. Moreover, we proposed a new feature selection method, called MiRFFS (MicroRNA Functional group-based Feature Selection). Similar to sparse group Lasso, it assumes that the selected feature subset is a combination of several functional groups, and in each group only representative members are included. We define a inclusion criterion to select miRNAs that are functionally similar while differentially expressed at the same time. Given the criterion, we start from some seed features and fill the subset with features satisfying the criterion, then combine the groups and eliminate redundancy. The performance of the new method has been compared with sparse group lasso, and also some common feature selection methods, on three breast cancer miRNA microarray data sets. The results show that MiRFFS can achieve a compact feature subset with high classification accuracy, and most of the selected miRNAs have reported association with breast cancer in the existing literatures. II. M ETHODS A. MiRNA functional groups MiRNAs play their function mainly by regulating the expression of target genes at the post-transcription level. They often work together and form co-regulatory groups. Thus, miRNA functional groups can be inferred through the similarity of their target genes. Since Gene Ontology has been a widely used knowledge base for annotating gene functions, a lot of methods have been proposed to estimate the functional similarity between genes by computing the semantic similarity between their GO sets [20], [21]. However, the available functional annotation of miRNAs in public databases is very few, thus the prior knowledge-based methods are hard to be applied to miRNAs. Here, we infer miRNA functional similarity by GO annotation from their target genes. Usually, gene pairwise similarities are obtained by integrating their GO semantic similarities via the best match average (BMA) equation [22] as defined in Eq. (1). 𝑘 ∑

max 𝑠𝑖𝑚𝑡𝑖 ,𝑡′𝑗 +

𝑖=1 1≤𝑗≤𝑠

𝑠 ∑

max 𝑠𝑖𝑚𝑡𝑖 ,𝑡′𝑗

𝑗=1 1≤𝑖≤𝑘

, (1) 𝑘+𝑠 where 𝑔1 and 𝑔2 are two genes, corresponding to two GO sets with 𝑘 and 𝑠 GO terms, respectively. 𝑡𝑖 and 𝑡′𝑗 are two terms 𝑠𝑖𝑚𝑔1 ,𝑔2 =

243

in these two sets, respectively. 𝑠𝑖𝑚𝑡𝑖 ,𝑡′𝑗 denotes the semantic similarity of 𝑡𝑖 and 𝑡′𝑗 . Considering the miRNA case, since each miRNA has a set of target genes, a miRNA can be regarded as a set of GO sets, which is actually a GO set with redundant GO terms. Thus, we modified the original BMA equation by adding the GO occurrence times, as shown in Eq. (2). 𝑘 ∑

𝑠𝑖𝑚𝑚1 ,𝑚2 =

max 𝑠𝑖𝑚𝑡𝑖 ,𝑡′𝑗 × 𝑁𝑡𝑖 +

𝑖=1 1≤𝑗≤𝑠

𝑠 ∑

max 𝑠𝑖𝑚𝑡𝑖 ,𝑡′𝑗 × 𝑁𝑡′𝑗

𝑗=1 1≤𝑖≤𝑘

𝑘 ′ + 𝑠′

(2)

where 𝑚1 and 𝑚2 are two miRNAs, corresponding to two GO sets, 𝒢𝑚1 and 𝒢𝑚2 , with 𝑘 ′ and 𝑠′ GO terms (duplicated GOs are kept), respectively. 𝑁𝑡𝑖 and 𝑁𝑡′𝑗 are the times of 𝑡𝑖 and 𝑡𝑗 present in 𝒢𝑚1 and 𝒢𝑚2 , respectively. We use Yang’s method [23] to compute the semantic similarity between GO terms. In order to obtain target genes for miRNAs, we have checked several computational predictions tools, including TargetScan [24], PicTar [25], and miRanda [26], as well as experimental-supported target database, TarBase [27]. Finally we chose miRanda because it covers the most miRNAs in the expression data. Given the similarity matrix of miRNAs, a clustering can be performed to get functional groups. B. The new feature selection algorithm The new method, MiRFFS, is a two-stage feature selection algorithm. The first stage is an expansion phase, i.e., adding qualified miRNAs into the selected feature set. Initially, we have some miRNAs whose pvalues of t-test are among the top, each of which is regarded as a seed of a functional group. That is, each seed is the first member of a feature subset. Then, for each subset, we add miRNAs according to an inclusion criterion, i.e., the pairwise 𝐶𝑜𝑟𝐷𝑖𝑓 values between the newly added miRNA and all other miRNAs in the current subset should be bigger than a certain threshold, where the 𝐶𝑜𝑟𝐷𝑖𝑓 measures function correlation and expression difference, defined in Eq. (3), (3) 𝐶𝑜𝑟𝐷𝑖𝑓 = 𝐶𝑜𝑟𝑓 − ∣𝐶𝑜𝑟𝑒 ∣, where 𝐶𝑜𝑟𝑓 denotes function correlation, i.e., GO-based similarity, and 𝐶𝑜𝑟𝑒 denotes expression correlation, i.e., Pearson correlation coefficient [28] of expression levels. When all the subsets remain unchanged, their classification accuracies on the validation set are examined and top ranked subsets are combined as a whole subset. The second stage is a reduction phase, which aims to eliminate the redundancy in the subset outputted by the first stage. The sequential backward elimination algorithm is adopted [29]. The whole pipeline is shown in Fig. 1, and detailed algorithm is described in Algorithm 1. C. Sparse group Lasso with functional group information Sparse group Lasso is a typical structured feature selection algorithm, which results in sparsity at both the group level and

,

  



 

TABLE I DATA DISTRIBUTION

 

  

  

 

 

Data set Total # miRNAs Total # samples

     



# two classes

 

 

 

 

   

  

        

 

     

  

 

A. Data sets

Fig. 1. Flowchart of the new method

Algorithm 1 Input: MiRNA set ℳ, miRNA functional similarity matrix 𝑆 Output: Feature set ℬ, ℬ ⊂ ℳ 1: Perform clustering on 𝑆 and obtain functional group set 𝒢. 2: for Each group 𝑔 ∈ 𝒢 do 3: Let 𝒜𝑔 be the set of selected features of 𝑔, 𝒜𝑔 = 𝜙 4: Add seed miRNA 𝑠𝑔1 into 𝒜𝑔 , where 𝑝𝑣𝑎𝑙𝑢𝑒𝑠𝑔1 = 𝑚𝑖𝑛𝑚𝑖 ∈𝑔 𝑝𝑣𝑎𝑙𝑢𝑒𝑚𝑖 5: for Each miRNA 𝑚𝑗 ∈ 𝑔 do 6: Let 𝐶𝑜𝑟𝐷𝑖𝑓𝑚𝑗 ,𝑠𝑔𝑘 be the 𝐶𝑜𝑟𝐷𝑖𝑓 value of the miRNA pair, 𝑚𝑗 and 𝑠𝑔𝑘 , where 𝑚𝑗 ∈ 𝑔, 𝑠𝑔𝑘 ∈ 𝒜𝑔 , and 𝑡 be the threshold of 𝐶𝑜𝑟𝐷𝑖𝑓 . 7: if ∀𝑠𝑔𝑘 ∈ 𝒜𝑔 , 𝐶𝑜𝑟𝐷𝑖𝑓𝑚𝑗 ,𝑠𝑔𝑘 ≥ 𝑡 then 8: Add 𝑚𝑗 into 𝒜𝑔 9: end if 10: end for 11: end for 12: Estimate accuracy for each 𝒜𝑔 using SVMs. 13: Put the top ranked 𝒜𝑔 s into ℬ 14: Perform sequential backward search on ℬ 15: Output ℬ

the feature level. That is, the sparsity is also expected within the groups. The object function is defined in Eq. (4) [13], 𝑚𝑖𝑛𝛽∈𝑅𝑝 (∥𝑦 −

𝐿 ∑ 𝑙=1

𝑋𝑙 𝛽𝑙 ∥22 + 𝜆1

𝐿 ∑

GSE40525 269 104 Tumor: 52 Peri-tumor: 52

III. E XPERIMENTAL RESULTS

    

 

   

GSE26659 225 77 Relapse: 43 Non-relapse: 34

where 𝑦 is a vector of 𝑁 labels, and 𝑋 is a 𝑁 × 𝑝 matrix of features. Suppose there are a total of 𝐿 groups. 𝑋𝑙 is the matrix corresponding to the 𝑙th group, and its coefficient vector is 𝛽𝑙 . The second term is the regularization term used in group lasso, and the third term aims to achieve sparsity within groups. In order to incorporate miRNA functional group information, we firstly perform a hierarchical clustering on miRNAs according to their GO-based similarities, and then cut the hierarchical tree into clusters, which correspond to functional groups.

    

GSE22220 191 207 ER+: 127 ER-: 80

∥𝛽𝑙 ∥2 + 𝜆2 ∥𝛽∥1 ) (4)

𝑙=1

244

In this study, three public miRNA data sets from NCBI GEO [30] are used in the experiments, namely, GSE22220 [31], GSE26659 [32] and GSE40525 [33]. All of these three studies aim to find differentially expressed miRNAs and explore their function in the tumorigenesis of breast cancer. GSE22220 has two categories according to ER status, namely ER positive and ER negative, GSE26659’s labels are relapse and non-relapse, and GSE40525 also has two labels, tumor and peri-tumor. In order to ensure the data quality, we remove the miRNAs whose expression levels were not detected or below the threshold value in more than 30% of the samples, and miRNAs which are not covered in miRanda are also removed. The detailed data distributions are shown in Table I. B. Experimental settings In the first stage, we started from 10 subsets, in each of which there was a seed miRNA. Then the 𝐶𝑜𝑟𝐷𝑖𝑓 value is used for selecting miRNAs which are functionally related but expressionally uncorrelated with the seed miRNA. The threshold for 𝐶𝑜𝑟𝐷𝑖𝑓 was determined by an empirical analysis on the sorted list of 𝐶𝑜𝑟𝐷𝑖𝑓 values for all miRNA pairs. Take GSE40525 as an example, as shown in Fig. 2, only a small portion of the miRNA pairs have high 𝐶𝑜𝑟𝐷𝑖𝑓 values. It can be observed that the change rate of the curve remains stable when 𝐶𝑜𝑟𝐷𝑖𝑓 is greater than 0.25, and the two other data sets also exhibit similar patterns. Thus, we set the threshold of 𝐶𝑜𝑟𝐷𝑖𝑓 value as 0.25. This study adopted a five-fold cross validation for the feature selection and performance evaluation. For estimating classification accuracy, LIBSVM [34] with linear kernel was used. The sparse group Lasso algorithm with functional group information was implemented by using R package msgl (High

Dimensional Multiclass Classification Using Sparse Group Lasso) [35], where the best value of 𝜆 was searched via cross validation, and 𝛼 is set to be 0.5. The functional groups were obtained by a hierarchical clustering on the miRNA functional similarity matrix, where we used a measurement called inconsistency coefficient [36] to cut the hierarchical tree and obtain clusters. The threshold of the inconsistency coefficient is set to be 1.

TABLE II R ESULTS OF M I RFFS ON THREE DATA SETS𝑎 Dataset Total # Selected # Total Accuracy(%) Sensitivity(%) Specificity(%)

Selected miRNAs

GSE22220 191 10 82.6 95.3 80.1 miR-146b-5p miR-518a-5p miR-584 miR-181d miR-505 miR-137 miR-18a miR-149 miR-18a* miR-570

GSE26659 225 9 75.3 79.1 77.3 miR-505 miR-181b miR-214 miR-148a miR-30e miR-148b miR-365 miR-125b-2* miR-142-5p

GSE40525 296 9 94.2 92.3 96.0 miR-181c miR-340 miR-497 miR-30d miR-30a miR-126* miR-424 miR-409-3p miR-101

𝑎

MiRNAs that have evidence of association with breast cancer (from HMDD, PhenomiR and miR2Disease) are in bold.

Fig. 2. 𝐶𝑜𝑟𝐷𝑖𝑓 values for all miRNA pairs in descending order

C. Results of MiRFFS We applied the new MiRFFS method to the three miRNA microarray data sets. The classification accuracies (including total accuracy, sensitivity and specificity) and selected miRNAs in the final feature set are listed in Table II. As can be seen, the new feature selection method results in a significant feature reduction. Only around 5% of the original miRNAs remain in the final feature set, indicating a great redundancy among the features in the data set. The classification accuracies obtained on these three data sets differ a lot. GSE40525 has the highest accuracy, which is nearly 20% higher than that of GSE26659. It may be due to the different classification tasks in each data set. It is relatively easy to differentiate tumor and non-tumor tissues in GSE40525, while other two data sets aim to discriminate different subtypes of tumors or patients. To examine the performance of the 𝐶𝑜𝑟𝐷𝑖𝑓 measure. We compared a variant of the proposed MiRFFS method. The difference is that, in the expansion stage, for each subset, the inclusion of miRNAs only considers the correlation of expression data without using the 𝐶𝑜𝑟𝐷𝑖𝑓 measure. Thus, a miRNA is added if the Pearson correlation coefficients between expression levels of the miRNA and all miRNAs already in the subset are less than a threshold. We found that this variant performs generally worse with lower accuracy or larger feature numbers. The total accuracy is comparable on GSE40525, 1% lower on GSE22220 (81.6%) and 6.5% lower (68.8%) on GSE26659. It suggests that, although miRNAs have been pre-clustered into functional groups, considering functional correlation within each group is still helpful. Selection of

245

uncorrelated miRNAs can only avoid feature redundancy, but the functional association may be more important. Furthermore, we examined the impact of 𝐶𝑜𝑟𝐷𝑖𝑓 threshold. As mentioned in Section III-B that the dramatic change of 𝐶𝑜𝑟𝐷𝑖𝑓 occurs before 0.25, we found that if the threshold was too big (>0.35), the performance would degenerate substantially. That is, a too strict threshold leads to false rejection of useful features. After all, only a small proportion of the miRNA pairs have strong functional association. In addition, note that in the previous experiments only topranked subsets (the first half) entered the reduction stage, here we experimented another variant by combining all subsets as the input for the reduction stage. Generally speaking, the performance was comparable with the original algorithm, with slight higher accuracy on GSE22220 and lower accuracy on GSE40525, indicating that the inclusion of more features may sometimes help the classification but may also bring irrelevant features and enlarge the search space at the reduction stage. D. Comparison with existing feature selection methods We have conducted a series of experiments to compare MiRFFS with some widely used feature selection methods, including Lasso, group Lasso (GL), Sparse group Lasso (SGL) and some general feature selection methods without structured information. Table III lists all the methods in the comparison. GL and SGL both use the miRNA functional group information, i.e., groups obtained from the hierarchical tree. Correlation-based Feature Selection (CFS) [37], consistencybased selection [38], information gain (IG) [39] are filtering methods. In the IG method, the measure of symmetrical uncertainty was used. Best-first search (BFS) is a wrapper method, and Random forest (RF) filter [40] is an embedded method. CFS, BFS and consistency-based methods determine the number of selected features automatically; while for the others two methods, we chose a subset of features which are significantly better than other, i.e., they have significantly

better weights than other features. R packages msgl and FSelector [41] were adopted to implement these methods. TABLE III C OMPARISON OF FEATURE SELECTION METHODS ON THREE DATA SETS𝑎 Methods Lasso GL𝑏 SGL𝑏 CFS BFS Consistency IG RF MiRFFS

Feature # 41, 26, 21 23, 110, 12 26, 69, 14 29, 6, 6 4, 5, 3 13, 7, 5 80, 7, 187 4, 2, 2 10, 9, 9

GSE22220 82.0 80.0 82.0 78.3 75.8 79.7 72.9 77.3 82.6

Accuracy (%) GSE26659 GSE40525 65.0 92.3 65.0 92.3 68.0 92.3 68.8 93.3 68.8 90.4 68.8 93.3 68.8 89.4 67.5 91.3 75.3 94.2

𝑎

In the second column, the three numbers on each row are the total number of selected features of GSE22220, GSE26659 and GSE40525, respectively. GL and SGL, i.e., group Lasso and sparse group Lasso, both use the miRNA functional group information. 𝑏

Table III shows that MiRFFS achieved the highest classification accuracies on all of the three data sets, increasing by 0.6%, 6.5%, and 0.9% on the three data sets respectively compared with the best result obtained by other methods. In the meantime, the number of features is relatively small. The major reasons for the performance enhancement include the following three aspects. First, we take into consideration functional correlation inferred by target genes and their GO terms, thus first we need a proper similarity matrix of the miRNAs. Secondly, we adopt a clustering-based feature selection schema by using the 𝐶𝑜𝑟𝐷𝑖𝑓 metric, which helps to avoid potential feature redundancy in each cluster. Thirdly, we refine the combined sets of features by further eliminating redundant features. These three parts ensure the efficacy of the new method. Apparently, the structural information of miRNA functional groups also improved the Lasso method. Generally, the sparse group Lasso has the best accuracy among the three Lassobased methods. On GSE22220 and GSE40525, group Lasso and sparse group Lasso selected much less features than Lasso without group information, but it is not the case on GSE26659. In the experiments, we found that the performance on GSE26659 was unstable, i.e. large variance on accuracy and number of features. It may be due to the small sample size, or the differential signal is not strong enough in discriminating the relapse and non-relapse samples. E. Functional analysis on the selected miRNAs In order to perform functional enrichment on the selected miRNAs, we analyzed the enriched pathways of their target genes by using mirPath [42]. For GSE40525, the top-ranked pathways have the lowest pvalues (

Suggest Documents