Document not found! Please try again

NETBAGs: a network-based clustering approach with gene signatures ...

37 downloads 0 Views 3MB Size Report
Aim: To evaluate gene signature and network-based approach for cancer subtyping and classification. Materials & methods: Here we introduced NETwork ...
Research Article For reprint orders, please contact: [email protected]

NETBAGs: a network-based clustering approach with gene signatures for cancer subtyping analysis

Aim: To evaluate gene signature and network-based approach for cancer subtyping and classification. Materials & methods: Here we introduced NETwork Based clustering Approach with Gene signatures (NETBAGs) algorithm, which clustered samples based on gene signatures and identified molecular markers based on their significantly expressed gene network profiles. Results: Applying NETBAGs to multiple independent breast cancer datasets, we demonstrated that the clustering results were highly associated with the clinical subtypes and clearly revealed the genomic diversity of breast cancer samples. Conclusion: NETBAGs algorithm is able to classify samples by their genomic signatures into clinically significant phenotypes so that potential biomarkers can be identified. The approach may contribute to cancer research and clinical study of complex diseases. Keywords: cancer subtyping • cluster • expression • gene network • microarray • RNA-seq

Microarray is one of the mostly used genomic technologies in biomedical research. The reliability and reproducibility of microarray was once questioned, but the comprehensive MicroArray Quality Control phase I (MAQC-I) study has concluded that the intra-platform consistency and inter-platform concordance of microarray could be achieved when appropriate data pre-processing, normalization and feature selection algorithms are applied [1] . For instance, ranking based on fold-change of gene expression together with a nonstringent p-value cutoff of statistical significance generated reproducible differentially expressed gene (DEG) lists in rat toxicogenomic study [2] . A major application of microarray is for sample classification, which then leads to identification of gene signatures for samples classes  [3] and/or construction of predictive models for disease diagnosis and/or prognosis  [4] . However, not all types of end points were suitable for classification analysis. The second phase of MAQC (MAQC-II) project has also indicated that the prediction performance for models based on microarray

10.2217/bmm.15.96

data is largely dependent on the nature of end points to be predicted [5] . For instance, the model in MAQC-II project for breast cancer ERpos classification showed a much better performance than the model predicting pCR/RD, regardless of the data analysis teams and approaches. The poor predicting performance of some end points may due to their heterogeneous properties. On the other hand, for the ‘easier end point’ such as ERpos, many prognosis biomarkers of ER positive breast cancer have been reported by microarray with some even for clinical use [6,7] . However most of those genomic biomarkers are reported to be sensitive to study datasets  [8] and a low overlap of biomarkers is usually found between studies [9,10] . One possible reason might be that conventional biomarker identification approaches did not fully account for the interaction between biomarkers; in other words, gene–gene ­interaction. The poor predicting performance of heterogeneous end points and low consistency between studies in microarray therefore were major challenges in current cancer genomics

Biomark. Med. (Epub ahead of print)

Leihong Wu1, Zhichao Liu1, Joshua Xu1, Minjun Chen1, Hong Fang2, Weida Tong*,1 & Wenming Xiao**,1 1 Division of Bioinformatics & Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA 2 Office of Scientific Coordination, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA *Author for correspondence: Tel.: +1 870 543 7142 Fax: +1 870 543 7854 [email protected] **Author for correspondence: Tel.: +1 870 543 7387 Fax: +1 870 543 7854 [email protected]

part of

ISSN 1752-0363

Research Article  Wu, Liu, Xu et al. research [11] . Recently, network modeling has emerged as a promising method in biological and pharmacological studies [12,13] . Network-based approach enables interrogating the complex relationships between genes/proteins, environmental factors and diseases in an integrated fashion. Previous studies have demonstrated that hub nodes (e.g., genes or proteins) that with topological significance in a biological network also tend to be more important in the biological process [14–17] . Network modeling thereby offers a possible way to study diseases beyond the significant genes with considering the interplays between them [12,18] . Several network-based studies have shown that network properties could help classification and clustering analysis  [19,20] . For instance, Chuang et al. classified breast cancer metastasis based on gene sub-networks extracted from the whole protein–protein interaction (PPI) network [19] . Hofree et al. developed networkbased stratification to classify cancers based on their mutation network information [21] , etc. Many network-based approaches have directly used whole-genome profiles to define the nodes and edges in the network, and recently more differential network approaches which focused on the network changes were also developed such as based on differential expressions [22] . However, differential expression changes were calculated based on known class labels (end points) and there are few differential network approaches for clustering analysis. If we could appropriately define the gene signature for each sample independent of end points, the combination of networkbased approach and gene signature could also be a possible way for cancer subtyping analysis. To that end, here we developed a novel algorithm called NETwork Based clustering Approach with Gene signatures (NETBAGs), to classify samples by gene signature and identify molecular markers based on their significant expressed network profiles. The gene signatures for each sample were defined by their significantly expressed gene (SEG) profile which was based on the relatively expression level of this sample in the whole dataset. The SEG profile was then used to annotate biological network to obtain specific gene signature annotated network. After that, network propagation and stratification algorithm [21] were applied to cluster samples based on their signature network ­profiles. Breast cancer has highly heterogeneous tumors [23] ; therefore, we used a dataset for breast cancer to test NETBAGs capability for subtyping analysis. Currently, subtyping breast cancer is mainly based on clinical descriptors such as age, tumor histological grade, and expression of estrogen receptor (ER), progesterone receptor (PR) and HER2. More accurate

10.2217/bmm.15.96

Biomark. Med. (Epub ahead of print)

prediction of the ER, PR and HER2 genotyping in patients will extend the confidence for a more accurate clinical intervention and therapy. For instance, there is still no efficient targeted therapy for breast cancer patients with all three negative markers as ER-, PR- and HER2-, also known as triple-negative breast cancer (TNBC). Therefore, subtyping analysis and identifying novel breast cancer biomarkers, especially for TNBC patients, is critical for new drug development, its evaluation process at regulatory agencies, and impact on patient care and prognosis. Materials & methods Breast cancer datasets

The MAQC-II breast cancer dataset [5] was downloaded from the Gene Expression Omnibus (GEO) database (GEO accession number: GSE20194) which contains a total of 278 samples. Since the MAQC-II project only utilized 230 samples [5] , we choose our analysis on the same subset of samples. Microarray gene expression data were obtained using the Affymetrix Human Genome U133A array (GEO accession number: GPL96) with 22,283 probe sets. The original gene expression data were normalized with MAS5 and log2 normalization. The annotation of the chip was collected from ArrayTrack V3.5.0 [24] . A total of 19,880 out of 22,283 (89.2%) probe sets had gene annotation information with 12,510 genes. Genes were then filtered by STRING database, and genes that did not exist in the database were not considered. A total of 8884 unique genes were used. The Cancer Genome Atlas (TCGA) breast cancer dataset was used for biomarker verification. The prenormalized breast cancer datasets were downloaded from the publicly available Synapse database [25] , which included gene expression data based on microarray and RNA-seq platforms. The microarray validation dataset (validating dataset A) contained 532 samples and 17,813 genes, based on the Agilent 244K custom gene expression microarray. The RNA-seq validation dataset (validating dataset B) is obtained from the Illumina HiSeq 2000 RNA sequencing platform, including 822 samples with 20,530 genes. The gene expression data of both the TCGA breast cancer datasets were pre-normalized and were represented by gene names. Clinical data for 1044 patients were also downloaded from the TCGA portal [26] . In total, there were 532 patients with the microarray data and 821 patients with the RNA-seq data for the clinical information. An additional four gene expression datasets with GEO accession number of GSE18864 [27,28] , GSE36693  [29] , GSE38959 [30] and GSE53752 [31] were also obtained to validate the candidate biomarkers for TNBCs postulated in this study. Pre-normal-

future science group

NETBAGs: a network-based clustering approach with gene signatures for cancer subtyping analysis 

ized expression data and their platform gene annotation were all downloaded from the GEO database. GSE18864 contained 38 triple negative samples (ER/ PR/HER2 negative) and 46 other samples. GSE36693 contained 21 tissue basal-like samples and 80 other samples. GSE38959 contained 30 TNBC cell samples and 13 normal mammary gland ductal cell samples. GSE53752 contained 51 TNBC samples and 25 normal breast tissue samples. The p-values and fold changes were then calculated between the triple-negative samples and other samples to determine ­differentially expressed genes.

Research Article

logical network such as disease-specific networks could be used in NETBAGs algorithm. Here we used gene– gene interaction relationships from STRING database (version v.9.1) to build the initial network structure, which were curated from literature, computational prediction, etc. [32] . In particular, only top 10% of interactions were used in network construction, containing a total of 12,223 genes [21] . In the process of networkbased clustering analysis, the structure of network did not change. For each sample, its unique network ­annotation was based on its SEG profile. Consensus clustering analysis

Significantly expressed gene profile

SEG profile is the gene signature matrix for each sample which was used to annotate its specific network. Z-score and fold change were applied to determine SEGs for each sample. In detail, Z-score was calculated by Z-test, as the significance of the objective sample in all samples; the fold change was measured as the objective sample value compared with the mean value of all samples. In particular, if a gene had an absolute |Z-score| >1.96 (equivalent to p 2, it was considered as SEG and labeled as 1 (otherwise, it would be labeled as 0) in SEG profile. For a gene that had more than one probe, its SEG level was the average of all the probes. Finally, the n-by-p SEG matrix contained SEG profiles of all samples, where n is the number of samples and p is the number of genes. Gene–gene interaction network

The gene interaction network was constructed by using whole-genome information. In fact, any scale of bio-

Consensus clustering analysis was used to further improve and stabilize the clustering performance [33] . For each loop of consensus clustering analysis, 80% of the samples and SEGs were randomly selected to form the SEG matrix and analyzed with the NETBAGs algorithm. The clustering result was then voted to generate the similarity scoring matrix which was measured by their co-occurring frequency in a total of 100 repetitions. Hierarchical clustering analysis (HCA) was then applied on the final scoring matrix to cluster samples. Clustering performance evaluation

The parameter of cluster number K, which determined how many clusters were generated in nonnegative matrix factorization (NMF), was evaluated from 4 to 10. First, we evaluated the performance of a single loop of NETBAGs algorithm without consensus clustering (NETBAG_single). A total of 100 iterations were applied on each parameter K and performances were evaluated based on median p-value of χ 2 in cross-tabulation with known clinical end points.

Table 1. Overlaps of genes and pathways between clusters. Numbers in left-bottom matrix and in parentheses of diagonal represent overlapped and involved pathways. Numbers in right-top matrix represent overlapped genes between clusters. Overlap DEGs/ pathways 

C1

C2

C1 

3590 (81) 8

C3

C4

C5

C6

C7

C8

C9

C10

214

47

35

32

648

303

41

115

C2

0

76 (1)

2

2

2

1

2

3

0

2

C3

8

0

255 (14)

3

10

1

58

4

0

2

C4

3

0

2

321 (13)

8

0

6

13

0

31

C5

1

0

2

0

256 (14)

1

11

0

5

2

C6

2

0

0

2

1

184 (21)

1

2

1

36

C7

12

0

0

0

1

0

859 (17)

53

3

9

C8

12

0

1

2

1

3

2

491 (24)

4

15

C9

8

0

2

1

2

2

3

6

173 (19)

3

C10

14

0

0

0

0

1

4

5

4

436 (19)

DEG: Differentially expressed gene.

future science group

www.futuremedicine.com

10.2217/bmm.15.96

Research Article  Wu, Liu, Xu et al. The null hypothesis is that the sample proportion in any cluster is the product of the proportions of their clinical end points. In total, p-values of five end points were calculated, as ER± (E), PR± (P), HER2± (H), triple negative/non-triple negative (TN) and marker ­combinations. We then investigated the performance of NETBAGs with consensus clustering process. The consensus modeling process picked 80% randomly chosen samples and genes for each of 100 repetitions, and the final consensus model was constructed by counting gene co-occurred frequency in the same cluster during all iterations. K-means was used to cluster samples based on their similarity vector with 100 repetitions and the result was also shown in Table 1. Similarly, K = 10 still showed the best clustering performance for most end points. The comparison between NETBAG_single and NETBAGs indicated that the consensus modeling process might slightly improve the performance when K = 6 and K = 8, and showed similar performance when K = 10 or K = 4, and would not improve weak clustering performance for the PR end point. For comparison, we also used K-means and NMF to cluster samples based on their expression profiles, K-means is a commonly used clustering algorithm and NMF is the clustering algorithm embedded in NETBAGs. For K-means clustering, we used the original normalized data matrix (including 22,283 probes) since K-means are poor with SEG matrix (median p > 0.05, data not shown). For NMF clustering, we used the SEG matrix (including 8884 genes) in order to better compare with NETBAGs. The K-means clustering process and NMF clustering process are repeated 100 times. Results NETBAGs study pipeline

The study pipeline shown in Figure 1 consists of three major components. The first section is data preparation. PPI data were collected from STRING database and the top 10% relationships with high confidence were used to construct the structure of biological network. On the other hand, gene expression data of breast cancer samples were collected from MAQC-II datasets to generate SEG matrix (see ‘Materials & methods’ section), which was used to annotate the biological network. According to their different SEG profiles, the annotated network profile of each sample, or sample specified network, was also different from each other. In particular, there were 230 sample specified networks as same as the sample size of breast cancer dataset. The second section was about network-based analysis to redistribute the weight of nodes in network

10.2217/bmm.15.96

Biomark. Med. (Epub ahead of print)

according to their network connections. Network propagation algorithm PRINCE [34] was applied on each annotated network, that every nodes had a chance to pump and receive signals from its neighbors. After the network propagation process, samples were then clustered based on their redistributed network profiles with NMF algorithm. The following section is about final cluster analysis and interpretation. A consensus clustering analysis was applied based on the NMF result, which repeated 100 times with each repetition using 80% samples. A consensus scoring matrix was generated to record how many times two samples were clustered into the sample cluster. The final clustering result was derived from HCA on the consensus scoring matrix. The top regulated genes of each cluster were then considered as potential biomarkers and those from two dominant clusters were further validated on external datasets (Figure 1) . NETBAGs outperforms other clustering algorithms

The clustering performance was evaluated by the mean value of χ 2 for 100 repetitions. Four algorithms, as NETBAG_single, NETBAGs, K-means and NMF were performed on the same dataset. As shown in Figure 2, both NETBAG_single and NETBAGs showed a better χ 2 than K-means and NMF especially when K became larger, indicating the process of network propagation in NETBAG algorithm could contribute to the clustering performance. On the other hand, NETBAGs showed a relatively better performance than NETBAG_single, indicating consensus clustering process would further improve the model performance. Also, the combined end point of ER, PR and HER2 showed a good clustering performance for all methods, which might indicated the breast cancer samples are biologically divided with multiple factors, and the more delicate phenotype subtype definition would also lead to better clustering performance. Breast cancer subtyping analysis

The final clustering result was performed with NETBAGs when parameter K = 10. HCA was applied on the final scoring matrix to cluster the breast cancer samples. The clustering result included ten clusters (denoted as C1 to C10) as shown in Figure 3A, where the degree of consensus score (representing the sample similarity) was color coded with blue as the highest. For comparison, a clustering analysis directly based on the sample similarity calculated by Pearson correlation was also performed in Figure 3B. As shown, clusters generated by NETBAGs were more clearly separated from each other than the result from Pearson ­correlation.

future science group

NETBAGs: a network-based clustering approach with gene signatures for cancer subtyping analysis 

Functional analysis and Validation

PPI database

STRING

Reliable PPI relationships

Gene expression profile SEG matrix generation n samples

Network based clustering

Data preparation

Sample pool

Research Article

PPI Network structure

Network annotation file p SEGs

Annotated Network models Network Propagation Smoothed sample-network matrix NMF & Consensus clustering Modeling Sample similarity scoring matrix

HCA clustering result DEG analysis Validation Network modules DEGs of selected cluster Network visualization Pathways analysis &

Potential biomarkers

Figure 1. NETBAGs study pipeline consists of three major sections. The first section is data preparation. The second section was mainly about network-based analysis, in aim to redistribute the weight of nodes in network according to their network connections. The third section involved final cluster analysis and interpretation. After clusters were generated, the top regulated genes of each cluster were then considered as potential biomarkers and those from two dominant clusters were further validated on external datasets. NETBAG: NETwork Based clustering Approach with Gene signature.

The clinical marker statistics of all clusters were presented in Figure 3C. According to the HCA result, five clusters were enriched with current clinical subtypes, such as ER+, HER2+ or TNBC. ER+ clusters (e.g., C1, C2 and C3) were in the same branch of the entire HCA tree, and all of them were also enriched with HER2-negative tumors. However, their PR statuses were quite different as C1 showed a high PRratio whereas most samples in C3 were PR+. Meanwhile, cluster C4 was enriched with HER2+ tumors as 24 of 27 samples in C4 were HER2 positive. On the other side, C5 were mostly enriched with TNBC samples (23 out of 33). Other clusters, including C6, C7, C8, C9 and C10, were not clearly enriched with any clinical subtypes. Diversity of clusters

DEGs for each cluster were extracted by a filter of a two-side t-test (p 2 between samples in and out of this cluster. The number of DEGs of ten clusters varied from 76 to 3590. Pathway

future science group

analysis was then applied on these DEG lists to define enriched KEGG pathways of these clusters. As shown in Table 1, the diagonal presented the number of DEGs (pathways) of each cluster, and leftbottom area was overlapping pathways between clusters. The top-right area was overlapping genes between clusters. The overlaps of DEGs and pathways between most clusters were small. For example, C4 and C5 were enriched with HER2+ and TNBC samples respectively, although both of them contained nearly 300 DEGs, the overlap between them was only 8. Moreover, there was no common pathway shared by these two clusters, indicating significant gene diversity. On the other hand, although C2 and C3 were both related to ER+ and HER2-, only two genes overlapped between their DEG lists, indicating there were g­ enotype differences between samples in C2 and C3. Network modules

PPI network models of each cluster were constructed by using their DEGs. Only relationships between two

www.futuremedicine.com

10.2217/bmm.15.96

Research Article  Wu, Liu, Xu et al.

200

K=6

Chi2

150 *

100 50 0 200

*

K=8

Chi2

150 100 50 0 200

*

K = 10

Chi2

150 100 50

ER status

PR status

HER2 status

Basal like

NMF

Kmeans

NETBAGs

NETBAG_single

NMF

Kmeans

NETBAGs

NETBAG_single

NMF

Kmeans

NETBAGs

NETBAG_single

NMF

Kmeans

NETBAGs

NETBAG_single

NMF

Kmeans

NETBAGs

NETBAG_single

0

EPH

NETBAG_single NETBAGs K means NMF Figure 2. χ2 comparison of four clustering algorithms under different end points and number of clusters. EPH combination is clusters combined with three markers (ER, PR and HER2). Blue: NETBAG_single algorithm; Green: NETBAGs; Yellow: K-means; Red: NMF. χ2 is measures with MATLAB crosstab function, ‘*’ in EPH end point means the clustering performance of NETBAGs is significantly higher than other clustering methods. ER: Estrogen receptor; NETBAG: NETwork Based clustering Approach with Gene signature; PR: Progesterone receptor.

DEGs would be considered and the giant components of the network model for each cluster were shown in Figure 4. Clusters were enriched with different pathways, for example, C3 was enriched with neurotrophin

10.2217/bmm.15.96

Biomark. Med. (Epub ahead of print)

signaling pathway (KEGG ID: hsa04722, p = E-06) and mismatch repair (KEGG ID: hsa03430, p = E-06), where C4 was also enriched with neurotrophin signaling pathway (p = E-04) but further enriched

future science group

NETBAGs: a network-based clustering approach with gene signatures for cancer subtyping analysis 

Research Article

C9

C5 C4

C7

C8 C10

C1 C2 C3

C6

C1 C2 C3 C4

ER+/28/3 21/2 14/1 10/17

Receptor status PR+/14/17 16/7 14/1 8/19

C5

5/28

7/26

1/32

33

C6 C7 C8 C9 C10 Total

9/5 26/17 7/6 12/5 9/5 141/89

7/7 21/22 4/9 10/7 3/11 104/126

0/14 5/38 1/12 1/16 2/12 40/190

14 43 13 17 14 230

Cluster

HER2+/4/27 1/22 1/14 24/3

Notes

Total 31 23 15 27

ER+ & HER2- enriched cluster HER2+ Including 23 TNBC samples

Others

Figure 3. Breast cancer subtyping. (A) HCA clustering result by NETBAGs, repetition = 100, K = 10. The blue in the HCA matrix represented high similarity, while white represents low and intermediate. A total of 230 samples were clustered into ten clusters and labeled with C1-C10 with red arrows. (B) HCA clustering result by samples similarity calculated by Pearson correlation. The red in the HCA matrix represented high similarity, while white represents low and intermediate. (C) The clinical marker statistics of all clusters in NETBAGs. HCA: Hierarchical clustering analysis; NETBAG: NETwork Based clustering Approach with Gene signature.

with T-cell receptor signaling pathway (KEGG ID: hsa04660, p = E-06); C5 was enriched with cell cycle (KEGG ID: hsa04110, p 

Suggest Documents