OPMSP: A Computational Method Integrating Protein ...

1 downloads 52 Views 4MB Size Report
Ramamurthy, V.C.O. Njar, Galeterone and VNPT55 induce protea- somal degradation of AR/AR-V7, induce significant apoptosis via cytochrome c release and ...
Send Orders for Reprints to [email protected] Protein & Peptide Letters, 2016, 23, 1-14

1

RESEARCH ARTICLE

OPMSP: A Computational Method Integrating Protein Interaction and Sequence Information for the Identification of Novel Putative Oncogenes Lei Chen1,$,*, Baoman Wang2,$, ShaoPeng Wang3,$, Jing Yang3, Jerry Hu4, ZhiQun Xie5, Yuwei Wang6, Tao Huang2,** and Yu-Dong Cai3,*** 1

College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People’s Republic of China; Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, People’s Republic of China; 3College of Life Science, Shanghai University, Shanghai 200444, People’s Republic of China; 4Department of Mathematics and Computer Science, School of Arts and Sciences, University of HoustonVictoria, Victoria, TX 77901, USA; 5School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA; 6State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macau (SAR), People’s Republic of China 2

ARTICLE HISTORY Received: June 25, 2016 Revised: September 30, 2016 Accepted: October 9, 2016 DOI: 10.2174/0929866523666161021 165506

Abstract: Oncogenes are genes that have the potential to cause cancer. Oncogene research can provide insight into the occurrence and development of cancer, thereby helping to prevent cancer and to design effective treatments. This study proposes a network method called the oncogene prediction method based on shortest path algorithm (OPMSP) for the identification of novel oncogenes in a large protein network built using protein-protein interaction data. Novel putative genes were extracted from the shortest paths connecting any two known oncogenes. Then, they were filtered by a randomization test, and the linkages among them and known oncogenes were measured by protein interaction and sequence Yu-Dong Cai data. Thirty-seven new putative oncogenes were identified by this method. The enrichment analysis of the 37 putative oncogenes indicated that they are highly associated with several biological processes related to the initiation, progression and metastasis of tumors. Six of these genes—ESR1, CDK9, SEPT2, HOXA10, LMX1B, and NR2C2—are extensively discussed. Several lines of evidence indicate that they may be novel oncogenes.

Keywords: Oncogene, protein-protein interaction, protein sequence, BLAST, shortest path algorithm. 1. INTRODUCTION The study of disease through the use of bio-markers could revolutionize our understanding of tumorigenesis and provide new candidate targets for further drug research and development [1]. With the constantly growing understanding of tumorigenesis, various theories about such malignant processes have been presented. In 1969, a specific oncogene hypothesis was proposed by Huebner and Todaro that postulated the vertical transmission of eukaryotic DNA from RNA tumor viruses in so-called virogenetic segments [2]. Such *

Address correspondence to these authors at the College of Information Engineering, Shanghai Maritime University, People’s Republic of China; Tel: 0086-21-38282825; Fax: 0086-21-38282800; E-mail: [email protected] ** Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, People’s Republic of China; Tel: 86-215492-3269; E-mail: [email protected] *** Address correspondence to this author at the College of Life Science, Shanghai University, People’s Republic of China; Tel: 86-21-66136132;Fax: 86-21-6613-6109; E-mail: [email protected] $ These authors contributed equally to this work. 0929-8665/16 $58.00+.00

tumor-associated genes contained in viral genomes possess the ability to transform normal cells into tumor cells. The isolation of reverse transcriptase altered the entire work flow of research on RNA tumor viruses, which further allowed novel lines of investigation of their neoplastic mechanisms. In 1976, Stehelin et al. characterized the genetic materials of ASV (avian sarcoma virus) may contribute to specific tumorigenesis [3]. Moreover, these researchers demonstrated that such oncogenic genes may already be present in eukaryotic cells as normal counterparts. Afterwards, growing numbers of transforming genes were characterized in various retroviruses, which further confirmed the cellular origins and validated the virogene/oncogene theory [4]. As a result, these genes were identified as oncogenes, and the counterparts were identified as proto-oncogenes. DNA sequencing and gene expression profiling contribute to the prognosis of various malignancies. Chronic myeloid leukemia and breast cancer are examples of malignancies that have been validated to have functional prognostic indicators that can be identified by DNA copy number variants and gene expression profiles [5]. With the development © 2016 Bentham Science Publishers

2 Protein & Peptide Letters, 2016, Vol. 23, No. 12

of computational technology, the molecular, clinical and population traits of cancers for diagnosis and fundamental studies can be summarized and integrated by reliable databases, promoting the study of tumorigenesis. The identification and validation of functional tumor-associated genes revealed the underlying mechanisms of tumorigenesis. The analysis of tumor-associated genes may also provide a functional tool for the assessment of cancer risk and the improvement of prognosis. Furthermore, the identification of molecular markers of disease progression aids in the optimal selection of treatment routines [6]. Further studies on tumorassociated genes revealed that different oncogenic genes may contribute to different steps of tumorigenesis [7-11]. Functional genes may be identified to further improve patient management and lead to the development of targeted therapies [12]. For a long time, the identification of candidate oncogenic genes relied on experimental techniques performed on typical cell lines or animal models. These technologies can be categorized into at least three groups according to their molecular levels, including genomic level, transcriptomic level and proteomic level. Different technologies can be applied to identify oncogenes that initiate tumorigenesis on different levels. Take the transcriptomic level as an example. The abnormal expression of functional genes, such as TP53 and KRAS, may contribute to malignant transformation in various tumor subtypes [13-15]. Multiple experimental methods, such as RNA sequencing, real-time quantitative polymerase chain reactions and RNA interference, can verify the expression alteration of certain genes and reveal the potential oncogenic role of such alterations. However, these experiments are time-consuming and expensive to extend to pan-cancers. With the development of computing technology and abundant multi-omics data, we can build effective computational methods to conduct a systematic analysis of multiple cancer subtypes. High-throughput omics technologies have rapidly developed in recent years, which has generated massive multiomics data that provide several opportunities and challenges. The mass multi-omics data is a reliable data source and an effective analytic platform to study the underlying mechanisms of complicated biological processes. It is also a great challenge to effectively tackle these large-scale data. Building effective computational methods is an alternative technique. Recently, many studies have adopted a classic graph algorithm, the shortest path algorithm, to identify disease genes [16-22]. The shortest path algorithm was used to search for the shortest paths connecting any two validated genes in a constructed network, and the inner nodes of these paths were deemed to represent possible disease genes. In this paper, we adopted the shortest path algorithm to build a new network method to identify novel oncogenes. This method is called the oncogene prediction method based on shortest path algorithm (OPMSP). Based on the fact that proteins that can interact with each other are more likely to have similar functions, a shortest path algorithm was applied to a large network to search for novel putative oncogenes based on known oncogenes, where the network was constructed using the current protein-protein interaction (PPI) information retrieved from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [23]. In addition, a

Chen et al.

strict screening procedure was applied to screen out most possible putative oncogenes. The obtained putative genes were analyzed by the functional annotation tool DAVID [24] to uncover their biological meaning. Among the identified putative oncogenes, we analyzed six and discussed the molecular mechanisms modulating their expression, which further validated the effectiveness of the OPMSP. We believe that the OPMSP method will contribute to the identification of targets for novel therapies and will aid in elucidating the molecular processes underlying tumor development. 2. MATERIALS AND METHODS 2.1. Materials Known oncogenes were compiled from the HGNC database (HUGO Gene Nomenclature Committee, http://www. genenames.org/) [25] and the oncogene family of GSEA MSigDB (Gene Set Enrichment Analysis Molecular Signatures Database, http://www.broadinstitute.org/gsea/ msigdb/gene_families.jsp) [26, 27] using the keyword of “oncogene” as a search criterion. In total, 251 oncogenes were collected from the HGNC database (accessed in January 2013), and 330 oncogenes were retrieved from MsigDB (Version 4.0, accessed in September 2013). In GSEA MsigB, the genes were categorized into gene families according to their functions. The oncogenes, defined as a single mutated allele that is sufficient to contribute to tumorigenesis, were collected from known literature reports [28]. HGNC is a public database that approves unique names and symbols for protein-coding genes, ncRNA genes and pseudogenes. The database collects genes from reports and researchers to create standard names and symbols that are used in the major databases, such as Ensembl, UniProt, NCBI Gene, the UCSE genome browser and GeneCard [25]. After combining these genes, we obtained 543 oncogenes. To apply the oncogene data to the PPI network reported in STRING (http://stringdb.org/) [23], the oncogenes were mapped onto their Ensembl IDs, and those with available Ensembl IDs were reserved. Finally, 482 oncogenes with corresponding Ensembl IDs were obtained for further study. For convenience, we denote the set containing these 482 oncogenes Doncogene . Detailed information for these 482 oncogene Ensembl IDs and their sources (HGNC or MsigDB) is provided in Supplementary Material I. 2.2. Description of the OPMSP The OPMSP consists of a searching procedure and a screening procedure. The former attempts to identify putative genes in a network, while the latter filters false positives and selects core putative genes. The following section provides a detailed description of these procedures. 2.2.1. Searching Procedure PPIs play essential roles in many intracellular and intercellular biochemical processes, and two proteins within a PPI are likely to share similar functions [16, 17, 29-34]. The oncogenes investigated in this study are therefore likely to share common features because they all have the potential to

OPMSP: A Computational Method Integrating Protein Interaction

cause tumors or cancer. Thus, it is reasonable to attempt to identify putative oncogenes based on known oncogenes using PPI data. The PPI data used in this study were downloaded from STRING (Version 9.1, http://string-db.org/) [23]. The file ‘protein.links.v9.1 txt.gz’ provided in STRING contains the PPI data of several organisms. Because ‘9606’ is the organism code of Homo sapiens in STRING, we extracted the line starting with ‘9606’ (indicating it is an interaction of Homo sapiens) and obtained 2,425,314 human PPIs that involve 20,770 proteins. These PPIs were derived from various sources, such as high-throughput experiments, (conserved) co-expression, genomic context, and previous knowledge. Thus, they measure the association between proteins by both physical (i.e., direct) association and functional (i.e., indirect) association. Each human PPI contains two proteins that are represented by Ensembl IDs and a score with maximum value 999 and minimum value 150 to evaluate the likelihood of the occurrence of the PPI. The score of a human PPI between proteins p1 and p2 was denoted S(p1, p2). Based on the available human PPI data, a large network was constructed. Each of the 20,770 proteins was defined as a node in the network, and two nodes were considered adjacent if and only if their corresponding proteins constituted a PPI in STRING, i.e., an edge represented a PPI. From the interaction score of each PPI, a weight was assigned to each edge, which was defined as w(e) = 1000 – S(p1, p2), where p1 and p2 are the corresponding proteins at the endpoints of e. We used 1000 minus the interaction scores as the edge weights because the maximum interaction score was 999 and the edge weights should be compatible with the shortest path algorithm. Based on the widely accepted fact that two proteins in a PPI share similar functions and the shortest path method has been used to address several biological problems [16-19, 30, 32, 35-38], the shortest paths that connect any two of the 482 Ensembl IDs were searched for in the network. For any shortest path (e.g., p1 , p2 ,…, pn ), where p1 and pn are oncogenes, the weights of the edges are small, inducing corresponding interactions with high interaction scores according to the definition of edge weight. Thus, the corresponding proteins of consecutive nodes in the path are highly likely to share similar functions. Because the endpoints of the path are oncogenes, the inner nodes of the path share functions with oncogenes (i.e., they may be novel oncogenes). Accordingly, the proteins of the inner nodes in all obtained shortest paths were extracted, and those that were among the 482 original oncogenes were discarded. The resulting gene set was regarded as the candidate gene set for the subsequent filtering steps. To quantify each candidate gene, we determined the betweenness of each candidate gene, which is defined as the number of paths containing the candidate gene. Betweenness indicates the direct and indirect influences of the candidate genes in the network [39-41]. 2.2.2. Screening Procedure Candidate genes were attained by applying the shortest path algorithm to the constructed network. However, there were false positives because the translation products of cer-

Protein & Peptide Letters, 2016, Vol. 23, No. 12

3

tain candidate genes may play fundamental roles in cell survival, with basic functions of transferring energy molecules from the external environment or generating chemical energy through redox reactions [42-44]. Thus, there are frequent associations between these types of candidate genes and other genes (i.e., they may be selected by the searching procedure with high probability). In fact, they exhibit few associations with the biological processes of tumor formation and development. To control for these genes, a screening procedure was developed that consisted of the following two parts: (1) a randomization test to filter irrelevant candidate genes and (2) the use of protein interaction and sequence information to select core candidate genes. To execute the randomization test, we first randomly produced 1,000 sets of Ensembl IDs, for example, G1, G2,…, G1000. Each of them had 482 Ensembl IDs, none of which were identical to the set Doncogene . For each Gi, the shortest path connecting two Ensembl IDs in the set was searched for in the network, as elaborated in Section 2.2. For each candidate gene, its betweenness was determined based on these paths. Finally, for each candidate gene, we obtained 1,000 betweenness values produced by 1,000 random sets and one betweenness value produced by the 482 oncogenes in Doncogene . To measure the occurrence frequency of each candidate gene in all randomly produced sets, the permutation FDR was computed as follows:

‡”

1000

FDR( g ) =

i =1

fi

1000

(1)

where fi was set to 1 if the betweenness of g produced by Gi was larger than that of g produced by oncogenes; otherwise, it was set to 0. Candidate genes with high permutation FDRs are universal genes in the network that are not specific to cancer. To discard them, 0.05 was set as the threshold for the permutation FDR (i.e., candidate genes with permutation FDRs less than 0.05 were retained for further analysis). To identify genes closely related to known oncogenes, the remaining candidate genes were filtered using protein interactions and sequences. As mentioned in Section 2.2, two proteins that form a PPI share common structural and functional features [16, 29, 31, 33, 34]. This fact allowed us to select core candidate genes that were closely related to known oncogenes. The maximum interaction score was calculated for each remaining candidate gene g as follows: Score-interaction (g) = max{ S ( g , g ' ) : g '! Doncogene }

(2)

where S ( g , g ' ) is the interaction score between g and g ' , as mentioned in Section 2.2.1, and Doncogene is the set consisting of validated oncogenes. A high maximum interaction score for candidate gene g means g is highly related to at least one validated oncogene, which implies it is a novel oncogene with high probability. Thus, candidate genes that receive high maximum interaction scores should be selected. Proteins with similar structures share similar functions. Among the candidate genes, those with structures similar to

4 Protein & Peptide Letters, 2016, Vol. 23, No. 12

Chen et al.

known oncogenes were considered more likely to be novel oncogenes. Thus, the basic local alignment search tool (BLAST) [45] was used to calculate the similarity of two protein sequences. The alignment score for proteins p1 and p2 was denoted by Ss(p1, p2). We calculated the maximum alignment score for each remaining candidate gene g as follows: Score-sequence (g) = max{ S S ( g , g ' ) : g '! Doncogene }

(3)

where S S ( g , g ' ) is the alignment score between g and g ' , as mentioned above, and Doncogene is the set consisting of

The OPMSP procedure can be summarized as follows: Algorithm OPMSP Input: Known oncogenes, a weighted protein network Output: A list of putative oncogenes 1. Searching procedure 1.1 Search all shortest paths connecting any two known oncogenes 1.2 Extract inner nodes from these paths; select the corresponding genes not previously identified as oncogenes as candidate genes 2. Screening procedure

validated oncogenes. With similar arguments as the maximum interaction score, candidate genes with high maximum alignment scores are more likely to be novel oncogenes and to be selected.

2.1 For each candidate gene, calculate its FDR using Eq. 1 and select candidate genes with FDRs smaller than 0.05

Candidate genes that receive both high maximum interaction scores and high maximum alignment scores should be selected. Because 900 in STRING is used as the cutoff of the highest confidence level, it was set as the threshold for the maximum interaction score. For the maximum alignment score, 90 was set as the threshold for homology between two proteins. Thus, 90 was set to be the threshold for the maximum alignment score. The final putative oncogenes were thus candidate genes among the candidate genes filtered by the randomization test that were assigned maximum interaction scores no less than the threshold of 900 and maximum alignment scores larger than or equal to the threshold of 90.

2.3 For each remaining candidate gene, calculate the maximum alignment score using Eq. 3

2.2 For each remaining candidate gene, calculate the maximum interaction score using Eq. 2

2.4 Select candidate genes with a maximum interaction score no less than 900 and a maximum alignment score no less than 90 3. Output the remaining candidate genes as putative oncogenes

3. RESULTS AND DISCUSSION From the constructed network, we can extract a number of putative oncogenes based on known oncogenes. The entire procedure of the OPMSP method is illustrated in Figure 1, and the results and discussion are presented below.

Figure 1. Overview of our method for the identification of novel putative oncogenes.

OPMSP: A Computational Method Integrating Protein Interaction

3.1. Results of the Searching Procedure Based on the method described in Section 2.2, all shortest paths connecting a pair of oncogenes in the constructed network were detected. The Ensembl IDs of genes occurring as inner nodes in these paths were obtained. After excluding the Ensembl IDs among the 482 oncogenes, 818 Ensembl IDs were obtained. These Ensembl IDs were mapped to gene symbols, resulting in the 789 gene symbols listed in Supplementary Material II. Additionally, for each of the 789 gene symbols, the betweenness was determined and is provided in Supplementary Material II. These 789 genes were deemed candidate genes. 3.2. Results of the Screening Procedure To exclude false positives among the 789 candidate genes, a randomization test was performed. Each candidate gene received a permutation FDR (Supplementary Material

Protein & Peptide Letters, 2016, Vol. 23, No. 12

5

II). After setting 0.05 as the threshold for the permutation FDR, 199 candidate genes remained (Supplementary Material III). To select core candidate genes from the remaining 199 genes, each candidate gene was further evaluated by calculating its maximum interaction score (cf. Eq. 2) and maximum alignment score (cf. Eq. 3). These values are listed in Supplementary Material III. Figures 2 and 3 show the distributions of the maximum interaction scores and maximum alignment scores of 199 candidate genes, respectively, from which we can see that more than 70% and 20% of the candidate genes received maximum interaction scores no less than 900 and maximum alignment scores no less than 90, respectively. According to the criteria outlined in Section 2.2.2, 900 and 90 were selected as the thresholds of the maximum interaction score and the maximum interaction score. To determine the rationality of these thresholds, we also computed the maximum interaction scores and maximum interac-

Figure 2. The distribution of the maximum interaction scores of 199 candidate genes and validated oncogenes.

6 Protein & Peptide Letters, 2016, Vol. 23, No. 12

Chen et al.

Figure 3. The distribution of the maximum alignment scores of 199 candidate genes and validated oncogenes.

tion scores of 482 oncogenes and counted their distributions, as shown in Figures 2 and 3, respectively. Approximately 50% of the oncogenes received maximum interaction scores no less than 900, and approximately 70% had maximum alignment scores no less than 90. Thus, we selected genes with maximum interaction scores no less than 900 and maximum alignment scores no less than 90 and obtained 37 core candidate genes (Supplementary Material IV). These 37 genes are closely related to known oncogenes and were deemed highly likely to be novel oncogenes. Thus, they were defined as putative oncogenes. 3.3. Enrichment Analysis of Putative Oncogenes Using DAVID The 37 putative oncogenes yielded by the OPMSP method were analyzed by the functional annotation tool DAVID [24], which can uncover the biological meaning behind putative oncogenes. The results included the gene ontology (GO) term enrichment analysis and KEGG pathway enrichment analysis, which are provided in Supplementary Material V. 3.3.1. GO Terms Analysis The GO term enrichment analysis results indicated that 37 putative oncogenes were enriched in four GO terms with adjusted p-value less than 0.05. These four GO terms, in addition to some key measurements yielded by DAVID, are listed in Table 1. According to the basic classification of GO terms, two of the four GO terms (GO:0007264, GO:0008283) were biological processes, while the other two (GO:0003707, GO:0004879) were molecular functions. Eight putative oncogenes were enriched in GO: 0007264, describing small GTPase-mediated signal transduction. Abnormal regulation and activation of small GTPase-mediated signaling pathways have been widely reported to participate in the initiation and progression of various tumor subtypes, including lung cancer, pancreatic cancer and hepatic cancer, indicating a core regulatory role of this biological process in tumorigenesis [46-48]. Take pancreatic cancer as an example. Recent publications revealed that during the initiation and progression of pancreatic cancer, small GTPase-mediated signal transduction contributes to the ab-

normal proliferation processes and mediates the sensitivity of tumors to chemotherapy and radiotherapy, suggesting a crucial oncogenic role of our screened GO term [48-50]. In addition to this GO term, we also identified a proliferationassociated GO term, GO: 0008283. Because abnormal proliferation turns is one of the core characteristics of tumors, it is reasonable that our screened genes were enriched in this biological process [51-53]. Nine putative oncogenes were enriched in this group. MAP2K1, as a putative oncogene, has been confirmed to contribute to such biological processes. Recent publications confirmed that this gene may initiate and promote carcinogenesis processes by activating abnormal proliferative processes of regional cells, validating the core contribution of cell-proliferation-associated biological processes during tumor initiation and progression [54, 55]. Two molecular function GO terms are also listed in Table 1. GO: 0003707, describing steroid hormone receptor activity, is also in our enrichment analysis list. Not all tumor subtypes are related to steroid hormone associated biological processes [56, 57]. However, tumor subtypes, such as endocrine cancer, breast cancer and prostate cancer, have been confirmed to have some correlation with the obtained GO term: steroid hormone receptor activity [56, 58, 59]. Considering the potential oncogenic function of such biological processes, it is reasonable that our putative oncogenes were enriched in this cluster. Another GO term, GO:0004879, is related to ligand-dependent nuclear receptor activity, involving various functional receptors, including steroid hormone receptors, retinoic acid receptors, and juvenile hormone receptors. The steroid hormone receptor, which contributes to steroid hormone associated biological processes, may be an oncogenic biological process, suggesting that the enriched GO term GO: 0004879 contributes to tumorigenesis as the parent gene oncology term [56, 57]. Additionally, functional nuclear receptors, such as retinoic acid receptors and juvenile hormone receptors, have also been confirmed to be associated with tumorigenesis-mediated oncogenic biological processes, validating our prediction and enrichment results [60-62]. 3.3.2. KEGG Pathway Analysis Apart from the GO term enrichment analysis, we also obtained seven KEGG pathways in which our putative onco-

OPMSP: A Computational Method Integrating Protein Interaction

Table 1.

Protein & Peptide Letters, 2016, Vol. 23, No. 12

GO terms enriched by 37 putative oncogenes.

GO term

GO term

Putative genes associated with GO term

FDR adjusted ! value

Category

GO:0007264

small GTPase-mediated signal transduction

MAP2K1, RALBP1, GRB2, MAPK3, SOS2, ARHGAP1, GNA12, RAPGEF2

1.96E-02

biological processes

GO:0008283

cell proliferation

FGF6, AR, LMX1B, MAP2K1, ESRRB, WNT3A, CDK9, FGF10, NR2C2

2.22E-02

biological processes

GO:0003707

steroid hormone receptor activity

PPARA, AR, ESRRB, ESR1, NR2C2

1.02E-02

molecular function

GO:0004879

ligand-dependent nuclear receptor activity

PPARA, AR, ESRRB, ESR1, NR2C2

2.01E-02

molecular function

genes were enriched, which are listed in Table 2. All seven pathways have been confirmed to contribute to the initiation and progression of at least one tumor subtype, validating the accuracy and efficacy of our prediction and enrichment analysis. Among the seven KEGG pathways, three pathways (hsa05200: Pathways in cancer; hsa05215: Prostate cancer; hsa05210: Colorectal cancer) were tumor-associated pathways. Our putative oncogenes FGF6, AR, MAP2K1, RALBP1, and GRB2 were enriched in such biological processes and may contribute to tumorigenesis, suggesting the accuracy and efficacy of our prediction. Our predicted oncogenes may have specific tissue specificity. Genes such as AR and EP300 have been clustered into prostate-cancerassociated pathways but not colorectal-cancer-associated pathways, as confirmed in recent literature [63-65]. Apart from the KEGG pathways that directly describe certain cancer subtypes, we obtained four specific pathways that may have potential oncogenic effects. Melanogenesis, described as hsa04916, is the main biological function of the cutaneous melanin pigment, a crucial component that contributes to camouflage, mimicry and social communication [66, 67]. Strong relationships between abnormal melanogenesis and melanoma have been revealed and confirmed by various recent publications, validating our predicted oncogenic role Table 2.

7

of the pathway and the candidate genes enriched in it [6870]. Dorso-ventral axis formation (hsa04320) is another potential tumor-associated pathway. As an embryonicdevelopment-associated pathway, it involves various earlydevelopment-associated genes, such as NOTCH3, MAP2K1 and SOS2 [71-73]. Considering that tumorigenesis has been widely regarded as a regional characteristic dedifferentiation process involving various development-associated genes, it is reasonable that our screened KEGG pathway related to early development may also contribute to tumorigenesis, validating our prediction [74-78]. In addition to these signaling pathways, there are two crucial candidates, the MAPK signaling pathway (hsa04010) and B cell receptor signaling pathway (hsa04662). Both KEGG pathways have been confirmed to contribute to tumorigenesis, validating the efficacy and accuracy of our prediction. The MAPK signaling pathway has been confirmed to contribute to the initiation, progression and metastasis of certain tumor subtypes, including lung cancer, hepatic cancer and gastric cancer [79-82]. Nine putative oncogenes were enriched in these biological processes. As for the B cell receptor signaling pathway, although it is an immune-associated biological process, recent publications revealed the underlying correlation between B cell receptor and tumorigenesis [83, 84]. Take GRB2, which

KEGG pathways enriched by 37 putative oncogenes

KEGG pathway ID

KEGG pathway

Putative genes associated with GO term

FDR adjusted ! value

hsa05200

Pathways in cancer

FGF6, AR, MAP2K1, RALBP1, GRB2, WNT3A, FZD1, FGF10, FZD10, EP300, SOS2, MAPK3, IKBKB

3.08E-06

hsa05215

Prostate cancer

AR, EP300, MAP2K1, GRB2, MAPK3, SOS2, IKBKB

1.97E-03

hsa04916

Melanogenesis

FZD10, EP300, MAP2K1, CAMK2G, WNT3A, MAPK3, FZD1

3.68E-03

hsa04320

Dorso-ventral axis formation

NOTCH3, MAP2K1, GRB2, MAPK3, SOS2

3.85E-03

hsa04010

MAPK signaling pathway

FGF6, MAP2K1, GRB2, MAPK3, SOS2, GNA12, FGF10, IKBKB, RAPGEF2

1.30E-02

hsa04662

B cell receptor signaling pathway

PTPN6, MAP2K1, GRB2, MAPK3, SOS2, IKBKB

1.70E-02

hsa05210

Colorectal cancer

FZD10, MAP2K1, GRB2, MAPK3, SOS2, FZD1

2.96E-02

8 Protein & Peptide Letters, 2016, Vol. 23, No. 12

Chen et al.

contributes to this biological process, as an example. In melanoma, GRB2 contributes to the specific biological processes that regulate the proliferation, migration and invasion of melanoma, validating its candidate oncogenic role [85]. Additionally, a specific loss-of-function of the B cell receptor associated gene PRDM1/BLIMP-1 has been confirmed to contribute to B- cell lymphoma, further validating the essential oncogenic role of the pathway in multiple tumor subtypes [86]. Based on the above discussion, the biological meaning of the 37 putative oncogenes was uncovered. These 37 putative oncogenes are strongly related to several biological processes related to the initiation, progression and metastasis of tumors, which demonstrates the efficacy and accuracy of the predictions yielded by the OPMSP method. 3.4. Analysis of Putative Genes Yielded by the OPMSP Method The OPMSP method identified 37 putative oncogenes. Because candidate genes that have low permutation FDRs are more likely to be novel oncogenes, we selected six putative genes (ESR1, CDK9, SEPT2, HOXA10, LMX1B, and NR2C2) with permutation FDRs G) has been linked to the initiation and progression of hepatocellular carcinoma in Mongolian races but has the reverse effect in Caucasian races [92]. Many studies have focused on ESR1 rs9340799 and ESR2 rs1256049 polymorphisms and their association with prostate cancer risk [93]. Taken together, ESR1 is a likely oncogene. Nevertheless, several studies have shown that ESR1 is a candidate tumor suppressor gene in HCC. ERα suppresses bladder cancer initiation and invasion via modulation of the AKT pathway. Therefore, whether ESR1 gene polymorphisms are associated with prostate cancer risk must be validated. Therefore, we presume that ESR1 is a bi-functional gene. CDK9. This gene has been confirmed to be a specific cyclin-dependent protein kinase. Based on the T family cyclins and cyclin K, CDK9 plays a specific role in kinase activity. CDK9 and one of its cyclin partners form a heterodimer, which participates in the assembly of functional transcription regulator, P-TEFb, as its main component. The elongation process of RNA polymerase II (pol II) transcripts is stabilized by P-TEFb [94]. The CDK9-related pathway functions in multiple biological processes by controlling RNA pol II-mediated gene expression, such as cell proliferation, differentiation, cell growth and protection from apoptosis. Most importantly, perturbation of CDKs activity can result in tumorigenesis. In more differentiated primary neuroectodermal and neuroblastoma tumors, the CDK9/cyclin T1 complex is highly expressed, which indicates that kinase expression is associated with tumor differentiation grade

Detailed information regarding six putative genes (permutation FDR

Suggest Documents