Journal of the American Medical Informatics Association, 0(0), 2018, 1–8 doi: 10.1093/jamia/ocy117 Research and Applications
Heterogeneous network embedding for identifying symptom candidate genes Kuo Yang,1 Ning Wang,1 Guangming Liu,1 Ruyu Wang,1 Jian Yu,1 Runshun Zhang,2 Jianxin Chen3 and Xuezhong Zhou1,4 1 School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China 2Guanganmen Hospital, China Academy of Chinese Medical Sciences, Beijing, China 3Beijing University of Chinese Medicine, Beijing, China and 4Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, China
Corresponding Author: Xuezhong Zhou, PhD, Beijing Jiaotong University, No.3 Shangyuancun, Haidian District, Beijing, 100044 (
[email protected]) Received 8 March 2018; Revised 24 July 2018; Editorial Decision 9 August 2018; Accepted 11 August 2018
ABSTRACT Objective: Investigating the molecular mechanisms of symptoms is a vital task in precision medicine to refine disease taxonomy and improve the personalized management of chronic diseases. Although there are abundant experimental studies and computational efforts to obtain the candidate genes of diseases, the identification of symptom genes is rarely addressed. We curated a high-quality benchmark dataset of symptom-gene associations and proposed a heterogeneous network embedding for identifying symptom genes. Methods: We proposed a heterogeneous network embedding representation algorithm, which constructed a heterogeneous symptom-related network that integrated symptom-related associations and applied an embedding representation algorithm to obtain the low-dimensional vector representation of nodes. By measuring the relevance between symptoms and genes via calculating the similarities of their vectors, the candidate genes of given symptoms can be obtained. Results: A benchmark dataset of 18 270 symptom-gene associations between 505 symptoms and 4549 genes was curated. We compared our method to baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over the state-of-the-art method, with precision and recall improved by 66.80% (0.844 vs 0.506) and 53.96% (0.311 vs 0.202), respectively, for TOP@3 and association precision improved by 37.71% (0.723 vs 0.525) over the PRINCE. Conclusions: The experimental validation of the algorithms and the literature validation of typical symptoms indicated our method achieved excellent performance. Hence, we curated a prediction dataset of 17 479 symptom-candidate genes. The benchmark and prediction datasets have the potential to promote investigations of the molecular mechanisms of symptoms and provide candidate genes for validation in experimental settings. Key words: heterogeneous network embedding, symptom gene identification, network medicine
INTRODUCTION Symptoms and signs (called symptoms in brief) are the primary evidence for clinical diagnosis and disease classification.1 As a critical layer connecting exposomes and genomes in the knowledge network, symptoms play an important role in precision medicine to
refine disease taxonomy.2 In recent years, increasingly more phenotype (disease and symptom) databases, such as Human Phenotype Ontology (HPO),3 Human Disease Ontology (DO),4 and Orphanet Rare Disease Ontology (Orphanet)5 have been constructed. Most biomedical researchers are mainly focused on analyzing and
C The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. V
All rights reserved. For permissions, please email:
[email protected]
1
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
Research and Applications
2
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
METHODS Dataset Disease-gene associations Disease-gene associations were collected from the DisGeNet19 and Malacards20 databases (Figure 2). First, we extracted 130 820
curated disease-gene associations between 13 074 diseases with UMLS code (CUI) and 8947 genes from the DisGeNet database, which integrates disease-gene associations from UniProt,24 PsyGeNET,25 ClinVar,26 Orphanet,5 the GWAS Catalog,27 CTD28 and HPO3 databases. Second, we collected 73 064 disease-gene associations between 6118 diseases with CUIs and 8370 genes from the Malacards database. To unify and integrate the disease terms, we mapped the original disease identifiers of the 2 databases to Unified Medical Language System (UMLS) codes. Finally, the 2 data sources were integrated to obtain 196 397 disease-gene associations that include 16 594 unique diseases and 11 497 unique genes. Protein-protein interactions The protein-protein interactions (PPIs) were collected from Menche et al,29 and include 213 888 records with 15 964 unique proteins. These data are integrated PPI data derived from multiple data sources, such as HPRD,21 BioGRID,22 IntAct23 and PINA.30 Disease-symptom associations Disease-symptom associations were collected from the DO,4 HPO3 and Orphanet5 databases (Figure 2). To unify the disease terms from the different datasets, we mapped the original disease codes to UMLS codes. We collected 1008 disease-symptom associations between 204 diseases and 417 symptoms from the DO database, 87 442 disease-symptom associations between 4366 diseases and 6176 symptoms from the HPO database, and 35 039 diseasesymptom associations between 2391 diseases and 3721 symptoms from the Orphanet database. By integrating the 3 data sources, we finally obtained 100 305 distinct disease-symptom associations (DSA) between 5605 diseases and 6935 symptoms.
Benchmark dataset construction of symptom-gene associations By integrating symptom-related and gene-related association data, we curated a benchmark dataset of symptom-gene associations (called BDSG) (Figure 2). In particular, to obtain the high quality symptom gene associations, we utilized the phenomenon of some “Dual Phenotypes” (DP), such as obesity, fever, back pain, and vertigo, which are not only regarded as diseases, but also as symptoms in medical fields. The associated genes of symptoms with DP characteristics can be directly derived from the disease-gene associations with high quality assurance. To identify these kinds of phenotype terms with DP characteristics, we utilized the hierarchical tree codes (eg C08: respiratory tract diseases and C08.618.248: cough) from MeSH31 terminology to relate the disease terms in our dataset. First, we collected 1051 symptom terms whose MeSH tree codes start with C23.888. Second, we extracted the disease term list and symptom term list from DSA, respectively, and identified the DP symptoms by intersecting the 2 lists. After obtaining the union set of the aforementioned 2 symptom lists, we curated 1278 symptoms with distinct UMLS CUIs. Then, by intersecting the CUIs from the diseases in the integrated disease-gene associations, we obtained 505 symptoms with the DP characteristics, from which we finally curated 18 270 high quality symptom-gene associations (Supplementary Material S1) between these 505 symptoms and 4549 genes. In addition, to curate a more comprehensive symptom-gene benchmark dataset, we further collected the symptom-gene associations derived from the SEMMED32 database, which offered semantic predictions from the titles and abstracts of PubMed33 literatures. We extracted the gene-related semantic predictions about symptom terminologies
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
understanding the molecular mechanism of disease phenotypes.6–8 The investigation of the underlying molecular mechanisms of symptom phenotypes has rarely been addressed, except for disease conditions overlapping with symptom phenotypes, such as obesity9 and pain.10 In addition, to impel the study of genome and phenotypes, the U.S. National Human Genome Research Institute initiated 2 projects, eMERGE,11 which correlates whole genome scans with phenotype data extracted from the electronic medical record systems and PhenX12 which provides investigators with high-priority, well-established, lowburden standard measures to collect phenotypic and environmental data for large-scale genomic studies. Jyotishman et al13 adopted multiple standards and biomedical terminologies to promote cross-study pooling of data and complex genotype-phenotype associations detection. Similar to the computational approaches for disease-gene prediction, symptom gene identification is also a key task for revealing the underlying molecular mechanisms of symptoms. Gene prediction of given diseases requires extensive experiments to test hundreds of candidate genes in a wet lab.14 In fact, experimental gene identification for symptoms and diseases is a difficult and time-consuming task.15 The success of network-based computational methods for identifying disease genes8,14,16 demonstrated that it is an effective method for disease gene prediction. There exists preliminary work1 that indicates it is feasible to use a network propagation approach to predict the candidate genes of symptoms and complicated factors involved in the influence of prediction performance.17 In addition, recent increasing curation of large-scale symptom-related association data, such as disease-gene associations (eg OMIM,18 DisGeNet19 and Malacards20) symptom-disease associations (Disease Ontology,4 HPO3 and Orphanet5) and protein-protein interactions (HPRD,21 BioGRID,22 and IntAct23) offer a rare opportunity for the development of computational approaches. However, to substantially promote these efforts, we still need to address 2 essential tasks: curation of a high-quality benchmark dataset and making full use of the heterogeneous symptom-related indirect association data, such as symptom-disease associations, disease-gene associations and protein-protein interactions to improve the symptom gene prediction performance. Here, by integrating symptom-disease and disease-gene associations, we curated a benchmark dataset of symptom-gene associations. We proposed a deep embedding representation algorithm on a heterogeneous symptom-related network to identify symptom genes (Figure 1). First, we constructed a heterogeneous symptom-related network, which includes symptom-disease, disease-gene and proteinprotein associations. Then, the network embedding representation algorithm was applied to construct low-dimensional vector representation (LVR) of nodes (symptoms and genes) in the network. By calculating the relevance between symptoms and genes that were measured by the similarities of their vectors, the candidate genes of symptoms can be obtained. We compared the prediction performance of our algorithm to the baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over baseline algorithms. Finally, a high-quality prediction dataset of symptom-candidate gene associations was curated based on the results predicted by our method.
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
3
Figure 2. A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then, we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270 symptom-gene associations was curated.
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
Figure 1. An overview of LSGER method. First, by integrating disease-symptom, disease-gene, and protein-protein associations (a), a heterogeneous symptomrelated network (b) was constructed. Then, the network embedding algorithm was applied to obtain a low-dimensional vector representation of nodes (c). Finally, the relevance between the symptom and gene nodes can be measured by the similarities of their vectors (d). By sorting predicted genes by relevance, the candidate genes of given symptoms can be identified.
4
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
and finally obtained 50 907 symptom-gene associations (called SPSG) between 932 symptoms and 9382 genes.
Based on the Fisher exact test,34 we proposed a Fisher-based statistical model to predict symptom genes (FSGER) as a baseline method. Based on the symptom-disease and disease-gene associations, we considered the diseases as a bridge to connect symptoms and genes. In detail, for symptom s and gene g, we defined a, b, c and d to represent the number of diseases associated with s and g, associated with s but not g, associated with g but not s and associated with neither s nor g, respectively. The relevance Relðs; gÞ between the symptom s and the gene g can be defined as follows: Relðs; gÞ ¼ 1
ða þ bÞ!ðc þ dÞ!ða þ cÞ!ðb þ dÞ! a!b!c!d!n!
where n represents the number of all the related diseases. Then, by ranking the predicted genes by the relevance, the ranking gene lists of given symptoms can be obtained.
Heterogeneous symptom-related network embedding representation Network embedding representation learning35 is an effective algorithm for learning the low-dimensional feature vectors of the nodes in a given network, and it can effectively preserve the local and global structure information of the network. Network embedding representation methods are applicable in many tasks, such as visualization, label classification and link prediction.35 In this study, we constructed a heterogeneous symptom-related network, and applied the network embedding algorithm node2vec35 to obtain the lowdimensional vector representation of the nodes in the network. As a well-known algorithm for network embedding representation, the main idea of node2vec is to learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. In detail, for a given network G ¼ ðV; EÞ, the aim of node2vec is to learn the mapping function f : V ! Rd (parameter d is the number of feature dimensions) from nodes to feature representations. By applying the SkipGram architecture to the network,36,37 the objective function can be optimized by maximizing the log-probability of observing the network neighborhood Ns ðuÞ for node u conditioned on its feature representation as follows: X max logPrðNs ðuÞjf ðuÞÞ f
u2V
For the node u 2 V, its network neighborhoods Ns ðuÞ can be generated through a neighborhood sampling strategy S. The authors of node2vec proposed a biased random walk strategy, which can flexibly and efficiently explore the diverse neighborhoods of nodes. Given a source node u, the random walk of fixed length i can be simulated, and node ci (that is, the i-th node in the random walk, and c0 ¼ u) was generated by the distribution function: 8p < vx if ðv; xÞ 2 E Z Pðci ¼ xjci1 ¼ vÞ ¼ : 0 otherwise where pvx is the unnormalized transition probability between nodes v and x, and Z is the normalizing constant. By applying the 2 standard assumptions, conditional independence and symmetry in the
LVR-based similarity prediction model to identify symptom genes We can obtain the LVR of the nodes in the given network based on a heterogeneous network embedding representation algorithm. The low-dimensional vector features of nodes fused the local structure (neighbor of nodes) and global structure information of the network. Then, we proposed a LVR-based similarity model for symptom gene prediction (LSGER). The relevance between the symptom and gene nodes can be measured by the similarities of their lowdimensional vectors. Mathematically, given the symptom node vs and the gene node vg , we can measure the relevance Rel vs ; vg between them by calculating the LVR-based cosine sim ilarity cos Nvs ; Nvg of their vectors Nvs and Nvg as follows: Nvs Nvg Rel vs ; vg ¼ cos Nvs ; Nvg ¼ jNvs j Nvg By calculating and sorting the correlations between query symptom and all candidate genes, we can obtain a ranking list of candidate genes for the query symptom. Otherwise, for the symptom vs , we designed a pre-selection strategy of candidate genes: selecting the genes of diseases related to vs as candidate gene pool and compared to no-selection strategy: selecting all genes as a candidate gene pool. Based on the 2 strategies, the 2 variants LSGER-AG (all genes) and LSGER-DG (with filtered disease gene) of LSGER algorithm were proposed.
Experimental setting and evaluation We constructed 2 benchmark datasets of symptom-gene associations (BDSG and SPSG), which can be used to evaluate the prediction performance of different algorithms. In the experiment, we removed all the known genes of the symptoms in the benchmark dataset and predicted the candidate genes of every test symptom, which indicated that there were not any priori symptom-gene associations for all the prediction algorithms. Our method was compared to the baseline algorithms FSGER and PRINCE.1 Foremost, the PRINCE was proposed by Vanunu et al38 to predict disease genes. Li et al1 extended the PRINCE and applied it to the task of symptom genes prediction. In their work, a network propagation method was used in the PPI network to obtain priority scores of candidate genes. The FSGER algorithm is a Fisher-based statistics model that connected disease-symptom and disease-gene associations for symptom genes prediction. We adopted precision (PR), recall (RE), F1-score (F1),39 association precision (AP) and area under curve (AUC) as the evaluation metrics. Given a test symptom set S with m symptoms, for every test symptom s 2 S, TðsÞ represents the test gene set of symptom s. Given a ranking list of predicted genes, we selected the top i genes Ri ðsÞ of the ranking list (i ¼ 3; 10) as candidate genes. The precision, recall and F1-score for TOP@i can be defined as follows:
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
Fisher-based statistics model for symptom gene prediction
feature space, the low-dimensional vector features of nodes can be measured using stochastic gradient ascent over the model. We constructed 2 heterogeneous networks, SDGNet, which integrated symptom-disease and disease-gene associations and SDGPNet, which integrated symptom-disease, disease-gene, and protein-protein associations. Given a heterogeneous network G ¼ ðV; EÞ, V and E represented the nodes and edges of the network. Then, we applied the network embedding representation algorithm to learn the LVR of nodes. Finally, the node v can be mapped to a low-dimensional vector Nv .
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
Precision ¼
1 X jTðsÞ \ Ri ðsÞj s2S M jTðsÞj
F1 score ¼ 2
precision recall precision þ recall
The recall was calculated in the top 3 or 10 candidate genes, which may lead to low recall values. Since we used the same mode of calculating the recall, it is fair for all the prediction algorithms. In addition, for every test symptom s, the top k genes Rk ðsÞ of ranking list were also selected (k equals to the number of test genes of symptom s). The association precision can be defined, as follows: P jTðsÞ \ Rk ðsÞj P AP ¼ s2S s2S jRk ðsÞj In addition, we also used the AUC to evaluate the prediction performance. For every test symptom, we selected the top 100 predicted genes as candidate genes and obtained the predicted scores of symptom-candidate genes pairs. Then, we ranked all the symptomcandidate gene pairs by the scores and calculated the AUC values. Compared to the AUC calculation of homogeneous network in link prediction tasks, the AUC calculation in this study may lead to the inapposite AUC of prediction results. Hence, the AUC evaluation is only a supplement to the other metrics.
RESULTS LVR-based similarity model to predict symptom genes For LSGER, we compared it to the PRINCE and FSGER algorithms. We adopted precision, recall, F1-score for TOP@3 and TOP@10, association precision and AUC as evaluation metrics. For LSGER-AG and LSGER-DG algorithms, we used 2 heterogeneous networks, SDGNet and SDGPNet, as test networks. First, the experimental results (Table 1) on the BDSG dataset show that, compared to the baseline algorithm PRINCE (AP ¼ 0.525; PR ¼ 0.506 and RE ¼ 0.202 for TOP@3), the FSGER algorithm achieved slightly better performance: AP improved by 2.10%; PR and RE improved by 20.55% and 17.33%, respectively, for TOP@3. The LSGER-AG with SDGPNet yielded the best performance: compared to PRINCE, AP improved by 37.71%; AUC improved by 21.60%; PR and RE improved by 66.80% and 53.96%, respectively, for TOP@3. Second, the LSGER algorithm with SDGPNet obtained slightly higher performance than did the SDGNet (LSGER-AG: PR and RE improved by 1.69% and 3.67%, respectively, for TOP@3; LSGER-DG: PR and RE improved by 1.58% and 3.32%, respectively, for TOP@3), which indicated that the fusion of more gene-related information (PPI network) improved prediction performance of LSGER algorithm. Finally, in terms of precision and recall for TOP@3, both LSGER-AG and LSGER-DG had similar prediction performance. However, in terms of AP, the prediction performance of LSGER-DG was better than that of LSGER-AG (with SDGNet: AP improved by 6.31%; with SDGPNet: AP improved by 9.54%), which indicated the candidate gene pre-selection improved the prediction performance of the LSGER algorithm. Furthermore, we have performed the comparative experiments with different similarity metrics in the supplementary materials (SM). We have selected 3 classical similarity metrics, cosine similarity (Sim_cos), Euclidean distance similarity (Sim_eu) and
Pearson similarity (Sim_pea), to measure the vector similarities of symptom and gene nodes. The results predicted by LSGER-AG algorithm with the SDGNet and SDGPNet networks indicated that different similarity metrics had some degree of influence on the prediction performance of our algorithm. For example, in term of precision (PR) and recall (RE) for TOP@3, the prediction algorithm with Sim_pea (PR ¼ 0.852; RE ¼ 0.314), Sim_eu (PR ¼ 0.871; RE ¼ 0.318) and Sim_cos (PR ¼ 0.844; RE ¼ 0.311) obtained similar performances on recall but different results on precision measure. In the SM section, we also compared the performance of symptom-gene prediction algorithms on the SPSG dataset. The prediction results indicated that the LSGER-DG with SDGPNet still obtained the best performance: compared to the PRINCE algorithm, the recall and F1score improved by 35.32% and 64.24%, respectively. Compared to the BDSG dataset with highly credible symptom-gene associations, the prediction associations offered by the SEMMED had a low confidence. Therefore, the evaluation results on the BDSG dataset can be of greater value than those on the SPSG dataset. From the above, our method had a higher performance than other prediction algorithms.
Case study: candidate genes of some typical symptoms To illustrate the performance of prediction algorithm, we showed the prediction performance using LSGER-AG with SDGPNet of several typical symptoms (Table 2), including constipation (CUI: C0009806), nausea (CUI: C0027497), pain (CUI: C0030193), Usher syndromes (C1568248), vision disorders (C0042790), and aphasia (C0003537), which are regarded as DP symptom terms. The top 10 candidate genes of these symptoms were also listed (Table 3), and the bold genes in the table are the known genes of these symptoms. For example, for constipation, the top 9 candidate genes are the known genes (PR ¼ 0.9 for TOP@10). In addition, for the candidate genes (Table 3) of pain, we found 9 benchmark genes and the left gene ZNF470 (rank ¼ 5) was related to amyotrophic lateral sclerosis (ALS).40 We searched HPO3 and found that pain is one of the typical symptoms of ALS. Therefore, ZNF470 might be a novel gene for pain. We further evaluated the predicted genes of pain by additional validations from PPI interactions and genetic functional analysis. In particular, we extracted the interaction of the top 49 predicted genes of pain in the context of the whole PPI network and showed the interaction map of them (Figure 3a), which includes 36 benchmark genes and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to the interactions with random controls (p-value ¼ 6.82e-68), which indicated that the novel genes are located close to benchmark genes in the PPI network. Further enrichment analysis (Gene Ontology and Pathway) of the pain predicted genes obtained similar results (Figure 3b). For example, there are 9 candidate genes and 11 known genes on the neuroactive ligand-receptor interaction pathway (p-value ¼ 9.90E-15). Therefore, additional analysis indicated that there exist heavy interactions among the candidate genes and known genes of pain, which partially validate the rationality of the prediction results. To fully evaluate the candidate genes that were not recorded in the BDSG dataset, we manually searched the recently published biomedical papers to verify the novel candidate genes. For example, for the novel candidate genes of Usher syndromes (PR ¼ 0.7 for TOP@10), we found that Jaworek et al41 confirmed the locus (chromosome 10p11.21-q21.1) of USH1K gene (rank ¼ 3) associated with type 1 Usher syndrome. The candidate gene USH1H (rank ¼ 4) is likely to associate with the Usher syndrome as well, which was investigated by Dad et al42 In addition, for all 4 novel candidate genes in the top 10 gene list of vision disorders, we found positive validations
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
Recall ¼
1 X jTðsÞ \ Ri ðsÞj s2S M jRi ðsÞj
5
6
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
Table 1. The performance comparison of symptom gene prediction algorithms TOP@3
TOP@10
Algorithm
AP
AUC
Precision
Recall
F1-score
Precision
Recall
F1-score
– – SDGNet SDGNet SDGPNet SDGPNet
PRINCE FSGER LSGER-AG LSGER-DG LSGER-AG LSGER-DG
0.525 0.536 0.745 0.792 0.723 0.792
0.736 0.564 0.890 0.856 0.895 0.853
0.506 0.610 0.830 0.821 0.844 0.834
0.202 0.237 0.300 0.301 0.311 0.311
0.211 0.252 0.327 0.327 0.338 0.336
0.420 0.486 0.719 0.693 0.719 0.698
0.371 0.422 0.572 0.561 0.576 0.568
0.296 0.344 0.488 0.473 0.489 0.478
The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision.
Table 2. The prediction performance of some specific symptoms TOP@3
TOP@10
ID
Symptom (CUI)
Number of hit genes/test genes
Precision
Recall
F1-score
Precision
Recall
F1-score
1 2 3 4 5 6
Constipation (C0009806) Nausea (C0027497) Pain (C0030193) Usher syndromes (C1568248) Vision disorders (C0042790) Aphasia (C0003537)
109/158 11/17 54/79 15/18 6/6 5/9
1.000 1.000 1.000 0.667 1.000 0.333
0.019 0.176 0.038 0.111 0.500 0.111
0.037 0.300 0.073 0.190 0.667 0.167
0.900 0.800 0.900 0.700 0.600 0.600
0.057 0.471 0.114 0.389 1.000 0.667
0.107 0.593 0.202 0.500 0.750 0.632
Table 3. The top 10 candidate genes of some specific symptoms Rank
Constipation (C0009806)
Nausea (C0027497)
Pain (C0030193)
Usher syndromes (C1568248)
Vision disorders (C0042790)
Aphasia (C0003537)
1 2 3 4 5 6 7 8 9 10
SEMA3C NRTN GPBAR1 HMBS MLNR DUOX2 TRHR MLN SCN11A CELIAC8
ETFDH ETFB ETFA LPL HMBS IFNA2 COQ4 SLC7A7 ACADM TNF
PROKR1 PON3 PNOC DAO ZNF470 HTR3B NTSR1 TRPA1 UNC13A BDKRB1
USH1G PDZD7 USH1K USH1H USH1E CIB2 CDH23 MT-TS2 USH1C WHRN
TSEN54 TSEN2 TSEN34 TTPA CLN6 ATXN7 CNGA3 GRM6 PRPH2 NR2E3
LOC643387 ATP1A2 PSNP2 GRN GRIN2A ADA2 REEP1 MAPT L1CAM NOTCH3
The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms.
from recent independent publications. For example, Gootwine et al43 verified that the achromatopsia can be caused by the CNGA3 (rank ¼ 7) mutations. Furthermore, the remaining 3 candidate genes GRM6 (rank ¼ 8), PRPH2 (rank ¼ 9) and NR2E3 (rank ¼ 10) were likely to associate with the subtypes of vision disorders, such as night blindness,44 visual acuity,45 and enhanced S-cone syndrome.46
DISCUSSION In real-world clinical settings, symptoms always play an essential role in both diagnosis and treatment of diseases. Symptoms are the most directly observable manifestations of a disease.47 Therefore, the investigation of the underlying molecular mechanism of symptoms has the potential to propel the refinement of disease taxonomy48 for precision medicine. In this study, we constructed a benchmark dataset of symptom-gene associations and proposed a heterogeneous symptom-related network embedding prediction algorithm for symptom gene prediction. The experimental results indicated our algorithm achieved a significant improvement over the
state-of-the-art method. The heterogeneous symptom-related network embedding prediction algorithm that we proposed can make full use of multiple symptom-related information (eg symptomdisease, disease-gene and protein-protein associations). In particular, we integrated the symptom-disease and diseasegene associations to curate a benchmark dataset of symptom-gene associations, which can be used to evaluate the performance of the proposed novel symptom gene prediction algorithms. By systematic checking of the symptom terms (more details in SM), we curated a high-quality prediction dataset that contains 17 479 symptomcandidate genes between 461 symptoms and 3620 genes (Supplementary Material S2). The benchmark and prediction datasets of symptom-gene associations can also be used to further investigate the symptom-related molecular mechanisms in experimental settings. However, due to the lasting period of curation efforts, the general “temporal” lag from state-of-the-art publications exists in most biomedical knowledge databases (eg UMLS and SEMMED). To address the limitation, we conducted the latest literature manual validation to evaluate reliability of the candidate genes.
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
Network
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
7
Furthermore, the experimental results indicated more information fusion can improve prediction performance. Therefore, we will consider more heterogeneous data, such as gene ontology and expression data in the next efforts. The symptom terms that were extracted from the UMLS database have hierarchy structures. For example, as a high-level category, vision disorder is the hypernym of cataracts (CUI: C0086543), cortical blindness (CUI: C0155320), and night blindness (CUI: C0028077). We will extract and curate a symptom-gene benchmark with hierarchy structures, which can impel us to design a more reliable prediction algorithm. In addition, the symptom terms from MeSH database are high-quality but with limited number. Therefore, we need further collection of various symptom terms contained in the “Clinical Finding” category of SNOMED49 to expand our dataset. However, the curation of a high-quality symptom-gene benchmark dataset will always be a systematic task that needs to be performed continuously. The semantic prediction of SEMMED would be a high-quality resource to curate the benchmark dataset with wide symptom coverage.
CONCLUSION Symptom-gene identification is a primary step towards understanding the molecular mechanism of symptoms and refining the disease taxonomy in precision medicine. In this study, we curated a benchmark dataset of 18 270 symptom-gene associations and proposed a heterogeneous symptom-related network embedding representation algorithm for symptom gene prediction. We compared our method to the baseline algorithms (FSGER and PRINCE), the results of which indicated our algorithm achieved a significant improvement. We also curated a high-quality prediction dataset of 17 479 symptom-candidate genes that contain 461 symptoms and 3620
genes. The analysis results of the candidate genes of typical symptoms indicated that the prediction results have the potential to investigate the underlying molecular mechanisms of symptoms in the experimental settings.
FUNDING The work is partially supported by the National Key Research and Development Program (2017YFC1703506), the Fundamental Research Funds for the Central Universities (2017YJS057, 2017JBM020), the Special Programs of Traditional Chinese Medicine (201407001, JDZX2015170 and JDZX2015171), and the National Key Technology R&D Program (2013BAI02B01 and 2013BAI13B04).
COMPETING INTERESTS None.
CONTRIBUTORS X. Z conceived and designed the research. K. Y performed the experiments, analyzed the data, and drafted the manuscript; N. W, G. L and R. W were involved in the data curation and analysis; X. Z, J. C, J. Y and R. Z revised the manuscript. All authors read and approved the final manuscript.
SUPPLEMENTARY MATERIAL Supplementary material is available at Journal of the American Medical Informatics Association online.
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
Figure 3. PPI interaction and genetic functional analysis of the predicted genes of pain. We extracted the interaction of the top 49 predicted genes of pain in the context of the whole protein-protein interaction (PPI) network and showed interaction matrix of them (a), which includes 36 known genes (ie benchmark genes) and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to those with random controls (p-value¼6.82e-68), which indicated that the novel genes were located close to benchmark genes in the PPI network. Further pathway and Gene Ontology (termed GO) enrichment analysis of the pain predicted genes obtained similar results. The bold and underlined genes are known and candidate genes of pain, respectively.
8
Journal of the American Medical Informatics Association, 2018, Vol. 0, No. 0
REFERENCES
Downloaded from https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocy117/5142853 by Dupre Library Serials Dept user on 27 October 2018
1. Li X, Zhou X, Peng Y, et al. Network based integrated analysis of phenotype-genotype data for prioritization of candidate symptom genes. Biomed Res Int 2014; 2014: 435853. 2. Hofmannapitius M, Alarc onriquelme ME, Chamberlain C, et al. Towards the taxonomy of human disease. Nature Reviews Drug Discovery 2015; 14(2): 75–6. 3. Ko¨hler S, Vasilevsky NA, Engelstad M, et al. The human phenotype ontology in 2017. Nucleic Acids Res 2017; 45 (D1): D865–76. 4. Kibbe WA, Arze C, Felix V, et al. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res 2015; 43 (D1): D1071. 5. Rath A, Olry A, Dhombres F, et al. Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Human Mutation 2012; 33(5): 803–8. 6. Lupski JR, Stankiewicz P. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. Plos Genet 2005; 1 (6): e49. 7. Zhou H, Skolnick J. A knowledge-based approach for predicting genedisease associations. Bioinformatics 2016; 32 (18): 2831–8. 8. Zeng X, Liao Y, Liu Y, et al. Prediction and validation of disease genes using HeteSim Scores. IEEE/ACM Trans Comput Biol Bioinf 2017; 14 (3): 687. 9. Locke AE, Kahali B, Berndt SI, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 2015; 518 (7538): 197–206. 10. de Heer EW, Have MT, Hwj VM, et al. Pain as a risk factor for common mental disorders. Results from the Netherlands Mental Health Survey and Incidence Study-2: a longitudinal, population-based study. Pain 2018; 159: 712–8. 11. Mccarty CA, Chisholm RL, Chute CG, et al. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics 2011; 4 (1): 1–11. 12. Stover PJ, Harlan WR, Hammond JA, et al. PhenX: a toolkit for interdisciplinary genetics research. Curr Opin Lipidol 2010; 21 (2): 136–40. 13. Jyotishman P, Pan H, Wang J, et al. Evaluating phenotypic data elements for genetics and epidemiological research: experiences from the eMERGE and PhenX Network Projects. AMIA Jt Summits Transl Sci Proc 2011; 2011: 41–5. 14. Le DH, Dang VT. Ontology-based disease similarity network for disease gene prediction. Vietnam J Comp Sci 2016; 3 (3): 1–9. 15. Calvo B, L opez-Bigas N, Furney SJ, et al. A partially supervised classification approach to dominant and recessive human disease gene prediction. Comp Methods Progr Biomed 2007; 85 (3): 229–37. 16. Jiang R. Walking on multiple disease-gene networks to prioritize candidate genes. J Mol Cell Biol 2015; 7 (3): 214. 17. Gonzalezperez S, Pazos F, Chagoyen M. Factors affecting interactomebased prediction of human genes associated with clinical signs. BMC Bioinformatics 2017; 18 (1): 340. 18. Ada Hamosh AFS, Amberger JS, Bocchini CA, Victor A. McKusick Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005; 33 (1): 514–7. 19. Pinero J, Queralt-Rosinach N, Bravo A, et al. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015; 2015 (0): bav028. 20. Rappaport N, Twik M, Plaschkes I, et al. MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res 2017; 45 (D1): D877–87. 21. Keshava Prasad TS, Goel R, Kandasamy K, et al. Human Protein Reference Database–2009 update. Nucleic Acids Res 2009; 37 (Database): D767. 22. Chatraryamontri A, Breitkreutz BJ, Oughtred R, et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res 2015; 43(Database issue): D470. 23. Orchard S, Ammari M, Aranda B, et al. The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 2014; 42: 358–63. 24. Apweiler R. Activities at the universal protein resource (UniProt). Nucleic Acids Res 2014; 42 (11): 7486. 25. Gutierrez-Sacristan A, Grosdidier S, Valverde O, et al. PsyGeNET: a knowledge platform on psychiatric disorders and their genes. Bioinformatics 2015; 31 (18): 3075–3077.
26. Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014; 42 (Database issue): 980–5. 27. Welter D, Macarthur J, Morales J, et al. The NHGRI GWAS catalog, a curated resource of sNP-trait associations. Nucleic Acids Res 2014; 42 (Database issue): 1001–6. 28. Peter DA, Grondin MC, Robin J, et al. The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res 2011; 39 (Database issue): 1067–72. 29. Menche J, Sharma A, Kitsak M, et al.; Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science 2015; 347 (6224): 1257601. 30. Cowley MJ, Pinese M, Kassahn KS, et al. PINA v2.0: mining interactome modules. Nucleic Acids Res 2012; 40 (Database issue): 862–5. 31. Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Libr Assoc 2000; 88 (3): 265. 32. Kilicoglu H, Fiszman M, Rodriguez A, et al. Semantic MEDLINE: a web application for managing the results of PubMed searches. Proc Smbm. 2008: 69–76. 33. Wheeler DL, Church DM, Lash AE, et al. Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res 2002; 30 (1): 13–16. 34. Fisher RA. On the interpretation of v2 from contingency tables, and the calculation of P. J R Stat Soc 1922; 85 (1): 87–94. 35. Grover A, Leskovec J. Node2vec: scalable feature learning for networks. in proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. San Francisco, CA, USA. 2016:855–864. 36. Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv 2013. (https://arxiv.org/abs/1301.3781v3) 37. Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social representations. in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014. New York, NY, USA. 2014: 701–710. 38. Vanunu O, Magger O, Ruppin E, et al. Associating genes and protein complexes with disease via network propagation. Plos Comput Biol 2010; 6 (1): e1000641. 39. Billsus D, Pazzani MJ, Learning collaborative information filters. in proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, USA. 1998: 46–54. 40. Bauer J, Wendland J. Candida albicans Sfl1 suppresses flocculation and filamentation. Eukaryotic Cell 2007; 6 (10): 1736–1744. 41. Jaworek TJ, Bhatti R, Latief N, et al. USH1K, a novel locus for type I Usher syndrome, maps to chromosome 10p11.21-q21.1. J Hum Genet 2012; 57 (10): 633–637. 42. Dad S, Østergaard E, Thykjaer T, et al. Identification of a novel locus for a USH3 like syndrome combined with congenital cataract. Clin Genet 2010; 78 (4): 388–397. 43. Gootwine E, Ofri R, Banin E, et al. Safety and efficacy evaluation of rAAV2tYF-PR1.7-hCNGA3 vector delivered by subretinal injection in CNGA3 mutant achromatopsia sheep. Hum Gene Ther Clin Dev 2017; 28: 96–107. 44. Ma NG, Ad UI, et al. Mutations in GRM6 identified in consanguineous Pakistani families with congenital stationary night blindness. Mol Vis 2015; 21: 1261–1271. 45. Chowers I, Tiosano L, Audo I, et al. Adult-onset foveomacular vitelliform dystrophy: a fresh perspective. Prog Retinal Eye Res 2015; 47: 64–85. 46. Kuniyoshi K, Hayashi T, Sakuramoto H, et al. New truncation mutation of the NR2E3 gene in a Japanese patient with enhanced S-cone syndrome. Jpn J Ophthalmol 2016; 60 (6): 476–485. 47. Zhou XZ, Menche J, Barab asi A, et al. Human symptoms–disease network. Nat Commun 2014; 5: 4212. 48. Zhou X, Lei L, Liu J, et al. A systems approach to refine disease taxonomy by integrating phenotypic and molecular networks. EBioMedicine 2018; 31: 79–91. 49. Donnelly K. SNOMED-CT: the advanced terminology and coding system for eHealth. Stud Health Technol Inform 2006; 121 (121): 279.