Linked Functional Annotation For Differentially ...

2 downloads 0 Views 563KB Size Report
Number Variation (CNV), Somatic mutation from TCGA and COSMIC repositories. ... from various cancer mutation (TCGA) and database with global mutation list.
Linked Functional Annotation For Differentially Expressed Gene (DEG) Demonstrated using Illumina Body Map 2.0 Alokkumar Jha , Yasar Khan, Aftab Iqbal, Achille Zappa , Muntazir Mehdi, Ratnesh Sahay, and Dietrich Rebholz-Schuhmann Insight Centre for Data Analytics, NUI Galway, Ireland Galway {alokkumar.jha,yasar.khan,aftab.iqbal,achille.zappa,muntazir.mehdi, ratnesh.sahay,rebholz}@insight-centre.org

Abstract. Next generation sequencing is playing a key role in therapeutic decision making in cancer. However, technology is facing a new challenge to find exact fit for clinical and its translational outcomes. In order to address this challenge a well-connected functional annotation mechanism is required in cancer studies with low FDR (false discovery rate). In this paper, we propose a linked functional annotation mechanism to connect and query distributed repositories covering Gene expression, Copy Number Variation (CNV), Somatic mutation from TCGA and COSMIC repositories. We annotate differential expression data set generated from Illumina Body map 2.0 for linking 16 different tissue types and expected risk associated with cancer for the genes with the RPKM value greater than 0.5. We argue that DEG from Illumina body map 2.0 can be associated with tissue types and expression in COSMIC and TCGA, with a future potential biomarker for that particular tissue specific cancer. We have also discovered genomic position(loci) and associated genes for ovarian cancer as novel indications which provides a protocol for future studies to use this platform for functional annotations in cancer studies. To the best of our knowledge, the proposed linked annotation mechanism is a first initiative to represent COSMIC datasets into the RDF format and link them with the TCGA datasets, resulting in 1.2 billion and 100 million triples for COSMIC and TCGA, respectively. Keywords: Differentially Expressed genes(DEG),Linked data, Clinical genomics,Copy Number Variation (CNV)

1

Introduction

The Next Generation Sequencing (NGS) technologies have opened new doors in cancer research. However, high-throughput sequencing data has to go through complex processing pipelines and downstream annotation lacks a well-integrated platform in both clinical and translational [11] research. The challenge now is to develop a computational infrastructure for handling the enormous volume of data and the biological interpretation of exponential data originating from

2

Authors Suppressed Due to Excessive Length

clinical facilities. The functional annotation of cancer genomics data is domain specific and user intuitive which leads to ambiguity while aggregating clinical outcomes from distinct resources. In this paper our main focus is on exploring patterns of a gene across different cancer and tissue types. Our experiments are based on COSMIC1 and TCGA2 knowledge repositories with mutation and differential expression data sets. In order to link and retrieve patterns of a gene and tissue specific information from various cancer mutation (TCGA) and database with global mutation list and mutation type(COSMIC), we encountered three challenges (i) transform heterogeneous data repositories and their storage formats (i.e., TCGA, COSMIC ) into the standard RDF format; (ii) discover links that can correlate patterns of a gene across CNV, mutation and gene expression data sets; (iii) scalable querying of large volume and frequently updating datasets covering 16 different tissue types and gene expressions multiple distinct repositories. The experiments conducted in this paper is based on the transcriptome study from the Illumina Body Map 2.0 (HBM)3 that includes adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Illumina Body Map provides a gene specific information across one or more tissue types can lead to potential biomarker for targeted therapy. In this study our results not only depicts novel biological outcomes but also provides a linked annotation framework that assimilates clinical outcomes from overlapping repositories. The rest of the paper is structured as follows: Section 2 discusses our motivational scenario explaining the Illumina Body Map(HBM) 2.0 use case and annotation databases; Section 3 presents the methodology and architecture of the proposed linked functional annotation framework; Section 4 presents the evaluation of linked functional annotation framework; Section 6 presents the related work in linking the TCGA repository and Section 7 concludes our work.

2

Motivational scenario

In order to understand the behaviour of disease it is important to understand the patterns in genes and samples. We chose Illumina body map 2.0 to discover underlined similarity between genes across different tissue studies. Illumina body map 2.0 is a project to understand connection among human tissues on molecular and gene level. Due to overlaps between cancer behaviours, progression and mutated genes, we have annotated top 100 genes distilled by our filtering criteria with COSMIC to explore previously observed studies from TCGA database, e.g., somatic mutation, genomic loci and other mutations linked to these genes retrieved from healthy tissues. Illumina Body Map (HBM) 2.0: It is a rich sample set of transcriptome study including 16 tissue types including adrenal, adipose, brain, breast, colon, heart, 1

http://cancer.sanger.ac.uk/cosmic https://tcga-data.nci.nih.gov/tcga/ 3 https://www.ebi.ac.uk/gxa/experiments/E-MTAB-513 2

Title Suppressed Due to Excessive Length

3

kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. These 16 cells data have been processed, downloaded, aligned and finally expression level have been calculated [3]. Sequencing has been done on both Paired-end (PE) and Single-end (SE), with read-length of 50bp and 75bp. Our input will be the outcome of RNA seq data analysis pipeline processed list of Differentially Expressed Genes (DEG). The gene expression extracted from Illumina body map samples return more than 52000 genes in the list and to define the cut off for each RPKM value we have followed the method mentioned in sandberg et al[8]. In our case, the data annotated for each tissue type includes both the coverages and their corresponding expression level in the form of RPKM values. The RNA seq data set provides a variety of insightful data such as copy number variation (CNV), fusion genes, structural variation, DEG, novel mutations, splice junctions and trascriptome variations (SNP). A logical connection between these datasets enable useful insight into the cancer disease with biological and clinical interpretation. Annotation databases (COSMIC & TCGA): Our main focus is on exploring behaviour of various DEG across different tissues types relating to cancer mutation (TCGA) and database with global mutation list and mutation type (COSMIC). Figure 1 shows the correspondences, i.e., links that we established between TCGA and COSMIC databases. For example our primary interest is to link copy number variation CNV, complete mutation and gene expression data.

GENE_EXPRESSION GENE_EXPRESSION Composite Element REF Protein Expression

COMPLETE_MUTATION Hugo_Symbol Entrez_Gene_Id ChromosomeStart_Position

COMPLETE_MUTATION

CNV(COPY NUMBER)

CNV(COPY NUMBER) Chromosome Start End Segment_Mean

ID_SAMPLE SAMPLE_NAME GENE_NAME REGULATION

UID

Gene name Sample name Tumour origin TCGA_ID ID_tumour Sample source ……..

ID_SAMPLE ID_tumour Primary site Histology subtype MUT_TYPE Chromosome:G_Start..G_Stop

Validation Status ……

Fig. 1: Links between COSMIC and TCGA repositories

We have done basic curative in terms of refinement. RDFied version (discussed in the Section 3.1) has kept this redundancy problem to have FDR rate as low as possible.as explained in the figure above we had to find instances to linked tow databases or couple of events within the databases. For example,In COSMIC GENE EXPRESSION and MUTATION data could be easily linked using gene names but CNV had SAMPLE IDs common with these two. later we used sample IDs after first iteration with GENE NAME. Same ways chr:start end position and GENE NAME was two major instances to link COSMIC and TCGA as explained with green arrows in Figure 1.

3

Methodology & Architecture

The linked functional annotation architecture is summarized in Figure 2, which shows its three main components: (i) RDFization: it transformations TCGA and COSMIC databases as Linked Data and further made them accessible

4

Authors Suppressed Due to Excessive Length

via several SPARQL endpoints. This would enable researchers to pose a query against a SPARQL endpoint and get the required chunk of data necessary for analysis; (ii) Linking: it discovers correspondences between selected datasets (TCGA variants, COSMIC diff expression, COSMIC Mutation, etc.). Links discovered in this component has a direct consequence on the efficient source selection, query planning, and overall execution of query in a decentralized fashion; and (iii) Scalable Query Federation: it a single-point-of-access through which distributed data sources can be queried in unison. Illumina Body Map 2.0

Source Selection Data Summaries

Access Policy Model

Query Execution

TCGA_protein_expression

COSMIC_Mutation

Query Planning

COSMIC_diff_expression

SAFE Query Federation Engine

COSMIC GENE_NAME Mutation _ID Sample_ID Tumor_ID Chr_start Chr_end TCGA_ID

RDFization

SPARQL Query

Linked TCGA and COSMIC Repositories

TCGA_variants

Differentially expressed genes (DEG)

GENE_EXPRESSION COMPLETE_MUTATION

CNV(COPY NUMBER)

TCGA_ID Chr_end Chr_start Variant_type Ref_Protien Expression

TCGA

Results

Fig. 2: Scalable Query Federation Over Gene Expression Annotation

The scalable query federation is based on our SPARQL query federation engine–called SAFE [7] – developed for the clinical trial repositories. SAFE has been tailored for efficient integration of data from multiple TCGA and COSMIC SPARQL endpoints. The main approach behind SAFE is to use the intelligent distribution of data to reduce the number of sources selected (without losing the recall) for processing federated SPARQL queries. 3.1

RDFization

In this section we describe the process of converting COSMIC raw data to COSMIC Linked Data and the corresponding statistics related to the data. COSMIC raw data files are in tab separated text (tsv) format. For the purpose of RDFizing COSMIC raw data, we have developed COSMIC RDFizer tool. The COSMIC RDFizer takes tab separated files as input and generates the resulting RDF in the form of N3 triples. These RDF triples can now be uploaded to any triple store, such as Virtuoso, Sesame etc (Virtuoso in our case). For our use case, we only need three types of data, i.e Gene Expression (GE), Gene Mutation (GM) and Copy Number Variants (CNV) data. Table 1 shows different statistics related to the RDFization of COSMIC data. Rows 2-5 represent the statistics of RDFization of different parts of COSMIC data, e.g. Row 2 represents, the number of records (Column 2), it’s size (Column 4), the corresponding triples generated (Column 2) and it’s size, for COSMIC Gene Expression data. The other two rows represents the same for COSMIC Gene Mutation and Copy Number Variant data. Row 5 represents the sum of

Title Suppressed Due to Excessive Length

5

all the three parts of data being RDFized. A total of 154 million records has been RDFized, producing 1.2 billion triples approximately. Row 6 represents the statistics related to RDF version of TCGA-OV (TCGA Ovarian Cancer), a subset of Linked TCGA4 data. The RDF dumps of COSMIC data can be available as per data policy.5 . Table 1: COSMIC Data Statistics Data COSMIC GE COSMIC GM COSMIC CNV Total TCGA-OV

3.2

Records 149 M 3.7 M 0.9 M 154M 18M

Triples Original Size RDFized Size 1185 M 7.5 GB 20 GB 84 M 916 MB 1.44 GB 11 M 82 MB 161 MB 1.28 B 8.5 GB 22 GB 100 M 2.15 GB 5 GB

COSMIC and TCGA Linking

One of the design principle of linked data is the provision of links with other data sources. In this section we will focus on the provision of links between COSMIC and TCGA. These links are at the core of facilitating data integration and data analysis tasks. There are several OWL constructs which can be used to link resources across different data sources. But we linked COSMIC with TCGA based on owl:sameAs construct. One such example can be viewed in Listing 1.1, where two COSMIC sample ids has been identified as same as two TCGA patient bar code ids. These types of links significantly enrich the information required by making data integration across different data sources possible. The use of such links is described in the next section. Listing 1.1: COSMIC and TCGA Linking Example COSMIC TCGA−OV h t t p : / / c o s m i c . s e l s . i n s i g h t . o r g / schema / ID Sample /TCGA−13−0920 owl : sameAs h t t p : / / t c g a . d e r i . i e /TCGA−13−0920 COSMIC TCGA−OV h t t p : / / c o s m i c . s e l s . i n s i g h t . o r g / schema / ID Sample /TCGA−24−1850 owl : sameAs h t t p : / / t c g a . d e r i . i e /TCGA−24−1850 4

http://tcga.deri.ie/ Due to restricted data policy for COSMIC from Wellcome trust sanger institute, we will be able to provide access to RDFied version upon request for review process only. 5

6

3.3

Authors Suppressed Due to Excessive Length

Scalable Query Federation

SAFE has been developed for accessing sensitive clinical data cubes at different locations [7]. Two main changes have been introduced in SAFE to query TCGA and COSMIC SPARQL endpoints (i) query standard RDF representation: SAFE initial version issues query over statistical clinical information stored as RDF data cubes within distinct names graphs [2]. Therefore, internal query processing (i.e., source selection, query planning, query execution) is updated to query normal RDFized version of TCGA and COSMIC repositories; and (ii) disabling access control: SAFE has a feature of processing data-access restrictions (defined as Access Policy Model [7]) while federating queries over multiple clinical site. Since, experiments conducted in this paper mainly involve public repositories this feature has been intentionally disabled. Listing 1.2 shows a sample SPARQL query federating across COSMIC and TCGA data asking for genomic loci of a mutated gene by chromosome start points which ultimately gives us disease metastasis information along with Mutation type. However, answering such a query requires integration of COSMIC and TCGA and merging results from both TCGA and COSMIC because answer to the given query can not be fulfilled by either of the data sources. Hence query federation comes into play. Result of the first four triples in the given query (i.e. cosmic-s:ID Sample, cosmic-s:Gene Name, cosmic-s:Chrom start) is fetched from COSMIC and result of the next three triples (i.e. tcga:tcga id, tcga:start) is fetched from TCGA. To produce the required information, both these results are merged on the basis of the last triple which integrates COSMIC with TCGA. Listing 1.2: Federated SPARQL Query PREFIX c o s m i c−s : PREFIX t c g a : PREFIX owl : SELECT ∗ WHERE { ? c o s m i c r e s u l t a c o s m i c−s : r e s u l t ; c o s m i c−s : ID Sample ? i d s a m p l e ; c o s m i c−s : Gene Name ? g e n e ; c o s m i c−s : c h r o m s t a r t ? c h r o m s c o s m i c . ? t c g a r e s u l t a tcga : r e s u l t ; tcga : Id ? t c g a i d ; tcga : S t a r t ? t c g a c h r o m s t a r t . ? i d s a m p l e owl : sameAs ? t c g a i d . }

4

Biological questions and annotations results from HBM 2.0 data

We have queried all the genes with RPKM > 0.5 and expressed in all tissue types and retrieved and compared their linked and manual individual query results. Figure 3 [1] is a schematic representation which satisfies earlier mentioned condition and provides us 99 genes to query. We have identified potential cancer types based on gene patterns across various tissues and further helped to understand the behaviour of most amplified cancer types.

Title Suppressed Due to Excessive Length

7

liver

heart

skeletal muscle

lung

lymph node

leukocyte

kidney

adrenal

colon

adipose

brain

breast

testis

ovary

prostate

thyroid

CCL21 JCHAIN IGHA1 IGHM IGKV1−5 IGHV1−2 IGKV3−20 IGLV3−25 IGHG1 IGKV4−1 IGHG2 IGHV3−23 IGKC IGLV3−19 IGLC2 IGLC3 IGFBP4 RPS11 RPS27 MYL9 SEMG1 TXNIP FABP4 EEF1A1 ACTB RPS12 S100A8 S100A9 LYZ CD74 HLA−DRA TMSB10 SRGN TMSB4X B2M FTL HLA−E SCGB1A1 HBB SPP1 GPX3 MALAT1 MT−ND6 MTATP6P1 MT−ATP8 MT−ND4L MT−CO1 MT−ND5 MB FABP3 MYL2 MYH7 ACTC1 MYL3 PLN MT−ATP6 MT−CYB MT−ND3 MT−CO3 MT−ND4 MT−CO2 MT−ND2 MT−ND1 MT−RNR2 MT−RNR1 ACTA1 GAPDH CKM MYBPC1 KLHL41 MT−TP PDK4 DES TNP1 PRM2 PRM1 TG SCD C3 RBP4 SAA1 AGT APOA1 SERPINA1 SAA2 FGG FGB HP ORM2 ORM1 CRP APOA2 FGL1 GC APOH APOC3 FGA ALB AMBP APCS

Fig. 3: List of Genes expressed in all tissues and highly expressed

5

Results

The main focus of our work is to provide a linked annotation mechanism where single query can retrieve all possible annotations. Initially, we have queried 99 genes 3 listed here to get CNV, Mutation and Gene Expression annotations from cBIO portal (TCGA) and CNV annotator (COSMIC) to demonstrate existing method and provide a fair comparison with the proposed linked annotation mechanism. Results from TCGA 4 clearly indicates a pattern of these genes towards uterine and ovarian cancer with significant Mutation and CNV. Ovarian cancer could be clearly picked for further investigation because of its higher amplification rate and multiple repetition in various experiments. Further studies conducted to understand somatic behaviour and loci(genomic position) of these genes and detailed information could be retrieved for mutation information. Mutation information retrieved are listed here. This study provides gene like ACTC1,B2M,CRP,FABP3,FABP4,FGA,FGB,GC,MYH7,RPRH2,SLC26A3,TG, TXNIP and these could be considered most significant driver genes for ovarian cancer from healthy human tissues. This argument gets support from COSMIC queried against 99 genes from HBM2.0 to get result for only somatic mutations in

8

Authors Suppressed Due to Excessive Length

cancer. This repeat annotation will not only provide detailed statistics reported in cosmic but also a validation for out earlier argument. Table 2 clearly indicates that locations chr1,chr4,chr14 and genes CRP,FGA,FABP3,MYH7 could be potential targeted genes for ovarian cancer detailed list could be retrieved from here

80%

Alteration frequency

70%

60%

50%

40%

30%

20%

10%

0% Cancer type +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

-

+

+

+

+

-

-

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

CNA data

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Ut

erin

eC

S( Ov TC aria GA ) Me n( TC lan GA om ) a( TC Bla G d de A) Bla r (T dd CG er (T C A) Lu ng GA Lu ad p ng ub en ) ad o( en TC o( GA TC ) Lu GA ng p ub sq ) u( TC Liv GA Ov ) aria er (T CG n S to (T C A) ma GA ch pu (T b) S to C G A m p Lu ub ac ng ) h( sq TC u G (T C He A) ad GA He &n p ad ub ec &n ) k( ec TC k( GA TC ) GA Br p ea ub st Sa (T C ) rc o GA ma Ut erin (T C ) e( GA TC ) GA Ut p erin ub ) e( Pa TC nc GA re a ) s( Ce TC rvic GA Es al op (T C ) ha GA gu ) s( TC DL GA BC Br ea (T C ) st GA (T C ) GA Pr os pu ta t b) e( Co TC lor GA Pr ec os ta l ta t (TC ) e( Co GA TC lor ) GA ec ta l 20 15 (TC ) GA cc pu RC b C (TC ) Gli GA om ) a( TC GB GA M (T C ) GA AC C (T C ) pR GA CC GB (T C ) M (T C GA ) GA ch 20 RC 13 cc C RC (T C ) C GA (TC ) GA PC P G pub) (T C ch RC GA Th C y ro (T C ) id GA (T C ) GA Th pu yro b) AM id (T CG L( A) TC GA pu AM b) GB L( TC M (T C GA ) GA 20 08 )

Mutation data

Mutation

Deletion

Amplification

Multiple alterations

Fig. 4: TCGA query output from cBIO Portal[4]

We now query the same 99 genes as earlier using the linked annotation mechanism and snippet of result for ovarian cancer is listed in Listing 1.3 with detailed functional annotation along with TCGA ids exclusive for ovarian cancer is available here. In the Listing 1.3 first part represents the COSMIC annotation and the other parts with its TCGA outcomes. MYH7 corresponds to chr-1 which is evident from previous annotations and replicated again in our study along with its TCGA ID:TCGA-13-0920-01 and its mutation type is mostly GAIN type of mutation for chr1 and chr14 which proves it dominant mutation with all there regulation of over, under and normally expressed. Translational researchers can repeat and revalidate the study for Pubmed ID:1398522 additionally with beta value (measure of methylation) of 0.041999536 and scaled estimation (Tumour purity) of 773.555 also supports this gene from epigenetic point of view. Further multiple genomic locations will help clinical practitioners to track CNV for the targeted study and which ultimately leads a direction towards better prognosis. Listing 1.3: Linked annotations for MYH7

Title Suppressed Due to Excessive Length

9

@p r e f i x c s : . @p r e f i x c : . cs : r e s u l t c s : ID Sample c : 1 4 7 4 8 2 3 ; c s : Sample Name c :TCGA−13−0920−01; c s : Gene Name c :MYH7; c s : c h r o m s t a r t c :50564627; cs : Regulation ” over ” . cs : chrom stop c : 5 1 0 1 0 3 3 1 ; cs : chr no c : 1 ; cs : chrom start m c :23418303; c s : chrom stop m c : 2 3 4 1 8 3 0 3 ; c s : c h r n o m c : 1 4 ; c s : mut type ”GAIN” ; c s : Mutation ID c : COSM75523 ; c s : P r i m a r y S i t e c : o v a r y ; c s : S i t e S u b t y p e c : NS ; c s : P r i m a r y H i s t o l o g y c : c a r c i n o m a ; c s : H i s t o l o g y S u b t y p e c : s e r o u s c a r c i n o m a ; c s : ID Tumour c : 1 3 9 8 5 2 2 ; @p r e f i x t s : . @p r e f i x t : . ts : result t s : b c r p a t i e n t b a r c o d e t :TCGA−13 −0920; t s : b e t a v a l u e ” 0 . 0 4 1 9 . . ” ; t s : chromosome t : 1 4 ; t s : chromosome t : 1 ; ts : s t a r t t :1288070; ts : stop t :1293914; ts : s c a l e d e s t i m a t e ” 773.555 ” .

Table 2: loci information for highly expressed gene in ovarian cancer from HBM 2.0

6

CHROMOSOME

Mutation Type Pubmed GENES

chr1:246407146-246740944 chr1:159683086-159684133 chr1:31842502-31849609 chr1:31842502-31849609 chr4:155506556-155506859 chr14:23857082-23886607 chr14:23857092-23886486

CNV Loss Deletion Deletion Insertion Loss Loss

17122850 20364138 20482838 20482838 20981092 20981092 20981092

CNST, LOC255654, SMYD3, TFB2M CRP FABP3 FABP3 FGA MIR208A, MYH6, MYH7 MIR208A, MYH6, MYH7

Related Works

Authors [6] provide a cancer study with 12 cancer types to logically classify massive data generation from TCGA and ICGC. Saleem et. al. [10], have covered TCGA database with few cancer types for limited number of patient data. Likewise a micro version COSMIC database have been RDFied to understand the mechanism of TP53 [12]. Similarly, a federation platform [5]FedEx is being developed to evaluate query execution time on TCGA data set which is further extended to integrated to evaluate the biological outcome by linking it through PubMed [9] Our work covers complete CNV , mutation and gene expression data and extended TCGA for the same which will provide a proof of concept for a global linked life science functional annotation platform. It is important to note that, work presented in this paper is a first attempt for transforming COSMIC into the RDF format and link it with the TCGA datasets.

7

Conclusion

In this paper we propose a linked functional annotation mechanism to explore the cancer mutation. Our experiments are based on TCGA, COSMIC and HBM 2.0 datasets which demonstrates novel indication for pattern of genes towards

10

Authors Suppressed Due to Excessive Length

ovarian cancer with significant mutation. Similar study remains valid for other cancer types. We have covered CNV, Gene Expression and Mutation data from COSMIC and TCGA (only for ovarian cancer). We have processed 1.2 billion COSMIC triples and 100 million TCGA triples which in turn generates 27 GB of data. In future this project will be expanded to cover level 1, 2 and 3 along with other datasets from COSMIC to provide in depth biological insight for each queried gene.

8

Acknowledgment

This publication has emanated from research supported by the research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.

References 1. E. bioinformatics institute. Illumina bosd map 2.0 european bioinformatics institute. 2. J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs, provenance and trust. In WWW, pages 613–622, 2007. 3. ericminikel. Tissue-specific gene expression data based on human bodymap 2.0, 2013. 4. J. Gao, B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer, Y. Sun, A. Jacobsen, R. Sinha, E. Larsson, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Science signaling, 6(269):pl1– pl1, 2013. 5. A. Iqbal. Fostering serendipity through big linked data. Semantic Web Challenge at ISWC, 2013. 6. C. Kandoth, M. D. McLellan, F. Vandin, K. Ye, B. Niu, C. Lu, M. Xie, Q. Zhang, J. F. McMichael, M. A. Wyczalkowski, et al. Mutational landscape and significance across 12 major cancer types. Nature, 502(7471):333–339, 2013. 7. Y. Khan, M. Saleem, A. Iqbal, M. Mehdi, A. Hogan, A. N. Ngomo, S. Decker, and R. Sahay. SAFE: policy aware SPARQL query federation over RDF data cubes. In A. Paschke, A. Burger, P. Romano, M. S. Marshall, and A. Splendiani, editors, Proceedings of the 7th International Workshop on Semantic Web Applications and Tools for Life Sciences, Berlin, Germany, December 9-11, 2014., volume 1320 of CEUR Workshop Proceedings. CEUR-WS.org, 2014. 8. D. Ramskold, E. T. Wang, C. B. Burge, and R. Sandberg. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol, 5(12):e1000598, 2009. 9. M. Saleem, M. R. Kamdar, A. Iqbal, S. Sampath, H. F. Deus, and A.-C. N. Ngomo. Big linked cancer data: Integrating linked tcga and pubmed. Web Semantics: Science, Services and Agents on the World Wide Web, 27:34–41, 2014. 10. M. Saleem, S. S. Padmanabhuni, A.-C. N. Ngomo, J. S. Almeida, S. Decker, and H. F. Deus. Linked cancer genome atlas database. In Proceedings of the 9th International Conference on Semantic Systems, pages 129–134. ACM, 2013. 11. J. Xuan, Y. Yu, T. Qing, L. Guo, and L. Shi. Next-generation sequencing in the clinic: promises and challenges. Cancer letters, 340(2):284–295, 2013. 12. A. Zappa, A. Splendiani, and P. Romano. Towards linked open gene mutations data. BMC bioinformatics, 13(Suppl 4):S7, 2012.