Linked Biomedical Dataspace: Lessons Learned integrating Data for ...

7 downloads 4281 Views 2MB Size Report
1 Insight Center for Data Analytics, National University of Ireland, Galway. {ali.hasnain ..... Usage Statistics of ReVeaLD logged using Google Analytics. 5.
Linked Biomedical Dataspace: Lessons Learned integrating Data for Drug Discovery Ali Hasnain1 , Maulik R. Kamdar1 , Panagiotis Hasapis2 , Dimitris Zeginis3,4 , Claude N. Warren, Jr5 , Helena F. Deus6 , Dimitrios Ntalaperas2 , Konstantinos Tarabanis3,4 , Muntazir Mehdi1 , and Stefan Decker1 1

Insight Center for Data Analytics, National University of Ireland, Galway {ali.hasnain,maulik.kamdar,muntazir.mehdi,stefan.decker}@insight-centre.org 2 UBITECH Research, 429 Messogion Avenue, Athens, Greece {phasapis,dntalaperas}@ubitech.eu 3 Centre for Research and Technology Hellas, Thessaloniki, Greece 4 Information Systems Lab, University of Macedonia, Thessaloniki, Greece {zeginis,kat}@uom.gr 5 Xenei.com [email protected] 6 Foundation Medicine Inc. Cambridge, MA [email protected] Abstract. The increase in the volume and heterogeneity of biomedical data sources has motivated researchers to embrace Linked Data (LD) technologies to solve the ensuing integration challenges and enhance information discovery. As an integral part of the EU GRANATUM project, a Linked Biomedical Dataspace (LBDS) was developed to semantically interlink data from multiple sources and augment the design of in silico experiments for cancer chemoprevention drug discovery. The different components of the LBDS facilitate both the bioinformaticians and the biomedical researchers to publish, link, query and visually explore the heterogeneous datasets. We have extensively evaluated the usability of the entire platform. In this paper, we showcase three different workflows depicting real-world scenarios on the use of LBDS by the domain users to intuitively retrieve meaningful information from the integrated sources. We report the important lessons that we learned through the challenges encountered and our accumulated experience during the collaborative processes which would make it easier for LD practitioners to create such dataspaces in other domains. We also provide a concise set of generic recommendations to develop LD platforms useful for drug discovery. Keywords: Linked Data, Drug Discovery, SPARQL Federation, Visualization, Biomedical Research

1

Introduction

Drug discovery entails the effective integration of data and knowledge from multiple disparate sources, the intuitive retrieval of vital information and the active involvement of domain scientists at all stages [33]. Biomedical data, encompassing a diverse range of spatial (gene ⇒ organism) and temporal (cell division ⇒ human lifespan) scales, is organized in separate datasets, each originally published

to address a specific research problem. As a result, there are a large number of voluminous datasets available with varying representations, models, formats and semantics. Consequently, retrieving meaningful information for drug discoveryrelated queries, like ‘List of molecules, with 5 Hydrogen bond donors, Molecular Weight ,< c h e b i / s p a r q l > bgp ( t r i p l e ? m o l e c u l e r d f : type )))

4.2

Retrieving molecules, which interact with Estrogen receptors

One of the primary objectives14 of the GRANATUM project was to identify molecules having a favorable binding affinity with Estrogen receptors-α and β for the prognosis of breast cancer drug therapy [36]. PubChem is a vast public repository cataloguing the potency of small molecules towards various biological targets, as determined by bioactivity assays (BioAssays) [21]. The central idea is to retrieve favorable agents (with Molecular Weight300,000 compounds (Workflows II, III). These compounds can be virtually screened using in silico methods like Protein-Ligand Docking [35], to obtain around 10 potential compounds for in vivo analysis. LBDS could also be used for the discovery of biological interactions (protein-protein or gene-drug interactions) by integrating ‘-omics’ datasets with GO or PubChem. Q12. Should external links to Linked Data sources be locally materialized to enhance query responses? RDF entities existing in repositories are subject to changes, data unavailability or are badly-curated. As interfaces request data from a federated query engine, which executes queries to remote repositories, the user experience or semantic reasoning by agents is disrupted in such situations. A potential solution can be the partial materialization of RDF triples from remote resources to local repositories [9]. The query engine could first try to resolve a query locally and if it is not possible, the query can be forwarded to external repositories. The selection of triples to be cached, as well as the refresh mechanisms is subject to a lot of parameters that could be solved by weighted-equations. Q13. How would the LBDS address emerging user needs? Even if the model seems to fully represent an area of interest (e.g. cancer chemoprevention) at the time of its creation, new needs might emerge in the future (e.g. new Qe) for end-users. The LBDS has to provide a maintenance mechanism that satisfies these demands. An incrementation tool was integrated with 23

http://modernizr.com/

ReVeaLD to enable users to extend or merge the semantic model by adding new Qe. A naive versioning is enabled for domain users to maintain and share different modifications of their extensions.

7

Recommendations

We summarize a set of generic recommendations that initiatives developing LD platforms for drug discovery might find useful. 1. End-users (i.e. domain experts) should be involved at all stages (from conceptualization to evaluation) of the LBDS development. 2. Developers must use a domain-specific semantic model for the homogenisation of the data sources and the integration of the LBDS components. 3. Quality and availability of the RDF data sources should be taken into consideration when discovering datasets. 4. SPARQL endpoints must be monitored constantly for availability and interoperability, and feedback should be used to inform data publishers. 5. Caching mechanisms must be incorporated at the data sources and QEC. 6. Data publishers must ensure that the RDF URIs are HTTP-dereferenceable. 7. User-driven tools for data extraction and annotation must be provided. 8. Retrieved information from the LBDS should be made more human-readable and personalized to meet the needs of the domain. 9. Concept maps must be used for knowledge visualization, to enable preliminary users to interpret and formulate domain problems. 10. HCI-based (Human-computer interaction) evaluations of semantic web applications must be carried out to enhance user experience and usability.

8

Conclusion

In this paper, we present the important lessons learned during the collaborative development of a Linked Biomedical Dataspace (LBDS) for supplementing drug discovery. We provided a brief overview of the different components and the state-of-the-art technologies which could be integrated to publish, interlink, access and visualize LD. We emphasize the collaborative involvement of domain users in all the decision-making processes of the LBDS development. Three workflows showcase how the LBDS can be exploited by bioinformaticians and biomedical researchers for cancer chemoprevention drug discovery. We compare the main features of our LBDS against some of the popular LD platforms available for drug discovery. Our experiences and the challenges encountered have helped us outline the important lessons and summarize generic recommendations for LD practitioners to create such dataspaces in other domains.

Acknowledgments This research has been supported in part by the EU FP7 GRANATUM project, ref. FP7-ICT-2009-6-270139, Science Foundation Ireland under Grant Number SFI/12/RC/2289 and SFI/08/CE/I1380 (Lion 2). The authors would like to acknowledge the members of the GRANATUM Consortium, for providing essential inputs during the development stages of the different components of the LBDS.

References 1. Alexander, K., Cyganiak, R., et al.: Describing linked datasets. In: LDOW (2009) 2. Antoniades, A., Georgousopoulos, C., Forgo, N., et al.: Linked2Safety: A secure linked data medical information space for semantically-interconnecting EHRs advancing patients’ safety in medical research. In: 12th International Conference on Bioinformatics & Bioengineering (BIBE). pp. 517–522. IEEE (2012) 3. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., et al.: Gene Ontology: tool for the unification of biology. Nature genetics 25(1), 25–29 (2000) 4. Belleau, F., Nolin, M.A., et al.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41(5), 706–716 (2008) 5. Berlanga, R., et al.: Exploring and linking biomedical resources through multidimensional semantic spaces. BMC bioinformatics 13(Suppl 1), S6 (2012) 6. Bizer, C., Seaborne, A.: D2RQ-treating non-RDF databases as virtual RDF graphs. In: Proceedings of the 3rd international semantic web conference (ISWC) (2004) 7. Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004) 8. Buil-Aranda, C., et al.: SPARQL Web-Querying Infrastructure: Ready for Action? In: 12th International Semantic Web Conference, pp. 277–293. Springer (2013) 9. Castillo, R., Leser, U.: Selecting materialized views for RDF data. In: Current Trends in Web Engineering, pp. 126–137. Springer (2010) 10. Cheung, K.H., Frost, H.R., Marshall, M.S., et al.: A journey to semantic web query federation in the life sciences. BMC bioinformatics 10(Suppl 10), S10 (2009) 11. Euzenat, J., Meilicke, C., et al.: Ontology alignment evaluation initiative: six years of experience. In: Journal on data semantics XV, pp. 158–192. Springer (2011) 12. Freitas, A., Curry, E., et al.: Querying linked data using semantic relatedness: a vocabulary independent approach. Internet Computing IEEE pp. 24–33 (2012) 13. Goble, C., Gray, A.J., Harland, L., Karapetyan, K., Loizou, A., et al.: Incorporating commercial and private data into an open linked data platform for drug discovery. In: The Semantic Web–ISWC 2013, pp. 65–80. Springer (2013) 14. Hartig, O., Bizer, C., Freytag, J.C.: Executing sparql queries over the web of linked data. In: The Semantic Web - ISWC 2009, pp. 293–309. Springer (2009) 15. Hasnain, A., Fox, R., Decker, S., Deus, H.F.: Cataloguing and linking life sciences LOD Cloud. In: 1st International Workshop on Ontology Engineering in a Datadriven World at EKAW12 (2012) 16. Irwin, J.J., Shoichet, B.K.: ZINC-a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling 45(1), 177–182 (2005) 17. Kamdar, M.R., Iqbal, A., Saleem, M., Deus, H.F., Decker, S.: GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research. In: Conference on Semantics in Healthcare and Life Sciences (CSHALS). ISCB (2014) 18. Kamdar, M.R., Zeginis, D., Hasnain, A., Decker, S., Deus, H.F.: ReVeaLD: A userdriven domain-specific interactive search platform for biomedical research. Journal of Biomedical Informatics 47(0), 112 – 130 (2014) 19. Kannas, C., Achilleos, K., Antoniou, Z., Nicolaou, C., Pattichis, C., et al.: A workflow system for virtual screening in cancer chemoprevention. In: 12th International Conference on Bioinformatics & Bioengineering (BIBE). pp. 439–446. IEEE (2012) 20. Kaufmann, E., Bernstein, A.: Evaluating the usability of natural language query languages and interfaces to Semantic Web knowledge bases. Web Semantics: Science, Services and Agents on the World Wide Web 8(4), 377–393 (2010)

21. Li, Q., Cheng, T., Wang, Y., Bryant, S.H.: PubChem as a public resource for drug discovery. Drug discovery today 15(23), 1052–1057 (2010) 22. Markham, K.M., et al.: The concept map as a research and evaluation tool: Further evidence of validity. Journal of research in science teaching 31(1), 91–101 (1994) 23. Miller, G.A., Beckwith, R., Fellbaum, C., et al.: Introduction to WordNet: An on-line lexical database. International journal of lexicography 3(4), 235–244 (1990) 24. Nikolov, A., Uren, V., Motta, E., De Roeck, A.: Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution. In: The Semantic Web, pp. 332–346. Springer (2009) 25. Pence, H.E., Williams, A.: ChemSpider: an online chemical information resource. Journal of Chemical Education 87(11), 1123–1124 (2010) 26. Pietriga, E., Bizer, C., et al.: Fresnel: A browser-independent presentation vocabulary for RDF. In: The semantic web-ISWC 2006, pp. 158–171. Springer (2006) 27. Ruttenberg, A., Rees, J.A., et al.: Life sciences on the semantic web: the Neurocommons and beyond. Briefings in bioinformatics 10(2), 193–204 (2009) 28. Saleem, M., Khan, Y., Hasnain, A., Ermilov, I., et al.: A fine-grained evaluation of SPARQL endpoint federation systems. Semantic Web Journal (2014) 29. Saleem, M., et al.: Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web (2014) 30. Samwald, M., Jentzsch, A., et al.: Linked open drug data for pharmaceutical research and development. Journal of cheminformatics 3(1), 19 (2011) 31. Sandler, R.S., Halabi, S., Baron, J.A., Budinger, S., Paskett, E., et al.: A randomized trial of aspirin to prevent colorectal adenomas in patients with previous colorectal cancer. New England Journal of Medicine 348(10), 883–890 (2003) 32. Schwarte, A., et al.: FedX: Optimization techniques for federated query processing on linked data. In: The Semantic Web–ISWC 2011, pp. 601–616. Springer (2011) 33. Searls, D.B.: Data integration: challenges for drug discovery. Nature reviews Drug discovery 4(1), 45–58 (2005) 34. Shi, L., Campagne, F.: Building a protein name dictionary from full text: a machine learning term extraction approach. BMC bioinformatics 6(1), 88 (2005) 35. Sousa, S.F., et al.: Protein-ligand docking: current status and future challenges. Proteins: Structure, Function, and Bioinformatics 65(1), 15–26 (2006) 36. Speirs, V., Parkes, A.T., et al.: Coexpression of Estrogen Receptor α and β Poor Prognostic factors in Human Breast Cancer? Cancer research 59(3), 525–528 (1999) 37. Uschold, M., Gruninger, M.: Ontologies: Principles, methods and applications. The knowledge engineering review 11(02), 93–136 (1996) 38. Visser, P.R., Jones, D.M., Bench-Capon, T., Shave, M.: An analysis of ontology mismatches; heterogeneity versus interoperability. In: AAAI 1997 Spring Symposium on Ontological Engineering, Stanford CA., USA. pp. 164–72 (1997) 39. Weininger, D.: SMILES, a chemical language and information system. Journal of chemical information and computer sciences 28(1), 31–36 (1988) 40. Whetzel, P.L., Noy, N.F., et al.: Bioportal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic acids research 39(suppl 2), W541–W545 (2011) 41. Williams, A.J., Harland, L., Groth, P., Pettifer, S., et al.: Open PHACTS: semantic interoperability for drug discovery. Drug discovery today 17(21), 1188–1198 (2012) 42. Zeginis, D., et al.: A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked-data sources. Semantic Web (2013)

Suggest Documents