The Linked Clinical Data Project: Applying Semantic Web ...

11 downloads 18594 Views 333KB Size Report
The Linked Clinical Data Project: Applying Semantic Web Technologies for Clinical and ... In general terms, Linked Data refers to a set of best practices for ...
The Linked Clinical Data Project: Applying Semantic Web Technologies for Clinical and Translational Research using Electronic Medical Records Jyotishman Pathak

Richard C. Kiefer

Christopher G. Chute

Dept. of Health Sciences Research Mayo Clinic st 200 1 Street SW Rochester, MN 55905, USA

Dept. of Health Sciences Research Mayo Clinic st 200 1 Street SW Rochester, MN 55905, USA

Dept. of Health Sciences Research Mayo Clinic st 200 1 Street SW Rochester, MN 55905, USA

[email protected]

[email protected]

[email protected]

ABSTRACT Systematic study of clinical phenotypes is important to better understand the genetic basis of human diseases and more effective gene-based disease management. The Linked Clinical Data (LCD) project at Mayo Clinic aims to develop a semantics-driven framework for highthroughput phenotype extraction, representation, integration, and querying from electronic medical records using emerging Semantic Web technologies, such as Linked Open Data. This poster abstract provides a brief background and overview of the recently initiated LCD project.

Categories and Subject Descriptors I.2.4 [Knowledge Representation Formalisms and Methods]: Relation systems, Distributed databases

General Terms Algorithms, Standardization, Languages.

Keywords Linked Open Data, genotype-phenotype associations, genome-wide association studies, diabetes mellitus

BACKGROUND AND INTRODUCTION The recent advances in genomic science and comparative biological studies has led to the emergence of a transdiscipline field, called “Phenomics”, that aims to capitalize on high-throughput computation and informatics technologies for systematic studying of phenotypes and how they might influence personal genomics. Several comparative phenomics studies[1] in the recent past have demonstrated the power of positively correlating phenotypes with several measures of gene functions. However, despite the advances, research in phenomics is presented with various challenges, including

(i) developing approaches for high-throughput extraction and representation of phenotypes from Electronic Medical Records (EMRs) using standardized biomedical ontologies and metadata, (ii) building techniques for storing, integrating, and querying phenotype data, and (iii) advancing phenotypic-driven analysis to derive genedisease, gene-drug, and gene-environment associations. To address these requirements, the recently initiated Linked Clinical Data (LCD) project at Mayo Clinic aims to investigate emerging Semantic Web technologies, such as Resource Description Framework (RDF), Web Ontology Language (OWL) and Linked Data, for developing a semantics-driven framework to highthroughput phenotyping using EMRs to analyze multifactorial phenotypes, such as Peripheral Arterial Disease and Coronary Heart Disease. The main goals of the LCD project are to: (1) Investigate ontology-based techniques for representing and encoding phenotype data derived from the EMR; (2) Develop a framework for publishing and integrating ontology-encoded structured phenotype data for federated querying using Linked Data principles and technologies, and (3) Propose and validate semantic reasoning techniques to support rapid cohort identification in cardiovascular diseases. Figure 1 shows the proposed platform architecture.

PROJECT RATIONALE In general terms, Linked Data refers to a set of best practices for publishing and linking pieces of data, information and knowledge in the Web. The core technologies that Linked Data builds on are (1) Universal Resource Identifiers (URIs) for identifying entities or concepts in the world, (2) RDF data model and RDFS/OWL ontologies for representing, structuring and linking descriptions of such entities as resources, and (3)

Figure 1 Proposed LCD architecture

HTTP for retrieving resources, or descriptions of the resources. Given that by definition, phenomics is a systematic approach for integrating and analyzing information spanning genes, proteins, pathways, diseases, drugs and patients, it requires an infrastructure that is flexible enough to respond to new and different data typologies, scalable to handle large and continuously evolving data, and able to accommodate different knowledge discovery use cases. Linked Data and related technologies present a very promising approach to meet such requirements, although what is lacking thus far is an end-to-end holistic infrastructure and application platform that can leverage the existing Linked Data tools and technologies to facilitate translational research—the LCD project aims to provide such an infrastructure. More information about the project is available at: http://informatics.mayo.edu/LCD.

obtained. The list of SNPs will then be entered into a SPARQL query which will use dbSNP to find genes, OMIM to get the associated disease, and finally filtered with the list of patients in the MCLSS database. Through this process we may verify the patients found in MayoGC when going from disease to SNP are also found in MCLSS when going from SNP to disease. An additional facet of this process will be the potential for risk analysis by finding diseases which are associated with the SNPs but for which the patient has not been diagnosed.

ACKNOWLEDGMENTS The research is supported in part by the Mayo Clinic Early Career Development Award (FP00058504).

REFERENCES 1.

CURRENT STATUS AND FUTURE WORK The current focus of the LCD project is to link dbSNP [2] and OMIM [3] in order to isolate SNPs which have been linked to certain diseases. A SNP SPARQL endpoint was created using the database dump from NCBI. Data from those triples were joined via a federated SPARQL 1.1 query to the OMIM endpoint provided by Bio2RDF [4]. The result set is being compared with the information found in SNPedia (http://snpedia.com/) for validation and comparison purposes. Once the validation has taken place, the project will grow to include patient data provided by the Mayo Genome Consortia (MayoGC [5]). The MayoGC collects SNP data from volunteers for medical research purposes. By specifying a disease such as diabetes, a list of patients and their SNPs which are associated with that disease may be

2.

3.

4. 5.

McCarty, C., et al., The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Medical Genomics, 2011. 4(1): p. 13. Sherry, S.T., M. Ward, and K. Sirotkin, dbSNP—Database for Single Nucleotide Polymorphisms and Other Classes of Minor Genetic Variation. Genome Research, 1999. 9(8): p. 677-679. Hamosh, A., et al., Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research, 2005. 33(suppl 1): p. D514-D517. Belleau, F., et al., Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 2008. 41(5): p. 706-716. Bielinski, S.J., et al., Mayo Genome Consortia: A Genotype-Phenotype Resource for Genome-Wide Association Studies With an Application to the Analysis of Circulating Bilirubin Levels. Mayo Clinic Proceedings, 2011. 86(7): p. 606-614.

Suggest Documents