Grid-based Management of Biomedical Data using an ... - CiteSeerX

1 downloads 49 Views 490KB Size Report
XML-based Distributed Data Management System*. Shannon Hastings .... Bell et. al. [2] develop uniform web services interfaces, data and security models for ..... of a global model management system simplifies the task of determining when ...
2005 ACM Symposium on Applied Computing

Grid-based Management of Biomedical Data using an XML-based Distributed Data Management System ∗

Shannon Hastings, Stephen Langella, Scott Oster, Tahsin Kurc, Tony Pan, Umit Catalyurek, Dan Janies, Joel Saltz Department of Biomedical Informatics, The Ohio State University Columbus, OH, 43210 {hastings,langella,oster,kurc,tpan,umit,janies,jsaltz}@bmi.osu.edu ABSTRACT

semantic relationships between different data types and data models. This level needs to be layered metadata definitions and metadata instances, which describe the structure and content of datasets in various databases. Therefore, the last level are middleware tools and runtime support for efficient management of metadata and data and for querying the data of interest from large and distributed biomedical datasets. In this paper, we describe the application of a generic, XML-based distributed data management middleware for Grid-based management and integration of image, genomic, and outcome data. This system supports 1) distributed and coordinated management of metadata definitions, 2) on-demand database creation on distributed platforms, and 3) querying against distributed remote data repositories exposed as XML data sources via well defined interfaces and protocols. We present the use of this system in three application scenarios that form important components of basic and clinical research studies: Management and analysis of large image data, access to and integration of data from external bioinformatic data sources, and integration of laboratory and outcome data, which is maintained in a centralized information warehouse, with other data types.

This paper presents the application of a generic, XML-based distributed data management system for Grid-enabled management and integration of biomedical data, including image, molecular, and outcome data. We discuss the use of this system in three inter-related application scenarios: Management of large-scale image data, access to data from Internetbased bioinformatic data repositories, and integrating clinical data stored in an enterprise information warehouse into translational research.

1. INTRODUCTION The ultimate goal of translational biomedical research is to bring about a better understanding of mechanisms causing various types of diseases and, through this understanding, to deliver improved methods for diagnosis and treatment. Since biological systems are complex, achieving this goal requires gleaning information from data at multiple scales and from multiple data sources. The data types include a broad collection of high throughput genomic and proteomic data, outcome data, and digitized imaging and anatomic pathology studies. Translational research studies are frequently carried out at many collaborating sites. Thus, it is likely that diverse data types will be captured and stored on distributed heterogeneous platforms. We can view a data integration system as consisting of several levels of tools that interact with each other. The top most level is user interfaces that allow a user to formulate queries into databases. The second level consists of tools for management of ontological information about data attributes and

2.

RELATED WORK

Advances in high performance computing, distributed systems, and faster wide-area networks have opened the doors to Grid based computation and information sharing platforms. A number of large data grid projects have been driven by the need to access distributed repositories [4, 19, 8, 16, 1]. Biomedical Informatics Research Network (BIRN) [4] project funded by NIH targets shared access to medical data in a wide-area environment. BIRN initiative focuses on support for collaborative access to and analysis of datasets generated by neuroimaging studies. The Shared Pathology Informatics Network (SPIN) [19] initiative develops an Internet-based software infrastructure to support a network of tissue specimen datasets. MEDIGRID [16] is a project initiated in Europe to investigate application of Grid technologies for manipulating large medical image databases. Parallel, distributed, and federated database systems have been a major topic in the database community for a long time [5, 11]. However, most of the efforts have been angled toward relational databases. There are some recent efforts to develop Grid and Web services [7] implementations of database technologies. Raman et. al. [18] discuss a number of virtualization services to make data management and access transparent to Grid applications. Bell et. al. [2] develop uniform web services interfaces, data and security models for

∗ This research was supported in part by the National Science Foundation under Grants #CCF-0342615, #ACI9619020 (UC Subcontract #10152408), #EIA-0121177, #ACI-0203846, #ACI-0130437, #ANI-0330612, #ACI9982087, Lawrence Livermore National Laboratory under Grant #B517095 (UC Subcontract #10184497), NIH NIBIB BISTI #P20EB000591, Ohio Board of Regents BRTTC #BRTT02-0003.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05 , March 13-17, 2005, Santa Fe, New Mexico, USA Copyright 2005 ACM 1-58113-964-0/05/0003 ...$5.00.

105

relational databases. OGSA-DAIS at GGF [6] is a focused effort that is developing the service definitions and standards for data access and integration in the Grid. The XML-based data management system employed in this paper builds on these evolving standards and provides several core services for management of metadata definitions and data.

3.

AN XML-BASED DISTRIBUTED DATA MANAGEMENT SYSTEM

In this section, we provide a brief overview of the XMLbased data management middleware system, called Mobius [10, 13], which is designed to manage and integrate data and metadata that may exist across many heterogeneous resources. Its design is motivated by the requirements of Grid-wide data access and integration [6]. Mobius provides a set of generic services and protocols to support distributed creation, versioning, and management of data models and data instances, on demand creation of databases, federation of existing databases, and querying of data in a distributed environment. Mobius GME: Global Model Exchange. For any application it will be essential to have a model for the data and metadata. By creating and publishing a common schema for data captured and referenced in a collaborative study, research groups can make sure that applications developed by each group can correctly interact with the shared data sets and interoperate. While a common schema enables interoperability, each group may generate and/or reference additional data attributes. Thus, it is also necessary to support local extensions of the common schema (i.e., versions of the schema). The Global Model Exchange (GME) is a distributed service that provides a protocol for publishing, versioning, and discovering XML schemas. In this way, the GME allows distributed collections of users to examine and analyze distributed data. In GME, a schema can be the same as a schema already registered in the system, or it can be created by versioning an already existing schema by adding or deleting attributes. The schema can also be a composition of new attributes and references to multiple existing schemas. Since the GME is a global service, it needs to be scalable. To handle these issues, the GME is implemented as an architecture similar to Domain Name Server (DNS), in which there are multiple GMEs each of which is an authority for a set of namespaces. Mobius Mako: Distributed Data Storage and Retrieval. Mako is a distributed data storage service that provides users the ability to create on demand databases, store instance data, retrieve instance data, query instance data, and organize instance data into collections. Mako exposes data resources as XML data services through a set of well-defined interfaces based on the Mako protocol. A data resource can be a relational database, an XML database, a file system, or any other data source. Specific data resource operations are exposed as XML operations. Mako provides a standard way of interacting with data resources, thus making it easy for applications to interact with heterogeneous data resources. Clients interact with Mako over a network; the Mako architecture illustrated in Figure 1 contains a set of listeners, each using an implementation of the Mako communication interface. When a listener receives a packet, the packet is materialized and passed to a packet router. In order to

106

Figure 1: Mako Architecture abstract the Mako infrastructure from the underlying data resource, there is an abstract handler for each packet type. Data services can easily be exposed through the Mako protocol by creating an implementation of the abstract handlers. Mako contains a handler implementation to expose XML databases that support the XMLDB API [20]. It also contains handler implementations to expose MakoDB. MakoDB is an XML database optimized for interacting in the Mako framework. Instance data in Mako is queried using XPath.

4.

APPLICATIONS

We implemented three applications that involve data types that are commonly generated and referenced in both basic and clinical research studies; image data, genomic data, and clinical data.

4.1

Image Analysis and Management

Analysis of image data is increasingly becoming a key component of biomedical research. Although data values and types of data acquired by different imaging modalities may differ, a common characteristic is that the amounts of data captured by advanced imaging devices can surpass the storage capacity of a single machine, requiring use of distributed storage systems. Another challenge is to provide support for on-demand image analysis and collaboration [17, 12]. In order to address these problems, an image management and analysis system should support: 1) Management and Evolution of image processing workflows. Researchers should be able to define, register, and version workflows. 2) Efficient execution of image analysis workflows, which can consist of several stages of sequences of simple and complex operations. Complex operations and parameter studies require execution in a parallel or distributed environment. 3) On-demand creation and management of distributed image databases. In addition to image data from imaging devices, datasets generated by the steps of a workflow execution can be stored and shared with collaborators. In order to address the above issues, we have implemented three services using Mobius GME and Mako services and another middleware, called DataCutter [3], which is a

component-based framework that supports distributed execution of data-intensive applications composed of interacting components. These three services collectively provide support for metadata management, data storage management, and distributed execution. The first service builds on GME and Mako in order to keep track of metadata associated with workflows, input image datasets, intermediate results check-pointed during execution of a workflow, and output image datasets. This service allows registration, management, and versioning of workflows and image information as XML schemas in GME. In our implementation, image analysis is described as a network of data processing components (an image processing workflow). The image schema may define attributes associated with an image, such as the type of the image, study id for which this image was acquired, date and time of image acquisition, etc. For a workflow, the schema defines the skeleton of the data processing network. Schemas defined by other researchers can be included in a schema for the workflow and image information for a study. This service also manages the metadata definitions associated with image datasets generated by an execution of a workflow. An instance of a workflow specifies the function names and locations of individual components, the number of copies and placement of copies for a component, persistent checkpointing locations in the workflow (which tells the execution environment that output from check-pointed components should be stored as intermediate image datasets in the system), input and output datasets (conforming to some schema registered for image data), and data selection criteria (which specifies the subset of images from input datasets to be processed). The workflow instance can be stored in the system so that clients can search for it and execute it using the distributed execution environment. The second service builds on Mako and is a data service that efficiently manages distributed collections of disks in the system for storage and staging of collections of input images, images generated by intermediate stages of the image analysis workflow, and output datasets. It enables an XML virtualization of relational databases, through Mako protocols and MakoDB servers. It allows user-defined schemas to automatically manifest custom databases at runtime, and data adhering to these schemas to be stored in these databases. An instance of the image schema corresponds to an image dataset with images and associated image attributes stored across multiple storage systems running MakoDB servers. When image data sets are stored, they can be associated with a user-defined identifier that can later be used to retrieve the data set from the data service. The third service builds on the distributed execution middleware developed in [9]. This service supports efficient execution of image analysis workflow as a network of image processing components. The network of components can consist of multiple stages of pipelined operations, parameter studies, and interactive visualization. The distributed execution service carries out instantiation of components on distributed platforms, and management of data flow between components, and data retrieval from and storage to distributed collections of MakoDB servers. An example instantiation of the system is shown in Figure 2. In Section 5, we present a preliminary performance evaluation of the system on a PC cluster.

107

Figure 2: The overall system architecture. A client can create new schemas for workflow, store workflow instances, and submit image processing requests in the system. Results of image analyses can be stored in the system.

4.2

Integrating Data from Internet-based Biomedical Data Repositories

Public databases of biomedical data such as genetic polymorphisms and phenotypic variation provide a rich set of information sources in order to study disease susceptibility and processes. The ability to integrate data from online sources with local data and execute analyses on all the available data affords opportunities for richer and acclerated hypothesis generation and testing. We have developed a software pipeline to support the study of complex, multifactorial diseases such as coronary artery disease (CAD). A key goal of this application is to leverage large comparative datasets genotypic and phenotypic datasets available for inbred mice to indentify human candidate genes that are associated with CAD susceptibility. Our pipeline carries out the following procedures and analyses. We acquire mouse DNA polymorphism and phenotype data from multiple data sources including: NCBI’s GenBank, the Mouse Phenome database at the Jackson Lab, and from local investigators. Next a phylogenetic tree is reconstructed from the polymorphism data. The optimal tree(s) and the corresponding list of mutations that occurred as the strains were isolated are exported. Once branches of phenotypic interest are defined by a user, our novel tools can rapidly search very large databases of mutational change to pinpoint a subset of nucleotide positions, whose mutational history is correlated with either 1) parallel changes in a phenotype of interest among distantly related organisms or 2) a closely related group of organisms that share a phenotype of interest. Both sets of information can provide candidate genes for functional studies such as the genetic bases of multifactorial disease. Once a set of putative correlations has been established we gather statistical and qualitative data on the quality of the correlation and its biomedical relevance. Correlation quality data include: i) Evaluation of the potential ambiguity of coincident mutation due to missing observations on polymorphisms on some strains. ii) Investi-

gation of the potential that a phenotypic trait is correlated with a DNA polymorphism by chance. We implement the concentrated changes test, which produces a p-value for the type one error on the null hypothesis that the DNA polymorphism is associated with the trait by chance [14]. In parallel, we assess the potential of a region of interest in mouse genome to lead us a biomedically meaningful gene or regulatory region in the human genome. To this end we perform similarity search (BLAST) [15] using the DNA flanking the mouse polymorphism as a query on the Human genome and reference sequence collections for the Human proteome and transcriptome. This search returns an expectation value that the match is not due to common ancestry but rather chance. All of these data, including ambiguity metrics, correlation p values, expectation score of blast hit in human genome, and position of blast hit in human genome, and in candidate gene symbol (if hit), are then ingested into Mobius for filtered searching. We can also use standard gene symbol to access pathway information using MatchMiner and GoMiner. As a whole this work produces a prioritized list of candidate genes to rapidly validate as disease candidates on various reference tissues and in human clinical populations. A challenging issue in this pipeline is that Internet-based biomedical data repositories and analysis tools differ in the format that they store and query the data or in the method of which the data repository is interacted with by clients. A data source might have a database interface, or a web service interface, or it might be a web site that has to be wrapped. Our approach is to develop a system capable of generating materialized views from data sources and of creating virtual, ad-hoc data warehouses. The motivations for and advantages of such an approach are as follows. 1) Studies often involve repeated requests to the same dataset. By creating a view and storing a local snapshot of a data source, the query load on the data source can be reduced. 2) The collection of attributes stored in a data source and the data source interface may both change over time, which may lead a wrapper to become invalid. In such cases, it is very useful to maintain a data source view on which data analysis operations can function. 3) The data source may provide highly suboptimal infrastructure for carrying out associative queries. For example, the data source can only allow downloading or uploading of files or documents, despite the fact that the documents can be structured and queried. To provide a programmatic interface to public repositories for our application, we have implemented a catalog that includes a list of URIs for all data sources employed in the application, along with information on how the data can be extracted. The data source catalog maintains the URI of the data access method or web site wrapping service to handle cases where a non-standard access method or service is required. Currently the system contains wrappers for extracting data from Genbank, BLAST, MatchMiner, and the Jackson Labs. The system polls the registered data sources in regular intervals and updates local instances of databases that conform to our models. Mobius GME is used to manage XML schema used to describe attribute metadata. The use of a global model management system simplifies the task of determining when data access methods are returning data defined by common data models or sub-models. Data instances of the model are stored on Mobius Mako servers running on a distributed collection of PC clusters. Researchers

are provided with a unified interface, which allows XPath queries, for interacting with the datasets.

4.3

Exposing Subsets of Outcome Data in Enterprise Data Warehouses

One of the key components in translational research is the integration of patient, clinical lab, and outcome data with other data types (e.g., molecular, genomic, and image). In many medical centers, patient data, laboratory data, and outcome data are maintained in centralized, enterprise data warehouses. However, such systems are not widely used to store data generated basic research studies. Data in basic research is generally captured in individual researcher labs. These systems are also not optimized for large image data. In this application, the goal is to correlate findings from analysis of digitized microscopy slides, stored in the image analysis system as described in Section 4.1, and clinical outcome. A series of digital microscopy datasets, with particular metadata attribute values (e.g., patient group or study id), are reviewed for suspicious regions of interest (ROI). The researcher then runs an analysis algorithm on the ROIs to determine, for example, the nuclear count and density in the region [17] and stores the results in the system. After the analysis is done on all ROIs, the result data is joined with diagnosis data stored in an information warehouse. We have used the Mobius Mako services to expose subsets of databases in the enterprise information warehouse system. The information warehouse in our case is an Oracle based system, managing the data in relational databases. Thus, a series of relational tables should be exposed that contain information about the lab samples of patients, lab results, which were taken over a series of visits, and diagnosis. We developed XML schemas that model the tables that store these datasets. These schemas were managed by Mobius GME. We employed a tool, called XQuark bridge1 , that exposes relational tables in a database as XML documents through XML schemas defined by application developers. We implemented handlers, by extending the abstract handler, in Mako to interface to XQuark and expose the information warehouse tables as XML data sources to the environment. In this way, it is possible to access the tables in the information warehouse using a common protocol and XPath queries and integrate them with image data stored in other Mako servers.

5.

AN EVALUATION OF IMAGE ANALYSIS SYSTEM

We have performed a preliminary performance evaluation of the support for image analysis and storage. We will perform evaluations of the other implementations in near future. Due to space limitations, we present performance results as the number of Mako servers is varied. A cluster with 7 nodes was used for this evaluation. Each node of the cluster consists of two AMD Opteron processors with 8 GB of memory and 1.5TB of disk storage in RAID5 SATA disk arrays. The nodes are inter-connected via a Gigabit switch. A simple pipeline of MakoReader->Processing->MakoWriter filter group was implemented. The MakoReader filter interacts with a Mako server, extracts image data, and passes it to the Processing filter; the Processing filter inverts the color of each pixel in an image and passes the output to the 1

108

http://xquark.objectweb.org/

ysis of information in translational research.

7. REFERENCES

[1] Asia Pacific BioGrid. http://www.apgrid.org. [2] W. H. Bell, D. Bosio, W. Hoschek, P. Kunszt, G. McCance, and M. Silander. Project Spitfite - Towards Grid Web Service Databases. http://www.cs.man.ac.uk/grid-db/documents.html. [3] M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Computing, 27(11):1457–1478, Oct. 2001. [4] Biomedical Informatics Research Network (BIRN). http://www.nbirn.net. [5] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Parallel processing of spatial joins using R-trees. In Proceedings of the 1996 International Conference on Data Engineering, pages 258–265, Feb. 1996. [6] Data Access and Integration Services. https://forge.gridforum.org/projects/dais-wg. [7] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. Grid Services for Distributed System Integration. IEEE, 35(6):37–46, 2002. [8] Grid Physics Network (GriPhyN). http://www.griphyn.org. [9] S. Hastings, T. Kurc, S. Langella, U. Catalyurek, T. Pan, and J. Saltz. Image processing for the grid: A toolkit for building grid-enabled image processing applications. In CCGrid: IEEE International Symposium on Cluster Computing and the Grid. IEEE Press, May 2003. [10] S. Hastings, S. Langella, S. Oster, and J. Saltz. Distributed data management and integration: The mobius project. In GGF Semantic Grid Workshop 2004, pages 20–38. GGF, June 2004. [11] E. G. Hoel and H. Samet. Data-parallel spatial join algorithms. In J. Chandra, editor, Proceedings of the 23rd International Conference on Parallel Processing. Volume 3: Algorithms and Applications, pages 227–234, Boca Raton, FL, USA, Aug. 1994. CRC Press. [12] T. Kurc, S. Hastings, U. Catalyurek, J. Saltz, J. D. Fleig, B. D. Clymer, H. von Tengg-Kobligk, K. T. Baudendistel, R. Machiraju, and M. V. Knopp. A distributed execution environment for analysis of DCE-MR image datasets. In The Society for Computer Applications in Radiology (SCAR 2003), published as an Abstract, 2003. [13] S. Langella, S. Hastings, S. Oster, T. Kurc, U. Catalyurek, and J. Saltz. A distributed data management middleware for data-driven application systems. In To be included in Cluster 2004 Proceesings, September 2004. [14] W. Maddison. A method for testing the correlated evolution of two binary characters: Are gains or losses concentrated on certain branches of a phylogenetic tree? Evolution, 44:539–577, 1990. [15] S. McGinnis and T. Madden. Blast: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res., 32(Web Server issue):W20-5, 2004. [16] MEDIGRID. http://creatis-www.insa-lyon.fr/MEDIGRID/home.html. [17] T. Pan, K. Mosaliganti, R. Machiraju, D. Cowden, and J. Saltz. Large scale image analysis and nuclear segmentation for digital microscopy images. In The Ninth Annual Conference on Advancing Practice, Instruction and Innovation through Informatics (APIII 2004), 2004, Accepted for ePoster presentation. [18] V. Raman, I. Narang, C. Crone, L. Haas, S. Malaika, T. Mukai, D. Wolfson, and C. Baru. Data access and management services on grid. http://www.cs.man.ac.uk/grid-db/documents.html. [19] Shared Pathology Informatics Network (SPIN). http://www.sharedpath.org. [20] XML:DB Initiative for XML Databases. http://www.xmldb.org/.

Figure 3: Data I/O Scalability Experiments. “Reader” and “Writer” bars denote the experiments, in which the number of MakoReader and MakoWriter filters is varied, respectively. MakoWriter filter; and the MakoWriter filter inserts the inverted image to a Mako server. 7500 images were used as input – each image was a 256x256-pixel gray scale image. In the first set of experiments, the number of MakoReader and Processing filters was fixed at 7 and each filter copy was executed on one of the nodes. The number of MakoWriter filters was varied from 1 to 7. A reader filter reads from a single Mako server, sends the data to the Processing filter. The number of Mako servers that can store data is equal to the number of MakoWriter filters. The images were initially distributed evenly among 7 Mako servers. In the second set of experiments, the number of Processing and MakoWriter filter copies was fixed at 7. The number of MakoReader filters was varied from 1 to 7. In this experiment, each reader filter read from a single Mako server. The images were distributed evenly across an increasing number of Mako servers as the number of MakoReader filters increased (i.e., if there are N MakoReader filters, the images were distributed across N Mako servers). In both experiments, each writer filter wrote to Mako servers using demand-driven distribution. The results of these experiments are shown in Figure 3. The numbers in the graphs are the total execution time for processing all images. In the figure, we see that the execution time of the image processing pipeline decreases as the number of reader and writer filters is increased. The graph shows good scalability of the system when the number of Mako servers that can store and serve data is increased.

6.

CONCLUSIONS

XML has become a de facto standard for representing semi-structured data sets in heterogeneous and distributed environments. In this paper, we showed application of an XML-based, generic metadata and data management system for distributed management and integration of biomedical data. This system provides core services and common protocols for 1) distributed but coordinated managing of metadata definitions, 2) exposure of subsets of data stored in ad-hoc data warehouses and enterprise information systems, and 3) on-demand creation of databases along with management of complex data analysis workflows. These capabilities enable integration of heterogeneous biomedical data sources into a biomedical data Grid. This type of distributed data environments with strongly typed data facilitates applications that can remove barriers to better synthesis and anal-

109

Suggest Documents