Dec 13, 2005 - of software programs is critical for the successful analysis ... achieve the following goals: 1) Allow software produced .... custom database (DB).
A Web Portal that Enables Collaborative Use of Advanced Medical Image Processing and Informatics Tools through the Biomedical Informatics Research Network (BIRN)
Shawn N. Murphy MD Ph.D.1, Michael E. Mendis 1, Jeffrey S. Grethe Ph.D.3, Randy L. Gollub MD Ph.D.2, David Kennedy Ph.D.2, and Bruce R. Rosen MD Ph.D.2 1 Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 2 Athinoula A. Martinos Center for Biomedical Imaging, MGH, Boston, MA 3 University of California-San Diego, San Diego, CA Launched in 2001, the Biomedical Informatics Research Network (BIRN; http://www.nbirn.net) is an NIH – NCRR initiative that enables researchers to collaborate in an environment for biomedical research and clinical information management, focused particularly upon medical imaging. Although it supports a vast array of programs to transform and calculate upon medical images, three fundamental problems emerged that inhibited collaborations. The first was that the complexity of the programs, and at times legal restrictions, combined to prohibit these programs from being accessible to all members of the teams and indeed the general researcher, although this was a fundamental mission of the BIRN. Second, the calculations that needed to be performed were very complex, and required many steps that often needed to be performed by different groups. Third, many of the analysis programs were not interoperable. These problems combined to created tremendous logistical problems. The solution was to create a portal-based workflow application that allowed the complex, collaborative tasks to take place and enabled new kinds of calculations that had not previously been practical. INTRODUCTION The ability to send data through a succession of software programs is critical for the successful analysis of complex images. Over the years, various groups have developed “data pipelines”, many of which are simple scripts, but some of which are entire applications to handle these processes. Although the pipelines are effective in their various local environments, they tend to fail under circumstances where a high degree of collaboration is required in a calculation. Being local, they also are not easily transferable to other institutions where calculations are being tested for reproducibility or extended for further experimentation. They are not very effective in keeping the data organized by research subject and content. Finally, the pipelines are not
available to the clinical researcher as the domain space in which collaboration is taking place increases to genomics and epidemiology. Nonetheless, the current state-of-the-art for image processing exists in these data pipeline applications. Perhaps the most sophisticated is the LONI pipeline1 from the Laboratory of Neuro Imaging at the University of California at Los Angeles. Others in use include the Kepler pipeline2 from the University of California at Berkley and at San Diego, as well as the jBPM workflow engine from JBoss (http://jboss.com). A system was envisioned that could consume the existing pipeline applications and achieve the following goals: 1) Allow software produced by the BIRN to be made available to people inside and outside of the BIRN group, 2) allow a consistent computing platform of BIRN software to be maintained with special attention to metadata and data provenance, and 3) allow study metadata to be tightly organized across groups to allow for collaboration and comparison of results. METHODS The potential for a portal-based solution was appreciated by the Morphometry BIRN test bed. A portal-based solution is where a web-site would accept uploaded images and host the calculating machinery (both hardware and software) so that image processing could be initiated via the web-site. The resulting transformed images would then be returned to the users, along with the summarized numerical results of any calculations upon the images, such as the volume of segmented structures. To understand what would be required from the portal-based solution we surveyed the needs across all of the Morphometry BIRN partner sites which included two groups at Harvard University, one at Johns Hopkins University, one at Washington University, one at University of California at San Diego, one at University of California at Irvine, and one at University of California at Los Angeles. The requirements
AMIA 2006 Symposium Proceedings Page - 579
that emerged were as follows: 1) The system must be able to incorporate the pipeline tools that are currently available for image processing including LONI, Kepler, and jBPM; 2) The system must allow human review of intermediate results of calculations. The most common use case supporting this requirement is the review of images to ensure that calculations have not resulted in a gross error and converged to irrelevant local minima. 3) The system must allow human handoffs. Several projects exist within the BIRN where various groups participate in various portions of a calculation. Therefore, a process where one group automatically indicates what calculations are ready for the next group is necessary. 4) The system must allow data provenance to be managed to allow calculations to be reproduced accurately. 5) The system must be available both for direct human interaction through a set of web pages, and also to software processes through a set of services such that other computerized systems can call and interact with the system directly. 6) The system must provide a clear plan on how to represent the results of calculations and have the ability to access to their results by direct viewing or through
software processes. 7) The system must contain the security, scalability, and reliability to be expected in a multi-user production system. An example process is the Semi-Automated Shape Analysis pipeline (SASHA) as shown below. First, 3D Structural MRI data of the brain with good gray-white matter contrast-tonoise ratio is acquired at a participating site. In order to be shared, the image data has to be deidentified within the site’s firewall: patient information is removed from the image headers and face information is stripped from the images while leaving the brain intact. The de-identified data then needs to be uploaded to a common site where it can be accessed by other participating sites. Second, the de-identified structural brain MRI data is automatically segmented using MGH’s Freesurfer morphometry tools. The derived segmented data (e.g., the hippocampal surfaces) is consumed by the JHU site and used for shape analysis using their Large Deformation Diffeomorphic Metric Mapping tool (LDDMM)3. The combined morphometric results (surfaces, volumes, labels, deformation fields) can be viewed from the database using 3D Slicer as the common visualization platform. 3 4
2 http://www.cis.jhu.edu/software/ldmm/index.html
JHU Large Deformation Diffeomorphic Metric Mapping Shape Analysis of Segmented Structures
http://surfer.nmr.mgh.harvard.edu/
MGH Freesurfer Cortical & Subcortical segmentations
http://www.slicer.org/
BWH 3D Slicer Visualization of segmentation and shape analysis results
1 Wash. Univ. T1 structural MRI Alzheimer’s and Age-matched controls Data Donor Site
Image Header: Subject: Juan Perez Patient ID: 911 …
Image Header: Subject: anon BIRN ID: 9284ka9e23sd… …
De-identification
AMIA 2006 Symposium Proceedings Page - 580
Current pipeline tools are used to work with this data at the various local sites. It was imperative that the portal did not require the functionality of these tools to be reinvented, because this would not represent an efficient use of BIRN resources. For example, the MGH Freesurfer calculation consists of over 40 steps, and we did not wish to redo the workflow in a new portal-based tool. Therefore, the web portal needed to assimilate the following pipeline applications: 1) Kepler2 (http://www.kepler-project.org) Kepler is a visual modeling tool written in Java. It was begun in 1997 at UC Berkley. Several recent efforts have extended the Ptolemy-II platform (http://ptolemy.eecs.berkeley.edu/) to allow for the drag-and-drop creation of scientific workflows from libraries of actors. The Ptolemy actor is often a wrapper around a call to a web service or grid service. Ptolemy leverages an XML-meta language called Modeling Markup Language (MoML) to produce a workflow document describing the relationships of the entities, properties, and ports in a workflow. The process of creating a workflow with the Ptolemy software is centered on creating Java classes that extends a built-in Actor class. 2) LONI pipeline1 (http://www.loni.ucla.edu/ twiki/bin/view/Pipeline) The LONI Pipeline is a visual environment for constructing complex scientific analyses of data. It is written in Java and utilizes an OWL-based XML representation of the workflow. The environment also takes advantage of supercomputing environments by automatically parallelizing data-independent programs in a given analysis whenever possible. 3) jBPM (http://www.jboss.com/products/jbpm) The primary focus of the JBoss jBPM development has been the BPM (business process management) core engine. Besides further development of the engine, the JBoss roadmap for jBPM focuses on three areas a) native BPEL support, b) a visual designer to model workflows, and c) process management capabilities enhancement. jBPM can stand alone in a Java VM, inside any Java application, inside any J2EE application server, or as part of an enterprise service bus. The ability to use web services provides a way to perform distributed computing, and also in a grander scheme a way to allow rapid deployment of new computation algorithms. This is achieved by enabling the ownership and maintenance of the web service by those who are
actually developing a specific computational algorithm at their local site. RESULTS The goal of the Portal was not to produce new software, but rather to link together and support existing BIRN software such that it could be more effectively utilized by various groups of collaborating users. To achieve this goal, we architected the system as shown in the diagram below. Because the BIRN is dedicated to open source solutions, all parts of the infrastructure are available to the public for free as open source projects including the Kepler pipeline engine. In the diagram, the parts built by the authors of this paper are shown in dark gray, while pre-existing software that was integrated into the solution are shown in light gray. The system relies on uploads and downloads of images and other accompanying data to and from an open-source file management system named the Storage Resource Broker (SRB, available at http://www.sdsc.edu/srb/). The SRB provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations and allows file security to be managed on a network shared resource in conjunction with the Grid Account Management Architecture (GAMA, available at http://grid-devel.sdsc.edu/gama). The GAMA system is used for authorization and authentication and consists of two components, a backend security service that provides secure management of credentials, and a front-end set of portlets and clients that provide tight integration into web/grid portals4. The main system software is divided amongst a Web Server and an Execution Server to comply with the general architecture of the BIRN portal. The Execution server has access to a Condor grid (http://www.cs.wisc.edu/condor/). We chose to use jBPM as the principle engine for scheduling and executing other applications. This is because it is a reliable, open source Workflow engine that is particularly geared towards making human handoffs in a workflow. It’s “out of the box” functionality includes a set of services that allows breakpoints in a workflow to be defined where the workflow will enter a “wait” state until human intervention occurs. This gives the chance for handoffs between groups to occur and intermediate calculations to be checked. Additional required software
AMIA 2006 Symposium Proceedings Page - 581
instance is requested to be created by the “Request” form part of Starting the User Interface (UI). SRB Client The Request form is Web Server used to start the Kepler SRB Workflow which in the (File Repository) figure is a LONI prejBPM defined workflow being overseen by jBPM (all LONI child workflow DB applications are ultimately overseen by upload download the jBPM workflow engine). Data is retrieved that had been Request Confirm Check on uploaded to the SRB. Form Request Request As the workflow starts, SRB Client or BIRN Portal runs, and finishes, updates are made to the custom database (DB) from which they are includes the web portal open source software, displayed from the “Confirm Request” and the GridSphere “Check on Request” forms in the UI. Upon (http://www.gridsphere.org/gridsphere/ finishing, the “Check on Request” UI form is gridsphere), and the open source Apache Tomcat used to show confirmation of the run and the project (http://www.apache.org/). resulting image data is then downloaded from We combined the above pieces with a the SRB. Numerical results are downloaded custom designed workflow portlet that drives from the Custom DB. The “Check on Request” web access to the infrastructure, a J2EE UI allows the intermediate states of the workflow based (http://java.sun.com/javaee/index.jsp) to be checked and acted upon. The UI’s interface to some of the deeply embedded available in the diagram are also designed to be functionality of jBPM, custom interfaces to the available as web service calls so that other client Kepler and LONI pipeline engines, and a applications can be used to communicate with versatile database that tracks workflows and the user on the state of the workflows. stores results. All of the BIRN custom designed The Custom DB stores Entity-Attributepieces of the above workflow solution will be Value combinations5 in a star schema6. The made available through the BIRN website in the database schema does not change as new data spring of 2007. sources are added. New data will result in Controlling the versions of software used to additional rows added to the fact and dimension perform calculations is important to guaranteeing tables, but new columns and tables do not need reproducible image processing. An important to be added for each new data source. This is design principal of the custom workflow portlet very useful in large projects such as the BIRN architecture is that it defines calculation “zones” where there are many tools depending upon a that use consistent versions of the Java Virtual specific database schema. A strategy where the Machine, the pipeline engine (jBPM, Kepler, and database grows by adding rows for new data LONI) and all the associated programs that will rather than adding new tables and columns be used in the calculation. To this end, users allows tools developed to work with one kind of may not upload new programs and indeed must data to also work with a new type of data. restrict themselves to software available in a Attribute definitions are managed through a predefined calculation zone. These zones are concept dimension table that ensures the defined by BIRN administrators. integrity of the ontology and provides easily The functionality works as follows, and the managed ad-hoc query capabilities. This creation of a request to run a workflow is strategy is also used for maintaining the data illustrated in the figure above. The workflows provenance. are stored as objects and can be called when an GAMA authentication server
Interface Interface
WF driver (client)
GridSphere
WF portlet
WF driver (server)
Tomc at Server
AMIA 2006 Symposium Proceedings Page - 582
DISCUSSION The development of this portal-based workflow application framework allows BIRN applications that previously were not generally available to become accessible to the general clinical researcher. This expands the impact that the BIRN can make on clinical research and allows efficient sharing of available hardware resources. Additionally, the workflow application allows for a stabilization of the BIRN calculation process, the enablement of more efficient collaborations, and an informaticsoriented revision of the BIRN platform such that ontology systems are effectively utilized in the storage and retrieval of results. Finally, an emergent property of the system is that both the raw and derived medical image data are stored in a format that is compatible with advanced medical informatics systems of analysis. Besides collaboration, the use of a well functioning portal enables not only the initial calculation of the experimental results, but also the recalculation for verification of the results, and the exploration of the parameter space of the results. The amount of change per change of an initial parameter may be graphed as a parameter vector space, and such graphs help to show where care must be taken with the initial estimates of the parameters. Disadvantages of the portal exist, some that have potential solutions, others that are inherent parts of the architecture. Because of the careful ontology mapping and data provenance tracking requirements, more time must be spent setting up a calculation. This discourages quick, ad-hoc calculations from being performed. If one is in the initial stages of a using a new application to perform calculations, the portal will be cumbersome. The de-identification of data prior to being used in calculations is also cumbersome in initial phases of a project. We are currently working through optimizing this process, and it appears a software solution should help alleviate this problem. Finally, the architecture requires hardware be available to perform the portal calculations. Grid enabling the architecture is part of the solution, and may allow very effective distribution of the calculations over available national resources. Setting up the BIRN analysis portal allows general use of BIRN resources and enables effective collaborations between sites. It allows greater exploration of recalculated experiments and the ability to routinely explore complex
parameter spaces. The BIRN analysis portal is built as a completely open source solution and is based upon existing workflow expression standards and architecture. The requirements of the BIRN analysis portal are common to those of other large projects offering the opportunity of code and design reuse. This work was supported by the Morphometry BIRN (U24-RR021382) and the BIRN Coordinating Center (U24-RR019701) (Biomedical Informatics Research Network, http://www.nbirn.net), a National Center for Research Resources Project, U.S.A. References (1) David E Rex, Jeffrey Q Ma, and Arthur W Toga. The LONI Pipeline Processing Environment. NeuroImage, 19:1033–1048, 2003. (2) Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger-Frank E, Jones M, Lee E, Tao J, Zhao Y (2005) Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience. Published Online: 13 Dec 2005. (3) Beg MF, Buckner R, Fischl B, Park Y, Ceyhan E, Priebe C, Ceritoglu C, Kolasny A, Brown T, Quinn B, Yu P, Gold B, Ratnanather JT, Miller M, BIRN Brain Morphometry (2005) "Pattern classification of hippocampal shape analysis in a study of Alzheimer's Disease." In: Human Brain Mapping Conference. (4) GAMA: Grid Account Management Architecture, Karan Bhatia, Kurt Mueller, Sandeep Chandra, IEEE International Conference on EScience and Grid Computing, Dec 2005. (5) Kimball, R. The Data Warehousing Toolkit. New York: John Wiley, 1997. (6) Nadkarni, P.M., Brandt, C. Data Extraction and Ad Hoc Query of an EntityAttribute-Value Database. J Am Med Inform Assoc. 1998; 5:511-7. (7) Murphy, S.N., Gainer, V.S., Chueh, H. A Visual Interface Designed for Novice Users to find Research Patient Cohorts in a Large Biomedical Database. AMIA, Fall Symp. 2003: 489-493.
AMIA 2006 Symposium Proceedings Page - 583