Architecture of a Federated Query Engine for

1 downloads 0 Views 336KB Size Report
responsible for constructing data queries, distributing them to the ... Typically an SQL statement with tables .... Virtual Data Repository, an Oracle TimesTen In-.
Architecture of a Federated Query Engine for Heterogeneous Resources Richard L. Bradshaw, MS1, Susan Matney, MSN, RNC1, Oren E. Livne, PhD1, Bruce E. Bray, MD1,2, Joyce A. Mitchell, PhD1,2, Scott P. Narus, PhD1,2 1 Health Sciences Center, 2Department of Biomedical Informatics, University of Utah, Salt Lake City, UT Abstract The Federated Utah Research and Translational Health e-Repository (FURTHeR) is a Utah statewide informatics platform for the new Center for Clinical and Translational Science at the University of Utah. We have been working on one of FURTHeR’s key components, a federated query engine for heterogeneous resources, that we believe has the potential to meet some of the fundamental needs of translational science to access and integrate diverse biomedical data and promote discovery of new knowledge. The architecture of the federated query engine for heterogeneous resources is described and demonstrated. Introduction FURTHeR is a Utah statewide informatics platform housed in the new Center for Clinical and Translational Science at the University of Utah1. The objective of FURTHeR is to deliver innovative and practical software tools and services that can directly support data and knowledge access, integration, and discovery. Our aim is to develop a platform that federates Utah’s largest patient data warehouses (University of Utah Healthcare, Intermountain Healthcare, and Salt Lake City Veterans Administration Medical Center), public health data from the State of Utah Department of Health, and data from the Utah Population Database (an extensive genealogic and demographic resource)2. One of the key components of a federated system is its Federated Query Engine (FQE). The FQE is responsible for constructing data queries, distributing them to the appropriate data sources, translating the received results into a common structure, intersecting the returned results, and aggregating the final data. The purpose of this paper is to describe a proposed architecture for the FURTHeR FQE (FFQE). The preliminary design is illustrated by stepping through a simple example. Background There are at least three existing platforms in the biomedical informatics space that support federated queries: NCI’s cancer Biomedical Informatics Grid

(caBIG), Informatics for Integrating Biology and the Bedside (i2b2)3,4, and the Biomedical Informatics Research Network (BIRN)5. Both caBIG and i2b2 can be configured to support on-the-fly, federated queries from homogeneous data sources. Neither federates data from heterogeneous data sources (data sources with different data structures, concepts, and concept granularity) on-the-fly. caBIG provides tools that can be used to adapt an organization’s data source to one of caBIG’s registered data models. i2b2’s current data access services require that the underlying data repositories follow a predefined i2b2 star schema. BIRN does provide the ability to perform a federated search across heterogeneous databases, making larger and broader populations of data accessible, but individual patient records are not linked and joined together. For example, if one data source has genetic data and another has lab data, the data cannot be intersected and correlated to analyze which patients had a genetic manifestation of X who also had a particular lab result of Y. Another common approach to facilitating research across disparate data sources is the use of data warehousing. In this method, data are extracted, transformed, and loaded from distinct heterogeneous systems into a centralized, homogeneous database. All the aforementioned Utah organizations follow this traditional data warehousing approach to facilitate research. Performing clinical research against statewide resources could be approached the same way by transferring and transforming data from each organization’s data warehouse into a statewide data warehouse. Other technologies that are similar to, but outside the scope of, FFQE are the Deep Web and the Semantic Web6,7. Both of these projects aim to provide database-like queries that span data sources across the Web by using new methods to discover and index Web content. The FURTHeR approach differs from these methods in that it executes simultaneous queries against federated, heterogeneous data resources, and provides a framework that enables secure patient record linking across data sources. This is accomplished by building a query on-the-fly for each data resource and

AMIA 2009 Symposium Proceedings Page - 70

transforming each resource’s query result set into a homogeneous virtual data repository where data are managed temporarily. The virtual data repository manages data only for the amount of time needed to perform analysis. Once an analytical “session” is completed, the data are purged. Partner data sharing agreements dictate the rules that determine the amount of time an analytical session is maintained and which data are viewable. The underlying component of the FURTHeR architecture that makes these operations possible is the FFQE. FFQE Architecture The FFQE design was influenced by several requirements. First, we needed to federate data from disparate systems not only within an institution, but also across institutions, into a virtual repository. Second, each data source needed to remain in its original location and format for security, intellectual property and data management purposes. Third, we wanted to establish semantic interoperability by using metadata for each physical data source model to describe the source data models and how they map to a “universal” model. Fourth, based on a substantial effort to link patient identifiers that has already been performed by one of our partners, we wanted the ability to link patient data between the data sources. Finally, we needed to provide users with a consistent view of the available statewide data. Based on these requirements, we identified the following set of architectural components for the FFQE: Metadata – data that describe a data model in terms of classes, class attributes, class attributes values, and associations between them. Logical Model – a standards-based data model designed to support interoperability; this is the data model of the virtual data repository. Physical Data Source – typically an existing database/data warehouse. Physical Model – a data model of a Physical Data Source. Logical Query – a query composed using Logical Model constructs. Physical Query – a query composed using Physical Model constructs. Typically an SQL statement with tables, columns, and column values constructed from the Physical Model. Logical to Physical Query Transformation – the metadata artifacts that contain the information required by the query transformation service to transform the Logical Query to the Physical Query;

associations between the Logical Query class attributes and the Physical Model class attributes, and the class attribute value associations between the attributes of the Logical Query and Physical Model. Physical Result Set – the result of a Physical Query executed against the Physical Data Source structured after the Physical Model Logical Result Set – a result set that is formatted according to the Logical Model; a Physical Result Set is transformed to a Logical Result Set in preparation for loading into the Virtual Data Repository. Physical to Logical Model Transformation – a metadata artifact that contains the information required for a transformation service to perform the Physical Model to Logical Model transformation. Virtual Data Repository – a temporary data repository where Logical Result Sets are pooled together, intersected, and analyzed by Logical Queries. Terminology Management System – a set of tools and services to access and manage terminology content; includes authoring tools, a content loader, a content exporter, and business services that allow programmatic access to the system. The responsibility of the Terminology Management System is to provide concept-to-concept translations during Logical Query to Physical Query Transformations and Physical Result Set to Logical Result Set translations. Metadata Management System – a set of tools and services similar to the Terminology Management System used to access and manage metadata content; includes authoring tools, a content loader, a content exporter, and business services that allow programmatic access to the system. The Terminology Management System is considered to be a component of the Metadata Management System because the Terminology Management System contains structural metadata that other metadata resources use. Specifically, class attribute value set concepts are managed in the Terminology Management System, while classes and attributes and their associations to each other are managed in the Metadata Management System. The FFQE accesses data model metadata from the Metadata Management System. FFQE Process Flow Example Figure 1 depicts the process flow through the FFQE components during the execution of a federated query. To begin, a researcher builds a query based on the Logical Model using terminology from the Terminology Management System. The user query forms the Logical Query (step 1 in Figure 1).

AMIA 2009 Symposium Proceedings Page - 71

patient identifiers are reconciled during the Physical to Logical Model Transform. Each Logical Result Set becomes a part of the Virtual Data Repository by “inserting” the results into the inmemory data structure (step 7). The final query results are aggregated by computing the intersection and union of the Logical Result Sets (step 8). The results of the federated Logical Query are filtered based on access permissions and returned to the user (step 9). Implementation The FFQE components up to this point have been purposefully described without implementation details to keep the design separate from the implementation. The current implementation of the FFQE is based on a service-oriented architecture (SOA) that has been implemented using Open Source Java Enterprise Edition (Java EE) technologies. The FFQE component implementations are described below according to the steps and objects in Figure 1. In Figure 1, step 2a is implemented with an object transformation service being developed in-house that relies on metadata provided by the Metadata Management System (step 2b). The Metadata Management System is being developed in-house and is based on Java EE, SOA, and Oracle RDBMS and XML DB (www.oracle.com). The terminology engine that backs the Metadata Management System is Apelon DTS8. Step 2 is also responsible for sending the Physical Queries in parallel to the Physical Data Sources for simultaneous query processing. JMS (Java messaging) is the enabling technology that assists with the parallel processing. Steps 3, 4, and 5 are all implemented with data services that are based on the REST SOA technology. Data are queried using the Physical Query. The Physical Result Set is returned in XML. Figure 1 – FURTHeR federated query engine steps to complete a federated query. The Logical Query is then translated into one or more Physical Queries using the corresponding Logical to Physical Query Transforms (steps 2 and 3). Physical Queries are dispatched in parallel to their respective Physical Data Sources (step 4). Each Physical Data Source executes its Physical Query and generates a Physical Result Set (step 5), which is then returned and transformed into a Logical Result Set (step 6) using a Physical to Logical Model Transform. Concept-to-concept translations are performed and

The Physical to Logical Model Transform that performs the Physical Result Set to Logical Result Set transformation is an XQuery, an artifact that is retrieved from the Metadata Management System using the Metadata services (6b). The Physical to Logical Model Transform XQuery generates an XML Logical Result Set (step 7). The XML Logical Result Set is “un-marshaled”, or converted, into Java Objects. Hibernate, an object relational mapping strategy, inserts the Java Object representation of the Logical Result Set into the Virtual Data Repository, an Oracle TimesTen In-

AMIA 2009 Symposium Proceedings Page - 72

Memory Database (step 8)9. After all the Logical Result Sets have been stored in the Virtual Data Repository, the Logical Query criteria are transformed into the Hibernate query language (HQL) and the HQL expression is executed against the Virtual Data Repository. The final federated query results of this operation (step 9) are Java Objects that are consumable directly by Java clients; they can also be accessed via web services. Discussion The architecture of the FFQE allows us to transform heterogeneous data into a common model. In the event that we choose to transform to a different model, we create: 1) the Physical to Logical Model Transform artifact to transform the Physical Data Model data to the new model for each Physical Data Source; 2) an object-relational mapping to the new model used to insert the Logical Result Sets into the Virtual Data Repository; 3) the terminology mappings to the new model (may be optional). The Physical Data Source data services do not need to be modified. For example, if we decided to use i2b2 as our Logical Model and Repository we could use the FFQE to populate the i2b2 database by creating the following: •

An XQuery to transform each Physical Result Set to the i2b2 database model (replacing the Physical to Logical Model Transform in step 6).



The Object relational mappings to the i2b2 database model to insert the Logical Result Set into the Virtual Data Repository (migration between step 7 and 8).



i2b2 terminology updated/synchronized with FURTHeR terminology.

After these are created, the FFQE will populate the i2b2 database with the results of a user-specified Logical Query. When the FFQE finishes loading the database, the i2b2 tools can be used to navigate and analyze the results. The work up front could take a few weeks, but once this is done, building i2b2 data marts from a Logical Query is straightforward. We believe this ability will significantly facilitate translational research: it enables researchers to collect cohort data much more quickly from multiple resources, potentially eliminating time consuming and expensive data collection work that has been required in the past. Another feature of the FFQE architecture that adds utility is its configurable Virtual Data Repository implementation. The implementation of the Virtual Data Repository is a standard database interface. Changing the database implementation from an in-

memory database to a standard database requires a simple configuration change. In the i2b2 example, the FFQE could use an in-memory database, or it could use a physical database. Users could use the inmemory database to build and tune the results of their queries and then store the in-memory data structure to a physical database based on changing a few FFQE settings. The FFQE could be thought of as a data mart builder. Data sources are added by adding the required metadata to the Metadata Management System. A Logical Query submitted to the FFQE finds the data of interest from the corresponding data sources and builds the data mart it is configured to build. The data mart is accessible, as is any other database, by commercial business intelligence tools. Early tests of the FFQE indicate potentially significant time savings for the work typically required to build a research-enabling data mart. It is not uncommon today that a data mart that requires data from multiple resources can take months to build. Using a preconfigured FFQE, the time spent is potentially minutes to hours based on the sophistication of the Logical Query and how long it takes each data source to return its results. We are seeing 100,000 patient demographic records transformed in 95 seconds using modest desktopclass hardware. The middle tier business logic was executed on a MacBook Pro with 4 GB RAM and a 2.5 GHz Intel Core 2 Duo. The database hardware is an enterprise-grade server running Oracle 11g database software. Knowing the limitations of the Physical Data Sources and their query response times, as well as all the required transformations and translations, we are not expecting blink speed response times. We realize and acknowledge that performance on federated queries across multiple data sources that return thousands to millions of rows will be an issue that we will continue to address. A common performance-related concern amongst peers is the idea to use an in-memory database for the Virtual Data Repository. The Physical Data Sources we are federating are terabyte warehouses and it is conceivable that an in-memory database could run out of memory. It turns out that the current inmemory database is not memory bound. Oracle’s implementation has a mechanism to overflow into a disk-based database, treating the in-memory structure as an in-memory database cache. Limitations and Challenges Linking patients between systems is an ongoing challenge for federated queries that perform intersecting queries across data sources. No statewide

AMIA 2009 Symposium Proceedings Page - 73

solution currently exists, but significant work has been performed with this goal in mind. Two approaches are currently being evaluated. The first is to manually update and maintain the existing patient ID crosswalk for all the data sharing organizations; the other approach is to auto-generate a reproducible encrypted patient identifier so that each institution can supply the same unique patient identifier. It is more likely that we will use the first approach initially because of existing work and experience (occurs in step 8), but the second is more desirable if it proves feasible.

Conclusion

Maintaining the metadata that the FFQE uses to transform data is critical as well. We are working on a process that enables each organization to maintain its metadata. The motivation for an organization to keep its metadata up-to-date is based on its access to and usage of the FURTHeR tools. An organization with out-of-date metadata impacts the accuracy of its own results as well as everyone else’s. For this strategy to work, FURTHeR needs to be a compelling application that provides value to each institution.

This investigation was supported by Public Health Services research grant UL1-RR025764 from the National Center for Research Resources.

Finally, some of the most significant challenges are related to the politics of sharing data. This influences how the FFQE will interact with the security modules of FURTHeR. The data security policies vary from organization to organization and this may impact metadata content. For example, policies influence the data elements that a researcher can select in a query for a given data resource and the results that can be retrieved and displayed to the user. We may have to implement specific role-based security access within our metadata, too, in order to enforce institutionspecific access rules.

2.

The architecture of the FFQE is capable of federating data across heterogeneous data sources. The architecture relies on metadata components to allow dynamic configuration of data sources and translation methods. The promise of the FFQE is that it has the potential to meet some of the fundamental needs of translational science to access and integrate diverse biomedical data and promote discovery of new knowledge. Acknowledgements

References 1.

3.

4.

5.

Future Work FURTHeR is in its infancy. We have much work ahead to build a mature infrastructure. The FFQE is a promising start that we are testing and constantly adding functionality to. In the near future we will be focusing on enhancing the FFQE query translation capabilities, such as adding support for inline query functions and more sophisticated data models.

6.

7.

As the i2b2 example above illustrates, we are looking closely at i2b2 as a potential implementation that the FFQE can support. We are examining similar support for caBIG. We have studied the caGrid architecture and see some potential applications for the FFQE.

8.

The FFQE has up to this point been developed on modest hardware. We will be working with the University of Utah’s Center for High Performance Computing to leverage their enterprise grade servers and knowledge of high performance enabling technologies, such as Grid computing.

9.

Rocha R, Hurdle J, Matney SA, Narus S, Meystre S, Lasalle B, et al. Utah’s statewide informatics platform for translational and clinical science. AMIA Annual Symposium proceedings / AMIA Symposium. 2008:1114. Cannon-Albright LA. Utah Family-Based Analysis: Past, Present and Future. Human Heredity. 2008;65(4):209-20. Oster S, Langella S, Hastings S, Ervin D, Madduri R, Phillips J, et al. caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research. J Am Med Inform Assoc. 2008 March-April;15(2):138-49. Partners Healthcare. i2b2: Informatics for Integrating Biology and the Bedside. 2009 [updated 2009; cited 2009 March]; Available from: https://www.i2b2.org/. Gupta A, Bug W, Marenco L, Qian X, Condit C, Rangarajan A, et al. Federated access to heterogeneous information resources in the Neuroscience Information Framework (NIF). Neuroinformatics. 2008 Sep;6(3):205-17. Deep Web Technololgies. 2008 [updated 2008; cited]; Available from: http://www.deepwebtech.com/. W3C. World Wide Web Consortium. Web Ontology Language. Journal [serial on the Internet]. 2007 Date: Available from: http://www.w3.org/2004/OWL/. . Apelon. Distributed Terminology System Managing Standards for Business Results. Ridgefield, CT: Apelon; 2008 [updated 2008; cited 2009 March 2]; Available from: http://www.apelon.com/products/white%20paper s/DTS%20White%20Paper%20V34.pdf. Oracle. Oracle and TimesTen. [cited]; Available from: http://www.oracle.com/timesten/index.html.

AMIA 2009 Symposium Proceedings Page - 74