Federated Queries for Comparative Effectiveness Research ...

HealthGrid Applications and Technologies Meet Science Gateways for Life Sciences S. Gesing et al. (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-054-3-9

9

Federated Queries for Comparative Effectiveness Research: Performance Analysis Ronald C. PRICE a, Derick HUTH a, Jody SMITH a, Steve HARPER a, Wilson PACE b, Gerald PULVER b, Michael G. KAHN c, Lisa M. SCHILLING d and Julio C. FACELLI a, e, a

Center for High Performance Computing, University of Utah, Departments of Family Medicine, c Pediatrics and d Medicine, University of Colorado School of Medicine and e Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84112, USA. b

Abstract. This paper presents a study of the performance of federated queries implemented in a system that simulates the architecture proposed for the Scalable Architecture for Federated Translational Inquiries Network (SAFTINet). Performance tests were conducted using both physical hardware and virtual machines within the test laboratory of the Center for High Performance Computing at the University of Utah. Tests were performed on SAFTINet networks ranging from 4 to 32 nodes with databases containing synthetic data for several million patients. The results show that the caGrid FQE (Federated Query Engine) is capable and suitable for comparative effectiveness research (CER) federated queries given its nearly linear scalability as partner nodes increase in number. The results presented here are also important for the specification of the hardware required to run a CER grid. Keywords. Federated Queries, Comparative Effectiveness Research, caGrid

1. Introduction The Scalable Architecture for Federated Translational Inquiries Network (SAFTINet) attempts to federate geographically-dispersed US safety net entities who collectively serve markedly diverse underserved populations. The overall project goal is to enhance the capacity and capability of a distributed research network to conduct prospective comparative effectiveness research (CER) via a multi-setting, multi-state organization. One of the core benefits of CER utilizing clinically collected data is the ability to study patient care and outcomes in real-world settings, where conditions that impact variability in care and health outcomes are taken into account. CER for minority, underserved, and rural populations is especially valuable due to their historically limited representation in clinical research, well documented health care disparities, and the differences between documented clinical trial efficacy and real world effectiveness in these populations [1].

Corresponding author at the Department of Biomedical Informatics, The University of Utah, 26 South 2000 East Room 5775 HSEB, Salt Lake City, UT 84112, [email protected].

10

R.C. Price et al. / Federated Queries for Comparative Effectiveness Research

Medicaid, which supports health care for many of these populations in the US encourages its agencies to develop new architectures for better management of data and information. States are being encouraged to implement new web/service oriented architectures, like the Medicaid Information Technology Architecture (MITA), to connect organizational silos and align Medicaid with newer information technologies. One of the goals of the SAFTINet project is to combine data collected during the course of clinical care, typically documented in electronic systems, with claims and administrative data from State Medicaid agencies for CER use. SAFTINet’s intent is to remain aligned with other initiatives, such as MITA, which promotes service-orientated architectures (SOA) for CER. Therefore it makes sense for the SAFTINet project to adopt this type of architecture. Among SOA architectures the preferred approach selected for SAFTINet is grid computing technologies [2, 3], because they provide a much stronger security model, can be more easily integrated into analytical workflows [4, 5] and allow the deployment of virtual organizations in a relatively straight forward manner. While the concept of grid computing originated in HPC (High Performance Computing) [6, 7] it is becoming apparent that its principles are highly aligned with the informatics requirements of emerging distributed biomedical research applications [8]. A comprehensive account of current applications of grid computing in biomedical applications was presented by Rick Stevens in the August 2006 issue of CTWatch Quarterly [9]. Examples of biomedical grid architectures include for example: caGrid [10, 11], which is the underpinning to one of the most comprehensive multi institutional cancer research infrastructures, caBIG ( https://cabig.nci.nih.gov/ ). GeneGrid [12] provides a service-oriented-architecture (SOA) implementation for a virtual bioinformatics laboratory focused on antibody and drug research and development. The Shared Pathology Informatics Network (SPIN) [13, 14] founded to establish an Internet based virtual database to allow researchers to locate appropriate human tissue specimens for cancer research. The NeuroLOG project supports neurosciences middleware in grid environment [15]; AMGA provides metadata catalogue in a grid environment [16]. The European HealthGrid initiative supports drug discovery and telediagnosis, aiming to enable radiologists from geographically dispersed hospitals to share standardized mammograms for diagnostic and epidemiologic inquiries [17, 18]. The Globus MEDICUS project supports the federation of DICOM medical imaging devices into a healthcare grid to address image sharing among providers [19]. The WISDOM initiative provides grid based virtual screening tools for developing new anti malaria drugs [20]. Hadzic and Chang have provided ontology services as a grid resource [21] and a secure grid medical data manager has been reported by Montagnat et al. [22]. A special section of IEEE Transactions on Nanobioscience was dedicated to the topic of grid and web services for life sciences [17, 18]. An annual conference, organized by the HealthGrid organization (http://community.healthgrid.org/) reports annually on the progress on application of grid architectures to biomedical sciences [23]. Among grid computing technologies and development environments the SAFTINet team preferred to use the caGrid [24, 25] software development tools because they allow us leveraging the existing federated query engine to execute federated queries across participating partner sites and easy deployment in an existing grid such us the Translational Informatics and Data Management (Triad) Grid (http://triadcommunity.org/). While many aspects of the caGrid technology appear adequate for the SAFTINet infrastructure there is relative little published on the


11

performance of federated queries in large deployments like those envisioned for SAFTINet. The goal of this work is to evaluate the scalability of the federated queries as the number of SAFTINet grid nodes increases. 2. Methods To evaluate the scalability of federated queries performed using DCQL (Distributed caGrid Query Language) we used a simplified cohort definition that included subject age and diagnosis. This cohort definition involves patients older than 18 years of age with encounter-based ICD9 of 493.X (the ICD9 code corresponding to Diagnosis = Asthma). The DCQL queries were performed using three different data sets, labeled here small, medium and large. The test data was created through the use of thoroughly anonymized database starting with records of approximately three million individuals. Anonymization was accomplished via the following process. 1) ID #s: ID numbers from the existing database (person, location, provider) were replaced with integers randomly selected from a uniform distribution ranging from 0 to 99,999,999, and kept consistent in order to maintain links between tables. 2) Dates of diagnosis onset and resolution: A random integer between -60 and 60 was drawn from a uniform distribution for each person. Every problem date for a given individual was shifted by the offset drawn for that person. 3) Dates of birth were limited to year of birth, except that for those born prior to 1930; a year was randomly selected between 1910 and 1929. 4) Diagnosis codes: After deleting spurious codes, any ICD9 code that was attributed to less than 25 individuals was dropped from the database. This resulted in a dataset with records for 2,972,969 patients, of whom 1.43% had a history of asthma. In order to produce datasets with prevalence rates of approximately ten and twenty percent, 87% and 94%, respectively, of patients without history of asthma were excluded from the dataset. The attenuated datasets were then fortified to approximately two million records each, by multiplication: five-fold for the 10% dataset and nine-fold for the 20% set. Relationships between individuals and problems were maintained, with new randomly generated ID numbers assigned to the created person and problem records. This data set was used to populate a reduced two table OMOP (Observational Medical Outcomes Partnership, http://omop.fnih.org/) data model, which includes Person and Condition Occurrence. The total numbers of entries for the three data sets considered here are presented in Table 1.

Table 1: The total number of entries for the three data sets considered here. Data Set Size

Small Medium Large

Number of Persons 2,972,969 2,116,540 1,969,596

Total Number of Condition Occurrences 4,616,071 5,694,920 7,749,990

Number of Individuals with Desired ICD code 7,966 34,350 61,128

12


A data grid service was created using a model driven approach using the OMOP CDM V2.0 data model; the grid service essentially wraps the grid software around a configured and tuned MySQL database on the server side leveraging the OMOP data model.

Figure 1: SAFTINet prototype grid architecture used in this work to test DCQL scalability. The architecture is composed by several of distributed data service (only one depicted in the figure in the left box) that expose to the grid the data in the extended OMOP model and a web client in which the federated query logic have been implements (right top box). The user access the system using a standard browser (right lower box).

The adult asthma federated query was implemented on the client side using Java to set the necessary environment to call the FQE (Federated Query Engine) and display the results appropriately. The FQE performs the aggregation of the distributed


13

queries automatically and return the results to the Java client for display. The architecture used for this work is given in Figure 1. All the experiments were conducted on either physical or virtual machines available at the Center for High Performance Computing at the University of Utah. Linux based VMs were configured with one AMD Opteron 250 CPU @ 2.3 Ghz, 3GB of memory and 66 GB SCSI disk for server machines, i.e. those in which the grid services with the data were deployed. For the client machine, i.e. the one that was used for issuing the queries, physical hardware was always used. Two different machines were used for running the client one small configuration: Mac Book Pro, Dual Core 2.4Ghz, 4GB of RAM when testing the effect of using real hardware for the servers and a large configuration: 16 cores Intel Xeon X5560 @ 2.8GHz, 48 GB of RAM and 500 GB of SCSI disk, for the scaling tests. All the systems used here were running Linux CentOS 4 Linux version 2.6.18-274.12.1.el5. We used a single CPU on the server-side to avoid any variation in execution time due to VM management software selecting CPUs to run in the wrong order. This technique essentially made server-side execution time static, for each VM in the configuration under evaluation. We deployed up to 32 server VMs to measure performance for networks with 2, 4, 8, 16 and 32 nodes, which in the SAFTINet environment will correspond to a network of equivalent number of partner sites. Extreme care was taken to ensure that each test was done with the same resources available on the both the client-side and server-sides. The first invocation of the grid node container includes significant amount of initialization, therefore measurements were taken only during the second invocation so as to not include the overhead due to initialization and to create a consistent set of performance numbers. While we did not perform statistic testing of the fluctuation of measured time, random observation indicated that the expected fluctuations are of the order of a few percent. Due to the nature of thread pools and memory management in the JVM (Java Virtual Machine) plus process management in the OS on both the client-side and server-side we occasionally experienced minor variation in our execution time of approximately ~1% of total execution time.

3. Results Because one of the goals of SAFTINet is to be minimally intrusive on the IT infrastructure of the participating sites, it is appealing to use a VM deployment strategy for establishing grid nodes in the participating sites. This convenience should be measured against any performance degradation that may result from to this decision. Therefore our first test was to run the federated query against data grid services deployed on physical machines and then run it on equivalent virtual machines. The results are presented in Table 2. It is apparent form he table that over two nodes DCQL performs correctly, for all three data sets the number of person records retrieved corresponds exactly to twice the number of persons with condition with code of 493.X in Table 1. This was also verified for all the other tests performed here. The use of VMs increases the queries time by an average of 5% (ranging from 1.8 %, medium data set, to 8.4 %, large data set). While this in not totally negligible it appears that is quite within other observations from our previous work [26] and probably quite tolerable for this application.

14


Table 2. Comparison of execution times of DCQL using physical and VM nodes. Number of grid nodes (Physical) 2

Wall Time (in seconds) 155

Returned Cohort Size 15,932

2

805

68,700

Medium

2

1,675

122,256

Large

Number of grid nodes (VM) 2

164

15,932

Small

Data Set Size Small

2

820

68,700

Medium

2

1,816

122,256

Large

Table 3. Comparison of execution times of DCQL using 2 to 32 VM grid nodes.

Number of grid nodes

Wall Time (in seconds)

Returned Cohort Size

Data Set Size

2 2 2

155 691 1320

15,932 68,700 122,256

Small Medium Large

4 4 4

165 719 1415

31,864 137,400 244,512

Small Medium Large

8 8 8

194 782 1526

63,728 274,800 489,024

Small Medium Large

16 16 16

253 1075 2037

127,456 549,600 978,048

Small Medium Large

32 32 32

367 1559 2960

254,912 1,099,200 1,956,096

Small Medium Large


15

Table 3 present the scaling results using VMs, as in the case of two grid nodes testing the DCQL retrieves the expected number of cases based on the number of individuals with an ICD9 code 493.X in Table 1. The execution logs show that the time spent in the database is a small fraction of the total execution time; comparison with native SQL indicates that the database is well tuned and that very little in terms of overall performance could be expected in terms of by improving the efficiency of the database. The results from Table 3 are also depicted in Figure 2. It is apparent that for all data sets tested here the total execution time increases linearly with the number of grid nodes in the network. A more careful analysis of the DCQL timing using CPU monitoring tools shows that there are two main contributions to the wall time plotted in the figure. During the first phase of data gathering there is no CPU utilization on the client size, this corresponds to the intercept of the linear fitting in Figure 2. As depicted in Figure 3, the gathering time (y-intercept) is proportional to the size of the data set, but independent of the number of nodes in the grid. This shows that essentially the gathering process is an embarrassing parallel task in which each node operates independently. The second phase corresponds to the aggregation of results by the FQE, this execution time is proportional to the number of nodes in the grid, but as depicted in Figure 3 the rate of increase is also proportional to the data set size.

3500 small 3000

medium large

2500 Executiontimeinseconds

y=55x+1163

2000 y=29x+595

1500 1000 500

y=7x+138

0 0

10

20

30

40

NumberofNodes Figure 2. Linear scaling for the DCQL tests performed here for Small, Medium and Large data sets. Execution time in seconds.


60

1400

50

1200

40

1000

Intercept

Slope

16

30 20 10

800 600 400 200

0

0 0

50000 100000 Datasetsize

0

50000 100000 Datasetsize

Figure 3. Linear dependency of the intercept (gathering phase) and slope (rate of increase of the aggregation time) on the data set size for the linear scaling tests from Figure 2. Data set size in number of records, slope in seconds/nodes and intercept in seconds. The maximum memory use on the client side is presented in Table 4 as function of the data set size and number of nodes. While it is apparent that memory size increases with both the number of nodes and the data set size, we have not been able to extract clear relationships as we did with the CPU utilization. Table 4. Maximum memory a used in the tests performed in this study.

a

Number of Nodes

Large Data Set

Medium Data Set

Small Data Set

4 8 16 32

18,577 20,281 20,893 29,247

18,191 18,490 20,718 21,448

7, 798 11,697 16,660 18,579

Values rounded to the closest MB.

During all the performed tests we carefully monitored the performance of the physical system used to host the VMs to ensure that there was no contention from either CPU or memory. Therefore these results represent a best case scenario in which the hardware resources are not constrained. A limitation of this study is the lack of assessment of the impact of network delays on the performance. Network performance was not a bottleneck for these experiments because all the experiments were performed using the VMs running the grid nodes and the client within the CHPC test lab. We expect that this will be the case for most large participating sites, but data originating from small sites my suffer network delays due to extremely low bandwidth available in such settings.


17

4. Conclusions These results demonstrates that DCQL implemented in the current version of the caGrid FQE is very capable and suitable for CER federated queries given its nearly linear scalability as partner sites increase in number. The results presented here are also important for the specification of the hardware required to run a CER grid. Acknowledgements This work has been partially supported by grant R01HS019908 from the Agency for Healthcare Research and Quality and by generous allocation computer resources at the Center for High Performance Computing. MGK and GP were also supported by NIH/NCRR Colorado CTSI Grant Number TL1 RR025778. References [1] [2]

[3] [4]

[5] [6] [7] [8] [9] [10]

[11]

[12]

N.E. Adler, K. Newman, Socioeconomic Disparities In Health: Pathways And Policies, Health Affairs, 21 (2002) 60-76. T.N. Truong, M. Nayak, H.H. Huynh, T. Cook, P. Mahajan, L.T. Tran, J. Bharath, S. Jain, H.B. Pham, C. Boonyasiriwat, N. Nguyen, E. Andersen, Y. Kim, S. Choe, J. Choi, T.E. Cheatham, J.C. Facelli, Computational Science and Engineering Online (CSE-Online): A Cyber-Infrastructure for Scientific Computing, J. Chem. Inf. Model., 46 (2006) 971-984. I. Foster, Service-Oriented Science, Science, 308 (2005) 814 - 817. A.S. McGough, Cohen, J., Darlington, J., Katsiri, E., Lee, W., Panagiotidi, S., Patel, Y., An End-to-end Workflow Pipeline for Large-scale Grid Computing, J Grid Computing, 3 (2006) 259 - 281. J. Yu, R. Buyya, A taxonomy of Workflow Management Systems for Grid Computing, J. of Grid Computing, 3 (2006) 171-200. F. Berman, G. Fox, T. Hey, Grid Computing: Making The Global I nfrastructure a Reality, in, John Wiley & Sons, London, 2003, pp. 1080. J. Andreeva, S. Campana, F. Fanzago, J. Herrala, High-Energy Physics on the Grid: the Atlas and CMS Experience, J. Grid Computing, 6 (2008) 3-13. J.C. Facelli, The Impact of Grid Computing in Biomedical Informatics in: INFOLAC2008-AAIM, Buenos Aires, Argentina, 2008. R. Stevens, Trends in Cyberinfrastructure for Bioinformatics and Computational Biology, CT Watch Quarterly, 2 (2006) 1 - 5. J. Saltz, S. Oster, S. Hastings, S. Langella, T. Kurc, W. Sanchez, M. Kher, A. Manisundaram, K. Shanbhag, P. Covitz, caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid, Bioinformatics, 22 (2006) 1910-1916. S. Langella, S. Hasting, S. Oster, T. Pan, A. Sharma, J. Permar, D. Ervin, B.B. Cambazoglu, T. Kurc, J. Saltz, Sharing Data and Analytical Resources Securely in a Biomedical Research Grid Enviroment, J. American Medical Informatics Association, 15 (2008) 363-373. P.V. Jithesh, Donachy, P., Harmer, T., Kelly, N., Perrott, R., Wasnik, S., Johnston, J., McCurley, M., Townsley, M., McKee, S., GeneGrid: Architecture, Implementation and Application, J Grid Computing, 4 (2006) 209 - 222.

18

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]


T.A. Drake, J. Braun, A. Marchevsky, I.S. Kohane, C. Fletcher, H. Chueh, B. Beckwith, D. Berkowicz, F. Kuo, Q.T. Zeng, U. Balis, A. Holzbach, A. McMurry, C.E. Gee, C.J. McDonald, G. Schadow, M. Davis, E.M. Hattab, L. Blevins, J. Hook, M. Becich, R.S. Crowley, S.E. Taube, J. Berman, A system for sharing routine surgical pathology specimens across institutions: the Shared Pathology Informatics Network, 38 (2007) 1212. M.J. Becich, Lessons Learned from the Shared Pathology Informatics Network (SPIN): A Scalable Network for Translational Research and Public Health, J Am Med Inform Assoc., 14 (2007) 534 - 535. J. Montagnat, A. Gaignard, D. Lingrand, J.R. Balderrama, P. Collet, P. Lahire, NeuroLOG: A community-driven middleware design, in: Studies in Health Technology and Informatics, 2008, pp. 49-58. I. Blanquer, V. Hernandez, J. Salavert, D. Segrelles, Using Grid-Enabled distributed metadata database to index DICOM-SR, in: Studies in Health Technology and Informatics, 2009, pp. 117-126. E. Bartocci, Cacciagrano, D., Cannata, N., Corradini, F., Merelli, E., Milanesi, L., Romano, P., An Agent-Based Multilayer Architecture for Bioinformatics Grids, IEEE Transactions on Nanobioscience, 6 (2007) 142 - 148. L. Milanesi, Armano, G., Breton, V., Romano, P., Guest Editorial: Special Section on Grid, Web Services, Software Agents, and Ontology Applications for Life Sciences, IEEE Transactions on Nanobioscience, 6 (2007) 101 - 103. Erverich SG, Silverstein JC, Chervenak A, Schuler R, Nelson MD, Kesselman C, Globus MEDICUS - federation of DICOM medical imaging devices into healthcare Grids, Stud Health Technol Inform, 126 (2007) 269-278. N. Jacq, J. Salzemann, F. Jacq, Y. Legré, E. Medernach, J. Montagnat, A. Maaß, M. Reichstadt, H. Schwichtenberg, M. Sridhar, V. Kasam, M. Zimmermann, M. Hofmann, V. Breton, Grid-enabled Virtual Screeneing Agaist Malaria, J. Grid Computing, 6 (2008) 29-43. M. Hadzic, Chang, E., Grid Services Complemented by Domain Ontology Supporting Biomedical Community, in: Scientific Appl. of Grid Computing, 1st International Workshop, SAG, Beijing, China, 2004, pp. 86 - 98. J. Montagnat, Á. Frohner, D. Jouvenot, C. Pera, P. Kunszt, B. Koblitz, N. Santos, C. Loomis, R. Texier, D. Lingrand, P. Guio, R.B.D. Rocha, A.S.d. Almeida, Z. Farkas, A Secure Grid Nedicak Data Manager Interfaced to the gLite Middleware, J. Grid Computing, 6 (2008) 45-59. N. Jacq, H. Müller, I. Blanquer, Y. Legré, V. Breton, D. Hausser, V. Hernández, T. Solomonides, M. Hofmann-Apititus, From Genes to Personalized HealthCare: Grid Solutions for the Life Sciences, IOS Press, Amsterdan, 2007. S. Oster, S. Langella, S. Hastings, D. Ervin, R. Madduri, J. Phillips, T. Kurc, F. Siebenlist, P. Covitz, K. Shanbhag, I. Foster, J. Saltz, caGrid 1.0: an enterprise Grid infrastructure for biomedical research, J Am Med Inform Assoc, 15 (2008) 138-149. J. Saltz, S. Oster, S. Hastings, S. Langella, T. Kurc, W. Sanchez, caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid, Bioinformatics, 22 (2006) 1910 - 1916. R.C. Price, W. Pettey, T. Freeman, K. Keahey, M. Leecaster, M. Samore, J. Tobias, J.C. Facelli, SaTScan on a Cloud: On-Demand Large Scale Spatial Analysis of Epidemics, 2010.