'Spring through the gateway': deploying genomic ... - ACM Digital Library

5 downloads 20 Views 312KB Size Report
Jul 25, 2013 - David Rhee. Albert Einstein College of Medicine. 1300 Morris Park Avenue. Bronx, NY 10461. 1-718-678-1150 david[email protected].
‘Spring through the Gateway’ – Deploying Genomic Workflows with XSEDE [Extended Abstract] David Rhee

R Brent Calder

Kevin Shieh

Albert Einstein College of Medicine 1300 Morris Park Avenue Bronx, NY 10461 1-718-678-1150

Albert Einstein College of Medicine 1300 Morris Park Avenue Bronx, NY 10461 1-718-678-1150

Albert Einstein College of Medicine 1300 Morris Park Avenue Bronx, NY 10461 1-718-678-1150

[email protected]

[email protected]

[email protected]

Joseph Hargitai

Pilib Ó Broin

Aaron Golden

Albert Einstein College of Medicine 1300 Morris Park Avenue Bronx, NY 10461 1-718-678-1150

Albert Einstein College of Medicine 1300 Morris Park Avenue Bronx, NY 10461 1-718-678-1150

Albert Einstein College of Medicine 1300 Morris Park Avenue Bronx, NY 10461 1-718-678-1150

[email protected]

[email protected]

[email protected]

ABSTRACT The use of sequencing technologies has revolutionized the field of genomics, allowing us to study structural and functional variations within the genome to base pair level. These technologies can also be used to probe the associated epigenome, where DNA-binding proteins alter the structural integrity of the genome, restricting or enabling localized gene expression in a heritable fashion. By using assays that identify the binding location of these proteins, so called ‘epigenetic marks’ can be used to discover and correlate molecular functions and phenotypes being studied. As such ‘epigenetic marks’ are inherently plastic in their nature, being easily perturbed by environmental stimuli, they are a compelling and important area of study in the context of human development and diseases. One of the most commonly studied epigenetic marks is DNA methylation: attachment of the methyl group to cytosines in CpG dinucleotides – an occurrence where two cytosine nucleotides are immediately followed by two guanine nucleotides in tandem. This modification can directly block the binding of regulatory proteins to that specific location thus effectively ‘silencing’ transcriptional activities. There are roughly 2.8 million such CpG loci in a human genome, making it an excellent target for performing a genome-wide methylation assay using sequencing technologies. Several assays have been developed to determine cytosine methylation status based on the use of restriction enzymes. The two most commonly used techniques are Methyl-Seq [1] and HELP-tagging [2]. Whilst both are based on differential binding of MspI and HpaII restriction enzymes, their analytical procedures differ: for Methyl-Seq, methylation status is defined as the ratio of HpaII/MspI tags per locus, whereas with HELP-tag, a geometryPermission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). XSEDE '13, Jul 22-25 2013, San Diego, CA, USA ACM 978-1-4503-2170-9/13/07.

based angle calculation per HpaII/MspI tags is applied for methylation quantification instead. Both approaches are somewhat simplistic, however, since they fail to consider noise due to the random sampling nature of sequencing and other experimentally based artifacts - resulting in varied sequencing coverage depth. Recently, a more rigorous statistical analysis of methylation status associated with HpaII/MspI tags was performed [3]. By applying a Bayesian hierarchical model framework to deal with artifacts, it yields a more robust assessment of methylation at each CpG locus. The resulting software, called msBayes, involves the use of R scripts combined with the WinBUGS statistical package [4] that provides the necessary Markov Chain Monte Carlo (MCMC) routines used to generate samples from the posterior distribution of the parameters used in the analysis. However, while msBayes provides a superior way to infer methylation status, its implementation is hindered by both its constraint to use a nonUnix operating system and by the length of time required to perform a typical MCMC analysis. Our assessment with a typical dataset yields a processing time of ~ 3 seconds per CpG site using a Windows 7 machine configured with Intel 1.3 GHz processor and 4 GB RAM. With approximately 1.6 to 2.3 million sites per typical assay, and given the likelihood of multiple replicates and samples being processed per epigenomics experiment, it is immediately obvious that this is a non-viable solution. Given the prominent use of epigenetic assays in the research community and recognizing that the basic approach adopted by Wu et al. could be optimized to run more efficiently on high performance resources, we set about the development of a UNIX based variant, called msBayes2.0. This version, developed as a modular system, consists of core module written in C/OpenMP, which is responsible for communicating with OpenBUGS package [5], and a Python script which handles data preprocessing and post-OpenBUGS analysis. The user interface module, written in combination of SQLite, HTML, Python and PHP, facilitates user interaction, job scheduling, and the handling of concurrent multiple HELP-tag datasets over a web-portal while processing on local cluster resources. The msBayes2.0 decreased the processing time per CpG site by an order of magnitude relative to the WinBUGS version, and our ability to fine-grain parallelize the analysis with one site per core yielded an

effectively linear speed-up per cores used. We then further developed this infrastructure to deploy processing jobs to remote resources via the Einstein Genome Gateway, allowing researchers to unknowingly access XSEDE cyberinfrastructure resources (see Figure 1). Using all 32 cores on a Trestle node at the San Diego Supercomputer Center (SDSC) resulted in an effective processing time of approximately 0.01 second per CpG site, an effective performance increase of approximately 300-fold over the original msBayes implementation (see Table 1).

components comprehensively. Progress tracking facilitates a more robust and dynamic ability to deal with issues in the process life cycle when they occur. Furthermore, the fact that Spring can interface with both cloud and grid resources via Cloud Tools and the Crux Toolkit complements our vision of using this system to act as middleware to remote cyberinfrastructure resources.

Table 1. Computational performance of msBayes and msBayes2.0 msBayes* msBayes2.0* msBayes2.0± No. cores single single 32 used Time per site ~3 ~0.35 ~0.01 (seconds) Fold 1 ~10 ~300 change *Dual-booting option was implemented to run msBayes on Windows 7 and msBayes2.0 on Ubuntu 12.04 on the same machine configured with Intel 1.3 GHz processor with 4 GB RAM. ±Trestles configured with 32-core 2.4 GHz AMD Magny-Cours processors per node.

In parallel with this work, researchers at Einstein are using the Spring Framework to refactor the Wiki-based Automated Sequence Processor (WASP), originally designed to coordinate the operation of the entire sequencing facilities on the Bronx campus [5]. WASP has a unique design that integrates sequencing machines, distributed data storage servers, high performance clusters, web servers, and scientists into a single ecosystem that is easily accessible via a central web portal. Whilst highly successful, revolutionizing user interactivity with the current design which is a blend of diverse software technologies (Perl, Python, awk, R, C++, Ajax & PHP) became somewhat inflexible – this is a critical issue, given the dynamic nature of sequencing technologies. On the other hand, one of its great successes has been the fact that the end users, basic researchers and clinicians, remain unaware of the data storage, management and, most importantly, processing services they request when pursuing such massively parallel sequencing experiments through Einstein’s Sequencing Core Facility. This abstraction of both systems and computational complexity for the user in effect makes WASP an ideal middleware solution for the genomic sciences, particularly as one can leverage remote national cyberinfrastructure resources [7]. Our decision to refactor the original WASP LIMS-workflow system using Java/Spring Framework and not comparable technologies such as .NET was based on the Java/Spring infrastructural support which allowed us to focus on application development applicable to multiple deployment environments. Thus, by using Spring Java/J2EE application framework as the basis for the redesign, the new WASP System represents a more robust, highly modularized computational infrastructure that fulfills the original design goals for WASP but also provides a more mature processing environment as well as flexibility for subsequent utility development, via a plug-in system – which allows third party developers to develop their own ‘bioinformatics pipeline’ to extend the function of the WASP system. One particularly interesting feature of this new environment is Spring Batch, which we can use to manage overall life cycles of quite complex analytical workflows formed from the plug-in

Figure 1. Workflow of msBayes2.0 We decided to use the Gateway-enabled variant of msBayes2.0 as the driver to experiment with Spring Batch and the WASP System’s ability to deploy genomic processing jobs to the XSEDE infrastructure. Our motivation was to attempt to overcome the somewhat unsophisticated and time-dependent nature of our existing implementation. Operating via a web-portal, the typical workflow of msBayes2.0 can be broken down into a multi-step process (see Figure 1). A user is first greeted with a job submission page, inviting them to submit their name/e-mail, experimental details, and MspI and HpaII library files containing genomic locations and corresponding tag counts. The upload process automatically checks input files for errors and inserts the data into the database powered by SQLite. The second step occurs behind the scenes; the Python Script periodically scans for newly submitted jobs in its queue and distributes workloads to available offsite HPC clusters, such as the XSEDE or local HPC Clusters located at the Einstein campus. Upon receipt, the core module preprocess/parses the input data and performs methylation quantification via multi-core processing. Finally, as the Python Script periodically scans for and identifies finished jobs in the Compute nodes / clusters, it organizes completed data into a BED file: a tab delimited text file containing chromosome number, chromosome start position, chromosome end position, user provided project name (‘name’) and methylation level (‘score’). The output from the msBayes2.0 system is made available to the initial user via an e-mailed link to the BED file. Our experience using XSEDE resources via Gateway-enabled variant of msBayes2.0 showed us that there are temporal limitations to this current design in that a script runs every 2 hours to ‘pick up’ a job submission on the main gateway server, compute time is at the whim of load and queue depth on the remote resource, and again a script polls the subdirectory scheduled to contain the completed analysis every 2 hours, issuing the e-mail command when so detected. Altogether, a typical analysis from start to finish requires approximately 22 - 46 hours per dataset entirely dependent on the compute resources used, if there are no extended delays. For example, quantification of entire ~1.8 million sites from the human embryonic stem cell methylation dataset [8] on a local HPC resource incorporating 12-core 2.60 GHz Intel Xeon processors per compute node, takes ~42 hours to complete.

Similarly, quantifying the same dataset on the XSEDE resource Trestles with only 12 cores also takes ~42 hours. However, scaling up the number of cores to 32 on Trestles decrease the analysis time to ~18 hours when scrutinizing the same dataset (see Table 2). Furthermore, we have also experienced that there are procedural flaws to this approach – issues on the remote XSEDE computing resource that could compromise job completion are not dealt with, nor is it possible to monitor the status of the workflow or correct procedural issues in an expeditious fashion.

genomics and epigenomics datasets to existing gateway technologies such as Galaxy platform [9] due to WASP’s powerful LIMS system plus providing a platform for custom bioinformatics pipeline development via plug-in system by thirdparty developers. The prototype msBayes2.0 is continuously being developed and we plan to release it as an open-source software when development of next generation WASP System is completed.

Table 2. Deployment of msBayes2.0 on remote compute resources Local HPC Trestles Trestles No. cores 12 12 32 used Time (core ~42 hours ~42 hours ~18 hours module) Time ~46 hours ~22 hours (gateway) By using Spring, computational resources – such as those within the XSEDE stable – can be pre-configured within the operating WASP System for each computer resources detailing the specific file transport, scheduler and software configuration setup in each case. Then when a plug-in requests work done on a generic cluster resource available to the WASP System, its scheduler can decide where it needs to go and provisions files and configures the resulting work unit for the destination via Spring Batch. The WASP System then monitors the progress and cleans up on completion or in the event of a failure (at which point the plug-in's Batch flow has an opportunity to recover). End-user experience is the same in every other respect; the user is presented with the same web-portal interface experience (see Figure 2). We developed a prototype plug-in specifically designed to deploy our msBayes2.0 workflow to XSEDE as a work unit and by taking advantage of Spring Batch properties, we could both decrease the temporal duty cycle associated with the original script polling start/stop steps and monitor the job’s progress. Of perhaps even greater importance was Spring Batch’s ability to remotely control the job life cycle and, in particular, to dynamically respond to problems that occurred in the work unit life cycle, both in terms of transferring to/from XSEDE resources and during job deployment. This added ability to monitor and regulate job deployment on a remote XSEDE compute nodes was lacking in our earlier attempts at submitting jobs via Gateway-enabled version thus effectively cutting down on the ‘duty cycle’ of any given jobs. As such, msBayes2.0 is a work unit within the new Spring-based WASP System being specifically deployed on XSEDE grid resources via the Einstein Genome Gateway. It represents the prototype for processing of both genomics and epigenomics datasets on national cyberinfrastructure resources. By combining Spring Framework’s capabilities, with the WASP System’s fundamental design focus on automated processing managing large sequencing datasets, we believe our approach offers an excellent new paradigm to leverage XSEDE’s significant computational resources in this new era of Big Data. The Einstein Genome Gateway itself is hosted on Indiana University’s Quarry system which has full access to XSEDE infrastructure, thus possessing the network capacity necessary for handling ‘big data’ between resources. In addition, we believe that our system offers an alternative platform for processing and managing large

Figure 2. Overview of prototype gateway-enabled msBayes2.0 WASP plug-in system

Categories and Subject Descriptors D.1.3 [Software]: Concurrent Programming – Distributed Programming, Parallel Programming. D.2.11 [Software]: Software Architectures – Data abstraction, Domain-specific architectures. D.2.m [Software]: Miscellaneous – Rapid Prototyping.

General Terms Management, Measurement, Experimentation

Performance,

Reliability,

Keywords Bioinformatics, Grid computing, Epigenomics, Computational Biology, Gateway, DNA methylation, Parallel Computing, High Performance Computing

1. REFERENCES [1] Brunner, A.L., Johnson, D.S., Kim, S.W., Valouev, A., Reddy, T.E., Neff, N.F., Anton, E., Medina, C., Nguyen, L., Chiao, E., Oyolu, C.B., Schroth, G.P., Absher, D.M., Baker, J.C., Myers, R.M. 2009. Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver, Genome Research, 19 10441056. [2] Suzuki M., Jing Q., Lia, D., Pascual, M., McLellan, A., Greally, J.M. 2010. Optimized design and data analysis of tag-based cytosine methylation assays, Genome Biology, 11, R36 [3] Wu, G., Yi, N., Absher, D., Zhi, D. 2011. Statistical quantification of methylation levels by next-generation sequencing, PLoS ONE, 6, e21034. [4] Lunn, D., Thomas, A., Best, N., Spiegelhalter, D. 2000. WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility, Statistics and Computing, 10, 325337. [5] Lunn, D., Spiegelhalter, D., Thomas, A., Best, N. 2009. The BUGS project: Evolution, critique and future directions, Statistical Medicine, 28, 30493067.

[6] McLellan, A.S., Dubin, R.A., Jing, Q., Ó Broin, P., Moskowitz, D., Suzuki, M., Calder, R.B., Hargitai, J., Golden, A., Greally, J.M. 2012. The Wasp System: An open source environment for managing and analyzing genomic data, Genomics, 100, 345351. [7] Golden, A., McLellan, A.S., Dubin, R.A., Jing, Q., Ó Broin, P., Moskowitz, D., Xhang, Z., Suzuki, M., Hargitai, J., Calder, R.B., Greally, J.M. 2012. The Einstein Genome Gateway using WASP - A High Throughput Multi-Layered Life Sciences Portal for XSEDE, Proceedings of the 4th International Workshop on Science Gateways for Life Sciences, IOS Press, Amsterdam, 2012 [8] Suzuki, M., Jing Q., Lia, D., Pascual, M., McLellan A.S., Greally, J.M. 2010. Optimized design and data analysis of tag-based cytosine methylation assays. Genome Biology, 11, R36. [9] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A, 2005. Galaxy: a platform for interactive large-scale genome analysis, Genome Research, Oct; 15(10):1451-5.