Koonkie: An Automated Software Tool for Processing ... - CiteSeerX

2 downloads 132670 Views 497KB Size Report
ing these datasets often requires custom-designed software and high-performance computing resources that in turn require regular monitoring and maintenance ...
2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

Koonkie: An Automated Software Tool for Processing Environmental Sequence Information using Hadoop Dongjae Kim1 , Kishori M. Konwar2 , Niels W. Hanson3 , Steven J. Hallam2,3 1 Department of Computer Science, University of British Columbia 2 Department of Microbiology & Immunology, University of British Columbia 3 Graduate Program in Bioinformatics, Univeristy of British Columbia [email protected], {kishori, nielsh, shallam}@mail.ubc.ca

Abstract Next-generation sequencing platforms produce increasingly vast text datasets from environmental samples. Processing these datasets often requires custom-designed software and high-performance computing resources that in turn require regular monitoring and maintenance. The inherent design of traditional grid systems limits user control and on-demand availability, which are needed for efficient processing. Recent advances in high-speed Internet bandwidth and the emergence of affordable and scalable cloudcomputing services offer attractive alternatives to traditional grid systems. However, adapting cloud-computing services to use available and emerging bioinformatics tools requires non-trivial setup and expertise. Here we present Koonkie, a flexible and scalable Hadoop-based software tool that can be used to process environmental sequence information with scalable cloud-computing services. Koonkie can automatically connect, configure, and launch processes on clouds, in a pay-as-you-go manner, without incurring additional monitoring and maintenance costs. Moreover, by exploiting novel features of the YARN resource negotiator introduced in Hadoop v2, Koonkie can perform tasks, like BLAST, requiring large side-data distributions with prohibitive communication costs in existing versions of Hadoop. We demonstrate Koonkie’s flexibility and performance by annotating Illumina sequence data from the Human Microbiome Project in a pipeline inspired by MetaPathways, a modular pipeline for analyzing environmental sequence information. Isolating the BLAST task, we find that Koonkie’s performance scales linearly with requisitioned AWS nodes. The Koonkie software is available for download from http://www.github.com/hallamlab/koonkie.

1

Introduction

We have entered the Big Data era, where the volume of information stored and generated from multifarious sources spanning social media and financial services to ecosystems and galactic systems approaches petabytes with no subsiding signs. Concomitant with the rise of high-speed Internet and cost reductions in compute and disk storage, a number of programming models have become available to navigate and harness the data deluge by distributing both processing and storage requirements over clusters of commodity hardware. Examples of such models include Scope, Clustera, ©ASE 2014

Sector and Sphere, and especially MapReduce [5, 7, 8, 11]. Popularized in its open-source implementation of Apache Hadoop, MapReduce enables the usage of many compute instances while minimizing uncertainties related to data partitioning, distribution processing, replication, and fault tolerance. Indeed, MapReduce is extensively used at the enterprise level by companies such as Google, Ebay, Yahoo, Facebook, LinkedIn, and Quantcast [36]. Cloud-based computing boasts advantages in accessibility, availability and affordability where multiple grids can be abstracted and coordinated to appear as a single resource. Moreover, cloud systems require only an Internet connection for round-theclock access, and once data exists on the cloud it can be easily shared with others over high-bandwidth connections. Finally, many cloud-computing services are pay-as-you go, providing significant cost improvements over large up-front centralized infrastructure and personnel investments associated with traditional grid systems.

MapReduce implementations such as as Hadoop allow users to more easily develop scalable code to run on thousands of compute instances. It allows high throughput parallel computing on a variety of machine types through the simplicity of its computational model and easily configured infrastructure. In MapReduce, users are only required to create two high-level functions, known as Map and Reduce, and then submit these to the system to be distributed and executed. The Map and Reduce compute stages are separated by an extensive communication phase known as “the shuffle”, where keyed output from the map phase is mapped to reduce functions (Figure 1) [7]. Although it is necessary to configure Hadoop for optimal system performance, e.g., number of reducers or machine-specific memory parameters, Hadoop configuration is generally less cumbersome than data partitioning, fault tolerance, and concurrency issues encountered with traditional grid systems. Indeed, Microsoft Azure, Google Cloud Services, and Amazon Web Services (AWS) offer cloud storage or computational instances interfacing directly or indirectly with Hadoopbased services. Moreover, companies such as Cloudera Enterprise and Hortonworks Dataworks provide specialized Hadoop distributions that support multiple data processing and management applications to build client-driven workflows [13, 16, 22, 27, 36].

ISBN: 978-1-62561-003-4

1

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

1

2

3

4

5

Input Split

Map Tasks

"Shuffle"

Reduce Tasks

Outputs

Figure 1: The structure of a MapReduce job is centred around three distributed stages Map, Shuffle, and Reduce. First input data is split into separate blocks based on the Hadoop Distributed File System (HDFS) (1). In the Map phase, input data splits are distributed to mapper instances where the Map tasks are performed, outputting resulting key-value pairs (2). The Shuffle phase then merges and sorts the key-value pairs from the map phase and distributes them to their appropriate reduce instances (3). In the Reduce phase, the reduce tasks are performed on the sorted key-value pairs they received from the shuffle (4), outputting the results of the reduce function (5). It should be noted here that all steps acquire their required resources (input and support data) through the network via the HDFS.

1.1

Big Data Problems in Computational Biology and Bioinformatics

High-throughput next-generation sequencing technologies are rapidly transforming biology into an information science and creating bioinformatics challenges and opportunities in areas as diverse as microbial ecology [17], personalized medicine [10], and synthetic biology [3]. For example, the Human Genome and Human Microbiome Projects have instigated intensive interest in personalized medicine, incorporating point of care diagnostics and Big Data science spanning multiple levels of biological information (DNA, RNA, protein and metabolites) and metabolic network interactions within the host milieu [4, 6, 23]. Many tasks in Computational Biology revolve around processing increasingly-large DNA sequence datasets, represented as unstructured text data. Fortunately much of this processing can be applied in an embarrassingly parallel manner such that it is common to split sequence datasets amongst worker computational resources for parallel computation. Despite this, many tasks, such has homology search tasks using Basic Local Alignment Search Tool (BLAST) [2], may still require months of continuous computation. Without fault tolerance these large homology search tasks may never reach completion, a shortcoming present in existing high-throughout computational frameworks like Message Passing Interface (MPI) systems. Here, Hadoop’s shared-nothing, redundancy-backed fault tolerance ensures that a task will eventually reach completition, and it’s readily accessible interface for text data makes Hadoop attrac©ASE 2014

tive for large biological sequence datasets. Furthermore, many bioinformatics softwares and pipelines are written in the Python programming language, which is readily supported through Hadoop Streaming. Hadoop’s wide-spread deployment and adoption in many cloud-computing setups and its flexible use of commodity hardware means many tasks implemented using MapReduce can be readily deployed in a wide variety of HPC environments. Recently there have been specific applications of Hadoop in bioinformatics [35] including Biopig [26], a modification of the data processing Pig language for translating queries into MapReduce tasks that have been modified for querying sequence data on the Hadoop Distributed File System (HDFS) [27, 31], CloudBurst for highly parallelized sequence read-mapping using Hadoop [30], and BioDoop which implements a BLAST-based gene-set enrichment analysis and gene association mapping in Hadoop [20]. Despite these specific applications, a general-purpose software tool supporting environmental sequence processing relevant to microbial ecology and human microbiomics using Hadoop remains to be developed and implemented on the cloud. Bioinformatics pipelines depend on a variety of software and system library requirements, input and necessary support files, and flexibility of usage that is more dynamic and heterogeneous than common numerical or natural language processing tasks conducted on the cloud. For example, sequence processing from assembly and read mapping to open reading frame (ORF) prediction and functional annotation incorporates multiple software tools with different input and output formats, ranging from stand-alone scripts written in Python, Perl, or Java, to executables compiled using C or C++, (e.g., ABySS, bwa, Prodigal, B/LAST , CD-HIT, RAxML) [32, 21, 15, 18, 9, 34]. Although pipelines with predefined processing steps are routinely used on environmental sequence information (e.g., MetaPathways, MG-RAST, HUMAnN) [19, 24, 1], many in-house pipelines are designed ad hoc for a particular purpose. For example, a bioinformatician might want to deploy a custom pipeline incorporating several short-read sequence assembly algorithms to obtain optimal contig mapping. However, input and output requirements for each algorithm might require local support of large database files that are not amenable to splitting. Along the same lines, an end-user without extensive computing experience might want to conduct a massive homology search of unknown environmental sequences against one or more reference databases using BLAST, finally parsing the BLAST output with a specialized script to assign functional annotations, but such homology searches are prominent computational bottlenecks and are increasing in complexity and difficulty based on the increasing size of subject and query datasets. Today most environmental sequence analysis pipelines use traditional high-performance computing techniques. However, such computational systems share common challenges including maintenance, staffing, heterogeneous setup and use, idiosyncratic or asynchronous performance between compute nodes and clusters and lack of on-demand throughput for time-sensitive jobs. While efforts have been made to circumvent these challenges using a master-worker

ISBN: 978-1-62561-003-4

2

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

model [12], the use of cloud-computing services and MapReduce has the potential to transform environmental sequence analysis in a scalable and cost-effective manner, a prerequisite for personalized medical services or comprehensive comparative genomics projects.

1.2

Local I/O

B

HDFS

Koonkie

Here we present Koonkie1 [25], a flexible Hadoop-based software that can be used effectively for scalable processing of environmental sequence information. Koonkie can automatically launch, connect, and configure cloud-based services with specialized bioinformatics applications requiring large local reference resources. Thus, Koonkie enables users to navigate and harness the cloud without extensive coding experience or infrastructure maintenance costs. First, we review the Hadoop distributed file system (HDFS), the issues it has with local-reads on the worker instances, and how this can drastically increase communication costs and overall completion time of Hadoop jobs with large side data distributions. Next, we observe how the recent addition of the YARN resource management system in Hadoop 2.x provides features that can be exploited to reduce communication costs, and how this reduction improves performance when processing heterogeneous datasets. We then describe how these features were implemented in the Koonkie software, and follow-up with an example pipeline and a theoretical discussion of runtime. Finally, we empirically demonstrate Koonkie’s performance improvements on computationallyheavy BLAST jobs using an assembled Illumina sample from the Human Microbiome Project [37] on the AWS cloud.

1.3

A

Challenges HDFS

with

MapReduce

and

In a classic Hadoop 1.x run all necessary resource are obtained from the Hadoop distributed file system (HDFS), where files are stored together as cohesive blocks. Before a Hadoop run can be launched, all files involved must first be uploaded to the HDFS, and then obtained back from the HDFS on to the worker nodes when required for computation. This communication cost is an ongoing concern, and the HDFS attempts to maintain data locality by replicating multiple copies of data blocks across a server. Typically each data block is replicated three times, twice on different nodes of the same rack, and once on a different rack, striking a balance between reliability, read and write performance, and block distribution across the cluster. However, if jobs have large side data requirements, communication costs tend to dominate. Moreover, under the JobTracker model, workers were unaware of resources available on their local disk from one process to the next. Thus, the HDFS may be queried for the same resources multiple times on the same node, a clear communication redundancy. Although there is potential to make supporting resources available locally via its Job Configuration and Distributed Cache [39], 1 A strong domesticated elephant used to catch wild elephants in Assam, a state in India, situated at the foothills of the Himalayas, for domestication.

©ASE 2014

Figure 2: Koonkie uses local reads and writes to decrease communication time. Improvements in resource allocation via YARN in Hadoop 2.x allows the use to local reads and writes at the worker nodes for efficient access to necessary side data files (A). Previous versions of Hadoop obtained all input and necessary side data from the HDFS directly, significantly increasing communication costs in tasks with large side data distributions (B). these methods suffer from bandwidth limitations when handling large sequence databases required for BLAST-based searches using large next-generation sequence datasets via HDFS. Recently, features added with the Yet Another Resource Negotiator (YARN) in Hadoop v2 can be utilized to address the distribution of large side data files. YARN splits the resource allocation responsibilities of the JobTracker into three collaborating managers; the ResourceManager (RM), the NodeManager (NM), and the ApplicationMaster (AM) [38]. While this collaboration address a number of outstanding issues related to distributed scheduling, fault tolerance, and resource allocation, of particular interested is the NM. As a Java process, the NM can directly access the local disks of distributed workers, allowing HFS queries to be bypassed for local reads. Koonkie exploits this novel feature of YARN to enable local access of large support files common to tasks in biological sequence analysis, such as BLAST reference databases, as an alternative to querying the HDFS and suffering stifling communication costs.

2

Koonkie Design

Koonkie takes full advantage of two new features, shortcircuit reads and centralized cache management (CCM), to implement local file system reads and writes by redirecting Hadoop resource localization calls (Figure 2). So-called short-circuit reads allow the NM to utilize Unix domain sockets for local file system access of HDFS blocks, bypassing the DataNode. Short-circuiting avoids the communication cost of the NodeManager fetching the data from the DataNode and HDFS, which if the data/block size is large can be substantial. The CCM supports these reads by holding frequently accessed files paths in-memory on every worker node, allowing the NM to quickly lookup file locations without consulting the NameNode. In Koonkie, we implemented the CCM to act as a lookup table for large support files like BLAST reference databases. Here we will describe how the short-circuit reads and

ISBN: 978-1-62561-003-4

3

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

CCM writes are used in a Koonkie Hadoop job. The job will first scan its designated directories for the required files specified in its configuration file or command line. If found, the NodeManager will query the local disk rather than query the HDFS. However, if the required files are not found, the blocks containing the files from the HDFS will then be downloaded to the local disk of the Hadoop task instance. To avoid this communication cost, Koonkie saves the retrieved files to the local task disk, and the YARN NodeManager is updated to query the local files should they be needed for future computations. Provided that the local disk does not run out of disk space, this local-read procedure ensures that a Koonkie Hadoop Job will always work at least as well as it did in previous implementations of Hadoop.

2.1

AWS and StarCluster

Although commercial Hadoop services exist, there are no services that allow customizable Hadoop installation with private support files, meaning the burdensome transfer of required files for every setup must incur needless communication, bandwidth, and runtime costs [31]. Therefore, we decided to build our own custom Amazon Machine Image (AMI) for an environmental sequence annotation task, preloading the image with all the necessary support files, e.g., a large reference sequence database. AWS offers the ability to requisition a wide variety and volume of machines on demand. In Koonkie, we integrated StarCluster (http://star.mit.edu/cluster/); a Python package developed at MIT to automate AWS cluster setup with our custom images. StarCluster installs Sun Grid Engine (SGE) for queuing, and requisitions a number of instances that are loaded with a specified image. This cluster is then prepared by opening ports and conducting key exchanges to setup the master-worker model. Once cluster configuration has completed, a series of Python scripts called plugins can be called to further setup the cluster. We used plugins to install and configure Hadoop and required binaries across a collection of requisitioned AWS machines. For image storage and implementation, the Amazon S3 image limit of 10GB per file is simply too restrictive for large sequence databases. We therefore chose EBS images with a 1TB limit on image size, which is more inline with our database requirements.

2.2

A Sequence Analysis Pipeline

To demonstrate the utility of Koonkie in processing environmental sequence information using Hadoop, we developed an example pipeline capable of performing two common bioinformatics tasks including sequence homology search with the seed-and-extend algorithm BLAST, and a popular taxonomic summation algorithm called the Lowest Common Ancestor (LCA). The seven stages of the pipeline represent a wide variety of input and outputs, both in terms of formats, their structure, and volume of data (Figure 3). The stages also require a wide range of algorithms and implementations ranging from custom Python scripts to precompiled executable software written in C/C++. The pipeline starts with assembled sequences, which are processed for ©ASE 2014

quality control, then gene/open reading frame (ORF) predictions are performed, BLAST and LCA searches are conducted, and a final summarization table is generated. There are multiple ways of implementing the stages of the pipeline, but not all tasks will benefit from massive parallelization with Hadoop. Therefore, we use the following theoretical metrics of the input files to decide on the most efficient implementation: runtime, which is the estimated time for the Hadoop system to run a stage of the pipeline in the asymptotic Big-O sense, and bit complexity, the number of bits sent by the system from one instance to another across the network during a pipeline stage. For any stage of the pipeline, let nI be the size of the input and nR the size of support files which are constant between runs, such as the reference protein databases. 1. Preprocessing: Input sequences are preprocessed to remove sequences below a user-defined length threshold, generally 180 base pairs, after splicing the sequences into smaller fragments wherever ambiguous bases appear. The input size is O(nI ), so it is more efficient to implement this job locally on the master instance with zero bit complexity. 2. ORF prediction: Using the filtered sequences from Stage (1), ORFs are predicted using the Prokaryotic Dynamic Programming Gene finding Algorithm (Prodigal), which can detect incomplete or fragmentary ORFs [15]. The output from Prodigal contains gene coordinate information, nucleotide, and conceptually translated amino acid sequences for predicted ORFs, which are exported as files in two specialized file formats, GFF and FASTA. The Prodigal executable is a precompiled binary written in C that runs quickly. For example, a million sequences with an average length of 275 base pairs will take roughly 6 minutes. In the context of a Hadoop run, it is more efficient to run this step on the master instance and incur zero bit complexity. 3. Self-BLAST RefScore Calculation: The BLAST bitscore is a measure of the strength of homology between two sequences. In order to account for the length of the query sequence, it is common to normalize the bit-score by its ideal self-blast score as a BLAST-score ratio (BSR) [29]. There are two straightforward, but inefficient, methods of BSR computation. One is to create query and database sequence with the entire sequences, and run BLAST on the query against the same database. The next method involves the same steps except, one sequence is queried against the database at a time. In the first case, the BLAST would compute approximately O(n2I ) comparisons instead of the alternative of O(nI ) comparisons. However, while the second approach only performs O(nI ) comparisons, it is also very slow because each BLAST run involves two steps: (i) creation of an index of the target sequence and (ii) running BLAST comparisons on the indexed sequence and the query sequence. The sluggishness comes from the invocation overhead of the formatting and the BLAST command. Therefore, a reasonable solution is to partition the initial query file into blocks of

ISBN: 978-1-62561-003-4

4

BLAST

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

.refscores .blastou

/results/rRN

Kegg, COG, RefSeq, MetaCyc ATG

ATG

ATG

Annotated ORFs

ATG

ATG

ATG

Bit-score

1

32

3

4

5

6

Preprocessing

Prodigal (ORF Prediction)

RefScore Calculation

BLAST

Filter Results

LCA

Metagenome

Analyses

func_taxa.txt rRNA.stats.txt

/results/ann 7

!

Summary

.{DB}_stats .fxn_and

/results/tRN

Figure 3: Many sequence analysis pipelines proceed by way of many heterogeneous software steps. Starting with the raw sequences, the pipeline proceeds in seven independent steps; first the sequences are quality controlled (1), before proceeding with potential gene or open reading frame (ORF) prediction (2). Next each ORF sequence is compared against itself to obtain a BLAST reference score (RefScore) for normalization (3). A parallel BLAST job is performed via Hadoop with its large reference databases stored locally on the worker nodes for fast reads and writes (4). BLAST results are results are filtered based on quality statistics (e.g., length, e-value, BLAST-score ratio) (5). Next annotated ORFs have their taxonomy summarized by the Lowest Common Ancestor (LCA) algorithm (6). Finally functional and taxonomic results from steps (5) and (6) are summarized on a per-ORF basis into a summary table, ready for downstream analysis (7).

tRNA Scan

LCA

k sequences, where k is in the order of tens to hundreds, and BLAST each blocked query file against its partPathway Tools nered database. This will take approximately O(knIInput ) comparisons to compute the same result, and means we can submit BSR as a MapReduce job with O(nI /k) blocks, which would incur us O(nI ) rather than O(n2I ) bit complexity. 4. BLAST: BLAST, an optimized approximation to the O(n2 ) dynamic programming Smith-Waterman algorithm [33], is one of the most routinely performed bioinformatics tasks. Due to the complex and large nature of BLAST inputs, which include a formatted database and the set of query sequences, BLAST can become a computational bottleneck limiting pipeline performance on large datasets. On the other hand, naively implementing BLAST as a Hadoop task is not effective either, many reference sequence databases are larger than 30GB, and since each search requires the complete database, each task adds O(nR ) to the bit complexity. If we use m BLAST processes with large reference databases like NCBI RefSeq that are significantly larger than our input sequences nR >> nI , we cannot expect m times speedup because the total bit complexity O(mnR ) would become a memory bandwidth bottleneck. Therefore, we include the reference databases as part of the AMI, to reduce the bit comPathoLogic plexity to O(nI ) and attain a speedup factor close to m, where m is the number of Reduce tasks running on separate reduce task instances. Moreover, this does not require any complicated setup on the part of the user.

4

ePGDB Creation

5

5. Parsing and Filtering BLAST results: Tabular BLAST results pertaining to each ORF in step (4) are commonly filtered for homology standards by a series of cut-off values. For each ORF, the top e-value, the number of matches that may occur by chance, from each

Export Pathways

©ASE 2014

/results/LCA

/results/mltr

/ptools/

reference database is selected and given an “information score” based on the number of informative words (e.g., adverbs, articles, auxiliary verbs, etc., removed). This step merely requires a single pass of the data by a Python script and is more suitable to run on the master-instance with zero bit complexity.

ML-TreeMap

6. Lowest Common Ancestor (LCA) Algorithm: If the RefSeq protein database is used in step (4) [28], then each sequence is annotated with a taxonomic identity, in addition to enzymatic function. Since there are often multiple matches to multiple taxonomies, we can use LCA to decide on a consensus taxonomy by climbing up the NCBI Taxonomy database hierarchy [14]. The LCA rule is implemented as a Python script that takes the filtered RefSeq BLAST outputs and the NCBI Taxonomy database. Although, in theory, this step needs to query a database like the BLAST step, the NCBI .pathways.txt Taxonomy database is sufficiently small such that it is better suited for local processing on the master instance.

/results/pgd

.path

$Pathway_

7. Functional and Taxonomic Summary Table: Final reports are generated using the outputs from Stages (5) and (6). Each row of the report consists of the {NAME}cyc functional and taxonomic annotations of an individual ORF. The script is implemented in Python and is better suited for being run on the master-instance.

3

/run_statist

ePGDB Empirical Studies

In order to empirically demonstrate the functionality and .run_parameters.txt performance of the Koonkie pipeline, we performed two general studies. In the first, using Human Microbiome sequence information sourced from a stool sample we demonstrate the number of functionally Pathway Summariesand taxonomically annotated ORFs. Next we isolate the Hadoop BLAST portion of the

ISBN: 978-1-62561-003-4

5

.n

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

Table 1: Functional and taxonomic analysis statistics. Sample Attribute Value Sample Name stool SRS011405 Sequences 84, 380 Avg. Length 1,118 Total Bases 94,302,353 Predicted ORFs 141,472 Functionally Annotated ORFs 74,814 Taxonomically Annotated ORFs 74,814 AWS Instances 40 Time Taken 6 hrs, 23 min With Koonkie we used a sample from the Human MicroBiome project (Table 1) with the pipeline of tasks described and the NCBI RefSeq database (Version 61) consisting of 22,929,834 sequences for a total size of 8.8GBs. Isolating the BLAST step, we find that BLAST compute times decreases with the number of requisitioned AWS instances, but approaches an asymptote with diminishing returns (Figure 4). This asymptote represents the time to load the RefSeq database into memory from disk, a fixed-cost likely only improved by faster disks (i.e., solid-state). Another issue is memory requirements of concurrent reduce tasks. Multiple reducers per node allow the use of multiple processors, but as reducers do not share memory if all reducers load the same resource then the total memory usage on the node M , will be the size of the resource, r, by the number of reducers, k, (i.e., r × k). A possible solution to this in the case of BLAST is to split the compiled database into sizes of r < M/k with the ‘makeblastdb’ command. However, this comes at a cost of additional local disk access, which if all reducers are accessing the same BLAST database split will be slower due to the increased seek time of interleaving concurrent reads on the local disk. However, the current setup of Hadoop does not allow for the control of the number of reducers per node, making the above solution difficult to reliably implement in practice, representing a clear area of improvement for the YARN NodeManager. To compare our results to existing Hadoop implementations, we ran our sample dataset on a set of v2.4.0 Hadoop binaries downloaded from the Apache FTP server. Without any specialized improvements, using 40 machines with 4 processors each, v2.4.0 Hadoop required 14 hours and 32 minutes to complete which included localization of the resources from the HDFS using YARN NodeManagers. In contrast, Koonkie required 6 hours and 23 minutes to accomplish the same task as it skipped the localization step, becauise each worker had a database on it’s local disk. This represented a total speedup of approximately 2.5. These comparisons show that at worst Koonkie will per©ASE 2014

800

600

Time (min)

pipeline to show how BLAST performance improves with an increased number of requisitioned AWS compute nodes. In our AWS computational examples, we used m3.2xlarge instances with 8 virtual CPUs running on Intel Xeon E5-2670 v2 (Ivy Bridge) Processors, 30 GB memory, 2x80 GB SSD local instance storage running Ubuntu 13.04 64-bit OS, using one reducer per machine on the US West (Oregon) AWS region.

400

200

0 20

30

AWS Instances

40

50

Figure 4: Sequence homology runtime decreases with AWS Instances. 141,000 ORFs predicted from the from the Human Microbiome Project sample SRS011405 were annotated against the RefSeq database (22 million sequences) with the BLAST algorithm with progressively large collection of m3.2xlarge AWS instances. Runtime decreases with the number of AWS instances almost linearly; however, a plateau is likely because of the fixed upfront time required by the BLAST program on each instance to load the RefSeq database and fixed Hadoop communication costs.

form sequence homology tasks as fast as or better than standard Hadoop implementations (e.g., CloudBlast). Indeed, Koonkie will perform just as well with any other sample where the distribution of sequence length is similar, as load balancing problems will only occur where samples have highly variable sequence distribution, causing them to finish at different times.

4

Conclusions

Koonkie enables bioinformatics pipeline development with a wide variety of computational requirements on the cloud using Hadoop. We demonstrated this in our example pipeline using common functional and taxonomic annotations for environmental sequence information. By exploiting the novel features of short-circuit reads and centralized cache management implemented in Hadoop 2.x, Koonkie performs massively parallel Hadoop jobs with large side data distributions. For BLAST jobs requiring access to a complete copy of the NCBI reference database, this is a critical improvement. Moreover, we demonstrate how Koonkie performs large-scale homology searches on modern Illumina datasets that scale linearly with the number of requisitioned AWS nodes. Using our demonstration pipeline as an illustrative example, we also explored a theoretical framework for efficient choices of distributing particular pipeline stages based on asymptotic bit complexity, i.e., the number of bits sent

ISBN: 978-1-62561-003-4

6

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

to the system. Finally, Koonkie expands the StarCluster framework to automatically requisition and configure AWS clusters on-demand, ready for general pipeline development. Although Koonkie effectively utilizes the AWS to provide an on-demand cloud computational environment, we would like to make Koonkie independent from the AWS cloud, allowing Koonkie to automatically install and setup a Hadoop service from a collection of compute machines. Enabling the usage of low cost nodes which are much smaller in RAM capacity and disk space in regards to BLAST’s requirements is a primary area of concern in order to bring down total costs of a Hadoop run. Improvements to YARN worker task memory management would allow jobs with large side data distributions like BLAST to be used more efficiently on machines with smaller memory requirements, further reducing cost of cloud-computing services. We could also see Koonkie potentially self-optimizing pipeline tasks to use the Hadoop framework based on their empirical bit complexity. Moreover, a flexible framework of accepted pipeline stages would allow users to mix-and-match pipeline stages and bring more flexibility to the development process. Koonkie code and tutorials are available at http://www.github.com/hallamlab/koonkie

5

Acknowledgments

This work was carried out under the auspices of Genome Canada, Genome British Columbia, Genome Alberta, the Natural Science and Engineering Research Council (NSERC) of Canada, the Canadian Foundation for Innovation (CFI) and the Canadian Institute for Advanced Research (CIFAR) through grants awarded to SJH. KMK was supported by the Tula Foundation funded Centre for Microbial Diversity and Evolution (CMDE). NWH was supported by a four year doctoral fellowship (4YF) administered through the UBC Graduate Program in Bioinformatics.

References [1] Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria M Schubert, Jacques Izard, Brandi L Cantarel, and et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS computational biology, 8(6), 2012. [2] Stephen F Altschul, Warren Gish, Webb. Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403–410, October 1990. [3] E. Aronova, Karen S. Baker, and N. Oreskes. Big Science and Big Data in Biology. Historical Studies in the Natural Sciences, 40(2), 2010.

[6] Gregory M Cooper, Bradley P Coe, Santhosh Girirajan, Jill A. Rosenfeld, Tiffany H Vu, and et al. A copy number variation morbidity map of developmental delay. Nature genetics, 43(9):838–46, 2011. [7] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):1–13, 2008. [8] DJ DeWitt and E Paulson. Clustera: an integrated computation and data management system. Proceedings of the Very Large Databases (VLDB 2008), 1(212):28–41, 2008. [9] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. CD-HIT: accelerated for clustering the nextgeneration sequencing data. Bioinformatics, 28(23):3150– 3152, December 2012. [10] Geoffry S Ginsburg and James J McCarthy. Personalized medicine: revolutionizing drug discovery and patient care. Trends in Biotechnology, 19(12):491–496, 2001. [11] Yunhong Gu and Robert L Grossman. Sector and Sphere: the design and implementation of a high-performance data cloud. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 367(1897):2429–45, 2009. [12] Niels W Hanson, Kishori M Konwar, Shang-Ju Wu, and Steven J Hallam. MetaPathways v2.0: A master-worker model for environmental Pathway/Genome Database construction on grids and clouds. In Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on, pages 1–7, May 2014. [13] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference, USENIXATC’10, pages 11–11, 2010. [14] Daniel H Huson, Alexander F Auch, Ji Qi, and Stephan C Schuster. MEGAN analysis of metagenomic data. Genome Research, 17(3):377–386, 2007. [15] Doug Hyatt, Gwo-Liang Chen, Philip F Locascio, Miriam L Land, Frank W Larimer, and Loren J Hauser. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119, 2010. [16] Mohammad Islam, Angelo K. Huang, Mohamed Battisha, Michelle Chiang, Santhosh Srinivasan, and et al. Oozie: Towards a scalable workflow management system for hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, pages 1–10, 2012. [17] CM Jessup, Rees Kassen, and SE Forde. Big questions, small worlds: microbial model systems in ecology. Trends in Ecology and Evolution, 19(4):189–197, 2004. [18] Szymon M Kielbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C Frith. Adaptive seeds tame genomic sequence comparison. Genome Res, 21(3):487–493, March 2011.

[4] Manimozhiyan Arumugam, Jeroen Raes, Eric Pelletier, Denis Le Paslier, Takuji Yamada, and et al. Enterotypes of the human gut microbiome. Nature, 473(7346):174–80, 2011.

[19] Kishori M Konwar, Niels W Hanson, Antoine P Pag´e, and Steven J Hallam. MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information. BMC Bioinformatics, 14(1):202, 2013.

[5] Ronnie Chaiken, Bob Jenkins, Per-Ake. Larson, Bill Ramsey, Darren Shakib, and et al. SCOPE : Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the Very Large Databases (VLDB 2008), 2008.

[20] Simone Leo, Federico Santoni, and Gianluigi Zanetti. Biodoop: Bioinformatics on Hadoop. Proceedings of the 2009 International Conference on Parallel Processing Workshops, pages 415–422, 2009.

©ASE 2014

ISBN: 978-1-62561-003-4

7

2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014

[21] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589–595, March 2010.

[38] V.K. Vavilapalli and A.C. Murthy. Apache hadoop yarn: Yet another resource negotiator. Proceedings of the ACM Symposium on Cloud Computing (SOCC 2013), 2013.

[22] Matthew L Massie, Brent N Chun, and David E Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817–840, 2004.

[39] Tom White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009.

[23] Mark I McCarthy, Goncalo R Abecasis, Lon R Cardon, David B Goldstein, Julian Little, and et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics, 9(5):356– 69, 2008. [24] Folker Meyer, D Paarmann, M D’Souza, R Olson, E M Glass, and et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9:386, 2008. [25] A.J.W. Milroy. Elephant catching in assam. Asian Elephant Specialist Group Newsletter, (8):38–45, 1927. [26] Henrik Nordberg, Karan Bhatia, Kai Wang, and Zhong Wang. BioPig: a Hadoop-based analytic toolkit for largescale sequence data. Bioinformatics, 29(23):3014–9, 2013. [27] Christopher Olston, Benjamin Reed, and U Srivastava. Pig latin: a not-so-foreign language for data processing. Proceedings of ACM SIGMOD 2008, 2008. [28] Kim D Pruitt, Tatiana Tatusova, and Donna R Maglott. NCBI Reference Sequence (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 33(Database issue):D501– 4, 2005. [29] David A Rasko, Garry S A Myers, and Jacques Ravel. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics, 6:2, 2005. [30] Michael C Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363–9, 2009. [31] K. Shvachko, Hairong Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10, May 2010. [32] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, Steven J M Jones, and Inan¸c Birol. ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117–1123, 2009. [33] Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–7, 1981. [34] A Stamatakis, A J Aberer, C Goll, S A Smith, S A Berger, and F Izquierdo-Carrasco. RAxML-Light: a tool for computing terabyte phylogenies. Bioinformatics, 28(15):2064– 2066, July 2012. [35] Ronald C Taylor. An overview of the Hadoop /MapReduce /HBase framework and its current applications in bioinformatics. BMC Bioinformatics, 11(Suppl 12):S1, 2010. [36] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, and et al. Hive - a petabyte scale data warehouse using Hadoop. Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE 2010), pages 996–1005, 2010. [37] Peter J Turnbaugh, Ruth E Ley, Micah Hamady, Claire M Fraser-Liggett, Rob Knight, and Jeffrey I Gordon. The Human Microbiome Project. Nature, 449(7164):804–810, October 2007. ©ASE 2014

ISBN: 978-1-62561-003-4

8

Suggest Documents