Scripting for large-scale sequencing based on Hadoop Motivation and ...

3 downloads 15481 Views 327KB Size Report
by the “big data revolution” in data-based activi- ties resulting in novel computational paradigms such as MapReduce and scalable tools such as. Hadoop and ...


84

Posters

EMBnet.journal 19.A

Scripting for large-scale sequencing based on Hadoop André Schumacher1,2.3, Luca Pireddu4, Aleksi Kallio5, Matti Niemenmaa6, Eija Korpelainen5, Gianluigi Zanetti4, Keijo Heljanko2,3 ICSI, Berkeley, USA Helsinki Institute for Information Technology HIIT, Helsinki, Finland 3 Aalto University, Espoo, Finland 4 CRS4, Pula, Italy 5 CSC-IT Center for Science, Helsinki, Finland 6 Aalto University, Espoo, Finland 1 2

Motivation and Objectives

Foundation (http://hadoop.apache.org, http:// The large volumes of data generated by mod- pig.apache.org). ern sequencing experiments present significant challenges in their manipulation and analysis. SeqPig Traditional approaches, such as scripting and re- SeqPig extends Pig with a number of features lational database queries, are often found to be and functionalities conceived for processing seinadequate, frustratingly slow, or complicated to quencing data. Specifically, it provides: 1) data scale. These problems have already been faced input and output components, 2) specialized by the “big data revolution” in data-based activi- functions to extract fields and to transform data ties resulting in novel computational paradigms and 3) a collection of scripts for frequent tasks such as MapReduce and scalable tools such as (e.g., pileup, QC statistics). SeqPig provides import and export functions Hadoop and Pig. for file formats commonly used for sequencWe describe our ongoing work on SeqPig, a ing data: Fastq, Qseq, SAM and BAM. SeqPig tool that facilitates the use of the Pig Latin scriptsupports ad hoc – scripted or even interactive ing language to manipulate, analyze and query – distributed manipulation and analysis of large sequencing data. SeqPig provides access to sequencing datasets. Unlike traditional methods, popular data formats and implements a numthe scalable nature of Pig allows the speed of its ber of high level functions. Most importantly, it operations to scale with the computing resourgrants users access to the proven to be scalable ces available. SeqPig includes functions to acplatform that is Hadoop from a high level scriptcess SAM flags, split reads by base (for computing language, whether the cluster is run locally or ing base-level statistics), reverse-complement in the cloud. reads, calculate read reference positions in a mapping (for pile-ups, extracting SNP positions), Methods SeqPig operates on top of Hadoop and Pig and more. The authors are currently working on and augments them to facilitate their use to expanding the library of functions, and SeqPig is process sequencing data. Hadoop is a distrib- an open source project that welcomes and enuted computing framework that implements courages contributions from the community. the MapReduce programming model, which Using cloud-based resources expresses computations as sequences of side- SeqPig has been tested on Amazon’s Elastic effect free Map and Reduce functions. Hadoop MapReduce service. Users may rent computing was initially developed at Yahoo!, but has since time on the cloud to run their SeqPig scripts, and been widely adopted, e.g. by Facebook, Twitter even share their S3 storage buckets with other and LinkedIn. Pig is a set-based scripting lan- cloud-enabled software. guage whose instructions are compiled to a Dependencies sequence of MapReduce jobs, which are then SeqPig builds on Hadoop-BAM (Niemenmaa et executed on a Hadoop cluster. It effectively al., 2012), Seal (Pireddu et al., 2011), and Picard simplifies the use of a Hadoop cluster through (http://picard.sourceforge.net). Hadoop-BAM imits concise SQL-like logic. Both Hadoop and Pig plements a number of file formats for Hadoop, are projects supported by the Apache Software while Seal and Picard implement some of the

EMBnet.journal 19.A

Posters

85

sequence analysis functiona-lity that SeqPig ex- Acknowledgements poses at a higher level. This work was supported by the Cloud Software and D2I Programs of the Finnish Strategic Centre Results and Discussion for Science, Technology and Innovation TIVIT and SeqPig enables the manipulation and analysis of by the Sardinian (Italy) Regional Grant L7-2010/ sequencing data on the Hadoop big-data com- COBIK. putational platform. At CRS4 SeqPig is already used routinely for some steps in the production References workflow; in addition, SeqPig scripts have been Niemenmaa M, Kallio A, Schumacher A, Klemelä P, Korpelainen E, and Heljanko K. (2012) Hadoop-BAM: diused for ad hoc investigations into data qualrectly manipulating next generation sequencing data in ity issues, comparison of alignments tools, and the cloud. Bioinformatics 28(6):876-877. doi:10.1093/bioinreformatting or packaging data. In the future we formatics/bts054 plan to expand its function library and thoroughly Pireddu L, Leo S, and Zanetti G. (2011) SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics test its scalability and performance characteris27(15):2159-2160. doi:10.1093/bioinformatics/btr325 tics.

Suggest Documents