Scripting for large-scale sequencing based on Hadoop Motivation and ...

84

Posters

EMBnet.journal 19.A

Scripting for large-scale sequencing based on Hadoop André Schumacher1,2.3, Luca Pireddu4, Aleksi Kallio5, Matti Niemenmaa6, Eija Korpelainen5, Gianluigi Zanetti4, Keijo Heljanko2,3 ICSI, Berkeley, USA Helsinki Institute for Information Technology HIIT, Helsinki, Finland 3 Aalto University, Espoo, Finland 4 CRS4, Pula, Italy 5 CSC-IT Center for Science, Helsinki, Finland 6 Aalto University, Espoo, Finland 1 2

Motivation and Objectives

Foundation (http://hadoop.apache.org, http:// The large volumes of data generated by mod- pig.apache.org). ern sequencing experiments present significant challenges in their manipulation and analysis. SeqPig Traditional approaches, such as scripting and re- SeqPig extends Pig with a number of features lational database queries, are often found to be and functionalities conceived for processing seinadequate, frustratingly slow, or complicated to quencing data. Specifically, it provides: 1) data scale. These problems have already been faced input and output components, 2) specialized by the “big data revolution” in data-based activi- functions to extract fields and to transform data ties resulting in novel computational paradigms and 3) a collection of scripts for frequent tasks such as MapReduce and scalable tools such as (e.g., pileup, QC statistics). SeqPig provides import and export functions Hadoop and Pig. for file formats commonly used for sequencWe describe our ongoing work on SeqPig, a ing data: Fastq, Qseq, SAM and BAM. SeqPig tool that facilitates the use of the Pig Latin scriptsupports ad hoc – scripted or even interactive ing language to manipulate, analyze and query – distributed manipulation and analysis of large sequencing data. SeqPig provides access to sequencing datasets. Unlike traditional methods, popular data formats and implements a numthe scalable nature of Pig allows the speed of its ber of high level functions. Most importantly, it operations to scale with the computing resourgrants users access to the proven to be scalable ces available. SeqPig includes functions to acplatform that is Hadoop from a high level scriptcess SAM flags, split reads by base (for computing language, whether the cluster is run locally or ing base-level statistics), reverse-complement in the cloud. reads, calculate read reference positions in a mapping (for pile-ups, extracting SNP positions), Methods SeqPig operates on top of Hadoop and Pig and more. The authors are currently working on and augments them to facilitate their use to expanding the library of functions, and SeqPig is process sequencing data. Hadoop is a distrib- an open source project that welcomes and enuted computing framework that implements courages contributions from the community. the MapReduce programming model, which Using cloud-based resources expresses computations as sequences of side- SeqPig has been tested on Amazon’s Elastic effect free Map and Reduce functions. Hadoop MapReduce service. Users may rent computing was initially developed at Yahoo!, but has since time on the cloud to run their SeqPig scripts, and been widely adopted, e.g. by Facebook, Twitter even share their S3 storage buckets with other and LinkedIn. Pig is a set-based scripting lan- cloud-enabled software. guage whose instructions are compiled to a Dependencies sequence of MapReduce jobs, which are then SeqPig builds on Hadoop-BAM (Niemenmaa et executed on a Hadoop cluster. It effectively al., 2012), Seal (Pireddu et al., 2011), and Picard simplifies the use of a Hadoop cluster through (http://picard.sourceforge.net). Hadoop-BAM imits concise SQL-like logic. Both Hadoop and Pig plements a number of file formats for Hadoop, are projects supported by the Apache Software while Seal and Picard implement some of the

EMBnet.journal 19.A

Posters

85

sequence analysis functiona-lity that SeqPig ex- Acknowledgements poses at a higher level. This work was supported by the Cloud Software and D2I Programs of the Finnish Strategic Centre Results and Discussion for Science, Technology and Innovation TIVIT and SeqPig enables the manipulation and analysis of by the Sardinian (Italy) Regional Grant L7-2010/ sequencing data on the Hadoop big-data com- COBIK. putational platform. At CRS4 SeqPig is already used routinely for some steps in the production References workflow; in addition, SeqPig scripts have been Niemenmaa M, Kallio A, Schumacher A, Klemelä P, Korpelainen E, and Heljanko K. (2012) Hadoop-BAM: diused for ad hoc investigations into data qualrectly manipulating next generation sequencing data in ity issues, comparison of alignments tools, and the cloud. Bioinformatics 28(6):876-877. doi:10.1093/bioinreformatting or packaging data. In the future we formatics/bts054 plan to expand its function library and thoroughly Pireddu L, Leo S, and Zanetti G. (2011) SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics test its scalability and performance characteris27(15):2159-2160. doi:10.1093/bioinformatics/btr325 tics.

Scripting for large-scale sequencing based on Hadoop Motivation and ...

Scripting for large-scale sequencing based on Hadoop Motivation and ...

Suggest Documents

Scripting for large-scale sequencing based on Hadoop Motivation ...

simple and scalable scripting for large sequencing data sets in Hadoop

simple and scalable scripting for large sequencing ...

Content-Based Image Retrial Based on Hadoop

Risk of LargeScale Evacuation Based on the ... - Wiley Online Library

Personalized Content Sequencing Based on

Comparing Motivation-Based and Motivation-Attitude-Based ... - MDPI

[PDF]Read Programming Pig: Dataflow Scripting with Hadoop New ...

A Framework for Genetic Algorithms Based on Hadoop - arXiv

A Parallel Genetic Algorithm Based on Hadoop MapReduce for the ...

TPM-based Authentication Mechanism for Apache Hadoop

A LargeScale Gene-Trap Screen for Insertional

DNA SEQUENCING: A Sequencing Method Based on Real-Time ...

Identity-Based Motivation and Health

Local Alignment Tool Based on Hadoop Framework and GPU ...

Local Alignment Tool Based on Hadoop Framework and GPU

Web-based tool for automatic acceptance test execution and scripting ...

A Scripting based Architecture for Management of Streams and ...

Intervention Strategies Based on Information-Motivation-Behavioral ...

Changing the Environment based on Intrinsic Motivation

Cloud-based Hadoop Deployments: Benefits and ... - Accenture

Carnivore diet analysis based on nextgeneration sequencing ...

Improving Sequencing Algorithms Based on ... - Semantic Scholar

a hadoop-based distributed framework for efficient managing and ...