Scalable Computing Methods for Processing Next ...

Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud Keijo Heljanko1 , Matti Niemenmaa1 , Andre´ Schumacher1 , Aleksi Kallio2 , Eija Korpelainen2 , and Petri Klemela¨ 2 1

Department of Information and Computer Science School of Science, Aalto University [email protected] 2

CSC — IT Center for Science [email protected]

14.5-2012 Keijo Heljanko - Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud 14.5-2012 1/28

Next Generation Sequencing I

Europe has more than 200 Next Generation Sequencing systems with a combined capacity to sequence more than 300 human genomes per day

I

The amount of NGS data worldwide is predicted to double every 5 months

I

The datasets are large: The 1000 Genomes project (http://www.1000genomes.org) is a freely available 50 TB data set of human genomes

I

Without cloud computing methods genomics research will hit infrastructural limits

Keijo Heljanko - Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud 14.5-2012 2/28

Genomics Research I

I

Genomics research has high economic relevance: It is used by biomedicine researchers, hospital diagnostics, food industries, agronomy, pharmaceutical industries, . . . Example use cases of NGS data analytics include: I

I

I

I

Human genetics (involving the detection of genotype variations that relate to diseases) Personalized medicine (which will rely critically on the ability to rapidly and reliably map the genetic make-up of individual patients) Genomics selection for agriculture (which will exploit genomics data in farmed livestock to select the next generation of breeding animals and crop plants)

All these use cases require advanced analytics algorithms to be developed to cope with the data growth rate


No Processor Clock Speed Increases Ahead

I

Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal, 30(3), March 2005 (updated graph in August 2009). Keijo Heljanko - Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud 14.5-2012 4/28

Implications of the End of Free Lunch I

The clock speeds of microprocessors are not going to improve much in the foreseeable future I

I

The number of transistors in a microprocessor is still growing at a high rate I

I

I

The efficiency gains in single threaded performance are going to be only moderate

One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing

Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus


Warehouse Scale Computing I

I

The smallest unit of computation in Google scale is: Warehouse full of computers ¨ Luiz Andre´ Barroso, Urs Holzle: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Morgan & Claypool Publishers 2009 http://dx.doi.org/10.2200/S00193ED1V01Y200905CAC006

I

The book says: “. . . we must treat the datacenter itself as one massive warehouse-scale computer (WSC).”


Google MapReduce I

A scalable batch processing framework for warehouse scale computing developed at Google for Big Data: computing the Web index

I

The MapReduce framework takes care of all issues related to parallelization, synchronization, load balancing, and fault tolerance.

I

When deciding whether MapReduce is the correct fit for an algorithm, one has to remember the fixed data-flow pattern of MapReduce. The algorithm has to be efficiently mapped to this data-flow pattern in order to efficiently use the underlying computing hardware

I

For a formalization of MapReduce, see: Howard J. Karloff, Siddharth Suri, Sergei Vassilvitskii: A Model of Computation for MapReduce. SODA 2010: 938-948 Keijo Heljanko - Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud 14.5-2012 7/28

MapReduce and Functional Programming I

Based on the functional programming in the large: I

I

I

I I

User is only allowed to write side-effect free functions “Map” and “Reduce” Re-execution is used for fault tolerance. If a node executing a Map or a Reduce task fails to produce a result due to hardware failure, the task will be re-executed on another node Side effects in functions would make this impossible, as one could not re-create the environment in which the original task executed One just needs a fault tolerant storage of task inputs The functions themselves are usually written in a standard imperative programming language, usually Java


Map and Reduce Functions

I

The framework only allows a user to write two functions: a “Map” function and a “Reduce” function

I

The Map-function is fed blocks of data (block size 64-128MB), and it produces (key, value) -pairs

I

The framework groups all values with the same key to a (key, (..., list of values, ...)) format, and these are then fed to the Reduce function for the key

I

A special Master node takes care of the scheduling and fault tolerance by re-executing Mappers or Reducers


MapReduce Diagram User program (1)fork

(1)fork (1)fork

Master (2)assign map

Worker

(2) assign reduce

split 0 split 1 split 2

(3)read

Worker

(4) local write

(5)Remote read

split 3

(6)write

Worker

output file 0

Worker

output file 1

split 4 Worker

Input files

Map phase

Intermediate files (on local disks)

Reduce phase

Output files 22.1.2010

Figure:

2

J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004


Apache Hadoop I

An Open Source implementation of the MapReduce framework, originally developed by Doug Cutting and heavily used by e.g., Yahoo! and Facebook

I

“Moving Computation is Cheaper than Moving Data” - Ship code to data, not data to code.

I

Map and Reduce workers are also storage nodes for the underlying distributed filesystem: Job allocation is first tried to a node having a copy of the data, and if that fails, then to a node in the same rack (to maximize network bandwidth)

I

Project Web page: http://hadoop.apache.org/


Apache Hadoop (cnt.) I

Builds reliable systems out of unreliable commodity hardware by replicating most components (exceptions: Master/Job Tracker and NameNode in Hadoop Distributed File System)

I

Tuned for large (gigabytes of data) files

I

Designed for very large 1 PB+ data sets

I

Designed for streaming data accesses in batch processing, designed for high bandwidth instead of low latency

I

For scalability NOT a POSIX filesystem

I

Written in Java, runs as a set of userspace daemons


Hadoop Distributed Filesystem (HDFS)

I

A distributed replicated filesystem: All data is replicated by default on three different Data Nodes

I

Inspired by the Google Filesystem

I

Each node is usually a Linux compute node with a small number of hard disks (4-12)

I

A single NameNode that maintains the file locations, many DataNodes (1000+)


Hadoop Distributed Filesystem (cnt.)

I

Any piece of data is available if at least one datanode replica is up and running

I

Rack optimized: by default one replica written locally, second in the same rack, and a third replica in another rack (to combat against rack failures, e.g., rack switch or rack power feed)

I

Uses large block size, 128 MB is a common default designed for batch processing

I

For scalability: Write once, read many filesystem


Implications of Write Once

I

All applications need to be re-engineered to only do sequential writes. Example systems working on top of HDFS: I I

I I I I

MapReduce batch processing system HBase (Hadoop Database), a database system with only sequential writes, Google Bigtable clone Apache Pig and Hive data mining tools Mahout machine learning libraries Lucene and Solr full text search Nutch web crawling


Two Large Hadoop Installations I

I

Yahoo! (2009): 4000 nodes, 16 PB raw disk, 64TB RAM, 32K cores Facebook (2010): 2000 nodes, 21 PB storage, 64TB RAM, 22.4K cores I

I

12 TB (compressed) data added per day, 800TB (compressed) data scanned per day A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, H. Liu: Data warehousing and analytics infrastructure at Facebook. SIGMOD Conference 2010: 1013-1020. http://doi.acm.org/10.1145/1807167.1807278


CSC Genome Browser

I

CSC provides tools and infrastructure for bioinformatics

I

Bioinformatics is the largest customer group of CSC (in user numbers)

I

Finland has limited resources to produce data

I

Analysis software systems is area where Finland has potential for global impact


Genome Browser Requirements I

I

Interactive browsing with zooming in and out, “Google Earth”-style Single datasets 100GB-1TB+ with interactive visualization at different zoom levels I

I

I

Preprocessing used to compute summary data for the higher zoom levels Dataset too large to compute the summary data in real time using the real dataset Scalable cloud data processing paradigm map-reduce implemented in Hadoop used to compute the summary data in preprocessing (currently upto 15 x 12 = 180 cores)


Genome Browser GUI


Read aggregation: problem I

A common problem in DNA sequence analysis is the alignment of a large number of smaller subsequences to a common reference sequence

I

Once the alignment has been determined, one needs to analyze how well the reference sequence is covered by the subsequences (a.k.a. reads)

I

For interactive visualization the large data set has to be summarized

I

Cloud computing enables interactive visualization of sequencing data


Read aggregation: problem (cnt.)


Read aggregation via Hadoop

BAM

sort

summarize

I

Center point of each read computed (Map)

I

Reads sorted according to that center point

I

For fixed summary-block size B, every B reads are merged into a single aggregated read (Reduce)

I

Note: in the previous example we had B = 2


Triton Cluster Before Latest Upgrade

I

112 AMD Opteron 2.6 GHz compute nodes with 12 cores, 32-64GB memory each, totalling 1344 cores

I

Infiniband and 1 GBit Ethernet networks

I

30 TB+ local disk

I

40 TB+ fileserver work space


Experiments I

One 50 GB (compressed) BAM input file from 1000 Genomes

I

Run on the Triton cluster 1-15 compute nodes with 12 cores each

I

Four repetitions for increasing number of worker nodes

I

Two types of runs: sorting according to starting position (“sorted”) and read aggregation using five summary-block sizes at once (“summarized”)


Mean speedup

50 GB sorted

50 GB summarized for B=2,4,8,16,32

16

16 Ideal Input file import Sorting Output file export Total elapsed

14

14 12

10

Mean speedup

Mean speedup

12

Ideal Input file import Summarizing Output file export Total elapsed

8 6

10 8 6

4

4

2

2

0

0 1

2

4

8 Workers

15

1

2

4

8

15

Workers


MapReduce for Read Alignment I

The SEAL system is a parallelization of the BWA read alignment tool on top of Hadoop (http://biodoop-seal.sourceforge.net/)

I

It allows multiple BWA instances running on the same physical machine to share the large index of the reference sequence

I

For more details, see the paper: “Luca Pireddu, Simone Leo, Gianluigi Zanetti: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15): 2159-2160 (2011)”

I

We are cooperating with the SEAL author to better integrate our tools


Results I

Version 4.0 of the hadoop-BAM released: http://sourceforge.net/projects/hadoop-bam/

I

In addition to visualization, the Apache Pig framework has been integrated to allow easy ad-hoc queries to genomics data using the Pig data mining language

I

An paper on Hadoop-BAM in Bioinformatics: ¨ P., Niemenmaa, M., Kallio, A., Schumacher, A., Klemela, Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics, accepted for publication. (http://dx.doi.org/10.1093/bioinformatics/bts054).


Future plans I

I

I I

I

I

Investigate parallelized variant detection using Hadoop:ified GATK MapReduce algorithms for generation of very large (200 million+ nodes) graphs in a computer aided verification domain Algorithms to solve games on very large graphs Using PACT (TU Berlin) and Spark (Berkeley) programming models for algorithms that are difficult for MapReduce (e.g., iterative graph algorithms) Parallel cloud datastore (HBase) for parallelized database functionality, extensions needed to support multi-row transactions We are open to working with you to parallelize your problems! Keijo Heljanko - Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud 14.5-2012 28/28

Scalable Computing Methods for Processing Next ...

Scalable Computing Methods for Processing Next ...

Suggest Documents

Grid Computing Used For Next Generation High Speed Processing ...

Towards scalable divide-and-conquer methods for computing ...

Scalable Methods for Post-Processing, Visualizing, and ... - bioRxiv