HPC for fast and accurate NGS workflows

3 downloads 0 Views 1MB Size Report
Jul 6, 2014 - Pierre Carrier, Richard Walsh, Bill Long, Carlos Sosa, Jef Dawson ... programs, often written in perl, python, java, or C/C++ (i.e., they are serial ...
Using HPC techniques to accelerate NGS workflows Pierre Carrier, Richard Walsh, Bill Long, Carlos Sosa, Jef Dawson Cray Life Science working group Cray Inc.

TrinityRNASeq: collaboration with Brian Haas, Timothy Tickle The Broad Institute Thomas William TU-Dresden COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 1

Outline ● Overview of an important NGS workflow example: MegaSeq ● What is the HPC approach? What is distributed memory? Toward the 1 hour genome ● Example case: trinityRNASeq ● Overview of TrinityRNASeq (Inchworm, Chrysalis, Butterfly) ● Inchworm: structure of sequential code and solution with MPI ● MPI scaling results of the mouse and axolotl RNA sequences

● Your feedback: What are the key bottlenecks you encounter?

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 2

NGS Workflow Modular Tools ● Most bioinformatics workflows include several intermediate and separate modular programs, often written in perl, python, java, or C/C++ (i.e., they are serial codes, sometimes with openMP multi-threading): ●

BWA



bowtie/bowtie2



tophat



GATK



BioPerl



blat



samtools



Picard



R



The list goes on…

● Currently, relatively few bioinformatics programs are parallelized for distributed

memory, using the well-established Message Passing Interface (MPI) standard: ●

mpiBLAST/abokiaBLAST



ABySS



Ray



HMMER



PhyloBayes-MPI

A typical bioinformatics workflow is a combination of all these separate programs COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 3

Overview of an NGS workflow: MegaSeq Megan J. Puckelwartz, et al., Supercomputing for the parallelization of whole genome analysis, http://bioinformatics.oxfordjournals.org/content/30/11/1508

Parallelism can be obtained by computing simultaneously multiple genomes

Software pipeline Picard (java) 2/ReadGroup

BWA (C) 1/ReadGroup

samtools (C) 25/ReadGroup

Picard (java) 100/genome

GATK (6x) (java) 25/genome 1/genome

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

snpEff (perl, python)

A N A LY Z E 4

Overview of an NGS workflow: MegaSeq Megan J. Puckelwartz, et al., Supercomputing for the parallelization of whole genome analysis, http://bioinformatics.oxfordjournals.org/content/30/11/1508

This outstanding improvement can, we believe, be further improved by the use of HPC programming techniques discussed next, that help reduce computing time to get one full genome in 1 hour

Each genome requires ~1 TB of space

In the MegaSeq example, that would mean computing all 240 genomes in less than an hour

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 6

Overview of an NGS workflow: MegaSeq Megan J. Puckelwartz, et al., Supercomputing for the parallelization of whole genome analysis, http://bioinformatics.oxfordjournals.org/content/30/11/1508

This outstanding improvement can, we believe, be further improved by the use of HPC programming techniques discussed next, that help reduce computing time to get one full genome in 1 hour

Each genome requires ~1 TB of space

In the MegaSeq example, that would mean computing all 240 genomes in less than an hour

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 6

Overview of an NGS workflows

VariantFiltration

Haplotype Caller

SAMPE (BWA)

PrintReads

ALN (BWA)

BaseRecalibrator

Extract fastq (Picard)

Compress/ Split/ Sort/ Duplicates/ reorder (samtools, Picard)

IndelRealigner

Genome 1

RealignerTargetCreator

Let’s look more closely at this general pipeline:

Annotate variants (snpEff)

GATK

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 7

Overview of an NGS workflows

VariantFiltration

Haplotype Caller

SAMPE (BWA)

PrintReads

ALN (BWA)

BaseRecalibrator

Extract fastq (Picard)

Compress/ Split/ Sort/ Duplicates/ reorder (samtools, Picard)

IndelRealigner

Genome 1

RealignerTargetCreator

Let’s look more closely at this general pipeline:

Annotate variants (snpEff)









GATK

ACCESS to SAME DATABASE disk

Approximately < 2 days

Approximately 1 TGGTGTTACCTGAAGTCGAAGCAGG >1 TGTCTGCTCTCAGCAGTGATATCAA >1 CTTTGCCCTTTTTACTAGGGAGGGG >1 CTGTCGGAGATAGAGCCGATGAATT >4 GGCTCTATGAGATGGGTTTGAAGTG >1 AGCACGTGGGCAGGCGCCTGTCGCG >2 ACAAAAATACACTGGCTGCCCTGGC >7 TAAAAACCTTCTCTGGATTAACTCA >2 CAGGTCTGTATACTAATTTTTGTCA >1 AGAAAGACGACCAATCAGGCTGTCG >1 AATGTGAAAACTGTGCTGGTATGTC >1 TCTGAAGGGGGTTTTTCCCGGGGGA >1 AAAAAAGCTTGTAGTGGAGAGTTGG >1 CTTCCCCCTTCAAGGTCAAGTACTT >1 GGATAACCAGATAATAGATTAGCTT

1 Reading

2

(n)

COM PUTE 7/6/2014

Each MPI rank sends K-mers to targeted MPI ranks.

|

STORE ISMB 2014

|

A N A LY Z E 13

Architecture of MPI Inchworm Brian Haas BioIT 2014

MPI Inchworm Phase 2 of 2: Building the Contigs 1 G C GGCTCTATGAGATGGGTTTGAAGTG

A T

MPI_Send MPI_Recv

2

Find greedy extension to the right. C Rank(1) msg to Rank(2): Greedy extension to:

K

GCTCTATGAGATGGGTTTGAAGTG ?

(n)

Rank(2) responds to Rank(1): T

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 14

Results for mouse Advantage of distributed memory: time scalability 5000 MPI-inchworm

Total time (seconds)

4500

OMP-Inchworm

4000 3500 3000

2500 2000

Limited by core counts of one single node

1500 1000 500 0

0

5

10

15

20

25

30

# Threads or # MPI ranks COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 22

Results for mouse and axolotl Advantage of distributed memory: memory scalability Mouse MPI inchworm on XC30 (Intel IVB-10)

Contig Builder

Absolute time (seconds)

1000

Kmer Server

100

10

1

100

1000 # of MPI ranks

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 15

Results for mouse and axolotl Advantage of distributed memory: memory scalability Mouse MPI inchworm on XC30 (Intel IVB-10)

Contig Builder

Absolute time (seconds)

1000

Kmer Server

100

The Mexican Axolotl (~30GB) computation is impossible to perform using the serial code, on 64 or 128 GB nodes; requires using special memory node

10

Axolotl MPI inchworm on XC30 (Intel IVB-10)

1

100

1000 # of MPI ranks Absolute time (seconds)

1000

More memory is required for the axolotl. …Use more MPI ranks to distribute memory

100

10

300

100

1000

# of MPI ranks

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 15

Results for mouse and axolotl Advantage of distributed memory: memory scalability Mouse MPI inchworm on XC30 (Intel IVB-10)

Contig Builder

Absolute time (seconds)

1000

Kmer Server

100

The Mexican Axolotl (~30GB) computation is impossible to perform using the serial code, on 64 or 128 GB nodes; requires using special memory node

10

Axolotl MPI inchworm on XC30 (Intel IVB-10)

1

100

1000 # of MPI ranks Absolute time (seconds)

1000

Distributed memory allows you to do research on problems that otherwise would not be amenable

More memory is required for the axolotl. …Use more MPI ranks to distribute memory

100

10

300

100

1000

# of MPI ranks

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 15

Some references

C/Fortran

C/Fortran

COM PUTE 7/6/2014

C/Fortran

C++

|

STORE ISMB 2014

|

C/Fortran2008 PGAS

A N A LY Z E 16

Your feedback concerning your workflows?

[email protected] [email protected]

COM PUTE 7/6/2014

|

STORE ISMB 2014

|

A N A LY Z E 16

Suggest Documents