Using HPC techniques to accelerate NGS workflows Pierre Carrier, Richard Walsh, Bill Long, Carlos Sosa, Jef Dawson Cray Life Science working group Cray Inc.
TrinityRNASeq: collaboration with Brian Haas, Timothy Tickle The Broad Institute Thomas William TU-Dresden COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 1
Outline ● Overview of an important NGS workflow example: MegaSeq ● What is the HPC approach? What is distributed memory? Toward the 1 hour genome ● Example case: trinityRNASeq ● Overview of TrinityRNASeq (Inchworm, Chrysalis, Butterfly) ● Inchworm: structure of sequential code and solution with MPI ● MPI scaling results of the mouse and axolotl RNA sequences
● Your feedback: What are the key bottlenecks you encounter?
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 2
NGS Workflow Modular Tools ● Most bioinformatics workflows include several intermediate and separate modular programs, often written in perl, python, java, or C/C++ (i.e., they are serial codes, sometimes with openMP multi-threading): ●
BWA
●
bowtie/bowtie2
●
tophat
●
GATK
●
BioPerl
●
blat
●
samtools
●
Picard
●
R
●
The list goes on…
● Currently, relatively few bioinformatics programs are parallelized for distributed
memory, using the well-established Message Passing Interface (MPI) standard: ●
mpiBLAST/abokiaBLAST
●
ABySS
●
Ray
●
HMMER
●
PhyloBayes-MPI
A typical bioinformatics workflow is a combination of all these separate programs COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 3
Overview of an NGS workflow: MegaSeq Megan J. Puckelwartz, et al., Supercomputing for the parallelization of whole genome analysis, http://bioinformatics.oxfordjournals.org/content/30/11/1508
Parallelism can be obtained by computing simultaneously multiple genomes
Software pipeline Picard (java) 2/ReadGroup
BWA (C) 1/ReadGroup
samtools (C) 25/ReadGroup
Picard (java) 100/genome
GATK (6x) (java) 25/genome 1/genome
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
snpEff (perl, python)
A N A LY Z E 4
Overview of an NGS workflow: MegaSeq Megan J. Puckelwartz, et al., Supercomputing for the parallelization of whole genome analysis, http://bioinformatics.oxfordjournals.org/content/30/11/1508
This outstanding improvement can, we believe, be further improved by the use of HPC programming techniques discussed next, that help reduce computing time to get one full genome in 1 hour
Each genome requires ~1 TB of space
In the MegaSeq example, that would mean computing all 240 genomes in less than an hour
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 6
Overview of an NGS workflow: MegaSeq Megan J. Puckelwartz, et al., Supercomputing for the parallelization of whole genome analysis, http://bioinformatics.oxfordjournals.org/content/30/11/1508
This outstanding improvement can, we believe, be further improved by the use of HPC programming techniques discussed next, that help reduce computing time to get one full genome in 1 hour
Each genome requires ~1 TB of space
In the MegaSeq example, that would mean computing all 240 genomes in less than an hour
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 6
Overview of an NGS workflows
VariantFiltration
Haplotype Caller
SAMPE (BWA)
PrintReads
ALN (BWA)
BaseRecalibrator
Extract fastq (Picard)
Compress/ Split/ Sort/ Duplicates/ reorder (samtools, Picard)
IndelRealigner
Genome 1
RealignerTargetCreator
Let’s look more closely at this general pipeline:
Annotate variants (snpEff)
GATK
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 7
Overview of an NGS workflows
VariantFiltration
Haplotype Caller
SAMPE (BWA)
PrintReads
ALN (BWA)
BaseRecalibrator
Extract fastq (Picard)
Compress/ Split/ Sort/ Duplicates/ reorder (samtools, Picard)
IndelRealigner
Genome 1
RealignerTargetCreator
Let’s look more closely at this general pipeline:
Annotate variants (snpEff)
…
…
…
…
GATK
ACCESS to SAME DATABASE disk
Approximately < 2 days
Approximately 1 TGGTGTTACCTGAAGTCGAAGCAGG >1 TGTCTGCTCTCAGCAGTGATATCAA >1 CTTTGCCCTTTTTACTAGGGAGGGG >1 CTGTCGGAGATAGAGCCGATGAATT >4 GGCTCTATGAGATGGGTTTGAAGTG >1 AGCACGTGGGCAGGCGCCTGTCGCG >2 ACAAAAATACACTGGCTGCCCTGGC >7 TAAAAACCTTCTCTGGATTAACTCA >2 CAGGTCTGTATACTAATTTTTGTCA >1 AGAAAGACGACCAATCAGGCTGTCG >1 AATGTGAAAACTGTGCTGGTATGTC >1 TCTGAAGGGGGTTTTTCCCGGGGGA >1 AAAAAAGCTTGTAGTGGAGAGTTGG >1 CTTCCCCCTTCAAGGTCAAGTACTT >1 GGATAACCAGATAATAGATTAGCTT
1 Reading
2
(n)
COM PUTE 7/6/2014
Each MPI rank sends K-mers to targeted MPI ranks.
|
STORE ISMB 2014
|
A N A LY Z E 13
Architecture of MPI Inchworm Brian Haas BioIT 2014
MPI Inchworm Phase 2 of 2: Building the Contigs 1 G C GGCTCTATGAGATGGGTTTGAAGTG
A T
MPI_Send MPI_Recv
2
Find greedy extension to the right. C Rank(1) msg to Rank(2): Greedy extension to:
K
GCTCTATGAGATGGGTTTGAAGTG ?
(n)
Rank(2) responds to Rank(1): T
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 14
Results for mouse Advantage of distributed memory: time scalability 5000 MPI-inchworm
Total time (seconds)
4500
OMP-Inchworm
4000 3500 3000
2500 2000
Limited by core counts of one single node
1500 1000 500 0
0
5
10
15
20
25
30
# Threads or # MPI ranks COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 22
Results for mouse and axolotl Advantage of distributed memory: memory scalability Mouse MPI inchworm on XC30 (Intel IVB-10)
Contig Builder
Absolute time (seconds)
1000
Kmer Server
100
10
1
100
1000 # of MPI ranks
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 15
Results for mouse and axolotl Advantage of distributed memory: memory scalability Mouse MPI inchworm on XC30 (Intel IVB-10)
Contig Builder
Absolute time (seconds)
1000
Kmer Server
100
The Mexican Axolotl (~30GB) computation is impossible to perform using the serial code, on 64 or 128 GB nodes; requires using special memory node
10
Axolotl MPI inchworm on XC30 (Intel IVB-10)
1
100
1000 # of MPI ranks Absolute time (seconds)
1000
More memory is required for the axolotl. …Use more MPI ranks to distribute memory
100
10
300
100
1000
# of MPI ranks
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 15
Results for mouse and axolotl Advantage of distributed memory: memory scalability Mouse MPI inchworm on XC30 (Intel IVB-10)
Contig Builder
Absolute time (seconds)
1000
Kmer Server
100
The Mexican Axolotl (~30GB) computation is impossible to perform using the serial code, on 64 or 128 GB nodes; requires using special memory node
10
Axolotl MPI inchworm on XC30 (Intel IVB-10)
1
100
1000 # of MPI ranks Absolute time (seconds)
1000
Distributed memory allows you to do research on problems that otherwise would not be amenable
More memory is required for the axolotl. …Use more MPI ranks to distribute memory
100
10
300
100
1000
# of MPI ranks
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 15
Some references
C/Fortran
C/Fortran
COM PUTE 7/6/2014
C/Fortran
C++
|
STORE ISMB 2014
|
C/Fortran2008 PGAS
A N A LY Z E 16
Your feedback concerning your workflows?
[email protected] [email protected]
COM PUTE 7/6/2014
|
STORE ISMB 2014
|
A N A LY Z E 16