Document not found! Please try again

Gustaf: Generic Multi-split Alignment Finder

1 downloads 0 Views 784KB Size Report
Tobias Rausch. 2. , Knut Reinert. 1 ... read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence. We present ...
Gustaf: Generic Multi-split Alignment Finder

SE AN

1

1

Kathrin Trappe , Anne-Katrin Emde , 2 1 Tobias Rausch , Knut Reinert ¨ Berlin, Berlin, Germany of Computer Science, Freie Universitat 2European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Germany 1Department

Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence. We present GUSTAF [1], a sound generic multi-split detection method implemented in the C++ library SeqAn [4]. GUSTAF uses SeqAn’s exact local aligner Stellar [2] to find partial read alignments, and reports precise breakpoints of SVs and exon-exon boundaries.

Preliminary Results

Structural Variants

True (TP) and false (FP) hits of Gustaf on a contig set with simulated variants: 3

1 1

• Simulated assembled contigs (37 bp to 607 bp) from Illumina read data • Variants include insertions, deletions, inversions, translocations (tandem duplications) • TP: deviation up to 5 bp (position) and 10% length

3

2

Deletion 1

3

2 3

1

1

3

2

Tandem Duplication 1 2

1338 331

insertion

Find partial alignments using Stellar

Dispersed Duplication 3

2

deletion

inversion

translocation

Read >gi|217331561|gb|FJ423752.1| Homo sapiens STRN4/GPSN2 fusion mRNA, partial sequence CTGGGGGACTTGGCAGATCTCACCGTCACCAACGACAACGACCTCAGCTGCGATGTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTCTTGGA

chr 19

CTTCGAGGTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTCTTGGA || ||| ||||||||||||||||||||||||||||||||||||||||||||||| CTGCGATGTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTCTTGGA

chr 19 (complement)

1

22

0

3

2

1080

1523

Generic Multi-Split Mapping with Gustaf Gustafs workflow:

1

2124

3

2

FP

• Promising high number of true positive (TP) hits with moderate number of false positive hits (FP) • FP mostly result from low repeat resolution, wrongly classified small indels and tandem duplications, not filtered pseudo deletions • work in progress ...

3

2

TP

4117

We observe:

Inversion 1

5758

# predictions

Insertion

5

4

CTGGGGGACTTGGCAGATCTCACCGTCACCAACGACAACGACCTCAGCTGCGATGT-AAG-TTCT |||||||||||||||||||||||||||||||||||||||||||||||||||||||| || |||| CTGGGGGACTTGGCAGATCTCACCGTCACCAACGACAACGACCTCAGCTGCGATGTGGAGATTCT

chr 4 (complement)

GTGGAGATTCTGCACGCAAAGACAGGGGAGAAGCTGTGTTTC |||||||||||| ||||||||||| ||||||||||||||||| GTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTC

Stellar finds all maximal -matches

Compatibility check and breakpoint computation

1

3

2

4

5

2

Get exact breakpoint & chain alignments

Translocation 1

3

2

5

4

Graph with compatibility information 2*

start

3

chr: 19 db: 47230700...47230763 read: 1...66 0

4*

edge weight: edit distance of next match * breakpoint indicator

1

3

4

2

5

Find best chain & report breakpoints

chr: 19 db: 14673330...14673384 + read: 48...102 1 chr: 4 db: 125384889...125384931 read: 55...97 2

0

5

end 102

Find best chain

Multiple reads support breakpoints

Best chain of Stellar Matches start

3

chr: 19 db: 47230700...47230763 read: 1...66 0

2*

chr: 19 db: 14673330...14673384 + read: 48...102 1

Split-alignment determines precise breakpoint [3]

0

end 102

Conclusions GUSTAF is very versatile: It allows for multiple splits at arbitrary locations in the read, is independent of read length and sequencing platform, and supports both single-end and paired-end reads. Our results show that GUSTAF accurately detects inversions, inter- and intra-chromosomal translocations, insertions and deletions in genomic sequencing data, and identifies precise exon-exon junctions in RNA-Seq data, including gene fusion transcripts. In the future, we will extend the set of detectable variants by taking into account pseudo deletions and differentiate between tandem and dispersed duplications, while also resolving repeat regions accurately. ¨ Berlin (2012), http://www.seqan.de/projects/gustaf/ [1] Trappe, K.: Multi-Split-Mapping of NGS reads for variant detection. Masters thesis, Freie Universitat [2] Kehr, B., Weese, D., Reinert, K.: STELLAR: fast and exact local alignments. BMC Bioinformatics, 12(Suppl 9):S15 (2011), http://www.seqan.de/projects/stellar/ [3] Emde, A.-K., Schulz, M. H., Weese, D., Sun, R., Vingron, M., Kalscheuer, V. M., Haas, S.A., Reinert, K.: Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS. Bioinformatics 28(5), 619-627 (2012), http://www.seqan.de/projects/splazers/ ¨ [4] Doring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn and efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008)