Tobias Rausch. 2. , Knut Reinert. 1 ... read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence. We present ...
Gustaf: Generic Multi-split Alignment Finder
SE AN
1
1
Kathrin Trappe , Anne-Katrin Emde , 2 1 Tobias Rausch , Knut Reinert ¨ Berlin, Berlin, Germany of Computer Science, Freie Universitat 2European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Germany 1Department
Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence. We present GUSTAF [1], a sound generic multi-split detection method implemented in the C++ library SeqAn [4]. GUSTAF uses SeqAn’s exact local aligner Stellar [2] to find partial read alignments, and reports precise breakpoints of SVs and exon-exon boundaries.
Preliminary Results
Structural Variants
True (TP) and false (FP) hits of Gustaf on a contig set with simulated variants: 3
1 1
• Simulated assembled contigs (37 bp to 607 bp) from Illumina read data • Variants include insertions, deletions, inversions, translocations (tandem duplications) • TP: deviation up to 5 bp (position) and 10% length
3
2
Deletion 1
3
2 3
1
1
3
2
Tandem Duplication 1 2
1338 331
insertion
Find partial alignments using Stellar
Dispersed Duplication 3
2
deletion
inversion
translocation
Read >gi|217331561|gb|FJ423752.1| Homo sapiens STRN4/GPSN2 fusion mRNA, partial sequence CTGGGGGACTTGGCAGATCTCACCGTCACCAACGACAACGACCTCAGCTGCGATGTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTCTTGGA
chr 19
CTTCGAGGTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTCTTGGA || ||| ||||||||||||||||||||||||||||||||||||||||||||||| CTGCGATGTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTCTTGGA
chr 19 (complement)
1
22
0
3
2
1080
1523
Generic Multi-Split Mapping with Gustaf Gustafs workflow:
1
2124
3
2
FP
• Promising high number of true positive (TP) hits with moderate number of false positive hits (FP) • FP mostly result from low repeat resolution, wrongly classified small indels and tandem duplications, not filtered pseudo deletions • work in progress ...
3
2
TP
4117
We observe:
Inversion 1
5758
# predictions
Insertion
5
4
CTGGGGGACTTGGCAGATCTCACCGTCACCAACGACAACGACCTCAGCTGCGATGT-AAG-TTCT |||||||||||||||||||||||||||||||||||||||||||||||||||||||| || |||| CTGGGGGACTTGGCAGATCTCACCGTCACCAACGACAACGACCTCAGCTGCGATGTGGAGATTCT
chr 4 (complement)
GTGGAGATTCTGCACGCAAAGACAGGGGAGAAGCTGTGTTTC |||||||||||| ||||||||||| ||||||||||||||||| GTGGAGATTCTGGACGCAAAGACAAGGGAGAAGCTGTGTTTC
Stellar finds all maximal -matches
Compatibility check and breakpoint computation
1
3
2
4
5
2
Get exact breakpoint & chain alignments
Translocation 1
3
2
5
4
Graph with compatibility information 2*
start
3
chr: 19 db: 47230700...47230763 read: 1...66 0
4*
edge weight: edit distance of next match * breakpoint indicator
1
3
4
2
5
Find best chain & report breakpoints
chr: 19 db: 14673330...14673384 + read: 48...102 1 chr: 4 db: 125384889...125384931 read: 55...97 2
0
5
end 102
Find best chain
Multiple reads support breakpoints
Best chain of Stellar Matches start
3
chr: 19 db: 47230700...47230763 read: 1...66 0
2*
chr: 19 db: 14673330...14673384 + read: 48...102 1
Split-alignment determines precise breakpoint [3]
0
end 102
Conclusions GUSTAF is very versatile: It allows for multiple splits at arbitrary locations in the read, is independent of read length and sequencing platform, and supports both single-end and paired-end reads. Our results show that GUSTAF accurately detects inversions, inter- and intra-chromosomal translocations, insertions and deletions in genomic sequencing data, and identifies precise exon-exon junctions in RNA-Seq data, including gene fusion transcripts. In the future, we will extend the set of detectable variants by taking into account pseudo deletions and differentiate between tandem and dispersed duplications, while also resolving repeat regions accurately. ¨ Berlin (2012), http://www.seqan.de/projects/gustaf/ [1] Trappe, K.: Multi-Split-Mapping of NGS reads for variant detection. Masters thesis, Freie Universitat [2] Kehr, B., Weese, D., Reinert, K.: STELLAR: fast and exact local alignments. BMC Bioinformatics, 12(Suppl 9):S15 (2011), http://www.seqan.de/projects/stellar/ [3] Emde, A.-K., Schulz, M. H., Weese, D., Sun, R., Vingron, M., Kalscheuer, V. M., Haas, S.A., Reinert, K.: Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS. Bioinformatics 28(5), 619-627 (2012), http://www.seqan.de/projects/splazers/ ¨ [4] Doring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn and efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9, 11 (2008)