GenomeTools – a versatile and efficient bioinformatics toolkit Gordon Gremme, Sascha Steinbiss and Stefan Kurtz Research Group for Genome Informatics, Center for Bioinformatics, University of Hamburg, Hamburg, Germany
external bindings
GenomeTools library Extended
Core
• GenomeTools bindings available for • Java (via JNA FFI) • Python (via ctypes FFI) • Ruby (via Ruby::DL FFI) • Lua (embedded)
Bindings
LTR
Annotation parsers
Memory management
Scripting languages
De novo LTR retrotransposon prediction
JNA
Java bindings
Java application
• no further software needed to build the
bindings • interface consistency across all supported languages • native language idioms are preserved (e.g. error handling etc.) • performance benefits from C core
Annotation handling
Data structures
LTR retrotransposon annotation
Efficient file access
Stream processing
Tools and runtime
Chaining
Sequence parsers
Alignment
Encoded sequences
Hidden Markov models
ctypes
Python script
Ruby bindings
Ruby script
Match Enhanced suffix array construction and access
Multithreading support
DL
FM index construction and access
AnnotationSketch
Math and combinatorics
Python bindings
Bit-packed strings
Annotation visualization
Short read mapping
Option parser
Annotation storage
Matching algorithms
Index structures Lua bindings
• index all substrings of a sequence set • allow efficient queries on large
Lua script
sequences • • • • •
tallymer
sketch
ltrharvest
ltrdigest
sequniq
mgth
exact or approximate pattern matching maximal repeat detection maximal unique matches minimal unique substrings ...
• GenomeTools currently supports • enhanced suffix array [Abouelhoda et al., 2004] • FM index [Ferragina and Manzini, 2000]
...
Tools
• on-disk and in-memory construction • flexible and efficient access
Pros and cons of current genome analysis software collection of software programs for genome analysis (“tools” as standalone binaries: BLAST, HMMER, EMBOSS, . . . )
Annotation drawing
programming frameworks for custom software development (“library”: Bio∗, SeqAn, . . . )
• AnnotationSketch [Steinbiss et al., 2009b] • generic high-quality annotation visualization component • input: annotation graphs from an input stream
+ extensibility + easy integration – external software dependencies – interface (in)consistency – efficiency
++ “out of the box” usability +/– optimization for one job – integration requires glue scripts – extensibility
output: PNG/PS/PDF/SVG • drawing on GUI widgets possible (e.g. GTK via Cairo) • extensively configurable via “smart” style files • •
via Lua callback functions • open plugin interface allows creation of custom visualization schemes (e.g. for expression data, etc.)
Design fundamentals of GenomeTools Avoiding a monolithic codebase • object-oriented implementation in plain C • allows classical OO approaches
Goals • rigorously tested code • •
built-in unit tests extensive test suite
• •
• portability
encapsulation, interfaces design patterns
• implemented via strict adherence to
• speed and space efficiency • minimalism • if in doubt, use the simpler solution
clean code design guidelines [Gremme et al., 2007]
Encoded sequences • bit-compressed sequence collections
over alphabets of size ≤ 253 • wildcard support (unique characters) • characters mapped to integers • alphabet transformation/reduction • default: most space-efficient representation automatically chosen Example: human genome → 2.000008 bits/base GenBank → 2.014166 bits/base
• fast access to sequence contents • substring extraction • random/sequential character access • sequence comparison, k -mer streaming • metadata: sequence lengths,
descriptions, input file names, . . . • forward/reverse/complement views • access to SS without storing S • efficient and convenient sequential access via iterators
Application: Repeat detection LTR retrotransposon identification • index-based de novo detection and annotation of LTR retrotransposons • LTRharvest [Ellinghaus et al., 2008], LTRdigest [Steinbiss et al., 2009a] • LTRharvest makes use of GenomeTools index structures to generate candidate pairs of degenerate repeats • LTRdigest employs local alignment algorithms and hidden Markov models to annotate the inner regions • clustering and postprocessing allows grouping into putative families • efficient implementation and stream usage allows processing of large data sets, e.g. from mammals or plants
relationships [Eilbeck et al., 2005] • store annotations as DAGs
LTR similarity LTR palindromic motifs
TSD LTR
LTR
TSD
5'
3'
LTR length
b5 TSD LTR
LTR distance (LTR retrotransposon length)
e5
b3
protein domain matches
e3
PPT LTR
PBS
5'
pol
gag
TSD
3'
EN
RT
PR
(env)
Figure 4: Structure of an LTR retrotransposon and model parameters for LTRharvest and LTRdigest.
GFF3 ESA
Encoded sequence
FASTA FASTA FASTA
Tabular output GtGFF3OutStream
GtLTRharvestStream
tRNAs pHMMs Constraints Parameters
GtLTRharvestFASTAOutStream
GtLTRharvestTabOutStream
Encoded sequence
GtLTRdigestStream
K -mer frequency based repeat detection • Tallymer [Kurtz et al., 2008] • uses enhanced suffix arrays for counting, indexing, searching k-mers
• Sequence Ontology (SO) defines terms and
Figure 3: Example visualization generated by AnnotationSketch.
GtLTRFileoutStream
GtGFF3OutStream
GtGFF3InStream
Annotation graphs
• •
dynamic styles dynamic captions
Exon: CG11076:1
nodes: single features with properties (location etc.) edges: SO part of relationships
Exon: CG11076:2
Transcript: CG11076-RB
PPT Dom1 PBS 3' LTR FASTA FASTA FASTA FASTA
GFF3
GFF3
Figure 5: Stream usage in an LTR retrotransposon identification pipeline.
Exon: CG11076:3
Transcript: CG11076-RA
Availability and software requirements • free, open source software under the
• annotations can be parsed from
GFF3/BED/GTF/. . . or constructed manually via API functions • pass connected components via root nodes • iterator-based graph traversal (DFS/custom/. . . )
Tabular output
BSD-like ISC license • supported platforms:
Gene: CG11076
Figure 1: Example annotation graph.
• •
UNIX (Linux, BSD, Mac OS X, . . . ) Windows (with Cygwin)
• source available for download at
http://genometools.org
Stream processing
Basic dependencies • C/C++ compiler (gcc, clang, . . . ) • GNU make Optional external dependencies • Cairo (for AnnotationSketch) • HMMER3 (for LTRdigest)
References
Gene: CG11076
mRNA: CG11076-RA
mRNA: CG11076-RB
Exon: CG11076:2
Gene: CG11076
Exon: CG11076:3 mRNA: CG11076-RA
Gene: CG11078
mRNA: CG11078-RA
Exon: CG11078:1
mRNA: CG11076-RB
Exon: CG11076:2
mRNA: CG11078-RB
Exon: CG11078:2
Exon: CG11078:3
Exon: CG11076:3
Input
Input
Parameters
Exon: CG11078:4
Gene: CG11078
mRNA: CG11078-RA
Exon: CG11078:1
Gene: CG11078
mRNA: CG11078-RB
Exon: CG11078:2
Intermediate output
Exon: CG11078:3
mRNA: CG11078-RA
Exon: CG11078:4
Source stream Input parser (e.g. GFF3, BED, ...) or De novo annotation generator (gene prediction, ...)
Exon: CG11078:1
Gene: CG11078
mRNA: CG11078-RB
Exon: CG11078:2
Exon: CG11078:3
mRNA: CG11078-RA
Exon: CG11078:4
Processing stream
Exon: CG11078:1
Gene: CG11078
mRNA: CG11078-RB
Exon: CG11078:2
e.g. GFF3, BED, ...
Output
Exon: CG11078:3
mRNA: CG11078-RA
Exon: CG11078:4
Processing stream
can modify, add or remove annotation graphs from a stream
Exon: CG11078:1
mRNA: CG11078-RB
Exon: CG11078:2
Exon: CG11078:3
Abouelhoda, M. I. et al. (2004). Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms, 2:53–86.
Gremme, G. et al. (2007). The GenomeTools design. http://genometools.org/design.html.
Eilbeck, K. et al. (2005). The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol, 6:R44.
Kurtz, S. et al. (2008). A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517.
Exon: CG11078:4
Sink stream Outputs annotation in a certain format
Figure 2: An example streaming pipeline. Streams allow “lazy” memory-efficient sequential processing of annotation graphs
12th Annual Bioinformatics Open Source Conference (BOSC) 2011 · Vienna, July 15–16, 2011
Ellinghaus, D. et al. (2008). LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics, 9:18. Ferragina, P. and Manzini, G. (2000). Opportunistic Data Structures with Applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, page 390.
Mail:
[email protected]
Steinbiss, S. et al. (2009a). Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res, 37(21):7002–7013. Steinbiss, S., Gremme, G., et al. (2009b). AnnotationSketch: a genome annotation drawing library. Bioinformatics, 25(4):533–534.
WWW: http://www.zbh.uni-hamburg.de