GenomeTools – a versatile and efficient ...

6 downloads 274 Views 341KB Size Report
Python bindings. Java bindings. Bindings. JNA ctypes. DL external bindings tallymer sketch ltrharvest ltrdigest sequniq mgth ... Java application. Python script.
GenomeTools – a versatile and efficient bioinformatics toolkit Gordon Gremme, Sascha Steinbiss and Stefan Kurtz Research Group for Genome Informatics, Center for Bioinformatics, University of Hamburg, Hamburg, Germany

external bindings

GenomeTools library Extended

Core

• GenomeTools bindings available for • Java (via JNA FFI) • Python (via ctypes FFI) • Ruby (via Ruby::DL FFI) • Lua (embedded)

Bindings

LTR

Annotation parsers

Memory management

Scripting languages

De novo LTR retrotransposon prediction

JNA

Java bindings

Java application

• no further software needed to build the

bindings • interface consistency across all supported languages • native language idioms are preserved (e.g. error handling etc.) • performance benefits from C core

Annotation handling

Data structures

LTR retrotransposon annotation

Efficient file access

Stream processing

Tools and runtime

Chaining

Sequence parsers

Alignment

Encoded sequences

Hidden Markov models

ctypes

Python script

Ruby bindings

Ruby script

Match Enhanced suffix array construction and access

Multithreading support

DL

FM index construction and access

AnnotationSketch

Math and combinatorics

Python bindings

Bit-packed strings

Annotation visualization

Short read mapping

Option parser

Annotation storage

Matching algorithms

Index structures Lua bindings

• index all substrings of a sequence set • allow efficient queries on large

Lua script

sequences • • • • •

tallymer

sketch

ltrharvest

ltrdigest

sequniq

mgth

exact or approximate pattern matching maximal repeat detection maximal unique matches minimal unique substrings ...

• GenomeTools currently supports • enhanced suffix array [Abouelhoda et al., 2004] • FM index [Ferragina and Manzini, 2000]

...

Tools

• on-disk and in-memory construction • flexible and efficient access

Pros and cons of current genome analysis software collection of software programs for genome analysis (“tools” as standalone binaries: BLAST, HMMER, EMBOSS, . . . )

Annotation drawing

programming frameworks for custom software development (“library”: Bio∗, SeqAn, . . . )

• AnnotationSketch [Steinbiss et al., 2009b] • generic high-quality annotation visualization component • input: annotation graphs from an input stream

+ extensibility + easy integration – external software dependencies – interface (in)consistency – efficiency

++ “out of the box” usability +/– optimization for one job – integration requires glue scripts – extensibility

output: PNG/PS/PDF/SVG • drawing on GUI widgets possible (e.g. GTK via Cairo) • extensively configurable via “smart” style files • •

via Lua callback functions • open plugin interface allows creation of custom visualization schemes (e.g. for expression data, etc.)

Design fundamentals of GenomeTools Avoiding a monolithic codebase • object-oriented implementation in plain C • allows classical OO approaches

Goals • rigorously tested code • •

built-in unit tests extensive test suite

• •

• portability

encapsulation, interfaces design patterns

• implemented via strict adherence to

• speed and space efficiency • minimalism • if in doubt, use the simpler solution

clean code design guidelines [Gremme et al., 2007]

Encoded sequences • bit-compressed sequence collections

over alphabets of size ≤ 253 • wildcard support (unique characters) • characters mapped to integers • alphabet transformation/reduction • default: most space-efficient representation automatically chosen Example: human genome → 2.000008 bits/base GenBank → 2.014166 bits/base

• fast access to sequence contents • substring extraction • random/sequential character access • sequence comparison, k -mer streaming • metadata: sequence lengths,

descriptions, input file names, . . . • forward/reverse/complement views • access to SS without storing S • efficient and convenient sequential access via iterators

Application: Repeat detection LTR retrotransposon identification • index-based de novo detection and annotation of LTR retrotransposons • LTRharvest [Ellinghaus et al., 2008], LTRdigest [Steinbiss et al., 2009a] • LTRharvest makes use of GenomeTools index structures to generate candidate pairs of degenerate repeats • LTRdigest employs local alignment algorithms and hidden Markov models to annotate the inner regions • clustering and postprocessing allows grouping into putative families • efficient implementation and stream usage allows processing of large data sets, e.g. from mammals or plants

relationships [Eilbeck et al., 2005] • store annotations as DAGs

LTR similarity LTR palindromic motifs

TSD LTR

LTR

TSD

5'

3'

LTR length

b5 TSD LTR

LTR distance (LTR retrotransposon length)

e5

b3

protein domain matches

e3

PPT LTR

PBS

5'

pol

gag

TSD

3'

EN

RT

PR

(env)

Figure 4: Structure of an LTR retrotransposon and model parameters for LTRharvest and LTRdigest.

GFF3 ESA

Encoded sequence

FASTA FASTA FASTA

Tabular output GtGFF3OutStream

GtLTRharvestStream

tRNAs pHMMs Constraints Parameters

GtLTRharvestFASTAOutStream

GtLTRharvestTabOutStream

Encoded sequence

GtLTRdigestStream

K -mer frequency based repeat detection • Tallymer [Kurtz et al., 2008] • uses enhanced suffix arrays for counting, indexing, searching k-mers

• Sequence Ontology (SO) defines terms and

Figure 3: Example visualization generated by AnnotationSketch.

GtLTRFileoutStream

GtGFF3OutStream

GtGFF3InStream

Annotation graphs

• •

dynamic styles dynamic captions

Exon: CG11076:1

nodes: single features with properties (location etc.) edges: SO part of relationships

Exon: CG11076:2

Transcript: CG11076-RB

PPT Dom1 PBS 3' LTR FASTA FASTA FASTA FASTA

GFF3

GFF3

Figure 5: Stream usage in an LTR retrotransposon identification pipeline.

Exon: CG11076:3

Transcript: CG11076-RA

Availability and software requirements • free, open source software under the

• annotations can be parsed from

GFF3/BED/GTF/. . . or constructed manually via API functions • pass connected components via root nodes • iterator-based graph traversal (DFS/custom/. . . )

Tabular output

BSD-like ISC license • supported platforms:

Gene: CG11076

Figure 1: Example annotation graph.

• •

UNIX (Linux, BSD, Mac OS X, . . . ) Windows (with Cygwin)

• source available for download at

http://genometools.org

Stream processing

Basic dependencies • C/C++ compiler (gcc, clang, . . . ) • GNU make Optional external dependencies • Cairo (for AnnotationSketch) • HMMER3 (for LTRdigest)

References

Gene: CG11076

mRNA: CG11076-RA

mRNA: CG11076-RB

Exon: CG11076:2

Gene: CG11076

Exon: CG11076:3 mRNA: CG11076-RA

Gene: CG11078

mRNA: CG11078-RA

Exon: CG11078:1

mRNA: CG11076-RB

Exon: CG11076:2

mRNA: CG11078-RB

Exon: CG11078:2

Exon: CG11078:3

Exon: CG11076:3

Input

Input

Parameters

Exon: CG11078:4

Gene: CG11078

mRNA: CG11078-RA

Exon: CG11078:1

Gene: CG11078

mRNA: CG11078-RB

Exon: CG11078:2

Intermediate output

Exon: CG11078:3

mRNA: CG11078-RA

Exon: CG11078:4

Source stream Input parser (e.g. GFF3, BED, ...) or De novo annotation generator (gene prediction, ...)

Exon: CG11078:1

Gene: CG11078

mRNA: CG11078-RB

Exon: CG11078:2

Exon: CG11078:3

mRNA: CG11078-RA

Exon: CG11078:4

Processing stream

Exon: CG11078:1

Gene: CG11078

mRNA: CG11078-RB

Exon: CG11078:2

e.g. GFF3, BED, ...

Output

Exon: CG11078:3

mRNA: CG11078-RA

Exon: CG11078:4

Processing stream

can modify, add or remove annotation graphs from a stream

Exon: CG11078:1

mRNA: CG11078-RB

Exon: CG11078:2

Exon: CG11078:3

Abouelhoda, M. I. et al. (2004). Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms, 2:53–86.

Gremme, G. et al. (2007). The GenomeTools design. http://genometools.org/design.html.

Eilbeck, K. et al. (2005). The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol, 6:R44.

Kurtz, S. et al. (2008). A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517.

Exon: CG11078:4

Sink stream Outputs annotation in a certain format

Figure 2: An example streaming pipeline. Streams allow “lazy” memory-efficient sequential processing of annotation graphs

12th Annual Bioinformatics Open Source Conference (BOSC) 2011 · Vienna, July 15–16, 2011

Ellinghaus, D. et al. (2008). LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics, 9:18. Ferragina, P. and Manzini, G. (2000). Opportunistic Data Structures with Applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, page 390.

Mail: [email protected]

Steinbiss, S. et al. (2009a). Fine-grained annotation and classification of de novo predicted LTR retrotransposons. Nucleic Acids Res, 37(21):7002–7013. Steinbiss, S., Gremme, G., et al. (2009b). AnnotationSketch: a genome annotation drawing library. Bioinformatics, 25(4):533–534.

WWW: http://www.zbh.uni-hamburg.de

Suggest Documents