An automated ideogram drawing software package Stefan Böhringer*, René Gödde, Daniel Böhringer, Thorsten Schulte, Jörg T Epplen Molecular Human Genetics, Ruhr-Universität Bochum, Germany *corresponding author:
[email protected] Phone: +49 234 32 28101, Fax: +49 234 31 14196 Molecular Human Genetics, Universitätsstr. 150, 44780 Bochum, Germany Keywords: ideogram, automation, genome, annotation, open source
Abstract A software package to conveniently visualize ideograms is presented. The preconfiguration is specialized on results from human genome screens. The software draws a chromosomal ideogram and can highlight marker positions arbitrarily. Labels can be attached at liberty and will be positioned automatically to avoid overlap. To automatize complex visualization tasks a tool to extract marker positions by name from the location database (ldb) is provided. The software is highly configurable and can draw arbitrary caryograms, banding patterns and chromosome groupings. Annotations may be customized by using program options or implementing new annotation subclasses. Output is in Postscript format which may be converted to arbitrary graphic file formats or may be directly used for high quality printing.
Introduction Modern genetic analysis of complex disorders in man continue to expand marker density in the genome (c.f. e.g. GAMES). Visualisation of strength of association of thousands of markers is a challenging yet indispensable task in these projects. Data should be outlined comprehensively but representation should highlight important results clearly. We have developed a software package to fulfil these demands, which, at the same time, is kept as user friendly as possible.
Methods The software package is implemented in the Perl5 language (PERL). A single command accepts several configuration files and combines information from a number of sources. The file format of all input files is in "property list" (plist format) format, which was invented by Appletm (APPLE) which can be easily parsed by computer programs and edited by hand alike. The file format is documented in a Perl5 module which can read and write plist files and allows for easy data conversion from different sources. In the following we describe the important parts of configurability. For there is a plethora of options, we refer to the online documentation for a complete list and the exact syntax of options. In its simplest form the program (coloredChromosomes.pl) will be invoked without parameters and will then read configuration files from default locations and place its output in the temporary directory ("/tmp"). Fig. 1 displays the default output together with several distances as defined in the configuration file (arrows with letters A - E). By default a human ideogram is drawn. All chromosomes are placed in lanes and grouped therein. Within a single lane chromosomes are aligned at their centromeres. Chromosomes are strechted vertically to optimally fill the remaining space after subtracting all vertical margins. A refinement of this placement structure is shown in Fig. 2. This representation fosters the notion of subgroups, which can be used to build up two levels of groupings within a single lane. This option can be used to display diploid caryograms or to employ complex annotations. Conceptually annotations are separated into internal and external annotations. Internal annotations draw within the shape of a single chromosome, whereas external annotations draw alongside the shape of a chromosome. The main program only draws the shapes and names of the chromosomes. All further drawing is committed by annotation modules. For example, in Fig. 2 the left chromosome of each pair is annotated with the banding module which draws a banding pattern inside the chromosome shape. Contrariwise the right companion is internally drawn by the plain module, simply drawing a plain colour. Fig. 2 also displays an external annotation. The banding names to the left of chromosome pairs is a such. The corresponding module bandingNames takes into account band sizes to inset certain names to avoid overlap. The source code can be used as an example to write similar modules. Further examples are given in the next section. The invocation of the application is summarised in Appendix A and detailed in the online documentation. A second aspect of genomic annotation is the combination of data from different sources. We have developed a program to retrieve the localisation of arbitrary markers in the genome from the location database (ldb; LDB). The whole database can be downloaded by FTP (file transfer protocol) and stored on local disk. The invocation of our program (lociLocations.pl) seeks for loci given in a text file or via standard input and produces a plist file which maps these loci names to chromosomal locations. Aliases for loci names can be given in an additional file. This program can calculate the distribution of chromosomal distances between the loci for which the chromosomal locations are to be resolved. This option can be used to estimate the uniformness of genome saturation for a given marker set. Again, details of program options the format of input and output files are given in the online documentation. Now we give some examples of how the program can be applied.
Examples
Fig. 3 shows an example of label annotations. Some candidate genes for a certain condition (which happens to be Multiple Sclerosis) are shown. A small rectangle is drawn inside the chromosome and a label connected with the rectangle by a line is drawn outside the chromosome. Two modules etags, external tags and itags, internal tags manage this drawing, respectively. If two labels would overlap by direct side by side placement they are moved and a beziér curve connects rectangle and label. The algorithm used to decide about movements tries to minimise the total amount of movement when a new label is introduced. It therefore clearly depends on the order of label placement but shows excellent results in practice. Fig. 4 shows an example of complex annotation. We have used the program for a genome screen in Multiple Sclerosis involving about 6000 microsatellite markers. We show only a single chromosome for which the marker names are truncated due to a nondisclosure agreement in the collaboration (GAMES). In this case a subgroup configuration is chosen, showing lefthand the banding pattern on the chromosome and giving annotations on a plainly drawn chromosome placed righthand. Here each label is connected with a value ranging from 0 to 1. In this particular case it is a p-value of a statistical test evaluating association with disease for the respective marker. Each rectangle drawn inside the chromosome is colour coded. Values < 0.05 (significant results) are highlighted in yellow, values >= 0.05 are displayed in green, red and blue. These colours are blended smoothly. The points of interpolation and colours used can be chosen at will in the configuration file holding label information. Labels are sorted to draw small values last to give significant results priority in case of rectangle overlap. Another striking feature is, that not all rectangles have labels attached. A cut-off can be defined to determine by the value attached to the label, whether the label is actually to be drawn (in this case all values < 0.05 have labels attached). Also the p-value is shown directly below the label text, which can be extended by a second line of information. For all participants of the GAMES project we have prepared a location file from the ldb database, such that only significances for individual markers are needed to graphically represent the result.
Sofware installation and data conversion The software has been tested under Linux and Windows 2000 but should work on any platform with a Perl5 installation. The Perl5 website (PERL) lists supported platforms. The Postscript output can be directly printed (Ghostscript offers Postscript filters for almost any printer; GHOST). If a bitmap representation is required, tools like Gimp (GIMP) or ImageMagick (IMAGE) can be used. Bitmaps in this publication were produced using Gimp. Note, however, that both, Gimp and ImageMagick require Ghostscript for Postscript import. All mentioned tools are available for most Unix platforms and Windows and are free software.
Conclusions The ideogram drawing software has atomised a large scale ideogram drawing project (GAMES) without any need for manual postprocessing. Also the software has applications in education, a broad range of genetic presentations and web visualisation of ideograms for arbitrary species. The package can be fully customised and is designed for easy extendibility (c.f. Appendix B). Full source code is provided to allow flexible user customisation. We hope to bundle different efforts in ideogram drawing by making our software available. It can be downloaded from our web page (BIOINF).
Ackknowledgements This work was supported by the Heinrich und Alma von Vogelsang foundation.
Appendix A Program invocation coloredChromosomes.pl [--help] [--chromosomeSpec specificationFile] [--placement placementName] [--labelFile labelFile] [--labelMap mapFile] [-labelPlacement LplacementName>] [--o output path] A specification file other than the default is given by --chromosomeSpec which describes a chromosomal layout and banding pattern for some organism. The option --placement chooses a placement pattern for the chromosomes on the paper out of a set of patterns present in the configuration file. --labelFile denotes the path of an additional file which holds annotation information to draw within or alongside the chromomal shapes. For one such labelFile the option --labelPlacement specifies a section in the main configuration file which gives further options to control annotation. If annotations are complex --labelMap enables a further level of indirection by giving a file representing chromosomal locations by name which can be referred upon in the labelFile. A mapFile is produced by the program lociLocations.pl as described in the methods section. B Programming annotations To extend this software package complete source code is available. However, if only additional annotations are asked for one can easily implement a new annotation class, which can then be readily used without any further modifications in the main source code. As an example we show a simple class for internal annotations used to draw a plain colour within the chromosome shape, named CCAplain (all annotation class names have to be prefixed with "CCA"). package CCAplain; require 5.000; @ISA = qw(CCAnnotation); use CCAnnotation; sub draw { my ($self) = @_; my ($x, $y, $p, $q, $w) = ($self->{position}{x}, $self->{position}{y}, $self->{position}{p}, $self->{position}{q}, $self->{position}{w}); my $hw = $w / 2.0; $self->{ps}->setrgbcolor($self->{Colors}->rgbColor($self->{annotation}{color})); $self->{ps}->fill(); } 1;
Each new class has to inherit from the base annotation class CCAnnotation and has to implement the draw() method. $self->{ps} represents a Postscript object on which postscript commands can be invoked. Also the object is initialized to hold the layout of the current chromosome. Other configuration information is available and detailed in the online documentation for the CCAnnotation class (e.g. the global colour object $self->{Color}). The difference between internal and external annotation is just that the drawing area is clipped to the chromosome shape for internal annotation. This way an object for internal annotations does not need to care for
irregularities in the shape of the chromosome and can assume a rectangular area. Also the path for the chromosome shape can be expected to be build up prior to invocation of draw(). This enables the plain module just to set the colour and fill. Therefore the positional information ($x, $y, $p, $q, $w) is just shown for demonstrative purposes and could be omitted.
References APPLE. Apple computer Inc. http://developer.apple.com/techpubs/macosx/Cocoa/CocoaTopics.html BIOINF. Bioinformatics resources of the department for Molecular Human Genetics in Bochum, Germany. http://mhg.uni-bochum.de/bioinformatics GHOST. Ghostscript and Ghostview. http://www.cs.wisc.edu/~ghost GIMP. The GNU Image Manipulation Program. http://www.gimp.org GAMES. Genetic analysis of Multiple Sclerosis in europeans. http://www.mrcbsu.cam.ac.uk/MSgenetics/GAMES IMAGE. Image conversion. http://www.imagemagick.org LDB. Location data base. http://cedar.genetics.soton.ac.uk/public_html/ldb.html PERL. Practical extraction and report language. http://www.perl.org
Figure 1 Ideogram output as produced by default parameters. Double arrows and delimiters indicate spaces that can be assessed in the configuration file. A: inner group spacing, B: between group spacing, C: chromosome width, D: left margin, E: bottom margin, F: chromosome top margin, G: chromosome bottom margin
A
F
G
D E
B
C
Figure 2 Example of an ideogram using subgroups. Pairs of chromosomes are drawn (subgroup) and grouped together (chromosomes 1-3, 4-5, 6-12 etc.). Band names are drawn to the left of the chromosome pairs. The concepts behind the drawing mechanisms are explained in the text.
Figure 3 Annotations to visualize the distribution of a set of markers over the human genome.
Figure 4 Large scale annotation as resulting from a genome screen searching for genetic association. The locus names have been truncated to 3 letters honoring a non disclosure agreement in the GAMES collaboration.