BIOINFORMATICS APPLICATIONS NOTE
Vol. 19 no. 17 2003, pages 2325–2327 DOI: 10.1093/bioinformatics/btg316
libsequence: a C++ class library for evolutionary genetic analysis Kevin Thornton Committee on Genetics, University of Chicago, 1101 East 57th Street, Chicago, IL 660 37, USA Received on March 31, 2003; revised on May 12, 2003; accepted on May 31, 2003
1
INTRODUCTION
The rapid increase in sequence data both from multiple species and from multiple individuals within species poses a demand for flexible software tools to deal with such data. Currently, much progress has been made on the front of generating routines for obtaining and manipulating large sequence databases, and for parsing the output of commonly used bioinformatics tools, in large part due to the efforts of the bioperl project (http://www.bioperl.org). However, widely available tools for the analysis of single nucleotide polymorphism (SNP) data are limited by an inability to perform ‘batch’ analyses (i.e. quickly summarize large volumes of data), and/or are simply not portable across different operating systems. This paper describes libsequence, a C++ class library providing an object-oriented framework for the analysis of molecular population genetic data. The library facilitates the development of custom software in three ways. First, it provides methods for data storage/manipulation. Second, a wide range of commonly used statistics to summarize and analyze SNP data are implemented. Third, the software is integrated with tools to perform analyses based on coalescent simulation. While the library itself is targeted primarily at developers, several software packages making use of the library have been written which will be useful to end-users and are able to perform a wide range of common analyses of SNP data and coalescent simulation data. Once libsequence
Published by Oxford University Press.
is installed on a system, all of these other packages can be installed.
2 LIBRARY DESIGN 2.1 General comments The C++ language (Stroustrup, 1997) has several advantages for software development. First, its support for object-oriented programming allows the addition of new functionality to software while requiring little extra code to be written, resulting in a flexible and maintainable code base. Second, support for polymorphic data types guarantees that custom extensions to existing class hierarchies remain compatible with existing library features. When the source code to object-oriented libraries is available, programmers can extend the library functionality (through the C++ inheritance mechanism) for their own purposes with relatively little effort, and these extensions remain compatible with the existing code base.
2.2
Limitations
libsequence was written to impose no arbitrary limitations on a programmer. There are no built-in limits on the length of a sequence or the size of a polymorphism table. Rather, the only limitation is the amount of RAM on a system and the length constraints imposed by the C++ system itself. On the author’s system (Linux on a Pentium II workstation), the maximum length of a std::string is approximately 4.2 billion, which imposes the upper limit on the length of an object of a sequence, and the built-in maximum size of a polymorphism table is the square of that value (the maximum length of a string × the maximum number of elements in a vector).
2.3
Development notes
libsequence is written in ANSI/ISO C++ and was developed on an ×86 GNU/Linux system using the GNU C Compiler (GCC) 2.95, 3.1 and 3.2 compiler systems. It should compile with other compilers that have good support for the ANSI standard and for template instantiation (and template member functions). The build system was generated using autoproject (http://www.gnu.org/ directory/devel/specific/autoproject.html) and GNU libtool
2325
Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 23, 2011
ABSTRACT Summary: A C++ class library is available to facilitate the implementation of software for genomics and sequence polymorphism analysis. The library implements methods for data manipulation and the calculation of several statistics commonly used to analyze SNP data.The object-oriented design of the library is intended to be extensible, allowing users to design custom classes for their own needs. In addition, routines are provided to process samples generated by a widely used coalescent simulation. Availability: The source code (in C++) is available from http://www.molpopgen.org Contact:
[email protected]
K. Thornton
(http://www.gnu.org/software/libtool/libtool.html) is used to portably generate dynamic and static archives. A developer’s reference manual (containing several code examples) for the library was generated directly from comments in the library source code using the doxygen tool (http://www.doxygen.org) and is available in both html and pdf formats.
2.4
Operating systems
libsequence, and programs linking to it, have been compiled and tested on ×86 GNU/Linux Systems with GCC 2.95, 3.1 and 3.2, Sun systems with GCC 2.95, and Apple OS X systems with GCC 3.1 Support for OS X is limited to machines with a recent version of the Developer’s CD installed (December 2002 at the time of this writing). In general, a recent version of GCC (http://gcc.gnu.org) is preferable, but any standards-compliant compiler system should suffice.
License and availability
libsequence is distributed as source code under the terms of the GNU Public License (GPL) (http://www.gnu.org/ licenses.html). The source code, documentation, and examples are available for download from http://www.molpopgen.org. Documentation in html format can be found at http:// www.molpopgen.org/software/libsequence/doc/html. There are also several software packages based on libsequence available for download, also under the terms of the GPL, from http://www.molpopgen.org. Online documentation (‘man’) pages for these programs are available.
3 METHODS IMPLEMENTED 3.1 Sequence manipulation libsequence provides a flexible system for sequence manipulation, which is of use to the broader genomics community. Currently, the common FASTA format is supported, but developers can add custom formats via the C++ inheritance mechanism, requiring the writing of only two functions (one for input, one for output). These custom will be compatible with all library functions. This design principle extends to dealing with formatted alignment data and polymorphism tables. In general, however, bioperl may be a preferable platform if format conversion utilities are desired, as the intent here is to facilitate reading in data for more compuationally intensive tasks. In addition to input/output, sequences can be reversed, complemented, translated, and substrings can be taken from them. There are also methods for comparing sequences for equality, which take into account the possible presence of missing data in the sequence. libsequence also provides several function objects which facilitate writing code for tasks such as counting codon tables from a coding sequence (CDS) and calculating base composition.
2326
Summary or test statistic
Reference
Watterson’s theta Nucleotide diversity (π ) Theta H Number of polymorphic sites Number of mutations Number of external mutations Tajima’s D Fu and Li’s D, F , D ∗, F ∗ Walls B, B and Q Hudson’s partition test Hudson’s C 4-Gamete test (Rmin ) Number of haplotypes Haplotype diversity FST
Watterson (1975) Tajima (1983) Fay and Wu (2000) N/A N/A Fu and Li (1993) Tajima (1989) Fu and Li (1993); Simonsen et al. (1995) Wall (1999) Hudson et al. (1994) Hudson (1987) Hudson and Kaplan (1985) Depaulis and Veuille (1998) Depaulis and Veuille (1998) Hudson et al. (1992a,b); Slatkin (1993)
3.2
Divergence
libsequence implements two methods for calculating divergence between sequences. The first is the 2-parameter method of Kimura (Kimura, 1980) to calculate a distance per site between two sequences. The second is the Ka /Ks method of Comeron (Comeron, 1995), which is an extension of Li’s method (Li, 1993). Comeron’s method calculates amino acid (Ka ) and silent divergence (Ks ) per site between coding (CDS) sequence. The implementation of the method in libsequence uses Grantham’s distance (Grantham, 1974) to weight alternate evolutionary pathways in the case of codons that are divergent at more than one site. Also, there is the option to restrict the amount of divergence at a codon to a maximum of 1, 2 or all 3 positions. A set of classes also exist to simplify the writing of alternate weighting schemes, if they are desired. The point estimates of Ka and Ks from this method are highly correlated with maximum-likelihood estimates (i.e. Yang, 2000), but have the advantage of being much more rapid to compute. There is the disadvantage, however, that model-based hypothesis testing cannot be performed with these ‘approximate’ methods, and one should prefer the likelihood framework (Yang and Bielawski, 2000).
3.3
SNP and coalescent simulation analysis
The main objective in the development of libsequence was to simplify writing custom programs for SNP analysis. SNP analysis often has two components—the summary of the data and hypothesis testing using coalescent simulation (Hudson, 1983). A modified version of the C code by Hudson (Hudson, 2002) to generate genealogies under a coalescent process is included in the library with a C++ class interface. libsequence provides routines for a large number of summary statistics of SNP data in both sequence and simulated formats (Table 1). The details of all calculations are found in the library reference manual. libsequence was
Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 23, 2011
2.5
Table 1. List of methods for SNP analysis in libsequence
libsequence
written specifically to interact with Richard Hudson’s coalescent simulation program ms (Hudson, 2002), and separate versions of many routines are provided for simulated data where there would be a gain in efficiency to do so. There is also a routine for calculating the linkage disequilibrium coefficients r 2 , D, and D [see Hudson (2001) for a recent review] with an optional frequency filter.
REFERENCES
2327
Downloaded from bioinformatics.oxfordjournals.org at University of Portland on May 23, 2011
Comeron,J.M. (1995) A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J. Mol. Evolution, 41, 1152–1159. Depaulis,F. and Veuille,M. (1998) Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evolution, 15, 1788–1790. Fay,J.C. and Wu,C.I. (2000) Hitchhiking under positive Darwinian selection. Genetics, 155, 1405–1413. Fu,Y.X. and Li,W.H. (1993) Statistical tests of neutrality of mutations. Genetics, 133, 693–709. Grantham,R. (1974) Amino acid difference formula to help explain protein evolution. Science, 85, 862–864. Hudson,R.R. (1983) Properties of a neutral allele model with intragenic recombination. Theoret. Population Biol., 23, 183–201. Hudson,R.R. (1987) Estimating the recombination parameter of a finite population-model without selection. Genet. Res., 50, 245–250. Hudson,R.R. (2002) Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 18, 337–338. Hudson,R.R. and Kaplan,N.L. (1985) Statistical properties of the number of recombination events in the history of a sample of DNA-sequences. Genetics, 111, 147–164. Hudson,R.R., Slatkin,M. and Maddison,W.P. (1992a) Estimation of levels of gene flow from DNA-sequence data. Genetics, 132, 583–589.
Hudson,R.R., Boos,D.D. and Kaplan,N.L. (1992b) A statistical test for detecting geographic subdivision. Mol. Biol. Evolution, 9, 138–151. Hudson,R.R., Bailey,K., Skarecky,D., Kwiatowski,J. and Ayala,F.J. (1994) Evidence for positive selection in the superoxidedismutase (Sod) region of Drosophila-melanogaster. Genetics, 136, 1329–1340. Hudson,R.R. (2001) Linkage disequilibrium and recombination. In: Balding,D.J., Bishop,M. and Cannings,C. (eds), Handbook of Statistical Genetics. Wiley, New York, pp. 309–324. Kimura,M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evolution, 16, 111–120. Li,W.H. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evolution, 36, 96–99. Simonsen,K.L., Churchill,G.A. and Aquadro,C.F. (1995) Properties of statistical tests of neutrality for DNA polymorphism data. Genetics, 141, 413–429. Slatkin,M. (1993) Isolation by distance in equilibrium and nonequilibrium populations. Evolution, 47, 264–279. Stroustrup,B. (1997) The C++ Programming Language, 3rd edn. Addison-Wesley, Reading, MA. Tajima,F. (1983) Evolutionary relationship of DNA sequences in finite populations. Genetics, 105, 437–460. Tajima,F. (1989) Statistical-method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123, 585–595. Wall,J.D. (1999) Recombination and the power of statistical tests of neutrality. Genet. Res., 74, 65–79. Watterson,G.A. (1975) Number of segregating sites in genetic models without recombination. Theoret. Population Biol., 7, 256–276. Yang,Z. (2000) Phylogenetic Analysis by Maximum Likelihood (PAML), Version 3.0. University College London, London. Yang,Z.H. and Bielawski,J.P. (2000) Statistical methods for detecting molecular adaptation. Trends Ecol. Evolution, 15, 496–503.