CVTree: A WHOLE-GENOME-BASED AND

April 9, 2012 10:49 WSPC/INSTRUCTION FILE

S2010194512005041

International Conference Mathematical and Computational Biology 2011 International Journal of Modern Physics: Conference Series Vol. 9 (2012) 1–10 c World Scientific Publishing Company

DOI: 10.1142/S2010194512005041

Int. J. Mod. Phys. Conf. Ser. 2012.09:1-10. Downloaded from www.worldscientific.com by CHINESE ACADEMY OF SCIENCES @ BEIJING on 09/11/17. For personal use only.

CVTree: A WHOLE-GENOME-BASED AND ALIGNMENT-FREE APPROACH TO MICROBIAL PHYLOGENY

BAILIN HAO T-Life Research Center and Department of Physics, Fudan University, Shanghai, 200433 China and Institute of Theoretical Physics, Academia Sinica, Beijing 100190, China [email protected]

The number of sequenced genomes of Archaea, Bacteria, and Fungi accumulates rapidly. Several thousands genomes of these unicellular organisms will be available in a few years. Due to the extremely large difference in genome size and gene content it is difficult to use the traditional alignment-based method to infer phylogeny from the genomes. An alignment-free and whole-genome-based approach called CVTree has been developed and successfully applied to these organisms. As CVTree has been successfully applied to genomes of viruses, chloroplasts, Bacteria, Archaea and fungi, in this brief review we will mainly touch on some mathematical problems related to the foundation of the new approach, including a few yet unsolved problems, such as the violation of the triangular inequalities of the dissimilarity measure used in the CVTree method. Keywords: Alignment-free phylogeny; composition vector; bacteria taxonomy. PACS numbers: 87.10.-e

1. Phylogeny versus Taxonomy We begin with a simple glossary. “Prokaryote” is the collective name of Bacteria and Archaea, the two major domains of unicellular organisms that do not possess a real nucleus. In what follows we often use the colloquial name bacteria for both. “Phylogeny” and “taxonomy” are not synonymous but closely related notions. Taxonomy deals with the classification of species, extant as well as extinct ones. Phylogeny traces back the evolutionary relationships of organisms to their common ancestor. Historically, taxonomy came before phylogeny with its own nomenclature and validation rules. As taxonomic ranks such as domain, phylum, class, order, family, genus, and species are imposed by human being to the organisms, taxonomy may seem to have a more subjective character. However, both phylogeny and taxonomy draw their conclusions from morphological and molecular data. Ideally, a meaningful taxonomy should agree with a faithful phylogeny in demarcating the natural borderlines between species. 1


2

S2010194512005041

B. Hao


In 1965 Emile Zuckerkandl and Linus Pauling1 conceived the idea that the amino acid sequences of proteins contain traces of evolutionary history. Their suggestion to infer phylogenetic information from comparison of orthologous proteins laid the foundation for molecular phylogeny of multicellular plants and animals. However, the progress of bacterial phylogeny along this line was hindered by the lack of a proper molecular clock until Carl Woese and coworkers2 showed that the small subunit (SSU) ribosomal RNA (16S rRNA) of bacteria serves this goal well. 2. Success and Limitation of 16S rRNA Analysis The success of 16S rRNA phylogeny is witnessed by the fact that the contents of the new edition of Bergey’s Manual of Systematic Bacteriology 3 now “follow a phylogenetic framework based on analysis of the nucleotide sequence of the SSU RNA, rather than a phenotypic structure” (George Garrity’s Preface to Vol. 2). Consequently, the congruence of bacterial phylogeny and taxonomy on the basis of 16S rRNA analysis calls for independent verification of both. Furthermore, the 16S rRNA analysis does not possess resolution power at the taxonomic rank “species” and below. To this end it is better to refer to competent biologists themselves: “Although 16S phylogeny is auguably excellent for classification of Bacteria and Archaea from the domain level down to the family or genus, it lacks resolution below that level” as Jim Staley, the Chair of the Bergey’s Manual Trust from 2001 to 2008, put it;4 and “The single category for which SSU sequence divergence cannot provide a sharp resolution is species”.5 3. The CVTree Approach The basic idea of our new approach to bacterial phylogeny was announced on 17 June 2002 on an international conference in celebration of the 80-th birthday of Prof. C. N. Yang.6 We had a hard time to get the main paper published in a biological journal.7 The method later acquired the name CVTree from its public domain web server.8 Now we use the term CVTree for the algorithm, for the web server, and for the trees obtained by using the CVTree method. The CVTree approach7–9 has several prominent features: (1) As input data CVTree takes all the protein products from a prokaryote genome instead of using the short SSU rRNA sequences. In this way one circumvents the task of choosing orthologous genes. In addition, lateral gene transfer and lineage-dependent gene loss appear as mechanisms of genome evolution and should not bring about difficulties to the inference of phylogenetic trees. (2) Methodologically, CVTree is alignment-free and, consequently, parameter-free, as sequence alignment involves many parameters such as scoring matrices and gap penalties. The subtraction of a background caused by neutral mutations significantly improves the quality of the inferred trees. (3) The CVTree results are justified by direct comparison with bacterial taxonomy


S2010194512005041

CVTree: Whole-Genome-Based and Alignment-Free Phylogeny

3


in contrast to the traditional statistical re-sampling procedure like bootstrapping or jackknifing. This was impossible a decade ago as whether prokaryotic proteins contain phylogenetic informations was questioned then10 and whole-genome-based phylogeny could “not resolve the major branchings of the Bacteria”.11 As the CVTree algorithm has been elaborated before,7–9 here we give only a brief description. Take all the protein product encoded in a genome and count the number of overlapping K-peptides to form a raw Composition Vector (CV) of dimension 20K by arranging the frequency of appearance f (α1 α2 · · · αK ) of peptide α1 α2 · · · αK according to the lexicographic order of the peptides in terms of the amino acid letters αi , i = 1, 2, · · · , 20. Then the number of every K-peptide is predicted from that of the related (K − 1)- and (K − 2)-peptides by using a (K − 2)-th order Markovian relation f 0 (α1 α2 · · · αK ) =

f (α1 α2 · · · αK−1 )f (α2 · · · αK ) (L − K + 1)(L − K + 3) . f (α2 · · · αK−2 ) (L − K + 2)2

(1)

The above relation is written for a single protein sequence of length L. The superscript in f 0 reminds that it is a predicted value rather than a real count f . When L ≫ K the second factor in Eq. (1) may be ignored and use f 0 (α1 α2 · · · αK ) ≈

f (α1 α2 · · · αK−1 )f (α2 · · · αK ) f (α2 · · · αK−2 )

(2)

instead. Then the difference of the predicted f 0 and the actual count f is used to replace the corresponding component of the raw CV to yield a new, renormalized CV. ~ = (a1 , a2 , · · · , aN ) Suppose for species A and B we have two normalized CVs A K ~ and B = (b1 , b2 , · · · , bN ), where N = 20 , the correlation of the two CVs is defined by the cosine of the angle between them PN ai × b i ~ ~ (3) C(A, B) = PN i PN 1 2 ( i ai × j b2j ) 2 ~ B) ~ ∈ [−1, 1]. The correlation distance or This is a normalized correlation C(A, dissimilarity between the two species is defined as ~ ~ ~ B) ~ = 1 − C(A, B) ∈ [0, 1]. D(A, 2

(4)

~ B) ~ as a distance and build a phylogenetic tree from the “distance We treat D(A, matrix” by using the Neighbor-Joining (NJ) algorithm from the Phylip12 package. We note that the NJ algorithm has long gained the reputation of being a robust method and quite recently it was proved to be a quartet method in phylogeny.13 Although the underlying idea of CVTree is quite simple, it is not an easy job to implement it by a biologist practitioner. Therefore, we have designed a public domain CVTree Web Server, which has been published twice8,14 since 2004.


4

S2010194512005041

B. Hao

The CVTree method has been successfully applied to viruses,15,16 Bacteria and Archaea,7,17,18 chloroplasts,19 and fungi.20 We cite a latest paper21 where detailed references to these applications may be found.


4. A Few Mathematical Problems Being a newly proposed method which differs from the traditional prokaryotic phylogeny in many aspects, the foundation of the CVTree approach raises some mathematical problems which have been solved only partially. We list a few of these problems. 4.1. Decomposition and reconstruction of symbolic sequences: problem of uniqueness Since bacterial genomes differ very much in their size and gene content, it is almost impossible to align two genomes. In order to be alignment-free we are are forced to compare the collections of overlapping K-peptides obtained from translated protein sequences from the genomes. It is naturally to ask what is the relation of the primary protein sequence and the collection of K-peptides derived from the former. Are they equivalent and in what sense? This question leads to the following mathematical problem. Given an alphabet Σ made of a finite number of letters. The cardinality of Σ is 4 or 20 for nucleotide or amino acid alphabet. Take a sequence of length L over this alphabet and fix a small integer K. Decompose this sequence into overlapping K-tuples, starting from the first letter of the sequence and shifting by one letter at a time. In this way, we get a collection of L − K + 1 pieces of K-tuples. This decomposition can always be done. A more interesting problem is the converse: reconstruct a sequence of length L from the collection of K-tuples, using each K-tuple once and only once. The reconstruction problem is solvable as at least one can recover the original sequence. A deeper question is the uniqueness of the reconstruction. We know it is not unique when K is too small (think of the extreme case K = 1) and it is unique if K is large enough (think of the extreme case K = L − 1). How unique is the above reconstruction problem when K varies? It turns out that the uniqueness problem may be transformed into the problem of the number of Eulerian loops over a graph determined by the original protein sequence and one can make use of the wealth of results in graph theory accumulated since Leonhard Euler studied the problem of Seven Bridges in K¨onigsberg in the year 1735. We left a short note on this problem to the electronic arXive23 and continued our effort on the main front of attack into bacterial phylogeny. A few years later there appeared a paper by Kontorovich24 in an entirely different context that cited our e-preprint.23 In this paper a theorem was proved that there exist automata which can determine whether a symbolic sequence has an unique reconstruction for a given K. Unfortunately, it was an existence theorem and one could not realize the automaton explicitly by following the proof. This fact stimulated a student in my


S2010194512005041



5

group to design such an automaton explicitly.25,26 It should be noted that it was a deterministic finite-state automaton (FSA), but not a minimal FSA. The problem of how to construct a minimal FSA for this problem remains open until present time. In the process of working on the uniqueness problem we realized its connection to a special class of formal language, namely, the factorizable language. A language L is called factorizable if any word x ∈ L may be decomposed into all possible sub-words which all belong to L. From this definition it follows that all uniquely reconstructible sequences over an alphabet Σ under a given K form a factorizable language. This remark together with finiteness consideration of the language makes Kontorovich’s theorem a simple corollary of the definition. For a more detailed elaboration of factorizable languages and the uniqueness problem please consult our review papers.27,28 Based on the mathematical knowledge outlined above one can write at least three computer programs. The first program implements brute-force reconstructions using K-tuples from a given collection. This program yields all the reconstructed sequences if there are many. However, one must introduce a cut-off because some protein may have a huge number of reconstructions and a program without a cut-off may never stop. The second program implements so-called BEST Theorem in graph theory which gives the number of reconstructions for a given protein at a given K. This program generates only one number, namely, the number of different Eulerian loops, without listing the reconstructed sequences themselves. The third program implements a SFA and yields the least information, i.e., a “yes” or “no” answer to the uniqueness problem. If it is “no”, one increments K by one and runs the program again until reaching a K value that leads to unique reconstruction. Equipped with these programs one may screen databases of real proteins. It has been shown that most proteins occurring in Nature do have unique reconstruction at moderate K values,29 e.g., K = 5 or 6. This fact speaks in favor of using Kpeptides to form CVs for genome comparison. However, there exists a small number of proteins that have huge number of reconstructions at not-too-large K-values. Most of these are fibrous proteins performing mechanical functions.27 4.2. Choice of the best K value Although K looks like a parameter. It is actually not. It funtions like a knob on an optical microscope that tunes the resolution of the instrument. We never adjust K for one or another genome. In fact, for a given collection of genomes we build trees for all K = 3 to 7 and watch the convergence of branchings with K increasing. The minimal value K = 3 follows from the fact that we use a (K − 2)-th order Markov prediction and it can be shown that there is no need to go beyond K = 7. For a quick overview of the tree one can look at K = 5 or 6 which are the best K values in the sense of agreement with the taxonomy. The last statement may be explained in the following qualitative way.



6

S2010194512005041

B. Hao

The longer K the more species-specific would a K-peptide be, provided it does not belong to so-called low-complexity sequences. When very species-specific Kpeptides are used one would end up with a star-tree, i.e., all species are distinct and one cannot reveal the inter-relationship among them. Only by using shorter K-peptides one takes into account features common to more species. In order to let a K-peptide be specific, its number of appearance in a collection of proteins of total length L should be much less than that of a random K-peptide whose probability is 20−K under the assumption that all amino acids have an equal probability of 20−1 . The expected count of such a K-peptide should be small enough: L × 20−K ≪ 1. On the other hand, in our (K − 2)-th order Markov prediction the connection of a designated K-peptide to its “neighbors” is determined by the number of (K − 2)peptides which should not be too few: L × 20−(K−2) > 1. Combining these two inequalities and take L to be of the order of 106 which is typical for most of bacterial genomes, we get 4.6 < K < 6.6,

(5)

i.e., the appropriate K values are 5 and 6 in agreement with our repeated comparisons of CVTrees with bacterial taxonomy. If one takes L to be 105 or 107 for viruses or fungi, then the “best” K-values would be either 4 ∼ 5 or 6 ∼ 7 in agreement with our observations in [16] and [20]. One reason for not to use too large a K-value consists in the total number of different K-peptides in the collection of all protein products in a genome. When K is small, all types of peptides are present and the total number is given by 20K . When K gets greater, the exponential growth is limited linearly by the total number L of amino acids in the protein collection. Suppose that there are M protein sequences PM with total length L = i=1 Li , Li being the length of the i-th protein and M being of the order of several thousands. Then the growth of the total number of different K-peptides is limited by a linearly descending function L − M (K − 1). If one draws the two functions for L ∼ 106 , then a peak is located around K = 6 ∼ 7. That means all essential types of K-peptides have appeared for K ≤ 7 and there is no need to go beyond K = 7. We note in passing that statistical re-sampling tests such as bootstrap or jackknife of the CVTree results also show that K = 5 and 6 lead to the best trees30 for bacteria. 4.3. Optimal subtraction procedure According to Kimura’s theory of neutral evolution31 mutations take place randomly at molecular level and natural selections shape the direction of evolution. Therefore,


S2010194512005041



7

in the original counts of K-peptides in our raw CVs, a significant fraction must be results of neutral mutations that have nothing to do with speciation and phylogeny. Indeed, the subtraction of a random background caused by neutral mutations to highlight the role of natural selection has significantly improved the CVTree phylogeny. However, the (K − 2)-th order Markov prediction Eq. (1) we used in the CVTree approach is by no way the only choice. For example, several prediction formulae were given in the Thesis by Qiang Li32 and some others were listed in the review.22 Which statistical prediction is the optimal one? First of all, there is no mathematical principle to base the optimum on. As we indicated in the beginning of this paper, one should rely on direct comparison of the resulted phylogeny with bacterial taxonomy to make the judgment. As much as we know Eq. (1) remains the best prediction for the time being. First, it can be derived from the relation between joint probability and conditional probability with clear understanding where a Markov approximation was made.7,9 Second, it can be derived independently by invoking the maximal entropy principle.33 The last but not the least argument, the CVTrees based on Eq. (1) agree well with bacteriologists’ taxonomy in most major and finer branchings. 4.4. Distance versus dissimilarity Curiously enough, the dissimilarity measure Eq. (4) does not guarantee the fulfillment of all the triangular inequalities encountered in the calculation, but an overwhelming majority of the inequalities does hold. For example, for a dataset of 1404prokaryotic genomes plus 8 Eukaryotes as outgroups there are in total 1412 = 468 198 020 triangles. An exhaustive check of the inequalities reveals 3 the following number of violations: 10006, 200, 0, 0, and 3, for K = 3, 4, 5, 6, and 7, respectively. Even 10006 is a tiny fraction of the total number of triangles. For K = 5 and 6, corresponding to “the best” CVTrees (see below), there are no violations at all. The 3 violations at K = 7 come from 5 strains of one and the same species, namely, Yersinia pestis, which behaves well, i.e., as monophyletic branches, in all CVTrees from K = 3 to 7. ~ and B ~ as Rewrite the cosine between two CVs A T ~ B ~ A , cos θ = ~ · kBk ~ kAk the correlation distance (3) may be written as ~ B) ~ = 1 − cos θ = 1 D(A, 2 2

~T B ~ A 1− ~ ~ kAk · kBk

!

Using the identity

2

!

A ~T B ~ ~ A B

~ , −

=2 1−

~ ~ ~ · kBk ~

kAk kBk kAk

.


8

S2010194512005041

B. Hao


the dissimilarity measure (4) turns out to be the square of an Euclidian distance:22

2

A ~ ~ 1 B

~ B) ~ = D(A, − (6)

. ~ ~ 4 kAk kBk

It is well known that an Euclidian distance satisfies the triangular inequality, but its square may not. However, what is the condition that for a collection of CVs only a tiny fraction of the triangles formed by these CVs does not comply with the triangular inequality. The fact that for the “best” choice of K being 5 or 6 all the triangular inequalities do hold true hints on a possible connection of the fulfillment of triangular inequalities with the quality of a phylogenetic tree, but at present we are not in a position to formulate this connection explicitly. This observation brings to our mind a deeper question in the context of phylogeny, namely, the problem of ultrametricity. If in an additive phylogenetic tree the distance between any two extant species is defined as the evolution time or the total branch lengths to their common ancestor, then among any three extant species one may pick up two such that the distance between them are equal to each other and be greater or at least equal to that to the third species. Such distance is an example of so-called ultrametric distance. Ultrametricity is a strong form of triangular inequality when the triangle is required to be isosceles or equilateral. Generally speaking, a “distance matrix” like what we formed using the dissimilarity measure (4) may not be a metric, as all the triangular inequalities are not guaranteed to satisfy. Since we have seen that the “best” trees in biological sense do not violate the triangular inequalities at all, one may think about how to “repair” a “distance matrix” with very weak violations of some of the triangular inequalities so as all the inequalities do hold. If one finds a way to do this the next task may be seeking a way to further make the distance matrix fully ultrametric. Any progress along this line may be useful in phylogeny and may revive the interest of physicists to ultrametricity since it was first introduced to physics a quarter of century ago.34 4.5. Calibration of branch lengths in CVTrees The CV approach provides good topology of the phylogenetic trees as witnessed by its agreement with taxonomy. However, the calibration of branch lengths remains an unsolved problem. The use of overlapping K-peptides to form CVs brings about dependency of components of a CV and makes the analysis difficult. The subtraction procedure further worsens the situation. Nevertheless, a calibration formula has been given in [32]. The final calibrated dissimilarity measure simply reads h i1 ~ B) ~ = 1 − 1 − 2D(A, ~ B) ~ K, Dcal (A, (7)

~ B) ~ is the original dissimilarity measure (4). However, the calibration where D(A, slightly changes the topology and the number of violated triangular inequalities increases though remaining a tiny fraction of the total number of inequalities.


S2010194512005041



9

Here we would like to argue that calibration of branch lengths is not that important for bacterial phylogenetic trees. Indeed, the need of determining the evolution time of various branches comes from the traditional phylogeny of multicellular plants and animals for which fossil data may be used to check the timing. The requirement of scaling the branch lengths rests on the assumption of trees being additive and there exists a molecular clock running at a more or less constant rate. The latter assumption can hardly be hold true for bacterial species living in extremely diversified ecological conditions. Furthermore, bacterial fossil data, though exist, are not useful at all for dating purpose. Therefore, we advocate the viewpoint that at least for the time being one should be content with providing faithful topology to bacterial phylogenetic trees in order to support the more practical need of having a reasonable classification of these unseen but abundant creatures of the Nature. Acknowledgments This work was partially supported by the National Basic Research Program of China (973 Project No. 2007CB814800) and the Shanghai Leading Academic Discipline Project No. B111. References 1. E. Zuckerkandl and L. Pauling, J. Theor. Biol., 8, 357 (1965). 2. C. R. Woese and G. E. Fox, Proc. Natl. Acad. Sci USA, 74, 5088 (1977). 3. The Bergey’s Manual Trust, Bergey’s Manual of Systematic Bacterialogy, 2nd Ed., Vol. 1–5 (Springer, New York, 2001–2011). 4. J. T. Staley, Phil. Trans. R. Soc. B361, 1899 (2006). 5. P. Yarza, M. Richter, J. Peplies et al., Syst. Appl. Microbiol. 31, 241 (2008). 6. B. Hao, J. Qi and B. Wang, Mod. Phy. Lett. B17 91 (2003). 7. J. Qi, B. Wang and B. Hao, J. Mol. Evol. 58, 1 (2004). 8. J. Qi, H. Luo and B. Hao, Nucl. Acids Res. 32, Web Server Issue, W45 (2004). 9. B. Hao and J. Qi, J. Bioinf. & Comput. Biol., 2, 1 (2004). 10. A. A. Teichmann and G. Mitchison, J. Mol. Evol. 49, 98 (1999). 11. M. Hyunen, B. Snel and P. Bork, Science, 286, 1443a (1999). 12. J. Felsenstein, PHYLIP: http://evolution.geentics.washington.edu/phylip.html 13. R. Mihaescu, D. Levy and L. Pachter, Algorithmica, 54, 1 (2009). 14. Z. Xu and B. Hao, Nucl. Acids Res. 37, Web Server Issue, W147 (2009). 15. L. Gao, J. Qi, H. Wei, Y. Sun and B. Hao, Chinese Science Bulletin 48, 1170 (2003). 16. L. Gao and J. Qi, BMC Evol. Biol., 7, 41 (2007). 17. L. Gao, J. Qi, J. Sun and B. Hao, Science in China Series C Life Science, 50, 587 (2007). 18. J. Sun, Z. Xu and B. HAO, Chinese Sci. Bull. 55, 2323 (2010). (ORSC & APORC, 2010) 19. K. H. Chu, J. Qi, Z. G. Yu and V. Anh, Mol. Biol. Evol., 28, 70 (2004). 20. H. Wang, Z. Xu, L. Gao and B. Hao, BMC Evol. Biol., 9, 195 (2009). 21. Q. Li, Z. Xu and B. Hao, J. Biotech., 149, 115 (2010). 22. R. H. Chan, R. W. Wang and H. M. Yeung, Composition vector method for phyloge-


10

23. 24. 25.


26. 27. 28.

29. 30. 31. 32.

33. 34.

S2010194512005041

B. Hao

netics — a review, in Proc. 9th Int. Symp. Operations Research and Its Applications (ORSC & APORC, Chengdu, China, 2010), p. 13. B. Hao, H. Xie and S. Zhang, Composition representation of protein sequences and the number of Eulerian loops, arXiv: physics/0103028 (March 2001). C. Kantorovich, Theor. Comput. Sci., 329, 271 (2004). Q. Li and H. Xie, Finite automata for testing uniqueness of Eulerian traits, arXiv: cs.CC/0507052 (July 2005). Q. Li and H. Xie, J. Comput. & Syst. Sci., 74, 870 (2008). X. Shi, H. Xie, S. Zhang and B. Hao, J. Korean Phys. Soc., 50, 118 (2007). B. Hao and H. Xie, Factorizable language: from dynamics to biology, in Reviews of Nonlinear Science and Complexity, ed. H. G. Schuster (Wiley-VCH, Weinheim, 2008), p. 147. L. Xia and C. Zhou, J. Syst. & Complexity, 20, 18 (2007). G. Zuo, Z. Xu, H. Yu and B. Hao, Genomics, Proteomics & Bioinformatics, 8, 262 (2010). M. Kimura, The Neutral Theory of Molecular Evolution, (Cambridge University Press, Cambridge UK, 1985). Q. Li, A heuristic evolutionary model for K-string composition and the problem of uniqueness of reconstruction of sequences, PhD Thesis, Fudan University, Shanghai, 2009. R. Hu and B. Wang, Physica, A290, 464 (2001). R. Rammal, G. Toulouse and M. A. Virasoro, Rev. Mod. Phys., 58, 765 (1986).