A graph theoretic approach to the development of ... - Springer Link

13 downloads 230 Views 1MB Size Report
However all methods have been unable to decide whether any better trees could be found for the data. Several .... d(s i, sj) = the minimum number of nucleotide substitutions that have to be made in order to transform .... new points we call Steiner points. ..... for links of length 3 we find d(s4, s 5) = d(ss, s6) = 3, giving Fig. 5(e).
Journal of Molecular Evolution

J. Mol. Evol. 13,127--149 (1979)

© by Springer-Verlag• 1979

A Graph Theoretic Approachto the Developmentof Minimal PhylogeneticTrees L.R. Foulds I , M.D. Hendy I , and David Penny 2 1Department of Mathematics; 2Department of Botany and Zoology, Massey University, Palmerston North, New Zealand

Summary. The problem of determining the minimal phylogenetic tree is discussed in relation to graph theory. It is shown that this problem is an example of the Steiner problem in graphs which is to connect a set of points by a minimal length network where new points can be added. There is no reported method of solving realistically-sized Steiner problems in reasonable computing time. A heuristic method of approaching the phylogenetic problem is presented, together with a worked example with 7 mammalian cytochrome c sequences. It is shown in this case that the method develops a phylogenetic tree that has the smallest possible number of amino acid replacements. The potential and limitations of the method are discussed. It is stressed that objective methods must be used for comparing different trees. In particular it should be determined how close a given tree is to a mathematically determined lower bound. A theorem is proved which is used to establish a lower bound on the length of any tree and if a tree is found with a length equal to the lower bound, then no shorter tree can exist. Key words: Cytochrome c - Phylogenetic tree - Minimal spanning tree Graph theory - Molecular evolution - Steiner problem in graphs. Come, listen, my men, while I tell you again The five unmistakable marks By which you may know, wheresoever you go The warranted genuine Snarks Lewis Carroll

The Hunting of tbe Snark The theory of evolution predicts that existing biological species have been linked in the past by common ancestors. Since the time of Charles Darwin many scientists have suggested phylogenies to link both existing species and/or organisms in the fossil record. With some groups, particularly vertebrates, there is a reasonable fossil record

0022--2844/79/0013/0127/~ 04.60

128

L.R. Foulds et al.

and this together with comparative information from existing organisms enables phylogenies to be determined with a fair degree of agreement on the basic outlines. But for most species the fossil record is at best inadequate and at worst non-existent and in these cases the evolution of the group must be determined from a knowledge of the existing species. There has been no general agreement on methods of determining phylogenies particularly when there is no fossil record and it is not surprising that there is often little agreement even on the main outline of the evolution within some major classes. Eck and Dayhoff (1966), and Fitch and Margoliash (1967) introduced methods for building phylogenetic trees from protein sequence data. Since that time several variations of these methods have been reported and will be referred to in the discussion. However all methods have been unable to decide whether any better trees could be found for the data. Several attempts have been made to establish more objective methods of determining phylogenies. The process has recently been subdivided into six steps (Penny 1976). They are: 1. The collection and/or selection of data for study 2. Selecting a biological model 3. Establishing a criterion for the optimal tree 4. Determining a minimal network or unrooted tree 5. Determining ancestral states 6. Finding the root of the tree In the present case an evolutionary model has been assumed (as opposed to the nonevolutionary model of classical numerical taxonomy) and in this case the criterion is to minimize some function of the number of changes to the data (Camin and Sokal, 1965; Farris, 1970, 1972; Fitch, 1971, 1975, 1977;Penny 1976). Methods of developing phylogenies do not necessarily lead to a rooted phylogenetic tree. An unrooted tree, as opposed to a rooted tree, does not show the location of the ancestor common to the taxa being investigated. Most of the objective methods of making phylogenies result in an unrooted tree and additional information must be used to find on which link the root occurs (Penny 1976). In such trees there is no direction to the links whereas in rooted trees the two ends of the links represent different points in time. Both rooted and unrooted trees are examples of networks, but networks also include cases where there may be closed circuits. The most important remaining problem in determining the minimal tree is to find the unrooted tree of minimal length joining the species. This paper reports an improved method based on that of Eck and Dayhoff (1966), that in some cases can be shown to determine the minimal (most parsimonious) tree. We use only information on amino acid sequences and in particular sequences of the respiratory protein cytochrome c. However it is to be emphasised that other comparative information such as anatomical or morphological data could be used and a subsequent paper will illustrate this.

Biochemical Background The general problem is to construct a phylogenetic tree for a number of taxa based on a set of data for those taxa. In particular we shall consider the case where the data

Development o f Minimal Phylogenetic Trees

129

are amino acid sequences, one sequence from each taxon. The sequences are of the same length and thus alignable such that the k th amino acid character (position) for all taxa may be assumed to have descended from the same ancestral amino acid. Thus each position is a character and the 20 possible amino acids constitute the range of character states (Dayhoff, 1972). For every possible pair of amino acids we define a distance between them equal to the fewest possible nucleotide substitutions that would be required to convert a codon for one amino acid into a codon for the other. Each amino acid in the sequence is coded for by one triplet of three nucleic acid bases or nucleotides. There are 64 possible triplets from the 4 nucleotides: A, C, G, U. There are more triplets (64) than amino acids (20) and in nature there can be from 1 to 6 different triplets coding for a particular amino acid, although in any particular message only one triplet will code for an amino acid at a particular site in the protein. A good introduction to this material can be found in Watson (1975) which also includes the triplet codes. Throughout this paper we adopt the terms: amino acid replacement and nucleotide substitution. Three related criteria have been used to measure the length of link of the tree that has been built from amino acid sequence information. The term length is defined in the next section. There are: 1. Minimum Amino Acid Changes, in which each amino acid replacement is counted equally. But a change from, say, glutamic to aspartic acids requires only one base change (see below) whereas from methionine (AUG) to aspartic acid (GA U or GAC) must require three. A second criterion can therefore be used, namely, 2. Minimum Weighted Amino Acid Changes, where each amino acid replacement is weighted for the minimum number of nucleotide substitutions required in the triplet that codes for that position in the protein. These values are taken from Fitch and Margoliash (1967) except that isoleucine to arginine, glutamic, glutamine and lysine require respectively 1, 2, 2 and 1 nucleotide substitutions. This method considers each replacement independently but it may still underestimate the number of nucleotide substitutions required in a tree. This can be seen in the following example. Each of the amino acids Asp(D, GAPy), Glu(E, GAPu), Gly(G, GGX) is one nucleotide substitution from each other and any sequence of changes involving them (say G ~ D ~ E) would appear to involve two nucleotide substitutions. But an examination of the codes shows that the sequence of changes D ~ G -~ E; or E -~ G ~ D; or G ~ E and G ~ D would involve three nucleotide substitutions. Any of the other sequences of changes, e.g. G -~ D -~ E or D ~ G and D ~ E would require only two. To include all the changes that would have to occur on a tree an additional criterion is used namely, 3. Minimum Nucleotide Substitutions (Fitch and Farris, 1973) which is the same as the minimum mutation distance (Moore et al., 1973). This includes the Minimum Weighted Amino Acid Changes together with any additional substitutions that are necessary because of the exact tree topology and the redundancy of the code. In the present work the Minimum Weighted Amino Acid Changes criterion is used while finding a minimal network because it can consider each amino acid replacement

130

L.R. Foulds et al.

separately. But wherever possible the order of amino acid replacement has taken account of the possibility of minimizing the Minimum Nucleotide Substitution which is normally evaluated after a tree has been created. A method for evaluating the Minimum Nucleotide Substitutions for a given tree was first reported by Fitch (1971) and later improved by Fitch and Farris (1973). The mathematical proof for this type of method was given by Hartigan (1973) and b y Moore et al. (1973). It is important to note that an amino acid replacement that does occur in evolution is reversible in that it is normally possible initially for a reversion to the original state. In this respect the problem faced is a more general approach than that of Camin and Sokal (1965) and Estabrook (1968) who considered the case where the ancestral state of each character was known and where changes to a character were irreversible. The Mathematical Problem Graph theory is the branch of mathematics which is concerned with the study of discrete arrangements of, and relationships between objects. The interested reader is recommended the text by Harary (1969). A graph is a mathematical structure which can be represented by a set of points drawn in a plane, pairs of which may be connected by lines called links. A path is a sequence Of distinct links in a graph with the property that each link in the sequence, other than the first, begins at the point where its predecessor ends. A circuit is a sequence of distinct links in a graph which is identical to a path except that the first point of the first link is coincident with the last point of the last link in the sequence. A graph is said to be connected if there exists at least one path between every pair of distinct points. A tree is a graph that is connected but does not contain any circuits. A n y connected graph will be termed a network. The phylogenetic problem introduced in the previous section can be formulated in graph theoretic terms. Each taxon is represented by a point in the tree to be constructed. Each link in the tree is associated with the changes between the species it connects. Let n = the number of species to be analysed, B = (s 1, s2, ..., Sn), the set of species, d(s i, sj) = the minimum number of nucleotide substitutions that have to be made in order to transform the sequence of species si into the sequence of species sj. A link in a tree connecting si and sj is said to have length d(si, sj). Both a path and a tree are defined to have length equal to the sum of the lengths of their constituent links. Note that we define d(si, s i) =

0

i = 1 , 2 ..... n.

Also the distance function d is symmetric in the sense that d(si, sj) --

d(sj, si),

i -- 1, 2 ..... n, j = 1 , 2 ..... n.

Development of Minimal Phylogenetic Trees

131

However d does not necessarily obey the triangle inequality because d(si, sj) + d(sj, sk) ~ d(s i, Sk), occurs for certain i, j, and k. As an example of this inconsistency, consider the sequences for Turtle, Dogfish and Bonito at site 33 in Dayhoff (1972). Now the problem is to identify the tree(s) whose length is minimal and connects all species in B. In order to illustrate various points the following example will be introduced. It has six sequences differing at only four characters, numbered 1, 2, 3 and 4. The sequences and the distances between them are given in Table 1. Unfortunately it is not possible to guarantee for some sets of species that a minimal tree (the one with the least number of base changes) will connect every pair of species, s.1 and s;, by a link of length only d(s.,1 s:) nucleotide substitutions. As an example j j consider a universe B of just the first tt/ree species: s 1, s 2, and s 3 and only character 1 in the example. Then all the relevant information is shown in Fig. l(a). As can be seen in the Figure each link has been labelled. The label associated with the link connecting species si and sj consists of blocks of the form: r a/3 (p), where: r is a character (position in the sequence) at which s i and sj differ, is the character state in position r of si, /3 is the character state in position r of sj, p is the number of nucleotide substitution associated with the a 13change. Usually, when p = 1, we write r a 3(1) as r ~ 13. However the structure in Fig. l(a) is not minimal as it contains a circuit. A minimal tree can be obtained by breaking this circuit by removing any link. Suppose for instance the link between s I and s 3 is removed. Then s 1 and s 3 are connected by a link representing two base changes even though they differ by only one nucleotide substitution. So far this discussion has considered the problem as similar to one known in graph theory as the minimal spanning tree problem. In terms of the terminology developed so far, this problem is to identify the most parsimonious tree, T (the minimal spanning tree) which connects all species in B by links between pairs of species si and sj having length d(si, sj). That is, to find the tree, T for which

T a b l e 1.

1

2

3

4

s1

s2

s3

s4

s5

s6

sI

R

R

L

M

O

3

3

1

2

2

s1

s2

T

T

L

K

0

3

2

1

2

s2

2

2

1

s3

0

1

1

s4

0

1

s5

0

s6

s3

S

T

I

M

s4

R

T

L

M

s5

T

T

L

M

s6

S

T

L

M

0

132

L.R. Foulds et al.

(a) y ~ s2~

£ 2RT

{c1

J

%

Fig. 1. Examples of development of minimal trees

sisj e T

d(si,s.) J

is a minimum and T connects (spans) all species in B. Methods for the solution of this problem have been developed by Prim (1957) and Kruskal (1956). Both methods guarantee to find the shortest tree, It would be very convenient if the phylogenetic problem of this paper was the same as the minimal spanning tree problem. Then the above mentioned methods could be applied and the problem would be solved. However there is a crucial difference which will be explained by means of the example. Consider once again, only the first three species: Sl, s2, and s 3 and all sites 1, 2, 3, and 4. Then d(s 1,s 2) = 3, d(s 2,s 3) = 3, and d(s 1, s 3) = 3. Then a minimal spanning tree is displayed in Figure l(b). This tree has a total of six nucleotide substitutions. However this is not the minimal phylogenetic tree. It can be seen in Figure l(b) that the two links SlS 2 and SlS 3 have the change 2RT in common. These two changes can be combined to form the minimal phylogenetic tree with five changes shown in Fig. 1 (c). This example raises the possibility of minimal phylogenetic tree containing points other than those belonging to the original species in B. These new points we call Steiner points. The phylogenetic problem is an example of a problem known in graph theory as the Steiner problem in graphs which is to identify the minimal tree, T that connects all species in B by links betweeen pairs of species of some larger set, B' which contains B. To make this clear again consider the example. Suppose the set of species, B is still sl, s2, and s 3 but three new sequences, (represented by the Steiner points: s4, s5, and s 6) are added.

Development of Minimal Phylogenetic Trees

Let

133

B' = (s 1 , s 2 , s 3 , s 4, s 5 , s6).

Given the distances in table 2 the problem is to find a minimal tree which spans the species in B. A minimal tree is shown in Figure 1(c), in which s 5 and s 6 are not connected. Two other solutions are shown in Figures 1 (d) and (e). All three trees possess five changes. However, there do exist data sets which can be represented by minimal phylogenetic trees which do not require Steiner points. It is possible to transform the phylogenetic problem into the Steiner problem in graphs by defining B' as follows. At each site at which at least two species are at variance, record all letters present. In the example, R, S, and T occur at character 1. Then define B' to be the set of sequences which correspond to all combinations. For instance there are 3, 2, 2, and 2 letters present at characters 1, 2, 3 and 4 respectively. Hence there are 3x2x2x2=24 elements in B'. These are:RRLM(Sl) , TTLK(s2) , STIM(s3), RTLM(s4), TTLM(s5), STLM(s6) , plus 18 others. One can form the complete distance matrix for all pairs in B' and apply a method suitable for the Steiner problem in graphs to find the minimal phylogenetic tree. Methods for solving the Steiner problem in graphs have been developed by Dreyfus and Wagner (1972) and Hakimi (1972). However it has been reported by Christofides (1975) that neither approach can guarantee finding the minimal tree in reasonable computing time for problems with more than ten species. Now for phylogenetic problems with 20 or more species, the number of elements in B' would be enormous. There is no reported method for realistically sized problems. Hence a heuristic solution procedure which will usually find a tree with a relatively small, if not minimal number of changes is in order. Earlier in this section it was mentioned that there was another approach to the minimal spanning tree problem other than that of Prim. This second approach is due to Kruskal. The main purpose of this paper is to present a solution procedure to the phylogenetic problem which is based in part upon the method of Kruskal. Kruskal's method can be described as follows. It begins by ordering the prospective links of the tree in nondecreasing order of length d(s.,1 s:). One then begins at the top j of this list of links and examines each link in turn - the shortest first. As each link is examined it is included in the tree if it does not create a circuit with links already included, otherwise it is rejected. The method terminates when a spanning tree has been produced, i.e. when n-1 links have been included for a tree of n points. Note that, unlike Prim's method (and Farris, 1970, and Eck and Dayhoff's method, 1966, which build up one connected component) Kruskal's method (like Fitch and Margoliash's method 1967) may, in its intermediate stages, create a number of disconnected components which are all ultimately connected to produce one final spanning tree. The method for the phylogenetic problem presented in this paper combines the strategies of Eck and Dayhoff's ancestral sequence method, Kruskal's method, and that of coalescing, illustrated in Fig. l(b) and (c). The method will be rigorously described in the next section. Briefly, at each iteration the method identifies the shortest link joining two unconnected points. These points may be original points or Steiner points created by coalescing. (Strictly speaking we should refer only to the added points in

134

L.R. Foulds et al.

a minimal solution as Steiner points, however at the intermediate stage the necessity of some added points for the minimal solution has not been proved. For convenience we refer to all additional points as Steiner points, (some of which may later be deleted). This link is then added to the tree. An examination is then carried out to see whether any coalescing is possible. The coalescements yielding the largest reduction in the total length of the tree are performed. If circuits are present an attempt is made to break them b y removing links which give the largest reduction in the total length of the tree. If there is more than one largest link, the circuit is left unbroken. The method terminates when all original species are connected. The Process We shall use the abbreviation PST, standing for Pbylogenetic Steiner Tree, to refer to any phylogenetic tree on a set of sequences which may contain Steiner points. In the case where no Steiner points are added the PST is simply a spanning tree. We say two PST's are identical if and only if 1) The two trees have the same topology and 2) Each corresponding point refers to the same sequence. If either, or both conditions fail to hold between a pair of PST's than we say they are distinct. The length of a PST is the algebraic sum of th e lengths of each of the component links of the PST. A MPST, standing for Minimal Pbylogenetic Steiner Tree, will refer to any PST spanning a given set of sequences which has minimal length for all PST's on that set of sequences. The process of determining the MPST is described below using four examples, the first three of which are artificial, and in the fourth we construct a MPST for a set of seven mammals. It is useful in these calculations to have some estimates of the length of the MPST, and we begin b y describing simple methods of calculating upper and lower bounds on this length.

Upper and Lower Bounds Obviously the length of any constructed PST represents an upper bound, U on the length of a MPST. One ready, way of calculating such an upper bound is to use the Prim (1957) process and these algorithms are readily available. An efficient computer program for this process has been written by Kevin and Whitney (1972). We have developed a process for calculating lower bounds. Let a represent the full set of sequences for the n species, at characters labelled 1 , 2 .... , m. Let I = (1, 2 .... , m) be the index set for o. If we take any subset I 1 = (i, j ..... k) of I, we can define the corresponding subsequence ~1 as being that subsequence of o formed by selecting characters i, j, ..., k, i.e., the subsequence whose index set is I 1. If we take I] = I - I 1 as being the complement of 11 in I (i.e. those elements of I not in I i ) then I] will similarly define a s u b s e q u e n c e o ~ . A s l = I 1 U I i a n d I 1 N I 1 = ~ ' , I 1 , I i define a partition of l. Also a l will be complementary to Ol, so that Ol, a ] will partition o into two disjoint subsequences.

Development of Minimal Phylogenetic Trees

135

Let L be the length of the MPST of o, Let L 1 be the length of the MPST of 01, Let L] be the length of the MPST of a~ .

Lemma. L1>L 1 +L~ . Proof. Let r be a MPST for a. The length of each link in r is made up of all the nucleotide substitutions required to traverse that link. We can identify which of these substitutions belong to a 1 and which to o ~ , and the sum of the number of these substitutions will give the length of the link. We can create two new trees ~'1 and r] from r, each with the same topology as r, but of different link lengths (which could possibly be 0), where in 7-1 the length of each link is the number of substitutions of 01 along the corresponding links in 7-, and similarly in 7-] the length of each link is the number of substitutions o l along the corresponding links in 7-. Consequently 7-1 will be a PST for o 1, and 7-~ will be a PST of o~ . Let the lengths of 7"1 and 7-~ be M 1 and M~, then by our construction L = M 1 + M~. But, as the MPST's of 01 and a] have lengths L 1 and L] , which are minimal, L I~ L 0, i.e. L 0 is a lower bound on the length of all PST's of o. This is the same as Camin and Sokal's (1965) "minimum evolutionary steps" but generalized to include reversible changes. This is also the same as Fitch's m 0 (1965, 1977). Corollary. If r is a PST of o of length L 0 then 7- is an MPST. The significance of this corollary is that there is no way in which the characters of a tree of length L, can be partitioned into two or more subsets whose corresponding minimal trees have lengths that sum to a number greater than L. In some simple instances, the complete partition, in which each subsequence represents the changes in a single character position, may be adequate. Example 1 S = (Sl, s2, s 3) with the species having sequences and distances: given in Table 2. All changes in Table 2 represent one nucleotide substitution. We now follow Kruskal, and search for the links of minimal length, not previously chosen, which, when added individually to the existing graph do not create a circuit. Now d(sl, s 2) = d(s2, s3) = 3 so these two are chosen. These now give us a partial graph, illustrated in Fig. 2(a). We now attempt to reduce the graph by "coalescing", introducing Steiner points, where necessary, which represent the sequence of a possible intermediary species.

136

L.R. Foulds et al.

T a b l e 2.

2

3

A

B

s2 :

B

B

s3 :

A

C

sI :

4

5

sI

C

D

E

0

D

C

E

D

C

D

s2

s3

3

4

s1

0

3

s2

0

s3

(a)

(bl

Q

3CD4DC

2BC 5ED Q

Fig. 2. Introduction of a Steiner point in minimal tree development At each p o i n t we compare all the pairs of links incident there, and identify those changes that are c o m m o n to the two links, choosing to coalesce on a pair with the maximal n u m b e r of c o m m o n changes. In our example, s 2 is the o n l y p o i n t with a pair of incident links, and there is only one c o m m o n change, l A B . To represent this we introduce the Steiner point s4, to represent the possible sequence: s 4 = ABDCE. This is represented in Fig. 2(b). Thus the link SlS 2 has been replaced by the path SlS4S2, and link s2s 3 by path s2s4s 3. Now, as all points are c o n n e c t e d to the tree the algorithm is c o m p l e t e , so we have a spanning tree of length 5. Comparing this with our lower b o u n d L = 5, we must have a MPST • A tree is said to be precise, if the path f r o m s.1 to s.j is actually of length d(s-, s;) for each pair s., s;. The tree just constructed is precise, however this is not 1 j 1 j usually the case as we chall see in a subsequent example.

Example 2 S = (s 1 , s 2, s 3) with the species having sequences and distances given in Table 3. U = 7, L = 6 using the c o m p l e t e partition. Following the algorithm as above, we w o u l d obtain firstly the link SlS 2 as d(s 1, s 2)

Development of Minimal Phylogenetic Trees

137

T a b l e 3.

1

2

3

4

5

s1

s2

s3

s4

s5

:

A

B

C

D

E

0

2

2

3

4

s1

s2 :

A

C

C

C

E

0

2

3

3

s2

s3 :

A

D

C

B

E

0

4

3

s3

s4 :

A

C

D

D

D

0

4

s4

s5 :

E

C

A

B

E

0

s5

s1

= 3, as in Fig. 3(a). Now we look at the paths of length 4. As d(SlS 3) = d(s2s 3) = 4 we simultaneously add the links SlS 3, s2s 3 obtaining a circuit Fig. 3(h). There are now three pairs of incident links meeting at a point, and of these the pair s 1 s 3, s2s 3 have two substitutions in common, one more than the other two pairs. Hence we coalesce this pair of links to create the Steiner point s4, as in Fig. 3(c). We turn now to the other pairs of links and on each pair we can coalesce, combining one common substitution, to create Steiner points s 5 and s6, as in Fig. 3(d). At this stage we still have a circuit, which must be broken b y the removal of one link, before a tree is obtained. As each link represents one nueleotide substitution the choice is arbitrary from a graph theoretic view point. If we had had the replacement BC requiring two substitutions, while BD, and CD required only one, then obviously, minimizing substitutions, we remove the link s4s 5, which also removes the need for the Steiner points s4 and s 5. This gives us the tree of Fig. 3(e), of length 6, which is minimal, as well as precise. It might have been the case that the replacements BC, BD and CD required two substitutions (hence U = 8, L = 7), and in this case there could exist a character state F, such that BF, CF and DF each required a single substitution. In this instance, the insertion of a new Steiner point s 7, to replace the s4s5s6s 4 circuit, which had character state F in position 3 would give a minimal tree of length 7, as in Fig. 3(f). This too would be precise, although its uniqueness depends on the uniqueness of the character state F. There are some instances however, where opting for the minimal number of substitutions does not resolve the circuit. If the two longer links of the s4s5s6s 4 circuit are of equal length, and there is no character state such as the character state F above, which could resolve the circuit, then we have insufficient information to make a choice. There are several alternative courses of action that could be taken. If this was a problem concerned only with finding a minimal solution, then we would make an arbitrary deletion. For our evolutionary model, however, this is unsatisfactory as the model should not contain any arbitrariness. The second, and favored, alternative is to delay the decision at this stage. Later additions to the Table often determine which links need to be deleted, and indeed a previously arbitrary decision may have deleted a link which would have been more favorable. Otherwise if at the end of the problem

138

L.R. Foulds et al.

Fig. 3. Steps in the construction of the minimal tree in example 2

the circuits are still unresolved, we must present all alternatives, having solved the p r o b l e m as far as possible mathematically. The choice is left to the biologist who m a y wish to use some additional information. A third alternative, taken by several workers such as Fitch and Margoliash (1967) was to accept ambiguity in the d e t e r m i n a t i o n of the Steivier points sequences, and introduce non-integral link lengths. This however we find unsatisfactory, because

Development of Minimal Phylogenetic Trees

139

the loss of precise knowledge of the sequences of the Steiner points makes other aspects of our algorithm much more complex. In the next example we consider the case, as above, where a circuit is obtained initially, but subsequently broken by later additions.

Example 3 S = (sl, s2, s3, s4, ss) with the species having sequences and distances given in Table 4. U = 10, L = 8 using the complete partition. Again we assume that each replacement is a single nucleotide substitution. Selecting those distances of length two, we obtain the circuit SlS2S3S1 as illustrated in Fig. 4(a). This circuit is irresolvable as it stands, because no coalescing is possible.

(~. 2DB 4B0_D

Fig. 4. Steps in the construction of the minimal tree in example 3

140

L.R. F o u l d s et al.

Table 4, 1

2

3

4

5

s1

s2

s3

s1

A

B

C

D

E

0

3

4

s1

s2

B

B

D

C

E

0

4

s2

s3

A

C

B

C

D

0

s3

H e n c e we m o v e o n to t h e distances o f l e n g t h 3, creating t h e g r a p h o f Fig. 4 ( b ) w i t h t h r e e circuits o f l e n g t h 3: SlS2S3Sl (as b e f o r e ) , SlS2S4S 1 a n d s2S3SsS 2. H o w e v e r t h e t w o latter circuits each can b e coalesced, collapsing t o t h e S t e i n e r p o i n t s s 6 a n d s7, b r i n g i n g us t o a g r a p h o f o n l y o n e circuit again, s h o w n in Fig. 4(c). N o w we have s u f f i c i e n t i n f o r m a t i o n t o resolve o u r original circuit, as t h e link SlS 3 of l e n g t h t w o is t h e longest link in t h e circuit, a n d h e n c e is r e m o v e d t o create t h e tree o f Fig. 4(d). This tree is of length 8, a n d as L = 8, is m i n i m a l . F o r o u r f i n a l e x a m p l e we l o o k at t h e sequences for t h e set o f seven m a m m a l s listed b e l o w . T h e c y t o c h r o m e - c s e q u e n c e s given b y E c k a n d D a y h o f f ( 1 9 6 6 ) c o n t a i n 1 0 4 a m i n o acids, b u t as 89 o f t h e s e a m i n o acids are t h e same, for e a c h o f t h e s e m a m m a l s , we isolate a n d s t u d y o n l y t h o s e 15 c h a r a c t e r s w h e r e t h e a m i n o acids differ. T h e c h a r a c t e r n u m b e r s u s e d b e l o w , refer t o t h e i r actual p o s i t i o n in t h e c y t o c h r o m e - c s e q u e n c e o f l e n g t h 104.

Example 4 S = (s 1, s2, s 3, s4, s5, s 6, s7) w i t h t h e sequences given in T a b l e 5. T h e specific b i n o m e n s are given b y Eck a n d D a y h o f f ( 1 9 6 6 ) . All t h e a m i n o acid changes r e p r e s e n t o n e nucleo t i d e s u b s t i t u t i o n , e x c e p t for M ~+ Q, P ~+ V, G ,~ K, E ~ T, G ~ T a n d D ~+ T w h i c h each require a m i n i m u m o f t w o s u b s t i t u t i o n s . T h e d i s t a n c e t a b l e for this e x a m p l e is given in t h e first seven c o l u m n s o f T a b l e 6. T h e u p p e r b o u n d , U = 24. In o r d e r t o calculate L, we m u s t c o n s i d e r t h e n u m b e r o f s u b s t i t u t i o n s n e c e s s a r y at each character. E x c e p t at c h a r a c t e r 89, o n l y t w o a m i n o acids are r e p r e s e n t , so t h e Table 5. 11

12

15

44

46

47

50

58

60

62

83

88

89

92

103

s I :Man

1

M

S

P

Y

S

A

I

G

D

V

K

E

A

N

s2 : Monkey

I

M

S

P

Y

S

A

T

G

D

V

K

E

A

N

s 3 : Horse

V

Q

A

P

F

T

D

T

K

E

A

K

T

E

N

s4 : Dog

V

Q

A

P

F

S

D

T

G

E

A

T

G

A

K

s5 : Pig

V

Q

A

P

F

S

D

T

G

E

A

K

G

E

N

s6 : Whale

V

Q

A

V

F

S

D

T

G

E

A

K

G

A

N

s 7 :Rabbit

V

Q

A

V

F

S-

D

T

G

D

A

K

D

A

N

Development of Minimal Phylogenetic Trees

141

Table 6.

sI s2 s3 s4 s5 s6 s7 s8

sI

s2

s3

s4

s5

s6

s7

s8

0

1

15

12

11

12

11

10

0

14

11

10

11

10

9

0

8

5

8

9

6

0

3

4

6

2

0

3

5

1

0

2

2

0

4 0

n u m b e r of substitutions necessary at that character is represented b y the n u m b e r of substitutions for the pair. At character 89 there are four different amino acids, and the graph representing the possible permutations is given in Fig. 5(a). The numerals on each link indicates the m i n i m u m n u m b e r of substitutions o n that link. As can easily be seen, each possible m i n i m a l spanning tree is of length 4. If we use the complete partition where each amino acid site is considered independently, we o b t a i n 2 1 = l + 2 + 1 + 2 + 1 + 1+ 1+1+2+1+1 +1+4+1+1.However at sites 44 and 62 requiring 2 and 1 substitutions, the c o m b i n a t i o n s PD, PE, VE, VD each occur, and these combinations, as shown generally b y Fitch (1975, 1977), need a m i n i m u m of 4 substitutions, a value which is greater t h a n 2 + 1, as seen in Figure 5(b). Fitch called these unavoidable discordancies and gave a m e t h o d that easily recognizes such simple cases as this (1975, 1977). A n y spanning tree for sites 44 and 62 must have one of the changes P +~ V or D ~ E repeated in parallel. The m i n i m a l spanning trees avoid repetition of P ~ V, which requires two substitutions, and we o b t a i n a tree of length 4. Hence if we partition with (44, 62) together, and the remaining sites i n d e p e n d e n t l y , we o b t a i n L = 22. F r o m our t h e o r e m we k n o w that no spanning tree for these mammals can exist with length less than 22. We n o w proceed with the construction of a spanning tree, F r o m scanning Table 6 we see d(Sl, s 2) = 1 is the least n o n d i a g o n a l entry, so taking a c c o u n t of links of length 1 we o b t a i n Fig. 5(c). d(s6, s 7) = 2 and adding this link we o b t a i n Fig. 5(d). Scanning for links of length 3 we find d(s4, s 5) = d(ss, s6) = 3, giving Fig. 5(e). At this stage we can coalesce as the links s4s 5 and s5s 6 have the change 92AE in c o m m o n . Hence we create a Steiner point s 8 which differs from s 5 at character 92, with A replacing E. This is illustrated in Fig. 5(f). s 8 represents a point on our graph, so we must add a n o t h e r row and c o l u m n to Table 6. The new link lengths of 1, and 2, d(ss, s 8) = 1, d(s4, s8) = d(s6, Ss) = 2 are already precisely represented on the graph. We n o w search for links of length 4. d(s4, s 6) = d(ST, s 8) = 4, b u t again these are precisely represented on the graph. Searching for links of length 5, d(s 3, s 5) -- d(s 5, s 7) = 5. s7,s 5 are already linked by a path of length 5, b u t we need to add s 3, as in Fig. 5(g). Again we see no coalescing is possible.

142

L.R. Foulds et al. T

(,~

I'D

//% G/(1)

\

i~l

PE

VD

(1)~D

VE

,,,,

®

o,,,-, ...@

®o,~o ,,oo,, ® 89GD

kL/ 93

s4 a~tTK 'U3KIv , ~

89~2)

ULJ

89GD

s6

~'

10~ 50AD 83VA 89EG

~

~ 62E~

(

.ttl

60GK(2)

Fig. 5. Steps in the construction of the minimal tree for a set of seven mammalian sequences (example 4)

Development of Minimal Phylogenetic Trees

143

Now we search for distances of lengths 6, 7 and 8. Each of these are represented by paths of the precise length. When we examine for d(s i, sj) = 9 we find d(s2, s 8) = d(s 3, s 7) = 9. Although s 3 and s 7 are already linked, the path between them is of length 10, so we can introduce a link o f length 9 between them and coalesce, introducing new Steiner points on the links s3s 5 and s7s 6. The graph would now contain a circuit, whose longest link, of length 5 would be between these two new Steiner points. Hence by our algorithm we would need to break the circuit by removing the link of longest length, and we would return to our original graph. However the link s8s 2 does not introduce such a complication, so adding this link we obtain the graph of Fig. 5(h). There is no coalescing possible on the link s8s 2. All species have now been linked, and the resulting tree is of length 22 = L. Hence this tree is minimal. However it is not precise as we saw above when accepting a path of length 10 linking s 3 to s 7 when d(s3, S7) = 9 nor is it unique, as we shall illustrate below. For some pairs s., s; in the tree d(s-, s;) is strictly less than the sum of the lengths f 1 j 1 j o the links in the path joining si and s.. For example: d(Sl, s3) = 15(16), d~s 1, s 7) = 11(14), d(s2, s 3) = 14(15), d(s 2, s 7) = 10(13). A new link s3s 2 would be eliminated in the same way as with the s3s 7 link mentioned previously. However the new link, of length 10, connecting s7s 2, can be coalesced to create a circuit, as in Fig. 5(i). For a minimal spanning tree, we would break either the s9sl0 link (as in Fig. 5(h)) of the s6s 8 link. Both these are o f length 2, and our procedure, acting on the original data, cannot distinguish between these two trees. We illustrate both trees with species' names added in Fig. 6(a) and 6(b). The lengths on the links represent the number of substitutions along that link. The added Steiner points would represent cytochrome-c sequences:

(a} 1 2

(b)

Fig. 6. The two minimal trees for a set of seven mammalian sequences. The species are M A N (homo sapiens), M O N K E Y (Macaca mulatta), R A B B I T (Oryctolagus cuniculus), D O G (Canis familiaris), H O R S E (Equs caballus), PIG (Sus scrufa), W H A L E (Rachianectes glaucus)

144

L.R. Foulds et al. 11

12

15

44

46

47

50

58

60

62

83

88

89

92

lO3

s8

V

Q

A

P

F

S

D

T

G

E

A

K

G

A

N

s9

V

Q

A

P

F

S

D

T

G

D

A

K

G

A

N

Slo

V

Q

A

V

F

S

D

T

G

D

A

K

G

A

N

Discussion

The development of bounds as suggested by Fitch (1975, 1977) makes possible an absolute (rather than relative) objective evaluation of phylogenetic trees. This could reduce disagreement between authors dealing with similar sets of species. We illustrate in Fig. 7 the diversity of topologies produced by a number of authors. It should be noted that several of the trees in Fig. 7 have been taken from trees with additional species and we have not determined whether these larger trees are minimal. A three letter code has been used to indicate the species and link lengths are shown. The tree of Eck and Dayhoff (1966), was based on amino acid changes, and the numbers of base changes were derived from these. The tree of Strydom et al. (1972) did not explicitly give numbers of substitutions, they were illustrated on a scaled diagram. The estimate of length given is based on the number of substitutions for their topology. The trees of Eck and Dayhoff, and of Fitch and Margoliash (1967), did not include the whale. The adddition of the whale would require at least one more substitution. Other than the omission of the whale, it is seen that the tree of Eck and Dayhoff is identical to our second tree, however all of the other trees are topologically distinct. In comparing the lengths of the trees, we should note that Fitch (1973) uses the Minimum Nucleotide Substitution criteria, and in our two trees the sequence of changes at character 89 D ~ G followed by G -+ E requires one further implicit nucleotide substitution, as illustrated in our introduction. Each of the comparative trees were subsets of larger trees, and their topology is determined in part by the influence of further species. However it should be noted that when our method is applied to larger sets of species the topology restricted to these seven mammals remains minimal, although one of the two alternatives is deleted. In order to understand how different methods lead to different trees, it is necessary to review briefly some of the previous methods. Some early methods used a trial and error method of constructing trees and then selected an optimal one for the criterion being used. This may be an adequate approach for a few species and where there is additional information available on the relationships of the species. The comment has been made (Farris, 1972), "the trial and error methods are quite laborious; it is generally not feasible to investigate a large number of alternatives...". A method due to Wagner (1961), based upon Prim's procedure for solving the minimum spanning tree, has been introduced into the literature of systematic biology by Farris (1970), who provides computational details and formalisation. The theoretical basis of the method has been outlined by Kluge and Farris (1969), and Farris et al. (1970). The method has been employed as a major analytical tool by Baird and Eckardt (1.972).

Development of Minimal Phylogenetic Trees

145

STRYDOM et al. (1972)

~

26

FITCH (1973)

2

F1

length=25

@

1

2

PENNY (1974) length = 24

FITCH (1976)

length=24 F2 8

..

', , •

1

2 w

Fig. 7a

Eck and Dayhoff (1966) have used a method for protein sequences that is apparently based on the Prim approach in that species are added one at a time to the tree, each species being added to all positions until the best position is found for the species. Boulter et al. (1972) use this approach hut comment that, "in retrospect, a wrong decision can be made at any step and the final tree may be only a close approximation

146

L.R. Foulds et al.

(a)

length = 22

H1

1

1

8

H2

~ 2

~ 1 .

.

.

length ~ 22

.

ECK & DAYHOFF (1966) length=21 (no whale)

8

FITCH & MARGOLIASH (1967)

15 N i

7.7

length=22.6

(no

whale

~.4

Fig. 7b Fig. 7. Phylogenetic trees obtained by various authors for a common set of mammalian sequences

to the 'best' tree." In order to improve the method, Bouher et al. (1972) use the refinement that after the final tree is constructed, individual branches are removed and tried at all other positions of the tree, a refinement introduced by Dayhoff in 1969. Methods based on the Kruskal approach have been referred to by Strydom et al. (1972) and Penny (1976). This approach can have a problem with links that have evolved faster than the average rate. For example a tree shown by Strydom et al. (1972) has the primate link (man and monkey) diverging from the land vertebrates, before the divergence of birds and the remaining mammals. A method of reducing this effect was referred to previously (Penny 1976). The present method overcomes this problem

Development of Minimal Phylogenetic Trees

147

of differences in the rate of evolution on different links since all species are retained for further comparisons, even after they are linked into the graph. This is an important aspect of the method and in this case ensures that in particular, the horse is included in its optimal position. Another important feature of the present method is that when new ancestral species are determined, t h e y are compared with all the existing species. This means that the links are not in a fixed position while the tree is being determined. It has been found (with larger data sets) that improvements are usually made to the existing portion of the network as later species are included. One of the most useful features of our approach is the objectivity in having a lower bound on the size of the minimal solution. The strategy involved is to attempt to determine a tree of length equal to the lower bound, and with the criterion of a minimal spanning tree. We are then guaranteed that the solution in indeed minimal. However the methods illustrated so far do not guarantee that this ideal result is always attainable in reasonable computing time for larger problems and one may have to be satisfied with a tree that is relatively close in length to the greatest lower bound obtained. It should be noted that Eck and Dayhoff (1966) did find the minimal tree for the seven species o f mammals that they used. But it was not recognised as being the best possible tree because no optimality criterion was employed. The method of Eck and Dayhoff will not always find the best solution. This was shown b y Peacock and Boulter (1975) who used the method to generate trees from sets of simulated amino acid sequences that had been generated by a random process. They were able to show with 16 species that the method gave a good approximation of the best tree but they estimated that there was an error o f 2-8%. There are eight characteristics that distinguish the present method from most of the previous methods for determining phylogentic trees. These are. 1. that it is based on the Krukskal approach in that there may be more than one component of linked species during the process, 2. all species are kept for further comparisons after they are linked into the graph, 3. newly determined ancestral species (Steiner points) are immediately compared with all other species whether or not they are linked to the tree, 4. the final tree is independent of the order in which the species are listed, 5. particular attention is given to calculating a lower bound to test whether a given tree is minimal or near minimal, 6. it is explicitly recognised that the process is generating a network in which circuits are possible. 7. it is a dynamic method in that both Steiner points and links can be removed, 8. the method is not affected by variations in the rate of nucleotide substitutions on different links. The present method aims at developing a tree that will link all the species with the smallest possible number of nucleotide substitutions. It is another question as to how well this will reflect their actual phylogeny. Also the determination of the root, or original ancestor, of the tree cannot be decided b y the method; additional information is necessary to consider this. The accuracy of the tree is very much dependent upon the suitability and reliability of the data. There may be errors in the sequence entries. A single error could conceivably make a significant difference to the tree if it affected

148

L.R. Foulds et al.

closely related sequences. If there are many closely related species and a few others very different from each other, this may result in long unbranched links. F r o m our experience with large data sets these long links in the minimal tree may lead to inconsistencies with current theory. Some species.have the same cytochrome-c sequences, e.g. human and chimpanzee, Dayhoff (1972). We assume such species to be identical and hence the method cannot shed any light on their relative evolutionary behavior. Hence other information must be used in this case. Another limitation of the data is that for some sequences it has not been possible to distinguish between the amino acid pairs: glutamic acid and glutamine or aspartic acid and asparagine. If amino acid substitutions occurred at random then it would be unlikely for any individual change to occur twice in closely related species. There are over 100 amino acid sites and 20 amino acids can occur at each site (although not all can be reached b y a single nucleotide substitution). Similarly there is a low probability of a reversion to a previous acid at a site, at least before other substitutions had led to several changes at other sites. But with cytochrome-c, for example, only a few sites in any species can accept an amino acid substitution at a given time, Fitch (1976). Therefore parallel substitutions in related species and reversions will occur more frequently than at random. This is likely to lead to similar species appearing to be even more closely related than usually expected. This emphasizes once again the necessity of careful data selection. The implementation of the complete method b y computer is a relatively complex task. It was decided to use an interactive approach with a terminal to allow greater flexibility in the development of the algorithm. A program was written to: store the data, record the sequences of Steiner points created, record which species were linked, perform comparisons between sequences, discover which link should be added next to the structure created so far, and to help in deciding how to coalesce links. With this interactive method it t o o k approximately 15 rain to construct the optimal trees for the seven mammals. The program is being developed further. The method has been applied to larger sets of sequences. In particular we have found minimal trees for a set of 23 different vertebrate cytochrome-c sequences. This and larger trees will be discussed subsequently. We would welcome the opportunity to discuss applying the method to other sets of data.

Acknowledgement. The authors are indebted to Mr. L.K. Thomas of the Department of Computer Science at Massey University for writing the interactive computer program which was used as an aid in implementing the method. We would like to thank Professor W.M. Fitch for constructive criticism of the manuscript. References

Baird, R.C., Eckardt, M.J. (1972). Syst. Zool. 21, 8 0 - 9 0 Boulter, D., Ramshaw, J.A.M., Thompson, E.W., Richardson, M., Brown, R.M. (1972). Proc. R. Soc. Lond. B.181,441--455 Camin, J.H., Sokal, R.R. (1965). Evolution 19, 3 1 1 - 3 2 6 Christofides, N. (1975). Graph Theory: an Algorithmic Approach. 1 st ed. New York: Academic Press Dayhoff, M.D. (1969). Atlas of Protein Sequence and Structure 4, National Biomedical Foundation, Silver Spring, Md.

Development of Minimal Phylogenetic Trees

149

Dayhoff, M.O. (1972). Atlas of Protein Sequence and Structure. 5, National Biomedical Research Foundation Silver Spring, Md. Dreyfus, S.E., Wagner, R.A. (1971). Networks. 1, 195--214 Eck, R.V., Dayhoff, M.O. (1966). Atlas of Protein Sequence and Structure 1966, National Biomedical Research Foundation Silver Spring, Md. Estabrook, G.F. (1968). J. Theor. Biol. 21,421--438 Farris, J.S. (1970). Syst. Zool. 19, 83-92 Farris, J.S. (1972). Am. Nat., 106, 645-668 Farris, J.S., Kluge, A.G., Eckardt, M.J. (1970). Syst. Zool. 19,172-189 Fitch, W.M. (1971). Syst. Zool. 20,406-416 Fitch, W.M. (1973). J. Mol. Evol. 2,123-136 Fitch, W.M. (1975). Proceedings of the eighth International Conference on Numerical Taxonomy, Ed. G.F. Estabrooks, Freeman, San Francisco, pp. 189-230 Fitch, W.M. (1976). J. Mol. Evol. 8, 13-40 Fitch, W.M. (1977). Amer. Nat. III, 223-258 Fitch, W.M., Margoliash, E. (1967). Science 155,279-284 Fitch, W.M., Farris, J.S. (1974). J. Mol. Evol. 3,263 Hakimi, S.L. (1972). Networks, 1,113-131 Harary, F. (1969). Graph Theory 1st ed. Reading, Massachusetts, Addison-Wesley Hartigan, J.A. (1973). Biometrics. 29, 53-65 Kevin, V., Whitney, M. (1972). Comm. of A.C.M. 15,273 Kluge, A.G., Farris, J.S. (1969). Syst. Zool. 18, 1-32 Kruskal, J.R. (1956). Proc. Amer. Math. Soc. 7, 48-50 Moore, G.W., Barnabas, J., Goodman, M. (1973). J. Theor. Biol. 38,459-485 Peacock, D., Boulter, D. (1975). J. Mol. Biol. 95,513-527 Penny, D. (1974). J. Mol. Evol. 3,179-188 Penny, D. (1976). J. Mol. Evol. 8, 95-116 Prim, R.C. (1957). Bell Syst. Tech. J. 36, 1389-1401 Strydom, D.J., Van der Walt, S.J., Botes, D.P. (1972). Comp. Biochem. Physiol. 43B, 21-24 Wagner, W.H. (1959). IX Int. Botanical Congr. Montreal. 841--845 Watson, J.D. (1975). Molecular Biology of the Gene. 3rd ed. Menlo Park, Calif. W.A. Benjamin, Inc.

Received March 9, 1977~Revised September 5, 1978