J Mol Evol (2002) 54:191–199 DOI: 10.1007/s00239-001-0001-5
© Springer-Verlag New York Inc. 2002
Stretch Coding and Block Coding: Two New Strategies to Represent Questionably Aligned DNA Sequences Daniel L. Geiger Research Associate, Santa Barbara Museum of Natural History, 2559 Puesta del Sol Road, Santa Barbara, CA 93105, USA Received: 1 August 2000 / Accepted: 10 July 2001
Abstract. Most coding strategies that address the problem of questionable alignment (elision, case sensitive, missing, polymorphic, gaps as presence/absence matrix) conflict with phylogenetic principles, particularly those relating to the concept of homology (shared similiarity explained by common ancestry). In some cases, the test of conjunction is failed. In other cases, characters that are coded ambiguously can lead to character-state optimization in the terminal taxa that conflicts with the original observations. Only data exclusion and contraction avoid these pitfalls. In highly dissimilar sequences additional character states can represent the available information. Two new methods that accomplish this—block and stretch coding—are introduced here. These two new coding strategies are not in conflict with the test of conjunction and do not contradict the original observations. They are comparable to coding practices with morphological data once the intrinsic differences due to character-state identity and topographical identity have been taken into account. It is suggested that, of the three recoding methods, the one is selected that preserves the maximum potential phylogenetic information as measured with the minimum number of steps required for the particular part of the data matrix. Key words: DNA sequence alignment — Character coding — Stretch coding — Block coding — Homology — Test of conjunction
Correspondence to: Daniel L. Geiger; email:
[email protected]
Introduction Phylogenetic systematics relies increasingly on molecular data, particularly DNA sequences. One of the most contested areas in molecular phylogenetics is DNA sequence alignment, in which shared similarities of the bases between taxa are inferred. Sequence alignment is hence a necessary step to convert the raw sequence data of individual taxa into a data matrix. It is logically impossible to obtain a phylogeny from data on individual taxa, because a data matrix representing information on multiple taxa must be at hand, which necessarily involves alignment. Alignment may be performed by a number of computer programs [e.g., CLUSTAL of Higgins and Sharp (1988), MALIGN of Wheeler and Gladstein (1994)] and, alternatively, may be done manually. Wheeler (1996) claimed to have developed an alignment-free phylogenetic analysis program (POY). However, POY is only alignment free in so far as alignment and phylogenetic reconstruction are implemented in a single computer program [see also Simmons and Ochoterena (2000) for further inconsistencies]. To determine minimum costs, a particular set of values (or sets in case of equally costly solutions) in the form of a particular alignment(s) must be specified, even if stored only temporarily during computer manipulation of the data. The storage condition applies to any alignment method, be it minimum cost edit, pairwise, or multiple alignments. In other words, any conclusion in combinatorics is based on particular states of the elements to be freely combined. The advantages and disadvantages of certain alignment procedures have been considered with regard to
192
introducing gaps (Waterman et al. 1991), to maximize computational efficiency (Waterman et al. 1991), and in the context of evolutionary relevance (Gatesy et al. 1993). The main concerns have been computational efficiency, how complex an evolutionary model is needed for meaningful alignment, and whether the particular model is realistic. Identification of Questionable Alignment. In this contribution I do not consider the advantages and shortcomings of the various approaches to alignment, but concentrate on one particular issue, that of questionable alignment. Questionable alignment can be broadly defined as variation of nucleotide position due to changed alignment parameters (ti/tv ratio, gap insertion cost, gap extension cost). Although this definition gives an accurate description, it serves little for handling of actual data, because the parameter spaces are not numerically defined. Can the parameter spaces be defined in an objective manner? I consider it to be impossible. The problem lies in the fact that all DNA sequence alignment is inherently subjective. First, a distinction between explicitness and objectivity needs to be made. Explicit statements are fully declared but are not necessarily externally justified. Objective statements are externally justified, which usually implies that they are also explicit. In the context of DNA sequence alignment the application of a particular set of alignment parameters (ti/tv ratio, gap insertion cost, gap extension cost) may be contested, because different alignments may result from alternative sets of alignment parameters. As a consequence, the extent and the particular state of regions of questionable alignment may vary because of the particular set of parameters used. There are two possible objective means to identify regions of questionable alignment. (1) The universal parameter value akin to the gravitational constant or value for the speed of light is used. Thus far this approach has not yielded any success and can be disregarded. (2) The entire range of possible values is examined by a so-called sensitivity analysis (Wheeler 1995). First, intervals may be subjectively chosen as integer numbers on a linear scale, or on a log scale with bases such as 2, e, and 10 (Wheeler 1995). The range of values for gap weights has a logical lower limit of 0.5 because of the triangle inequality (Wheeler 1993) but has an open upper bound. The only logical upper limit is infinity, i.e., base change only. Most workers would argue that the infinity boundary is unrealistic. Restricting the range of alignment parameters used to less than the objective 0.5 to infinity automatically introduces subjectivity (Gatesy et al. 1993). Whether the range is explicitly restricted in computer-facilitated alignment or while performing manual alignment is irrelevant to the question of the subjective choice of parameters. Even alignment with an explicitly defined range of parameter values is subjective. Lutzoni et al. (2000, p. 634) argued for alignment by eye “to correct for obvious alignment
errors,” although this approach is liable to human errors. As an example, their Fig. 10, positions 81–86, shows –CTTAA for Drosophila, Aedes, and Schsitocerca, but Heptagenia is aligned as CTTATC and Cicindella as –ACTTA, which are obvious alignment errors. Focus of this Contribution. I focus my attention on how to handle alignment that is considered questionable, regardless of how it is determined. The handling of questionable alignment has consequences for many subsequent inferences, such as the postulation of phylogenetic hypotheses, the reconstruction of ancestral character states, and the evaluation of patterns of character evolution. Hence, coding of questionable alignments deserves a critical inspection. I first examine the goals of alignment in terms of identifying shared similarities as evidence of past common ancestry, continue to evaluate previously proposed strategies with respect to these goals, and propose two new procedures: stretch coding and block coding.
Goal of Alignment The goal of DNA sequence alignment is to identify shared similarities that represent evidence of past common ancestry. From that evidence of common ancestry one can retrodict past ancestors, leading to a hypothesis of relationships of the terminal taxa in the form of a cladogram (Geiger et al. 2001). Shared similarity consists of a pair of components: “special similarity” and “positional correspondence” of Remane (1952), also called “character-state identity” and “topographical identity” by Brower and Schawaroch (1996). This dichotomy is echoed in the philosophical, epistemological literature as “propositional belief” and “objectual belief” (Audi 1998). These pairs of terms refer to the shared character states (A, G, C, T, –) and the shared position (for instance, base 213), respectively. Both similarity conditions need to be met satisfactorily for any given datum to be admissible as evidence of common ancestry. In the case of DNA sequences, character-state identity (A, G, C, T, –) is not ambiguous, but uncertainties about topographical identity can yield questionable alignment (Brower and Schawaroch 1996). The statement “Organism X has a G” offers little information, but “Organism X has a G in position 213” carries potential phylogenetic information. Both components of the shared similarity statement are satisfied. If an area of questionable alignment is identified, the goal is to represent it in a way that is compatible with principles of phylogenetic reconstruction. It is this methodological justification criterion that will be applied to uphold or reject certain coding strategies discussed below. Whether a coding strategy will result in a shorter tree, i.e., whether it should fare better or worse under the optimality criterion of parsimony, is irrelevant.
193
A central concept of phylogenetics is that of homology: shared similarity explained by common ancestry (Fitzhugh 1999). The term homology has been applied to different entities and concepts (Hall 1994). My preference of its application is guided by the original circumscription by Owen (1847, 1849) of the word homologue, homology, and their connection to the archetype (see also Panchen 1994). What is recognized prior to phylogenetic analysis of the data matrix is shared similarity without reference to an ancestor. This shared similarity has also been termed primary, putative, or weak homology (e.g., de Pinna 1991; Hawkins et al. 1997). Subsequently these shared similarities can be explained by common ancestry. If these shared similarities can be accounted for by descent from a common ancestor, then these shared similarities explained by common ancestry are termed homologies. Homology in this restricted sense has also been termed secondary, confirmed, or strong homology (e.g., de Pinna 1991). The connection of homology with causal explanation by means of ancestors was also highlighted by Owen, whose special homology made reference to a common nonarchetypal ancestor, whereas general homology recognized the connection to the archetype (Panchen 1994). Those shared similarities that cannot be accounted for by descent from a common ancestor are universally called homoplasies. Note that the use of homology in the restricted sense produces a pair of terms—homology and homoplasy— that are shared similarities distinguished by being (not) explained by common ancestry. Furthermore, the inexistence of the terms primary and secondary homoplasy (Sanderson and Hufford 1996) lends further support for the restricted use of the term homology as shared similarities explained by common ancestry. Two problems relating to homology are encountered in the present context and are highlighted in Fig. 1. Failure of the Test of Conjunction. If any character in one taxon is considered similar to more than one character in an other taxon, i.e., if one character evolves into two (disregarding cases of homonomy), then the test of conjunction is failed (de Pinna 1991). The failure of the test of conjunction can be diagnosed in a data matrix if a character of one taxon is coded in one column but is coded for another taxon in more than one column. This situation is indicated with underlined positions. Contradiction of the Original Observation. Some coding methods allow a character-state optimization at the terminal node (terminal taxon) after a tree has been obtained that will be incompatible with the original data. Consider taxon 1 CC; taxon 2 with questionable alignment, –C or C–, and taxon 3 – –. Taxon 2 is recoded ?? One possibility of showing tree ((1, 2)3) under ACCTRAN is with the terminal sequences ((CC, CC)– –) inferring two steps from 3 to (1, 2). The char-
acter-state optimization at the terminal node of taxon 2 (CC) is in conflict with a single base having been observed and, hence, contradicts the original data. This problem with character-state optimizations at the terminal nodes (hereafter character-state optimization) is highlighted by the positions on a gray background.
Gaps: Missing Data or Fifth Character State? The conjunction of a topographical identity with a character-state identity employs philosophical classification theory: a particular datum (G) is put into a bin (base at position 213) through comparison with comparable data (other orthologous sequences). In that process gaps may be introduced, which represent the absence of bases. Nelson and Platnick (1981) have pointed out that the absence of a character state cannot be observed. One does not note the absence of legs in snakes, but the smooth continuity of the body surface; the absence of legs is inferred during the classification process. Corresponding arguments can be applied to gaps. What we can observe in DNA sequences is the adjacent position of two bases (Simmons and Ochoterena 2000). Gaps are also inferred during the classification process because the adjacent positions in one taxon are set apart by inserted bases in other taxa. Acknowledging Nelson and Platnick’s (1981) point, the conventional use of “absence” and “gap” is continued here, despite being somewhat imprecise. From the above it follows that gaps are a consequence of alignment, which is a classification act; they do not constitute missing data, but a discrete entity and, hence, should be treated as a fifth character state. There is a direct analogy to coding practices with morphological data. Consider the legless condition of snakes compared to other reptiles. If snake taxa were coded ? for legs, then legs would be postulated for snakes, which is in conflict with the inference arising from comparison of legless snakes with leg-bearing lizards. The “legless” condition is a clear inference, which should be coded as such using a separate state. The same reasoning applies to gaps. The existence of a gap and the absence of legs represent conclusions of classification acts that are open to explanation by an evolutionary event of a loss (or gain) of a base or legs; the two have the same evolutionary significance. It may be argued that the classification of base–gap/ leg–legless is different from the classification of a base as A/C/G/T or the leg as short/medium-length/long. The coding then contains two characters: presence or absence of base/leg and state of base/leg, with the taxon showing a gap/legless condition having an inapplicable state. There is no a priori reason for coding a character as a single multistate character or as two, one being a presence/absence character and the other coding for the particular expression of the present condition. However, the
194
Fig. 1. The effect of character coding on homology statements in questionably aligned DNA sequences, using hypothetical sequences of four taxa. Underlining: Violation of test of conjunction. Gray background: Character states not found in original data. Box: Switching the positions of the single base plus gap results in further possible optimizations which also contradict the original observations. For details see the text.
latter version introduces the problem of inapplicables (Strong and Lipscomb 2000). Hence, I favor the less problematic of the two equivalent coding methods, namely, coding gaps as a fifth character state. Alternative Coding Strategies Unambiguous alignment is always unproblematic. The clearest case is with conserved regions harboring unin-
formative states. As soon as regions of unequal length are compared, questionable or ambiguous alignment may arise. Statements of shared similarity may depend upon the alignment algorithm and parameters chosen. Even with a particular setting, multiple, equally parsimonious, or costly alignments may be found, although not many programs report more than one. Ambiguous alignment can be addressed using various flexible coding strategies. I consider each strategy and illustrate the consequences
195
in terms of failure of the test of conjunction and contradiction with the original observation, using a simple example. This example consists of four hypothetical sequences (Fig. 1; original data). The sequences are characterized by initial and terminal conserved regions of five bases each, adjacent to variable regions spanning two positions, i.e., six and seven. In the variable region, taxon 1 shows a CC, and for taxon 2 a double gap (– –) is found. Taxa 3 and 4 show one base each, C and A, respectively, and one gap. The absolute and relative positions of base and gap are unresolved for the last two taxa (Fig. 1; four possible alignments). Elision. Finding a consensus of multiple alignments resulting from the choice of different alignment parameters has been addressed by Wheeler et al. (1995), who proposed a method called “elision.” The alternative alignment for each taxon (Fig. 1; four possible alignments) are appended to one another (Fig. 1; Elision) and analyzed together. Essentially, elision is a weighting procedure, giving characters with unequivocal position more weight than those where the specific similarity statement is uncertain. Those characters with unambiguous alignment are found multiple times (positions 1–5 and 8–12), whereas any particular type of column in regions of ambiguous alignment (positions 6 and 7) is included at a lower frequency (four times versus once in Fig. 1). Character weighting beyond the simplest assumption of equal weights is highly controversial, because it introduces subjectivity into the analysis (Wheeler 1986). Elision also has effects with respect to the homology concept (Wheeler et al. 1995, pp. 5–6): “The elided data . . . have the disturbing property of assigning multiple primary homologies to the same datum . . . the implications for homology are unsettling, since individual bases must have individual histories, but are not treated as such.” To rephrase their finding, the test of conjunction is failed for all positions with variable alignment. The other multiply occurring positions are instances of character weighting and not failures of the test of conjunction. The shortcomings of elision have also been discussed by Lutzoni et al. (2000). Case Sensitive. PAUP (Swofford 1990) used its ability to distinguish between upper- and lowercase characters to address questionable alignment. Unequivocally aligned bases are shown in uppercase; ambiguously aligned bases are shown in lowercase. Lowercase states are then treated as missing (?) but preserve the original information in a convenient form (Fig. 1; Case sensitive, Recoded). The first problem is the inability to distinguish between an unambiguously aligned and an ambiguously aligned gap, because a hyphen does not have upper- and lowercase symbols. If hyphens could be coded in upperor lowercase, they would encounter the same problems
as the base characters. For one, an a priori alignment must be chosen subjectively, unless gaps are specified as missing characters. The latter is clearly inappropriate, as pointed out above; it would result, in the present example, in a CC character-state optimization for all taxa, i.e., render the character uninformative (not shown). Assuming that gaps are treated as a fifth character state, lowercase coding becomes pointless, because a particular position for the lowercase characters must be chosen to begin with. Additionally, the lowercase (⳱missing) characters can now take on any state (A, T, G, C, –). Character-state optimizations will consider only existing states, C and –, which will lead to contradictions to the original observations. Taxon 4 will never show the observed state A, and taxon 3 may show a double gap although one base was observed (Fig. 1; Case sensitive, Optimizations). Case-sensitive coding confuses the uncertain expression of a state (character-state identity) with the uncertain position (topographical identity) of an observed state in the sequence. Missing Data. If a region of ambiguous alignment is coded as missing data [⳱? (e.g., Whiting et al. 1997)], the parsimony algorithm is offered more possibilities to optimize the states for these particular missing data entries. Such a strategy seemingly offers flexibility but comes at a high price, because the same problems with character-state optimizations as discussed above apply. One additional type of inconsistency can result in taxon 4, when CC is optimized; it infers the wrong type and the wrong number of bases (Fig. 4; Missing data, Optimizations). Polymorphic. Regions with ambiguous alignment can be coded as polymorphic, restricting the possible states a particular position may exhibit (for overview see Wiens 1995). Positions 6 and 7 are coded “1 ⳱ C or –” for taxon 3 and “2 ⳱ A or –” for taxon 4 (Fig. 1; Polymorphic, Recorded). The possibility of assigning unobserved character states to a position is barred, but still only existing states (C or –) will be optimized. Taxon 4 will always show a double gap, hence, it contradicts the observations consistently. Taxon 3 contradicts the observations in two of the four possible optimizations (Fig. 1; Polymorphic, Optimizations). Exclusion. Questionably aligned regions may be excluded from the analysis [Fig. 1; Exclusion, Recoded (e.g., Gatesy et al. 1993; Cerchio and Tucker 1998)]. This method will result, in the example used here, in a sequence with only uninformative characters, leaving the relationships of the four taxa unresolved. The exclusion of characters does not conflict with the homology concept, nor does it contradict the original observations. The exclusion of characters after their acquisition was characterized by Wheeler (1986, p. 108) as “to give up” with
196
Fig. 2. Coding strategies for a few highly dissimilar taxa. Original data show one of the many possible alignments of this particular data set. Stretch coding and block coding illustrate two alternative coding strategies compared to presence/absence coding discussed in the text.
a certain, highly homoplastic data set if even reluctantly applied weighting did not provide better resolution. In the same sense, if multiple weighting schemes in sequence alignment lead to different similarity statements, then we may well give up these characters, i.e., exclude them. Although unsatisfactory, it is more appropriate to admit the inability to recognize shared similarity than to make unfounded assertions. Contraction. This less stringent method provides similarity statements at a higher level of generality. Positions 6 and 7 are recoded into a single character with character states as types of linear combinations of bases: 1 ⳱ CC; 2 ⳱ – –; 3 ⳱ contains C; 4 ⳱ contains A (Fig. 1; Contraction, Recorded). It shows that all taxa have distinct character states and that shared similarities cannot be expressed. Neither does the test of conjunction fail, nor do the optimized character states contradict the original observations: the homology concept is intact (Fig. 1; Contraction, Optimizations). Contraction could reduce the number of possible alignments implicitly, in a case in which taxa 4 and 5 both would show C’s (not shown). These could be assumed to represent a shared similarity so that both would be coded “1.” This may be viewed as a reasonable assumption, because there is no alignment problem between taxon 4 and taxon 5 to the exclusion of taxa 1 and 2. The alternative assumption, that the C’s in taxa 4 and 5 are not shared similarities, could be expressed by coding taxon 3 with character state “1” and taxon 4 with character state “2.” There is no coding method retaining any characters that can keep the information that all four possible alignments are equally possible. Only data exclusion implicitly acknowledges all four possible alignments. A conceptually similar approach has been proposed by Wa¨gele (1994). A superficially similar approach was proposed by Wheeler (1999) with his fixed character states and Lutzoni et al. (2000). These coding strategies also treat multiple positions as a single character. However, an additional step matrix with weights based on the minimum steps between the particular sequences is em-
ployed. Wheeler’s (1999) and Lutzoni and co-workers’ (2000) methods differ primarily in how the weights for the step matrix are determined. In both cases, the particular value of the weights depends on the ti/tv ratio and the gap cost function. Additionally, to determine the minimum cost, a particular alignment must be assumed. Accordingly, due to the step matrix, the underlying problem of questionable alignment is not addressed. Contraction has been applied by Geiger (2000) to data of Lee and Vacquier (1995). The reduction of two or more positions to a single one may raise two questions. The first regards character weighting. This problem has bearing only if two otherwise equivalent coding schemes are compared. The latter is not the case here. Data contraction is carried out because of problems relating to representation of shared similarities. Issues relating to homology clearly take precedence over weighting. Hence, the character-weighting argument, despite, in general, being of legitimate concern, has only limited force in the current context. As an option, the contracted character could be given the weight equal to the minimum number of positions. The second question arising from data reduction relates to the implicit assumption about a reduced number of events. It is fully addressed under Presence/Absence Coding, below.
Highly Dissimilar Taxa What is the most appropriate action if only one or a few taxa are extremely dissimilar compared to the remainder (Fig. 2)? Exclusion or contraction of data may eliminate much important information on the relationships of the majority of taxa. One may consider missing data coding for that particular stretch in highly dissimilar taxa (Fig. 2; taxa 6 and 7), because essentially one does not know anything about the specific similarities in these taxa. However, as pointed out above, such a coding strategy is inappropriate because it confuses the uncertain expression of a state (⳱character-state indentity) with the uncertain position (⳱topographical identity) of an ob-
197
served state in the sequence and will create conflicts during character-state optimization. The two coding strategies introduced below address the question of how best to represent sequence data in highly dissimilar taxa. Stretch Coding. The observed dissimilarity is a piece of available information about taxa 6 and 7 to be represented as such. This information can be expressed using new character states, i.e., the entire unalignable region can be coded as a single state, 6 (Fig. 2; Stretch coding). In this fashion it can be shown that taxa 6 and 7 are, on the one hand, very similar to one another but, on the other hand, highly dissimilar from taxa 1–5. Minor differences between taxon 6 and taxon 7 could be coded using an additional character state for taxon 7 (not shown). For instance, instead of the first base in taxa 6 and 7 both being a T, taxon 6 may show a T but taxon 7 may have a G. Taxon 7 would have a single character state, “7.” One caveat applies: as an additional state(s) must be specified for a particular character, it implies a shared similarity statement with the characters in taxa 1 to 5, which is not feasible. Explicit similarity statements of T and A in taxa 6 and 7 are made, disregarding alternative alignment possibilities, e.g., the T may be found either in position 2 as indicated or in position 4; the relative position of T may not be the same for taxa 6 and 7. One may argue that shared character states need to be explained and that the classification of the character states under a particular character is of lesser importance, because the shared similarity statements are restricted to within the blocks of taxa 1 to 5 and taxa 6 and 7 (Fig. 2; boxes). Then the information is sufficiently represented by stretch coding. Geiger (2000) utilized this coding method for data of Lee and Vacquier (1995). Block Coding. To circumvent the problem of unspecifiable similarity statements across blocks within a stretch (Fig. 2; Original data: characters 1–7, taxa 1–5; characters 1–7, taxa 6 and 7), one may consider treating states in taxa 1–5 and taxa 6 and 7 separately (Fig. 2; Block coding, right boxes), inserting autapomorphies for the corresponding taxa in the other block (Fig. 2; Block coding, left boxes). Autapomorphies are used here to carry out the function of inapplicables. No current computer program treats inapplicables appropriately, specifically ignoring the particular value (cf. Strong and Lipscomb 2000). There is no particular symbol for inapplicables; usually the ? is employed, and inapplicables are treated as missing data, with the undesirable consequences discussed above under Case Sensititive and Missing Data recoding strategies. Autapomorphies can force the program not to infer any relationships that are not based on actual data but will introduce additional steps. There are two blocks with autapomorphies only (characters 1 and 2, taxa 1–5, upper-left box; characters 3–9, taxa 6 and 7, lower-right box) and two blocks containing the sequence
information (characters 3–9, taxa 1–5; upper-right box; characters 1 and 2, taxa 6 and 7, lower-left box). The shared similarity statements of the sequence information in each block are unconnected to one another. No shared similarity statements are made between the T in taxa 6 and 7 with respect to any T (in position 2 or 4 of original data) in taxa 1–5. However, problems with character weighting arise because the information from the first seven characters is now coded in nine. One may argue that the autapomorphies in the respective blocks should be coded the same within each block but different from the information-bearing entries, i.e., as synapomorphies. Characters 1 and 2 for taxa 1–5 would all be coded 6, and characters 3–9 in taxa 6 and 7 would all be coded 7. Additional steps beyond the actual observation are also introduced, though fewer than when using autapomorphies. This would imply that the original sequence was nine positions long. Such coding would introduce unobserved synapomorphies and accentuate the between-block differences beyond the original observations, which is not supported here. Both stretch and block coding strategies have theoretical advantages and disadvantages, and both have some implicit assumptions. Block coding may seem preferable for the following reasons. When considering taxa 1–5 and taxa 6 and 7 separately, then the similarity statements are clear; only by combining the data might problems arise. Therefore, the information from the two blocks should be coded separately. As the other block does not contribute any information to the data found in the block under consideration, the “empty” blocks should be assigned uninformative autapomorphies as functional placeholders for inapplicables. Presence/Absence Coding. Block coding is somewhat similar to the established procedure to code gaps in a supplemental presence/absence (p/a) matrix (e.g., Baum et al. 1998) (Fig. 2; presence/absence). P/a coding adds nonadditive binary characters for the presence (1) or absence (0) of the gaps (Fig. 2; a–f). Block coding differs from p/a coding in two ways. (i) The characters with gaps are represented twice in the p/a strategy, namely, within the sequence and as the supplemental binary matrix. Block coding recodes the characters within the sequence portion. Issues with character weighting are reduced to a minimum with block coding. (ii) The gaps in the sequence part of the p/a matrix are treated as missing characters, while no missing character states are introduced with block coding. As any missing character state misrepresents an actual observation, any use of a ? as an indicator of uncertain assignment of observed character states (⳱character-state identity) to a specific character (⳱topographical identity) is misleading. P/a coding may also be seen as coding of insertion/ deletion events (Simmons and Ochoterena 2000). Such a coding strategy presumes knowledge about evolutionary events that can be postulated only through phylogenetic
198
analysis. A pattern of character-state distribution is observed from which past evolutionary events (speciation, mutation, insertion, deletion) are inferred by means of a phylogenetic tree. Using a p/a matrix is similar to a priori ordering character states; both involve knowledge about the evolutionary process, which is impossible to have without a time machine. Comparison to Practices in Morphology. The coding strategies (particularly block coding) are comparable to practices in morphology. With morphological character data, a state of a particular character that is highly dissimilar in a given taxon is coded as a separate state and is not forced into one of the existing states. Consider taxa 1–5 having state “red,” taxa 6–10 having state “blue,” and taxon 11 reflecting light only at 500 nm. Nobody would try to argue that the state of taxon 11 is either red or blue, because it is very different from either. The new state “green” is introduced for taxon 11. The principles discussed here (contraction, stretch coding, block coding) can be equally applied to problems of topographic identity with morphological data. Consider the legs of five centipede taxa to the legs of two insects. (For simplicity of the argument, I assume that all legs of the centipedes are similar, but the argument is essentially the same even with different leg types within and/or between centipede taxa, because a limited number of insect legs still has to be compared to many more centipede legs.) Under data contraction, all leg characters will be contracted in one column: legs absent, 0; legs present, 1. The character becomes uninformative. In stretch coding, all 100 centipede legs are given one character state (0), but it is recognized that all insect legs are different from all centipede legs but similar to other insect legs. All insect legs are given a character state not found in the centipedes: 1. Comparing the 100 centipede leg pairs to the three insect leg pairs, the data matrix is filled in with “1” for the remaining 97 characters in the insects. The insects are recognized to be very different from the centipedes with respect to any leg condition. For block coding, the three insect legs are given one code (i), the corresponding characters for the centipedes are filled in with autapomorphies (1–5). The 100 centipede legs are given one code (c) in a separate block, and the corresponding positions in the insects are filled in with autapomorphies (6, 7). The strict compatability of coding methods between morphological and molecular data is demonstrated. Despite the above similarities between block coding and coding of mophological data, one practical difference must be addressed. A single observation can be coded as an additional character state in morphology but not with molecular data. With sequence data the problematic part in the classification process is the establishment of topographical identity, therefore, a single position cannot harbor ambiguous alignment. The minimum number of characters required for ambiguous alignment
is two adjacent positions. If the positional information places a base in a particular position (⳱topographical identity), the identification of the character state (⳱character-state identity) is not an issue. For morphological characters, in contrast, the problematic part in establishing shared similarities is usually character-state identity. Hence, for any single set of observations classified in a character, any particular observation may not be classifiable in one of the other states. The observation that cannot be classified in an existing state is given a new state. The common denominator is that whenever one of the two conditions—topographical identity or characterstate identity—to postulate shared similarity is not met, then additional units (character states, positions) are introduced. This unique indication should not affect the relationships to the remainder of the taxa. In stretch coding it is achieved by using additional states to recode the highly dissimilar taxa, in block coding autapomorphies are introduced, and in morphology a new character state is created. In no case is the particular datum coded as missing, because otherwise the original observation can be contradicted during character-state optimization. Contraction Versus Stretch/Block Coding. Three recoding methods discussed here are capable of showing patterns of shared similarity in an appropriate fashion based on the methodological justification criterion. All three can be applied to both examples shown. When should the reductive strategy of contraction and when should the recoding strategies of stretch and block coding be employed? I suggest that the coding strategy resulting in the greatest number of potentially informative character states, i.e., the number of character states without the autapomorphies-as-inapplicables, should be chosen. It will result in the greatest number of shared character states that can potentially advance our understanding of character-state evolution and relationships among the terminal taxa. It can be metrically expressed as the largest number of minimum steps minus the number of autapomorphous states for that particular part of the data matrix. Topology dependent metrics such as Goloboff’s (1991) decisiveness could be applied but are considered inappropriate, because the construction of a data matrix (observational phase) must be kept independent of the analytical or explanatory phase of tree searches (Brady 1994). The calculation of the number synapomorphies proceeds as follows for the example shown in Fig. 2. If the stretch coding matrix were contracted, then taxa 1 and 2 would share state “1,” taxa 3 and 4 would share state “2,” taxon 5 would be an autapomorphous state, “3,” and taxa 6 and 7 would share state “4.” The number of synapomorphous states (1, 2, 4) to the exclusion of the autapomorphous state (3) is 3, whereas with stretch coding this number is 10. Stretch coding harbors over three times as much potential phylogenetic information.
199
Conclusion Most flexible coding strategies fail the test of conjunction or allow character-state optimizations in contradiction to the original observations. Data exclusion has the undesirable property of eliminating much of the information that can be used to infer relationships of the terminal taxa. Data contraction holds a middle ground by using a higher level of generality for a particular part of the sequence. For highly divergent taxa, either exclusion or contraction may eliminate too much of the information that can be used to infer relationships among a subset of taxa. The techniques described here—stretch coding and block coding—remedy this situation, without failing the test of conjunction or allowing character-state optimizations that contradict the original data. Thus, the appropriate coding strategies permit meaningful character-state reconstructions in the ancestors and allow discovery of patterns in DNA sequence evolution that are isolated from arbitrary decisions during the alignment. Acknowledgments. I would like to thank a number of people for engaging discussions on the above topic. The mention of their names does not signal their agreement with the above. They have been instrumental, however, in sharpening our arguments: Mark Dawson, Paul Flook, Verena and Dominik Brantschen-Geiger, Gonzalo Giribet, Guillermo Herrera, Robert Lavenberg, James McLean, Gavin Naylor, Mason Posner, Jan Pawlowski, Mark Siddall, Petra Sierwald, Joseph Slowinski, Ellen Strong, Christine Thacker, and Ward Wheeler. Brian Brown, Douglas Eernisse, Kirk Fitzhugh, Jacqueline Reich, Don Reynolds, and Christine Thacker read early versions of the manuscript and made valuable suggestions. Two anonymous reviewers helped to improve the manuscript further. This is W.M. Keck Program in Molecular Systematics Contribution No. 4.
References Audi R (1998) Epistemology. A contemporary introduction to the theory of knowledge. Routledge, London Baum DA, Small RL, Wendel JF (1998) Biogeography and floral evolution of Baobabs (Adansonia, Bambacaceae) as inferred from multiple data sets. Syst Biol 47:181–207 Brady RH (1994) Pattern description, process explanation, and the history of morphological sciences. In: Grande L, Rieppel O (eds) Interpreting the hierarchy of nature grande. Academic Press, San Diego, pp 7–31 Brower AVZ, Schawaroch V (1996) Three steps of homology assessment. Cladistics 12:265–272 Cerchio S, Tucker P (1998) Influence of alignment on the mtDNA phylogeny of Cetacea: Questionable support for the Mysticeti/ Physeteroidea clade. Syst Biol 47:336–344 de Pinna MCC (1991) Concepts and tests of homology in the cladistic paradigm. Cladistics 7:367–394 Fitzhugh K (1999) The inferential basis of homology. In: XVIIIth Meeting of the Willi Hennig Society Abstracts, pp 20–21 Gatesy J, DeSalle R, Wheeler W (1993) Alignment-ambiguous nucleotide sites and the exclusion of systematic data. Mol Phylogenet Evol 2:152–157
Geiger DL (2000) Distribution and biogeography of the Haliotidae (Gastropoda: Vetigastropoda) world-wide. Boll Malacol 35:57–120 Geiger DL, Fitzhugh K, Thacker CE (2001) Timeless characters: A response to Vermeij (1999). Paleobiology 27:179–180 Goloboff PA (1991) Homoplasy and the choice among cladograms. Cladistics 7:215–232 Hall BK (ed) (1994) Homology: The hierarchical basis of comparative biology. Academic Press, San Diego Hawkins JA, Hughes CE, Scotland RW (1997) Primary homology assessment, characters and character states. Cladistics 13:275–283 Higgins DG, Sharp PM (1988) CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene 73:237– 244 Lee Y-H, Vacquier VD (1995) Evolution and systematics in Haliotidae (Mollusca, Gastropoda): inference from DNA sequences of sperm lysin. Mar Biol 124:267–278 Lutzoni F, Wagner P, Reeb V, Zoller S (2000) Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. Syst Biol 49:628–651 Nelson GJ, Platnick N (1981) Systematics and biogeography: Cladistics and vicariance. Columbia University Press, New York Owen R (1847) Report on the archetype and homologies in the vertebrate skeleton. Meet Br Assoc Adv Sci Rep 16:169–340 Owen R (1849) On the nature of limbs. John van Voorst, London Panchen AL (1994) Richard Owen and the concept of homology. In: Hall BK (ed.) Homology: The hierarchical basis of comparative biology. Academic Press, San Diego Remane A (1952) Die Grundlagen des natu¨rlichen Systems, der vergleichenden Anatomie und der Phylogenetik. Theoretische Morphologie und Systematik I. Akademische Verlagsgesellschaft, Leipzig Sanderson MJ, Hufford L (1996) Homoplasy: The recurrence of similarity in evolution. Academic Press, San Diego Simmons MP, Ochoterena H (2000) Gaps as characters in sequencebased phylogenetic analysis. Syst Biol 49:369–381 Strong EE, Lipscomb D (2000) Character coding and inapplicable data. Cladistics 15:363–371 Swofford DL (1990) Phylogenetic analysis using parsimony. Illinois Natural History Survey, Champaign Wa¨gele JW (1994) On the information content of characters in comparative morphology and molecular systematics. J Zool Syst Evol Res 33:42–47 Waterman MS, Joyce J, Eggert M (1991) Computer alignment of sequences. In: Miyamoto MM, Cracraft J (eds) Phylogenetic analysis of DNA sequences. Oxford University Press, Oxford, pp 59–72 Wheeler QD (1986) Character weighting and cladistic analysis. Syst Zool 35:102–109 Wheeler WC (1993) The triangle inequality and character analysis. Mol Biol Evol 10:707–712 Wheeler WC (1995) Sequence alignment, parameter sensitivity and the phylogenetic analysis of molecular data. Syst Biol 44:321–331 Wheeler WC (1996) Optimization alignment: The end of multiple sequence alignment in phylogenetics? Cladistics 12:1–9 Wheeler WC (1999) Fixed character states and the optimization of molecular sequence data. Cladistics 15:379–385 Wheeler WC, Gladstein DS (1994) MALIGN: a multiple sequence alignment program. J Hered 85:417–418 Wheeler WC, Gatesy J, DeSalle R (1995) Elision: A method for accomodating multiple molecular sequence alignments with alignment-ambiguous sites. Mol Phylogenet Evol 4:1–9 Whiting MF, Carpenter JC, Wheeler QD, Wheeler WC (1997) The Strepsiptera problem: phylogeny of the holometabolous insect orders inferred from 18S and 28S ribosomal DNA sequences and morphology. Syst Biol 46:1–68 Wiens JJ (1995) Polymorphic characters in phylogenetic systematics. Syst Biol 44:482–500