DNA^+Pro^: an Improved Progressive Multiple ...

DNA+Pro: an Improved Progressive Multiple Sequence Alignment Algorithm for Evolutionary Analysis Using Combined DNA-Protein Sequences

Xiaolong Wang1*, Shuang-yong Xu2, Deming Gou3 1. Department of Biotechnology, Ocean University of China, Qingdao, 266003, P. R.

Nature Precedings : hdl:10101/npre.2010.4898.1 : Posted 14 Sep 2010

China 2. New England Biolabs, Inc., Ipswich, Massachusetts, USA 3. College of life science, Shenzhen University, Shenzhen, 518060, Guangdong, P. R. China  E-mail: [email protected]

1

ABSTRACT Alignment of DNA and protein sequences is a basic tool in the study of evolutionary, structural and functional relationship among macromolecules. Present sequence alignment methods are somewhat error-prone, often producing systematic bias. Errors in sequence alignments sometimes lead to


subsequent misinterpretation of evolutionary, structural and functional information in genes, proteins and genomes. In traditional sequence alignment algorithms, alignments of DNA and protein sequences are conducted separately. It has been long believed that the phylogenetic signal disappears more rapidly from DNA sequences than from encoded proteins. It is therefore generally preferable to align sequences at the amino acid level. Here we present a new method—DNA+Pro, which aggregates DNA and protein sequences into combined DNA-protein sequences and align them in a combined fashion. We demonstrate that combining sequences improve the quality of multiple sequence alignment and solve practical evolutionary problems in primate immunodeficiency virus proteins and bacterial restriction enzymes. In addition to increased theoretical information contents, the distance estimations are more biological significant in combined alignment than in protein only or DNA only alignments. By integrating information buried separately in DNA and protein sequences, DNA+Pro improves the accuracy of multiple sequence alignment of closely-related proteins and prevents certain errors that may occur in phylogeny analysis using protein only approaches. 2

INTRODUCTION Bioinformatics, as well as functional and comparative genomics, seek to uncover functional and structural sequence changes leading to genetic differences between species. The goal of these comparative studies is to


provide accurate reconstruction of evolutionary histories of related genes, proteins and genomes. However, all evolutionary, structural or functional studies that rely on sequence analyses require accurate sequence alignments, i.e., the correct identification of homologous nucleotides or amino acids, and the accurate positioning of gaps indicating insertions and deletions. For example, progressive multiple sequence alignment algorithms, such as ClustalW (1), build a multiple alignment from pairwise alignments between sequences, performed in order of decreasing relatedness according to a guide tree. Although there are quite a few new sequence alignment algorithms, such as MUSCLE (2-3), MAFFT (4-5) and T-coffee (6-7), multiple sequence alignment is still error-prone and subject to misinterpretation. Different sequence alignment tools sometimes lead to drastically different conclusions in alignment as well as phylogenetic tree construction on the same set of sequence data, and can support entirely different mechanisms driving evolutionary, structural and functional changes in sequences. Here we present a superior sequence alignment algorithm—DNA+Pro, which improves the quality of progressive sequence alignment and prevents certain errors in 3

phylogeny analysis by combining DNA and protein sequences into aggregated DNA-protein sequences and aligns them in a combined fashion. We demonstrated its utility in multiple examples.


MATERIALS AND METHODS Proteins, coding DNA sequences and online resources Human immunodeficiency virus (HIV) or simian immunodeficiency virus (SIV) strains included in Fig. 1A, 1B, 2A, 2B, S1A-S1F and S2A-S2F were derived from the seed alignment of Pfam family pf00516. The complete coding DNA sequences (CDS) of their envelope glycoprotein gp120 (Env) and gag polyprotein (Gag) sequences were retrieved from GenBank. The coding sequences of gag for HV1W1, HV1BN, HV1ZH and HV2D2 are not available in the public nucleotide databases, so they are not included. The HIV strains included in Fig. S3 were selected from reference 12. The coding DNA sequences and amino acid sequences of BamHI-like isoschizomers (DNA recognition and cleavage site G/GATCC) included in Fig. 4 were derived from REBASE and GenBank.

Combined DNA-protein (CDP) scoring matrix A CDP scoring matrix, such as CDP-Gon250 scoring matrix (Table S1), is a 24 x 24 array derived by merging a nucleotide substitution matrix with an 4

amino acid substitution matrix, such as Gonnet250, Blosum62 and PAM250 scoring matrices. In a CDP matrix, substitutions between any pair of amino acids, or any pair of DNA bases, are allowed, but high penalties are given to „substitutions‟ between a DNA base and an amino acid to prohibit mismatch of a DNA base to an amino acid during the alignment process. As shown in


Table S1, The first line of a CDP matrix is the symbol set, lowercase letters (a, c, g, u/t) stands for DNA/RNA bases, uppercase letters (A/B, C/X, D, E, F, G/Z, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) stand for amino acids. Numbers are the 24 x 24 scoring matrix. ClustalW does not support case-sensitive alphabet, to avoid the confliction in codes for DNA bases and amino acids, DNA+PRO changes the amino acid codes temporary in combined DNA-Protein sequences: replaces A with B, C with X, and G with Z. In resolving of the output combined alignment, DNA+PRO changes the amino acid codes back to ordinary codes, write them in uppercase, and the DNA bases in lowercase letters.

Combined DNA-protein sequences Alignment Using a in-house developed program, DNA+PRO, the coding DNA sequences were translated using standard translation table SGC0 and converted into combined DNA-Protein sequences, in which every triplet codon is immediately followed by the one-letter code of the corresponding amino acid. DNA+PRO is coded in Microsoft Visual Basic for Applications (VBA) and 5

integrated with Microsoft EXCEL. The combined DNA-protein sequences were then subjected to progressive alignment by calling CLUSTAL W (v. 2.0.12) using a set of user-defined settings and a combined DNA-Protein (CDP) scoring matrices, CDP-Gon250 scoring matrix (Table S1). The optimal parameters for Clustal W to align the combined sequences were listed in


Table S2. When finished, the output combined alignment is fed back to DNA+PRO for final refinement, gap-only columns are removed, and gaps produced in a triplet codon are moved to the right of the codon to keep the combined alignment in 4-letter groups. Finally a combined DNA-protein alignment was visualized in Microsoft Excel in combined, protein, and DNA view (Fig. 1B). In these alignment views, every triplet codon and its encoded aa was colored with a different color, so that both synonymous and nonsynonymous mutations can be easily identified. The DNA+Pro software and the supplementary data are downloadable free of charge from our website www.dnapluspro.com.

Protein-only sequence alignment For protein only sequence alignment, the multiple sequence alignment methods used are CLUSTAL W v. 2.0.12, MAFFT v. 5.861, MUSCLE v. 3.6, T-COFFEE v. 3.93 and PRANK. All programs were run with default settings. As shown in Fig. 1A, DNA+PRO reverse-translates a protein-only alignment into a combined DNA-protein alignment, so that protein-only alignments can be 6

compared with combined alignments in a combined-view.

Phylogeny analysis In the other phylogeny analyses, phylogeny trees were constructed from protein only alignments, combined alignments, or combined alignments


reverse-translated from protein only alignments. Phylogenetic analyses were conducted in MEGA4 (28). The evolutionary history was inferred using the Neighbor-Joining method (29). The optimal tree with the sum of branch length is shown. The percentages of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) are shown next to the branches (30). The evolutionary distances were computed using the Poisson correction method (31) and are in the units of the number of amino acid substitutions per site. All positions containing gaps and missing data were eliminated from the dataset (Complete deletion option). For protein-only alignments, the distance between two aligned protein sequences (PAA) is defined as proportion of amino acid sites that are different: PAA = DAA/LAA

(1)

where DAA is the number of amino acid sites that are different between the two aligned protein sequences; LAA is number of valid common amino acid sites compared. When the protein-only alignment is reverse-translated into a DNA alignment, distance between the two aligned DNA (PBase) sequences is 7

defined as proportion of DNA bases that are different: LBase = 3LAA

(2)

PBase = DBase/LBase= DBase/3LAA

(3)

Where DBase is the number of DNA base sites that are different between the two aligned DNA sequences; LBase is number of valid common DNA bases


compared. In a combined alignment aligned by DNA+Pro, or reverse translated from a protein-only alignment, distance between the two combined DNA-protein sequences (P) is defined as proportion of DNA base and amino acid sites that are different: L = (LAA + LBase) = 4LAA

(4)

P = D/L = (DAA + DBase) / 4LAA

(5)

Where LAA and LBase is respectively number of common DNA base sites and amino acid sites compared; DAA and DBase is respectively number of DNA base sites and amino acid sites that are different between the two aligned combined DNA-protein sequences. Obviously, for a given number of base substitutions, say dBase, in a given length of coding DNA sequence, lBase= 3lAA, pBase = dBase/lBase= dBase/3lAA P reaches its minimum when all of the base substitutions are synonymous, i.e., there is no amino acid substitution, dAA, min = 0

(6) 8

pAA, min = 0

(7)

pmin = dBase / 4lAA = 3/4 pBase

(8)

On the other hand, P reaches its maximum when all of the base substitutions are non-synonymous, and each amino acid substitution is caused only by one DNA base substitution, therefore,


dAA, max = dBase

(9)

pAA, max = dBase / lAA = 3 pBase

(10)

pmax = 2dBase / 4lAA = 3/2 pBase

(11)

Usually, base substitutions may be either synonymous or nonsynonymous, and each amino acid substitution is caused by one, two or three base substitutions, therefore, 0 ≤ PAA ≤ 3PBase 3/4 PBase ≤ P ≤ 3/2 PBase

(12) (13)

According to equation (12), distances calculated from a protein only alignment may seriously underestimate the distance of two coding DNA sequence when there is a high rate of synonymous mutations. On the other hand, the distance of two coding DNA sequences may seriously underestimate the distance of their encoded protein sequences when the rate of non-synonymous mutations is high. Equation (13), however, suggests that distances calculated from combined alignments give better estimations of the distances for the coding DNA and their encoded protein sequences, as they dependent on not only rates of base substitution, but rates of amino acid 9

substitution.

RESULTS Combined

DNA-protein

sequence

improves

multiple

sequence

alignment


As shown in Fig. 1A, the traditional protein-only alignment aligned by ClustalW, and its combined or DNA view that were reverse translated by DNA+Pro from the protein only alignment, suggests that part of the variable (V2) region has a high rate of amino acid and DNA base substitutions. Obviously, ClustalW incorrectly squeezed distinct but „homology‟ sequences between highly conserved sequence blocks (Fig. S1A). Using phylogeny-guided heuristic iteration (2-3), or consistency-based methods (4-7), alignments given by MAFFT (Fig. S1B), MUSCLE (Fig. S1C) and T-coffee (Fig. S1D) were improved to some degree, but the problem of mismatched insertions still exists. Using a phylogeny-aware algorithm (8-9), PRANK provides a different alignment of this region (Fig. S1E). The PRANK algorithm keeps insertions “flagged” to ensure that independent insertions are not matched (9). As shown in Fig. S1E, PRANK identified distinct insertions correctly, while produced too many gaps and ignored true homologies among insertions. The PRANK algorithm, therefore, provides only a partial solution to the erroneous treatment of insertions in progressive alignment. This type of misalignment can be avoided by using DNA+Pro algorithm that 10

combines DNA and protein sequences into combined DNA-protein sequences and aligns them in a combined alignment. The principle and implementation of the DNA+Pro algorithm is described in the Material and Method section. Protein and their complete coding DNA sequences (CDS) were converted into combined DNA-protein sequences, in which every triplet codon is immediately


followed by the one-letter code of its encoded amino acid. The combined sequences were then aligned using progressive alignment algorithm by calling CLUSTAL W with a combined DNA-Protein (CDP) scoring matrix, such as CDP-Gon250 matrix (Table S1), and a set of user-defined settings (Table S2). A combined alignment of HIV and SIV gp120 DNA and protein sequences are shown in Fig. 1B, the combined alignment can be visualized in combined, protein only or DNA only view. As shown in Fig. 1B and Fig. S1F, DNA+Pro not only identified the distinct insertions, showed the homologies among them in the variable region, but also provided more accurate alignment in the conserved region.

Combined alignment improves phylogenic analyses of HIV and SIV proteins Traditional sequence alignment algorithms perform heuristics pairwise alignments at the branching points of a guide phylogenetic tree approximating the evolutionary history of the sequences (1-7). Unfortunately, an input guide tree is rarely consistent with an output phylogenetic tree constructed from a 11

multiple sequence alignment, and the output tree is often inconsistent with the real evolutionary history. Consequently, the output phylogenetic tree and the alignment are both error-prone. As shown in Fig. S2, the phylogenetic tree of the env gene inferred from the protein-only alignments by CLUSTAL W (Fig. S2A), MAFFT (Fig. S2B), T-coffee (Fig. S2C), MUSCLE (Fig. S2D) and


PRANK (Fig. S2E) are all varied significantly, and the tree inferred from the combined alignment suggests a drastically different evolutionary process (Fig. S2F). In order to compare and evaluate the accuracy of these different alignment algorithms and phylogenetic trees, we build sequence alignments and performed phylogeny analysis for different HIV genes, such as env and gag, using protein only and the combined alignments. It has been suggested that different regions in a HIV or SIV genome may have different evolutionary histories. By systematically examining trees from all possible combinations of four SIVs in four genomic regions, it was inferred that the chimpanzee SIV virus (SIVcpz) is mosaic: Its left-hand region (gag and pol) comes from a redcapped mangabey virus, and the right-hand region (env) is the ancestor of a virus found in several Cercopithecus monkeys (10). The mosaic structure of SIVcpz requires that a chimpanzee was infected with two different monkey viruses and these recombined. It was speculated that the recombination may have occurred in a dually infected monkey and the mosaic virus was transmitted to a chimpanzee, but it is more probable that the dual infection 12

occurred in a chimpanzee since chimpanzees hunt and eat small monkeys (10). Since HIV was originated from SIVcpz, it is interesting to know whether such kind of dual infection and recombination had also happened in human. Bootscan analysis, which breaks the genome into small sections and


analyzes each section independently, has been used to identify areas of recombination within an HIV-1 genome (11-12). However, the apparent phylogenetic incongruence at different regions of the genome that was taken as evidence of recombination is shown to be not statistically significant (12). A likely explanation for the differences in the evolutionary rates across the genome is that different regions of the genome are under different selective pressures (12). We constructed phylogenetic trees for env and gag genes respectively from protein only and the combined alignments. As shown in Fig. 2A, the protein only trees for env and gag are varied, implying that some of the HIV genomes, such as HV1J3, HV1B1, HB1A2, HV2BE and HV2G1, are somewhat mosaic. However, the combined trees (Fig. 2B) show a nearly consistent evolutionary process both for env and gag, suggesting that different regions of these HIV genomes underwent similar evolutionary histories since isolated from SIVcpz, and therefore dual infection and recombination had never or rarely happened in these HIV strains. We strongly believe that these combined trees are more reliable than the protein only trees, 13

because they are not only more consistent in different genes, but also have generally higher Bootstrap values (Fig. 2B). In fact, they also have a stronger and more biological meaningful theoretical basis, as described in the Material & Method section. By testing the combined method in many different HIV and SIV strains (data not shown), we found that combined trees for pol, gag and


env are generally more consistent than protein only trees, implying a systematic inconsistency in protein only trees and alignments. Sometimes combined alignments suggest different evolutionary process for different HIV genes that is consistent with protein only trees, but with generally better bootstrap values (Fig. S3A, S3B). Artifacts of traditional protein only or DNA only alignment algorithms may cause misinterpretation of a „mosaic structure‟ in closely-related genomes, and the inappropriate attribution of recombinant origins to divergent sequences obscures the true evolutionary properties of these viruses (12). DNA+Pro provide a more accurate analysis tool that can prevent some, if not all, of this type of inconsistency or errors.

Combined alignment improves interpretation of evolutionary events With the aid of the phylogenetic tree inferred from the combined alignment, and the 64-color views of the combined alignment (Fig. 1B), one can easily interpret the mutation events that happened in the evolution process of this region. For example, GGNSSNGNGDSSK, EKGNISPKNNTSNNTS and NNSTKDNIKNDNST, were identified as 3 insertions respectively in HV1ZH, 14

HV1RH and HV1J3. According to the phylogenetic tree, HV1ZH is the ancestor of HV1RH and HV1J3, from the combined view of the alignment, we can speculate that the mutation events for HV1ZH to evolve were: in HV1RH, except for some base substitutions and a one-residue insertion (K) in the left, the right part is replaced by a repetitive sequence (PKNNTSNNTS); and in


HV1J3, except for some base substitutions and a two-residue deletion (SP) in the middle, one of the two tandem repeats (NNST) shifted from the right to the left. Moreover, as shown in the combined and the DNA view of the combined alignment, in the evolution process the coding sequences of the variable region became more and more AT-rich and repetitive. Comparing the combined view with the protein view and the DNA view, it is suggested that insertions and deletions (Ins/del) in this variable region might be caused by slipped strand mispairing: after the virus has inserted the two copies of its RNA genome packaged in the virion into the host‟s cell, the viral reverse transcriptase, encoded by the pol gene, reverse transcribes the RNA to DNA. As the polymerase progresses it hops from one copy of the genome to the other (10). Ins/del will easily occur in the variable regions of HIV genomes. Considering the recombination feature of HIV reverse transcriptase, and the slipped strand mispairing property of short tandem repeats, this mechanism is more convincing when compared with mutation events that were suggested by other alignments, such as a lot of amino acid substitutions (Fig. S1A-S1D), 15

or distinct insertions at the same position (Fig. S1E). As shown in the combined alignments of the env and gag (Supplementary supporting material), such kind of ins/del happened again and again in the other HIV strains, and in other variable regions. Therefore, slipped strand mispairing might be the


major cause of highly frequent Indels in the variable region of HIV genomes.

Combined

alignment

improves phylogenic analyses

of

bacterial

restriction enzymes The OkrAI-DNA complex structure has been solved recently, which shows that OkrAI is a minimal version of BamHI since OkrAI carries three small deletions in comparison with BamHI structure (1, 3A, and 7 of BamHI helical region are missing in OkrAI) (24). We are interested in the phylogeny of BamHI isoschizomers (BamHI, OkrAI, Bsp98I and DdsI) and BamHI-like putative restriction enzymes found in the sequenced microbial genomes. Fig. 3A and 3B show the phylogenetic trees constructed for the 7 BamHI isoschizomers based on protein only and combined DNA+Protein alignments. Both methods identified DdsI as the most closely related isoschizomer (pair wise alignment indicates 61% aa sequence similarity and 38% aa sequence identity). However, the phylogeny tree is somewhat different in the other six enzymes. Protein only alignment derived tree indicated that BsuBSP1ORFAP is more closely related to BamHI (BsuBSP1ORFAP: 42% similarity to BamHI in aa sequence and 25% identity in aa sequence), but the tree derived from 16

DNA+Protein alignment indicated that it is more similar to HauORF2756P. The combined tree might be more trustworthy, because it has better bootstrap values. Restriction-modification systems are subject to rapid evolution and are sometimes found associated with mobile genetic elements such as transposases and plasmids. Closely-related Isochizomers found in diverse


bacterial species suggest that R-M systems can be acquired by horizontal transfer. More consistent alignment of restriction enzymes and methylases and construction of phylogenetic trees will help better understanding of evolutionary relationship among the few thousand R-M systems. By testing the combined method in many different restriction enzymes, as well as other functional proteins, we found that combined trees are generally more consistent with the robust multi-gene phylogenetic tree than protein only trees for the corresponding bacterial strains (data not shown), indicating that combined alignments are more accurate and thus superior in phylogenic analysis than protein only alignments.

Information content of DNA, protein and combined sequences The measurements of theoretical information content are given by Shannon entropy (25). For DNA and protein sequences, the information contents are determined respectively by the frequencies of bases or amino acids occurred in a sequence (26): SDNA=-Σi=1,4 Pi ln(Pi) 17

SPro= -Σj=1,20 Pj ln(Pj) The information contents of DNA and protein sequences reach their upper limits when bases and amino acids occur in equal frequencies: MAX (SDNA) = -4 × (1/4 × log2 (1/4)) = - log2 (1/4) = 2 MAX (SPro) = -20 × (1/20 × log2 (1/20)) = - log2 (1/20) = 4.3219


The maximum information content of protein sequences is larger than that of DNA sequences. It seems that protein sequences are more informative than DNA, and outperforms in sequence alignment. But this is clearly false: a triplet codon can code for 64-3 states, but an amino acid only for 20 states. Hence, coding DNA sequence actually carries more information than its encoded protein sequence. The problem is how to fully utilize the information buried in DNA sequences. DNA+Pro provides a new way to exploit the information buried separately in DNA and protein sequences. In a combined DNA-protein sequence, an amino acid is uniquely determined by its encoding triplet codon, so the entropy is dependent solely on the frequencies of the triplet codons, but independent on the frequencies of the amino acids. Let Pk, Pl, Pm be frequencies of the three bases of a triplet codon. The theoretical information content of a combined DNA-protein sequence is given by SDNA+Pro= -Σ k, l, m=1, 4 Pk Pl Pm ln (Pk Pl Pm) The maximum information content of combined sequences is reached when the DNA bases occur in equal frequencies: 18

MAX (SDNA+Pro) = -64 × (1/64 × log2 (1/64)) = - log2 (1/64) = 6 Therefore the maximum information content of combined DNA-protein sequences is larger than that of protein sequences. DNA+Pro computes information contents for DNA, protein and combined sequences. It is shown that computed information contents of combined sequences are always


greater than those of corresponding protein sequences, and this forms the theoretical basis of the combined alignments.

DISCUSSION Problems in DNA only or protein only alignments The small size of the alphabet makes alignment of nucleotide sequences inherently difficult: even a pair of completely unrelated DNA sequences will typically display ~25% identity over their entire length and it is often possible to find extended local alignments where >50% of the aligned nucleotides are identical. This makes the task of distinguishing true homology from random similarity difficult. The simple fact that proteins are built from 20 amino acids while DNA contains only four bases, means that the „signal-to-noise ratio‟ in protein sequence alignments is better than in DNA sequence alignments (15). Besides this theoretical information-content advantage, protein sequence alignments also benefit from amino acid substitution matrices, such as PAM (16), BLOSUM (17) and Gonnet (18) series. These matrices contain empirically derived scores for each possible amino acid pair and provide a 19

rational basis for aligning amino acids. In addition, it has long been believed that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins due to generally higher rates of synonymous mutations over nonsynonymous ones (19-20). It is therefore generally preferable to align protein-


coding sequences at the amino acid level. For example, some programs, such as RevTrans (15), construct multiple DNA alignments by translating protein coding DNA sequences, aligning the resulting peptide sequences, and building a multiple DNA sequence alignment by reverse translating of the aligned protein sequences. However, some important information carried by DNA sequences, such as synonymous mutations and frame-shift mutations, will get lost after they were translated into protein sequences. An interesting coding DNA sequence alignment algorithm, COMBAT (21-22), had been reported to combine DNA and protein sequence alignments. However, the algorithm so far has not been used for multiple sequence alignment or phylogenetic analysis, but only for pairwise sequence alignment, database searching and comparative genome studies (21-23). Using combined DNAprotein sequences and CDP scoring matrices, DNA+Pro allows multiple DNA and protein sequences to be aligned in a single alignment, and overcomes the problems commonly exist in DNA only or protein only alignments.

Problems in multiple sequence alignment algorithms 20

Conventional progressive or consistency-based algorithms (1–7) always match distinct neighboring insertions and non-insertions in the same column if they produced significant homology. This problem has been identified as being caused by repeated penalizing gap-opening (12), but cannot be avoiding by reducing the gap-opening penalties in traditional protein-only


alignment, because small gap-opening penalties result in „gappy‟ alignments. Recently, Löytynoja and Goldman (9) uses the phylogeny-aware (PRANK) approach that “flags” the gaps made in previous alignments and, so that distinct insertion events are kept separate even when they occur at exactly the same position. The PRANK method, however, produces too many unnecessary gaps, and ignores possible true homologies among insertions (Fig S1E). The authors argued that inserted characters are not descendants of any other insertions or ancestral characters, and should not be aligned with anything (9). However, if an inserted sequence is homologous to another sequence, it is possible that it is the ancestor of the other sequence, so their homology should not be ignored. Proper handling of these insertions will affect downstream phylogenetic analysis, as well as structural and functional studies. In conventional protein only alignments, it is almost impossible to distinguish an insertion from another distinct insertion or a non-insertion if they are „homologous‟ to each other in the amino acid level, because there is no information available to discriminate them. Using combined DNA-protein 21

sequences, this problem becomes much easier to solve. In a combined DNAprotein sequence, every information unit consists of three bases and one amino acid. In a combined alignment, triplet codons followed by its encoded amino acids shows every detailed evolutionary events, such as the synonymous and non-synonymous base substitutions. And in DNA+Pro, these


mutations are clearly shown by 64-color views. In a highly conserved region, the rate of synonymous mutations is dominantly higher than that of nonsynonymous ones. In a variable region, however, the rate of non-synonymous mutations is higher than in a conserved region. These correlated DNA bases and amino acids, together with the 64-color views, make it much easier to distinguish base substitutions from insertions or deletions, and therefore easier to distinguish an insertion from another distinct insertion or a noninsertion because of the enlarged alphabet and the subsequently increased information content. Although sequences may be „homologous‟ to each other in a protein only alignment (Fig. 1A, top), they show significant differences in the reverse-translated combined view (Fig. 1A, middle) or DNA view (Fig. 1A, bottom). Combining information buried respectively in DNA and protein sequences, DNA+Pro enables distinguishing base substitutions from insertions, allows homologous sites to be aligned more accurately (Fig. 1B), and overcomes the problems commonly exist in conventional DNA-only or proteinonly alignment while avoid producing unnecessary gaps.

22

Problems in molecular phylogeny analysis A traditional molecular phylogenetic tree was constructed either from a DNA only or a protein only alignment. Sequence distances, however, are underestimated in both of them: if a protein only alignment is used to construct a phylogenetic tree, synonymous mutations are ignored, because


they have no contribution to the amino acid substitutions, so the mutational events happened in DNA sequences are underestimated; if a DNA alignment reverse translated from a protein alignment was used instead, although both synonymous and non-synonymous mutations are counted, amino acid substitution events, however, are ignored. Although every amino acid substitution has at least one non-synonymous base substitution counted in the DNA sequences, non-synonymous mutations are treated without discriminating from synonymous ones. So the biological effect of nonsynonymous mutations is underestimated. In contrast, when using combined alignments to construct a phylogenetic tree, not only both synonymous and non-synonymous base substitutions are counted, but also they are discriminated by change (or not) in the amino acid sequences. In addition to more accurate alignment of sequences, more biologically significant estimations of sequence distances forms a stronger basis for phylogeny analysis.

CONCLUSION 23

Multiple sequence alignment is of crucial importance in subsequent genomic analyses, such as phylogeny inference, structure modeling and detection of positive selection (27). Our analysis shows that errors in traditional protein only or DNA only alignment may lead to inconsistency and errors in evolutionary and comparative studies of closely-related proteins. It is


not that the progressive algorithm itself is defective. Rather, correct alignment requires that information buried separately in DNA and protein sequences to be fully exploited in a combined manner. DNA+Pro is useful in cases where multiple sequence alignment of closely related proteins and/or their coding DNA sequence forms the basis for further investigations, such as phylogenetic and evolutionary analysis. This combined alignment method gives a more accurate picture of the mechanisms of protein evolution, it may also be useful in structure and functional analyses of proteins.

Acknowledgements This research was supported by the National Science Foundation of China through grant 81072567. SYX is employed at New England Biolabs, Inc (NEB). Thus, although NEB did not provide direct financial support, it can be considered as an indirect funder for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors wish to thank Prof. Zhenmin Bao, Prof. Quanqi Zhang and Prof. Jingjie Hu, who work in Department of Biotechnology, Ocean University of China, for helps and encouraging discussions during this work. 24

Author Contributions XW conceived, designed and programmed the software, analyzed the data and wrote the paper. SYX provided conceptual advice, contributed to multiple sequence alignment and phylogenetic analyses of the restriction enzymes.


SYX and DG contributed to manuscript editing, discussed the results and suggested improvements, commented on and approved the manuscript.

Figure Legends Fig. 1. Comparison of combined (DNA+protein) and protein only sequence alignments of the variable (V2) region of HIV gp120 (Env) protein. (A) A protein only alignment aligned by ClustalW is reverse translated and visualized in DNA+Pro, respectively in protein view (top), combined view (middle), and DNA view (bottom). (B) A combined sequence alignment aligned by DNA+Pro is visualized respectively in combined view (top), protein view (middle) and DNA view (bottom). DNA and protein sequences are written in lowercase and uppercase letters, respectively. In the combined sequences, every triplet codon is immediately followed by the one-letter code for its encoded amino acid. The phylogenetic trees shown in the left are inferred from the corresponding alignments. Three representative inserted sequences are shown in blue boxes, and their evolutionary relationships are shown by blue arrows. 25

Fig. 2. Combined (DNA+protein) alignments suggest more consistent evolutionary process for different HIV genes than protein only alignments. (A) The phylogenetic trees for env and gag genes constructed from protein only alignments. (B) The phylogenetic trees for env and gag inferred from


combined DNA+protein alignments. A subgroup that is more consistent in combined trees than in protein only trees, and has better bootstrap values, is shown by blue boxes.

Fig. 3. Comparison of phylogenetic trees for restriction enzyme BamHI homologs (isoschizomers OkrAI, Bsp98I, DdsI, and putative BamHI-like restriction

enzymes)

constructed

from

protein

only

and

combined

(DNA+Protein) alignments. (A) ClustalW; (B) DNA+Pro. BsuBSP1ORFAP (red boxes) is shown to be more closely related to BamHI in the protein only tree, but more similar to HauORF2756P in the combined tree with higher bootstrap values.

26


REFERENCES 1. Thompson J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673– 4680. 2. Katoh K, Kuma K, Toh H, Miyata T., 2005. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33(2): 511–518. 3. Katoh K, Misawa K, Kuma K, Miyata T., 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30(14): 3059–3066. 4. Robert C Edgar, 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 5: 113. 5. Robert C. Edgar, 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5): 1792–1797. 6. Poirot O, O'Toole E, Notredame C. 2003. Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res. 31(13):3503-6. 7. Notredame C, Higgins DG, Heringa J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 302(1):205-17. 8. Löytynoja A, Goldman N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 102(30): 10557–10562. 9. Löytynoja A, Goldman N. 2008. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 320 (5883):1632-5. 10. Bailes E, Gao F, Bibollet-Ruche F, Courgnaud V, Peeters M, Marx PA, Hahn BH, Sharp PM. 2003. Hybrid origin of SIV in chimpanzees. Science. 300 (5626):1713. 11. Salminen, M. O., J. K. Carr, D. S. Burke, and F. E. McCutchan. 1995. Identification of breakpoints in intergenotypic recombinants of HIV-1 by bootscanning. AIDS Res. Hum. Retrovir. 11:1423–1425. 12. Anderson JP, Rodrigo AG, Learn GH, Madan A, Delahunty C, Coon M, Girard M, Osmanov S, Hood L, Mullins JI. 2000. Testing the hypothesis of a recombinant origin of human immunodeficiency virus type 1 subtype E. J Virol. 74(22):10752-65. 13. Corvaglia AR, François P, Hernandez D, Perron K, Linder P, Schrenzel J. 2010. A type III-like restriction endonuclease functions as a major barrier to horizontal gene transfer in clinical Staphylococcus aureus strains. Proc Natl Acad Sci U S A. 107(26): 1195411958. Epub 2010 Jun 14. 14. Snyder EE, Kampanya N, Lu J, et al. 2007. PATRIC: The VBI PathoSystems Resource Integration Center. Nucleic Acids Res. 35 (Database issue) 401-406. 15. Wernersson R, Pedersen AG. 2003. RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 31(13): 3537-9. 16. Henikoff S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. 27


17. Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G. 2008. BLOSUM62 miscalculations improve search performance. Nat. Biotech. 26 (3): 274–275. 18. Gonnet GH, Cohen MA, Brenner SA. 1992. Exhaustive matching of the entire protein sequence database. Science. 256:1443–1445. 19. Yang Z. and Nielsen, R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol., 17, 32–43. 20. Yang Z. and Nielsen, R.J. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. Mol. Evol., 46, 409–418. 21. Hein J. 1994. An algorithm combining DNA and protein alignment. J. Theor. Biol., 167, 169–174. 22. Hein J. and Støvlbæk, J. 1996. Combined DNA and protein alignment. Methods Enzymol., 266, 402–418. 23. Bing Sun, Jacob T. Schwartz, Ofer H. Gill and Bud Mishra, 2006. COMBAT: Search Rapidly for Highly Similar Protein-Coding Sequences Using Bipartite Graph Matching Lecture Notes in Computer Science, 3992/2006, 654-661, DOI: 10.1007/11758525_89 24. Vanamee ES, Viadiu H, Chan SH, Ummat A, Hartline AM, Xu SY, Aggarwal AK, 2010. Asymmetric DNA recognition by the OkrAI endonuclease, an isoschizomer of BamHI. Nucleic Acids Res. 2010 Sep 9. [Epub ahead of print]. 25. Shannon, Claude E, 1951. Prediction and entropy of printed English, The Bell System Technical Journal, 30:50-64 26. Weiss O, Jiménez-Montaño MA, Herzel H. 2000. Information content of protein sequences. J Theor Biol. 206(3):379-386. 27. Notredame C, 2007. Recent Evolutions of Multiple Sequence Alignment Algorithms. PLoS Comput Biol. 3(8): e123. 28. Tamura K, Dudley J, Nei M & Kumar S, 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24:1596-1599. 29. Saitou N, Nei M, 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4:406-425. 30. Felsenstein J, 1985. Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39:783-791. 31. Zuckerkandl E & Pauling L, 1965. Evolutionary divergence and convergence in proteins, pp. 97-166 in Evolving Genes and Proteins, edited by V. Bryson and H.J. Vogel. Academic Press, New York.

28

218

I I I I I I I I I I I L HV2BE ataI HV1J3 HV2D1 ataI HV1OY HV2G1 ataI HV1B1 HV2NZ HV1C4 ataI HV2CA HV1A2 ataI HV2D2 HV1RH ataI HV1EL ataI HV1ND ataI HV1Z84 ataI HV1MA ataI HV1ZH attI SIVCZ ctaL HV2BE HV1J3 HV2D1 HV1OY HV2G1 HV1B1 HV2NZ HV1C4 HV2CA HV1A2 HV2D2 HV1RH

HV1EL HV1ND HV1Z84 HV1MA HV1ZH SIVCZ


1A

HV1J3 HV1OY HV1B1 HV1C4 HV1A2 HV1RH HV1EL HV1ND HV1Z84 HV1MA HV1ZH SIVCZ

ata ata ata ata ata ata ata ata ata ata att cta

219

220

221

222

223

224

225

226

227

N N S T K D D N D D N K N T D N A S T T D E K G N I S D N N N D N D S D D D N S A D D S D G G N S S N G N E N aatN aatN agtS accT aagK gatD gatD aatN ---- ---- ---- ---gatD gatD aatN aaaK aatN actT gatD aatN gctA agtS actT actT gatD ---- ---- ---- ---- ---gagE aagK ggtG aatN attI agcS gacD aatN aatN aatN ---- ---gacD aatN gatD agtS ---- ---gatD gatD gatD aatN agtS gctA gatD gatD agtS gatD ---- ---gggG ggaG aatN agtS agtS aatN gggG aatN gagE aacN ---- ----

N P N aatN ------------cctP ------aatN ----------

aat gat gat gat gat gag gac gac gat gat ggg ggg

aat ata aaa --- --- ----- --- ----- --- ----- --- --cct aag aat --- --- ----- --- --aat acc agt --- --- ----- --- ----- --- ---

aat aat gat aat --aag aat aat gat gat gga aat

agt --aat gct --ggt aat gat gat agt aat gag

acc --aaa agt --aat aat agt aat gat agt aac

aag --aat act --att ----agt --agt ---

gat --act act --agc ----gct --aat ---

I K K N T S ataI aaaK ---- ------- ------- ------- ---aagK aatN ---- ------- ---accT agtS ---- ------- ------- ----

228

229

230

N N aatN ------------aatN -------------------

T ---------------actT -------------------

S ---------------agcS -------------------

231

232

233

234

235

236

237

D N S T R D T T S T N N T K T N Y T N K N D T K N N T S Y G N R T N S T N S T N S T N N T N Y T N N S S G D S S K N T ---- ---- gatD aatN agtS accT agaR ---- ---- ---- gatD actT accT agcS ---- ---- accT aacN aacN accT aaaK ---- ---- accT aacN tatY accT aacN ---- ---- aagK aatN gatD actT aaaK aatN aatN actT agcS tatY ggtG aacN ---- aggR accT aatN agtS actT aatN ---- agtS accT aatN agtS accT aatN ---- aatN accT aatN tatY accT aatN ---- ---- ---- ---- aatN agtS agtS ---- ---- ggtG gatD agtS agtS aaaK ---- ---- ---- ---- ---- aacN acaT

aat --- --- ----- --- --- ----- --- --- ----- --- --- ----- --- --- --aat act agc aat --- --- --- ----- --- --- ----- --- --- ----- --- --- ----- --- --- ----- --- --- ---

----------aat agg agt aat -------

gat --acc acc aag act acc acc acc --ggt ---

aat gat aac aac aat agc aat aat aat --gat ---

agt act aac tat gat tat agt agt tat aat agt ---

acc acc acc acc act ggt act acc acc agt agt aac

aga agc aaa aac aaa aac aat aat aat agt aaa aca

238

Y Y Y Y F Y Y Y Y Y Y Y tatY tatY tatY tatY tttF tatY tatY tatY tatY tatY tatY tatY

tat tat tat tat ttt tat tat tat tat tat tat tat

HV1J3 HV1J3 HV1B1 HV1B1 HV1C4 HV1C4 HV1A2 HV1A2 HV1OY HV1OY HV1RH HV1RH HV1ND HV1EL HV1EL HV1ND HV1Z84 HV1Z84 HV1MA HV1MA HV1ZH HV1ZH SIVCZ SIVCZ HV1J3 HV1J3 HV2CA Nature Precedings : hdl:10101/npre.2010.4898.1 : Posted 14 Sep 2010

1B

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

ataI aatN aatN agtS accT aagK gatD aatN ataI aaaK aatN gatD aatN agtS accT ----

----

---- ----

275

276

277

278

279

---- agaR ----

280

----

---- tatY

ataI ----

----

----

----

---- gatD aatN ----

---- gatD ----

----

----

----

ataI ----

----

----

----

---- gatD gatD ----

---- aatN aaaK aatN actT accT aacN aacN ---- ---- accT aaaK ----

----

---- tatY

ataI ----

----

----

----

---- gatD aatN ----

---- gctA agtS actT actT accT aacN tatY ---- ---- accT aacN ----

----

---- tatY

ataI ----

----

----

----

---- gatD aagK ----

---- aatN gatD ---- actT ----

----

----

---- actT accT agcS tatY ---- ---- acgT ----

----

----

---- ----

---- aaaK ----

----

tttF

ataI gagE ----

---- aagK ggtG aatN attI agcS cctP aagK aatN aatN actT agcS aatN aatN ---- ---- actT agcS tatY ggtG aacN tatY

ataI ----

----

----

----

---- gacD aatN ----

----

---- aatN aatN aggR ----

----

----

---- ---- accT aatN agtS actT aatN tatY

ataI ----

----

----

----

---- gacD aatN ----

----

---- gatD agtS agtS ----

----

----

---- ---- accT aatN agtS accT aatN tatY

ataI ----

----

----

----

---- gatD gatD ----

----

---- gatD aatN agtS gctA aatN accT agtS aatN accT aatN tatY accT aatN tatY

ataI ----

----

----

----

---- gatD gatD agtS ----

----

----

---- ----

----

----

----

---- agtS tatY

attI gggG ggaG aatN ----

----

---- agtS agtS aatN ggtG gatD agtS agtS aaaK ----

---- gatD aatN agtS ----

----

---- ----

----

----

----

----

---- tatY

ctaL gggG aatN ----

----

----

---- gagE aacN ----

----

---- aacN acaT ----

----

----

---- ----

----

----

----

----

---- tatY

I

N

N

S

T

K

D

N

I

K

N

D

N

S

T

-

-

-

-

-

R

-

-

-

Y

HV1B1 HV2NZ HV1B1

I

-

-

-

-

-

D

N

-

-

D

-

-

T

T

S

Y

-

-

T

-

-

-

-

-

HV2D1 HV1C4 HV1C4

I

-

-

-

-

-

D

D

-

-

N

K

N

T

T

N

N

-

-

T

K

-

-

-

Y

HV2G1 HV1A2

HV1A2

I

-

-

-

-

-

D

N

-

-

A

S

T

T

T

N

Y

-

-

T

N

-

-

-

Y

HV2BE HV1OY

HV1OY

I

-

-

-

-

-

D

K

-

-

N

D

-

T

-

-

-

-

-

-

K

-

-

-

F

HV1RH HV1RH

I

E

-

-

K

G

N

I

S

P

K

N

N

T

S

N

N

-

-

T

S

Y

G

N

Y

HV1EL HV1ND

I

-

-

-

-

-

D

N

-

-

-

N

N

R

-

-

-

-

-

T

N

S

T

N

Y

HV1ND HV1EL

I

-

-

-

-

-

D

N

-

-

-

D

S

S

-

-

-

-

-

T

N

S

T

N

Y

HV1Z84 HV1Z84

I

-

-

-

-

-

D

D

-

-

-

D

N

S

A

N

T

S

N

T

N

Y

T

N

Y

HV1MA HV1MA

I

-

-

-

-

-

D

D

S

-

-

D

N

S

-

-

-

-

-

-

-

-

-

S

Y

HV1ZH HV1ZH

I

G

G

N

-

-

-

S

S

N

G

D

S

S

K

-

-

-

-

-

-

-

-

-

Y

L ata ata ata ata ata ata ata ata ata ata att cta

G aat --------gag --------ggg ggg

N aat ------------------gga aat

agt ------------------aat ---

acc --------aag -------------

aag --------ggt -------------

gat gat gat gat gat aat gac gac gat gat -----

E aat aat gat aat aag att aat aat gat gat agt gag

N ata --------agc ------agt agt aac

aaa --------cct --------aat ---

aat gat aat gct aat aag --------ggt ---

gat --aaa agt gat aat aat gat gat gat gat ---

N aat --aat act --aat aat agt aat aat agt aac

T agt act act act act act agg agt agt agt agt aca

acc acc acc acc --agc ----gct --aaa ---

--agc aac aac --aat ----aat -------

--tat aac tat --aat ----acc -------

----------------agt -------

----------------aat -------

--acg acc acc --act acc acc acc -------

aga --aaa aac aaa agc aat aat aat -------

----------tat agt agt tat -------

----------ggt act acc acc -------

----------aac aat aat aat agt -----

Y tat --tat tat ttt tat tat tat tat tat tat tat

HV2D2

SIVCZ SIVCZ HV2CA HV1J3 HV1J3 HV2NZ HV1B1 HV1B1 HV1C4 HV2D1 HV1C4 HV1A2 HV1A2 HV2G1 HV1OY HV1OY HV2BE HV1RH HV1RH HV2D2 HV1ND HV1EL HV1EL HV1ND HV1Z84 HV1Z84 HV1MA HV1MA HV1ZH HV1ZH SIVCZ SIVCZ

env

HV1C4 HV1B1

gag

HV1C4 HV1J3 HV1B1

2A

HV1RH

HV1A2

HV1A2

HV1RH

Protein only

HV1OY

HV1OY

HV1EL

HV1EL

HV1ND

HV1ND

HV1MA

HV1MA

SIVCZ

SIVCZ

HV2BE

HV2D1

HV2D1

HV2G1

HV2G1

HV2BE

HV2NZ

HV2CA

HV2CA

HV2NZ

SIVM1

SIVM1

SIVG1

SIVG1

SIVV1

SIVV1

SIVGB

SIVGB

env

2B DNA+PRO


HV1J3

HV1J3 HV1B1

gag

HV1J3 HV1B1

HV1C4

HV1C4

HV1A2

HV1A2

HV1OY

HV1OY

HV1RH

HV1RH

HV1EL

HV1EL

HV1ND

HV1ND

HV1MA

HV1MA

SIVCZ

SIVCZ

HV2CA

HV2CA

HV2NZ

HV2NZ

HV2BE

HV2BE

HV2D1

HV2D1

HV2G1

HV2G1

SIVM1

SIVM1

SIVV1

SIVV1

SIVG1

SIVG1

SIVGB

SIVGB

OkrAI

3A

HchORF2488P Bsp98I

Protein only

Csp7822ORF584P

3B DNA+PRO


HauORF2756P BsuBSP1ORFAP DdsI BamHI

OkrAI HchORF2488P Bsp98I Csp7822ORF584P HauORF2756P BsuBSP1ORFAP DdsI BamHI

BamHI homologs

DNA^+Pro^: an Improved Progressive Multiple ...

DNA^+Pro^: an Improved Progressive Multiple ...

Suggest Documents

Progressive change in primary progressive multiple ...

Chronic Progressive Multiple Sclerosis

Secondary progressive multiple sclerosis - Multiple Sclerosis Trust

Secondary progressive multiple sclerosis - Multiple Sclerosis Trust

An improved multiple internal standard ... - Semantic Scholar

An Improved Progressive TIN Densification Filtering Method ... - MDPI

MITOXANTRONE IN SECONDARILY PROGRESSIVE MULTIPLE ...

Improved Progressive Polynomial Algorithm for ... - Semantic Scholar

Benign versus Secondary-Progressive Multiple ... - Semantic Scholar

Progressive Injury in Chronic Multiple Sclerosis

Defining secondary progressive multiple sclerosis

Establishing multiple contexts for student's progressive ... - iNEER

Systemic Inflammation in Progressive Multiple ... - Semantic Scholar

An Improved Algorithm for Mining Association Rules Using Multiple ...

Tracking Multiple Video Targets with an Improved GM-PHD ... - MDPI

An improved multiple access scheme for chaos-based digital ... - Core

an improved decentralized approach for tracking multiple ... - arXiv

An improved Bonferroni procedure for multiple tests of signiï¬cance

An Improved Interacting Multiple Model Filtering Algorithm ... - MDPI

Study of an Improved Genetic Algorithm for Multiple

An Improved Resource Allocation Scheme for Plane Cover Multiple ...

An Improved Design of a Fully Automated Multiple Output

An Improved Parallel Multiple-point Algorithm Using a List Approach

An Improved GA for solving multiple depot VRP Yudong ZHANG ...

DNA^+Pro^: an Improved Progressive Multiple ...