Theories on PHYlogenetic ReconstructioN (PHYRN)

3 downloads 0 Views 2MB Size Report
histories(1;2). In this manuscript, we theorize that PSSMs derived specifically from the query sequences used to construct the phylogenetic tree will improve this ...
Theories on PHYlogenetic ReconstructioN (PHYRN) Gaurav Bhardwaj1,2, Zhenhai Zhang1,3, Yoojin Hong1,4, Kyung Dae Ko1,2, Gue Su Chang3, Evan J. Smith1,2, Lindsay A. Kline1,2, D. Nicholas Hartranft1,2, Edward C. Holmes1,2, Randen L. Patterson1,2, and Damian B. van Rossum1,2. (1) Center for Computational Proteomics, The Pennsylvania State University, USA (2) Department of Biology, The Pennsylvania State University, USA (3) Department of Biochemistry and Molecular Biology, The Pennsylvania State University, USA (4) Department of Computer Science and Engineering, The Pennsylvania State University, USA *

Address correspondence to:

Randen L. Patterson, 230 Life Science Bldg, University Park, PA 16802. Tel: 001-814-865-1668; Fax: 001-814-863-1357; E-mail: [email protected]. Damian B. van Rossum, 518 Wartik Laboratory, University Park, PA 16802. Tel: 001-814-863-1007; Fax: 001-814-863-1357; E-mail: [email protected]. Abstract The inability to resolve deep node relationships of highly divergent/rapidly evolving protein families is a major factor that stymies evolutionary studies. In this manuscript, we propose a Multiple Sequence Alignment (MSA) independent method to infer evolutionary relationships. We previously demonstrated that phylogenetic profiles built using position specific scoring matrices (PSSMs) are capable of constructing informative evolutionary histories(1;2). In this manuscript, we theorize that PSSMs derived specifically from the query sequences used to construct the phylogenetic tree will improve this method for the study of rapidly evolving proteins. To test this theory, we performed phylogenetic analyses of a benchmark protein superfamily (reverse transcriptases (RT)) as well as simulated datasets. When we compare the results obtained from our method, PHYlogenetic ReconstructioN (PHYRN), with other MSA dependent methods, we observe that PHYRN provides a 4- to 100fold increase in accurate measurements at deep nodes. As phylogenetic profiles are used as the information source, rather than MSA, we propose PHYRN as a paradigm shift in studying evolution when MSA approaches fail. Perhaps most importantly, due to the improvements in our computational approach and the availability of vast amount of sequencing data, PHYRN is scalable to thousands of sequences. Taken together with PHYRN’s adaptability to any protein family, this method can serve as a tool for resolving ambiguities in evolutionary studies of rapidly evolving/highly divergent protein families.

Introduction Phylogenetic profiles have been suggested by us and others as a unified framework for measuring structural, functional, and evolutionary characteristics of protein/protein-families(1;3-6). Proteins within a phylogenetic profile can be defined in an N(query) by M(PSSM) matrix. Under this paradigm, a protein is defined as a vector where each entry quantifies the alignments of a query sequence with a PSSM(3;7). In the case of evolutionary measurements, we previously demonstrated that phylogenetic profiles built in this manner can be used to construct phylogenetic trees using Euclidian distance measurements(1;2). Our previous study used reverse transcriptases (RT) as a benchmark dataset, and demonstrates that phylogenetic profiles perform well even at extreme levels of divergence (i.e. “twilight zone of sequence similarity”)(2). This work highlighted how pre-existing PSSMs, obtained from the Conserved Domain Database (CDD,(8)), could be utilized to construct an informative Mdimension. When we generated trees with the entire CDD we obtained a tree with perfect monophyly. Despite the perfect monophyly, the statistical support at most deep nodes was lacking. Interestingly, when we analyzed the alignments from the phylogenetic profiles, we determined that the most frequently occurring PSSM alignments were the 16 RT domain-containing profiles present in CDD. When trees were constructed using only these 16RT PSSMs for our M-dimension, we still observed significant monophyly that is well above random (Supplemental Figure 3 in (2)). Based on these results, it is reasonable to consider that expanding only the informative profiles within our knowledge base will improve the robustness of phylogenetic profile-based measurements, in addition to improving computational performance. In this manuscript, we present data supporting this supposition. Further, we present a pipeline for enriching and amplifying informative PSSMs as well as an algorithmic improvement that drastically reduces computational expense. As evidence for these theories we analyzed biological (RTs) and ROSE-simulated (9) datasets. When compared with other ab-initio multiple-sequence alignment (MSA) methods, PHYRN reliably recapitulates true evolutionary history in simulated datasets, and provides deep-node measurements with robust statistical support. Results PHYRN Pipeline The algorithm begins by compiling a set of protein queries belonging to the same protein family/superfamily (Figure 1). We use CDD (8) and other approaches to define conserved domains present in members of this superfamily (e.g. RT domain in reverse transcriptases). From this subset of knowledgebase PSSMs we utilize pairwise comparisons to define boundaries of homology. These homologous protein fragments are then utilized to construct a database/library of query-based PSSMs using PSI-BLAST (6-iterations, e-value threshold= 10-6)(10). In this manner, a query set of 100 sequences can make a library of at least 100 PSSMs. We then use rpsBLAST (8) to obtain pairwise alignments between full-length queries and the query-specific PSSM library. This alignment information (% identity, % coverage) is then encoded into phylogenetic profile matrix. Following, we calculate the Euclidian distance between each query(2). The results from these calculations can then plotted as a phylogenetic tree using a variety of tree-building algorithms (e.g. Neighbor-Joining(11), Maximum Likelyhood(12), Minimum Evolution(13), etc, see Methods for complete description of PHYRN). In the following sections we will highlight the methods for creating informative PSSM libraries and the scoring schemes utilized. Enriching and Amplifying Informative PSSMs Although CDD provides a comprehensive resource for conserved domains, in all cases the number of PSSMs for any given domain is relatively small (>100). We have previously demonstrated

ii

i

iii

Chop query sequences at domain specific boundaries

Query N Query 3 Query 2

PSI-BLAST 6 iterations, e = 10-6)

Query 1 Chopped Query sequences

CDD domain specific PSSM (pfamXXXX)

Multiple domain specific PSSMs generated from chopped query sequences

iv v M profiles

vi Euclidean distance matrix

N queries

vii

Run rpsBLAST with full length query sequences and domain specific PSSMs

Phylogram

N X M data matrix

Score = %identity X % coverage

Figure 1: PHYRN Concept and Work Flow- PHYRN begins by (i-ii) defining and extracting the domain specific region among the query sequences. (iii) Domain specific regions are then used to create PSSM library using PSI-BLAST. (iv-v) Positive alignments are then calculated between queries and PSSM library using rpsBLAST, and encoded as a PHYRN product score (%identity X %coverage) matrix. (vi) Product score matrix is converted to a Euclidean distance matrix by calculating Euclidean distance between each query pair. vii) Phylogenetic trees can then be graphed using NeighborJoining (NJ) or Minimum Evolution (ME) as available in MEGA.

that quality of phylogenetic profile measurements is proportional to the size and variety of the domainspecific PSSM library (2;3;7). This increase in information content is exemplified in Figure 2. When the silkworm Non-LTR AAA92147.1 is analyzed with NCBI CDD, only one RT specific PSSM alignment is returned (Figure 2a). When the same query is analyzed using PHYRN and the 16 RT PSSMs from CDD, we observe 3 overlapping alignments from 3 different PSSMs (~19% of the library, Figure 2b). These overlapping alignments define the boundary from which to generate PSSMs using PSI-BLAST. This same approach was used for 100 RT query sequences; post-expansion, we obtain 102 RT-specific PSSMs. When we reanalyzed the silkworm sequence with the amplified library, ~56% of the PSSM library returns a result within the homologous region. Scoring Scheme In order to encode alignment information between queries and our PSSM libraries, we utilize a product score (%identity X %coverage) during our PHYRN analysis. Equation sets for %identity and %coverage are as defined in (2). The algebraic derivation in Figure 3a demonstrates that our product score is equivalent to (1-p-distance) X gap weight. In data not shown, we determined that the gap weight is a negligible variable, and when removed does not alter our results. Figure 3b depicts the distribution of % gaps, % identity, and % coverage of alignments between 100 RT queries and 102 RT-specific PSSMs. Overall, 92% of alignments are less than 25% identity and the average percentage identity between 100 RT sequences is 21.8% (± 5.6% s.d.). Within a smaller subset of 88 RT sequences, 3644 pairs (95.2%) among 3828 possible pairs of these 88 sequences have less than 25% sequence identity. As a whole RT sequences reside in the “twilight zone” of sequence similarity, underscoring the reason why deducing evolutionary relationships within the RT family is extremely challenging. Furthermore, % gap and % coverage measurements have wide variances.

Phylogenetic Trees generated using PHYRN

rps-BLAST (default settings)

a

Phylogenetic Reconstruction of the RT superfamily Despite the low16 RT c Expansion (n= 102 RT PSSMs) PSSMs from b identity and high variance CDD in % gaps and coverage, when a phylogenetic tree PSI-BLAST 6-iterations, is constructed from a e=10 phylogenetic profile rt pssm-01 comprised of 100 RT query sequences and 102 RT-specific PSSMs using Neighbor-Joining (11) rt pssm-02 (see Methods), we obtain a robust monophyletic tree with deep statistical Library Hits (AAA92147) support (bootstrap and rt pssm-03 Pre-expansion PSSM 3/16 = 18.75% Post-expansion PSSM 57/102= 55.88% jackknife, Figure 3c). In Figure 2: Enrichment and Amplification of Signal Source PSSMs- (a) Family/Superfamily specific PSSMs can all aspects, this tree is be identified using NCBI Conserved Domain Database. (b) rpsBLAST can then be used to identify overlapping superior to the tree alignments between individual queries and family/superfamily specific CDD PSSMs. Overlapping alignments are then used to define domain specific region as described in methods. (c) Domain specific regions as identified are constructed prethen used to generate PSSMs using PSI-BLAST and NCBI non-redundant (nr) database. expansion of the RTspecific PSSMs (Fig 3 in Chang et al (2)). Specifically, the Hepadnaviruses now form a clear monophyletic clade, and the Mt Plasmids now reside in the prokaryotic group as expected. -6

As previously mentioned, phylogenetic profile measurements improve with larger datasets (1;3;7). Therefore, we wondered whether our resolution could be increased by the inclusion of additional sequences across multiple taxa, thereby increasing the size of the PSSM library. We collected 716 full-length RT containing sequences from the literature (14-20) and PSI-blast aided searches of NCBI non-redundant database. These sequences were subsequently included in our RT-specific PSSM library as previously described. Figure 4 depicts a linearized phylogenetic tree of 716 full length retroelements measured using 846 PSSMs generated from the RT domain and rooted with retrointrons. The pairwise distances among them were acquired based on Euclidean distance measurement in the 716 × 846 data matrix, and an unrooted phylogenetic tree was derived from the 716 × 716 distance matrix using a neighbor-joining method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix. Even at this size and level of divergence (~17% identity between groups), the PHYRN tree has robust monophyly. Within these monophyletic nodes, there are multiple subclades which are evident from this analysis. For example, we observe numerous subgroups of TY3 retroelements, and proper subclade groupings of retroviral RTs. Recently, Simon et al reported that there is a plethora of uncharacterized bacterial RTs (19). Our analysis is congruent with this proposal as we also observe novel clades of bacterial origin. While these results are promising, the evolutionary history is unknown and therefore we cannot fully evaluate the performance of PHRYN using this dataset. Further, during the course of these experiments, we determined that although effective, the pipeline as described is computationally expensive for large datasets. To overcome these limitations, we made the following changes to the pipeline. Specifically, we compiled all PSSMs into a single database that can be used by standard

rps-BLAST(8). This drastically reduced the number of operations and enabled us to use a sliding window of e-value thresholds to recover positive alignments. In data not shown, these changes afforded us a >600-fold increase in computational speed as well as improved speciation (see Methods for complete details).

a

c

score = %i × %c (%i: Percent identity, %c: Percent coverage) ids aqlen = × alen plen ids aqlen = × plen aqlen + gaps ids: The number of identical residues alen: The alignment length including gaps aqlen: The alignment length in the query excluding gaps ( = query _ to − query _ from + 1) plen: The sequence length of a PSSM gaps: The number of gaps in the query-sided alignment Q alen = aqlen + gaps Q

aqlen = (Gap weight based on the number of gaps in the query-sided alignment) aqlen + gaps

= (Proportion of amino acid sites identical over a domain) × (Gap weight) = (1 − Proportion of amino acid sites different over a domain) × (Gap weight) = (1 − p ) × w

% gap

b

% cov

% Id 100 × 102 distance matrix

Figure 3: Phylogenetic profile based measurements of evolutionary distance.- (a) Algebraic derivation of pdistance from PHYRN product scoring scheme (%Identity x %coverage). (b) Distribution of 100 Retroelements by measurements of % Identity, % coverage, and %gap. (c) Unrooted phylogenetic tree of 100 full length retroelements measured using 102 PSSMs generated from the RT domain. The pairwise distances among them were acquired based on Euclidean distance measurement in the 100 X 102 data matrix, and an unrooted phylogenetic tree was derived from the 100 X 102 distance matrix using a minimum evolution method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix. Bootstrap and jackknife (80% fraction of samples) values were obtained from 1,000 replicates and are reported as percentages.

Phylogenetic Reconstruction of Rose Simulations Rose (Random Model of Sequence Evolution - Version 1.3; http://bibiserv.techfak.unibielefeld.de/rose/) implements a probabilistic model for protein sequence evolution (9). In this simulation, sequences are created from a common ancestor to produce a dataset of known size, divergence, and history. In this artificial evolutionary process, the accurate history is recorded since the multiple sequence alignment is created simultaneously. This allows us to have perfect control over evolutionary rates, allowing us to test the efficacy of our approach.

Figure 5 provides the results from PHRYN and MUSCLE (21) using 67 sequences simulated for 17% identity by ROSE (see Methods for full description of PSSM generation). Whereas MUSCLE performs poorly on this dataset (Figure 5b), PHYRN recaptures 93 % of the true evolutionary history and has only one deep node incorrectly identified (Figure 5a). In a second simulation, we maintained a similar level of divergence while increasing the size of the simulated dataset to 584 sequences. In this simulation, MUSCLE performance decreases significantly (no deep-nodes are correctly obtained, (Figure 6a), while PHRYN performance is still robust (Figure 6b). PHYRN recaptures 76% of the true evolutionary history at the deepest 64 nodes. Taken together, these results demonstrate the power of PHYRN for deriving deep evolutionary information.

Figure 4: Towards comprehensive phylogenies.- Linearized phylogenetic tree of 716 full-length retroelements measured using 846 PSSMs generated from the RT domain and rooted with retrointrons. The pairwise distances among them were acquired based on Euclidean distance measurement in the 716 X 846 data matrix, and an unrooted phylogenetic tree was derived from the 716 X 716 distance matrix using a neighbor-joining (NJ) method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix.

a

b

Figure 5: PHYRN recapitulates ‘true evolutionary history’ better than MUSCLE in simulated protein families- Consensus tree between original ROSE tree and tree generated using a) PHYRN and b) MUSCLE. Simulated protein family generated using ROSE, with an average distance of 550 (p distance ~0.83). Red circles mark the branch points (nodes) that are not recapitulated correctly. (no. of query sequences = 67)

100

100

100 100 100

100

50 100 100

100 100 100

50 100

50

50 100 50 50 100

a

50 50

100

100 100

50 100 50 100 50 50

100 100

100

100

50 50 50 100

100 50 100

50 100 100 50

100 50 100

50

50 100

100

50 100 50

100 100

100 50

50

100 100

50

100 50 50 100

50

100 100

50 100

100

50 100 100

50

50

100 50 100 100 100 100

100 100 100

100 100 50

100

100

100

50 100

50 50 50

50

50

100

100 50

50 100 50

50

50

100

50 100

50 50

50 50 50

100

100 100 100 100

50 50 100

100 100 100

50 50

50

50 100 100

50 50 100

100 100

100 50

100

100

100 50

50 100 100

50

50

50

100 50

100 100

50 50

100

100 100

100

50

100 50 100

50 50

100 100 100

50 50 100

50

50 50 50

50 50

50

50 100

100

100 100 100 100 50 100

50 50

100 100

50

50

100

50 50

100

50 100 50

100

100

100 50 100

50 50 50 50

100 100

50

50

100 100 100 100

50 50 50 100

100 100 100 100

50

50

100

50 50

100 50 100

100

50 100 100

50 50 50 100

100

50 100 100

50 50

50

50 100

50 50

50

100 50

50

50

50 50

100 50 100

100

100

50 100

100

100 100

50

100

100

50 50

100 50

100

50

100 100 100

100 100

50 50

100

100 100

50 100 50 100 100

100 50 100

50 100 50

50

100 100

100

100

100 100

50 100 100 100 100 100 50

50

50 100 100

50 50

100 50

100 50

50 50 100

50 100 100

50 50

100 100

50

100

100

100 50 100

50

100

100

100

100 100 100

50 100 50 100 100

100 50 100

50 100 50

100 50

100 100 50

50

50

100 100

50 100 50

100 100 50 100

100 100 100

100

50 100 50 100 100 50 100

100

100

100

100

50 50

50 100

50 50

100 50

100 100 50 100

100

100 50 100

50 100

50

100 100 100 100 100

50 50 100

100 100 100

50 50

100

50 50 50

50

100

50 100 100

50

100 100 100 100

50

100 100 50

50 50 100

50

100

100

100 50 50

100

50 100 100 50 50

100 50 50 100

SEQ588 SEQ257 SEQ256 SEQ254 SEQ1034 SEQ1033 SEQ1036 SEQ446 SEQ1748 SEQ1752 SEQ799

100 100

100 100

b

100 100

100 100

100 100

100

100 100 100 50 100

100

50

100

100 100

100

100

100

100 50

100

100 100 100

100

100

100 100

100

100

50

100

100 100

100 100

100

100

100 100

100

50 100

SEQ206 SEQ1690 SEQ1469 SEQ1199 SEQ1201 SEQ1191 SEQ1193 SEQ1457 SEQ1164 SEQ1159 SEQ1183 SEQ1182

50

100

100

SEQ1645 SEQ1649 SEQ1641

SEQ1638 SEQ1945 SEQ1944 SEQ1941 SEQ1949 SEQ1937

SEQ1934 SEQ1930 SEQ1927 SEQ2015 SEQ2010 SEQ2004 SEQ2005 SEQ2000 SEQ1997 SEQ1996 SEQ1988 SEQ1992 SEQ2035 SEQ2036 SEQ2043 SEQ2046 SEQ2028

SEQ2024 SEQ1982 SEQ1976 SEQ1973 SEQ1972 SEQ1967

SEQ1957 SEQ1960 SEQ1904 SEQ1900 SEQ1896 SEQ1892 SEQ1918 SEQ1916 SEQ1908 SEQ1888 SEQ1887 SEQ1883 SEQ1876 SEQ1879 SEQ1872 SEQ1868 SEQ1864 SEQ1863 SEQ1825 SEQ1824 SEQ1814 SEQ1818 SEQ1808 SEQ1806 SEQ1799 SEQ1838 SEQ1837 SEQ1841 SEQ1849 SEQ1846 SEQ1855 SEQ1856 SEQ1833 SEQ1830 SEQ1569 SEQ1567 SEQ1560 SEQ1563 SEQ1551 SEQ1547 SEQ1545 SEQ1584 SEQ1582 SEQ1575 SEQ1576 SEQ1579 SEQ1594 SEQ1596 SEQ1721 SEQ1716 SEQ1724 SEQ1725 SEQ1728 SEQ1727 SEQ1710 SEQ1709 SEQ1712 SEQ1702 SEQ1706 SEQ1674 SEQ1672 SEQ1677 SEQ1696 SEQ1694 SEQ1690

100 50

100

100 100

100

100

100

100 100

100

SEQ1757 SEQ1756 SEQ1245

100

SEQ1246 SEQ1247 SEQ1239

SEQ1687 SEQ1764 SEQ1769 SEQ1775 SEQ1772

50

50

SEQ1661 SEQ1663 SEQ1664

SEQ1968 SEQ1965

100

100 100

SEQ1618 SEQ1614 SEQ1653

SEQ2030 SEQ2023

100

100 100

100

SEQ249 SEQ1965 SEQ1787 SEQ1687 SEQ1949 SEQ1170 SEQ2043 SEQ2046 SEQ2035 SEQ2036 SEQ1493

SEQ1784 SEQ1790 SEQ1787 SEQ1742 SEQ1741 SEQ1745 SEQ1735 SEQ1738

SEQ58 SEQ56 SEQ215 SEQ216 SEQ218 SEQ219 SEQ1584 SEQ144 SEQ145 SEQ160 SEQ159

100

100 50 50 50

50

100

100 100

50 100

100 100

SEQ1434 SEQ1430 SEQ1419 SEQ1418 SEQ1426 SEQ1422 SEQ1472 SEQ1516 SEQ1518 SEQ1489 SEQ1488

100 100

100 100

100

100

100

100

100 100

100

100

100

100 100 100

50

SEQ795 SEQ1246 SEQ1245 SEQ1247 SEQ1238 SEQ1239 SEQ1241 SEQ1226 SEQ1227 SEQ1223 SEQ1232

100

100

50

SEQ1089 SEQ1090 SEQ1086 SEQ1838 SEQ1837 SEQ1830 SEQ903 SEQ1988 SEQ1992 SEQ1677 SEQ792 SEQ791

100

100

100

100

SEQ753 SEQ742 SEQ743 SEQ745 SEQ746 SEQ32 SEQ33 SEQ36 SEQ26 SEQ25 SEQ10

100

50

100

100 100 100

100

100

100

50

SEQ1463 SEQ1461 SEQ1447 SEQ1453 SEQ1457 SEQ1528

100

100 50

SEQ1509 SEQ1513 SEQ1518

100

SEQ761 SEQ760 SEQ729 SEQ732 SEQ737 SEQ714

100

SEQ715 SEQ711 SEQ721

SEQ764 SEQ767

SEQ722 SEQ718

100 100

100 50

100

100

100

100

100

100 50 100

50

100

100 100

100

100

100 100

100 100 100 100

100

100

100

100 100

SEQ729 SEQ737 SEQ829 SEQ335 SEQ336 SEQ337 SEQ268 SEQ269 SEQ264 SEQ275 SEQ273

100

SEQ349 SEQ353 SEQ957 SEQ956 SEQ959 SEQ952 SEQ941 SEQ943 SEQ935 SEQ1097 SEQ1104 SEQ1107

SEQ950 SEQ941 SEQ943 SEQ935 SEQ839 SEQ841

50 100

100 100 100

100

100

100

100

100 100

100

100

100 100 50 100

100 100

100

50

100 100

50 100

100

100 100

50 50

SEQ1128 SEQ1131 SEQ375 SEQ377 SEQ881 SEQ880 SEQ876 SEQ874 SEQ870 SEQ193 SEQ191

50 100 100

100 100

100

100

100 100

100

100 50

100 100

100

100

100

100

100

100 100

100

100

50 100

100

100

100 100

50 100

100 100 100

100

100

SEQ711 SEQ714 SEQ715 SEQ808 SEQ809 SEQ658 SEQ656 SEQ655 SEQ683 SEQ678 SEQ998

50 100

50

100

100

100

50 100

100

100 50 100 100 100 100

100

SEQ458 SEQ470

SEQ478 SEQ500 SEQ505 SEQ504 SEQ512 SEQ507 SEQ488 SEQ485 SEQ493 SEQ494 SEQ497 SEQ424 SEQ426 SEQ429 SEQ432 SEQ446 SEQ448 SEQ442 SEQ441 SEQ438 SEQ392 SEQ393 SEQ395 SEQ401 SEQ417 SEQ418 SEQ414 SEQ410 SEQ408 SEQ399

SEQ349 SEQ353 SEQ337 SEQ336 SEQ335 SEQ329 SEQ332 SEQ382

SEQ377 SEQ362 SEQ359

SEQ370 SEQ268 SEQ269 SEQ264 SEQ275 SEQ273 SEQ281 SEQ280 SEQ284 SEQ291 SEQ288 SEQ312 SEQ313 SEQ322 SEQ318 SEQ297 SEQ299 SEQ307 SEQ306 SEQ114 SEQ115 SEQ112 SEQ107 SEQ105 SEQ120 SEQ122 SEQ130 SEQ126 SEQ99 SEQ94 SEQ92 SEQ76 SEQ77

SEQ67 SEQ62 SEQ32 SEQ33 SEQ26 SEQ25 SEQ13 SEQ14

100 100

SEQ21 SEQ36 SEQ156

SEQ159 SEQ160 SEQ141 SEQ140 SEQ144 SEQ145 SEQ193 SEQ191 SEQ184 SEQ186 SEQ179 SEQ178 SEQ167 SEQ170 SEQ176 SEQ175 SEQ238 SEQ239 SEQ241 SEQ242 SEQ235

100

SEQ234 SEQ232 SEQ216

100

50

SEQ466 SEQ465 SEQ456

100

50 100

50

SEQ813 SEQ463 SEQ462

SEQ155 SEQ152 SEQ163

100

100

100

SEQ808 SEQ809 SEQ818

SEQ10 SEQ16

100 100

100 100

100

SEQ825 SEQ821

SEQ47 SEQ56 SEQ58

100

100

100

SEQ779 SEQ829 SEQ833

SEQ42 SEQ52 SEQ51

100

100 100

SEQ783 SEQ786 SEQ775

SEQ82 SEQ45 SEQ44

100

50

100

SEQ791 SEQ795

100

100 50 50

50

SEQ801 SEQ799 SEQ792

100

100

SEQ997 SEQ1900 SEQ1896 SEQ1892 SEQ1904 SEQ223 SEQ1849 SEQ1846 SEQ458 SEQ456 SEQ969

SEQ876 SEQ874 SEQ870

SEQ74 SEQ81

100 100

SEQ880 SEQ881

SEQ366 SEQ369

100 100

100

100

SEQ892 SEQ865 SEQ856

SEQ381 SEQ385 SEQ375

100

100

SEQ1824 SEQ130 SEQ126 SEQ438 SEQ120 SEQ122 SEQ1269 SEQ1272 SEQ1273 SEQ1277 SEQ382

SEQ889 SEQ888 SEQ896

SEQ344 SEQ347

50 100

100

SEQ176 SEQ175 SEQ179 SEQ178 SEQ167 SEQ170 SEQ593 SEQ592 SEQ1486 SEQ1485 SEQ1825

SEQ840 SEQ847

SEQ471 SEQ473 SEQ480

100 100 50

SEQ1967 SEQ1968 SEQ1960 SEQ1545 SEQ1547 SEQ1551 SEQ1567 SEQ1569 SEQ1560 SEQ1563 SEQ1127

SEQ957 SEQ956 SEQ959 SEQ952 SEQ949

100

50

SEQ929 SEQ926 SEQ922

50

100

50

SEQ528 SEQ531 SEQ998

100 100

100

SEQ1863 SEQ1908 SEQ1531 SEQ1534 SEQ1982 SEQ1973 SEQ1972 SEQ1976 SEQ1148 SEQ1145 SEQ1146 SEQ1143

SEQ525 SEQ521 SEQ522

SEQ903 SEQ911 SEQ928

100

50

SEQ543 SEQ524

100

100

SEQ232 SEQ1168 SEQ1645 SEQ1649 SEQ1638 SEQ1639 SEQ1641 SEQ1642 SEQ1872 SEQ1868 SEQ1864

SEQ537 SEQ536 SEQ547

SEQ969 SEQ976 SEQ982

100

100 100

100

SEQ575 SEQ539 SEQ540

100

100 100

100

SEQ578 SEQ574

SEQ992 SEQ906

100

50

100

SEQ554 SEQ570 SEQ566

SEQ1013 SEQ1020 SEQ967

100

100

SEQ558 SEQ553 SEQ552

100

100

100

SEQ563 SEQ562

SEQ1008 SEQ1015

100

SEQ1206 SEQ1207 SEQ1210 SEQ1209 SEQ1214 SEQ1213 SEQ1216 SEQ982 SEQ362 SEQ1528 SEQ1527

SEQ593 SEQ588 SEQ592

SEQ997 SEQ1003 SEQ1007

100 100

SEQ606 SEQ607 SEQ610

100

50

50

SEQ619 SEQ623 SEQ626

100

100

100

SEQ630 SEQ638 SEQ640

100

100

100

SEQ634 SEQ633

100

100 100

SEQ239 SEQ241 SEQ242 SEQ1790 SEQ344 SEQ1702 SEQ634 SEQ633 SEQ630 SEQ638 SEQ640

SEQ705 SEQ694 SEQ697

50 100 100 100

SEQ690 SEQ701 SEQ702

50

100

SEQ1784 SEQ1575 SEQ1576 SEQ1579 SEQ429 SEQ432 SEQ607 SEQ606 SEQ610 SEQ609 SEQ238

SEQ666 SEQ683 SEQ678

SEQ600 SEQ599 SEQ583

100

100

SEQ670 SEQ662 SEQ667

SEQ609 SEQ601

100

100

SEQ1015 SEQ1013 SEQ2030 SEQ2028 SEQ2023 SEQ2024 SEQ521 SEQ522 SEQ524 SEQ525 SEQ1041 SEQ1070

SEQ674 SEQ671

50 100

100

100

SEQ658 SEQ656 SEQ655

100

100

50

SEQ578 SEQ566 SEQ570 SEQ563 SEQ562 SEQ558 SEQ553 SEQ552 SEQ554 SEQ888 SEQ889

SEQ650 SEQ648 SEQ649

SEQ686 SEQ689

100

100

SEQ752 SEQ753 SEQ765

100

100

100

SEQ743 SEQ749 SEQ750

100

100 100

SEQ689 SEQ690 SEQ686 SEQ671 SEQ670 SEQ674 SEQ662 SEQ666 SEQ667 SEQ575 SEQ574

SEQ1504 SEQ1493 SEQ745 SEQ746 SEQ742

100

100

SEQ1478 SEQ1482 SEQ1501

100

100

100

SEQ1489 SEQ1488

100

100

SEQ115 SEQ114 SEQ399 SEQ1887 SEQ1888 SEQ1883 SEQ1918 SEQ1916 SEQ1525 SEQ1582 SEQ833

SEQ1516 SEQ1486 SEQ1485

100

100 50

100

SEQ1531 SEQ1534 SEQ1510

100 100

100

SEQ1111 SEQ1020 SEQ1510 SEQ1509 SEQ1513 SEQ186 SEQ184 SEQ74 SEQ77 SEQ76 SEQ82

SEQ1472 SEQ1469

SEQ1527 SEQ1525

100

100

100

100

SEQ1418 SEQ1422 SEQ1426

100

100

50

SEQ1440 SEQ1436 SEQ1419

100 50 100

100 100

SEQ1295 SEQ1327 SEQ1326

SEQ1433 SEQ1430

100 100

SEQ537 SEQ536 SEQ539 SEQ540 SEQ51 SEQ52 SEQ47 SEQ45 SEQ44 SEQ42 SEQ1115 SEQ1114

SEQ1302 SEQ1286 SEQ1299

SEQ1333 SEQ1383 SEQ1434

100 100

SEQ1389 SEQ1390 SEQ1392

SEQ1344 SEQ1345 SEQ1337

50 100

100

SEQ1397 SEQ1398 SEQ1399

100

100

SEQ1051 SEQ141 SEQ140 SEQ152 SEQ504 SEQ505 SEQ500 SEQ1447 SEQ1737 SEQ1738 SEQ1735

SEQ1407 SEQ1403

100 100

100 100 50

100

SEQ1359 SEQ1354 SEQ1352

SEQ1310 SEQ1307

100

100

SEQ1376 SEQ1373 SEQ1361

SEQ1323 SEQ1340

100

100

SEQ1366 SEQ1367

SEQ1329 SEQ1320 SEQ1322

50

SEQ1609 SEQ1608 SEQ1618 SEQ1614 SEQ1721 SEQ1716 SEQ1712 SEQ648 SEQ649 SEQ650 SEQ543

SEQ1036 SEQ1041 SEQ1089

100 100

100

100 100

SEQ1051 SEQ1034 SEQ1033

100

100

SEQ1757 SEQ1756 SEQ226 SEQ225 SEQ401 SEQ1622 SEQ1623 SEQ1625 SEQ1626 SEQ1629 SEQ1630

SEQ1059 SEQ1058 SEQ1055

SEQ1064 SEQ1065 SEQ1370

100 100

100

100

100

SEQ1137 SEQ1134

100

100

SEQ418 SEQ417 SEQ393 SEQ392 SEQ1175 SEQ1176 SEQ1177 SEQ896 SEQ347 SEQ1264 SEQ1254 SEQ1256

SEQ1128 SEQ1127 SEQ1131

50 100

100 100

SEQ1146 SEQ1143 SEQ1148

100 50

50

SEQ1814 SEQ1672 SEQ1674 SEQ369 SEQ370 SEQ385 SEQ280 SEQ281 SEQ284 SEQ288 SEQ291

SEQ1193 SEQ1115

50 100

100 100

SEQ1201 SEQ1199 SEQ1191

SEQ1079 SEQ1082 SEQ1070

100

100

SEQ1996 SEQ1997 SEQ1927 SEQ1930 SEQ1937 SEQ1936 SEQ1933 SEQ1934 SEQ1876 SEQ1879 SEQ1818

SEQ1214 SEQ1213 SEQ1216

SEQ1090 SEQ1086

100 50

100

SEQ1207 SEQ1209 SEQ1210

100

100 50

SEQ1168 SEQ1206

100

100

SEQ856 SEQ865 SEQ1058 SEQ1059 SEQ1055 SEQ1079 SEQ1082 SEQ1064 SEQ1065 SEQ950 SEQ949

SEQ1164 SEQ1159 SEQ1170

SEQ1107 SEQ1145

100

100

50

SEQ1183 SEQ1182 SEQ1186

SEQ1122 SEQ1097 SEQ1104

100 100

50

SEQ1176 SEQ1175

SEQ1114 SEQ1111 SEQ1119

100

100

100

SEQ1232 SEQ1231 SEQ1177

100

100

SEQ322 SEQ318 SEQ507 SEQ512 SEQ626 SEQ992 SEQ623 SEQ840 SEQ839 SEQ841 SEQ847

SEQ1226 SEQ1227 SEQ1223

100

50 100

50

SEQ1272 SEQ1277

50 100

100

SEQ13 SEQ14 SEQ16 SEQ21 SEQ105 SEQ107 SEQ307 SEQ306 SEQ297 SEQ299 SEQ313 SEQ312

SEQ1256 SEQ1269 SEQ1273

100

50 100

SEQ1261 SEQ1264 SEQ1254

100 100

100 100

SEQ1238 SEQ1241

50

100

100

SEQ1737 SEQ1748 SEQ1752

100

50

SEQ1231 SEQ779 SEQ775 SEQ761 SEQ760 SEQ764 SEQ765 SEQ767 SEQ749 SEQ750 SEQ752

SEQ818 SEQ813 SEQ81 SEQ359 SEQ601 SEQ1833 SEQ718 SEQ1122

100

100 100 100

100

SEQ1630 SEQ1609 SEQ1608

SEQ1936 SEQ1933

100 100

100

100

SEQ1742 SEQ702 SEQ701 SEQ705 SEQ694 SEQ697 SEQ1764 SEQ1769 SEQ1775 SEQ1772 SEQ426

SEQ1594 SEQ547 SEQ892 SEQ92 SEQ99 SEQ94 SEQ783 SEQ976 SEQ62 SEQ67 SEQ1186

100 100

100 100

50

100

100

100

SEQ466 SEQ465 SEQ493 SEQ494 SEQ497 SEQ488 SEQ485 SEQ1694 SEQ1696 SEQ1745 SEQ1741

SEQ202 SEQ825 SEQ821 SEQ922 SEQ528 SEQ531 SEQ1504 SEQ1841 SEQ1957 SEQ395 SEQ332 SEQ329

100 100

100

100

100

SEQ1625 SEQ1626 SEQ1629

SEQ1642 SEQ1639

100

100

100 50

SEQ1407 SEQ1403 SEQ1397 SEQ1398 SEQ1399 SEQ1389 SEQ1390 SEQ1392 SEQ1383 SEQ463 SEQ462

SEQ967 SEQ1003 SEQ1008 SEQ1007 SEQ906 SEQ1137 SEQ1799 SEQ911 SEQ619 SEQ235 SEQ234

100

50

SEQ1345 SEQ1344 SEQ1340 SEQ1366 SEQ1367 SEQ1370 SEQ1373 SEQ1376 SEQ1354 SEQ1352 SEQ1359 SEQ1361

SEQ381 SEQ366 SEQ1463 SEQ1461 SEQ1453 SEQ408 SEQ410 SEQ786 SEQ448 SEQ441 SEQ442 SEQ732

100 100

SEQ1302 SEQ1307 SEQ1310 SEQ1327 SEQ1326 SEQ1329 SEQ1320 SEQ1323 SEQ1322 SEQ1337 SEQ1333

SEQ928 SEQ929 SEQ926 SEQ1710 SEQ1709 SEQ1706 SEQ1724 SEQ1725 SEQ1728 SEQ1727 SEQ801

100 100

SEQ1622 SEQ1623

SEQ1654 SEQ1656

100 100

100

SEQ424 SEQ1856 SEQ1855 SEQ470 SEQ471 SEQ473 SEQ480 SEQ478 SEQ1295 SEQ1286 SEQ1299

SEQ163 SEQ156 SEQ155 SEQ414 SEQ1119 SEQ1806 SEQ1808 SEQ1261 SEQ1436 SEQ1440 SEQ1433

100

100

SEQ721 SEQ722 SEQ1663 SEQ1664 SEQ1661 SEQ1653 SEQ1654 SEQ1656 SEQ1478 SEQ1482 SEQ112 SEQ1501 SEQ1596 SEQ1945 SEQ1944 SEQ1941 SEQ1134 SEQ2015 SEQ2010 SEQ2004 SEQ2005 SEQ2000 SEQ583

100 100

100

SEQ599 SEQ600

100 100

SEQ215 SEQ218 SEQ219 SEQ223 SEQ225 SEQ226 SEQ206 SEQ202 SEQ257 SEQ256 SEQ254 SEQ249

Figure 6: Deep Node Recapitulation of ‘true evolutionary history’ in mega-phylogeniesConsensus tree between original ROSE tree and tree generated using a) MUSCLE and b) PHYRN. Simulated protein family generated using ROSE, with an average distance of 550 (p distance ~0.82). Green circles mark the deep nodes that are recapitulated correctly in the consensus trees. (no. of query sequences = 584)

Discussion Our case-study of the RT superfamily and simulated datasets demonstrates that PHYRN is capable of inferring deep evolutionary relationships between highly divergent proteins. A number of implications can be derived from this study: (i) phylogenies built with PHYRN recapture more of the true evolutionary history and have robust statistical support; (ii) phylogenies built on pairwise alignments outperform conventional MSA methods and (iii) this method is scalable to thousands of sequences. This improved performance is due the improved information content contained in the PSSM libraries used in this study. We improved the efficacy of our PSSMs by: (i) limiting the PSSMs to homologous domains, (ii) optimizing the PSI-BLAST settings for their generation, and (iii) creating a pipeline that is sufficiently fast to handle large datasets. Conversely, with respect to MSA dependent methods, increasing the number of query sequences makes it increasingly difficult to obtain an optimal multiple sequence alignment(22); in PHYRN, increasing number of query sequences also increases the dimensionality of the phylogenetic profile, thus increasing the alignment information space. This increase in information space leads to better, more robust measurements of relative rates. This ‘comprehensive survey’ approach, where more sequences are better, is in contrast to ‘random walk’ approach of MSA dependent methods where increased sequences are a problem and trees are limited to discrete taxa. Further, use of frequency tables in the phylogenetic profiles provides more informative measurements for calculating relative rates of evolution. This approach provides PHYRN with a potential to generate trees with thousands of sequences where the only theoretical limit is the available sequencing data. Indeed, when we expanded the RT tree from 100 to 716 sequences comprising >14 groups we obtain a tree that is consistent yet higher resolution than previously reported RT studies (2;15-17). As PHRYN is well suited to making measurements on large divergent datasets, we hypothesize this approach may be capable of solving a number of unanswered questions related to the ancient origins of life and speciation. Moreover, since PHYRN functions in the twilight zone of sequence similarity, this algorithm may have the ability to inform whether functionally or structurally similar proteins have a common ancestor or occurred via convergent evolution. In conclusion, our study provides strong evidence that, even in its nascent stage, PHYRN measurements can provide key insight into evolutionary relationships among distantly related and/or rapidly-evolving proteins.

Acknowledgements- This work was supported by the Searle Young Investigators Award and startup money from PSU (RLP), NCSA grant TG-MCB070027N (RLP, DVR), The National Science Foundation 428-15 691M (RLP, DVR), and The National Institutes of Health R01 GM087410-01 (RLP, DVR). This project was also funded by a Fellowship from the Eberly College of Sciences and the Huck Institutes of the Life Sciences (DVR) and a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds (DVR). The Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions. We would like to thank Teresa Killick, Sree Chintapalli, and Anand Padmanbha for their help and support during the project as well as Jason Holmes at the Pennsylvania State University CAC center for technical assistance. We would also like to thank Drs. Robert E. Rothe, Jim White, Glenn M. Sharer, Cookie van Volmar, Barbara VanRossum, Russell Hamilton Carroll, C. Heesch and C. Hong for creative dialogue.

M (no. of PSSMs) matrix of product scores. Euclidean distance was calculated based on the PHYRN product score of each query and an N X N Euclidean distance matrix was generated. Similar to method used for RT trees, phylogenetic trees were inferred using Neighbor Joining (NJ) and Minimum Evolution (ME) method, as available in MEGA (13). Generating phylogenetic trees using MUSCLE Optimal Multiple Sequence Alignment (MSA) for a given dataset was obtained using MUSCLE v3.6 (23). Phylogenetic trees for these optimal MSA were inferred using MEGA’s Neighbor-joining (NJ) and Minimum-evolution (ME) algorithm, with pairwise deletion and p-distance as default settings. Consensus trees with ROSE ‘true history’ We used ‘consense’ program of PHYLIP v3.67 package (24;25) to generate consensus trees between PHYRN and Rose trees, as well as between MUSCLE and Rose. Recapitulation rate and percentages were then calculated from consensus tree newick files. Bootstrap and Jackknife We generated 3,000 random samples from our PHYRN M-dimension, using random number generator code from PHYLIP source code (http://evolution.genetics.washington.edu/phylip.html). During resampling, same columns were allowed to be selected more than once. We then used Fitch program with default settings in PHYLIP 3.67 package to generate minimum-evolution (ME) trees for each sample, followed by Consense program for generating a consensus trees of all samples by majority rule. For jackknife resampling, we followed a similar approach to generate 1,000 random samples, however only 80% of original M-dimensional data was resampled each time. FITCH and Consense programs were then used in similar manner as used in bootstrap resampling. Reference List 1. Ko,K.D., Hong,Y., Chang,G.S., Bhardwaj,G., van Rossum,D.B., and Patterson,R.L. 2008. Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution. Physics Archives arXiv:0806.239, q-bio.Q. 2. Chang,G.S., Hong,Y., Ko,K.D., Bhardwaj,G., Holmes,E.C., Patterson,R.L., and van Rossum,D.B. 2008. Phylogenetic profiles reveal evolutionary relationships within the "twilight zone" of sequence similarity. Proc. Natl. Acad Sci U. S. A 105:13474-13479. 3. Ko,K.D., Hong,Y., Bhardwaj,G., Killick,T.M., van Rossum,D.B., and Patterson,R.L. 2009. Brainstorming through the Sequence Universe: Theories on the Protein Problem. Physics Archives arXiv:0911.0652v1, q-bio.QM:1-21. 4. Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D., and Yeates,T.O. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad Sci U. S. A 96:4285-4288. 5. Ranea,J.A., Yeats,C., Grant,A., and Orengo,C.A. 2007. Predicting protein function with hierarchical phylogenetic profiles: the Gene3D Phylo-Tuner method applied to eukaryotic genomes. PLoS. Comput. Biol. 3:e237. 6. Wu,J., Mellor,J.C., and DeLisi,C. 2005. Deciphering protein network organization using phylogenetic profile groups. Genome Inform. 16:142-149.

7. Hong,Y., Lee,D., kang,J., van Rossum,D.B., and Patterson,R.L. 2009. Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding. Physics Archives arXiv:0911.0650v1, q-bio.QM:1-21. 8. Marchler-Bauer,A., Anderson,J.B., Cherukuri,P.F., Weese-Scott,C., Geer,L.Y., Gwadz,M., He,S., Hurwitz,D.I., Jackson,J.D., Ke,Z. et al 2005. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 33 Database Issue:D192-D196. 9. Stoye,J., Evers,D., and Meyer,F. 1998. Rose: generating sequence families. Bioinformatics. 14:157-163. 10. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. 11. Saitou,N., and Nei,M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425. 12. Adams,M.A., Suits,M.D., Zheng,J., and Jia,Z. 2007. Piecing together the structure-function puzzle: experiences in structure-based functional annotation of hypothetical proteins. Proteomics. 7:2920-2932. 13. Tamura,K., Dudley,J., Nei,M., and Kumar,S. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24:1596-1599. 14. Arkhipova,I.R., Pyatkov,K.I., Meselson,M., and Evgen'ev,M.B. 2003. Retroelements containing introns in diverse invertebrate taxa. Nat. Genet. 33:123-124. 15. Curcio,M.J., and Belfort,M. 2007. The beginning of the end: links between ancient retroelements and modern telomerases. Proc. Natl. Acad Sci U. S. A 104:9107-9108. 16. Medhekar,B., and Miller,J.F. 2007. Diversity-generating retroelements. Curr. Opin. Microbiol. 10:388395. 17. Eickbush,T.H., and Jamburuthugoda,V.K. 2008. The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Res. 134:221-234. 18. Simon,D.M., Kelchner,S.A., and Zimmerly,S. 2009. A broadscale phylogenetic analysis of group II intron RNAs and intron-encoded reverse transcriptases. Mol. Biol. Evol. 26:2795-2808. 19. Simon,D.M., and Zimmerly,S. 2008. A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 36:7219-7229. 20. Simon,D.M., Clarke,N.A., McNeil,B.A., Johnson,I., Pantuso,D., Dai,L., Chai,D., and Zimmerly,S. 2008. Group II introns in eubacteria and archaea: ORF-less introns and new varieties. RNA. 14:1704-1713. 21. Edgar,R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797. 22. Kemena,C., and Notredame,C. 2009. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 25:2455-2465. 23. Edgar,R.C. 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC. Bioinformatics. 5:113.

24. Felsenstein,J. 1997. An alternating least squares approach to inferring phylogenies from pairwise distances. Syst. Biol. 46:101-111. 25. Felsenstein,J. 2008. Comparative methods with sampling error and within-species variation: contrasts revisited and revised. Am. Nat. 171:713-725.