clustering of eukaryotic orthologs based on sequence and domain ...

4 downloads 9251 Views 70KB Size Report
ontological annotations available for each of the sequences belonging to the ... solving proteins in domains by clustering proteins in all groups to which their.
CLUSTERING OF EUKARYOTIC ORTHOLOGS BASED ON SEQUENCE AND DOMAIN SIMILARITIES USING THE MARKOV GRAPH-FLOW ALGORITHM HATICE GULCIN OZER1, JINGCHUN CHEN2,3,4, FA ZHANG2,3,4,5,6, BO YUAN1,2,3,4,* Program in Biophysics1, Departments of Biomedical Informatics2, and Pharmacology3, Program in Pharmacogenomics4, The Ohio State University, Columbus OH 43210; Institute of Computing Technology5, Graduate School6, The Chinese Academe of Sciences, Beijing, China 100080 Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. It is hence important to identify orthologs for transferring functional and other information between genes in different organisms with a high degree of reliability. For example, protein-protein interactions identified by high-throughput proteomics already covers three-quarters of yeast proteins to date. Similar information is being massively generated for other important model systems such as fly and worm. Mapping their orthologous counterparts in human will thus tremendously assists in the understanding of the biological functions of human genes at the systems level. Unfortunately, a confounding factor in this process is that cross-species comparisons often identify genes that, although highly similar, do not represent a true ortholog and may in fact be functionally dissimilar because of large numbers of paralogs within protein families. Existing clustering methods based on two-way best genome-wide similarities have so far not separated paralogs from orthologs effectively. We present a fully automatic computational method to cluster orthologs and in-paralogs from multiple species. We use the program BLASTP to generate a pairwise distance matrix, which is then normalized for each homologous group between and within the species included. We also used protein domains and their organizations in protein sequences as an additional criterion for filtering false relationships. Ortholog clusters are first seeded with multiple reciprocal best pairwise matches, after which the Markov graph-flow algorithm is applied to include in-paralogs. Classification parameters such as the inflation index are optimized according to the functional consistency in each of the clusters. This was inferred by the comparison of ontological annotations available for each of the sequences belonging to the same cluster. We also use existing structural classifications for proteins to validate our results. We apply our programs on six completely sequenced eukaryotic genomes, assigns confidence values for both orthologs and in-paralogs. We note significant improvement for the clustering of orthologs with recent paralogs, comparing our results with similar efforts at NCBI and TIGR. This provides an automatic and robust method to cluster orthologous genes of multiple genomes.

* Corresponding author: [email protected] 1

2 1.

Introduction

With rapidly growing amount of sequence data, the need for automatic analysis methods for biological discovery is growing too. Many scientists are now asking which genes in the human genome are sharing the exact same biological function with genes in simpler organisms. The concepts of orthology and paralogy originated from the field of molecular systematics [1], and have recently been applied to functional characterizations and classifications on the scale of whole-genome comparisons [2-9]. Orthologs are genes in different species that evolved from a gene in a common ancestor, while paralogs are homologs generated by gene duplications within the same genome [1]. Since orthologs are likely to retain the same biological functions, identification of the orthologous groups is a useful tool in various bioinformatics areas. The phylogenetic approach includes clustering of homologs, generation of correct multiple alignments for each group of homolog domain, construction of a phylogenetic tree for each group, and finally extraction of orthologs from these trees. Unfortunately this approach involves some poorly automatic steps and demands large sources of computing power. An alternative to phylogenetic methods is to use all-against-all sequence comparison between genomes to detect orthologs. Since orthologs retain their function through speciation whereas paralogs are generated by gene duplication events followed by presumably faster divergences (no selective pressure), orthologs should be more similar than paralogs to each other. Hence, orthologs should give reciprocal best-hits (i.e., the first sequence finds the second sequence as its best hit in the second species, and vice versa) pattern between two genomes. All-against-all BLAST searches have recently gained popularity for finding orthologs [6,7,9,10]. At the beginning it was used mainly to detect simple one-to-one relationships [11]. With the appearance of several fully sequenced genomes, the Clusters of Orthologous Groups (COG) database was constructed [2,4,11]. Sequences from distinct genomes that are reciprocal besthits are identified as a pair of orthologs, and COG recognizing relationships among at least three distinct lineages (triangles) have been identified across distant phylogenetic lineages. TIGR Orthologous Gene Alignments (TOGA) [5] applies a COG-based approach on the TIGR gene indices [12]. Complications associated with ortholog group construction for eukaryotic genomes include extensive gene duplication and functional redundancy, the multidomain structure of many proteins, and the predominance of incomplete eukaryotic genome sequencing [12, 14]. These challenges demand an approach

3 able to distinguish between in-paralogs and out-paralogs. Duplicated genes arose after the species split are called in-paralogs (or recent paralogs). These paralogs are likely to retain similar function, and hence should be grouped with true orthologs. On the other hand out-paralogs (or ancient paralogs) arising from duplication events before speciation are likely to have diverged functions, and thus should be excluded from ortholog groups [15,16]. The generation of orthologs, in-paralogs and out-paralogs through speciation and gene duplication depicted in Figure 1A, and their relations are shown in Figure 1B.

Figure 1. (A) Generation of orthologs, in-paralogs and out-paralogs through speciation and gene duplications. (B)Relationship of ortholog, in-paralog and out-paralog

Some of the researches on clustering of sequences, more specifically, clustering of orthologous groups are as follows: ProtoMap [17] describes the relationships between sequences as a weighted graph based on the similarity scores, and implements an accretion algorithm for clustering with different cutoffs. GeneRAGE [18] is based on a single linkage hierarchical clustering, in which relationships between sequences are represented by a binary yes/no flag regardless of the strength of the relationships. A procedure is incorporated for solving proteins in domains by clustering proteins in all groups to which their domains belong, avoiding the incorrect inclusion of sequences connected by indirect sharing of domains in the same cluster. Abascal and Valenica proposed an alternative approach for clustering sequence space that focuses on the space surrounding a given protein by analyzing only local sequence relationships, instead of clustering the full sequence space in the search distant relationships and a complete overview of sequence. Hence this approach can be applied to

4 individual sequences without requiring complete genomes [19]. The INPARANOID approach [15], which detects ortholog groups from two species by all-against-all comparison using NCBI BLAST, defines orthologs as a pair of sequence from two species that score higher with each other than with any other sequence in the other genome. It has been proven to be an effective approach for finding orthologs, despite its scalability issues. OrthoMCL approach [16] is developed in response to this scalability problem and similar to INPARANOID in principle. It also suggests that paralogs must be more similar to each other than to any sequences from other species. Thus in-paralogs and out-paralogs can be differentiated more effectively and subsequently in-paralogs are better grouped with previously defined orthologs. Through application of the MCL [20] implementation of the Markov clustering algorithm, which allows simultaneous classification of global relationship in a similarity space, the manyto-many orthologous relationships in multiple genomes arose. Sequence determines structure, and structure determines function. So when we study sequence similarity, we eventually hope to discover or validate similarity in structure and function. This approach is often successful. However, there are many examples where two sequences have little or no similarity, but still the molecules fold into the same structure and share the same function, or very similar two sequences have different functions. To achieve a higher level of sensitivity and specificity in assessing the similarity, it is necessary to add other criteria. Since structural information is only available for less than five percent of all the protein sequences, similarity at the level of protein domains and their organizations (available for more than 85% of all the existing protein sequences) is required to establish our similarity relationship in this study. Existing clustering methods have so far not separated paralogs from orthologs effectively. In this study we present a fully automatic computational method to cluster orthologs and in-paralogs from multiple species. NCBI BLASTP was used to generate pairwise distance matrix. Then domain similarity was included as a second similarity criterion. Graph-flow algorithm (MCL implementation of Markov Cluster algorithm) was applied to group orthologs and to distinguish in-paralogs from out-paralogs. Finally, a number of validation processes were applied on the clustering: 1) The comparison of ontological annotations available for each of the sequences belonging to the same cluster to assess the functional homogeneity; and 2) The comparison with the structurebase classifications (SCOP, CATH, and FSSP). These validation processes allowed us comparing our clustering results with NCBI’s KOG database and

5 TIGR’s TOGA database, and also pointing out the improvement achieved by addition of domain similarity criteria. 2.

Methods

2.1. Sequence data and alignment score refinements Using an approach similar to the OrthoMCL approach, we conducted all-againstall NCBI BLASTP on the protein sequences from human (28,928 sequences), arabidopsis (26,192 sequences), worm (22,439 sequences), fly (16,106 sequences), and yeast (6,195 sequences) genomes. This search generated a tabular output containing sequence pairs with alignments and their bit scores. However, these results were containing some overlaps where local regions from query sequences either completely or partially overlapped one another, or, in other instances were disjointed and never overlapped. These overlaps represented cases where the query sequence aligned with another sequence in multiple places. These cases required additional processing since MCL implementation expects a graph of one-to-one correspondence between local alignments and the bit scores for those alignments. A parsing program using PERL programming language was developed to search iteratively the tabular output of BLASTP for the instances of partial and complete overlaps, and complete disjoints, and to eliminate these multiple alignments. Hence, parsing program provides one-to-one correspondence between alignments of sequences and their scores. To improve the accuracy of the sequence alignment scores, domain similarity was supplemented as an additional similarity criterion since sequence alignment scores are limited to sequence similarities in one dimension. The parsing program, in the second step examines domain similarities, and keeps the pairs that have the same domains in the same order in their overlapping region. Domain databases Rfam 12.0, TIGR 3.0, PRINTS 37.0, Prosit 18.10, ProDom 2002.1, SMART 4.0, SuperFamily 1.63, and PIR SuperFamily 2.41 were used. Table 1 illustrates the mapping of protein domains. High similarity of in-paralogs can bias the BLAST scores, for example, when scores of pairs within the same genome are likely to be higher than scores between genomes. Thus, the raw BLAST scores must be normalized to eliminate the effect of this systematic bias. The normalization method introduced in OrthoMCL approach [16] was used in the third step of the parsing program. First the pairs with scores under 100 were eliminated to get rid of false positives, so that only "most recent" paralogs (in-paralogs) would be included. Then, Wij,

6 representing the average score among all ortholog and in-paralog pairs from genome i and j (when i=j, Wij means the average score among those paralog pairs with reciprocal best hits within the genome), and W, representing the average score among all pairs, are calculated. Finally, raw scores of pairs were divided by Wij/W to obtain the final normalized scores. Table 1. Mapping of protein domains Proteome Databases Proteins (Version) Analyzed 28,928 Human (ENS 19.2) 26,192 Arabidopsis (UniProt 1.6) 22,439 Worm (ENS 19.116) 16,106 Fly (ENS 19.3) 6,195 Yeast (UniProt 1.6) 30,783 Zebrafish (ENS 19.3) 136,643 Total

At Least One Domain 21,749 (75%) 20964 (80%) 15186 (68%) 11499 (71%) 4,334 (70%) 26,071 (85%) 99,803 (73%)

At Least Two Domains 13,657 (47%) 11,792 (45%) 8,374 (37%) 6,609 (41%) 2,289 (37%) 16,356 (53%) 59,077 (43%)

At Least Three Domains 7,114 (25%) 5,913 (23%) 3,740 (17%) 3,409 (21%) 1,069 (17%) 8,974 (29%) 30219 (22%)

2.2. Clustering Markov Clustering (MCL) algorithm, which is designed to find cluster structure in a graph, where dissimilarity between vertices is implicitly defined by the connectivity characteristics of the graph, was applied to group orthologs and to distinguish in-paralogs from out-paralogs. The hearth of the MCL algorithm lies the idea to simulate flow within a graph, to promote flow where the current is strong, and to demote flow where the current is weak. The paradigm is empowered by inserting a new operator into Markov process, called inflation, which is responsible for both strengthening and weakening of current [20]. The parsing program finally generates a graph in the MCL-compliant format using the eliminated and normalized results of the BLAST search. This graph was fed as input to the MCL implementation to get clustering results. 2.3. Validation of clusters The consistencies of the clusters generated by the MCL algorithm were examined through a validation process. A second parsing program was developed for this process, which evaluates the concordance of the sequences belonging to the same cluster by comparing their ontological annotations. Table 2, summarizes mapping of gene ontologies. MCL has a parameter I -inflation index- which involves tweaking the granularity of clustering results, having the range [1.2-5.0]. The lower the values of the inflation index, the more refined the granularity of the resulting clusters. To identify the optimum value for this variable, MCL is tested with different inflation index values (1.2, 1.3, …, 4.9, 5.0). Since 4.0 was determined to

7 provide the most consistent clusters, this optimum value is used through out the study. Table 2. Mapping of gene ontology (GO): Gene ontology were annotated by http://www.geneontology.org, using Swissprot, Trembl keywords, protein domain and family information, and manual curation. The child and parent relationships were parsed from the go graph database from the same web site. Proteome Databases (Version) Human (ENS 19.2) Arabidopsis (UniProt 1.6) Worm (ENS 19.116) Fly (ENS 19.3) Yeast (UniProt 1.6) Zebrafish (ENS 19.3) Total

Proteins Analyzed 28,928 26,192 22,439 16,106 6,195 30,783 130,643

At Least One GO Term 17,632 (61%) 13,068 (50%) 9,959 (44%) 8,351 (52%) 5,292 (85%) 19,440 (63%) 73,742 (56%)

The structure-based classification databases SCOP, CATH, and FSSP were compared by blast the sequence clusters and the three structure clusters. The consistencies were scored for clustering consistencies. Since many of the SCOP, CATH, and FSSP classifications were based protein domains rather than the entire protein sequences, we use a PERL script to consider both single and multidomain based classifications. The functional homogeneity index was created based on the Gene Ontology (GO) database that establishes a standard way to annotate a gene on its molecular function, biological process and cellular components. In the domain of molecular function, a set of annotation terms is organized into a Directed Acyclic Graph (DAG) according to the hierarchies of the terms. This makes it possible to measure functional similarity between genes using information content [21]. First, given the annotation of all genes for one species, the information content of an annotation term T in the molecular function domain of GO is defined as:

F = -ln(g/G) Where g is the number of genes that are annotated with T or its child terms, and G is the total number of annotated genes. Therefore the more specific a term is, the higher is its information content. Given an ortholog cluster of genes, the functional homogeneity index is defined as the information content of the GO term that 1) it is a common parent term of all the GO terms that appear in the annotations of all the genes in the cluster; and 2) its information content is the highest among all the common parent terms. It can be easily seen that if a cluster contains some genes that are irrelevant to the cluster, their common parent annotation term will be at much higher hierarchy in the DAG, therefore the cluster’s functional homogeneity index will be much lower.

8 Finally, the validation process designed in this study was applied on the clusters of NCBI and TIGR to compare results of these three studies. Then zebrafish was added to the first protein set and the whole procedure was applied to the new protein set with six species in order to check whether there is an improvement in the clustering results. 3.

Results

First we applied our computational strategy to the five species, human, fly, worm, yeast and arabidopsis, to be able to compare our resulting clusters with NCBI’s KOG and TIGR’s TOGA databases. Percentage distribution of the sequences among the clusters of OSU, NCBI and TIGR is depicted in Figure 2. OSU 13% 4%

12%

20%

NCBI

16% 22% 13%

TIGR

Figure 2. Percentage distribution of the sequences among clusters of OSU, NCBI and TIGR

Then, we applied our strategy by adding Zebrafish into our sequence set, to analyze how an additional species affect the clustering results. Clustering results for 5 species study of OSU, NCBI and TIGR, and 6 species study of OSU are listed in Table 3. We also determined what clusters might be lineage-specific and how many orthologs are expanded in a species-specific way for each study (Table 4). According to MCL algorithm as more nodes added to the graph the resulting clusters would be more accurate. As it is depicted in Figure 3 and Figure 4, consistencies of the resulting clusters were further improved by the addition of the sixth sequence. The validation of our classification results using functional ontology and structural information (Figures 5 and 6) demonstrated that the use of protein domains and their organizations significantly increased the accuracy. Note, that the error rate cannot be reduced even at very high sequence similarity alone (BLAST score > 200), indicating the need of additional criteria to relate “similar” sequences. It is thus important to conclude that sequence similarity alone is not sufficient for the functional classification of proteins. Similarity at the protein sequence, the conserved domains it contains as well as their organization, its secondary and three-dimensional structure, plus similarity for the regulation of its expression and function are required to correctly classify the protein into a similar functional cluster. Thus, our current effort is only the first step towards the goal.

9 Table 3. Clustering results for 5 species study of OSU, NCBI and TIGR, and 6 species study of OSU 5 Species 6 Species OSU NCBI TIGR OSU Number of Sequences Clustered

40088

46934

36272

4532

Total Number of Clusters

10128

4845

27692

13824

3.96

9.67

1.31

3.95

2970 (29%)

1073 (22%)

4847 (%17)

4028 (29%)

7158 (71%)

3772 (78%)

22845 (%83)

9796 (71%)

413

1784 (37%)

359 (1.3%)

189 (1.4%)

Mean Size of Clusters Number Clusters Larger than the Mean Number of Clusters Smaller than the Mean Number of Orthologous Clusters Shared by All Species

(4%)

Number of Clusters

12 2 5 NA 2 4

408 640 8894 NA 1510 11393

3 1 461 NA 8 183 27692

17 103 226 157 121 227

Paralog

7 0 4 NA 0 2 4845

Lineage Specific

Paralog

Lineage Specific

213 1094 2272 NA 687 1953 10128

Lineage Specific

17 108 229 NA 101 191

Paralog

Yeast Worm Arabidopsis Zebrafish Fly Human Total Clusters

Paralog

Lineage Specific

Table 4. The number of lineage-specific clusters and species-specific expansion (paralogs) for each of the species within the resulting clustering of each study. OSU OSU NCBI TIGR (6 species)

210 1074 2259 2846 678 1660 13824

25000 20000 15000 10000 5000 0

OSU (5)

NCBI (5)

Not Enough Data

TIGR (5)

Species Specific

OSU (6)

Evaluated

Figure 3. Shows the number of the clusters that are species specific, that do not have enough functional annotation data, and that are evaluated for validation process for each study.

10

Number of Clusters

100

98

80

96

60

94

9 9 .3

99

100

98 96

100-90

40 20 0

Consistency(%) OSU(5)

NCBI(5)

TIGR(5)

OSU(6)

Figure 4. Validation process is done on the clusters that are not lineage-specific and having functional annotations at least 60% of the cluster. Since each study has different number of total clusters, validation results are depicted as percentage of the clusters having consistencies in the given region.

Matched Domain Normalized Error Rate

8.00%

Blast Score + Overlap Length 100S+150L 100S+200L 200S+150L 200S+200L

7.00% 6.00% 5.00% 4.00% 3.00% 2.00% 1.00% 0.00%

Similarity Cutoff SCOP

CATH

FSSP

Figure 5. Shows the clustering error rate in percentage which is normalized to the total number of clusters using each parameters, namely 1) matched domains; and 2-5) Blast scores (s) and overlap length (L, number of amino acid residues). The comparisons were performed with structure-based classification database SCOP, CATH, and FSSP, indicating clearly that the error rate of using matched domains (~1%) is significantly lower than those using any combinations of sequenceidentify based criteria

11

3.6 3.4 3.2 3 2.8

matched domains 100S + 150L 100S + 200L 200S + 150L 200S + 200L

2.6

Figure 6. Shows the clustering assessment of using the total average functional homogeneity (Yaxis), which was calculated based on the average information content of all the gene ontology terms (GO) in a given clusters (Chen et al., in preparation). The higher the number the more homogeneous of is for each of the clusters. The figure is the sum of all clusters. Clearly, the matched domains (first bar) result in functionally more homogenous clusters than any of the sequence similarity-based criteria (see Figure 5).

4.

Conclusion

In this paper we described a fully automatic algorithm to cluster orthologs and in-paralogs from multiple species. We showed that the Markov graph-flow algorithm could be effectively used to cluster protein sequences based on sequence and domain similarities. Since we observed that the sequence similarity alone is not robust enough to cluster truly homologous sequences as have been showed by NCBI and TIGR, we have incorporated the domain architecture as an additional criteria which significantly increases both the accuracy and the tightness of our orthologous clusters. We performed an extensive comparative assessment of our clusters with those by NCBI and TIGR, and showed a significant improvement using our method. Since the algorithm will perform even better with more genomes added, we plan to expand our approach to all existing eukaryotic species. In the future, we also plan to modify the current normalization protocol with inter- versus intra-species weights to further improve the handling of lineage-specific expansions of paralogous genes. References 1. W. M. Fitch, Syst.Zool. 19, 99 (1970). 2. R. L. Tatusov, E. V. Koonin, D. J. Lipman, Science 278, 631 (1997). 3. R. L. Tatusov, M. Y. Galperin, D. A. Natale, E. V. Koonin, Nucleic Acids Res. 28, 33 (2000). 4. R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Federova, E. V. Koonin, Nucleic Acid Res. 29, 22 (2001). 5. Y. Lee, R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, et al, Genome Res. 12, 493 (2002).

12 6. S. A. Chervitz, L. Aravind, G. Sherlock, C. A. Ball, E. V. Koonin, S. S. Dwight, M. A. Harris, K. Dolinski, S. Mohr, T. Smith, et al., Science 282, 2022 (1998). 7. A. R. Mushegian, J. R. Garey, J. Martin, L. X. Liu, Genome Res. 8, 590 (1998). 8. S. J. Wheelan, M. S. Boguski, L. Duret, W. Makalowski, Gene 238, 163 (1999). 9. M. G. Rubin, M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson, I. K. Hariharan, M. E. Fortini, P. W. Li, R. Apweiler, W. Fleischmann et al., Science 287, 2204 (2000). 10. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Nucleic Acid Res. 25, 3389 (1997). 11. R. L. Tatusov, A. R. Mushegian, P. Bork, N. P. Brown, W. S. Hayes, M. Borodovsky et al., Curr Biol. 6, 279 (1996). 12. J. Quackenbush, F. Liang, I. Holt, G. Pertea, J. Upton, Nucleic Acids Res. 28, 141 (2000). 13. R. F. Doolittle, Annu. Rev. Biochem. 64, 287 (1995). 14. S. Henikoff, E. A. Greene, S. Pietrokovski, P. Brok, T. K. Attwood, L. Hood, Science 278, 609 (1997). 15. M. Reem, C. E. Storm, E. L. Sonnhammer, J. Mol. Biol. 314, 1041 (2001). 16. L. Li, C. J. Stoeckert, D. S. Roos, Genome Res. 13, 2178 (2003). 17. G. Yona, N. Linial, M. Linial, Proteins 37, 360 (1999). 18. A. J. Enright, C. A. Ouzounis, Bioinformatics 16, 451 (2000). 19. F. Abascal, A. Valencia, Bioinformatics 18, 908 (2002). 20. S. Van Dongen, "Graph clustering by flow simulation" Ph.D thesis, University of Utretch, The Netherlands (2000). 21. P. W. Lord, R. D. Stevens, A. Brass, C. A. Goble, Pacific Symposium on Biocomputing 8, 601 (2003).

Suggest Documents