Phylogenetic Profiling with Pre-Defined Organism ... - Semantic Scholar

3 downloads 0 Views 405KB Size Report
Gclust Server: Phylogenetic Profiling with Pre-Defined. Organism Sets. Naoki Sato [email protected]. Department of Life Sciences, Graduate School ...
Gclust Server: Phylogenetic Profiling with Pre-Defined Organism Sets Naoki Sato [email protected]

Department of Life Sciences, Graduate School of Arts and Sciences, University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan

Keywords: endosymbiosis, nodulation, homolog group, genomic clustering

1

Introduction

I describe a new type of comparative genomic database for phylogenetic profiling based on pre-calculated homologous protein groups. This methodology takes advantage of the fact that the proteins that are conserved in a set of organisms (in-category) that share a certain morphology, physiological function, enzyme activity, or metabolic pathway (referred to as 'trait T') may be involved in the trait T. This has been used to infer protein-protein interaction and co-evolved proteins. An obvious example is that many enzymes related to photosynthesis or chloroplast function are conserved in plants and cyanobacteria, both of which share the function ‘oxygenic photosynthesis’, namely, photosynthesis with oxygen evolution. We reported previously an application of this principle to extract ‘chloroplast proteins of endosymbiont origin’ or CPRENDOs [1]. But there are other applications of the phylogenetic profiling, such as conserved proteins in symbiotic nitrogen fixation.

2

Method and Results

2.1 General Consideration Phylogenetic profiling is not as simple as expected. Each trait T is realized by various enzymes and regulators (parts) that are shared by the in-category. Here, the parts may or may not be shared by other organisms that do not have the trait T (out-category). In addition, the regulators (or transcription factors) may be members of a large protein family, and are not easily recognized as the parts involved in the particular trait T by simple sequence comparison. However, it is certain that the proteins shared by the in-category (with or without presence in the out-category) include candidate proteins for the trait T, and are useful in functional genomic studies. This is the basis of our strategy, which involves both informatics (initial screening) and experiments (final screening). Phylogenetic profiling requires well-defined protein clusters, which are constructed over a set of organisms included in both in-category and out-category. This is especially true because proteins of unknown function are being analyzed. There are many cases in which clustering encounters difficulties, such as multidomain proteins, promiscuous proteins, and fragments. The current version of the Gclust software uses various different criteria as used by biologists for manual clustering, and thus, produces meaningful clusters from the biologists' viewpoint. There are several different computational methods for genomic protein clustering, but no single computational method or algorithm is capable of cleanly classifying real protein sequence data, because of the above-mentioned difficulties in protein similarity. In addition, the in-category and the out-category should also be deliberately defined for successful phylogenetic profiling. Too many genomes or biased selection of genomes give unusual clusters, which give unsound profiling. Therefore, there are no universal or unitary protein clusters that are useful in all phylogenetic profiling. The Gclust database server provides an interface for searching, analyzing and exploiting pre-calculated protein clusters of several different sets of organisms, for use in various different types of research. 2.2 Construction of Gclust Databases The Gclust databases are designed as pre-calculated similarity matrices, which include all homologs in the selected dataset. Gclust databases are constructed by the GCLUST software (version 3.5.3) using the results of all-against-all BLASTP [2] (version 2.2.12) results with publicly available genomic databases [1]. The homolog matrices obtained by GCLUST were further converted into an organism-sorted cluster list by the TBSORT software, and then into an organism group-sorted cluster list. The latter is used for phylogenetic profiling. Using a data loader in the Gclust server, these data are automatically converted to internal format (flat files), which are provided upon request by users through CGI scripts. A homolog group contains all reliable homologs that represent a protein family. Gclust uses E-values of BLASTP, overlap scores, and the number of organisms as input, and exploits a new heuristic to automatically

select various thresholds for construction of homolog clusters. Gclust also automatically defines domain structure of all proteins by using BLASTP results, and this information is used to link relevant homologs and to eliminate multidomain or fragmented proteins. The general explanation of the algorithm was presented in a previous paper [1], but a new additional algorithm introduced after the publication of [1] will be published elsewhere. The software is also available as a source code from the Gclust database web site. 2.3 Web Interface The Gclust website is accessible at the URL given in [4]. To perform phylogenetic profiling, 'Search Menu' provides two methods. In one method, organism groups as defined in the menu are selected. The selection should be yes, no, or indifferent (any, for short). If ‘yes’ is selected for an organism group, number of organisms within the organism group that belong to each cluster is counted, and if this number is larger than 50% of the number of organisms of the selected organism group, the cluster is selected. Similar selection is repeated for each organisms group. The threshold value (50%) may change depending on purpose of phylogenetic profiling. In another method, all species are individually selected. In the Search Results window, many different clusters are listed (Fig. 1). Each cluster is displayed in the Cluster Display window as a similarity matrix. If the cluster is large (>30 sequences), a similarity matrix is not displayed, but only basic description of each sequence is shown. The sequences can be retrieved for further analysis, or an alignment can be made by the Clustal W software on the screen and manipulated by the Jalview software [3]. Jalview runs in the server, but the client must have Java (version 1.4 or higher) installed. The homologs that are not included in the clustering, such as multidomain proteins, are also displayed as ‘Related sequences’. Because there are always some clusters that are sensitive to dataset and clustering severity, such related sequences may be useful, when a protein family is divided into two clusters in a particular database.

Figure 1:

3

Selection of clusters according to the given phylogenetic profile.

Prospects

There is no other database that is specialized in phylogenetic profiling, which is otherwise difficult for most biologists. We hope that the server is useful for many experimental biologists. The Gclust database will be expanded by incorporating various new genomes such as rice, moss, fungi etc.

References [1] Sato, N., Ishikawa, M., Fujiwara, M., and Sonoike, K., Mass identification of chloroplast proteins of endosymbiont origin by phylogenetic profiling based on organism-optimized homologous protein groups, Genome Informatics, 16:56-68, 2005. [2] Altschul, S.F., Madden, T.L., Schäfer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25: 3389-3402, 1997. [3] Clamp, M., Cuff, J., Searle, S.M., and Barton, G.J., The Jalview Java alignment editor, Bioinformatics, 20: 426-427, 2004. [4] http://gclust.c.u-tokyo.ac.jp/