WebVar: a resource for the rapid estimation of relative ...

3 downloads 176 Views 183KB Size Report
IMPLEMENTATION. We have implemented SiteVar and SiteVarProt as python scripts with a user interface written in PERL and PHP on a Web server WebVar, ...
Vol. 20 no. 8 2004, pages 1331–1333 DOI: 10.1093/bioinformatics/bth076

BIOINFORMATICS APPLICATIONS NOTE

WebVar: a resource for the rapid estimation of relative site variability from multiple sequence alignments Flavio Mignone, David S. Horner and Graziano Pesole∗ Dipartimento di Scienze Biomolecolari e Biotecnologie, Università di Milano, 20113 Milano, Italy Received on August 7, 2003; revised on October 25, 2003; accepted on October 31, 2003 Advance Access publication February 10, 2004

between closely related sequences are likely, in general, to occur at more variable sites. Second, our algorithms assume that, between closely related sequences, substitutions that are empirically observed to occur rarely will tend to be observed at less constrained sites. These algorithms have been shown to be effective in simulation studies, and in our hands they have performed well with real data. By way of contrast with maximumlikelihood methods, our approach yields rapid results even with very large datasets (Horner and Pesole, 2003).

INTRODUCTION

METHODOLOGY

The characterization of relative site variability in alignments of homologous nucleotide and amino acid sequences is of great importance in both the estimation of functionally important residues in comparative sequence analysis in the presence or absence of protein structure models and in phylogenetic reconstruction (where failure to account for site-by-site substitution rate variation can cause the recovery of incorrect trees). Several authors have recently suggested that deconstruction of the effects of site substitution rate heterogeneity should be a routine part of phylogenetic analysis (Gribaldo and Philippe, 2002). Maximum-likelihood-based methods to estimate relative site variability have proved effective in phylogenetic reconstruction. However, they are computationally intensive for large datasets, they presume that the distribution of site rates can indeed be modeled by predetermined distribution and they require the prior formulation of a phylogenetic tree describing the evolutionary relationships between the sequences in question. We have previously proposed and demonstrated the effectiveness of simple tree-independent algorithms, SiteVar and SiteVarProt, to estimate relative site variability from large alignments of homologous nucleic acid and protein sequences, respectively (Pesole and Saccone, 2001; Horner and Pesole, 2003). These algorithms are based on two simple assumptions. First, substitutions observed in pairwise comparisons

The algorithm used by WebVar is shown below:

∗ To

whom correspondence should be addressed.

Bioinformatics 20(8) © Oxford University Press 2004; all rights reserved.

Vi =

N (N −1)/2  j =1

δij , Kj

(1)

where δij is a measure of pair distance for site i in the j -th pairwise comparison and Kj is the overall genetic distance—as determined by the Kimura formula for both DNA and protein sequences—for the j -th pairwise comparison. For DNA sequences δij , it is assumed to be 0, 1 or 2 depending on whether a nucleotide change is observed (0 for no change, 1 for transitions, 2 for transversions) at site i in the j -th pairwise comparison. In the case of protein sequences, a series of substitution weight matrices were derived from the substitution frequencies used to create the BLOSUM substitution matrices according to the following equation: δij = 1 −

fab 1 2 (faa

+ fbb )

,

(2)

where fab is the frequency of observed substitutions between amino acid states a and b in the BLOSUM blocks and faa and fbb are the observed frequencies of no change for amino acids a and b, respectively. For each pairwise comparison, the appropriate weight matrix is selected on the basis of pairwise identity levels between the sequences compared. According to Equation (1), the relative contribution of an observed substitution at a given position to the measure of variability

1331

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 25, 2013

ABSTRACT Summary: WebVar is an online resource that provides estimates of relative site variability from multiple alignments of homologous protein or nucleic acid sequences. WebVar provides a variety of graphic and textual representations of estimates, designed to assist in phylogenetic analysis. Availability: The WebVar server is located at http:// www.pesolelab.it/Tools/WebVar.html Contact: [email protected]

F.Mignone et al.

(A)

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 25, 2013

(B)

Fig. 1. Screen grab of the Home Page (A) and of a sample output page from the WebVar site (B).

for that position will be inversely proportional to the corresponding pairwise genetic distance and proportional to the perceived similarity of the two amino acid or nucleotide states encountered.

1332

The number of sequences compared is significant as it can reasonably be expected that the higher the number of sequences considered, the more reliable the estimate of the relative variability of the individual sites

Site variability plots from multiple alignments

should be (Pesole and Saccone, 2001; Horner and Pesole, 2003).

IMPLEMENTATION We have implemented SiteVar and SiteVarProt as python scripts with a user interface written in PERL and PHP on a Web server WebVar, hosted at the University of Milan. The WebVar server allows the user to upload a FASTA format multiple sequence alignment and to specify several analysis and result format options through a menu-driven interface. Default values for all settings are provided. Further facilities, such as independent estimates of site variability for different groups of sequences within the same dataset, will be added to the site in the future.

The user interface for the WebVar site (http://www.pesolelab.it/ Tools/WebVar.html) initially asks the user to choose between DNA and protein analysis. The user may then upload a FASTA format multiple sequence alignment for analysis. Results are graphically presented allowing the user to visualize the relative variability of each position in the multiple sequence alignment (Fig. 1). The user may also download a tab delineated text file with relative site variability for each site scaled between 0 and 1, and additionally scaled to the mean substitution rate for the dataset as a whole to allow direct comparison with rate estimates from other programs such as MrBayes or PAML. This file is compatible with Excel and other spreadsheet applications. Several additional options for formatting of results are provided. These include the provision of a nexus character partition block dividing the residues into groups of similar relative variability, which can be appended to a NEXUS sequence alignment for phylogenetic analysis in PAUP*

ACKNOWLEDGEMENTS This project has been partially funded by EMBnet contract 2001-TMPC-02 and Telethon. D.S.H. is funded by Marie Curie category 30 individual fellowship number MCFI-2001-00634.

REFERENCES Gribaldo,S. and Philippe,H. (2002) Ancient phylogenetic relationships. Theor. Popul. Biol., 61, 391–408. Horner,D.S. and Pesole,G. (2003) The estimation of relative site variability among aligned homologous protein sequences Bioinformatics, 19, 600–606. Huelsenbeck,J.P. and Ronquist,F. (2001) MrBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17, 754–755. Pesole,G. and Saccone,C. (2001) A novel method for estimating substitution rate variation among sites in a large dataset of homologous DNA sequences. Genetics, 157, 859–865. Swofford,D.L. (1998) PAUP*.Phylogenetic analysis using parsimony (*and Other Methods). Sinauer, Sunderland, MA.

1333

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on February 25, 2013

USING WebVar

(Swofford, 1998) or MrBayes (Huelsenbeck and Ronquist, 2001) and a set of FASTA format sequence masks for the identification of the most variable sites. Furthermore, the user can define a maximum cutoff value for pairwise distances used in the estimation of site variability. This facility is useful in the analysis of highly divergent sequences where mutational saturation (the phenomenon where multiple substitutions, undetected by pairwise comparison and resulting in underestimation of pairwise distance, have occurred at single sites) has been shown to be problematic in the estimation of site variability (Horner and Pesole, 2003). With lower cutoff values, only comparisons between closely related sequences, which are less likely to be mutationally saturated, are used in estimation of variability. The Website advises the user what proportion of possible comparisons have been used in the analysis.

Suggest Documents