BIOINFORMATICS APPLICATIONS NOTE

4 downloads 2408 Views 98KB Size Report
the output is simply HTML, so the result is platform independent and ... subject to copyright restrictions. Contact: ... reformatted alignment with HTML mark-up.
BIOINFORMATICS APPLICATIONS NOTE

(%  '(  " +  

MView: a web-compatible database search or multiple alignment viewer $" %  *(.' #*$+,()#  *(/ ' #*$+ ' * -*() ' $($'!(*&,$+ '+,$,-,  0   %%(&  '(& &)-+ &*$"                 

Abstract Summary: MView is a tool for converting the results of a sequence database search into the form of a coloured multiple alignment of hits stacked against the query. Alternatively, an existing multiple alignment can be processed. In either case, the output is simply HTML, so the result is platform independent and does not require a separate application or applet to be loaded. Availability: Free from http://www.sander.ebi.ac.uk/mview/ subject to copyright restrictions. Contact: [email protected] Often when running FASTA (Pearson, 1990) or BLAST (Altschul et al., 1990), it is desired to visualize the database hits stacked against the query sequence. The BEAUTY post-processor (Worley et al., 1996) parses BLAST output and shows the query and hits as a stack of line segments, but this is neither generally applicable, nor are the sequences themselves stacked. At the other extreme, given a full multiple alignment generated by other tools, viewers [e.g. Belvu (Sonnhammer, 1995), an X window client], editors [e.g. Cinema (Attwood et al., 1997), a Java applet] and formatters for hard copy [e.g. Boxshade (Hofmann and Baron, 1992)] exist. By comparison, MView is intended as a filter for post-processing searches or alignments to generate a reformatted alignment with HTML mark-up. Its principal uses so far are as an embedded tool inside Web applications for viewing precomputed searches and alignments under GeneQuiz (Scharf et al., 1994; Andrade et al., 1998), and in displaying FASTA/BLAST search output (Lopez, 1997). Driven from the command line, the program comprises a back end implementing different parsers, and a front end providing a set of formatting options applied to the internally assembled (or simply read in) multiple alignment. HTML mark-up may be switched off so that plain ASCII output can be produced for loading into another tool, such as Belvu. The basic output consists of columns of optional descriptor information followed by a column of alignment strings (see Figure 1). Descriptors include a number giving the original rank of a sequence in the input, a sequence identifier, which

380

may be hyperlinked to the SRS system (Etzold et al., 1996), a text field, a field of scoring information from searches, and a field reporting the per cent identity of each sequence with respect to a preferred sequence in the alignment, usually the query in the case of a search. Multiple alignments require minimal parsing and are subjected only to formatting stages. Search hits are first stacked against the ungapped query sequence and require special processing. Ungapped search (e.g. BLAST) hit fragments are assembled into a single string by overlaying them preferentially by score onto a template string, while gapped search (e.g. FASTA) hits have columns corresponding to query gaps excised. Consequently, the stacked alignment is a patchwork of reconstituted sequences that nevertheless is informative and visually striking. MView offers three kinds of input filtering. A threshold maximum pairwise sequence identity can be specified to screen out close homologues, and an upper bound can be set on the number of sequences to be reported. Additionally, specific to the type of input, cut-offs on, for example, BLAST p-value or score, can be set. Three colour schemes are supported: none, colour residues by property using amino acid physicochemical classes (Taylor, 1986), or colour by identity and property. In the last case, residues are coloured if identical to their counterpart in any preferred sequence in the alignment, normally the query in the case of a search. This reference sequence is also used to calculate the displayed per cent identities. Formatting options include pagination of the alignment—the default is to produce one single scrollable band, but this can be broken into panes by specifying a desired number of alignment columns per pane. A ruler can be attached to the top of the alignment, and various other minor settings are possible, such as a choice of gap character. Parsers all inherit from a generic class offering incremental parsing of record-based files, and are quite easy to add. At the time of writing, parsers for protein sequence searches have been implemented for FASTA 1.6, 2.0, 3.0, BLASTP 1.4, BLAST2 (WashU) 2.0, BLAST2 (NCBI) 2.0, PSI-BLAST 2.0. (Altschul et al., 1997). Multiple alignment formats recognized are Pearson/FASTA, MSF, CLUSTALW,  Oxford University Press

MView: a web-compatible database search

Fig. 1. A single pane of output from MView assembled from a search of P2CA_RAT (a class 2C protein phosphatase) using BLASTP 1.4.9 against a non-redundant database. Fields from the left are: rank of each hit; SRS-linked identifier; score, p-value, fragment c ount as reported by BLASTP; per cent identity of hit fragments to query. Maximum pairwise identity was set at 80%, colouring is by identity to query and by property, with other residues in grey. These proteins belong to an extended family (Bork et al., 1996) that also includes a mitochondrial phosphatase, the adenylate cyclases and several bacterial phosphatases. The BLAST scores in the lower rows are weak, but the tool readily allows the user to identify promising candidates for further study.

MaxHom/HSSP, and a trivial one comprising paired columns of identifiers and aligned sequences. MView and its underlying class libraries are implemented in Perl, Version 5 (Wall et al., 1996) for UNIX, and should be easily portable to other systems. Formatting and colouring of HTML alignments require a fixed-width font (e.g. Courier) and support for the tag, so a recent version of a browser such as Netscape is recommended.

Acknowledgements Thanks to S.Hoersch, R.Lopez, C.Reich, A.Franchini and M.Andrade for testing and suggestions.

References Altschul,S.F., Gish,W., Miller,W., Myers,E, W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Altschul,S.F., Madden,T.L., Schaeffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Andrade, M., Brown, N.P., Leroy, C., Hoersch, S., Reich, C., Franchini, A., de Daruvar, A., Tamames, J., Valencia, A., Ouzounis, C. and Sander, C. (1998) GeneQuiz: Automated genome sequence analysis and annotation”. (manuscript in preparation). http://www.sander.ebi.ac.uk.genequiz/ . Attwood,T., Payne,A.W.R., Michie,A.D. and Parry-Smith,D.J. (1997) A Colour INteractive Editor for Multiple Alignments—CINEMA. EMBnet.news, 3(3). http://www2.ebi.ac.uk/embnet/news/.

Bork,P., Brown,N.P., Hegyi,H. and Schultz,J. (1996) The protein phosphatase 2C (PP2C) superfamily: Detection of bacterial homologues. Protein Sci., 5, 1421–1425. Etzold,T., Ulyanov,A. and Argos,P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol., 266, 114–128. Hofmann,K. and Baron,M.D. (1992) BOXSHADE. ISREC, Switzerland; Institute for Animal Health, U.K. http://ulrec3.unil.ch/software/BOX_form.html. Lopez,R. (1997) Fasta3 and blast service at the EBI. EMBnet.news, 4(1). http://www2.ebi.ac.uk/embnet/news/. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol., 183, 63–98. Scharf,M., Schneider,R., Casari,G., Bork,P., Valencia,A., Ouzounis,C. and Sander,C. (1994) GeneQuiz: a workbench for sequence analysis. In Altman,R., Brutlag,D., Karp,P., Lathrop,R. and Searls,D. (eds), Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 348–353. Sonnhammer,E. (1995) Belvu—a multiple alignment viewer. Sanger Centre, UK. http://www.sanger.ac.uk/∼esr/Belvu.html. Taylor,W.R. (1986) The classification of amino acid conservation. J. Theor. Biol., 119, 205–218. Wall,L., Christiansen,T. and Schwartz,R.L. (1996) Programming Perl, 2nd edn. Nutshell Handbooks, O’Reilly & Associates, Inc., Sebastopol, CA, USA. Worley,K.C., Wiese,B.A. and Smith,R.F. (1996) BEAUTY: An enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity search results. Genome Res., 5, 173–184.

381