ENDscript: a workflow to display sequence and structure information

BIOINFORMATICS APPLICATIONS NOTE

Vol. 18 no. 5 2002 Pages 767–768

ENDscript: a workflow to display sequence and structure information Patrice Gouet 1,∗ and Emmanuel Courcelle 2 1 Laboratoire

de Bio-Cristallographie, IBCP-CNRS UMR 5086, 7 passage du Vercors, ´ et des 69367 Lyon Cedex 07, France and 2 Laboratoire de Biologie Moleculaire Relations Plantes Microorganismes, BP 27 Chemin de Borde Rouge, 31326 Castanet Tolosan, France

Received on October 16, 2001; revised on November 4, 2001; accepted on December 17, 2001

ABSTRACT Summary: ENDscript is a web server grouping popular programs such as BLAST, Multalin and DSSP. It uses as query the co-ordinates file of a protein in Protein Data Bank format and generates PostScript and png figures showing: residues conserved after a multiple alignment against homologous sequences, secondary structure elements, accessibility, hydropathy and intermolecular contacts. Thus, the user can relate quickly 1D, 2D and 3D information of a protein of known structure. Availability: http://genopole.toulouse.inra.fr/ENDscript Contact: [email protected]

BACKGROUND ENDscript is a web tool in the tradition of PDBsum (Laskowski et al., 1997) a database of the known 3dimensional structures of protein, and of the suite of programs developed in the group of Geoff Barton to perform sequence alignments (AMPS, Barton and Sternberg, 1987) and colouring (ALSCRIPT, Barton, 1993). ENDscript takes as input a protein co-ordinate file in Protein Data Bank format (Berman et al., 2000) and runs the following programs: DSSP (Kabsch and Sander, 1983) to extract secondary structure elements and to compute accessibility per residues, CNS (Brünger et al., 2000) to calculate distances for intermolecular contacts and protein–ligand interactions, BLAST (Altschul et al., 1997) to search databases such as the SWISS-PROT (Bairoch and Apweiler, 2000) or the PDBaa (sequences derived from the 3-dimensional Protein Data Bank) for homologous sequences, Multalin (Corpet, 1988) or ClustalW (Thompson et al., 1994) to perform multiple sequence alignments, ESPript (Gouet et al., 1999) BobScript (Esnouf, 1997) and MolScript (Kraulis, 1991) to render all this information in coloured PostScript files, ImageMagick (ImageMagick Studio LLC) to convert these files to images in png format. Such figures are ∗ To whom correspondence should be addressed.

c Oxford University Press 2002

sought routinely by scientists, who want to understand relations between the 3D structure of a protein and its sequence. ENDscript aims to help them to better deal with the ever growing flow of information due to the development of structural genomics. The ENDscript cgi interface is based on the one of ESPript which was completely rewritten. A few additional programs were written in Fortran and in Perl, to pipe information between the different stages of ENDscript. The server runs on Unix and Linux platforms.

RUNNING ENDSCRIPT The user can upload the co-ordinates file of a protein from his workstation to the ENDscript server or enter a PDB identifier. X-ray structures are supported as well as NMR structures. A click on the RUN button of the interface leads to the first PostScript figure produced by ESPript. As an example, Figure 1(a) was obtained from the PDB file deposited under the code 1G94. It corresponds to a psychrophilic alpha-amylase solved with a bound hepta-saccharide and crystallized with one monomer in the asymmetric unit (Aghajari, 1998). Sequences of each protein chain contained in the PDB query (CHAIN A in our example) are checked by a home-made program named SPDB. Each monomer appears in the resulting figure in a one letter code sequence, displayed with additional information. Thus, secondary structure elements are shown above the sequence, along with grey stars marking residues modelled in alternate conformations in crystallographic structures. Relative accessibility is shown below the sequence by a blue-coloured bar. A second bar presents the figure of hydropathy, calculated from the sequence according to the algorithm of Kyte and Doolittle (1982). Intermolecular contacts are figured by characters on the bottom line. According to their colour and format, the user can deduce that a residue is involved in a short/long, crystallographic/non-crystallographic contact with a neighbouring molecule. A different code is used on the same line, to highlight residues in interaction 767

P.Gouet and E.Courcelle

(a)

β1

α1 20

η1

30

40

α3

50

1

""

β6 TT 100

60

β3

70

β4

80

90

η2

α4

""

α5

TT

a β7

1

"

α6

β8

" η3

TT 110

120

130

140

150

160

170

180

"

aaaa

β1

1 1 9 9 9 9 9 9 9 9 9 9 9 9 9

2

aaa "

α1 α2 β2 TT TT TT TT TT TT TT TT TT TT TT TT TT TT TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVS TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVS GRTSIVHLFEWRWVDIALECERYLGPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLGPKGFGGVQVS GRTDIVHLFEWRWVDIALECERYLGPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLGPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLGPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLGPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLAPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLAPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLAPKGFGGVQVS GRTSIVHLFEWRWVDIALECERYLAPKGFGGVQVS GRNSIVHLFEWKWNDIADECERFLQPQGFGGVQIS GRNSIVHLFEWKWNDIADECERFLQPQGFGGVQIS G..TILHAWNWSFNTLKHNMKDIHD.AGYTAIQTS

a

2 """"

a"

a"

a

a

" "" ""

(c)

1

Fig. 1. Samples output of ENDscript obtained from the PDB file 1G94, showing (a) the partial sequence of 1G94 (448 residues in total) with (i) on the top secondary structures (squiggles are helices, arrows beta strands, TT letters beta turns) alternate residues (grey stars) (ii) on the bottom a bar of accessibility (black is accessible, grey intermediate, white buried), a bar of hydropathy (black is hydrophobic, grey hydrophilic, white neutral) and a line of molecular contacts (letters are crystallographic contacts, quotes are protein–ligand interactions, digits are disulphide bridges), (b) an excerpt from a multiple alignment, obtained after a BLAST search against the Protein Data Bank sequence database (PDBaa); secondary structure elements of all sequences are displayed, (c) a 3D representation of the full protein, coloured according to sequence similarity from white (low) to black (high).

with hetero-compounds as well as cysteins involved in disulfide bridges. In a next step, the user can check a box labelled ‘enable the BLAST search’. After a click on the RUN button, a new ESPript figure is produced, displaying the result of a multiple alignment between the sequence of the first monomer of the query, against matching sequences detected by a BLAST search in either the SWISSPROT or the PDBaa databases. The user can show on the alignment, the secondary structure elements of all sequences of known 3D structure as in Figure 1(b). A final PostScript file is generated by BobScript (Figure 1(c)). The protein is drawn in mode ribbon and coloured according to sequence similarity. The user can also generate a script written in Virtual Reality Modeling Language (VRML) by the program MolScript (Kraulis, 1991), in order to rotate the ribbon representation of the protein on his screen. A link to the web server PHYLODENDRON (copyright 1997 by D. G. Gilbert) is available in the result frames to draw phylogenetic trees, when a multiple alignment has been performed.

768

ACKNOWLEDGEMENT We thank Dr Richard Haser and Dr Daniel Kahn for their support in the development of ENDscript.

AAGSGTGTAGNSFGNKSFPIYSPQDFHESCTINNSDYGNDRYRVQNCELVGLADLDTASNYVQNTIAAYINDLQAIGVKGFRFDASKHVA

(b) 1G94_A 1AQH 1DHK_A 1OSE 1BVN_P 1PIF 1JFH 1PPI 1BSI 1B2Y_A 1SMD 1CLV_A 1VIW_A 1BAG 1G94_A 1AQH 1DHK_A 1OSE 1BVN_P 1PIF 1JFH 1PPI 1BSI 2CPU_A 1B2Y_A 1SMD 1CLV_A 1VIW_A 1BAG acc hyd

β2

TT

10

TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVSPPNEHITGSQWWTRYQPVSYELQSRGGNRAQFIDMVNRCSAAGVDIYVDTLINHM

β5 1G94_A acc hyd 1G94_A

α2

TT 1

1G94_A acc hyd 1G94_A

REFERENCES Aghajari,N. (1998) Phd Thesis, Université d’Aix Marseille et Université de Liège. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. Barton,G.J. (1993) ALSCRIPT a tool to format multiple sequence alignments. Protein Eng., 6, 37–40. Barton,G.J. and Sternberg,M.J. (1987) A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol., 198, 327–337. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. Brünger,A.T., Adams,P.D., Clore,G.M., DeLano,W.L., Gros,P., Grosse-Kunstleve,R.W., Jiang,J.S., Kuszewski,J., Nilges,M., Pannu,N.S., Read,R.J., Rice,L.M., Simonson,T. and Warren,G.L. (2000) Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D, 54, 905–921. Corpet,F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res., 16, 10881–10890. Esnouf,R.M. (1997) An extensively modified version of MolScript that includes greatly enhanced coloring capabilities. J. Mol. Graphics, 15, 132–134. Gouet,P., Courcelle,E., Stuart,D.I. and Metoz,F. (1999) ESPript: multiple sequence alignments in PostScript. Bioinformatics, 15, 305–308. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–2637. Kraulis,P.J. (1991) MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst., 24, 946–950. Kyte,J. and Doolittle,R. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol., 157, 105–132. Laskowski,R.A., Hutchinson,E.G., Michie,A.D., Wallace,A.C., Jones,M.L. and Thornton,J.M. (1997) PDBsum: a Web-based database of summaries and analyses of all PDB structures. Trends Biochem. Sci., 22, 488–490. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.

ENDscript: a workflow to display sequence and structure information

ENDscript: a workflow to display sequence and structure information

Suggest Documents

Combining Structure and Sequence Information ... - Semantic Scholar

From Sequence to Structure

a workflow for mutation extraction and structure

ENDscript: extracting and rendering ... - BioMedSearch

phase 2 (PDF) - ENDscript

From Information Content to Auditory Display with

ESPript/ENDscript: extracting and rendering ... - BioMedSearch

Using Workflow to Build an Information Management System for a

Information Display January 2012

Information Display January 2012

A Bioinformatics Pipeline For Sequence To Structure: A Case Study ...

structure and sequence analysis - NCBI

A hierarchical display of patient-collected physiological information to ...

Neural Sequence-to-sequence Learning of Internal Word Structure

Basic Textile Care: Structure, Storage, and Display

Psychophysical scaling of a cardiovascular information display

Chapter 4 From Sequence to Structure

Using Structure to Explore the Sequence

Information Graphics Design Challenges and Workflow - Primaonline

eDiscovery and Compliance Information Workflow with CommVault ...

Sequence alignment and mutual information

Answering Information Needs in Workflow

CarrierSeq: a sequence analysis workflow for low-input ... - bioRxiv

a sequence analysis workflow for low-input ... - BMC Bioinformatics