DSC: Public Domain Protein Secondary Structure ... - Semantic Scholar

DSC: Public Domain Protein Secondary Structure Prediction Ross D. King

1;3

Mansoor Saqi

2

Introduction DSC (Discrimination of protein Secondary structure Class) is a protein secondary structure prediction method that uses multiplyaligned homologous sequences and linear statistics. This paper describes the public domain versions of DSC (ftp://ftp.icnet.uk/icrfpublic/bmm/king/dsc/dsc.tar.z.). The advantages of DSC are: 1. DSC has high accuracy. It has a prediction accuracy of 70.1% (per residue) on a standard set of 126 proteins. This percentage was con rmed by the recent CASP2 blind prediction challenge (see below). 2. DSC is based on simple linear statistics. Existing high accuracy prediction methods are 'black-box' predictors based on complex non-linear statistics (e.g. neuralnetworks in PHD (Rost and Sander, 1993) and nearest-neighbour methods in NNSSP (Salamov and Solovyev, 1995)). 3. DSC's accuracy is predictable a priori. This permits evaluation of the utility of the prediction, e.g. it is possible to ac-

2

Roger Sayle ;

Michael J.E. Sternberg

1

curately predict chains that have a mean accuracy of > 80%. 4. DSC source code is freely available (in C). This allows DSC to be easily ported to other systems and to be used on proprietary sequences (e.g. DSC is running inhouse at Glaxo Welcome plc). 5. DSC has a prediction server (see below). This server can generate its own sequence alignments. A full scienti c description of DSC is given elsewhere (King and Sternberg 1996). There were two aims in developing DSC: to obtain high accuracy by identi cation of a set of concepts important for prediction; and to obtain insights into the folding process. The important concepts in secondary structure prediction were identi ed as: residue conformational propensities, sequence edge eects; moments of hydrophobicity; position of insertions and deletions in aligned homologous sequence; moments of conservation; auto-correlation of secondary structure; residue ratios; secondary-structure feedback eects; and ltering the prediction to remove isolated predictions (King and Sternberg 1996). Explicit use of edge eects, moments of conservation, and auto-correlation are new to DSC. The relative importance of the concepts used in prediction was analysed by step-wise addition of information and the examination of weights in the discrimination function. The simple and explicit structure of the prediction allows the method to be easily reimplemented.

1 Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, Lincoln's Inn Fields, London WC2A 3PX, U.K. 2 Bioinformatics Group, Dept. of Biomolecular Structure Glaxo Medicines Research Centre, Gunnels Wood Road, Stevenage, Herts, SG1 2NY, UK. 3 To whom correspondence should be sent. Current address: Department of Computer Science, The University of Wales Aberystwyth, Aberystwyth, SY23 3DB, Wales, U.K. Tel: +44 1970-622432 Fax: +44 1970-622455 Email [email protected] 4 Keywords: statistics, WWW, software, sequence A stand-alone version of DSC (written in C) is alignment

Stand-alone DSC

available by ftp. This version of DSC requires

the philosophy behind DSC, using public domain software. It is planned to eventually distribute the alignment method bundled with DSC. The procedure used to form the multiple alignment is as follows: 1. Blastp (Altschul et al., 1990) is used to identify possible homologous sequences to the query sequence. Blastp was chosen as the rst step for speed and eciency. OWL (Bleasby and Wooton, 1990) is used as the search database as it contains a large and reliable set of sequences. The Poisson threshold used by Blastp can be set by the user, the default value being 1e-09. 2. The retrieved sequences are aligned with the enquiry sequence using the SmithWaterman method (Smith and Waterman, 1981). Homologous sequences that have long sequence extensions at the N or C terminal end are then trimmed to make the sequences of almost equal extension. This is done to improve the quality of the nal sequence alignment. 3. The sequences are globally aligned using Clustalw (Thompson et al., 1994) 4. The aligned sequences are then read into DSC and the alignment and the prediction of the probe sequence forwarded to the user. The various component programs used in forming the alignment/prediction are joined together using a Perl wrapper program. The Perl cgi-bin library is used to interface with HTML In addition to the stand alone versions of DSC, (Brenner and Aoki, 1996). two WWW services based on DSC have been developed. The simplest one is a WWW page that takes a set of multiply aligned sequences and outputs a prediction of the secondary struc- In the design of the alignment method many ture of the query sequence. This is simply a choices were made, e.g. the default Poisson WWW interface to the distributed DSC pre- threshold, how to trim sequences, or how indiction method. formative dierent levels of homology are, etc. The second service is a WWW page that It cannot be assumed the choices were optimal accepts a single sequence, generates a multi- and more research is needed to determine the ple alignment for the sequence, and then uses best methods for aligning sequences for use in DSC to predict the secondary structure of the secondary structure prediction. enquiry sequence using the aligned sequences. DSC was recently blind tested at CASP2 The multiple alignment is done, in keeping with (http://PredictionCenter.llnl.gov). The 70%

the input of a set of aligned sequences: it generates as output the predicted secondary structure of the sequence. DSC also estimates the accuracy of its prediction for the sequence, this gives an indication of reliability. DSC can recognise aligned sequences in six formats: MSF, PHD output, CLUSTALW, PIR, Fasta, and simple ASCII. For each residue, the secondary structure class is predicted to be either helix (alpha-helix, or 310 helix), betastrand, or random coil, and the estimated probability of each class is displayed. The output for each input format is tailored to return the predictions in an appropriate form. For example, with an input MSF le, the output le includes the original MSF le with four extra sequences: the predicted class and the predicted probabilities for each class. The predictions are aligned with the sequence predicted. The stand-alone version of DSC incorporates three prediction features not included in King and Sternberg (1996). The rules used to lter the nal predictions can now be used iteratively. DSC can also now remove any remaining isolated residues from the prediction, i.e. residues that do not neighbour residues of the same secondary structure class. Finally, DSC can now remove sections of aligned sequence that are poorly aligned to the sequence to be predicted. The default is that if a sequence of 40 residues has a per-residue identity of less than 20% then the middle 21 residues are masked out and not used in the prediction.

Web Interface to DSC

Discussion and Conclusion

accuracy of DSC was con rmed, making DSC the second most successful method in the competition (more accurate than all human experts). Only the PHD method, based on neural networks and a larger set of training proteins, was generally more accurate. On the 15 proteins predicted by both PHD and DSC, DSC had an accuracy of 70% and PHD 73% (per protein - the CASP2 standard measure); PHD was more accurate than DSC 11 times, DSC more accurate than PHD 3 times, and they were of equal accuracy once. Direct comparison of the servers was not possible at that time. Secondary structure prediction is the foundation of almost all structural prediction and no single prediction method is perfect. It is therefore useful when predicting a sequence to use several prediction methods, and to interpret the meaning of the predictions. DSC is an accurate and reliable prediction method and capable of correctly identifying regions of secondary structure missed by other methods. DSC is an important addition to the available suite of secondary structure prediction programs.

Availability The code for DSC (in C) is available at: ftp://ftp.icnet.uk/icrf-public/bmm/king/dsc/ dsc.tar.z The WWW interface to DSC that inputs an already formed multiple sequence alignment is at: http://www.icnet.uk/bmm/dsc/dsc read align.html The WWW interface to DSC that inputs a single sequence and forms a multiple sequence alignment is at: http://www.icnet.uk/bmm /dsc/dsc form align.html

References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic local alignment search. J. Mol. Biol., 215, 403{410. Bleasby, A.J., and Wooton, J.C. (1990). Construction of validated, non-redundant composite protein sequence databases. Protein Eng., 3, 153{159.

Brenner, S.E., and Aoki, W. (1996). Introduction to CGI/Perl New York MIS: Press. King, R.D., and Sternberg, M.J.E. (1996). Identi cation and application of the concepts important for accurate and reliable protein secondary structure. Protein Science, 5. 2298{ 2310. Rost, B., and Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584{599. Salamov, A.A., and Solovyev, V.V. (1995). Prediction of protein secondary structure by combining nearest-neighbour algorithms and multiple sequence alignments. J. Mol. Biol., 247, 11{15. Smith, T.F., and Waterman, M.S. (1981). Identi cation of common molecular subsequences. J. Mol. Biol., 147, 195{197. Thompson, J.D., Higgins, D.G., and Gibson, J.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspeci c gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673{4680.