Vol. 16 no. 3 2000 Pages 245–250
BIOINFORMATICS
Adaptive encoding neural networks for the recognition of human signal peptide cleavage sites B. Jagla 1 and J. Schuchhardt 2,∗ 1 Freie
¨ Berlin, Institut fur Universitat ¨ Med./Techn. Physik und Lasermedizin, Director Prof. Dr. Ing. G. Muller, ¨ Prof. h.c. Dr. h.c., Krahmerstrasse 6-10, 12045 Berlin, ¨ zu Berlin, Innovationskolleg Theoretische Germany and 2 Humboldt Universitat Biologie, Invalidenstraße 43, 10115 Berlin, Germany
Received on September 1, 1999; revised on October 29, 1999; accepted on December 13, 1999
Abstract Motivation: Data representation and encoding are essential for classification of protein sequences with artificial neural networks (ANN). Biophysical properties are appropriate for low dimensional encoding of protein sequence data. However, in general there is no a priori knowledge of the relevant properties for extraction of representative features. Results: An adaptive encoding artificial neural network (ACN) for recognition of sequence patterns is described. In this approach parameters for sequence encoding are optimized within the same process as the weight vectors by an evolutionary algorithm. The method is applied to the prediction of signal peptide cleavage sites in human secretory proteins and compared with an established predictor for signal peptides. Conclusion: Knowledge of physico-chemical properties is not necessary for training an ACN. The advantage is a low dimensional data representation leading to computational efficiency, easy evaluation of the detected features, and high prediction accuracy. Availability: A cleavage site prediction server is located at the Humboldt University http:// itb.biologie.hu-berlin.de/ ∼ jo/ sig-cleave/ ACNpredictor.cgi Contact:
[email protected];
[email protected]
Introduction Classification of patterns in biological sequences is an important task in bioinformatics. Applications range from the identification of functional motifs in DNA-sequences (e.g. splice sites or promoter sites) to the prediction of protein secondary and tertiary structure. Artificial neural networks (ANN) are an established method for pattern classification artificial neural (Baldi and Brunak, 1998; Schneider and Wrede, 1998). ∗ To whom correspondence should be addressed.
c Oxford University Press 2000
Prerequisite to the application of an ANN to sequencedata is a translation of the letters or words of the sequence into real numbers in order to get numerical vectors that are then processed by a neural network. This is not a trivial problem since its solvability may crucially depend on a proper data representation. In most applications of artificial neural networks to sequence data, distributed encoding is used (Baldi and Brunak, 1998). Employing orthogonal unit vectors, implicit correlations between residues are avoided; however, if the underlying alphabet is large, this approach leads to a large number of parameters which is unfavourable for statistical and computational reasons. For example, an amino acid sequence window encompassing Ns = 19 amino acids will be mapped to a 19 × 20 = 380 dimensional real vector. This number can be reduced by low dimensional encoding. Using physico-chemical properties, e.g. hydrophobicity and volume (Nc = 2), the sequence window of length 19 (Ns = 19) will be mapped to a 19 × 2 = 38dimensional vector. But the right choice of properties is crucial and generally difficult. In fact, the relevant properties are usually not known in advance. Determination of an optimal set of properties is a combinatorial problem (Weiss and Herzel, 1998) and may require extensive calculations even for few properties. We apply the method of weight sharing ANN (LeCun et al., 1989; Rumelhart et al., 1986) in order to circumvent the extensive search and to automatically find a classifier with a low dimensional encoding matrix (Lohmann et al., 1996; Schneider and Wrede, 1998). The adaptive encoding approach is able to extract completely new features from scratch and completely avoids the problem of combinatorial search in a predefined search space of properties. The general idea of a weight sharing approach is to profit from correlations in the input signal in order to reduce the number of parameters. More specifically,
245
B.Jagla and J.Schuchhardt
in this application the same encoding vector is used for all input positions. A similar approach was used by Riis and Krogh (1996) in the context of secondary structure prediction. Instead of using a fixed set of properties, encoding parameters are now integrated into the artificial neural network and are optimized within the same training process as the weight vectors. Only the number of properties remains to be determined. The adaptive encoding approach is especially suitable for biological interpretation, since the modular structure of the network architecture allows separate analysis of encoding and weight matrices. As an example application, we chose the prediction of cleavage sites in human signal peptides of secretory proteins (Claros et al., 1997; Schneider et al., 1995; Nielsen et al., 1996, 1997a,b, 1999). Signal sequences are non-homologous in their primary structure (Watson, 1984), and are therefore a challenging problem for classification methods. Even though there is a wide range of tools for predicting signal sequences and cleavage sites (Chou and Elrod, 1999; Claros et al., 1997; Nakai and Horton, 1999; Nielsen et al., 1999; Schneider and Wrede, 1994) the underlying biological features and mechanisms are not yet fully understood. The signal peptide of proteins targeted into the endoplasmatic reticulum (ER) consists of three regions: a positively charged n-region, followed by a hydrophobic h-region and a neutral but polar c-region (von Heijne, 1985). The cleavage site (c-region) is generally characterized by neutral small side-chain amino acids at positions −1 and −3 (von Heijne, 1983; Nielsen et al., 1997a,b). The most frequently used publicly available tool for identifying targeting signals and their cleavage sites is SignalP (Nielsen et al., 1997a,b) which serves as a reference in this work.
Data and methods Selection of cleavage site data and pattern generation For reasons of comparison, we used a data set assembled by Nielsen et al. (1997a,b) encompassing 416 sequences of human secretory proteins. A data set of homology reduced cleavage sites of SWISSPROT entries (Bairoch and Boeckmann, 1994) is available by anonymous FTP from virus.cbs.dtu.dk/pub/signalp. A sequence window encompassing 19 amino acids [−15, +4] relative to the cleavage site was used to generate positive patterns. Since there may be additional cleavage sites before the start of the mature protein (Visvikis et al., 1990) and over prediction in the Nterminal region of the mature protein is undesirable, the first 10 sliding window positions towards the mature protein were used for selecting non-cleavage site patterns (negative patterns). The positive and negative data sets were divided up 246
randomly into three approximately equal-sized parts for a three-fold cross-validation. Thirty networks were trained: 10 repeats of three-fold cross-validation. The network with the best classification performance on test data was chosen for further evaluation.
Sequence encoding Using distributive encoding an amino acid is mapped to a 20-dimensional unit vector: A → e 1 = (1, 0, . . . , 0); . . . ; Y → e 20 = (0, 0, . . . , 1) Using Nc properties (Schneider and Wrede, 1993; Taylor, 1986) each amino acid is mapped to an Nc dimensional encoding vector, A → c 1 = c A ; . . . ; Y → c 1 = c Y . The 20 encoding vectors can be combined to a (Nc , 20)dimensional encoding matrix (Code). A row of this matrix will be termed ‘property vector’ since it is analogous to the vector obtained from encoding the 20 amino acids by a physico-chemical property.
Adaptive encoding neural network (ACN) The adaptive encoding neural network (ACN) presented here is a feed forward neural network (Hertz et al., 1991) with an integrated encoding layer. Each amino acid of the input sequence is encoded using the same property vectors at each position. The resulting real numbers serve as input for a perceptron network (Rosenblatt, 1958). The network output (O) is calculated by applying a threshold and a hyperbolic tangent function to the sum of the weighted input data: O(X 1 , . . . , X Ns ) Ns Nc θ = tanh wi,l Codel,I ndex(X i ) − Ns i=1 l=1 (1) Nc is the number of encoding properties; Ns is the length of the sequence window; w is the weight matrix containing the position specific adaptive parameters. The function Index(X i ) maps the amino acid X i to the corresponding column vector of the Code-matrix. The encoding matrix is the same for each position in the sequence and no threshold values or non-linear functions are connected with it. The expression in angular brackets gives the contribution of an amino acid at a sequence position to the network output (positional contribution). The threshold value was split up equally among all positions in order to simplify the interpretation: a positive value will support the identification as a cleavage site and a negative value suppresses it.
Adaptive encoding neural networks
Table 1. Performance of adaptive encoding networks (ACN) using 1, 2, and 3 encoding vectors. The mean correlation coefficient (cc) of 10 repeats of a three-fold cross-validation is given for training and test performance
0.8 0.6
Number of encoding properties
number of parameters
cc train
cc test
0.4 0.2
1 2 3
40 79 118
0.54 0.75 0.77
0.54 0.69 0.69
0 -0.2 -0.4 -0.6
Since only the input layer is affected, this procedure can be transferred to multi-layer networks and nucleic acid encoding in a straightforward manner. Networks were trained using a (1, 100) evolution strategy (Rechenberg, 1993; Schneider and Wrede, 1993; Schneider et al., 1996). During training weight vector (w) and encoding matrix (Code) are varied and selected iteratively, starting from a set of variation from one generation to the next is determined by an automatic adapting step size parameter (σ ). The initial step size (σ ) was always set to 0.1 and the network training continued until a value of less than 0.001 was reached. A version of the backpropagation algorithm (Hertz et al., 1991; Riis and Krogh, 1996) was applied, yielding comparable results (data not shown).
Results and discussion Estimating the optimal number of encodings To determine the optimal number of encodings, several ACN were trained and tested as described in Data and Methods, employing Nc = 1, 2, 3 properties. Analysis of these networks showed that two properties are sufficient for predicting signal peptide cleavage sites of human secretory proteins (Table 1). For our network architecture there is no significant improvement on the test data when increasing Nc from 2 to 3. We therefore concentrate on networks applying two properties. Analysis of property vectors A trained ACN consists of an encoding matrix (Code) and a weight matrix (w) that can be analysed separately. The encoding matrix (Figure 1, upper row) can be interpreted in terms of column vectors or in terms of row vectors: column vectors represent the encoding of a specific amino acid, whereas the 20-dimensional row vectors can be interpreted in analogy to a physico-chemical property, e.g. hydrophobicity. Histograms are included for an additional quantitative representation of the matrix. The two learned properties differ significantly (the angle between the two vectors is about 73◦ ). In Property1 (dark) leucine, isoleucine, and valine are
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Fig. 1. Encoding matrix found by an ACN. A network employing two-dimensional adaptive encoding was applied to the prediction of signal peptide cleavage sites. Each amino acid is encoded by two values, code1 and code2. Above: matrix representation by a density plot. The upper row represents the property vector of the first property and the second row of the second property. Below: Bar char representation. The first property vector is represented by black bars and the second property vector by light grey. (The real values of both encodings are compiled in Table 2.)
the only positive components. Property2 (light grey) on the other hand is characterized by the uncharged and small amino acids alanine, glycine, valine, cysteine, serine, and leucine. The property vectors are compared with 140 known physico-chemical properties by calculating the angle between the vectors. The best accordance is achieved for hydrophobicity (Engelman et al., 1986) for Property1 (angle = 46◦ ) and for a HPLC retention coefficient (Meek, 1980) for Property2 (angle = 44◦ ). Remarkably, both found properties share rather little with all the known 140 physico-chemical properties. (The complete list of properties is available on request from the authors.) The ACN extracts this information only from the data set. New problem specific encoding can be derived with this method, where the obvious connection to known properties is lacking, as for the HPLC retention coefficient shown here.
Analysis of weight vectors The weights of the ACN determine which encoding is important at a specific position. Weights are plotted according to their position in a histogram (Figure 2) showing following features: • Property1 is emphasized in positions −6 to −12 • from position −5 to −6 the weight changes from a small value to a very high value 247
B.Jagla and J.Schuchhardt
Table 2. Property values found by an ACN trained for the identification of signal sequence cleavage sites of human secretory proteins. The values are rounded to the third decimal position
Property1 Property2
Property1 Property2
A
C
D
E
F
G
0.622 −0.315
0.263 −0.074
−0.457 −0.217
−0.234 −0.378
−0.396 −0.046
M
N
P
Q
R
−0.323 −0.10
−0.091 −0.229
−0.296 −0.66
−0.419 −0.741
−0.685 −0.296
H
I
K
L
−0.581 −0.235
0.049 0.337
−0.708 0.126
0.198 −0.286
S
T
V
W
Y
−0.768 −0.363
−0.175 −0.241
0.101 0.046
−0.279 −0.117
0.493 −0.093
0.521 0.208
• towards position −15 the values decline continuously • between −5 and +4 only positions −2 and +3 adopt significant values • Property2 is concentrated at position −1 and −3. As seen from the values at positions −10 to −13 there is some information common to both encodings.
1 0.8 0.6 0.4
Comparison with known properties of the cleavage site Statistically the amino acids most frequently found in the h-region of eukaryotic signal peptides, are leucine, valine, and alanine. This correlates with Property1, except for alanine. Analysis of the cleavage site region showed that small, non-polar amino acids dominate the positions −1 and −3 (c-region). These amino acids are represented by Property2. The h/c boundary is located between the two regions (position −4 and −5) (von Heijne, 1985). This boundary is reflected by very low weights assigned to position −5 and −4. In general the weighting of the positions is similar to the information analysis by Nielsen et al. (1997a,b) showing that positions −6 to −12 contribute strongly to the identification of cleavage sites. However the detailed shape differs. The weighting in the h-region decreases from −6 to −12 (ACN) while the statistical analysis has a maximum in the −12 region. It is taken for granted that for the exact localisation of the cleavage site a major part of weight is allocated near to the cleavage site. Furthermore, weight vectors will be influenced by the choice of negative examples, and therefore deviate from the statistical analysis which is restricted to the positive examples. In total, the ACN identifies most of the properties that distinguish the h-region and the c-region and the sequence locations of the two regions are clearly identified. Comparison with other methods An earlier approach to the prediction of cleavage sites was based on weight matrices (von Heijne, 1986). With 248
0.2 0
-15 -14 -13-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
1
2
3
4
Fig. 2. Weight parameters of the ACN are displayed in a Hinton diagram (above) and a bar chart (below). The x-axis gives the sequence position relative to the cleavage site. Above: Upper row: weight parameters of Property1. Lower row: weight parameters of Property2. Below: Bar chart representation. Dark grey bars: contribution of Property1; light grey bars: contribution of Property2; medium grey bars: total contribution as the sum of absolute values.
these methods about 80% of the cleavage sites could be identified. More reliable tools currently available use artificial neural networks, such as SignalP (Nielsen et al., 1997a,b). We tested the reliability of the ACN by comparing it with the signal sequence predictor SignalP, version 1.0. Only the cleavage site specific module of SignalP (C value) which is also an ANN was used for comparison. All 416 positive and 4160 negative pattern were analysed with both methods. The main differences between both methods are the architecture and the number of parameters. The cleavage site module of SignalP is a two-layer ANN with two hidden neurones and about 760 weight parameters. The ACN uses only 79 parameters. Besides these there are other differences between the ACN and SignalP. First, the negative data sets used for training may differ. Second, the task is somewhat different because other modules in SignalP help selecting the cleavage site. The results are compiled in Table 3.
Adaptive encoding neural networks
Table 3. Comparison of an ACN and SignalP (Nielsen et al., 1997a,b). Two-thirds (ACN) and four-fifth (SignalP) of 416 cleavage site sequences (P and U) are used for training. Ten sequences following the cleavage site are used as negative examples (N and O). The results are measured for training and test. The number of parameters is also shown. P = correctly classified as cleavage sites; U = incorrectly classified as non-cleavage sites; N = correctly classified as non non-cleavage sites; O = incorrectly classified as cleavage sites; cc = correlation coefficient (Matthews, 1975); Performance: (P+N)/(P+N+O+U)
Method
No of parameters
P
U
N
O
cc
Performance
ACN Signal P
79 760
312 347
104 69
4105 4039
55 121
0.78 0.76
97% 96%
The performance of the ACN is comparable to that of SignalP. The correlation coefficient (cc; Matthews, 1975) and the percentage of correctly classified patterns give nearly the same values. Nevertheless, the systems differ in the way this accuracy is achieved. Overprediction is lower for the ACN than for SignalP, meaning that a predicted cleavage site is more likely to be a true cleavage site for the ACN. On the other hand SignalP detects more cleavage sites at the price of higher overprediction. Within the modular framework of SignalP this may be a desirable property. According to our observation (data not shown) training of the ACN converged about 10 times faster than in comparable networks employing distributive encoding. This is approximately proportional to the reduction in number of parameters.
Conclusion Using ACN we could reduce the number of parameters to a one-tenth without loss of prediction accuracy compared with the distributive encoding technique. The architecture of the ACN allows for a straightforward biological interpretation of network parameters, which turned out to be consistent with all known facts about cleavage sites. Computational expense is about 10 times less than for distributive encoding, which is important for applications to large data sets. In addition the use of few parameters allows for the application to small amounts of data, still getting statistically reliable predictors. Thus the ACN is the method of choice when applying ANN to either nucleic acid or amino acid sequence classification tasks. Outlook In order to assess reliability and coherence of encodings found by the ACN, extensive statistical analysis is currently performed. The methods described here will be applied to targeting sequence identification and the analysis
of mutations in cleavage sites. The effect of errors in the data set is a difficult problem that will be dealt with in the future.
Acknowledgements Paul Wrede, Gisbert Schneider, Arif Malik, Hanspeter Herzel, Gerhard M¨uller and Stefan Welke are thanked for discussion and comments. Work was supported by DFG: Graduierten Kolleg ‘Dynamik und Evolution makromolekularer und zellul¨arer Prozesse’. References Bairoch,A. and Boeckmann,B. (1994) The SWISS-PROT protein sequence data bank: current status. Nucl. Acids Res., 22, 3578– 3580. Baldi,P.F. and Brunak,S. (1998) Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA. Chou,K.C. and Elrod,D.W. (1999) Protein subcellular location prediction. Protein Eng., 12, 107–118. LeCun,Y., Boser,B., Denker,J.S., Henderson,D., Howard,R.E., Hubbard,W. and Jackel,L.D. (1989) Backpropagation applied to handwritten Zip code recognition. Neural Comput., 1, 541–551. Claros,M.G., Brunak,S. and von Heijne,G. (1997) Prediction of Nterminal protein sorting signals. Curr. Opin. Struct. Biol., 7, 394– 398. Engelman,D.A., Steitz,T.A. and Goldman,A. (1986) Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu. Rev. Biophys. Biophys. Chem., 15, 321– 353. von Heijne,G. (1983) Patterns of amino acids near signal sequence cleavage sites. Eur. J. Biochem., 133, 17–21. von Heijne,G. (1985) Signal sequences. The limits in variation. J. Mol. Biol., 184, 99–105. von Heijne,G. (1986) A new method for predicting signal sequence cleavage sites. Nucl. Acid Res., 14, 4683–4690. Hertz,J., Krogh,A. and Palmer,R.G. (1991) Introduction to the Theory of Neural Computation. Addison-Wesley, New York, NY. Lohmann,R., Schneider,G. and Wrede,P. (1996) Structure optimization of an artificial neural filter detecting membrane-spanning amino acid sequence. Biopolymers, 38, 13–29. Matthews,B.W. (1975) Comparison of the prediction and observed secondary structure of T4 phage lysozym. Biochim. Biophys. Acta, 405, 442–451. Meek,J.L. (1980) Prediction of peptide retention times in highpressure liquid chromatography on the basis of amino acid composition. Proc. Natl Acad. Sci. USA, 77, 1632–1636. Nakai,K. and Horton,P. (1999) PSORT: a program for detecting the sorting signals and predicting their subcellular localization. Trends Biochem. Sci., 24, 34–35. Nielsen,H., Engelbrecht,J., von Heijne,G. and Brunak,S. (1996) Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site. Proteins, 24, 165–177. Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997a) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng., 10, 1–6. Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997b)
249
B.Jagla and J.Schuchhardt
A Neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural. Syst., 8, 581–599. Nielsen,H., Brunak,S. and von Heijne,G. (1999) Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng., 12, 3–9. Rechenberg,I. (1993) Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. FrommannHolzboog, Stuttgart. Riis,S.K. and Krogh,A. (1996) Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J. Comput. Biol., 3, 163–183. Rosenblatt,F. (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev., 65, 386–408. Rumelhart,D.E., Hinton,G.E. and Williams,R.J. (1986) Learning internal representations by error propagation. In Rumelhart,D.E. and McClelland,J.L. (eds), Parallel Distribution Procession. Vol. I, Bradford Books, Cambridge, MA, pp. 318–362. Schneider,G. and Wrede,P. (1993) Development of artificial neural filters for pattern recognition in protein sequences. J. Mol. Evol., 36, 586–595. Schneider,G. and Wrede,P. (1994) The rational design of amino acid
250
sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys. J., 66, 335–344. Schneider,G., Schuchhardt,J. and Wrede,P. (1995) Peptide design in machina: Development of artificial mitochondrial protein precursor cleavage sites by simulated molecular evolution. Biophys. J., 68, 434–447. Schneider,G., Schuchhardt,J. and Wrede,P. (1996) Evolutionary optimization in multimodal search space. Biol. Cybern., 203– 207. Schneider,G. and Wrede,P. (1998) Artificial neural networks for computer-based molecular design. Prog. Biophys. Mol. Biol., 70, 175–222. Taylor,W.R. (1986) The classification of amino acid conservation. J. Theor. Biol., 119, 205–218. Visvikis,S., Chan,L., Siest,G., Drouin,P. and Boerwinkle,E. (1990) An insertion deletion polymorphism in the signal peptide of the signal peptide of the human apolipoprotein B gene. Hum. Genet., 84, 373–375. Watson,M.E.E. (1984) Compilation of published signal sequences. Nucl. Acids Res., 12, 5145–5164. Weiss,O. and Herzel,H. (1998) Correlations in protein sequences and property codes. J. Theor. Biol., 190, 341–353.