Bioinformatics Advance Access published March 28, 2014
Motifs tree: a new method for predicting post-translational modifications Christophe Charpilloz 1,2 , Anne-Lise Veuthey 2 , Bastien Chopard 1,2 and Jean-Luc Falcone 1,2,∗ 1
ˆ Department of Computer Science, University of Geneva, Battelle - batiment A, route de Drize 7, 1227 Carouge, Switzerland. 2 ´ Swiss Institute of Bioinformatics, Centre medical universitaire, rue Michel-Servet 1, 1211 Geneva 4, Switzerland.
ABSTRACT Motivation: Post-translational modifications (PTMs) are important steps in the maturation of proteins. Several models exist to predict specific PTMs, from manually detected patterns to machine learning (ML) methods. On one hand the manual detection of patterns does not provide the most efficient classifiers and requires an important workload, on the other hand models built by ML are hard to interpret and do not increase biological knowledge. Therefore we developed a novel method based on patterns discovery and decision trees to predict PTMs. The proposed algorithm builds a decision tree, by coupling the C4.5 algorithm with genetic algorithms, producing high performance white box classifiers. Our method was tested on the initiator methionine cleavage (IMC) and Nα -terminal acetylation (N-Ac), two of the most common PTMs. Results: The resulting classifiers perform well when compared to existing models. On a set of eukaryotic proteins they display a crossvalidated MCC of 0.83 (IMC) and 0.65 (N-Ac). When used to predict potential substrates of NatB and NatC our classifiers display better performance than the state of the art. Moreover we present an analysis of the model predicting IMC for H. sapiens proteins and demonstrate that we are able to extract experimentally known facts without prior knowledge. Those results validate the fact that our method produces white box models. Availability: Predictors for IMC and N-Ac and all datasets are freely available at: http://terminus.unige.ch/. Contact:
[email protected]
1 INTRODUCTION Post-translational modifications (PTMs) are modifications occurring during protein maturation or biosynthesis. These modifications can consist of attachments of functional groups (e.g. methylation), changes of the chemical nature (e.g. deamidation), cleavage of one or more residues (e.g. initiator methionine cleavage), or structural changes (e.g. disulfide bonds). The PTMs broaden the diversity of functional groups of the 20 standard amino acids, thus producing diverse forms of proteins that cannot be derived only from its ∗ to
whom correspondence should be addressed
genes (Walsh, 2006; Schwartz et al., 2009). Since the mature form of a protein cannot be inferred only by genes, the knowledge of a protein’s PTMs helps to understand the roles, the possible interactions, or the activity of a protein. Numerous predictors for PTMs have been developed, based on different machine learning models. For example artificial neural networks (ANN) have been widely used to predict various PTMs, like phosphorylation (Blom et al., 1999), N-terminal myristoylation (Bologna et al., 2004), C-mannosylation (Julenius, 2007). More recently random forest (RF) have been succesfully used to predict PTMs sites, for ubiquitination (Radivojac et al., 2010), γcarboxylation (Zhang et al., 2012), glycosylation sites (Chuang et al., 2012). Although some of these predictors provide very good prediction capabilities for the problem they tackle, they often are black boxes. Indeed, their mathematical complexity makes them hard to interpret in term of biological meaning (Berthold et al., 2010). Unfortunately this restricts the application of these model for biological problems which, in our opinion, require a model providing explanations for the prediction. The purpose of this paper is to introduce a new method to automatically build a PTM predictor, using only the information contained in the proteins primary structure and which can be interpreted by biologists. We focused on two PTMs: first the Nα terminal acetylation (Nα -Ac): a PTM involving the transfer of an acetyl group to the N-terminal residue α-amino group. It is one of the most common covalent irreversible modifications and occurs in about 50% of yeast proteins and 80% of human proteins (Polevoda and Sherman, 2002, 2003). In eukaryotes, Nα -Ac is catalyzed by N-terminal acetyltransferases (Nats) (Pestana and Pitot, 1975; Gautschi et al., 2003; Polevoda et al., 2008). Six Nats have been identified (NatA to NatF), each acetylating specifics N-terminal regions (Polevoda and Sherman, 2003; Polevoda et al., 2009). Because Nα -Ac can occur on proteins having their initiator methionine cleaved or not (Polevoda et al., 2009), it is also required to know if the initiator methionine cleavage (IMC) occurs in order to produce an accurate predictor for Nα -Ac. Hence, the second PTM studied is the IMC which is catalyzed by methionine aminopeptidases (MetAPs) (Kendall and Bradshaw, 1992). As pointed by Eisenhaber and Eisenhaber (2010) it is unlikely to discover a unique pattern describing the requirement of all
© The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected]
1
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
Associate Editor: Dr. John Hancock
2
METHODS
Our method is based on combinations of biomolecular motif descriptors. Each descriptors can then be used to compute a similarity score by aligning the descriptor with an amino acids sequence (Gonnet and Lisacek, 2002). These scores are then compared with cutoff values to discriminate sequences into two groups (Bucher et al., 1996). These descriptors are then combined in a binary decision tree (DT) where they correspond to the test nodes. We call such model a motifs tree (MT). Dataset. Because our method relies on supervised machine learning, we need good quality datasets in order to train the classifiers. All data used in this study were extracted from the release 2012 07 of UniProtKB (11th of July, 2012). (1) Nα -Ac dataset. To extract entries from UniProtKB we build two queries, one for the Nα -acetylated proteins and one for proteins that do not undergo Nα -Ac. Our datasets were based only on experimental evidence. We select entries in the database according to the following criteria: each entry must have been reviewed by a UniProtKB curator; it’s existence must be experimentally proven and a chromosomic gene must be linked to the entry. An entry is labeled as Nα -acetylated if the residue exposed to the Nat is annotated as N-acetyl. The acetylation must also be experimentally proven. An entry is labeled as non Nα -acetylated if the exposed N-termial residue is not annotated as N-acetyl (regardless the confidence) and one reference must state that the entry has been sequenced at protein level with a method able to detect eventual acetylation. Protein with N-terminal residues blocked by an unidentified modification are discarded. Those criteria are detailed in supplementary information A (see “2.1 Criteria used to build the datasets”). Although it is know that Nα -Ac is not always a total modification, this fact is currently not taken into account in the available protein databases. Hence, we qualify a protein as acetylated if the PTM was experimentally observed, regardless of the modification ratio. The extraction process was repeated for several taxonomic groups. Table 1 shows the sizes and the PTM ratio of the datasets extracted from UniProtKB depending on the chosen taxon: Eukaryota, Metazoa and H. sapiens. We also stress that the taxon datasets are not mutually exclusive: 79% of the Eukaryota dataset is composed by Metazoa sequences and 65% of the Metazoa dataset is composed by H. sapiens sequences. (2) Initiator methionine cleavage dataset. There is no specific query to build an IMC dataset. Our IMC datasets were extracted from the Nα -Ac datasets by checking the presence of the feature of type initiator methionine with the value removed. The criteria used for the Nα -Ac datasets imply experimental evidences for the IMC too. In the
2
Table 1. Number of sequences and content of the different datasets extracted from UniProtKB for the two considered PTMs: the IMC and the Nα -Ac. The “Nb. of seq.” and ratio columns respectively indicate the number of sequence in the PTM’s dataset and the ratio of proteins undergoing the corresponding PTMs. Init. Met cleav. Nb. of seq. Ratio
Taxon Eukaryota Metazoa H. sapiens
2519 2004 1322
0.72 0.72 0.69
Nα -Ter. acet. Nb. of seq. Ratio 2486 1971 1289
0.64 0.71 0.87
case of the IMC datasets, we have kept 33 proteins that were filtered out of the Nα -Ac dataset because the method used to sequence the N-terminus was not able to determine acetylation, while being able to determine the IMC status. The datasets’ composition is detailed in table 1. Model. There are several approaches to define the motif descriptors: regular expression, consensus sequence with degenerated positions, consensus sequence with mismatches, weight matrix, flexible pattern, profile, and so on (Bucher et al., 1996; Bork and Gibson, 1996). We will, in the context of this study, define a motif as a sequence of elements (called here tokens). The five categories of tokens we used are presented along with their similarity measure with an amino acid. The similarity of an amino acid a with a token t is denoted by σ(t, a), and ranges between 0 and 1.
• Any amino acid: this token matches with any amino acid and its similarity measure is always 1. This token is represented with the symbol “ ”.
.
• Fixed amino acid: which are tokens imposing a match with a single amino acid. The similarity measure is 1 if and only if the token is aligned on the amino acid described by the token, otherwise it is 0.
• Inclusion: these tokens describe sets of amino acids. The similarity measure is 1 if and only if the token is aligned on an amino acid included in the set, otherwise it is 0. For instance [ACM] is a token having a similarity measure of 1 with Ala, Cys and Met and 0 with the other amino acids.
• Exclusion: these tokens are the complement of the previous one. The token similarity measure is 1 if and only if the token is aligned on an amino acid not included in the amino acids set described by the token, otherwise it is 0. For example ¬[EPT] has a similarity of 0 with Glu, Pro and Thr and 1 with the other amino acids.
• Physico-Chemical similarity: which are tokens describing how similar is an amino acid to a reference amino acid according to a physico-chemical property (the AAindex1 database, Kawashima and Kanehisa 2000). Those tokens are represented by the reference amino acid r, followed by the AAindex1 p (i.e. t = {r, p}). For example {S,KYTJ820101} is a token where the amino acids with a similar hydropathy index (Kyte and Doolittle, 1982) than serine, have a high similarity score. The similarity is computed as follows: σ({r, p}, a) = 1 − |¯ p(r) − p¯(a)|, where p¯(x) is the value of the property for x, normalized between 0 and 1. Although we restricted our choice only to these five types of tokens, to keep the model as simple as possible, these tokens generate approximately 2 · 106 possibilities (as there exist 220 possible sets of amino acids). The previous definition of tokens allows to build a similarity matrix between the twenty amino acids and the tokens. Then to compute a similarity score between a motif and a sequence, we use the Needleman-Wunsch algorithm
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
enzymes, because there is no biological sense to build an acetylation predictor based on an “average” motif, as no enzyme recognizes this “average” motif. Our main idea is to combine several discriminant motifs optimized with genetic algorithms (GA). These motifs are combined using a binary decision tree. Our choice is mainly motivated by the need for white box models, that is to say classifiers that are interpretable by biologists, in order to help identifying the required biological features. The method described in this paper is tested by evaluating its capacity to predict the IMC and the Nα -Ac of eukaryotic proteins. The choice of predicting those PTMs has been made because several methods to predict Nα -Ac have been published which allow us to test the efficiency of our method by comparison. The published methods range from black box machine learning (ML) methods, e.g. SVM (Liu and Lin, 2004) and ANN (Lars et al., 2005), to manual pattern detection (‘by eye’). For example Martinez et al. (2008); Cai and Lu (2008) and more recently Bienvenut et al. (2012), predict PTMs using manually extracted rules based only on the information provided by the first two or three amino acids in the sequence, which may be insufficient to predict correctly the PTMs.
power. This step is important as we try to build an interpretable model. All operators used along with model parameters are detailed in supplementary information A (“1 Genetic algorithm”). The retained parameters are: Tournament size: 5; Population size: 250; Max generations: 150; Mutation probability: 0.75; Number of plague “remove”: 20; Number of plague “clean”: 100; Number of amino acids: 6; Gap penalty: -0.0625; Pruning factor (α): 0.5; Bucket size: 6. They were chosen by scanning a wide range of values (data not shown). We observed that the method performance is independent of the chosen values for most of the parameters. For example reducing the maximum generations does not change the quality of prediction, but produces deeper trees. The only parameter that affects the quality of prediction is the size of the fragment used for the alignment (the “Number of amino acids” parameter in supplementary information A, table 1). This parameter specifies the number of amino acids taken into account for the alignment. During tests we noticed that fragments that are too long, e.g. 15 amino acids, produce classifiers with poor generalization capacities (see section 3.1). We therefore used the minimum size that disambiguate the proteins undergoing and not undergoing Nα -Ac. Software to build a motif-tree. All software used to build our model, namely a motif-tree, has been developed by the authors.
3
RESULTS
In this section we first assess the learning capability of our method by evaluating the quality of the predictions obtained with the datasets extracted from UniProtKB. We compare then our predictors to the state of the art used to predict IMC and Nα -Ac.
3.1
Generalization and stability
Two potential problems arise from the algorithm we used to build our classifier. The first (common to all machine learning algorithms) is a lack of generalization, which is the ability of the algorithm to correctly classify proteins that are not present in the training set. The second problem is the stability of our model, that is to say the consistency of the results despite the stochastic nature the GA. Indeed, we have no guarantee that every GA evolution will converge to a good solution. Cross-validation (CV) is a widely used process to evaluate generalization of a classifier, allowing us to estimate the average generalization error of a machine learning method (Hastie et al., 2001). To evaluate the stability we simply applied 10 independent stratified 10-folds cross-validations on our datasets, combining the CV results to obtain the average and the standard deviation. So if the CV results have a high average classification score with low standard deviations, the method is stable and produces classifiers with good generalization capability. As we are only taking into account the first six amino acids, redundancies and ambiguities appear in the dataset. Therefore we have removed those duplicates from the training set of each fold. This ensures that the test set contains only sequences never seen during the learning phase. More details are provided in supplementary information A (“2.3 Redundancy and ambiguities resolution”). The crossvalidation results are presented in table 2. We can see that the learning and generalization capabilities of our method are good for both IMC and Nα -Ac, as it is shown by the classification scores. However, we must pay attention to the accuracy values. Since our training set class are imbalanced, a trivial classifier could easily reach a high accuracy. For instance, 87 % of human proteins are acetylated in our dataset and a bad classifier which predicts all proteins as acetylated will obtain an
3
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
with the similarity matrix to obtain the score of the best possible global alignment between a motif and an amino acids sequence. These motifs are then combined in a binary decision tree manner. The need of combining motifs arises because a single motif, as described above, was not able to produce an accurate prediction for the Nα -Ac prediction. A binary decision tree uses motifs as nodes and class labels as leaves. A sequence “moves” down in the tree the following way: (1) when a sequence reaches a node, it is aligned with the node’s motif to get a similarity score. This score is then compared to the node threshold (cutoff values) in order to select the next branch to take. (2) When a sequence reaches a leaf, it is classified as undergoing a specific PTM or not, depending on the leaf label. This representation is highly readable: each path in the tree from the root to a given leaf can be represented as a logical clause in conjunctive normal form. Building the motifs tree. The algorithm used to build a motifs tree is very similar to C4.5 (Quinlan, 1992). This algorithm recursively adds test nodes that split the training set. In our case the tests conducted by the nodes are based on a motif and its alignment with a sequence. To choose a motif at each node, the C4.5 algorithm selects the best motif among all possible motifs, that is to say the one yielding two subsets with the best class separation. The problem is that with the symbols used to describe our motifs, there exists approximately 106·n possible motifs of length n. For example, searching the best motif of length 5 means searching the best motif among 1030 motifs. Because an exhaustive search for the best possible motif is not feasible, we cannot use the C4.5 algorithm. Therefore we relied on genetic algorithms (GA) (Goldberg, 1989) to explore the motif space (i.e. the set of all possible tokens sequences). The idea is that we may not need the best possible motif to build the motifs tree, but a good approximation of the best motif is probably enough. Approximating the best motif. Genetic algorithms generate a solution to an optimization problem by mimicking Darwinian evolution (reproduction, inheritance, mutation and selection) to explore the space of admissible solutions. The idea is to reproduce a survival-of-the-fittest model, where several solutions of the problem are generated, then modified by bio-inspired methods, and the best ones are selected for the next round of evolution. In the GA terminology, an admissible solution of the problem is called an individual, its representation in the GA is called a genome and its elements genes. The individuals of the evolution process form the population. To understand how we used the GA to approximate the best motif, we need to define what is an individual, what is the initial population, how the fitness function is computed, and which genetic operators are used. In our setup, an individual is a tokens sequence of variable length. The initial population is randomly created by generating n individuals with m tokens, randomly drawn from the category of token described in the model section, with m equal to the length of the sequences in the dataset (six amino acids in this study). The fitness function used is based on the normalized information gain ratio (Russell and Norvig, 2010) and the Matthews correlation coefficient (MCC) (Matthews, 1975). The best threshold is selected among all different scores evaluated in the set. To do so, we consider each score as a potential candidate for the threshold, so each of them is used sequentially as a cutoff value. As the cutoff value allows to split the training data, the information gain ratio can be computed and the score maximizing this gain is chosen to be the threshold. Then we compute the MCC based on the split induced by the threshold. We used the following the GA operators: (1) The k-tournament selection operator. (2) The one point crossover where the same break point is used in both parents to produce two new offspring of the same length (Banzhaf et al., 1998). The break point is randomly chosen at each application of the operator. (3) The one point mutation which change the value of one gene in the individual. Our mutation operator can add a random new token, delete a random token, or substitute a token in the motif at a random position by a random new token. (4) We define a plague operator which is used to simplify an individual (i.e. a motif). This operator removes the tokens that do not improve the quality of the solution, and simplifies inclusion and exclusion tokens. Therefore this operator improve the readability of a motif without altering its discriminant
Table 2. Results assessing the quality of the IMC prediction and Nα -Ac prediction. Score values are the mean on 10 independent stratified crossvalidations, each made with 10 folds. The standard deviation is given between parenthesis. PTM
Taxon
Acc.
Sen.
Spe.
MCC
IMC
Eukaryota Metazoa H. sapiens
0.93 0.94 0.95
0.95 0.96 0.96
0.89 0.91 0.93
0.83 (0.0001) 0.86 (0.0002) 0.89 (0.0001)
N-Ac.
Eukaryota Metazoa H. sapiens
0.84 0.85 0.93
0.89 0.90 0.96
0.76 0.73 0.59
0.65 (0.0001) 0.64 (0.0002) 0.56 (0.0006)
Comparison with TermiNator3
Now that we know that our method is reliable, we compare our method with the state of the art: TermiNator3 (Martinez et al., 2008). We choose not to compare our model against NetAcet, another well known Nα -Ac predictor, because it has only been trained on NatA substrates from S. cerevisiae. For this comparison, we trained our classifiers with the full datasets (instead of running a CV experiment) described above, one for each taxon: Eukaryota, Metazoa and H. sapiens and for each PTM: Nα -Ac and IMC. Six predictors were produced, whose performances were compared to TermiNator3. Those trainings are justified by the fact that, now we are convinced that our model generalizes well and is stable, we wanted to use all the available information to build the most accurate predictors. Moreover the patterns used by TermiNator3 seems to have been built based on their full dataset. The comparison is presented in table 3 and we obtained cross-validated results close to TermiNator3 with our method (table 2). When trained on the full dataset, results are on a par with TermiNator3 for the prediction of IMC. However our classifiers perform better than TermiNator3 for Nα -Ac prediction. 3.2.1 Potential NatB and NatC As it was introduced above, several enzymes catalyze the Nα -Ac, however the information regarding the Nats catalyzing the PTM is rarely available. But by looking at the known specific substrates, authors have proposed substrates to identify the Nat catalyzing the acetylation depending on the first two amino acids. For the NatB, the following substrates are proposed: MD-, ME-, MN-, and for the NatC the following substrates are proposed: MF-, MI-, ML-, MW- (Polevoda et al., 2009).
4
Init. Met cleav. Nα -Ter. acet. Acc. Sen. Spe. MCC Acc. Sen. Spe. MCC
Taxon
Service
Eukaryota
Terminus 0.99 0.99 0.99 0.98 0.99 0.98 0.99 0.97 TermiNator3 0.96 0.99 0.89 0.91 0.87 0.92 0.77 0.71
Metazoa
Terminus 0.99 0.99 0.99 0.98 0.99 1.0 0.96 0.96 TermiNator3 0.97 1.00 0.90 0.92 0.88 0.92 0.80 0.72
H. sapiens
Terminus 0.99 0.99 0.99 0.97 0.99 0.99 0.96 0.96 TermiNator3 0.97 0.99 0.92 0.93 0.90 0.91 0.82 0.63
Unfortunately, the number of experimentally identified substrates of those specific Nats is scarce. To estimate the capability of our classifiers regarding to NatB and NatC substrates, we built two new datasets: one for potential NatB and one for potential NatC. From the Eukaryota dataset, all proteins matching the theoretical requirements for NatB or C are considered as potential substrates and extracted into those new datasets. The proteins are extracted with their original class (i.e. Nα -acetylated or not Nα -acetylated), because not all proteins matching the substrates are acetylated. We applied a 10-folds CV on the whole Eukaryota dataset. Then we measured the performance of the model only on the potential NatB and NatC. The results in table 4 display that the patterns used in TermiNator3 are too stringent. Indeed, the results show that if a sequence starts like the NatB proposed substrate, it is always classified as Nα -acetylated (the sensitivity is 1.0 and the specificity is 0.0). For the sequences starting like the NatC known substrates it is the opposite (the sensitivity of 0.0 and the specificity of 1.0) indicating that all these sequences are classified as not Nα -acetylated. Therefore, in both cases the MCC obtained is 0.0, meaning that in this case TermiNator3 performs no better than random prediction. The pattern used by TermiNator3 (Martinez et al., 2008) take only into account, at most, the first three amino acids, but the information provided by these three amino acids is probably insufficient to decide if a protein undergoes Nα -Ac. Our model takes into account the first six amino acids and produces a cross-validated MCC higher that 0.0, therefore it performs better than random. So our model has been able to find specificities between proteins undergoing Nα -Ac, as it is showed by the increase of specificity in the case of the NatB substrates (+0.39) and the increase of sensitivity in the case of the NatC substrates (+0.55). Finally we tested our predictor on the 7 experimentally identified substrates of NatB and C in H. sapiens (Starheim et al., 2008, 2009). As shown in table 5, all substrates were correctly predicted by the motifs tree. This shows that our model is able to discover subtle features specific to those proteins, even when they are accounting only for less than 20% of the whole dataset.
3.3 Analysis of the initiator Met cleavage motifs tree The main goal of this paper is to present a new automatic approach in order predict PTMs based only on the protein primary structure, called motifs tree, and we have presented the performances of our
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
accuracy score of 0.87. To evaluate our results, we compare then the obtained accuracy score against a so-called baseline (the “Ratio” columns in table 1), which is the proportion of the majority class in the training set. All classifiers obtained here display a significant improvement over the baselines of our training sets. For example, with the Eukaryota dataset, the accuracy rises from 0.72 (baseline) to 0.93 in the case of IMC. In the case of Nα -Ac it rises from 0.64 (baseline) to 0.84. The results also show that the method is very stable, since the standard deviation is less than 1 % of the classification scores, meaning that every run will produce a good classifier.
3.2
Table 3. Prediction scores for TermiNator3 and Terminus (our online predictors). The “PTM” column indicates to which post-translational modification the scores apply. The “Dataset” column indicates which one of our datasets extracted from UniProtKB is used.
Table 4. Cross-validated scores obtained by Eukaryota classifiers versus TermiNator3. The “Potential” means that only sequences matching to the potential NatB or potential NatC substrates are considered. The “Nb. seq.” column indicates the number of sequences and the “N-Acet.” column the ratio of sequences undergoing Nα -Ac in each dataset. Service
A
E S
D P TA E N V
S
D
G
A
G E
A
T
S
A
L
L
Q
F
K
Substrate Nb. seq. N-Acet. Acc. Sen. Spe. MCC
Motifs tree TermiNator3
NatB
384
0.91
0.89 0.93 0.39 0.90 1.00 0.00
0.31 0.00
Motifs tree TermiNator3
NatC
100
0.38
0.66 0.55 0.73 0.64 0.00 1.00
0.28 0.00
UniProt ID Taxon
ED
N
L Q F K M I
E
A
S
P A
H
A
A
L
G
P E
R
A
Sequence Nat Terminus TermiNator3
S
A
E
P
S
D
Q04206
H. sapiens MDELFPL B
Ac-M(1)
Ac-M(1)
Q9NVJ2† P42345 P31943† P52597†
H. sapiens H. sapiens H. sapiens H. sapiens
Ac-M(1) Ac-M(1) Ac-M(1) Ac-M(1)
M(1) M(1) M(1) M(1)
MLALISR MLGTGPA MLGTEGG MLGPEGG
C C C C
T
G
G V T
T
G
V
E A TE
LS DG Q L D
G VT
P V
GA E Q TG V T AD A
K L L
R
F
I C
C
M D
R P I
F
N
Y
S
†
E
EE DKP
T
G P G TH
E SL Q
A G K D R AQ M A NY V S D T A V G
G
A
L
N F
P S
Y
K
P
S S
W
V
N
Q I W
M
VRR
Sequences used to build the H. sapiens classifier or the Eukaryota classifier.
classifiers to predict Nα -Ac and IMC. In this section we will show how we can use our model to infer knowledge about the underlying biological process (e.g. enzyme substrate specificity). Due to the lack of space, we will illustrate this feature by analyzing the smallest motifs tree, predicting IMC in H. sapiens. However the same approach can be applied to all motifs trees (article in preparation). For details about the other motifs trees produced for this paper, see supplementary information B (“1 Motifs trees”). The analyzed tree is the product of a training on the full H. sapiens dataset. We point out that the motifs found during different runs of training are very close and combined in similar trees. As it seems that all learning phases converges to a particular point in the solution space, we can focus our analysis on one motifs tree. As this model is based on combination of motifs, it can be interesting to analyze the discovered motifs to propose assumptions about the substrates of the enzymes catalyzing the chemical process of a given PTM. Hence, we studied how sequences are split at each node and we tried to extract the features that separate the two sets of sequences induced by the split. Let us note that there are two genes encoding for MetAPs in human, MetAP1 and MetAP2 (Bradshaw et al., 1998), but the information about which enzyme catalyzes the cleavage is not known and is not taken into consideration in the model. Also, even if it does not add information, the initiator methionine is kept in the sequences. First of all, we see that the motifs tree (fig. 1) is composed of three tests (motifs), all of them leading to at least one leaf (i.e. a predicted class):
Fig. 1. The motifs tree for the prediction of initiator methionine cleavage for H. sapien proteins extracted from UniProtKB. Each node of the tree is represented by the motif used for its test. Leaves are represented using a sequence logo made with all the sequences ending in that leaf and the label under a leaf specifies the class corresponding to the prediction made at the leaf and its accuracy. The initiator methionine is always present in all sequences, but is not displayed in the logo as it does not provide any information. Moreover each sequence logo is rescaled according to its highest value (the maximum being 4.32 bits). The branches are labeled with the alignment score condition required on the test to follow the path indicated by the branch. The sequence logo on top illustrates the composition of the H. sapien proteins extracted from UniProtKB.
• the sequences that do not contain the signal described by the first motif are classified as not undergoing the initiator methionine cleavage; • the sequences containing both the first two motif signals are classified as undergoing the initiator methionine cleavage; • the sequences reaching the last node are classified as not undergoing the initiator methionine cleavage if the signal of the third motif is detected in the sequence. To understand what features are exploited by the motifs tree to discriminate the sequences, we will focus on the first node. The first motif is described by the following tokens sequence:
5
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
Table 5. Predictions of acetylated proteins with known Nats using the Terminus H. sapiens classifier. The “Sequence” column displays only the first seven amino acids of the protein exposed to the Nat. The “Nat” column indicates which Nat catalyzes the Nα -Ac. The “Terminus” and the “TermiNator3” columns indicate respectively the prediction of the services.
acids sequences, sj = (aj1 , aj2 , . . . , ajn ). The profile of m on all sequences in S, is a vector c = (c1 , c2 , . . . , ck ), whose ci are given by : |S| 1 ∑ σ(ti , xji ) (1) ci = |S| j=1
.{S,CHAM830104}{S,CHAM830104}.{F,GARJ730101}.. and our analysis is split into two steps: (1) scores analysis and identification of discriminant tokens in the motif; and (2) positions of interest in the amino acids sequences. We begin by identifying the discriminant tokens (step 1). To do so, we compute the average motif score profile. The profile is computed for a set of sequences aligned on a motif. For a given alignment, each token contributes to the alignment score either by its similarity with the aligned amino acid, either by being gapped. If all contributions of each token on each sequence are summed and normalized, we obtain an average motif score profile. Formally, let m = (t1 , t2 , . . . , tk ), a k tokens motif, and S = {sj } a set of amino
6
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
Fig. 2. (a) The motif score profile (eq. 1) difference between the sequences achieving an alignment score less than or equal to the threshold and the achieving an alignment score greater than the threshold. On this plot we can see that the tokens at position 2 and 3 in the motif have an important contribution in the alignment scores of sequences achieving a score higher than the threshold. (b) The normalized histogram of aligned position illustrates on which positions in the amino acids sequence a token is aligned. The sequences considered to build this histogram are the one following the right branch after the first motif. The colors of the stack indicate the position in the amino acids sequences. A stack lower that 1.0 reflect that the token is aligned with sequence gaps. E.g., a stack with a height of 0.4 means the token is aligned with an amino acid for 40% of the alignments, and is gapped for the remaining 60%.
where xj is the aligned sequence, i.e. sj with the alignment gaps. So xji is the i-th symbols in the sequence j which is aligned with m. It can be either an amino acid or a gap (σ with a gap always equals the gap penalty, i.e. -0.0625). So to identify discriminant tokens in the motif, we compute the profiles for the sequences following the left (cl ) and the right (cr ) branch and plot the following difference cr − cl . A positive difference points to a token increasing the score of the sequences following the right branch, a negative difference points to a token increasing the score of the sequences following the left branch. So, as we want to identify the features contributing to the signal strength, we are interested in the positive differences. In the case of the first motif, the profiles difference emphasizes the discriminant power of the tokens at position 2 and 3 in the motif (fig. 2 (a)). The two tokens are the same, namely the token {S,CHAM830104}. This property is interesting because it gives the maximum similarity (i.e. 1.0) with the Ser and the following amino acids: A, C, G, P, T and obviously S. It gives a similarity of 0.5 with the Ser for the following amino acids: D, E, F, I, K, M, N, Q, R, W, Y and V, and has no similarity with L (i.e. 0.0). So it clearly promotes the presence of A, C, T G, P, S and T. Regarding the amino acids producing a similarity of 0.5, it is interesting to note that the threshold is 5.4375, which is the the maximum alignment score possible with the motif minus 0.5. So the use of this property in the first motif seems to play the role of a selector for the amino acids having a similarity score of 1. Now that we have identified two tokens having an impact on the alignment score, we must identify where, in the protein sequence, the specificity induced by the token is discriminant (step 2). To do so, we rely on a plot showing how many time a token i is aligned with the residue at position j of the sequences following the right branch. This histogram shows that the two tokens of interest are mainly aligned on the second amino acid (the one immediately after the initiator methionine), and, in less extent, on the third amino acid (fig. 2 (b)). This rough analysis allow to conclude that this node splits the protein set based on the presence of an alanine, cysteine, glycine, proline and serine immediately after the initiator methionine. Moreover, as this node lead to a leaf for the sequences in which the signal is not detected, we can observe that the proteins not having those amino acids at the second position do not undergo the IMC. Therefore the following rule can be proposed: if a sequence start with M¬[ACGPSTV], the methionine is not cleaved. This has been experimentally observed (Burstein and Schechter, 1978; Meinnel et al., 2005) and is corroborated by our model. Moreover this rule is compatible with the pattern in Martinez et al. (2008, table 1). If we take into account only the information regarding the IMC in the cited publication, we can build the following rule: a match with M¬[ACGPST] for the first two amino acids imply no IMC. The same approach can be used to extract information from the other motifs. We will summarize the main lines here. The motifs, the motif score profiles and histograms of the aligned positions, allowing us to extract theses results are provided in supplementary information B (“1.1 Motifs tree for init. Met. cleavage prediction
• a match with M[ACGPS] implies that initiator methionine cleavage occurs; • a match with M[TV]..¬[EK] implies that initiator methionine cleavage occurs; • otherwise the protein does not undergo IMC. To conclude this short analysis we wonder if the simplified rules perform as well as the motifs tree. Indeed the rules produce a MCC of 0.93 while the full motifs tree produces a MCC of 0.97. This is a very good result: we have been able to use the model to infer good rules that allowed us to find the sequence requirement for IMC in H. sapiens. We also showed that the motifs tree is very sensitive to subtle features that are hardly detectable by human, as
the motifs tree produce better classification scores than the inferred rules. Our model, and the tools proposed to analyze it, lead us to draw conclusion on human MetAPs substrates that are very similar to the experimental results published in literature (Burstein and Schechter, 1978; Meinnel et al., 2005; Frottin et al., 2006; Martinez et al., 2008; Xiao et al., 2010). Therefore we can claim that our model is a white box.
3.4
N-terminus prediction service
We developed a free and open online service to allow researchers to use our motifs trees for predicting IMC and Nα -Ac on their sequences of interest. The service is accessible both through a web interface or through a simple REST API (supported in almost every programming languages) and can be used to access the predictors programmatically. The service is available at the following address: http://terminus.unige.ch/ .
4
DISCUSSION
We presented a new method to predict PTMs called motifs tree. The method was tested for the IMC and Nα -Ac by building a classifier for proteins in different taxa. The resulting models are accurate on our datasets and perform as well as the previously published state of the art results, namely TermiNator3. Moreover our results are cross-validated, showing that our model can build classifiers with good generalization capabilities. We did not compare our model with NetAcet because it has been only trained on a small dataset restricted to NatA substrates from S. cerevisiae. Also, we have shown that our Nα -Ac classifier, can take into account subtle information allowing it to improve the classification of potential NatB and NatC substrates which is a feature that is lacking in TermiNator3 and NetAcet. As with all machine learning approaches, the quality of the predictor depends on the quality of the dataset. In biology negative sets are difficult to build because the rely on the non-observation of a phenomena which is not directly annotated in databases. To confirm that our predictor was not biased due to noise in the dataset we have use an hold-out test set. This set is only composed of experimentally confirmed non-acetylated eukaryotic proteins (Bienvenut et al., 2012). All proteins in the hold-out test set and were not seen during training. The Eukaryota motifs tree produce a specificity of 0.85 on this hold-out test set, which is above the cross-validated specificity (+0.09). This good result illustrates that our algorithm induces correct rules to predict non-acetlyation, probably because our methodology can cope with noise in the dataset, or because our dataset is now clean enough to produce accurate predictors. We add that the ability to learn with noise is a desirable feature for a machine learning method. For more details, see supplementary information A (“3.3 Validation for N-terminal acetylation classifiers”). Also our method produces a white box model that shows how features are used to classify sequences. In a preliminary analysis, we have illustrated that our model can provide helpful information about the composition of sequences that promotes or inhibits a PTM. This is also a valuable advantage versus the predictors presented in Liu and Lin (2004); Lars et al. (2005), which are hard or impossible to interpret. We are convinced that a model used for classification in biology should be readable by experts. Indeed models used in machine learning are able to capture characteristics that are very
7
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
in H. sapiens”). First, it is important to remember that we are going through a decision tree, and the alignments are applied on sequences that have been selected by the preceding motifs. The profiles difference of the second motif (supplementary information B, 1.1.2) indicates that the token at position 10 has a major contribution in producing discriminant alignment score between proteins. This token is [AFIKNQSW] and the histogram of aligned positions shows that it is almost always aligned with the second amino acid in the sequence. But we already know that the sequences reaching this node should be carry [ACGPSTV] as the second residue. So we can denoise this token by only considering the intersection between [AFIKNQSW] and [ACGPST], leading to a simplified form of the token: [AS]. The motif seems to detect the presence of an alanine and a serine a the second position. Another token contributes well to the profiles difference, the token 13, which is a fixed amino acid token for the Ser. This token is mainly aligned on the second and third amino acid in the sequences. As a relevant match implies that the sequence undergoes the IMC, this lead us to propose that sequences starting with M[AS] are cleaved. But the MA sequences are highly represented in the set of sequences having a relevant match with the motif (68% of the set), and may hide the contribution of other tokens. So we removed those sequences from the proteins set and produced a new profiles difference. These new profiles emphasize the contribution of the second token in the motif, which is a fixed amino acid token for the Pro, and is always aligned on the second amino acid in the sequence. So, considering the preceding motif and the information provided by the tokens at position 10 and 13, we can conclude that proteins starting with M[APS] undergo IMC. Therefore proteins reaching the last motif should be mainly composed of sequences starting with M[CGTV]. Again the profiles difference (supplementary information B, 1.1.3) indicates that the token 9: [EK] has the greatest difference in the profiles. The histogram shows that it is always aligned on the fifth residue. This is a very interesting feature, because it shows that the MetAPs activity is not only influenced by the amino acid on the second and third position in the sequence, but also by amino acids farther in the sequence. In this case it is also interesting to note that these two amino acids are charged. We conclude that a protein starting by M[CGVT] does not undergo IMC if a glutamic acid or lysine is present in the sequence at position 5. We have extracted simplified rules for each motif. These rules can be combined to produce the sequence requirement for the IMC to occur or not:
hard to see in the data. Those characteristics or features may be exploited to understand the studied biological process. The purpose of this first analysis was only to validate the white box quality of our model, by retrieving experimentally known biological facts from our motifs trees. However in the future we may be able to use the motifs tree as a tool to propose new biological hypothesis that could be tested experimentally. Acknowledgement. We would like to thank Alexandros Kalousis from the Computer Science department (University of Geneva) for the discussion that helped us to improve our methodology. The SIB activities are supported by the State Secretariat for Education, Research and Innovation (SERI).
REFERENCES
8
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on August 26, 2015
Banzhaf, W., Francone, F. D., Keller, R. E., and Nordin, P. (1998). Genetic programming: an introduction: on the automatic evolution of computer programs and its applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Berthold, M. R., Borgelt, C., Hppner, F., and Klawonn, F. (2010). Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer Publishing Company, Incorporated, 1st edition. Bienvenut, W. V., Sumpton, D., Martinez, A., Lilla, S., Espagne, C., Meinnel, T., and Giglione, C. (2012). Comparative large scale characterization of plant versus mammal proteins reveals similar and idiosyncratic N-α-acetylation features. Mol Cell Proteomics, 11(6), M111.015131. Blom, N., Gammeltoft, S., and Brunak, S. (1999). Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol, 294(5), 1351–1362. Bologna, G., Yvon, C., Duvaud, S., and Veuthey, A.-L. (2004). N-terminal myristoylation predictions by ensembles of neural networks. Proteomics, 4(6), 1626–1632. Bork, P. and Gibson, T. J. (1996). Applying motif and profile searches. Methods Enzymol, 266, 162–184. Bradshaw, R. A., Brickey, W. W., and Walker, K. W. (1998). N-terminal processing: the methionine aminopeptidase and N-α-acetyl transferase families. Trends Biochem Sci, 23(7), 263–267. Bucher, P., Karplus, K., Moeri, N., and Hofmann, K. (1996). A flexible motif search technique based on generalized profiles. Comput Chem, 20(1), 3–23. Burstein, Y. and Schechter, I. (1978). Primary structures of N-terminal extra peptide segments linked to the variable and constant regions of immunoglobulin light chain precursors: implications on the organization and controlled expression of immunoglobulin genes. Biochemistry, 17(12), 2392–2400. Cai, Y.-D. and Lu, L. (2008). Predicting N-terminal acetylation based on feature selection method. Biochem Biophys Res Commun, 372(4), 862–865. Chuang, G.-Y., Boyington, J. C., Joyce, M. G., Zhu, J., Nabel, G. J., Kwong, P. D., and Georgiev, I. (2012). Computational prediction of N-linked glycosylation incorporating structural properties and patterns. Bioinformatics, 28(17), 2249–2255. Eisenhaber, B. and Eisenhaber, F. (2010). Prediction of posttranslational modification of proteins from their amino acid sequence. In O. Carugo and F. Eisenhaber, editors, Data Mining Techniques for the Life Sciences, volume 609 of Methods in Molecular Biology, pages 365–384. Humana Press. Frottin, F., Martinez, A., Peynot, P., Mitra, S., Holz, R. C., Giglione, C., and Meinnel, T. (2006). The proteomics of N-terminal methionine cleavage. Mol Cell Proteomics, 5(12), 2336–2349. Gautschi, M., Just, S., Mun, A., Ross, S., R¨ucknagel, P., Dubaqui´e, Y., EhrenhoferMurray, A., and Rospert, S. (2003). The yeast n-α-acetyltransferase nata is quantitatively anchored to the ribosome and interacts with nascent polypeptides. Mol Cell Biol, 23(20), 7403–7414. Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition.
Gonnet, P. and Lisacek, F. (2002). Probabilistic alignment of motifs with sequences. Bioinformatics, 18(8), 1091–1101. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2nd edition. Julenius, K. (2007). Netcglyc 1.0: prediction of mammalian c-mannosylation sites. Glycobiology, 17(8), 868–876. Kawashima, S. and Kanehisa, M. (2000). AAindex: amino acid index database. Nucleic acids research, 28(1), 374. Kendall, R. L. and Bradshaw, R. A. (1992). Isolation and characterization of the methionine aminopeptidase from porcine liver responsible for the co-translational processing of proteins. J Biol Chem, 267(29), 20667–20673. Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol, 157(1), 105–132. Lars, K., Dyrlov, B. J., and Nikolaj, B. (2005). NetAcet: prediction of N-terminal acetylation sites. Bioinformatics, 21(7), 1269–1270. Liu, Y. and Lin, Y. (2004). A novel method for N-terminal acetylation prediction. Genomics Proteomics Bioinformatics, 2(4), 253–255. Martinez, A., Traverso, J. A., Valot, B., Ferro, M., Espagne, C., Ephritikhine, G., Zivy, M., Giglione, C., and Meinnel, T. (2008). Extent of N-terminal modifications in cytosolic proteins from eukaryotes. Proteomics, 8(14), 2809–2831. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta, 405(2), 442–451. Meinnel, T., Peynot, P., and Giglione, C. (2005). Processed N-termini of mature proteins in higher eukaryotes and their major contribution to dynamic proteomics. Biochimie, 87(8), 701–712. Pestana, A. and Pitot, H. C. (1975). Acetylation of nascent polypeptide chains on rat liver polyribosomes in vivo and in vitro. Biochemistry, 14(7), 1404–1412. Polevoda, B. and Sherman, F. (2002). The diversity of acetylated proteins. Genome Biol, 3(5), reviews0006. Polevoda, B. and Sherman, F. (2003). N-terminal acetyltransferases and sequence requirements for N-terminal acetylation of eukaryotic proteins. J Mol Biol, 325(4), 595–622. Polevoda, B., Brown, S., Cardillo, T. S., Rigby, S., and Sherman, F. (2008). Yeast n-αterminal acetyltransferases are associated with ribosomes. J Cell Biochem, 103(2), 492–508. Polevoda, B., Arnesen, T., and Sherman, F. (2009). A synopsis of eukaryotic n-αterminal acetyltransferases: nomenclature, subunits and substrates. BMC Proc, 3 Suppl 6, S2. Quinlan, J. R. (1992). C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). Morgan Kaufmann, 1st edition. Radivojac, P., Vacic, V., Haynes, C., Cocklin, R. R., Mohan, A., Heyen, J. W., Goebl, M. G., and Iakoucheva, L. M. (2010). Identification, analysis, and prediction of protein ubiquitination sites. Proteins, 78(2), 365–380. Russell, S. J. and Norvig, P. (2010). Artificial Intelligence - A Modern Approach. Pearson Education, 3rd edition. Schwartz, D., Chou, M. F., and Church, G. M. (2009). Predicting protein posttranslational modifications using meta-analysis of proteome scale data sets. Mol Cell Proteomics, 8(2), 365–379. Starheim, K. K., Arnesen, T., Gromyko, D., Ryningen, A., Varhaug, J. E., and Lillehaug, J. R. (2008). Identification of the human n-α-acetyltransferase complex b (hnatb): a complex important for cell-cycle progression. Biochem J, 415(2), 325–331. Starheim, K. K., Gromyko, D., Evjenth, R., Ryningen, A., Varhaug, J. E., Lillehaug, J. R., and Arnesen, T. (2009). Knockdown of human N-α-terminal acetyltransferase complex c leads to p53-dependent apoptosis and aberrant human arl8b localization. Mol Cell Biol, 29(13), 3569–3581. Walsh, C. (2006). Posttranslational Modification Of Proteins: Expanding Nature’s Inventory. Roberts and Company Publishers. Xiao, Q., Zhang, F., Nacev, B. A., Liu, J. O., and Pei, D. (2010). Protein Nterminal processing: substrate specificity of escherichia coli and human methionine aminopeptidases. Biochemistry, 49(26), 5588–5599. Zhang, N., Li, B.-Q., Gao, S., Ruan, J.-S., and Cai, Y.-D. (2012). Computational prediction and analysis of protein γ-carboxylation sites based on a random forest method. Mol Biosyst, 8(11), 2946–2955.