Improved protein fold assignment using support vector ... - CiteSeerX

4 downloads 4032 Views 375KB Size Report
Keywords: fold recognition; support vector machines; machine learning; proteomics ... Chicago. He earned BS Degree in Bioengineering at UIC, May 2003.
Int. J. Bioinformatics Research and Applications, Vol. 1, No. 3, 2006

Improved protein fold assignment using support vector machines Robert E. Langlois, Alice Diec, Ognjen Perisic, Yang Dai and Hui Lu* Department of Bioengineering, University of Illinois at Chicago, 60607 Illinois, USA Fax: 312 413 2018 E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: Because of the relatively large gap of knowledge between number of protein sequences and protein structures, the ability to construct a computational model predicting structure from sequence information has become an important area of research. The knowledge of a protein’s structure is crucial in understanding its biological role. In this work, we present a support vector machine based method for recognising a protein’s fold from sequence information alone, where this sequence has less similarity with sequences of known structures. We have focused on improving multi-class classification, parameter tuning, descriptor design, and feature selection. The current implementation demonstrates better prediction accuracy than previous similar approaches, and has similar performance when compared with straightforward threading. Keywords: fold recognition; support vector machines; machine learning; proteomics; structure prediction. Reference to this paper should be made as follows: Langlois, R.E., Diec, A., Perisic, O., Dai, Y. and Lu, H. (2006) ‘Improved protein fold assignment using support vector machines’, Int. J. Bioinformatics Research and Applications, Vol. 1, No. 3, pp.319–334. Biographical notes: Robert Ezra Langlois is a second year PhD student of Bioinformatics in Department of Bioengineering at University of Illinois at Chicago. He earned BS Degree in Bioengineering at UIC, May 2003. Currently he is supported by a NIH training grant: Cellular Signaling in Cardiovascular System. His research interests include machine learning, protein folding, structure prediction, protein function prediction, and binding prediction of signaling proteins. Alice Diec earned her Masters Degree in Bioinformatics from the Department of Bioengineering at UIC, October 2004. Currently, she is working in the Washington University Genome Center. Ognjen Perisic is a third year PhD student of Bioinformatics in Department of Bioengineering at UIC. His research interests are in computational biophysics, free energy calculation, non-equilibrium statistical physics in biology, and protein structure prediction.

Copyright © 2006 Inderscience Enterprises Ltd.

319

320

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu Yang Dai received a PhD Degree in Management Science and Engineering from the University of Tsukuba, Japan, in 1991. She was a Research Associate and Assistant Professor at the Department of Management Science of Kobe University of Commerce (1991–1997) and at the Department of Mathematical and Computing Sciences of Tokyo Institute of Technology (1997–2001), both in Japan. Since 2001, she has had Faculty position in the Department of Bioengineering, University of Illinois at Chicago. Her current research focuses on bioinformatics, computational biology, machine learning, data mining, as well as in algorithm design associated with network optimisation, combinatorial optimisation, and global optimisation. Hui Lu is an Assistant Professor in the Department of Bioengineering at University of Illinois at Chicago. He earned his PhD from University of Illinois at Urbana-Champaign, and BS from Beijing University. His research interests in bioinformatics include: machine learning, protein folding, protein structure prediction, protein-protein and protein-DNA interactions, molecular dynamics and Monte Carlo simulations, protein function annotation, microarray and gene expression and gene regulation networks.

1

Introduction

With the completion of the human genome project, the accumulation of sequence information has grown and continues to grow at an exponential pace. The growth in the number of structures in the Protein Data Bank (PDB) is enough to illustrate the importance of protein structure prediction vs. more costly and time-consuming experimental methods such as X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. Having a computational method to accurately assign a structure to a specific protein sequence will provide insights into its function and evolutionary origins. There are four main strategies to solve this problem. First, homology modelling matches sequences to a particular structure using sequence similarity; common tools include HMMer (Karplus et al., 1998) and MODELLER (Sonchez and Sali, 1997; Jones, 1999) and others. Second, fold recognition in which threading is the main method. Threading detects structural similarities between sequences of little similarity; typical current programs include GenTHREADER (McGuffin and Jones 2003), PROSPECT (Xu and Xu 2000), PROSPECTOR (Skolnick and Kihara, 2001) and many others. Comparing a library of structures to a sequence, threading slides a sequence along the template structure and evaluates the match using a combination of scoring functions. Third, de novo approaches start from the assumption that native protein state is at global free energy minimum. The conformational space is searched with molecular dynamics or Monte Carlo simulations using empirical or physical based potentials, examples include Rosetta (Bonneau et al., 2001) and Touchstone (Kihara et al., 2001) among others. In the past few years, a fourth strategy, machine-learning techniques have shown promise in identifying the three-dimension fold of a protein from sequence alone where no significant identity exists to proteins with known structure using classification.

Improved protein fold assignment using support vector machines

321

Two popular approaches in machine learning are neural networks and support vector machines (Cortes and Vapnik, 1995). Although each method has its own suitable application areas, support vector machines (SVM) have demonstrated better performance than other machine learning methods in a number of tasks: classifying microarray data (Brown et al., 2000), text classification (Joachims, 1998), and fold recognition (Dubchak et al., 1999; Ding and Dubchak, 2001; Yu et al., 2003). Likewise, SVM has been combined with Hidden Markov Models (HMM) for remote homology detection (Jaakkola et al., 1999). However, as a relatively new technique, there are many open research issues on how to correctly implement SVM for a given task. In this paper, we present our approach of building a multi-class SVM classifier in the context of protein fold recognition and the advances made in designing and selecting feature vectors. A wide range of techniques has been used to solve the protein fold recognition problem. The most successful one is threading. Threading attempts to assign a sequence to a set of known structures based on a set of empirical potentials (Xu and Xu, 2000). Limiting our scope to discriminative techniques still leaves us with a wide variety of approaches. One of the most accurate approaches in protein family/superfamily recognition is called SVM pairwise (Liao and Noble, 2003). This approach maps the sequence to be featured using the Smith – Waterman similarity score (SW score); the target feature vector is created using the SW score for each sequence in the training set and SVM is used to discriminate classes. Another technique, called SVM – Fisher (Jaakkola et al., 2000), uses HMM to create a feature vector of the protein sequence using the gradient of the log probability with respect to each parameter. Likewise, SVM I-sites (Hou et al., 2004) takes a different approach to vectorising a protein sequence. Here, the sequence is broken into subsequences with variable length and scored against a library of structural fragments. In other words, the feature vector consists of sequence fragment correlations for each fragment in the library. This approach is about as accurate as SVM pairwise with greater efficiency. Turning the problem on its head, the string kernel (Leslie et al., 2002) approach saves a step by implicitly representing the sequence as a feature vector. That is, the sequences are represented as vectors in the high dimensional feature space via a string-base feature map. Using kernel tricks, this can be done efficiently with accuracy similar to SVM – Fisher. Finally, Ding and Dubchak (2001) reduce a protein sequence into its structural and physical-chemical properties. The advantage of this approach lies in its efficiency and expressiveness, and it can be applied to fold recognition. The first key factor in the use of machine-learning algorithms is how to efficiently build a compact yet informative descriptor set. In this paper, we demonstrate how a carefully designed secondary structure descriptor could improve the performance of SVM based classification. Employing SVM, we have tested the effectiveness on a dataset constructed from ASTRAL40 database (Brenner et al., 2000). Our dataset is comprised of 53 folds spanning six classes with at least 20 examples in each fold. Our SVM procedure achieves an accuracy of about 20% when using amino acid composition alone and 40% for our own secondary structure descriptor. When all features were combined, we have achieved 48% (fine-grained, see Table 1) accuracy and 53% confidence over a randomised dataset of the 53 Structural Classification of Proteins (SCOP) fold dataset mentioned previously and an accuracy of 85% for class level (coarse-grained, see Table 2) classification.

322 Table 1

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu SCOP fold-level results broken-down by SCOP class Summary: fine-grained average results Composition

New SS

Composition and new SS

Accuracy (%)

Confidence (%)

Accuracy (%)

Confidence (%)

Accuracy (%)

Confidence (%)

a

29.09

39.73

49.09

54.33

58.18

64.93

b

22.00

24.44

33.33

29.19

38.67

39.42

c

15.00

23.65

49.29

47.50

52.86

54.11

d

13.33

27.81

37.78

56.45

38.89

50.15

f

30.00

75.00

40.00

44.44

70.00

87.50

g

63.33

77.78

33.33

43.21

63.33

82.14

Average

22.64

31.95

41.70

44.96

48.49

53.74

Class

Table 2

SCOP class-level results Coarse-grained results Count

Composition

Composition and New SS

New SS

Fold

Sequence

Accuracy (%)

Confidence (%)

Accuracy (%)

Confidence (%)

Accuracy (%)

Confidence (%)

a

11

429

56.36

68.13

91.82

90.18

96.36

90.60

b

15

728

64.67

65.54

90.67

87.74

92.67

93.29

c

14

855

82.14

47.33

85.00

77.78

89.29

78.62

d

9

353

6.67

30.00

57.78

67.53

60.00

72.00

f

1

27

30.00

100.00

30.00

100.00

50.00

100.00

g

3

151

Class

80.00

96.00

83.33

83.33

76.67

92.00

Weighted

57.92

57.92

82.26

82.26

85.28

85.28

Average

53.31

67.83

73.10

84.43

77.50

87.75

The other key factor addressed here is the feature selection. There are many feature vectors that may or may not help in fold recognition from our current understanding of protein structure prediction, thus a feature selection process can be very useful. We have implemented and tested feature selection protocol that may improve the results and/or decrease the running time. This protocol will be crucial when the feature vectors increase to a larger number. We describe our implementation of SVM and feature selection protocols in ‘Methods’, and the performance of our protocol in ‘Results’. Summary and possible future improvement are presented in ‘Discussions’.

Improved protein fold assignment using support vector machines

2

323

Methods

2.1 Support vector machines SVM is a binary classification method, using a non-linear transformation, which maps the data to a high dimensional feature space where a linear classification is performed. It is equivalent to solving the quadratic optimisation problem: 1 min w ⋅ w + C ∑ ξi w , b ,ξi 2 i

(1)

subject to yi (φ ( xi ) ⋅ w + b) ≥ 1 − ξ i , i = 1,..., m,

ξi ≥ 0,

i = 1,..., m,

where xi is a feature vector labelled by yi ∈ {+1, –1}(xi, yj) i = 1, …, m and C is a parameter. More precisely, the given model summarises the so-called soft-margin SVM which tolerates noise within the data. The above model generates a separating plane using the equation f ( x) = φ ( x) ⋅ w + b = 0 . Through the representation of w = ∑ j a jφ ( x j ) , we obtain φ ( xi ) ⋅ w = ∑ j α jφ ( xi ) ⋅ φ ( x j ) . This gives us an efficient approach to solve SVM without the explicit use of the non-linear transformation (Cristianini and Shawe-Taylor, 1999).

2.2 Extending SVM to multiple classes Efficiently, extending SVM to handle multiple labels has taken many forms. Some implementations (Chang and Lin, 2001) alter the above problem to handle multiple classes implicitly while others use standard machine learning techniques to extend the basic binary classifier explicitly (Ding and Dubchak, 2001). Given the maturity of the later methods and their flexibility, we implemented three such techniques: one-vs-one, one-vs-others and Decision Directed Acyclic Graph (DDAG) (Platt et al., 2000). Training DDAG and one- vs.-one is performed on every pairwise classifier or 1/2n(n – 1) classifiers where n is the number of classes. For one-vs.-others, only n classifiers are trained. Note counter-intuitively, the training time is not much different for one-vs.-one and one-vs.-others (Platt et al., 2000). Predicting the label for the ‘versus’ methods is as follows: sum the predictions for each classifier and take the label with the highest output (see Table 3). As for DDAG, the prediction creates a list of every label, and then starting with the labels at opposite ends of the list (label1 and labeln for list of size n), predict using the corresponding classifier. The ‘winning’ class is kept and the next (previous) from the losing class indicates the corresponding classifier is to be predicted against next. Figure 1 graphically illustrates the DDAG decision process. The DDAG classifier is significantly faster (only makes n – 1 predictions, where n is the number of classes) than the previous two methods whose speeds are comparable with one another.

324

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu

Table 3

Illustrates the voting systems in the different multiple class extensions Method

Class vote

AVA

OVA

1

2

3

1v2

1

1(1)

0(0)

0(0)

1v3

2

1(1)

0(0)

0(1)

2v3

3

0(1)

0(1)

0(0)





3(3)

0(1)

0(1)

This table illustrates the voting system used to combine classifiers into a single output. AVA (OVO). Figure 1

A graphical representation of the DDAG voting system

The previously described feature vector (xi) is a fixed set of attributes describing the state. Our problem is to take each state, in this case a protein sequence, and assign it a label (a SCOP fold level classification). The labels are comprised of a fixed set; the assumption being that any given input (sequence) will fall into one of these categories. It follows that finding a good mapping between feature and state lies at the crux of our work. In the following sections, we elucidate the state, labels, and features used to describe the state.

2.3 Parameter tuning The first step in training an SVM classifier is to choose the kernel. As suggested in literature (Chang and Lin, 2001), we choose the Gaussian kernel. A multi-scale grid search was used to find the best combination of two parameters, C for the soft-margin SVM and γ for the Gaussian kernel. The optimal parameters were selected by the best weighted-accuracy averaged over all classifiers (pairwise for one-vs.-one and DDAG) for five-fold cross-validation. This procedure proved optimal when compared to other techniques where each classifier maintains its own set of parameters (results not shown). However, without careful separation between training and testing, this technique may fail rather spectacularly in real application. The training accuracy value reported here reflects the average of cross-validation accuracy of each individual two-class classifier. Thus, the high percentage reported here does not indicate over-training.

Improved protein fold assignment using support vector machines

325

2.4 Feature selection We applied feature selection to a set of features in the dataset created by Ding and Dubchak (2001). That dataset is similar to ours but with less folds. It consists of 27 folds with no two sequences having greater than 35% sequence identity. Ding and Dubchak (2001) developed six feature vector sets utilising structural and physical-chemical properties extracted from the protein sequences. The first feature set, composition (C), is just a simple percent composition vector of the 20 amino acids. Predicted secondary structure (S), hydrophobicity (H), normalised van der Waals volume (V), polarity (P) and polarisibility (Z) were constructed differently. Details of the feature construction can be found in literature (Ding and Dubchak, 2001). Here, we summarise their final vector with all six independent feature vectors in Table 4, which represents a total of 20 + 21 × 5 = 125 features. Table 4

Summary of feature vector dimensions for protein fold recognition

Symbol

Parameter

C S H P V Z All

Amino acid composition Predicted secondary structure Hydrophobicity Polarity Normalised van der Waals volume Polarizibility All six feature vectors combined

Dimension 20 21 21 21 21 21 125

In many classification problems, input vectors may be of high dimension. For computational efficiency, discarding irrelevant features prior to training may be favourable, particularly when the number of available features significantly outnumbers the number of examples. This is especially the case for many bioinformatic applications and the protein fold recognition problem mentioned here. Feature selection is performed for each binary classifier. The Fisher score was used as the feature selection ranking value. It is defined as F (r ) =

µ r+ − µ r− σ r+ + σ r− 2

2

(2)

where µ r± is the mean value for the rth feature in the positive/negative class and σ is the standard deviation. To find the optimal subset of features, the protocol starts by using the top 5% of the newly ranked feature list, then adding 5% in each round, until the best performance in training set is reached. The optimal percentage of features is then used on the independent test set.

326

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu

2.5 Dataset The state in our problem is comprised of a set of sequences, which are classified into different categories according to the SCOP system (Murzin et al., 1995). SCOP classifies a protein according to a hierarchical system breaking down into class, fold, superfamily and family. In previous work, (Dubchak et al., 1999), it has been shown that class level classification can already be achieved with high accuracy. Similarly, using homology-modelling techniques, superfamily level and family level recognition has not proven hard for current methods. To this end, we have focused on assigning protein sequences on the fold level. Given the difficulties in reconstructing Dubchak’s 27-fold dataset (missing identifiers and reclassified sequences), we opted to create our own, more complete dataset. Note, while Dubchak does provide an online dataset, this is composed of the features but not the original sequences. The ASTRAL40 (Brenner et al., 2000) database was used to ensure no example had more than 40% identity to another. Moreover, only folds with no less than 20 examples are taken to ensure large enough testing and training sets for accurate and significant results, respectively. The final dataset consists of 53 folds. The training and testing sets consist of 2,013 and 530 examples, respectively. That is, for each of the 53 folds, the test set has exactly ten examples whereas the training set has no less than ten examples.

2.6 Accuracy scores The following describes the standard accuracy and confidence scores used in this paper. Note an interesting review of extending these binary accuracy scores to multi-class problems is given in Baldi et al. (2000). Accuracy =

TP TP + FN

Confidence =

3

TP . TP + FP

(3) (4)

Results

3.1 Descriptor designs Our state, a protein sequence, does not make a very good input into most standard machine-learning algorithms. These sequences vary in length and because of insertions and deletions exhibit some positional dependence. Here, our task is to map the sequences to a unique set of fixed length features and attempt to remove this positional dependence. Given that the protein sequences consist of a fixed alphabet of residues, our first attempt was to look at the relative frequencies of each residue in a protein sequence. However, as described in previous literature, this is not a very expressive descriptor. Indeed, for the 27 fold test case published before, the discriminative power of amino acid composition is around 50%. The performance in our blind 53-fold experiment is about 20% (Figure 2).

Improved protein fold assignment using support vector machines Figure 2

327

Overall accuracy of the various descriptors tested on the harder dataset

Next, we designed a descriptor based on the secondary structure assignment of each residue in a sequence using PSI-PRED (McGuffin et al., 2000) (not the actual DSSP (Kabsch and Sander, 1983) assignments). The idea behind this descriptor is to capture the main secondary structure elements and the corresponding topology. Here, we analysed the number of structural units and number of element patterns. For instance, we count the number of alpha helices, and then alpha helices followed by beta sheets, and so on. This is performed without preference up to some predefined limit, i.e., four-element patterns (see Figure 3). Additionally, this descriptor was extended using bins to count the sizes of individual secondary structure elements. Finally, we combine this descriptor with amino acid composition. Note this descriptor is different from other secondary structure descriptors published in the literature (Ding and Dubchak, 2001; Yu et al., 2003). Figure 3

An example of the secondary structure descriptor

3.2 Secondary structure descriptor performance Figure 2 and Table 5 show the performance of secondary structure descriptors. The accuracy, 40%, is much better than that of composition, 20%. This is a considerable improvement when compared to the already published work showing that their secondary structure descriptor as having a slightly unsatisfactory performance than the amino acid composition (Ding and Dubcheck, 2001).

328 Table 5

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu SCOP class level accuracy for secondary structure descriptor using the gaussian kernel (C = 8, γ = 0.262)

SCOP class All alpha All beta Alpha\beta Alpha + beta Membrane Small

Accuracy (%) 49.96 26.64 39.15 55.89 44.44 49.96

Confidence (%) 45.45 29.33 44.29 36.67 40.00 45.45

Number of folds 11 15 14 9 1 3

As anticipated, using structural and topological information derived from sequence information provides a strong signal to classify folds. However, this approach is limited by the accuracy of secondary structure prediction. As seen in Table 5, the best performance is from folds that consist primarily of alpha helices. This is because of the relatively accurate prediction of alpha helical secondary structures. An interesting problem arising from the SCOP method for classification is that quite a few of the alpha helical folds are wrongly predicted as folds in the alpha + beta class. This is because of the misnomer that the all alpha helical folds consist of only alpha helices. In those former instances, a small or insignificant beta sheet is found in a particular group of folds; thus, such folds have a greater chance of falling into the wrong category. Although the accuracy of secondary structure descriptors is better than composition by a large amount, the best performance comes when we combine these two. The accuracy is increased to 48.5% while the confidence level is 53.7% (Figure 2). This accuracy is lower than previous publications on SVM based fold recognition, owing to the reason we are dealing with prediction of 53 folds, rather than 27 folds. From the fold-level results in Table 6, we found there is no obvious correlation between fold class and prediction accuracy (as the coarse level results might indicate). In more than a few cases the failings in secondary structure prediction account for the misclassification. Table 6

Fold a.1 a.102 a.118 a.2 a.24 a.26 a.3 a.39 a.4 a.45 a.60

A summary of the fold level classification displaying the accuracy for each fold

# 31 25 52 22 31 26 34 42 117 20 29

Fine-grained results Comp New SS Accuracy Confidence Accuracy Confidence (%) (%) (%) (%) 50.00 71.43 100.00 76.92 10.00 20.00 20.00 66.67 40.00 40.00 90.00 69.23 50.00 55.56 10.00 33.33 30.00 42.86 60.00 46.15 40.00 57.14 40.00 80.00 30.00 30.00 50.00 55.56 10.00 08.33 30.00 60.00 40.00 11.76 90.00 26.47 20.00 100.00 40.00 50.00 00.00 00.00 10.00 33.33

Comp and New SS Accuracy Confidence (%) (%) 90.00 69.23 60.00 85.71 80.00 72.73 50.00 100.00 50.00 55.56 40.00 80.00 60.00 54.55 60.00 66.67 90.00 26.47 50.00 83.33 10.00 20.00

Improved protein fold assignment using support vector machines Table 6

Fold b.1 b.10 b.18 b.2 b.29 b.34 b.40 b.42 b.43 b.47 b.55 b.6 b.60 b.71 b.82 c.1 c.2 c.23 c.26 c.3 c.37 c.47 c.52 c.55 c.56 c.66 c.67 c.69 c.94 d.142 d.144 d.15 d.153 d.169 d.17 d.3 d.58 d.92 Weighted Average

329

A summary of the fold level classification displaying the accuracy for each fold (continued)

# 241 41 25 27 33 53 87 24 27 31 28 38 21 21 31 182 100 66 42 46 122 51 23 53 24 35 35 51 25 20 26 56 25 25 24 20 133 24

Fine-grained results Comp New SS Accuracy Confidence Accuracy Confidence (%) (%) (%) (%) 70.00 12.28 90.00 21.95 40.00 50.00 50.00 29.41 00.00 00.00 00.00 00.00 10.00 50.00 00.00 00.00 20.00 33.33 30.00 37.50 50.00 38.46 40.00 40.00 30.00 17.65 50.00 17.86 10.00 14.29 00.00 00.00 00.00 00.00 10.00 33.33 50.00 41.67 80.00 57.14 20.00 22.22 50.00 50.00 10.00 20.00 20.00 40.00 20.00 66.67 60.00 85.71 00.00 00.00 00.00 00.00 00.00 00.00 20.00 25.00 50.00 05.38 90.00 25.71 40.00 25.00 60.00 50.00 10.00 12.50 40.00 40.00 00.00 00.00 30.00 60.00 20.00 28.57 70.00 53.85 10.00 03.33 50.00 19.23 10.00 07.69 70.00 58.33 00.00 00.00 20.00 50.00 00.00 00.00 10.00 20.00 00.00 00.00 00.00 00.00 20.00 28.57 40.00 66.67 10.00 100.00 90.00 100.00 30.00 20.00 60.00 54.55 10.00 100.00 60.00 66.67 00.00 00.00 10.00 33.33 30.00 100.00 50.00 83.33 30.00 42.86 80.00 57.14 20.00 50.00 70.00 100.00 20.00 50.00 50.00 62.50 00.00 00.00 10.00 100.00 00.00 00.00 00.00 00.00 20.00 07.41 50.00 21.74 00.00 00.00 20.00 50.00 22.64 22.64 41.70 41.70 22.64 31.95 41.70 44.96

Comp and New SS Accuracy Confidence (%) (%) 60.00 17.65 50.00 35.71 00.00 00.00 10.00 20.00 20.00 28.57 60.00 37.50 70.00 29.17 40.00 44.44 20.00 50.00 80.00 61.54 50.00 71.43 40.00 66.67 60.00 100.00 00.00 00.00 20.00 28.57 80.00 28.57 60.00 46.15 40.00 36.36 30.00 33.33 80.00 57.14 60.00 27.27 80.00 53.33 20.00 50.00 10.00 14.29 20.00 66.67 40.00 66.67 90.00 100.00 70.00 77.78 60.00 100.00 20.00 66.67 70.00 77.78 70.00 58.33 70.00 77.78 40.00 50.00 00.00 00.00 10.00 50.00 50.00 20.83 20.00 50.00 48.49 48.49 48.49 53.74

330

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu

3.3 Compare with a threading program To evaluate our SVM setup, we compared the current performance with a threading protocol PROSPECT (Xu and Xu, 2000). The program is downloaded from Ying Xu’s webpage in University of Georgia. We ran the threading program on each of our test sequences using a fold library comprised of our training set. The program was not manually tuned, but uses the default parameters. In the end, we calculated the accuracy in the same way as the analysis of the SVM performance. The overall accuracy from threading is 56.2% which is higher than the 48.5% from SVM. The performance on individual class is: 72%, 52% 52%, 47%, 30%, and 77%. Notice that threading used much more information from the sequences and structures than SVM, such as structural contacts, statistical potentials, and sequence similarities, it is expected SVM performance will increase when such information is included. On the other hand, we want to emphasise that the current threading setup doesn’t reflect the true performance of state of art in threading when the template structure database is properly set-up and human knowledge is included. This comparison between SVM and threading is for the purpose of providing a quick evaluation of SVM to insure that we are on the right track.

3.4 Feature selection Feature selection is done for two reasons. First, removing unnecessary or ‘bad’ features can improve the accuracy of most machine learning algorithms. While some work claimed feature selection would improve the accuracy of SVM experiments (Weston et al., 2000), this is not the case in some other applications (Liu, 2004). Second, feature selection does provide insights into the quality and productivity of each feature. The features selected by the Fisher score equation (2) in the 75th percentile across all OVO classifiers were tallied and the frequencies presented in Figure 4. Note that the order of features reflects the order of C, S, H, P, V and Z (using Ding and Dubchak dataset). As shown in Figures 4 and 5, composition (C) and secondary structure (S) generate the strongest signals. Further evidence is provided in Figure 4 where the combination of C and S performs quite well. Hydrophobicity also induces a moderate signal (Figure 4), as further supported in Figure 5 demonstrating that this combination yields the best results. Figure 4

Frequency of features ranked in the 75th percentile

Improved protein fold assignment using support vector machines Figure 5

4

331

Protein fold recognition: single and jury voting detail

Discussion

In this work, we have developed an SVM protocol for protein fold recognition. It involves three components: descriptors, classification, and feature selection. In a test set of 53 folds, our accuracy is 48.5%, 56.2% accuracy from automated threading. Note that threading uses quite a bit more information than our method; it is encouraging that the SVM can achieve qualitatively comparable results at this stage. We expect that with further development of features, the performance of SVM can improve dramatically. We have found carefully designed secondary structure descriptors perform much better than composition. In our SVM setup, we can achieve 52% accuracy with composition using the 27 fold dataset from literature, which is comparable to the original publication (Ding and Dubchak, 2001). In the 53 fold dataset, the composition alone can generate only 20% accuracy because of more folds that require to be classified. Our secondary structure descriptors improved the accuracy to 40%, an improvement when compared with mechanically designed descriptors reported to have worse performance than composition. Thus, other descriptors in the previous work should be carefully checked and improvements are expected. Improving the accuracy (generality of the classifier) is one of the main motivations to use feature selection. However, as the results have demonstrated, the Fisher score is probably not the best approach to improve performance. A more sophisticated approach like forward selection (Weston et al., 2000) would reveal the ideal set of features to achieve the highest accuracy. Nevertheless, finding better accuracy is not the only reason in feature selection. Here, we investigated the features that provide the strongest signal to classify a particular sequence. Using the insights gained from feature selection, future work will hence focus on improving current features and developing new similar features for those found successful. The secondary structure descriptors would benefit from the confidence labels generated by PSI-PRED. By incorporating these confidence labels, it may be possible to leverage a previously unused resource. That is, to use these labels directly; there may be some pattern that can be used as a signal to categorise a sequence. As for the problem

332

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu

of fuzzy classifications, i.e., a fold in the all alpha helical class containing beta sheets, a pre-processing step could eliminate insignificant beta sheets (alpha helices). This procedure would depend on an empirical threshold mined from the training set. Such an approach could be extended to amino acid composition and hydrophobicity. Note this approach differs from feature selection in that a particular feature is not removed, just reduced in cases where certain criteria are not met. For example, in the case of composition, one particular fold may rely on a sulphur bridge between two cysteines while other folds contain a few cysteines that do not bind owing to distance or neighbour constraints. By removing the contribution of cysteine in the latter cases, the cysteine signal in composition can be greatly increased in identifying the former fold. Interestingly, the fold-level results indicate that it would be helpful to reorganise the various folds in such a way that those that cannot be distinguished by our descriptor fall in the same temporary class. Targets that fall in these classes can have their classification further refined.

5

Future work

The next step is to expand the structural characteristics in our descriptor using insights from ab initio folding. The idea here is to make use of structurally conserved fragments that are unique to each fold in our dataset. The next step is using a set of statistical potentials to identify and quantify the highest correlation in a test sequence. It follows that a feature vector can be built incorporating this descriptor for each fold. While these same insights form the building blocks for threading, it would be interesting to see if a machine learning approach like SVM could do a better job discriminating between folds. Moreover, no one has yet used boosting (Schapire and Singer, 1999) for fold recognition. While this technique is not robust against noisy data, variations are being developed to overcome this shortcoming. Interestingly, the results of Ding and Dubchak (2001) were improved upon using a technique related to boosting (Yu et al., 2003).

6

Conclusions

As machine learning techniques improve, they are finding more and more uses in different fields. In the area of fold recognition, these techniques still leave much to be desired when compared against mature techniques like threading. However, there are numerous advantages to developing more principled techniques. Such techniques require no special knowledge to implement, have less parameters to tune, and provide theoretical guarantees. Moreover, machine-learning techniques can provide unique insights into the importance of a sequence feature in determining structure (i.e., feature selection).

Acknowledgments This work is partially supported by startup funds from UIC Bioengineering to H.L. R.E.L. is supported by NIH training grant T32 HL 07692: Cellular Signaling in Cardiovascular System (PI, John Solaro).

Improved protein fold assignment using support vector machines

333

References Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an overview’, Bioinformatics, Vol. 16, No. 5, pp.412–424. Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Strauss, C.E.M. and Baker, D. (2001) ‘Rosetta in CASP4: progress in abinitio protein structure prediction’, Proteins: Structure, Function, and Genetics, Vol. 45. No. S5, pp.119–126. Brenner, S.E., Koehl, P. and Levitt, M. (2000) ‘The ASTRAL compendium for protein structure and sequence analysis’, Nucl. Acids. Res., Vol. 28, No. 1, pp.254–256. Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T. S., Ares, Jr. M. and Haussler, D. (2000) ‘Knowledge-based analysis of microarray gene expression data by using support vector machines’, PNAS, Vol. 97, No. 1, pp.262–267. Chang, C-C. and Lin, C-J. (2001) LIBSVM: A Library for Support Vector Machines, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Cortes, C. and Vapnik, V. (1995) ‘Support-vector networks’, Machine Learning, Vol. 20, No. 3, pp.273–297. Cristianini, N. and Shawe-Taylor, J. (1999) An Introduction to Support Vector Machines and other Kernel-Based Learning Methods, Cambridge University Press, Cambridge, UK. Ding, C.H.Q. and Dubchak, I. (2001) ‘Multi-class protein fold recognition using support vector machines and neural networks’, Bioinformatics, Vol. 17, No. 4, pp.349–358. Dubchak, I., Muchnik, I., Mayor, C., Dralyuk, I. and Kim, S-H. (1999) ‘Recognition of a protein fold in the context of the structural classification of proteins (SCOP) classification’, Proteins: Structure, Function, and Genetics, Vol. 4, No. 35, pp.401–407. Hou, Y.N., Hsu, W., Lee, M.L. and Bystroff, C. (2004) ‘Remote homolog detection using local sequence-structure correlations’, Proteins-Structure Function and Bioinformatics, Vol. 57, No. 3, pp.518–530. Jaakkola, T., Diekhans, M. and Haussler, D. (1999) ‘Using the Fisher Kernel method to detect remote protein homologies’, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB-99) held in 6–10 August, Heidelberg, Germany. Jaakkola, T., Diekhans, M. and Haussler, D. (2000) ‘A discriminative framework for detecting remote protein homologies’, Journal of Computational Biology, Vol. 7, No. 1, pp.95–114. Joachims, T. (1998) ‘Text categorization with support vector machines: learning with many relevant features’, European Conference on Machine Learning (ECML-98), 21–24 April, Dorint-Parkhotel, Chemnitz, Germany. Jones, D.T. (1999) ‘Protein secondary structure prediction based on position-specific scoring matrices’, Journal of Molecular Biology, Vol. 292, pp.195–202. Kabsch, W. and Sander, C. (1983) ‘Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features’, Biopolymers, Vol. 22, No. 12, pp.2577–2637. Karplus, K., Barrett, C. and Hughey, R. (1998) ‘Hidden Markov models for detecting remote protein homologies’, Bioinformatics, Vol. 14, No. 10, pp.846–856. Kihara, D., Lu, H., Kolinski, A. and Skolnick, J. (2001) ‘TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints’, Proceedings of National Academy of Sciences (USA), Vol. 98, No. 18, pp.10125–10130. Leslie, C., Eskin, E., Cohen, A., Weston, J. and Noble, W.S. (2002) Mismatch String Kernels for Discriminative Protein Classification, Neural Information Processing Systems. Liao, L. and Noble, W.S. (2003) ‘Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships’, Journal of Computational Biology, Vol. 10, Nol. 6, pp.857–868.

334

R.E. Langlois, A. Diec, O. Perisic, Y. Dai and H. Lu

Liu, Y. (2004) ‘A comparative study on feature selection methods for drug discovery’, Journal of Chemical Information and Computer Sciences, Vol. 44, No. 5, pp.1823–1828. Mcguffin, L.J., Bryson, K. and Jones, D.T. (2000) ‘The PSIPRED protein structure prediction server’, Bioinformatics, Vol. 16, No. 4, pp.404–405. McGuffin, L.J. and Jones, D.T. (2003) ‘Improvement of the GenTHREADER method for genomic fold recognition’, Bioinformatics, Vol. 19, No. 7, pp.874–881. Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. (1995) ‘SCOP: a structural classification of proteins database for the investigation of sequences and structures’, Journal of Molecular Biology, Vol. 247, pp.536–540. Platt, J.C., Cristianini, N. and Shawe-Taylor, J. (2000) Large Margin DAGs for Multiclass Classification, Advances in Neural Information Processing Systems. Schapire, R.E. and Singer, Y. (1999) ‘Improved boosting algorithms using confidence-rated predictions’, Machine Learning, Vol. 37, No. 3, pp.297–336. Skolnick, J. and Kihara, D. (2001) ‘Defrosting the frozen approximation: PROSPECTOR – a new approach to threading’, Proteins: Structure, Function, and Genetics, Vol. 42, No. 3, pp.319–331. Sonchez, R. and Sali, A. (1997) ‘Evaluation of comparative protein structure modeling by MODELLER-3’, Proteins: Structure, Function and Genetics, Vol. 29, No. S1, pp.50–58. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T. and Vapnik, V. (2000) ‘Feature Selection for SVMs’, in Leen, T.K., Dietterich, T.G. and Tresp, V. (Eds.): Proceedings of Advances in Neural Information Processing Systems, The MIT Press, Cambridge, USA, Vol. 13, pp.668–674. Xu, Y. and Xu, D. (2000) ‘Protein threading using PROSPECT: design and evaluation’, Proteins: Structure, Function, and Genetics, Vol. 40, No. 3, pp.343–354. Yu, C-S., Wang, J-Y., Yang, J-M., Lyu, P-C., Lin, C-J. and Hwang, J-K. (2003) ‘Fine-grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple-parameters sets’, Proteins: Structure, Function and Genetics, Vol. 4, No. 50, pp.531–536.

Suggest Documents