model for protein sequence motif information ... - Semantic Scholar

8

Int. J. Functional Informatics and Personalised Medicine, Vol. 1, No. 1, 2008

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model for protein sequence motif information extraction Bernard Chen* Department of Computer Science, Georgia State University, 34 Peachtree Street Room1417 A, Atlanta, GA 30303, USA E-mail: [email protected] *Corresponding author

Stephen Pellicer Department of Computer Science, Georgia State University, 34 Peachtree Street Room1417 B, Atlanta, GA 30303, USA E-mail: [email protected]

Phang C. Tai Department of Biology, Georgia State University, 402 Kell Hall, 24 Peachtree Center Ave., Atlanta, GA 30303, USA E-mail: [email protected]

Robert Harrison Department of Computer Science, Georgia State University, 34 Peachtree Street Room 1440, Atlanta, GA 30303, USA E-mail: [email protected]

Yi Pan Department of Computer Science, Georgia State University, 34 Peachtree Street Room 1442, Atlanta, GA 30303, USA E-mail: [email protected]

Copyright © 2008 Inderscience Enterprises Ltd.

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 9 Abstract: Protein sequence motifs are gathering progressively attention in the sequence analysis area. The conserved regions have the potential to determine the conformation, function and activities of the proteins. We develop a new method combines the concept of granular computing and the power of Ranking-SVM to further extract protein sequence motif information generated from the FGK model. The quality of motif information increases dramatically in all three evaluation measures by applying this new feature elimination model. Since the training step of Ranking-SVM is very time consuming, we provide a feasible way to reduce the training time dramatically without sacrificing the quality. Keywords: FIK model; FGK model; ranking SVM; feature elimination; protein sequence motif. Reference to this paper should be made as follows: Chen, B., Pellicer, S., Tai, P.C., Harrison, R. and Pan, Y. (2008) ‘Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model for protein sequence motif information extraction’, Int. J. Functional Informatics and Personalised Medicine, Vol. 1, No. 1, pp.8–25. Biographical notes: Bernard Chen received the BS Degree in Computer Science from Fu-Jen Catholic University, Taipei, Taiwan, in 2002. He is currently working toward the PhD Degree in the Department of Computer Science, Georgia State University, under the supervision of Dr. Y. Pan. He is also one of Molecular Basis of Disease (MBD) PhD fellow. His main research interests include bioinformatics, data mining algorithms, fuzzy logic, soft computing, microarray, protein array analysis and high-performance computing. Stephen Pellicer received the BS Degree in Computer Science from the University of Georgia, Athen, Georgia. He is currently working toward the PhD Degree in the Department of Computer Science, Georgia State University, under the supervision of Dr. Y. Pan. He is also one of Molecular Basis of Disease (MBD) PhD fellow. His main research interests include highperformance computing, peer-to-peer network, data mining algorithms, and Bioinformatics. Phang C. Tai received the PhD Degree in Microbiology from the University of California, Davis. He has done postdoctoral work at Harvard Medical School, Boston, MA, and is currently a Regents’ Professor and Chair of the Department of Biology, Georgia State University, Atlanta. His research interest is in molecular biology and microbial physiology. His current research focuses on the mechanism of protein secretion across bacterial membranes, with emphasis on the structure and function of SecA protein. Robert Harrison received the BS Degree in Biophysics from Pennsylvania State University, University Park PA in 1979 and the PhD Degree in Molecular Biochemistry and Biophysics from Yale University in 1985. He is currently a Professor in the Department of Computer Science at Georgia State University. He is a Georgia Cancer Coalition Distinguished Scholar and his research has been supported by the NIH and the Georgia Cancer Coalition. His current research interests include computational approaches to the prediction and design of molecular structure, machine learning, and the development of grammar-based models for systems and structural biology. Yi Pan is the Chair and a Professor in the Department of Computer Science and a Professor in the Department of Computer Information Systems at Georgia

10

B. Chen et al. State University. He received his PhD Degree in Computer Science from the University of Pittsburgh, USA, in 1991. His research interests include parallel and distributed computing, optical networks, wireless networks, and bioinformatics. He has published more than 100 journal papers with over 30 papers published in various IEEE journals. His recent research has been supported by NSF, NIH, NSFC, AFOSR, AFRL, JSPS, IISF and the states of Georgia and Ohio.

1

Introduction

Proteins can be regarded as one of the most important elements in the process of life; they can be separated into different families according to the sequential or structural similarities. The close relationship between protein sequence and structure plays an important role in current analysis and prediction technologies. Therefore, understanding the hidden knowledge between protein structures and their sequences is an important task in today’s bioinformatics research. The biological term sequence motif denotes a relatively small number of functionally or structurally conserved sequence patterns that occurs repeatedly in a group of related proteins. These motif patterns may be able to predict other proteins’ structural or functional area, such as enzyme-binding sites, DNA or RNA binding sites, prosthetic attachment sites, etc. PROSITE (Hulo et al., 2004), PRINTS (Attwood et al., 2002), and BLOCKS (Henikoff et al., 1999) are three of the most popular motif databases. PROSITE sequence patterns are created from observation of short conserved sequences. Analysis of the 3-D structures of PROSITE patterns suggests that recurrent sequence motifs imply common structure and function. Fingerprints from PRINTS contain several motifs from different regions of multiple sequence alignments, increasing the discriminating power to predict the existence of similar motifs because the individual parts of the fingerprint are mutually conditional. Since sequence motifs from PROSITE, PRINTS, and BLOCKS are developed from multiple alignments, these sequence motifs only search conserved elements of sequence alignment from the same protein family and carry little information about conserved sequence regions, which transcend protein families (Zhong et al., 2005). Some of the commonly used tools for protein sequence motif discovery include MEME (Bailey and Elkan, 1994), Gibbs Sampling (Lawrence et al., 1993), and Block Maker (Henikoff et al., 1995). Some newer algorithms include MITRA (Eskin and Pevzner, 2002), ProfileBranching (Price et al., 2003), and the generic motif discovery algorithm for sequential data (Jensen et al., 2006). Users are asked to give several protein sequences, which may be required in FASTA format, as the input data while using these tools. Again, sequence motifs discovered by above methods may carry little information that crosses family boundaries, because the size of input dataset is limited. Han and Baker (1998) have used the K-means clustering algorithm to find recurring protein sequence motifs. They choose a set of initial points as the centroids by random (Han and Baker, 1998). Wei et al. proposed an improved K-means clustering algorithm to obtain initial centroids location more wisely (Zhong et al., 2005). Due to the fact that the performance of K-means clustering is very sensitive to initial points selection, the results published by Wei et al. has been improved in their experiment. The main reason that both above papers use K-means instead of some other more advanced clustering

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 11 technology is due to the extremely large input dataset. Since K-means is famous for its efficiency, other clustering methods with higher time and space costs may not be suitable for this task. In our previous work (Chen et al., 2006a, 2006b), we proposed granular computing models that utilised Fuzzy C-means clustering algorithm to divide the whole data space into ten smaller subsets and then apply the improved K-means algorithm to each subset to discover relevant information. The total execution time is only 20% of Han and Baker (1998) and obtains an even higher quality of sequence motif information. The input dataset in our previous works (Chen et al., 2006a, 2006b) and related works (Zhong et al., 2005; Han and Baker, 1998) are generated from whole protein sequences by the sliding window technique; however, the information we tried to obtain is sequence motif knowledge which is only several small parts of each sequence. Therefore, not all segments in the dataset can provide significant information. Besides, fuzzy C-means (FCM) is capable of assigning one data point to more than one information granule, but not all granules that FCM assigned need the information from the data point. In this paper, in order to obtain more precise motif information, we utilise our previous effort and combine it with ranking SVM to eliminate the redundant or less meaningful segments in our original dataset. In Tang et al. (2005), a granular feature elimination model applied on Microarray data called GSVM-RFE is proposed. Although Microarray has the potential to deal with tens of thousands genes simultaneously, compared with our data size, we have a much larger dataset. In their experiment, four clusters are divided. However, 48 clusters are mined in this paper. In addition, we use a greedy K-means clustering algorithm to fix the initial centroid location. Therefore, we call our model ‘Super’ GSVM-FE to indicate that our model has the potential to apply to a huge data space.

2

Granular computing techniques

2.1 FGK model Granular computing represents information in the form of aggregates, also called ‘information granules’ (Lin, 2002; Yao, 2001). For a huge and complicated problem, it uses the divide-and-conquer concept to split the original task into several smaller subtasks to save time and space complexity. Also, in the process of splitting the original task, it comprehends the problem without including meaningless information. As opposed to traditional data-oriented numeric computing, granular computing is knowledge-oriented (Yao, 2001). A granular computing based model called “Fuzzy-Greedy-Kmeans model” (FGK model) is proposed in our previous work (Chen et al., 2006b). This model works by using FCM to building a set of information granules and then applying our new greedy K-means clustering algorithm to obtain the final information. The new greedy method collects five traditional K-means results and then selects the initial centroids based on those results. Due to the fact that the centroids in higher quality clusters have the potential to generate better clusters in the sixth round, we divided our selection initial centroid procedure into five steps: initially gathering centroid seeds belonging to clusters with structural similarity greater than 80% and then proceeding with 75%, 70%, 65% and 60%. Major advantages of the FGK model are reduced time- and space- complexity, filtered outliers, and higher quality granular information results.

12

B. Chen et al.

2.2 Ranking-SVM Assume that there exists an input space X ∈ Rn, where n denotes number of features; there exists an output space of ‘ranks’ represented by labels Y = {0, 1, 2, …, q} where q denotes the number of ranks. Further assume that there exists a total order between the ranks q ; q – 1 ; " ; 1 ; 0, where ; means a preference relationship (Cao et al., 2006). A set of ranking functions f ∈ F exists and each of them can determine the preference relations between instances: JJG JJG JJG JJG xi ; x j ⇔ f ( xi ) > f ( x j ). (1) JJG Suppose that we are given a set of ranked instances S = {xi , yi }ti =1 from the space X × Y. The task here is to select the best function f ’ from F that minimises a given loss function with respect to the given ranked instances. Herbrich et al. (2000) propose formalising the above learning problem as that of learning for classification on pairs of instances. First, we assume that f is a linear function. G JG G f JwG ( x) = 〈 w, x〉 (2) JG where w denotes a vector of weight and stands for an inner product. Plugging equations (2) into (1) we obtain: JJG JJG JG JJG JJG xi ; x j ⇔ 〈 w, xi − x j 〉 > 0. (3) JJG JJG JJG JJG Therefore, the relation xi ; x j has been replaced by a new vector xi − x j . Thus, we can take any pair and their rank to create a new vector and a new label. Suppose JJG of instance JJG we have x1 and x2 denote first and second instance, and let y1 and y2 represent their ranks, and then we have: JJG JJG x1 − x2 , z = +1 if y1 ; y2 (4) − 1 if y2 ; y1 . From the original training dataset S, we can create a new training dataset S’ for all pairs and assign z (+1 or –1) value based on equation (4). Next, we may consider S’ as classification data and construct a SVM model.

2.3 Super Granular SVM Feature Elimination model Basically, this new model is the next generation of the FGK model. It also uses the fuzzy concept to divide the original dataset into several smaller information granules. For each granule, after five iterations of traditional K-means clustering, the greedy k-means is applied. The next step is different from the FGK model: we adapt ranking SVM (Joachims et al., 2002) to rank all members in each cluster generated by the greedy K-means clustering algorithm, and then we filter out lower ranked members. The number of segments to eliminate is decided by a user defined filtrate percentage. The results of different percentage are discussed in Section 4. After the feature elimination step, we collect all surviving data points in each information granule and then run greedy K-means with same initial centroids we previously generated. Finally, we collect all the results in

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 13 all granules to create the final protein sequence motif information. Figure 1 is the sketch of the super GSVM-FE Model. In order to compare the results, we present another similar feature elimination approach by modifying only one component of the model: we shrink the cluster size instead of using ranking SVM. The number of segments eliminated is decided by a user defined distance threshold. If the distance between a member and the centre of the cluster is greater than the threshold, the data point is filtered. The major advantage of this approach is that not all clusters get rid of the same amount members. If the cluster is compact at the beginning, fewer members should be eliminated. On the other hand, if the cluster is in a loose form, more data points should be removed. The results of different threshold are also discussed and compared in Section 4. Figure 1

The Sketch of Super GSVM-FE Model

14

3

B. Chen et al.

Experiment setup

3.1 Dataset The original dataset used in this work includes 2710 protein sequences obtained from Protein Sequence Culling Server (PISCES) (Wang and Dunbrack, 2003). No sequence in this database share more than 25% sequence identity. Sliding windows with nine successive residues are generated from protein sequence. Each window represents one sequence segment of nine continuous positions. More than 560,000 segments are generated by this method and clustered into 800 clusters. The frequency profile from the HSSP (Sander and Schneider, 1991) is constructed based on the alignment of each protein sequence from the Protein Data Bank (PDB) where all the sequences are considered homologous in the sequence database. We also obtained secondary structure from DSSP (Kabsch and Sander, 1983), which is a database of secondary structure assignments for all protein entries in the PDB. In order to perform a complete analysis, we applied our new approach on information granule number eight in Chen et al. (2006a) which contains 43254 data segments and 48 clusters. After five iterations of traditional K-means clustering are executed, 45 initial centroids are decided for greedy K-means. All initial centers have at least a 250 distance measure from existing centroids. Another three initial seeds are generated randomly with minimum distance threshold check.

3.2 Representation of sequence segment The sliding windows with nine successive residues are generated from protein sequences. Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP. Twenty rows represent 20 amino acids and nine columns represent each position of the sliding window. For the frequency profiles (HSSP) representation for sequence segments, each position of the matrix represents the frequency for a specified amino acid residue in a sequence position for the multiple sequence alignment. DSSP originally assigns the secondary structure to eight different classes. In this paper, we convert those eight classes into three classes based on the following method: H, G and I to H (Helices); B and E to E (Sheets); all others to C (Coils).

3.3 Distance measure According to Zhong et al. (2005) and Han and Baker (1998), the city block metric is more suitable for this field of study since it will consider every position of the frequency profile equally. The following formula is used to calculate the distance between two sequence segments (Han and Baker, 1998): L

N

Distance = ∑∑ Fk (i, j ) − Fc (i, j )

(5)

i =1 j =1

where L is the window size and N is 20 which represent 20 different amino acids. Fk(i, j) is the value of the matrix at row i and column j used to represent the sequence segment.

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 15 Fc(i, j) is the value of the matrix at row i and column j used to represent the centroid of a give sequence cluster.

3.4 Structure similarity measure Cluster’s average structure is calculated using the following formula:

∑

ws i =1

max( pi , H , pi , E , pi ,C )

(6)

ws

where ws is the window size and pi,H shows the frequency of occurrence of helix among the segments for the cluster in position i. pi,E and pi,C are defined in a similar way. If the structural homology for a cluster exceeds 70%, the cluster can be considered structurally identical (Sander and Schneider, 1991). If the structural homology for the cluster exceeds 60% and bellows 70%, the cluster can be considered weakly structurally homologous (Zhong et al., 2005).

3.5 Davis-Bouldin Index (DBI) measure Besides using secondary structure information as a biological evaluation criterion, we include an evaluation method used in computer science on this dataset in our previous work (Chen et al., 2006b). The DBI measure (Davies and Bouldin, 1979) is a function of the inter-cluster and intra-cluster distances. A good cluster result should reflect a relatively large inter-cluster distance and a relatively small intra-cluster distance. The DBI measure combines these two distance information into one function, which is defined as follows: DBI =

 dint ra (C p ) + dint ra (Cq )  1 k MAX   , where ∑ p q ≠ k p =1 dint er (C p , Cq )  

dintra (C p

∑ )=

np i =1

gi − g pc np

(7)

and dinter (C p , Cq ) = g pc − g qc

k is the total number of clusters, dintra and dinter denote the intra- cluster and inter-cluster distances respectively. np is the number of members in the cluster Cp. The intra-cluster distance defined as the average of all pair wise distance between the members in cluster P and cluster P’s centroid, gpc. The inter-cluster distance of two clusters is computed by the distance between two clusters’ centroids. The lower DBI value indicates the higher quality of the cluster result.

3.6 HSSP-BLOSUM62 measure BLOSUM62 (Henikoff and Henikoff, 1992) is a scoring matrix based on known alignments of diverse sequences. By using this matrix, we may tell the consistency of the amino acids appearing in the same position of the motif information generated by our method. Because different amino acids appearing in the same position should be close to each other, the corresponding value in the BLOSUM62 matrix will give a positive value. For example, if the rule indicates amino acid A1 and A2 are two elements frequently

16

B. Chen et al.

appear in some specific position; A1 and A2 should have similar biochemical property. Hence, the measure is defined as the following: If k = 0: HSSP − BLOSUM62 measure = 0 Else If k = 1: HSSP − BLOSUM62 measure = BLOSUM62ii Else:

HSSP − BLOSUM62 measure =

∑ i =1 ∑ j =i +1 HSSPi ⋅ HSSPj ⋅ BLOSUM 62ij k −1

k

∑ ∑ k −1

k

i =1

j = i +1

(8)

HSSPi ⋅ HSSPj

k is the number of amino acids with frequency higher than a certain threshold in the same position (in this paper, 8% is the threshold). HSSPi indicates the percent of amino acid i appearing. BLOSUM62ij denotes the value of BLOSUM62 on amino acid i and j. The higher HSSP-BLOSUM62 value indicates more significant motif information. When k equals zero, it indicates that there is no amino acids appearing in the specific position, so the value for this measure is assigned zero. While k equals one, it indicates that there is only one amino acid appearing in the position. Unlike the first time we proposed this measurement (Chen et al., 2006b) where we assign zero score to this situation; we believe it should assign some positive value to appreciate the clear information. Therefore, we assign the corresponding amino acid’s diagonal value in BLOSUM62. To the best of our knowledge, it is the first evaluation method based on the biochemical point of view that considers both HSSP and BLOSUM62 information.

3.7 Ranking-SVM setup In ranking SVM (Joachims, 2002), the target value is used to denote as pair wise preference constraints. We set the target value for each member in the cluster by counting the number of matching secondary structure between a member’s structure and the representative structure of the cluster. Since we use window size nine in our experiment, the highest target value is 9 and the lowest is 0.

4

Experimental results

4.1 Training ranking-SVM execution time comparison Since the training step of Ranking-SVM is very time consuming, we want to minimise it. Due to the fact that the training time of SVM increase exponentially with the number of training data and the dataset we trained is preprocessed by clustering (indicates the data should be similar to each other to some degree), we believe we may acquire results with similar quality if we just train some instead of all data in a cluster. Therefore, we adapt a random sample concept to randomly select a certain percentage of data in the clusters, and feed those random samples into the Ranking-SVM. Table 1 shows the total execution time required for training Ranking-SVM in all 48 clusters. (Since we have 48 clusters, we need to train 48 different Ranking-SVM) The first column indicates the percentage of the whole dataset in clusters trained by support vector machine. The second and third column denotes the required time in the unit of seconds and days. Graph one is interpreted from table one.

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 17 Table 1

Execution time required for training Ranking-SVM with different percentage of whole dataset in clusters

Training percentage (%)

Execution time (s)

Execution time (Days)

100

811631

9.4

90

407607

4.7

80

279522

3.2

70

163686

1.9

60

79451

0.9

50

37551

0.4

40

13796

0.2

It can be easily interpreted from Figure 2 that we can save execution time dramatically by training partial data in each cluster; But the question is, how about the quality of our ranking support vector machine if we only give partial data for training? The next section gives the answer to this question. Figure 2

Execution time in the unit of days required for training Ranking-SVM with different percentage of whole dataset in clusters (see online version for colours)

4.2 Quality comparison In Tables 2 and 3, the number of clusters which contain higher than 60% and 70% structural similarity generated by different methods is given. The first column denotes the percentage of original data filtered in each cluster; the first row indicates what percentage of the whole dataset in each cluster trained by Ranking-SVM, and the last position indicates eliminated features by the shrink approach. Thus, the number ‘35’ in row 7 (30%) and column 4 (80%) of Table 2 means: “If we filter out 30% of the data segments by using Ranking-SVM trained from 80% of the whole dataset in each cluster, we may get 35 clusters with secondary structure greater than 60%.”

To the best of our knowledge, there are no related papers using Ranking-SVM on the data of protein sequence motif information transcending protein family boundaries for feature elimination. Therefore, we use the shrink approach which is a traditional clustering improvement for comparison. The average HSSP-BLOSUM62 value on high structural similarity (>60%) clusters and the DBI measures are available in Tables 4 and 5.

18 Table 2

B. Chen et al. Number of clusters with secondary structure similarity >60% comparison

Train filter (%) 0 10 15 20 25 30 35 40 Table 3

80% 12 22 23 29 30 35 40 42

70% 12 18 23 25 30 32 37 41

60% 12 18 22 28 29 31 36 39

50% 12 17 20 24 28 30 34 37

40% 12 17 21 22 25 30 33 37

Shrink 12 15 15 14 15 17 18 19

All 0 1 3 6 7 9 12 16

90% 0 0 2 4 7 8 10 13

80% 0 2 2 4 6 8 10 15

70% 0 1 1 2 6 8 9 13

60% 0 1 2 2 4 8 8 11

50% 0 0 2 4 3 5 7 11

40% 0 1 1 1 3 5 7 10

Shrink 0 0 0 0 1 1 1 1

70% 0.749 0.660 0.799 0.786 0.879 0.803 0.815 0.593

60% 0.749 0.663 0.698 0.832 0.737 0.743 0.813 0.694

50% 0.749 0.800 0.742 0.772 0.750 0.772 0.841 0.736

40% 0.749 0.791 0.696 0.705 0.838 0.779 0.844 0.696

Shrink 0.749 0.634 0.731 0.818 0.760 0.655 0.897 0.694

70% 6.39 6.19 6.21 6.00 5.84 5.84 5.81 5.67

60% 6.39 6.37 6.22 6.11 5.97 5.79 5.77 5.70

50% 6.39 6.22 6.10 6.02 5.96 5.80 5.77 5.76

40% 6.39 6.17 6.05 6.05 5.90 5.81 5.76 5.68

Shrink 6.39 6.14 6.03 5.90 5.79 5.79 5.75 5.64

Comparison of HSSP-BLOSUM62 measure


90% 12 20 25 27 31 35 39 43

Number of clusters with secondary structure similarity >70% comparison


All 12 21 24 28 31 36 40 45

All 0.749 1.05 0.828 0.751 0.852 0.910 0.793 0.658

90% 0.749 0.918 0.733 0.888 0.853 0.819 0.773 0.729

80% 0.749 0.814 0.745 0.881 0.796 0.868 0.765 0.692

Comparison of DBI measure

Train filter (%) 0 10 15 20 25 30 35 40

All 6.39 6.23 6.21 6.04 5.98 5.91 5.84 5.70

90% 6.39 6.27 6.09 6.10 6.01 5.91 5.79 5.74

80% 6.39 6.25 6.04 5.96 5.88 5.83 5.82 5.66

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 19 The results of Tables 2–5 reveal that the quality of clusters improved in all three measures steadily by filtering out part of the original data. Compared to the Shrink approach form a secondary structure similarity point of view, it is not hard to tell that ranking SVM generates much better results no matter what percentage of the whole dataset are trained. The support vector machine approach produces more clusters with higher than 60% structural similarity almost all of the time. If we compare the number of cluster that share over 70% structural similarity, Ranking SVM unquestionably surpasses the shrink approach. It indicates that our proposed model has a high potential for bringing forth high quality protein sequence motif information. In our HSSP-BLOSUM62 measurement, ranking support vector machine gives a higher value almost all of the time. It implies that our model generates more biochemical meaningful motif information by ruling out some less meaningful data points. When it comes to DBI measurement which is a purely computer scientific aspect evaluation, shrink method always receives lower (indicates better) value. This is mainly because the shrink method is based on simply narrowing the cluster size from outside; in other words, it focuses on shorter intra-cluster distance. Therefore, it can always generate the best DBI value. Although the SVM approach has larger DBI values, the difference between methods is small. More importantly, the ranking SVM shows the same tendency of decreasing DBI measure as with the shrink approach. If we consider Ranking-SVM trained with all data segments in each cluster and try to find the optimal filtering percentage; based on the results shown above, filtering 30% of the whole data size seems to yield the best results. The motif information we want transcends protein family boundaries; thus, if we filter out too many segments, we might generate motif information falling into some specific protein family. The reason we choose 30% as representative is that it matches the criteria of filtering part of the data and creates higher structural similarity results. Considering the evaluation of secondary structural similarity greater than 60%, it is the first big improvement (an extra five clusters) compared with filtering out 25%. More importantly, it achieves very high HSSP-BLOSUM62 values which indicates the motif information is bio-chemical meaningful. Compared with Ranking-SVM trained by different percentage of the original data, we can tell that the more training data, the better quality generated. However, the results generated from training all data, 90%, and 80% are very similar to each other in all evaluation measures. The quality starts to go down slightly when Ranking-SVM is trained with less than 70% of data in each cluster. Figures 3–5 are derived from Tables 2–5 when 30% of the original dataset are filtered. The results support our assumption that we may acquire very similar results if we just train partial instead of all data in a cluster; and according our figures and tables, 80% seems to be the best number for this ‘partial’ value. It requires only 1/3 execution of training all data, shows similar number of clusters with high structural similarity, yields high HSSP-BLOSUM62 value, and reduces the DBI measure.

20

B. Chen et al.

Figure 3

Comparison of number of clusters with secondary structure similarity greater than 60% and 70% when different percentage of data in each cluster trained by Ranking-SVM, when 30% of original data been filtered (see online version for colours)

Figure 4

Comparison of HSSP-BLOSUM62 value when different percentage of data in each cluster trained by Ranking-SVM, when 30% of original data been filtered (see online version for colours)

Figure 5

Comparison of DBI measure when different percentage of data in each cluster trained by Ranking-SVM, when 30% of original data been filtered. (Lower indicates better) (see online version for colours)

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 21

4.3 Sequence motifs We select some representative motif information generated from filtering 30% of the whole dataset by using Ranking-SVM trained with 80% of data in each cluster. Tables 6–10 illustrates four different sequence motifs before and after feature elimination. . In this paper, instead of using the existed format (Zhong et al., 2005; Chen et al., 2006a, 2006b, 2007a, 2007b), we propose a new representation format combined with amino acid logo (Crooks et al., 2004) to show the motif we discovered from our new approach. By using this new format, the frequency of each amino acid can be easily interpreted; and more importantly, this is what biologists used to. We believe that this new format will be adapted and used in related research of finding sequence motifs transcend protein family boundaries. •

The upper box gives the number of members belonging to this motif, the secondary structural similarity and the average HSSP-BLOSUM62 value.

•

The graph demonstrates the type of amino acid frequently appeared in the given position by amino acid logo. It only shows the amino acid appearing frequency higher than 8%. The height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position.

•

The x-axis label indicates the representative secondary structure to the position.

Table 6

Table 7

Helix-Coil motif (see online version for colours)

Sheet-Coil motif (see online version for colours)

22

B. Chen et al.

Table 8

Helix motif (see online version for colours)

Table 9

Coil motif (see online version for colours)

Table 10

Coil motif (see online version for colours)

5

Conclusions

A novel granular feature elimination model called Super GSVM-FE which combines Fuzzy C-means, Greedy K-means clustering algorithm and Ranking SVM has been proposed to extract protein sequence motif information. In this model, we utilise fuzzy clustering to split the whole dataset into several information granules and analyse each

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 23 granule by Greedy K-means clustering algorithm. After that, we rate all members in all clusters by ranking SVM, and then filter out less meaningful segments to obtain higher quality motif knowledge. Analysis of sequence motifs also shows that by filtering some portion of original dataset may reveal some subtle motif information hidden by noisy data points. This is the first time that we justify the need for feature elimination in the dataset of protein sequence motif transcending protein family boundaries by providing two major reasons: (1) the information we try to generate is about sequence motif, but the original input data are derived from whole protein sequences by the sliding window technique. (2) During fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule. However, not all data segments have direct relation to the granule assigned. Additionally, we performed a comprehensive analysis on the tradeoff between the execution time and the quality of motif information. Since we perform Ranking-SVM on clusters, trained data are similar to each other to some degree. Therefore, based on our results, training 80% of the original data can not only save training time but also obtain competitive quality of extracted protein sequence motif information. It opens a new research direction in cluster support vector machine. We believe some other research with huge input data size may adapt our model to generate high quality filtered results. Furthermore, a new representation format for sequence motif is presented. It improves the previous used frequency table by showing the noticeable amino acids clearly together with the frequency scale. Thus, the protein sequence motifs transcending protein families discovered and extracted by computer scientists can be easily understood by biologists. We believe this paper provides many new directions in multiple research areas.

Acknowledgements This research was supported in part by the US National Institutes of Health (NIH) under grants R01 GM34766-17S1 and P20 GM065762-01A1, and the US National Science Foundation (NSF) under grants CCF-0514750 and ECS-0334813. This work was also supported by the Georgia Cancer Coalition and used computer hardware supplied by the Georgia Research Alliance.

References Attwood, T.K., Blythe, M., Flower, D.R., Gaulton, A., Mabey, J.E., Naudling, N., McGregor, L., Mitchell, A., Paine, G.M. and Scordis, P. (2002) ‘PRINTS and PRINTS-S shed light on protein ancestry’, Nucleic Acid Res., Vol. 30, No. 1, pp.239–241. Bailey, T.L. and Elkan, C. (1994) ‘Fitting a mixture model by expectation maximization to discover motifs in biopolymers’, Proc. Int. Conf. Intell. Syst. Mol. Biol., Vol. 2, pp.28–36. Cao, Y., Xu, J., Liu, T.Y., Li, H., Huang, Y. and Hon, H.W. (2006) ‘Adapting ranking SVM to document retrieval’, Proceeding of SIGIR06, pp.186–193. Chen, B., Tai, P.C., Harrison, R. and Pan, Y. (2006a) ‘FGK model: an efficient granular computing model for protein sequence motifs information discovery’, IASTED Proc. International Conference on Computational and Systems Biology (CASB), Dallas, pp.56–61.

24

B. Chen et al.

Chen, B., Tai, P.C., Harrison, R. and Pan, Y. (2006b) ‘FIK model: novel efficient granular computing model for protein sequence motifs and structure information discovery’, IEEE Proc. 6th Symposium on Bioinformatics and Bioengineering (BIBE), Washington DC, pp.20–26. Chen, B., Pellicer, S.P., Tai, C., Harrison, R. and Pan, Y. (2007a) ‘Super Granular SVM Feature Elimination (Super GSVM-FE) Model for protein sequence motif information extraction’, Proceedings of the 2007 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2007), Hawaii, pp.317–323. Chen, B., Pellicer, S.P., Tai, C., Harrison, R. and Pan, Y. (2007b) ‘Super Granular Shrink-SVM Feature Elimination (Super GS-SVM-FE) model for protein sequence motif information extraction’, Proceeding of Bioinformatics and Bioengineering (BIBE), 2007, pp.379–386. Crooks, G.E., Hon, G., Chandonia, J.M. and Brenner, S.E. (2004) ‘WebLogo: a sequence logo generator’, Genome Research, Vol. 14, pp.1188–1190. Davies, D.L. and Bouldin, D.W. (1979) ‘A cluster separation measure’, IEEE Trans. Pattern Recogn. Machine Intell., Vol. 1, pp.224–227. Eskin, E. and Pevzner, P.A. (2002) ‘Finding composite regulatory patterns in DNA sequences’, Bioinformatics, Vol. 18, Suppl. 1, pp.354–363. Han, K.F. and Baker, D. (1998) ‘Recurring local sequence motifs in proteins’, J. Mol. Biol., Vol. 251, pp.176–187. Henikoff, S. and Henikoff, J.G. (1992) ‘Amino acid substitution matrices from protein blocks’, Proc. Natl. Acad. Sci. USA, Vol. 89, pp.10915–10919. Henikoff, S., Henikoff, J.G. and Pietrokovski, S. (1999) ‘Blocks+: a non redundant database of protein alignment blocks derived from multiple compilation’, Bioinformatics, Vol. 15, No. 6, pp.417–479. Henikoff, S., Henikoff, J.G., Alford, W.J. and Pietrokovski, S. (1995) ‘Automate construction and graphical presentation of protein blocks from unaligned sequences’, Gene, Vol. 163, pp.GC17–GC26. Herbrich, R., Graepel, T. and Obermayer, K. (2000) ‘Large margin rank boundaries for ordinal regression’, Advances in Large Margin Classifiers, Book chapter 7, pp.115–132. Hulo, N., Sigrist, C.J.A.V., Saux, L., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., de Castro, E., Bucher, P. and Bairoch, A. (2004) ‘Recent improvements to the PROSITE database’, Nucleic Acids Res., Vol. 32, Database issue: D134-137. Jensen, K.L., Styczynski, M.P., Rigoutsos, I. and Stephanopoulos, G.N. (2006) ‘A generic motif discovery algorithm for sequential data’, Bioinformatics, Vol. 22, No. 1, pp.21–28. Joachims, T. (2002) ‘Optimizing search engines using click through data’, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp.133–142. Kabsch, W. and Sander, C. (1983) ‘Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features’, Biopolymers, Vol. 22, pp.2577–2637. Lawrence, C.E., Altschul, S.F., Boquski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. (1993) ‘Detecting subtle sequence signals: a Gibbs Sampling strategy for multiple alignment’, Science, Vol. 262, pp.208–214. Lin, T.Y. (2002) ‘Data mining and machine oriented modeling: a granular computing approach’, Journal of Applied Intelligence, Kluwer, Vol. 13, No. 2, pp.113–124. Price, A., Ramabhadran, S. and Pevzner, P.A. (2003) ‘Finding subtle motifs by branching from sample strings’, Bioinformatics, Vol. 19, Suppl. 2, pp.II149–II155. Sander, C. and Schneider, R. (1991) ‘Database of similarity derived protein structures and the structure meaning of sequence alignment’, Proteins: Struct. Funct. Genet., Vol. 9, No. 1, pp.56–68.

Efficient Super Granular SVM Feature Elimination (Super GSVM-FE) model 25 Tang, Y., Zhang, Y., Huang, Z. and Hu, X. (2005) ‘Granular SVM-RFE gene selection algorithm for reliable prostate cancer classification on microarray expression data’, IEEE Proc. 5th Symposium on Bioinformatics and Bioengineering (BIBE), Minneapolis, pp.290–293. Wang, G. and Dunbrack Jr., R.L. (2003) ‘PISCES: a protein sequence-culling server’, Bioinformatics, Vol. 19, No. 12, pp.1589–1591. Yao, Y.Y. (2001) ‘On modeling data mining with granular computing’, Proceedings of COMPSAC 2001, pp.638–643. Zhong, W., Altun, G., Harrison, R., Tai, P.C. and Pan Y. (2005) ‘Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property’, IEEE Transactions on Nanobioscience, Vol. 4, No. 3, pp.255–265.