Available online at www.sciencedirect.com
Journal of Biomedical Informatics 41 (2008) 165–179 www.elsevier.com/locate/yjbin
Mining sequential patterns for protein fold recognition Themis P. Exarchos
b
a,b
, Costas Papaloukas b,c, Christos Lampros Dimitrios I. Fotiadis b,d,*
a,b
,
a Department of Medical Physics, Medical School, University of Ioannina, GR 451 10 Ioannina, Greece Unit of Medical Technology and Intelligent Information Systems, Department of Computer Science, University of Ioannina, P.O. Box 1186, GR 45110 Ioannina, Greece c Department of Biological Applications and Technology, University of Ioannina, GR 45110 Ioannina, Greece d Biomedical Research Institute – FORTH, GR 45110 Ioannina, Greece
Received 20 November 2006 Available online 17 May 2007
Abstract Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered. 2007 Elsevier Inc. All rights reserved. Keywords: Data mining; Sequential patterns; Fold recognition
1. Introduction Structure prediction is a challenging field strongly related with function determination which is of high interest for the biologists and the pharmaceutical industry. As the genome projects worldwide progress, we are presented with an exponentially increasing number of protein sequences which are not accompanied by any knowledge concerning their structure or biochemical function. Proteins have structural features which define functional similarities, so the need for structure estimation methods is high. One way to define their structure is to link them with *
Corresponding author. Address: Unit of Medical Technology and Intelligent Information Systems, Department of Computer Science, University of Ioannina, P.O. Box 1186, GR 45110 Ioannina, Greece. Fax: +30 26510 97092. E-mail address:
[email protected] (D.I. Fotiadis). 1532-0464/$ - see front matter 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2007.05.004
proteins in annotated databases, whose three-dimensional structure (fold) is known. Determining how amino acid sequences are related to those of proteins with known structure, helps us make predictions for their structural, functional and evolutionary attributes [1]. The proteins that share the same fold category have considerable structural similarities even when no evolutionary relationship (homology) of their sequences can be detected [2,3]. Various methods have been developed to identify the fold category where a protein of unknown structure belongs (fold recognition). These methods are divided into two methodological approaches: (a) the informatics based methods that involve the sequence-based methods [4–15] and the structure based methods [16–19] and (b) the biophysics based methods [20–22]. Sequence based methods use protein sequence or predicted secondary structure information to perform sequence comparison and detect whether two proteins share a fold or not. Structure based
166
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
or threading methods create an energy function describing how well a probe sequence matches a target fold. In fold recognition by threading, we must take the amino acid sequence of a protein and evaluate how well it fits into one of the known three-dimensional (3D) protein structures. Besides purely sequence based or structure based methods, a combination of them is also possible [23]. On the other hand, methods based on biophysics perform ab initio structure prediction. They detect a native conformation or ensemble of conformations of the protein that are at or near the global free-energy minimum [24]. Sequence-based methods are very common in fold recognition. Machine learning techniques, such as genetic algorithms [8], support vector machines [9,10], hidden Markov models [11,13,14] and segmentation conditional random fields [15], have been adopted to exploit protein sequence or secondary structure information. The amino acid composition (protein sequence), in specific, has been employed in many areas of bioinformatics, like protein structural class prediction [25–27], discrimination of DNA binding proteins [28] and discrimination of outer membrane proteins [29]. However, although significant improvement has been made in the field of fold recognition, the accuracy of the existing methods remains limited and there is a need to develop new methods. In this study, a novel classification method for biological data is proposed. The method uses sequential patterns that are extracted with data mining techniques and is validated in the common problem of protein fold recognition. Previous studies that utilized data mining techniques for biological data analysis [30] and proteins in specific [31], provided very promising results revealing that data mining can play a vital role in the field of bioinformatics. Currently, data mining is employed in the form of sequential pattern mining (SPM) [32] which is a technique appropriate for analyzing sequential data, like time series, texts and biosequences (e.g., proteins, DNA). Our approach extracts a large number of sequential patterns in order to characterize each class (protein fold in our case). The patterns discovered using the protein data could also assist the domain experts by providing them with previously unknown knowledge. Sequential patterns can match significant combinations of amino acids that may correspond to functionally or structurally important regions in the proteins, like, for example, dipeptide combinations [33]. They follow the notion of deterministic motifs and are able to allow flexible length for the pattern due to the variable insertion of gaps between the amino acids of the patterns. A motif is defined as the occurrence in the protein’s sequence of a particular cluster of residue types [34]. However, determining consistent motifs, requires first multiple alignment of the input protein sequences, which is not needed in mining sequential patterns. The proposed method was applied in automated protein fold recognition, by classifying an unknown protein to the corresponding fold. In the training phase, sequential pat-
terns are extracted from the training data with the use of the cSPADE algorithm [35]. During testing, a classifier uses the extracted sequential patterns and classifies the unknown proteins. Our method introduces several novelties. The employment of SPM for protein structure analysis offers the potential of discovering new knowledge in the form of patterns. Furthermore, the method uses only the protein’s sequence for classification, which is easier to be acquired, whereas other similar approaches make use of the secondary structure [6], as well as other features [36]. For training and testing we employed a dataset with low similarity between proteins. The classification results indicate that our method performs well in terms of accuracy (considering a 36-class classification problem where the accuracy of the random prediction is 2.8%) and compares favorably with the Sequence Alignment and Modeling (SAM) approach (version 3.3.1) [11,12], which is an effective and widely used tool for sequence-based classification of proteins in structural/functional categories and thus for fold recognition [37,38]. In the following paragraphs, the adopted method is presented and the training and testing procedures are explained. The employed dataset and the results of the classification method are described then. The advantages and disadvantages of our approach are given in the discussion section, where possible further improvements are also discussed. 2. Materials and methods The formulation of SPM can cover almost any categorical sequential domain [33,39,40]. In order to apply SPM to a specific domain, the following notions are required: a database of sequences D, a set of items (alphabet) I, a definition of the transaction id (tid) and a definition of an itemset. In what concerns our problem, protein sequences form the database D. The set of items I is the 20 amino acids that compose the protein sequences plus one for the unknown amino acid. tid currently denotes the position of the amino acid in the protein sequence and an itemset consists only of a single item (one of the 21 letters), since only one amino acid exists in a specific position of the protein sequence. The SPM procedure can be favored by incorporating constraints that allow for flexible gap of the extracted sequential patterns. More details are provided in the Appendix A. Several algorithms have been reported in the literature which implement the above described SPM procedure [32,41,42]. However, limited work has been done in constrained SPM [35,40,43,44]. An algorithm that performs constrained SPM is the cSPADE algorithm [35]. cSPADE finds the set of all frequent sequences with constraints, such as minimum and maximum gap between sequence items, based on the SPADE algorithm [45]. In what concerns the performance and the computational effort required, cSPADE is considered superior, compared to other constrained SPM approaches [40,44].
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
2.1. Description of the method The employed method is closely related to the feature mining problem [46]. Feature mining, combines two powerful data mining techniques: SPM and classification algorithms, in order to provide appropriate feature selection for sequential domains. More specifically, sequential patterns are extracted and constitute the features to be used in the classification algorithms. In Fig. 1 the flowchart of our method is shown. In the beginning we select the constraints to be incorporated in the cSPADE algorithm and then sequential patterns are extracted from the protein sequences. The scoring function computes the score of an unknown protein for every pattern and then, the final score of the unknown protein with respect to a fold is calculated, leading to the protein classification (fold recognition). 2.2. Training phase During training (see also Fig. 2), the cSPADE algorithm generates one set of sequential patterns for every fold under consideration. These patterns constitute the features to be used in classifying the unknown proteins. Several experiments were performed, concerning the gap and the support constraint. As already mentioned, the current method closely resembles the feature mining problem. For this reason, even if SPM is an unsupervised technique, we employed it in a supervised manner, since we generated sequential patterns for each category (fold) separately. In other words, a patterni extracted from foldi, indicates an implication (rule) of the form patterni ) foldi. To understand the above procedure better, we consider 5 amino acid sequences (indicating protein sequences) which belong to the same fold: (1) SLFEQL,
Protein Protein Primary Sequences Structures
Algorithm’s constraints selection Tr Training Sequential pattern mining
Pattern score calculation Testing Fold score calculation
Fold recognition Fig. 1. Flowchart of the proposed method.
167
(2) STYEL, (3) STVAEL, (4) XSLTKT and (5) MLTA. Each amino acid of these fictitious sequences constitutes both an item and an itemset, since only one amino acid exists in a single position of the protein. These proteins constitute a database to be mined for sequential patterns using a minimum support, for example equal to 60% (i.e., the patterns should appear in at least 0.6 * 5 = 3 proteins), and maximum gap = 4 (i.e., the pattern is a 4-distance subsequence of the protein sequence). After applying the cSPADE algorithm with the above parameters several sequential patterns are extracted, like: SL (contained in sequences 1,2,4), SE (contained in sequences 1,2,3), ST (contained in sequences 2,3,4), EL (contained in sequences 1,2,3) and SEL (contained in sequences 1,2,3). 2.3. Testing phase In multi-class classification problems (as is fold recognition), the method should classify a sequence of unknown structural category in only one category among many others. Our classification method combines all the extracted sequential patterns from all folds according to a straightforward approach. When classifying an unknown protein to one of the folds, all the extracted sequential patterns from all folds are examined to find which of them are contained in the protein. For a pattern contained in a protein, the score of this protein with respect to this fold is increased by: scoreji
¼
length of the patternji k ; number of patterns in foldi
ð1Þ
where i represents a fold, j represents a pattern of a fold, patternji is the jth pattern of the ith fold and k is a value employed to assign the minimum score, to the minimal pattern. It should be mentioned that if a pattern is contained in a protein sequence more than once, receives the same score as if it was contained only once. The number of patterns in foldi in the denominator of the scoring function is adopted for normalization reasons. The scores for each fold are summed and the new protein is assigned to the fold exhibiting the highest sum. Fig. 3 depicts the testing procedure schematically. The score of a protein with respect to a fold is calculated based on the number of sequential patterns of this fold contained in the protein. The higher the number of patterns of a fold contained in a protein, the higher the score of the protein for this fold. It should be noted that some adjustments and weightings are required when calculating the score. Specifically, the length of the pattern in the numerator makes longer sequential patterns more significant than shorter ones. Also, the score of a protein with respect to a fold is normalized by dividing it with the number of sequential patterns extracted from this fold.
168
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
Training datasets for every Fold
fold a1
... fold a118
Sequential patterns for every fold
...
cSPADE Sequential patterns extraction for every fold
fold b1
...
Parameters Minimum Support Maximum Gap Minimum Gap
fold c1
…
Sa1
…
Sc1
Sa118
…
Sd15
… …
Sb1 Sg3
...
Where Si denotes the set of sequential patterns, extracted from the fold i
fold d15
... fold g3 Fig. 2. Schematic representation of the training phase.
Score of test protein for a1 fold
… …
Score of test protein for a118 fold
Sequential patterns, extracted from every fold during training
…
Sc1
…
…
Sa3 Sd15
…
Sb1 Sg3
Scoring function
…
Unknown proteins
…
…
Sa1
…
Test dataset
Score of test protein for b1 fold
Score of test protein for c1 fold
… Score of test protein for d15 fold
… Score of test protein for g3 fold
Fig. 3. Schematic representation of the testing phase.
Classify test protein to the fold reported the highest score
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
169
Table 1 An example of the classification procedure adopted by the proposed method Sequence Fold 1 patterns SEL SLFEQLGGQAAVQA
SGG
SLFEQLGGQAAVQA
Comments
Score
Gap between S and E = 3 Gap between E and L = 2 The pattern is a 3-distance subsequence
(length of pattern 1)/(number of Fold 1 patterns) = 3 1/2 = 1
Gap between S and G = 6 Gap between G and G = 1 The SGG is a 6-distance subsequence (>4)
Score = 0
Final score of protein with respect to Fold 1 = 1 + 0 = 1 Fold 2 patterns QLG SLFEQLGGQAAVQA
Gap between Q and L = 1 Gap between L and G = 1 The pattern QLG is a 1-distance subsequence
(length of pattern 1)/(number of Fold 2 patterns) = 3 1/3 = 2/3
QAAA
SLFEQLGGQAAVQA
Gap between Q and A = 1 Gap between A and A = 1 Gap between A and A = 3 The pattern is a 3-distance subsequence
(length of pattern 1)/(number of Fold 2 patterns) = 4 1/3 = 1
FEGG
SLFEQLGGQAAVQA
Gap between F and E = 1 Gap between E and G = 3 Gap between G and G = 1 The pattern is a 3-distance subsequence
(length of pattern 1)/(number of Fold 2 patterns) = 4 1/3 = 1
Final score of protein with respect to Fold 2 = 2/3+1+1 = 8/3 Classification fi Fold 2 Bold characters in the sequences denote the letters of the corresponding patterns.
Again, let us consider the following sequence: SLFEQLGGQAAVQA which we want to classify in one of two candidate folds. Suppose that during training, the following sequential patterns were extracted using 4 as the maximum gap between two consecutive items in a sequence: Fold 1 patterns: SEL, SGG Fold 2 patterns: QLG, QAAA, FEGG Since the maximum gap = 4, we have to check if a pattern is at most a 4-distance subsequence of the sequence when calculating the score. If not, this pattern should not be considered. We should notice that this is a simplistic example that does not include all the sequential patterns that can be extracted. For example since pattern SEL is a frequent sequential pattern, then patterns SE and EL are also frequent, based on the Apriori principle [32]1. Table 1 depicts the classification procedure using the sequential patterns extracted from two-folds. Here we set k = 1. To produce the classification decision, the final scores (sums) are compared. In this example, Sum of Scores for Fold 2 > Sum of Scores for Fold 1, thus the unknown protein is classified in Fold 2. The above scoring function (Eq. (1)) is a heuristic one, selected after a series of experiments. For example, we uti1
If pattern SEL is frequent, then pattern SL should also be frequent (together with SE and EL). However, when using gap constraints, the Apriori principle cannot be applied and thus, we cannot claim that the pattern SL is also frequent.
lized also the times a sequential pattern is contained in the protein raised in the power of n (n = 1,2,. . .), the logarithm of the length of the pattern, the length of the pattern raised in the power of n (n = 1,2,. . .), the support of the pattern and others, but all these reported lower classification results. 3. Dataset In order to validate the proposed classifier, an appropriate group of protein sequences were taken from the Protein Data Bank (PDB) [47]. All members of this group correspond to a specific fold of the Structural Classification of Proteins (SCOP) database [2]. As protein members we used those included in the ASTRAL SCOP 1.69 dataset, where no proteins with more than 40% identity between them are included. The identity between protein sequences was defined by the BLAST identity in both (BIB) criterion. The complete dataset used in the current study is shown in Table 2. Specifically the 36 most populated SCOP folds, with at least 30 members, were used to derive the training and test data. The threshold of 30 members was adopted in order to obtain enough sequences and properly train the classifier. From the 2410 proteins in total, the two thirds from each category were used for training, while the rest for evaluation (Table 2). 4. Results Our method was evaluated using the above described dataset. Table 3 depicts the experimental results obtained
170
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
Table 2 The dataset used (36 SCOP folds) Fold
Index
Training set
All alpha proteins Globin-like Cytochrome c DNA-binding 3-helical bundle Four-helical up-and-down bundle EF-hand SAM domain-like Alpha–alpha superelix
a1 a3 a4 a24 a39 a60 a118
260 21 20 103 28 31 25 32
131 11 10 52 15 15 12 16
All beta proteins Immunoglobin-like beta sandwich Common fold of diphtheria toxin/transcription factors/cytochrome f Galactose-binding domain-like ConA-like lectins/glucanases SH3-like barrel OB-fold Trypsin-like serine proteases PH domain-like Double-stranded beta-helix Nucleoplasmin-like
b1 b2 b18 b29 b34 b40 b47 b55 b82 b121
406 132 20 21 24 44 61 25 24 28 27
203 66 10 10 12 22 31 12 12 14 14
Alpha and beta proteins (a/b) (TIM)-barrel NAD(P)-binding Rossmann fold FAD/NAD(P)-binding domain Flavodoxin-like Adenine nucleotide alpha hydrolase-like P-loop containing nucleotide Thioredoxin-like Ribonuclease H-like motif Phosphorylase/hydrolase-like S-Adenosyle-L-methionine-dependent methyltransferases PLP-dependent transferases Hydrolases Periplasmic binding protein-like II
c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94
658 143 91 22 58 35 91 39 31 20 40 31 34 23
329 71 46 11 29 17 46 20 15 10 20 15 17 12
Alpha and beta proteins (a+b) b-Grasp Cystatin-like Ferredoxin-like Protein kinase-like (PK-like)
d15 d17 d58 d144
189 44 20 102 23
95 22 10 51 12
Membrane and cell surface proteins and peptides Single transmembrane helix
f23
25 25
12 12
Small proteins Knottins (small inhibitors, toxins, lectins)
g3
68 68
34 34
1606
804
Overall
Test set
Table 3 Experimental results obtained using various parameters for minimum support, maximum gap value and k (training and test set) Parameters
Fold prediction (%)
Class prediction (%)
Fold prediction (%)
Class prediction (%)
MinSup (%)
MaxGap
k
Test set
Training set
Test set
Test set
20 20 40 40 50 50
4 4 4 4 1a 5
2 1 1 2 1 2
54.0 30.5 40.6 59.7 14.5 49.5
62.5 35.1 52.1 78.1 22.9 73.4
19.8 14.9 16.9 24.9 7.6 20.7
35.5 22.6 35.5 60.3 16.7 45.4
The best results are obtained using MinSup = 40%, MaxGap = 4 and k = 2 are given in bold. a Using MaxGap 1, the extracted sequential patterns are composed only by consecutive amino acids.
using different values for minimum support, maximum gap value and k (scoring function), for both training and test sets.
The best results were obtained using minimum support = 40%, maximum gap = 4 (minimum gap was set to
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179 Table 4 Number of the extracted sequential patterns, sensitivity for all folds and overall accuracy of the method when evaluated in the training set Fold index
# Sequential patterns
Sensitivity (proteins)
Sensitivity (%)
A1 A3 A4 a24 a39 a60 a118
1304 905 998 1442 935 987 2624
15/21 17/20 30/103 28/28 26/31 19/25 28/32
71.4 85.0 29.1 100.0 83.9 76.0 87.5
Class A B1 B2 b18 b29 b34 b40 b47 b55 b82 b121
9195 1610 1968 2616 2572 726 1695 2465 1286 2922 4296
163/260 68/132 19/20 20/21 21/24 10/44 25/61 24/25 16/24 20/28 26/27
62.7 51.5 95.0 95.2 87.5 22.7 41.0 96.0 66.7 71.4 96.3
Class B c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94
22156 4802 4790 12219 2157 4485 3244 1370 3964 3792 3032 7515 4327 5053
249/406 16/143 85/91 22/22 30/58 29/35 32/91 22/39 30/31 20/20 27/40 31/31 29/34 19/23
61.3 11.2 93.4 100.0 51.7 82.9 35.2 56.4 96.8 100.0 67.5 100.0 85.3 82.6
Class C d15 d17 d58 d144
60739 841 1226 1248 4663
392/658 21/44 14/20 22/102 22/23
59.6 47.7 70.0 21.6 95.7
Class D f23
7978 311
79/189 16/25
41.8 64.0
Class F G3
311 120
16/25 60/68
64.0 88.2
Class G
120
60/68
88.2
Overall
100499
959/1606
59.7
1) and k = 2 (the latter implicates that only patterns with length P3 were employed). Table 3 also depicts the performance of the SPM methodology in the task of class prediction. If a protein that belongs to foldi is classified into foldj and foldi and foldj belong to the same class, then this is considered as a correct class prediction. Table 4 shows the number of the extracted sequential patterns and the corresponding performance of the classifier (using minimum support = 40%, maximum gap = 4, minimum gap = 1 and k = 2) in the training set. In Table 5, the classification results for each fold separately are presented in terms of Top-1 to Top-5 sensitivity [18] and over-
171
all accuracy. Top-1 to Top-5 sensitivity is computed by considering a classification as correct even if the actual (true) fold receives a score between the 1st and 5th highest ones. In our case this sensitivity reached up to 56.5%. In Table 6 we present the results of our method for different values of sequence identity (640%, 625% and 25–40%), which correspond to different levels of classification difficulty. It should be mentioned that the 625% sequence identity refers only to the test sequences. The training of the method was performed using proteins from the whole dataset (sequence identity 640%). Moreover, in order to obtain a detailed analysis of all the correct and wrong classifications of our method we calculated the confusion matrix of the proposed method for the fold recognition problem (Table 7). The (i, j) element of the confusion matrix denotes the number of test proteins that belong to category i and were classified as category j. We also compared our method with SAM [11,12], which is widely used for the same classification problem [37,38]. SAM was evaluated following both possible approaches, scores and E-values ranking, using the same training and test sets (Table 2) with the proposed method. In Table 6 the SAM classification results for both scores and E-values ranking are shown for the three previous mentioned sequence identity levels: 640%, 625% and 25–40%. Again the 625% sequence identity refers only to the test sequences. A comparison of the results obtained by our method and SAM (using scores ranking and E-values ranking) is presented in Table 8. In this Table only the Top-1 to Top-5 accuracy for the six classes and overall are presented. It should be noted that the sequence identity here is 640%. Finally, in order to evaluate the robustness of the proposed method, receiver operating characteristics (ROC) analysis was performed following the class reference formulation [39,48], where each category was considered separately, against all others. The ROC curves with the corresponding areas under curves (AUC) for all folds are shown in Fig. 4. The AUC varies from 0.549 for fold d58 to 0.996 for fold g3. In order to compute the multi-class AUC the following formula is used [48]: AUCtotal ¼
X
AUCðci Þ pðci Þ;
ð2Þ
ci 2C
where AUC(ci) is the area under the class reference ROC curve for class (fold) ci(i = 1,2, . . . ,36), C is the number of classes (folds) and p(ci) is the ratio of test proteins belonging to foldi to the total number of test proteins. Using Eq. (2), AUCtotal = 0.81, indicating a reliable model (classifier). TheAUCtotal was also computed using Eq. (2) for SAM using both scores ranking and E-values ranking, and AUCtotal = 0.70 for both cases. Furthermore, some indicative examples of sequential patterns extracted from folds a1, b1, c1, f23 and g3, using minimum support 40%, maximum gap 4 and minimum gap 1 are given below.
172
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
Table 5 Classification results of the proposed method in the test set with the best obtained parameters in terms of Top-1 to Top-5 sensitivity for every fold Fold index
Top-1
Top-1 (%)
Top-2 (%)
Top-3 (%)
Top-4 (%)
Top-5 (%)
a1 a3 a4 a24 a39 a60 a118
2/11 2/10 15/52 5/15 10/15 2/12 6/16
18.2 20.0 28.8 33.3 66.7 16.7 37.5
27.3 30.0 32.7 33.3 73.3 33.3 62.5
54.5 50.0 40.4 40.0 73.3 33.3 68.8
81.8 50.0 50.0 53.3 73.3 41.7 68.8
90.9 50.0 53.8 53.3 73.3 50.0 75.0
Class A b1 b2 b18 b29 b34 b40 b47 b55 b82 b121
42/131 24/66 3/10 2/10 1/12 0/22 6/31 7/12 0/12 0/14 9/14
32.1 36.4 30.0 20.0 8.3 0.0 19.4 58.3 0.0 0.0 64.3
40.5 48.5 30.0 30.0 8.3 9.1 32.3 75.0 0.0 0.0 78.6
48.9 54.5 40.0 40.0 33.3 9.1 45.2 75.0 0.0 0.0 85.7
57.3 62.1 60.0 50.0 58.3 18.2 51.6 91.7 0.0 14.3 85.7
61.1 66.7 60.0 50.0 58.3 27.3 58.1 91.7 8.3 21.4 85.7
Class B c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94
52/203 0/71 32/46 1/11 7/29 8/17 5/46 0/20 0/15 0/10 1/20 7/15 1/17 7/12
25.6 0.0 69.6 9.1 24.1 47.1 10.9 0.0 0.0 0.0 5.0 46.7 5.9 58.3
35.0 7.0 78.3 9.1 41.4 52.9 23.9 5.0 20.0 30.0 30.0 60.0 29.4 58.3
41.9 22.5 82.6 45.5 48.3 52.9 32.6 15.0 46.7 40.0 35.0 66.7 47.1 66.7
51.2 36.6 84.8 63.6 55.2 58.8 52.2 30.0 60.0 50.0 40.0 73.3 52.9 75.0
55.7 47.9 93.5 63.6 69.0 58.8 60.9 35.0 66.7 60.0 40.0 73.3 58.8 75.0
Class C d15 d17 d58 d144
69/329 2/22 0/10 2/51 2/12
21.0 9.1 0.0 3.9 16.7
32.8 18.2 0.0 11.8 33.3
43.8 22.7 0.0 17.6 41.7
54.4 31.8 0.0 19.6 41.7
61.7 31.8 0.0 19.6 58.3
Class D f23
6/95 5/12
6.3 41.7
14.7 41.7
20.0 41.7
23.2 50.0
25.3 50.0
Class F g3
5/12 26/34
41.7 76.5
41.7 79.4
41.7 79.4
50.0 82.4
50.0 82.4
Class G
26/34
76.5
79.4
79.4
82.4
82.4
Overall
200/804
24.9
34.6
42.8
51.5
56.5
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
173
Table 6 Classification results of the proposed method, SAM using the E-values ranking and the scores ranking in the test set for sequence identity 640%, 625% and 25–40% Fold index
Proposed method
SAM E-values ranking
640 (%)
625 (%)
25–40 (%)
640 (%)
A1 A3 a4 a24 a39 a60 a118
18.2 20.0 28.8 33.3 66.7 16.7 37.5
0.0 14.3 28.3 36.4 60.0 11.1 38.5
33.3 33.3 33.3 25.0 80.0 33.3 33.3
81.8 60.0 3.8 6.7 86.7 16.7 0.0
Class A b1 b2 b18 b29 b34 b40 b47 b55 b82 b121
32.1 36.4 30.0 20.0 8.3 0.0 19.4 58.3 0.0 0.0 64.3
29.7 38.0 30.0 22.2 12.5 0.0 17.4 66.7 0.0 0.0 54.5
40.0 33.3 — 0.0 0.0 0.0 25.0 55.6 0.0 0.0 100.0
Class B c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94
25.6 0.0 69.6 9.1 24.1 47.1 10.9 0.0 0.0 0.0 5.0 46.7 5.9 58.3
24.8 0.0 66.7 0.0 33.3 50.0 9.1 0.0 0.0 0.0 7.1 50.0 6.7 44.4
Class C d15 d17 d58 d144
21.0 9.1 0.0 3.9 16.7
Class D f23 Class F g3
625 (%)
SAM scores ranking 25–40 (%)
640 (%)
625 (%)
25–40 (%)
60.0 57.1 4.3 9.1 80.0 22.2 0.0
100.0 66.7 0.0 0.0 100.0 0.0 0.0
81.8 60.0 1.9 13.3 73.3 16.7 0.0
60.0 57.1 2.2 9.1 60.0 22.2 0.0
100.0 66.7 0.0 25.0 100.0 0.0 0.0
25.2 50.0 0.0 30.0 25.0 36.4 6.5 83.3 25.0 14.3 7.1
19.8 50.0 0.0 33.3 25.0 15.4 4.3 66.7 10.0 8.3 9.1
43.3 50.0 — 0.0 25.0 66.7 12.5 88.9 100.0 50.0 0.0
23.7 31.8 10.0 30.0 25.0 36.4 3.2 66.7 25.0 7.1 0.0
16.8 32.0 10.0 33.3 25.0 15.4 4.3 33.3 10.0 8.3 0.0
46.7 31.3 — 0.0 25.0 66.7 0.0 77.8 100.0 0.0 0.0
27.8 0.0 75.0 33.3 0.0 0.0 15.4 0.0 0.0 0.0 0.0 42.9 0.0 100.0
32.0 14.1 23.9 100.0 27.6 11.8 80.4 25.0 13.3 10.0 20.0 80.0 5.9 25.0
25.5 14.8 13.3 100.0 19.0 12.5 81.8 23.1 15.4 14.3 14.3 62.5 6.7 22.2
50.0 11.8 43.8 100.0 50.0 0.0 76.9 28.6 0.0 0.0 33.3 100.0 0.0 33.3
24.1 9.9 17.4 81.8 27. 6 17.6 41.3 30.0 6.7 10.0 20.0 66.7 11.8 25.0
18.8 9.3 10.0 87.5 19.0 18.8 33.3 30.8 7.7 14.3 14.3 37.5 13.3 22.2
38.9 11.8 31.3 66.7 50.0 0.0 61.5 28.6 0.0 0.0 33.3 100.0 0.0 33.3
19.9 6.7 0.0 2.2 0.0
23.9 14.3 0.0 16.7 25.0
32.5 0.0 0.0 3.9 91.7
28.6 0.0 0.0 2.2 75.0
43.2 0.0 0.0 16.7 100.0
24.6 0.0 0.0 2.0 66.7
19.9 0.0 0.0 0.0 50.0
37.5 0.0 0.0 16.7 75.0
6.3 41.7
2.7 45.5
18.2 0.0
13.7 25.0
5.5 27.3
40.9 0.0
9.5 33.3
2.7 36.4
31.8 0.0
41.7 76.5
45.5 65.2
0.0 100.0
25.0 44.1
27.3 43.5
0.0 45.5
33.3 50.0
36.4 47.8
0.0 54.5
Class G
76.5
65.2
100.0
44.1
43.5
45.5
50.0
47.8
54.5
Overall
24.9
22.9
30.6
29.4
24.1
44.7
23.8
18.4
39.3
The sequence identity refers only to the test proteins.
To understand better these patterns, we should note that: • The notation x(0, 3) means that at least 0 and at most 3 residues of any type may occur in this position. (0, 3) is derived from the constraint we set to the cSPADE algorithm, i.e., to extract sequential patterns with minimum gap 1 and maximum gap 4. • The intervening amino acids, denoted with x, do not belong to the sequential pattern.
• Each non-x letter defines one particular type of amino acid residue in that position in the pattern. • The dash character ‘‘-’’ does not express anything particular and it is used for writing and reading purposes.
5. Discussion In this work, we proposed a novel method for analyzing protein sequences based on sequential pattern mining. The
174
Table 7 The confusion matrix of the proposed classification method in the test set, for the fold recognition problem True fold
a1
a3 a4 a24 a39 a60 a118 b1 b2 b18 b29 b34 b40 b47 b55 b82 b121 c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94 d15 d17 d58 d144 f23 g3
2 1 1 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0
0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 15 0 0 0 1 1 0 0 0 1 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 1 0 0
4 0 5 5 2 3 5 4 0 0 0 3 2 1 1 1 1 4 2 0 1 2 9 6 2 0 0 0 0 1 2 3 10 2 1 0
0 0 0 1 10 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 8 0 0 2 0 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 1
1 0 4 1 0 5 6 0 0 0 0 0 1 0 3 1 0 3 1 0 1 0 10 2 2 0 2 2 1 0 2 1 4 2 0 0
0 0 2 0 0 0 0 24 0 0 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0
0 1 1 0 0 0 0 8 3 1 2 2 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 0 1
0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 6 0 1 0 0 3 1 1 1 1 6 0 3 1 0 2 0 0 2 0 1 0 0 0 1 0 0 1 3 1 2 1 0 0
0 0 0 0 0 0 0 4 2 2 4 1 0 7 0 0 0 3 1 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 1
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 6 1 0 0 0 0 0 0 2 9 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 2 0 2 0 0 0 4 0 1 2 3 6 4 0 2 1 37 32 8 11 1 9 2 4 5 3 5 6 2 1 0 14 0 2 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 1 2 0 0 1 0 1 1 0 0 2 0 5 2 0 7 0 0 1 3 1 1 0 0 0 1 1 3 0 0 0
0 0 3 1 0 0 1 0 0 0 0 0 5 0 0 0 0 3 3 0 1 8 5 0 3 0 4 0 2 1 0 1 3 0 0 1
0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 2 1 1 2 1 5 1 0 0 3 0 1 0 0 1 0 3 0 1
0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 2 0 0 2 1 0 0 0 0 1 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 3 0 0 0
0 0 2 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3 0 0 0 3 3 0 0 1 2 7 2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0
1 3 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 3 1 0 2 0 0 1 0 1 1 0 0 7 1 0 0 0 0 0
The (i, j) element denotes the number of test proteins that belong to foldi and were classified as foldj. The highlighted numbers indicate the correct classifications.
0 0 0 0 0 1 0 2 0 0 0 2 2 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 2 0 2 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 1 0 3 0 0 1 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 3 0 2 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
a1 a3 a4 a24 a39 a60 a118 b1 b2 b18 b29 b34 b40 b47 b55 b82 b121 c1 c2 c3 c23 c26 c37 c47 c55 c56 c66 c67 c69 c94 d15 d17 d58 d144 f23 g3
Predicted fold
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
175
Table 8 Comparison ofclassification results in the test set for the proposed method and SAM (scores ranking and E-values ranking) in terms of Top-1 to Top-5 accuracy for the six classes and overall Method
Class
Top-1 (%)
Top-2 (%)
Top-3 (%)
Top-4 (%)
Top-5 (%)
SAM scores ranking
A B C D F G
23.7 24.1 24.6 9.5 33.3 50.0
31.3 34.0 34.3 20.0 33.3 64.7
35.1 38.9 40.7 21.1 33.3 70.6
39.7 45.3 43.5 28.4 33.3 70.6
42.7 49.8 47.1 30.5 33.3 73.5
Overall
23.8
33.6
38.4
42.8
46.3
A B C D F G
25.2 32.0 32.5 13.7 25.0 44.1
32.8 37.4 37.1 18.9 25.0 61.8
36.6 39.4 43.5 21.1 25.0 67.6
38.9 45.8 48.9 25.3 25.0 67.6
42.7 50.2 52.3 29.5 25.0 70.6
Overall
29.4
35.3
39.6
44.3
48.0
A B C D F G
32.1 25.6 21.0 6.3 41.7 76.5
40.5 35.0 32.8 14.7 41.7 79.4
48.9 41.9 43.8 20.0 41.7 79.4
57.3 51.2 54.4 23.2 50.0 82.4
61.1 55.7 61.7 25.3 50.0 82.4
Overall
24.9
34.6
42.8
51.5
56.5
SAM E-values ranking
Proposed method
SPM technique was employed using the cSPADE algorithm in order to mine the sequential patterns. Using a simple scoring function which utilizes the extracted sequential patterns, the sequences are classified into the corresponding class. Our approach was tested in the problem of protein fold recognition and classified efficiently unknown proteins into 36 candidate folds. To evaluate the method, an appropriate group of protein sequences were acquired from the PDB. We also compared the proposed method with SAM, which is widely used as a benchmark in fold recognition [37,38]. However, SAM requires higher computational effort during training, since it employs the Baum–Welch algorithm [49] for training the model, which is an iterative procedure. Baum–Welch algorithm, given M training sequences of length L and a model consisting of S states, has complexity of O(MS2L) for each iteration. Moreover, many iterations are required until the algorithm converges. On the other hand, the cSPADE algorithm minimizes I/O costs by reducing database scans. Only 3 scans of the database of sequences are required, one for generating frequent 1-sequences, another for generating frequent 2-sequences, and one more for generating all frequent k-sequences. Also, cSPADE minimizes computational costs by using efficient search schemes (details can be found in [45]). Furthermore, cSPADE scales almost linearly in the database size and a number of other database parameters. The obtained results indicate that the proposed approach performs slightly better than SAM using the
scores ranking, but worse when the E-values ranking are considered. More specifically, the proposed method exhibits overall accuracy 24.9% while SAM’s overall accuracy was 23.8% using the scores ranking and 29.4% using the E-values. It is worth mentioning that when the sequence identity of the test proteins is 625%, our method shows comparable performance with SAM even when E-values are used (23% vs. 24%, Table 6). Moreover, the AUC for the proposed method is higher than SAM, demonstrating its robustness against SAM. In addition, the proposed method performs well when considering the Top-k accuracy. Top-k accuracy is very important since prediction of a certain set of candidate folds can assist domain experts in limiting their uncertainty. The Top-5 accuracy of the proposed method was 56.5%, while SAM’s was 46.3% and 48.0% using scores ranking and E-values ranking, respectively. In the literature other methods reported higher results but they have used different datasets either with less candidate categories [9,36,50] or with proteins that share higher homology [17,34,51]. The proposed method is suitable for analyzing biosequences like protein sequences due to their sequential nature and is able to discover strong sequential dependencies (patterns) between amino acids that may correspond to functionally or structurally important elements in proteins. Also, when the gap is one and the length of the amino acid patterns is two, then these patterns are composed by consecutive amino acids that correspond to dipeptide combinations. In addition, our approach is able to provide the
176
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
1
b
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
Sensitivity
Sensitivity
a
0.5 a1, AUC=0.936 a3, AUC=0.945 a4, AUC=0.788 a24, AUC=0.646 a39, AUC=0.965 a60, AUC=0.826 a118, AUC=0.858
0.4 0.3 0.2
0.6 0.5 0.4
b1, AUC=0.881 b2, AUC=0.839 b18, AUC=0.882 b29, AUC=0.912 b34, AUC=0.852
0.3 0.2
0.1
0.1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
1-Specificity
d
1
0.9
0.8
0.8
0.7
0.7
0.6 0.5 b40, AUC=0.721 b47, AUC=0.963 b55, AUC=0.852 b82, AUC=0.790 b121, AUC=0.970
0.3
0.8
0.9
1
0.8
0.9
1
0.9
1
0.5 c1, AUC=0.755 c2, AUC=0.816 c3, AUC=0.978 c23, AUC=0.757 c26, AUC=0.818 c37, AUC=0.741 c47, AUC=0.744
0.4
0.2
0.2
0.1
0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0.1
0.2
0.3
f
1
0.8
0.7
0.7
Sensitivity
0.9
0.8
0.6 0.5 c55, AUC=0.749 c56, AUC=0.775 c66, AUC=0.808 c67, AUC=0.891 c69, AUC=0.873 c94, AUC=0.887
0.3 0.2
0.5
0.6
0.7
1
0.9
0.4
0.4
1-Specificity
1-Specificity
Sensitivity
0.7
0.6
0.3
0
e
0.6
1
0.9
0.4
0.5
1-Specificity
Sensitivity
Sensitivity
c
0.4
0.6 0.5 0.4
d15, AUC=0.770 d17, AUC=0.655 d58, AUC=0.549 d144, AUC=0.982 f23, AUC=0.949 g3, AUC=0.996
0.3 0.2 0.1
0.1
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1-Specificity
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1-Specificity
Fig. 4. ROC curves and areas under ROC curves (AUC) for the best performance classifier for folds: (a) a1, a3, a4, a24, a39, a60 and a118, (b) b1, b2, b18, b29 and b34, (c) b40, b47, b55, b82 and b121, (d) c1, c2, c3, c23, c26, c37 and c47, (e) c55, c56, c66, c67, c69 and c94 and (f) d15, d17, d58, d144, f23 and g3.
patterns that led to the classification of an unknown protein to a specific fold (the extracted patterns that are contained in the unknown protein), thus it can lead to pattern discovery. Furthermore, the training phase of the method, i.e., the determination of the sequential patterns, is a fast procedure because the cSPADE algorithm is used. In general, for sequential pattern mining the computational
load is increased exponentially as longer sequences need to be mined. For example, with m attributes there are O(mk) potentially frequent sequences of length at most k [45]. The cSPADE algorithm handles the above aspect efficiently [35]. However, the proposed method has some disadvantages. An adequate number of sequences from every class is
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
a
177
b Sequence ID
Sequence
1
SLFE
2
MVNWAA
3
LSADQ
Sequence ID (sid)
Transaction ID (tid) (position)
Itemset
1
1
S
1
2
L
1
3
F
1
4
E
2
1
M
2
2
V
2
3
N
2
4
W
2
5
A
2
6
A
3
1
L
3
2
S
3
3
A
3
4
D
3
5
Q
Fig. 5. (a) The sequence representation of a database of sequences, (b) the representation of a database in the form of tuples to be used for sequential pattern mining.
required (theoretically at least two, but the more the better) for proper sequential pattern extraction [32]. This limits our method against others that are able to extract a classification model, even from a single sequence from each class (e.g., SAM in the case of fold recognition), however with a small decrease in their performance [52]. In addition, when classifying an unknown sequence, the sequential patterns extracted from all classes in the training phase, should be checked one by one in order to find out if they are contained in the sequence. Since the number of the extracted sequential patterns from biological data will be considerable, a large number of comparisons must be performed in order to reach to the classification decision (the average time for classifying a protein of length 100 was about 4 s using a Pentium IV at 2.4 MHz and with 512 MB RAM). Another disadvantage of the proposed method is the 0% reported accuracy for some folds. This can be attributed to the fact that it is difficult to discriminate proteins belonging to folds with low sequence identity, such as the TIM barrel fold [53]. A possible improvement could be the adoption of the secondary structure information. Moreover, the utilization of SPM, besides finding valid and causal relationships in the biological data, will also find all the spurious and particular relationships among the data in the specific dataset. For this reason, results of any SPM procedure should be considered as exploratory and hypothesis-generating. Further improvement might focus on the employment of additional types of biological information like, for example, the protein secondary structure besides the protein sequence. The modeling of both the protein sequence and the secondary structures using sequential pattern mining would be of great interest, however this would highly
increase the complexity. Another issue is the implementation of a more sophisticated scoring function through the utilization of artificial neural networks or genetic algorithms. Moreover, it would be of great interest to apply the proposed method in the superfamily level which can address a more practical scientific question like function determination. 6. Conclusions We presented a new method based on data mining techniques for protein sequence analysis and classification. Specifically, sequential pattern mining was used for the classification of proteins into folds. The approach we followed leads to knowledge discovery in the field of protein classification and indicates the important role of data mining in bioinformatics. Our method compares well with other systems in the literature that accomplish the task of fold recognition. However, several improvements could be considered in order to increase its efficiency. Appendix A. Sequential pattern mining Data mining can be defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [39]. A well known data mining technique is sequential pattern mining (SPM) which is defined as follows [32]: Let I = {i1,i2, . . . , in} be a set of items. A subset X ˝ I is an itemset and jXj is the size of X. A sequence s = (s1,s2, . . . , sm) is an ordered list of itemsets, where si ˝ I, i 2 {1, . . . ,m}. The length l of a sequence
178
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179
P s = (s1,s2, . . . , sm) is defined as mi¼1 jsi j and a sequence with length l is an l-sequence. A sequence sa = (a1,a2, . . . , an) is contained in another sequence sb = (b1,b2, . . . , bm) if there exist integers 1 6 i1 < i2 < . . . < in 6 m such that a1 bi1 ; a2 bi2 ; . . . ; an bin . In SPM, a database of sequences D is transformed into a database of tuples (sid, tid, X), where sid is a sequence-id, tid is a transaction id, denoting the position of the itemset X in the sequence (Fig. 5). The support of a sequence sa in the database D supD(sa) (in the standard database sequence formation and not in the tuples formation) is the percentage of sequences s 2 D, which contain sa (at least once). If the sequence sa is contained more than once in the sequences, supD(sa) remains the same. Given a support threshold minSup, a sequence sa is a frequent l-sequential pattern on D (or frequent l-sequence) if supD(sa) P minSup. The problem of mining sequential patterns is to find all frequent sequential patterns for a database D, given a support threshold sup. Several constraints can be incorporated when mining for sequential patterns [40]. One of the simplest constraints is the gap constraint which imposes a limit in the maximum distance between two consecutive itemsets in the sequence. This constraint is very useful to reflect the impact of an item on another one when each transaction occurs at a particular instant of time (position). When using gap constraints, the notion of contained in is adapted to incorporate also the restriction ik ik1 6 d. Using d = 1 (maximum gap = 1) the extracted sequential patterns are composed by consecutive itemsets. Similar to the maximum gap constraint is the minimum gap constraint, which states that the distance between two consecutive itemsets must be more than a specified value (ik ik1 P d 0 ). References [1] Whitford D. Proteins: structure and function. John Wiley & Sons; 2005. [2] Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995;247:536–40. [3] Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A. Critical assessment of methods of protein structure prediction (CASP)–round 6. Proteins 2005;61(Suppl 7):3–7. [4] Fischer D, Eisenberg D. Protein fold recognition using sequencederived predictions. Protein Sci 1996;5:947–55. [5] Di Francesco V, Geetha V, Garnier J, Munson PJ. Fold recognition using predicted secondary structure sequences and hidden Markov models of proteins folds. Proteins: Struct Funct Genet 1997;1:123–8. [6] Hargbo J, Elofsson A. Hidden Markov models that use predicted secondary structures for fold recognition. Proteins 1999;36:68–87. [7] Karplus K, Sjo¨lander K, Barrett C, Cline M, Haussler D, Hughey R, et al. Predicting protein structure using hidden Markov models. Proteins: Struct Funct Genet 1997;1:134–9. [8] Dandekar T, Argos P. Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and extended criteria specific for strand regions. J Mol Biol 1996;256:645–60. [9] Ding C, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001;17:349–58.
[10] Shi JY, Pan Q, Zhang SW, et al. Protein fold recognition with support vector machines fusion network. Prog Biochem Biophys 2006;33(2):155–62. [11] Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 1996;12(2):95–107. [12] Karplus K, Karchin R, Shackelford G, Hughey R. Calibrating Evalues for hidden Markov models using reverse-sequence null models. Bioinformatics 2005;21:4107–15. [13] Lindahl E, Elofsson A. Identification of related proteins on family, superfamily and fold level. J Mol Biol 2000;295:613–25. [14] Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins 2003;51:504–14. [15] Liu Y, Carbonell J, Weigele P. Protein fold recognition using segmentation conditional random fields (SCRFs). J Comput Biol 2006;13(2):394–406. ` thy R, Eisenberg D. A method to identify protein [16] Bowie JU, LuE sequence that fold into a known three-dimensional structure. Science 1991;253:164–70. [17] Flockner H, Domingues F, Sippl MJ. Proteins folds from pair interactions: a blind test in fold recognition. Proteins: Struct Funct Genet 1997;1:129–33. [18] Xu J. Fold recognition by predicted alignment accuracy. IEEE/ACM Trans Comput Biol Bioinform 2005;2(2):157–65. [19] Sander O, Sommer I, Lengauer T. Local protein structure prediction using discriminative models. BMC Bioinform 2006;7(14). [20] Murzin AG. Structure classification based assessment of CASP3 predictions for the fold recognition targets. Proteins: Struct Funct Genet 1999;37:88–103. [21] Orengo CA, Bray JE, Hubbard T, LoConte L, Sillitoe I. Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction. Proteins: Struct Funct Genet 1999;37:149–70. [22] Ortiz AR, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J. Ab initio folding of proteins using restraints derived from evolutionary information. Proteins: Struct Funct Genet 1999;3:177–85. [23] Elofsson A, Fischer D, Rice DW, LeGrand SM, Eisenberg D. A study of combined structure-sequence profiles. Folding & Design 1996;1:451–61. [24] Bonneau R, Baker D. Ab initio protein structure prediction: progress and prospects. Annu Rev Bioph Biom 2001;30:173–89. [25] Chou KC, Zhang CT. A correlation-coefficient method to predicting protein-structural classes from amino acid compositions. Eur J Biochem 1992;207:429–33. [26] Wang ZX, Yuan Z. How good is the prediction of protein structural class by the component-coupled method? Proteins 2000;38:165–75. [27] Luo RY, Feng ZP, Liu JK. Prediction of protein structural class by amino acid and polypeptide composition. Eur J Biochem 2002;269:4219–25. [28] Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acid Res 2005;33(20):6486–93. [29] Gromiha M, Suwa M. A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 2005;21(7):961–8. [30] Galitsky BA, Kuznetsov SO, Vinogradov DV. Applying hybrid reasoning to mine for associative features in biological data. J Biomed Inform 2007;40:203–20. [31] Radivojac P, Chawla NV, Dunker AK, Obradovic Z. Classification and knowledge discovery in protein databases. J Biomed Inform 2004;37:224–39. [32] Agrawal R and Srikant R. Mining sequential patterns. In: 11th Intl Conf on Data Eng 1995; p. 3–14. [33] Wang K, Hu Y, Hu Yu J. Scalable Sequential Pattern Mining for Biological Sequences. Proceedings of the 13th ACM conference on Information and knowledge management, USA. 2004; p. 178–87.
T.P. Exarchos et al. / Journal of Biomedical Informatics 41 (2008) 165–179 [34] Blekas K, Fotiadis DI, Likas A. Motif-based protein sequence classification using neural networks. J Comput Biol 2005;12:64–82. [35] Zaki MJ. Sequence mining in categorical domains: incorporating constraints. In: Proc of the 9th international conference on information and knowledge management USA. 2000; p. 422–29. [36] Aung Z, and Tan KL. Automatic 3D protein structure classification without structural alignment. J Comput Biol, Mary Ann Liebert, Inc. Publishers, June 2005. [37] Ginalski K, Grishin NV, Godzik A, Rychlewski L. Practical lessons from protein structure prediction. Nucleic Acids Res 2005;33:1874–91. [38] Fischer D. Servers for protein structure prediction. Curr Opin Struct Biol 2006;16(2):178–82. [39] Tan PN, Steinbach M, Kumar V. Introduction to data mining. Addison Wesley; 2005. [40] Srikant R, and Agrawal R. Mining sequential patterns: Generalizations and performance improvements. In: Proc 5th Int Conf Extending Database Technology EDBT, vol. 1057; 1996, p. 3–17. [41] Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, et al. Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowledge Data Eng 2004;16: 1424–40. [42] Ayres J, Gehrke J, Yiu T, Flannick J. Sequential pattern Mining Using Bitmaps. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Canada. 2002; p. 429–35. [43] Mannila H, Toivonen H, Verkamo I. Discovery of frequent episodes in event sequences. Data Min Knowl Disc 1997;1(3):259–89.
179
[44] Garofalakis M, Rastogi R, Shim K. SPIRIT: Sequential Pattern Mining with Regular Expression Constraint. In: Proceedings of the 25th International Conference on Very Large Databases. 1999: p. 223–34. [45] Zaki MJ. Efficient enumeration of frequent sequences. In 7th Intl Conf Info and Knowledge Management 1998 USA. p. 68–75. [46] Lesh N, Zaki MJ, Ogihara M. Scalable feature mining for sequential data. IEEE Intell Syst 2000;15(2):48–56. [47] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The protein data bank. Nucleic Acids Res 2000;28:235–42. [48] Fawcett T. ROC Graphs: Notes and Practical Considerations for Researchers. Technical Report HPL-2003-4, HP Labs, 2003. [49] Baum LE. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 1972;3:1–8. [50] Aung Z, and Tan KL. Automatic Protein Structure Classification through Structural Fingerprinting. 4th IEEE Symposium on Bioinformatics and Bioengineering Taichung, Taiwan, ROC, 2004. p. 508– 15. [51] Grundy WN, Bailey TL, Elkan CP, Baker ME. Meta-MEME: motifbased hidden Markov models of protein families. Comput Appl Biosci 1997;13:397–406. [52] Lampros C, Papaloukas C, Exarchos TP, Goletsis Y, Fotiadis DI. Sequence-based protein structure prediction using a reduced statespace hidden Markov model. Comput Biol Med; 2006 [in press]. [53] Lorentzen E, Pohl E, Zwart P, Stark A, Russell RB, Knura T, et al. Crystal structure of an archaeal class I aldolase and the evolution of (ba)8 barrel proteins. J Biol Chem 2003;278:47253–60.