Detection of Splice Sites Using Support Vector Machine - Springer Link

3 downloads 0 Views 301KB Size Report
Springer-Verlag Berlin Heidelberg 2009. Detection of Splice Sites Using Support Vector Machine. Pritish Varadwaj, Neetesh Purohit, and Bhumika Arora.
Detection of Splice Sites Using Support Vector Machine Pritish Varadwaj, Neetesh Purohit, and Bhumika Arora Indian Institute of Information Technology, Allahabad [email protected], [email protected], [email protected]

Abstract. Automatic identification and annotation of exon and intron region of gene, from DNA sequences has been an important research area in field of computational biology. Several approaches viz. Hidden Markov Model (HMM), Artificial Intelligence (AI) based machine learning and Digital Signal Processing (DSP) techniques have extensively and independently been used by various researchers to cater this challenging task. In this work, we propose a Support Vector Machine based kernel learning approach for detection of splice sites (the exon-intron boundary) in a gene. Electron-Ion Interaction Potential (EIIP) values of nucleotides have been used for mapping character sequences to corresponding numeric sequences. Radial Basis Function (RBF) SVM kernel is trained using EIIP numeric sequences. Furthermore this was tested on test gene dataset for detection of splice site by window (of 12 residues) shifting. Optimum values of window size, various important parameters of SVM kernel have been optimized for a better accuracy. Receiver Operating Characteristic (ROC) curves have been utilized for displaying the sensitivity rate of the classifier and results showed 94.82% accuracy for splice site detection on test dataset. Keywords: Splice site, Support vector machine, Electron-ion interaction potential.

1 Introduction The successful completion of several genomic projects in recent past has yielded vast amount of sequence data. Rational analysis of these genomic data to extract relevant information can have profound implications on automated annotation and functional motif identification. Identification of genes from sequence data is an important area of research in field of computational biology. The complexity of these gene finding approaches further increases due to the presence of coding regions called exons interrupted by non-coding regions called introns complemented by intergenic regions. The exon-intron border is known as donor splice site where as intron-exon border is known as acceptor splice site. Identification of exon, intron and splice site regions of a gene is not a new problem. There exists classical probabilistic based, artificial intelligence based and Digital Signal Processing based approaches for addressing above problem. Approaches such as Hidden Markov Model (HMM), Dynamic Programming and Bayesian Networks falls in first category while Artificial Neural Network (ANN), Support Vector Machine (SVM) based approaches categorize to artificial intelligence category. Furthermore discrete nature of DNA representation has motivated many signal processing engineers to obtain an equivalent numeric sequence for DNA S. Ranka et al. (Eds.): IC3 2009, CCIS 40, pp. 493–502, 2009. © Springer-Verlag Berlin Heidelberg 2009

494

P. Varadwaj, N. Purohit, and B. Arora

strands and then apply various Digital Signal Processing (DSP) methods to find some interpretable results. Proposed SVM based method for splice site detection is basically a hybrid version of computational intelligence approach and DSP based approach. Herewith we summarize a brief review of relevant previous work. Artificial Neural Network (ANN) based approach has been adopted in [1],[2],[3] for gene recognition. A window size of 99 nucleotides has been used and the classification of center nucleotide of the window, either as coding or non-coding region, was done using 9 inputs, 14 hidden layers and 1 output ANN. Genetic algorithm was used for evolving biases and interconnection weights. In [4],[5],[6],[7] many other variations of ANN or rule based system combined with ANN has been used for the purpose of gene identification. Various statistical coefficients and frequency indicators calculated from genomic sequence has been used as input features in [1-7] but unfortunately, none of these methods have satisfactory level of accuracy for different genes and different species. A survey of various computational approaches used in gene identification has been reported in [8]. Support vector machine (SVM) is an alternative method used in machine learning which has much robust theoretical background and often it gives better results than ANN. If the input data belongs to two classes with n common features then each input sample can be represented as a vector in an n-dimensional space and the classification problem reduces in finding a hyperplane in that space which should separate two classes. SVM finds two parallel hyperplanes in the same space with any orientation such that the margin between them could be maximized. The input samples (called vectors) which fall on these parallel hyperplanes are called support vectors. Several kernel functions have been recommended for SVMs but radial basis function (RBF) often gives better performance. The performance of RBF is heavily dependent on two parameters C and γ. The optimum values of these parameters vary from problem to problem this aspect has been ignored by most researchers while using SVM. The SVM has been used [9-13] for several bioinformatics problems; even the problem of splice site detection and also has been recently addressed elsewhere [14-22]. These methods involve feature and model selection (simple or hidden Markov Model) and SVM kernel engineering for splice site prediction using conditional positional probabilities. Above methods are cumbersome as compared to proposed approach which is simple and straight forward. For DSP based approaches, various schemes of converting a character sequence into numeric sequence and applying various DSP techniques for gene finding and other such applications have been summarized in [23],[24]. The binary or Voss representation [25], one, two or three dimensional tetrahedron representation [26], EIIP method [27],[28], paired numeric, paired and weighted spectral rotation (PWSR), Paired spectral content [29] etc are popular conversion techniques. All such techniques utilizes the statistical properties of exon and intron regions observed in many genes of different species e.g. period-3 property, frequency of particular nucleotides in exon or intron regions etc. Either time domain methods e.g. correlation structures [30], average magnitude difference function (AMDF) [29] etc. or frequency domain methods e.g. DFT [31],[32],[33], wavelet [27][28], autoregressive [34] etc have been used for exploring above mentioned properties of genes. For overcoming spectral spreading problem of transform method, few filtering approaches have also been suggested [35]. In short, many researchers have done good work by adopting one of the above mentioned three approaches and they have obtained satisfactory results but the dataset used for most of

Detection of Splice Sites Using Support Vector Machine

495

such studies are probably biased towards a single chromosome or single gene hence not giving desired output for other genes of the same species. The hybrid method adopted in proposed scheme is a novel approach which is giving high accuracy identification for a large variety of genes with less preprocessing required. Being natural characteristic feature of nucleotides, EIIP values appear more appealing for obtaining numeric sequences, thus this method has been used in present work. Furthermore we found SVM to be a better classifier than artificial neural network (ANN) at least for aforesaid purpose. Hence this was preferred which resulted in showing higher prediction accuracy. In this work, EIIP mapped numeric dataset of genomic sequences has been prepared for training the SVM and the optimum values of various SVM parameters have been explicitly determined. Use of these optimized values during testing phase has increased the accuracy of results by many folds. This paper has been organized as follows: Detailed methodology has been described in section 2. Results followed by discussion have been reported in section 3. Finally the conclusion and future work appears in section 4 followed by references in section 5.

2 Materials and Methods We have selected Arabidopsis thaliana as model species and genomic sequence data were collected across all five chromosomes of the same species. Data were collected from The Exon-Intron Database (EID) [36], further the sequences entries were subjected to similarity screening to get the final set of data with less than 23% inter similarities. The selected dataset for this study comprises of 1000 splice site bearing sequences of 12 residues length (500 acceptor splice site and 500 donor splice site) and 1000 non splice site sequences of 12 residues length (500 each, both from exon and intron region). Further we randomly split the whole dataset of 2000 sequence entries of 12 residues length each, into training and testing set at a ratio of 2:3, i.e. 800 training set comprised of 400 splice site (donor and acceptor) and 400 non-splice site data. Similarly 1200 test sequences were taken which comprised of 600 splice site (donor and acceptor) and 600 non-splice site sequences from exon and intron region. Training set data was used for training various classifiers, while the testing examples were not exposed to the system during learning, kernel selection and hyper-parameter selection phases. Training dataset and Test dataset can be obtained from http:// profile.iiita.ac.in/pritish/svmdata The genomic character sequence of training and test set were converted into numeric sequence using EIIP values as described below (Table 1). Table 1. EIIP values of nucleotide residues

Nucleotide Letters

Name

EIIP value

A

Adenine

0.1260

G

Guanine

0.0806

C

Cytosine

0.1340

T

Thymine

0.1335

496

P. Varadwaj, N. Purohit, and B. Arora

As described earlier other potential character to numeric mapping schemes appears to be biased in attempt to exploit various statistical properties of gene. EIIP values are natural characteristic feature of nucleotides and probably it carries the chemical information patterns required to be recognized by spliceosome [37]. Further it’s not only a single EIIP value guided reorganization by spliceosome rather it should be an environmental effect of neighborhood residues which makes the specific pattern detection possible. For this reason we have considered sequences of 12 residues length (sequence window size=12) after validating the result with different window length of 6, 8, 10, 12, 14 and 16 residues sequences. The ROC plot of this has been shown in Fig. 1. ROC plot of performance vs. different window size 1

0.9

0.8

True positive rate (sensitivity)

0.7

0.6

0.5

0.4

0.3

0.2

12 14 10 08 06

0.1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 False positive rate (1-Specificity)

0.7

0.8

residues residues residues residues residues

0.9

1

Fig. 1. ROC plot of prediction accuracy with different input vector sequence window length

Thus window size equal to 12 is selected which is moved from one end of a gene (3’ / 5’) to other(5’ / 3’) and the presence or absence of splice site is detected for each windows sliding with one residue ahead per move (Fig. 2). Acceptorsplicesite Intronregion

5’end T

A

C

G

T

G

Exonregion T

A

G

G

A

T

C

A

i

3’end G

T

G

A

i+11

Donorsplicesite Exonregion

5’end A

G

C

C i

T

A

Intronregion T

G

A

G

T

T

A

T

3’end G

T

A

G

i+11

Fig. 2. Window sliding by one residue at a time to create input vector for classifier

Detection of Splice Sites Using Support Vector Machine

497

In this process there may exist 4 possibilities for the type of nucleotides patterns falling within the window: a) sequence pattern belongs exon region, b) sequence pattern belongs to intron region, c) sequence pattern belongs to a donor splice site and d) sequence pattern belongs to an acceptor splice site. Training data of above four types of sequences were prepared in 2000 such windows of with target labels -1 for a) and b); +1 for c) and d) patterns type respectively. So input vector for training as well as ௜ , each labeled by corretest set has been quantified as: ܺ ௜ ൌ ሺܺଵ௜ ǡ ܺଶ௜ ǡ ǥ ǥ ǥ ǡ ܺଵଶ ሻ i i sponding y = +1 or y = –1 depending on whether it represents a splice site or nonsplice site patterns, respectively. Training set data were subjected to SVM classifier, which involved fixing several hyper-parameters and values of these hyper-parameters determining the function that SVM optimizes and therefore have a crucial effect on the performance of the trained classifier [38]. We have used several kernels: linear, polynomials and radial basis function (RBF). We found RBF as the suitable classifier function (as the number of features is not very large), for which training errors on splice site data (false negatives) outweigh errors on non-splice site data (false positives). The classical Radial Basis Function (RBF) used in this work has similar structure as SVM with Gaussian kernel K(Xi, Xj ) = exp(-γ | | Xi – Xj | | 2 ), γ > 0

(1)

This kernel (1) is basically suited best to deal with data that have a classconditional probability distribution function approaching the Gaussian distribution. It maps such data into a different space where the data becomes linearly separable. To actually visualize this, it is convenient to observe that the kernel (which is exponential in nature) can be expanded into an infinite series, thus giving rise to an infinitedimension polynomial kernel: each of these polynomial kernels will be able to transform certain dimensions to make them linearly separable. However, this kernel is Contour map of 'C' and 'Gama' for RBF 50 Accuracy_p X= 13 Y= 39 Level= 0.9825

45 40

Value of 'C'

35 30 25 20 15 10 5

0

2

4

6

8

10 12 Value of 'gama'

14

16

18

20

Fig. 3. Contour plot of grid search result showing optimum values of hyper-parameter

498

P. Varadwaj, N. Purohit, and B. Arora

difficult to design, in the sense that it is difficult to arrive at an optimum ‘γ’ and choose the corresponding C that works best for a given problem. Since searching the best hyperplane parameters is associated with the problem of overfitting, a grid parameter search exploring all combinations of C and γ with ten folds cross-validation routine, where γ ranged from 2−15 to 24 and C ranged from 2−5 to 215[39] has been implemented. To identify an optimal hyper-parameter set we have performed a two step grid-search on C and γ using 10 folds cross-validation, by dividing training set into 10 subsets of equal size (80 each). Iteratively each subset is tested using the classifier trained on the remaining 9 subsets. Pairs of (C; γ) have been tried and the one with the best cross-validation accuracy has been picked. The best cross-validation performance, for a value of γ = 1.59 and C = 97 was obtained by the RBF kernel, with parameter and cost factor (Fig. 3). The result obtained shows very good classification accuracy 98.25 % during the cross-validation.

3 Result and Discussion To optimize the SVM parameters γ and C, 10-fold cross-validation has been applied on each of the training datasets bin, exploring various combinations of C (2−5 to 215) and γ (2−15 to 24). In 10-fold cross-validation, the training dataset (800 sequence entries, each of 12 residues length) was spilt into 10 subsets of 80 sequence entries (40 splice site and 40 non-splice site), where one of such subsets was used as the test dataset while the other subsets were used for training the classifier. The process is repeated 10 times using a different subset of corresponding test and training datasets, hence ensuring that all subsets are used for both training and testing. A two fold grid optimization has been considered and result shown (Fig. 3) suggests the optimized C and γ were found to be 97 and 1.59 respectively. The best combinations of γ and C obtained from the grid based optimization process were used for training the RBF kernel based SVM classifier using the entire training dataset of 800 sequence entries. The SVM classifier efficiency was further evaluated by various quantitative variables: a) TN, true negatives – the number of correctly classified non-splice site, b) FN, false negatives – the number of incorrectly classified non-splice site, c) TP, true positives – the number of correctly classified splice site, d) FP, false positives – the number of incorrectly classified splice site. Using these variables several statistical metrics were calculated to measure the effectiveness of the proposed RBF-SVM classifier. Sensitivity (Sn) and Specificity (Sp) metrics, which indicates the ability of a prediction system to classify the splice site and non-splice site information, were calculated by equation (2) and (3) and receiver operating characteristic curve (ROC) for the same has been plotted (Fig-04) ܵ݊ሺΨሻ ൌ

ܵ‫݌‬ሺΨሻ ൌ

்௉ ்௉ାிே

்ே ்ேାி௉

x 100

(2)

x 100

(3)

Detection of Splice Sites Using Support Vector Machine

499

To indicate an overall performance of the classifier system; a) Accuracy (Ac), for the percentage of correctly classified splice sites and the Matthews Correlation Coefficient (MCC) were computed as follows: ‫ܿܣ‬ሺΨሻ ൌ

‫ ܥܥܯ‬ൌ 

ܶܲ ൅ ܶܰ ܺ ͳͲͲ ܶܲ ൅ ܶܰ ൅ ‫ ܲܨ‬൅ ‫ܰܨ‬

(4)

ሺܶܲ  ܶܰሻ െ ሺ‫ܰܨ  ܲܨ‬ሻ

(5)

ඥሺܶܰ ൅ ‫ܲܨ‬ሻሺܶܰ ൅ ‫ܰܨ‬ሻሺܶܲ ൅ ‫ܲܨ‬ሻሺܶܲ ൅ ‫ܰܨ‬ሻ Receiver Operating Characteristic (ROC) curve

1

0.9

0.8

True positive rate (Sensitivity)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4 0.5 0.6 False positive rate ( 1- Specificity)

0.7

0.8

0.9

1

Fig. 4. Receiver operating characteristic (ROC) plot for classifier with optimized values of C and γ

Sensitivity (Sn) is found to be 96.55% with false positive proportion (FP) 3.45 %, where as Specificity (Sp) is found to be 93.10 % with false negative (FN) proportion 6.90%. Similarly Youden's Index (Youden’s Index= sensitivity + specificity – 1) is 0.8865 and Matthews Correlation Coefficient (MCC) found to be 0.9124. The overall accuracy (Ac) is calculated as 94.82% , which is significantly higher than existing methods. Area under ROC curve is found to be 0.98095 with standard error 0.00567. We have chosen the RBF kernel with optimized parameters γ and C. Using 10-fold cross-validation, the parameters γ and C were optimized at 1.59 and 97 with an overall training datasets classification accuracy of 98.25%, which is reasonably good. While the reported accuracy on the training datasets may indicate the effectiveness of a prediction method, it may not accurately portray how the method will perform on novel, hitherto undiscovered splice sites. Therefore, testing the SVM methodology on independent out-of-sample datasets, not used in the cross-validation is critical. Here, we applied the SVM classifiers, on the entire test datasets, the SVM method obtained

500

P. Varadwaj, N. Purohit, and B. Arora

an accuracy of 94.82% using the RBF kernel with γ = 1.59 and C = 97. These findings suggest that the SVM-based prediction of splice site detection might be helpful in identifying potential exon-intron boundary hence gene annotation.

4 Conclusion and Future Work In the process of spliceosome mediated RNA splicing, introns are removed and exons are joined from transcribed pre-mRNA to form the final mRNA, for translation into successive protein. Spliceosome recognizes the splice sites (donor and acceptor) on pre-mRNA and it does so for invariably large ranges of pre-mRNA with very accurate intron-exon boundary recognition. This can be only possible if either the intron-exon boundary carries a typical pattern in it or somehow spliceosome can remember the boundary sequence information. But the number of possible proteins in a species and associated splice sites, support only the former theory that; spliceosome does not have memory in it rather it recognizes splice site by typically conserved residual-chemical pattern. The importance of EIIP values as chemical features has already been established by the researchers, we further used 12 residues length sequence which probably carry the chemical environmental effect of splice site surrounding. Seeking a chemical environmental effect is more logical than finding statistical features of the whole sequence and hence this work has an extra edge over other works in the sense that it is probably closer to the phenomenon adopted by spliceosome in real life. This work was started with a simple thought of mathematical modeling of the detection of intron-exon boundary information, the very phenomenon used by spliceosome in nature. The results obtained for all five chromosomes of Arabidopsis thaliana with good accuracy are very encouraging for further extensions into other species. While applying above method to other species, we would like to establish an inter-species splice site pattern finding model. We also like to extend our work for adopting more robust way of character to numeric sequence conversion, enhancing the accuracy with fine tuning of SVM parameters and window size.

References [1] Uberbacher, E.C., Xu, Y., Mural, R.J.: Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol. 266, 259–281 (1996) [2] Fickett, J.W., Tung, C.-S.: Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450 (1992) [3] Fogel, G.B., Chellapilla, K., Corne, D.W.: Identification of coding regions in DNA sequences using evolved neural networks. In: Fogel, G.B., Corne, D.W. (eds.) Evolutionary Computation in Bioinformatics, pp. 195–218. Morgan Kaufmann, San Francisco (2002) [4] Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., Brunak, S.: Splice site prediction in Arabidopsis thaliana pre mRNA by combining local and global sequence information. Nucleic Acids Res. 24(17), 3439–3452 (1996) [5] Reese, M.G.: Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Comput. Chem. 26(1), 51–56 (2001) [6] Ranawana, R., Palade, V.: A neural network based multi-classifier system for gene identification in DNA sequences. Neural Comput. Appl. 14(2), 122–131 (2005)

Detection of Splice Sites Using Support Vector Machine

501

[7] Sherriff, A., Ott, J.: Applications of neural networks for gene finding. Adv. Genet. 42, 287–297 (2001) [8] Bandyopadhyay, S., Maulik, U., Roy, D.: Gene Identification: Classical and computational Intelligence approaches. IEEE Trasaction on systems, man and cybernatics 38(1) (January 2008) [9] Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R., Schölkopf, B.: Improving the C. elegans genome annotation using machine learning. PLoS Computational Biology 3(2), e20 (2007) [10] Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Classifiers. In: Kearns, M., Solla, S., Cohn, D. (eds.) Advances in Neural Information Processing Systems, vol. 11, pp. 487–493. MIT Press, Cambridge (1999) [11] Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.R.: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics 16(9), 799–807 (2000) [12] Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C., Furey, T.S., Ares, J.M., Haussler, D.: Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS 97, 262–267 (2000) [13] Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.: A New Discriminative Kernel from Probabilistic Models. Advances in Neural information processings systems 14, 977 (2002) [14] Sonnenburg, S., Rätsch, G., Jagota, A., Müller, K.R.: New Methods for Splice-Site Recognition. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, p. 329. Springer, Heidelberg (2002) [15] Sonnenburg, S.: New Methods for Splice Site Recognition. Master’s thesis Humboldt University (Supervised by Müller, K.-R., Burkhard, H.-D., Rätsch, G.) (2002) [16] Lorena, A., de Carvalho, A.: Human Splice Site Identifications with Multiclass Support Vector Machines and Bagging. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003. LNCS, vol. 2714. Springer, Heidelberg (2003) [17] Yamamura, M., Gotoh, O.: Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets. Genome Informatics 14, 426–427 (2003) [18] Rätsch, G., Sonnenburg, S.: Accurate Splice Site Detection for Caenorhabditis elegans. In: Schölkopf, B., Tsuda, K., Vert, J.P. (eds.) Kernel Methods in Computational Biology. MIT Press, Cambridge (2004) [19] Degroeve, S., Saeys, Y., Baets, B.D., Rouzé, P., de Peer, Y.V.: SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 21(8), 1332–1338 (2005) [20] Huang, J., Li, T., Chen, K., Wu, J.: An approach of encoding for predictionof splice sites using SVM. Biochimie 88, 923–929 (2006) [21] Zhang, Y., Chu, C.H., Chen, Y., Zha, H., Ji, X.: Splice site prediction using support vector machines with a Bayes kernel. Expert Systems with Applications 30, 73–81 (2006) [22] Baten, A., Chang, B., Halgamuge, S., Li, J.: Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 7(suppl. 5), S15 (2006) [23] Anastassiou, D.: Genomic signal processing. IEEE Signal Process. Mag. 18(4), 8–20 (2001) [24] Zhang, X., Chen, F., Zhang, Y., Agner, S.C., Akay, M., Lu, Z., Waye, M.M.Y., Tsui, S.K.: Signal processing techniques in genomic engineering. Proc. IEEE 90(12), 1822– 1833 (2002) [25] Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phy. Rev. Lett. 68(25), 3805–3808 (1992)

502

P. Varadwaj, N. Purohit, and B. Arora

[26] Silverman, B.D., Linsker, R.: A measure of DNA periodicity. J. Theor. Biol. 118, 295– 300 (1986) [27] Ning, J., Moore, C.N., Nelson, J.C.: Preliminary wavelet analysis of genomic sequences. In: Proc. IEEE Bioinformatics Conf., pp. 509–510 (2003) [28] deergha Rao, K., Swamy, M.N.S.: Analysis of Genomics and proteomics using DSP Techniques. IEEE Transactions on circuits abd systems 55(1) (Feburary 2008) [29] Akhtar, M., Epps, J., Ambikairajah, E.: Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE journal of selected topics in signal processing 2(3) (June 2008) [30] Li, W.: The study of correlation structure of DNA sequences: A critical review. Comput. Chem. 21(4), 257–271 (1997) [31] Anastassiou, D.: Genomic signal processing. IEEE Signal Process. Mag. 18(4), 8–20 (2001) [32] Tiwari, S., Ramaswamy, S., Bhattacharya, A., Bhattacharya, S., Ramaswamy, R.: Prediction of probable genes by Fourier analysis of genomic sequences. Comput. Appl. Biosci. 13, 263–270 (1997) [33] Kotlar, D., Lavner, Y.: Gene prediction by spectral rotation measure: A new method for identifying protein-coding regions. Genome Res. 18, 1930–1937 (2003) [34] Rao, N., Shepherd, S.J.: Detection of 3-periodicity for small genomic sequences based on AR techniques. In: Proc. IEEE Int. Conf. Comm., Circuits Syst., vol. 2, pp. 1032–1036 (2004) [35] Vaidyanathan, P.P., Yoon, B.-J.: Gene and exon prediction using allpass-based filters. Presented at the IEEE Workshop Genomic Signal Processing and Statistics, Raleigh, NC (2002) [36] Saxonov, S., Daizadeh, I., Fedorov, A., Gilbert, W.: An exhaustive database of proteincoding intron-containing genes. Nucleic Acids Res. 28(1), 185–190 (2000) [37] Burge, C.B., et al.: Splicing precursors to mRNAs by the spliceosomes. In: Gesteland, R.F., Cech, T.R., Atkins, J.F. (eds.) The RNA World, pp. 525–560. Cold Spring Harbor Lab. Press (1999) [38] Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3) (1995) [39] Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm

Suggest Documents