Integrating binding site predictions using meta ... - Semantic Scholar

2 downloads 0 Views 152KB Size Report
program designed for phylogenetic footprinting. Nucleic. Acids Research: Vol. 31, No. 13, 3840-3842. [5] Markstein, M., Stathopoulos, A., Markstein, V., Mark-.
Integrating binding site predictions using meta classification methods Y. Sun½ , M. Robinson½ , R. Adams½ , A. G. Rust¾ , P. Kaye½ , N. Davey½ ½ Science and technology research school, University of Hertfordshire, United Kingdom E-mail: comrys, m.robinson, r.g.adams, p.kaye, n.davey@herts.ac.uk ¾ Institute of Systems Biology, 1441 North 34th Street, Seattle, WA 98103, USA E-mail: [email protected] Abstract Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. There is good reason to believe that predictions from these different classes of algorithms could be used in conjunction to improve the quality of predictions. In this paper, we apply single layer networks and support vector machines on predictions from  key algorithms. Furthermore, we use a ‘window’ of consecutive results for the input vectors in order to contextualise the neighbouring results. Moreover, we improve the classification result with the aid of under- and over- sampling techniques. We find that by integrating  base algorithms, support vector machines and single layer networks can give better binding site predictions. 1 Introduction In this paper, we address the problem of identifying transcription factor binding sites on sequences of DNA. There are many different algorithms in current use to search for binding sites [1, 2, 3, 4]. However, most of them produce a high rate of false positive predictions. The task addressed here is to reduce false positive predictions by means of classification techniques from the machine learning field. To do this we first integrate the results from  different algorithms for identifying binding sites, using nonlinear classification techniques. To further improve classification results, we employ windowed inputs where a fixed number of consecutive results are used as an input vector, so as to contextualise the neighbouring results. The datasets include two classes labeled as either binding sites or non-binding sites with about  being non-binding sites. We make use of sampling techniques, working with a traditional neural network: single layer networks (SLN) and a contemporary classification algorithm: support vector machines (SVM). We expound the problem domain in the next section. In Section 3, we introduce the datasets used in this pa-

per. We explain how we apply under- and over- sampling techniques in Section 4. Section 5 briefly introduces our experiments and gives all the experimental results. The paper ends in Section 6 with a conclusion. 2 Problem Domain One of the most exciting and active areas of research in biology currently, is understanding how the exquisitely fine resolution of gene expression is achieved at the molecular level. It is clear that this is a highly nontrivial problem. While the mapping between the coding region of a gene and its protein product is straightforward and relatively well understood, the mapping between a gene’s expression profile and the information contained in its non-coding region is neither so simple, nor well understood at present. It is estimated that as much as 50% of the human genome is cis-regulatory DNA [5] , undeciphered for the most part and tantalisingly full of regulatory instructions. Cis-regulatory elements form the nodes connecting the genes in the regulatory networks, controlling many important biological phenomena, and as such are an essential focus of research in this field [6]. It is known that many of the mechanisms of gene regulation act directly at the transcriptional or sequence level, for example in those genes known to play integral roles during embryogenesis [6]. One set of regulatory interactions are those between a class of DNA-binding proteins known as transcription factors and short sequences of DNA which are bound by the proteins by virtue of their three dimensional conformation. Transcription factors will bind to a number of different but related sequences. The current state of the art algorithms for transcription factor binding site prediction are, in spite of recent advances, still severely limited in accuracy. We show that in a large sample of annotated yeast promoter sequences, a selection of 12 key algorithms were unable to reduce the false positive predictions below 80%, with between 20% and 65% of annotated binding sites re-

DNA sequence

A

G

Known binding sites

Y

N

C Y

T

C

T

N

N

N

12 key algorithm results

Y

1 0 1 1 0 1

N

12 key algorithm results

Y Y

Labels

1 0 0 0 0 0 1 1 0 0 0

N Y

Windows

label 1 label 2 label 3

Fig. 1. The dataset has  columns, each with a possible

Fig. 2. The window size is set to  at the current study. The

covered. These algorithms represent a wide variety of approaches to the problem of transcription factor binding site prediction, such as the use of regular expression searches, PWM scanning, statistical analysis, coregulation and evolutionary comparisons. There is however good reason to believe that the predictions from these different classes of algorithms are complementary and could be integrated to improve the quality of predictions. In the work described here we take the results from the  aforemention algorithms and combine them in  different ways, as shown in next section. We then investigate whether the integrated classification results of the algorithms can produce better binding site predictions than any one algorithm alone.

gorithms and combine them into one 12-ary feature vector as a single input. All repeated and inconsistent data were removed and  data points are left.

binding prediction (Y or N). The  algorithms give their own prediction for each sequence position and one such column is shown.

3 Description of The Dataset The dataset is from a large sample of annotated yeast promoter sequences. It has   possible binding positions and a prediction result for each of the 12 algorithms, see Figure 1. The  algorithms can be categorised into higher order groups as Single sequence algorithms (7) [1, 7, 8, 9]; Coregulatory algorithms (3) [2, 10]; Comparative algorithm (1) [3]; Evolutionary algorithm (1) [4]. The label information contains the best information we have been able to gather for the location of known binding sites in the sequences. Throughout we have used the following notation:  denotes the prediction that there is no binding site at this location; the predictions that there is a binding site at this location, while   indicate that this algorithm is not able to analyse this sequence. Two datasets are generated based on the original set.

¯ Single Inputs: We take the results from the  al-

middle lable of  continuous possible binding sites is the label for a new windowed input. The length of each windowed input now is  .

We then randomly choose  of the data points for training, while  remained for testing.

¯ Windowed Inputs: To do this, we use a ‘window’ of consecutive results in the input vector in order to contextualise the neighbouring results, see Figure 2. In the current work, we set the window size . We use the first  of the data points for training. By using windowed inputs, after removing repeated and inconsistent data,  out of  data points are left. In this part, we do crossvalidation first to choose a better classification algorithm, then train the selected classifier using the whole  data points. For testing, we do experiments on the final third of the data points first without repeated and inconsistent data points, and second with full continuous set of rows for analysis by biologists. 4 Sampling techniques for Imbalanced Dataset Learning In our dataset, there are less than  binding positions among the total vectors, so this is an imbalanced dataset [11]. Since the dataset is imbalanced, the supervised classification algorithms will be expected to tend to over predict the majority class, namely the non-binding site category. In this work, to cope with imbalanced data, we concentrate on the data-based method, i.e., using under-sampling of the majority class (negative samples)

and over-sampling of the minority class (positive samples). We combine over-sampling and under-sampling methods in our experiments. For the under-sampling, we randomly selected a subset of data points from the majority class. The over-sampling case is more complex. In [11], the author addresses an important issue that the class imbalance problem is only a problem when the minority class contains very small subclusters. This indicates that simply over sampling with replacements may not significantly improve minority class recognition. To overcome this problem, we apply a synthetic minority oversampling technique as proposed in [12]. For each patten in the minority class, we search for its   nearest neighbours by computing Hamming distances. A new pattern belonging to the minority class will be generated by employing the majority voting principle to this pattern and its   nearest neighbours in the feature vector space. For simplicity, in the following experiments, we set  . 5 Experiments The classification techniques we used are single layer network (SLN) [13], and support vector machine (SVM)[14]. The SVM experiments were completed using LIB SVM , which is available from the URL http://www.csie.ntu.edu.tw/cjlin/libsvm. The others were implemented using the N ETLAB toolbox, which is available from the URL http://www.ncrg.aston.ac.uk/netlab/. All the user-chosen parameters are attained by using cross-validation. In the training dataset, the positive examples, namely binding sites, are double over-sampled. When combing with under-sampling, we selected a number of negative examples using cross-validation from the training dataset.







Precision = TP / (TP + FP)

F-score

Accuracy



(2)

 ¡ 

¡     

   

(3)

   (4)   

false positive rate

 

(5)



Table 1. A confusion matrix. TN

FP

FN

TP

5.2 Classification results 5.2.1 Classification results with Single Inputs: The results are shown in Table 2. Comparing with the best base algorithm, the SLN and SVM increase the Fscore by  , and  , respectively; and decrease fp by  , and  , respectively. It can be seen that the SVM outperforms the SLN on this dataset. The integrated predictor is much better than the single best algorithm. Table 2. Common performance metrics (%) calculated from confusion matrices on consistent single testing set.

Classifier

R.

P.

F.

Accu.

fp

best Alg.

74.2

20.4

32.0

70.4

30.1

SLN

75.8

24.7

37.3

76.1

23.9

SVM

69.4

31.9

43.7

83.2

15.4

5.1 Common performance metrics It is apparent that for a problem domain with an imbalanced dataset, classification accuracy rate is not sufficient as a standard performance measure. Instead recall, precision and F-value [15, 16] are calculated to understand the performance of the classification algorithm on the minority class. Based on the confusion matrix computed from the test results (see Table 1, where TN is the number of true negative samples; FP is false positive samples; FN is false negative samples; FP is true positive samples), several common performance metrics can be defined as follows: 

Recall = TP / (TP + FN)

(1)

5.2.2 Classification results with Windowed Inputs: All experimental results are presented in Table 3. The top half of the table gives the results for the consistent windowed data, as described in section 3. It can be seen that for both SLN and SVM, the F-scores are increased and fp are decreased when compared with the single best base algorithm on the consistent test set. And again, the SVM outperforms the SLN on this dataset. For full test set, which includes inconsistent data points, it can also be seen that for the SVM the F-score is increased, whilst the SLN keeps the same value as the best base algorithm. On the other hand, the SVM reduces the fp by , while the SLN does  .

Table 3. Common performance metrics (%) calculated from confusion matrices on consistent and full window test sets.

test set

consistent

full

Classifier

R.

P.

F.

Accu.

fp

best Alg.

39.2

14.4

21.0

86.0

11.6

SLN

38.3

16.6

23.2

88.0

9.6

SVM

36.8

21.6

27.3

90.7

6.6

best Alg.

37.2

17.2

23.5

86.1

11.0

SLN

33.5

18.1

23.5

87.4

9.3

SVM

32.5

23.3

27.1

89.9

6.6

6 Conclusions The first significant result presented here is that by integrating the 12 algorithms we can considerably improve binding site prediction. In particular the SVM increases the F-score by   from the best base algorithm and decreases false positive rate by  . The second significant result is that windowing of the input vectors, thereby joining additional contextual information from the genome. Moreover the windowing of the data removes many of the inconsistent and repeated data points and thereby gives a much larger training and test set. As expected the SVM gave better classification results than the SLN. Future work will investigate i) using real valued algorithm results in the input vector; ii) using algorithm based technologies to cope with the imbalanced dataset; iii) considering a range of supervised meta-classifiers or ensemble learning algorithms. References [1] http://www.hgmp.mrc.ac.uk/Software/EMBOSS/. [2] Bailey, T.L. & Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology: 28-36, AAAI Press. [3] http://family.caltech.edu/SeqComp/index.html. [4] Blanchette, M. & Tompa, M. (2003) FootPrinter: a program designed for phylogenetic footprinting. Nucleic Acids Research: Vol. 31, No. 13, 3840-3842. [5] Markstein, M., Stathopoulos, A., Markstein, V., Markstein, P., Harafuji, N., Keys, D., Lee, B., Richardson, P., Rokshar, D., Levine, M. (2002) Decoding Noncoding Regulatory DNAs in Metazoan Genomes. Proceeding of 1st IEEE Computer Society Bioinformatics Conference (CSB 2002), Stanford, CA, USA. [6] Arnone, M. I. and Davidson, E. H.(1997) The hardwiring

of development: Organization and function of genomic regulatory systems. Development: 124, 1851-1864. [7] Apostolico, A., Bock, M.E, Lonardi, S., & Xu, X.(2000) Efficient Detection of Unusual Words. Journal of Computational Biology: Vol.7, No.1/2. [8] Rajewsky, N., Vergassola, M., Gaul, U. & Siggia, E.D. (2002) Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo, BMC. Bioinformatics: 3:30. [9] Thijs, G., Marchal, K., Lescot, M., Rombauts, S., De Moor B, RouzP, & Moreau, Y. (2001) A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes, Proceedings Recomb’2001: pp. 305312. [10] Hughes, J.D., Estep, P.W., Tavazoie, S., & Church, G.M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology Mar 10;296(5): 1205-1214 [11] Japkowicz, N. (2003) Class imbalances: Are we focusing on the right issure? Workshop on learning from imbalanced datasets, II, ICML, Washington DC. [12] Chawla, N. V., Bowyer, K. W., Hall, L. O. and Kegelmeyer, W. P. (2002) SMOTE: Synthetic minority over-sampling Technique. Journal of Artificial Intelligence Research. Vol. 16, pp. 321-357. [13] Bishop, C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press, New York. [14] Scholk¨opf, B and Smola, A. J. (2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press. [15] Buckland, M. and Gey, F. (1994) The relationship between Recall and Precision. Journal of the American Society for Information Science: Vol. 45, No. 1, pp. 12–19. [16] Joshi, M., Kumar, V., and Agarwal, R. (2001) Evaluating Boosting algorithms to classify rare classes: Comparison and improvements. First IEEE International Conference on Data Mining, San Jose, CA.

Suggest Documents