Effect of Using Varying Negative Examples in Transcription Factor ...

4 downloads 5171 Views 359KB Size Report
Volume 6623 of the series Lecture Notes in Computer Science pp 1-12. Effect of Using Varying Negative Examples in Transcription Factor Binding Site Predictions ... improve the overall predictions by using classification methods like Support ...
Effect of Using Varying Negative Examples in Transcription Factor Binding Site Predictions Faisal Rezwan1 , Yi Sun1 , Neil Davey1 , Rod Adams1 , Alistair G. Rust2 , and Mark Robinson3 1

School of Computer Science, University of Hertfordshire, College Lane, Hatfield, Hertfordshire AL10 9AB, UK 2 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 3 Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing MI 48824, USA {F.Rezwan,Y.2.Sun,N.Davey,R.G.Adams}@herts.ac.uk, [email protected], [email protected]

Abstract. Identifying transcription factor binding sites computationally is a hard problem as it produces many false predictions. Combining the predictions from existing predictors can improve the overall predictions by using classification methods like Support Vector Machines (SVM). But conventional negative examples (that is, example of nonbinding sites) in this type of problem are highly unreliable. In this study, we have used different types of negative examples. One class of the negative examples has been taken from far away from the promoter regions, where the occurrence of binding sites is very low, and another one has been produced by randomization. Thus we observed the effect of using different negative examples in predicting transcription factor binding sites in mouse. We have also devised a novel cross-validation technique for this type of biological problem.

1

Introduction

Gene expression levels can be regulated in vivo by DNA-binding proteins called transcription factors and genes are turned on and off according to whether specific sites on the genome have these regulatory proteins attached to them. Transcription factors are themselves encoded by genes and the resulting regulatory interconnections form complex systems known as Genetic Regulatory Networks (GRNs). The stretches of DNA to which transcription factors bind in a sequencespecific manner are called transcription factor binding sites (TFBSs). The location of TFBSs within a genome yields valuable information about the basic connectivity of a GRN, and as such is an important precursor to understanding the many biological systems that are underlain by GRNs. There are many experimental and computational approaches for identifying regulatory sites. Many experimental techniques are costly and often time-consuming and therefore not amenable to a genome-wide approach [1]. Experimental approaches that are genome-wide, such as ChIP-chip and ChIP-seq, are themselves C. Pizzuti, M.D. Ritchie, and M. Giacobini (Eds.): EvoBIO 2011, LNCS 6623, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011 

2

F. Rezwan et al.

Fig. 1. At each position in the genome sequence we have an annotation and 12 algorithmic predictions of that annotation

dependent on the availability of specific antibodies and still require additional verification. Computational approaches for the prediction of TFBSs are therefore essential to complement and guide experimental exploration and verification. Computational approaches are typically prone to predicting many false positives, limiting their utility significantly. Fig. 1 shows different computational predictions on yeast data (described in [2]) yielding a lot of false positives and some a lot of false negatives when compared with the real binding sites (annotated). Some of the prediction algorithms even failed to produce any predictions. Improving the quality of these computational methods for predicting the locations of TFBSs is an important research goal. In earlier works [2] [3] [4] [5], results of a group of different predictors have been combined to produce a prediction that is better than that of any of the individual predictions. One of the major problems with training a classifier on these combined predictions is that the data constituting the training set can be unreliable. Whereas the labelling of a known binding site is normally correct (as it has been experimentally verified), the labelling of the negative class may be much more unreliable. This is the major issue we have addressed in this paper. We will also present the improvement of the performance of the classifier on unseen data by changing the negative vectors during training.

2

Background

Binding sites differ from one species to another. Some organisms have a much more complex organization of their gene regulatory regions, which makes their binding sites difficult to predict. Again, unlike in simple organisms, the location of binding sites may not be proximal to the promoter site and can even be thousands of base pairs (bps) away, both upstream and downstream as well as inside intronic regions. We have therefore chosen a complex multi-cellular organism, mouse (M. musculus), as a model organism to validate our method. The mouse genome has all of the above properties and in addition has more non-coding DNA sequences than yeast.

Effect of Using Varying Negative Examples in TFBS Predictions

3

We have treated the problem as a classical binary classification problem where the examples can be divided into two classes: the positive class contains those data that have been annotated as binding sites and the negative class otherwise. We have run experiments using the same positive examples but different sets of negative examples. The first set of negative examples is that part of the promoter regions that are not annotated as binding sites. But, in biological data it is hard to establish an example as belonging to a negative class. So there is a possibility of noise in the negative examples that we used in our first experiment, as base pairs marked as non-binding may be part of an unknown binding site and using them as negative examples may make the classifier perform incorrectly. To reduce this kind of noise we constructed two sets of negative examples- namely distal negative examples and randomized negative examples (described in detail in Section 3.2). There is also a detailed description of the genomic data in [2] and [5] that we are going to use and here we will only give a summary of it in the next section.

3 3.1

Description of Data Genomic Data

For each gene there is a corresponding promoter region where the binding sites will probably be located. The mouse data set has 47 annotated promoter sequences with an average length of 1294 bps (see Table 1). Most of the promoters are upstream of their associated genes and a few of them extend over the first exon including intronic regions. There are seven different prediction algorithms that have been used (discussed in [3]) and combined together they make a single data set, a 60851 by 8 data matrix where the first column is the annotation label and rest of the columns represent scores from the seven prediction algorithms. The label is "1" if it is a part of a TFBS, otherwise "0" (see Fig. 2). Table 1. A summary of the mouse data Total number of sequences Total sequence length Average sequence length Average number of TFBSs per sequence Average TFBSs width Total number of TFBS TFBS density in total data set

3.2

47 60851 bps 1294.70 bps 2.87 12.78 bps 135 2.85%

Problems with the Data and Solutions

From the statistics (Table 1), it is quite clear that mouse data set is imbalanced as our feature of interest is significantly under-represented and it is likely to result in a classifier that has been over-trained on the majority class and can only act as a weak classifier for the minority class, which may give false classifications.

4

F. Rezwan et al.

Fig. 2. At each position in the genome sequence we have an annotation and several algorithmic predictions of that annotation. We train an SVM to produce the annotation given the algorithmic predictions. Table 2. Statistics of mouse data Original Inconsistent Unique 60,851 12,119(20%) 32,747(54%)

To overcome this problem, we chose a databased method (described in [6]) in which the minority class (binding site examples) is over-sampled and majority class (non-binding site examples) is under-sampled. A Synthetic Minority Oversampling Technique (SMOTE) [6] has been used to over-sample the minority class in the training set. Another problem with the data set is that whereas one can be reasonably confident that the bps labelled as being part of a binding site, no such confidence can be extended to the rest of the promoter region. There may be many, as yet undiscovered sites therein. This implies that the negative labels could be incorrect on many vectors. In our data set there are also a number of vectors that are repeated. There are also repeats that belong to both classes. We call them inconsistent vectors, which make up about 20% of the data. There are also repeats that occur in only one class and these are simply called repeats. The vectors which occur exclusively once in any class and which do not belong to any repeats or inconsistent classes, are called unique. The breakdown of the mouse data set is given in Table 2. To deal with inconsistent and repeated data, we have taken the simplest approach by removing all such vectors (keeping one copy of the consistent vectors). As a result, we have lost over 46% of the mouse

Effect of Using Varying Negative Examples in TFBS Predictions

5

data. From a biological point of view this is unsatisfactory as a prediction is required for every base pair in the genome. So, we keep the inconsistent and repeated vectors in the test set to make it more biologically meaningful whilst using the above mentioned approach for training sets only. It had already been mentioned that negative data might be particularly unreliable. So we investigated whether modifying the negative training data set in such a way that it contained only vectors, that were highly unlikely to be part of a binding site, can improve performance. Therefore, we constructed two synthetic training sets with negative examples. It is important to note that the final test set has never been altered in any way and therefore consists of vectors from the original data. For the first type of negative examples, called distal negative examples, we selected regions from the mouse genome that are at least 4500 to 5000 bps away from their associated genes. For the randomized negative examples, we placed all the training vectors labelled with a zero in the annotation into a matrix, as their probability of being part of a TFBSs is almost zero. Each column was then independently randomly reordered. This effectively randomized each vector whilst maintaining the overall statistical properties of each of our original prediction algorithms. It is unlikely that a real binding site would elicit such randomly joint predictions.

4

Methods

As mentioned in Section 2, we have run three types of experiments: Experiment 1: Using negative examples sequences not annotated as TFBSs from original data Experiment 2: Replacing negative examples with distal negative examples Experiment 3: Replacing negative examples with randomized negative examples In addition, we have applied some pre-processing (data division, normalization and sampling) on the training set and some post-processing on the prediction set. The whole process has been depicted in Fig. 2. 4.1

Pre-processing: Preparing Training and Test Set

First we have normalized each column and then searched for any repetitive or inconsistent data vector in the mouse data and eliminated them from the training set, as these are the source of misleading prediction results. We have also done the same after mixing the positive examples with the distal and randomized negative examples. After removing the repetition and inconsistency in the data set, we shuffled the data rows to mix the positive and negative examples randomly and thus we got a new data set.We took two-third of new data set as training data. But, as mentioned before, we took the test set containing only data points from the original mouse data, which is biologically meaningful, not from the new data set. We have used sampling techniques (mentioned in Section 3.2) on the training set to make it more balanced. In this sampling, the final ratio between majority

6

F. Rezwan et al.

classes to minority class was taken as 1.0 and 2.0 in order to explore the better margin of positive and negative classes suitable for training. 4.2

The Classifier and Its Performance Measures

After constructing the training set using pre-processing, we have trained a Support Vector Machine using LIBSVM (available at http://www.csie.ntu.edu. tw/~cjlin/libsvm) on the training data. We have used SVM with a Gaussian kernel. As is well known such a classifier has two hyper parameters, the cost parameter, C, and the width of the Gaussian kernel, gamma. These two parameters affect the shape and position of the decision boundary. It is important to find good values of the parameters, and this is normally done by a process of cross-validation. Table 3. Confusion Matrix Predicted Negatives Predicted Positives Actual Negatives True Negatives(TN) False Positives(FP) Actual Positives False Negatives(FN) True Positives(TP)

Recall = F -score =

TP TP + FN

2 × Recall × P recision Recall + P recision

Accuracy =

(1)

P recision =

(3)

FP -rate =

TP TP + FP

FP FP + TN

(2) (4)

TP + TN (5) TP + FN + FP + TN

As we are dealing with a strongly imbalanced data set (only 2.85% in mouse data is TFBSs), simply using the Accuracy (correct classification) as the performance measure is inappropriate, because then predicting everything as non-binding sites would give a very good Accuracy rate. Other measures are more suitable for this classification problem. Taking account of both Recall and Precision using the F-score should give us a good measure of classification performance since the F-score is actually weighted average of Recall and Precision. In addition reducing the FP-rate should also be another major concern verifying a classifier’s performance. The performance measures are described in Equation (1) to (5). 4.3

Post-processing (Filtering)

The original biological algorithms predict contiguous sets of bps as binding sites. However in this study, each bp is predicted independently of its neighbouring nucleotides. As a result, the classifier outputs many short predictions sometimes even with a length of only one or two bps. Therefore, we removed (replaced the positive prediction with a negative one) predictions with a length equal to or

Effect of Using Varying Negative Examples in TFBS Predictions

7

1. Remove all inconsistent and repeated data points 2. Split the data into two-third training from the new data set, one-third test from the original data set 3. Split the training data into 5 partitions 4.This gives 5 different training (four-fifth) and validation (one-fifth) sets. The validation set is drawn from the related original data set 5. Use sampling to produce more balanced training sets 6. For each pair of C/gamma values 6.1. For each of the 5 training sets 6.1.1. Train an SVM 6.1.2. Measure performance on the corresponding validation set, exactly as the final test will be measured. So use the Performance Measure, after the predictions on the validation set have been filtered (by post-processing described in Section 4.3) 6.2. Average the Performance Measure over the 5 trials 7. Choose the C/gamma pair with the best average Performance Measure 8. Pre-process the complete training set and train an SVM with the best C/gamma combination 9. Test the trained model on the unseen test set Fig. 3. Pseudo-code of the cross-validation method

smaller than a threshold value, and measured the effect on the performance. In the present study, a range of threshold values (from 4 to 7) is used rather than a single one in order to assess the feasible threshold size. 4.4

Cross-Validation

In our previous works on this problem, performance had been measured on the validation data by simply using classification Accuracy. However this may not be the most effective method. The trained model will have to perform well on the test set and here Accuracy is not used. We have therefore decided to investigate what can happen if we measure performance on the validation set exactly as we do on the test set. That is we measure the performance measures after the predictions have been filtered on length and with the repeated/inconsistent vectors placed back in the validation set. The step-by-step description of exactly what we have done is shown in Fig. 3. It is called Optimized F-score Cross-validation if F-score is used as cross-validation criterion, otherwise Optimized Accuracy Cross-validation if Accuracy is used.

5

Results and Discussion

Before presenting our experimental results, let us see how the base algorithms perform for identifying cis-binding sites on the same test set we used in all the experiments described from Section 5.1 to 5.3. We calculated the performance

8

F. Rezwan et al.

measures of the seven algorithms discussed in [3]. Among them we took best results from two of the algorithms EvoSelex and MotifLocator and the results are as follows: Table 4. The results of base algorithms on mouse data Base Algorithms Recall Precision FP-score FP-rate MotifLocator 0.43 0.07 0.12 0.24 EvoSelex 0.35 0.08 0.13 0.17

From the results in Table 4, it is evident that the prediction algorithms wrongly classified a lot of sites as binding sites which actually are not. As a result, the Recall is very high as well as the FP-rate, but the Precision is very low. Now we will discuss the results from our experiments. 5.1

Experiment 1- Using Negative Examples Sequences Not Annotated as TFBSs

In this experiment we do not make any changes to the negative examples in the training set but vary the way in which the cross validation takes place. We have used four variations (see Table 5). Table 5. The result of using the original negative examples in the mouse data with varying cross-validation methods Cross-validation Method Optimized Accuracy Optimized F-score Optimized Accuracy+filtering Optimized F-score+filtering

Recall Precision F-score FP-rate 0.20 0.14 0.16 0.05 0.24 0.17 0.20 0.04 0.17 0.15 0.16 0.04 0.24 0.17 0.20 0.05

First as a baseline, we cross-validate in the normal way using Accuracy as the performance measure. We then change these criteria so that the F-Score was used, in accord with how the model will be assessed on the final training set. Both these validation methods can be combined with the post-processing to filter out short predictions when assessing the performance of the model on a validation set. It is important to observe that the original algorithms did not perform with high fidelity as seen in Table 4. From Table 5, the overall performance of the classifiers, as measured from their F-Score, is fairly similar. Note that the high Accuracy can be achieved by over predicting the majority class, the nonbinding sites. Cross validating with F-scores gives better results than Accuracy (see Table 5). But the improvement is not that noticeable. Even post-processing during cross validation did not yield the expected improvement. 5.2

Experiment 2- Replacing Negative Examples with Distal Negative Examples

In this experiment we replace the negative examples in the training set with distal negative examples and vary the same way in which the cross validation takes

Effect of Using Varying Negative Examples in TFBS Predictions

9

Table 6. The result of using the distal negative examples in the mouse data with varying cross-validation methods Cross-validation Method Optimized Accuracy Optimized F-score Optimized Accuracy+filtering Optimized F-score+filtering

Recall Precision F-score FP-rate 0.37 0.97 0.53 0.0006 0.65 0.99 0.79 0.0002 0.68 0.99 0.81 0.0001 0.68 0.99 0.81 0.0001

place (described in Experiment 1). As these negative examples are thousands of bps away from the promoter regions, there is less possibility of labelling genuine binding sites as negative; therefore it will help the classifier to characterize positive and negative examples properly. The results improved dramatically - the F-score jumps 20% to 81% (Table 6). The classifier is able to identify almost all of the binding sites in the test data. The number of False Positives is also very low. But still there is no difference between using Accuracy and F-score as a cross-validation criterion when we are using post-processing. The F-score has even improved to the same extent without post-processing during cross-validation, and can therefore be an ideal candidate as cross-validation criterion rather than Accuracy. 5.3

Experiment 3- Replacing Negative Examples with Randomized Negative Examples

In this experiment we have replaced the negative examples in the training set with randomized negative examples (described in Section 3.2) and vary the same way in which the cross-validation takes place (described in Experiment 1) as before. As mentioned above, randomization would totally destroy any biological property the data vector might have. It will give us a greater confidence in the negative examples than when we are using the original training set or the set of distal negative examples, which still may contain some examples that might make the classifier’s work difficult to characterize the examples. From the results in Table 7, we can see that using randomized negative examples has improved the results substantially even without using post-processing during cross-validation. The F-score has improved 86%. This proves that using real negative examples during training can characterize the data quite well. As the classifier could predict the binding sites in the unseen test set properly, Table 7. The result of using the randomized negative examples in the mouse data with varying cross-validation methods Cross-validation Method Optimized Accuracy Optimized F-score Optimized Accuracy+filtering Optimized F-score+filtering

Recall Precision F-score FP-rate 0.76 0.69 0.73 0.02 0.69 0.97 0.81 0.001 0.76 1.0 0.86 0.00 0.76 1.0 0.86 0.00

10

F. Rezwan et al.

Fig. 4. At each position in the genome sequence we have an annotation and several algorithmic predictions of that annotation. We train an SVM to produce the annotation given the algorithmic predictions.

using post-processing helps to improve predicting in both ways (decreasing false positives and false negatives). Apart from assessing the prediction based on performance measures, we have visualized the data (like Fig. 1) to see if our prediction is as good as it reflects in our results. We have taken a fraction of mouse genome (upstream region of the gene MyoD1, Q8CFN5, Vim, and U36283 ) and compared our best results from

Effect of Using Varying Negative Examples in TFBS Predictions

11

different experiments along with prediction algorithms and annotation. In Fig. 4, the upper 7 results are from the original prediction algorithms and the next one is experimentally annotated binding sites from ABS [7] or OregAnno [8]. The last three results are our best prediction results from three different types of experiments (described in Section 4). The figure shows that the prediction algorithms generate a lot of false predictions. On the other hand, using original mouse data (Experiment 1) does not make good predictions. Whereas, using distal or randomized negative examples (Experiment 2 or 3) improves the prediction considerably. The predictions are almost identical to annotations and experiment with randomized negative example gives slightly better predictions than that of distal negative examples.

6

Conclusion

The identification of binding sites for transcription factors in a sequence of DNA is a very difficult problem. In previous studies, it was shown that basic algorithms individually could not produce accurate predictions and consequently produced many false positives. Though the combination of these algorithms using two class SVM (see Experiment 1) gave better results than each individual prediction algorithm, there were still a lot of false positives due to vulnerability of the negative examples in our data set. Our current approach is similar to using a one class classifier. However our previous attempts to use such classifiers had results similar to that of the original two class SVM (see Experiment 1). Our present results show that a change in the provenance of the negative examples significantly improves the resulting predictions and that implementation of the new cross-validation technique can bring further improvements. Consequently our major result here is obviously the effects of changing the source of the negative examples used in the training data. Along with the novel cross-validation method our procedure can be a step in the right direction for dealing with this type of biological data. At present we are still in a preliminary stage of our work and any claim that we have solved the binding site problem would be premature, but the results so far certainly look promising. Future work will involve using the most recent prediction aglorithms available and we would also like to extend our work to the genomes of other organisms.

References 1. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Régnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23(1), 137–144 (2005) 2. Sun, Y., Robinson, M., Adams, R., te Boekhorst, R., Rust, A.G., Davey, N.: Integrating genomic binding site predictions using real-valued meta classifiers. Neural Computing and Applications 18, 577–590 (2008)

12

F. Rezwan et al.

3. Sun, Y., Robinson, M., Adams, R., Rust, A.G., Davey, N.: Prediction of Binding Sites in the Mouse Genome using Support Vector Machine. In: Kurkova, V., Neruda, R., Koutnik, J. (eds.) ICANN 2008, Part II. LNCS, vol. 5164, pp. 91–100. Springer, Heidelberg (2008) 4. Sun, Y., Robinson, M., Adams, R., Rust, A.G., Davey, N.: Using Pre and Postingprocessing Methods to Improve Binding Site Predictions. Pattern Recognition 42(9), 1949–(1958) 5. Robinson, M., Castellano, C.G., Rezwan, F., Adams, R., Davey, N., Rust, A.G., Sun, Y.: Combining experts in order to identify binding sites in yeast and mouse genomic data. Neural Networks 21(6), 856–861 (2008) 6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeye, W.P.: Synthetic minority oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002) 7. Blanco, E., Farré, D., Albà, M.M., Messeguer, X., Guigó, R.: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res. 34(Database issue), D63-7 (2006) 8. Montgomery, S.B., Griffith, O.L., Sleumer, M.C., Bergman, C.M., Bilenky, M., Pleasance, E.D., Prychyna, Y., Zhang, X., Jones, S.J.M.: ORegAnno: An open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics (March 2006)