Using a Hybrid Adaboost Algorithm to Integrate ... - Semantic Scholar

2 downloads 0 Views 95KB Size Report
Yi Sun, Mark Robinson, Rod Adams, Paul Kaye, Alistair G. Rust ..... [14] M. Markstein, A. Stathopoulos, V. Markstein, P. Markstein, N. Harafuji,. D. Keys, B. Lee, ...
Using a Hybrid Adaboost Algorithm to Integrate Binding Site Predictions Yi Sun, Mark Robinson, Rod Adams, Paul Kaye, Alistair G. Rust , Neil Davey University of Hertfordshire, College Lane, Hatfield, Hertfordshire AL10 9AB comrys, m.robinson, r.g.adams, p.h.kaye, [email protected] Institute of Systems Biology 1441 North 34th Street Seattle, WA 98103, USA [email protected]

Abstract— Currently the best algorithms for transcription factor binding site prediction are severely limited in accuracy. There is good reason to believe that predictions from these different classes of algorithms could be used in conjunction to improve the quality of predictions. In previous work, we have applied single layer networks, rules sets and support vector machines on predictions from  key real valued algorithms. Furthermore, we used a ‘window’ of consecutive results in the input vector in order to contextualise the neighbouring results. In this paper, we improve the classification result with the aid of a hybrid Adaboost algorithm working on the dataset with windowed inputs. In the proposed algorithm, we first apply weighted majority voting. Those data points which cannot be classified ‘easily’ using weighted majority voting are then classified using the Adaboost algorithm. We find that our method outperforms each of the original individual algorithms and the other classifiers used previously in this work.

I. I NTRODUCTION In this paper, we further address the problem of identifying transcription factor binding sites on sequences of DNA [18]. There are many different algorithms in current use to search for binding sites [21], [3], [22], [4]. However, most of them produce a high rate of false positive predictions. The problem addressed here is to reduce these false positive predictions by means of classification techniques taken from the field of machine learning. In [18] we first integrated the results from  different base algorithms for identifying binding sites, using non-linear classification techniques. To further improve classification results, we employed windowed inputs, where a fixed number of consecutive results were used as an input vector, so as to contextualise the neighbouring results. The data has two classes labeled as either binding sites or non-binding sites, with about   used being non-binding sites. We made use of sampling techniques, working with a traditional neural network: single layer networks (SLN), rules sets (C4.5-Rules) and a contemporary classification algorithm: support vector machines (SVM). In this work, we focus on the dataset with windowed inputs. We first apply the Adaboost algorithm to this data. We then propose a novel hybrid Adaboost algorithm to further improve the classification results.

We expound the problem domain in the next section. In Section III, we describes the datasets used in this paper. We briefly introduce the Adaboost algorithm and propose a hybrid Adaboost algorithm in Section IV. A set of common metrics for assessing classifier performance are covered in Section V. Section VI briefly introduces our experiments and Section VII gives all the experimental results. The paper ends in Section VIII with conclusions. II. P ROBLEM D OMAIN One of the most exciting and active areas of research in biology currently, is understanding how the exquisitely fine resolution of gene expression is achieved at the molecular level. It is clear that this is a highly non-trivial problem. While the mapping between the coding region of a gene and its protein product is straightforward and relatively well understood [7], the mapping between a gene’s expression profile and the information contained in its non-coding region is neither so simple, nor well understood at present. It is estimated that as much as 50% of the human genome is cisregulatory DNA [14] , undeciphered for the most part and tantalisingly full of regulatory instructions. A cis-regulatory component consists of DNA that encodes a site for proteinDNA interaction in a gene regulatory system, conversely, a trans-regulatory component consists of a protein that binds to a cis-regulatory DNA sequence. Cis-regulatory elements form the nodes connecting the genes in the regulatory networks, controlling many important biological phenomena, and as such are an essential focus of research in this field [2]. Lines of research likely to directly benefit from more effective means of elucidating the cis-regulatory logic of genes include embryology, cancer and the pharmaceutical industry. It is known that many of the mechanisms of gene regulation act directly at the transcriptional or sequence level, for example in those genes known to play integral roles during embryogenesis [2]. One set of regulatory interactions are those between a class of DNA-binding proteins known as transcription factors and short sequences of DNA which are bound by the proteins by virtue of their three dimensional conformation. Transcription factors will bind to a number of different but related sequences. A base substitution in

a cis-regulatory element will commonly, simply modify the intensity of the protein-DNA interaction rather than abolish it. This flexibility ensures that cis-regulatory elements, and the networks in which they form the connecting nodes, are fairly robust to various mutations. Unfortunately, it complicates the problem of predicting the cis-regulatory elements from out of the random background of the non-coding DNA sequences. The current state of the art algorithms for transcription factor binding site prediction are, in spite of recent advances, still severely limited in accuracy. We show that in a large sample of annotated yeast promoter sequences, a selection of 12 key algorithms were unable to reduce the false positive predictions below 80%, with between 20% and 65% of annotated binding sites recovered. These algorithms represent a wide variety of approaches to the problem of transcription factor binding site prediction, such as the use of regular expression searches, PWM scanning, statistical analysis, co-regulation and evolutionary comparisons. There is however good reason to believe that the predictions from these different classes of algorithms are complementary and could be integrated to improve the quality of predictions. In the work described here we take the results from the  afore-mention algorithms and combine them into  different feature vectors, as shown in next section. We then investigate whether the integrated classification results of the algorithms can produce better classifications than any one algorithm alone.

A

gene sequence Known binding sites

1

G −1

C 1

T

C

T

−1

−1

−1

1 −1 0.9 12 key algorithm results

−0.47 0.6 −1

Fig. 1. The dataset has  columns, each with a possible binding prediction (binary or real value). The 12 algorithms give their own prediction for each sequence position and one such column is shown.

1

12 key algorithm results

−1 0.9 −0.47

0.6

−1

Labels Windows

1

−1 −1 −1 −1 −1 1

1

−1 −1 −1

label 1 label 2 label 3

III. D ESCRIPTION

OF

T HE DATA

The data has  possible binding positions and a prediction result for each of the 12 algorithms. The  algorithms can be categorised into higher order groups as Single sequence algorithms (7) [21], [1], [15], [19]; Coregulatory algorithms (3) [3], [9]; A Comparative algorithm (1) [22]; An Evolutionary algorithm (1) [4]. The label information contains the best information we have been able to gather for the location of known binding sites in the sequences. We use  to denote the prediction that there is no binding site at this location and +  to denote the predictions that there is a binding site at this location. For each of the base  algorithms, a prediction result can be either binary or real valued, see Figure 1. The data therefore consists of   -ary real vectors each with an associated binary value. In this work, we divide our dataset into a training set and a test set: the first  is the training set and the last  is the test set. Amongst the data there are repeated vectors, some with the same label (repeated items) and some with different labels (inconsistent items). It is obviously unhelpful to have these repeated or inconsistent items in the training and test set, so they are removed. As the data is drawn from a sequence of DNA nucleotides the label of other near locations is relevant to the label of a particular location. We therefore contextualise the training and test data by windowing the vectors as shown in Figure 2. We use the locations up to three either side, giving a window size of 7, and a consequent input vector size of 84.

Fig. 2. The window size is set to  in this study. The middle label of  continuous prediction sites is the label for a new windowed inputs. The length of each windowed input now is "! .

In [18], we show that several results including the better one from the SVM. In this paper, we focus on the datasets with windowed inputs. IV. T HE H YBRID A DABOOST A LGORITHM A. The Adaboost algorithm The Adaboost algorithm [10] produces a sequence of weak classifiers that collectively form a strong classifiers. The idea is that firstly a weak classifier is used. The data points that are poorly classified have their frequency increased and the new dataset is used to train a second weak classifier. This process iterates for a specified number of iterations. The final strong classifier is a linear combination of weak classifiers. Here the weak classifier used is a single layer neural network. See Table I for a description of the Adaboost process [10]. The Adaboost algorithm has a number of interesting properties. In [10], it was shown that the training error of the strong classifier approaches zero exponentially in the number of iterations. In [16], it was suggested that the generalisation performance of the Adaboost algorithm is related to the margin (separation) of the examples, and that it can achieve large margin rapidly.

TABLE II

B. The hybrid Adaboost algorithm

T HE HYBRID A DABOOST ALGORITHM

We now propose a new classification procedure, a hybrid Adaboost algorithm. Our idea was to identify the data points that the base algorithms find most difficult. The Adaboost algorithm is then applied to only those points. We use weighted majority voting (WMV) [8] to identify the ‘difficult’ data points. Each base algorithm votes with its confidence, which is measured by the probability that the given pattern (I) is positive (P), computed from the training set with single inputs as follows: #%$'&)( *,+"- The number of the true positive examples (1) The number of the positive predictions . Let /10 denote the summed confidence of the base algorithm that give a positive prediction, / the negative. A threshold 2 is picked. Those data points whose /30 and / values are close together are going to be had to classify. That is if ( (65 2 / 0 4/ , the point is deemed ‘difficult’ and collected into the ‘difficult’ sub-set. The other data points use WMV to determine whether they are predicted as binding site or not. Unfortunately this naive method did not work well. The main reason is the dataset is imbalanced which means not enough positive predictions are seen and this effects the confidence levels: the sums /30 and / . To increase the effect of the positive predictions a multiplier 7 is used. Now the condition used to determine whether a data point is ‘difficult’ or not is shown as follows: (

Thus the complete shown in to choose

7)89/:03;/

(,?L@B CEDFHGICKJJKJHCK?A@ M•CSD,PQGSR , multiplier – , threshold — ; A variant of the weighted majority voting: – Calculating confidences based on training dataset; – For each test data point, one computes its sum of positive confidences ˜ ~ and sum of negative confidences ˜ ; — if ™ –•!˜ ~ Vš˜ ™› put the point into a new dataset else classify with the signed confidence value. end Over-sampling and under-sampling is applied to the training set with windowed inputs (see section VI-A); A strong classifier is generated using the Adaboost algorithm (see section IV-A) on the training set; Use the resulting strong classifier to classify the ‘difficult’ data points.

standard performance measure. To evaluate classifiers used in this work, we apply several common performance metrics, such as recall, precision and F-score [5], [13], which are calculated to understand the performance of the classification algorithm on the minority class. Based on the confusion matrix (see Table III) computed from the test results, where TN is the number of true negative examples; FP is false positive examples; FN is false negative examples; TP is true positive examples, several common performance metrics can be defined as follows:

(2)

method we applied is a variant of the WMV. A description of the hybrid Adaboost algorithm is Table 2 II. In practice, we use the cross-validation 7 and (see section VI-B).

TABLE III A

CONFUSION MATRIX

TN

FP

FN

TP

V. C LASSIFIER P ERFORMANCE It is apparent that for a problem domain with an imbalanced dataset, classification accuracy rate is not sufficient as a

=

TABLE I T HE A DABOOST A LGORITHM

=

=

=

=

Inputs: data points >,?A@B CEDFHGICKJJJKC?L@NMOC'D,PQGSR , where D,TU>,VWCYX"ZR , number of iterations M; F Initialise weights: [ T] \ H^K_ ; Do for ` \ CKJJJKCYa , – Train classifier on the training set with respect to >Z[cb F CJKJJKCY[bP R , and obtain an hypothesis d b : @feg>,VWCYX"ZR ; – Calculate the weighted training error h b of d b ; P h b \9i T j kHln6 m oprq s lt [ bT , – Set: u

}p v F wzy{ IF | x }p b \ – Update weights: p q Ft lE‚Hƒ„ | … ˆ p k l o p q s l t‡† [ T bW~ p \€ where ‰ q b is a normalization constant such that F t P i T n F [ T bW~ \  . F Terminate if h b \  or h‹Š v and set a \ `ŒV) ; … p The final strong classifier is then: d ?L@ G \iŽ n F r’c‘ “ ” … ’ d b ?A< @ G b

Recall = TP / (TP + FN) œ

(3)

Precision = TP / (TP + FP) œ

(4)

F-score 

Ÿž Recall ž Precision

Accuracy 

Recall+Precision TP+TN œ TP+FN+TN+FP

fp rate 

FP FP+TN .

œ

(5)

(6)

(7)

VI. E XPERIMENTS The classification techniques used in previous work [18] was the support vector machine (SVM) [17]. The SVM experiments were completed using LIBSVM, which is available from the URL http://www.csie.ntu.edu.tw/   cjlin/libsvm.

A. Sampling Techniques for Imbalanced Dataset Learning In our dataset, there are less than N binding positions amongst all the vectors, so this is an imbalanced dataset [12]. Since the dataset is imbalanced, the supervised classification algorithms will be expected to over predict the majority class, namely the non-binding site category. There are various methods of dealing with imbalanced data [11]. In this work, we concentrate on the data-based method [6]: using under-sampling of the majority class (negative examples) and over-sampling of the minority class (positive examples). We combine both over-sampling and under-sampling methods in our experiments. For under-sampling, we randomly selected a subset of data points from the majority class. The over-sampling case is more complex. In [12], the author addresses an important issue that the class imbalance problem is only a problem when the minority class contains very small sub-clusters. This indicates that simply over sampling with replacements may not significantly improve minority class recognition. To overcome this problem, we apply a synthetic minority oversampling technique as proposed in [6]. For each pattern in the minority class, we search for its ¡¢ nearest neighbours in the minority class using Euclidean distance. Since the dataset is a mixed one of continuous and binary features, we follow the suggestion in [6]: when the binary features differ between a pattern and its nearest neighbours, then the median of standard deviations of all continuous features for the minority class is included in the Euclidean distance. A new pattern belonging to the minority class can then be generated as follows: for continuous features, the difference of each feature between the pattern and its nearest neighbour is taken, and then multiplied by a random number between  and  , and added to the corresponding feature of the pattern; for binary features, the majority voting principle to each element of the ¡4 nearest neighbours in the feature vector space is employed. We take 5 nearest neighbours, and double the number of items in the minority class. The actual ratio of minority to majority class is determined by the under-sampling rate of the majority class. According to our previous experience, we set the final ratio to a half, which works well in this work. B. Parameter Settings Since we do not have enough data to build up an independent validation set to evaluate the model, all the user-chosen parameters are obtained using cross-validation. The training set is divided into 5 equal subsets, one of which is to be a validation set, and there are therefore 5 possible such sets. For each classifier a range of reasonable parameter settings are selected. Each parameter setting is validated on each the five validation sets, having previously been trained on the other £ ¤ of the training data. The mean performance over these 5 validations, is taken as the overall performance of the classifier with this parameter setting. The parameter setting with best performance is then used with this classifier in the subsequent experiments. For example in the hybrid Adaboost algorithm, 2 there are two parameters 7 and in the weighted majority

voting part. Six different combinations were evaluated. We 2 selected 7 from a range ¥¦§ ¤¨ , and ¥ ©¨ . .

VII. R ESULTS The results are shown in Table IV, together with the best base algorithm (the one with the highest F-score). Compared TABLE IV C OMMON PERFORMANCE METRICS (%) ON

THE POSSIBLE BINDING SITES

WITH WINDOWED INPUTS .

Classifier best Alg. SVM Adaboost Hybrid

recall 40.95 38.81 37.38 40.71

precision 17.46 20.25 22.66 22.50

F-score 24.48 26.61 28.21 28.98

Accuracy 82.22 84.93 86.61 85.95

fp rate 14.66 11.58 9.66 10.62

with the best base algorithm, all classifiers, increase the precision and decrease the fp rate. It can be seen that the hybrid Adaboost algorithm is clearly the best classifier - it outperforms the others in terms of the F-score. Its F-score £ improves by  and  when comparing with the best . . base algorithm and the SVM respectively. It shows that the recall value of the hybrid Adaboost algorithm is nearly the same as the best base algorithm, while the SVM and Adaboost algorithm have become more conservative, predicting binding sites less often but with greater accuracy. VIII. C ONCLUSIONS The significant result presented here is that when using the hybrid Adaboost algorithm, we can considerably improve binding site prediction by decreasing the false positive predictions, while at the same time improving the F-score. In fact, we are able to reduce the false positive predictions of the best base algorithm by ª , whilst maintaining about the same . number£ of true positive predictions and improving the F-score by  . . Hence our new algorithm presents a notable improvement on all other algorithms we have tried. For instance, we also considered using the SVM in place of the variant of WMV in an alternate hybrid Adaboost algorithm. However this version performed no better than the ordinary Adaboost algorithm, and therefore not as well as our original hybrid algorithm. Future work will investigate i) searching for a method to find out a suitable ratio of minority to majority classes, which could give better results; ii) using algorithm based technologies to cope with the imbalanced dataset; iii) considering a wider range of supervised meta-classifiers or ensemble learning algorithms. Another important avenue to explore will be to examine the biological significance of the results and we are currently working on using a visualisation tool. R EFERENCES [1] A. Apostolico, M. E, Bock, S. Lonardi and X. Xu, “Efficient Detection of Unusual Words,” Journal of Computational Biology, Vol.7, No.1/2, 2000. [2] M. I. Arnone and E. H. Davidson, “The hardwiring of development: Organization and function of genomic regulatory systems”, Development 124, 1851-1864, 1997.

[3] T. L. Bailey and C. Elkan, “Fitting a mixture model by expectation maximization to discover motifs in biopolymers,” Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36, AAAI Press, 1994. [4] M. Blanchette and M. Tompa, “ FootPrinter: a program designed for phylogenetic footprinting,” Nucleic Acids Research, Vol. 31, No. 13, 3840-3842, 2003. [5] M. Buckland and F. Gey, “The relationship between Recall and Precision,” Journal of the American Society for Information Science, Vol. 45, No. 1, pp. 12–19, 1994. [6] N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling Technique,” Journal of Artificial Intelligence Research. Vol. 16, pp. 321-357, 2002. [7] F. H. Crick, “The genetic code”, Sci Am 207, pp. 66-74. [8] T. Fawcett, “Using rule sets to maximize ROC performance,” Proceedings of the IEEE International Conference on Data Mining (ICDM-2001), Los Alamitos, CA, pp 131-138, IEEE Computer Society, 2001. [9] J. D. Hughes, P. W. Estep, S. Tavazoie and G. M. Church, “Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae,” Journal of Molecular Biology, Mar 10; 296(5):1205-1214, 2000. [10] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1):119-139, 1997. [11] Wu, G and Chang, E. Y.: Class-boundary alignment for imbalanced dataset learning. Workshop on learning from imbalanced datasets, II, ICML, Washington DC, 2003. [12] N. Japkowicz, “Class imbalances: Are we focusing on the right issure?” Workshop on learning from imbalanced datasets, II, ICML, Washington DC, 2003. [13] M. Joshi, V. Kumar and R. Agarwal, “Evaluating Boosting algorithms to classify rare classes: Comparison and improvements,” First IEEE International Conference on Data Mining, San Jose, CA, 2001. [14] M. Markstein, A. Stathopoulos, V. Markstein, P. Markstein, N. Harafuji, D. Keys, B. Lee, P. Richardson, D. Rokshar and M. Levine, “Decoding Noncoding Regulatory DNAs in Metazoan Genomes”, proceeding of 1st IEEE Computer Society Bioinformatics Conference (CSB 2002), Stanford, CA, USA, 14-16, August, 2002. [15] N. Rajewsky, M. Vergassola, U. Gaul and E. D. Siggia, “Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo,” BMC Bioinformatics, 3:30, 2002. [16] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods”. the Annals of Statistics, 26(5):1651-1686, October 1998. [17] B. Scholk¨opf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2002. [18] Y. Sun, M. Robinson, R. Adams, P. Kayes, A. G. Rust and N.Davey, “Using Real-valued Meta Classifiers to Integrate Binding Site Predictions”, Proceedings IJCNN05, 2005. [19] G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. De Moor, RouzP and Y. Moreau, “A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes,” Proceedings Recomb’2001, 305-312, 2001. [20] T. F. Wu, C. J. Lin and R. C. Weng, “Probability Estimates for multiclass classification by pairwise coupling,” Journal of Machine Learning Research, 5 pp. 975-1005, 2004. [21] http://www.hgmp.mrc.ac.uk/Software/EMBOSS/. [22] http://family.caltech.edu/SeqComp/index.html.

Suggest Documents