An Improved Genetic Algorithm for DNA Motif Discovery with Public ...

1 downloads 0 Views 157KB Size Report
1 Park Drive, La Trobe Research and Development Park,. Bundoora, VIC ... In this paper, a novel DNA motif discovery approach using a genetic algorithm is pro-.
An Improved Genetic Algorithm for DNA Motif Discovery with Public Domain Information Xi Li1,2 and Dianhui Wang1 1

Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, VIC, 3086, Australia [email protected] 2 Department of Primary Industries, Bioscience Research Division, Victorian AgriBiosciences Centre, 1 Park Drive, La Trobe Research and Development Park, Bundoora, VIC, 3083, Australia

Abstract. Recognition of transcription factor binding sites (TFBSs or DNA motifs) to help with understanding the regulation of gene expression is one of the major challenges in the post-genomics era. Computational approaches have been developed to perform binding sites discovery based on brute-force search techniques or heuristic search algorithms, and numbers of them have achieved some degrees of success. However, the prediction accuracy of the algorithm can be relatively influenced by the natural low signal-to-noise ratio of the DNA sequence. In this paper, a novel DNA motif discovery approach using a genetic algorithm is proposed to explore the ways to improve the algorithm performance. We take account of the publicly available motif models such as Position Frequency Matrix (PFM) to initialize the population. By considering both conservation and complexity of the DNA motifs, a novel fitness function is developed to better evaluate the motif models during the evolution process. A final model refinement process is also introduced for optimizing the motif models. The experimental results demonstrate a comparable (superior) performance of our approach to recently proposed two genetic algorithm motif discovery approaches.

1

Background

The short segments (usually ≤ 30) of DNA sequences where Transcription Factors (TFs) bind to are named as Transcription Factor Binding Sites (TFBSs). One believes that the interaction between TF and TFBS dominants the regulation of gene expression. Most of the time TFBSs are located within the promoter region (100–1000 bp) of a gene’s Transcription Start Site (TSS), while they can also be found in gene’s downstream and coding regions. A DNA Motif refers to a common conserved appearance shared by a group of TFBSs recognized by the same TF. The DNA motif discovery problem can be regarded as extracting some short (5–30 letters) unknown motifs from a set of genes of interest. M. K¨ oppen et al. (Eds.): ICONIP 2008, Part I, LNCS 5506, pp. 521–528, 2009. c Springer-Verlag Berlin Heidelberg 2009 

522

X. Li and D. Wang

The traditional binding sites identification methods, such as DNase footprinting [4], can provide trustable identification results, whereas they are laborintensive and time-consuming. Nowadays, as massive genomic sequence and gene expression profiles become available, numerous computational tools have been developed to offer an alternative way to explore the DNA motif find problem. Many literature studies classify the search algorithms into exhaustive approaches and heuristic approaches. The exhaustive approaches guarantee to find the most over-represented words by enumerating the search space of counting the frequencies of all possible occurrences, such as CONSENSUS [13]. In the other hand the heuristic approaches are often designed by using the probabilistic models, which initially estimate the motif model and iteratively optimize the parameters of the estimated model to maximize the likelihood, such as MEME [1]. Tompa et.al. [12] evaluated 13 motif discovery tools using the eukaryotic datasets and as a complement Hu et.al. [5] reported the performance assessment of 5 representative tools on the prokaryotic data. The results from both studies show that current motif discovery tools are still lack of accuracy and far from perfect. Genetic Algorithm (GA) has been introduced to the motif discovery problem ([2], [3], [7], [14]). Recently proposed GA algorithms have shown the evidences of improvement. For example, Wei and Jensen [14] presented a GA-based ciselement discovery framework named as GAME, which employs a Bayesian-based posterior distribution as the fitness function to evolve the population. In 2008, Chan et.al. [2] proposed another GA approach called GALF-P that employed a local filtering operator and an adaptive post-processing to reduce the falsepositive rate. Comparative studies carried out by Chan et.al. [2] on both synthetic datasets and experimental datasets show that GALF-P achieves a better overall performance than GAME as well as other 3 commonly used motif finding tools. In this work, we present an improved approach to identification of the DNA motif by using a genetic algorithm (IGAMD). Our method attempts to incorporate the experimental valid Position Frequency Matrix (PFM) for initializing the population. We employ information content and complexity measure as fitness function to effectively evaluate the candidate solutions. A new genetic operator named as substitution is introduced to avoid the local maximum solution. We also develop a final model refinement process to improve the model quality. Comparative studies of the promoter regions of 302 co-regulated genes containing 580 experimental verified TFBSs from 10 E.coli TF families demonstrate the advantageous performance of our approach over two GA-based methods GAME and GALF-P.

2

The Proposed Approach

In our method, a prior knowledge represented as PFM is introduced to support for constructing the initial population. During the evolution process, the fitter individuals have better chance to reproduce or to be kept in the next generation. We also explore to use a novel fitness function and a combination of genetic

An Improved Genetic Algorithm for DNA Motif Discovery

523

operators for improving the algorithm performance. A final motif model refinement process is given to optimize the predicted results. 2.1

Representation

An individual represents a potential solution to optimize the problem. In DNA motif finding domain, an individual can be regarded as a motif model extracted from a set of overrepresented short DNA sequences (donated as k-mers where k is the length of the motif), thereby the target of our GA approach is to find the optimal individual. In [2] and [14], the individual is represented by a string of integers storing the start position of a k-mer in each target sequence. Comparing with the position-based representation, the individual in our study is represented as a vector {v1 , v2 ,..., vn }, where vm is a k-mer from the m-th sequence and n is the number of the input sequences. 2.2

Fitness Score Function

It is crucial to define a suitable fitness function that measures the quality of our proposed motif model representation. In previous GA approaches, some of them applied the information theory as fitness function, such as [7] and [14], some chose the distance-oriented functions to score individuals [3], or a combination of both [2]. Recently Congdon et.al. [3] presented a benchmark to access the GA performance by applying 1) Information Content (IC) and 2) distance-oriented separately as fitness function as well as 3) some variations of the combination of the two methods. The results show IC fails from the competition against other two attempts [3]. We believe a combination of metrics from different aspects together to evaluate the individual sounds a reasonable solution. The information content (IC) computed using the relative entropy of the binding sites with respect to the background base distribution [11] is applied here to evaluate the conservation significance of a motif model. It can be regarded as an indication of how much deviation of a motif model from the background distribution. The IC value of a DNA motif model M is computed by: IC(M ) =

k  T  j=1 b=A

fb (j)log

fb (j) pb

(1)

where fb (j) is the normalized frequency of nucleotide b ∈ Σ = {A, T, C, G} (for DNA sequences) on the j-th position of all instances in M and pb is a background base frequency estimated from the intergenic regions of the investigated genome. A higher IC value can imply the higher conservation of a motif model, while a group of repetitive segments such as sequences of A, which can not be considered as possible binding sites, will produce a significant high IC value. If we only expect the individual with the highest IC as the optimal motif model, the vulnerability is obvious. Thus, a model complexity score function proposed by Mahony et.al. [9] can effectively measure the model compositional structure to exclude the models with low complexity scores. In our study, an individual

524

X. Li and D. Wang

with a complexity score less than a threshold value will be removed from the population. The complexity measure is given by,  kj=1 f (j,b)   k  T 1 k c(M ) = k 4 j=1 f (j, b)

(2)

b=A

To describe both the conservation property and compositional structure of a given motif model during the evolution process, we propose a new fitness function that is, k  T  1 kfb (j) (3) ICC(M ) = + klog( ) fb (j)log k 4 pb j=1 f (j, b) j=1 b=A 2.3

Population Initialization

To initial the population, a verified PFM motif model K with a given length k is given. We adopt the formula used in roulette-wheel selection to generate a Posifij tion Cumulate Frequency Matrix (PCM) based on K, that is pij = T . z=A fiz Here z, j ∈ {A, C, G, T } are for DNA sequence, i ∈ {1, 2, ...k} is the i-th column, fij is the frequency of the nucleotide j at i-th column. Also the roulettewheel selection is applied to produce a list of consensus sequences denoted as {C1 , C2 , ..., Cs } from the PCM, where s is the population size. The consensus can be regarded as a possible motif instance from K. Each of them is then used to scan the input sequences to collect a group of k-mers with the highest similarities as an individual. If more than one k-mers found from one sequence, the first appearance will be parsed and stored to the individual. 2.4

Evolution Process

The conventional roulette-wheel selection is used to choose individuals as parents for reproduction. Thus, the individual with higher fitness degree has a higher opportunity to be selected to reproduce than the others. For reproduction, a crossover operator, a mutation operator, and a substitution operator are proposed here. When the crossover operator is chosen with a certain crossover probability, a single crossover point is randomly generated on both parents so that all k-mers beyond this point are exchanged between two parents. The one-point crossover is employ in both [14] and [2] as well. The promising experimental studies from them make us believe that the one-point crossover sounds to be a reasonable choice. When the mutation is performed with a given mutation probability, one randomly chosen k-meri from the selected parent will be replaced by a random k-mer from the sequence i. Here, we introduce a new operator named substitution, which can be regarded as the replacement of an individual Iold by Inew that derives from Iold itself. The substitution operator triggered with a given

An Improved Genetic Algorithm for DNA Motif Discovery

525

probability starts with a randomly chosen individual Iold from the population to construct a consensus string C from Iold ’s PFM. The way to construct C is to choose the nucleotides with the second-highest frequency from r columns and the nucleotides with the highest frequency from the rest of columns with equal probability. Instead of randomly generating the number r, we adopt a distribution formula proposed by [7], P (r= k) = 1/2k , where k ∈ [2,n(number of input n sequence)] and P (r = 1) = 1 - k=2 1/2k . Unlike the mutation, a new set of k-mers one from each sequence, which have the highest similarities against C, will be grouped to form a new individual to replace Iold . We believe by using three operators together during the evolution process, it can provide a stable diversity level of the population to potentially avoid the local maxima solution. The reproduction will not be stopped until the size of the population doubled. To keep the same population size across every generation, we apply two possible selections as the replacement strategy, which are winners-take-all selection and tournament selection [14]. Winners-take-all guarantees to keep the best candidate solutions in every generation and saves the computing time, while the population may get stabilized after a short number of generations, which has the potential risk to be trapped into the local optima solution. Thus, by default the tournament selection has a higher chosen probability than winners-take-all. Although some individuals with low fitness scores may be selected, the average fitness for the following generation increases smoothly to keep away from the premature convergence. The evolution process will be terminated when the number of generations reaches a fixed number or the optimal solution keeps the same for a given iteration times. 2.5

Final Model Refinement

After the evolution process meets the terminating criteria, a final model refinement is proposed to improve the overall algorithm performance. For each model, a consensus pattern is composed of the dominant letter (highest frequency) from each column of its PFM. We merge all motif models that share the same consensus pattern and a small difference in information contents (|ICa −ICb | ≤ γ) together as a new motif model. In such way, some mis-clustered k-mers may be grouped together. As the number of k-mers in one model potentially increases after merging, it will directly affect the model quality. To solve this problem, we carefully develop a false positive instance removing method. For a given motif model M , the removing process starts with finding a k-mer kmax that can maximize the IC of M after removed from it. If the increase value of IC is larger than a predefined small constant value β, the kmax will be removed permanently and the IC will be re-calculated for next iteration. The iteration continues till no k-mer satisfies the removing criteria from M . The removing method applies to all the merged models. In such way, we believe that any kmers which have large negative impact on the model quality will be removed, whereas some weak true-positive binding sites can be kept in the model. A log-likelihood scoring function Maximum a Posteriori (MAP), which measures the model conservation by giving a statistical significance score proposed

526

X. Li and D. Wang

by Liu et.al. [8], is employed here for ranking the merged motif model. The model with the highest MAP score is the final output motif model. The MAP score is given by, k

1  log(xm )   f (l, b)log(f (l, b)) − log(p0 (si )) (4) M AP = k xm s l=1

b

i

where xm is the number of k-mers to build the PFM, k is the width of PFM and p0 (si ) is the probability of generating a k-mer si from the Markov background model. By applying the model refinement stage, it extends the ability of identifying weak binding sites and potentially purifies the motif model by removing the false-positive instances to reach our goal of improving the overall algorithm performance.

3 3.1

Results Experimental Datasets

The promoter regions of 302 co-regulated genes from 10 E.coli TF families are downloaded from RAST (http://rsat.ulb.ac.be/rsat/). The 302 promoter regions contain 580 experimental verified TFBSs in total. The information of the 580 TFBSs and 10 PFMs constructed by true binding sites are obtained from RegulonDB [6]. The datasets have covered different motif properties, such as width of motif from 8 to 23, total number of sites per set from 15 to 95 and total number of sequences per set from 7 to 121. In such case, we can demonstrate the stability of algorithm performance under different scenarios. The detailed description of the datasets is shown in Table 1. Table 1. Details of the experimental datasets TF Name No. of seq. No. of BS Width of BS Ave. No.of BS per seq.

ArgR 8 18 19 2.25

CRP 121 185 23 1.52

Fis 42 103 16 2.5

Fur 21 54 20 2.6

IHF 46 65 14 1.4

LexA 12 17 21 1.4

Lrp 11 33 13 3

NarL 19 73 8 3.8

NtrC 7 17 18 2.4

PhoP 15 15 18 1

To evaluate the proposed algorithm performance, we choose precision P and recall R from the field of information retrieval [10]. A single measurement called F -measure F is also applied here to indicate the performance accuracy, which is calculated as a weighted average of the precision and recall [10]. The best value of F-measure is 1 and the worst is 0. 3.2

Comparisons with GA Applications

In this section we evaluate our approach in details by carrying out the experimental comparisons with the two GA methods GAME [14] and GALF-P [2].

An Improved Genetic Algorithm for DNA Motif Discovery

527

Table 2. The average results of IGAMD, GAME and GALF-P on the 10 datasets for 20 runs TF ArgR CRP Fis Fur IHF LexA Lrp NarL NtrC PhoP Ave

IGAMD P R 0.91 ± 0.13 0.59 ± 0.14 0.45 ± 0.17 0.70 ± 0.09 0.40 ± 0.05 0.40 ± 0.10 0.83 ± 0.06 0.93 ± 0.02 0.28 ± 0.06 0.46 ± 0.03 0.93 ± 0.09 0.81 ± 0.04 0.24 ± 0.06 0.24 ± 0.12 0.18 ± 0.05 0.14 ± 0.06 0.94 ± 0.08 0.57 ± 0.10 0.31 ± 0.03 0.71 ± 0.06 0.55 0.55

F 0.71 0.55 0.40 0.87 0.35 0.87 0.24 0.16 0.71 0.43 0.55

GAME P R 0.53 ± 0.10 0.81 ± 0.17 0.75 ± 0.02 0.63 ± 0.04 0.28 ± 0.14 0.10 ± 0.05 0.85 ± 0.02 0.90 ± 0.02 0.08 ± 0.12 0.04 ± 0.07 0.82 ± 0.06 0.86 ± 0.03 0.19 ± 0.11 0.16 ± 0.10 0.09 ± 0.07 0.02 ± 0.02 0.58 ± 0.04 0.78 ± 0.07 0.13 ± 0.14 0.17 ± 0.17 0.43 0.45

F 0.64 0.68 0.15 0.87 0.05 0.84 0.17 0.03 0.67 0.15 0.44

P 0.50 ± 0.05 0.67 ± 0.17 0.38 ± 0.02 0.90 ± 0.02 0.22 ± 0.10 0.95 ± 0.06 0.16 ± 0.06 0.06 ± 0.05 0.60 ± 0.07 0.35 ± 0.08 0.48

GALF-P R 0.83 ± 0.07 0.51 ± 0.13 0.21 ± 0.03 0.87 ± 0.03 0.20 ± 0.06 0.74 ± 0.05 0.16 ± 0.07 0.03 ± 0.02 0.82 ± 0.08 0.56 ± 0.12 0.5

F 0.63 0.58 0.27 0.88 0.21 0.83 0.16 0.04 0.69 0.43 0.49

For each approach we adjusted the probabilities of genetic operators (from 0.1 to 0.9), the population size (from 500 to 2000), and the number of generation (from 500 to 2000), and kept all other parameters by default across 20-runs. The average of Precision, Recall and F -measure of 20-runs for each approach across 10 datasets as well as the standard deviation of Precision and Recall (followed by the ± symbol) are shown in Table 2. The best results are displayed as bold. According to our results, IGAMD shows the comparable performance against the other two. In 7 of the 10 datasets, it has the best F -measure and the best precision score respectively. The number of best recall is 6 of 10. As a result, IGAMD outperforms the other two in terms of the average Precision, Recall and F -measure. We notice that for some datasets with high number of average binding sites per sequence such as Fis (2.45), Lrp (3) and NarL (3.8), all algorithms returned low prediction accuracy. From our point of view, it may be caused by the assumption made by all three algorithms which is one binding site per sequence during the evolution process. Although GAOMP shows the improvement of prediction accuracy comparing with GALF-P and GAME, the algorithm still failed to predict on some datasets which contain less conserved binding sites.

4

Conclusion

In this study, we propose a novel genetic algorithm based DNA motif discovery application IGAMD. Public domain information is introduced to construct the initial motif model for reducing the entire search space. Two model quality metrics are combined together as the fitness function to provide a stable evolution process. With a model refinement procedure, it can potentially increase the prediction precision. As the evidences shown in the result section, IGAMD demonstrates a comparable algorithm performance across 10 datasets against two GA approaches. The major issue for us is to further improve the reliability and the robustness of IGAMD for the low signal-to-noise data. A more comprehensive model representation along with a new fitness function is believed to enhance the predicting ability to handle less conserved binding sites.

528

X. Li and D. Wang

Acknowledgments The first author would like to thank Dr Tim Sawbridge at Victorian AgriBiosciences Centre for his kind support on degree study.

References 1. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28–36. AAAI Press, Menlo Park (1994) 2. Chan, T.-M., Leung, K.-S., Lee, K.-H.: TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics 24, 341–349 (2008) 3. Congdon, C.B., Aman, J.C., Nava, G.M., Gaskins, H.R., Mattingly, C.J.: An Evaluation of Information Content as a Metric for the Inference of Putative Conserved Noncoding Regions in DNA Sequences Using a Genetic Algorithms Approach. IEEE/ACM Trans. on Computational Biology and Bioinformatics 5, 1–14 (2008) 4. Galas, D.J., Schmitz, A.: DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170 (1978) 5. Hu, J., Li, B., Kihara, D.: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 33, 4899–4913 (2005) 6. Huerta, A.M., Salgado, H., Thieffry, D., Collado-Vides, J.: RegulonDB: a database on transcriptional regulation in Escherichia coli. Nucleic Acids Res. 26, 55–59 (1998) 7. Li, L.P., Liang, Y., Bass, R.L.L.: GAPWM: A Genetic Algorithm Method for Optimizing a Position Weight Matrix. Bioinformatics 23, 1188–1194 (2007) 8. Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nature Biotechnology 20, 835–839 (2002) 9. Mahony, S., Hendrix, D., Golden, A., Smith, T.J., Rokhsar, D.S.: Transcription factor binding site identification using the Self-Organizing Map. Bioinformatics 21, 1807–1814 (2005) 10. Shaw Jr., W.M., Burgin, R., Howell, P.: Performance standards and evaluations in ir test collections: cluster-based retrieval models. Information Processing & Management 33, 1–14 (1997) 11. Stormo, G.D., Fields, D.S.: Specificity, free energy and information content in protein-DNA interactions. Trends in Biochemical Sciences 23, 109–113 (1998) 12. Tompa, M., Li, N., Bailey, T.L., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23, 137–144 (2005) 13. van Helden, J., Andr´e, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842 (1998) 14. Wei, Z., Jensen, S.T.: GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics 22, 1577–1584 (2006)