Employing Genetic Algorithm to Construct Epigenetic ...

2 downloads 0 Views 261KB Size Report
Renowned for its heuristic search using survival of fittest theory [10], Genetic Algorithm is used in this framework to generate and select good candidate tree ...
Employing Genetic Algorithm to Construct Epigenetic Tree-Based Features for Enhancer Region Prediction Pui Kwan Fong1 , Nung Kion Lee1,∗ , and Mohd Tajuddin Abdullah2 1 2

Department of Cognitive Sciences, Universiti Malaysia Sarawak, Kota Samarahan, Malaysia Center of Tasik Kenyir EcoSystem, Universiti Malaysia Terengganu, Kuala Terengganu, Malaysia [email protected]

Abstract. This paper presents a GA-based method to generate novel logical-based features, represented by parse trees, from DNA sequences enriched with H3K4me1 histone signatures. Current methods which mostly utilize k-mers content features are not able to represent the possible complex interaction of various DNA segments in H3K4me1 regions. We hypothesize that such complex interaction modeling is significant towards recognition of H3K4me1 marks. Our propose method employ the tree structure to model the logical relationship between k-mers from the marks. To benchmark our generated features, we compare it to the typically used k-mer content features using the mouse (mm9) genome dataset. Our results show that the logical rule features improve the performance in terms of f-measure for all the datasets tested. Keywords: Genetic Algorithm, Feature extraction, Histone modifications.

1

Introduction

Initiation of gene transcription involves variants of regulatory elements whereby locating cis-regulatory elements enlighten the comprehension of complex gene regulation. One of the essential cis-regulatory elements known as enhancer comprises clusters of transcription factor binding sites (TFBS), each spans about 6 to 20 base pair(bp). Enhancer is capable of regulating gene expressions locating ten to hundred thousand bp away regardless of its location [1]. Locating enhancer regions remain a challenging task due to the unusual characteristics of distantacting and short DNA sequences. In addition,the binding sites of enhancer degenerates easily yet retaining the original function [2]. Thus, it is difficult to find a general pattern of sequence to represent a specific type of enhancer. Pioneer computational methods focus on implementing motif profiles searching to discover TFBS. Review by [3] highlights that these methods achieve high prediction accuracy for lower organisms only and often produce high false positive hits for complex organisms. Recently, the advancement of chIP-chip and ∗

Corresponding author.

C.K. Loo et al. (Eds.): ICONIP 2014, Part III, LNCS 8836, pp. 390–397, 2014. c Springer International Publishing Switzerland 2014 

Construct Epigenetic Tree-Based Features for Enhancer Region Prediction

391

chIP-seq techniques on genome-wide mapping of epigenetic marks [4,5] facilitates the use of these features for enhancer prediction. Epigenetic marks such as histone modification is prominent as a landmark for enhancer identification and this features are widely used [6,7] with different representation approaches which can give high impact on the prediction results. In this paper, we hypothesize that DNA features co-exist in which their interaction are complex and need to be represented in higher order features as oppose to using only content information such as k-mer frequency [7]. A framework for modelling complex parse tree features using Genetic Algorithms (GA) [8] is proposed. Tree features generated from this framework are scrutinize for the competence in discriminating enrichment of H3K4me1 in DNA sequences.

2

Related Works

Early enhancer prediction approaches employ motif profile search and comparative genomic. Motif profile search utilizes annotated motif databases such as JASPAR or TRANSFAC to construct statistical or supervised learning model for predicting associated sites. While comparative genomic approach identify evolutionary conserved region using multiple sequence alignment algorithm. Supervised/statistical model has the limitation of returning many false positives because of its representation model which is non-specificity and the difficulties to determine matching cut-off value. While conservation analysis is useful, it can only detect evolutionary conserved sites. Therefore, machine learning method which utilizes different features related to binding sites has been proposed [6,7]. These methods employ features associated with enhancer site or region for supervised algorithm training. Ultimately, the set of features use in training determine the prediction accuracy rates. Significant findings revealed that enrichment of H4K3me1 have striking correlations with enhancer whereby the distance between them are approximately 100 bp to 1500 or 2000 bp [5], [7]. Thus, there is an increasing need to locate DNA sequences enriched with these modified histones whereby experimentally, ChIPchip or ChIP-seq is used to produce high resolution mapping and profiling [4,5]. However, these experimental techniques are tedious and expensive thus histone modification information is not easily accessible and available for all organisms. Therefore, the key approach of this paper is to propose a computational method for determining and characterizing the DNA locations of histone modified marks. Characterizing histone modification marks using computational methods are proven successful using different features representation methods. Combination of content (k-mer frequency) and context (distance from gene) based features are used by [9] to predict H3K4me1 of yeast genome. Study showed that utilizing both features could achieve high H3K4me1 prediction accuracy of 90.86% while only 72.61% and less is achieved when small k-mer frequency (less than 9-mer feature) is used. Clearly, simple k-mer features are insufficient to represent DNA sequences with histone modification enrichment. Thus, a new method to represent H3K4me1 sequences with complex combination of nucleotides is proposed instead of just employing fixed k-mer frequency.

392

P.K. Fong, N.K. Lee, and M.T. Abdullah %*HQHWLF$OJRULWKPIRUIHDWXUHJHQHUDWLRQ

$3UHSDUDWLRQRIGDWDVHW +.PHDQGUDQGRPUHJLRQ H[WUDFWLRQEDVHGRQFRRUGLQDWH

7UHHIHDWXUHV JHQHUDWHGE\*$

&RRUGLQDWH &RRUGLQDWH



*HQRPH







&RPSXWH PDWFKLQJ

 

([WUDFWHG'1$



   





   







*HQPD[



&&ODVVLILFDWLRQXVLQJ6XSSRUW9HFWRU0DFKLQH 3RVLWLYHDQGQHJDWLYH VHTXHQFHVIRUWUDLQLQJDQG

 

 

*HQHWLF















  

)HDWXUHYHFWRU        

 













&RQYHUWWUHHIHDWXUHWROLQHDUIRUP

 





&RPSXWH PDWFKLQJ 7RSIHDWXUHVVHOHFWHG EDVHGRQVHOHFWLRQ FULWHULD