Employing Genetic Algorithm to Construct Epigenetic ...

Employing Genetic Algorithm to Construct Epigenetic Tree-Based Features for Enhancer Region Prediction Pui Kwan Fong1 , Nung Kion Lee1,∗ , and Mohd Tajuddin Abdullah2 1 2

Department of Cognitive Sciences, Universiti Malaysia Sarawak, Kota Samarahan, Malaysia Center of Tasik Kenyir EcoSystem, Universiti Malaysia Terengganu, Kuala Terengganu, Malaysia [email protected]

Abstract. This paper presents a GA-based method to generate novel logical-based features, represented by parse trees, from DNA sequences enriched with H3K4me1 histone signatures. Current methods which mostly utilize k-mers content features are not able to represent the possible complex interaction of various DNA segments in H3K4me1 regions. We hypothesize that such complex interaction modeling is significant towards recognition of H3K4me1 marks. Our propose method employ the tree structure to model the logical relationship between k-mers from the marks. To benchmark our generated features, we compare it to the typically used k-mer content features using the mouse (mm9) genome dataset. Our results show that the logical rule features improve the performance in terms of f-measure for all the datasets tested. Keywords: Genetic Algorithm, Feature extraction, Histone modifications.

1

Introduction

Initiation of gene transcription involves variants of regulatory elements whereby locating cis-regulatory elements enlighten the comprehension of complex gene regulation. One of the essential cis-regulatory elements known as enhancer comprises clusters of transcription factor binding sites (TFBS), each spans about 6 to 20 base pair(bp). Enhancer is capable of regulating gene expressions locating ten to hundred thousand bp away regardless of its location [1]. Locating enhancer regions remain a challenging task due to the unusual characteristics of distantacting and short DNA sequences. In addition,the binding sites of enhancer degenerates easily yet retaining the original function [2]. Thus, it is difficult to find a general pattern of sequence to represent a specific type of enhancer. Pioneer computational methods focus on implementing motif profiles searching to discover TFBS. Review by [3] highlights that these methods achieve high prediction accuracy for lower organisms only and often produce high false positive hits for complex organisms. Recently, the advancement of chIP-chip and ∗

Corresponding author.

C.K. Loo et al. (Eds.): ICONIP 2014, Part III, LNCS 8836, pp. 390–397, 2014. c Springer International Publishing Switzerland 2014

Construct Epigenetic Tree-Based Features for Enhancer Region Prediction

391

chIP-seq techniques on genome-wide mapping of epigenetic marks [4,5] facilitates the use of these features for enhancer prediction. Epigenetic marks such as histone modification is prominent as a landmark for enhancer identification and this features are widely used [6,7] with different representation approaches which can give high impact on the prediction results. In this paper, we hypothesize that DNA features co-exist in which their interaction are complex and need to be represented in higher order features as oppose to using only content information such as k-mer frequency [7]. A framework for modelling complex parse tree features using Genetic Algorithms (GA) [8] is proposed. Tree features generated from this framework are scrutinize for the competence in discriminating enrichment of H3K4me1 in DNA sequences.

2

Related Works

Early enhancer prediction approaches employ motif profile search and comparative genomic. Motif profile search utilizes annotated motif databases such as JASPAR or TRANSFAC to construct statistical or supervised learning model for predicting associated sites. While comparative genomic approach identify evolutionary conserved region using multiple sequence alignment algorithm. Supervised/statistical model has the limitation of returning many false positives because of its representation model which is non-specificity and the difficulties to determine matching cut-off value. While conservation analysis is useful, it can only detect evolutionary conserved sites. Therefore, machine learning method which utilizes different features related to binding sites has been proposed [6,7]. These methods employ features associated with enhancer site or region for supervised algorithm training. Ultimately, the set of features use in training determine the prediction accuracy rates. Significant findings revealed that enrichment of H4K3me1 have striking correlations with enhancer whereby the distance between them are approximately 100 bp to 1500 or 2000 bp [5], [7]. Thus, there is an increasing need to locate DNA sequences enriched with these modified histones whereby experimentally, ChIPchip or ChIP-seq is used to produce high resolution mapping and profiling [4,5]. However, these experimental techniques are tedious and expensive thus histone modification information is not easily accessible and available for all organisms. Therefore, the key approach of this paper is to propose a computational method for determining and characterizing the DNA locations of histone modified marks. Characterizing histone modification marks using computational methods are proven successful using different features representation methods. Combination of content (k-mer frequency) and context (distance from gene) based features are used by [9] to predict H3K4me1 of yeast genome. Study showed that utilizing both features could achieve high H3K4me1 prediction accuracy of 90.86% while only 72.61% and less is achieved when small k-mer frequency (less than 9-mer feature) is used. Clearly, simple k-mer features are insufficient to represent DNA sequences with histone modification enrichment. Thus, a new method to represent H3K4me1 sequences with complex combination of nucleotides is proposed instead of just employing fixed k-mer frequency.

392

P.K. Fong, N.K. Lee, and M.T. Abdullah %*HQHWLF$OJRULWKPIRUIHDWXUHJHQHUDWLRQ

$3UHSDUDWLRQRIGDWDVHW +.PHDQGUDQGRPUHJLRQ H[WUDFWLRQEDVHGRQFRRUGLQDWH

7UHHIHDWXUHV JHQHUDWHGE\*$

&RRUGLQDWH &RRUGLQDWH

*HQRPH

&RPSXWH PDWFKLQJ

([WUDFWHG'1$

*HQPD[

&&ODVVLILFDWLRQXVLQJ6XSSRUW9HFWRU0DFKLQH 3RVLWLYHDQGQHJDWLYH VHTXHQFHVIRUWUDLQLQJDQG

*HQHWLF

)HDWXUHYHFWRU

&RQYHUWWUHHIHDWXUHWROLQHDUIRUP

&RPSXWH PDWFKLQJ 7RSIHDWXUHVVHOHFWHG EDVHGRQVHOHFWLRQ FULWHULD

Employing Genetic Algorithm to Construct Epigenetic ...

Employing Genetic Algorithm to Construct Epigenetic ...

Suggest Documents

Agent based Genetic Algorithm Employing Financial ... - CiteSeerX

Agent based Genetic Algorithm Employing Financial Technical ...

an algorithm to construct continuous area cartograms

Algorithm To Construct Reverse Super Vertex Magic ...

A Cascade Network Algorithm Employing Progressive ... - CiteSeerX

Efficient quantum algorithm to construct arbitrary Dicke states

Algorithm to Construct Graph with Total Vertex ... - Science Direct

A Simulated Annealing Algorithm to Construct Covering Perfect Hash

An Algorithm to Construct Symmetric Latin Squares of Order for ...

An Algorithm to Construct Super-Symmetric Latin ... - Google Sites

An Algorithm to Construct Decision Tree for Machine ...

An Interactive Algorithm to Construct an Appropriate Nonlinear ...

An Interactive Algorithm to Construct an Appropriate ... - Science Direct

A branch and bound algorithm to construct a

An algorithm to construct industry cost curves used in ... - SAIMM

An Effective Machine-Part Grouping Algorithm to Construct ...

An Algorithm to Construct Super-Symmetric Latin ...

An algorithm to construct Monte Carlo confidence intervals for an ...

Rapid genotyping by low-coverage resequencing to construct genetic ...

Improved multiuser detectors employing genetic ... - Semantic Scholar

Application of structural equation models to construct genetic networks

Employing Site-Specific Recombination for Conditional Genetic ...

A genetic algorithm and memetic algorithm to sequencing ... - CiteSeerX

A Genetic Algorithm Approach to Scheduling ...