coding Regions of a DNA Sequence by Positional Frequency ...

2 downloads 0 Views 277KB Size Report
frequency distribution of nucleotides and the algorithm shows the results that exon regions exhibit more random behavior compared to intron regions. Such a ...
2009 International Conference on Computers and Devices for Communication

CIS

Identification and Analysis of Coding and Noncoding Regions of a DNA Sequence by Positional Frequency Distribution of Nucleotides (PFDN) Algorithm M.Roy , S.Biswas and S. Barman(Mandal)

Abstract: During the last several years, substantial progress has been made in developing high-throughput experimental techniques that produce large amounts of genomic data pertaining to molecular activities in cells. Consequently, a great deal of research is being focused on addressing important problems in molecular biology by analyzing these data using mathematical and computational approaches. Genomic signal processing has been an active area of research for the past two decades and have increasingly attracted the attention of researchers from digital signal processing area all over the world . An important step in genomic annotation is to identify protein coding regions of DNA sequence especially in the study of eukaryotic genomes. Due to lack of obvious sequence features among exons and introns, distinguishing protein coding regions from non-coding regions effectively is a challenging problem. A variety of computational algorithms have been developed to predict exons. Most of the exon finding algorithms are based on statistics methods. The signal processing approaches of recent years may identify some hidden periodicity and features which can not be revealed easily by conventional statistics methods. In this paper the authors have presented an algorithm to separate out coding regions from non-coding regions based on positional frequency distribution of nucleotides and the algorithm shows the results that exon regions exhibit more random behavior compared to intron regions. Such a behavior was also observed by FFT power spectrum analysis of DNA sequences. Case studies on genes from different organisms show that the algorithm is an effective approach towards exon prediction. Index Terms – DNA, Fourier transform, Fast Fourier transform, Discrete Fourier transform, Genomic signal processing.

First and second authors are with Women’s Polytechnic, Chandannagar, Govt. of W.B. Third authors is with Institute of Radio Physics & Electronics University of Calcutta [email protected],[email protected], [email protected]

978-81-8465-152-2/09/$26.00©2009 CODEC

I.

A

INTRODUCTION

DNA sequence is a long molecule that carries genetic

information. However only segments of the DNA molecule which contain relevant information are called genes. The problem of gene recognition is to define an algorithm which takes an input DNA sequence and produces as output a feature table describing the location and structure of the patterns in the sequence. Many attempts have been made to find the informational content in DNA sequence. Fig 1. demonstrates a simple schematic part of a DNA molecule, with the double helix straightened out for simplicity[1].

Fig.1: DNA double helix structure

The four bases or nucleotides attached to the sugar phosphate backbone are denoted by alphabets A,C,G, and T( Adenine, Cytosine, Guanine and Thymine). The base A always pairs with T and C with G. The two strands of the DNA molecule are therefore complementary to each other. The forward genome sequence corresponds to the upper strand of the DNA molecule. The RNA (Ribo Nucleic Acid) molecule is closely related to DNA. It also has four bases but Thymine is replaced by Uracil (denoted by U). RNA molecules are short lived single stranded molecules which are used by cells as temporary copies of portions of DNA in course of transcription from DNA to mRNA. The example of DNA sequence taken here is ATTCATAGT. The ordering is from the so-called 5’ to the 3’ end (left to right). The DNA sequence can be divided into genes and intergenic spaces[2] as shown in Fig.2. The gene again is sub divided into exons and introns. Even though all the cells in an organism have identical genes only selected

2009 International Conference on Computers and Devices for Communication

subsets are active in any family of the cells. Only exons are involved in protein– coding. The bases in the exons can be divided into groups of three adjacent bases, called codons.

There are 64 possible codons. A coding region always starts with start codon ATG and ends with any of the three stop codons TAG, TGA, TAA. distribution of nucleotides in DNA sequence to identify the exons and introns accurately. The algorithm is tested within known genes from several organisms. Case studies indicate that the method described in this paper is an effective protein coding region prediction method in terms of accuracy and efficiency. The results of this algorithm have also been compared with the results of FFT power spectrum based protein coding identification methods which has given a satisfactory result. II. ALGORITHM FOR EXON – INTRON PREDICTION BY POSITIONAL FREQUENCY DISTRIBUTION OF NUCLEOTIDE (PFDN)

Fig.2: Coding & Non-coding region of DNA sequence.

Each codon instructs the cell machinery to synthesize amino acid. The codon sequence uniquely identifies amino acid sequence which defines a protein. There are amino acids hence the mapping is from many to one shown in Table1.

CIS

an an 20 as

For a DNA sequence of length N, find the frequency distribution of each nucleotide at three different positions of N/3 codons. The algorithm to compute the frequency distribution of nucleotides is as follows: 1.

Set three position-pointers io,i1,i2 to indicate three nucleotide positions of each codon. 2. Compute frequency distribution of each nucleotide at positions io,i1,i2. 3. Set three position-pointers to the next codon position. 4. Repeat step 2 and step 3 until position-pointers reach N/3 codon position. 5. Make a comparative analysis of Positional Frequency Distribution of nucleotides in intron and exon regions. The above algorithm has been tested for Human Beta Globin and C-Elegan genome database. III.

Fig.3: Mapping Table The introns do not take part in protein synthesis because they are removed in the process of splicing. The main problem with the gene detection approach is that an algorithm is successful for certain genes but do not work for others. The development of a universal gene locating algorithm is still a challenging research problem. Using signal processing methods a variety of gene prediction algorithms have been developed (Tiwari et al.,1997; Anastasssiou 2001, Guan et al,2004; Yin et al,2006). Most of the DSP based gene finding algorithms are based on 3 based hidden periodicity, which is identified as a pronounced peak at the frequency N/3 of Fourier power spectrum of the DNA sequences(N is the length of the DNA sequence). It is prevalent in most protein coding regions, but does not exist in the non-coding regions. This paper presents an algorithm based on positional frequency

978-81-8465-152-2/09/$26.00©2009 CODEC

FFT ANALYSIS OF DNA SEQUENCES

For a DNA string x[n] with N nucleotide characters (a,t,c,g) we generate four binary indicator sequences xa[n], xc[n] xg[n] , xt[n] [7]. For example x[n]=[a t t c c g a g g c a] Binary indicator sequence will be given by: xa[n]=[1 0 0 0 0 0 10 0 0 1] xt[n]=[0 1 1 0 0 0 0 0 0 0 0] xg[n]=[0 0 0 0 0 1 0 1 1 0 0] xc[n]=[0 0 0 1 1 0 0 0 0 1 0] Let Xa[k], Xt[k], Xg[k] and Xc[k] be the FFT of the corresponding binary sequences given by

XS[k ] = ∑ xs[n]e − j 2πnk / N , n=0,1,…….N-1 For S=a, g, t, c and k=0,1,2,…..,N-1

2009 International Conference on Computers and Devices for Communication

Then PS[k] = ∑|Xs(k)|2 for S=a,g,c,t which gives the total Power Spectral content at k. Measure of Power spectrum may be used as a preliminary indicator to detect probable coding regions in DNA sequences. The Power spectrum reveals pronounced randomness in exon regions compared to that of intron regions. IV.

RESULTS AND DISCUSSIONS

Table 1 & 2 shows the detail results of known gene database of Human Beta Globin Exon and Intron regions. The bar graphs (Fig. 4 & Fig.5) are the true reflection of the table data. Inspection of bar graph reveals more randomness of positional frequency distribution of nucleotides in exon region whereas such random nature of nucleotide is almost absent in intron region. To be more specific it can be said that if the intron segments are taken into consideration, the normalized frequency distribution for each nucleotide remains almost same for all the three codon positions, where as in the exon regions there is large fluctuation in the normalized frequency distribution in the consecutive codon positions for each nucleotide . Similar characteristics of nucleotides have also been observed by FFT power spectrum analysis of nucleotides both in coding and non-coding regions. The DNA spectrum given by PS[k] has been advocated as a measure that discriminates between coding and non-coding regions. From the plot it can be observed that there is sharp contrast between spetra of coding and non-coding regions of DNA sequence. The introns have a flat Fourier Spectrum devoid of any periodicity in intron regions in contrast to that of exons which show peaks as has been plotted in figures 6 & 7 .We have also observed similar nature of DNA sequence in Celegan Chromosome III database as shown in Fig. 8, 9,10 & 11 . Case studies indicate that the PFDN algorithm described in this paper is an effective approach to predict protein coding regions of a DNA sequence.

CIS

PFDN for Intron region of Human Beta Globin taking 850 bp. A0 A1 A2 Nucleotide A 0.1001 0.0907 0.0895 T0 T1 T2 Nucleotide T 0.1307 0.1354 0.1425 C0 C1 C2 Nucleotide C 0.053 0.0553 0.0541 G0 G1 G2 Nucleotide G 0.0494 0.0518 0.0471 Table 2: PFDN for Intron region of Human Beta Globin.

A 0.2

0.18

0.18

0.16

0.16

0.14

0.14 0.12

0.12

0.1

0.1

0.08 0.08

0.06 0.06

0.04 0.04

0.02 0.02

0 C0

0 A0

A1

0.2

0.2 0.18

0.16

0.16

Table 1 : PFDN for exon region of Human Beta Globin.

C2

0.14

0.14

0.12

0.12 0.1

0 .1

0.08

0 .0 8

0.06

0 .0 6

0.04

0 .0 4

0.02

0 .0 2

0 T0

T1

0

T2

G0

G1

G2

T G Fig.4: PFDN for Exon of Human Beta Globin 223 bp A T 0.2

0.2

0.18

0.18

0.16

0.16 0.14

0.14

0.12

0.12

0.1

0.1

0.08

0.08

0.06

0.06

0.04

0.04

0.02

0.02

0 A0

A1

0

A2

T0

0.2

0.2

0.18

0.18

0.16

0.16

0.14

0.14

0.12

0.12

0.1

0.1

0.08

0.08

0.06

0.06

0.04

T1

T2

0.04

0.02

0.02

0 C1

Globin taking

0

C2

G0

G1

C

G2

G

Fig.5: PFDN for Intron region of Human Beta Globin taking 850 bp. 1000 900 800 700

Power Spectrum

A2 0.1081 T2 0.1036 C2 0.0675 G2 0.0540

C1

A2

0.18

C0

PFDN for Exon region of Human Beta 223 bp. A0 A1 Nucleotide A 0.0090 0.081 T0 T1 Nucleotide T 0.099 0.0495 C0 C1 Nucleotide C 0.1081 0.0855 G0 G1 Nucleotide G 0.1171 0.1171

C

0.2

600 500 400 300 200 100 0

0

100

200

300

400

500

600

Frequency

Fig.6: FFT power spectrum of exon length 223 bp of Human Beta Globin.

978-81-8465-152-2/09/$26.00©2009 CODEC

2009 International Conference on Computers and Devices for Communication

CIS

10000 9000 8000

Power spectrum

7000 6000 5000 4000 3000 2000 1000 0

0

100

200

300

400

500

600

Frequency

Fig.7: FFT power spectrum of intron length 850 bp of Human Beta Globin.

A

T

0.1

0.1

0.08

0.08

0.06

0.06

0.04

0.04

0.02

Fig.11: FFT power spectrum vs frequency plot of Intron length 1254 bp of C-Elegan Chromosome III

REFERENCES

0.02

0 A0

A1

0

A2

T0

T1

T2

[1].Vaidyanathan,P.P., Yoon,B.J “The role of signal-processing concepts in genomics and proteomics”, Journal of the Franklin Institute , special issue on Genomics,2004.

0.1

0.1

0.08

0.08 0.06

0.06

0.04

0.04

0.02

0.02

0 C0

C1

[2]. Tuqan,J., Rushdi, A. “A DSP based approach for finding the codon bias in DNA sequences”, IEEE journal on signal processing, vol.2.No. 3, June,2008.

0

C2

G0

G1

C

G2

[3].Anastassiou,D., “Frequency –domain sequences”, Bioinformatics 16,1073-1081.

G

Fig.8: PFDN for Exon region of C-Elegan Chromosome III taking 111 bp.

A

T

analysis of biomolecular

[4]. Fickett,J.W., Tung,C.S., “Assessment of protein coding regions in DNA sequences”, Nucleic Acid Res,10,5303-5318, 2000.

0.1

0.1 0.08

0.08

0.06

0.06

0.04

0.04

0.02

[5].Tiwari,S., Ramachandran,S., Bhattachary,A., Bhattacharya. S., and Ramaswamy,R., “Prediction of probable genes by fourier analysis of genomic sequences,” CABIOS, vol 3,no.3.263-270,1997

0.02

0 A0

A1

A2

0 T0

T1

T2

[6].Yin,C., Stephen,S., Yau,T., “Prediction of protein coding regions by the 3 base periodicy analysis of a DNA sequence”, Journal of Theoretical Biology 247 ,687-694,2007.

0.1

0.1 0.08

0.08

0.06

0.06

0.04

0.04

0.02

[7].Achuthsankar.. Nair, S and Sreenadhan,.S., “An improved digital filtering technique using nucleotide frequency indicators for locating exons”, Journal of CSI,Vol. 36. No.1, Jan.-Mar.,20, exon length 111 bp of C-Elegan Chromosome III.

0.02

0 C0

C1

C

C2

0 G0

G1

G2

G

Fig.9. PFDN for Intron of C-Elegan Chromosome III taking 1254 bp..

Fig.10: FFT power spectrum vs frequency plot of

978-81-8465-152-2/09/$26.00©2009 CODEC

Suggest Documents