BASED ON DIGITAL SIGNAL PROCESSING. John A. Berger*, Sanjit K. Mitra. University of California, Santa Barbara. Dept. of Electrical and Computer ...
NEW APPROACHES TO GENOME SEQUENCE ANALYSIS BASED ON DIGITAL SIGNAL PROCESSING John A. Berger* , Sanjit K. Mitra
Marco Carli, Alessandro Neri
University of California, Santa Barbara Dept. of Electrical and Computer Engineering Santa Barbara, CA 93106-9560
University of Rome TRE Department of Electrical Engineering via della Vasca Navale, 84 00146 Rome - Italy
e-mail:
(berger,mitra)@ece.ucsb.edu
ABSTRACT In this paper, two tools based on typical digital signal processing approaches are presented for analyzing long range correlations in DNA sequences. In order to illustrate the evolution of a DNA sequence, a three-dimensional DNA walk has been introduced and a Gauss-wavelet-based analysis has been performed to reveal the fractal behavior of the walk sequence. The second tool is devoted to exploring characteristics of the data using a lossless, Huffmanbased encoding technique. This technique compresses the DNA sequences and it allows analysis to be performed in the encoded domain. Experimental results show the effectiveness of the proposed approaches and their efficacy for future work. 1. INTRODUCTION Over the last few years, a massive, wide-ranging research effort has successfully deciphered the entire human genome sequence. The whole description of the human genome is roughly three billion characters in length. In the same period, scientists have completed the sequencing of a few other organisms and the speed at which data is currently being acquired is growing at astonishing rates. Interpreting the meaning of these genome sequences is one of the most exciting challenges facing scientists today. Despite the extremely large size of DNA sequences used to represent an organism’s entire genome, transformations of the data may be used to handle such records in a much more manageable fashion. It is mandatory to find analysis tools or domain representations that are both statistically effective and easily accessible. One motivation, among many, that requires such analysis techniques deals with studying the fractal behavior of DNA sequences. The goal is to de* This work is supported in part by a University of California MICRO grant with matching support from Lucent Technologies, Philips Research Laboratories, and Microsoft Corp.
termine if such behavior is responsible for long-range correlations in genome sequences. Genomic information is inherently discrete in nature because there is a finite number of nucleotides in the DNA alphabet. This fact suggests that we may interpret the DNA sequence as a discrete-time sequence that can be studied using standard techniques from the field of digital signal processing. Although DNA sequences are truly symbolic signals, numerical assignments may indeed be made for analysis purposes only after careful consideration. In this paper, two new tools for DNA sequence analysis are presented that effectively investigate a genome’s fractal characteristics. Both techniques analyze numerical sequences where designations are assigned to represent the nucleotides in an intuitive manner. In addition, the methods described are applicable to generic source DNA sequences. Consequently, the utility of these numerical designations is illustrated through specific examples. This paper is organized as follows. Section 2 provides a brief overview of the purine-pyrimidine DNA walk. In Section 3, the proposed three-dimensional DNA walk is introduced and its application to help visualize DNA sequences is shown. In Section 4, a Gauss-wavelet-based analysis is used to analyze the fractal and scaling components in a DNA walk sequence. Next, a novel representation based on digital communications techniques is offered in Section 5. Finally, concluding remarks are given in Section 6. 2. DNA WALK - A REVIEW Graphical representations of DNA sequences are useful because they allow visual observations of nucleotide composition, base pair patterns, and sequence evolution. A DNA sequence is composed of four bases: adenine (A), thymine (T ), cytosine (C), and guanine (G). One important task in the study of genome sequences is to determine densities of specific nucleotides and to understand the implications for exon, or coding regions. Several methods for addressing
1400
1200
1000
800 s(k)
this problem graphically are presented in literature [1]. The first step is to convert the four-letter genome alphabet into some numerical format. A technique already proposed [1, 2] comparatively plots the purine (A, G) and pyrimidine (C, T ) content within DNA sequences by using a two-dimensional DNA walk. Let us denote a DNA sequence of length N as X = {x[i]; i = 1, 2, . . . , N }. For a position k within this seqence, define the value x[k] = +1 if a pyrimidine is present or the value x[k] = −1 if a purine is present. Furthermore, let us denote the DNA walk sequence of length N as S = {s[i]; i = 1, 2, . . . , N }, where for any position k we have a cumulative sum of the x[i] for 1 ≤ i ≤ k described by
600
400
200
0
s[k] =
k X
x[i].
1000
(1)
2000
3000
4000
5000 Index k
6000
7000
8000
9000
i=1
From the sequence S, observations can be made regarding only the relative content of purines and pyrimidines. Although this observation is enlightening, the DNA walk concept may be extended to a more useful form by increasing the dimensionality of the numerical DNA sequence.
Fig. 1. The two-dimensional DNA walk for HUMDPI illustrates the relative content of purine and pyrimidine nucleotides within the sequence. This DNA walk is limited in its effectiveness for demonstrating the complete behavior of the sequence.
10000
3. PROPOSED DNA WALK If we increase the dimensionality of the numerical DNA sequence, the representation is no longer limited to only purine and pyrimidine designations, or other strictly binary classifications. Utilization of the complex plane still allows us to differentiate between the purines and pyrimidines, but it now allows further separation among the four nucleotides. For a position k within the DNA sequence, define the value x[k] = +1 for A, x[k] = −1 for G, x[k] = +i for T , and x[k] = −i for C. Similarly, the three-dimensional DNA walk elements are computed using Eq. (1). The resulting sequence S is complex-valued and it may easily reduce back to the binary case since purines are limited to values on the real-axis, while pyrimidines are restricted to the imaginaryaxis. It should be noted here that the numerical designations for a symbolic signal should, by no means, be used to place a explicit numerical ordering on the nucleotides in terms of a magnitude. In the study of genome sequences, the entire nucleotide content may not be completely clear. Treating nucleotide ambiguity in a numerical manner has been considered in the proposed method. The quaternary code, in fact, can be extended to a pentanary code by assigning the value x[k] = 0 for an unknown nucleotide element at position k. This additional symbol is introduced to deal with sequencing estimation problems, which are investigated in Section 4. To compare with results reported in [1, 3], the human desmoplakin I cDNA sequence (Accession HUMDPI), of length N = 9,588, has been chosen. The two-dimensional
Cumulative Sum
8000
6000
4000
2000
0 0 −50
500
−100
400 −150
300 200
−200
100
−250 Imaginary Part of Sum
0 −300
−100
Real Part of Sum
Fig. 2. The three-dimensional DNA walk for HUMDPI is shown. This graphical representation highlights the evolution of the DNA sequence and exposes trends in nucleotide composition.
DNA walk for the HUMDPI sequence is provided in Figure 1. As a comparison, the three-dimensional DNA walk is provided for the HUMDPI sequence in Figure 2. An important statistical quantity characterizing any random walk is the root mean square fluctuation F [k] about the average of the displacement [2]. It is well-known that long range correlations may be observed if the fluctuations can be described by a power law, namely F [k] ∼ k α with α 6= 1/2. Fluctuation is defined to be h i 2 F 2 [k] ≡ E (∆s[k] − E [∆s[k]]) , (2) where ∆s[k] = s[k0 +k]−s[k0 ]. Note that in Eq. (2), we in-
dicate an average over all positions k0 in the walk sequence. Compared to the results presented in [2] for the purinepyrimidine walk, the proposed method provides similar results for observing long range correlations in introns. The following section proposes a manner to analyze the fractal characterisitics of the three-dimensional DNA walk. 4. GAUSS-WAVELET-BASED ANALYSIS Currently, it is not entirely clear that genome sequences can be sufficiently characterized by long-range correlations despite the appearance of dependencies among the nucleotides over ranges of 1,000 base pairs or more. Some authors [1, 2] address these observed dependencies of nucleotide arrangement as “patchiness” and there is no definitive conclusion regarding the significance of such findings. Over the last few years, researchers have tried to extract some scale-independent or fractal information from the structure of the genome sequence. In general, the wavelet transform has proven to be well suited for characterizing the scaling properties of fractal objects. A good review of using wavelet analysis to study genome sequences can be found in Arneodo, et al. [3]. In our work, we have appropriately chosen a Gaussian-based wavelet [4, 5]. Using the nth-order derivatives of the Gaussian function 2 as the analyzing wavelets, ψ (n) (t) = dn (e−t /2 )/dtn , the continuous wavelet transform (CWT) of a signal S(t) with respect to the wavelet ψ(t) is defined as Z ∞ ∗ Wψ S(b, a) ≡ S(t)ψb,a (t)dt, (3) −∞
where a, b ∈ R, a > 0, S(t) represents the walk, and µ ¶ 1 t−b ψb,a (t) = √ ψ . a a
Fig. 3. Wavelet transform analysis of the three-dimensional DNA walk for the original HUMDPI sequence using 32 scales. Density is represented as a function of position i within the DNA sequence and the absolute values of the coefficients are shown.
unknown—a common problem in DNA sequencing. Comparing the spectral characteristics of the sequence with unknown elements to the original sequence, we can observe the discrepancies in the spectral domain. For a scale of size M = 32, the wavelet analysis is performed using the firstorder derivative of the Gaussian function as the analyzing wavelet. Figure 3 represents the wavelet transform result for the uncorrupted HUMDPI sequence and the patchiness of the sequence is evident. Comparatively, Figure 4 shows the wavelet transform of the “noisy” HUMDPI sequence. For such a small percentage of unknown nucleotides in the original sequence, the effects are noticeably evident.
(4)
Denote the scale a = 2−u and the translation b = k2−u , where u and k belong to the integer set Z. The discrete wavelet transform (DWT) is simply determined by using the CWT with these values for a and b. Necessarily, the CWT of S(t) is a number at (k2−u , 2−u ) on the time-scale plane and it represents the correlation between the signal S(t) and ψ ∗ (t) at that time-scale point. The goal of the wavelet analysis is to extract information from the threedimensional DNA walk in the transform domain. Given a DNA sequence with noise, or uncertainty in nucleotide composition, the wavelet analysis has been performed and the results have been compared to the wavelet transform of a known DNA sequence. For solely the intent of illustrating the usefulness of the wavelet transform analyis, we proceed in a heuristic manner as follows. Consider the three-dimensional walk for the HUMDPI sequence with 5% of the nucleotides randomly denoted as
5. DIGITAL COMMUNICATIONS TECHNIQUE HUFFMAN CODES The second DNA analysis technique proposed in this contribution is based on a digital communications approach to dealing with DNA sequences. Genome sequence analysis presents many difficult problems for scientists. The obstacles involved in the sequencing process, for example, include dealing with large amounts of data, lacking a complete knowledge of the genome length a priori, and recognizing nucleotide symbol identity with complete accuracy. These impediments are typical of ones encountered in standard telecommunications problems. By using a quatenary, real-valued DNA numerical sequence, we proceed to analyze the strings via the standard, lossless Huffman encoding technique [6]. For the kth element, x[k], of the sequence X , we denote x[k] = γ1 for A, x[k] = γ2 for T , x[k] = γ3 for C, and x[k] = γ4 for
lows us to visualize DNA sequences from a new perspective. Consequently, this technique is worthy of mention not only because it hints at the value of the information theoretic techniques to study DNA sequences, but compressing and then exploring symbolic strings from a digital communications perspective is applicable to DNA data. 6. CONCLUDING REMARKS
Fig. 4. This plot shows the absolute values of the wavelet coefficients for the three-dimensional noisy HUMDPI DNA walk sequence. Compared to Figure 3, the plot shows similar strand bias but the noise has produced corruption in the transform domain.
G. The Huffman encoding process is performed on X . This numerical designation allows for the efficient computation of occurrence probabilities of nucleotide triplets within the sequence, correlations among other nucleotides, and probable locations of nucleotide combinations within the entire genome. Working in the encoded domain will allow for the further reduction of analytical complexity if the sequences are very long. For a source symbol γi , we have code word K(γi ) occurring with probability πi , where K is the coding of the source. The n code words’ average length is given by L=
n X
d i πi ,
(5)
i=1
where di is the length of each individual code word. In general, L is larger than the symbol length of the original sequence, but the total number of code words will be less. The human β-globin intergenomic sequence (Accession HUMHBB), of length N = 73,308, which is studied in [1, 7] is addressed here. Accordingly, the Huffman encoding algorithm on this sequence reduces the number of symbols from N = 73,308 in the sequence domain to N = 20,841 in the encoded domain. Although the codebook generated for the Huffman encoder is not unique for each sequence, knowing the symbol probabilities, and necessarily the codebook, a priori allows for a uniquely decodeable sequence. Determining the correlations of the symbols in the encoded domain has not proven to be extremely useful, mainly because the codebook is not unique for each sequence. However, this transformation al-
In this paper, two new tools for analyzing DNA sequences have been proposed. These techniques treat genome sequences as discrete sets of numbers, with each element representing a nucleotide. To deal with such a huge set of data, an intuitive transformation of characters to numbers has been performed. Accordingly, the sequences can be thought of as discrete-time signals and studied using conventional digital signal processing approaches. The numerical assignments strongly affect the results of the analysis. The tools proposed are based on an improvement of the DNA walk description technique and on a new Huffmanbased encoding technique. Both have been tested on several DNA sequences and the results have been verified to match results reported in the literature. 7. REFERENCES [1] A. Arneodo, Y. D’Aubenton-Carafa, B. Audit, E. Bacry, J. F. Muzy, and C. Thermes, “What can we learn with wavelets about DNA sequences?,” Physica A, vol. 249, pp. 439–448, 1998. [2] C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Simons, and H. E. Stanley, “Longrange correlations in nucleotide sequences,” Nature, vol. 356, pp. 168–170, 1992. [3] A. Arneodo, E. Bacry, P. V. Graves, and J. F. Muzy, “Characterizing long-range correlations in DNA sequences from wavelet analysis,” Physical Review Letters, vol. 74, no. 16, pp. 3293–3296, 1995. [4] M. Carli, F. Coppola, G. Jacovitti, and A. Neri, “Translation, orientation, and scale estimation based on Laguerre-Gauss circular harmonic pyramids,” in SPIE Conference Photonics West, San Jose, Jan. 2002. [5] G. Jacovitti and A. Neri, “Multiresolution circular harmonic decomposition,” IEEE Transactions on Signal Processing, vol. 48, no. 11, pp. 3242–3247, 2000. [6] J. Ad´amek, Foundations of Coding, John Wiley & Sons, Inc., New York, 1991, pp. 17-23. [7] B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, and J. D. Watson, Molecular Biology of the Cell, Garland Publishing, Inc., New York, 1994, pp. 98–104.