Genome Informatics 12: 510–511 (2001)
510
Quadtree Representation of DNA Sequences Natsuhiro Ichinose
Tetsushi Yada
Toshihisa Takagi
[email protected]
[email protected]
[email protected]
Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan
Keywords: quadtree representation, chaos game representation, frequency of oligo nucleotides
1
Introduction
Quadtree representation is a method representing a frequency distribution of all oligo nucleotides in long DNA sequences. The method is theoretically equivalent to chaos game representation [2, 3], which represents orbits of the chaos game system (or iterated function system [1]) driven by the DNA sequences. Since the relation between a point in the orbit and a subsequence in the driving DNA sequences is continuous and one-to-one [1], the frequency distribution of the orbit is equivalent to that of successive nucleotides. y Figure 1 shows an example of the configuration of 2-mer 3 CC CT TC TT nucleotides in the quadtree representation. In the configuration the frequency of each 2-mer will be represented on 2 CA CG TA TG the corresponding area by color or gray scale. Each 2-mer is hierarchically arranged: namely the first nucleotide in the 1 AC AT GC GT bottom-left 4 areas corresponds to “A”, the bottom-right ar0 eas correspond to “G”, the top-left areas correspond to “C” AA AG GA GG and the top-right areas correspond to “T”. The area correx 0 1 2 3 sponding to each first nucleotide is divided in four, then each second nucleotide is assigned to a divided area by the same way. If the distribution of the 3-mer or longer nucleotides Figure 1: Configuration of 2-mer nuis considered, furthermore each area is divided in four and cleotides in the quadtree representation. each nucleotide is assigned to one of them hierarchically. The quadtree representation is a hierarchical representation of subsequences. This feature is important because information between distinct subsequences can be integrated. However there is a problem in which information of shorter nucleotides occupy a larger region. In this paper, therefore, we will develop the method in which information of longer nucleotides (ex. tandem repeats) can be extracted, by filtering or biasing the information of shorter nucleotides.
2
Method and Results
The count of successive nucleotides is done by the following equations: H(x, y) = number of {x = xn , y = yn |n = L, L + 1, · · · , N }, xn+1 = (2xn + f (σn )) mod 2L ,
f (σ) =
0 σ = A or C , 1 σ = G or T
yn+1 = (2yn + g(σn )) mod 2L ,
g(σ) =
0 σ = A or G , 1 σ = C or T
(1) (2) (3)
Quadtree Representation
511
where σn ∈ {A,G,C,T} is a nucleotide in the DNA sequence with the length N ; L is the length of the oligo nucleotides; xn and yn are the pair of integers between [0, 2L ) and they correspond to a subsequence, for example, xn = 1 and yn = 0 correspond to “AG” when L = 2 as shown in Fig. 1; H(x, y) is the number of the appearance of oligo nucleotides corresponding to the pair (x, y). The frequency distribution is visualized as the logarithm of odds h represented as follows: (4) h(x, y) = log10 (H(x, y)/E(x, y)) where E(x, y) is the expectation value of H(x, y) under some model. The model of E(x, y) is performed as the filter mentioned in the previ4 ous section: namely if we consider to filter the 2-mer information, then we adopt the simple Markov model as 3 the model of the expectation value. Figures 2(a) and (b) are the quadtree representation of the hu- (a) 2 (b) 1 man chromosome 21 [4] without and with filter, respectively: namely the expectation value is uniform in the Figure 2: Quadtree representation of human chromosome 21 former, but it is determined by the (L = 9): (a) uncorrelated random model, (b) simple Markov Markov model of the sequence in the model. Areas with high brightness imply the nucleotides with high frequency. latter.
3
Discussions
In Fig. 2(a) the frequency of nucleotides “CG” is low: this is well-known phenomena. Because the quadtree representation is hierarchical, the characteristic such as low “CG” causes fractal-like patterns [3]. In Fig. 2(b) these influences of the 2-mer nucleotides are filtered, then we can observe the linear patterns instead ( 1 - 4 in fig.2(b)). The patterns of 1 and 2 correspond to the AT-rich and the CG-rich sequences matched regular expressions “[AT]+” and “[CG]+”, respectively. The patterns of both 3 and 4 correspond to tandem repeats matched “([AG][CT])+”. It is interesting that the pattern “([AC][GT])+” has not been observed, because this implies that the distribution of the types of tandem repeats is skew. In this paper although we showed only the case of the 2-mer filter, we can extend this method to the longer-nucleotides filters by using the higher-order Markov models. As the future work, we will extend the method in order to analyze statistical differences between distinct sequences.
References [1] Barnsley, M., Fractal Everywhere, Academic Press, INC., 1988 [2] Jeffrey, H.J., Chaos game representation of gene structure, Nucleic Acids Res., 21:2487–2491, 1990. [3] Hao, B.-L., Lee, H.C., and Zhang, S.-Y., Fractals related to long DNA sequences and complete genome, Chaos, Solitons & Fractals, 11:825–836, 2000. [4] Hattori, M., Fujiyama A., Taylor T.D., Watanabe, H., Yada, T., Park, H.S., Toyoda, A., Ishii, K., Totoki, Y., Choi, D.K., et al., The DNA sequence of human chromosome 21, Nature, 405:311–319, 2000.