Sequence Structure of Hidden 10.4-base Repeat in the Nucleosomes ...

3 downloads 0 Views 838KB Size Report
A weak 5-base periodicity in the distribution of TA dinucleotides was detected. .... For example, the optimal location 5 within the period for YR element would ...
Journal of Biomolecular Structure & Dynamics, ISSN 0739-1102 Volume 26, Issue Number 3, (2008) ©Adenine Press (2008)

Open Access Article

The authors, the publisher, and the right holders grant the right to use, reproduce, and disseminate the work in digital form to all users.

Sequence Structure of Hidden 10.4-base Repeat in the Nucleosomes of C. elegans http://www.jbsdonline.com Abstract By measuring prevailing distances between YY, YR, RR, and RY dinucleotides in the large database of the nucleosome DNA fragments from C. elegans, the consensus sequence structure of the nucleosome DNA repeat of C. elegans was reconstructed: (YYYYYRRRRR)n. An actual period was estimated to be 10.4 bases. The pattern is fully consistent with the nucleosome DNA patterns of other eukaryotes, as established earlier, and, thus, the YYYYYRRRRR repeat can be considered as consensus nucleosome DNA sequence repeat across eukaryotic species. Similar distance analysis for [A, T] dinucleotides suggested the related pattern (TTTYTARAAA)n where the TT and AA dinucleotides display rather out of phase behavior, contrary to the “AA or TT” in-phase periodicity, considered in some publications. A weak 5-base periodicity in the distribution of TA dinucleotides was detected. Key words: Nucleosome; Chromatin; 10-base periodicity; Nucleosome positioning; Distance analysis; Dinucleotides; RR/YY nucleosome pattern; AA/TT nucleosome pattern; Nucleosome sequence probe; DNA bendability; Bendability matrix.

F. Salih1 B. Salih1 E. N. Trifonov1,2,* Genome Diversity Center

1

Institute of Evolution University of Haifa Mount Carmel

Haifa 31905, Israel

Division of Functional

2

Genomics and Proteomics Faculty of Science

Masaryk University, Kamenice 5

Brno CZ-62500, Czech Republic

Introduction Gene expression studies of last decade make it as clear as strongly suspected before that DNA sequence and nucleosome positioning is one of important factors in gene regulation (1). Accessibility of transcription binding sites crucially depends on the nucleosome positioning (2). The nucleosomes are distributed in highly nonrandom fashion around transcription start sites (3, 4). Replication is dependent on the nucleosome positioning (5), and the nucleosomes are involved in epigenetic phenomena (5, 6). The reviving interest to the involvement of nucleosomes in gene expression urges renewed efforts in elucidation of the sequence rules responsible for the nucleosome positioning, especially after large nucleosome DNA sequence databases became recently available (7). Genomic DNA of C. elegans displays strong ~10 base periodicity of AA and TT dinucleotides associated with the nucleosomes (7-9). Other dinucleotides also make their moderate contribution to the periodicity (7). In earlier studies of the sequence patterns in the nucleosome DNA specific phase shifts between various dinucleotides within the ~10 base nucleosome sequence repeat have been documented (10-12). The phase preferences, presumably, reflect optimal orientations of the base pair stacks in the nucleosome DNA, relative to the surface of the histone octamer (10, 13). These phase (orientational) preferences of all 16 dinucleotides within one period of the nucleosome DNA have been first presented in form of “bendability matrix” (10). As it followed from the matrix, the AA and TT dinucleotides, counter-phase to one another, are the major contributors to the periodical pattern of the nucleosome DNA. Significant contributions of other RR and YY dinucleotides (counter-phase as well) are also indicated by the matrix of bendability.

Phone: +972 4 828 8096 Fax: +972 4 824 6554 Email: [email protected] *

273

274 Salih et al.

Later studies confirmed these early data (11, 12, 14), though the full picture of the consensus sequence pattern(s) of the ~10 base repeat still remains rather uncertain. With the advance of DNA sequencing large databases of the sequences involved in the nucleosomes now become available. The first such collection of over 310,000 sequences is provided by a massive study of the chromatin of C. elegans (7). In this work we subjected the sequences of the database to positional auto- and cross-correlation analyses and generated distance histograms that allow to locate the typical positions of RR, YY, YR, and RY dinucleotides within the 10.4 base repeat of the nucleosome DNA. The [A, T] distributions are analyzed as well. Results and Discussion [R, Y] Sequence Pattern of the 10.4 Base Repeat In terminology of signal processing the occurrence of various distances between two like-named elements along the sequence is described by the autocorrelation function that displays maxima corresponding to the distances that occur more of-

Figure 1: Distribution of distances between [R, Y] dinucleotides in the nucleosome DNA fragments of C. elegans. At the ordinates actual counts are indicated, in the ensemble of sequence fragments of total size 7.5 × 106 bases. A, histogram of distances between YY dinucleotides; B, distances from YY to YR dinucleotides; C, distances from YY to RR; D, distances from YY to RY dinucleotides; E, distances between YR dinucleotides.

275 Nucleosome Positioning in C. elegans

ten. Similarly, the cross-correlation function describes the distances between two different elements. In Figure 1, the preferred distances between YY dinucleotides and, respectively, YR, RR, RY (cross-correlations), and YY (autocorrelation) are presented in form of distance histograms. The autocorrelation for YR dinucleotides is shown as well (Fig. 1E). For the calculation total 7.5 × 106 bases long ensemble of nucleosome sequence fragments was randomly extracted from the 312,492 pyrosequencing reads [supplemental material to (7), see Methods]. In terms of evaluation of the nucleosome DNA periodicities this is sufficiently large ensemble, as any set of the size 1.0 × 106 bases generates quantitatively indistinguishable patterns (data not presented). The preferred distances (maxima), from the YY dinucleotide as reference point are gradually shifting downstream in the order YR, RR, RY (Fig. 1 A-D). Respective peak positions are listed in the Table I. The autocorrelations YY to YY (Fig. 1A) and YR to YR (Fig. 1E) display the periodicity of about 10 bases. The phase shift of 5 bases between YY and RR dinucleotides (Fig. 1C) demonstrates that the nucleosome sequence pattern of C. elegans follows the same counter-phase YY/ RR rule, as other organisms (10, 12, 14), and is in full agreement with the alternating pattern (xYYYxxRRRx)n suggested earlier as a major component of the universal nucleosome sequence probe (6, 13). Other preferred distances, involving also dinucleotides YR and RY (Fig. 1 and Table I) may be used to build more specific 10-base pattern. Since the nucleosome possesses dyad symmetry, both Watson and Crick strands of the nucleosome DNA would have to carry the same pattern. Therefore, both strands of the consensus double-stranded repeating unit should read the same. Assuming that there is only one preferred position (range) for each of the [R, Y] dinucleotides (which is not necessarily true), the symmetrical pattern which makes the best fit to the distance distributions in the Figure 1 would be (YYYYYRRRRR)n (see

Figure 2: Consensus sequence structure of the basic 10-base repeat of the nucleosome DNA of C. elegans. A, pattern of repeating [R, Y] dinucleotides; B, pattern of repeating [A, T] dinucleotides.

276 Salih et al.

Figure 2A). The shift distances 2.5, 5, and 7.5 of the pattern fit well to the values 2.9, 5.0, and 6.5 in the last column of the Table I, considering respective error bars (the bars correspond to maximal differences between the observed shifts and their respective averages). The integral value 10 of the period is taken here for simplicity of the presentation. Actual period of the repeat can be calculated as mean value of the individual distances between the neighboring peak positions listed in the Table I. It is 10.4 ± 0.7 bases (the error of the mean is indicated), which is equal to the period 10.40 bases calculated (15) from coordinates of phosphates in the available x-ray derived nucleosome structures. To satisfy the overall non-integer period 10.40 the standard repeat sequence as above should contain occasionally extra R or Y. All other auto- and cross-correlation plots (YR/RY, RR/RR, RY/RY, etc.) are consistent with the above consensus pattern (data not shown). As Figure 1 and Table I illustrate, the peak positions do not strictly obey the ideal (YYYYYRRRRR)n repetition with the period 10.4 bases, though they are close to the respective error bar ranges. The “noise” can be partly explained by presence of other sequence patterns in the genomic DNA. One distorting contribution is the 3-base periodicity of protein coding sequences in general (16) and in C. elegans, specifically (9). This component, interfering with chromatin pattern has to be and, indeed, is filtered

Figure 3: Distribution of distances between [A, T] dinucleotides in the nucleosome DNA fragments of C. elegans. The same sequences as in the Figure 1 are used. A, histogram of distances between TT dinucleotides; B, distances from TT to TA dinucleotides; C, distances from TT to AA; D, distances from TT to AT dinucleotides.

out in the Figures 1 and 3 by smoothing the raw data plots, i.e., by averaging (see Methods) every three consecutive positions (see an example of raw data in the Fig. 4A below). As the collection of the nucleosome sequence fragments is representative of all sequence types of the C. elegans genome (7) the protein-coding sequence content in the fragments is close to the genomic content of the coding sequences in the genome, 27% (17). This causes a detectable 3-base periodicity in the nucleosome DNA sequences, with the amplitudes about 1/6 of 10-base oscillation amplitudes (9). Other possible distortions can be caused by distributions of distances originating from the sequence-wise anomalous central parts of the nucleosome DNA (9, 11, 14, 18, 19). These contributions are largely uncertain, and can not be removed at this stage. Tandemly repeating sequences make only 2.7% of the C. elegans genome (17) and their influence is not visible in the calculated distance histograms. Notwithstanding the noise the resulting pattern (YYYYYRRRRR)n fits well to rather complex picture of the preferred distances between the dinucleotides (Fig. 1), and is in full agreement with the bulk of earlier studies.

277 Nucleosome Positioning in C. elegans

Figure 4: Distribution of distances between TA dinucleotides in the nucleosome DNA fragments of C. elegans. A, original data, dominated by multiples of 3 bases (peaks at 6, 9, 12, 18, 27, 36, 45, 54); B, same after repeated smoothing by running window of 3 bases. Positions of nearest integers to 5.2 × n are indicated by triangles.

The presentation of the nucleosome 10.4 base repeat in form (YYYYYRRRRR)n is a convenient simplification, only to indicate, that the preferred positions within typical (consensus) repeat (see Fig. 2A) are 0 for the dinucleotides RY, 2.5 for YY, 5 for YR, and 7.5 for RR (with the uncertainties 0.4 bases or less). The pattern should repeat at the distances 10, 21, 31, 42… bases, which are nearest integers of (10.4 base)n repeat. The positional preferences are not necessarily “sharp”, so that immediate neighboring positions at a given standard location of a dinucleotide are almost equally suitable (13). For example, the optimal location 5 within the period for YR element would mean that the flanking positions 4 (YYYYRRRRRR) and 6 (YYYYYYRRRR) may be occupied by YR nearly as often. Most adequate presentation of the sequence structure of the 10.4 base nucleosome DNA repeat would be a matrix 16 × 10 (or 16 × 21, close to two periods) in which the affinities of all 16 dinucleotides to all positions within the period are indicated. The earliest version of such a matrix is presented in (10). The following section is the next step towards the eventual derivation of the full bendability matrix for the nucleosomes of C. elegans. [A, T] Components of the 10.4 Base Repeat The in-phase AA and/or TT nucleosome positioning pattern (WWxxxxxxxx)n

278 Salih et al.

(where WW is either AA or TT) has been suggested in many studies (7, 13, 15, 20) in addition to classical counter-phase pattern (xTTxxxAAxx)n (10, 13, 14). In this section we address the question, which of these patterns (or both?) is characteristic of the C. elegans nucleosomes. The [Y, R] pattern (YYYYYRRRRR)n as above (that is, counter-phase alternation of YY and RR) would suggest something similar to be valid for A and T (TTTTTAAAAA?). In the Figure 3 the preferred distances between TT, AA, TA, and AT are presented, results of the calculations similar to ones in the Figure 1. The behavior of T and A is not exactly the same as of Y and R. In particular, the most frequent distance between TT and AA is 8.0 ± 2.5 bases, rather than 5, as in case of YY and RR (see Table II). This suggests that the dinucleotides TT, TC, CT, and CC are not evenly distributed within the runs YYYYY but rather occupy some favorite positions, as well as AA, AG, GA, and GG – within the run RRRRR, respectively. The peak-to-peak distances for all [A, T] dinucleotides (Table II) suggest the pattern (TTTYTARAAA)(TTTY…), consistent with both [R, Y] pattern as above, and with the respective shifts observed: TT to TA distance 3.5 (3.1 ± 2.1), TT to AA distance 7 (8.0 ± 2.5), and TT to AT distance 8.5 (8.7 ± 2.8) (see Fig. 2B). The TA to TA dinucleotide distances show a weak about 5-base periodicity (Fig. 4B, peaks at 5, 10, 16, 21, 26, 31, and 47, 52 – nearest integers to the multiples of the period 10.4/2 = 5.2 bases). The possibility of 5-base periodical distribution of some flexible basepair stacks (such as YR*YR) along nucleosome DNA follows from the works by Zhurkin et al. as early as in 1979 (21, 22). The rationale is that separated by 5 bases minor and major groove positions (with roll angles opening away from the histone octamer surface) are the most deformable in the nucleosome DNA (“mini-kinks”). All previous attempts to detect the 5-base period in the nucleosomes failed, and the

Figure 5: Matrices of optimal positions corresponding to one-line presentations (YYYYYRRRRR)n (A) and (TTTYTARAAA)n (B) of the consensus nucleosome repeat. Most prominent positions for various dinucleotides within the period are indicated by + sign.

TA distance histogram (Fig. 4) is, to our knowledge, the first observation of that kind (also F. Cui and V. B. Zhurkin, personal communication). The observations that preferential TT to TA distances are, apparently, separated by about 10 bases from one another (Fig. 3), and the 5-base periodicity of distances between TA dinucleotides (Fig. 4B) are contradictory. The assignment of TA to the middle position of the TTTYTARAAA pattern can, thus, be only tentative. Future, more detailed analyses may resolve the contradiction. Note that the amplitudes of the oscillations involving TA are rather weak in both Figure 1 and Figure 3. The analysis of the contributions of all 16 dinucleotides to the periodical pattern in the nucleosomes of C. elegans (7) shows that the TA is the only dinucleotide that does not display any visible 10-base periodicity. However, the histogram in the Figure 3B (TT to TA distances) does show the preferred shift of about 3 bases. Taking the 5-base periodicity of TA into consideration would suggest that TA could be rather frequent also at the position occupied by dominating AT. The simple consensus pattern is unable to reflect this feature. More appropriate, again, would be the bendability matrix or the matrix of optimal positions. Such matrices, for the [R, Y] pattern and for the [A, T] pattern, are shown in the Figure 5. Again, the period 10 (instead of 10.4) is taken here for simplicity of presentation. No attempt is made at this stage to derive quantitative estimates of affinity of the dinucleotides to the preferred positions within the repeat as indicated in the matrices. Since the oscillating AA and TT dinucleotides are dominant in the genome of C. elegans (7), the pattern (TTTYTARAAA)n derived above can be considered as good approximation to the hidden 10.4 base nucleosomal DNA repeat in C. elegans, perhaps, better than the general (YYYYYRRRRR)n pattern though the eventual full bendability matrix with all 16 dinucleotides would be even more appropriate description, especially if some elements have more than one preferred position within the period (like TA, Fig. 4B). Although derivation of such matrix by the autoand cross-correlations as above is possible, the anomalous central piece of the nucleosome DNA pattern (9, 11, 14, 18, 19) would, certainly, interfere with the reconstructions. A more adequate procedure would be, perhaps, some version of multiple alignment (10, 11, 19), with the aim to derive not only the periodical pattern, but the dinucleotide distributions in the central section of the nucleosome DNA as well. In some reports the in-phase periodicity of AA or TT dinucleotides has been suggested as one of the nucleosome positioning patterns (13, 15, 20). All the preferred distances between AA or TT dinucleotides in this case would be the same (AA to AA, TT to TT, AA to TT, and TT to AA) – close to 10 bases. As the analysis above shows, the most frequent TT to AA distance in case of C. elegans is rather 7-8 bases (2-3 bases for AA to TT distance). When, however, the WW to WW distances (WW = AA or TT) are scored (that is, total sum of AA/AA + TT/TT + AA/TT and TT/AA distances), a clear periodicity 10-11 appears (8, 15). This, apparently, happens because the amplitudes of the oscillations for the AA/AA and TT/TT distances are significantly higher than in case of nearly counter-phase AA/ TT and TT/AA oscillations (compare Fig. 3A and Fig. 3C). Obviously, total sum (WW/WW) would produce dominating distances, that is, multiples of about 10 – “AA or TT” periodicity. In other words, the periodically repeating combination (TTxxxxxxxx)(TT… occurs more often than (TTxxxxxAAx)(TT… Same holds for (AAxxxxxxxx)(AA… and (AAxTTxxxxx)(AA… For all that the four combinations fit to the same close to counter-phase periodical pattern (TTTYTARAAA) (TTTYTARAAA)… Obviously, there is no such fit of the four combinations to the “AA or TT” pattern (WWxxxxxxxx)(WWxxxxxxxx)… In conclusion, the sequence structure of the hidden 10.4-base repeat in the genome of C. elegans is calculated by distance analysis to be TTTYTARAAA, or YYYYYRRRRR in more general form. The YY/RR pattern is, indeed, the universal pattern across species. It is observed in yeast (14), in human (12), in mixed sequences of various eukaryotes (10, 11), and C. elegans is not an exception. The C. elegans

279 Nucleosome Positioning in C. elegans

280 Salih et al.

database of nucleosome DNA sequences and fragments thereof (7) appears to be sufficient for calculation of the DNA bendability matrix (10) for C. elegans, the ultimate complete description of the 10.4-base nucleosome DNA repeat structure. This would require more advanced computations (work in progress). Methods Sequences The nucleosome DNA sequence fragments are taken from the database of 312,492 nucleosome DNA ends (fragment sizes of 50 to 200 bases) generated by Johnson et al. (7) and available as supplemental data to the paper (7) on the site www.genome. org. Only the sequences of length above 100 bases were taken for the distance analysis. Three subsets of the database, 2.5 × 106 bases total each, were taken from the top of the list, from the middle and from the bottom. The results of calculation for all three sets were combined. Distance Analysis This is equivalent to positional auto- and cross-correlations. All the distances between respective dinucleotides (counted as difference between sequence coordinates of the dinucleotides, unidirectionally) were scored and presented in form of histograms in the interval up to 60 bases. To avoid overrepresentation of short distances the counts were stopped when the heading dinucleotide reached the last 60 bases of a sequence fragment. Filtering To eliminate 3-base periodical component of the sequences, the scores of each three adjacent positions in the original distance histograms were averaged, and the middle position scores replaced by the average. The histograms of Figures 1 and 3 are all filtered only once, as described. To further decrease the raggedness of the histogram (in case of the autocorrelation of TA, Fig. 4), the second round of the filtering was applied (Fig. 4B). Acknowledgement Discussions with V. B. Zhurkin are highly appreciated. References and Footnotes 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

M. Yaniv, S. C. R. Elgin. Curr Op in Genetics and Development 18, 107-108 (2008). J. D. Anderson, J. Widom. J Mol Biol 296, 979-987 (2000). I. Ioshikhes, E. N. Trifonov, M. Q. Zhang. Proc Natl Acad Sci USA 96, 2891-2895 (1999). T. N. Mavrich, C. Jiang, I. P. Ioshikhes, X. Li, B. J. Venters, S. J. Zanton, L. P. Tomsho, J. Qi, R. L. Glaser, S. C. Schuster, D. S. Gilmour, I. Albert, B. F. Pugh. Nature 453, 358362 (2008). S. Henikoff. Nature Rev Genet 9, 15-26 (2008). F. Salih, B. Salih, S. Kogan, E. N. Trifonov. J Biomolec Str Dyn 26, 9-16 (2008). S. M. Johnson, F. J. Tan, H. L. McCullough, D. P. Riordan, A. Z. Fire. Genome Research 16, 1505-1516 (2006). J. Widom. J Mol Biol 259, 579-588 (1996). A. Valouev, J. Ichikawa, T. Tonthat, J. Stuart, S. Ranade, H. Peckham, K. Zeng, J. A. Malek, G. Costa, K. McKernan, A. Sidow, A. Fire, S. M. Johnson. Genome Research 18, 1051-1063 (2008). G. Mengeritsky, E. N. Trifonov. Nucl Acids Res 11, 3833-3851 (1983). I. Ioshikhes, A. Bolshoy, K. Derenshteyn, M. Borodovsky, E. N. Trifonov. J Molec Biol 262, 129-139 (1996). S. B. Kogan, M. Kato, R. Kiyama, E. N. Trifonov. J Biomol Str Dyn 24, 43-48 (2006). F. Salih, B. Salih, E. N. Trifonov. J Biomolec Str Dyn 24, 489-493 (2007). A. B. Cohanim, Y. Kashi, E. N. Trifonov. J Biomol Str Dyn 22, 687-694 (2005). A. B. Cohanim, Y. Kashi, E. N. Trifonov. J Biomol Str Dyn 23, 559-566 (2006). E. N. Trifonov. J Molec Biol 194, 643-652 (1987).

17. The C. elegans Sequencing Consortium. Science 282, 2012-2018 (1998). 18. D. Boffelli, P. De Santis, A. Palleschi, M. Savino. Bioph Chem 39, 127-136 (1991). 19. M. Kato, Y. Onishi, Y. Wada-Kiyama, T. Abe, T. Ikemura, S. Kogan, A. Bolshoy, E. N. Trifonov, R. Kiyama. J Molec Biol 332, 111-125 (2003). 20. S. C. Satchwell, H. R. Drew, A. A. Travers. J Mol Biol 191, 659-675 (1986). 21. V. B. Zhurkin, Y. P. Lysov, V. I. Ivanov. Nucl Acids Res 6, 1081-1096 (1979). 22. V. B. Zhurkin. FEBS Letters 158, 293-297 (1983).

Date Received: July 18, 2008

Communicated by the Editor Ramaswamy H. Sarma

281 Nucleosome Positioning in C. elegans